Environment
Login to Artemisa
Users can access both User Interface(UI) nodes (mlui01.ific.uv.es
, mlui02.ific.uv.es
)
via ssh
. For instance, in Linux and Mac using a terminal:
$ ssh YOUR_ARTEMISA_USERNAME@mlui01.ific.uv.es
In Windows, this can be done using a ssh client like Putty
Once authenticated, the following welcome prompt is shown:
==========================================================
Welcome to
_ _ _
/ \ _ __| |_ ___ _ __ ___ (_)___ __ _
/ _ \ | '__| __/ _ \ '_ ` _ \| / __|/ _` |
/ ___ \| | | || __/ | | | | | \__ \ (_| |
/_/ \_\_| \__\___|_| |_| |_|_|___/\__,_|
The IFIC's AI and ML infraestructure
This is mlui01.ific.uv.es running CentOS 7.9.2009
Please, read the "Politica de Seguridad de la Informacion"
https://artemisa.ific.uv.es/static/Politica_Seguridad.pdf
==========================================================
**ATTENTION**
CUDA Version: 12.2 Driver Version: 535.54.03
Execute GPUstatus to get a summary on GPU utilization
==========================================================
[artemisa_user@mlui01 ~]$
In UI nodes users can develop and submit production jobs. These nodes also allow to test and validate their programs using a entry-level local UI node GPU. Exclusive use of this local GPU is restricted to 5 minutes slots, as it is meant for developing and testing of light tasks.
Users can also submit their codes as production batch jobs using the Job Management System HTCondor. This allows access to Worker Nodes (WNs), enabling the use of high-end GPUs. WN may have up to 8 GPUs with high speed NVlink interconnection (see System Overview) and extended execution times, up to 48 hours.
Users can have up to 8 jobs running simultaneously.
User Interface
UI nodes also serve as development nodes, where users can test and validate their processes using the entry-level local GPU. After validation, they can submit their codes as production batch jobs using the Job Management System HTCondor.
Authorized users can log in through ssh protocol in the UI machines:
mlui01.ific.uv.es
mlui02.ific.uv.es
The user HOME directory resides in Lustre filesystem
/lhome/ific/<username_first_letter>/<username> <= IFIC users
/lhome/ext/<groupid>/<username> <= external users
In addition to the user HOME directory, there is space for Projects available upon request, and accessible also from the UI:
/lustre/ific.uv.es/ml/<groupid>
Use UI GPU (gpurun)
As GPUs are a exclusive resource, users have to request usage of the local User
Interface GPU through the gpurun
tool, with a time slot of 5 minutes.
$ gpurun -i
Connected
Info: OK 0. Tesla V100-PCIE-16GB [00000000:5E:00.0]
1. Tesla V100-PCIE-16GB [00000000:86:00.0]
Total clients:0 Running:0 Estimated waiting time:0 seconds
In case of user concurrency, the command will handle the queue of petitions synchronously.
Worker Nodes use
HTCondor
HTCondor is the resource management system that runs in this cluster. It manages the job workflow and allows the users to send jobs to be executed in the Worker Nodes (WN). Direct access to WNs is not allowed.
Each WN can be spltted into slots that can accept jobs to be processed. HTCondor deals with job sorting and processing. WN slots are created when the job does not require the Enterprise node resources, so more jobs can be run in the WN. CPU and Memory resources are subtracted in chunks from the main slot. 0, 1, 2, 4 or 8 GPU requests are permitted. GPUs are used exlusively by one job.
HTCondor tries to run jobs from different users in a fair share way. The tool takes into account the previous processing time spent by the user so CPU time is assigned evenly between all users.
HTCondor manual can be found here
The current Artemisa WNs configuration:
node |
GPU_Model |
#nodes |
---|---|---|
mlwn03 |
4x Tesla V100-SXM2-32GB |
1 |
mlwn24 |
8x NVIDIA A100-SXM4-40GB |
1 |
mlwn01-02 |
1x Tesla V100-PCIE-32GB |
2 |
mlwn04-23 |
1x Tesla V100-PCIE-32GB |
20 |
mlwn25-35 |
1x NVIDIA A100-PCIE-40GB |
11 |
In more detail:
Machine |
Platform |
Cpus |
Gpus |
TotalGb |
FreeCpu |
---|---|---|---|---|---|
mlwn01.ific.uv.es |
x64/CentOS7 |
96 |
1 |
376.4 |
8 |
mlwn02.ific.uv.es |
x64/CentOS7 |
96 |
1 |
376.4 |
8 |
mlwn03.ific.uv.es |
x64/CentOS7 |
112 |
4 |
754.37 |
56 |
mlwn04.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn05.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn06.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn07.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn08.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn09.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn10.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn11.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn12.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn13.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn14.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn15.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn16.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn17.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn18.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn19.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn20.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn21.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn22.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn23.ific.uv.es |
x64/CentOS7 |
80 |
1 |
376.38 |
48 |
mlwn24.ific.uv.es |
x64/CentOS7 |
192 |
8 |
503.82 |
48 |
mlwn25.ific.uv.es |
x64/CentOS7 |
128 |
1 |
503.82 |
72 |
mlwn26.ific.uv.es |
x64/CentOS7 |
128 |
1 |
503.82 |
72 |
mlwn27.ific.uv.es |
x64/CentOS7 |
128 |
1 |
503.82 |
72 |
mlwn28.ific.uv.es |
x64/CentOS7 |
128 |
1 |
503.82 |
72 |
mlwn29.ific.uv.es |
x64/CentOS7 |
128 |
1 |
503.82 |
72 |
mlwn30.ific.uv.es |
x64/CentOS7 |
128 |
1 |
472.32 |
72 |
mlwn31.ific.uv.es |
x64/CentOS7 |
128 |
1 |
503.82 |
72 |
mlwn32.ific.uv.es |
x64/CentOS7 |
128 |
1 |
503.82 |
72 |
mlwn33.ific.uv.es |
x64/CentOS7 |
128 |
1 |
503.82 |
72 |
mlwn34.ific.uv.es |
x64/CentOS7 |
128 |
1 |
503.82 |
72 |
mlwn35.ific.uv.es |
x64/CentOS7 |
128 |
1 |
503.82 |
72 |
Submit Description File
We are going to submit a simple job, hello_world.sh
:
#!/bin/bash
echo "Hello World!"
/bin/sleep 10
A submit description file is required to send a job. This type of file contains the specifics of the submitted task, like job requirements, outputs, etc.
A simple submit description file example might be hello_world.sub
:
# hello_world
universe = vanilla
executable = hello_world.sh
arguments = $(Cluster) $(Process)
log = hello_world.log
output = outfile.$(Cluster).$(Process).out
error = errors.$(Cluster).$(Process).err
request_cpus = 4
request_memory = 4000
request_disk = 1G
queue 2
Each line of the submit description file has the form
command_name = value
The command name is case insensitive and precedes an equals sign. Values to right of the equals sign are likely to be case sensitive, especially in the case that they specify paths and file names.
The first line of this submit description file is a comment. Comments begin with the # character. Comments do not span lines.
The command name is case insensitive and precedes an equals sign. Values to right of the equals sign are likely to be case sensitive, especially in the case that they specify paths and file names.
universe
: defines an execution environment for a job. If a universe is not specified, the default is vanilla. The vanilla universe is a good default, for it has the fewest restrictions on the job. More on HTCondor Universes.
executable
: specifies the program that becomes the HTCondor job. It can be a shell script or a program.A full path and executable name, or a path and executable relative to the current working directory may be specified.Caution
.sh files must be given execution permits. In this case:
chmod +x test.sh
arguments
: While the name of the executable is specified in the submit description file with the executable command, the remainder of the command line will be specified with the arguments command.
log
: The log command causes a job event log file named test.log to be created once the job is submitted. It is not necessary, but it can be incredibly useful in figuring out what happened or is happening with a job.
output
: file capturing the redirected standard output. When submitted as an HTCondor job, standard output of the job is on that WN, and thus unavailable. In this example, each execution produces a unique file through the $(Cluster) and $(Process) macros, so it is not overwritten. Each set of queued jobs from a specific user, submitted from a single submit host, sharing an executable have the same value of $(Cluster). Within a cluster of jobs, each takes on its own unique $(Process). The first job has value 0. More on variables in the Submit Description File
error
: file capturing the redirected standard error. Analogue to output.
request_cpu
: How many CPUs are going to be needed in the task, avoiding the overload of the WN.
request_memory
: How many RAM memory is going to be used for the task. Suffix: ‘M’ for Megabytes, ‘G’ for Gigabytes. Default unit: megabytes.
request_disk
: How many disk memory is going to be used for the task. Suffix: ‘M’ for Megabytes, ‘G’ for Gigabytes. Default unit: kilobytes.
queue
: Number of instances of the job to be sent. 2 in this case. Default is 1. This should be the last command in the file. Commands after this are ignored.
The submit description file can include other commands that can meet the needs of more complex tasks. Some of these commands are going to be used and explained in the tutorials section. A more comprehensive documentation can be found here.
Job management with HTCondor
We now have all the ingredients to submit the job to HTCondor. We can do so by executing
condor_submit
in the command line:
[artemisa_user@mlui01]$ condor_submit hello_world.sub
Submitting job(s)..
2 job(s) submitted to cluster 568517.
Once the job is submitted, we can monitor this and other of our jobs currently in the
queue with condor_q
:
[artemisa_user@mlui01]$ condor_q
-- Schedd: xxx.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/04/23 13:55:20
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
artemisa_user ID: 568517 9/4 13:55 _ _ 2 2 568517.0-1
Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for artemisa_user: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 72 jobs; 0 completed, 0 removed, 4 idle, 13 running, 55 held, 0 suspended
This shows that our 2 jobs are waiting to be executed (IDLE column). If the jobs start, we see:
[artemisa_user@mlui01]$ condor_q
-- Schedd: xxx.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/04/23 13:55:24
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
artemisa_user ID: 568517 9/4 13:55 _ 2 _ 2 568517.0-1
Total for query: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for artemisa_user: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for all users: 72 jobs; 0 completed, 0 removed, 2 idle, 15 running, 55 held, 0 suspended
Submitted jobs can be removed from the queue through the condor_rm
command with the job
identifier as a command line argument
[artemisa_user@mlui01]$ condor_rm 568517.1
Job 568517.1 marked for removal
If we want to remove all jobs within a cluster we only specify the cluster number
[artemisa_user@mlui01]$ condor_rm 568517
All jobs in cluster 568517 have been marked for removal
We can get a thorough analysis adding the -better-analyze
option to condor_q
that can be used to detect
problems in jobs, for instances, if a job is in HOLD or stuck in IDLE status.
$ condor_submit augment_data_cpu.sub
Submitting job(s).
1 job(s) submitted to cluster 579102.
$ condor_q
-- Schedd: cm02.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/21/23 17:26:33
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
artemisa_user ID: 579102 9/21 17:26 _ _ _ 1 1 579102.0
Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for artemisa_user: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for all users: 180 jobs; 0 completed, 0 removed, 107 idle, 39 running, 34 held, 0 suspended
$ condor_q -better-analyze 579102.0
-- Schedd: cm02.ific.uv.es : <147.156.116.133:9618?...
The Requirements expression for job 579102.000 is
(RequirementsOrig) && (RequirementsUserGroupLimit)
Job 579102.000 defines the following attributes:
DiskUsage = 1
FileSystemDomain = "ific.uv.es"
GroupMaxRunningJobs = 36.0
ImageSize = 1
RequestDisk = DiskUsage
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
RequirementsOrig = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
RequirementsUserGroupLimit = ((isUndefined(SubmitterUserResourcesInUse) || (SubmitterUserResourcesInUse < UserMaxRunningJobs)) && (isUndefined(SubmitterGroupResourcesInUse) || (SubmitterGroupResourcesInUse < GroupMaxRunningJobs)))
UserMaxRunningJobs = 450.0
The Requirements expression for job 579102.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 83 RequirementsOrig
579102.000: Job is held.
Hold reason: Error from slot1@mlwn02.ific.uv.es: Failed to execute '/lhome/ific/a/artemisa_user/02_cpu/augment_data_cpu.sh': (errno=13: 'Permission denied')
Last successful match: Thu Sep 21 17:26:28 2023
579102.000: Run analysis summary ignoring user priority. Of 72 machines,
0 are rejected by your job's requirements
70 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
2 are able to run your job
Storage
Storage is mantained in several disk servers, as detailed here. A distributed Lustre filesystem is shared and mounted in the different nodes of the cluster, including UI and WNs.
This means that all data is available in all nodes, and no explicit file transfer is needed to be accessible from the WNs. This includes user home directories and project areas.
Utilities
The following tools are available in both WN and UI nodes.
Containers
Containers are a software distribution form, very convenient for developers and users.
Singularity is supported as it is secure and admits several container types, including Docker and access to DockerHUB. The current distribution documentation for users can be found here.
CVMFS: HEP Software distribution
We adopt CVMFS as the main HEP software distribution method. The software packages are distributed in different repositories and accessible as local mounted /cvmfs points in UI and WNs.
Current available repositories:
$ cd /cvmfs/atlas.cern.ch
Other Repositories: The list of other CERN cvmfs repositories maintained by their respective owners and available in mount points, as detailed in CVMFS repositories list.
Local Installed software
NVIDIA Drivers : Driver Version: 535.54.03
NVIDIA CUDA Toolkit : provides a development environment for creating high performance GPU-accelerated applications. Installed releases: 9.1, 10.0, 10.1, 10.2, 11.3, 11.4, 11.5, 12.2
NVIDIA CUDA-X: GPU-Accelerated Libraries built on top of NVIDIA CUDA that provide a collection of libraries, tools, and technologies that deliver higher performance compared to CPU-only alternatives. Installed releases: 9.1, 10.0, 10.1, 10.2, 11.3, 11.4, 11.5, 12.2
Compilers: python 2.7.5, python 3.6.8, gcc 4.8.5
Tensorflow: end-to-end open source platform for machine learning. Installed releases: 2.6.0 (python3)
Other Scientific Libraries: SciPy, NumPy, atlas, blas, lapack.