Environment

Login to Artemisa

Users can access both User Interface(UI) nodes (mlui01.ific.uv.es, mlui02.ific.uv.es) via ssh. For instance, in Linux and Mac using a terminal:

$ ssh YOUR_ARTEMISA_USERNAME@mlui01.ific.uv.es

In Windows, this can be done using a ssh client like Putty

Once authenticated, the following welcome prompt is shown:

==========================================================

Welcome to
    _         _                 _
/ \   _ __| |_ ___ _ __ ___ (_)___  __ _
/ _ \ | '__| __/ _ \ '_ ` _ \| / __|/ _` |
/ ___ \| |  | ||  __/ | | | | | \__ \ (_| |
/_/   \_\_|   \__\___|_| |_| |_|_|___/\__,_|

The IFIC's AI and ML infraestructure

This is mlui01.ific.uv.es running CentOS 7.9.2009
Please, read the "Politica de Seguridad de la Informacion"

https://artemisa.ific.uv.es/static/Politica_Seguridad.pdf

==========================================================
**ATTENTION**
CUDA Version: 12.2  Driver Version: 535.54.03
Execute GPUstatus to get a summary on GPU utilization
==========================================================

[artemisa_user@mlui01 ~]$

In UI nodes users can develop and submit production jobs. These nodes also allow to test and validate their programs using a entry-level local UI node GPU. Exclusive use of this local GPU is restricted to 5 minutes slots, as it is meant for developing and testing of light tasks.

Users can also submit their codes as production batch jobs using the Job Management System HTCondor. This allows access to Worker Nodes (WNs), enabling the use of high-end GPUs. WN may have up to 8 GPUs with high speed NVlink interconnection (see System Overview) and extended execution times, up to 48 hours.

Users can have up to 8 jobs running simultaneously.

User Interface

UI nodes also serve as development nodes, where users can test and validate their processes using the entry-level local GPU. After validation, they can submit their codes as production batch jobs using the Job Management System HTCondor.

Authorized users can log in through ssh protocol in the UI machines:

mlui01.ific.uv.es
mlui02.ific.uv.es

The user HOME directory resides in Lustre filesystem

/lhome/ific/<username_first_letter>/<username>   <=  IFIC users
/lhome/ext/<groupid>/<username>  <= external users

In addition to the user HOME directory, there is space for Projects available upon request, and accessible also from the UI:

/lustre/ific.uv.es/ml/<groupid>

Use UI GPU (gpurun)

As GPUs are a exclusive resource, users have to request usage of the local User Interface GPU through the gpurun tool, with a time slot of 5 minutes.

$ gpurun -i
Connected
Info: OK 0. Tesla V100-PCIE-16GB [00000000:5E:00.0]
1. Tesla V100-PCIE-16GB [00000000:86:00.0]

Total clients:0 Running:0 Estimated waiting time:0 seconds

In case of user concurrency, the command will handle the queue of petitions synchronously.

Worker Nodes use

HTCondor

HTCondor is the resource management system that runs in this cluster. It manages the job workflow and allows the users to send jobs to be executed in the Worker Nodes (WN). Direct access to WNs is not allowed.

Each WN can be spltted into slots that can accept jobs to be processed. HTCondor deals with job sorting and processing. WN slots are created when the job does not require the Enterprise node resources, so more jobs can be run in the WN. CPU and Memory resources are subtracted in chunks from the main slot. 0, 1, 2, 4 or 8 GPU requests are permitted. GPUs are used exlusively by one job.

HTCondor tries to run jobs from different users in a fair share way. The tool takes into account the previous processing time spent by the user so CPU time is assigned evenly between all users.

HTCondor manual can be found here

The current Artemisa WNs configuration:

node

GPU_Model

#nodes

mlwn03

4x Tesla V100-SXM2-32GB

1

mlwn24

8x NVIDIA A100-SXM4-40GB

1

mlwn01-02

1x Tesla V100-PCIE-32GB

2

mlwn04-23

1x Tesla V100-PCIE-32GB

20

mlwn25-35

1x NVIDIA A100-PCIE-40GB

11

In more detail:

Machine

Platform

Cpus

Gpus

TotalGb

FreeCpu

mlwn01.ific.uv.es

x64/CentOS7

96

1

376.4

8

mlwn02.ific.uv.es

x64/CentOS7

96

1

376.4

8

mlwn03.ific.uv.es

x64/CentOS7

112

4

754.37

56

mlwn04.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn05.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn06.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn07.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn08.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn09.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn10.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn11.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn12.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn13.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn14.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn15.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn16.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn17.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn18.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn19.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn20.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn21.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn22.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn23.ific.uv.es

x64/CentOS7

80

1

376.38

48

mlwn24.ific.uv.es

x64/CentOS7

192

8

503.82

48

mlwn25.ific.uv.es

x64/CentOS7

128

1

503.82

72

mlwn26.ific.uv.es

x64/CentOS7

128

1

503.82

72

mlwn27.ific.uv.es

x64/CentOS7

128

1

503.82

72

mlwn28.ific.uv.es

x64/CentOS7

128

1

503.82

72

mlwn29.ific.uv.es

x64/CentOS7

128

1

503.82

72

mlwn30.ific.uv.es

x64/CentOS7

128

1

472.32

72

mlwn31.ific.uv.es

x64/CentOS7

128

1

503.82

72

mlwn32.ific.uv.es

x64/CentOS7

128

1

503.82

72

mlwn33.ific.uv.es

x64/CentOS7

128

1

503.82

72

mlwn34.ific.uv.es

x64/CentOS7

128

1

503.82

72

mlwn35.ific.uv.es

x64/CentOS7

128

1

503.82

72

Submit Description File

We are going to submit a simple job, hello_world.sh:

#!/bin/bash
echo "Hello World!" 
/bin/sleep 10

A submit description file is required to send a job. This type of file contains the specifics of the submitted task, like job requirements, outputs, etc.

A simple submit description file example might be hello_world.sub:

# hello_world
 
universe = vanilla

executable          = hello_world.sh
arguments           = $(Cluster) $(Process)

log                 = hello_world.log
output              = outfile.$(Cluster).$(Process).out
error               = errors.$(Cluster).$(Process).err

request_cpus        = 4
request_memory      = 4000
request_disk        = 1G

queue 2

Each line of the submit description file has the form

command_name = value

The command name is case insensitive and precedes an equals sign. Values to right of the equals sign are likely to be case sensitive, especially in the case that they specify paths and file names.

The first line of this submit description file is a comment. Comments begin with the # character. Comments do not span lines.

The command name is case insensitive and precedes an equals sign. Values to right of the equals sign are likely to be case sensitive, especially in the case that they specify paths and file names.

  • universe : defines an execution environment for a job. If a universe is not specified, the default is vanilla. The vanilla universe is a good default, for it has the fewest restrictions on the job. More on HTCondor Universes.

  • executable : specifies the program that becomes the HTCondor job. It can be a shell script or a program.A full path and executable name, or a path and executable relative to the current working directory may be specified.

Caution

.sh files must be given execution permits. In this case: chmod +x test.sh

  • arguments : While the name of the executable is specified in the submit description file with the executable command, the remainder of the command line will be specified with the arguments command.

  • log : The log command causes a job event log file named test.log to be created once the job is submitted. It is not necessary, but it can be incredibly useful in figuring out what happened or is happening with a job.

  • output : file capturing the redirected standard output. When submitted as an HTCondor job, standard output of the job is on that WN, and thus unavailable. In this example, each execution produces a unique file through the $(Cluster) and $(Process) macros, so it is not overwritten. Each set of queued jobs from a specific user, submitted from a single submit host, sharing an executable have the same value of $(Cluster). Within a cluster of jobs, each takes on its own unique $(Process). The first job has value 0. More on variables in the Submit Description File

  • error : file capturing the redirected standard error. Analogue to output.

  • request_cpu : How many CPUs are going to be needed in the task, avoiding the overload of the WN.

  • request_memory : How many RAM memory is going to be used for the task. Suffix: ‘M’ for Megabytes, ‘G’ for Gigabytes. Default unit: megabytes.

  • request_disk : How many disk memory is going to be used for the task. Suffix: ‘M’ for Megabytes, ‘G’ for Gigabytes. Default unit: kilobytes.

  • queue : Number of instances of the job to be sent. 2 in this case. Default is 1. This should be the last command in the file. Commands after this are ignored.

The submit description file can include other commands that can meet the needs of more complex tasks. Some of these commands are going to be used and explained in the tutorials section. A more comprehensive documentation can be found here.

Job management with HTCondor

We now have all the ingredients to submit the job to HTCondor. We can do so by executing condor_submit in the command line:

[artemisa_user@mlui01]$ condor_submit hello_world.sub
Submitting job(s)..
2 job(s) submitted to cluster 568517.

Once the job is submitted, we can monitor this and other of our jobs currently in the queue with condor_q:

[artemisa_user@mlui01]$ condor_q

-- Schedd: xxx.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/04/23 13:55:20
OWNER         BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
artemisa_user ID: 568517    9/4  13:55    _     _       2     2   568517.0-1

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for artemisa_user: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 72 jobs; 0 completed, 0 removed, 4 idle, 13 running, 55 held, 0 suspended

This shows that our 2 jobs are waiting to be executed (IDLE column). If the jobs start, we see:

[artemisa_user@mlui01]$ condor_q

-- Schedd: xxx.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/04/23 13:55:24
OWNER         BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
artemisa_user ID: 568517    9/4  13:55    _     2       _     2   568517.0-1

Total for query: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for artemisa_user: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for all users: 72 jobs; 0 completed, 0 removed, 2 idle, 15 running, 55 held, 0 suspended

Submitted jobs can be removed from the queue through the condor_rm command with the job identifier as a command line argument

[artemisa_user@mlui01]$ condor_rm 568517.1
Job 568517.1 marked for removal

If we want to remove all jobs within a cluster we only specify the cluster number

[artemisa_user@mlui01]$ condor_rm 568517
All jobs in cluster 568517 have been marked for removal

We can get a thorough analysis adding the -better-analyze option to condor_q that can be used to detect problems in jobs, for instances, if a job is in HOLD or stuck in IDLE status.

$ condor_submit augment_data_cpu.sub
Submitting job(s).
1 job(s) submitted to cluster 579102.
$ condor_q
-- Schedd: cm02.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/21/23 17:26:33
OWNER           BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
artemisa_user   ID: 579102    9/21 17:26    _     _       _     1      1   579102.0

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for artemisa_user: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for all users: 180 jobs; 0 completed, 0 removed, 107 idle, 39 running, 34 held, 0 suspended

$ condor_q -better-analyze 579102.0

-- Schedd: cm02.ific.uv.es : <147.156.116.133:9618?...
The Requirements expression for job 579102.000 is

    (RequirementsOrig) && (RequirementsUserGroupLimit)

Job 579102.000 defines the following attributes:

    DiskUsage = 1
    FileSystemDomain = "ific.uv.es"
    GroupMaxRunningJobs = 36.0
    ImageSize = 1
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
    RequirementsOrig = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
    RequirementsUserGroupLimit = ((isUndefined(SubmitterUserResourcesInUse) || (SubmitterUserResourcesInUse < UserMaxRunningJobs)) && (isUndefined(SubmitterGroupResourcesInUse) || (SubmitterGroupResourcesInUse < GroupMaxRunningJobs)))
    UserMaxRunningJobs = 450.0

The Requirements expression for job 579102.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]          83  RequirementsOrig


579102.000:  Job is held.

Hold reason: Error from slot1@mlwn02.ific.uv.es: Failed to execute '/lhome/ific/a/artemisa_user/02_cpu/augment_data_cpu.sh': (errno=13: 'Permission denied')

Last successful match: Thu Sep 21 17:26:28 2023


579102.000:  Run analysis summary ignoring user priority.  Of 72 machines,
      0 are rejected by your job's requirements
     70 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      2 are able to run your job

Storage

Storage is mantained in several disk servers, as detailed here. A distributed Lustre filesystem is shared and mounted in the different nodes of the cluster, including UI and WNs.

This means that all data is available in all nodes, and no explicit file transfer is needed to be accessible from the WNs. This includes user home directories and project areas.

Utilities

The following tools are available in both WN and UI nodes.

Containers

Containers are a software distribution form, very convenient for developers and users.

Singularity is supported as it is secure and admits several container types, including Docker and access to DockerHUB. The current distribution documentation for users can be found here.

CVMFS: HEP Software distribution

We adopt CVMFS as the main HEP software distribution method. The software packages are distributed in different repositories and accessible as local mounted /cvmfs points in UI and WNs.

Current available repositories:

  • CERN/SFT Repositories: list of provided packages. Environment can be set using lgcenv .

$ cd /cvmfs/atlas.cern.ch
  • Other Repositories: The list of other CERN cvmfs repositories maintained by their respective owners and available in mount points, as detailed in CVMFS repositories list.

Local Installed software

  • NVIDIA Drivers : Driver Version: 535.54.03

  • NVIDIA CUDA Toolkit : provides a development environment for creating high performance GPU-accelerated applications. Installed releases: 9.1, 10.0, 10.1, 10.2, 11.3, 11.4, 11.5, 12.2

  • NVIDIA CUDA-X: GPU-Accelerated Libraries built on top of NVIDIA CUDA that provide a collection of libraries, tools, and technologies that deliver higher performance compared to CPU-only alternatives. Installed releases: 9.1, 10.0, 10.1, 10.2, 11.3, 11.4, 11.5, 12.2

  • Compilers: python 2.7.5, python 3.6.8, gcc 4.8.5

  • Tensorflow: end-to-end open source platform for machine learning. Installed releases: 2.6.0 (python3)

  • Other Scientific Libraries: SciPy, NumPy, atlas, blas, lapack.