Environment

Login to Artemisa

Users can access both User Interface(UI) nodes (mlui01.ific.uv.es, mlui02.ific.uv.es) via ssh. For instance, in Linux and Mac using a terminal:

$ ssh YOUR_ARTEMISA_USERNAME@mlui02.ific.uv.es

In Windows, this can be done using a ssh client like Putty

Once authenticated, the following welcome prompt is shown:

==========================================================

Welcome to
    _         _                 _
   / \   _ __| |_ ___ _ __ ___ (_)___  __ _
  / _ \ | '__| __/ _ \ '_ ` _ \| / __|/ _` |
 / ___ \| |  | ||  __/ | | | | | \__ \ (_| |
/_/   \_\_|   \__\___|_| |_| |_|_|___/\__,_|

The IFIC's AI and ML infraestructure

This is mlui02.ific.uv.es running AlmaLinux 9.6 (Sage Margay)
Please, read the "Politica de Seguridad de la Informacion"

https://artemisa.ific.uv.es/static/Politica_Seguridad.pdf

==========================================================
CUDA Version: 12.9  Driver Version: 575.57.08
Execute GPUstatus to get a summary on GPU utilization
==========================================================


[artemisa_user@mlui02 ~]$

In UI nodes, users can develop and submit production batch jobs. These nodes also allow to test and validate their programs using an entry-level local GPU. Exclusive use of this local GPU is restricted to a 5 minute period per execution, as it is meant for developing and testing of light tasks.

Users can also submit their codes as production batch jobs using the Job Management System HTCondor. This allows access to Worker Nodes (WNs), enabling the use of high-end GPUs. WNs may have up to 8 GPUs with high speed NVlink interconnection (see System Overview) and extended execution time, up to 48 hours. Users can have up to 8 jobs running simultaneously.

User Interface

UI nodes also serve as development nodes, where users can test and validate their processes using the entry-level local GPU. After validation, they can submit their codes as production batch jobs using the Job Management System HTCondor.

Authorized users can log in by ssh protocol through the UI nodes:

mlui01.ific.uv.es
mlui02.ific.uv.es

The user HOME directory resides in Lustre filesystem

/lhome/ific/<username_first_letter>/<username>   <=  IFIC users
/lhome/ext/<groupid>/<username>  <= external users

In addition to the user HOME directory, there is space for Projects available upon request, and accessible also from the UI:

/lustre/ific.uv.es/ml/<groupid>

Use UI GPU (gpurun)

As GPUs are a exclusive resource, users have to request usage of the local User Interface GPU through the gpurun tool, with time slots of 5 minutes.

$ gpurun -i
Connected
Info: OK 0. Tesla V100-PCIE-16GB [00000000:5E:00.0]
1. Tesla V100-PCIE-16GB [00000000:86:00.0]

Total clients:0 Running:0 Estimated waiting time:0 seconds

In case of user concurrency, the command will handle the queue of petitions synchronously.

Worker Nodes use

HTCondor

HTCondor is the resource management system running in this cluster. It manages the job workflow and allows the users to send jobs to be executed in the Worker Nodes (WN). Direct access to WNs is not allowed.

Each WN can be spltted into slots that can accept jobs to be processed. HTCondor deals with job sorting and processing. WN slots are created when the job does not require full node resources, so more jobs can be run in the WN. CPU and Memory resources are subtracted in chunks from the main slot. 0, 1, 2, 4 or 8 GPU requests are permitted. GPUs are used exlusively by one job.

HTCondor tries to run jobs from different users in a fair-share way. The tool takes into account the previous processing time performed by user jobs to achieve a fair distribution of computing resources if several users compete for the same resource.

HTCondor manual can be found here

Worker Nodes Configuration

The current Artemisa WNs configuration:

(*) Per GPU

Machine	Cpus	Memory(GB)	GPU Model	GPU Mem(*)	GPUs
mlwn01.ific.uv.es	96	376	Tesla V100-PCIE-32GB	32GB	1
mlwn02.ific.uv.es	96	376	Tesla V100-PCIE-32GB	32GB	1
mlwn03.ific.uv.es	112	754	Tesla V100-SXM2-32GB	32GB	4
mlwn04.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn05.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn06.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn07.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn08.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn09.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn10.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn11.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn12.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn13.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn14.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn15.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn16.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn17.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn18.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn19.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn20.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn21.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn22.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn23.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn24.ific.uv.es	192	503	NVIDIA A100-SXM4-40GB	40GB	8
mlwn25.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn26.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn27.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn28.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn29.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn30.ific.uv.es	128	471	NVIDIA A100-PCIE-40GB	40GB	1
mlwn31.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn32.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn33.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn34.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn36.ific.uv.es	192	377	NVIDIA H100 NVL	94GB	2
mlwn37.ific.uv.es	192	377	NVIDIA H100 NVL	94GB	2

Submit Description File

We are going to submit a simple job, hello_world.sh:

#!/bin/bash
echo "Hello World!" 
/bin/sleep 10

A submit description file is required to send a job to be executed in the WNs. This type of file contains the specifics of the submitted task, like job requirements, outputs, etc.

A simple example of a submit description file is hello_world.sub:

# hello_world
 
universe = vanilla

executable          = hello_world.sh
arguments           = $(Cluster) $(Process)

log                 = hello_world.log
output              = outfile.$(Cluster).$(Process).out
error               = errors.$(Cluster).$(Process).err

request_cpus        = 4
request_memory      = 4000
request_disk        = 1G

queue 2

Each line has the form

command_name = value

command name is case insensitive and precedes an equals sign. Values to right of the equals sign are likely to be case sensitive, especially in the case that they specify paths and file names.

The first line of this submit description file is a comment. Comments begin with the # character. Comments do not span lines.

universe : defines an execution environment for a job. If a universe is not specified, the default is vanilla. The vanilla universe is a good default, for it has the fewest restrictions on the job. More on HTCondor Universes.

executable : specifies the program that becomes the HTCondor job. It can be a shell script or a program.A full path and executable name, or a path and executable relative to the current working directory may be specified.

Caution

.sh files must be given execution permits. In this case: chmod +x test.sh

arguments : While the name of the executable is specified in the submit description file with the executable command, the remainder of the command line will be specified with the arguments command.

log : The log command causes a job event log file named test.log to be created once the job is submitted. It is not necessary, but it can be incredibly useful in figuring out what happened or is happening with a job.

output : file capturing the redirected standard output. When submitted as an HTCondor job, standard output of the job is on that WN, thus unavailable. In this example, each execution produces a unique file through the $(Cluster) and $(Process) macros, so it is not overwritten. Each set of queued jobs from a specific user, submitted from a single submit host, sharing an executable have the same value of $(Cluster). Within a cluster of jobs, each takes on its own unique $(Process). The first job has value 0. More on variables in the Submit Description File

error : file capturing the redirected standard error. Analogue to output.

request_cpu : How many CPUs are going to be needed in the task, avoiding the overload of the WN.

request_memory : How many RAM memory is going to be used for the task. Suffix: ‘M’ for Megabytes, ‘G’ for Gigabytes. Default unit: megabytes.

request_disk : How many disk memory is going to be used for the task. Suffix: ‘M’ for Megabytes, ‘G’ for Gigabytes. Default unit: kilobytes.

queue : Number of instances of the job to be sent. 2 in this case. Default is 1. This should be the last command in the file. Commands after this are ignored.

The submit description file can include other commands that can meet the needs of more complex tasks. Some of these commands are going to be used and explained in the tutorials section. A more comprehensive documentation can be found here.

Job management with HTCondor

We have now all the ingredients to submit the job to HTCondor. We can do so by executing condor_submit in the command line:

[artemisa_user@mlui01]$ condor_submit hello_world.sub
Submitting job(s)..
2 job(s) submitted to cluster 568517.

We can monitor our jobs currently in the queue with condor_q:

[artemisa_user@mlui01]$ condor_q

-- Schedd: xxx.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/04/23 13:55:20
OWNER         BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
artemisa_user ID: 568517    9/4  13:55    _     _       2     2   568517.0-1

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for artemisa_user: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 72 jobs; 0 completed, 0 removed, 4 idle, 13 running, 55 held, 0 suspended

This shows that our 2 jobs are waiting to be executed (IDLE column). If the jobs start, we see:

[artemisa_user@mlui01]$ condor_q

-- Schedd: xxx.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/04/23 13:55:24
OWNER         BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
artemisa_user ID: 568517    9/4  13:55    _     2       _     2   568517.0-1

Total for query: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for artemisa_user: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for all users: 72 jobs; 0 completed, 0 removed, 2 idle, 15 running, 55 held, 0 suspended

Submitted jobs can be removed from the queue through the condor_rm command with the job identifier as an argument

[artemisa_user@mlui01]$ condor_rm 568517.1
Job 568517.1 marked for removal

If we want to remove all jobs within a cluster we only specify the cluster number

[artemisa_user@mlui01]$ condor_rm 568517
All jobs in cluster 568517 have been marked for removal

We can obtain a more thorough analysis adding the -better-analyze option to condor_q that can be used to detect problems in jobs, e.g. if a job is on HOLD or stuck in IDLE status.

$ condor_submit augment_data_cpu.sub
Submitting job(s).
1 job(s) submitted to cluster 579102.
$ condor_q
-- Schedd: cm02.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/21/23 17:26:33
OWNER           BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
artemisa_user   ID: 579102    9/21 17:26    _     _       _     1      1   579102.0

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for artemisa_user: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for all users: 180 jobs; 0 completed, 0 removed, 107 idle, 39 running, 34 held, 0 suspended

$ condor_q -better-analyze 579102.0

-- Schedd: cm02.ific.uv.es : <147.156.116.133:9618?...
The Requirements expression for job 579102.000 is

    (RequirementsOrig) && (RequirementsUserGroupLimit)

Job 579102.000 defines the following attributes:

    DiskUsage = 1
    FileSystemDomain = "ific.uv.es"
    GroupMaxRunningJobs = 36.0
    ImageSize = 1
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
    RequirementsOrig = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
    RequirementsUserGroupLimit = ((isUndefined(SubmitterUserResourcesInUse) || (SubmitterUserResourcesInUse < UserMaxRunningJobs)) && (isUndefined(SubmitterGroupResourcesInUse) || (SubmitterGroupResourcesInUse < GroupMaxRunningJobs)))
    UserMaxRunningJobs = 450.0

The Requirements expression for job 579102.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]          83  RequirementsOrig


579102.000:  Job is held.

Hold reason: Error from slot1@mlwn02.ific.uv.es: Failed to execute '/lhome/ific/a/artemisa_user/02_cpu/augment_data_cpu.sh': (errno=13: 'Permission denied')

Last successful match: Thu Sep 21 17:26:28 2023


579102.000:  Run analysis summary ignoring user priority.  Of 72 machines,
      0 are rejected by your job's requirements
     70 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      2 are able to run your job

Storage

Storage is mantained in several disk servers, as detailed here. A distributed Lustre filesystem is shared and mounted in the different nodes of the cluster, including UI and WNs.

This means that all data is available in all nodes, and no explicit file transfer is needed to be accessible from the WNs. This includes user home directories and project areas.

Utilities

The following tools are available in both WN and UI nodes.

Containers

Containers are a software distribution form, very convenient for developers and users.

Singularity is supported as it is secure and admits several container types, including Docker and access to DockerHUB. The current distribution documentation for users can be found here.

CVMFS: HEP Software distribution

We adopt CVMFS as the main HEP software distribution method. The software packages are distributed in different repositories and accessible as local mounted /cvmfs points in UI and WNs.

Current available repositories:

CERN/SFT Repositories: list of provided packages. Environment can be set using lgcenv .

$ cd /cvmfs/atlas.cern.ch

Other Repositories: The list of other CERN cvmfs repositories maintained by their respective owners and available in mount points, as detailed in CVMFS repositories list.

Local Installed software

NVIDIA Drivers : Driver Version: 575.57.08
NVIDIA CUDA Toolkit : provides a development environment for creating high performance GPU-accelerated applications. Installed releases: 12.6, 12.8, 12.9
NVIDIA CUDA-X: GPU-Accelerated Libraries built on top of NVIDIA CUDA that provide a collection of libraries, tools, and technologies that deliver higher performance compared to CPU-only alternatives. Installed releases: 12.6, 12.8, 12.9
Compilers: python 3.9.21, gcc 11.5.0
Other Scientific Libraries: SciPy, NumPy