HTCondor

HTCondor is the resource management system running in this cluster. It manages the job workflow and allows the users to send jobs to be executed in the Worker Nodes (WN). Direct access to WNs is not allowed.

Each WN can be spltted into slots that can accept jobs to be processed. HTCondor deals with job sorting and processing. WN slots are created when the job does not require full node resources, so more jobs can be run in the WN. CPU and Memory resources are subtracted in chunks from the main slot. 0, 1, 2, 4 or 8 GPU requests are permitted. GPUs are used exlusively by one job.

HTCondor tries to run jobs from different users in a fair-share way. The tool takes into account the previous processing time performed by user jobs to achieve a fair distribution of computing resources if several users compete for the same resource.

HTCondor manual can be found here

Worker Nodes Configuration

The current Artemisa WNs configuration:

(*) Per GPU

Machine	Cpus	Memory(GB)	GPU Model	GPU Mem(*)	GPUs
mlwn01.ific.uv.es	96	376	Tesla V100-PCIE-32GB	32GB	1
mlwn02.ific.uv.es	96	376	Tesla V100-PCIE-32GB	32GB	1
mlwn03.ific.uv.es	112	754	Tesla V100-SXM2-32GB	32GB	4
mlwn04.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn05.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn06.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn07.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn08.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn09.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn10.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn11.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn12.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn13.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn14.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn15.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn16.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn17.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn18.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn19.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn20.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn21.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn22.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn23.ific.uv.es	80	377	Tesla V100-PCIE-32GB	32GB	1
mlwn24.ific.uv.es	192	503	NVIDIA A100-SXM4-40GB	40GB	8
mlwn25.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn26.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn27.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn28.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn29.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn30.ific.uv.es	128	471	NVIDIA A100-PCIE-40GB	40GB	1
mlwn31.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn32.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn33.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn34.ific.uv.es	128	503	NVIDIA A100-PCIE-40GB	40GB	1
mlwn36.ific.uv.es	192	377	NVIDIA H100 NVL	94GB	2
mlwn37.ific.uv.es	192	377	NVIDIA H100 NVL	94GB	2

Submit Description File

We are going to submit a simple job, hello_world.sh:

#!/bin/bash
echo "Hello World!" 
/bin/sleep 10

A submit description file is required to send a job to be executed in the WNs. This type of file contains the specifics of the submitted task, like job requirements, outputs, etc.

A simple example of a submit description file is hello_world.sub:

# hello_world
 
universe = vanilla

executable          = hello_world.sh
arguments           = $(Cluster) $(Process)

log                 = hello_world.log
output              = outfile.$(Cluster).$(Process).out
error               = errors.$(Cluster).$(Process).err

request_cpus        = 4
request_memory      = 4000
request_disk        = 1G

queue 2

Each line has the form

command_name = value

command name is case insensitive and precedes an equals sign. Values to right of the equals sign are likely to be case sensitive, especially in the case that they specify paths and file names.

The first line of this submit description file is a comment. Comments begin with the # character. Comments do not span lines.

universe : defines an execution environment for a job. If a universe is not specified, the default is vanilla. The vanilla universe is a good default, for it has the fewest restrictions on the job. More on HTCondor Universes.

executable : specifies the program that becomes the HTCondor job. It can be a shell script or a program.A full path and executable name, or a path and executable relative to the current working directory may be specified.

Caution

.sh files must be given execution permits. In this case: chmod +x test.sh

arguments : While the name of the executable is specified in the submit description file with the executable command, the remainder of the command line will be specified with the arguments command.

log : The log command causes a job event log file named test.log to be created once the job is submitted. It is not necessary, but it can be incredibly useful in figuring out what happened or is happening with a job.

output : file capturing the redirected standard output. When submitted as an HTCondor job, standard output of the job is on that WN, thus unavailable. In this example, each execution produces a unique file through the $(Cluster) and $(Process) macros, so it is not overwritten. Each set of queued jobs from a specific user, submitted from a single submit host, sharing an executable have the same value of $(Cluster). Within a cluster of jobs, each takes on its own unique $(Process). The first job has value 0. More on variables in the Submit Description File

error : file capturing the redirected standard error. Analogue to output.

request_cpu : How many CPUs are going to be needed in the task, avoiding the overload of the WN.

request_memory : How many RAM memory is going to be used for the task. Suffix: ‘M’ for Megabytes, ‘G’ for Gigabytes. Default unit: megabytes.

request_disk : How many disk memory is going to be used for the task. Suffix: ‘M’ for Megabytes, ‘G’ for Gigabytes. Default unit: kilobytes.

queue : Number of instances of the job to be sent. 2 in this case. Default is 1. This should be the last command in the file. Commands after this are ignored.

The submit description file can include other commands that can meet the needs of more complex tasks. Some of these commands are going to be used and explained in the tutorials section. A more comprehensive documentation can be found here.

Job management with HTCondor

We have now all the ingredients to submit the job to HTCondor. We can do so by executing condor_submit in the command line:

[artemisa_user@mlui01]$ condor_submit hello_world.sub
Submitting job(s)..
2 job(s) submitted to cluster 568517.

We can monitor our jobs currently in the queue with condor_q:

[artemisa_user@mlui01]$ condor_q

-- Schedd: xxx.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/04/23 13:55:20
OWNER         BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
artemisa_user ID: 568517    9/4  13:55    _     _       2     2   568517.0-1

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for artemisa_user: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 72 jobs; 0 completed, 0 removed, 4 idle, 13 running, 55 held, 0 suspended

This shows that our 2 jobs are waiting to be executed (IDLE column). If the jobs start, we see:

[artemisa_user@mlui01]$ condor_q

-- Schedd: xxx.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/04/23 13:55:24
OWNER         BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
artemisa_user ID: 568517    9/4  13:55    _     2       _     2   568517.0-1

Total for query: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for artemisa_user: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for all users: 72 jobs; 0 completed, 0 removed, 2 idle, 15 running, 55 held, 0 suspended

Submitted jobs can be removed from the queue through the condor_rm command with the job identifier as an argument

[artemisa_user@mlui01]$ condor_rm 568517.1
Job 568517.1 marked for removal

If we want to remove all jobs within a cluster we only specify the cluster number

[artemisa_user@mlui01]$ condor_rm 568517
All jobs in cluster 568517 have been marked for removal

We can obtain a more thorough analysis adding the -better-analyze option to condor_q that can be used to detect problems in jobs, e.g. if a job is on HOLD or stuck in IDLE status.

$ condor_submit augment_data_cpu.sub
Submitting job(s).
1 job(s) submitted to cluster 579102.
$ condor_q
-- Schedd: cm02.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/21/23 17:26:33
OWNER           BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
artemisa_user   ID: 579102    9/21 17:26    _     _       _     1      1   579102.0

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for artemisa_user: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for all users: 180 jobs; 0 completed, 0 removed, 107 idle, 39 running, 34 held, 0 suspended

$ condor_q -better-analyze 579102.0

-- Schedd: cm02.ific.uv.es : <147.156.116.133:9618?...
The Requirements expression for job 579102.000 is

    (RequirementsOrig) && (RequirementsUserGroupLimit)

Job 579102.000 defines the following attributes:

    DiskUsage = 1
    FileSystemDomain = "ific.uv.es"
    GroupMaxRunningJobs = 36.0
    ImageSize = 1
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
    RequirementsOrig = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
    RequirementsUserGroupLimit = ((isUndefined(SubmitterUserResourcesInUse) || (SubmitterUserResourcesInUse < UserMaxRunningJobs)) && (isUndefined(SubmitterGroupResourcesInUse) || (SubmitterGroupResourcesInUse < GroupMaxRunningJobs)))
    UserMaxRunningJobs = 450.0

The Requirements expression for job 579102.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]          83  RequirementsOrig


579102.000:  Job is held.

Hold reason: Error from slot1@mlwn02.ific.uv.es: Failed to execute '/lhome/ific/a/artemisa_user/02_cpu/augment_data_cpu.sh': (errno=13: 'Permission denied')

Last successful match: Thu Sep 21 17:26:28 2023


579102.000:  Run analysis summary ignoring user priority.  Of 72 machines,
      0 are rejected by your job's requirements
     70 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      2 are able to run your job