HTCondor
HTCondor is the resource management system running in this cluster. It manages the job workflow and allows the users to send jobs to be executed in the Worker Nodes (WN). Direct access to WNs is not allowed.
Each WN can be spltted into slots that can accept jobs to be processed. HTCondor deals with job sorting and processing. WN slots are created when the job does not require full node resources, so more jobs can be run in the WN. CPU and Memory resources are subtracted in chunks from the main slot. 0, 1, 2, 4 or 8 GPU requests are permitted. GPUs are used exlusively by one job.
HTCondor tries to run jobs from different users in a fair-share way. The tool takes into account the previous processing time performed by user jobs to achieve a fair distribution of computing resources if several users compete for the same resource.
HTCondor manual can be found here
Worker Nodes Configuration
The current Artemisa WNs configuration:
(*) Per GPU
Machine |
Cpus |
Memory(GB) |
GPU Model |
GPU Mem(*) |
GPUs |
|---|---|---|---|---|---|
mlwn01.ific.uv.es |
96 |
376 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn02.ific.uv.es |
96 |
376 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn03.ific.uv.es |
112 |
754 |
Tesla V100-SXM2-32GB |
32GB |
4 |
mlwn04.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn05.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn06.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn07.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn08.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn09.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn10.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn11.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn12.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn13.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn14.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn15.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn16.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn17.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn18.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn19.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn20.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn21.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn22.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn23.ific.uv.es |
80 |
377 |
Tesla V100-PCIE-32GB |
32GB |
1 |
mlwn24.ific.uv.es |
192 |
503 |
NVIDIA A100-SXM4-40GB |
40GB |
8 |
mlwn25.ific.uv.es |
128 |
503 |
NVIDIA A100-PCIE-40GB |
40GB |
1 |
mlwn26.ific.uv.es |
128 |
503 |
NVIDIA A100-PCIE-40GB |
40GB |
1 |
mlwn27.ific.uv.es |
128 |
503 |
NVIDIA A100-PCIE-40GB |
40GB |
1 |
mlwn28.ific.uv.es |
128 |
503 |
NVIDIA A100-PCIE-40GB |
40GB |
1 |
mlwn29.ific.uv.es |
128 |
503 |
NVIDIA A100-PCIE-40GB |
40GB |
1 |
mlwn30.ific.uv.es |
128 |
471 |
NVIDIA A100-PCIE-40GB |
40GB |
1 |
mlwn31.ific.uv.es |
128 |
503 |
NVIDIA A100-PCIE-40GB |
40GB |
1 |
mlwn32.ific.uv.es |
128 |
503 |
NVIDIA A100-PCIE-40GB |
40GB |
1 |
mlwn33.ific.uv.es |
128 |
503 |
NVIDIA A100-PCIE-40GB |
40GB |
1 |
mlwn34.ific.uv.es |
128 |
503 |
NVIDIA A100-PCIE-40GB |
40GB |
1 |
mlwn36.ific.uv.es |
192 |
377 |
NVIDIA H100 NVL |
94GB |
2 |
mlwn37.ific.uv.es |
192 |
377 |
NVIDIA H100 NVL |
94GB |
2 |
Submit Description File
We are going to submit a simple job, hello_world.sh:
#!/bin/bash
echo "Hello World!"
/bin/sleep 10
A submit description file is required to send a job to be executed in the WNs. This type of file contains the specifics of the submitted task, like job requirements, outputs, etc.
A simple example of a submit description file is hello_world.sub:
# hello_world
universe = vanilla
executable = hello_world.sh
arguments = $(Cluster) $(Process)
log = hello_world.log
output = outfile.$(Cluster).$(Process).out
error = errors.$(Cluster).$(Process).err
request_cpus = 4
request_memory = 4000
request_disk = 1G
queue 2
Each line has the form
command_name = value
command name is case insensitive and precedes an equals sign. Values to right of the
equals sign are likely to be case sensitive, especially in the case that they specify paths
and file names.
The first line of this submit description file is a comment. Comments begin with the # character. Comments do not span lines.
universe: defines an execution environment for a job. If a universe is not specified, the default is vanilla. The vanilla universe is a good default, for it has the fewest restrictions on the job. More on HTCondor Universes.
executable: specifies the program that becomes the HTCondor job. It can be a shell script or a program.A full path and executable name, or a path and executable relative to the current working directory may be specified.Caution
.sh files must be given execution permits. In this case:
chmod +x test.sh
arguments: While the name of the executable is specified in the submit description file with the executable command, the remainder of the command line will be specified with the arguments command.
log: The log command causes a job event log file named test.log to be created once the job is submitted. It is not necessary, but it can be incredibly useful in figuring out what happened or is happening with a job.
output: file capturing the redirected standard output. When submitted as an HTCondor job, standard output of the job is on that WN, thus unavailable. In this example, each execution produces a unique file through the $(Cluster) and $(Process) macros, so it is not overwritten. Each set of queued jobs from a specific user, submitted from a single submit host, sharing an executable have the same value of $(Cluster). Within a cluster of jobs, each takes on its own unique $(Process). The first job has value 0. More on variables in the Submit Description File
error: file capturing the redirected standard error. Analogue to output.
request_cpu: How many CPUs are going to be needed in the task, avoiding the overload of the WN.
request_memory: How many RAM memory is going to be used for the task. Suffix: ‘M’ for Megabytes, ‘G’ for Gigabytes. Default unit: megabytes.
request_disk: How many disk memory is going to be used for the task. Suffix: ‘M’ for Megabytes, ‘G’ for Gigabytes. Default unit: kilobytes.
queue: Number of instances of the job to be sent. 2 in this case. Default is 1. This should be the last command in the file. Commands after this are ignored.
The submit description file can include other commands that can meet the needs of more complex tasks. Some of these commands are going to be used and explained in the tutorials section. A more comprehensive documentation can be found here.
Job management with HTCondor
We have now all the ingredients to submit the job to HTCondor. We can do so by executing
condor_submit in the command line:
[artemisa_user@mlui01]$ condor_submit hello_world.sub
Submitting job(s)..
2 job(s) submitted to cluster 568517.
We can monitor our jobs currently in the queue with condor_q:
[artemisa_user@mlui01]$ condor_q
-- Schedd: xxx.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/04/23 13:55:20
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
artemisa_user ID: 568517 9/4 13:55 _ _ 2 2 568517.0-1
Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for artemisa_user: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 72 jobs; 0 completed, 0 removed, 4 idle, 13 running, 55 held, 0 suspended
This shows that our 2 jobs are waiting to be executed (IDLE column). If the jobs start, we see:
[artemisa_user@mlui01]$ condor_q
-- Schedd: xxx.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/04/23 13:55:24
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
artemisa_user ID: 568517 9/4 13:55 _ 2 _ 2 568517.0-1
Total for query: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for artemisa_user: 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
Total for all users: 72 jobs; 0 completed, 0 removed, 2 idle, 15 running, 55 held, 0 suspended
Submitted jobs can be removed from the queue through the condor_rm command with the job
identifier as an argument
[artemisa_user@mlui01]$ condor_rm 568517.1
Job 568517.1 marked for removal
If we want to remove all jobs within a cluster we only specify the cluster number
[artemisa_user@mlui01]$ condor_rm 568517
All jobs in cluster 568517 have been marked for removal
We can obtain a more thorough analysis adding the -better-analyze option to condor_q that can be used to detect
problems in jobs, e.g. if a job is on HOLD or stuck in IDLE status.
$ condor_submit augment_data_cpu.sub
Submitting job(s).
1 job(s) submitted to cluster 579102.
$ condor_q
-- Schedd: cm02.ific.uv.es : <xxx.xxx.xxx.xxx:xxxx?... @ 09/21/23 17:26:33
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
artemisa_user ID: 579102 9/21 17:26 _ _ _ 1 1 579102.0
Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for artemisa_user: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for all users: 180 jobs; 0 completed, 0 removed, 107 idle, 39 running, 34 held, 0 suspended
$ condor_q -better-analyze 579102.0
-- Schedd: cm02.ific.uv.es : <147.156.116.133:9618?...
The Requirements expression for job 579102.000 is
(RequirementsOrig) && (RequirementsUserGroupLimit)
Job 579102.000 defines the following attributes:
DiskUsage = 1
FileSystemDomain = "ific.uv.es"
GroupMaxRunningJobs = 36.0
ImageSize = 1
RequestDisk = DiskUsage
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
RequirementsOrig = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
RequirementsUserGroupLimit = ((isUndefined(SubmitterUserResourcesInUse) || (SubmitterUserResourcesInUse < UserMaxRunningJobs)) && (isUndefined(SubmitterGroupResourcesInUse) || (SubmitterGroupResourcesInUse < GroupMaxRunningJobs)))
UserMaxRunningJobs = 450.0
The Requirements expression for job 579102.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 83 RequirementsOrig
579102.000: Job is held.
Hold reason: Error from slot1@mlwn02.ific.uv.es: Failed to execute '/lhome/ific/a/artemisa_user/02_cpu/augment_data_cpu.sh': (errno=13: 'Permission denied')
Last successful match: Thu Sep 21 17:26:28 2023
579102.000: Run analysis summary ignoring user priority. Of 72 machines,
0 are rejected by your job's requirements
70 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
2 are able to run your job