03-Submit a job to WNs GPU(s)

To use the GPUs in the WNs the user has to send the jobs in batch mode. This has several benefits, like a higher time limit cap of 48 hours. In this first example we will submit a job that requests a single GPU and retrieves basic environment information, useful for debugging purposes.

Required files

We create the following submit description file,

universe = vanilla

executable              = get_info.sh
arguments               = $(Process)

log                     = condor_logs/test.log
output                  = condor_logs/outfile.$(Cluster).$(Process).out
error                   = condor_logs/errors.$(Cluster).$(Process).err

request_gpus = 1

queue

that requests one GPU to execute the following script,

#!/bin/sh
echo ">>>> HOST"
hostname
echo ">>>> CURRENT DIR"
pwd
echo ">>>> USER"
whoami 
echo ">>>> SPACE LEFT"
df -h
echo ">>>> NVIDIA INFO"
nvidia-smi
echo ">>>> OS INFO"
cat /etc/os-release

Caution

Don’t forget to give .sh files execution permits: chmod +x get_info.sh

Submit Job

Create the output directories and launch the job

(artuto) $ mkdir condor_logs
(artuto) $ condor_submit get_info.sub
Submitting job(s).
1 job(s) submitted to cluster 571648.

The standard output and and standard error of the script will respectively be recorded in the ‘output’ and ‘error’ files defined in the commands of the .sub file. The log file contains HTCondor-related information.

Inspecting the content of the output file:

  • >>>> HOST : assigned WN.

  • >>>> CURRENT DIR, USER : working directory in the UI and user.

  • >>>> SPACE LEFT : Mounted filesystems and free space, home and project filesystems, AFS is not mounted in Worker Nodes.

  • >>>> NVIDIA INFO : GPU(s) info.

  • >>>> OS INFO : Operating System info with version details.

Requesting a specific resource

Artemisa has three different models of NVIDIA Data Center GPUs:

Currently you can force your jobs to use a specific type of GPU with:

# run only on V100 GPUs machines
+UseNvidiaV100 = True

# run only on A100 GPUs machines
+UseNvidiaA100 = True

# run only on A100 GPUs machines
+UseNvidiaH100 = True

allowing to choose the model that better suits the needs of your job.

The next example will run the same Python code from the previous tutorial, but in a specific GPU model. The following submit description file implements this:

universe = vanilla

executable              = augment_data_gpu.py
arguments               = $(Process)

log                     = condor_logs/test.log
output                  = condor_logs/outfile.$(Cluster).$(Process).out
error                   = condor_logs/errors.$(Cluster).$(Process).err

notify_user = your.mail@institution.edu
notification = always

getenv = True

+UseNvidiaH100 = True

request_gpus = 1

queue

We also included 2 useful commands, as this job is going to take a bit longer,

notify_user = artemisa.user@ific.uv.es
notification = always

HTCondor will send a mail when the job is finished, failed or not. This mail contains resource metrics from the job.

The Python script used is very similar to the one in the previous tutorial, with 2 changes. One, setting the GPU as the only available device

######
# Set GPU as only available physical device
print("Set GPU as only available device")
my_devices = tf.config.list_physical_devices(device_type='GPU')
tf.config.set_visible_devices(devices= my_devices)
######

another change is setting the augmentation to run many times, generating a large amount of images

######
# num_images_generated = batch_size * repetitions
repetitions = 1000
for x in range(repetitions):
   batch_images, batch_labels = next(it)
######

Also the visualization part has been removed, as it wasn´t needed in this example.

After sending the job with condor_submit, the augmented_data directory is created and populated with the transformed images.

(artuto) $ condor_submit specific_gpu.sub

Summary

Recap

  • Jobs can be can be submitted to a WN through HTCondor to gain exclusive a access to a WN GPU.

  • log, output and error submission description file commands gather submission status, standard output and standard error respectively.

  • Use +UseNvidiaA100, +UseNvidiaV100 and +UseNvidiaH100 to select a specific GPU type.