03-Submit a job to WNs GPU(s)

After development in UI nodes, we can submit jobs to the WNs, enabling the use of GPUs and higher time limit cap of 48 hours. In this example we will submit a job that requests for a single GPU in order to retrieve basic environment information, useful for debugging purposes.

Required files

We create the following submit description file,

universe = vanilla

executable              = get_info.sh
arguments               = $(Process)

log                     = condor_logs/test.log
output                  = condor_logs/outfile.$(Cluster).$(Process).out
error                   = condor_logs/errors.$(Cluster).$(Process).err

request_gpus = 1

queue

that requests one GPU to execute the following script,

#!/bin/sh
echo ">>>> ENVIRONMENT"
printenv
echo ">>>> HOST"
hostname
echo ">>>> CURRENT DIR"
pwd
echo ">>>> USER"
whoami 
echo ">>>> SPACE LEFT"
df -h
echo ">>>> NVIDIA INFO"
set -x #echo on
nvidia-smi

Caution

Don’t forget to give .sh files execution permits: chmod +x get_info.sh

Submit Job

Create the output directories and launch the job

(artemisa-tuto) $ mkdir condor_logs
(artemisa-tuto) $ condor_submit get_info.sub
Submitting job(s).
1 job(s) submitted to cluster 571648.

The standard output and and standard error of the script will respectively be recorded in the ‘output’ and ‘error’ files defined in the commands of the .sub. The ‘log’ file contains HTCondor-related information.

Inspecting the content of outfile.571648.0.out:

  • >>>> ENVIRONMENT : HTCondor variables: assigned GPU, assigned slot, number of threads, etc.

  • >>>> HOST : assigned WN.

  • >>>> CURRENT DIR, USER : working directory in the UI and user.

  • >>>> SPACE LEFT : Mounted filesystems and free space, home and project filesystems, AFS is not mounted in Worker Nodes.

  • >>>> NVIDIA INFO : GPU(s) info.

Requesting a specific resource

Artemisa provides two different models of NVIDIA Data Center GPUs:

Currently you can force your jobs to use a specific type of GPU with:

# run only on V100 GPUs machines
+UseNvidiaV100 = True

# run only on A100 GPUs machines
+UseNvidiaA100 = True

# without directive can run on both kind of machines

enabling the use of a specific GPU architecture if needed.

The next example will run the same Python code of the previous tutorial, but in a certain GPU type specified by us. The following submit description file implements this:

universe = vanilla

executable              = augment_data_gpu.sh
arguments               = $(Process)

log                     = condor_logs/test.log
output                  = condor_logs/outfile.$(Cluster).$(Process).out
error                   = condor_logs/errors.$(Cluster).$(Process).err

notify_user = artemisa.user@ific.uv.es
notification = always

getenv = True

request_gpus = 1
+UseNvidiaA100 = True


queue

We also included 2 useful commands, as this job is going to take a bit longer,

notify_user = artemisa.user@ific.uv.es
notification = always

demanding HTCondor to send a notification mail when the job is finished, failed or not. This email also contains statistics of the resources used by the job.

The Python script used is very similar to the one in the previous tutorial, with 2 changes. One, setting the GPU as the only available device

######
# Set GPU as only available physical device
print("Set GPU as only available device")
my_devices = tf.config.list_physical_devices(device_type='GPU')
tf.config.set_visible_devices(devices= my_devices)
######

another change is allowing the augmentation to run many times, generating a large amount of images

######
# num_images_generated = batch_size * repetitions
repetitions = 1000
for x in range(repetitions):
   batch_images, batch_labels = next(it)
######

Also the visualization part has been removed, as it had no purpose in this example.

After sending the job with condor_submit, the augmented_data directory is created and populated with the transformed images.

(artemisa-tuto) $ condor_submit specific_gpu.sub

Summary

Recap

  • Jobs can be can be submitted to a WN through HTCondor to gain exclusive a access to a GPU.

  • log, output and error submission description file commands gather submission status, standard output and standard error respectively.

  • Use +UseNvidiaA100 and +UseNvidiaV100 to select a specific GPU type.