03-Submit a job to WNs GPU(s)
To use the GPUs in the WNs the user has to send the jobs in batch mode. This has several benefits, like a higher time limit cap of 48 hours. In this first example we will submit a job that requests a single GPU and retrieves basic environment information, useful for debugging purposes.
Required files
We create the following submit description file,
universe = vanilla
executable = get_info.sh
arguments = $(Process)
log = condor_logs/test.log
output = condor_logs/outfile.$(Cluster).$(Process).out
error = condor_logs/errors.$(Cluster).$(Process).err
request_gpus = 1
queue
that requests one GPU to execute the following script,
#!/bin/sh
echo ">>>> HOST"
hostname
echo ">>>> CURRENT DIR"
pwd
echo ">>>> USER"
whoami
echo ">>>> SPACE LEFT"
df -h
echo ">>>> NVIDIA INFO"
nvidia-smi
echo ">>>> OS INFO"
cat /etc/os-release
Caution
Don’t forget to give .sh files execution permits:
chmod +x get_info.sh
Submit Job
Create the output directories and launch the job
(artuto) $ mkdir condor_logs
(artuto) $ condor_submit get_info.sub
Submitting job(s).
1 job(s) submitted to cluster 571648.
The standard output and and standard error of the script will respectively be recorded in the ‘output’ and ‘error’
files defined in the commands of the .sub file. The log file contains HTCondor-related information.
Inspecting the content of the output file:
>>>> HOST: assigned WN.>>>> CURRENT DIR, USER: working directory in the UI and user.>>>> SPACE LEFT: Mounted filesystems and free space, home and project filesystems, AFS is not mounted in Worker Nodes.>>>> NVIDIA INFO: GPU(s) info.>>>> OS INFO: Operating System info with version details.
Requesting a specific resource
Artemisa has three different models of NVIDIA Data Center GPUs:
Currently you can force your jobs to use a specific type of GPU with:
# run only on V100 GPUs machines
+UseNvidiaV100 = True
# run only on A100 GPUs machines
+UseNvidiaA100 = True
# run only on A100 GPUs machines
+UseNvidiaH100 = True
allowing to choose the model that better suits the needs of your job.
The next example will run the same Python code from the previous tutorial, but in a specific GPU model.
The following submit description file implements this:
universe = vanilla
executable = augment_data_gpu.py
arguments = $(Process)
log = condor_logs/test.log
output = condor_logs/outfile.$(Cluster).$(Process).out
error = condor_logs/errors.$(Cluster).$(Process).err
notify_user = your.mail@institution.edu
notification = always
getenv = True
+UseNvidiaH100 = True
request_gpus = 1
queue
We also included 2 useful commands, as this job is going to take a bit longer,
notify_user = artemisa.user@ific.uv.es
notification = always
HTCondor will send a mail when the job is finished, failed or not. This mail contains resource metrics from the job.
The Python script used is very
similar to the one in the previous tutorial, with 2 changes. One, setting the GPU as the only available device
######
# Set GPU as only available physical device
print("Set GPU as only available device")
my_devices = tf.config.list_physical_devices(device_type='GPU')
tf.config.set_visible_devices(devices= my_devices)
######
another change is setting the augmentation to run many times, generating a large amount of images
######
# num_images_generated = batch_size * repetitions
repetitions = 1000
for x in range(repetitions):
batch_images, batch_labels = next(it)
######
Also the visualization part has been removed, as it wasn´t needed in this example.
After sending the job with condor_submit, the augmented_data directory is created and
populated with the transformed images.
(artuto) $ condor_submit specific_gpu.sub
Summary
Recap
Jobs can be can be submitted to a WN through HTCondor to gain exclusive a access to a WN GPU.
log,outputanderrorsubmission description file commands gather submission status, standard output and standard error respectively.Use
+UseNvidiaA100,+UseNvidiaV100and+UseNvidiaH100to select a specific GPU type.