03-Submit a job to WNs GPU(s)
After development in UI nodes, we can submit jobs to the WNs, enabling the use of GPUs and higher time limit cap of 48 hours. In this example we will submit a job that requests for a single GPU in order to retrieve basic environment information, useful for debugging purposes.
Required files
We create the following submit description file
,
universe = vanilla
executable = get_info.sh
arguments = $(Process)
log = condor_logs/test.log
output = condor_logs/outfile.$(Cluster).$(Process).out
error = condor_logs/errors.$(Cluster).$(Process).err
request_gpus = 1
queue
that requests one GPU to execute the following script
,
#!/bin/sh
echo ">>>> ENVIRONMENT"
printenv
echo ">>>> HOST"
hostname
echo ">>>> CURRENT DIR"
pwd
echo ">>>> USER"
whoami
echo ">>>> SPACE LEFT"
df -h
echo ">>>> NVIDIA INFO"
set -x #echo on
nvidia-smi
Caution
Don’t forget to give .sh files execution permits:
chmod +x get_info.sh
Submit Job
Create the output directories and launch the job
(artemisa-tuto) $ mkdir condor_logs
(artemisa-tuto) $ condor_submit get_info.sub
Submitting job(s).
1 job(s) submitted to cluster 571648.
The standard output and and standard error of the script will respectively be recorded in the ‘output’ and ‘error’
files defined in the commands of the .sub
. The ‘log’ file contains HTCondor-related information.
Inspecting the content of outfile.571648.0.out
:
>>>> ENVIRONMENT
: HTCondor variables: assigned GPU, assigned slot, number of threads, etc.>>>> HOST
: assigned WN.>>>> CURRENT DIR, USER
: working directory in the UI and user.>>>> SPACE LEFT
: Mounted filesystems and free space, home and project filesystems, AFS is not mounted in Worker Nodes.>>>> NVIDIA INFO
: GPU(s) info.
Requesting a specific resource
Artemisa provides two different models of NVIDIA Data Center GPUs:
Currently you can force your jobs to use a specific type of GPU with:
# run only on V100 GPUs machines
+UseNvidiaV100 = True
# run only on A100 GPUs machines
+UseNvidiaA100 = True
# without directive can run on both kind of machines
enabling the use of a specific GPU architecture if needed.
The next example will run the same Python code of the previous tutorial, but in a certain GPU type specified by us.
The following submit description file
implements this:
universe = vanilla
executable = augment_data_gpu.sh
arguments = $(Process)
log = condor_logs/test.log
output = condor_logs/outfile.$(Cluster).$(Process).out
error = condor_logs/errors.$(Cluster).$(Process).err
notify_user = artemisa.user@ific.uv.es
notification = always
getenv = True
request_gpus = 1
+UseNvidiaA100 = True
queue
We also included 2 useful commands, as this job is going to take a bit longer,
notify_user = artemisa.user@ific.uv.es
notification = always
demanding HTCondor to send a notification mail when the job is finished, failed or not. This email also contains statistics of the resources used by the job.
The Python script
used is very
similar to the one in the previous tutorial, with 2 changes. One, setting the GPU as the only available device
######
# Set GPU as only available physical device
print("Set GPU as only available device")
my_devices = tf.config.list_physical_devices(device_type='GPU')
tf.config.set_visible_devices(devices= my_devices)
######
another change is allowing the augmentation to run many times, generating a large amount of images
######
# num_images_generated = batch_size * repetitions
repetitions = 1000
for x in range(repetitions):
batch_images, batch_labels = next(it)
######
Also the visualization part has been removed, as it had no purpose in this example.
After sending the job with condor_submit
, the augmented_data
directory is created and
populated with the transformed images.
(artemisa-tuto) $ condor_submit specific_gpu.sub
Summary
Recap
Jobs can be can be submitted to a WN through HTCondor to gain exclusive a access to a GPU.
log
,output
anderror
submission description file commands gather submission status, standard output and standard error respectively.Use
+UseNvidiaA100
and+UseNvidiaV100
to select a specific GPU type.