04-Multi-GPU processing
This guide demonstrates the use of several GPUs for a classification task with a simple convolutional network, based on this tutorial.
Jupyter Notebook preprocessing
We are going to build a simple convolutional network with the well known MNIST dataset to recognize handwritten digits.
First, we are going to download the source code from the notebook.
$ conda activate artemisa-tuto
(artemisa-tuto) $ wget https://raw.githubusercontent.com/keras-team/keras-io/master/examples/vision/ipynb/mnist_convnet.ipynb -O simple_convnet.ipynb
Once the notebook is downloaded, we can transform it into a Python file with the nbconvert tool.
(artemisa-tuto) $ pip install notebook
(artemisa-tuto) $ jupyter nbconvert simple_convnet.ipynb --to python --no-prompt
generating a Python file
that only contains the
python cells.
Multi-GPU job submission
Subsequently, we have to prepare the .sh
and .sub
. files.
In the .sh
file we include the instructions to activate the conda environment and execute the
generated .py
file.
#!/bin/bash
EXERCISE_ENVIRONMENT="artemisa-tuto"
eval "$(conda shell.bash hook)"
conda activate $EXERCISE_ENVIRONMENT
python3 simple_convnet.py
In the submit description file we specify the number of gpus wanted for the job with the command
request_gpus = 2
. In order to increase the job throughput of the Artemisa facility,
multiGPU job usage is restricted, accepting 2 and 4 GPUs jobs. If a 8-GPU job is needed,
please contact us at artemisa-evaluacion@ific.uv.es explaining your specific computation needs
and time slot required.
universe = vanilla
executable = simple_convnet.sh
log = condor_logs/simple_convnet.log
output = condor_logs/simple_convnet.$(Cluster).$(Process).out
error = condor_logs/simple_convnet.$(Cluster).$(Process).err
getenv = True
request_gpus = 2
queue
After giving execution permits to the .sh
file, we can submit the job
(artemisa-tuto) $ chmod +x simple_convnet.sh
(artemisa-tuto) $ condor_submit simple_convnet.sub
Submitting job(s).
1 job(s) submitted to cluster 578870.
After the job is finished, the output can be found in condor_logs/*.out
file, which shows the
characteristics of the convolutional network: shape, train and test samples, model and layers description, epochs run
and finally the evaluation of the model, which achieves a ~99% accuracy.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 26, 26, 32) 320
max_pooling2d (MaxPooling2 (None, 13, 13, 32) 0
D)
conv2d_1 (Conv2D) (None, 11, 11, 64) 18496
max_pooling2d_1 (MaxPoolin (None, 5, 5, 64) 0
g2D)
flatten (Flatten) (None, 1600) 0
dropout (Dropout) (None, 1600) 0
dense (Dense) (None, 10) 16010
=================================================================
Total params: 34826 (136.04 KB)
Trainable params: 34826 (136.04 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/15
422/422 [==============================] - 5s 5ms/step - loss: 0.3746 - accuracy: 0.8859 - val_loss: 0.0847 - val_accuracy: 0.9775
Epoch 2/15
422/422 [==============================] - 1s 3ms/step - loss: 0.1104 - accuracy: 0.9664 - val_loss: 0.0578 - val_accuracy: 0.9845
Epoch 3/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0853 - accuracy: 0.9736 - val_loss: 0.0495 - val_accuracy: 0.9858
Epoch 4/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0701 - accuracy: 0.9788 - val_loss: 0.0419 - val_accuracy: 0.9892
Epoch 5/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0639 - accuracy: 0.9803 - val_loss: 0.0347 - val_accuracy: 0.9893
Epoch 6/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0562 - accuracy: 0.9828 - val_loss: 0.0348 - val_accuracy: 0.9902
Epoch 7/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0508 - accuracy: 0.9841 - val_loss: 0.0341 - val_accuracy: 0.9890
Epoch 8/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0484 - accuracy: 0.9851 - val_loss: 0.0332 - val_accuracy: 0.9900
Epoch 9/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0443 - accuracy: 0.9864 - val_loss: 0.0299 - val_accuracy: 0.9922
Epoch 10/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0396 - accuracy: 0.9871 - val_loss: 0.0289 - val_accuracy: 0.9933
Epoch 11/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0402 - accuracy: 0.9871 - val_loss: 0.0282 - val_accuracy: 0.9922
Epoch 12/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0378 - accuracy: 0.9879 - val_loss: 0.0271 - val_accuracy: 0.9923
Epoch 13/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0354 - accuracy: 0.9884 - val_loss: 0.0296 - val_accuracy: 0.9920
Epoch 14/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0337 - accuracy: 0.9896 - val_loss: 0.0292 - val_accuracy: 0.9910
Epoch 15/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0324 - accuracy: 0.9895 - val_loss: 0.0258 - val_accuracy: 0.9925
Test loss: 0.02214638702571392
Test accuracy: 0.9926999807357788
Summary
Recap
It is possible to gather the code cells inside a Jupyter Notebook and transform it into a
.py
file, enabling the submission to the WN.We can specify the number of GPUs needed with
request_gpus
, accepting 1, 2 and 4 GPU jobs. If a 8 GPU job is needed, please send a mail to artemisa-evaluacion@ific.uv.es.