04-Multi-GPU processing

This guide demonstrates the use of several GPUs for a classification task with a simple convolutional network, based on this tutorial.

Jupyter Notebook preprocessing

We are going to build a simple convolutional network with the well known MNIST dataset to recognize handwritten digits.

MNIST dataset samples.

MNIST dataset samples.

First, we are going to download the source code from the notebook.

$ conda activate artemisa-tuto
(artemisa-tuto) $ wget https://raw.githubusercontent.com/keras-team/keras-io/master/examples/vision/ipynb/mnist_convnet.ipynb -O simple_convnet.ipynb

Once the notebook is downloaded, we can transform it into a Python file with the nbconvert tool.

(artemisa-tuto) $ pip install notebook
(artemisa-tuto) $ jupyter nbconvert simple_convnet.ipynb --to python --no-prompt

generating a Python file that only contains the python cells.

Multi-GPU job submission

Subsequently, we have to prepare the .sh and .sub. files.

In the .sh file we include the instructions to activate the conda environment and execute the generated .py file.

.sh executable

#!/bin/bash
EXERCISE_ENVIRONMENT="artemisa-tuto"
eval "$(conda shell.bash hook)"
conda activate $EXERCISE_ENVIRONMENT
python3 simple_convnet.py

In the submit description file we specify the number of gpus wanted for the job with the command request_gpus = 2. In order to increase the job throughput of the Artemisa facility, multiGPU job usage is restricted, accepting 2 and 4 GPUs jobs. If a 8-GPU job is needed, please contact us at artemisa-evaluacion@ific.uv.es explaining your specific computation needs and time slot required.

Submit description file

universe = vanilla

executable              = simple_convnet.sh

log                     = condor_logs/simple_convnet.log
output                  = condor_logs/simple_convnet.$(Cluster).$(Process).out
error                   = condor_logs/simple_convnet.$(Cluster).$(Process).err

getenv = True

request_gpus = 2
queue

After giving execution permits to the .sh file, we can submit the job

(artemisa-tuto) $ chmod +x simple_convnet.sh
(artemisa-tuto) $ condor_submit simple_convnet.sub
Submitting job(s).
1 job(s) submitted to cluster 578870.

After the job is finished, the output can be found in condor_logs/*.out file, which shows the characteristics of the convolutional network: shape, train and test samples, model and layers description, epochs run and finally the evaluation of the model, which achieves a ~99% accuracy.

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Model: "sequential"
_________________________________________________________________
Layer (type)                Output Shape              Param #
=================================================================
conv2d (Conv2D)             (None, 26, 26, 32)        320

max_pooling2d (MaxPooling2  (None, 13, 13, 32)        0
D)

conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496

max_pooling2d_1 (MaxPoolin  (None, 5, 5, 64)          0
g2D)

flatten (Flatten)           (None, 1600)              0

dropout (Dropout)           (None, 1600)              0

dense (Dense)               (None, 10)                16010

=================================================================
Total params: 34826 (136.04 KB)
Trainable params: 34826 (136.04 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/15
422/422 [==============================] - 5s 5ms/step - loss: 0.3746 - accuracy: 0.8859 - val_loss: 0.0847 - val_accuracy: 0.9775
Epoch 2/15
422/422 [==============================] - 1s 3ms/step - loss: 0.1104 - accuracy: 0.9664 - val_loss: 0.0578 - val_accuracy: 0.9845
Epoch 3/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0853 - accuracy: 0.9736 - val_loss: 0.0495 - val_accuracy: 0.9858
Epoch 4/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0701 - accuracy: 0.9788 - val_loss: 0.0419 - val_accuracy: 0.9892
Epoch 5/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0639 - accuracy: 0.9803 - val_loss: 0.0347 - val_accuracy: 0.9893
Epoch 6/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0562 - accuracy: 0.9828 - val_loss: 0.0348 - val_accuracy: 0.9902
Epoch 7/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0508 - accuracy: 0.9841 - val_loss: 0.0341 - val_accuracy: 0.9890
Epoch 8/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0484 - accuracy: 0.9851 - val_loss: 0.0332 - val_accuracy: 0.9900
Epoch 9/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0443 - accuracy: 0.9864 - val_loss: 0.0299 - val_accuracy: 0.9922
Epoch 10/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0396 - accuracy: 0.9871 - val_loss: 0.0289 - val_accuracy: 0.9933
Epoch 11/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0402 - accuracy: 0.9871 - val_loss: 0.0282 - val_accuracy: 0.9922
Epoch 12/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0378 - accuracy: 0.9879 - val_loss: 0.0271 - val_accuracy: 0.9923
Epoch 13/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0354 - accuracy: 0.9884 - val_loss: 0.0296 - val_accuracy: 0.9920
Epoch 14/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0337 - accuracy: 0.9896 - val_loss: 0.0292 - val_accuracy: 0.9910
Epoch 15/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0324 - accuracy: 0.9895 - val_loss: 0.0258 - val_accuracy: 0.9925
Test loss: 0.02214638702571392
Test accuracy: 0.9926999807357788

Summary

Recap

  • It is possible to gather the code cells inside a Jupyter Notebook and transform it into a .py file, enabling the submission to the WN.

  • We can specify the number of GPUs needed with request_gpus, accepting 1, 2 and 4 GPU jobs. If a 8 GPU job is needed, please send a mail to artemisa-evaluacion@ific.uv.es.