04-Multi-GPU processing

This tutorial demonstrates the use of several GPUs for a classification task with a simple convolutional network, based on this tutorial.

Jupyter Notebook preprocessing

We are going to build a simple convolutional network with the well known MNIST dataset to recognize handwritten digits.

MNIST dataset samples.

MNIST dataset samples.

First, we are going to download the source code from the notebook.

$ conda activate artuto
(artuto) $ wget https://raw.githubusercontent.com/keras-team/keras-io/master/examples/vision/ipynb/mnist_convnet.ipynb -O simple_convnet.ipynb

Once the notebook is downloaded, we can transform it into a Python file with the nbconvert tool.

(artuto) $ pip install notebook
(artuto) $ jupyter nbconvert simple_convnet.ipynb --to python --no-prompt

generating a Python file that only contains the python cells.

Multi-GPU job submission

Subsequently, we have to prepare the .sub. file.

In the submit description file we specify the number of gpus requested for the job with the command request_gpus = 2. In order to increase the job throughput of the Artemisa facility, multi-GPU job usage is restricted, accepting 2 and 4 GPUs jobs. If a 8-GPU job is needed, please, contact us at artemisa-evaluacion@ific.uv.es explaining your specific computation needs and the time slot required. Please, keep in mind that there are few nodes capable of multi-GPU tasks (see WN Configuration), so higher waiting times might occur.

Submit description file

universe = vanilla

executable              = simple_convnet.py

log                     = condor_logs/simple_convnet.log
output                  = condor_logs/simple_convnet.$(Cluster).$(Process).out
error                   = condor_logs/simple_convnet.$(Cluster).$(Process).err

notify_user = your.mail@institution.edu
notification = always

getenv = True

request_gpus = 2

queue

After giving execution permits to the .py file, we can submit the job

(artuto) $ chmod +x simple_convnet.py
(artuto) $ condor_submit simple_convnet.sub
Submitting job(s).
1 job(s) submitted to cluster 578870.

After the job is finished, the output can be found in condor_logs/*.out, which shows the characteristics of the convolutional network: shape, train and test samples, model and layers description, epochs run and finally the evaluation of the model, which achieves a ~99% accuracy.

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv2d (Conv2D)                 │ (None, 26, 26, 32)     │           320 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d (MaxPooling2D)    │ (None, 13, 13, 32)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_1 (Conv2D)               │ (None, 11, 11, 64)     │        18,496 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_1 (MaxPooling2D)  │ (None, 5, 5, 64)       │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten (Flatten)               │ (None, 1600)           │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ (None, 1600)           │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense)                   │ (None, 10)             │        16,010 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 34,826 (136.04 KB)
 Trainable params: 34,826 (136.04 KB)
 Non-trainable params: 0 (0.00 B)
Epoch 1/5

  1/422 ━━━━━━━━━━━━━━━━━━━━ 45:43 7s/step - accuracy: 0.0625 - loss: 2.3327
 25/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.2527 - loss: 2.1916  
 55/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.3939 - loss: 1.9001
 86/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.4813 - loss: 1.6518
117/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.5419 - loss: 1.4692
148/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.5868 - loss: 1.3306
179/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.6215 - loss: 1.2222
210/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.6492 - loss: 1.1351
241/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.6721 - loss: 1.0629
272/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.6913 - loss: 1.0022
302/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.7072 - loss: 0.9516
333/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.7216 - loss: 0.9060
364/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.7341 - loss: 0.8659
395/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.7452 - loss: 0.8303
422/422 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.7539 - loss: 0.8024
422/422 ━━━━━━━━━━━━━━━━━━━━ 11s 11ms/step - accuracy: 0.7542 - loss: 0.8014 - val_accuracy: 0.9760 - val_loss: 0.0854
Epoch 2/5

  1/422 ━━━━━━━━━━━━━━━━━━━━ 9s 24ms/step - accuracy: 0.9375 - loss: 0.1683
 31/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9615 - loss: 0.1290 
 61/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9609 - loss: 0.1330
 91/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9610 - loss: 0.1322
123/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9612 - loss: 0.1310
154/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9612 - loss: 0.1300
183/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9612 - loss: 0.1295
214/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9612 - loss: 0.1291
246/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9613 - loss: 0.1284
277/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9614 - loss: 0.1276
308/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9615 - loss: 0.1269
340/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9616 - loss: 0.1263
371/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9618 - loss: 0.1256
401/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9620 - loss: 0.1248
422/422 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9621 - loss: 0.1243 - val_accuracy: 0.9820 - val_loss: 0.0614
Epoch 3/5

  1/422 ━━━━━━━━━━━━━━━━━━━━ 9s 23ms/step - accuracy: 0.9609 - loss: 0.0633
 30/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9712 - loss: 0.0819 
 61/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9710 - loss: 0.0865
 92/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9714 - loss: 0.0877
122/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9717 - loss: 0.0882
151/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9719 - loss: 0.0885
182/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9719 - loss: 0.0887
212/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9720 - loss: 0.0888
242/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9722 - loss: 0.0887
273/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9723 - loss: 0.0885
304/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9724 - loss: 0.0883
336/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9726 - loss: 0.0881
367/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9727 - loss: 0.0879
398/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9727 - loss: 0.0877
422/422 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9728 - loss: 0.0876 - val_accuracy: 0.9860 - val_loss: 0.0502
Epoch 4/5

  1/422 ━━━━━━━━━━━━━━━━━━━━ 9s 23ms/step - accuracy: 0.9766 - loss: 0.0783
 33/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9766 - loss: 0.0787 
 64/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9769 - loss: 0.0767
 95/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9774 - loss: 0.0753
126/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9776 - loss: 0.0743
157/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9777 - loss: 0.0737
189/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9778 - loss: 0.0735
220/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9778 - loss: 0.0734
251/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9778 - loss: 0.0735
282/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9778 - loss: 0.0735
313/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9778 - loss: 0.0735
345/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9778 - loss: 0.0735
377/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9778 - loss: 0.0735
409/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9778 - loss: 0.0734
422/422 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9778 - loss: 0.0734 - val_accuracy: 0.9883 - val_loss: 0.0419
Epoch 5/5

  1/422 ━━━━━━━━━━━━━━━━━━━━ 9s 23ms/step - accuracy: 0.9844 - loss: 0.0381
 32/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9822 - loss: 0.0477 
 63/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9812 - loss: 0.0524
 94/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9811 - loss: 0.0539
125/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9811 - loss: 0.0549
157/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9810 - loss: 0.0560
189/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9809 - loss: 0.0568
221/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9809 - loss: 0.0576
253/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9809 - loss: 0.0581
285/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9808 - loss: 0.0585
317/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9808 - loss: 0.0589
349/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9807 - loss: 0.0592
381/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9806 - loss: 0.0595
413/422 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9806 - loss: 0.0598
422/422 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9806 - loss: 0.0599 - val_accuracy: 0.9892 - val_loss: 0.0381
Test loss: 0.035545095801353455
Test accuracy: 0.987500011920929

Summary

Recap

  • It is possible to gather the code cells from a Jupyter Notebook and transform them into a .py file, allowing the submission to the WN.

  • We can specify the number of GPUs needed with request_gpus, accepting 1, 2 and 4 GPU jobs. If a 8 GPU job is needed, please send a mail to artemisa-evaluacion@ific.uv.es.