04-Multi-GPU processing
This tutorial demonstrates the use of several GPUs for a classification task with a simple convolutional network, based on this tutorial.
Jupyter Notebook preprocessing
We are going to build a simple convolutional network with the well known MNIST dataset to recognize handwritten digits.
MNIST dataset samples.
First, we are going to download the source code from the notebook.
$ conda activate artuto
(artuto) $ wget https://raw.githubusercontent.com/keras-team/keras-io/master/examples/vision/ipynb/mnist_convnet.ipynb -O simple_convnet.ipynb
Once the notebook is downloaded, we can transform it into a Python file with the nbconvert tool.
(artuto) $ pip install notebook
(artuto) $ jupyter nbconvert simple_convnet.ipynb --to python --no-prompt
generating a Python file that only contains the
python cells.
Multi-GPU job submission
Subsequently, we have to prepare the .sub. file.
In the submit description file we specify the number of gpus requested for the job with the command
request_gpus = 2. In order to increase the job throughput of the Artemisa facility,
multi-GPU job usage is restricted, accepting 2 and 4 GPUs jobs. If a 8-GPU job is needed,
please, contact us at artemisa-evaluacion@ific.uv.es explaining your specific computation needs
and the time slot required. Please, keep in mind that there are few nodes capable of multi-GPU tasks
(see WN Configuration), so higher waiting times might occur.
universe = vanilla
executable = simple_convnet.py
log = condor_logs/simple_convnet.log
output = condor_logs/simple_convnet.$(Cluster).$(Process).out
error = condor_logs/simple_convnet.$(Cluster).$(Process).err
notify_user = your.mail@institution.edu
notification = always
getenv = True
request_gpus = 2
queue
After giving execution permits to the .py file, we can submit the job
(artuto) $ chmod +x simple_convnet.py
(artuto) $ condor_submit simple_convnet.sub
Submitting job(s).
1 job(s) submitted to cluster 578870.
After the job is finished, the output can be found in condor_logs/*.out, which shows the
characteristics of the convolutional network: shape, train and test samples, model and layers description, epochs run
and finally the evaluation of the model, which achieves a ~99% accuracy.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv2d (Conv2D) │ (None, 26, 26, 32) │ 320 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d (MaxPooling2D) │ (None, 13, 13, 32) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_1 (Conv2D) │ (None, 11, 11, 64) │ 18,496 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_1 (MaxPooling2D) │ (None, 5, 5, 64) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten (Flatten) │ (None, 1600) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout) │ (None, 1600) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense) │ (None, 10) │ 16,010 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 34,826 (136.04 KB)
Trainable params: 34,826 (136.04 KB)
Non-trainable params: 0 (0.00 B)
Epoch 1/5
[1m 1/422[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m45:43[0m 7s/step - accuracy: 0.0625 - loss: 2.3327
[1m 25/422[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.2527 - loss: 2.1916
[1m 55/422[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.3939 - loss: 1.9001
[1m 86/422[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.4813 - loss: 1.6518
[1m117/422[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.5419 - loss: 1.4692
[1m148/422[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.5868 - loss: 1.3306
[1m179/422[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.6215 - loss: 1.2222
[1m210/422[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.6492 - loss: 1.1351
[1m241/422[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.6721 - loss: 1.0629
[1m272/422[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.6913 - loss: 1.0022
[1m302/422[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.7072 - loss: 0.9516
[1m333/422[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.7216 - loss: 0.9060
[1m364/422[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 2ms/step - accuracy: 0.7341 - loss: 0.8659
[1m395/422[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 2ms/step - accuracy: 0.7452 - loss: 0.8303
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.7539 - loss: 0.8024
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 11ms/step - accuracy: 0.7542 - loss: 0.8014 - val_accuracy: 0.9760 - val_loss: 0.0854
Epoch 2/5
[1m 1/422[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 24ms/step - accuracy: 0.9375 - loss: 0.1683
[1m 31/422[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9615 - loss: 0.1290
[1m 61/422[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9609 - loss: 0.1330
[1m 91/422[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9610 - loss: 0.1322
[1m123/422[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9612 - loss: 0.1310
[1m154/422[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9612 - loss: 0.1300
[1m183/422[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9612 - loss: 0.1295
[1m214/422[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9612 - loss: 0.1291
[1m246/422[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9613 - loss: 0.1284
[1m277/422[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9614 - loss: 0.1276
[1m308/422[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9615 - loss: 0.1269
[1m340/422[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9616 - loss: 0.1263
[1m371/422[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9618 - loss: 0.1256
[1m401/422[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 2ms/step - accuracy: 0.9620 - loss: 0.1248
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9621 - loss: 0.1243 - val_accuracy: 0.9820 - val_loss: 0.0614
Epoch 3/5
[1m 1/422[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 23ms/step - accuracy: 0.9609 - loss: 0.0633
[1m 30/422[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9712 - loss: 0.0819
[1m 61/422[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9710 - loss: 0.0865
[1m 92/422[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9714 - loss: 0.0877
[1m122/422[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9717 - loss: 0.0882
[1m151/422[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9719 - loss: 0.0885
[1m182/422[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9719 - loss: 0.0887
[1m212/422[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9720 - loss: 0.0888
[1m242/422[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9722 - loss: 0.0887
[1m273/422[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9723 - loss: 0.0885
[1m304/422[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9724 - loss: 0.0883
[1m336/422[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9726 - loss: 0.0881
[1m367/422[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9727 - loss: 0.0879
[1m398/422[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 2ms/step - accuracy: 0.9727 - loss: 0.0877
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9728 - loss: 0.0876 - val_accuracy: 0.9860 - val_loss: 0.0502
Epoch 4/5
[1m 1/422[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 23ms/step - accuracy: 0.9766 - loss: 0.0783
[1m 33/422[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9766 - loss: 0.0787
[1m 64/422[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9769 - loss: 0.0767
[1m 95/422[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9774 - loss: 0.0753
[1m126/422[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9776 - loss: 0.0743
[1m157/422[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9777 - loss: 0.0737
[1m189/422[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9778 - loss: 0.0735
[1m220/422[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9778 - loss: 0.0734
[1m251/422[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9778 - loss: 0.0735
[1m282/422[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9778 - loss: 0.0735
[1m313/422[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9778 - loss: 0.0735
[1m345/422[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9778 - loss: 0.0735
[1m377/422[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9778 - loss: 0.0735
[1m409/422[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 2ms/step - accuracy: 0.9778 - loss: 0.0734
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9778 - loss: 0.0734 - val_accuracy: 0.9883 - val_loss: 0.0419
Epoch 5/5
[1m 1/422[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 23ms/step - accuracy: 0.9844 - loss: 0.0381
[1m 32/422[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9822 - loss: 0.0477
[1m 63/422[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9812 - loss: 0.0524
[1m 94/422[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9811 - loss: 0.0539
[1m125/422[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9811 - loss: 0.0549
[1m157/422[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9810 - loss: 0.0560
[1m189/422[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9809 - loss: 0.0568
[1m221/422[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9809 - loss: 0.0576
[1m253/422[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9809 - loss: 0.0581
[1m285/422[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9808 - loss: 0.0585
[1m317/422[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9808 - loss: 0.0589
[1m349/422[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.9807 - loss: 0.0592
[1m381/422[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 2ms/step - accuracy: 0.9806 - loss: 0.0595
[1m413/422[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 2ms/step - accuracy: 0.9806 - loss: 0.0598
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9806 - loss: 0.0599 - val_accuracy: 0.9892 - val_loss: 0.0381
Test loss: 0.035545095801353455
Test accuracy: 0.987500011920929
Summary
Recap
It is possible to gather the code cells from a Jupyter Notebook and transform them into a
.pyfile, allowing the submission to the WN.We can specify the number of GPUs needed with
request_gpus, accepting 1, 2 and 4 GPU jobs. If a 8 GPU job is needed, please send a mail to artemisa-evaluacion@ific.uv.es.