02-CPU jobs

ARTEMISA has some resources reserved for CPU-only jobs.

Modern frameworks like TensorFlow or Keras allow the creation of data pipelines that benefit from parallel processing in CPU (eg. data augmentation) and GPU (eg. training). However, some tasks like data preparation and augmentation can be done in parallel using only CPUs, speeding up the process. This example proposes a basic example to perform data preparation requesting CPU-only resources, in order to build an image classifier. We will do so using the more challenging CIFAR-10 dataset. It consists of 32x32 pixel images with 10 classes. The data is split into 50k training and 10k test images.

Local CPU execution

First, activate the conda environment created in the first tutorial

$ conda activate artemisa-tuto

We will need the following packages

(artemisa-tuto) $ pip install matplotlib tensorflow-datasets scipy

We are going to run the following python code: augment_data_cpu.py

#!/bin/env python3
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator
# https://stepup.ai/train_data_augmentation_keras/
import os
import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt

# Helper function to inspect the first images in a dataset
def visualize_data(images, categories, class_names, file_name):
    fig = plt.figure(figsize=(14, 6))
    fig.patch.set_facecolor('white')
    for i in range(3 * 7):
        plt.subplot(3, 7, i+1)
        plt.xticks([])
        plt.yticks([])
        plt.imshow(images[i])
        class_index = categories[i].argmax()
        plt.xlabel(class_names[class_index])
    fig.savefig(file_name)


# Set CPU as only available physical device (avoid GPU)
print("Set CPU as only available device")
my_devices = tf.config.list_physical_devices(device_type='CPU')
tf.config.set_visible_devices(devices= my_devices)

# CIFAR-10 Dataset
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(class_names)

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

x_train = x_train / 255.0
y_train = to_categorical(y_train, num_classes)

x_test = x_test / 255.0
y_test = to_categorical(y_test, num_classes)

# The above code first downloads the dataset. The included preprocessing rescales the images into the range
# between [0, 1] and converts the label from the class index (integers 0 to 10) to a one-hot encoded
# categorical vector. 

# Show the first images of the training set
visualize_data(x_train, y_train, class_names, 'train_samples.png')


# Specification of the augmentation parameters: 
#   width shift: Randomly shift the image left and right by 3 pixels
#   height shift: Randomly shift the image up and down by 3 pixels
#   horizontal flip: Randomly flip the image horizontally.

width_shift = 3/32
height_shift = 3/32
flip = True

datagen = ImageDataGenerator(
    horizontal_flip=flip,
    width_shift_range=width_shift,
    height_shift_range=height_shift,
    )
datagen.fit(x_train)

# output directory
path = "./augmented_data"
try:
    os.mkdir(path)
except OSError:
    print ("Creation of the directory %s failed" % path)
else:
    print ("Successfully created the directory %s " % path)

my_batch_size=32
it = datagen.flow(x_train, y_train, shuffle=False, batch_size=my_batch_size,
        save_to_dir=path, save_prefix='artemisa_ex2')

# If we want to iterate and create more batches of images

# num_images_generated = batch_size * repetitions
# repetitions = 100
# for x in range(repetitions):
batch_images, batch_labels = next(it)

# Show samples augmented data
visualize_data(batch_images, batch_labels, class_names, 'augmented_samples.png')

Now we can run it in the UI without the gpurun tool, forcing the usage of the CPU.

(artemisa-tuto) $ python augment_data_cpu.py

The code above implements data augmentation. We use the CIFAR-10 . It consists of 32x32 pixel images with 10 classes. The data is split into 50k training and 10k test images.

The script also generates two figures with samples: a sample set from the original images and the same images augmented, to illustrate the transformations performed

Original images

_images/03_train_samples.png

Augmented set

_images/03_augmented_samples.png

Execution in a Worker Node

Now we will run our augmentation process in a remote Worker Node. Thus, we have to prepare the following submit description file

universe = vanilla
executable              = augment_data_cpu.sh
arguments               = 
log                     = condor_logs/test.log
output                  = condor_logs/outfile.$(Cluster).$(Process).out
error                   = condor_logs/errors.$(Cluster).$(Process).err

# Needed to read .bashrc and conda environment
getenv = True

# TestJob CPU
+testJob = True
queue

Note the ‘+testJob = True’ to use exclusively CPU resources

also the executable .sh script referenced in the .sub file

#!/bin/bash
EXERCISE_ENVIRONMENT="artemisa-tuto"
eval "$(conda shell.bash hook)"
conda activate $EXERCISE_ENVIRONMENT
python augment_data_cpu.py

Caution

Don’t forget to give .sh files execution permits: chmod +x augment_data_cpu.sh

Since the job will run on a remote Worker Node, we need to set up the same environment (conda). The beggining of the script does it before invoking the python executable

And prepare the directory for the output files

(artemisa-tuto) $ mkdir condor_logs

Finally launch the job through HTCondor.

(artemisa-tuto) $ condor_submit augment_data_cpu.sub

It is possible to check the results generated in the augmented_data directory. train_samples.png and augmented_samples.png figures illustrate the augmentation made with some samples.

Summary

Recap

  • ARTEMISA provides also CPU-only resources.

  • Use the +testJob = True command use CPU-only slots.

  • Use getenv = true to keep the current virtual environment when submitting with HTCondor.

  • Don’t forget to initialize the required virtual environment(ie: conda activate) before submitting the job.