Deep learning on HPC cluster with LSF queue

2 minute read

Published:

This is rather a short documentation to run the TensorFlow jobs on a high performance computing (HPC) cluster which is using LSF Load Sharing Facility. This can be extrapolated to other HPC systems with some tweaks. If you’re new to LSF and HPC jobs with DL then this post nicely summerizes the jargon.

Use case

My use-case is to run a Bayesian optimization on a deep learning model based upon TensorFlow, and code is written in Python. The input shape is 110*1500 and model has more than 10 layers which makes it impossible to train on my local workstation equipped with 6 GB of NVIDIA GPU. After running experimentation on down sampled data, it is time to try on entire dataset with full dimensions. For the same, my job script (saved as train_job_submission.job) looks as follows. For lingo related to LSF (with #BSUB), one can refer here. Also, this post from DTU also document the GPU usage with LSF with the explaination.

#!/bin/csh
#BSUB -q BatchGPU
#BSUB -R "a100 span[hosts=1]"
#BSUB -eo gpu_1.log
#BSUB -oo gpu_1.err
#BSUB -J GPU_Job_1
#BSUB -L /bin/csh
#BSUB -n 1
#BSUB -gpu "num=1"

pwd

module purge
module load cuda/v11.2
source ~/custom.tcshrc
conda activate py38

date

python -u BO_train_deeper_networks.py > BO_train_deeper_networks.out

date

In the above, after purging all the modules, I explicitly load the cuda/v11.2. Afterwards, I simply source a file where I have defined conda installation etc. (in many cases it might not be needed). And then, I activate the py38 conda environment. This environmnet has the TensorFlow which is compatible with cuda/v11.2. In my BO_train_deeper_networks.py file, I make sure to print out the GPU availability and its configuration. For instance, following was the print statement in the training script:

import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "2"
print("CUDA_VISIBLE_DEVICES >>", os.environ['CUDA_VISIBLE_DEVICES'])
tf.get_logger().setLevel('ERROR')

print("Available GPUs/CPUs >>")
print(tf.config.list_physical_devices())
print("Printing the details of GPUs available >> ")
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs Available >> ", len(physical_devices))
for device in physical_devices:
    print(tf.config.experimental.get_device_details(device))

And, if everything went right then I could see following output:

CUDA_VISIBLE_DEVICES >> 5
Available GPUs/CPUs >>
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Printing the details of GPUs available >> 
Num GPUs Available >>  1
{'compute_capability': (8, 0), 'device_name': 'xx xxxx-40GB'}

Initially, I struggled because I wasn’t setting #BSUB -gpu "num=1" parameters, which basically wasn’t showing the GPUs. After searching, I added os.environ['CUDA_VISIBLE_DEVICES']=0 but it wasn’t right because, it (might) made another GPUs visible, but they were busy (or so), therefore, my program failed.

TODO (Depending on need)

  • Multi GPU training

Reach out to me on Instagram for a faster reply! "Buy Me A Coffee"