SOL - Deep Learning GPU Cluster

The Diagnostic Image Analysis Group's goal is to improve image analysis tools for clinicians. Since 2015, deep learning has become an essential tool for artificial intelligence, substantially improving the accuracy and generalizability on developed image classification techniques. In order to create high quality deep-learning based algorithms for clinical practice, AI researchers require the use of high-power general-purpose graphic processing units (GPGPUs).

SOL is the RTC Deep Learning's infrastructure. It offers researchers access to a centralized pool of GPGPUs. It is used to create state-of-the-art tools for detecting and classifying illnesses and quantify extent and identify possible treatment options for diseases.

Capacity/Activity

Resource Value
Compute nodes 40
GPUs 100
CPUs 788
Total memory 3.5TB
Active users/month 40
Jobs run/month 7000

Architecture

SOL is a compute cluster which consists of 40 compute nodes that give access to a pool of 100 GPUs (mainly NVIDIA GTX1080Ti and RTX2080Ti cards), a dedicated 500 TB storage server for serving data to deep learning compute nodes using 20 Gbit networking, and a Prometheus+Grafana based monitoring solution to monitor the activity and health of the cluster.

sol-cluster-architecture

Users log in to job nodes directly to either schedule an experiment or to interact with a running experiment using Jupyter Notebooks. We use Slurm as an automated job queue. Experiments are encapsulated in docker containers to isolate their software stacks from the base systems of our compute nodes. This allows researchers maximum flexibility when it comes to trying out new, experimental libraries and software without us needing to install it on the base systems ourselves.

Future

The SOL compute cluster now runs reliably since mid-2017. The cluster is planned to steadily grow until the end of 2019 when it will reach its maximum design capacity of 100 GPUs. After this point, plans are to consolidate and upgrade old hardware and to remove non-stardard components and compute nodes from the cluster.

In addition to this, the compute cluster will be rolled out as a service to other departments of the hospital and we will start offering compute capabilites built around SOL to other departments of Radboud University in 2021.

People

Rita Bijlsma

Rita Bijlsma

Paul Konstantin Gerke

Paul Konstantin Gerke

Research Software Engineer

DIAG Research Software Engineering

Sil van de Leemput

Sil van de Leemput

Research Scientist RTC Deep Learning

RTC Deep Learning

Thomas de Bel

Thomas de Bel

PhD Candidate

Computational Pathology Group

Matin Hosseinzadeh

Matin Hosseinzadeh

PhD Candidate

 Xie Weiyi

Xie Weiyi

Patrick Brand

Patrick Brand

PhD Candidate