GPU-based Machine Learning System for fundamental biological research

Lead Research Organisation: University of Dundee
Department Name: School of Life Sciences

Abstract

In the past decade, biological sciences have witnessed a major shift towards data-driven research. Consequently, high-performance computing has become a standard research tool in life sciences. Biological data, however, is not only large but it is highly complex. This complexity requires alternative approaches to conventional data analysis. Machine learning (ML) has emerged as a powerful methodology that can successfully tackle the analysis of complex biological data.

It is a hopeless task to attempt to develop a mathematical model of an elephant. Yet, a three-year-old child can with ease point at an elephant in a photo. The child was shown a picture of an elephant and told that the object in the picture was an elephant. In other words, she learnt to recognise an elephant by seeing photos of it and now she can identify it on her own.

ML emulates the learning process on a computer. Instead of building a precise description of patterns, the computer is "taught" to recognise them. This is a paradigm shift from conventional computing and, thus, has its own challenges. Notably, the brain is well suited for learning by example, yet it performs poorly when it comes to long divisions. Computers, on the other hand, have been designed to perform numerical operations with great speed and precision. It is, therefore, not surprising that emulating an inherently heuristic process such as learning on a computer would require a substantial computational effort. With the recent advent in Graphics Processing Unit (GPU) and Solid State Drive (SSD) technologies, the necessary computer power has become broadly available. It is, however, not surprising that traditional High-Performance Computing facilities are not well-suited for ML applications.

ML has been successfully used in biology for more than two decades. An excellent example is the prediction of the viability of cancer cells when exposed to a drug. The idea is to associate a response (e.g., whether a cancer cell survives or not) to a set of characteristics or features (e.g., which genes were mutated and what chemical properties of the drug are). In the so-called supervised learning, the machine is presented with a large set of training data that contains correct responses for given input parameters. Based on that data, the machine learns to predict the response for new, previously unseen parameters. A major challenge is that it is often not easy to identify what the appropriate features are. Cells are very complex and it is often unclear which are the most relevant features that determine a specific response, e.g., mutations of which genes one should consider, etc. An expert is, therefore, required to prepare the appropriate training set.

In recent years, so-called deep learning techniques have revolutionised the learning process by allowing the machine to automatically extract the key features from raw data. This is achieved by a set of model neurons, inspired by biological neural cells, organised in a layered network (i.e., a neural network). The information propagates through layers of the network, which enables each layer to capture more and more abstract features in the data. This drastically reduces the need for carefully tailored training sets and makes the ML applicable to a wider range of problems, especially those where expert-made training sets are not available or too costly to make.

Deep learning ML approaches, however, require substantial computational resources. Typical deep learning neural networks contain tens to hundreds of layers, thousands of neurons, and hundreds of thousands of links between them. Training them, therefore, requires hardware that operates at TFLOPS speeds (trillions of operations per second) and can access the data at several GB/s.

The aim of this proposal is to build a designated GPU-based system for applying deep-learning ML methods in fundamental biological research at the University of Dundee.

Technical Summary

The designated Machine Learning (ML) facility will be used to assist in fundamental biological research in the areas of Epigenomics, Cryo-EM, Proteomics, Macromolecular Structure Prediction and Modelling and Image-based Phenotypic Analysis, Segmentation and Cell Tracking in in-vivo images of early-stage embryos, Multi-Parameter Optimisation for Drug Design, and exploration of the Biological Mechanisms associated with Antimicrobial Drug Resistance.

The system will work as a unified ML platform that will comprise 16 servers custom-built and optimised for deep learning ML applications. Each server will be equipped with two AMD EPYC 7452 - 32 Core, 2.35GHz CPUs, 512GB 3200MHz DDR4 RAM, four NVIDIA RTX 3090 GPUs, and one 8TB nVME SSDs. The full system will, therefore, have 1,024 CPU cores, 671,744 GPU cores (20,992 3rd generation Tensor cores), 8TB RAM and 256TB of fast local storage. This will provide an optimal balance between CPUs, GPUs, and fast local I/O capabilities for maximum deep learning performance. The system will run CentOS 8 Linux with a wide variety of ML software tools.

The system will be fully integrated into the existing HPC infrastructure available in the School of Life Sciences and will directly benefit from access to the 4PB of GPFS-based data storage, a unified job queuing system, and a large library of scientific software tools. The system will be supported by an onsite expert team of HPC professionals.

Publications

10 25 50