Machine Learning Infrastructure

Lead Research Organisation: Lancaster University
Department Name: Computing & Communications

Abstract

Machine Learning Infrastructure refers to the resources, processes, and tools required to training and undergo inference of machine learning models. Providing machine learning within a standard infrastructure allows users to easily have access to the computational power without having to invest in the heavy cost of performing machine learning tasks. Within the infrastructure are seven vital components which provide this service to users; Model Selection, Data Ingestion, ML Pipelines, Visualisation, Tuning, Deployment, and Inference (Lee, Yoo, Kim, Lee, & Hong, 2019). The complexity of each component is abstracted away from the user, only requiring the initial configuration of how the process should perform and therefore not requiring the user to understand the backbone of the process. Furthermore, ML infrastructure allows users to have a wide array of choices in building their ML systems; such as different ML frameworks, compiler optimisation libraries such as TVM and Glow allowing users to who are less familiar with ML systems to only need to focus on the input, output, configuration, and evaluation of their produced model (Li, et al., 2020). The computational power of training, and running inference requires dedicated hardware able to accelerate performance, and computation time of ML tasks. GPU's are extremely common within distributed machine learning environments with companies such as Nvidia, and AMD developing hardware specifically optimised and tailored to ML workloads (Mittal & Vaishay, 2019). Currently, Nvidia are one of the world leading companies in developing such technology, with their most recent cloud and datacentre dedicated hardware, the V100's, pushing the boundaries to computation performance to accelerate ML tasks, providing 640 Tensor Cores reaching a theoretical peak performance of 125 Teraflops (Markidis, Chien, Laure, Peng, & Vetter, 2018). Moreover, V100's are well tuned to distributed machine learning, scaling workloads across all cores across multiple V100s via NVLink or PCIe making them good candidates for integrating within machine learning infrastructure (Xu, Han, & Ta, 2018).
Businesses and research institutions can utilise the ML infrastructure for tasks that require extensive Machine Learning and GPU computation. Furthermore, establishing a machine learning infrastructure will provide a dedicated research platform to further areas within environmental sustainability, profiling optimisation of models, security, defence, and cloud computing (Boutaba, et al., 2018). The addition of this research can further machine learning, providing cutting edge security, and performance profiling tools that enable efficient optimisation, and provide guarantee of security against state of the art attacks, data privacy, performance optimisation, model tuning and much more (AI-Rubaie & Chang, 2019). Providing already integrated components and systems will therefore then allow users to use complex technologies to get the most out of their research and projects within ML. Areas such as ML compiler optimisation using systems such as autoTVM provide the ability of increasing the performance of models drastically but add additional complexity to an already complex field within computing (Chen, et al., 2018).

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/R513076/1 01/10/2018 30/09/2023
2461248 Studentship EP/R513076/1 01/07/2020 31/12/2023 William Hackett