A macromolecular structure building toolkit for machine learning and cloud applications

Lead Research Organisation: University of York
Department Name: Chemistry

Abstract

Scientists are interested in the atomic structure of biological molecules: in other words, what the molecules look like. Knowing in detail what a molecule looks like provides important clues to how it might work. If we can go further and capture molecules in the process of interacting with other biological molecules, or artificial compounds such as drugs, we get a clearer picture of how they work.

Most of our knowledge of the structure of biological molecules comes from experimental techniques including X-ray crystallography and electron microscopy (EM). These experimental techniques give us pictures of the real molecules in which we can see an outline of the molecular structure, but we can't usually see the individual atoms or tell them apart. So we need to interpret the map in terms of what we know about the molecule from the genetic code which was used to build it. We address this in two ways: through software which allows the user to place atoms using 3D graphics to see the shapes, or by software which tries to do the same process automatically.

The automatic process involves lots of steps, from recognizing groups of atoms to linking them up and matching them to the genetic code. Recent advances in computer vision have created huge opportunities to improve automatic interpretation, and scientists working in these areas have produced revolutionary improvements in some of the steps. However these breakthroughs are only useful in combination with the rest of the steps. So we want to break up our automated interpretation software into the individual steps and make those steps very easy for other groups to use. They can then try replacing the step they are interested in with their new code and distribute the resulting method as a complete package.

Another interesting element of this work is that it is structured so that the primary benefit of science is to others. Science works by scientists building on the work of others. We have observed that some of the ways in which science is done discourages this - science is done by groups led by senior scientists who are in competition with one another for funds and recognition, which disincentives the sharing of methods and results. We want to test if there is a better way to do science and achieve more progress with less funding by working primarily to benefit others. If we are right, then over the course of 5-10 years we should be able to identify projects which have been enabled by our work, even if we did not initiate or participate in those projects. We will aim to build a qualitative picture of how our approach has impacted practice in the field by comparing project building on our work to projects building on other components or built from scratch.

A final strand of this project is to make the tools that we write work in web browsers, so that users do not need to install special software. This will link our work with developments in cloud computing, and we will also adapt our methods to help with advances in predicting the shape of molecules which have come from Google's DeepMind project. This will make the steps of determining molecular structures more accessible to new participants in the field, including students, schools, participants with more limited computing resources such as Chromebooks and mobile devices. Barriers to participation often serve to confine the practice of science to existing privileged groups, so making these methods more widely available will reduce inequalities of opportunity and encourage diversity in the scientific community.

Technical Summary

Understanding the atomic structure of biological macromolecules is a key step in understanding, and when necessary modifying their behaviour. While the DeepMind AlphaFold2 software can predict the structures of many proteins, experimental methods are still required to examine ligand binding, nucleotide and carbohydrate structures. Experimental techniques including X-ray crystallography and cryo-EM produce electron density maps, which must then be interpreted in terms of atomic positions and bonds. This process has been substantially automated over past decades, however these methods do not fully exploit modern developments in machine learning, cloud computing, and AlphaFold.

A number of groups have developed machine learning methods for parts of the model building process, but turning these into complete automated tools is a much larger task. We propose to break up our model building software into easily reusable python components and a sample model building implementation, into which other groups can drop their own components. At the same time, we will carry out an experiment in the organisation of science - we aim to give away our software in such a way that we do not benefit directly from others building on it. The aim of this is to test the hypothesis that the organisation of science into competing hierarchical research groups can in some cases hinder progress and participation. If this is correct then giving away tools will accelerate methods development in ways which will benefit the community, and possibly also us.

The third part of the project will be to port the new components to run inside web browsers using WebAssembly. This will support a parallel project which is aiming to enable users to see and build proteins by hand within the browser. When coupled with existing cloud computing services, this will allow people around the world to determine macromolecule structures with no specialist hardware or software.

Publications

10 25 50
publication icon
Agirre J (2023) The CCP4 suite: integrative software for macromolecular crystallography. in Acta crystallographica. Section D, Structural biology