A machine learning approach to identify carbon dioxide-binding proteins for sustainability and health

Lead Research Organisation: Durham University
Department Name: Biosciences

Abstract

Carbon dioxide is one of the most important gases on Earth and an absolute requirement for life. The planet faces long term increases in atmospheric CO2, predicted to have a significant physiological impact on crops. Therefore, understanding CO2 biology is of pressing strategic importance. However, protein targets for CO2 sensing are almost wholly unknown. We will develop a computational approach to identify all CO2-binding sites in proteins. The project will provide our first insight into the extent of CO2-protein interactions and supply tools to revolutionise our understanding of CO2 biology.
CO2 can spontaneously form a reversible protein post-translational modification through carbamylation of neutral N-terminal lysine e-amino groups.
We have developed a chemical proteomics technology for covalent trapping of protein carbamates that allows their identification by mass spectrometry (MS). While effective, this technique is also labour and capital expensive. In this context, this project has the following aims:
1. Developing a machine learning model to identify putative lysine carbamylation sites. Preliminary results indicate that a machine learning tool, trained with our experimental data and data from the literature, should be able to accurately identify lysines likely to form a carbamate. Specifically, we have demonstrated a link between structure-dependent lysine pKaH, solvent accessibility, and CO2-binding. For each lysine in our experimental dataset, we will leverage on the PDB and AlphaFold databases to extract additional features (local amino acid environment, depth, and predicted dynamics). This data will be used to train a neural network classifier. To identify which sites are biologically relevant, we will explore the possibility of incorporating sequence conservation data.

2. Predicting proteome-wide carbamylation sites. Our trained classifier will first be applied to higher plants. Most likely candidates will be experimentally verified. Should these predictions be demonstrated as accurate, we will expand our prediction to every known protein. Both this final data and the trained classifier will be made freely available to the community.

3. Understanding the biological role of carbamylation. We will select proteins with known roles in CO2 metabolism and experimentally validated carbamylation sites. We will carry out multi-long molecular dynamics simulations to characterise how CO2 binding affects protein structure/dynamics.

The project outputs will be a database of all known CO2-binding sites (derived from available structural data) to promote understanding of the molecular consequences of altered CO2 and a software tool (also available via webserver) to enable the prediction of CO2-binding sites from future structural data.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/T008695/1 01/10/2020 30/09/2028
2838427 Studentship BB/T008695/1 01/10/2023 30/09/2027