2021BBSRC-NSF/BIO UniPlex - Genome-Wide Protein Complex Prediction and Validation

Lead Research Organisation: European Bioinformatics Institute
Department Name: Molecular Networks

Abstract

Proteins are essential components that both build cellular structures and work as the tools that make the cell function. However, proteins do not operate in isolation and often form molecular machines in which several proteins bind together and with other biomolecules to act as a single entity called a molecular complex. This provides tremendous versatility and regulatory capacities, since by changing a single component of the complex, its function can be dramatically altered. Protein complexes often also form more stable structures than isolated proteins, and their formation creates new active sites as protein chains from different molecules assemble in close proximity. It is therefore of crucial importance to know the composition of complexes and study them as discrete functional entities in order to truly understand how cellular processes work. The Complex Portal (www.ebi.ac.uk/complexportal) is an encyclopaedic database that collates and summarizes information on stable, macromolecular complexes of known function from the scientific literature through manual curation. Complex Portal (CP) curators have now completed a first draft of all the stable molecular complexes from baker's yeast (Saccharomyces cerevisiae) and the gut bacteria Escherichia coli, both model organisms widely used for the study of basic biological processes. The next big goal for the project is the complete annotation of the all human complexes (the human complexome). The CP has had multiple requests from the research community to significantly speed up the annotation of human data, but manual curation is laborious, and can only partially meet demand.

There are multiple types of data available in the literature that can indicate that different proteins form part of the same complex: co-immunoprecipitation studies, where proteins that bind together are purified out via a selected protein bait; proximity data sets, which tag proteins which are very close together in a cell using a bacterial enzyme, or co-fractionation experiments, where cells are broken apart and proteins that co-purify together are identified. There are public databases that compile data about how individual proteins bind each other (IntAct); the processes in which such proteins take part, called pathways (Reactome); or capture the 3D structure of two or more proteins bound together (wwPDB). We propose to extend the scope and relevance of the Complex Portal by using machine learning algorithms that can identify groups of proteins that are most likely to represent functional complexes which exist in the cell from large datasets generated using the techniques described above. These predictions of complexes will be validated against other experimental data and, where possible, also against literature evidence. We will also use large scale studies of protein expression in different cell types, tissues, and conditions to validate the predicted complexes and to differentiate between variants of complexes formed in different conditions.
Complexes predicted to exist at high confidence will be made available through the Complex Portal website, properly identified as computationally inferred data, where they will both guide the work of Complex Portal curators and dramatically increase the amount of complexes available for researchers as reference entities. We will add further information from other resources such as Reactome and PDB to these entries and map changes to amino acids which are known to affect protein interaction strength and stability to complex binding interfaces from the IntAct database. This work will help accelerate our understanding of complexes as the molecular machines essential to biological processes and support basic and applied research.

Technical Summary

The Complex Portal (CP) is a manually curated reference resource of molecular complexes. Identification and annotation of all molecular complexes is the CP's biggest challenge, especially for the much-demanded human complexome. We propose to rapidly increase the coverage of the CP through computational inference of high confidence complexes, based on large-scale experimental and computational data. We will extend hu.MAP, the most comprehensive complex map available, by adding thousands of newly published large-scale mass spectrometry experiments. Further, we will improve upon the machine learning framework using an automated model selection algorithm selecting among deep learning as well as classical models to best discriminate between true and false protein interactions. Protein complexes will be identified by clustering of the highest-scoring pairwise interactions, then validated and refined by protein (co-)expression analysis. This will distinguish between core and conditional subunits and map tissue-specific expression and subunit composition, providing information-rich annotations for each individual complex. We will infer high confidence complexes for species spanning three kingdoms of life: S. cerevisiae, H. sapiens, and A. thaliana. The resulting set of high confidence inferred complexes will be enriched with structural and functional data from IntAct, wwPDB, and Reactome, including amino acid mutations known to disrupt protein interactions mapped to complex binding interfaces. The entire prediction pipeline will be developed as a highly automated, adaptable and repeatable workflow which will ensure a continuously updated and expanded set of inferred complexes that can rapidly evolve with additional data becoming available. Presentation and impact of the CP will be improved through website updates and a comprehensive outreach and training program providing a powerful tool for biological discovery for the research community.

Publications

10 25 50