2021BBSRC-NSF/BIO UniPlex - Genome-Wide Protein Complex Prediction and Validation

Lead Research Organisation: European Bioinformatics Institute

Department Name: Molecular Networks

Abstract

Proteins are essential components that both build cellular structures and work as the tools that make the cell function. However, proteins do not operate in isolation and often form molecular machines in which several proteins bind together and with other biomolecules to act as a single entity called a molecular complex. This provides tremendous versatility and regulatory capacities, since by changing a single component of the complex, its function can be dramatically altered. Protein complexes often also form more stable structures than isolated proteins, and their formation creates new active sites as protein chains from different molecules assemble in close proximity. It is therefore of crucial importance to know the composition of complexes and study them as discrete functional entities in order to truly understand how cellular processes work. The Complex Portal (www.ebi.ac.uk/complexportal) is an encyclopaedic database that collates and summarizes information on stable, macromolecular complexes of known function from the scientific literature through manual curation. Complex Portal (CP) curators have now completed a first draft of all the stable molecular complexes from baker's yeast (Saccharomyces cerevisiae) and the gut bacteria Escherichia coli, both model organisms widely used for the study of basic biological processes. The next big goal for the project is the complete annotation of the all human complexes (the human complexome). The CP has had multiple requests from the research community to significantly speed up the annotation of human data, but manual curation is laborious, and can only partially meet demand.

There are multiple types of data available in the literature that can indicate that different proteins form part of the same complex: co-immunoprecipitation studies, where proteins that bind together are purified out via a selected protein bait; proximity data sets, which tag proteins which are very close together in a cell using a bacterial enzyme, or co-fractionation experiments, where cells are broken apart and proteins that co-purify together are identified. There are public databases that compile data about how individual proteins bind each other (IntAct); the processes in which such proteins take part, called pathways (Reactome); or capture the 3D structure of two or more proteins bound together (wwPDB). We propose to extend the scope and relevance of the Complex Portal by using machine learning algorithms that can identify groups of proteins that are most likely to represent functional complexes which exist in the cell from large datasets generated using the techniques described above. These predictions of complexes will be validated against other experimental data and, where possible, also against literature evidence. We will also use large scale studies of protein expression in different cell types, tissues, and conditions to validate the predicted complexes and to differentiate between variants of complexes formed in different conditions.
Complexes predicted to exist at high confidence will be made available through the Complex Portal website, properly identified as computationally inferred data, where they will both guide the work of Complex Portal curators and dramatically increase the amount of complexes available for researchers as reference entities. We will add further information from other resources such as Reactome and PDB to these entries and map changes to amino acids which are known to affect protein interaction strength and stability to complex binding interfaces from the IntAct database. This work will help accelerate our understanding of complexes as the molecular machines essential to biological processes and support basic and applied research.

Technical Summary

The Complex Portal (CP) is a manually curated reference resource of molecular complexes. Identification and annotation of all molecular complexes is the CP's biggest challenge, especially for the much-demanded human complexome. We propose to rapidly increase the coverage of the CP through computational inference of high confidence complexes, based on large-scale experimental and computational data. We will extend hu.MAP, the most comprehensive complex map available, by adding thousands of newly published large-scale mass spectrometry experiments. Further, we will improve upon the machine learning framework using an automated model selection algorithm selecting among deep learning as well as classical models to best discriminate between true and false protein interactions. Protein complexes will be identified by clustering of the highest-scoring pairwise interactions, then validated and refined by protein (co-)expression analysis. This will distinguish between core and conditional subunits and map tissue-specific expression and subunit composition, providing information-rich annotations for each individual complex. We will infer high confidence complexes for species spanning three kingdoms of life: S. cerevisiae, H. sapiens, and A. thaliana. The resulting set of high confidence inferred complexes will be enriched with structural and functional data from IntAct, wwPDB, and Reactome, including amino acid mutations known to disrupt protein interactions mapped to complex binding interfaces. The entire prediction pipeline will be developed as a highly automated, adaptable and repeatable workflow which will ensure a continuously updated and expanded set of inferred complexes that can rapidly evolve with additional data becoming available. Presentation and impact of the CP will be improved through website updates and a comprehensive outreach and training program providing a powerful tool for biological discovery for the research community.

Funded Value:

£437,842

Funded Period:

Feb 23 - Feb 26

Funder:

BBSRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

BB/X002179/1

Principal Investigator:

Henning Hermjakob

Research Subject:

Biomolecules & biochemistry (21%)

Tools, technologies & methods (77%)

Research Topic:

Bioinformatics (28%)

Multiprotein complexes (7%)

Research approaches (14%)

Structural biology (14%)

Theoretical biology (35%)

Organisations

European Bioinformatics Institute (Lead Research Organisation)

People	ORCID iD
Henning Hermjakob (Principal Investigator)	http://orcid.org/0000-0001-8479-0262
Sandra Orchard (Co-Investigator)	http://orcid.org/0000-0002-8878-3972

Publications

Author Name

Title Publication Date Published

10 25 50

Balu S (2025) Complex portal 2025: predicted human complexes and enhanced visualisation tools for the comparison of orthologous and paralogous complexes. in Nucleic acids research

Fischer S (2024) hu.MAP3.0: Atlas of human protein complexes by integration of > 25,000 proteomic experiments

Key Findings
Research Databases and Models
Engagement Activities


Description	The Complex Portal is an encyclopaedic resource of macromolecular complexes from a number of key model organisms. In addition to the expert manually curated complexes, the portal now holds high-confidence machine-learning predicted human complexes from hu.MAP3.0 and MuSIC. All data is freely available for search and download. As of 01/2025, the portal holds ca. 5,000 manually curated and ca. 15,000 computationally predicted molecular complexes. An innovative visualisation tool, the Complex Navigator, allows user-friendly comparison of related complexes, as well as grouping of complexes by orthology.
Exploitation Route	The Complex Portal is a stable reference resource for molecular complexes, providing unique complex identifiers, allowing other resources like pathway databases to refer to an external resource for molecular complexes, instead of just representing them as "bags of protein identifiers". The high quality manually curated subset of the Complex Portal can also be used as a training and/or validation dataset for machine learning approaches to complex prediction. This has already been implemented for the hu.MAP 3.0 dataset [ https://doi.org/10.1101/2024.10.11.617930 ].
Sectors	Agriculture Food and Drink Pharmaceuticals and Medical Biotechnology
URL	https://www.ebi.ac.uk/complexportal/


Title	Complex Portal
Description	The Complex Portal is a manually curated, encyclopaedic resource of macromolecular complexes from a number of key model organisms, entered into the IntAct molecular interaction database (https://www.ebi.ac.uk/intact/). Data includes protein-only complexes as well as protein-small molecule and protein-nucleic acid complexes. All complexes are derived from physical molecular interaction evidences extracted from the literature and cross-referenced in the entry, or by curator inference from information on homologs in closely related species or by inference from scientific background. All complexes are tagged with Evidence and Conclusion Ontology codes to indicate the type of evidence available for each entry.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	The Complex Portal is a unique reference resource for manually curated biomolecular complexes.
URL	https://www.re3data.org/repository/r3d100013295


Title	The Complex Portal - beyond binaries or how to tame the spaghetti monster?
Description	The EMBL-EBI Complex Portal (www.ebi.ac.uk/intact/complex) is a central service that provides manually curated information on stable, on macromolecular complexes from model organisms. The database currently holds approximately 2000 complexes with the majority from Saccharomyces cerevisiae, human and mouse. It provides unique identifiers, names and synonyms, list of complex members with their unique identifiers (UniProt, ChEBI, RNAcentral), function, binding and stoichiometry annotations, descriptions of their topology, assembly structure, ligands and associated diseases as well as cross-references to the same complex in other databases (e.g. ChEMBL, GO, PDB, Reactome). Our stable identifiers are used as annotation objects in IntAct and the Protein2GO and as cross-references in ChEMBL, Intermine, MatrixDB and QuickGO. PDBe and Reactome are working towards integrating complex identifiers.Having established the basic data structure and content we are now focusing on providing a better user experience. We have completely redeveloped our website, developing and incorporating many more visualization tools, such as the ComplexViewer, PDBe's LiteMol Viewer, Reactome's DiagramJS, the Atlas widget of expression data and the MI-Circle viewer, a bespoke Chord diagram developed to give an alternative representation of complex topology, binding regions, mutations and links to InterPro domains. Future plans include building a tool that can a) explore evolutionary relationships between complexes across the database and b) infer quaternary structure of complexes for which no structure exists, using the Periodic Table of Complexes developed by the Teichmann group.This is a collaborative project, which has already been contributed to by groups such as UniProtKB, Saccharomyces Genome Database, the UCL Gene Annotation Team and MINT database. We welcome groups who are willing to contribute their expertise and will make editorial access and training available to you. Individual complexes will also be added to the dataset, on request. Contact us on intact-help@ebi.ac.uk for further information.
Type Of Material	Database/Collection of data
Year Produced	2017
Provided To Others?	Yes
Impact	The Complex Portal is a unique reference resource for biomolecular complexes. As of 01/2025, it covers ca. 5,000 manually curated and 15,000 computationally predicted complexes.
URL	https://f1000research.com/slides/6-336


Description	From interactions to quantitative models: FAIR resources for systems biology
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Undergraduate students
Results and Impact	Talk given at the Eötvös Loránd University, Budapest on FAIR resources for systems biology.
Year(s) Of Engagement Activity	2024


Description	HUPO 2024: Complex Portal: a resource for functionally annotated macromolecular complexes
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Poster on Complex Portal presented at the 23rd Human Proteome Organization World Congress, October 20-24, Dresden (Germany)
Year(s) Of Engagement Activity	2024
URL	https://2024.hupo.org/


Description	HUPO 2024: Dynamic organisation of the Human protein-protein interactions
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Poster on the Dynamic organisation of the Human protein-protein interactions presented at the 23rd Human Proteome Organization World Congress, October 20-24, Dresden (Germany).
Year(s) Of Engagement Activity	2024
URL	https://2024.hupo.org/


Description	HUpo 2024: Molecular Complex Navigator: Comparative Visualization of Biomolecular Complexes
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Poster on the Molecular Complex Navigator and Comparative Visualization of Biomolecular Complexes was presented at the 23rd Human Proteome Organization World Congress, October 20-24, Dresden (Germany)
Year(s) Of Engagement Activity	2024
URL	https://2024.hupo.org/


Description	IntAct Demo with Olaitan AWE
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Training workshop IntAct Demo with Olaitan AWE during the African Omics Workshop 2024 at the African Society for Bioinformatics and Computational Biology, Cape Town, 2024.
Year(s) Of Engagement Activity	2024


Description	IntAct: Protein-protein interactions database
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Undergraduate students
Results and Impact	Training on IntAct and protein-protein interactions database was provided at the University of Andes for undergraduate and postgraduate students in October 2024.
Year(s) Of Engagement Activity	2024


Description	Molecular interactions in the context of Rare diseases: Annotation rich dataset from the IMEx consortium.
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Talk by Senior Scientific Database Curator, Kalpana Panneerselvam, on "Molecular interactions in the context of Rare diseases: Annotation rich dataset from the IMEx consortium" at the 17th Annual International Biocuration Conference, in India.
Year(s) Of Engagement Activity	2024
URL	https://ibdc.rcb.res.in/biocuration2024/


Description	Network Analysis with Cytoscape
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Postgraduate students
Results and Impact	An introduction to the basic theory and concepts of network analysis. Attendees learned how to construct protein-protein interaction networks and subsequently use these to overlay large-scale data such as that obtained through RNA-Seq or mass-spec proteomics. The course focused on giving attendees hands-on experience in the use of one of the most commonly used open source Network Visualisation Platforms, Cytoscape.
Year(s) Of Engagement Activity	2024
URL	https://sites.google.com/cam.ac.uk/lpsjrtfh68dh3kcvfbzgfes4/home


Description	Network Biology theory and practicals with IntAct Cytroscape app
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Undergraduate students
Results and Impact	Training provided to 30 undergraduate students at the University of Cambridge during the U. of Cambridge Genomic Medicine Advanced Bioinformatics module GMO4.
Year(s) Of Engagement Activity	2024


Description	Network and IntAct basics
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Undergraduate students
Results and Impact	Training session on Network and IntAct basics at the framework of the Mathematics of Life course hosted at EBI. This course provided participants with an introduction and hands-on training on modelling approaches, tools, and resources used in systems biology as well as touch on network analysis.
Year(s) Of Engagement Activity	2024
URL	https://www.ebi.ac.uk/training/events/mathematics-life-modelling-molecular-mechanisms/


Description	Outreach Activity: Project development: Validating predicted molecular complexes through Xlinking MS
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Undergraduate students
Results and Impact	Talk on Validating predicted molecular complexes through Xlinking MS given at the National Synthesis Center for Emergence in the Molecular and Cellular Sciences during the first annual summit meeting in October 2024 at the University of Chicago.
Year(s) Of Engagement Activity	2024
URL	https://bpb-us-e1.wpmucdn.com/sites.psu.edu/dist/2/180585/files/2024/10/NCEMS-Summit-2024-program_v1...


Description	Poster on Molecular Complex Navigator
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Poster on Molecular Complex Navigator at the 14th international meeting on 'Visualizing Biological Data' (VIZBI 2024) at the University of Southern California
Year(s) Of Engagement Activity	2024
URL	https://calendar.usc.edu/event/14th_international_meeting_on_visualizing_biological_data_vizbi_2024


Description	Poster: Context-specific protein-protein interaction networks, IntAct database - Host ontology mapping
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Other audiences
Results and Impact	Poster presented at the 10th year Open Targets anniversary event held at EBI in October 2024.
Year(s) Of Engagement Activity	2024


Description	Proteomics Bioinformatics Course 2024: IntAct and IMEx
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Undergraduate students
Results and Impact	This course provided hands-on training in the basics of mass spectrometry (MS) and proteomics bioinformatics. 30 undergraduate and postgraduate students received training on how to use search engines and post-processing software, quantitative approaches, MS data repositories, the use of public databases for protein analysis, annotation of subsequent protein lists, and incorporation of information from molecular interaction and pathway databases.
Year(s) Of Engagement Activity	2024
URL	https://www.ebi.ac.uk/training/events/proteomics-bioinformatics-1/


Description	Proteomics Bioinformatics Course 2024: Network Analysis with Cytoscape
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Undergraduate students
Results and Impact	Training course which provided hands-on training in the basics of mass spectrometry (MS) and proteomics bioinformatics. 30 undergraduate and postgraduate students were provided training on how to use search engines and post-processing software, quantitative approaches, MS data repositories, the use of public databases for protein analysis, annotation of subsequent protein lists, and incorporation of information from molecular interaction and pathway databases.
Year(s) Of Engagement Activity	2024
URL	https://www.ebi.ac.uk/training/events/proteomics-bioinformatics-1/


Description	Seminar: IntAct: Curation strategy and data visualisation
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Other audiences
Results and Impact	Seminar as part of the GAA Seminar series that took place in November 2024.
Year(s) Of Engagement Activity	2024


Description	Suffolk Family Carers Young Carers visit
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	Suffolk Family Carers and Young Carers visit at EBI.
Year(s) Of Engagement Activity	2024

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications