An Integrated CATH Resource for the Postgenomic Era

Lead Research Organisation: University College London

Department Name: Structural Molecular Biology

Abstract

The success of the worldwide genome initiatives has given us the protein sequences for more than 300 species including human and mouse. The challenge now is to predict the functions of these proteins and how they interact with each other to give the diverse biological repertoires observed in nature. The three dimensional structure of a protein is much harder to determine than its sequence explaining why fewer than 25,000 structures are known compared with ~2.5 million non-redundant sequences. However, structural data often gives more profound insights into the mechanisms by which proteins act and interact. Also, because structure is more conserved than sequence we can detect more distant relationships giving clearer insights into how proteins evolve. A number of structural classifications exist to group proteins by their structural similarity and are particularly valuable for understanding how changes in the sequences and structures of relatives can modify functions. Since we cannot experimentally characterise all proteins, being able to accurately predict functions from related proteins is essential for understanding biological systems and determining the causes of and remedies for disease. The CATH classification is one of the most widely used and comprehensive of these structural family resources. It has expanded 12-fold since it was established in 1993 and is now accessed by biologists nearly 1 million times per month over the web. The only other resource of this kind is SCOP, which classifies a similar number of protein structures. The two resources employ different approaches, SCOP relying largely on manual inspection for the identification of remote structural similarities whilst CATH applies automated algorithms and manual inspection to validate only the hardest cases. This use of carefully validated automated approaches will ensure that CATH can cope with the massive flood of data expected over the next decade. The worldwide structural genomics initiatives are currently solving the structures for protein families for which no structural information exits. Although these initiatives are very welcome because they are expanding our knowledge of protein structures, they are necessitating faster and much more sensitive automatic methods for CATH, as well as a greater degree of manual validation. In this project we will develop much more efficient ways of classifying these structures to keep pace with the structural genomics initiatives. Since very few proteins have known structures, CATH will bring much wider benefits to the biological community if structural data can be predicted for the millions of sequences not yet structurally characterised. We have already developed very robust technologies for predicting which genome sequences can be assigned to CATH structural families. International competetions have shown these to be amongst the best performing in the world. Using these techniques we can predict structures for up to 80% of proteins in some organisms. In this project, we therefore propose to develop an integrated resource that combines information on structural families with structural predictions for all sequences in the genomes. We also have methods to integrate any available functional information for the proteins. Furthermore, our in-house modelling techniques can provide reasonable 3D models for many of these sequences which will help biologists in understanding the functional properties of the proteins and in determining the functional networks in which they participate. The integrated CATH resource we plan will present biologists with structural data for any protein of interest, combined with comprehensive functional data and highly intuitive web pages that help them to view the structures in the context of all the available functional data. By integrating data in this way this resource will ultimately enrich our understanding of biological systems.

Technical Summary

The CATH integrated resource will combine data on domain structures classified in CATH with predicted sequence relatives in the genomes. In addition 3D models will be built for genome sequences and protein interactions, where possible. Functional information will be integrated for each family extracted from public sources and inherited between relatives, using safe thresholds. In addition we will be opening up the resource to the sequence-based community through tightly integrated prediction tools (PSIPRED). To regularly update this information we will develop more sensitive methods and robust Grid based workflows for classifying structures, predicting structures in genome sequences, 3D modelling and integration of functional annotations. We will keep pace with the worlwide genomics initiatives by expanding our domain boundary recognition suite to include additional algorithms. Sensitivity in homologue detection will also be increased using neural network based approaches. Coverage of structural predictions will be significantly improved by exploiting multiple structural alignments, built for each CATH family, to improve the sensitivity and accuracy of HMM and threading methods. Information on conserved structural positions will also improve the homology modelling protocols used to build 3D models for genome sequences. Finally, protocols for integrating functional information will be improved and extended to incorporate data generated by new in-house methods being developed in related projects. The new integrated CATH resource will be available to biologists via new web pages which will allow users to browse the resource in a much more intuitive manner moving easily from structural family data to related sequences and their associated functions and to view available structures or 3D models highlighted to show conserved residue positions and surface features, such as electrostatics. The data and methods will also be available via DAS and Web services.

Funded Value:

£816,262

Funded Period:

Sep 08 - Feb 14

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/F010451/1

Principal Investigator:

Christine Orengo

Research Subject:

Biomolecules & biochemistry (37%)

Tools, technologies & methods (13%)

Research Topic:

Protein expression (37%)

eScience (13%)

Organisations

People	ORCID iD
Christine Orengo (Principal Investigator)
David Jones (Co-Investigator)

Publications

Author Name Title Publication

Date Published

|< < 1 2 > >|

10 25 50

Dessailly BH (2009) The evolution of protein functions and networks: a family-centric approach. in Biochemical Society transactions

Jones DT (2012) PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. in Bioinformatics (Oxford, England)

Dessailly BH (2009) Exploiting structural classifications for function prediction: towards a domain grammar for protein function. in Current opinion in structural biology

Lees JG (2014) Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis. in Nucleic acids research

Furnham N (2012) FunTree: a resource for exploring the functional evolution of structurally defined enzyme superfamilies. in Nucleic acids research

Yeats C (2011) The Gene3D Web Services: a platform for identifying, annotating and comparing structural domains in protein sequences. in Nucleic acids research

Cuff A (2010) Extending CATH: increasing coverage of the protein structure universe and linking structure with function in Nucleic Acids Research

Sillitoe I (2013) New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. in Nucleic acids research

Cuff A (2009) The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies in Nucleic Acids Research

Lees J (2012) Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis. in Nucleic acids research

Key Findings
Impact Summary
Research Databases and Models
Research Tools and Methods
Collaboration
Software and Technical Products
Engagement Activities


Description	This project further developed the CATH-Gene3D classification of protein domain structures. CATH-Gene3D groups together evolutionary related protein domains in order to analyse protein evolution and to inherit protein structures and functions for uncharacterised proteins. Protein domains are classified on the basis of similarities in their 3D structures and/or sequences. Nearly >80 million domain sequences had been classified in CATH-Gene3D by the end of the project. Comprehensive protein family data is valuable for understanding the evolution of proteins domains, which in turn is helpful for designing proteins that have modified functions. It is also valuable for predicting the functions of query proteins. This is important since fewer than 10% of the 40 million protein sequences now determined have experimental characterisation. A novel algorithm was developed for sub-classifying relatives within a family into subfamilies of relatives with shared functional properties. These functional sub-families were found to be much more accurate in predicting the function of a query protein than assignment to the superfamily. They were also found to be useful for accurately modelling the structures of proteins. In collaboration with Janet Thornton at the EBI, protein family information in CATH-Gene3D was also integrated with functional data from the catalytic site atlas (CSA) and MACIE database of reaction mechanisms to provide a new web-based resource FunTree which displays phylogenetic data for each enzyme superfamily combined with information on ligands bound and chemistries performed by different relatives. During this project the number of proteins classified in CATH-Gene3D was significantly increased to capture nearly 90% of all known protein structures. All the major protein families in nature are now represented in CATH-Gene3D. Another output of the project was a much faster and more sensitive structure comparison algorithm and webserver (CATHEDRAL). This allows biologists to submit a query structure which is then searched against representative structures from CATH to determine to which family the query belongs.
Exploitation Route	CATH-Gene3D is a member database in the widely accessed InterPro resource hosted at the EBI, which has more than 5 million webpage accesses per month. It is one of only two resources (out of 8 in InterPro) providing structural annotations and is therefore important for providing consensus information on predicted structural regions in protein sequences. CATH-Gene3D classification data is valuable for a number of general activities undertaken by other biologists/biomedical researchers: - Assigning structures or functions to uncharacterised proteins (as mentioned above) - Providing information on conserved and variable structural regions in domain families. This data is important information in drug design (ie for designing compounds that bind to proteins in pathogenic organisms but not in human). - The multiple alignment data in CATH is valuable for identifying highly conserved positions in a family and likely functional sites. This data is helpful in assessing the likely impacts of genetic variations, nsSNPs etc. - The structure comparison web-server is valuable for searching for structural analogs that may represent cross-hits for a drug designed to bind to a particular protein structure
Sectors	Chemicals,Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology
URL	http://www.cathdb.info


Description	The CATH classification is widely used by biologists and biomedical researchers to understand the structure and functions of query protein sequences. This is evidenced by the web access stats of nearly 2 million web page accesses per month from more than 10,000 unique visitors. The main CATH paper has been cited 2300 times and all CATH publications cited more than 6000 times. It is widely used as a teaching tool to explain the principles of protein evolution and structure-function relationships as evidenced by the numerous book chapters we have been invited to contribute to educational books. CATH-Gene3D classification data has been used by the following organisations and consortia: 1. Midwest Structural Genomics Consortium to analyse protein families and target structurally uncharacterised families relevant to human health for structure determination 2. Centre for Structural Genomics in Disease to target protein families implicated in virulence of pathogenic organisms for structure determination 3. The London Pain Consortum to predict associations between protein families in order to understand the protein networks/signalling pathways involved in neuropathic pain 4. The European BioSapiens Network to provide structural and functional annotations for completed genomes 5. The European EMBRACE Network to provide a publicly accessible server for protein structure comparison 6. The Europain consortium to provide information on protein families implicated in neuropathic pain. 7. The Protein Databank to provide information on domain structure families 8. UniProt to assign function to proteins based in conserved sites in CATH Functional Families (FunFams) CATH algorithms and data have also been widely used by researchers in industry: CATH was one of the four major UCL bioinformatics resources used to establish the UCL company Inpharmatica in 1998. This was involved in predicting structures and functions for proteins via the 'Biopendium'. Inpharmatica sold this and other related software packages to several large pharmaceutical companies including Pfizer, Astra Zeneca and Glaxo-Wellcome. Inpharmatica was acquired by Galapagos in 2006. The latest structure comparison algorithms developed by the CATH team (CATHEDRAL) have been distributed directly to Pharma including UCB Celltech LB, Pfizer India, Cubist, DE Shaw, Signal Pharmaceuticals, Astellas, Adimab, Molecular Health and BioCrea. For example, UCB has licensed CATHEDRAL and PDBsum and ~20 of their employees have directly used these resources. The Director of Computational Structural Biology stated "All these tools work together nicely to turn protein structural information into a more digestible form, which speeds up our work process, accelerates knowledge dissemination and facilitates more informed decision making for the research and development of both small molecule and antibody therapeutics. CATHEDRAL not only offers superior performance in this type of comparison, but also automatically specifies domain boundaries for a multi-domain query through an iterative search strategy. This unique feature has saved us hundreds of man-hours by eliminating the need for manual correction when structurally characterizing potential drug targets of multiple domains". Papers exploiting CATH data and published by Thornton and Orengo have been cited 13 times across 11 patent documents (assessed in the 2008 to 2014, ie. Research Exercise Framework (REF) in the UK), indicating the commercial relevance of their work. The patents are filed across the USA, Europe and Internationally through the PCT system and are assigned to GSK Ltd, Biogen Idec Inc. and Pharnext. The CEO of Acpharis has stated: "Protein structure data is core to our research and we rely on fold libraries and HMM data from CATH and related resources to answer the fundamental questions that we are addressing in designing drugs for novel targets, hopefully allowing design of more novel drugs that can better treat a variety of diseases. CATH provides a valuable service to the academic and commercial sectors and is a key resource for analyzing structures and collecting the information necessary for innovative drug design".
Sector	Chemicals,Digital/Communication/Information Technologies (including Software),Education
Impact Types	Economic


Title	CATH-Gene3D FunFams (DFX)
Description	CATH-Gene3D DFX FunFams are a classification of 55 million protein domain sequences and into evolutionary families in which relatives share very similar 3D-structures and functional properties using the Domain Family Exploration pipeline. These groupings are described as DFX functional families - DFX FunFams. Domain Family Exploration (DFX) algorithm which used function annotation data from the Gene Ontology to sub-classify the CATH-Gene3D superfamilies into FunFams. This approach was shown to be in the top 5 performing methods in a 2014 assessment of protein function prediction. It is valuable as fewer than 10% of known sequences in UniProt have detailed experimental characterisation. CATH-Gene3D-Funfams are also being used to identify domains which are significantly enriched in residue mutations associated with disease e.g. cancer, pain, ageing etc.
Type Of Material	Improvements to research infrastructure
Year Produced	2012
Provided To Others?	Yes
Impact	This functional grouping of sequences was made available to biologists through our public website - CATH-Gene3D. The website receives nearly 2 million webpage accesses per month from more than 10,000 unique visitors.
URL	http://www.cathdb.info


Title	GeMMA
Description	GeMMA (Genome Modelling and Model Annotation) is an approach to automatic functional subfamily classification within families and superfamilies of protein sequences. It is a profile-based agglomerative clustering algorithm that exploits COMPASS to compare the profiles derived from the multiple sequence alignments (MSAs) of clusters present at each stage of the clustering. At each iteration, the cluster profiles matching above a threshold are merged and profiles are generated for the new clusters. These iterations continue giving a hierarchical clustering tree built from the leaf nodes to the root, till a single cluster remains.
Type Of Material	Improvements to research infrastructure
Provided To Others?	No
Impact	GeMMA clustering program allows subclassification of the CATH-Gene3D superfamilies into smaller groups of sequence relatives that are functionally and structurally related to each other which is great importance in understanding the structure function relationships in a superfamily. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families has been demonstrated. GeMMA clusters can also help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics.


Title	Protein Chart
Description	The Protein Chart provides an overview of protein domains observed in structural biology to address a growing need in the scientific community -- making the entire spectrum of protein structures instantly accessible on one chart.
Type Of Material	Improvements to research infrastructure
Year Produced	2007
Provided To Others?	Yes
Impact	The Protein Chart has been used as a teaching tool and a research companion.
URL	http://eu.wiley.com/WileyCDA/WileyTitle/productCd-3527319638,subjectCd-LS43.html


Title	CATH-Gene3D
Description	Please note that this research database is still being continuously developed and improved. CATH-Gene3D is a domain family classification. As of 2018, over 90 million protein domain sequences are classified into evolutionary superfamilies. Within these, relatives are further classed into groups in which relatives share very similar 3D-structures and functional properties. These groupings are described as functional families, or FunFams. The latest version of the CATH-Gene3D protein structure classification database has recently been released (version 4.2, http://www.cathdb.info). The resource comprises over 450,000 domain structures and over 90 million protein domains classified into over 6000 homologous superfamilies. The daily-updated CATH-B, which contains our very latest domain assignment data, provides putative classifications for over 50,000 additional protein domains. Gene3D http://gene3d.biochem.ucl.ac.uk is a database of domain annotations of Ensembl and UniProtKB protein sequences. Domains are predicted using a library of profile HMMs representing over 6000 CATH superfamilies. The current Gene3D (v16) release has expanded its domain assignments to ~20 000 cellular genomes and over 90 million unique protein sequences, more than doubling the number of protein sequences since our last publication. Amongst other updates, we have improved our Functional Family annotation method. We have also improved the quality and coverage of our 3D homology modelling pipeline of predicted CATH domains.
Type Of Material	Database/Collection of data
Provided To Others?	Yes
Impact	CATH-Gene3D is widely used by biologists for teaching and research. There are ~1 million webpage accesses per month from ~9,000 unique visitors. CATH-Gene3D is a member database of InterPro, which receives more than 5 million web page accesses per month. It is also linked to from other major public sites including Pfam, PDB, PDBe.
URL	http://www.cathdb.info


Description	CSGID Structural Genomics Centre
Organisation	Northwestern University
Country	United States
Sector	Academic/University
PI Contribution	We analyse genome sequences to identify structurally uncharacterised protein families which are good drug targets eg associated with virulence in pathogenic organisms. We also provide a webserver/database for submission of community targets for structure determination.
Collaborator Contribution	They solve the structures of representative proteins from the family
Impact	outputs are publications and a website/database for submitting protein sequences targetted for structure determination multi-disciplinary - bioinformatics and structural biology


Description	EMBRACE
Organisation	EMBL European Bioinformatics Institute (EMBL - EBI)
Country	United Kingdom
Sector	Academic/University
PI Contribution	The EMBRACE project brought together a wide group of experts throughout Europe who were involved in the use of information technology in the biomolecular sciences. The network has worked to integrate the major databases and software tools in bioinformatics, using existing methods and emerging Grid service technologies. We integrated CATH software tools predicting protein functions with tools from other groups (such as genome annotation and protein association prediction).
Collaborator Contribution	Numerous tools and databases to further scientific research
Impact	Publications Better integrated resources and databases.


Description	ENFIN
Organisation	EMBL European Bioinformatics Institute (EMBL - EBI)
Country	United Kingdom
Sector	Academic/University
PI Contribution	Network of 30 experimental and computational biology groups We provided CATH annotations of protein functions and protein functional associations.
Collaborator Contribution	ENFIN is a virtual institute, formed to enable systems-level integration of experimental results. Objectives: - To develop a shared approach between traditionally dry and traditionally wet researchers in the area of systems-level interpretation of experimental results - To develop a distributed computational platform this integration and analysis of experimental data - To directly prove that such an approach has scientific value - To encourage and participate in the critical assessment of systems-level approaches - To disseminate knowledge and techniques to other academic researchers worldwide - To disseminate knowledge and techniques to commercial researchers, in particular European SMEs - To train young European researchers from a variety of backgrounds in system-level informatics techniques.
Impact	Publications Community resources to further scientific research
Start Year	2006


Description	InterPro
Organisation	EMBL European Bioinformatics Institute (EMBL - EBI)
Country	United Kingdom
Sector	Academic/University
PI Contribution	InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. It combines protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool. Our research team has provided the following contributions to the InterPro resource: - Structural annotations from CATH - Structural annotations from Genome3D - mapping between CATH and SCOP protein structure classifications CATH-Gene3D provide domain family HMMs and structure annotations to InterPro on a regular basis. We have recently provided a new tool - CATH-Resolve-Hits for generating accurate multi-domain architecture information from sequence matches to the CATH domain HMM libraries. We currently have BBSRC BBR funding to extend the mapping between SCOP and CATH, integrate Genome3D annotations in InterPro for selected model organisms, and provide a 3D viewer for the structural annotations.
Collaborator Contribution	Annotations from other sources, manual curations, central database and web site.
Impact	Publications Community resource to further biological research.
Start Year	2007


Description	PDBe
Organisation	EMBL European Bioinformatics Institute (EMBL - EBI)
Country	United Kingdom
Sector	Academic/University
PI Contribution	Our resource CATH provides high quality annotations to improve the quality of the information provided by the PDBe, primarily the location of structural domains and identifying distant evolutionary relationships between known protein structures. Our Gene3D resource provides structural annotations for genome sequences from ~20,000 species. These annotations are also incorporated in the Genome3D resource for selected model organisms. Collaborations between research groups involved in the Genome3D project has resulted in a high quality mapping between the CATH and SCOP structural classification databases. This is being implemented by the PDBe to improve the clarity and coverage of structural annotations in their resource. We currently have a BBSRC BBR funded collaboration with PDBe and InterPro to provide our CATH-Gene3D structural annotations to these resources, via the Genome3D portal.
Collaborator Contribution	Host, maintain and curate the central PDBe resource and website.
Impact	Publications Community resources to further scientific research.
Start Year	2006


Description	Partner in the NIH-Funded Midwest Centre for Structural Genomics
Organisation	Argonne National Laboratory
Country	United States
Sector	Public
PI Contribution	We analysed completed genomes to identify protein families which had no structural characterisation
Collaborator Contribution	Our partners determined the structures of representatives from these families
Impact	multi-disciplinary - bioinformatics and structural biology


Title	SSAP- structure comparison program
Description	Algorithm for aligning protein structures. It exploits a double dynamic algorithm to handle insertions and deletions and so can be used to align very distantly related homologues as well as close homologues. It has been used to identify the structural relationships on which the CATH classification was based.
Type Of Technology	Software
Impact	This software is licenced by UCLi and have been sold to several companies including CellTech, Pfizer India etc.


Description	Computational Biology conference (The Netherlands)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Hundreds of people from computational biology and other life science backgrounds attended the European Conference on Computational Biology in 2016 in The Hague, The Netherlands. This poster was presented during the poster sessions at this conference and was available for attendees to view throughout the conference. During the presentation of the poster, discussions were held on the topics of analysing disease-causing mutation data with CATH-Gene3D, and the CATH-Gene3D functional families.
Year(s) Of Engagement Activity	2016
URL	https://f1000research.com/posters/5-2167


Description	Primary School Visit (Warren Road Primary)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Schools
Results and Impact	Invited to give a 1 hour lesson on "DNA, Proteins and Minecraft" for 12-15 year 6 students (expected to reach L6 Science) at a National Lead Outstanding Primary School (Warren Road, Orpington). Learning objectives included: - understanding what DNA/proteins are made of and why they are important - the basic process of evolution - introduction to how enzymes work The school went on to achieve Gold Primary Science Quality Mark with this lesson mentioned in the award. "The session was absolutely fabulous. I learnt so much! The children loved it." - Tamara Fletcher (Deputy Head and Head of Science)
Year(s) Of Engagement Activity	2014,2015,2017


Description	Public talks and workshops
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	We have given several talks on CATH-Gene3D and Genome3D in local schools in London. We participated in a Wellcome Trust funded workshop on chronic pain at which we gave a talk and demonstration of how CATH-Gene3D and Genome3D data was being used to provide structural and functional information on genes involved in chronic pain The schools reported that our talks had generated a lot of interest in proteins and structural biology and that several students had decided to seek further information on undergraduate courses with study modules on computational biology. Our talks include images of protein structures which help in intuitively conveying information on the mechanisms by which proteins function. The Wellcome workshop on chronic pain was very well received with excellent responses to the feedback questionnai
Year(s) Of Engagement Activity	2009,2011,2012,2014,2015,2016

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications