An Greatly Expanded CATH-Gene3D with Functional Fingerprints to Characterise Proteins

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

There are millions of proteins being sequenced which have no known function. New CATH methods will predict their functions.

Whilst other resources do also predict function, CATH-Gene3D (referred to below as CATH) provides unique information on structurally conserved features linked to function. Structure data reveals how proteins perform their function and why the function changes if the protein is modified by mutations or other genetic variations. Protein function information is key to understanding biological systems and by extension drug design, protein engineering and disease.

CATH is a world leading resource that classifies proteins evolved from the same ancestral protein, into evolutionary families. Currently, CATH classifies 15 million protein domains into 2600 families. Family data is valuable because evolutionary relatives (called homologues) tend to have similar 3D structures and perform similar functions. Thus the benefit of CATH is the ability to infer properties between homologues.

This is important because of the millions of proteins currently known (>20 million) less than 5% have experimentally determined functions. Even in the organism of greatest interest to us, human, <10% of proteins have known functions. Because it can be slow and very expensive to characterise proteins it will not be possible to experimentally study all these proteins. Therefore, biologists use CATH to predict the function of a protein based on the family to which it belongs.

Another fact is that proteins are made up of 'domains' - on average two per protein. These are independently folded entities that act together to confer the function of the whole protein. CATH classifies proteins at the level of the domain and currently classifies ~70% of domains found in nature. Domains are the building blocks of proteins - a few thousand of them are combined in different ways to give the 20 million proteins, or more, in nature. Our group develops methods for predicting domain functions. This allows functions of whole proteins to be deduced from the functions of their constitutive domains. Thus functions can be suggested for proteins made from any combination of domains.

CATH uses information on the 3D structure of the domain to give more accurate family classifications, as structure is more highly conserved, during evolution, than the sequence. Even more important - structure can reveal how the protein performs its function and whether the protein loses its function if a mutation occurs at a particular site.

We will expand CATH by 100%. Since manual validation is very time consuming, we will develop better methods for automatically recognising distant homologues. We will continuously release data (CATH-B), prior to manual curation, so that biologists can benefit from the information much sooner.

We will collaborate with the other major structure classification SCOP to develop common classification strategies and provide complementary information on families.

We will improve the accuracy of functional inheritance across a family. We need to do this because in some families, especially those occurring more frequently in nature, the functions can change in some relatives.

We will improve accuracy by characterising important positions in the domain, conserved across functionally similar relatives. We can build patterns of these positions to recognise other domains sharing such patterns and likely to have similar functions.

We will make it easy for biologists to use our web search tool to determine if a protein belongs to one of these functional families. We will set this up on the Cloud so that biologists can quickly search CATH with the massive datasets they obtain using new sequencing technologies. These technologies capture proteins expressed under different conditions. Our web pages will report their functions and variations in the protein which could modify function causing disease

Technical Summary

(1) CATHifier
We will build the CATHifier platform for classifying structures/sequences in CATH-Gene3D (referred to below as CATH). This will comprise a better homologue predictor exploiting more powerful sequence matching (meta-methods), structure matching (meta-methods) and function matching (text mining). We will improve the machine learning SVM combining this data.

CATHifier will comprise RESTful web services and Cloud based workflows. We will make CATHifier available via webservers for user's queries. The web services will also export CATH data to PDB and InterPro.

(2) More sensitive methods for functional classification
Our functional classification/prediction tool (FunFamer) will be improved eg by using MDA data, better detection of conserved residues, looking for 3D co-localisation of conserved residues, exploiting conserved 3D motifs.

(3) FunFamer webserver for function prediction
Most proteins are not experimentally characterised and so function prediction is a major aim of the project. High Performance Compute (HPC) strategies will handle the vast datasets biologists are generating using next gen sequencing.

A recent BBSRC T&R pilot ported FunFamer to HPC facilities ie Amazon and UCL Legion (5500 nodes). We used infrastructure-on-demand services ie Amazon EC2 compute cloud and Amazon S3 storage service. Amazon virtual machines can be used in a cluster-like scenario via Sun SGE's 'cloud adapter' software or in parallel. We will improve scheduling for large datasets and explore using Hadoop and related strategies.

A major aim will be intuitive web pages displaying functions. No other resources identify structure-based functional families. We will show 3D structures highlighting functional sites conserved in both sequence and structure. We have begun this but more work is needed eg to make the site more intuitive, align query proteins against FunFams, display mutations close to functional sites or splice variations modifying function

Planned Impact

We will maintain and develop a world leading resource for protein domain structure classification (CATH-Gene3D, henceforth referred to as CATH) which combines 3D structure data, tens of millions of sequences predicted to belong to CATH families and extensive information on protein functions. We will improve the purity of functional classification and thereby increase the value of the resource for both basic biosciences and also the agricultural and biomedical communities.

CATH already has a very well developed website and this will be extended to provide more detailed information on protein functions and in particular residue sites on the protein surface likely to be important for function. The new web pages will therefore inform protein design or rationalise the impacts of genetic variation eg in different plant or animal strains. For example a single residue mutation in the Rubisco protein, affecting allostery, can alter the catalytic efficiency of this enzyme in rice and promote survival in arid regions.

CATH is already widely used - The website now receives nearly 2 million web-pages accesses/month from ~61,000 unique visitors and the CATH paper is highly cited - the original CATH Structure paper is now cited 1986 times (all CATH publications are cited 6413 times).

Communities in which CATH has an impact

Basic bioscience researchers: Evidenced by the fact that CATH is one of the 8 member groups of InterPro - a consortium of major protein family resources at the EBI. Several European networks of excellence (Biosapiens, EMBRACE, IMPACT, ENFIN) included the CATH group to provide structural/functional annotations for genome sequences.

Structural biologists: Evidenced by the fact that major protein structure repositories (PDB) link directly to CATH; a major structural genomics initiative (PSI) in the States selected CATH as the structural resource for target selection.

Biomedical Researchers: Evidenced by the fact that CATH is used to provide information on protein functions, protein networks and the impacts of SNPs for large consortia researching neuropathic pain (London Pain Consortium, Europain).

Other evidence of impact is given by the range of support letters including letters from directors of major institutes (eg RCSB, EBI), companies undertaking genome annotation (eg Synthetic Genomics) and users of the data.

Research fields in which CATH will have an Impact

Agricultural and Food security - Protein sequencing initiatives are providing increasing amounts of data for plants, crops, cattle and the bacteria that interact with these hosts and cause damage. The data and tools we will develop (eg information on conserved positions involved in function) will explain variations between strains and help identify suitable strains to improve yields, taste or colour or to cope with environmental conditions eg drought, pests and pathogens.

Protein design and biotech industries - modification of proteins in pathways can yield new sources of materials and energy (ie biofuels). New proteins can be designed to build synthetic pathways. The functional family (FunFam) data can be used to constrain conserved structural core positions in the protein and identify positions more tolerant to change and useful for new designs.

Health - Knowledge of structural details in the active sites of proteins and identification of conserved 3D features is valuable for drug design. Another major benefit will be the use of conservation data in FunFams to rationalise the impact of genetic variations (eg SNPs, spliced variations) on protein functions and disease susceptibility. This will inform both diagnostic strategies and drug design.

The FunFam server will characterise the functional repertoire of metagenomes from human cavities eg gut and thereby help explain the role of commensal bacteria in promoting health.

Other - CATH has been widely used to teach students about protein structure and evolution.

Publications

10 25 50

publication icon
Armstrong DR (2020) PDBe: improved findability of macromolecular structure data in the PDB. in Nucleic acids research

publication icon
Berman HM (2016) The archiving and dissemination of biological structure data. in Current opinion in structural biology

publication icon
Berman HM (2014) The Protein Data Bank archive as an open data resource. in Journal of computer-aided molecular design

publication icon
Das S (2015) Diversity in protein domain superfamilies. in Current opinion in genetics & development

publication icon
Das S (2016) Protein function annotation using protein domain family resources. in Methods (San Diego, Calif.)

publication icon
Dawson N (2017) The Classification of Protein Domains. in Methods in molecular biology (Clifton, N.J.)

 
Description This project will maintain and develop the CATH-Gene3D classification of protein domains and improve the functional sub-classification of relatives in superfamilies. In the 2 years that this project has been running we have:

1) Developed a new algorithm for recognising homologous domains based on machine learning (SVM) that combines information on protein sequences and protein structures. This is capable of providing accurate automated assignments for an additional 60,000 structural domains in CATH, increasing the total coverage by over 15%
2) Provided an improved method (FunFHMMER) for functional sub-classification - this groups together protein domains that have similar patterns of residue conservation indicative of common functional sites.
3) Independently validated the function prediction potential of FunFHMMER by submitting annotations to CAFA (international function prediction assignment). This method was ranked 1st out of 129 methods for predicting molecular function and 2nd for predicting biological process (using the most stringent metric)
4) Completed two major releases of the CATH database. The latest version (v4.1) now contains over 300,000 structural domains (almost doubling the number of domains annotated before this grant started).
5) Putative domain structure assignments are now available to the public (CATH-B) which provides the very latest annotations for new structures in the PDB (updated daily).
6) Updated the number of predicted domain structures in CATH-Gene3D cover 20,000 cellular genomes and over 43 million domain entries. This represents a two fold increase since the project started.
7) CATH-Gene3D data has been updated with new functional information and information on protein interactions from a range of public sources eg IRefIndex, GO, EC, REACTOME etc.
8) Used the new functional family groupings in CATH-Gene3D to create 3D models for human sequences (without known structure) that are significantly more accurate than models generated using more established methods (eg HHpred). This work has been accepted for publication by Acta Crystallographica, a journal widely read by structural biologists who use CATH.
9) Used these predicted structures to model protein-protein interactions for human proteins.
10) Additionally, the structural models have been expanded to include an extra model organism (Drosophila melanogaster).
11) Added splicing data into Gene3D allowing structural and functional interpretations of alternative splicing.
12) Expanded our coverage of disease associated human mutations to the much larger set integrated by UniProt and included additional visualization tools in the website. For example it is now easier to see where the mutations are on the domains. Furthermore, for a number of binary protein interactions there is information on sub-regions of the proteins that participate or influence the interaction. Many of these sub-regions overlap with domain regions in Gene3D. Using this data we can see which mutations are likely disrupt protein interactions.
13) Provided new CATH-Gene3D web-pages to show the relationships between functional families within a superfamily.
14) Added new visualization tools in the Gene3D website.
15) Progressed the mapping between CATH and SCOP to facilitate more rapid annotation of sequences in InterPro and to improve the accuracy of both resources
16) The improved mapping protocols have increased the number of superfamilies deemed to be equivalent in the two resources from 19% to 39% (732 superfamilies)
17) Improved structure comparison algorithm to allow domains to be superposed more accurately than previously possible - highlighting the highly-conserved ancient structural core across large and diverse groups of proteins
18) Provided a major release of the DomainFinder algorithm (v4.0) which identifies the optimum domain assignments from a mixed set of sequence search results. This new CATH-Resolve-Hits algorithm is being adopted by the InterPro project.
19) These discoveries and developments have been published in 14 journal articles and 4 textbook chapters
Exploitation Route CATH-Gene3D is a member database in the widely accessed InterPro resource hosted at the EBI, which has more than 5 million webpage accesses per month. It is one of only two resources (out of 11 in InterPro) providing structural annotations and is therefore important for providing consensus information on predicted structural regions in protein sequences. The mapping between CATH and SCOP will therefore be very useful for InterPro researchers integrating the information on structural annotations.

CATH classification data is valuable for a number of general activities undertaken by biologists/biomedical researchers:
- assigning structures or functions to uncharacterised proteins (as mentioned above)
- providing information on conserved and variable structural regions in domain families. This data is important information in drug design (ie for designing compounds that bind to proteins in pathogenic organisms but not in human).
- the multiple alignment data in CATH is valuable for identifying highly conserved positions in a family and likely functional sites. This data is helpful in assessing the likely impacts of genetic variations, nsSNPs etc.
- the up-to-date and comprehensive structural libraries provided by CATH are valuable for searching against to find structural analogs for a query structure that may represent cross-hits for a drug designed to bind to the query.
- CATH-Gene3D is the only structure based resource providing functional families (FunFams currently ~100,000) which group together domain sequences likely to have highly similar structures and functions. We have established a web search tool that allows users to scan new sequences against a library of HMMs for these families in order to obtain predicted functional annotations for their protein.

The developments in our structure comparison algorithm, highlighting conservation across remote evolutionary relationships, is the basis of collaboration with a large pharmaceutical company and this work will continue to be significant in the field of drug design.
Sectors Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology

URL http://www.cathdb.info
 
Description The CATH-Gene3D classification is widely used by biologists and biomedical researchers to understand the structure and functions of query sequences. This is evidenced by the web access stats of nearly 2 million web page accesses per month from more than 10,000 unique visitors. It is widely used as a teaching tool to explain the principles of protein evolution and structure-function relationships. The CATH data has been used by the following organisations and consortia: 1. Midwest Structural Genomics Consortium to analyse protein families and target structurally uncharacterised families with relevance to human health for structure determination 2. Centre for Structural Genomics in Disease to target protein families implicated in virulence of pathogenic organisms for structure determination 3. The London Pain Consortum to predict associations between protein families in order to understand the protein networks/signalling pathways involved in neuropathic pin 4. The Europain consortium to provide information on protein families implicated in neuropathic pain. 5. The Protein Databank to provide information on domain structure families CATH algorithms and data have also been widely used by researchers in industry: CATH was one of the four major UCL bioinformatics resources used to establish the UCL company Inpharmatica in 1998. This was involved in predicting structures and functions for proteins via the 'Biopendium'. Inpharmatica sold this and other related software packages to several large pharmaceutical companies including Pfizer, Astra Zeneca and Glaxo-Wellcome. Inpharmatica was acquired by Galapagos in 2006. The main structure comparison algorithms developed by the CATH team (CATHEDRAL) has been distributed directly to Pharma including UCB Celltech LB, Pfizer India, Cubist, DE Shaw, Signal Pharmaceuticals, Astellas, Adimab, Molecular Health and BioCrea. For example, UCB has licensed CATHEDRAL and PDBsum and ~20 of their employees have directly used these resources. The Director of Computational Structural Biology stated "All these tools work together nicely to turn protein structural information into a more digestible form, which speeds up our work process, accelerates knowledge dissemination and facilitates more informed decision making for the research and development of both small molecule and antibody therapeutics. CATHEDRAL not only offers superior performance in this type of comparison, but also automatically specifies domain boundaries for a multi-domain query through an iterative search strategy. This unique feature has saved us hundreds of man-hours by eliminating the need for manual correction when structurally characterizing potential drug targets of multiple domains". Papers exploiting CATH data and published by Thornton and Orengo have been cited 13 times across 11 patent documents (assessed in the 2008 to 2014, ie. Research Exercise Framework (REF) in the UK), indicating the commercial relevance of their work. The patents are filed across the USA, Europe and Internationally through the PCT system and are assigned to GSK Ltd, Biogen Idec Inc. and Pharnext. The CEO of Acpharis has stated: "Protein structure data is core to our research and we rely on fold libraries and HMM data from CATH and related resources to answer the fundamental questions that we are addressing in designing drugs for novel targets, hopefully allowing design of more novel drugs that can better treat a variety of diseases. CATH provides a valuable service to the academic and commercial sectors and is a key resource for analyzing structures and collecting the information necessary for innovative drug design".
Sector Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description Structural Bioinformatics Consortium (ELIXIR)
Geographic Reach Europe 
Policy Influence Type Influenced training of practitioners or researchers
Impact We are part of a consortium of 17 research groups developing training material in structural bioinformatics. This work is being co-ordinated by the Geneom3D consortium which is managed by Orengo. Each group within the consortium is developing their own training material relating to their particular research area. This material will be integrated via on-line workflows which are being developed as a part of the TeSS platform - an on-line training catalogue and training facility being organised by the ELIXIR UK node. The CATH-Gene3D training material was developed in 2013 for an ECCB workshop on protein structure to Function held in July 2013, organised by Christine Orengo, Nicholas Furnham and Romain Studer. Christine Orengo is also deputy lead of the Functional Effects domain in Structural Bioinformatics which is integrating tools and resources from the 17 structural bioinformatics research groups mentioned. This integrated resource will be used for the interpretation of genetic variations related to health and disease. Training material is also being developed in this context.
 
Title CATH-Gene3D FunFams (FunFHMMer) 
Description Sequences in each superfamily in the CATH-Gene3D resource have been classified into functional families or FunFams by cutting a hierarchical clustering tree of superfamily sequence relatives. This was previously done by the DFX algorithm which used function annotation data from the Gene Ontology to sub-classify the superfamilies into FunFams. However, due to the paucity of the GO terms and annotation biases existing in the GO, a new approach, FunFHMMer, for functionally classifying CATH superfamilies was developed which exploits sequence patterns, and is therefore unaffected by the limitations of GO. FunFHMMer, determines an optimal cut of a hierarchical clustering tree of sequence relatives within a given superfamily by calculating a novel functional coherence index based on conserved positions and specificity-determining positions (SDPs) in sequence alignments. This results in generation of more functionally coherent functional families or FunFams than our previous classification. 
Type Of Material Improvements to research infrastructure 
Year Produced 2015 
Provided To Others? Yes  
Impact The CATH-Gene3D FunFams are able to provide functional annotations for nearly 16 million domain sequences in UniProtKB and Ensembl. CATH currently identifies 110,439 FunFams and for the most populated of these (having high information content), accounting for 72% of CATH-Gene3D sequences. FunFams can be used to predict residues implicated in functional sites. CATH FunFams are also useful for analysing the variation in functions across a superfamily and since functional sites can be identified for many FunFams, they allow a structurally informed analysis of the mechanisms of this divergence. Sequence profiles (HMMs) for highly informative FunFams are being supplied to InterPro for their metagenome portal, to provide functional annotations for bacterial sequences identified by the metagenome projects. 
URL http://www.cathdb.info
 
Title CATH-Gene3D sequence search 
Description Within a superfamily, CATH-Gene3D classifies functional families (FunFams) that aim to group together domains that share the same function. Therefore, if a region of protein sequence provides a highly significant match to a particular CATH-Gene3D FunFam, then there is a good chance the protein shares a similar function to experimentally characterised proteins in the FunFam. Based on this assertion, the CATH-Gene3D sequence search tool provides function (GO and EC) annotations for query protein sequences based on the FunFam assignments. 
Type Of Material Improvements to research infrastructure 
Year Produced 2015 
Provided To Others? Yes  
Impact The sequence search tool provides function annotations for query sequences sequences and can be used by biologists via a CATH web server. The function prediction pipeline using CATH-Gene3D sequence search tool was submitted to the Critical Assessment of protein Function Annotation (CAFA) 2 and our method was ranked among the top methods for accuracy of function prediction according to a number of different scoring methods. 
URL http://www.cathdb.info/search/by_sequence
 
Title CATHEDRAL 
Description Please note that this method is still being continuously developed and improved. CATHEDRAL is an automatic structure comparison methods that can be used to search a library of protein structures to identify a structure with matching fold. It can also be used to obtain an alignment between two protein structures and to multiply align a set of structures (CATHEDRAL-Multi) to identify the common structural core and equivalences between residues across the structures. 
Type Of Material Improvements to research infrastructure 
Year Produced 2010 
Provided To Others? Yes  
Impact CATHEDRAL is being used by pharmaceutical companies with whom we collaborate to compare the ligand binding sites of protein structures and facilitate drug design. 
URL http://www.cathdb.info/search/by_structure
 
Title CATH-Gene3D 
Description Please note that this research database is still being continuously developed and improved. CATH-Gene3D is a domain family classification. As of 2018, over 90 million protein domain sequences are classified into evolutionary superfamilies. Within these, relatives are further classed into groups in which relatives share very similar 3D-structures and functional properties. These groupings are described as functional families, or FunFams. The latest version of the CATH-Gene3D protein structure classification database has recently been released (version 4.2, http://www.cathdb.info). The resource comprises over 450,000 domain structures and over 90 million protein domains classified into over 6000 homologous superfamilies. The daily-updated CATH-B, which contains our very latest domain assignment data, provides putative classifications for over 50,000 additional protein domains. Gene3D http://gene3d.biochem.ucl.ac.uk is a database of domain annotations of Ensembl and UniProtKB protein sequences. Domains are predicted using a library of profile HMMs representing over 6000 CATH superfamilies. The current Gene3D (v16) release has expanded its domain assignments to ~20 000 cellular genomes and over 90 million unique protein sequences, more than doubling the number of protein sequences since our last publication. Amongst other updates, we have improved our Functional Family annotation method. We have also improved the quality and coverage of our 3D homology modelling pipeline of predicted CATH domains. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact CATH-Gene3D is widely used by biologists for teaching and research. There are ~1 million webpage accesses per month from ~9,000 unique visitors. CATH-Gene3D is a member database of InterPro, which receives more than 5 million web page accesses per month. It is also linked to from other major public sites including Pfam, PDB, PDBe. 
URL http://www.cathdb.info
 
Description CSGID Structural Genomics Centre 
Organisation Northwestern University
Country United States 
Sector Academic/University 
PI Contribution We analyse genome sequences to identify structurally uncharacterised protein families which are good drug targets eg associated with virulence in pathogenic organisms. We also provide a webserver/database for submission of community targets for structure determination.
Collaborator Contribution They solve the structures of representative proteins from the family
Impact outputs are publications and a website/database for submitting protein sequences targetted for structure determination multi-disciplinary - bioinformatics and structural biology
 
Description ELIXIR 
Organisation ELIXIR
Department ELIXIR UK
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution We are part of the 3D-BioInfo ELIXIR Community in Structural Bioinformatics, which was established in January 2019 and is being coordinated by Christine Orengo. CATH-Gene3D contributes to two of the four major activities in 3D-BioInfo. Activity I relates to integration of functional sites in PDBe Knowledge Base (PDBe-KB). CATH Functional Families (FunFams) are being used to identify functional sites for domain families and this data is being integrated in PDBe-KB. Activity II relates to integration of tools and data associated with protein structure prediction. CATH functional families are being used to identify templates for homology modelling of structurally uncharacterised proteins. 3D-models have been generated for 14 model organisms including human, mouse, rat, arabadopsis, fly, yeast and E. Coli. 3D-Models are then integrated in the Genome3D resource, managed by Orengo. 3D-BioInfo Activity II involves integration of 3D-Models from Genome3D in PDBe-KB with links to UniProt. CATH-Gene3D recently received ELIXIR implementation study funding to collaborate with the SWISS-MODEL team in Switzerland to use the SWISS-MODEL pipeline together with template data from CATH functional families to build more accurate 3D models. We are planning to extend this activity to include more European partners through collaborations facilitated by 3D-BioInfo workshops. We are also part of a ELIXIR UK consortium of 17 research groups developing training material in structural bioinformatics. This work is being co-ordinated by the Genome3D consortium managed by Orengo. CATH-Gene3D training material was developed in 2013 for an ECCB workshop on protein structure to Function held in July 2013, organised by Christine Orengo, Nicholas Furnham and Romain Studer. This material has been adapted for the ELIXIR training workflows. Christine Orengo is also deputy lead of the Functional Effects Domain in Structural Bioinformatics which is integrating tools and resources from the 17 structural bioinformatics research groups mentioned above. The Domain is part of Genomics England and is headed by Ewan Birney. The aim is to establish an integrated resource will be used for the interpretation of genetic variations related to health and disease. Training material is also being developed in this context. ELIXIR UK funding was allocated in March 2017 to develop training workflows for predicting the impacts of genetic variations. These workflows have now been developed and are accessible via the ELIXIR TESS Training website.
Collaborator Contribution As regards the ELIXIR 3D-BioInfo collaborations, research groups from 15 European countries are involved in this collaboration. For the Activities that CATH-Gene3D contributes to, more than 10 groups are involved from 7 countries including the UK. All are contributing predicted functional site data to PDBe-KB. We all participate in workshops held at the EBI regularly to discuss ontologies and export/import mechanisms and APIs. As regards the ELIXIR UK training workflows, each group within the consortium is developing their own training material relating to their particular research area.
Impact All predicted functional site data will be made available via the PDBe-KB. Predicted domain data structure will be made available through Genome3D and also through PDBe-KB once the exchange mechanisms for that have been completed. All training material material will be integrated via on-line workflows which are being developed as a part of the TeSS platform - an on-line training catalogue and training facility being organised by the ELIXIR UK node.
Start Year 2013
 
Description InterPro 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. It combines protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool. Our research team has provided the following contributions to the InterPro resource: - Structural annotations from CATH - Structural annotations from Genome3D - mapping between CATH and SCOP protein structure classifications CATH-Gene3D provide domain family HMMs and structure annotations to InterPro on a regular basis. We have recently provided a new tool - CATH-Resolve-Hits for generating accurate multi-domain architecture information from sequence matches to the CATH domain HMM libraries. We currently have BBSRC BBR funding to extend the mapping between SCOP and CATH, integrate Genome3D annotations in InterPro for selected model organisms, and provide a 3D viewer for the structural annotations.
Collaborator Contribution Annotations from other sources, manual curations, central database and web site.
Impact Publications Community resource to further biological research.
Start Year 2007
 
Description Metagenomics collaboration with Ward and Hailes Group at UCL 
Organisation University College London
Department Biochemical Engineering
Country United Kingdom 
Sector Academic/University 
PI Contribution We are providing bioinformatics advice for analysing metagenome samples taken from a range of different environments including artic meadows, hospital drains etc. In particular we are scanning sequence fragments and contigs assembled from the metagenome samples against our libraries of sequence profiles (HMMs) for functional families (FunFams) in our CATH-Gene3D database of protein domain superfamilies. Matches can be used to identify the putative functions of enzymes in the sample and whether the enzymes are likely to have modified activity or specificity. We are currently applying our FunFam protocol to analyse metagenome samples from the MGnify resource at EBI, and searching in particular for novel petase enzymes. These will be tested experimentally by the Ward group and other collaborators in Cambridge. Our CATH-FunFam HMM library is being developed by a BBSRC BBR funded project on CATH-Gene3D.
Collaborator Contribution The groups of John Ward and Helen Hailes at UC are performing the molecular biology and chemistry to experimentally validate prediction of enzyme functions.
Impact multi-disciplinary collaboration Ward - experimental molecular biology and chemistry Hailes - experimental chemistry Orengo - bioinformatics Two joint publications to date
Start Year 2016
 
Description PDBe 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution Our resource CATH provides high quality annotations to improve the quality of the information provided by the PDBe, primarily the location of structural domains and identifying distant evolutionary relationships between known protein structures. Our Gene3D resource provides structural annotations for genome sequences from ~20,000 species. These annotations are also incorporated in the Genome3D resource for selected model organisms. Collaborations between research groups involved in the Genome3D project has resulted in a high quality mapping between the CATH and SCOP structural classification databases. This is being implemented by the PDBe to improve the clarity and coverage of structural annotations in their resource. We currently have a BBSRC BBR funded collaboration with PDBe and InterPro to provide our CATH-Gene3D structural annotations to these resources, via the Genome3D portal.
Collaborator Contribution Host, maintain and curate the central PDBe resource and website.
Impact Publications Community resources to further scientific research.
Start Year 2006
 
Description Partner in the NIH-Funded Midwest Centre for Structural Genomics 
Organisation Argonne National Laboratory
Country United States 
Sector Public 
PI Contribution We analysed completed genomes to identify protein families which had no structural characterisation
Collaborator Contribution Our partners determined the structures of representatives from these families
Impact multi-disciplinary - bioinformatics and structural biology
 
Title CATHEDRAL structure comparison algorithm 
Description This was developed from a double dynamic algorithm, SSAP, used for structure comparison. SSAP was modified to perform fast database searches and is nearly 1000 times faster than SSAP. There is a publicly available web server for CATHEDRAL on our in-house CATH-Gene3D database website. 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted 2010
Licensed Yes
Impact CATHEDRAL has been used for several in-house analyses of protein superfamilies leading to several publications. It has also been used in collaborations with other research groups both at UCL and externally to analyse protein superfamilies and these have also led to several publications.
 
Title SSAP- structure comparison program 
Description Algorithm for aligning protein structures. It exploits a double dynamic algorithm to handle insertions and deletions and so can be used to align very distantly related homologues as well as close homologues. It has been used to identify the structural relationships on which the CATH classification was based. 
Type Of Technology Software 
Impact This software is licenced by UCLi and have been sold to several companies including CellTech, Pfizer India etc. 
 
Title cath-cluster: A simple way to complete-linkage cluster arbitrary data 
Description The software provides a fast implementation of complete linkage clustering that allows arbitrary data to be clustered into groups according to similarity scores. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact The software has increased visibility of our group within the bioinformatics community. 
URL http://cath-tools.readthedocs.io/en/latest/tools/cath-cluster
 
Title cath-resolve-hits: a fast, effective way to collapse domain matches to a non-overlapping subset (i.e. domain architecture) 
Description cath-resolve-hits provides a fast, effective way to collapse a set of domain matches (e.g. from a typical protein sequence search) down to a non-overlapping subset or "domain architecture" assignment. Fast: Can process around 1-2 million input hits per second Powerful: - Finds the optimal result that maximises the sum of hits' scores - Handles discontinuous domains - Supports tolerance for overlaps between hits; auto-resolves any that occur Transparent: - Provides visualisation of input data and decisions via graphical HTML Simple: - Uses a simple default input file format - Also accepts HMMER domtblout files and hmmsearch output files - Accepts input that hasn't been pre-sorted or even pre-grouped (but can exploit that where specified) Configurable: - Allows users to determine their own scoring system to be maximised - Offers many easy-to-use options to configure the default behaviour Software Features: - written and tested in strict C++ - removed a number of local dependencies to allow the tool to be used by the wider community - source code released on GitHub under the GPLv3 license (as part of the cath-tools suite) - incorporated into a robust continuous integration (CI) build with tests and releases 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact The tool has been adopted by InterPro and the HMMER server (at EBI) as the standard method of resolving domain boundaries. 
URL http://cath-tools.readthedocs.io/en/latest/tools/cath-resolve-hits/
 
Title cath-ssap: rigorous protein structure comparison 
Description cath-ssap finds the optimal structural alignment between two protein structures, then uses this alignment to calculate a quantitative measure of the structural similarity. The program employs a highly sensitive double-dynamic algorithm that calculates and compares the local structural environment of residues. Since protein structure is more conserved that protein sequence during the process of evolution, these similarity scores provide a sensitive measure of remote homologies between distantly related proteins. - cath-ssap is a complete rewrite of the original SSAP algorithm of Taylor and Orengo (1989) - ported from C to strictly written and tested C++ - removed a number of local dependencies to allow the tool to be used by the wider community - source code released on GitHub under the GPLv3 license (as part of the cath-tools suite) - incorporated into a robust continuous integration (CI) build with tests and releases 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact The alignments and similarity scores provided by this algorithm provide fundamental evidence for the assignment of evolutionary relationships in the CATH database - a world leading resource for protein structural classification (the core scientific resource developed and maintained by our group). 
URL http://cath-tools.readthedocs.io/en/latest/tools/cath-ssap/
 
Title cath-superpose: flexible superpositions of protein structures 
Description cath-superpose provides the optimal structural superposition between two protein structures. When deciding on which residues to use for the superposition, the tool takes into account the structural environment of each residue. This focuses the superposition on the parts of the alignment that align well rather that variable regions that can disrupt superpositions. In contrast with methods that simply attempt to minimise the RMSD, this approach can be used to build superpositions of hundreds of protein structures that clearly show the highly conserved ancient structural core within distantly related protein domain structures. - written and tested in strict C++ - removed a number of local dependencies to allow the tool to be used by the wider community - source code released on GitHub under the GPLv3 license (as part of the cath-tools suite) - incorporated into a robust continuous integration (CI) build with tests and releases 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact - used as a tool to superpose predicted structures from the Genome3D collaboration - used to provide superpositions of entire superfamilies for the CATH database (previously not possible) 
URL http://cath-tools.readthedocs.io/en/latest/tools/cath-superpose/
 
Description 14TH INTERNATIONAL SYMPOSIUM ON INTEGRATIVE BIOINFORMATICS 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact A talk at a conference oniIntegrating heterogeneous data to create an in-depth insight into complex biological systems. This was held at Rothamsted in June 2018. Experts were brought together from the fields: bioinformatics, computational biology, computer science, systems biology, and statistics. Christine Orengo gave a talk about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2018
URL https://www.rothamsted.ac.uk/events/14th-international-symposium-integrative-bioinformatics
 
Description 31st European Crystallography Meeting, Oviedo 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A talk at an international conference about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2018
URL https://ecm31.ecanews.org/en/welcome-to-oviedo.php
 
Description 3rd Student Conference on Mathematical Foundations in Bioinformatics 2018 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Talk to around 50 attendees for a student conference on mathematical foundations in bioinformatics. This was held at Kings College London in August 2018.
Year(s) Of Engagement Activity 2018
URL https://nms.kcl.ac.uk/informatics/events/MatBio2018/
 
Description BioProNET Big Data and Computational Biology in Bioprocessing Workshop 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Talk at a workshop on computational biology and bioprocessing in June 2018.
Year(s) Of Engagement Activity 2018
URL http://biopronetuk.org/biopronet-funded-collaboration-building-workshops/
 
Description Biochemical Society Workshop on Exploiting Protein Structure to Determine the Effects of Genetic Variation 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Workshop on protein structure prediction and analysis in the context of analysing the impacts of genetic variations in proteins.

This was organised by Christine Orengo and Michael Sternberg and held at Darwin House, London.

Participants were introduced to concepts underpinning the analysis and prediction programs in CATH-Gene3D, Genome3D, PHYRE and other resources.
Year(s) Of Engagement Activity 2016
 
Description Bioinformatics and Computational Biology Conference 2018, Naples 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A talk at an international conference on computational analyses exploiting CATH-Gene3D and Genome3D data. Held in Naples, Italy in November 2018.
Year(s) Of Engagement Activity 2018
URL https://www.bbcc-meetings.it/
 
Description Bioinformatics talk in UCL Healthcare Careers Day 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Around 100 school students attended the Bioinformatics STEM talk at the Medicine and Healthcare Careers Day at UCL which was aimed at introducing Bioinformatics to school students along with a hands-on practical session on structural bioinformatics and showcasing the CATH database.
Year(s) Of Engagement Activity 2017
 
Description Cold Spring Harbor Asia conference on Frontiers in Computational Biology & Bioinformatics 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A talk at an international conference about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2018
URL https://www.csh-asia.org/2018meetings/COMP.html
 
Description Computational Biology conference (The Netherlands) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Hundreds of people from computational biology and other life science backgrounds attended the European Conference on Computational Biology in 2016 in The Hague, The Netherlands. This poster was presented during the poster sessions at this conference and was available for attendees to view throughout the conference. During the presentation of the poster, discussions were held on the topics of analysing disease-causing mutation data with CATH-Gene3D, and the CATH-Gene3D functional families.
Year(s) Of Engagement Activity 2016
URL https://f1000research.com/posters/5-2167
 
Description Computational Biology conference in July 2017 (Prague, Czech Republic) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Intelligent Systems for Molecular Biology (ISMB) is an annual academic conference on the subjects of bioinformatics and computational biology organised by the International Society for Computational Biology (ISCB). In July 2017, ISMB/ECCB was held in Prague. The principal focus of the conference is on the development and application of advanced computational methods for biological problems. Talks and posters were presented during various sessions at this conference. Christine Orengo gave a talk on
on computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2017
URL https://www.iscb.org/ismbeccb2017
 
Description ELIXIR 3D-BioInfo Launch Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk at the ELIXIR 3D-BioInfo Launch Meeting in Basel, Switzerland in October 2018. The talk presented computational analyses exploiting CATH-Gene3D and Genome3D data. This meeting discussed the launch of a new ELIXIR community in structural bioinformatics.
Year(s) Of Engagement Activity 2018
URL https://swissmodel.expasy.org/25years/elixir
 
Description ELIXIR All Hands Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The fourth ELIXIR All Hands meeting brought together ELIXIR Node members and collaborators from partner organisation to review ELIXIR achievements and activities so far and discuss plans for the future. This meeting was held in Berlin in May 2018.

Christine Orengo gave a talk on CATH-Gene3D and Genome3D.
Year(s) Of Engagement Activity 2018
URL https://www.elixir-europe.org/events/elixir-all-hands-2018
 
Description EMBO Workshop on Pseudoenzymes 2018: From molecular mechanisms to cell biology 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Talk at EMBO Workshop on Pseudoenzymes 2018: From molecular mechanisms to cell biology. This was held in Sardinia, May 2018.
Year(s) Of Engagement Activity 2018
URL http://meetings.embo.org/event/18-pseudoenzymes
 
Description Hosted a in2scienceUK student 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact A Year 12 student was hosted in the lab for 2 weeks during the summer of 2017. This was organised by the charity in2scienceUK. The student learnt about protein structure and function and contributed to protein domain chopping which is of great value to the CATH database.
Year(s) Of Engagement Activity 2017
 
Description ISMB Chicago 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact ISMB is the world's largest bioinformatics/computational biology conference. It hosts hundreds of attendees from multiple disciplines to discuss the latest developments and applications of computational methods to solve biological problems. This conference was hosted in Chicago in July 2018. Christine Orengo gave a talk on computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2018
URL https://www.iscb.org/ismb2018
 
Description Prague Protein Spring 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk at a conference to discuss the current progress and future directions of protein science. This was held in Prague in May 2018. Christine Orengo gave a talk about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2018
URL http://www.pragueproteinspring.cz
 
Description Primary School Visit (Warren Road Primary) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Invited to give a 1 hour lesson on "DNA, Proteins and Minecraft" for 12-15 year 6 students (expected to reach L6 Science) at a National Lead Outstanding Primary School (Warren Road, Orpington). Learning objectives included:

- understanding what DNA/proteins are made of and why they are important
- the basic process of evolution
- introduction to how enzymes work

The school went on to achieve Gold Primary Science Quality Mark with this lesson mentioned in the award.

"The session was absolutely fabulous. I learnt so much! The children loved it." - Tamara Fletcher (Deputy Head and Head of Science)
Year(s) Of Engagement Activity 2014,2015,2017
 
Description Public talks and workshops 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact We have given several talks on CATH-Gene3D and Genome3D in local schools in London.

We participated in a Wellcome Trust funded workshop on chronic pain at which we gave a talk and demonstration of how CATH-Gene3D and Genome3D data was being used to provide structural and functional information on genes involved in chronic pain


The schools reported that our talks had generated a lot of interest in proteins and structural biology and that several students had decided to seek further information on undergraduate courses with study modules on computational biology.

Our talks include images of protein structures which help in intuitively conveying information on the mechanisms by which proteins function.

The Wellcome workshop on chronic pain was very well received with excellent responses to the feedback questionnai
Year(s) Of Engagement Activity 2009,2011,2012,2014,2015,2016
 
Description School Visit (Isleworth Primary School) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Invited to give a 1 hour interactive presentation on "DNA, Proteins and Minecraft" for 90 year 6 students as part of a Science week at Isleworth Town Primary School, London.

Learning objectives included:

- understanding what DNA/proteins are made of and why they are important
- how DNA replication works
- the basic process of evolution
- introduction to how enzymes work
Year(s) Of Engagement Activity 2016
 
Description School Visit (Warren Road Primary School) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Invited to give a 1 hour lesson on "DNA, Proteins and Minecraft" for 12-15 year 6 students (expected to reach L6 Science) at a National Lead Outstanding Primary School (Warren Road, Orpington). Learning objectives included: understanding what DNA/proteins are made of and why they are important; the basic process of evolution; introduction to how enzymes work.
Year(s) Of Engagement Activity 2017
 
Description Structural Bioinformatics Workshop 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Talk at Structural Bioinformatics Workshop in Pune, India in March 2018.
Year(s) Of Engagement Activity 2018
 
Description Web molecular graphics Shonan Meeting (Japan) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Invited to take part in a Shonan Meeting as part of an international panel to discuss the current state-of-the-art and future directions in the field of Web Molecular Graphics. The meeting included a presentation on CATH and many focused discussion groups.

The Shonan meetings aim to promote informatics and informatics research at an international level by providing a venue for world-class scientists, promising young researchers, and practitioners in Asia.

The workshop has resulted in a number of collaborations and sparked discussions on future standards.
Year(s) Of Engagement Activity 2016
URL http://shonan.nii.ac.jp/seminar/086/