Increasing the Coverage and Accuracy of CATH for Comparative Genomics and Variant Interpretation

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

Evolution has given rise to families of protein domains where relatives are linked through speciation events or duplication events in the same genome. Extensive domain duplication and shuffling gives multi-domain proteins with varying functions depending on the domain composition.

The CATH classification takes the domain as the primary evolutionary unit and classifies relatives having significantly similar structures and sequence patterns. Currently there are 5500 CATH superfamilies containing 93 million domains.
Previous funding allowed us to hugely increase the number of domains in CATH. We want to keep increasing this data - even bigger expansions are expected as new technologies make it easier to solve structures and capture sequence data. We will improve the accuracy of our domain data by working with other classification experts (Alexey Murzin of SCOP) to establish a shared domain recognition platform for new domains at the European Bioinformatics Institute, with difficult assignments jointly validated by CATH/SCOP experts. This data will be public and valuable for other resources (eg SCOPe, ECOD).

CATH has been established for 22 years and is renowned for providing accurate structural annotations for biological analyses. More recently it significantly increased its value to the biology community by providing functional predictions.
Although the structural core of the superfamily is highly conserved, variations away from the core cause changes in function. CATH addresses this by grouping evolutionary relatives likely to have highly similar functions and structures into functional families (FunFams). Thus FunFams can accurately inherit information about structures and functions, between relatives. This is important as <10% of domains have been experimentally characterised. We verified in-silico that FunFams can accurately model structures of uncharacterised relatives and the ability of FunFams to inherit functional information between relatives has been validated by an international competition - CAFA. We will make the FunFams much more comprehensive and increase the accuracy of FunFams for enzymes.

Extending our FunFam library will allow us to predict more accurate multi-domain annotations in genome sequences. This will help biologists comparing the genomes of organisms occupying different environmental niches, as identification of diverse domain combinations can hint at changes in the functional repertoires of the organisms and different abilities to exploit compounds in their environments.

Because relatives in FunFams are so structurally conserved we can align and superpose them to extract the characteristics of this conserved structural core and use this information to build a '3D core-template'. These templates will help solve the structures of many more relatives since powerful new structural biology techniques (eg cryo-EM) can use core libraries like these to model the structures of uncharacterised proteins from electron dispersion data.

In another exciting development for CATH we will harness the structural data and the additional power that comes from 200-fold greater sequence data to find residue sites in the protein, conserved throughout evolution for their functional importance. We will characterise these sites. We already predict functional sites well from conservation patterns in sequence data, but including structural data can help distinguish the type of site (eg site binding a compound or another protein) and identify additional residues involved in the functional mechanism. This data is valuable for protein design and understanding why mutations near these sites affect the protein and cause disease.

We will disseminate our data via webpages and other web mechanisms and develop e-videos and training material for the new features. We'll also build more efficient mechanisms for scanning our website and for biologists to install our tools on their own computers to analyse genome data.

Technical Summary

The UCL PDRA will spend ~50% of their time maintaining CATH's computational platforms ie the software, hardware, databases and web services required to process a constantly increasing amount of data; manually validating remote homologues and new folds; developing programs to generate derived data for CATH-Plus (eg multiple structure alignments, 3D templates). The remaining time will be spent improving the accuracy of CATH data, improving web pages/APIs and building new features:
-Export DomChop Platform to EBI: modify CATH's DomChop platform to run with SCOP data and move to the EBI (in collaboration with PDBe). This will require removing/replacing all local dependencies (comprising scripts, databases, HPC and webservices).
-Expand FunFams: rework the agglomerative clustering algorithm to speed up clustering so that all domain relatives in superfamilies can be regularly clustered into FunFams. Several strategies will be explored eg using fast, rough clustering (MMseqs2.0) to guide sequence cluster comparisons, improving throughput of profile comparisons, improved batching of HPC jobs, using predictions of likely cluster-merges etc. The faster method will enable FunFams of 'Enzyme Units', with new pipelines to identify domains contributing to enzyme active sites.
-Downloadable implementation of CATH-MDA-Annotate: develop workflow providing external access to CATH tools and data, allowing users to annotate their own sequence datasets (eg full genome annotation). This will be in the form of low-dependency, open source software that is easy to download, install and run.
-Expand multiple structure alignments and site characterisation: build software for analysing multiple structure superpositions to identify conserved positions in the buried core or around known or predicted functional sites.
-Extend API for FunSite data: expand existing FunFam API to include annotations (in Stockholm format) from structure analyses (eg conserved positions in ligand binding pockets)

Planned Impact

CATH is a world leading resource for protein domains, unique in combining 3D structures with millions of sequences predicted to belong to CATH families and extensive functional information. We will improve the accuracy of the domain assignments and predicted functional sites, thereby increasing the value of CATH for basic biosciences and the agricultural and biomedical communities.

The CATH webpages and webservers are highly accessed with 33,747 unique visitors per month and ~1.5 million hits per month (ie all files), measured using awstats which is better than webalizer at distinguishing 'human' users from 'robots'. This is a more appropriate metric than Google Analytics, which uses very strict criteria for "human" interaction and more problematic, API interactions will not show up at all on Google Analytics. Over the last 6 months CATH has served an average of 1 million web pages/month to humans on web browsers. Taking all traffic into account (e.g. data downloads, API calls, web robots), CATH has served an average of 3.5 million pages/month. The average session duration is up by 10% and the pages per session are up by 5%, demonstrating that users are spending more time on the site and looking at more pages.

CATH web pages and scientific data are accessed from 179 different countries with the top ten being United States (16%), India (12%), United Kingdom (11%), China (11%), Germany (4%), Spain (3%), France (3%), Japan (3%), Italy (2%) and Canada (2%).

The original CATH paper is cited 2653 times and all CATH papers are cited 7789 times.

CATH has been endorsed as an ELIXIR UK resource (only 5 UK data resources are endorsed) and is the only UK resource with ELIXIR Europe-wide 'Core Resource' status - only 14 resources have similar status across Europe. ELIXIR is a European initiative providing endorsement (but not funding) for computational resources supporting the biology community.

CATH also has impact in directly supplying data to the following resources, accessed by structural, experimental and computational biologists.
- CATH domain structure annotations are provided via PDB, PDBe and RCSB websites. PDBe has ~50,000 unique visitors/month.
- Partner in InterPro - Gene3D structural annotations are disseminated by InterPro ~86,000 unique visitors/month from nearly every country in the world.
- Contributor to UniProt annotations, also widely accessed.
- Partner in Genome3D resource - an integrated resource of UK-structural bioinformatics resources providing structural annotations and 3D models for key model organisms, including human, mouse and representatives from Pfam families. Web access to Genome3D is well distributed across Europe, Asia and Americas.

The impact of CATH data on biology communities is reflected in the fact that since 2002 CATH has been a partner in 7 EU funded European Initiatives, 2 NIH funded consortia for structural genomics and 2 UK funded initiatives (eFamily (MRC), London Pain Consortium (Wellcome Trust). Current partnerships include the DDIP consortium for developmental fly interactome (BBSRC), Genome3D (BBSRC - structural annotations) and FunPDBe (BBSRC funded - functional site annotations)). All these projects use CATH data and tools for structural and functional annotations.

Links to Industry: Nearly 20% of CATH's unique visitors per month are from commercial IP addresses. Pharmaceutical companies also use CATH tools for structure analysis (eg the CATH structure comparison tool has been purchased by Celltech, Pfizer India and Lilly). CATH was a founding resource of the UCL company Inpharmatica involved in predicting structures and functions for proteins via the 'Biopendium'. Inpharmatica was acquired by Galapagos in 2006.
Other evidence of impact is given by the range of support letters including letters from directors of major institutes and centres and companies undertaking drug design.
CATH has also been widely used to teach students about proteins.
 
Description CATH is a protein domain structure classification which continuously processes protein structures deposited in the Protein Databank (PDBe) by identifying the boundaries of the individual domains and classifying the domains into evolutionary superfamilies. The structural data is expanded more than 200-fold by adding protein sequence data from UniProt. CATH evolutionary families are widely used by the biology community for understanding protein function and for predicting the structures and functions of uncharacterised proteins. We have developed a platform for continuously updating domain structure entries in CATH (CATH-B). CATH-B domain data has continuously expanded since the project started.

A computational platform has been established at the EBI for recognising new domain entries based on existing SCOP and CATH domains. This will be used for continuously updating the domain structure data in CATH and SCOP.

Two new CATH based algorithms have been developed, CATH-FunFam-MARC which has been applied to update all the functional families (FunFams) in CATH, and CATH-FunFam-FRAN, which uses random splits and information on Multi Domain Architecture (MDA) to build the tree of functional relationships within a CATH Superfamily. The FunFam MARC classifier is now an order of magnitude faster than the previous method (GeMMA) . This will help greatly to manage the ever increasing size of sequence databases and provide regular updates of CATH while improving classification accuracy and purity. To evaluate this new method, we generated functional families for an 'Enzyme Unit' spanning multiple CATH domains. The application to protein kinases (which were previously classified in CATH according to separate N and C terminal domains) resulted in the CATH-KinFams classification, encompassing all 330.000 protein kinases available in UniProt. Novel kinase functional families were identified and the classification is more functionally pure and more comprehensive than the widely used Manning classification. The project is published in MDPI Biomolecules with accompanying data released on Zenodo and the CATH website.

The improvements in the CATH-FunFam-FRAN allow us to handle very large scale protein sequence data i.e. from metagenome resources like MGnify at the EBI. In addition, a new machine learning (ML) algorithm, FunSites, has been developed which exploits CATH sequence and structure data to predict key functional sites in protein domains. This has been published in Bioinformatics.

We have also released a new version of CATH, version 4.3, which contains structural and functional annotations for nearly 200 million protein domains. We have published a paper in Nucleic Acid Research describing the expansion in the data.

The data in CATH functional families has been used to analyse mutations in the Spike receptor binding domain of SARS-CoV-2 and the ACE host receptor that it binds too. This provided insights into which animals may be susceptible to infection by the virus, thereby potentially acting as a reservoir for further viral transmission. This work was published in Scientific Reports. This also provided an analysis of genetic variations in ACE2 from different human populations to determine any associations with disease severity.

We have continued to improve the ways in which the community is able to access our data via API endpoints on the CATH website. As a result, CATH annotations, such as Superfamily and Functional Families assignments, are now available in Aquaria, a knowledgebase containing a structural coverage map dedicated to SARS-CoV-2.

We have also continued to update and migrate our in-house computational tools into open source software available to the community on GitHub (cath-tools). This includes our structural alignment and comparison algorithm (cath-ssap) and our structural superposition algorithm (cath-superpose). These tools make it possible to create and analyse the relationships with multiple protein structures in more nuanced ways. For example, by optimising the superposition of multiple structures on just the local region, it makes it is easier to see the similarities and differences in a particular place of interest (e.g. an active site).

To connect these structural annotation tools together, and make them available to the wider community, we have created an open source computational workflow (NextFlow). We have ensured that this works on multiple platforms, using code that is readily available and can be easily installed, and take advantage of high performance computing (HPC) resources where available. This workflow has been used to provide annotations for a large number of structural models predicted by AlphaFold. The analysis of these large-scale predicted structural annotations was published in Nature Communications.
Exploitation Route protein engineering, drug design, variant interpretation and evolutionary studies.
Sectors Digital/Communication/Information Technologies (including Software),Education,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://www.cathdb.info
 
Description Outside academia, CATH is widely used across the global pharmaceutical industry for drug design and research and development. It is also used to assess impacts of mutations in proteins supporting clinical diagnostics (e.g. hypercholesterolemia). CATH has informed policy on the host range of SARS-CoV2, and led to efficiencies in drug discovery. CATH functional families (FunFams) can facilitate drug repurposing to target disease genes, by providing valuable data for pharmaceutical companies interested in repurposing as a cost effective mechanism for selecting drugs. FunFams can also identify drug targets which are less likely to be associated with side effects, providing information that is valuable for drug design. CATH methods and data are being exploited in the NHS-funded Genomics England Functional Effects Domain, and in a large-scale analysis of lung cancer data to uncover mechanisms of cancer evolution: the GBP 14m, 9-year Cancer Research UK-funded TracerX project. FunFam classification also allows accurate detection of functionally important sites to guide mutagenesis experiments for synthetic biology and is being used to enhance functional sites in bacterial enzymes capable of degrading plastics and pesticides. It has also highlighted sites involved in SARS-CoV-2 infection. Recent machine learning algorithms have exploited FunFams to improve detection of functional sites (FunSites) and FunSites are being incorporated in two highly accessed resources (PDBe, with 367,655 users per month and UniProt, with ~900,000 users per month at the European Bioinformatics Institute). Both capture this data to facilitate disease diagnostics and personalised medicine. CATH is the only resource which is capable of performing functional sub-classification on such a large scale, identifying 220,000 families each with at least one experimentally characterised protein. Validation has shown high structural and functional coherence across FunFams, allowing much more accurate predictions to be made. CATH methods ranked in the top three (out of 150) in international assessments of molecular function prediction, and first in 2020. CATH data is also disseminated via the web portal of the international protein structure resource, the Protein Databank (PDBe), with over 4,411,871 unique users/year, and UniProt, a major source of protein functional data with over 10,800,000 unique users/year (2,220,000 of which are from industry). Further links to CATH are provided by many international web-based computational biology resources, for example Pfam, BRENDA. CATH FunFam analyses of binding sites involved in SARS-CoV-2 infection of animal hosts was used by the WHO and a UN Food and Agriculture Organisation policy unit in strategy discussions on animals at risk from infection, or which are likely to become reservoirs for the virus. These sites constitute major mechanisms of infection which are targetable by drugs. The work was also reported in several newspapers globally.
First Year Of Impact 2018
Sector Education,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types Economic,Policy & public services

 
Title CATH-FRAN - an randomized splitting algorithm for the classification of Functional Families based on CATH-Gardener 
Description CATH-FRAN is an incremental update to CATH-Gardener, a pipeline for the classification of sequences in Functional Families after an initial partitioning of the dataset according to their Multi-Domain-Architecture (MDA). CATH-FRAN further splits the initial dataset into random partitions, allowing for the processing of large SuperFamilies in CATH and metagenomes sets. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact CATH-FRAN allowed the group to create Functional Families from datasets that were untreatable using the previous version of the algorithm (CATH-Gardener). These datasets include promising promising sequences sets from metagenomic data. 
 
Title CATH-Gardener: high throughput pipeline to create Functional Families in CATH 
Description With the upcoming release of the CATH/Gene3D database (v4.3), we face a 5-fold increase in the number of structural domains (CATH) and predicted structural domains (Gene3D). Among these, the emergence of "mega" superfamilies, containing millions of protein domains has provided major challenges for computation and functional classifications. The "mega" superfamilies account for nearly two thirds of all protein domain sequences in CATH-Gene3D. To address these issues, we have created Gardener, a novel pipeline for clustering massive protein datasets into CATH Functional Families. This pipeline implements our previous clustering/tree-building algorithms (GeMMA/FunFHMMER) in an iterative approach with an initial partitioning of the data by Multi-Domain Architecture (MDA) as identified by Gene3D assignments. Gardener is a Luigi-based Python pipeline package built with HPC integration and monitoring, batch processing and traceability. Gardener allows us to manage the data production workflow for CATH/Gene3D future updates while visualizing and handling failures as well as tailor the process using different partition methods during cluster generation. We developed and applied CATH-Gardener to the CATH "mega" SuperFamilies, resulting in a general increase in FunFam coherence, a reduction in over-splitting of these functional families and a more flexible infrastructure to generate FunFams data for further CATH releases. This expanded and more comprehensive set of FunFams will enable better coverage of predicted functional annotations in CATH-Gene3D, including functional site predictions and the data will be exploited by SWISS-MODEL and other structural modelling groups using FunFams to select targets for structure prediction. There are also plans to use the functional site data to assist UniProt annotations. This algorithm has not yet been published, though we are in the process of doing so. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? No  
Impact The CATH-Gardener pipeline has enabled the CATH resource to create Functional Family clusters for extremely large superfamilies and to do so with far less manual intervention than required previously. Before this, inefficiencies in the existing algorithm had meant it was only possible to "top up" existing clusters with new sequences. Since great care was taken to keep the new algorithm as generic as possible, we have also been able to apply the same pipeline to novel problems such as generating clusters for multi-domain proteins (eg kinases). 
 
Title CATH-predict-GO - a computational pipeline for the functional annotations of proteins using Gene Ontology terms 
Description CATH-predict-GO is a python package that annotates proteins of unknown function by inheriting Gene Ontology terms using homology-based methods in a cascading fashion. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact CATH-predict-GO was evaluated during the Critical Assessment of Function Annotation (CAFA4) at the virtual ISMB2020 conference. Preliminary results showed promising potential for function prediction of model organisms. 
 
Title FunVar: Using CATH to analyse the functional impact of mutations caused by variations in protein structure 
Description We introduce the Functional Variation (FunVar) platform (https://funvar.cathdb.info) which exploits the FunFams and structural data in CATH. FunVar has been designed to facilitate analysis of population/pathogenic variants in human proteins or pathogen/host genes. Variants (specifically non-synonymous polymorphisms i.e. residue mutations) are mapped to protein structures, where available, to allow assessment of their proximity to functional sites and therefore possible impact on protein function. Variant data is obtained from publicly available sources. We hope that visualisation of these mutations on the protein structure, illustrating their proximity to functional sites can help guide diagnostics and therapeutics. Future editions of these pages will add quantitative data suggesting the likelihood of functional impact. Currently, CATH FunVar provides two use cases: annotations for proteins of the SARS-CoV-2 virus and its human host interactor proteins i.e. human proteins interacting with viral proteins human proteins with mutations implicated in cancer, taken from TCGA. In the future, the data on other important pathogenic infections such as tuberculosis will also be made available at CATH FunVar. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? Yes  
Impact The FunVar web resource allows the wider scientific community to explore the location of mutations (currently from Sars-CoV-2 and cancer datasets) on protein structural domains in CATH. An analysis of the sequence conservation within functionally similar domains enables the locations of these mutations.to be compared against highly conserved regions of the structure. 
URL https://funvar.cathdb.info
 
Title Pipeline to build 3D model structures with CATH and SWISS-MODEL. 
Description This project will increase interoperability between four ELIXIR resources (CATH, SWISS-MODEL, InterPro and PDBe), by building APIs that facilitate the import and export of data between them. The ultimate goal is to improve provision of 3D-Models for protein domain sequences via CATH, SWISS-MODEL and InterPro. Less than 10% of known sequences have experimentally characterised 3D structural information and yet this data is often essential for understanding the protein's molecular function and biological role and for determining whether residue mutations could damage the protein and lead to disease. So this integration is very timely as it will enhance links between sequence and structure data. APIs will be built using well established protocols and as well as promoting interoperability, and therefore sustainability, we will expand the data in each resource to ensure they serve a wider community of biologists. The tool effectively glues together existing webservices from CATH and SWISS-MODEL to provide a pipeline that takes a protein sequence and ultimately exports one or more predicted 3D structures. The results of this pipeline will be made available in Genome3D which will then be made available to the wider community through export to InterPro and PDBe. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? Yes  
Impact This collaboration has proved useful in ongoing discussions and planning when setting up 3D Beacons network. 
URL https://github.com/CATH-SWISSMODEL
 
Title Pipeline to identify Functional Sites (FunSites) and add to FunPDBe 
Description FunPDBe is a project of the Protein Data Bank in Europe - Knowledge Base (PDBe-KB) with the goal to create an integrated and accessible resource of structural and functional annotations for macromolecular structure data in the Protein Data Bank (PDB). It is a collaboration between the PDBe-KB and world-leading providers of structural bioinformatics data. This pipeline uses sequence and structural information from Functional Families (FunFams) in CATH to predict the location of functionally important sites within protein structures (FunSites). This information is then transformed into a JSON data structure (according to the FunPDBe schema) and exported to the PDBe-KB resource and made available on their web pages. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? Yes  
Impact Making functional site predictions available to the wider community, in a way that allows results from different resources and algorithms to be compared and contrasted, is an important part of improving the accuracy of these predictions and developing how they can be applied further. Since the PDBe-KB provides a highly trusted and well-used web resource, this also helps to add visibility for the predicting resources themselves. 
URL https://www.ebi.ac.uk/pdbe/funpdbe/deposition/
 
Title CATH-Gene3D 
Description Please note that this research database is still being continuously developed and improved. CATH-Gene3D is a domain family classification. As of 2018, over 90 million protein domain sequences are classified into evolutionary superfamilies. Within these, relatives are further classed into groups in which relatives share very similar 3D-structures and functional properties. These groupings are described as functional families, or FunFams. The latest version of the CATH-Gene3D protein structure classification database has recently been released (version 4.2, http://www.cathdb.info). The resource comprises over 450,000 domain structures and over 90 million protein domains classified into over 6000 homologous superfamilies. The daily-updated CATH-B, which contains our very latest domain assignment data, provides putative classifications for over 50,000 additional protein domains. Gene3D http://gene3d.biochem.ucl.ac.uk is a database of domain annotations of Ensembl and UniProtKB protein sequences. Domains are predicted using a library of profile HMMs representing over 6000 CATH superfamilies. The current Gene3D (v16) release has expanded its domain assignments to ~20 000 cellular genomes and over 90 million unique protein sequences, more than doubling the number of protein sequences since our last publication. Amongst other updates, we have improved our Functional Family annotation method. We have also improved the quality and coverage of our 3D homology modelling pipeline of predicted CATH domains. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact CATH-Gene3D is widely used by biologists for teaching and research. There are ~1 million webpage accesses per month from ~9,000 unique visitors. CATH-Gene3D is a member database of InterPro, which receives more than 5 million web page accesses per month. It is also linked to from other major public sites including Pfam, PDB, PDBe. 
URL http://www.cathdb.info
 
Description ELIXIR 
Organisation ELIXIR
Department ELIXIR UK
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution We are part of the 3D-BioInfo ELIXIR Community in Structural Bioinformatics, which was established in January 2019 and is being coordinated by Christine Orengo. CATH-Gene3D contributes to two of the four major activities in 3D-BioInfo. Activity I relates to integration of functional sites in PDBe Knowledge Base (PDBe-KB). CATH Functional Families (FunFams) are being used to identify functional sites for domain families and this data is being integrated in PDBe-KB. Activity II relates to integration of tools and data associated with protein structure prediction. CATH functional families are being used to identify templates for homology modelling of structurally uncharacterised proteins. 3D-models have been generated for 14 model organisms including human, mouse, rat, arabadopsis, fly, yeast and E. Coli. 3D-Models are then integrated in the Genome3D resource, managed by Orengo. 3D-BioInfo Activity II involves integration of 3D-Models from Genome3D in PDBe-KB with links to UniProt. CATH-Gene3D recently received ELIXIR implementation study funding to collaborate with the SWISS-MODEL team in Switzerland to use the SWISS-MODEL pipeline together with template data from CATH functional families to build more accurate 3D models. We are planning to extend this activity to include more European partners through collaborations facilitated by 3D-BioInfo workshops. We are also part of a ELIXIR UK consortium of 17 research groups developing training material in structural bioinformatics. This work is being co-ordinated by the Genome3D consortium managed by Orengo. CATH-Gene3D training material was developed in 2013 for an ECCB workshop on protein structure to Function held in July 2013, organised by Christine Orengo, Nicholas Furnham and Romain Studer. This material has been adapted for the ELIXIR training workflows. Christine Orengo is also deputy lead of the Functional Effects Domain in Structural Bioinformatics which is integrating tools and resources from the 17 structural bioinformatics research groups mentioned above. The Domain is part of Genomics England and is headed by Ewan Birney. The aim is to establish an integrated resource will be used for the interpretation of genetic variations related to health and disease. Training material is also being developed in this context. ELIXIR UK funding was allocated in March 2017 to develop training workflows for predicting the impacts of genetic variations. These workflows have now been developed and are accessible via the ELIXIR TESS Training website.
Collaborator Contribution As regards the ELIXIR 3D-BioInfo collaborations, research groups from 15 European countries are involved in this collaboration. For the Activities that CATH-Gene3D contributes to, more than 10 groups are involved from 7 countries including the UK. All are contributing predicted functional site data to PDBe-KB. We all participate in workshops held at the EBI regularly to discuss ontologies and export/import mechanisms and APIs. As regards the ELIXIR UK training workflows, each group within the consortium is developing their own training material relating to their particular research area.
Impact All predicted functional site data will be made available via the PDBe-KB. Predicted domain data structure will be made available through Genome3D and also through PDBe-KB once the exchange mechanisms for that have been completed. All training material material will be integrated via on-line workflows which are being developed as a part of the TeSS platform - an on-line training catalogue and training facility being organised by the ELIXIR UK node.
Start Year 2013
 
Description FunPDBe - Community driven enrichment of PDB data with structural and functional annotations 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution My group have generated structural and functional annotations for more than 95 million protein domains from UniProt. This data, which is disseminated via CATH-Gene3D will also be exported to PDBe for selected model organisms. As part of this collaboration we are also developing training workflows for biologists wishing to access and extract this information from CATH and from FunPDBe. This work is being done in collaboration with five other UK research groups, who are also generating structural and functional annotations using diverse methods. By combining our annotations in PDBe we will increase the coverage of our annotations in the model organisms and the consensus information helps to provide a weighting on accuracy ie the more independent methods that agree on a prediction the more likely it is to be correct.
Collaborator Contribution This is a BBSRC funded project involving the PDBe group at EBI and 10 other research groups, which ha the aim of increasing the structural and functional annotations in PDBe and exploiting this data to investigate the impacts of genetic variation in proteins. There are 3 workpackages - 1) functional site data 2) curated functional information 3) prediction of variant impacts. Each group is contributing derived data or tools to support one or more of these 3 aims.
Impact The project has only been running 6 months. We have built the framework for exporting data from the partner groups to FunPDBe and for importing this data into FunPDBe. We have also built the framework for the training workflows and started populating the workflows with material on functional site annotations and homology modelling.
Start Year 2017
 
Description InterPro 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. It combines protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool. Our research team has provided the following contributions to the InterPro resource: - Structural annotations from CATH - Structural annotations from Genome3D - mapping between CATH and SCOP protein structure classifications CATH-Gene3D provide domain family HMMs and structure annotations to InterPro on a regular basis. We have recently provided a new tool - CATH-Resolve-Hits for generating accurate multi-domain architecture information from sequence matches to the CATH domain HMM libraries. We currently have BBSRC BBR funding to extend the mapping between SCOP and CATH, integrate Genome3D annotations in InterPro for selected model organisms, and provide a 3D viewer for the structural annotations.
Collaborator Contribution Annotations from other sources, manual curations, central database and web site.
Impact Publications Community resource to further biological research.
Start Year 2007
 
Description Metagenomics collaboration with Ward and Hailes Group at UCL 
Organisation University College London
Department Biochemical Engineering
Country United Kingdom 
Sector Academic/University 
PI Contribution We are providing bioinformatics advice for analysing metagenome samples taken from a range of different environments including artic meadows, hospital drains etc. In particular we are scanning sequence fragments and contigs assembled from the metagenome samples against our libraries of sequence profiles (HMMs) for functional families (FunFams) in our CATH-Gene3D database of protein domain superfamilies. Matches can be used to identify the putative functions of enzymes in the sample and whether the enzymes are likely to have modified activity or specificity. We are currently applying our FunFam protocol to analyse metagenome samples from the MGnify resource at EBI, and searching in particular for novel petase enzymes. These will be tested experimentally by the Ward group and other collaborators in Cambridge. Our CATH-FunFam HMM library is being developed by a BBSRC BBR funded project on CATH-Gene3D.
Collaborator Contribution The groups of John Ward and Helen Hailes at UC are performing the molecular biology and chemistry to experimentally validate prediction of enzyme functions.
Impact multi-disciplinary collaboration Ward - experimental molecular biology and chemistry Hailes - experimental chemistry Orengo - bioinformatics Two joint publications to date
Start Year 2016
 
Description PDBe 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution Our resource CATH provides high quality annotations to improve the quality of the information provided by the PDBe, primarily the location of structural domains and identifying distant evolutionary relationships between known protein structures. Our Gene3D resource provides structural annotations for genome sequences from ~20,000 species. These annotations are also incorporated in the Genome3D resource for selected model organisms. Collaborations between research groups involved in the Genome3D project has resulted in a high quality mapping between the CATH and SCOP structural classification databases. This is being implemented by the PDBe to improve the clarity and coverage of structural annotations in their resource. We currently have a BBSRC BBR funded collaboration with PDBe and InterPro to provide our CATH-Gene3D structural annotations to these resources, via the Genome3D portal.
Collaborator Contribution Host, maintain and curate the central PDBe resource and website.
Impact Publications Community resources to further scientific research.
Start Year 2006
 
Description Swiss-Model - 3D Models for CATH domain sequences 
Organisation University of Basel
Country Switzerland 
Sector Academic/University 
PI Contribution This is an ELIXIR funded collaboration between the Orengo Group and the Swiss-Model Team, led by Prof. Torsten Schwede. The Orengo group will be building a computational platform to provide domain sequences predicted to belong to CATH functional families (FunFams). FunFams are generated using agglomerative clustering of domain sequences in each superfamily guided using a protocol that assess similarity in specificity determining residues.
Collaborator Contribution The SwissModel team will be building computational pipelines to import the CATH sequences data and then submit these sequences to the established Swiss-Model homology modelling platforms. The 3D models generated will be made available to the biology community via the Swiss-Model, CATH-Gene3D, PDBe and InterPro websites.
Impact We have built APIs that allow exchange of data between CATH and SWISS-MODEL Using these we have imported 3D-Models for structurally uncharacterised CATH-FunFams into CATH. This pilot work has led to a more substantial collaboration between the partners as part of the 3D-Gateway project which is establishing the 3D-Beacons portal to integrate 3D-Models from different resources (SWISS-MODEL, PHYRE, Rosetta, DomTHREADER)
Start Year 2017
 
Title cath-cluster: A simple way to complete-linkage cluster arbitrary data 
Description The software provides a fast implementation of complete linkage clustering that allows arbitrary data to be clustered into groups according to similarity scores. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact The software has increased visibility of our group within the bioinformatics community. 
URL http://cath-tools.readthedocs.io/en/latest/tools/cath-cluster
 
Title cath-resolve-hits: a fast, effective way to collapse domain matches to a non-overlapping subset (i.e. domain architecture) 
Description cath-resolve-hits provides a fast, effective way to collapse a set of domain matches (e.g. from a typical protein sequence search) down to a non-overlapping subset or "domain architecture" assignment. Fast: Can process around 1-2 million input hits per second Powerful: - Finds the optimal result that maximises the sum of hits' scores - Handles discontinuous domains - Supports tolerance for overlaps between hits; auto-resolves any that occur Transparent: - Provides visualisation of input data and decisions via graphical HTML Simple: - Uses a simple default input file format - Also accepts HMMER domtblout files and hmmsearch output files - Accepts input that hasn't been pre-sorted or even pre-grouped (but can exploit that where specified) Configurable: - Allows users to determine their own scoring system to be maximised - Offers many easy-to-use options to configure the default behaviour Software Features: - written and tested in strict C++ - removed a number of local dependencies to allow the tool to be used by the wider community - source code released on GitHub under the GPLv3 license (as part of the cath-tools suite) - incorporated into a robust continuous integration (CI) build with tests and releases 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact The tool has been adopted by InterPro and the HMMER server (at EBI) as the standard method of resolving domain boundaries. 
URL http://cath-tools.readthedocs.io/en/latest/tools/cath-resolve-hits/
 
Title cath-ssap: rigorous protein structure comparison 
Description cath-ssap finds the optimal structural alignment between two protein structures, then uses this alignment to calculate a quantitative measure of the structural similarity. The program employs a highly sensitive double-dynamic algorithm that calculates and compares the local structural environment of residues. Since protein structure is more conserved that protein sequence during the process of evolution, these similarity scores provide a sensitive measure of remote homologies between distantly related proteins. - cath-ssap is a complete rewrite of the original SSAP algorithm of Taylor and Orengo (1989) - ported from C to strictly written and tested C++ - removed a number of local dependencies to allow the tool to be used by the wider community - source code released on GitHub under the GPLv3 license (as part of the cath-tools suite) - incorporated into a robust continuous integration (CI) build with tests and releases 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact The alignments and similarity scores provided by this algorithm provide fundamental evidence for the assignment of evolutionary relationships in the CATH database - a world leading resource for protein structural classification (the core scientific resource developed and maintained by our group). 
URL http://cath-tools.readthedocs.io/en/latest/tools/cath-ssap/
 
Title cath-superpose: flexible superpositions of protein structures 
Description cath-superpose provides the optimal structural superposition between two protein structures. When deciding on which residues to use for the superposition, the tool takes into account the structural environment of each residue. This focuses the superposition on the parts of the alignment that align well rather that variable regions that can disrupt superpositions. In contrast with methods that simply attempt to minimise the RMSD, this approach can be used to build superpositions of hundreds of protein structures that clearly show the highly conserved ancient structural core within distantly related protein domain structures. - written and tested in strict C++ - removed a number of local dependencies to allow the tool to be used by the wider community - source code released on GitHub under the GPLv3 license (as part of the cath-tools suite) - incorporated into a robust continuous integration (CI) build with tests and releases 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact - used as a tool to superpose predicted structures from the Genome3D collaboration - used to provide superpositions of entire superfamilies for the CATH database (previously not possible) 
URL http://cath-tools.readthedocs.io/en/latest/tools/cath-superpose/
 
Title cathpy: a bioinformatics toolkit written in Python 
Description cathpy is a Bioinformatics toolkit written in Python. It is developed and maintained by the Orengo Group at UCL and is used for maintaining the CATH protein structure database (and associated research). 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact cathpy has made if far easier for the wider scientific community to engage with CATH data and CATH web services through a robust and well-tested software library. 
URL https://cathpy.readthedocs.io/en/latest
 
Description 14TH INTERNATIONAL SYMPOSIUM ON INTEGRATIVE BIOINFORMATICS 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact A talk at a conference oniIntegrating heterogeneous data to create an in-depth insight into complex biological systems. This was held at Rothamsted in June 2018. Experts were brought together from the fields: bioinformatics, computational biology, computer science, systems biology, and statistics. Christine Orengo gave a talk about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2018
URL https://www.rothamsted.ac.uk/events/14th-international-symposium-integrative-bioinformatics
 
Description 31st European Crystallography Meeting, Oviedo 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A talk at an international conference about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2018
URL https://ecm31.ecanews.org/en/welcome-to-oviedo.php
 
Description 3rd Student Conference on Mathematical Foundations in Bioinformatics 2018 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Talk to around 50 attendees for a student conference on mathematical foundations in bioinformatics. This was held at Kings College London in August 2018.
Year(s) Of Engagement Activity 2018
URL https://nms.kcl.ac.uk/informatics/events/MatBio2018/
 
Description BioProNET Big Data and Computational Biology in Bioprocessing Workshop 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Talk at a workshop on computational biology and bioprocessing in June 2018.
Year(s) Of Engagement Activity 2018
URL http://biopronetuk.org/biopronet-funded-collaboration-building-workshops/
 
Description Bioinformatics and Computational Biology Conference 2018, Naples 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A talk at an international conference on computational analyses exploiting CATH-Gene3D and Genome3D data. Held in Naples, Italy in November 2018.
Year(s) Of Engagement Activity 2018
URL https://www.bbcc-meetings.it/
 
Description Cold Spring Harbor Asia conference on Frontiers in Computational Biology & Bioinformatics 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A talk at an international conference about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2018
URL https://www.csh-asia.org/2018meetings/COMP.html
 
Description ELIXIR 3D-BioInfo Launch Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk at the ELIXIR 3D-BioInfo Launch Meeting in Basel, Switzerland in October 2018. The talk presented computational analyses exploiting CATH-Gene3D and Genome3D data. This meeting discussed the launch of a new ELIXIR community in structural bioinformatics.
Year(s) Of Engagement Activity 2018
URL https://swissmodel.expasy.org/25years/elixir
 
Description ELIXIR All Hands Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The fourth ELIXIR All Hands meeting brought together ELIXIR Node members and collaborators from partner organisation to review ELIXIR achievements and activities so far and discuss plans for the future. This meeting was held in Berlin in May 2018.

Christine Orengo gave a talk on CATH-Gene3D and Genome3D.
Year(s) Of Engagement Activity 2018
URL https://www.elixir-europe.org/events/elixir-all-hands-2018
 
Description EMBO Workshop on Pseudoenzymes 2018: From molecular mechanisms to cell biology 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Talk at EMBO Workshop on Pseudoenzymes 2018: From molecular mechanisms to cell biology. This was held in Sardinia, May 2018.
Year(s) Of Engagement Activity 2018
URL http://meetings.embo.org/event/18-pseudoenzymes
 
Description ISMB Chicago 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact ISMB is the world's largest bioinformatics/computational biology conference. It hosts hundreds of attendees from multiple disciplines to discuss the latest developments and applications of computational methods to solve biological problems. This conference was hosted in Chicago in July 2018. Christine Orengo gave a talk on computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2018
URL https://www.iscb.org/ismb2018
 
Description Online- Tutorial on Introduction to CATH database 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This tutorial is designed as part of EMBL-EBI training courses. This provides a quick introduction to the CATH database allowing users to explore the structure and function of proteins. The course is freely available from https://www.ebi.ac.uk/training/online/courses/introduction-to-cath-database/.
Year(s) Of Engagement Activity 2021
URL https://www.ebi.ac.uk/training/online/courses/introduction-to-cath-database/
 
Description Prague Protein Spring 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk at a conference to discuss the current progress and future directions of protein science. This was held in Prague in May 2018. Christine Orengo gave a talk about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2018
URL http://www.pragueproteinspring.cz
 
Description Structural Bioinformatics Workshop 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Talk at Structural Bioinformatics Workshop in Pune, India in March 2018.
Year(s) Of Engagement Activity 2018