Increasing the Coverage and Accuracy of CATH for Comparative Genomics and Variant Interpretation

Lead Research Organisation: European Bioinformatics Institute
Department Name: Protein Data Bank in Europe

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

The UCL PDRA will spend ~50% of their time maintaining CATH's computational platforms ie the software, hardware, databases and web services required to process a constantly increasing amount of data; manually validating remote homologues and new folds; developing programs to generate derived data for CATH-Plus (eg multiple structure alignments, 3D templates). The remaining time will be spent improving the accuracy of CATH data, improving web pages/APIs and building new features:
-Export DomChop Platform to EBI: modify CATH's DomChop platform to run with SCOP data and move to the EBI (in collaboration with PDBe). This will require removing/replacing all local dependencies (comprising scripts, databases, HPC and webservices).
-Expand FunFams: rework the agglomerative clustering algorithm to speed up clustering so that all domain relatives in superfamilies can be regularly clustered into FunFams. Several strategies will be explored eg using fast, rough clustering (MMseqs2.0) to guide sequence cluster comparisons, improving throughput of profile comparisons, improved batching of HPC jobs, using predictions of likely cluster-merges etc. The faster method will enable FunFams of 'Enzyme Units', with new pipelines to identify domains contributing to enzyme active sites.
-Downloadable implementation of CATH-MDA-Annotate: develop workflow providing external access to CATH tools and data, allowing users to annotate their own sequence datasets (eg full genome annotation). This will be in the form of low-dependency, open source software that is easy to download, install and run.
-Expand multiple structure alignments and site characterisation: build software for analysing multiple structure superpositions to identify conserved positions in the buried core or around known or predicted functional sites.
-Extend API for FunSite data: expand existing FunFam API to include annotations (in Stockholm format) from structure analyses (eg conserved positions in ligand binding pockets)

Planned Impact

CATH is a world leading resource for protein domains, unique in combining 3D structures with millions of sequences predicted to belong to CATH families and extensive functional information. We will improve the accuracy of the domain assignments and predicted functional sites, thereby increasing the value of CATH for basic biosciences and the agricultural and biomedical communities.
The CATH webpages and webservers are highly accessed with 33,747 unique visitors per month and ~1.5 million hits per month (ie all files), measured using awstats which is better than webalizer at distinguishing 'human' users from 'robots'. This is a more appropriate metric than Google Analytics, which uses very strict criteria for "human" interaction and more problematic, API interactions will not show up at all on Google Analytics. Over the last 6 months CATH has served an average of 1 million web pages/month to humans on web browsers. Taking all traffic into account (e.g. data downloads, API calls, web robots), CATH has served an average of 3.5million pages/month. The average session duration is up by 10% and the pages per session are up by 5%, demonstrating that users are spending more time on the site and looking at more pages.

CATH web pages and scientific data are accessed from 179 different countries with the top ten being United States (16%), India (12%), United Kingdom (11%), China (11%), Germany (4%), Spain (3%), France (3%), Japan (3%), Italy (2%) and Canada (2%).

The original CATH paper is cited 2653 times and all CATH papers are cited 7789 times.

CATH has been endorsed as an ELIXIR UK resource (only 5 UK data resources are endorsed) and is the only UK resource with ELIXIR Europe-wide 'Core Resource' status - only 14 resources have similar status across Europe. ELIXIR is a European initiative providing endorsement (but not funding) for computational resources supporting the biology community.
CATH also has impact in directly supplying data to the following resources, accessed by structural, experimental and computational biologists.
- CATH domain structure annotations are used by PDB and provided via PDBe and RCSB websites. PDBe has ~50,000 unique visitors/month.
- Partner in InterPro - Gene3D structural annotations are disseminated by InterPro ~86,000 unique visitors/month from nearly every country in the world.
- Contributor to UniProt annotations, also widely accessed.
- Partner in Genome3D resource - an integrated resource of UK-structural bioinformatics resources providing structural annotations and 3D models for key model organisms, including human, mouse and representatives from Pfam families. Web access to Genome3D is well distributed across Europe, Asia and Americas.

The impact of CATH data on biology communities is reflected in the fact that since 2002 CATH has been a partner in 7 EU funded European Initiatives, 2 NIH funded consortia for structural genomics and 2 UK funded initiatives (eFamily (MRC), London Pain Consortium (Wellcome Trust). Current partnerships include the DDIP consortium for developmental fly interactome (BBSRC), Genome3D (BBSRC - structural annotations) and FunPDBe (BBSRC funded - functional site annotations)). All these projects use CATH data and tools for structural and functional annotations.
Links to Industry: Nearly 20% of CATH's unique visitors per month are from commercial IP addresses. Pharmaceutical companies also use CATH tools for structure analysis (eg the CATH structure comparison tool has been purchased by Celltech, Pfizer India and Lilly). CATH was a founding resource of the UCL company Inpharmatica involved in predicting structures and functions for proteins via the 'Biopendium'. Inpharmatica was acquired by Galapagos in 2006.
Other evidence of impact is given by the range of support letters including letters from directors of major institutes and centres and companies undertaking drug design.
CATH has also been widely used to teach students about proteins.

Publications

10 25 50
 
Description A shared domain recognition platform for new domains is now established at the EMBL-European Bioinformatics Institute. Data from CATH and SCOP2 are integrated into the PDBe services. The preliminary assignments CATH-B are also integrated and displayed on PDBe web portals.
Exploitation Route The domain recognition platform will be used by CATH team for future work. SCOP2 assignments will be extended based on the reference assignments carried out by the SCOP2 team and distributed via FTP area.
Sectors Education,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Title DomChop v2 
Description DomChop v2 is a platform to assist CATH during the domain chopping step of their protein structural classification workflow. It implements a pipeline to detect close and remote homologs, allows the assignment of potential domain boundaries based on the homologs, and contains a system to edit and view domain boundaries in real time. 
Type Of Technology Software 
Year Produced 2019 
Impact This software allows the CATH homology pipeline to be run against pre-release PDB data, which is unavailable to CATH but available to PDBe, speeding up the time that homology data is available for domain chopping. It always improves domain chopping experience by a significant margin,due to the incorporation of a macromolecular 3D viewer. 
 
Description Transferring CATH Chopping program 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Presenting CATH progress to the CATH Scientific Advisory Board.
Year(s) Of Engagement Activity 2019