An Greatly Expanded CATH-Gene3D with Functional Fingerprints to Characterise Proteins

Lead Research Organisation: University College London

Department Name: Structural Molecular Biology

Abstract

There are millions of proteins being sequenced which have no known function. New CATH methods will predict their functions.

Whilst other resources do also predict function, CATH-Gene3D (referred to below as CATH) provides unique information on structurally conserved features linked to function. Structure data reveals how proteins perform their function and why the function changes if the protein is modified by mutations or other genetic variations. Protein function information is key to understanding biological systems and by extension drug design, protein engineering and disease.

CATH is a world leading resource that classifies proteins evolved from the same ancestral protein, into evolutionary families. Currently, CATH classifies 15 million protein domains into 2600 families. Family data is valuable because evolutionary relatives (called homologues) tend to have similar 3D structures and perform similar functions. Thus the benefit of CATH is the ability to infer properties between homologues.

This is important because of the millions of proteins currently known (>20 million) less than 5% have experimentally determined functions. Even in the organism of greatest interest to us, human, <10% of proteins have known functions. Because it can be slow and very expensive to characterise proteins it will not be possible to experimentally study all these proteins. Therefore, biologists use CATH to predict the function of a protein based on the family to which it belongs.

Another fact is that proteins are made up of 'domains' - on average two per protein. These are independently folded entities that act together to confer the function of the whole protein. CATH classifies proteins at the level of the domain and currently classifies ~70% of domains found in nature. Domains are the building blocks of proteins - a few thousand of them are combined in different ways to give the 20 million proteins, or more, in nature. Our group develops methods for predicting domain functions. This allows functions of whole proteins to be deduced from the functions of their constitutive domains. Thus functions can be suggested for proteins made from any combination of domains.

CATH uses information on the 3D structure of the domain to give more accurate family classifications, as structure is more highly conserved, during evolution, than the sequence. Even more important - structure can reveal how the protein performs its function and whether the protein loses its function if a mutation occurs at a particular site.

We will expand CATH by 100%. Since manual validation is very time consuming, we will develop better methods for automatically recognising distant homologues. We will continuously release data (CATH-B), prior to manual curation, so that biologists can benefit from the information much sooner.

We will collaborate with the other major structure classification SCOP to develop common classification strategies and provide complementary information on families.

We will improve the accuracy of functional inheritance across a family. We need to do this because in some families, especially those occurring more frequently in nature, the functions can change in some relatives.

We will improve accuracy by characterising important positions in the domain, conserved across functionally similar relatives. We can build patterns of these positions to recognise other domains sharing such patterns and likely to have similar functions.

We will make it easy for biologists to use our web search tool to determine if a protein belongs to one of these functional families. We will set this up on the Cloud so that biologists can quickly search CATH with the massive datasets they obtain using new sequencing technologies. These technologies capture proteins expressed under different conditions. Our web pages will report their functions and variations in the protein which could modify function causing disease

Technical Summary

(1) CATHifier
We will build the CATHifier platform for classifying structures/sequences in CATH-Gene3D (referred to below as CATH). This will comprise a better homologue predictor exploiting more powerful sequence matching (meta-methods), structure matching (meta-methods) and function matching (text mining). We will improve the machine learning SVM combining this data.

CATHifier will comprise RESTful web services and Cloud based workflows. We will make CATHifier available via webservers for user's queries. The web services will also export CATH data to PDB and InterPro.

(2) More sensitive methods for functional classification
Our functional classification/prediction tool (FunFamer) will be improved eg by using MDA data, better detection of conserved residues, looking for 3D co-localisation of conserved residues, exploiting conserved 3D motifs.

(3) FunFamer webserver for function prediction
Most proteins are not experimentally characterised and so function prediction is a major aim of the project. High Performance Compute (HPC) strategies will handle the vast datasets biologists are generating using next gen sequencing.

A recent BBSRC T&R pilot ported FunFamer to HPC facilities ie Amazon and UCL Legion (5500 nodes). We used infrastructure-on-demand services ie Amazon EC2 compute cloud and Amazon S3 storage service. Amazon virtual machines can be used in a cluster-like scenario via Sun SGE's 'cloud adapter' software or in parallel. We will improve scheduling for large datasets and explore using Hadoop and related strategies.

A major aim will be intuitive web pages displaying functions. No other resources identify structure-based functional families. We will show 3D structures highlighting functional sites conserved in both sequence and structure. We have begun this but more work is needed eg to make the site more intuitive, align query proteins against FunFams, display mutations close to functional sites or splice variations modifying function

Planned Impact

We will maintain and develop a world leading resource for protein domain structure classification (CATH-Gene3D, henceforth referred to as CATH) which combines 3D structure data, tens of millions of sequences predicted to belong to CATH families and extensive information on protein functions. We will improve the purity of functional classification and thereby increase the value of the resource for both basic biosciences and also the agricultural and biomedical communities.

CATH already has a very well developed website and this will be extended to provide more detailed information on protein functions and in particular residue sites on the protein surface likely to be important for function. The new web pages will therefore inform protein design or rationalise the impacts of genetic variation eg in different plant or animal strains. For example a single residue mutation in the Rubisco protein, affecting allostery, can alter the catalytic efficiency of this enzyme in rice and promote survival in arid regions.

CATH is already widely used - The website now receives nearly 2 million web-pages accesses/month from ~61,000 unique visitors and the CATH paper is highly cited - the original CATH Structure paper is now cited 1986 times (all CATH publications are cited 6413 times).

Communities in which CATH has an impact

Basic bioscience researchers: Evidenced by the fact that CATH is one of the 8 member groups of InterPro - a consortium of major protein family resources at the EBI. Several European networks of excellence (Biosapiens, EMBRACE, IMPACT, ENFIN) included the CATH group to provide structural/functional annotations for genome sequences.

Structural biologists: Evidenced by the fact that major protein structure repositories (PDB) link directly to CATH; a major structural genomics initiative (PSI) in the States selected CATH as the structural resource for target selection.

Biomedical Researchers: Evidenced by the fact that CATH is used to provide information on protein functions, protein networks and the impacts of SNPs for large consortia researching neuropathic pain (London Pain Consortium, Europain).

Other evidence of impact is given by the range of support letters including letters from directors of major institutes (eg RCSB, EBI), companies undertaking genome annotation (eg Synthetic Genomics) and users of the data.

Research fields in which CATH will have an Impact

Agricultural and Food security - Protein sequencing initiatives are providing increasing amounts of data for plants, crops, cattle and the bacteria that interact with these hosts and cause damage. The data and tools we will develop (eg information on conserved positions involved in function) will explain variations between strains and help identify suitable strains to improve yields, taste or colour or to cope with environmental conditions eg drought, pests and pathogens.

Protein design and biotech industries - modification of proteins in pathways can yield new sources of materials and energy (ie biofuels). New proteins can be designed to build synthetic pathways. The functional family (FunFam) data can be used to constrain conserved structural core positions in the protein and identify positions more tolerant to change and useful for new designs.

Health - Knowledge of structural details in the active sites of proteins and identification of conserved 3D features is valuable for drug design. Another major benefit will be the use of conservation data in FunFams to rationalise the impact of genetic variations (eg SNPs, spliced variations) on protein functions and disease susceptibility. This will inform both diagnostic strategies and drug design.

The FunFam server will characterise the functional repertoire of metagenomes from human cavities eg gut and thereby help explain the role of commensal bacteria in promoting health.

Other - CATH has been widely used to teach students about protein structure and evolution.

Funded Value:

£612,409

Funded Period:

Jan 14 - Dec 17

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/K020013/1

Principal Investigator:

Christine Orengo

Research Subject:

Omic sciences & technologies (40%)

Tools, technologies & methods (60%)

Research Topic:

Bioinformatics (60%)

Functional genomics (20%)

Genomics (20%)

Organisations

People	ORCID iD
Christine Orengo (Principal Investigator)
Gerard Kleywegt (Co-Investigator)
Alexey Murzin (Co-Investigator)

Publications

Author Name Title

Publication Date Published

|< < 1 2 3 4 5 6 > >|

10 25 50

Sillitoe I (2015) CATH: comprehensive structural and functional annotations for genome sequences. in Nucleic acids research

Sillitoe I (2019) CATH: expanding the horizons of structure-based functional annotations for genome sequences. in Nucleic acids research

Das S (2018) Choosing the Best Enzyme Complex Structure Made Easy. in Structure (London, England : 1993)

Crisci MA (2021) Closely related Lak megaphages replicate in the microbiomes of diverse animals. in iScience

Littmann M (2021) Clustering FunFams using sequence embeddings improves EC purity. in Bioinformatics (Oxford, England)

Crisci MA (2022) Detection and enumeration of Lak megaphages in microbiome samples by endpoint and quantitative PCR. in STAR protocols

Das S (2015) Diversity in protein domain superfamilies. in Current opinion in genetics & development

Ribeiro AJM (2019) Emerging concepts in pseudoenzyme classification, evolution, and signaling. in Science signaling

Adeyelu TT (2022) Exploiting protein family and protein network data to identify novel drug targets for bladder cancer. in Oncotarget

Tyzack JD (2019) Exploring Enzyme Evolution from Changes in Sequence, Structure, and Function. in Methods in molecular biology (Clifton, N.J.)

Key Findings
Impact Summary
Policy Influence
Research Databases and Models
Research Tools and Methods
Collaboration
Intellectual Property
Software and Technical Products
Engagement Activities


Description	This project will maintain and develop the CATH-Gene3D classification of protein domains and improve the functional sub-classification of relatives in superfamilies. In the 2 years that this project has been running we have: 1) Developed a new algorithm for recognising homologous domains based on machine learning (SVM) that combines information on protein sequences and protein structures. This is capable of providing accurate automated assignments for an additional 60,000 structural domains in CATH, increasing the total coverage by over 15% 2) Provided an improved method (FunFHMMER) for functional sub-classification - this groups together protein domains that have similar patterns of residue conservation indicative of common functional sites. 3) Independently validated the function prediction potential of FunFHMMER by submitting annotations to CAFA (international function prediction assignment). This method was ranked 1st out of 129 methods for predicting molecular function and 2nd for predicting biological process (using the most stringent metric) 4) Completed two major releases of the CATH database. The latest version (v4.1) now contains over 300,000 structural domains (almost doubling the number of domains annotated before this grant started). 5) Putative domain structure assignments are now available to the public (CATH-B) which provides the very latest annotations for new structures in the PDB (updated daily). 6) Updated the number of predicted domain structures in CATH-Gene3D cover 20,000 cellular genomes and over 43 million domain entries. This represents a two fold increase since the project started. 7) CATH-Gene3D data has been updated with new functional information and information on protein interactions from a range of public sources eg IRefIndex, GO, EC, REACTOME etc. 8) Used the new functional family groupings in CATH-Gene3D to create 3D models for human sequences (without known structure) that are significantly more accurate than models generated using more established methods (eg HHpred). This work has been accepted for publication by Acta Crystallographica, a journal widely read by structural biologists who use CATH. 9) Used these predicted structures to model protein-protein interactions for human proteins. 10) Additionally, the structural models have been expanded to include an extra model organism (Drosophila melanogaster). 11) Added splicing data into Gene3D allowing structural and functional interpretations of alternative splicing. 12) Expanded our coverage of disease associated human mutations to the much larger set integrated by UniProt and included additional visualization tools in the website. For example it is now easier to see where the mutations are on the domains. Furthermore, for a number of binary protein interactions there is information on sub-regions of the proteins that participate or influence the interaction. Many of these sub-regions overlap with domain regions in Gene3D. Using this data we can see which mutations are likely disrupt protein interactions. 13) Provided new CATH-Gene3D web-pages to show the relationships between functional families within a superfamily. 14) Added new visualization tools in the Gene3D website. 15) Progressed the mapping between CATH and SCOP to facilitate more rapid annotation of sequences in InterPro and to improve the accuracy of both resources 16) The improved mapping protocols have increased the number of superfamilies deemed to be equivalent in the two resources from 19% to 39% (732 superfamilies) 17) Improved structure comparison algorithm to allow domains to be superposed more accurately than previously possible - highlighting the highly-conserved ancient structural core across large and diverse groups of proteins 18) Provided a major release of the DomainFinder algorithm (v4.0) which identifies the optimum domain assignments from a mixed set of sequence search results. This new CATH-Resolve-Hits algorithm is being adopted by the InterPro project. 19) These discoveries and developments have been published in 14 journal articles and 4 textbook chapters
Exploitation Route	CATH-Gene3D is a member database in the widely accessed InterPro resource hosted at the EBI, which has more than 5 million webpage accesses per month. It is one of only two resources (out of 11 in InterPro) providing structural annotations and is therefore important for providing consensus information on predicted structural regions in protein sequences. The mapping between CATH and SCOP will therefore be very useful for InterPro researchers integrating the information on structural annotations. CATH classification data is valuable for a number of general activities undertaken by biologists/biomedical researchers: - assigning structures or functions to uncharacterised proteins (as mentioned above) - providing information on conserved and variable structural regions in domain families. This data is important information in drug design (ie for designing compounds that bind to proteins in pathogenic organisms but not in human). - the multiple alignment data in CATH is valuable for identifying highly conserved positions in a family and likely functional sites. This data is helpful in assessing the likely impacts of genetic variations, nsSNPs etc. - the up-to-date and comprehensive structural libraries provided by CATH are valuable for searching against to find structural analogs for a query structure that may represent cross-hits for a drug designed to bind to the query. - CATH-Gene3D is the only structure based resource providing functional families (FunFams currently ~100,000) which group together domain sequences likely to have highly similar structures and functions. We have established a web search tool that allows users to scan new sequences against a library of HMMs for these families in order to obtain predicted functional annotations for their protein. The developments in our structure comparison algorithm, highlighting conservation across remote evolutionary relationships, is the basis of collaboration with a large pharmaceutical company and this work will continue to be significant in the field of drug design.
Sectors	Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
URL	http://www.cathdb.info


Description	The CATH-Gene3D classification is widely used by biologists and biomedical researchers to understand the structure and functions of query sequences. This is evidenced by the web access stats of nearly 2 million web page accesses per month from more than 10,000 unique visitors. It is widely used as a teaching tool to explain the principles of protein evolution and structure-function relationships. The CATH data has been used by the following organisations and consortia: 1. Midwest Structural Genomics Consortium to analyse protein families and target structurally uncharacterised families with relevance to human health for structure determination 2. Centre for Structural Genomics in Disease to target protein families implicated in virulence of pathogenic organisms for structure determination 3. The London Pain Consortum to predict associations between protein families in order to understand the protein networks/signalling pathways involved in neuropathic pin 4. The Europain consortium to provide information on protein families implicated in neuropathic pain. 5. The Protein Databank to provide information on domain structure families CATH algorithms and data have also been widely used by researchers in industry: CATH was one of the four major UCL bioinformatics resources used to establish the UCL company Inpharmatica in 1998. This was involved in predicting structures and functions for proteins via the 'Biopendium'. Inpharmatica sold this and other related software packages to several large pharmaceutical companies including Pfizer, Astra Zeneca and Glaxo-Wellcome. Inpharmatica was acquired by Galapagos in 2006. The main structure comparison algorithms developed by the CATH team (CATHEDRAL) has been distributed directly to Pharma including UCB Celltech LB, Pfizer India, Cubist, DE Shaw, Signal Pharmaceuticals, Astellas, Adimab, Molecular Health and BioCrea. For example, UCB has licensed CATHEDRAL and PDBsum and ~20 of their employees have directly used these resources. The Director of Computational Structural Biology stated "All these tools work together nicely to turn protein structural information into a more digestible form, which speeds up our work process, accelerates knowledge dissemination and facilitates more informed decision making for the research and development of both small molecule and antibody therapeutics. CATHEDRAL not only offers superior performance in this type of comparison, but also automatically specifies domain boundaries for a multi-domain query through an iterative search strategy. This unique feature has saved us hundreds of man-hours by eliminating the need for manual correction when structurally characterizing potential drug targets of multiple domains". Papers exploiting CATH data and published by Thornton and Orengo have been cited 13 times across 11 patent documents (assessed in the 2008 to 2014, ie. Research Exercise Framework (REF) in the UK), indicating the commercial relevance of their work. The patents are filed across the USA, Europe and Internationally through the PCT system and are assigned to GSK Ltd, Biogen Idec Inc. and Pharnext. The CEO of Acpharis has stated: "Protein structure data is core to our research and we rely on fold libraries and HMM data from CATH and related resources to answer the fundamental questions that we are addressing in designing drugs for novel targets, hopefully allowing design of more novel drugs that can better treat a variety of diseases. CATH provides a valuable service to the academic and commercial sectors and is a key resource for analyzing structures and collecting the information necessary for innovative drug design". Over the last year we have developed new AI / Machine Learning protocols to integrate AlphaFold 3D models into CATH superfamilies. This has nearly doubled the number of domains in CATH with 3D annotations.
Sector	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Description	Structural Bioinformatics Consortium (ELIXIR)
Geographic Reach	Europe
Policy Influence Type	Influenced training of practitioners or researchers
Impact	We are part of a consortium of 17 research groups developing training material in structural bioinformatics. This work is being co-ordinated by the Geneom3D consortium which is managed by Orengo. Each group within the consortium is developing their own training material relating to their particular research area. This material will be integrated via on-line workflows which are being developed as a part of the TeSS platform - an on-line training catalogue and training facility being organised by the ELIXIR UK node. The CATH-Gene3D training material was developed in 2013 for an ECCB workshop on protein structure to Function held in July 2013, organised by Christine Orengo, Nicholas Furnham and Romain Studer. Christine Orengo is also deputy lead of the Functional Effects domain in Structural Bioinformatics which is integrating tools and resources from the 17 structural bioinformatics research groups mentioned. This integrated resource will be used for the interpretation of genetic variations related to health and disease. Training material is also being developed in this context.


Title	CATH-Gene3D FunFams (FunFHMMer)
Description	Sequences in each superfamily in the CATH-Gene3D resource have been classified into functional families or FunFams by cutting a hierarchical clustering tree of superfamily sequence relatives. This was previously done by the DFX algorithm which used function annotation data from the Gene Ontology to sub-classify the superfamilies into FunFams. However, due to the paucity of the GO terms and annotation biases existing in the GO, a new approach, FunFHMMer, for functionally classifying CATH superfamilies was developed which exploits sequence patterns, and is therefore unaffected by the limitations of GO. FunFHMMer, determines an optimal cut of a hierarchical clustering tree of sequence relatives within a given superfamily by calculating a novel functional coherence index based on conserved positions and specificity-determining positions (SDPs) in sequence alignments. This results in generation of more functionally coherent functional families or FunFams than our previous classification.
Type Of Material	Improvements to research infrastructure
Year Produced	2015
Provided To Others?	Yes
Impact	The CATH-Gene3D FunFams are able to provide functional annotations for nearly 16 million domain sequences in UniProtKB and Ensembl. CATH currently identifies 110,439 FunFams and for the most populated of these (having high information content), accounting for 72% of CATH-Gene3D sequences. FunFams can be used to predict residues implicated in functional sites. CATH FunFams are also useful for analysing the variation in functions across a superfamily and since functional sites can be identified for many FunFams, they allow a structurally informed analysis of the mechanisms of this divergence. Sequence profiles (HMMs) for highly informative FunFams are being supplied to InterPro for their metagenome portal, to provide functional annotations for bacterial sequences identified by the metagenome projects.
URL	http://www.cathdb.info


Title	CATH-Gene3D sequence search
Description	Within a superfamily, CATH-Gene3D classifies functional families (FunFams) that aim to group together domains that share the same function. Therefore, if a region of protein sequence provides a highly significant match to a particular CATH-Gene3D FunFam, then there is a good chance the protein shares a similar function to experimentally characterised proteins in the FunFam. Based on this assertion, the CATH-Gene3D sequence search tool provides function (GO and EC) annotations for query protein sequences based on the FunFam assignments.
Type Of Material	Improvements to research infrastructure
Year Produced	2015
Provided To Others?	Yes
Impact	The sequence search tool provides function annotations for query sequences sequences and can be used by biologists via a CATH web server. The function prediction pipeline using CATH-Gene3D sequence search tool was submitted to the Critical Assessment of protein Function Annotation (CAFA) 2 and our method was ranked among the top methods for accuracy of function prediction according to a number of different scoring methods.
URL	http://www.cathdb.info/search/by_sequence


Title	CATHEDRAL
Description	Please note that this method is still being continuously developed and improved. CATHEDRAL is an automatic structure comparison methods that can be used to search a library of protein structures to identify a structure with matching fold. It can also be used to obtain an alignment between two protein structures and to multiply align a set of structures (CATHEDRAL-Multi) to identify the common structural core and equivalences between residues across the structures.
Type Of Material	Improvements to research infrastructure
Year Produced	2010
Provided To Others?	Yes
Impact	CATHEDRAL is being used by pharmaceutical companies with whom we collaborate to compare the ligand binding sites of protein structures and facilitate drug design.
URL	http://www.cathdb.info/search/by_structure


Title	CATH-Gene3D
Description	Please note that this research database is still being continuously developed and improved. CATH-Gene3D is a domain family classification. As of 2018, over 90 million protein domain sequences are classified into evolutionary superfamilies. Within these, relatives are further classed into groups in which relatives share very similar 3D-structures and functional properties. These groupings are described as functional families, or FunFams. The latest version of the CATH-Gene3D protein structure classification database has recently been released (version 4.2, http://www.cathdb.info). The resource comprises over 450,000 domain structures and over 90 million protein domains classified into over 6000 homologous superfamilies. The daily-updated CATH-B, which contains our very latest domain assignment data, provides putative classifications for over 50,000 additional protein domains. Gene3D http://gene3d.biochem.ucl.ac.uk is a database of domain annotations of Ensembl and UniProtKB protein sequences. Domains are predicted using a library of profile HMMs representing over 6000 CATH superfamilies. The current Gene3D (v16) release has expanded its domain assignments to ~20 000 cellular genomes and over 90 million unique protein sequences, more than doubling the number of protein sequences since our last publication. Amongst other updates, we have improved our Functional Family annotation method. We have also improved the quality and coverage of our 3D homology modelling pipeline of predicted CATH domains.
Type Of Material	Database/Collection of data
Provided To Others?	Yes
Impact	CATH-Gene3D is widely used by biologists for teaching and research. There are ~1 million webpage accesses per month from ~9,000 unique visitors. CATH-Gene3D is a member database of InterPro, which receives more than 5 million web page accesses per month. It is also linked to from other major public sites including Pfam, PDB, PDBe.
URL	http://www.cathdb.info


Title	CATHe Dataset and Weights
Description	This dataset consists of the training, optimization, and testing sets used for developing the CATHe model, which is a deep learning framework capable of detecting extremely remote homologues (< 20% sequence identity) for CATH superfamilies. Additionally, the training weights for the artificial neural network present in the CATHe model have been provided.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
Impact	This work will enable the main research resource in our group (CATH protein structure classification database) to process the deluge of 3D prediction data arriving from the recent developments around AlphaFold.
URL	https://zenodo.org/record/6327572


Description	3D Beacons Network
Organisation	EMBL European Bioinformatics Institute (EMBL - EBI)
Country	United Kingdom
Sector	Academic/University
PI Contribution	Our research team has been part of the core development team responsible for the API schema, architecture and tools underpinning the 3D Beacons framework. Specifically, we have taken direct responsibility for delivering an example client implementation of the 3D beacon API with the intention that this will allow research groups with minimal coding skills and/or technical resources to be part of the 3D Beacons network.
Collaborator Contribution	The PDBe team have direct responsibility for delivering the 3D Beacons "Hub", which gathers information from all the nodes on the network. This also includes developing the front end public web pages that users can use to consume this data. The SWISS-MODEL team are responsible for the quality metrics that are used to normalise the predicted models from various methods.
Impact	Aims of the project: 1.) A major aim of the 3D-Gateway project (renamed as 3D Beacons) is to combine access to experimental and predicted structures to increase the coverage of structure data and structure-based functional annotations available for UniProt sequences from key model organisms linked to human health and agriculture. To increase the predicted structural data we will expand Genome3D and develop a mechanism (3D-Beacons network) for providing access to models from other model providers (i.e., SWISSMODEL, Rosetta, ModBase) in order to significantly increase coverage and reliability when assessing agreement across multiple models. 3D-Gateway will increase the amount of structural the information available for UniProt sequences at least 10-fold based on Genome3D data and considerably more taking into account projected model acquisition from the other external resources. 2.) As well as increasing the structural information available for UniProt, the 3D-Gateway project will integrate structure-based functional annotations from the PDBe Knowledge base (PDBe-KB) with the predicted models. These annotations will also be used to build new UniRules - consensus rules used in the annotation of UniProt sequences, including the assignment of functional residues. This data on functional motifs on 3D structures/models will enable a significant expansion of annotations of automatically curated sequences in UniProt. 3.) Another goal is to make these structural data and added value annotations available to non-expert users by building web-pages for displaying the 3D structure models (both experimental and predicted) and added value annotations for UniProt sequences. We will ensure that the information is presented in a way that clearly demonstrates data provenance. Web-based teaching materials and workshops will help biologists to exploit the new data and understand their benefits and limitations. This project has become more timely due to the improvement in accuracy in protein structural modelling advances in AlphaFold and the pending release of very large scale data (~100m structural models). 3D Beacons will ensure that structural analysis will become more central to Biological and Biomedical research. Outcomes to date: 1) Established specification of the API to establish communication protocols across the 3D Beacons Network 2) Established the 3D Beacons Hub to aggregate queries and responses across the Network 3) Developed an exemplar client implementation to enable groups to join the 3D Beacons Network 4) The Hub web pages have been launched and are available to the public 5) 3D Beacons has been adopted as a central activity of the ELIXIR 3DBioInfo structural bioinformatics community
Start Year	2019


Description	3D Beacons Network
Organisation	University of Basel
Department	Biozentrum Basel
Country	Switzerland
Sector	Academic/University
PI Contribution	Our research team has been part of the core development team responsible for the API schema, architecture and tools underpinning the 3D Beacons framework. Specifically, we have taken direct responsibility for delivering an example client implementation of the 3D beacon API with the intention that this will allow research groups with minimal coding skills and/or technical resources to be part of the 3D Beacons network.
Collaborator Contribution	The PDBe team have direct responsibility for delivering the 3D Beacons "Hub", which gathers information from all the nodes on the network. This also includes developing the front end public web pages that users can use to consume this data. The SWISS-MODEL team are responsible for the quality metrics that are used to normalise the predicted models from various methods.
Impact	Aims of the project: 1.) A major aim of the 3D-Gateway project (renamed as 3D Beacons) is to combine access to experimental and predicted structures to increase the coverage of structure data and structure-based functional annotations available for UniProt sequences from key model organisms linked to human health and agriculture. To increase the predicted structural data we will expand Genome3D and develop a mechanism (3D-Beacons network) for providing access to models from other model providers (i.e., SWISSMODEL, Rosetta, ModBase) in order to significantly increase coverage and reliability when assessing agreement across multiple models. 3D-Gateway will increase the amount of structural the information available for UniProt sequences at least 10-fold based on Genome3D data and considerably more taking into account projected model acquisition from the other external resources. 2.) As well as increasing the structural information available for UniProt, the 3D-Gateway project will integrate structure-based functional annotations from the PDBe Knowledge base (PDBe-KB) with the predicted models. These annotations will also be used to build new UniRules - consensus rules used in the annotation of UniProt sequences, including the assignment of functional residues. This data on functional motifs on 3D structures/models will enable a significant expansion of annotations of automatically curated sequences in UniProt. 3.) Another goal is to make these structural data and added value annotations available to non-expert users by building web-pages for displaying the 3D structure models (both experimental and predicted) and added value annotations for UniProt sequences. We will ensure that the information is presented in a way that clearly demonstrates data provenance. Web-based teaching materials and workshops will help biologists to exploit the new data and understand their benefits and limitations. This project has become more timely due to the improvement in accuracy in protein structural modelling advances in AlphaFold and the pending release of very large scale data (~100m structural models). 3D Beacons will ensure that structural analysis will become more central to Biological and Biomedical research. Outcomes to date: 1) Established specification of the API to establish communication protocols across the 3D Beacons Network 2) Established the 3D Beacons Hub to aggregate queries and responses across the Network 3) Developed an exemplar client implementation to enable groups to join the 3D Beacons Network 4) The Hub web pages have been launched and are available to the public 5) 3D Beacons has been adopted as a central activity of the ELIXIR 3DBioInfo structural bioinformatics community
Start Year	2019


Description	CSGID Structural Genomics Centre
Organisation	Northwestern University
Country	United States
Sector	Academic/University
PI Contribution	We analyse genome sequences to identify structurally uncharacterised protein families which are good drug targets eg associated with virulence in pathogenic organisms. We also provide a webserver/database for submission of community targets for structure determination.
Collaborator Contribution	They solve the structures of representative proteins from the family
Impact	outputs are publications and a website/database for submitting protein sequences targetted for structure determination multi-disciplinary - bioinformatics and structural biology


Description	ELIXIR
Organisation	ELIXIR
Department	ELIXIR UK
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	We are part of the 3D-BioInfo ELIXIR Community in Structural Bioinformatics, which was established in January 2019 and is being coordinated by Christine Orengo. CATH-Gene3D contributes to two of the four major activities in 3D-BioInfo. Activity I relates to integration of functional sites in PDBe Knowledge Base (PDBe-KB). CATH Functional Families (FunFams) are being used to identify functional sites for domain families and this data is being integrated in PDBe-KB. Activity II relates to integration of tools and data associated with protein structure prediction. CATH functional families are being used to identify templates for homology modelling of structurally uncharacterised proteins. 3D-models have been generated for 14 model organisms including human, mouse, rat, arabadopsis, fly, yeast and E. Coli. 3D-Models are then integrated in the Genome3D resource, managed by Orengo. 3D-BioInfo Activity II involves integration of 3D-Models from Genome3D in PDBe-KB with links to UniProt. CATH-Gene3D recently received ELIXIR implementation study funding to collaborate with the SWISS-MODEL team in Switzerland to use the SWISS-MODEL pipeline together with template data from CATH functional families to build more accurate 3D models. We are planning to extend this activity to include more European partners through collaborations facilitated by 3D-BioInfo workshops. We are also part of a ELIXIR UK consortium of 17 research groups developing training material in structural bioinformatics. This work is being co-ordinated by the Genome3D consortium managed by Orengo. CATH-Gene3D training material was developed in 2013 for an ECCB workshop on protein structure to Function held in July 2013, organised by Christine Orengo, Nicholas Furnham and Romain Studer. This material has been adapted for the ELIXIR training workflows. Christine Orengo is also deputy lead of the Functional Effects Domain in Structural Bioinformatics which is integrating tools and resources from the 17 structural bioinformatics research groups mentioned above. The Domain is part of Genomics England and is headed by Ewan Birney. The aim is to establish an integrated resource will be used for the interpretation of genetic variations related to health and disease. Training material is also being developed in this context. ELIXIR UK funding was allocated in March 2017 to develop training workflows for predicting the impacts of genetic variations. These workflows have now been developed and are accessible via the ELIXIR TESS Training website.
Collaborator Contribution	As regards the ELIXIR 3D-BioInfo collaborations, research groups from 15 European countries are involved in this collaboration. For the Activities that CATH-Gene3D contributes to, more than 10 groups are involved from 7 countries including the UK. All are contributing predicted functional site data to PDBe-KB. We all participate in workshops held at the EBI regularly to discuss ontologies and export/import mechanisms and APIs. As regards the ELIXIR UK training workflows, each group within the consortium is developing their own training material relating to their particular research area.
Impact	All predicted functional site data will be made available via the PDBe-KB. Predicted domain data structure will be made available through Genome3D and also through PDBe-KB once the exchange mechanisms for that have been completed. All training material material will be integrated via on-line workflows which are being developed as a part of the TeSS platform - an on-line training catalogue and training facility being organised by the ELIXIR UK node.
Start Year	2013


Description	InterPro
Organisation	EMBL European Bioinformatics Institute (EMBL - EBI)
Country	United Kingdom
Sector	Academic/University
PI Contribution	InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. It combines protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool. Our research team has provided the following contributions to the InterPro resource: - Structural annotations from CATH - Structural annotations from Genome3D - mapping between CATH and SCOP protein structure classifications CATH-Gene3D provide domain family HMMs and structure annotations to InterPro on a regular basis. We have recently provided a new tool - CATH-Resolve-Hits for generating accurate multi-domain architecture information from sequence matches to the CATH domain HMM libraries. We currently have BBSRC BBR funding to extend the mapping between SCOP and CATH, integrate Genome3D annotations in InterPro for selected model organisms, and provide a 3D viewer for the structural annotations.
Collaborator Contribution	Annotations from other sources, manual curations, central database and web site.
Impact	Publications Community resource to further biological research.
Start Year	2007


Description	Metagenomics collaboration with Ward and Hailes Group at UCL
Organisation	University College London
Department	Biochemical Engineering
Country	United Kingdom
Sector	Academic/University
PI Contribution	We are providing bioinformatics advice for analysing metagenome samples taken from a range of different environments including artic meadows, hospital drains etc. In particular we are scanning sequence fragments and contigs assembled from the metagenome samples against our libraries of sequence profiles (HMMs) for functional families (FunFams) in our CATH-Gene3D database of protein domain superfamilies. Matches can be used to identify the putative functions of enzymes in the sample and whether the enzymes are likely to have modified activity or specificity. We are currently applying our FunFam protocol to analyse metagenome samples from the MGnify resource at EBI, and searching in particular for novel petase enzymes. These will be tested experimentally by the Ward group and other collaborators in Cambridge. Our CATH-FunFam HMM library is being developed by a BBSRC BBR funded project on CATH-Gene3D.
Collaborator Contribution	The groups of John Ward and Helen Hailes at UC are performing the molecular biology and chemistry to experimentally validate prediction of enzyme functions.
Impact	multi-disciplinary collaboration Ward - experimental molecular biology and chemistry Hailes - experimental chemistry Orengo - bioinformatics Two joint publications to date
Start Year	2016


Description	PDBe
Organisation	EMBL European Bioinformatics Institute (EMBL - EBI)
Country	United Kingdom
Sector	Academic/University
PI Contribution	Our resource CATH provides high quality annotations to improve the quality of the information provided by the PDBe, primarily the location of structural domains and identifying distant evolutionary relationships between known protein structures. Our Gene3D resource provides structural annotations for genome sequences from ~20,000 species. These annotations are also incorporated in the Genome3D resource for selected model organisms. Collaborations between research groups involved in the Genome3D project has resulted in a high quality mapping between the CATH and SCOP structural classification databases. This is being implemented by the PDBe to improve the clarity and coverage of structural annotations in their resource. We currently have a BBSRC BBR funded collaboration with PDBe and InterPro to provide our CATH-Gene3D structural annotations to these resources, via the Genome3D portal.
Collaborator Contribution	Host, maintain and curate the central PDBe resource and website.
Impact	Publications Community resources to further scientific research.
Start Year	2006


Description	Partner in the NIH-Funded Midwest Centre for Structural Genomics
Organisation	Argonne National Laboratory
Country	United States
Sector	Public
PI Contribution	We analysed completed genomes to identify protein families which had no structural characterisation
Collaborator Contribution	Our partners determined the structures of representatives from these families
Impact	multi-disciplinary - bioinformatics and structural biology


Title	CATHEDRAL structure comparison algorithm
Description	This was developed from a double dynamic algorithm, SSAP, used for structure comparison. SSAP was modified to perform fast database searches and is nearly 1000 times faster than SSAP. There is a publicly available web server for CATHEDRAL on our in-house CATH-Gene3D database website.
IP Reference
Protection	Copyrighted (e.g. software)
Year Protection Granted	2010
Licensed	Yes
Impact	CATHEDRAL has been used for several in-house analyses of protein superfamilies leading to several publications. It has also been used in collaborations with other research groups both at UCL and externally to analyse protein superfamilies and these have also led to several publications.


Title	SSAP- structure comparison program
Description	Algorithm for aligning protein structures. It exploits a double dynamic algorithm to handle insertions and deletions and so can be used to align very distantly related homologues as well as close homologues. It has been used to identify the structural relationships on which the CATH classification was based.
Type Of Technology	Software
Impact	This software is licenced by UCLi and have been sold to several companies including CellTech, Pfizer India etc.


Title	cath-cluster: A simple way to complete-linkage cluster arbitrary data
Description	The software provides a fast implementation of complete linkage clustering that allows arbitrary data to be clustered into groups according to similarity scores.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	The software has increased visibility of our group within the bioinformatics community.
URL	http://cath-tools.readthedocs.io/en/latest/tools/cath-cluster


Title	cath-resolve-hits: a fast, effective way to collapse domain matches to a non-overlapping subset (i.e. domain architecture)
Description	cath-resolve-hits provides a fast, effective way to collapse a set of domain matches (e.g. from a typical protein sequence search) down to a non-overlapping subset or "domain architecture" assignment. Fast: Can process around 1-2 million input hits per second Powerful: - Finds the optimal result that maximises the sum of hits' scores - Handles discontinuous domains - Supports tolerance for overlaps between hits; auto-resolves any that occur Transparent: - Provides visualisation of input data and decisions via graphical HTML Simple: - Uses a simple default input file format - Also accepts HMMER domtblout files and hmmsearch output files - Accepts input that hasn't been pre-sorted or even pre-grouped (but can exploit that where specified) Configurable: - Allows users to determine their own scoring system to be maximised - Offers many easy-to-use options to configure the default behaviour Software Features: - written and tested in strict C++ - removed a number of local dependencies to allow the tool to be used by the wider community - source code released on GitHub under the GPLv3 license (as part of the cath-tools suite) - incorporated into a robust continuous integration (CI) build with tests and releases
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	The tool has been adopted by InterPro and the HMMER server (at EBI) as the standard method of resolving domain boundaries.
URL	http://cath-tools.readthedocs.io/en/latest/tools/cath-resolve-hits/


Title	cath-ssap: rigorous protein structure comparison
Description	cath-ssap finds the optimal structural alignment between two protein structures, then uses this alignment to calculate a quantitative measure of the structural similarity. The program employs a highly sensitive double-dynamic algorithm that calculates and compares the local structural environment of residues. Since protein structure is more conserved that protein sequence during the process of evolution, these similarity scores provide a sensitive measure of remote homologies between distantly related proteins. - cath-ssap is a complete rewrite of the original SSAP algorithm of Taylor and Orengo (1989) - ported from C to strictly written and tested C++ - removed a number of local dependencies to allow the tool to be used by the wider community - source code released on GitHub under the GPLv3 license (as part of the cath-tools suite) - incorporated into a robust continuous integration (CI) build with tests and releases
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	The alignments and similarity scores provided by this algorithm provide fundamental evidence for the assignment of evolutionary relationships in the CATH database - a world leading resource for protein structural classification (the core scientific resource developed and maintained by our group).
URL	http://cath-tools.readthedocs.io/en/latest/tools/cath-ssap/


Title	cath-superpose: flexible superpositions of protein structures
Description	cath-superpose provides the optimal structural superposition between two protein structures. When deciding on which residues to use for the superposition, the tool takes into account the structural environment of each residue. This focuses the superposition on the parts of the alignment that align well rather that variable regions that can disrupt superpositions. In contrast with methods that simply attempt to minimise the RMSD, this approach can be used to build superpositions of hundreds of protein structures that clearly show the highly conserved ancient structural core within distantly related protein domain structures. - written and tested in strict C++ - removed a number of local dependencies to allow the tool to be used by the wider community - source code released on GitHub under the GPLv3 license (as part of the cath-tools suite) - incorporated into a robust continuous integration (CI) build with tests and releases
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	- used as a tool to superpose predicted structures from the Genome3D collaboration - used to provide superpositions of entire superfamilies for the CATH database (previously not possible)
URL	http://cath-tools.readthedocs.io/en/latest/tools/cath-superpose/


Description	14TH INTERNATIONAL SYMPOSIUM ON INTEGRATIVE BIOINFORMATICS
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	A talk at a conference oniIntegrating heterogeneous data to create an in-depth insight into complex biological systems. This was held at Rothamsted in June 2018. Experts were brought together from the fields: bioinformatics, computational biology, computer science, systems biology, and statistics. Christine Orengo gave a talk about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity	2018
URL	https://www.rothamsted.ac.uk/events/14th-international-symposium-integrative-bioinformatics


Description	31st European Crystallography Meeting, Oviedo
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	A talk at an international conference about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity	2018
URL	https://ecm31.ecanews.org/en/welcome-to-oviedo.php


Description	3DSIG-ISMB2020 - 'SARS-CoV-2 spike protein predicted to bind strongly to host receptor protein orthologues from mammals, but not fish, birds or reptiles' - Su Datt Lam & Paul Ashford - 16/07/20
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Intelligent Systems for Molecular Biology (ISMB) is the main conference in the field of Bioinformatics. In 2020 it was held virtually. 3DSIG is a structural biology focussed track within the conference. Su Datt Lam and Paul Ashford (group members) presented our work on animal infection susceptibility by SARS-CoV-2. The presentation won the First Prize for Best Talk in 3DSIG.
Year(s) Of Engagement Activity	2020
URL	https://www.iscb.org/ismb2020-general/ismb2020-award-winners#3dsig-talk


Description	3rd Student Conference on Mathematical Foundations in Bioinformatics 2018
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	Talk to around 50 attendees for a student conference on mathematical foundations in bioinformatics. This was held at Kings College London in August 2018.
Year(s) Of Engagement Activity	2018
URL	https://nms.kcl.ac.uk/informatics/events/MatBio2018/


Description	BioProNET Big Data and Computational Biology in Bioprocessing Workshop
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Talk at a workshop on computational biology and bioprocessing in June 2018.
Year(s) Of Engagement Activity	2018
URL	http://biopronetuk.org/biopronet-funded-collaboration-building-workshops/


Description	Biochemical Society Workshop on Exploiting Protein Structure to Determine the Effects of Genetic Variation
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Workshop on protein structure prediction and analysis in the context of analysing the impacts of genetic variations in proteins. This was organised by Christine Orengo and Michael Sternberg and held at Darwin House, London. Participants were introduced to concepts underpinning the analysis and prediction programs in CATH-Gene3D, Genome3D, PHYRE and other resources.
Year(s) Of Engagement Activity	2016


Description	Bioinformatics and Computational Biology Conference 2018, Naples
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	A talk at an international conference on computational analyses exploiting CATH-Gene3D and Genome3D data. Held in Naples, Italy in November 2018.
Year(s) Of Engagement Activity	2018
URL	https://www.bbcc-meetings.it/


Description	Bioinformatics talk in UCL Healthcare Careers Day
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	Around 100 school students attended the Bioinformatics STEM talk at the Medicine and Healthcare Careers Day at UCL which was aimed at introducing Bioinformatics to school students along with a hands-on practical session on structural bioinformatics and showcasing the CATH database.
Year(s) Of Engagement Activity	2017


Description	Bristol Biodesign Keynote - 01/07/20
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	A one-day international symposium in synthetic biology and biodesign. Invited keynote
Year(s) Of Engagement Activity	2020
URL	https://www.bristol.ac.uk/biodesign-institute/events/2020/bristol-biodesign-2020.html


Description	Cold Spring Harbor Asia conference on Frontiers in Computational Biology & Bioinformatics
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	A talk at an international conference about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity	2018
URL	https://www.csh-asia.org/2018meetings/COMP.html


Description	Computational Biology conference (The Netherlands)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Hundreds of people from computational biology and other life science backgrounds attended the European Conference on Computational Biology in 2016 in The Hague, The Netherlands. This poster was presented during the poster sessions at this conference and was available for attendees to view throughout the conference. During the presentation of the poster, discussions were held on the topics of analysing disease-causing mutation data with CATH-Gene3D, and the CATH-Gene3D functional families.
Year(s) Of Engagement Activity	2016
URL	https://f1000research.com/posters/5-2167


Description	Computational Biology conference in July 2017 (Prague, Czech Republic)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Intelligent Systems for Molecular Biology (ISMB) is an annual academic conference on the subjects of bioinformatics and computational biology organised by the International Society for Computational Biology (ISCB). In July 2017, ISMB/ECCB was held in Prague. The principal focus of the conference is on the development and application of advanced computational methods for biological problems. Talks and posters were presented during various sessions at this conference. Christine Orengo gave a talk on on computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity	2017
URL	https://www.iscb.org/ismbeccb2017


Description	EBI Structural Bioinformatics Course 2021
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Around 20 students attended a week long course hosted by the EBI that covered a number of topics in the area of Structural Bioinformatics. Our research group was invited to provide the lecture and tutorial on protein structure comparison and classification resources.
Year(s) Of Engagement Activity	2021
URL	https://www.ebi.ac.uk/training/events/structural-bioinformatics2021/


Description	ELIXIR 3D-BioInfo Launch Meeting
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Talk at the ELIXIR 3D-BioInfo Launch Meeting in Basel, Switzerland in October 2018. The talk presented computational analyses exploiting CATH-Gene3D and Genome3D data. This meeting discussed the launch of a new ELIXIR community in structural bioinformatics.
Year(s) Of Engagement Activity	2018
URL	https://swissmodel.expasy.org/25years/elixir


Description	ELIXIR All Hands Meeting
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The fourth ELIXIR All Hands meeting brought together ELIXIR Node members and collaborators from partner organisation to review ELIXIR achievements and activities so far and discuss plans for the future. This meeting was held in Berlin in May 2018. Christine Orengo gave a talk on CATH-Gene3D and Genome3D.
Year(s) Of Engagement Activity	2018
URL	https://www.elixir-europe.org/events/elixir-all-hands-2018


Description	EMBL-EBI Structural Bioinformatics Course - 'Protein Structure Classification' - Ian Sillitoe - 24/11/20
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Ian Sillitoe (Group Research Manager) held a Structural Bioinformatics course at the European Bioinformatics Institute (part of the European Molecular Biology Laboratory) on Protein Structure Classification.
Year(s) Of Engagement Activity	2020


Description	EMBO Workshop on Pseudoenzymes 2018: From molecular mechanisms to cell biology
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Talk at EMBO Workshop on Pseudoenzymes 2018: From molecular mechanisms to cell biology. This was held in Sardinia, May 2018.
Year(s) Of Engagement Activity	2018
URL	http://meetings.embo.org/event/18-pseudoenzymes


Description	Function COSI - CAFA4 - Exploiting Evolutionary Functional Family Signatures for Protein Function Prediction - Christine Orengo - 14/07/20
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Intelligent Systems for Molecular Biology (ISMB) is the main conference in the field of Bioinformatics. In 2020 it was held virtually. Function COSI is a protein function prediction focussed track within the conference and the venue to announce the preliminary results of the Critical Assessments of Functional Annotations (CAFA). Christine Orengo was invited to present our methods for protein function predictions as they ranked among the top 3 in the preliminary assessment.
Year(s) Of Engagement Activity	2020
URL	https://www.iscb.org/cms_addon/conferences/ismb2020/tracks/functioncosi


Description	Function COSI - ISMB2020 - Pruning the protein jungle: recent developments in the CATH-Gardener function analysis and prediction pipeline - N Bordin - 13/07/20
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Intelligent Systems for Molecular Biology (ISMB) is the main conference in the field of Bioinformatics. In 2020 it was held virtually. Function COSI is a protein function prediction focussed track within the conference. Nicola Bordin (group member) presented his work on the new developments for our internal protein function prediction pipeline and its application to various datasets, including protein kinases, plastic-degrading enzymes and CATH-Gene3D v4.3
Year(s) Of Engagement Activity	2020
URL	https://www.iscb.org/cms_addon/conferences/ismb2020/tracks/functioncosi


Description	Hosted a in2scienceUK student
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	A Year 12 student was hosted in the lab for 2 weeks during the summer of 2017. This was organised by the charity in2scienceUK. The student learnt about protein structure and function and contributed to protein domain chopping which is of great value to the CATH database.
Year(s) Of Engagement Activity	2017


Description	ISMB Chicago
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	ISMB is the world's largest bioinformatics/computational biology conference. It hosts hundreds of attendees from multiple disciplines to discuss the latest developments and applications of computational methods to solve biological problems. This conference was hosted in Chicago in July 2018. Christine Orengo gave a talk on computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity	2018
URL	https://www.iscb.org/ismb2018


Description	Interview on our SARS-Cov-2 research on Cuatro (Spanish National TV)
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	A member of our research group Dr Nicola Bordin was asked to appear on news section on Spanish National TV channel to talk about our recently published research on transmission of SARS-CoV-2 in different animals.
Year(s) Of Engagement Activity	2020
URL	https://twitter.com/nicolabordin/status/1314229961839063041


Description	Microbiology@UCL Annual International Conference
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Annual conference organized by the Microbiology Department at UCL. Invited Speaker - Talk titled: Computational Strategies for Exploring the Possible Host Range of SARS-CoV-2
Year(s) Of Engagement Activity	2020
URL	https://www.ucl.ac.uk/research/domains/sites/research_domains/files/microbiologyucl_virtual_symposiu...


Description	Prague Protein Spring
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Talk at a conference to discuss the current progress and future directions of protein science. This was held in Prague in May 2018. Christine Orengo gave a talk about computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity	2018
URL	http://www.pragueproteinspring.cz


Description	Primary School Visit (Warren Road Primary)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Schools
Results and Impact	Invited to give a 1 hour lesson on "DNA, Proteins and Minecraft" for 12-15 year 6 students (expected to reach L6 Science) at a National Lead Outstanding Primary School (Warren Road, Orpington). Learning objectives included: - understanding what DNA/proteins are made of and why they are important - the basic process of evolution - introduction to how enzymes work The school went on to achieve Gold Primary Science Quality Mark with this lesson mentioned in the award. "The session was absolutely fabulous. I learnt so much! The children loved it." - Tamara Fletcher (Deputy Head and Head of Science)
Year(s) Of Engagement Activity	2014,2015,2017


Description	Public talks and workshops
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	We have given several talks on CATH-Gene3D and Genome3D in local schools in London. We participated in a Wellcome Trust funded workshop on chronic pain at which we gave a talk and demonstration of how CATH-Gene3D and Genome3D data was being used to provide structural and functional information on genes involved in chronic pain The schools reported that our talks had generated a lot of interest in proteins and structural biology and that several students had decided to seek further information on undergraduate courses with study modules on computational biology. Our talks include images of protein structures which help in intuitively conveying information on the mechanisms by which proteins function. The Wellcome workshop on chronic pain was very well received with excellent responses to the feedback questionnai
Year(s) Of Engagement Activity	2009,2011,2012,2014,2015,2016


Description	Regional Student Group Argentina 10th Anniversary - 13/10/20
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Postgraduate students
Results and Impact	Regional Students Groups are essential parts of the effort of the International Society of Computational BIology (ISCB) to involve students at all levels in bioinformatics and computational Biology. Invited Speaker for the 10th anniversary of the RSG Argentina.
Year(s) Of Engagement Activity	2020


Description	School Visit (Isleworth Primary School)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Schools
Results and Impact	Invited to give a 1 hour interactive presentation on "DNA, Proteins and Minecraft" for 90 year 6 students as part of a Science week at Isleworth Town Primary School, London. Learning objectives included: - understanding what DNA/proteins are made of and why they are important - how DNA replication works - the basic process of evolution - introduction to how enzymes work
Year(s) Of Engagement Activity	2016


Description	School Visit (Warren Road Primary School)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Schools
Results and Impact	Invited to give a 1 hour lesson on "DNA, Proteins and Minecraft" for 12-15 year 6 students (expected to reach L6 Science) at a National Lead Outstanding Primary School (Warren Road, Orpington). Learning objectives included: understanding what DNA/proteins are made of and why they are important; the basic process of evolution; introduction to how enzymes work.
Year(s) Of Engagement Activity	2017


Description	Spanish Television interview on SARS-CoV-2 susceptibility in animals - Nicola Bordin - 8/10/20
Form Of Engagement Activity	A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	Nicola Bordin (postdoc with the group) presented briefly our work on SARS-Cov-2 susceptibility in animals during an interview for the Spanish National Television (Cuatro).
Year(s) Of Engagement Activity	2020
URL	https://twitter.com/todoesmentiratv/status/1314227524805832704?lang=en-gb


Description	Structural Bioinformatics Workshop
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Talk at Structural Bioinformatics Workshop in Pune, India in March 2018.
Year(s) Of Engagement Activity	2018


Description	Web molecular graphics Shonan Meeting (Japan)
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Invited to take part in a Shonan Meeting as part of an international panel to discuss the current state-of-the-art and future directions in the field of Web Molecular Graphics. The meeting included a presentation on CATH and many focused discussion groups. The Shonan meetings aim to promote informatics and informatics research at an international level by providing a venue for world-class scientists, promising young researchers, and practitioners in Asia. The workshop has resulted in a number of collaborations and sparked discussions on future standards.
Year(s) Of Engagement Activity	2016
URL	http://shonan.nii.ac.jp/seminar/086/