A FAIR community resource for pathogen, host, interactions to enhance global food security and human health

Lead Research Organisation: European Bioinformatics Institute
Department Name: Genome Assembly and Annotation

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

PHI-base is the phenotype data source provider. We will continue to curate the literature for ~200 pathogenic species and include emerging problematic species. New advanced curation will include (a) first host plant targets of pathogen effectors, (b) anti-infective targets and variant sequences causing chemical insensitivity, (c) ~8 specific genome landscape features. We will further develop the multi-species PHI-Canto tool to enable rapid, accurate and comprehensive publication based author curation. PHI-base data is to be made available in emerging data exchange formats (eg phenopackets) to increase interoperability and use. The new PHIPO ontologies to underpin this curation will be built using protégé and adhering to strict ontology development principles outlined by the obo-foundry.

The PHI-phenotype information will be mapped onto microbial genes in Ensembl Genomes; an established platform combining a relational database back-end for persistent, non-redundant storage of data with web-based tools, programmatic interfaces (including RESTful APIs) and the ability to export and upload (local or remote) annotation files in standard file formats (e.g. BAM, CRAM, VCF). Genomes are overlaid with variation/ transcriptome data along with whole genome alignments and pan species comparative relationships; allowing extrapolation of functional annotation, eg from well understood pathogens to under-studied, under-funded pathogens.

To provide a bigger context, we will functionally advance the Knetminer open-source software to integrate the PHI-data and ontologies with biological pathway (BioCyc) and protein-protein interaction data (BioGrid, IntAct) from eight model organisms to elucidate the cascading processes triggered by pathogen effectors and their first targets in the host. This will allow multi-species, cross-kingdom network visualisation and analysis. We will create biannual releases of the integrated knowledge base in FAIR compliant RDF and Neo4j graph formats.

Planned Impact

This FAIR community resource is aligned with the BBSRC fundamental and strategic research priorities to achieve sustainable global food security, and improve human and animal health and wellbeing across the life course.
This resource is of immediate benefit to all researchers in the medical, crop plant, animal and model organism biosciences working on diseases caused by fungi, protists and bacteria, and will remove bottlenecks to new discoveries caused by data sets being unavailable, non-integrated and/or incompatible for simple queries/complex analyses. Priority infectious microbes have previously been selected and included according to UK industrial and academic researcher interests. This project will provide standardised annotation, more powerful comparative analyses, and greater data access through interactive interfaces and new tools.
The interpretation of genome-scale molecular biology and phenotyping data is a key component in the development of novel strategies for sustainable disease control in humans, cropped plant, farmed animals and has considerable academic, economic, social and ecological value. Specifically, this FAIR resource will organise genome sequence, genetic variation and phenotypic data and make it widely accessible through a new set of interfaces and new tools to permit genome-wide enquiries, linked to literature-curated pathogenic phenotypes associated with gene mutations.
The driving rationale for the project, as well as its greatest potential for societal impact, is in two targeted sectors. Firstly, sustainably increasing the yields of crop plants, through assisting the development of strategies for pesticide development and plant breeding. Crucially, this depends on an understanding of gene function (effectors and their targets, and other downstream biological functions dependent on these), which determine the range of possible pesticide targets, the total genetic reservoir available to plant breeders, and possible side effects (in terms of the impact on plant growth, development and overall health). This FAIR resource and the associated new tools will provide access to existing and new knowledge for numerous phytopathogenic species. The second targeted sector is human health and medical interventions to ensure healthy ageing throughout the life course. Understanding pathogen gene function, host targets and downstream biological functions will aid novel drug discoveries, track clinical efficacy and help diagnostic companies follow emerging problematic pathogenic microbes.
The main route to achieving impact will be through raising (academic and commercial) user awareness and use of the resource. Potential beneficiaries include AgCompanies developing pesticides or attempting to breed new varieties of pathogen-resistant plants and pharmaceutical companies developing new healthcare products to stop/ minimise infectious microbes in the general human populations and within hospitals. More generally, farmers and the wider global population will benefit from improved strategies for disease control, although they are not expected to be among the direct users of the database. The PIs at each organisation will engage with society, the media and policy makers to make the case for the importance of research into crop plant and medically important pathogens in the context of rising global concern about food and energy security, human health, farmed animal health, ecosystem resilience and of the potential benefits of genomics in addressing these concerns.
The five project objectives have been chosen in the light of the above observations. Collectively, the objective is to put the increasing quantities of data being generated back in the hands of researchers in as useful a form as possible, and to allow them to see the full spectrum of experimental results - from the study of an individual mutant phenotype to information about gene expression or its variance in a population - in an integrated fashion.

Publications

10 25 50
publication icon
Cunningham F (2022) Ensembl 2022. in Nucleic acids research

publication icon
Harrison PW (2023) Ensembl 2024. in Nucleic acids research

publication icon
Howe KL (2021) Ensembl 2021. in Nucleic acids research

publication icon
Howe KL (2020) Ensembl Genomes 2020-enabling non-vertebrate genomic research. in Nucleic acids research

publication icon
Urban M (2020) PHI-base: the pathogen-host interactions database. in Nucleic acids research

 
Description Over the past twenty years, techniques around sequencing have improved dramatically, enabling us to interpret the genomic makeup of any species with increasing accuracy and speed. This has paved the way for deeper biological explorations; for instance, how exactly are these species interacting with each other on a molecular level, how do these interactions influence the outcome and what factors can change them. This grant has facilitated working together, multiple groups specialising in standardised information capture and the representation of genomic data to focus on the interactions between pathogens and their hosts (animals, plants, insects and humans). This work enables the development of precise definitions, i.e. ontologies to describe these interactions, intuitive interfaces to allow scientists to curate their experimental findings, and software and databases to represent these in a way that they can be queried freely by anyone from around the world to make predictions and develop hypotheses that can be tested in the laboratory. The results of this grant extend well beyond pathogens and hosts to encompass species in a variety of ecosystems (human gut, soil, water) and the applications of these efforts range from the discovery of new therapeutics, agriculture, tackling plastic waste and understanding the impact of climatic fluctuations.
Exploitation Route The information captured during the grant lifetime can be applied in many domains, including medicine, ecology, and agriculture. The clearest application at this point is in human, animal and plant disease: identifying potential drug targets, mechanisms of drug resistance and approaches to stimulate the host (example, plant) immune system to fight infection. Looking beyond, the Ensembl infrastructure put in place during this grant (e.g. analysis methods, database schemas, user interfaces) can be leveraged to capture other types of interactions between species (e.g. symbiotic interactions between fungi and bacteria) that can shed insights into microbial populations in natural environments (e.g. soil, gut) and how they change in response to treatment and climatic weather conditions. For instance, we already host interactions concerning plastic degradation. These data, coupled with the other facets of the Ensembl infrastructure, such as comparative analyses, genomic variation and 3D structure predictions recently made available will enable the prediction of yet unknown interactions and the development of testable hypotheses. Our infrastructure is open source and freely available for third party reuse.
Sectors Agriculture, Food and Drink,Environment,Pharmaceuticals and Medical Biotechnology

 
Description The funding provided to EMBL-EBI has enabled Ensembl to build robust data schemas, and visualisation and search techniques that have linked proteins across Ensembl species. This development aligns Ensembl with current research trends that explore species within their environments and in the context of health and disease as a consequence of interactions at a molecular level, in addition to providing access to continually improving genomic annotation of isolate genomes. This offers a new angle of inquiry to our users in the medical, agricultural and environmental sectors and the possibility of integrating Ensembl's orthologue, variant and interaction information to determine potentially new drug targets, resistance genes or proteins that can play a significant role in mitigating environmental crises. We have imported over 13,500 molecular interactions from PHI-base - the resource described in the grant - and HPIDB and PlasticDB - two resources that we expanded to with minimal effort and no additional funding due to our new pipelines. This has increased the breadth of interactions we support and allowed an intersection between datasets of medical and environmental importance in Ensembl. Our REST endpoints allow for searches for these across all species, a feature absent previously. Open access to our code also means that other resources can recreate this, enabling stronger integration of data.
First Year Of Impact 2022
Sector Agriculture, Food and Drink,Environment,Pharmaceuticals and Medical Biotechnology
 
Title Import of host/pathogen interaction data into API 
Description New database to store relationships between proteins and other molecules in inter and intra species relationships. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact Resource can hold multiple types of interaction molecules which span the entirity of the Ensembl resource. Integration into a separate resource from Ensembl allows its update outside of the normal Ensembl release cycle and may be integrated into other Ensembl resources. Currently hosts pathogenic and plastic-degrading interactions 
URL http://interactions.rest.ensembl.org/
 
Title Overhaul of genomes in EnsemblBacteria 
Description The Ensembl Bacteria resource also has undergone a significant overhaul in that all redundant genomes have been removed to improve scalability. Remaining reference genomes from the ENA have been imported, and annotated with pathogen-host interactions and aligned to Rfam covariance models. These changes have been made with planned pathogen effector data provided by Rothamsted Research. These data will be integrated into our bacteria resource. 
Type Of Material Data handling & control 
Year Produced 2020 
Provided To Others? Yes  
Impact The large numbers of almost identical bacterial genomes (through experiments (re)sequencing the same outbreak, for instance) has meant that the volume of data has grown exponentially. In the past it was conceivable to represent every one of these genomes in ENA within EnsemblBacteria but it was soon clear that this was detrimental both to the smooth functioning of our pipelines and the future ability to pick useful references. Therefore, as of release 102 (end of 2020), Ensembl Bacteria only integrates non-redundant, reference bacterial and archaeal genomes. These references were identified by using UniProt, and has reduced the number of genomes in EnsemblBacteria from 44,048 to 31,332. We believe this move to only represent references will strengthen Ensembl Bacteria's offering and help us focus our annotation efforts on representative gene sets in a timely manner. 
URL http://bacteria.ensembl.org
 
Title Release of Ensembl resources (over the grant period) 
Description 3-4 release per year of the Ensembl microbial resources (fungal, bacteria and protists) during the period of the award. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact New species have been integrated into the Ensembl platform expanding potential targets for host-pathogen interactions. Genomes may be used in other forms of analysis without restriction 
 
Title Improvements to our RNA-Seq mapping and TrackHub visualisations 
Description When understanding pathogen-host interactions, studying the expression of genes during disease and other states is crucial. The microbial resources within Ensembl collaborate with EBI's ArrayExpress group to map all RNA-Seq datasets from the Short Read Archive (SRA) onto reference assemblies of key pathogens and then import them into our TrachHub registry for easy visualisation on our browser. In the past 12 months, we have improved this workflow making it resilient to minor failures (and thus, more scalable) and better aligned with ENA's new API. The reason for doing this was to make it as automated as possible such that new experiments deposited into the SRA can be promptly picked up and made available to our user base to view alongside the reference gene sets. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2020 
Impact (1) Made more resilient to minor errors in input files without requiring constant human intervention (invaluable when scaling to many species) (2) Various bug fixes to cope with microbial data 
URL http://protists.ensembl.org
 
Title Improvements to our pathogen-host interaction pipeline 
Description Our pathogen-host annotation pipeline imports curated data on the role of microbial genes in an infection. It has undergone the following updates in the past 12 months to a) scale better to thousands of microbial genomes, b) be robust in handling various input files making the pipeline more adaptable to different data repositories in the future and c) to use the ontology lookup service (OLS) provided by the EBI enabling standardised queries and vocabulary. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2020 
Impact (1) Better optimised to scale to thousands of species by: - Utilising Ensembl's ehive mechanisms strategically - Storing blast results and only recomputing when needed - Adding options to the pipeline to only run sections of it as needed (without recomputing blast results, which is most compute intensive part of the pipeline) (2) Automated sanitising of input data received from Rothamsted to cope with changes to formats (previously this was manual effort) (3) Checks added to map terms to ontologies and make use of EBI's ontology lookup service. This is a feature that will be invaluable as the terminology develops around interaction data. (4) Generation of a summary and statistics report at the end of the run. This has already been invaluable for inclusion in reports/papers and for troubleshooting. 
 
Title Pathogen/host import pipeline improvements 
Description A set of pipelines which can be used to populate our new interactions database automatically. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2022 
Open Source License? Yes  
Impact The pipeline can be used/adapted by external users to populate the database with custom third party data. 
 
Title Visualisation of interactions in Ensembl 
Description A series of interfaces which describe the number of known interactions, details of the interactions and links to external resources to explore further. Currently deployed across all Ensembl sites including our non-vertebrate resources and can show bi-directional interactions. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2022 
Open Source License? Yes  
Impact The interface enables fast exploration of bi-directional relationships of cross species interactors, and exposes these data through a novel and intuitive interface. 
URL https://fungi.ensembl.org
 
Description Annual course on Fungal Pathogen Genomics (virtual), Wellcome Connecting Science 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Fungal Pathogen Genomics annual course organised by Wellcome Connecting Science provides hands-on virtual training in web-based data-mining resources for fungal genomes and on how to take advantage of unique tools offered by each database; develop testable hypotheses, and investigate transcriptomics, proteomics and genomics datasets across multiple databases and different user interfaces.
Year(s) Of Engagement Activity 2021
URL https://coursesandconferences.wellcomeconnectingscience.org/event/fungal-pathogen-genomics-virtual-2...
 
Description Annual training workshop on fungal pathogen genomes 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We run an annual training workshop on fungal pathogen genomes in collaboration with FungiDB, SGD, JGI/MycoCosm and Pombase. This has been running since 2017, funded by the Wellcome Trust. Every year we train 25 participants (a mix of bioinformaticians, post-graduates, clinicians, academics and computer scientists) from around the world who travel to the Wellcome Genome campus for a week. We present a broad range of Ensembl functions that participants can make use of, including searching for pathogen-host interactions. Feedback from the participants have been really positive, and has lead to requests for further training in low to middle income countries by the Ensembl outreach team. Most participants of the course reported that they had not previously known about the available resources, and that they would continue to incorporate them in their work and lectures.
Year(s) Of Engagement Activity 2020
URL https://coursesandconferences.wellcomegenomecampus.org/our-events/fungal-pathogen-genomics-2020/
 
Description Ensembl-Rothamsted Research booth at New Scientist Live 2022 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Ensembl team members worked with Rothamsted Research at a booth at the 2022 New Scientist Live Festival held at ExCel in London.
Year(s) Of Engagement Activity 2022
URL https://www.excel.london/m/visitor/whats-on/new-scientist-live-2022
 
Description Euglena Network meeting 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presented the Ensembl Protists resource, along with the pipeline that links molecular interaction data curated at Rothamsted to microbial genes in Ensembl.
This was the inaugural meeting of this network bringing together groups interested in Euglenoids; their potential ranging from biofuels and absorption of metals to vegan food. The purpose of this meeting was to prioritise species to be studied and sequenced, and highlight resources (such as Ensembl) that the genomic data could be represented in.
Year(s) Of Engagement Activity 2020
 
Description European Nucleotide Archive (ENA) annual facilities workshop (14th October 2020) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The purpose of this meeting is three-fold: a) to inform users of updates to ENA and resources such as Ensembl that add extra information to genomic sequences and b) to gather requirements from the institutes generating big data and c) encourage the submission and sharing of data using the public archives.
We presented work and data in EnsemblProtists, EnsemblFungi and EnsemblBacteria.
Year(s) Of Engagement Activity 2020
 
Description European conference on fungal genetics 2020 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The results thus far of this project were presented in a poster at this large annual meeting of researchers in fungal genomes. The sessions sparked lively discussions with both long-term collaborators of Ensembl Fungi and curious new potential users. We also received valuable feedback from past students of our annual fungal workshops who were also attending this conference and many comments on how their bioinformatics journeys had flourished following the course we conducted.
Year(s) Of Engagement Activity 2020
URL https://www.ecfg15.org/
 
Description Poster titled "Enabling microbial ecology studies through molecular interactions in Ensembl" at 18 ISME 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster presented by Ensembl Microbes Bioinformatician Manuel Carbajo Martinez at the 18th International Symposium on Microbial Ecology (ISME) organised in Switzerland in 2022.
ABSTRACT
Microbial communities contain complex networks of molecular interactions: pathogenic to mutualistic. Many sequencing projects, and associated data resources, have focused on the genomic makeup of individual species within these populations. Connecting these genomes to their interactions with other genomes and the environment is crucial to bolster our understanding of biological processes and pathways in ecosystems.
Ensembl provides open-access tools to explore genomic, transcriptomic, comparative and variant data from thousands of species; each within a subdivision of vertebrates, plants, metazoa, fungi, protists or bacteria. We introduce a new tool to integrate experimentally verified interactions between any two entities (genes, proteins, mRNA or synthetic/organic molecules) into Ensembl; enabling relationships between species across its divisions for the first time. This allows us to assimilate interaction data for a species from a variety of contexts and viewpoints. For instance, we can annotate the genes in Fusarium solani involved in a range of activities such as the hydrolysis of cutin in the cell wall during infection of plant hosts such as squash, secreting polyethylene terephthalate (PET) hydrolase enzymes and those susceptible to the antagonist secretions of Pseudomonas bacteria.
These data, when combined with Ensembl's sequence searches and orthology predictions, can help elicit potential new molecular strategies in related, under-studied species. For instance, orthologues of the cutA gene in other Fusarium species also have cutin hydrolysis properties that could play a role both in pathogenesis and degradation of pollutants. Thus, Ensembl becomes a hub for the discovery of diverse datasets from diseases to microbial ecology.
Year(s) Of Engagement Activity 2022
URL https://isme18.isme-microbes.org/poster-program
 
Description Poster titled "Ensembl Fungi: Melding data sets to explore species interactions" at the 31st Annual Fungal Genetics conference 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster titled "Ensembl Fungi: Melding data sets to explore species interactions" at the 31st Annual Fungal Genetics conference 2022 organised at the Asilomar Conference Grounds (Genetics Society of America). Virtual poster presented by Ensembl Microbes Bioinformatician Manuel Carbajo Martinez.
ABSTRACT
Biological communities, in both healthy and diseased states, are a myriad of complex and dynamic interactions between species. Diseases, for instance, are a consequence of interactions between pathogen virulence factors and host cell molecules. Understanding these interactions is fertile ground for uncovering crucial biological mechanisms that can lead to better management of disease and agricultural practises, and a clearer understanding of many ecosystems from soil to the human gut. In Ensembl Fungi, we have developed a new data model to capture any pair of interacting entities (for example, a protein in a pathogen and a protein in a host) along with meta information about them using terms from controlled vocabularies such as experimental details of how the interaction was uncovered. We have integrated inter-species protein-protein interactions from PHI-base and have infrastructure in place to capture similar, manually curated data. This new data is combined with the 1500+ genomes in Ensembl Fungi, the Ensembl Variant Effect predictor, transcriptomic data in track hubs and the homologous relationships across fungi. Together, they provide a powerful toolkit to explore host-pathogen, and other, relationships between species. Here we present our data capture pipelines, underlying storage and search strategies. Furthermore, we are formulating methods to make conservative predictions of other potential participants in these interactions from related species. These methods will use a combination of metadata about species (for instance, pathogens infecting similar hosts), orthology and sequence similarity, and will be available to view/download with clear labelling to indicate prediction methods. We believe that these will provide an exciting opportunity for plant, medical, animal and environmental researchers to explore scientific hypotheses before committing to experimentation.
Year(s) Of Engagement Activity 2022
URL https://genetics-gsa.org/fungal-2022/#
 
Description SAB Meeting (2020) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Scientific Advisory Board Meeting. Developments over the past year were reported. 16 members attended from across the UK and there was a mix of research and industry engagement. The affiliations of the participants were: EBI, Syngenta, NIAB, Sainsbury Laboratory, Cambridge, Rothamsted Research, Imperial, University of Strathclyde, University of Manchester.
Year(s) Of Engagement Activity 2020
 
Description School visit Royston 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Talk and hands-on activities for 60 children at St Marys Catholic Primary School, Royston. Discussions around genomics, why it is relevant and the complex interactions of species in our world and in our bodies
Year(s) Of Engagement Activity 2020