A FAIR community resource for pathogens, hosts and their interactions to enhance global food security and human health

Lead Research Organisation: University of Cambridge
Department Name: Biochemistry

Abstract

Infectious microbes continue to impose major costs on the UK farming and food industry and increasingly threaten global food security, commercial and ornamental tree health and ecosystem resilience. Similarly, due to the rise in resistance to antimicrobial compounds and increased globalisation of trade and travel, infectious microbes impose ever greater costs on public and private UK medical and veterinary providers and threaten human and animal health and wellbeing across the lifecourse. There is a substantial and diverse UK and international bioscience research community whose needs are addressed by this resource. As the biosciences become an increasingly data-intensive discipline and mega-scale data analyses become the new norm, building and maintaining community resources that ensure the Findability, Accessibility, Interoperability, and Reusability of data (i.e. are FAIR) will benefit many different bioscience disciplines.

In recent years, new possibilities for the study (and ultimately control) of pathogens have opened up through the application of high-throughput technologies for determining the molecular nature of life. These include genome sequencing - which reveals the genetic code that determines inherited properties of cells - and extends to monitoring the varied cellular contents at different stages of life and disease. This FAIR community resource is designed to capture broad molecular information from pathogenic organisms, and combine it with descriptive information about the process of infection, including more specific molecular information, e.g. about the pathogen and host proteins that interact during infection, the phenotype of the interaction outcome and flag up which pathogen proteins are already targeted by anti-infective chemicals. The new knowledge on pathogen genomes, patterns of gene expression and potential interacting partner is housed using the Ensembl platform. Ensembl contains a comprehensive suite of software for the management and display of genome-scale data. The new phenotypic knowledge on experimentally verified genes required for the disease-causing abilities of each pathogenic species will increasingly be curated by members of the scientific community into the Pathogen Host Interactions (PHI-base) database using a newly developed tool called PHI-Canto. A new curation focus will increase the details recorded about (i) the molecular interactions between the repertoires of small effector proteins produced by pathogens and their initial targets within each host species, and (ii) the pathogen targets for anti-infective chemistries. To support the ongoing curation efforts, new generic PHIPO ontologies (controlled definitions) will be developed to accurately describe the depth and breadth of pathogen-host interactions.

By further developing the interfaces (within and between) Ensembl genomes, PHI-base and other key e-sciences data/ information providers this will support the joint querying and visualisation of genomic and phenotypic data. We will also deploy new and existing tools (graphical and non-graphical) to improve inter-species comparative analysis and the integration of different large data types to speed up analyses and make new discoveries on the evolutionary origin of genes, mutations important in the process of infection and genes/ gene networks conferring host resistance, pathogen virulence or resistance to anti-infective chemicals.
We will continue to engage with the large and active UK research community in the biosciences to identify their current needs and emerging requirements through University/Institute visits, and will conduct training activities to demonstrate the potential use of the resource. We will engage with academic and industry based scientists in other countries by attending and presenting this FAIR community resource and its uses at international conferences and workshops.

Technical Summary

PHI-base is the phenotype data source provider. We will continue to curate the literature for ~200 pathogenic species and include emerging problematic species. New advanced curation will include (a) first host plant targets of pathogen effectors, (b) anti-infective targets and variant sequences causing chemical insensitivity, (c) ~8 specific genome landscape features. We will further develop the multi-species PHI-Canto tool to enable rapid, accurate and comprehensive publication based author curation. PHI-base data is to be made available in emerging data exchange formats (eg phenopackets) to increase interoperability and use. The new PHIPO ontologies to underpin this curation will be built using protégé and adhering to strict ontology development principles outlined by the obo-foundry.

The PHI-phenotype information will be mapped onto microbial genes in Ensembl Genomes; an established platform combining a relational database back-end for persistent, non-redundant storage of data with web-based tools, programmatic interfaces (including RESTful APIs) and the ability to export and upload (local or remote) annotation files in standard file formats (e.g. BAM, CRAM, VCF). Genomes are overlaid with variation/ transcriptome data along with whole genome alignments and pan species comparative relationships; allowing extrapolation of functional annotation, eg from well understood pathogens to under-studied, under-funded pathogens.

To provide a bigger context, we will functionally advance the Knetminer open-source software to integrate the PHI-data and ontologies with biological pathway (BioCyc) and protein-protein interaction data (BioGrid, IntAct) from eight model organisms to elucidate the cascading processes triggered by pathogen effectors and their first targets in the host. This will allow multi-species, cross-kingdom network visualisation and analysis. We will create biannual releases of the integrated knowledge base in FAIR compliant RDF and Neo4j graph formats.

Planned Impact

This FAIR community resource is aligned with the BBSRC fundamental and strategic research priorities to achieve sustainable global food security, and improve human and animal health and wellbeing across the life course.
This resource is of immediate benefit to all researchers in the medical, crop plant, animal and model organism biosciences working on diseases caused by fungi, protists and bacteria, and will remove bottlenecks to new discoveries caused by data sets being unavailable, non-integrated and/or incompatible for simple queries/complex analyses. Priority infectious microbes have previously been selected and included according to UK industrial and academic researcher interests. This project will provide standardised annotation, more powerful comparative analyses, and greater data access through interactive interfaces and new tools.
The interpretation of genome-scale molecular biology and phenotyping data is a key component in the development of novel strategies for sustainable disease control in humans, cropped plant, farmed animals and has considerable academic, economic, social and ecological value. Specifically, this FAIR resource will organise genome sequence, genetic variation and phenotypic data and make it widely accessible through a new set of interfaces and new tools to permit genome-wide enquiries, linked to literature-curated pathogenic phenotypes associated with gene mutations.
The driving rationale for the project, as well as its greatest potential for societal impact, is in two targeted sectors. Firstly, sustainably increasing the yields of crop plants, through assisting the development of strategies for pesticide development and plant breeding. Crucially, this depends on an understanding of gene function (effectors and their targets, and other downstream biological functions dependent on these), which determine the range of possible pesticide targets, the total genetic reservoir available to plant breeders, and possible side effects (in terms of the impact on plant growth, development and overall health). This new resource and the associated new tools will provide access to existing and new knowledge for numerous phytopathogenic species. The second targeted sector is human health and medical interventions to ensure health ageing throughout the life course. Understanding pathogen gene function, host targets and downstream biological functions will aid novel drug discoveries, track clinical efficacy and help diagnostic companies follow emerging problematic pathogenic microbes.
The main route to achieving impact will be through raising (academic and commercial) user awareness and use of the resource. Potential beneficiaries include AgCompanies developing pesticides or attempting to breed new varieties of pathogen-resistant plants and pharmaceutical companies developing new health care products to stop / minimise infectious microbes in the general human populations and within hospitals. More generally, farmers and the wider global population will benefit from improved strategies for disease control, although they are not expected to be among the direct users of the database. The PIs at each organisation will engage with society, the media and policy makers to make the case for the importance of research into crop plant and medical important pathogens in the context of rising global concern about food and energy security, human health, farmed animal health, ecosystem resilience and of the potential benefits of genomics in addressing these concerns.
The five project objectives have been chosen in the light of the above observations. Collectively, the objective is to put the increasing quantities of data being generated back in the hands of researchers in as useful a form as possible, and to allow them to see the full spectrum of experimental results - from the study of an individual mutant phenotype to information about the expression of a gene or its variance in populations in an integrated fashion

Publications

10 25 50