Big Data approaches to identifying potential sources of emerging pathogens in humans, domesticated animals and crops
Lead Research Organisation:
University of Liverpool
Department Name: Institute of Infection and Global Health
Abstract
Emerging infectious diseases continue to pose major threats to humans, animals and plants. Recent years have seen significant outbreaks of several emerging diseases, ranging from the well-known (Ebola and Olive quick decline syndrome), to the previously little known (Zika), to the entirely novel (Schmallenberg), to name but a few. It is well established that the ability of a pathogen to infect multiple hosts, particularly hosts in different taxonomic orders or wildlife, is a risk factor for emergence in human and livestock pathogens. Emerging wild-life diseases have also been linked to 'spill-overs' from humans or domesticated animals. Despite the importance of cross-species disease transmission, there has been relatively little attention paid to which species are the most important sources cross communities (e.g., zoonotic, wild-life to domestic, plants to other kingdoms), which are the most prolific vectors, how those species acquired the pathogens, and by what means the diseases entered new species or populations. A major reason for this limited understanding is the lack of comprehensive data on the pathogens in animal and plant populations and, in most cases, poorly documented information on how they are transmitted, including to humans.
In this fellowship, I will improve and exploit a novel bioinformatic resource developed at the University of Liverpool to investigate how humans, their domesticated animals and crops are connected to the pathogen reservoir in other species, and how these pathogens pass from that reservoir to the focus populations. The bioinformatic resource, developed by me with funding from BBSRC, is the Enhanced Infectious Disease Database (EID2). EID2 utilises state-of-the-art, text and data mining procedures to extract information from multiple sources, including millions of metadata records accompanying genetic sequences and scientific publications. After processing, EID2 provides evidence for over 60,000 interactions between species of hosts and pathogens and is the most comprehensive data source on the known pathogens of humans, animals, and plants and their geographical ranges.
During this fellowship, I aim to investigate the factors which lead to emergence of pathogens, asking the following questions:
1. What are the characteristics of the networks that connect species via shared pathogens? How central are humans and their domesticated animals and crops in these networks and which other species are each of those communities most closely connected to?
2. What is the role of different pathogen transmission routes on the nature of these networks? Are the potential species-to-species transmission pathways different for direct, food-borne, water-borne and vector-borne pathogens?
3. What factors determine the host ranges of pathogens? Are host species more likely to become exposed to pathogens that infect a wide range of species? From species that are closer to them genetically? Or from those species with which they often interact?
4. What are we missing? Given the networks, transmission routes and host ranges, what is the risk associated with each pathogen emerging in new species? What are the pathogens that can be prioritised as more-likely to emerge in the future?
In this fellowship, I will improve and exploit a novel bioinformatic resource developed at the University of Liverpool to investigate how humans, their domesticated animals and crops are connected to the pathogen reservoir in other species, and how these pathogens pass from that reservoir to the focus populations. The bioinformatic resource, developed by me with funding from BBSRC, is the Enhanced Infectious Disease Database (EID2). EID2 utilises state-of-the-art, text and data mining procedures to extract information from multiple sources, including millions of metadata records accompanying genetic sequences and scientific publications. After processing, EID2 provides evidence for over 60,000 interactions between species of hosts and pathogens and is the most comprehensive data source on the known pathogens of humans, animals, and plants and their geographical ranges.
During this fellowship, I aim to investigate the factors which lead to emergence of pathogens, asking the following questions:
1. What are the characteristics of the networks that connect species via shared pathogens? How central are humans and their domesticated animals and crops in these networks and which other species are each of those communities most closely connected to?
2. What is the role of different pathogen transmission routes on the nature of these networks? Are the potential species-to-species transmission pathways different for direct, food-borne, water-borne and vector-borne pathogens?
3. What factors determine the host ranges of pathogens? Are host species more likely to become exposed to pathogens that infect a wide range of species? From species that are closer to them genetically? Or from those species with which they often interact?
4. What are we missing? Given the networks, transmission routes and host ranges, what is the risk associated with each pathogen emerging in new species? What are the pathogens that can be prioritised as more-likely to emerge in the future?
Technical Summary
Despite the importance of cross-species disease transmission, there has been relatively little attention paid to which species are the most important sources cross communities, and by what means the diseases entered new species or populations. A reason for this is the lack of comprehensive data on pathogens in animals and plants and, often, poorly documented information on how they are transmitted. The objective of this fellowship is to transform host-pathogen interactions extracted from EID2 into multi-host ecological networks. Using network analysis: typology; bridge species; and metrics of networks will be studied. By implementing a centrality index based on a cohort of centrality measures, potential super-spreaders will be identified.
Transmission routes play a role in determining the parts of the population that are infected, the range of species that can be infected, and the measures required to bring an outbreak under control. A transmission route identification system, utilising an ensemble of classifiers, will be developed to categorise scientific publications to identify transmission routes. Consequences of different routes will be assessed, and their effect on centrality will be quantified.
The range of hosts that a pathogen infects influences the dynamics of pathogen transmission and emergence in novel hosts. Statistical models will be developed to rank different traits of hosts in relation to host-specificity. The effects of the traits on centrality will be quantified.
Predicting unobserved or future host-pathogen interactions is imperative for one-health, enabling identification of potential emergence and spill-over events. A link prediction model on host-pathogen networks, with links either positive or negative, will be constructed. A system will be developed to capture host-pathogen negative interactions from scientific literature. The results will be incorporated into the multi-host networks; the effect on centrality will be quantified.
Transmission routes play a role in determining the parts of the population that are infected, the range of species that can be infected, and the measures required to bring an outbreak under control. A transmission route identification system, utilising an ensemble of classifiers, will be developed to categorise scientific publications to identify transmission routes. Consequences of different routes will be assessed, and their effect on centrality will be quantified.
The range of hosts that a pathogen infects influences the dynamics of pathogen transmission and emergence in novel hosts. Statistical models will be developed to rank different traits of hosts in relation to host-specificity. The effects of the traits on centrality will be quantified.
Predicting unobserved or future host-pathogen interactions is imperative for one-health, enabling identification of potential emergence and spill-over events. A link prediction model on host-pathogen networks, with links either positive or negative, will be constructed. A system will be developed to capture host-pathogen negative interactions from scientific literature. The results will be incorporated into the multi-host networks; the effect on centrality will be quantified.
Publications
Blagrove M
(2022)
Reply to: Machine-learning prediction of hosts of novel coronaviruses requires caution as it may affect wildlife conservation
in Nature Communications
Wardeh M
(2020)
Integration of shared-pathogen networks and machine learning reveals the key aspects of zoonoses and predicts mammalian reservoirs.
in Proceedings. Biological sciences
Wardeh M
(2021)
Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations.
in Nature communications
Wells K
(2020)
Distinct spread of DNA and RNA viruses among mammals amid prominent role of domestic species.
in Global ecology and biogeography : a journal of macroecology
Description | Global trade of coronavirus hosts: bringing geographically isolated hosts and viruses together risks novel recombination and spillover to humans |
Amount | £117,406 (GBP) |
Funding ID | BB/W00402X/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 07/2021 |
End | 05/2022 |
Description | Predicting mammalian and avian reservoirs of coronaviruses: identifying current reservoirs and co-infection hosts in which future novel coronavirus could be generated |
Amount | £11,008 (GBP) |
Funding ID | BBSRC IAA COVID - 168478 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2021 |
End | 06/2021 |
Description | Vector in the machine: How accurately can mosquito transmission of viruses be predicted by machine learning? |
Amount | £181,248 (GBP) |
Organisation | Medical Research Council (MRC) |
Sector | Public |
Country | United Kingdom |
Start | 10/2022 |
End | 09/2026 |
Description | Where coronaviruses hide, where novel strains are generated, and how they get to us: Predicting reservoirs, recombination, and geographical hotspots |
Amount | £79,286 (GBP) |
Funding ID | NE/W002302/1 |
Organisation | Natural Environment Research Council |
Sector | Public |
Country | United Kingdom |
Start | 03/2021 |
End | 03/2022 |
Title | Data and code for: "Monkeypox virus shows potential to infect a diverse range of native animal species across Europe, indicating high risk of becoming endemic in the region." |
Description | Background: Monkeypox is a zoonotic virus which persists in animal reservoirs and periodically spills over into humans, causing outbreaks. During the current 2022 outbreak, monkeypox virus has persisted via human-human transmission, across all major continents and for longer than any previous record. This unprecedented spread creates the potential for the virus to 'spillback' into local susceptible animal populations. Persistent transmission amongst such animals raises the prospect of monkeypox virus becoming enzootic in new regions. However, the full and specific range of potential animal hosts and reservoirs of monkeypox remains unknown, especially in newly at-risk non-endemic areas. Methods: Here, our pipeline utilises ensembles of classifiers comprising different class balancing techniques and incorporating instance weights, to identify which animal species are potentially susceptible to monkeypox virus. Subsequently, we generate spatial distribution maps to highlight high-risk geographic areas at high resolution. Findings: We show that the number of potentially susceptible species is currently underestimated by 2.4 to 4.3-fold. We show a high density of susceptible wild hosts in Europe. We provide lists of these species, and highlight high-risk hosts for spillback and potential long-term reservoirs, which may enable monkeypox virus to become endemic. |
Type Of Material | Computer model/algorithm |
Year Produced | 2022 |
Provided To Others? | Yes |
Impact | N/A |
URL | https://figshare.com/articles/software/Blagrove_et_al_2022_poxvriuses_data_and_code/20485332 |
Title | Divide-and-conquer: data and codes |
Description | Data and codes associated with: Divide-and-conquer: Wardeh, M., Blagrove, M.S.C., Sharkey, K.J. et al. Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations. Nat Commun 12, 3954 (2021). https://doi.org/10.1038/s41467-021-24085-w. Abstract: Our knowledge of viral host ranges remains limited. Completing this picture by identifying unknown hosts of known viruses is an important research aim that can help identify and mitigate zoonotic and animal-disease risks, such as spill-over from animal reservoirs into human populations. To address this knowledge-gap we apply a divide-and-conquer approach which separates viral, mammalian and network features into three unique perspectives, each predicting associations independently to enhance predictive power. Our approach predicts over 20,000 unknown associations between known viruses and susceptible mammalian species, suggesting that current knowledge underestimates the number of associations in wild and semi-domesticated mammals by a factor of 4.3, and the average potential mammalian host-range of viruses by a factor of 3.2. In particular, our results highlight a significant knowledge gap in the wild reservoirs of important zoonotic and domesticated mammals' viruses: specifically, lyssaviruses, bornaviruses and rotaviruses. |
Type Of Material | Computer model/algorithm |
Year Produced | 2021 |
Provided To Others? | Yes |
Impact | The multi-perspective host-pathogen predictive framework undperins the following research awards: Where coronaviruses hide, where novel strains are generated, and how they get to us: Predicting reservoirs, recombination, and geographical hotspots (NE/W002302/1); Global trade of coronavirus hosts: bringing geographically isolated hosts and viruses together risks novel recombination and spillover to humans (BB/W00402X/1); and Predicting mammalian and avian reservoirs of coronaviruses: identifying current reservoirs and co-infection hosts in which future novel coronavirus could be generated (BBSRC IAA COVID - 168478) |
URL | https://doi.org/10.6084/m9.figshare.13270304 |
Title | Machine learning ensemble models to predict and quantify mammalian reservoirs of zoonoses |
Description | State of the art machine learning models to predict, quantify and explain sharing of zoonoses between humans and mammalian hosts. |
Type Of Material | Computer model/algorithm |
Year Produced | 2020 |
Provided To Others? | Yes |
Impact | N.A |
URL | https://figshare.com/articles/R-codes_and_datasets/11536470 |
Title | Models to explain Centrality in networks of shared pathogens |
Description | State of the art machine learning model to explain driver of centrality (host importance) and influence in networks of shared pathogens between hosts (e.g. non-human mammals). Uniquely the model integrates a new metric of centrality, and systematic tool to select centrality measures in complex networks. |
Type Of Material | Computer model/algorithm |
Year Produced | 2020 |
Provided To Others? | Yes |
Impact | N/A |
URL | https://figshare.com/articles/R-codes_and_datasets/11536470 |
Title | Predicting mammalian hosts in which novel coronaviruses can be generated - codes |
Description | Novel pathogenic coronaviruses - such as SARS-CoV and probably SARS-CoV-2 - arise by homologous recombination between co-infecting viruses in a single cell. Identifying possible sources of novel coronaviruses therefore requires identifying hosts of multiple coronaviruses; however, most coronavirus-host interactions remain unknown. This novel method, deploys a meta-ensemble of similarity learners from three complementary perspectives (viral, mammalian and network), topredict which mammals are hosts of multiple coronaviruses. The results predict that there are 11.5-fold more coronavirus-host associations, over 30-fold more potential SARS-CoV-2 recombination hosts, and over 40-fold more host species with four or more different subgenera of coronaviruses than have been observed to date at >0.5 mean probability cut-off (2.4-, 4.25- and 9-fold, respectively, at >0.9821). |
Type Of Material | Computer model/algorithm |
Year Produced | 2021 |
Provided To Others? | Yes |
Impact | These models undperin the following research awards: Where coronaviruses hide, where novel strains are generated, and how they get to us: Predicting reservoirs, recombination, and geographical hotspots (NE/W002302/1); Global trade of coronavirus hosts: bringing geographically isolated hosts and viruses together risks novel recombination and spillover to humans (BB/W00402X/1); and Predicting mammalian and avian reservoirs of coronaviruses: identifying current reservoirs and co-infection hosts in which future novel coronavirus could be generated (BBSRC IAA COVID - 168478) |
URL | https://figshare.com/articles/software/covs-recombination-hosts/13110896 |
Title | covs-recombination-hosts |
Description | Data and codes associated with Wardeh et al - Predicting mammalian hosts in which novel coronaviruses can be generated.authors:Maya WardehMatthew BaylsMarcus Blagroveplease email maya wardeh: maya.wardeh@liverpool.ac.uk with any queries or requests. |
Type Of Material | Database/Collection of data |
Year Produced | 2021 |
Provided To Others? | Yes |
Impact | N/A |
URL | https://figshare.com/articles/software/covs-recombination-hosts/13110896/1 |
Description | Sapienza University of Rome |
Organisation | Sapienza University of Rome |
Country | Italy |
Sector | Academic/University |
PI Contribution | Exploring the mammalian virome to detect patterns of compatibility between mammal species and viruses at a global scale, identifying eco-biological profiles of viral carriers along the fast-slow continuum of mammalian life-history. |
Collaborator Contribution | Provided insight into virus-mammal interactions, and role virus traits have on the transmission/spill-over of viruses. |
Impact | Publication: Identifying patterns along the fast-slow continuum of mammalian viral carriers (in prep/under review) |
Start Year | 2022 |
Description | Species360 - Impact of global trade in wildlife on virus spread. |
Organisation | IDAs og Berg-Nielsens Studie-og støttefond |
Country | Denmark |
Sector | Charity/Non Profit |
PI Contribution | This partnership aims at quantifying the impact of global trade in wild animals on the potential spread of emerging infectious zoonses prioritized by the WHO Research and Development Blueprint Strategy. Our group provided data on animal species in which these zoonoses has been found to date; and the ecological role of these animals in the transmission of these pathogens (such as reservoirs; dead-end; and amplifying hosts), as well as information on how these pathogens manifest in the animal host (mortality, morbidity, minor symptoms, or no disease). |
Collaborator Contribution | Our partners are leading on the analyses to identify geographical patterns of trade in the animals identified above, and the key data gaps that need to be resolved to fully assess risks from international wildlife trade and put this information in the context of other drivers of zoonotic diseases. |
Impact | NA |
Start Year | 2020 |
Description | Swansea University, Department of Biosciences |
Organisation | Swansea University |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Collaboration was formed with Swansea University, in order to develop new models for network analysis of shared pathogens between arthropod vectors and hosts. The collaboration resulted to date in 1 publication, and 1 grant application. |
Collaborator Contribution | Collaboration was formed with Dr K Wells (Swansea University, Department of Biosciences) in order to develop new models for network analysis of shared pathogens between arthropod vectors and hosts. The aim of this partnership is to seek further funding to develop set of tools to predict vector-borne disease emergence in UK/European livestock. |
Impact | N/A |
Start Year | 2018 |
Title | Network and machine analysis tools reveal reservoirs of zoonoses - Network builder |
Description | codes, data and additional figures associated with manuscript (Integration of shared-pathogen networks and machine learning reveal key aspects of zoonoses and predict mammalian reservoirs, doi:10.1098/rspb.2019.2882) |
Type Of Technology | Software |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | N/A |
URL | https://figshare.com/articles/NetworkBuilder/11537742 |
Title | Network and machine analysis tools reveal reservoirs of zoonoses - R solution and datasets (2020) |
Description | Codes, data and additional figures associated with manuscript (Integration of shared-pathogen networks and machine learning reveal key aspects of zoonoses and predict mammalian reservoirs, doi:10.1098/rspb.2019.2882) |
Type Of Technology | Software |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | N/A |
URL | https://figshare.com/articles/R-codes_and_datasets/11536470 |
Description | BBC - Coronavirus: This is not the last pandemic |
Form Of Engagement Activity | A broadcast e.g. TV/radio/film/podcast (other than news/press) |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Media (as a channel to the public) |
Results and Impact | BBC coverage of big data and virology research in university of liverpool. Includes coverage of EID2. |
Year(s) Of Engagement Activity | 2020 |
Description | BBC interview - AI used to 'predict the next coronavirus' |
Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | NA |
Year(s) Of Engagement Activity | 2021 |
URL | https://www.bbc.co.uk/news/science-environment-56076716?fbclid=IwAR0AIa3il2XTl6QbrIlP7aCovc79To-tjPI... |
Description | Big Data epidemiology: turning trends into useful preventive medicine. Workshop at Society for Veterinary Epidemiology and Preventive Medicine Annual Conference 2018 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Research workshop demonstrating the usefulness of Big Data techniques and discussing potential uses within veterihary epidemiology. We particularly hightlighted what we had been able to achieve in developing the EID2. |
Year(s) Of Engagement Activity | 2018 |
Description | FRANC24 interview - Quand l'IA part à la chasse au prochain coronavirus chez les mammifères |
Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Interview and newspaper article in France24. |
Year(s) Of Engagement Activity | 2021 |
URL | https://www.france24.com/fr/%C3%A9co-tech/20210218-quand-l-ia-part-%C3%A0-la-chasse-au-prochain-coro... |
Description | New York Times Interview - AI used to 'predict the next coronavirus' |
Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Media interview and newspaper article in NYT. |
Year(s) Of Engagement Activity | 2021 |
URL | https://www.nytimes.com/2021/02/16/science/Covid-reemerging-viruses.html?fbclid=IwAR0AIa3il2XTl6QbrI... |
Description | The Coronavirus Menagerie - New York Times coverage |
Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Newspaper report in NYT. |
Year(s) Of Engagement Activity | 2022 |
URL | https://www.nytimes.com/2022/02/22/health/coronavirus-animals.html |
Description | The Search for Animals That Could Carry the Next Deadly Virus - Wall Street Journal Interview |
Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Media interview with Wall Street Journal. |
Year(s) Of Engagement Activity | 2021 |
URL | https://www.wsj.com/articles/the-search-for-animals-that-could-carry-the-next-deadly-virus-116166736... |
Description | Virologists use AI to work on next pandemic outbreak. |
Form Of Engagement Activity | A broadcast e.g. TV/radio/film/podcast (other than news/press) |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Radio interview in New Zealand |
Year(s) Of Engagement Activity | 2021 |
URL | https://www.rnz.co.nz/national/programmes/first-up/audio/2018783912/virologists-use-ai-to-work-on-ne... |