Big Data approaches to identifying potential sources of emerging pathogens in humans, domesticated animals and crops

Lead Research Organisation: University of Liverpool
Department Name: Institute of Infection and Global Health

Abstract

Emerging infectious diseases continue to pose major threats to humans, animals and plants. Recent years have seen significant outbreaks of several emerging diseases, ranging from the well-known (Ebola and Olive quick decline syndrome), to the previously little known (Zika), to the entirely novel (Schmallenberg), to name but a few. It is well established that the ability of a pathogen to infect multiple hosts, particularly hosts in different taxonomic orders or wildlife, is a risk factor for emergence in human and livestock pathogens. Emerging wild-life diseases have also been linked to 'spill-overs' from humans or domesticated animals. Despite the importance of cross-species disease transmission, there has been relatively little attention paid to which species are the most important sources cross communities (e.g., zoonotic, wild-life to domestic, plants to other kingdoms), which are the most prolific vectors, how those species acquired the pathogens, and by what means the diseases entered new species or populations. A major reason for this limited understanding is the lack of comprehensive data on the pathogens in animal and plant populations and, in most cases, poorly documented information on how they are transmitted, including to humans.
In this fellowship, I will improve and exploit a novel bioinformatic resource developed at the University of Liverpool to investigate how humans, their domesticated animals and crops are connected to the pathogen reservoir in other species, and how these pathogens pass from that reservoir to the focus populations. The bioinformatic resource, developed by me with funding from BBSRC, is the Enhanced Infectious Disease Database (EID2). EID2 utilises state-of-the-art, text and data mining procedures to extract information from multiple sources, including millions of metadata records accompanying genetic sequences and scientific publications. After processing, EID2 provides evidence for over 60,000 interactions between species of hosts and pathogens and is the most comprehensive data source on the known pathogens of humans, animals, and plants and their geographical ranges.
During this fellowship, I aim to investigate the factors which lead to emergence of pathogens, asking the following questions:
1. What are the characteristics of the networks that connect species via shared pathogens? How central are humans and their domesticated animals and crops in these networks and which other species are each of those communities most closely connected to?
2. What is the role of different pathogen transmission routes on the nature of these networks? Are the potential species-to-species transmission pathways different for direct, food-borne, water-borne and vector-borne pathogens?
3. What factors determine the host ranges of pathogens? Are host species more likely to become exposed to pathogens that infect a wide range of species? From species that are closer to them genetically? Or from those species with which they often interact?
4. What are we missing? Given the networks, transmission routes and host ranges, what is the risk associated with each pathogen emerging in new species? What are the pathogens that can be prioritised as more-likely to emerge in the future?

Technical Summary

Despite the importance of cross-species disease transmission, there has been relatively little attention paid to which species are the most important sources cross communities, and by what means the diseases entered new species or populations. A reason for this is the lack of comprehensive data on pathogens in animals and plants and, often, poorly documented information on how they are transmitted. The objective of this fellowship is to transform host-pathogen interactions extracted from EID2 into multi-host ecological networks. Using network analysis: typology; bridge species; and metrics of networks will be studied. By implementing a centrality index based on a cohort of centrality measures, potential super-spreaders will be identified.
Transmission routes play a role in determining the parts of the population that are infected, the range of species that can be infected, and the measures required to bring an outbreak under control. A transmission route identification system, utilising an ensemble of classifiers, will be developed to categorise scientific publications to identify transmission routes. Consequences of different routes will be assessed, and their effect on centrality will be quantified.
The range of hosts that a pathogen infects influences the dynamics of pathogen transmission and emergence in novel hosts. Statistical models will be developed to rank different traits of hosts in relation to host-specificity. The effects of the traits on centrality will be quantified.
Predicting unobserved or future host-pathogen interactions is imperative for one-health, enabling identification of potential emergence and spill-over events. A link prediction model on host-pathogen networks, with links either positive or negative, will be constructed. A system will be developed to capture host-pathogen negative interactions from scientific literature. The results will be incorporated into the multi-host networks; the effect on centrality will be quantified.

Publications

10 25 50
 
Description Global trade of coronavirus hosts: bringing geographically isolated hosts and viruses together risks novel recombination and spillover to humans
Amount £117,406 (GBP)
Funding ID BB/W00402X/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 07/2021 
End 05/2022
 
Description Predicting mammalian and avian reservoirs of coronaviruses: identifying current reservoirs and co-infection hosts in which future novel coronavirus could be generated
Amount £11,008 (GBP)
Funding ID BBSRC IAA COVID - 168478 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 03/2021 
End 06/2021
 
Description Vector in the machine: How accurately can mosquito transmission of viruses be predicted by machine learning?
Amount £181,248 (GBP)
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 10/2022 
End 09/2026
 
Description Where coronaviruses hide, where novel strains are generated, and how they get to us: Predicting reservoirs, recombination, and geographical hotspots
Amount £79,286 (GBP)
Funding ID NE/W002302/1 
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 03/2021 
End 03/2022
 
Title Data and code for: "Monkeypox virus shows potential to infect a diverse range of native animal species across Europe, indicating high risk of becoming endemic in the region." 
Description Background: Monkeypox is a zoonotic virus which persists in animal reservoirs and periodically spills over into humans, causing outbreaks. During the current 2022 outbreak, monkeypox virus has persisted via human-human transmission, across all major continents and for longer than any previous record. This unprecedented spread creates the potential for the virus to 'spillback' into local susceptible animal populations. Persistent transmission amongst such animals raises the prospect of monkeypox virus becoming enzootic in new regions. However, the full and specific range of potential animal hosts and reservoirs of monkeypox remains unknown, especially in newly at-risk non-endemic areas. Methods: Here, our pipeline utilises ensembles of classifiers comprising different class balancing techniques and incorporating instance weights, to identify which animal species are potentially susceptible to monkeypox virus. Subsequently, we generate spatial distribution maps to highlight high-risk geographic areas at high resolution. Findings: We show that the number of potentially susceptible species is currently underestimated by 2.4 to 4.3-fold. We show a high density of susceptible wild hosts in Europe. We provide lists of these species, and highlight high-risk hosts for spillback and potential long-term reservoirs, which may enable monkeypox virus to become endemic. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact N/A 
URL https://figshare.com/articles/software/Blagrove_et_al_2022_poxvriuses_data_and_code/20485332
 
Title Divide-and-conquer: data and codes 
Description Data and codes associated with: Divide-and-conquer: Wardeh, M., Blagrove, M.S.C., Sharkey, K.J. et al. Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations. Nat Commun 12, 3954 (2021). https://doi.org/10.1038/s41467-021-24085-w. Abstract: Our knowledge of viral host ranges remains limited. Completing this picture by identifying unknown hosts of known viruses is an important research aim that can help identify and mitigate zoonotic and animal-disease risks, such as spill-over from animal reservoirs into human populations. To address this knowledge-gap we apply a divide-and-conquer approach which separates viral, mammalian and network features into three unique perspectives, each predicting associations independently to enhance predictive power. Our approach predicts over 20,000 unknown associations between known viruses and susceptible mammalian species, suggesting that current knowledge underestimates the number of associations in wild and semi-domesticated mammals by a factor of 4.3, and the average potential mammalian host-range of viruses by a factor of 3.2. In particular, our results highlight a significant knowledge gap in the wild reservoirs of important zoonotic and domesticated mammals' viruses: specifically, lyssaviruses, bornaviruses and rotaviruses. 
Type Of Material Computer model/algorithm 
Year Produced 2021 
Provided To Others? Yes  
Impact The multi-perspective host-pathogen predictive framework undperins the following research awards: Where coronaviruses hide, where novel strains are generated, and how they get to us: Predicting reservoirs, recombination, and geographical hotspots (NE/W002302/1); Global trade of coronavirus hosts: bringing geographically isolated hosts and viruses together risks novel recombination and spillover to humans (BB/W00402X/1); and Predicting mammalian and avian reservoirs of coronaviruses: identifying current reservoirs and co-infection hosts in which future novel coronavirus could be generated (BBSRC IAA COVID - 168478) 
URL https://doi.org/10.6084/m9.figshare.13270304
 
Title Machine learning ensemble models to predict and quantify mammalian reservoirs of zoonoses 
Description State of the art machine learning models to predict, quantify and explain sharing of zoonoses between humans and mammalian hosts. 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact N.A 
URL https://figshare.com/articles/R-codes_and_datasets/11536470
 
Title Models to explain Centrality in networks of shared pathogens 
Description State of the art machine learning model to explain driver of centrality (host importance) and influence in networks of shared pathogens between hosts (e.g. non-human mammals). Uniquely the model integrates a new metric of centrality, and systematic tool to select centrality measures in complex networks. 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact N/A 
URL https://figshare.com/articles/R-codes_and_datasets/11536470
 
Title Predicting mammalian hosts in which novel coronaviruses can be generated - codes 
Description Novel pathogenic coronaviruses - such as SARS-CoV and probably SARS-CoV-2 - arise by homologous recombination between co-infecting viruses in a single cell. Identifying possible sources of novel coronaviruses therefore requires identifying hosts of multiple coronaviruses; however, most coronavirus-host interactions remain unknown. This novel method, deploys a meta-ensemble of similarity learners from three complementary perspectives (viral, mammalian and network), topredict which mammals are hosts of multiple coronaviruses. The results predict that there are 11.5-fold more coronavirus-host associations, over 30-fold more potential SARS-CoV-2 recombination hosts, and over 40-fold more host species with four or more different subgenera of coronaviruses than have been observed to date at >0.5 mean probability cut-off (2.4-, 4.25- and 9-fold, respectively, at >0.9821). 
Type Of Material Computer model/algorithm 
Year Produced 2021 
Provided To Others? Yes  
Impact These models undperin the following research awards: Where coronaviruses hide, where novel strains are generated, and how they get to us: Predicting reservoirs, recombination, and geographical hotspots (NE/W002302/1); Global trade of coronavirus hosts: bringing geographically isolated hosts and viruses together risks novel recombination and spillover to humans (BB/W00402X/1); and Predicting mammalian and avian reservoirs of coronaviruses: identifying current reservoirs and co-infection hosts in which future novel coronavirus could be generated (BBSRC IAA COVID - 168478) 
URL https://figshare.com/articles/software/covs-recombination-hosts/13110896
 
Title covs-recombination-hosts 
Description Data and codes associated with Wardeh et al - Predicting mammalian hosts in which novel coronaviruses can be generated.authors:Maya WardehMatthew BaylsMarcus Blagroveplease email maya wardeh: maya.wardeh@liverpool.ac.uk with any queries or requests. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact N/A 
URL https://figshare.com/articles/software/covs-recombination-hosts/13110896/1
 
Description Sapienza University of Rome 
Organisation Sapienza University of Rome
Country Italy 
Sector Academic/University 
PI Contribution Exploring the mammalian virome to detect patterns of compatibility between mammal species and viruses at a global scale, identifying eco-biological profiles of viral carriers along the fast-slow continuum of mammalian life-history.
Collaborator Contribution Provided insight into virus-mammal interactions, and role virus traits have on the transmission/spill-over of viruses.
Impact Publication: Identifying patterns along the fast-slow continuum of mammalian viral carriers (in prep/under review)
Start Year 2022
 
Description Species360 - Impact of global trade in wildlife on virus spread. 
Organisation IDAs og Berg-Nielsens Studie-og støttefond
Country Denmark 
Sector Charity/Non Profit 
PI Contribution This partnership aims at quantifying the impact of global trade in wild animals on the potential spread of emerging infectious zoonses prioritized by the WHO Research and Development Blueprint Strategy. Our group provided data on animal species in which these zoonoses has been found to date; and the ecological role of these animals in the transmission of these pathogens (such as reservoirs; dead-end; and amplifying hosts), as well as information on how these pathogens manifest in the animal host (mortality, morbidity, minor symptoms, or no disease).
Collaborator Contribution Our partners are leading on the analyses to identify geographical patterns of trade in the animals identified above, and the key data gaps that need to be resolved to fully assess risks from international wildlife trade and put this information in the context of other drivers of zoonotic diseases.
Impact NA
Start Year 2020
 
Description Swansea University, Department of Biosciences 
Organisation Swansea University
Country United Kingdom 
Sector Academic/University 
PI Contribution Collaboration was formed with Swansea University, in order to develop new models for network analysis of shared pathogens between arthropod vectors and hosts. The collaboration resulted to date in 1 publication, and 1 grant application.
Collaborator Contribution Collaboration was formed with Dr K Wells (Swansea University, Department of Biosciences) in order to develop new models for network analysis of shared pathogens between arthropod vectors and hosts. The aim of this partnership is to seek further funding to develop set of tools to predict vector-borne disease emergence in UK/European livestock.
Impact N/A
Start Year 2018
 
Title Network and machine analysis tools reveal reservoirs of zoonoses - Network builder 
Description codes, data and additional figures associated with manuscript (Integration of shared-pathogen networks and machine learning reveal key aspects of zoonoses and predict mammalian reservoirs, doi:10.1098/rspb.2019.2882) 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact N/A 
URL https://figshare.com/articles/NetworkBuilder/11537742
 
Title Network and machine analysis tools reveal reservoirs of zoonoses - R solution and datasets (2020) 
Description Codes, data and additional figures associated with manuscript (Integration of shared-pathogen networks and machine learning reveal key aspects of zoonoses and predict mammalian reservoirs, doi:10.1098/rspb.2019.2882) 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact N/A 
URL https://figshare.com/articles/R-codes_and_datasets/11536470
 
Description BBC - Coronavirus: This is not the last pandemic 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact BBC coverage of big data and virology research in university of liverpool. Includes coverage of EID2.
Year(s) Of Engagement Activity 2020
 
Description BBC interview - AI used to 'predict the next coronavirus' 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact NA
Year(s) Of Engagement Activity 2021
URL https://www.bbc.co.uk/news/science-environment-56076716?fbclid=IwAR0AIa3il2XTl6QbrIlP7aCovc79To-tjPI...
 
Description Big Data epidemiology: turning trends into useful preventive medicine. Workshop at Society for Veterinary Epidemiology and Preventive Medicine Annual Conference 2018 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Research workshop demonstrating the usefulness of Big Data techniques and discussing potential uses within veterihary epidemiology. We particularly hightlighted what we had been able to achieve in developing the EID2.
Year(s) Of Engagement Activity 2018
 
Description FRANC24 interview - Quand l'IA part à la chasse au prochain coronavirus chez les mammifères 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact Interview and newspaper article in France24.
Year(s) Of Engagement Activity 2021
URL https://www.france24.com/fr/%C3%A9co-tech/20210218-quand-l-ia-part-%C3%A0-la-chasse-au-prochain-coro...
 
Description New York Times Interview - AI used to 'predict the next coronavirus' 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact Media interview and newspaper article in NYT.
Year(s) Of Engagement Activity 2021
URL https://www.nytimes.com/2021/02/16/science/Covid-reemerging-viruses.html?fbclid=IwAR0AIa3il2XTl6QbrI...
 
Description The Coronavirus Menagerie - New York Times coverage 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact Newspaper report in NYT.
Year(s) Of Engagement Activity 2022
URL https://www.nytimes.com/2022/02/22/health/coronavirus-animals.html
 
Description The Search for Animals That Could Carry the Next Deadly Virus - Wall Street Journal Interview 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact Media interview with Wall Street Journal.
Year(s) Of Engagement Activity 2021
URL https://www.wsj.com/articles/the-search-for-animals-that-could-carry-the-next-deadly-virus-116166736...
 
Description Virologists use AI to work on next pandemic outbreak. 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact Radio interview in New Zealand
Year(s) Of Engagement Activity 2021
URL https://www.rnz.co.nz/national/programmes/first-up/audio/2018783912/virologists-use-ai-to-work-on-ne...