Ecology or genetics? Adapting machine learning approaches to understand determinants of cross-species transmission and virulence in RNA viruses

Lead Research Organisation: University of Liverpool
Department Name: Biostatistics

Abstract

Emerging infectious diseases from animal sources continue to threaten human health, exemplified by the spread and severe disease of recent Ebola virus, Zika virus and MERS coronavirus outbreaks. The WHO has noted the serious possibility of a new emerging pathogen to cause a public health crises, denoting this 'Disease X'.

'Disease X' is most likely to be caused by an RNA virus, as they evolve faster and are more likely to emerge and infect humans than other pathogens. Zoonotic viruses (i.e. those that transmit cross-species from non-human animals to humans) are also known to have higher emergence risks.

Although some zoonotic viruses cause severe and life-threatening illness upon infecting humans, others appear to cause mild disease or no disease at all. To produce early predictions of the public health impacts of 'Disease X', it is essential to identify which factors drive this variation in 'virulence' (i.e. how severe disease outcomes are).

However, we currently only have a poor understanding of which factors drive infection and virulence risk in cross-species transmissions, partly because of the lack of available risk factor information. The traditional approach is to identify ecological risk factors using classical statistical models, though these models are often too reductionist to capture the complex evolutionary patterns behind emergence.

Additionally, the ease of modern RNA sequencing has led to a much wider availability of large genetic data resources for viruses. Genetic patterns or 'motifs' recur throughout virus sequences, with certain motifs recurring more often within infections of certain hosts, which may aid virus replication or evasion of the immune system. Genetic sequences could therefore hold important signals towards predicting infection or virulence within a new host after cross-species transmission. However, finding practical ways of capturing motifs for predictive modelling has proven challenging due to the large volumes of potential information within sequence data.

The central goal of this research is to combine both ecology and genetics to improve predictions of which animal viruses pose the greatest risks of emergence and severe disease in humans.

To unlock this potential power in RNA virus genetic sequences, new analytical approaches are needed. I will apply machine learning as a state-of-the-art modelling method. Machine learning models can predict outcomes based on large sets of highly diverse predictors and complex interactions. These models will allow me to identify key genetic motifs influencing cross-species transmission and directly compare genetic and ecological traits. To improve predictive performance, I will compare a range of machine learning algorithms (e.g., classification and regression trees, support vector machines) and approaches (e.g. 'bagging', aggregation over many individual models; and 'boosting', allowing models to gradually learn).

This research will identify patterns across all known mammal and bird RNA viruses by using the exceptional breadth of data within the Enhanced Infectious Disease Database (EID), developed at the University of Liverpool. EID2 contains infection data from 29,500 host-pathogen pairs, automatically collected from genetic records (GenBank) and scientific literature texts (PubMed). Despite virulence being a key virus trait, no comparably-sized resources exist describing disease outcomes. Extending on the EID2 platform, I will develop automated text mining tools to capture data on disease outcomes in different hosts from scientific texts describing experimental infections.

This research will test evolutionary theory across a large diversity of RNA viruses. The proposed machine learning models will inform public health risk assessment by improving our capacity to predict emergence and suggesting strategic target viruses or hosts for preventing future disease outbreaks from cross-species transmission.

Technical Summary

Emerging infectious diseases remain a prominent threat to global health, e.g., Ebola virus, Zika virus. In 2015, the WHO designated 'Disease X' to indicate the serious potential of previously unknown emerging pathogens to cause public health crises.

Though zoonotic RNA viruses are known to present higher risks of emergence, detailed determinants of cross-species transmission remain unclear. Zoonotic viruses also vary widely in their capability to cause severe disease. To predict public health impacts of 'Disease X', a better understanding of which traits drive this variation in infectivity and virulence is urgently needed.

Whilst previous approaches have focused on ecological predictors, these traditional frameworks have been unable to capture the information within increasingly available RNA virus sequences. This research aims to capitalise upon the potential power within large genetic data resources and quantify comparative influences of genetic versus ecological traits of RNA viruses and hosts upon cross-species transmission dynamics.

To fully integrate novel, high-dimensional genetic data, new analytical approaches are needed. I will apply machine learning as a state-of-the-art statistical methodology, comparing several advanced approaches, e.g. gradient boosting, a method of gradual model learning which outperforms traditional methods.

Models will span all known mammal and avian RNA viruses (22 families) using the exceptional breadth of EID2, a large, host-virus infectivity dataset. This project will additionally develop further text-mining tools to capture and integrate virulence data within EID2.

The proposed models will allow tests of evolutionary theory across a range of RNA viruses. Quantified model outputs will contribute to public health risk assessments by informing prioritisation for novel viruses and advancing frameworks for emergence predictions, moving towards a 'smarter', empirically-driven strategy to prevent future disease burden.

Planned Impact

The proposed research addresses an interdisciplinary research question and a critical knowledge gap in global health. The research is therefore likely to be of significant interest and generate impact amongst a wide variety of stakeholders at the national and international level, including policy makers, health authorities, clinicians, and the general public.

The research aims to identify and quantify the determinants of cross-species transmission, with a perspective to identifying traits correlated with increased risks of infectivity and virulence within humans. As such, the research has the potential to inform various strategic measures towards prediction and prevention of the next human pandemic.

Firstly, the identified ecological and/or genomic predictors will suggest general, broad groups of viruses with increased risks of human infectivity and virulence. Research dissemination channels will be chosen such that scientific publications produced from the research will have sufficient reach to global health authorities. Model results and conclusions can then inform ongoing prioritisation or ranking exercises of pathogens likely to cause public health emergencies, e.g. World Health Organisation's Research & Development Preparedness Blueprint. This can in turn lead to target animals and/or geographic locations being identified that might benefit from prioritised resource allocation.

Secondly, the developed statistical machine learning models may offer inference as a predictive tool. Once developed, models can potentially be applied to novel emerging viruses as additional test sets to make immediate predictions of risk, as has been demonstrated for machine learning models predicting host type.

Longer-term, the quantification of relationships between infectivity/virulence and genetic predictors such as genomic biases is of especially high relevance. Traditionally, predictive comparative analyses have relied on trait information from human infections, however, methods using genomic predictors would allow functional estimates of virulence for animal viruses not yet known to infect humans as soon as a genetic sequence becomes available. Research outputs therefore will further assist in development of predictive tools or programmes that improve the accuracy and timeliness of risk assessments for newly identified animal viruses.

The value of early prioritisation and prediction is straightforward - by informing candidate pathogens likely to have high infectivity and/or cause severe disease in humans, resources for surveillance and intervention can be better allocated by policy makers and health authorities in an empirically-informed strategy. This will optimise efforts to: improve global health by preventing future zoonotic disease burden, particularly in developing countries; mitigate substantial economic costs associated with disease outbreaks; and ultimately contribute to preventing pandemic spread of future emerging viruses.
 
Description CSL Seqirus/Pandemic Institute Collaboration Fund
Amount £495,394 (GBP)
Organisation Seqirus 
Sector Private
Country United States
Start 11/2022 
End 04/2024
 
Description Wellcome Trust Institutional Strategic Support Fund
Amount £13,604 (GBP)
Organisation University of Liverpool 
Sector Academic/University
Country United Kingdom
Start 10/2022 
End 12/2022
 
Description Verena: Viral Emergence Research Initiative 
Organisation Georgetown University
Country United States 
Sector Academic/University 
PI Contribution Bringing genomic and ML expertise as a comparative modeller to the consortium
Collaborator Contribution Consortium otherwise contains experts in global ecology, immunology and experimental design, and phylogeny and network models.
Impact Data proliferation, reconciliation, and synthesis in viral ecology Rory Gibb, Gregory F. Albery, Daniel J. Becker, Liam Brierley, Ryan Connor, Tad A. Dallas, Evan A. Eskew, Maxwell J. Farrell, Angela L. Rasmussen, Sadie J. Ryan, Amy Sweeny, Colin J. Carlson, Timothée Poisot bioRxiv 2021.01.14.426572; doi: https://doi.org/10.1101/2021.01.14.426572
Start Year 2020
 
Description "Origins of the novel coronavirus" (SitP Online) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact This talk aimed at the general public discusses where new viruses come from, reviews evidence surrounding the source of SARS-CoV-2, and looks at how information spread during the early epidemic, including some unsubstantiated claims. Hosted and organised through the Merseyside Skeptics Society. Delivered digitally via Twitch. Q&A was taken live afterwards.
Year(s) Of Engagement Activity 2021
URL https://www.youtube.com/watch?v=tnaqyHBJpUE
 
Description 18th Annual Ecology & Evolution of Infectious Disease Conference, online 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presented digital poster 'Predicting the animal hosts of coronaviruses from genomic data through machine learning'. Interest was generated in the published research.
Year(s) Of Engagement Activity 2021
 
Description ASM & FEMS World Microbe Forum, online 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presented digital poster 'Predicting the animal hosts of coronaviruses from genomic data through machine learning'. Interest was generated in the published research.
Year(s) Of Engagement Activity 2021
 
Description BES & SFE2 joint Annual Meeting (Ecology Across Borders) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presented talk in Liverpool 'Building a comparative database of pathogen tropisms from published literature'. A new collaboration is being discussed with Cardiff University as a result of this engagement.
Year(s) Of Engagement Activity 2021
 
Description British Society for Ecology Annual Meeting, online 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Presented on-demand talk 'Predicting the animal hosts of coronaviruses from genomic data through machine learning'. Interest was generated in the preprint research.
Year(s) Of Engagement Activity 2020
URL https://youtu.be/rqAlyBxUKxo
 
Description Data Mining and Machine Learning Group talk 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Invited talk given to Data Mining and Machine Learning Group, Department of Computer Science, University of Liverpool. Discussion made with non-biologists as to how to apply computer science techniques to public health data problems.
Year(s) Of Engagement Activity 2020
 
Description EMBO Workshop: Codon usage: Function, mechanism, and evolution, Edinburgh. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Presented digital poster 'Codon usage biases as informative machine learning features to predict coronavirus origins'. Interest was generated in the published research.
Year(s) Of Engagement Activity 2022
 
Description European R users' meeting (online) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presented technical lightning talk "Using open-access data to derive genome composition of emerging viruses" and showed use case for software packages in own research. Invited to submit an rOpenSci blog post on this basis.
Year(s) Of Engagement Activity 2020
URL https://www.youtube.com/watch?v=79lUgThZ4HE
 
Description RCVS Mythbusting: Will good weather affect infection rates of COVID-19? 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Series of short postcasts/videos aimed at practising veterinarians and the general public to provide evidence-based guidance from current research. Invited to present on the topic by RCVS
Year(s) Of Engagement Activity 2020
URL https://knowledge.rcvs.org.uk/covid-19/covid-19-mythbusting/will-good-weather-affect-infection-rates...
 
Description Royal Statistical Society International Conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Presented talk in Manchester 'Generating data on pathogen tropisms from published literature'. Interest was generated in creating a further conference session the next year focused more on this topic.
Year(s) Of Engagement Activity 2021
 
Description Royal Statistical Society International Conference 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presented talk 'Which animals are emerging viruses likely to come from?'. Interest was generated in the published research.
Year(s) Of Engagement Activity 2022
 
Description Royal Statistical Society Young Statisticians' Meeting 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Presented plenary talk 'How can we predict the next pandemic?'. Interest was generated in the published research and my career path to date.
Year(s) Of Engagement Activity 2022
 
Description Science outreach - The Global Science From Home Show (Twitter) 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Chain of social media links between global scientists explaining or demonstrating a topic for 5 minutes.
Year(s) Of Engagement Activity 2020
URL https://twitter.com/L_Brierley/status/1243524486449836032
 
Description Science outreach show, "A Virus To End Humanity?" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Public engagement show presenting an interactive pandemic scenario that explores whether humanity is really at risk of extinction from the next viral outbreak. Audience members reported learning new things and having changed perspectives on the subject matter after debate and interactive discussion during the show. Performed at the Cabaret of Dangerous Ideas, Edinburgh Fringe Festival 2017 (65 tickets, sold out); Science Discovery Day Festival, University of St Andrews 2020; and Skeptics in the Pub groups (2018: Birmingham, Bournemouth, Coventry, Leicester, 2019: High Wycombe).
Year(s) Of Engagement Activity 2017,2018,2019,2020
 
Description Verena Lighthouse talk 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presented talk 'Predicting the animal hosts of coronaviruses from genomic data through machine learning' internally to large global consortium. Interest was generated in the methods and further potential to work together.
Year(s) Of Engagement Activity 2020
 
Description Verena blog: Can AI help us trace Omicron's origins? 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Blog post co-authored based around a paper supported by this same grant (Predicting the animal hosts of coronaviruses from compositional
biases of spike protein and whole genome sequences through machine learning). Intended purpose was to demonstrate good practice and cautious judgement in applying genomics-informed machine learning models to SARS-CoV-2 data in the wake of sudden increased interest in the potential animal spillback origins of the Omicron variant. Intended as a primer and guidance to researchers to ensure the evolutionary scale of their models matches that of their research question, and to prevent misinformed policy and reporting from flawed models.
Year(s) Of Engagement Activity 2021
URL https://www.viralemergence.org/blog/can-ai-help-us-trace-omicrons-origins
 
Description rOpenSci blog post 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited blog post for rOpenSci building on software shown in eRum talk. Generated interest among researchers in relevant fields and created networking opportunities with rOpenSci members which I will draw on in writing my own software package.
Year(s) Of Engagement Activity 2020
URL https://ropensci.org/blog/2020/11/10/coronaviruses-and-hosts/