Predicting emergence risk of future zoonotic viruses through computational learning

Lead Research Organisation: University of Glasgow
Department Name: MRC Centre for Virus Research

Abstract

Despite substantial research, we have so far failed to successfully predict which viruses would emerge to cause outbreaks with large burdens to public health and economies. Research addressing SARS-CoV-2 (the virus causing the COVID-19 pandemic) has shown that, in retrospect, SARS-like coronaviruses could have been predicted as high risk. To prepare for future pandemics, we need more reliable and specific predictions of which viruses have potential to be 'zoonotic', i.e., capable of transmitting from animals to humans.

This research will investigate new ways of making predictions by taking advantage of large contemporary datasets, e.g., genome sequence repositories and text-mined published research. Machine learning will be used as a state-of-the-art computational toolkit that can build models to find patterns in complex information (e.g., images, text, genetic sequences) and apply them to specific tasks (e.g., predicting whether a virus is zoonotic or not). I will model mammal and bird RNA viruses as the most likely sources of emerging infections. By incorporating traditionally neglected data that better captures how viruses interact with host proteins and tissues, models will predict potential of viruses to zoonotically infect and cause disease in humans with improved quality and precision.

Although viral sequencing has improved in coverage, different viruses have been sampled unequally. Resulting biases can lead to poor performance or misidentified relationships if data used to train machine learning models is not selected cautiously. Alongside three analytical objectives, I will also innovate new methods to improve model representation of differently sampled viruses based on evolutionary relatedness.

Firstly, I will build models using protein sequences to predict which viruses are likely to be zoonotic and from which hosts they will originate. To better represent how viruses interact with host cells, I will build models to use information about their physical and chemical protein properties. Further models will use newer methods that can automatically find important properties straight from raw sequences. These properties can be used to find protein 'hotspots' where important signals for predicting hosts are concentrated. Models will be tested by searching for predicted zoonotic viruses in surveillance data from ongoing hospital sampling.

Secondly, I will build models using host tissue and organ data. Data describing which tissues/organs are infected by each virus has already been extracted from scientific literature using text mining methods. Based on this new data, I will model the three-way network of viruses, hosts and their tissues and predict which additional tissues viruses are likely to infect. These predictions can then be tested through experimental in-vitro infection of cells from different tissues and hosts, taking advantage of synthetic viral protein toolkits. Once models are validated, further properties can be built into the network, e.g., disease severity or fatality, to predict which animal viruses have potential to cause severe human disease based on tissue patterns.

Finally, I will investigate virus-host interactions in more detail by focusing on host proteins underlying patterns of infection. By combining data on infected tissues and how often those tissues express potential viral-interacting proteins, I will predict which proteins may act as barriers to viral infection and which proteins may act as viral receptors (i.e., structures that directly bind viruses and allow cell entry). Experimental in-vitro infection of cells that do/do not express potential receptor proteins will further support viral interactions identified.

The proposed research will generate significant public heath impact by identifying priority viruses for targeted surveillance to prevent disease emergence and priority protein interactions for targeted experiments to develop pre-emptive therapeutics.

Technical Summary

Newly emerging or re-emerging viruses represent serious threats to public health, demonstrated by the COVID-19 pandemic and other recent outbreaks (e.g., Ebola virus, monkeypox virus). To better anticipate and prevent future pandemics, we need timely, data-driven predictions of which viruses may have zoonotic potential, which may cause severe human disease, and from which hosts they will originate.

Developments in machine learning now offer methods of generating predictions from high-dimensional data inputs. These can draw on techniques (e.g., boosting, allowing error correction and gradual learning) to improve predictive performance over traditional approaches. As such, these methods can predict traits directly from viral genome sequences or complex interacting networks of hosts and viruses.

This research aims to train machine learning models incorporating new data on viral sequences, protein properties, and tissue tropisms to improve precision in predicting potential of animal viruses to infect and cause disease in humans. Training data will be carefully curated and a new resampling method will be developed for host-virus data accounting for phylogenetic similarity. Models will be explicitly validated by application to an ongoing viral genomic surveillance program and development of infection assays that can experimentally test predicted tissue tropisms and viral-interacting proteins.

Model findings will specify high-risk viruses and hosts for practical intervention, as well as target tissues and genomic regions for priority surveillance. Models will also generate new insights into molecular determinants underpinning host-virus relationships and narrow down target proteins for potential pathways to develop pre-emptive therapeutics. Ultimately, this research will create an empirical modelling foundation ready to further build upon as future data availability increases and improve estimation of emerging infectious disease and pandemic risks.

Publications

10 25 50