Understanding the evolution and diversity of viral pathogens using next generation sequencing technologies

Lead Research Organisation: University of Manchester
Department Name: Life Sciences


A main cause of animal and human disease are infectious agents such as viruses. In this project we wish to study the genetic material of these pathogens. Genetic material is encoded as ordered 'sequences' of nucleotides. This information determine a virus' biological properties and response to the host immune system and thus the success of veterinary or medical treatments, whether they are vaccine or drug-based. Until very recently pathogen genetic material was characterized using Sanger sequencing, a technique invented in the late 1970s. More recently new sequencing technologies have become available that permit extremely large numbers of sequence fragments, called 'reads', to be generated. Many are referring to this as a revolution in sequencing because it now permits small groups of researchers to tackle projects previously only possible at sequencing centres, while sequencing centres can tackle truly massive sequencing projects, for example, the initiative to sequence 1,000 human genomes. This introduces the potential to explore pathogen genetic diversity on a scale that was previously unprecedented. However, there is a downside. The amount of data being generated is outstripping our ability to analyse it routinely, let alone carry out sophisticated evolutionary analysis. Particularly when it comes to pathogens, data sets could potentially be generated for which no suitable computational tools exist. This is exactly what happened in the case of the preliminary analysis in this project. HIV data was generated of importance to understanding drug resistance for which no software was available. This lack of software is because most research effort is being directed at assembling single complete genomes from next generation sequence data. However, with pathogens the interesting questions concern the diversity of sequences or so-called 'ultra-deep' sequencing. As a consequence, in this project we propose to develop, reliable, easy to use software that will be generically useful for all types of pathogen data sets. This will involve exploiting both the error information that is intrinsic to the new technology sequencing platforms and our considerable knowledge of the pathogen systems that we wish to analyse. Combined, this will permit us to develop software that will be able to summarise the variation in a sample of sequences and that will provide confidence in the sequence changes observed. Just as importantly, our computer-based approach will permit the sophisticated analysis of properties of the data in the hunt for clues to understanding a pathogen's biology. We will use this software in conjunction with next-generation sequence data to provide a detailed insight into intra-host dynamics of RNA viral populations. Particular focus will be given to genome diversity when the selective landscape within the host is altered, for example following transmission between individuals, disease progression or the initiation/alteration of drug treatments. Additionally our approach will be generically applicable to a wide range of research areas where understanding genetic variation is key.

Technical Summary

Next generation sequencing technologies have massively parallelized the sequencing of genomes. As a result genetic information can be studied to a previously unprecedented level of detail. Where these new technologies will have the most impact is undoubtedly the study of animal and human disease. In this project we will focus on the processing, management and analysis of next generation sequences from viral pathogens. We will firstly, create software that produces a diversity profile of a next generation data set, essentially a probabilistic model of the observed mutations, which incorporates the error from the specific sequencing platform in the context of the expected variation within a specific pathogen's genome. This will incorporate a set of priors derived from both biological and technological aspects of the data. The output will be a genetic profiling of the genomic regions sequenced with associated polymorphism likelihood scores. Secondly, we will produce software for interacting with these profiles in order to determine important evolutionary properties that contribute to the persistence of the pathogen with the host. In addition to incorporating standard methodologies for the analysis of polymorphism data and for inferring evolutionary trees, this software will specialize in the analysis of temporal data, from infected animals or patients at the level of individual infections, localized epidemics or global pandemics. The key will be the ability of our software to manage, analyse and visualize the results from complex (and extremely large) next generation sequence data sets. Initially we will focus on data being generated by the 454 sequencing platform. The data sets we have at our disposal include HIV-1, influenza, coronavirus and rotavirus sequences. Importantly the framework that we develop will be of utility to the study of any viral data sets and with modification will be applicable to more general studies of diversity in any 'ultra-deep' data.

Planned Impact

The research beneficiaries (both public and private) are beginning to produce next-generation sequencing data to answer many research questions. The particular nature of these data (short-reads but massive depth of coverage) produces its own challenges as well as some unprecedented opportunities, in particular in pathogen research. (i) The primary beneficiary will be virologists, medical researchers and public-health scientists who are beginning to collect next-generation sequencing data to consider questions about viral diversity within and between hosts. This includes large public projects to understand the diversity of certain pathogens worldwide. An example of this is the Influenza Genome Sequencing Project of which a major contributor is the J. Craig Venter Institute (see attached collaboration letter from Dr. David Spiro, head of the viral genomics group). The UK sequencing centres (see collaboration letters) will also be direct beneficiaries in particular due to their recent nomination as MRC sequencing hubs and thus interest in pathogen sequencing. Our system will have immediate utility (i) for scientists who are studying viral infections (in both humans and livestock) and (ii) research involving biomedicine and public health where it will be useful in tracking drug resistance and the spread of novel viral strains either in epidemiologically linked groups or globally. (ii) There will be commercial sector beneficiaries in the form of companies developing interventions for viral diseases such as anti-viral drugs and vaccines. In particular, DR has an existing collaboration with Pfizer R&D to pioneer the use of deep sequencing to study drug resistance evolution. The companies developing next-generation sequencing technologies will be another private beneficiary of the research because it will open their technologies up to a broad range of pathogen-focused research groups. (iii) The research will enable a better understanding of the causes and maintenance of virus genetic diversity and thus will provide benefits in terms of how we react to and deal with emerging viral pathogens. This indirect benefit to public-health scientists will emerge from application of our proposed system to the data that is beginning to be collected now. Indeed we intend to apply the system to some important research questions, ourselves, in collaboration with our sequencing collaborators. Our primary form of communication to potential beneficiaries will be through the joint mediums of presentations at major conferences (in particular the larger medical and public-health orientated ones) and peer-reviewed papers in high-impact journals. Such traditional mediums are important to ensure the quality of the research. We will also engage directly with individuals, institutions and companies that are likely to find our research applicable starting with the collaborative partners we have listed here. Conference presentations are also very effective for establishing contact with other potential beneficiaries and collaborators. We will provide training in the use of the software and methods through workshops. AR, in particular, regularly teaches on a number of such workshops (Workshop on Molecular Evolution and International Workshop on Virus Evolution) and these have proved extremely popular for providing training for the use of his genetic analysis software, BEAST. We intend to make the system freely available to researchers because of its potential public-health benefits. However, the complex nature of genetic analysis software means that there is considerable potential to exploit the experience of the RA and PIs in terms of consultancy to interested private sector users. We also envisage that specific commercial applications may require customized versions, perhaps to enable a particular high-throughput workflow and the modular nature of the proposed system would facilitate the development of such.
Description We have released freely available software for the analysis of next generation sequencing (NGS) data from viral populations. The software is generically applicable to viral data sets and for NGS data from different sequencing platforms. We have open
access publications associated with the software (Archer et al. 2010; 2012a) and have demonstrated the framework
can be used to process data from multiple NGS platforms, specifically: 454, Illumina, Ion Torrent and PacBio
(Archer et al. 2012b). We've a further manuscript in preparation that describes a statistical method to leverage
signal from ultra-deep data when linked temporal data sets are available (Archer et al. 2015). The software was also presented at international conferences on both computational biology/bioinformatics (e.g., ISMB/ECCB) and retroviruses (e.g., CROI). The PIs have also presented the work in seminars in both the UK and internationally.
Exploitation Route Software is available for use by others.
Sectors Pharmaceuticals and Medical Biotechnology

Description The production of our virus-focussed NGS analysis framework has led to collaborations with two hospitals: the Royal Free in London and University Hospitals Case Medical Center, Cleveland, Ohio. These organisations are using next generation sequencing data to detect low frequency mutations, e.g., associated with drug resistance in viral infections.
Sector Pharmaceuticals and Medical Biotechnology
Title Software for the analysis of viral NGS data 
Description Software, Segminator, for the analysis of viral next generation sequence (NGS) data, available at http://www.bioinf.manchester.ac.uk/segminator. Associated with publication: PMID: 22443413. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact Software has been used by researchers in academia and industry. 
URL http://www.bioinf.manchester.ac.uk/segminator