Understanding recombination through tractable statistical analysis of whole genome sequences

Lead Research Organisation: University of Reading
Department Name: Mathematics and Statistics

Abstract

This project concerns the analysis of whole genome sequence data: that is, the complete DNA sequence, the genetic code, of an organism. The technology for acquiring such a sequence is relatively new. It was used to sequence a single human genome in the "Human Genome Project", which finished in 2003. This project took 13 years and cost $2.7 billion. However, technological advances in the past 10 years have led to the cost of sequencing genomes to drop dramatically. It now costs only $1,000 to sequence a human genome and this price continues to fall.

Why should one wish to sequence a genome? The human genome can be thought of as the blueprint for building a human. Each person has a slightly different genome: the parts that are common to everyone are what make us human; the parts that differ are responsible for the (genetic) differences between us. Understanding our genomes, through studying both the common parts of the genome and the differences, promises scientific breakthroughs in many areas, particularly medicine. For example, Genomics England are currently planning to sequence 100,000 genomes for the purposes of improving clinical practice in dealing with rare disease, cancer and infectious disease.

The study of DNA sequences is not restricted to humans. It is also useful to obtain whole genome sequences from many other living things. This project is focussed on analysing genetic information obtained from bacteria. There are many reasons for studying bacteria, one of the most obvious being that some bacteria cause disease in plants and livestock (affecting our food supply) and in humans (affecting our health). Obtaining a better understanding of the genetic code of bacteria offers the promise of both tracking the spread of infections and also reducing the occurrence of disease.

However, although sequence data is relatively easy to obtain, extracting useful information from it can be very difficult. Genome sequences can be stored on computers in text files as a long sequence of letters. A single gene might consist of a few hundred or thousand letters. A whole bacterial genome, containing thousands of genes, might be 3 million letters long. Most scientific projects involve studying a population (tens, hundreds or thousands) of these genomes, thus it is not unusual for a dataset to consist of over a billion letters! To make sense of such a large complicated data set, mathematical methods, implemented as computer programs, are required.

This project is concerned with the development of such mathematical methods, and their implementation. The aim of the project is to learn about the evolution of bacteria by studying their genome sequences. As a rule, bacteria reproduce clonally: each individual only has a single parent. However, in some cases they can also exchange DNA, in a manner related to sexual reproduction in humans. It is of great scientific interest to understand such exchanges for several reasons, including that it is one of the main ways in which a bacteria can become resistant to antibiotics. MRSA is an example of an antibiotic resistant bacterial strain that has been of significant concern to the NHS. Understanding how antibiotic resistance is acquired is one way in which scientists can help to tackle such problems. Analysing whole genome sequences using mathematical methods, as is done in this project, is fundamental to these investigations.

Currently there are several computer programs that can be used to investigate the exchange of DNA, and important discoveries have been made through using them. However, when they are used on whole genome sequences, some of the programs run too slowly to be useful in many cases (sometimes they take months) and others cannot detect (or provide an incomplete picture) of genetic exchange events. This project is developing new programs that are both accurate, and run quickly enough to be useful.

Technical Summary

This project is concerned with understanding homologous recombination in bacteria through the analysis of whole genome sequence (WGS) data. There are several existing methods for inferring recombination from WGS, but they are either computationally expensive (ClonalFrame (CF) and ClonalOrigin (CO)) or use approximate models where the source and destination of recombination events cannot be identified (e.g. BratNextGen).

CF and CO are both Bayesian models that use Monte Carlo methods, specifically Markov chain Monte Carlo (MCMC), to perform inference. This approach has been commonplace since the early 1990s. Since that time, other, more computationally efficient, methods have been proposed in statistics, many of which are not used extensively in applications. This project will use some of these new techniques to make inference using the CF and CO models more efficient. There are two reasons for the computational expense of CF: one is that it searches over a (very large) space of possible clonal ancestries; the second is that it uses a large number of latent variables (labelling recombinant sites) that must be inferred. CO is even more expensive; searching over the large space of possible sources and destinations of recombination events.

There are two ways in which new Monte Carlo approaches will be used. The first is in making an "online" version of CF, in which the parameters in the model are inferred as new WGS data arrives (in contrast to the current, batch, approach), using sequential Monte Carlo. The two main benefits of this approach are that online inference can be performed, and that this also promises to be a more effective means for exploring the space of clonal ancestries. The second is in using a technique called pseudo-marginal MCMC for more efficiently searching over the large numbers of latent variables in both the CF and CO models. Different variations on pseudo-marginal MCMC will be exploited in the two different models.

Planned Impact

This project will develop software (based on statistical methodology) that will be used by other researchers. Therefore its pathway to impact is indirect, in that it is reliant on working with researchers whose application areas can lead to economic and societal impact. The academic impact of the work is a crucial component in the path to other impacts.

Despite this indirect pathway, the potential for impact from this research is significant. Algorithms such as the ones developed in this project, when implemented in freely available software packages, represent the gateway to making sense of whole genome sequence data (WGS). When scientists make a discovery from WGS (such as the recent widely-reported result that there is not a unique Celtic group of people in the UK), the results are a direct output of the underlying algorithms.

The data for which the algorithms in this project are designed is sequence data from bacteria. The algorithms extract the ancestral history of a population of bacteria, and describe the recombination events that have occurred. This is of significant interest in its own right, but there are also compelling practical reasons:
- Understanding horizontal gene transfer is at the core of understanding how genes conferring antimicrobial resistance are acquired;
- Recombination breaks down linkage disequilibrium (association between alleles at different loci in a genome), which is key to conducting association studies in bacteria, the means by which it might be discovered which genes are responsible for antimicrobial resistance or particularly virulent infections;
- Transmission of bacteria between hosts can be inferred from WGS, but recombination must be accounted for if this is to be done accurately, to provide a true picture of the spread of an infection.

The following groups of people have a direct interest in these topics:
- researchers in horticulture and agriculture, who need to understand plant pathogens;
- animal health researchers, in understanding livestock pathogens;
- the medical profession, regarding human pathogens;
- microbiologists.

Research into plant, livestock and human pathogens has direct impacts in the BBSRC strategic research priorities of "Agriculture and Food Security" and "Bioscience for Health". It is through engagement with researchers in these areas that this project will have its most significant impacts: the algorithms that are developed will provide a deeper understanding of pathogens, which can feed through to policy-makers and regulators who can then make informed decisions to improve food production and safeguard the health of animals and humans.

The statistical work in the project also has the potential to lead to impacts in other areas. Again, these are indirect, but are potentially significant. The methodology underlying the algorithms is far more general than its use in this project might suggest. In fact, the methods may also find use in applications as diverse as neuroscience, weather and climate analysis and the analysis of social networks. The PI is beginning to investigate the use of a similar method in providing a better understanding of anaesthesia, based on electroencephalography data. The understanding of how the statistical methods used in this project perform is useful to statisticians who work in different areas of application.

The project is ideal for a young researcher to work on, since they will gain skills in two growth areas of national importance: firstly, in the analysis of sequence data; secondly, in the fields of statistics, machine learning and data science, the subjects at the heart of the "Big Data Revolution" which has been recognised to be of national economic and scientific importance.

Publications

10 25 50
 
Title Bootstrapped synthetic likelihood 
Description Synthetic likelihood (SL) is a useful alternative to approximate Bayesian computation for parameter inference in models whose likelihood is not available to evaluate pointwise. Suppose that the data that is being modelled consists of N data points. The computational cost of SL can be high, since at each parameter it requires running the simulator M times, giving a cost of O(MN) for each parameter visited. Bootstrapped SL uses a bootstrap of a single simulation to approximate the standard SL, with a cost of O(N) for each parameter visited. The use of a bag of little bootstraps can be used to reduce the cost further, where only a sub-simulation of size n 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact None yet. 
URL https://arxiv.org/abs/1711.05825
 
Title Delayed acceptance ABC-SMC 
Description Approximate Bayesian computation (ABC) is the method of choice for inferring the parameters of statistical models that are defined by a simulator. The approach originates in genetics, but has also found application in a number of different fields, including the study of infectious diseases and the analysis of network data. ABC can be infeasible when the simulator is computationally expensive, since often the simulator needs to be run a large number of times. The method we have developed is useful in cases where an alternative computationally cheap simulator is available. In these cases, our method uses simulations from the cheap simulator to automatically rule out "bad" regions of parameter space. sometimes reducing the computational cost of ABC by orders of magnitude. 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact None so far. 
URL https://arxiv.org/abs/1708.02230
 
Title Sequential Bayesian inference for the coalescent 
Description We have developed a new methodology for sequentially inferring a coalescent tree from whole genome sequences, which can be updated as new sequence data arrives. This method relies on an underpinning development in sequential Monte Carlo methodology. This will eventually lead to a software tool, which can be deployed as part of pipelines that analyse whole genome sequence data (hence it is listed as an improvement to research infrastructure). 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? No  
Impact Presently this method is only available in prototype form. We are currently working on an implementation that we plan to make freely available. 
URL https://arxiv.org/abs/1612.06468
 
Description Short talk at the Modernising Medical Microbiology conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Richard Everitt gave a short talk at the Modernising Medical Microbiology conference, about work on sequential inference for the coalescent. This conference is attended by a wide range of interested parties, including academics, health care professionals, and decision makers at organisations such as Public Health England. Several people expressed interest in the talk, and plan to read our draft paper.
Year(s) Of Engagement Activity 2017
URL http://modmedmicro.nsms.ox.ac.uk/mmm-conference-14th-march-2017/