Understanding recombination through tractable statistical analysis of whole genome sequences

Lead Research Organisation: University of Reading
Department Name: Mathematics and Statistics


This project concerns the analysis of whole genome sequence data: that is, the complete DNA sequence, the genetic code, of an organism. The technology for acquiring such a sequence is relatively new. It was used to sequence a single human genome in the "Human Genome Project", which finished in 2003. This project took 13 years and cost $2.7 billion. However, technological advances in the past 10 years have led to the cost of sequencing genomes to drop dramatically. It now costs only $1,000 to sequence a human genome and this price continues to fall.

Why should one wish to sequence a genome? The human genome can be thought of as the blueprint for building a human. Each person has a slightly different genome: the parts that are common to everyone are what make us human; the parts that differ are responsible for the (genetic) differences between us. Understanding our genomes, through studying both the common parts of the genome and the differences, promises scientific breakthroughs in many areas, particularly medicine. For example, Genomics England are currently planning to sequence 100,000 genomes for the purposes of improving clinical practice in dealing with rare disease, cancer and infectious disease.

The study of DNA sequences is not restricted to humans. It is also useful to obtain whole genome sequences from many other living things. This project is focussed on analysing genetic information obtained from bacteria. There are many reasons for studying bacteria, one of the most obvious being that some bacteria cause disease in plants and livestock (affecting our food supply) and in humans (affecting our health). Obtaining a better understanding of the genetic code of bacteria offers the promise of both tracking the spread of infections and also reducing the occurrence of disease.

However, although sequence data is relatively easy to obtain, extracting useful information from it can be very difficult. Genome sequences can be stored on computers in text files as a long sequence of letters. A single gene might consist of a few hundred or thousand letters. A whole bacterial genome, containing thousands of genes, might be 3 million letters long. Most scientific projects involve studying a population (tens, hundreds or thousands) of these genomes, thus it is not unusual for a dataset to consist of over a billion letters! To make sense of such a large complicated data set, mathematical methods, implemented as computer programs, are required.

This project is concerned with the development of such mathematical methods, and their implementation. The aim of the project is to learn about the evolution of bacteria by studying their genome sequences. As a rule, bacteria reproduce clonally: each individual only has a single parent. However, in some cases they can also exchange DNA, in a manner related to sexual reproduction in humans. It is of great scientific interest to understand such exchanges for several reasons, including that it is one of the main ways in which a bacteria can become resistant to antibiotics. MRSA is an example of an antibiotic resistant bacterial strain that has been of significant concern to the NHS. Understanding how antibiotic resistance is acquired is one way in which scientists can help to tackle such problems. Analysing whole genome sequences using mathematical methods, as is done in this project, is fundamental to these investigations.

Currently there are several computer programs that can be used to investigate the exchange of DNA, and important discoveries have been made through using them. However, when they are used on whole genome sequences, some of the programs run too slowly to be useful in many cases (sometimes they take months) and others cannot detect (or provide an incomplete picture) of genetic exchange events. This project is developing new programs that are both accurate, and run quickly enough to be useful.

Technical Summary

This project is concerned with understanding homologous recombination in bacteria through the analysis of whole genome sequence (WGS) data. There are several existing methods for inferring recombination from WGS, but they are either computationally expensive (ClonalFrame (CF) and ClonalOrigin (CO)) or use approximate models where the source and destination of recombination events cannot be identified (e.g. BratNextGen).

CF and CO are both Bayesian models that use Monte Carlo methods, specifically Markov chain Monte Carlo (MCMC), to perform inference. This approach has been commonplace since the early 1990s. Since that time, other, more computationally efficient, methods have been proposed in statistics, many of which are not used extensively in applications. This project will use some of these new techniques to make inference using the CF and CO models more efficient. There are two reasons for the computational expense of CF: one is that it searches over a (very large) space of possible clonal ancestries; the second is that it uses a large number of latent variables (labelling recombinant sites) that must be inferred. CO is even more expensive; searching over the large space of possible sources and destinations of recombination events.

There are two ways in which new Monte Carlo approaches will be used. The first is in making an "online" version of CF, in which the parameters in the model are inferred as new WGS data arrives (in contrast to the current, batch, approach), using sequential Monte Carlo. The two main benefits of this approach are that online inference can be performed, and that this also promises to be a more effective means for exploring the space of clonal ancestries. The second is in using a technique called pseudo-marginal MCMC for more efficiently searching over the large numbers of latent variables in both the CF and CO models. Different variations on pseudo-marginal MCMC will be exploited in the two different models.

Planned Impact

This project will develop software (based on statistical methodology) that will be used by other researchers. Therefore its pathway to impact is indirect, in that it is reliant on working with researchers whose application areas can lead to economic and societal impact. The academic impact of the work is a crucial component in the path to other impacts.

Despite this indirect pathway, the potential for impact from this research is significant. Algorithms such as the ones developed in this project, when implemented in freely available software packages, represent the gateway to making sense of whole genome sequence data (WGS). When scientists make a discovery from WGS (such as the recent widely-reported result that there is not a unique Celtic group of people in the UK), the results are a direct output of the underlying algorithms.

The data for which the algorithms in this project are designed is sequence data from bacteria. The algorithms extract the ancestral history of a population of bacteria, and describe the recombination events that have occurred. This is of significant interest in its own right, but there are also compelling practical reasons:
- Understanding horizontal gene transfer is at the core of understanding how genes conferring antimicrobial resistance are acquired;
- Recombination breaks down linkage disequilibrium (association between alleles at different loci in a genome), which is key to conducting association studies in bacteria, the means by which it might be discovered which genes are responsible for antimicrobial resistance or particularly virulent infections;
- Transmission of bacteria between hosts can be inferred from WGS, but recombination must be accounted for if this is to be done accurately, to provide a true picture of the spread of an infection.

The following groups of people have a direct interest in these topics:
- researchers in horticulture and agriculture, who need to understand plant pathogens;
- animal health researchers, in understanding livestock pathogens;
- the medical profession, regarding human pathogens;
- microbiologists.

Research into plant, livestock and human pathogens has direct impacts in the BBSRC strategic research priorities of "Agriculture and Food Security" and "Bioscience for Health". It is through engagement with researchers in these areas that this project will have its most significant impacts: the algorithms that are developed will provide a deeper understanding of pathogens, which can feed through to policy-makers and regulators who can then make informed decisions to improve food production and safeguard the health of animals and humans.

The statistical work in the project also has the potential to lead to impacts in other areas. Again, these are indirect, but are potentially significant. The methodology underlying the algorithms is far more general than its use in this project might suggest. In fact, the methods may also find use in applications as diverse as neuroscience, weather and climate analysis and the analysis of social networks. The PI is beginning to investigate the use of a similar method in providing a better understanding of anaesthesia, based on electroencephalography data. The understanding of how the statistical methods used in this project perform is useful to statisticians who work in different areas of application.

The project is ideal for a young researcher to work on, since they will gain skills in two growth areas of national importance: firstly, in the analysis of sequence data; secondly, in the fields of statistics, machine learning and data science, the subjects at the heart of the "Big Data Revolution" which has been recognised to be of national economic and scientific importance.


10 25 50
publication icon
Everitt RG (2017) Delayed acceptance ABC-SMC in arXiv

publication icon
Everitt RG (2017) Bootstrapped synthetic likelihood in arXiv

publication icon
Everitt R (2019) Sequential Monte Carlo with transformations in Statistics and Computing

publication icon
Medina-Aguayo F (2019) Perturbation bounds for Monte Carlo within Metropolis via restricted approximations in Stochastic Processes and their Applications

Description This project focussed on statistical models for genetics. The objectives of the award were to develop improved inference for state-of-the-art models for real data.

There were two main strands of work described in the proposal:

- performing the inference of models of pathogen evolution online, for use in epidemics, for example;
- improving inference techniques for models of recombination in bacteria.

Both objectives were met. A "sequential Monte Carlo" method was developed for online inference of coalescent models. This is currently being implemented as part of BEAST2, one of the leading pieces of software for analysing genomic data, and has also been built on by leading researchers in genetics and epidemiology. A new "reversible jump" methodology was used for improving the inference of ClonalOrigin, allowing improving inference through the use of parallel computing. In addition to these new methods, other developments to the underpinning computational statistics were made, which aid the performance of our techniques applied to genetic data, but which are also applicable to other application areas, from neuroscience to ecology.

Both of our methods have freely available implementations, which are still under active development.
Exploitation Route Our method for online inference of evolutionary models of pathogens may be used to aid decision making in epidemics. This approach is currently being built on my leaders in this field. Our method for analysing recombination will be used by those who study bacteria - it could potentially help researchers understand the evolution of some mechanisms of anti-microbial resistance.

In addition, some of the improved techniques developed under the grant have formed the basis of a collaboration with the Centre for Environment, Fisheries and Aquaculture Science (Cefas), on inferring fish stocks to aid policy making. This work is ongoing, and has led to two successful applications for funding from NERC under the Landscape Decisions Programme (total funding £340,000).
Sectors Environment,Healthcare

Description The improved techniques developed under the grant have formed the basis of a collaboration with the Centre for Environment, Fisheries and Aquaculture Science (Cefas), on inferring fish stocks to aid policy making on fishery management. This work has so far involved examining models for sea bass, which had suffered a dramatic decline in recent years. The work is ongoing, and has led to two successful applications for funding from NERC under the Landscape Decisions Programme (total funding £340,000).
First Year Of Impact 2018
Sector Environment
Impact Types Policy & public services

Description Quantifying uncertainty in the predictions of complex process-based models
Amount £42,873 (GBP)
Funding ID NE/T004010/1 
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 10/2019 
End 09/2020
Title Bootstrapped synthetic likelihood 
Description Synthetic likelihood (SL) is a useful alternative to approximate Bayesian computation for parameter inference in models whose likelihood is not available to evaluate pointwise. Suppose that the data that is being modelled consists of N data points. The computational cost of SL can be high, since at each parameter it requires running the simulator M times, giving a cost of O(MN) for each parameter visited. Bootstrapped SL uses a bootstrap of a single simulation to approximate the standard SL, with a cost of O(N) for each parameter visited. The use of a bag of little bootstraps can be used to reduce the cost further, where only a sub-simulation of size n 
Type Of Material Data analysis technique 
Year Produced 2017 
Provided To Others? Yes  
Impact None yet. 
URL https://arxiv.org/abs/1711.05825
Title Delayed acceptance ABC-SMC 
Description Approximate Bayesian computation (ABC) is the method of choice for inferring the parameters of statistical models that are defined by a simulator. The approach originates in genetics, but has also found application in a number of different fields, including the study of infectious diseases and the analysis of network data. ABC can be infeasible when the simulator is computationally expensive, since often the simulator needs to be run a large number of times. The method we have developed is useful in cases where an alternative computationally cheap simulator is available. In these cases, our method uses simulations from the cheap simulator to automatically rule out "bad" regions of parameter space. sometimes reducing the computational cost of ABC by orders of magnitude. 
Type Of Material Data analysis technique 
Year Produced 2017 
Provided To Others? Yes  
Impact None so far. 
URL https://arxiv.org/abs/1708.02230
Title Improvements to the balance heuristic for estimating normalising constants 
Description When using importance sampling for estimating normalising constants in Bayesian statistics (used for comparing models), the method can exhibit poor performance when only one proposal distribution is used. This method further develops existing methods in the case of using multiple proposals. This is a technique that can be used within many examples that use importance sampling. In our case, it was developed in order to enhance methods that are used for inferring popular models used in genetics. 
Type Of Material Data analysis technique 
Year Produced 2019 
Provided To Others? Yes  
Impact None so far. 
URL https://arxiv.org/abs/1908.06514
Title MHAAR for ClonalOrigin 
Description ClonalOrigin is a state of the art model for recombination in bacteria, but was originally too expensive to be deployed on whole genome sequences. The method we have developed uses recent research in computational statistics in order to speed up the inference, making use of parallel computing. 
Type Of Material Data analysis technique 
Year Produced 2020 
Provided To Others? No  
Impact None so far. 
Title Sequential Bayesian inference for the coalescent 
Description We have developed a new methodology for sequentially inferring a coalescent tree from whole genome sequences, which can be updated as new sequence data arrives. This method relies on an underpinning development in sequential Monte Carlo methodology. This will eventually lead to a software tool, which can be deployed as part of pipelines that analyse whole genome sequence data (hence it is listed as an improvement to research infrastructure). 
Type Of Material Data analysis technique 
Year Produced 2018 
Provided To Others? Yes  
Impact Presently this method is only available in prototype form. We are currently working on an implementation that we plan to make freely available. 
URL https://arxiv.org/abs/1612.06468
Description Collaboration with Centre for Environment, Fisheries and Aquaculture Science on improved inference of fish stocks 
Organisation Centre For Environment, Fisheries And Aquaculture Science
Country United Kingdom 
Sector Public 
PI Contribution I have brought expertise in approximate Bayesian computation and particle MCMC, and new methods developed under UKRI funding, to this collaboration.
Collaborator Contribution The partners in this collaboration have brought expertise in modelling fish stocks, in interpreting results, and in providing a route to impact for the research (through input to policy development on fishery management).
Impact This collaboration is multi-disciplinary, involving me (Mathematics and Computational Statistics), members of the Ecology Group at the University of Reading, and experts in Ecology and Fishery Management at Cefas. Papers are currently in draft that describe the work, on fitting a bioeconomic model of fish stocks to data. The results of fitting this model will be used by Cefas to input into fishery management policy.
Start Year 2018
Title Improved inference for ClonalOrigin 
Description ClonalOrigin is a state-of-the-art model for recombination of bacteria. However, this model is rarely used to analyse whole genome sequences, since inferring it is computationally intractable. The software implements an improved reversible jump algorithm, which makes the inference more feasible. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact None yet. 
URL https://github.com/fmedina7/ClonOr_cpp
Title Sequential Monte Carlo for models with a variable number of parameters 
Description A java package that implements sequential Monte Carlo for some specific cases where the sequence of distributions changes in dimension, and allows extension to other models. This is associated with the paper "Sequential Monte Carlo with transformations". https://arxiv.org/abs/1612.06468 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact None yet. 
URL https://github.com/fmedina7/tSMC_java
Description Short talk at the Modernising Medical Microbiology conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Richard Everitt gave a short talk at the Modernising Medical Microbiology conference, about work on sequential inference for the coalescent. This conference is attended by a wide range of interested parties, including academics, health care professionals, and decision makers at organisations such as Public Health England. Several people expressed interest in the talk, and plan to read our draft paper.
Year(s) Of Engagement Activity 2017
URL http://modmedmicro.nsms.ox.ac.uk/mmm-conference-14th-march-2017/