RAPIER: from RAD sequencing to population genetics and evolutionary modelling

Lead Research Organisation: University of Bristol
Department Name: Mathematics

Abstract

Individual organisms belonging to the same species differ from one another genetically. The pattern of genetic variation, and how it covaries with phenotype, is highly informative about the past evolutionary history of the population, and also provides insights into gene function. Population genetic analysis has been shown to be very useful in a number of different fields; for example in genome-wide association studies to look for disease genes, epidemiological analysis, forensics, and also in elucidating the past history and evolution of human populations. Other than in model organisms such as humans, it has, until recently, been very expensive to analyse a sufficient amount of genome to be able to make accurate estimates of the quantities of interest. The development of Next-Generation Sequencing (NGS) technology has made it possible to analyse a very large number of genes (regions of the genome). However NGS, by itself, is a broad tool more suited to the analysis of whole individual genomes, which is still relatively expensive. For population genetic analysis one requires a sample of genes across the genome to be compared across individuals. The method of RADseq has been developed to do this. It works by sequencing regions of the genome that have a particular motif (such as CCTGCAGG for example). Because fragments originate with the same motif the same region can be compared across individuals. The challenge is that these motifs occur typically many thousands of times in a single genome, yielding many genes, which need to be sorted out. Computer software has been developed to do this, but because the technique is very new, there are a number of problems and biases inherent in the current method. This project aims to fix many of these problems by taking a more rigorously statistical approach. We will develop new publicly available software, making it much easier to apply NGS methods in population genetics.

Technical Summary

In the medium term, sequencing of a reduced representation of the genome is the only feasible way forward for the application of next-generation sequencing technology to population genetic analysis of non-model organisms. It is difficult, however, to organise the plethora of fragments into aligned sequences that can be compared across individuals in order to quantify nucleotide frequencies.This is especially true if no reference genome is available. Currently there is only one software package that attempts to deal with this problem. This pioneering effort has a number of flaws, however, which are addressed in this project. We will develop a clustering algorithm to build short sequence alignments based on all the data in a sample rather than on a per-individual basis as in the current package. This algorithm will take into account sequencing error measure by the phred scores. Sets of orthologous sequences will be identified, and genotypes together with their posterior probabilities under an error model will be output. In addition we will develop software for the inter-conversion of formats so that researchers can use the output for a number of different population genetic analysis packages. Furthermore we will complete the initial stages of a new software package that will allow for parameters to be inferred in a model of diverging populations with gene flow. This package will be based on approximate Bayesian computation, using recent enhancements to the method, and will take into account sequencing error in the estimation of gene frequency.

Planned Impact

The main impact of this research will be that the software tool that is generated in this project will allow for far greater use of NGS methods in population analysis. This will have a number of benefits outside academia:

Livestock and crop breeding technologies will benefit, particularly when involving organisms for which reference genomes have not yet been produced. The software will provide improved identification of genetic markers. These will be useful in QTL identification, the formation of high-density linkage maps, and also targeted back-crossing when breeding.

The software tool will have impact on decision makers in conservation and wildlife management. For example with improved generation of multiple genetic markers, the precision in the detection of hybrid individuals will be increased. Thereby helping to control the effects of introgressive invasions. In addition improved markers will allow for improved assessment of levels of inbreeding depression, by comparison of current levels of genetic variation with inferred past levels.

Veterinarians and clinicians will benefit because improved marker development for novel disease organisms, will allow improved fitting of epidemiological models by means of NGS data.

An increased in the number of genetic markers will enable agricultural decision makers to gain improved understanding of routes by means of which certain pests have arrived in a country. It is possible to use the genetic markers to compare different models of demographic history.

There are also more indirect and long-term benefits through improved identification of the functional roles of genes involved in local adaptation. Genes identified has having adaptive value in a particular organism, from a genome-wide scan, can be further investigated, and their properties analysed. In this way, novel modes of action and regulatory pathways may be discovered, which may improve our understanding of gene action in humans, with potentia medical applications.

Publications

10 25 50
 
Description The aim of this research has been to develop an accurate method for scoring the genotype of individuals from a particular type of genetic marker known as restriction site amplified DNA (RAD). The research has resulted in a software pipeline that can efficiently cluster raw DNA reads from a restriction site amplified DNA (RAD) analysis, which is a key part of generating genotypes from RAD data. We have also outlined a likelihood-based approach for actually obtaining the genotypes from the clusters. The software is publicly available on a GitHub repository.
Exploitation Route The findings can easily be taken forwards, because we have a documented pipeline, available on the publicly available github directory. Thus it should be straightforward for an interested research student or postdoctoral researcher to take this forward in conjunction with the likelihood model that we developed to generate a method for calling genotype frequencies.
Sectors Agriculture, Food and Drink,Education,Environment,Manufacturing, including Industrial Biotechology

URL https://github.com/MrKriss/rapier
 
Description This research has resulted in a software pipeline that can efficiently cluster raw Illumina reads from a restriction site amplified DNA (RAD) analysis, which is a key part of generating genotypes from RAD data. The software is publicly available on a GitHub repository.
First Year Of Impact 2013
Sector Digital/Communication/Information Technologies (including Software)
 
Title rad 
Description A novel clustering method for Illumina RAD reads. Designed as the first part of a pipeline for making accurate genotype calls from RAD data. 
Type Of Material Data analysis technique 
Year Produced 2013 
Provided To Others? Yes  
Impact Too early to judge. 
URL https://github.com/MrKriss/rapier