Comparative methods for second generation sequence analysis

Lead Research Organisation: University of Reading
Department Name: Sch of Biological Sciences

Abstract

Biological data, from more than once species, must be analysed in an evolutionary context, taking into account their evolutionary histories, as data is non-independent. For example, a trait found in mice will have a higher probability of being found in rats than humans, as mice share a more recent common ancestor with rats than humans. If the evolutionary histories, known as phylogenies, are not accounted for, an incorrect result can be found. Analysing data in an evolutionary context is called comparative methods.
Current DNA sequencing technology creates very large data sets, both in terms of the number of species and types of data. Comparative methods use computationally complex mathematical models to combine the phylogeny with the data of interest, and more complex mathematical models are being developed. The increase in the volume of data and complexity of the models is creating a gap between the ideas biologists would like to test and the computational power needed to perform the analysis. A single analysis can take weeks or even months on a desktop computer, this is currently a rate limiting step in biological research. Supercomputers can be used to solve these issues but are expensive to buy and run, are rare, complex and require a large amount of technical knowledge to use. Supercomputers also require large amounts of electricity to power and cool them.
The hardware used to play computer games, found in PC and games consoles, have the potential to offer a solution to this problem. The vast computing power needed to generate 3D images can now be applied to solve other problems. A recent study, analysing medical data, showed how a PC with a number of graphics cards, costing $5300, could outperform a $4.6 million supercomputer. This project aims to vastly accelerate comparative methods analysis by using graphics hardware. A popular comparative methods package, BayesTraits, will be converted to use a range of graphics hardware. While graphics hardware has a large amount of computing power, it can be hard to utilise as they are designed, primarily, to perform a very different task. This makes developing programs for graphics hardware more complex and time consuming than traditional computer programming.
Converting comparative methods programs to use graphics hardware will give biologists access to effective computer hardware and software required to analyse the vast quantities of data being generated. Allowing biologist to explore large data sets, answer complex questions and develop new insights into biological systems. It will eliminate the large technical hurdle associated with supercomputers and is cost effective, costing hundreds or thousands of pounds instead of millions. Graphics cards require 1/20th less power than traditional computers, making them more environmentally friendly.

Technical Summary

Comparative methods is a computationally intensive process due to the growth of large biological data sets and the complexity of the underlying statistical models. This is creating a gap between data and models which biologists would like to analyse with available computing resources. High performance computers can be used to bridge this gap but they are unrealistic for many biologists, due to their expense, rarity, running costs and technical requirements. General Purpose computing on Graphics Processing Units (GPGPU) has the potential to solve these problems, as the hardware is cheap, ubiquitous, has low technical requirements and requires a 20th of the power, compared to traditional computing. Converting programs capable of running on GPGPU requires a significant rewrite to take advantage of their vast parallel nature.
This project will convert a comparative methods package, BayesTraits, for GPGPU use. The majority of the program run time (>99%) is concentrated in the likelihood function, which calculates the probability of observing the comparative data given a phylogeny and model parameters. The likelihood function is based on a phylogenetic generalised least squares (GLS) calculation for continuous data and a continuous time Markov model for discrete data. These are fundamentally different calculations, with the GLS method dominated by matrix operations, inversions, powers and multiplications. The continuous time Markov model is a mix of matrix powers and a pruning function to collapse the likelihood through the phylogeny. The OpenCL framework will be used for development as it offers a hardware independent programming environment, with a high degree of portability. BayesTraits is a general purpose comparative methods package and must be able to effectively deal with a wide range of data sets, tuning the program to work effectively with a wide range of data sets and complex models will be important.

Planned Impact

Using graphics hardware to accelerate comparative methods will have a diverse range of impacts. Comparative methods are gaining ground in cultural research areas, BayesTraits has been used in a number of diverse fields including linguistics and anthropology. Languages and cultures have many similarities with species, they mutate and adapt over time, are heritable and compete for resources. Researchers also ask similar questions, what is the rate of change, what is the ancestral state, and are there correlations in the data. BayesTraits has been used to analyse a range of cultural data, including identifying the rate meanings evolve threw Indo-European languages, investigating how cultures evolve and sustain complex social systems and showing how marriage systems and wealth transfer at marriage are correlated across cultures.
While supercomputers are widely used in scientific research they are limited to countries with large research budgets. The top 500 supercomputers in the world are shared between 31 countries, with the United States owning 56% of them. This leads to over 160 countries without access to high-performance computing. Personal supercomputing, offered by GPGPU, has the ability to change this situation, giving researchers across the world, access to cheap and powerful computing.
It is estimated that 2% of the world's total energy is consumed by computer equipment. Graphics cards require a 1/10th of the power or traditional supercomputers making them an excellent green alternative. In 2007 the University of Reading purchased a supercomputer, ThamesBlue, rated as the 36 fastest supercomputer in the world, consisting of over 700 nodes. One of its primary tasks was to analyse biological data. It cost an estimated £25,000 a month in electricity to run, including cooling. Currently, 40 of the latest graphics cards have the same computational power and would fit into 10 nodes, requiring no more than £250 a month in electricity to run. These examples are not quite comparable, as the supercomputer is much easier to program and it assumes that any program would use all of the capabilities of the graphics cards, which is hard to achieve. It does, however, serve to highlight how energy efficient and powerful this technology is.

Publications

10 25 50
publication icon
Baker J (2016) Positive phenotypic selection inferred from phylogenies in Biological Journal of the Linnean Society

publication icon
Baker J (2015) Adaptive evolution toward larger size in mammals. in Proceedings of the National Academy of Sciences of the United States of America

publication icon
Cooper N (2016) A cautionary note on the use of Ornstein Uhlenbeck models in macroevolutionary studies. in Biological journal of the Linnean Society. Linnean Society of London

 
Description The grant has developed a new range of algorithms to effectively analyse large evolutionary data sets, it is now possible to analyse data sets consisting of thousands of taxa using standard computer hardware. This represents orders of magnitude speed increase over the previous methods and bridges the gap between data availability and analytical techniques.
Exploitation Route The software developed during this grant has been downloaded over 5300 times in the last 12 months and has led to a range of high impact papers by the group and other researchers.
Sectors Digital/Communication/Information Technologies (including Software),Environment,Culture, Heritage, Museums and Collections,Other

URL http://www.evolution.reading.ac.uk/BayesTraits.html
 
Description BBSRC tools and development
Amount £150,000 (GBP)
Funding ID H5183100 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 11/2014 
End 06/2016
 
Title BayesTraits V2 (Beta) 
Description The BayesTraits V2 (Beta) allow effect analysis of data sets orders of magnitude larger than the previous version. Enabling analysis of datasets generated form next generation sequencing methods, of tens of thousands of taxa. 
Type Of Technology Software 
Year Produced 2013 
Impact The BayesTraits V2 (Beta) allow effect analysis of data sets orders of magnitude larger than the previous version. Enabling analysis of datasets generated form next generation sequencing methods, of tens of thousands of taxa. 
URL http://www.evolution.reading.ac.uk/BayesTraitsV2Beta.html
 
Title BayesTraits V2.0 
Description BayesTraits is a computer package for performing analyses of trait evolution among groups of species for which a phylogeny or sample of phylogenies is available, these can be created using BayesPhylogenies. It can be applied to the analysis of traits that adopt a finite number of discrete states, or to the analysis of continuously varying traits. The methods can be used to take into account uncertainty about the model of evolution and the underlying phylogeny 
Type Of Technology Software 
Year Produced 2014 
Impact The new version of BayesTraits can effectively analyse large trees with hundreds or thousands of taxa, for both continuous, discrete and multi state data. Improvements have led to over 3 orders of magnitude speed increase when analysing a near complete bird phylogeny of 6000 taxa, reducing the run time from years to hours. This allows the analysis of current and next generation of phylogenetic trees, as well as the development of more complex and realistic evolutionary models. The new release includes a comprehensive manual, worked examples and supporting technical information. Since the last submission priors BayesTraits V2 has been downloaded over 3,500 times. 
URL http://www.evolution.reading.ac.uk/BayesTraits.html