Epicluster: A novel tool for high throughput detection of epistasis in studies of the genetics of complex traits

Lead Research Organisation: University of Edinburgh
Department Name: MRC Human Genetics Unit

Abstract

Gene interactions are thought to be important in shaping complex trait variation in agricultural, model organism and human disease genetics. They have been poorly explored, however, because of the lack of high throughput tools to analyse many different traits. With the support from the GridQTL project funded by BBSRC, we have developed a tool that can perform high throughput analyses of gene interactions in experimental populations genotyped with low density genetic markers. The tool however is not applicable to large datasets provided by genome-wide association studies in natural/commercial populations. Such datasets typically include hundreds of thousands of genetic markers and thousands of individuals with a large number of phenotypic traits. Genome-wide association studies have become increasingly popular for the investigation of the genetics of complex traits in livestock, plant, and human sectors. Despite much effort, a comprehensive analysis of gene interactions in those large datasets is still intractable for even a single trait (at levels of CPU months) due to their excessive computing demand and the lack of algorithms to handle billions of tests of marker combinations. A new high throughput analysis tool has become a necessity to study gene interactions in these large datasets. We propose the development of Epicluster, a novel tool to support routine high throughput analysis of gene interactions in large association study datasets. Instead of directly testing billions of marker combinations exhaustively, Epicluster will effectively select candidate markers with consistent genotype distribution patterns that differentiate the group of individuals with high trait values from the group with low trait values. It then performs comprehensive statistical tests only among the selected candidate markers and thus can improve the speed of analysing gene interactions for one trait to CPU hours. Epicluster development will adapt a bi-clustering algorithm that has been successfully applied in gene expression studies. A proof of principal test showed that the bi-clustering algorithm could cluster a large dataset with 500,000 markers in minutes. On completion Epicluster will be implemented as distributed software (i.e. automated analysis) to be used in high performance computer environments. In summary we expect Epicluster to herald a breakthrough in gene interaction analyses in large datasets across species. Hence Epicluster will facilitate a fuller understanding of the importance of gene interactions in complex traits.

Technical Summary

Detecting statistical epistasis is equivalent to identifying multiple loci whose joint behaviour is significantly associated with a trait of interest. Most multi-locus methods are based on an exhaustive search of genotype combinations and are therefore not computationally feasible approaches to analyse epistasis in GWAS data. New algorithms based on data mining (e.g. grammatical evolution optimized neural networks) or Bayesian marker partitioning (e.g. epistatic module detection) have been developed to address this issue, but these still required many CPU days to analyse one trait. In addition, the success of these algorithms is subject to the parameter settings and/or priors used for a given analysis. Epicluster takes a different approach to address the problem: using a bi-clustering algorithm to select clusters of potentially interacting SNPs from the total search space then testing epistasis among the selected SNPs only. The member SNPs of each cluster have identical or inverse (i.e. possibly interacting) genotype distribution patterns across a subset of individuals with high trait values (cases) and a subset of individuals with low trait values (controls), and the patterns in cases are different from the patterns in controls. An established multi-locus method, e.g. the regression based approach, can be used to search for epistasis within the much reduced space. The bi-clustering algorithm is adapted from the Mining Attribute Profiles method effective in discovering co-regulated gene expression patterns. As a proof of principal it was tried in a GWAS dataset with 500,000 SNPs and completed the bi-clustering in minutes. We expect Epicluster will be able to analyse a trait on the order of CPU hours given the speed of the bi-clustering algorithm is so fast. Epicluster will be validated by simulations and compared against established multi-locus methods.

Planned Impact

Epicluster will herald a breakthrough in analysis of epistasis in GWAS data that have been used to address a wide range of issues across species, including animal health and human ageing. It will be the first application that can support high throughput epistasis analysis in GWAS datasets. The immediate impact includes 1. Existing GWAS datasets can be reused to generate new knowledge, maximising the returns on significant previous investment and encouraging data sharing. 2. Accumulated epistasis results from the high throughput analyses will lead to a better understanding of the role of epistasis in complex traits. 3. Identification of epistatic loci will increase the trait variance explained, increasing the utilityof the results for predicting for example disease risk (in all organisms including man) and for estimating breeding values (for selection in agricultural species) 4. New gene and paths may be identified increasing understanding of the biological basis of trait variation. 5. Epistatic loci specifically indicate the presence of gene interactions in causal networks and so this information can be combined with other sources to facilitate reverse engineering of gene networks and pathways. 6. High throughput GWAS epistasis analyses will significantly increase the utility and usage of high-performance computing resources. 7. The output of the GWAS epistasis analyses can be collated into a dedicated database (to be developed as a publically shared resource), which will encourage data sharing (e.g. meta-analysis) and can be linked to other biological databases such as those for protein interactions and pathways and networks for meaningful interpretation. Results generated from the GWAS epistasis analyses have important scientific implications. These results can help answer the question of missing heritability in traditional GWAS, i.e. only a small proportion of phenotypic variation could be explained in most of the studies. These results may also be the basis for functional studies of newly discovered interactions and networks. Furthermore, they could be applied in animal/plant breeding programmes, e.g. the genomic selection programmes currently under development in cattle and chick industries. To engage users and beneficiaries, we will publish Epicluster in peer reviewed journals, present it in international conferences and local seminars, and make the software and source code freely available to the scientific community via the MRC Human Genetics Unit (HGU) website. User manual and an online tutorial will also be prepared and placed to the website. In addition, we will develop Epicluster user groups based on our links to researchers in HGU and University of Edinburgh (human geneticists), The Roslin Institute (animal geneticists), and Rothamsted Research Institute (plant geneticists). Collaborations will be developed from those links to use Epicluster to study important research problems in human, animal and plant sectors respectively. We will also explore the possibilities of building new research projects based on Epicluster, including a portal based service to allow Epicluster users to perform high throughput analysis using Grid computing resources, a database project to collate epistasis results and developmen of a new tool to prioritize epistatic loci for functional studies.

Publications

10 25 50
 
Description New algorithms and a fast tool that makes testing gene-gene interactions a routine exercise in genome-wide association studies.
Exploitation Route Software is publically available. A further step is necessary to extend the algorithms and to develop new tools allowing meta-analysis of multiple cohorts to boost the power of detection of gene-gene interactions.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description Yes. The software BiForce Toolbox has been downloaded and used by various groups and the papers have been cited 10 published papers so far.
First Year Of Impact 2012
Sector Digital/Communication/Information Technologies (including Software),Education
Impact Types Societal

 
Title BiForce Toolbox 
Description BiForce Toolbox to address the demand for high-throughput analysis of pairwise epistasis in either quantitative or disease traits across all commonly used computer systems. BiForce Toolbox is a stand-alone Java program that integrates bitwise computing with multithreaded parallelization and thus allows rapid full pairwise genome scans via a graphical user interface or the command line. Furthermore, BiForce Toolbox incorporates additional tests of interactions involving SNPs with significant marginal effects, potentially increasing the power of detection of epistasis. 
Type Of Technology Software 
Year Produced 2012 
Impact By August 2014, BiForce has been downloaded by 129 users, applied in several studies of epistasis published with interesting signals and pathways. 
URL http://bioinfo.utu.fi/biforcetoolbox