Multi-relational Association Mining Software for Genome Wide Association Studies

Lead Research Organisation: Aberystwyth University
Department Name: Computer Science


We are now in the genomic age, and have a variety of technologies available to tell us about the specific genomes of the organisms we work with. In humans we would like to know about the genes associated with disease or ageing, so that we can more effectively target drugs or engineer vaccines. In plants we would like to know about the genes associated with resistance to drought, tolerance of stress, the production of seed and the density of growth, so that we can breed better crops that will be more suitable to future climates and demands for food and fuel production.

Full genome sequencing to discover this genomic information for large populations is still expensive, but the sequencing of the DNA at certain marker locations is now reasonably affordable and technologically possible. For humans we can use more than a million markers to determine their genetic makeup at these locations in their genome. The question then, for humans, crops or other organisms is how to relate this genotype to the disease/health/drought resistance/seed production they show (the phenotype). We need to find associations between genotype and phenotype.

Association mining is a data mining technique commonly used by academic and industrial data mining experts to find frequent associations. It is used by commercial retailers to suggest other products that a customer might also like to buy, given the history of frequently associated purchases made by others. This technology should be ideal for finding genotype/phenotype associations. However, for this problem we have more complex data than the standard algorithms can process. The standard algorithms will only work for 'single table' data: data that could be represented in a single matrix of rows and columns. As soon as we want to specify more interesting relationships we need a more powerful representation for the data and for the association. For this we need to use first order predicate logic. 'First order' refers to the ability to use variables to represent relationships between the parts of the association (rather than just constants). The predicates describe those relationships. An example of such as relationship, is:

if genotype(Plant, ssr_1057, long, confident) and genotype(Plant, ssr_369, short, confident) and parent(Plant, Parent) and location(Parent, thailand) then dense_stems are 70% likely

We will produce software that can find multi-relational associations such as these in large amounts of complex data. We will apply this software to standard test data, and to a case study at Aberystwyth, for analysis of our population of the bio-energy crop Miscanthus. We will release the software as open source, with documentation and tutorials for the biological community to use.

Technical Summary

This proposal will produce association mining software for genome wide association studies. The software will find multi-relational associations, that is, it will be able to work with data expressed as relations spanning multiple database tables, or expressed as first order predicate logic. In this way we will be able to make use of not just simple marker variations and a basic phenotype, but complex structured phenotype data, information about parental genotype and phenotype, environmental data, information about sequence similarity, geography, longitudinal data and other data as required.

The software will be based on high performance data structures (inverted indices and data compression) to provide an effective solution for large data that cannot easily be handled by existing algorithms. The software will be open source and documented.

Aberystwyth University has a world-leading breeding program for the bioenergy crop, Miscanthus, with a collection of several thousand accessions. We will apply the software to the Miscanthus case study in Aberystwyth. We are currently obtaining genotype data for these collections, and these data will provide an excellent real-world application for the software.

Planned Impact

We will produce software for data mining associations between genotype and phenotype. This software will benefit industrial and academic researchers who need to associate genotype and phenotype. This includes those who work in genomic selection (crop and animal breeding specialists), those investigating genetic involvement in human diseases and ageing, and those who want to conduct core functional genomics work. They will have access to open source free software, designed specifically to investigate these relationships.

The Miscanthus research community in particular will benefit from our application of this software to the Miscanthus breeding programme during this project. This bioenergy crop is being investigated to increase our understanding of factors affecting biomass accumulation and the quality of this biomass for downstream processing, such as combustion in power stations or for conversion to liquid fuels, so that we can identify and breed improved Miscanthus varieties. The collection at Aberystwyth exhibits enormous phenotypic and genetic diversity for the development of new varieties optimised for maximum production in varied climates.

Other beneficiaries of this work include schools within the Convergence area of Wales, targetted by the Technocamps activity to help bring computer science to secondary school children. We will produce a bioinformatics 'activity pack' so that the children can learn about how computers are used to help biologists analyse their data, and how the information in DNA can produce different phenotypes as an analogy to how information in computer code can be executed to create different results. Elaine Jensen (a School Regional Champion for the BBSRC) and the BBSRC Inspiring Young Scientists coordinator (Tristan Bunn) will review the bioinformatics activity pack with the intention of publishing it on the BBSRC schoolscience web pages and elibrary in order to make it widely available to teachers and researchers.

Finally we believe that a good demonstration of multi-relational association mining on this problem will benefit both the data mining community and the GWAS community by providing the data mining community with a really challenging problem needing new solutions and by providing the GWAS community with better exposure to data miners and new algorithms.


10 25 50
Description GWAS is difficult to analyse and the data that was available at the time of this proposal was insufficient to draw useful conclusions. It is now more widely accepted that most GWAS studies are underpowered.
Exploitation Route A PhD student project in our department has taken the ideas from this project on to explore collating GWAS databases across species in order to extract relations that can be used to predict phenotype.
Sectors Agriculture, Food and Drink,Environment

Description Genome Game 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact We created a web based game describing the problem of inferring genotype from phenotype. We have used this game as an activity at multiple venues including during National Science Week, the National Eisteddfod, and various local events such as a display of computing and robots on Aberystwyth promenade in the summer.

Children ask relevant questions, express surprise at the size of genomes, ask why there are so many genes in wheat/yeast/humans, and enjoy solving the puzzle.
Year(s) Of Engagement Activity 2013,2014