Principled Application of Learning Classifier Systems to Large-Scale Challenging Datasets (LCSxLCD)

Lead Research Organisation: University of Nottingham
Department Name: Sch of Biosciences

Abstract

The goal of this project is to study the general applicability of Learning Classifier Systems (LCS) to large-scale challengingdata mining tasks. Data Mining and Knowledge Discovery have become crucial technologies for the advancement of manyscientific disciplines. Vast amounts of data are available thanks to initiatives such as the human genome project, thevirtual human physiome, etc. Successful data mining techniques have to scale accordingly to the volume of the data,extract accurate models out of (often) noisy and ambiguous datasets and provide new insight that enhances our understanding of complex problems. LCS are robust machine learning techniques with very high potential for data mining. The frontier of competence for LCS has been pushed forward in recent years with the help of advanced representations, better search mechanisms and theoretical analysis, as well as a few examples of their application to challenging real-world domains. This success notwithstanding, most if not all of the progress has been heuristically driven. In this project we will (1) develop theoretical models for the performance of LCS when applied to large volumes of data that can inform us of when and why LCS methodsare successful and also when do LCS fail; (2) afterwards, the insight gained from these models will help us design new LCS methods with improved performance and robustness. The end product of the project will be a framework containing allthe studied techniques with theory-based efficient implementations, adapted for their usage in high performance computingenvironments. Datasets known to be difficult to data mine will be used to validate the success of the developed techniques.

Publications

10 25 50
 
Description The goal of this project was to study the general applicability of Learning Classifier Systems (LCS) to large-scale challenging data mining tasks. The work programme designed to achieve this work was structured in three main Work Packages: (1) Theoretical foundations, where formal models for the functioning of the different subcomponents of an LCS were created, (2) algorithmic advances, providing improved and efficient mechanisms to tackle large-scale datasets and (3) Knowledge transfer, where LCS methods were applied to real-world problems. All three work packages have been successful:

- We have created theoretical models for the initialisation stage of our LCS that help us understand how LCS work and also are able to explain in a principled way the difficulties that such methods face on a specific class of problems: datasets with rule overlap.

- We have proposed a method to integrate cutting-edge high performance

computing hardware (GPGPUs) within LCS that is able to improve the speedup of LCS methods by orders of magnitude.

- We have applied LCS to a variety of problems in bioinformatics, systems and synthetic biology. Worth mentioning is our application of LCS to understand the process of seed germination in Arabidopsis Thaliana which led to the discovery (experimentally verified) of four novel regulators of germination.
Sectors Digital/Communication/Information Technologies (including Software)

 
Title BioHEL 
Description This is a rule-based machine learning software designed to deal with large-scale datasets. 
Type Of Technology Software 
Year Produced 2011 
Open Source License? Yes  
Impact Using this software we participated in the CASP9 protein structure prediction challenge in which our method was ranked as the best Ab-Initio contact map predictor. Moreover, two papers that describe different aspects of the BioHEL software won the best paper award for the machine learning track of the ACM GECCO conference in 2010 and 2011 
URL http://ico2s.org/software/biohel.html
 
Title ICOS PSP server 
Description This is a web service to access a broad range of predictors related to the field of protein structure prediction, including our Contact Map prediction method that was ranked as the best ab-initio method in the CASP9 competition 
Type Of Technology Webtool/Application 
Year Produced 2012 
Impact Our Contact Map prediction method was ranked as the best ab-initio method in the CASP9 protein structure prediction competition 
URL http://ico2s.org/servers/psp.html