Principled Application of Learning Classifier Systems to Large-Scale Challenging Datasets (LCSxLCD)
Lead Research Organisation:
University of Nottingham
Department Name: Sch of Biosciences
Abstract
The goal of this project is to study the general applicability of Learning Classifier Systems (LCS) to large-scale challengingdata mining tasks. Data Mining and Knowledge Discovery have become crucial technologies for the advancement of manyscientific disciplines. Vast amounts of data are available thanks to initiatives such as the human genome project, thevirtual human physiome, etc. Successful data mining techniques have to scale accordingly to the volume of the data,extract accurate models out of (often) noisy and ambiguous datasets and provide new insight that enhances our understanding of complex problems. LCS are robust machine learning techniques with very high potential for data mining. The frontier of competence for LCS has been pushed forward in recent years with the help of advanced representations, better search mechanisms and theoretical analysis, as well as a few examples of their application to challenging real-world domains. This success notwithstanding, most if not all of the progress has been heuristically driven. In this project we will (1) develop theoretical models for the performance of LCS when applied to large volumes of data that can inform us of when and why LCS methodsare successful and also when do LCS fail; (2) afterwards, the insight gained from these models will help us design new LCS methods with improved performance and robustness. The end product of the project will be a framework containing allthe studied techniques with theory-based efficient implementations, adapted for their usage in high performance computingenvironments. Datasets known to be difficult to data mine will be used to validate the success of the developed techniques.
Organisations
People |
ORCID iD |
Jaume Bacardit (Principal Investigator) |
Publications
Garcia-Piquer A
(2014)
Large-Scale Experimental Evaluation of Cluster Representations for Multiobjective Evolutionary Clustering
in IEEE Transactions on Evolutionary Computation
Franco MarĂa A.
(2010)
Speeding up the evaluation of evolutionary learning systems using GPGPUs
Franco M
(2013)
GAssist vs. BioHEL: critical assessment of two paradigms of genetics-based machine learning
in Soft Computing
Franco M
(2020)
Automatic Tuning of Rule-Based Evolutionary Machine Learning via Problem Structure Identification
in IEEE Computational Intelligence Magazine
Franco M
(2012)
Analysing BioHEL using challenging boolean functions
in Evolutionary Intelligence
Franco M
(2016)
Large-scale experimental evaluation of GPU strategies for evolutionary machine learning
in Information Sciences
Calian D
(2013)
Integrating memetic search into the BioHEL evolutionary learning system for large-scale datasets
in Memetic Computing
Bassel GW
(2011)
Functional network construction in Arabidopsis using rule-based machine learning on large-scale data sets.
in The Plant cell
Bacardit J
(2012)
Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features.
in Bioinformatics (Oxford, England)
Bacardit J
(2014)
Hard Data Analytics Problems Make for Better Data Analysis Algorithms: Bioinformatics as an Example.
in Big data
Description | The goal of this project was to study the general applicability of Learning Classifier Systems (LCS) to large-scale challenging data mining tasks. The work programme designed to achieve this work was structured in three main Work Packages: (1) Theoretical foundations, where formal models for the functioning of the different subcomponents of an LCS were created, (2) algorithmic advances, providing improved and efficient mechanisms to tackle large-scale datasets and (3) Knowledge transfer, where LCS methods were applied to real-world problems. All three work packages have been successful: - We have created theoretical models for the initialisation stage of our LCS that help us understand how LCS work and also are able to explain in a principled way the difficulties that such methods face on a specific class of problems: datasets with rule overlap. - We have proposed a method to integrate cutting-edge high performance computing hardware (GPGPUs) within LCS that is able to improve the speedup of LCS methods by orders of magnitude. - We have applied LCS to a variety of problems in bioinformatics, systems and synthetic biology. Worth mentioning is our application of LCS to understand the process of seed germination in Arabidopsis Thaliana which led to the discovery (experimentally verified) of four novel regulators of germination. |
Sectors | Digital/Communication/Information Technologies (including Software) |
Title | BioHEL |
Description | This is a rule-based machine learning software designed to deal with large-scale datasets. |
Type Of Technology | Software |
Year Produced | 2011 |
Open Source License? | Yes |
Impact | Using this software we participated in the CASP9 protein structure prediction challenge in which our method was ranked as the best Ab-Initio contact map predictor. Moreover, two papers that describe different aspects of the BioHEL software won the best paper award for the machine learning track of the ACM GECCO conference in 2010 and 2011 |
URL | http://ico2s.org/software/biohel.html |
Title | ICOS PSP server |
Description | This is a web service to access a broad range of predictors related to the field of protein structure prediction, including our Contact Map prediction method that was ranked as the best ab-initio method in the CASP9 competition |
Type Of Technology | Webtool/Application |
Year Produced | 2012 |
Impact | Our Contact Map prediction method was ranked as the best ab-initio method in the CASP9 protein structure prediction competition |
URL | http://ico2s.org/servers/psp.html |