Fast Generalised Rule Induction

Lead Research Organisation: University of Reading
Department Name: Computer Science

Abstract

The proposed research will significantly advance the state of the art in the field of data stream mining, in particular by providing adaptable rules that are understandable by humans but can be implemented in rule based expert systems.

Data stream mining is growing in importance and packages such as MOA are known to be used by widely in the data stream mining community. However, none of these environments provide techniques for the extraction of descriptive rules that can express patterns and changes of the patterns encoded in data streams. Descriptive data stream mining techniques exist such as cluster analysis, but none that identifies the patterns in the form of expressive and meaningful rules. This is important because it allows domain experts to look for potentially interesting but unknown patterns and changes of the pattern over time. The fact that the patterns are described in the form of rules allows to draw conclusions on why the pattern exist and how it could be influenced. Creating these rule sets is challenging as they need to be adapted automatically, as patterns encoded in a stream may change over time (known as concept drift); and also because they need to be created in a single pass through as data streams are potentially never ending.

The research will be integrated into a popular open source environment, with the two contenders being the MOA or the KNIME data mining workbenches. Especially the integration of the methodology in MOA will accelerate the adoption of expressive descriptive data stream mining techniques in both academic and commercial communities. In particular UK telecommunication and chemical companies are amongst the categories of companies that gain significant advantages from the extraction of descriptive rules in real-time from streaming data. The method will enable the telecommunication industry to improve the efficiency of detecting interesting event patterns in national telecommunication networks on the fly, and thus help forecasting performance bottlenecks and faults (personal communication with British Telecom). Chemical companies can employ this technology for monitoring sensors in chemical plants to identify plant stages on the fly without the need for time consuming analyses in the laboratory. This will trigger R&D investment of UK companies for making use of this research, which is reflected by the fact that British Telecom has already contributed £30,000 towards a PhD studentship to the PI's research in stream data mining techniques. Such exploitation of this research will lead to growth in the performance of technology companies, new jobs and an increase in revenue, and thus will give the UK an economical competitive advantage.


Society will indirectly advance from scientific areas that advance through the results of this research. And wider applications of the results of this research will be explored by considering the analysis of electroencephalogram (EEG) data fast in real-time. It is likely that new insights could have a direct impact on the public health aiding diagnosis and understanding of the brain. Also advances of the industry through this research will have an indirect impact on society. For example the forecasting of performance bottlenecks in national telecommunication network will lead to more reliable telecommunication applications such as telemedicine.

The wider academic community will also benefit from the results of this research through the open source implementation of this project's methodology, as this will allow researchers to find and express interesting patterns on the fly in a wide variety of fast scientific data streams. For example expressing and adapting complex patterns from meteorological sensors, detecting and/or expressing changes of patterns of brain activity through life functional Magnetic Resonance Imaging (fMRI), etc. Our unique position at the University of Reading provides ready access to these scientific data sources.

Planned Impact

The outcome of this project will have considerable implications for the design of data mining algorithms for commercial systems that enable the exploitation of fast data streams to tackle large-scale scientific and industrial problems. Data mining companies range from service oriented to software development organisations. Examples of software environments produced by such organisations are Weka, RapidMinder, KNIME and MOA. These specialist software environments incorporate a collection of algorithms and facilitate creation and execution of data mining tasks. These environments typically arose from academic research projects and some have even become commercial products (such as KNIME & RapidMiner). These software environments are adopted by large companies because they allow to automate the data mining processes, for example KNIME is adopted by Actian Ltd and Weka by BT Research. However, none of these environments provide techniques for the extraction of descriptive rules that can express patterns and changes of the patterns encoded in data streams. Currently open source MOA is the only of these environments that provides algorithms to analyse streaming data in real-time, however, expressive and descriptive algorithms do not yet exist for data streams.

The proposed research will significantly advance the state of the art in the field and be integrated in the popular MOA environment for mining data streams. This will accelerate the adoption of expressive descriptive data stream mining techniques in both, academic and commercial communities. In particular UK telecommunication and chemical companies are amongst the categories of companies that gain significant advantages from the extraction of descriptive rules in real-time from streaming data. Regarding the telecommunication industry this will improve the efficiency of detecting interesting event patterns in national telecommunication networks on the fly; and thus help forecasting performance bottlenecks/faults (personal communication with BT). Chemical companies can employ this technology for monitoring sensors in chemical plants to identify plant stages on the fly without the need for time consuming analyses in the laboratory. This will trigger R&D investment of UK companies for making use of this research, which is reflected by the fact that British telecom already contributes £30,000 towards a PhD studentship to the PI's research in this area. Such exploitation of this research will lead to growth in the performance of technology companies, new jobs and an increase in revenue, and thus will give the UK an economical competitive advantage.

The academic community will greatly advance from the results of this research through the open source implementation of this project's methodology, as this will allow researchers to find and express interesting patterns on the fly in fast data streams and thus gain new insights in their field of research. For example expressing and adapting complex patterns from meteorological sensors, detecting and/or expressing changes of patterns of brain activity through life functional Magnetic Resonance Imaging (fMRI), etc.

The results of this research can be used for expressing rules describing sentiment changes in social media, which can be used to predict social unrest and thus enhances national security, government policy and general public. The general public will indirectly advance from scientific areas that advance through the results of this research. For example the results of this research can be used to analyse electro encephalogram (EEG) data fast in real-time and thus gain new insights, which indirectly has an impact on the public health. Also advances of the industry through this research will have an indirect impact on the general public. For example the forecasting of performance bottlenecks in national telecommunication network will lead to more reliable telecommunication applications such as telemedicine.
 
Description The project aims at developing a novel, flexible and generic research method for describing patterns in data streams in a human readable form. Potential applications of such a research method are i.e. the description of the causes of computer network alarms as they happen; the description of human movement based on real-time sensor data, the description of the causality of flight delays, etc.

A particular challenge is the fact that patterns in data streams can change over time and thus such a method will have to update its model over time. Descriptive algorithms for data streams exist, however, they are either non expressive black box approaches (difficult to understand by humans) or highly specialised on a particular type of data. So far the project has realised three stages that lead to the proposed research method.

Stage 1
A rule-based descriptive method. This method does not work on data streams and is not adaptive. However, it was designed in a way so it could be adapted for the application on real-time data streams.

Stage 2
A rule based predictive method based on the techniques developed in stage 1. This technique is adaptive to changes of the pattern encoded in the stream. Also a new evaluation method has been developed to quantify the expressiveness of the method.

Stage 3
The actual final method proposed in the project. It is based on the descriptive rule based approach developed in stage 1 and the adaptation technique used in stage 2. This method has now been evaluated on real data streams and a research paper will be submitted shortly. So far no similar technique exists for streaming data, thus a new evaluation method has been developed to measure the change of the expressiveness and interestingness of the rules induced by the method.
Exploitation Route Some impact activities have already been taken forward. Due to the descriptive nature of the algorithm the project team detected a computer bug in one of the data stream generators of the leading academic data stream analytics platform MOA. The project team has liaised with the MOA team to resolve the bug. This has led to an understanding with the MOA team that the in the FGRI project developed algorithms will be considered for integration into the MOA platform. Thus the research can be accessed and taken forward for potential application stakeholders by downloading the MOA platform free of charge.

The artifact of the research has been published in a github repository: https://github.com/thienle2401/GeneralisedRulesAlgorithm. However, a new version of the code is being worked on which will be released once ready.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Electronics,Energy,Environment,Financial Services, and Management Consultancy,Manufacturing, including Industrial Biotechology,Retail,Transport

URL http://fgri.reading.ac.uk/site/
 
Description In parallel to this project a BT sponsored a PhD studentship that took some of the algorithms developed in this project forward to predict nationwide network alarms ahead of time, which would enable improving reliability of UKs telecommunication network. Whereas the accuracy of forecasting alarms was low, in some cases it achieved a relatively high precision. In a meeting with a large telecommunication network it became clear that it was not expected that such alarms could be forecasted and thus the company believes further resources should be invested into this application. However, it is not known at this stage if this investment has happened.
First Year Of Impact 2017
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Description Collaboration on the development of evaluation techniques for FGRI. 
Organisation Birmingham City University
Country United Kingdom 
Sector Academic/University 
PI Contribution Provision of a prototype implementation of the by the project developed methodology.
Collaborator Contribution The team collaborated with Prof. Mohamed Medhat Gaber who helped in devising a novel evaluation method for the methodology developed in this research project.
Impact Its not a multi-disciplinary collaboration. The collaboration resulted in a joint publication. This publication does not contain the aforementioned evaluation method per se. However, it contains some preliminary work that led up to the evaluation method which will be published in the near future. The publication DOI is the following:10.1109/ICMLA.2016.0168
Start Year 2016
 
Title Generalised Rules Induction Algorithm 
Description This software implements an algorithm for Generalised Rule Induction. The algorithm produces rules induced from a dataset. It can be used for describing relationships between attributes in the data in a form that is readable by human users. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact The software has only recently been released so impact will be updated at a later stage. 
URL https://github.com/thienle2401/GeneralisedRulesAlgorithm
 
Description Information website/blog for the Fast Generalised Rule Induction project 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Reporting about the progress of the project to the general public
Pointing to scientific publications about the project.
Providing Information about project team to the general public
Year(s) Of Engagement Activity 2016,2017
URL http://fgri.reading.ac.uk/site/
 
Description Presentation of research and reseach findings at UK Symposium on Knowledge Discovery and Data Mining 2017 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact Approximately 50+ plus delegates attended the event. The presentation was titled "Real-Time Fast Predictive Rule Induction Directly from Continuous Streaming Data" and sparked questions and discussion afterwards with industry. Especially Telecommunication industry has shown a keen interest in this research with regards to the analytics of network performance. In particular a major telecommunication company is currently supporting the PI's application for an EPSRC fellowship.
Year(s) Of Engagement Activity 2017
URL http://ukkdd.org.uk/2017/