From Frequent Itemsets to Informative Patterns: Theory and Applications

Lead Research Organisation: University of Bristol
Department Name: Engineering Mathematics

Abstract

This project will address a problem that has recently been highlighted as a major research challenge in data mining: the fact that the output of many data mining algorithms is typically (and with the present technology often unavoidably) too large for human understanding and interpretation, and therefore of little use for many practical purposes. This is the case in particular for the output of the important and broad class of frequent pattern mining algorithms, which are in the focal point of this proposal. (Such patterns can be subsets in sets, subgraphs in graphs, or subsequences, substrings, or approximate substrings in strings, and much more.)To address this challenge, we will develop new data mining approaches for effective database summarization and understanding. We will achieve this by combining, integrating, and developing state-of-the art ideas from data mining, statistical modeling, and optimization theory.The solution of this problem is likely to boost the impact of all frequent pattern mining techniques, of which the development required a considerable community effort, but of which the uptake in applied research and industry has remained limited so far. Therefore, this project may have a significantly non-linear effect on the research community: it has the potential to unleash a large amount of data mining expertise for practical application.To underline the impact of our theoretical results, and to ensure that they are used in practice, in a second part of this project we will be the first ones to apply our results to a number of case studies. We have intentionally chosen these case studies from three diverse domains: bioinformatics, text mining, and marketing. Each of these domains has a large user base, all of whom will become beneficiaries of this project. By developing these applications, we will be able to demonstrate the use of our newly developed methods to these user groups. The bioinformatics application we will tackle will be the search for transcriptional modules of genes, which are regulated by the same regulatory program under certain conditions. The data driven nature of bioinformatics makes data mining approaches very well suited, and our new methodologies will ensure their successful application with a minimal amount of expert knowledge required. Note that this is just one example of a bioinformatics application -- frequent pattern mining (subsets, subgraphs, subsequences, substrings...) is of use in many other branches of bioinformatics.In the text mining application we will search for interesting sets of words, sequences of words, or approximate sequences (strings) of words, occurring in a corpus of text documents. These text documents will be sets of news articles over a certain time span. We should point out that we are currently already gathering thousands of news articles every day for subsequent text mining analyses, and this in the context of another project being developed together with Prof. Nello Cristianini in the same department. The integration of this result with that project will provide our new methodology with a unique showcase demonstration.For the marketing application, the most obvious application would be the search for items in a supermarket store that are often sold together. Marketers are interested in this information, for example to allow them to optimize their promotion strategies. For example, they may choose to reduce the price of one product, and increase the margin on other products that are strongly associated with the former. Our collaboration with Unilever, who have such retail transaction data available for use in this project, will enable us to successfully complete this application.
 
Description This project's focus was on making sense of structured data by developing new advanced data mining techniques.

We have made considerable progress towards this goal, with key insights gained in how to quantify in a very subjective sense how 'interesting' a certain piece of information discovered in data is to a specific user.

We have also worked on applications to specific types of structured data, with results in machine translation (understanding how performance of statistical machine translation systems evolve as they can benefit from increasing amounts of training data), web and online news media mining (e.g. creating a map of the EU media sphere), and mostly music informatics (new methods for chord recognition).



Our theoretical work has received much attention and was widely published. We are currently organising a scientific workshop on the topic, inspired by our own findings and the findings of some of our colleagues.



Our applied work, and in particular the work on music information retrieval, has also led to a number of high impact publications in music informatics and signal processing venues, as well as to a highly successful public outreach story (www.scoreahit.com).
Exploitation Route The applied work on music informatics has a large number of possible applications in the creative economy. We are currently talking to our tech transfer office to investigate possible routes for exploitation.
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software)

URL http://www.tijldebie.net
 
Description The research findings have led to two newly awarded projects: one ERC Consolidator grant (FORSIED), and one new EPSRC grant (EP/M000060/1), which will build upon the research outcomes. Other pathways to impact are currently being investigated.
Sector Creative Economy