Matheuristics for multi-criterion data clustering: towards multi-criterion big data analytics

Lead Research Organisation: University of Manchester
Department Name: Alliance Manchester Business School

Abstract

With rapid increases in data volume in all areas of life, the meaningful analysis of these data is becoming a crucial bottleneck. Whether data are generated by customer transactions, through communications on social media, or as a by-product of manufacturing processes, data are meaningless unless suitable techniques are available to select the most relevant data, analyze these data and turn raw data into tangible information and insight. To some extent, "big data" reverses traditional approaches in data-mining, as data collection now frequently precedes the definition of an actual question or hypothesis. The purported advantage of this approach is that novel, unexpected findings may materialize - a premise that relies, however, on the expert use of suitable approaches for exploratory data analysis. The prominence of "big data" therefore fuels the need and use of scalable and powerful approaches to exploratory data analysis. Data clustering techniques present one of the most fundamental tools in exploratory data analysis, and this project aims to deliver novel techniques that are accurate, flexible and scalable to large data sets.

Data clustering techniques present one of the most fundamental tools in exploratory data analysis. Conceptually, data clustering refers to the identification of sub-groups within a data set so that items within the same group are similar and those in different groups are dissimilar; e.g., in the context of insurance data, a "cluster" of people may relate to customers who show similar behaviour in their claim patterns over time, while those in different clusters behave differently. Mathematically, data clustering can be seen as an example of a problem where good solutions are best described using a set of different criteria that account for conflicting properties such as the compactness of clusters and the separation between clusters.

The above observation has recently led to the development of multi-criterion approaches to data clustering, which explicitly consider a number of clustering criteria. This approach has shown a lot of promise, in terms of the accuracy and the robustness of the solutions obtained. However, current techniques for multi-criterion clustering are limited regarding their scalability to very large data sets and also their flexibility with respect to their consideration of different sources of dissimilarity data. This project proposes a novel technique for multi-criterion clustering: the algorithm will combine complementary ideas from two sub-fields of computer science, leading to improved scalability and flexibility of the technique developed. The work will include the development of an interactive user-interface and the application of multi-criterion clustering to problems in finance and marketing. All software produced will be released publicly.

Planned Impact

Briefly, the non-academic beneficiaries of this project are:
- Users of exploratory data analysis in industry / commerce, including manufacturing as a major source of "big data".
- Commercial providers of software tools for data-mining.
- The wider public, as they benefit from improved data analytical techniques through the spin-offs of such knowledge, e.g. through the development of improved recommendation systems (marketing), the identification of groups of co-regulated genes (biology), etc.

One of the applications considered in this proposal (work package WP4.2) involves measures of customer experience and loyalty and will likely involve industrial partners as the primary source of these data. Further opportunities for collaborations will be sought, e.g. as a part of industrial dissertation projects. The establishment of fruitful collaborations within and outside academia is supported by the third objective of the proposal, which aims to embed multi-criterion clustering techniques into an interactive, easy-to-use interface that is suitable for the rapid training and use by students, fellow academics, industrial partners etc. Similarly, work packages WP4.1 and WP4.2 support exploitation activities directly, as they deliver practical application examples that can facilitate and underpin the dissemination of the approach.

Publications

10 25 50
 
Description We have developed a novel multi-objective clustering technique that is more scalable (i.e. can be applied to larger problems and is generally quicker) but also improves the accuracy of existing approaches.
Exploitation Route The code has been made publicly available at https://github.com/garzafabre/Delta-MOCK and the methods may be used for cluster analysis in different subject areas.
Sectors Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare,Pharmaceuticals and Medical Biotechnology,Retail

URL https://github.com/garzafabre/Delta-MOCK
 
Description Multi-objective and multi-view clustering approaches are increasingly being trialled in a range of different application areas, and in collaboration with industrial stakeholders, including the analysis of electronic health records, the analysis of biological data and in web marketing / social media settings. Approaches based on Pareto optimization allow for the direct exploration of trade-offs between different criteria and to generate practically relevant trade-offs for upstream analytics.
First Year Of Impact 2017
Sector Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title Delta-MOCK 
Description New, scalable multicriterion clustering technique that uses a reduced encoding and delta evaluation of objectives. 
Type Of Material Computer model/algorithm 
Provided To Others? No  
Impact The code of the method will be released once published. At this point, I expect this method to be of wide interest (the original version of MOCK has been used widely) and employed for data analysis in a variety of application areas (including applications in marketing, finance and bioinformatics that my own research group works on). 
 
Description FAPESP fellowship 
Organisation Federal University of Sao Carlos
Country Brazil 
Sector Academic/University 
PI Contribution Prof. Katti Faceli is joining my research group with a FAPESP fellowship for a year. We will be collaborating on the development of advanced visualization techniques for the ensembles of clustering solutions resulting from multi-objective clustering. My team provides methodological expertise and state-of-the-art methods for multi-objective clustering, and I have contributed to the writing of the research proposal.
Collaborator Contribution Prof Katti Faceli brings expertise in visualization and, specifically, the visualization of clustering solutions.
Impact Conference publication
Start Year 2015
 
Description Big Data Panel 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact I helped organize and chaired a panel discussion on Big Data, as a part of ESOF 2016 in Manchester.
Year(s) Of Engagement Activity 2016
URL http://www.esof.eu/en/
 
Description Scratch Club 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Running of a weekly Scratch Club for Year 6 Students to introduce students to Computer Science and Programming.
Year(s) Of Engagement Activity 2015,2016