Matheuristics for multi-criterion data clustering: towards multi-criterion big data analytics

Lead Research Organisation: University of Manchester

Department Name: Alliance Manchester Business School

Abstract

With rapid increases in data volume in all areas of life, the meaningful analysis of these data is becoming a crucial bottleneck. Whether data are generated by customer transactions, through communications on social media, or as a by-product of manufacturing processes, data are meaningless unless suitable techniques are available to select the most relevant data, analyze these data and turn raw data into tangible information and insight. To some extent, "big data" reverses traditional approaches in data-mining, as data collection now frequently precedes the definition of an actual question or hypothesis. The purported advantage of this approach is that novel, unexpected findings may materialize - a premise that relies, however, on the expert use of suitable approaches for exploratory data analysis. The prominence of "big data" therefore fuels the need and use of scalable and powerful approaches to exploratory data analysis. Data clustering techniques present one of the most fundamental tools in exploratory data analysis, and this project aims to deliver novel techniques that are accurate, flexible and scalable to large data sets.

Data clustering techniques present one of the most fundamental tools in exploratory data analysis. Conceptually, data clustering refers to the identification of sub-groups within a data set so that items within the same group are similar and those in different groups are dissimilar; e.g., in the context of insurance data, a "cluster" of people may relate to customers who show similar behaviour in their claim patterns over time, while those in different clusters behave differently. Mathematically, data clustering can be seen as an example of a problem where good solutions are best described using a set of different criteria that account for conflicting properties such as the compactness of clusters and the separation between clusters.

The above observation has recently led to the development of multi-criterion approaches to data clustering, which explicitly consider a number of clustering criteria. This approach has shown a lot of promise, in terms of the accuracy and the robustness of the solutions obtained. However, current techniques for multi-criterion clustering are limited regarding their scalability to very large data sets and also their flexibility with respect to their consideration of different sources of dissimilarity data. This project proposes a novel technique for multi-criterion clustering: the algorithm will combine complementary ideas from two sub-fields of computer science, leading to improved scalability and flexibility of the technique developed. The work will include the development of an interactive user-interface and the application of multi-criterion clustering to problems in finance and marketing. All software produced will be released publicly.

Planned Impact

Briefly, the non-academic beneficiaries of this project are:
- Users of exploratory data analysis in industry / commerce, including manufacturing as a major source of "big data".
- Commercial providers of software tools for data-mining.
- The wider public, as they benefit from improved data analytical techniques through the spin-offs of such knowledge, e.g. through the development of improved recommendation systems (marketing), the identification of groups of co-regulated genes (biology), etc.

One of the applications considered in this proposal (work package WP4.2) involves measures of customer experience and loyalty and will likely involve industrial partners as the primary source of these data. Further opportunities for collaborations will be sought, e.g. as a part of industrial dissertation projects. The establishment of fruitful collaborations within and outside academia is supported by the third objective of the proposal, which aims to embed multi-criterion clustering techniques into an interactive, easy-to-use interface that is suitable for the rapid training and use by students, fellow academics, industrial partners etc. Similarly, work packages WP4.1 and WP4.2 support exploitation activities directly, as they deliver practical application examples that can facilitate and underpin the dissemination of the approach.

Funded Value:

£100,317

Funded Period:

Jun 15 - Jul 17

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/M013766/1

Principal Investigator:

Julia Handl

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (50%)

Information & Knowledge Mgmt (50%)

Organisations

People	ORCID iD
Julia Handl (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Deng Z (2022) The Russia-Ukraine war disproportionately threatens the nutrition security of developing countries. in Discover sustainability

Faceli K (2017) CVis - Towards a novel visualization tool to explore the relationship between input and output partitions in multi-objective clustering ensembles

Garza-Fabre M (2018) An Improved and More Scalable Evolutionary Approach to Multiobjective Clustering in IEEE Transactions on Evolutionary Computation

Garza-Fabre M (2023) Evolutionary Multiobjective Clustering Over Multiple Conflicting Data Views in IEEE Transactions on Evolutionary Computation

Garza-Fabre M (2017) Evolutionary Multi-Criterion Optimization

José-García A (2019) Many-view clustering

José-García A (2021) An evolutionary many-objective approach to multiview clustering using feature and relational data in Applied Soft Computing

Kandathil SM (2018) Improved fragment-based protein structure prediction by redesign of search heuristics. in Scientific reports

Kandathil SM (2019) Reliable Generation of Native-Like Decoys Limits Predictive Ability in Fragment-Based Protein Structure Prediction. in Biomolecules

Lu E (2015) Bioinspired Computation in Artificial Systems - International Work-Conference on the Interplay Between Natural and Artificial Computation, IWINAC 2015, Elche, Spain, June 1-5, 2015, Proceedings, Part II

Key Findings
Impact Summary
Research Databases and Models
Collaboration
Engagement Activities


Description	We have developed a novel multi-objective clustering technique that is more scalable (i.e. can be applied to larger problems and is generally quicker) but also improves the accuracy of existing approaches.
Exploitation Route	The code has been made publicly available at https://github.com/garzafabre/Delta-MOCK and the methods may be used for cluster analysis in different subject areas.
Sectors	Digital/Communication/Information Technologies (including Software) Financial Services and Management Consultancy Healthcare Pharmaceuticals and Medical Biotechnology Retail
URL	https://github.com/garzafabre/Delta-MOCK


Description	Multi-objective and multi-view clustering approaches are increasingly being trialled in a range of different application areas, and in collaboration with industrial stakeholders, including the analysis of electronic health records, the analysis of biological data and in web marketing / social media settings. Approaches based on Pareto optimization allow for the direct exploration of trade-offs between different criteria and to generate practically relevant trade-offs for upstream analytics.
First Year Of Impact	2017
Sector	Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Title	Delta-MOCK
Description	New, scalable multicriterion clustering technique that uses a reduced encoding and delta evaluation of objectives.
Type Of Material	Computer model/algorithm
Provided To Others?	No
Impact	The code of the method will be released once published. At this point, I expect this method to be of wide interest (the original version of MOCK has been used widely) and employed for data analysis in a variety of application areas (including applications in marketing, finance and bioinformatics that my own research group works on).


Description	FAPESP fellowship
Organisation	Federal University of Sao Carlos
Country	Brazil
Sector	Academic/University
PI Contribution	Prof. Katti Faceli is joining my research group with a FAPESP fellowship for a year. We will be collaborating on the development of advanced visualization techniques for the ensembles of clustering solutions resulting from multi-objective clustering. My team provides methodological expertise and state-of-the-art methods for multi-objective clustering, and I have contributed to the writing of the research proposal.
Collaborator Contribution	Prof Katti Faceli brings expertise in visualization and, specifically, the visualization of clustering solutions.
Impact	Conference publication
Start Year	2015


Description	Big Data Panel
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	I helped organize and chaired a panel discussion on Big Data, as a part of ESOF 2016 in Manchester.
Year(s) Of Engagement Activity	2016
URL	http://www.esof.eu/en/


Description	Scratch Club
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	Running of a weekly Scratch Club for Year 6 Students to introduce students to Computer Science and Programming.
Year(s) Of Engagement Activity	2015,2016

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications