Matheuristics for multi-criterion data clustering: towards multi-criterion big data analytics
Lead Research Organisation:
University of Manchester
Department Name: Alliance Manchester Business School
Abstract
With rapid increases in data volume in all areas of life, the meaningful analysis of these data is becoming a crucial bottleneck. Whether data are generated by customer transactions, through communications on social media, or as a by-product of manufacturing processes, data are meaningless unless suitable techniques are available to select the most relevant data, analyze these data and turn raw data into tangible information and insight. To some extent, "big data" reverses traditional approaches in data-mining, as data collection now frequently precedes the definition of an actual question or hypothesis. The purported advantage of this approach is that novel, unexpected findings may materialize - a premise that relies, however, on the expert use of suitable approaches for exploratory data analysis. The prominence of "big data" therefore fuels the need and use of scalable and powerful approaches to exploratory data analysis. Data clustering techniques present one of the most fundamental tools in exploratory data analysis, and this project aims to deliver novel techniques that are accurate, flexible and scalable to large data sets.
Data clustering techniques present one of the most fundamental tools in exploratory data analysis. Conceptually, data clustering refers to the identification of sub-groups within a data set so that items within the same group are similar and those in different groups are dissimilar; e.g., in the context of insurance data, a "cluster" of people may relate to customers who show similar behaviour in their claim patterns over time, while those in different clusters behave differently. Mathematically, data clustering can be seen as an example of a problem where good solutions are best described using a set of different criteria that account for conflicting properties such as the compactness of clusters and the separation between clusters.
The above observation has recently led to the development of multi-criterion approaches to data clustering, which explicitly consider a number of clustering criteria. This approach has shown a lot of promise, in terms of the accuracy and the robustness of the solutions obtained. However, current techniques for multi-criterion clustering are limited regarding their scalability to very large data sets and also their flexibility with respect to their consideration of different sources of dissimilarity data. This project proposes a novel technique for multi-criterion clustering: the algorithm will combine complementary ideas from two sub-fields of computer science, leading to improved scalability and flexibility of the technique developed. The work will include the development of an interactive user-interface and the application of multi-criterion clustering to problems in finance and marketing. All software produced will be released publicly.
Data clustering techniques present one of the most fundamental tools in exploratory data analysis. Conceptually, data clustering refers to the identification of sub-groups within a data set so that items within the same group are similar and those in different groups are dissimilar; e.g., in the context of insurance data, a "cluster" of people may relate to customers who show similar behaviour in their claim patterns over time, while those in different clusters behave differently. Mathematically, data clustering can be seen as an example of a problem where good solutions are best described using a set of different criteria that account for conflicting properties such as the compactness of clusters and the separation between clusters.
The above observation has recently led to the development of multi-criterion approaches to data clustering, which explicitly consider a number of clustering criteria. This approach has shown a lot of promise, in terms of the accuracy and the robustness of the solutions obtained. However, current techniques for multi-criterion clustering are limited regarding their scalability to very large data sets and also their flexibility with respect to their consideration of different sources of dissimilarity data. This project proposes a novel technique for multi-criterion clustering: the algorithm will combine complementary ideas from two sub-fields of computer science, leading to improved scalability and flexibility of the technique developed. The work will include the development of an interactive user-interface and the application of multi-criterion clustering to problems in finance and marketing. All software produced will be released publicly.
Planned Impact
Briefly, the non-academic beneficiaries of this project are:
- Users of exploratory data analysis in industry / commerce, including manufacturing as a major source of "big data".
- Commercial providers of software tools for data-mining.
- The wider public, as they benefit from improved data analytical techniques through the spin-offs of such knowledge, e.g. through the development of improved recommendation systems (marketing), the identification of groups of co-regulated genes (biology), etc.
One of the applications considered in this proposal (work package WP4.2) involves measures of customer experience and loyalty and will likely involve industrial partners as the primary source of these data. Further opportunities for collaborations will be sought, e.g. as a part of industrial dissertation projects. The establishment of fruitful collaborations within and outside academia is supported by the third objective of the proposal, which aims to embed multi-criterion clustering techniques into an interactive, easy-to-use interface that is suitable for the rapid training and use by students, fellow academics, industrial partners etc. Similarly, work packages WP4.1 and WP4.2 support exploitation activities directly, as they deliver practical application examples that can facilitate and underpin the dissemination of the approach.
- Users of exploratory data analysis in industry / commerce, including manufacturing as a major source of "big data".
- Commercial providers of software tools for data-mining.
- The wider public, as they benefit from improved data analytical techniques through the spin-offs of such knowledge, e.g. through the development of improved recommendation systems (marketing), the identification of groups of co-regulated genes (biology), etc.
One of the applications considered in this proposal (work package WP4.2) involves measures of customer experience and loyalty and will likely involve industrial partners as the primary source of these data. Further opportunities for collaborations will be sought, e.g. as a part of industrial dissertation projects. The establishment of fruitful collaborations within and outside academia is supported by the third objective of the proposal, which aims to embed multi-criterion clustering techniques into an interactive, easy-to-use interface that is suitable for the rapid training and use by students, fellow academics, industrial partners etc. Similarly, work packages WP4.1 and WP4.2 support exploitation activities directly, as they deliver practical application examples that can facilitate and underpin the dissemination of the approach.
People |
ORCID iD |
Julia Handl (Principal Investigator) |
Publications
Deng Z
(2022)
The Russia-Ukraine war disproportionately threatens the nutrition security of developing countries.
in Discover sustainability
Garza-Fabre M
(2018)
An Improved and More Scalable Evolutionary Approach to Multiobjective Clustering
in IEEE Transactions on Evolutionary Computation
Garza-Fabre M
(2023)
Evolutionary Multiobjective Clustering Over Multiple Conflicting Data Views
in IEEE Transactions on Evolutionary Computation
Garza-Fabre M
(2017)
Evolutionary Multi-Criterion Optimization
José-García A
(2019)
Many-view clustering
José-García A
(2021)
An evolutionary many-objective approach to multiview clustering using feature and relational data
in Applied Soft Computing
Kandathil SM
(2018)
Improved fragment-based protein structure prediction by redesign of search heuristics.
in Scientific reports
Kandathil SM
(2019)
Reliable Generation of Native-Like Decoys Limits Predictive Ability in Fragment-Based Protein Structure Prediction.
in Biomolecules
Description | We have developed a novel multi-objective clustering technique that is more scalable (i.e. can be applied to larger problems and is generally quicker) but also improves the accuracy of existing approaches. |
Exploitation Route | The code has been made publicly available at https://github.com/garzafabre/Delta-MOCK and the methods may be used for cluster analysis in different subject areas. |
Sectors | Digital/Communication/Information Technologies (including Software) Financial Services and Management Consultancy Healthcare Pharmaceuticals and Medical Biotechnology Retail |
URL | https://github.com/garzafabre/Delta-MOCK |
Description | Multi-objective and multi-view clustering approaches are increasingly being trialled in a range of different application areas, and in collaboration with industrial stakeholders, including the analysis of electronic health records, the analysis of biological data and in web marketing / social media settings. Approaches based on Pareto optimization allow for the direct exploration of trade-offs between different criteria and to generate practically relevant trade-offs for upstream analytics. |
First Year Of Impact | 2017 |
Sector | Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology |
Impact Types | Economic |
Title | Delta-MOCK |
Description | New, scalable multicriterion clustering technique that uses a reduced encoding and delta evaluation of objectives. |
Type Of Material | Computer model/algorithm |
Provided To Others? | No |
Impact | The code of the method will be released once published. At this point, I expect this method to be of wide interest (the original version of MOCK has been used widely) and employed for data analysis in a variety of application areas (including applications in marketing, finance and bioinformatics that my own research group works on). |
Description | FAPESP fellowship |
Organisation | Federal University of Sao Carlos |
Country | Brazil |
Sector | Academic/University |
PI Contribution | Prof. Katti Faceli is joining my research group with a FAPESP fellowship for a year. We will be collaborating on the development of advanced visualization techniques for the ensembles of clustering solutions resulting from multi-objective clustering. My team provides methodological expertise and state-of-the-art methods for multi-objective clustering, and I have contributed to the writing of the research proposal. |
Collaborator Contribution | Prof Katti Faceli brings expertise in visualization and, specifically, the visualization of clustering solutions. |
Impact | Conference publication |
Start Year | 2015 |
Description | Big Data Panel |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | I helped organize and chaired a panel discussion on Big Data, as a part of ESOF 2016 in Manchester. |
Year(s) Of Engagement Activity | 2016 |
URL | http://www.esof.eu/en/ |
Description | Scratch Club |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | Running of a weekly Scratch Club for Year 6 Students to introduce students to Computer Science and Programming. |
Year(s) Of Engagement Activity | 2015,2016 |