A multicriterion approach for cluster validation
Lead Research Organisation:
University College London
Department Name: Statistical Science
Abstract
Cluster analysis is about finding groups in data. It has applications in various areas such as biology, medicine, marketing, computer science, psychology, archeology, sociology.
The aim of the proposed project is to address cluster validation, which is a fundamental problem in cluster analysis. Cluster validation refers to both the evaluation of the quality of a clustering and the determination of the number of clusters.
The main idea is to develop a systematic catalogue of cluster validity indexes and to explore their properties, so that a user can match the requirements of a given application of cluster analysis by an appropriate set or aggregation of criteria. This is original, because most existing literature on cluster validation advertises "one criterion fits it all"-approaches ignoring the specific aims of clustering.
Given such a catalogue, a number of clusters in a given application can be determined by specifying a set of minimum requirements or by aggregating criteria with weights depending on the clustering aim. The quality of these approaches will be investigated.
The methods will be generalised to clusterings where some data ("outliers") are not assigned to any cluster.
For benchmarking the quality of cluster analysis methods, the given criteria will be used to explain the performance of different clustering methods on benchmark data sets from the characteristics of the true known clusterings of the data sets.
The developed approaches to determine the number of clusters will be used for deciding about the number of biological species present in data sets with genetic information.
The aim of the proposed project is to address cluster validation, which is a fundamental problem in cluster analysis. Cluster validation refers to both the evaluation of the quality of a clustering and the determination of the number of clusters.
The main idea is to develop a systematic catalogue of cluster validity indexes and to explore their properties, so that a user can match the requirements of a given application of cluster analysis by an appropriate set or aggregation of criteria. This is original, because most existing literature on cluster validation advertises "one criterion fits it all"-approaches ignoring the specific aims of clustering.
Given such a catalogue, a number of clusters in a given application can be determined by specifying a set of minimum requirements or by aggregating criteria with weights depending on the clustering aim. The quality of these approaches will be investigated.
The methods will be generalised to clusterings where some data ("outliers") are not assigned to any cluster.
For benchmarking the quality of cluster analysis methods, the given criteria will be used to explain the performance of different clustering methods on benchmark data sets from the characteristics of the true known clusterings of the data sets.
The developed approaches to determine the number of clusters will be used for deciding about the number of biological species present in data sets with genetic information.
Planned Impact
The results of this project are of interest to companies working on customer grouping and market segmentation. Cluster analysis is an important tool in market research, because it enables businesses to identify segments of customers to be addressed by new products and marketing strategies.
Currently there are three industrial collaborators who will use the results of this project, namely ecommera Limited, select Statistical Solutions and adam&eveDDB. These collaborators work for leading UK firms such as John Lewis, Pizza Express (adam&eveDDB), Asda, House of Fraser (ecommera) as well as the UK and local governments (select). The Swiss market research company GfK has recently also expressed interest.
Data sets in the IFCS cluster benchmarking repository, to which the current project contributes, will be openly available for teaching and learning, and for the choice of suitable algorithms in areas in which cluster analysis is needed.
Results regarding species delimitation will be used for advisory on efforts for conservation of biodiversity. Archaeological results will be disseminated in exhibitions. Results on musical styles will be used by the BBC and potentially further organisations.
The project focuses on general results that can be used in a wide variety of applications of cluster analysis, so that there is a scope for long term impact in clustering in medicine (e.g., classification of diseases), genetics (grouping of genes), neuroscience (image analysis), social sciences (social stratification, social network analysis), archaeology (classification of artifacts), biology (species delimitation), ecology (habitat classification), astronomy (object classification), chemistry (multiresolution analysis of spectra), psychology and education science (analysis of test and survey results), machine learning (object recognition), image segmentation, data base organisation and document clustering, and market segmentation.
Currently there are three industrial collaborators who will use the results of this project, namely ecommera Limited, select Statistical Solutions and adam&eveDDB. These collaborators work for leading UK firms such as John Lewis, Pizza Express (adam&eveDDB), Asda, House of Fraser (ecommera) as well as the UK and local governments (select). The Swiss market research company GfK has recently also expressed interest.
Data sets in the IFCS cluster benchmarking repository, to which the current project contributes, will be openly available for teaching and learning, and for the choice of suitable algorithms in areas in which cluster analysis is needed.
Results regarding species delimitation will be used for advisory on efforts for conservation of biodiversity. Archaeological results will be disseminated in exhibitions. Results on musical styles will be used by the BBC and potentially further organisations.
The project focuses on general results that can be used in a wide variety of applications of cluster analysis, so that there is a scope for long term impact in clustering in medicine (e.g., classification of diseases), genetics (grouping of genes), neuroscience (image analysis), social sciences (social stratification, social network analysis), archaeology (classification of artifacts), biology (species delimitation), ecology (habitat classification), astronomy (object classification), chemistry (multiresolution analysis of spectra), psychology and education science (analysis of test and survey results), machine learning (object recognition), image segmentation, data base organisation and document clustering, and market segmentation.
People |
ORCID iD |
Christian Hennig (Principal Investigator) |
Publications
Akhanli S
(2023)
Clustering of football players based on performance data and aggregated clustering validity indexes
in Journal of Quantitative Analysis in Sports
Akhanli S
(2020)
Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes
in Statistics and Computing
Akhanli SE
(2017)
Some Issues in Distance Construction for Football Players Performance Data
in Archives of Data Science, Series A
Anderlucci L
(2014)
The Clustering of Categorical Data: A Comparison of a Model-based and a Distance-based Approach
in Communications in Statistics - Theory and Methods
Coretto P
(2017)
Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering
in Journal of the American Statistical Association
Coretto P
(2017)
Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering
in Journal of Machine Learning Research
De Amorim R
(2015)
Recovering the number of clusters in data sets with noise features using feature rescaling factors
in Information Sciences
Halkidi M
(2015)
Handbook of Cluster Analysis
Hennig C
(2015)
What are the true clusters?
in Pattern Recognition Letters
Description | A battery of indexes for analysing the quality of a clustering has been developed. This can be used in a flexible way to find good clusterings in a variety of applications (see narrative impact for applications where this is currently used), and also for comparing different clustering algorithms. A computational method, using indexes as mentioned above, to distinguish clustered data from homogeneous data has been developed. This can also be used for estimating the number of clusters. I have compared the use of several such indexes for estimating the number of clusters. There has also been progress regarding identifying outliers in cluster analysis. I have developed a range of computer-intensive techniques like parametric bootstrap and generation of random clusterings on a fixed dataset for calibration of indexes and comparing and aggregating outcomes. This can be used for assessing the number of clusters. There is software (R-packages) either already developed or in final stages of development for all these achievements. |
Exploitation Route | This basically can be applied whenever cluster analysis can be applied, i.e., if there is a need for a scientific method to partition objects into groups. |
Sectors | Agriculture, Food and Drink,Chemicals,Communities and Social Services/Policy,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Electronics,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology,Retail,Transport,Other |
Description | Workshop presentation of mine was attended by bank statistician, who uses clustering and now uses validation as suggested by me. Same was reported by a number of attendants of my PyData presentation and my COMPSTAT tutorial in 2016. Overall, I had much feedback on many of my presentations from experts in various fields reporting on how they could make use of the presented ideas, although I can't evidence specific use. I also did some market segmentation work for Daniel Muellensiefen, adam&eveDDB. I have collaborated with Highbury & Islington Council regarding clustering their patient data. I have collaborated with Alice Stephenson (UCL Petrie Museum) regarding dating Egyptian artifacts, and Jean-Patrick Baudry and Gilles Celeux (Laboratoire de Statistique Theorique et Appliquee, Universite Paris IV) on clustering flow cytometry data for data preprocessing for cancer detection. My former PhD student Serhat Akhanli, who works on cluster analysis of football player performance data, has used our work in collaboration with a first league Turkish football club. |
First Year Of Impact | 2018 |
Sector | Digital/Communication/Information Technologies (including Software),Energy,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections,Retail |
Impact Types | Cultural,Economic,Policy & public services |
Title | IFCS Cluster Benchmark Data Repository |
Description | The IFCS Cluster Benchmark Data Repository is a collection of benchmark datasets for comparing cluster analysis methods. The special feature of the Repository is that every data comes with sophisticated meta data regarding the aim of clustering and subject matter background. This is connected very closely to the philosophy behind the funded project that (and in which way) such information is to be used for measuring the quality of clusterings and comparing them. This is joint work with the IFCS Cluster Benchmark Task Force as indicated in the grant proposal. |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | Benchmark Repositories such as the UCI Machine Learning Repository are widely used for comparing statistical learning methods. Our Repository will stimulate the use of the metadata for this task, which is a core requirement for measuring quality in cluster analysis according to my funded research. |
URL | https://ifcs.boku.ac.at/repository/ |
Title | fpc package for R |
Description | The package existed before my grant started, but in the most recent update I have added a number of functions for the evaluation of the quality of a clustering and single clusters that are the result of my funded work. (The year 2018 indicated below is for the current update.) |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | Users can apply the new functions for evaluating the quality of a clustering and specific clusters in flexible ways, depending on the aims of the data analysis. |
URL | http://cran.r-project.org/web/packages/fpc/index.html |
Title | otrimle package for R |
Description | Performs robust cluster analysis allowing for outliers and noise that cannot be fitted by any cluster, methodology as published in Coretto and Hennig (2016, 2017). |
Type Of Technology | Software |
Year Produced | 2017 |
Open Source License? | Yes |
Impact | This was added pretty recently so impact will take some time. |
URL | https://cran.r-project.org/web/packages/otrimle/index.html |
Description | ASMDA - Cluster validation: how to think and what to do |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Invited presentation of general framework of research for grant. General statistical audience plus some practitioners. Questions and discussion. |
Year(s) Of Engagement Activity | 2017 |
URL | http://www.asmda.es/asmda2017.html |
Description | Assessing the quality of a clustering |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Presentation on methods developed in the funded research for a broad audience of data analysis practitioners, PyData 2016, London. I had much discussion and a number of contacts afterwards. |
Year(s) Of Engagement Activity | 2016 |
URL | http://pydata.org/london2016/schedule/presentation/24/ |
Description | Cluster Benchmarking Data Analysis Challenge |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | After the Data Submission Challenge (reported earlier), the IFCS Cluster Benchmark task force (in which my participation was supported by the grant) organised a widely advertised challenge/competition about the analysis of the dataset that won the Submission Challenge. This was mainly carried out in the time funded by the grant, although results were presented later, on Tuesday 8 August, at the conference of the International Federation of Classification Societies, Tokyo. The presentation event prompted a lot of discussion and interest including plans for a future joint publication and further work. |
Year(s) Of Engagement Activity | 2017 |
URL | https://ifcs.boku.ac.at/repository/challenge2/ |
Description | Cluster analysis |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Postgraduate students |
Results and Impact | Course given at University of Valladolid on Cluster Analysis, January 2017, using some material from funded research. This had implications on our ongoing collaboration, also I got a request for advisory. |
Year(s) Of Engagement Activity | 2017 |
URL | http://www.imuva.uva.es/en/actividades/ver/426 |
Description | Clustering Data Submission Challenge |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | The IFCS Cluster Benchmark Task Force, of which I am a member and for which my work is funded by the grant, organised and advertised a challenge (competition) to submit data to the IFCS Cluster Benchmark Repository with high quality metadata, which was widely advertised. The deadline was 15 January 2017. We finally got six submissions and awarded a winner. Based on the winning dataset, there will be another challenge, on analysing the data. |
Year(s) Of Engagement Activity | 2016 |
URL | http://ifcs.boku.ac.at/_conference/index.php/ifcs2017/index/pages/view/challenge |
Description | IFCS 2017 - Decisions that are needed when using cluster analysis, and research that helps with making them |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Invited key note presentation at the conference of the International Federation of Classification Societies, Tokyo, 10 August 2017. Although this happened after the end of the grant funded time, it was based on research carried out when supported by the grant. This was a big success with many requests for more information and collaboration. |
Year(s) Of Engagement Activity | 2017 |
URL | http://ifcs.boku.ac.at/_conference/index.php/ifcs2017/ifcs2017 |
Description | Invited key note presentation: Cluster validation: How to think and what to do |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Key note presentation at the Unsupervised Machine Learning Workshop associated to the AFEKA Conference for Speech Processing, Tel Aviv, May 2016. The outcome was lots of interesting discussion and a potential future collaboration. |
Year(s) Of Engagement Activity | 2016 |
URL | http://www.afekaconference.co.il/sp2016/Keynote-Speakers#580174-dr-christian-hennig |
Description | Overview presentation on clustering |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Overview presentation on clustering, including some of the issues the grant is about, for researchers and practitioners active in finance. |
Year(s) Of Engagement Activity | 2013 |
URL | http://www.mathematik.uni-kl.de/fileadmin/AGs/fima/Sass/Workshop/RegimeSwitchingWorkshop.pdf |
Description | Tutorial on Clustering |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | I gave a tutorial on general principles of clustering with the Gaussian mixture model, covering some of the methodology my grant is about for the conference of the Portuguese classification society, attendance was about 30-40. |
Year(s) Of Engagement Activity | 2014 |
URL | http://www.clad.pt/DOC_ACTIVIDADES/JOCLAD2015_Programa_publico.pdf |
Description | Tutorial: Practical decision making in cluster analysis: Choice of method and evaluation of quality |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Tutorial on cluster validation methods including research outputs from the grant at COMPSTAT 2016 in Oviedo. I got much feedback and requests for advisory, which I gave. |
Year(s) Of Engagement Activity | 2016 |
URL | http://www.compstat2016.org/tutorials.php |