A multicriterion approach for cluster validation

Lead Research Organisation: University College London

Department Name: Statistical Science

Abstract

Cluster analysis is about finding groups in data. It has applications in various areas such as biology, medicine, marketing, computer science, psychology, archeology, sociology.

The aim of the proposed project is to address cluster validation, which is a fundamental problem in cluster analysis. Cluster validation refers to both the evaluation of the quality of a clustering and the determination of the number of clusters.

The main idea is to develop a systematic catalogue of cluster validity indexes and to explore their properties, so that a user can match the requirements of a given application of cluster analysis by an appropriate set or aggregation of criteria. This is original, because most existing literature on cluster validation advertises "one criterion fits it all"-approaches ignoring the specific aims of clustering.

Given such a catalogue, a number of clusters in a given application can be determined by specifying a set of minimum requirements or by aggregating criteria with weights depending on the clustering aim. The quality of these approaches will be investigated.

The methods will be generalised to clusterings where some data ("outliers") are not assigned to any cluster.

For benchmarking the quality of cluster analysis methods, the given criteria will be used to explain the performance of different clustering methods on benchmark data sets from the characteristics of the true known clusterings of the data sets.

The developed approaches to determine the number of clusters will be used for deciding about the number of biological species present in data sets with genetic information.

Planned Impact

The results of this project are of interest to companies working on customer grouping and market segmentation. Cluster analysis is an important tool in market research, because it enables businesses to identify segments of customers to be addressed by new products and marketing strategies.

Currently there are three industrial collaborators who will use the results of this project, namely ecommera Limited, select Statistical Solutions and adam&eveDDB. These collaborators work for leading UK firms such as John Lewis, Pizza Express (adam&eveDDB), Asda, House of Fraser (ecommera) as well as the UK and local governments (select). The Swiss market research company GfK has recently also expressed interest.

Data sets in the IFCS cluster benchmarking repository, to which the current project contributes, will be openly available for teaching and learning, and for the choice of suitable algorithms in areas in which cluster analysis is needed.

Results regarding species delimitation will be used for advisory on efforts for conservation of biodiversity. Archaeological results will be disseminated in exhibitions. Results on musical styles will be used by the BBC and potentially further organisations.

The project focuses on general results that can be used in a wide variety of applications of cluster analysis, so that there is a scope for long term impact in clustering in medicine (e.g., classification of diseases), genetics (grouping of genes), neuroscience (image analysis), social sciences (social stratification, social network analysis), archaeology (classification of artifacts), biology (species delimitation), ecology (habitat classification), astronomy (object classification), chemistry (multiresolution analysis of spectra), psychology and education science (analysis of test and survey results), machine learning (object recognition), image segmentation, data base organisation and document clustering, and market segmentation.

Funded Value:

£98,024

Funded Period:

Jun 13 - May 17

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/K033972/1

Principal Investigator:

Christian Hennig

Research Subject:

Mathematical sciences (100%)

Research Topic:

Statistics & Appl. Probability (100%)

Organisations

People	ORCID iD
Christian Hennig (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Akhanli S (2023) Clustering of football players based on performance data and aggregated clustering validity indexes in Journal of Quantitative Analysis in Sports

Akhanli S (2020) Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes in Statistics and Computing

Akhanli S (2020) Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes

Akhanli SE (2017) Some Issues in Distance Construction for Football Players Performance Data in Archives of Data Science, Series A

Anderlucci L (2014) The Clustering of Categorical Data: A Comparison of a Model-based and a Distance-based Approach in Communications in Statistics - Theory and Methods

Coretto P (2017) Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering in Journal of the American Statistical Association

Coretto P (2017) Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering in Journal of Machine Learning Research

De Amorim R (2015) Recovering the number of clusters in data sets with noise features using feature rescaling factors in Information Sciences

Halkidi M (2015) Handbook of Cluster Analysis

Hennig C (2015) What are the true clusters? in Pattern Recognition Letters

Key Findings
Impact Summary
Research Databases and Models
Software and Technical Products
Engagement Activities


Description	A battery of indexes for analysing the quality of a clustering has been developed. This can be used in a flexible way to find good clusterings in a variety of applications (see narrative impact for applications where this is currently used), and also for comparing different clustering algorithms. A computational method, using indexes as mentioned above, to distinguish clustered data from homogeneous data has been developed. This can also be used for estimating the number of clusters. I have compared the use of several such indexes for estimating the number of clusters. There has also been progress regarding identifying outliers in cluster analysis. I have developed a range of computer-intensive techniques like parametric bootstrap and generation of random clusterings on a fixed dataset for calibration of indexes and comparing and aggregating outcomes. This can be used for assessing the number of clusters. There is software (R-packages) either already developed or in final stages of development for all these achievements.
Exploitation Route	This basically can be applied whenever cluster analysis can be applied, i.e., if there is a need for a scientific method to partition objects into groups.
Sectors	Agriculture, Food and Drink,Chemicals,Communities and Social Services/Policy,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Electronics,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology,Retail,Transport,Other


Description	Workshop presentation of mine was attended by bank statistician, who uses clustering and now uses validation as suggested by me. Same was reported by a number of attendants of my PyData presentation and my COMPSTAT tutorial in 2016. Overall, I had much feedback on many of my presentations from experts in various fields reporting on how they could make use of the presented ideas, although I can't evidence specific use. I also did some market segmentation work for Daniel Muellensiefen, adam&eveDDB. I have collaborated with Highbury & Islington Council regarding clustering their patient data. I have collaborated with Alice Stephenson (UCL Petrie Museum) regarding dating Egyptian artifacts, and Jean-Patrick Baudry and Gilles Celeux (Laboratoire de Statistique Theorique et Appliquee, Universite Paris IV) on clustering flow cytometry data for data preprocessing for cancer detection. My former PhD student Serhat Akhanli, who works on cluster analysis of football player performance data, has used our work in collaboration with a first league Turkish football club.
First Year Of Impact	2018
Sector	Digital/Communication/Information Technologies (including Software),Energy,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections,Retail
Impact Types	Cultural,Economic,Policy & public services


Title	IFCS Cluster Benchmark Data Repository
Description	The IFCS Cluster Benchmark Data Repository is a collection of benchmark datasets for comparing cluster analysis methods. The special feature of the Repository is that every data comes with sophisticated meta data regarding the aim of clustering and subject matter background. This is connected very closely to the philosophy behind the funded project that (and in which way) such information is to be used for measuring the quality of clusterings and comparing them. This is joint work with the IFCS Cluster Benchmark Task Force as indicated in the grant proposal.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
Impact	Benchmark Repositories such as the UCI Machine Learning Repository are widely used for comparing statistical learning methods. Our Repository will stimulate the use of the metadata for this task, which is a core requirement for measuring quality in cluster analysis according to my funded research.
URL	https://ifcs.boku.ac.at/repository/


Title	fpc package for R
Description	The package existed before my grant started, but in the most recent update I have added a number of functions for the evaluation of the quality of a clustering and single clusters that are the result of my funded work. (The year 2018 indicated below is for the current update.)
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	Users can apply the new functions for evaluating the quality of a clustering and specific clusters in flexible ways, depending on the aims of the data analysis.
URL	http://cran.r-project.org/web/packages/fpc/index.html


Title	otrimle package for R
Description	Performs robust cluster analysis allowing for outliers and noise that cannot be fitted by any cluster, methodology as published in Coretto and Hennig (2016, 2017).
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	This was added pretty recently so impact will take some time.
URL	https://cran.r-project.org/web/packages/otrimle/index.html


Description	ASMDA - Cluster validation: how to think and what to do
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Invited presentation of general framework of research for grant. General statistical audience plus some practitioners. Questions and discussion.
Year(s) Of Engagement Activity	2017
URL	http://www.asmda.es/asmda2017.html


Description	Assessing the quality of a clustering
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Presentation on methods developed in the funded research for a broad audience of data analysis practitioners, PyData 2016, London. I had much discussion and a number of contacts afterwards.
Year(s) Of Engagement Activity	2016
URL	http://pydata.org/london2016/schedule/presentation/24/


Description	Cluster Benchmarking Data Analysis Challenge
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	After the Data Submission Challenge (reported earlier), the IFCS Cluster Benchmark task force (in which my participation was supported by the grant) organised a widely advertised challenge/competition about the analysis of the dataset that won the Submission Challenge. This was mainly carried out in the time funded by the grant, although results were presented later, on Tuesday 8 August, at the conference of the International Federation of Classification Societies, Tokyo. The presentation event prompted a lot of discussion and interest including plans for a future joint publication and further work.
Year(s) Of Engagement Activity	2017
URL	https://ifcs.boku.ac.at/repository/challenge2/


Description	Cluster analysis
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	Course given at University of Valladolid on Cluster Analysis, January 2017, using some material from funded research. This had implications on our ongoing collaboration, also I got a request for advisory.
Year(s) Of Engagement Activity	2017
URL	http://www.imuva.uva.es/en/actividades/ver/426


Description	Clustering Data Submission Challenge
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The IFCS Cluster Benchmark Task Force, of which I am a member and for which my work is funded by the grant, organised and advertised a challenge (competition) to submit data to the IFCS Cluster Benchmark Repository with high quality metadata, which was widely advertised. The deadline was 15 January 2017. We finally got six submissions and awarded a winner. Based on the winning dataset, there will be another challenge, on analysing the data.
Year(s) Of Engagement Activity	2016
URL	http://ifcs.boku.ac.at/_conference/index.php/ifcs2017/index/pages/view/challenge


Description	IFCS 2017 - Decisions that are needed when using cluster analysis, and research that helps with making them
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Invited key note presentation at the conference of the International Federation of Classification Societies, Tokyo, 10 August 2017. Although this happened after the end of the grant funded time, it was based on research carried out when supported by the grant. This was a big success with many requests for more information and collaboration.
Year(s) Of Engagement Activity	2017
URL	http://ifcs.boku.ac.at/_conference/index.php/ifcs2017/ifcs2017


Description	Invited key note presentation: Cluster validation: How to think and what to do
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Key note presentation at the Unsupervised Machine Learning Workshop associated to the AFEKA Conference for Speech Processing, Tel Aviv, May 2016. The outcome was lots of interesting discussion and a potential future collaboration.
Year(s) Of Engagement Activity	2016
URL	http://www.afekaconference.co.il/sp2016/Keynote-Speakers#580174-dr-christian-hennig


Description	Overview presentation on clustering
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Overview presentation on clustering, including some of the issues the grant is about, for researchers and practitioners active in finance.
Year(s) Of Engagement Activity	2013
URL	http://www.mathematik.uni-kl.de/fileadmin/AGs/fima/Sass/Workshop/RegimeSwitchingWorkshop.pdf


Description	Tutorial on Clustering
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	I gave a tutorial on general principles of clustering with the Gaussian mixture model, covering some of the methodology my grant is about for the conference of the Portuguese classification society, attendance was about 30-40.
Year(s) Of Engagement Activity	2014
URL	http://www.clad.pt/DOC_ACTIVIDADES/JOCLAD2015_Programa_publico.pdf


Description	Tutorial: Practical decision making in cluster analysis: Choice of method and evaluation of quality
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Tutorial on cluster validation methods including research outputs from the grant at COMPSTAT 2016 in Oviedo. I got much feedback and requests for advisory, which I gave.
Year(s) Of Engagement Activity	2016
URL	http://www.compstat2016.org/tutorials.php

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications