A multicriterion approach for cluster validation

Lead Research Organisation: University College London
Department Name: Statistical Science

Abstract

Cluster analysis is about finding groups in data. It has applications in various areas such as biology, medicine, marketing, computer science, psychology, archeology, sociology.

The aim of the proposed project is to address cluster validation, which is a fundamental problem in cluster analysis. Cluster validation refers to both the evaluation of the quality of a clustering and the determination of the number of clusters.

The main idea is to develop a systematic catalogue of cluster validity indexes and to explore their properties, so that a user can match the requirements of a given application of cluster analysis by an appropriate set or aggregation of criteria. This is original, because most existing literature on cluster validation advertises "one criterion fits it all"-approaches ignoring the specific aims of clustering.

Given such a catalogue, a number of clusters in a given application can be determined by specifying a set of minimum requirements or by aggregating criteria with weights depending on the clustering aim. The quality of these approaches will be investigated.

The methods will be generalised to clusterings where some data ("outliers") are not assigned to any cluster.

For benchmarking the quality of cluster analysis methods, the given criteria will be used to explain the performance of different clustering methods on benchmark data sets from the characteristics of the true known clusterings of the data sets.

The developed approaches to determine the number of clusters will be used for deciding about the number of biological species present in data sets with genetic information.

Planned Impact

The results of this project are of interest to companies working on customer grouping and market segmentation. Cluster analysis is an important tool in market research, because it enables businesses to identify segments of customers to be addressed by new products and marketing strategies.

Currently there are three industrial collaborators who will use the results of this project, namely ecommera Limited, select Statistical Solutions and adam&eveDDB. These collaborators work for leading UK firms such as John Lewis, Pizza Express (adam&eveDDB), Asda, House of Fraser (ecommera) as well as the UK and local governments (select). The Swiss market research company GfK has recently also expressed interest.

Data sets in the IFCS cluster benchmarking repository, to which the current project contributes, will be openly available for teaching and learning, and for the choice of suitable algorithms in areas in which cluster analysis is needed.

Results regarding species delimitation will be used for advisory on efforts for conservation of biodiversity. Archaeological results will be disseminated in exhibitions. Results on musical styles will be used by the BBC and potentially further organisations.

The project focuses on general results that can be used in a wide variety of applications of cluster analysis, so that there is a scope for long term impact in clustering in medicine (e.g., classification of diseases), genetics (grouping of genes), neuroscience (image analysis), social sciences (social stratification, social network analysis), archaeology (classification of artifacts), biology (species delimitation), ecology (habitat classification), astronomy (object classification), chemistry (multiresolution analysis of spectra), psychology and education science (analysis of test and survey results), machine learning (object recognition), image segmentation, data base organisation and document clustering, and market segmentation.

Publications

10 25 50
 
Description A battery of indexes for analysing the quality of a clustering has been developed. This can be used in a flexible way to find good clusterings in a variety of applications (see narrative impact for applications where this is currently used), and also for comparing different clustering algorithms.
A computational method, using indexes as mentioned above, to distinguish clustered data from homogeneous data has been developed. This can also be used for estimating the number of clusters. I have compared the use of several such indexes for estimating the number of clusters. There has also been progress regarding identifying outliers in cluster analysis.
I have developed a range of computer-intensive techniques like parametric bootstrap and generation of random clusterings on a fixed dataset for calibration of indexes and comparing and aggregating outcomes. This can be used for assessing the number of clusters.
There is software (R-packages) either already developed or in final stages of development for all these achievements.
Exploitation Route This basically can be applied whenever cluster analysis can be applied, i.e., if there is a need for a scientific method to partition objects into groups.
Sectors Agriculture, Food and Drink,Chemicals,Communities and Social Services/Policy,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Electronics,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology,Retail,Transport,Other

 
Description Workshop presentation of mine was attended by bank statistician, who uses clustering and now uses validation as suggested by me. Same was reported by a number of attendants of my PyData presentation and my COMPSTAT tutorial in 2016. Overall, I had much feedback on many of my presentations from experts in various fields reporting on how they could make use of the presented ideas, although I can't evidence specific use. I also did some market segmentation work for Daniel Muellensiefen, adam&eveDDB. I have collaborated with Highbury & Islington Council regarding clustering their patient data. I have collaborated with Alice Stephenson (UCL Petrie Museum) regarding dating Egyptian artifacts, and Jean-Patrick Baudry and Gilles Celeux (Laboratoire de Statistique Theorique et Appliquee, Universite Paris IV) on clustering flow cytometry data for data preprocessing for cancer detection. My former PhD student Serhat Akhanli, who works on cluster analysis of football player performance data, has used our work in collaboration with a first league Turkish football club.
First Year Of Impact 2018
Sector Digital/Communication/Information Technologies (including Software),Energy,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections,Retail
Impact Types Cultural,Economic,Policy & public services

 
Title IFCS Cluster Benchmark Data Repository 
Description The IFCS Cluster Benchmark Data Repository is a collection of benchmark datasets for comparing cluster analysis methods. The special feature of the Repository is that every data comes with sophisticated meta data regarding the aim of clustering and subject matter background. This is connected very closely to the philosophy behind the funded project that (and in which way) such information is to be used for measuring the quality of clusterings and comparing them. This is joint work with the IFCS Cluster Benchmark Task Force as indicated in the grant proposal. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact Benchmark Repositories such as the UCI Machine Learning Repository are widely used for comparing statistical learning methods. Our Repository will stimulate the use of the metadata for this task, which is a core requirement for measuring quality in cluster analysis according to my funded research. 
URL https://ifcs.boku.ac.at/repository/
 
Title fpc package for R 
Description The package existed before my grant started, but in the most recent update I have added a number of functions for the evaluation of the quality of a clustering and single clusters that are the result of my funded work. (The year 2018 indicated below is for the current update.) 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Users can apply the new functions for evaluating the quality of a clustering and specific clusters in flexible ways, depending on the aims of the data analysis. 
URL http://cran.r-project.org/web/packages/fpc/index.html
 
Title otrimle package for R 
Description Performs robust cluster analysis allowing for outliers and noise that cannot be fitted by any cluster, methodology as published in Coretto and Hennig (2016, 2017). 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact This was added pretty recently so impact will take some time. 
URL https://cran.r-project.org/web/packages/otrimle/index.html
 
Description ASMDA - Cluster validation: how to think and what to do 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited presentation of general framework of research for grant. General statistical audience plus some practitioners. Questions and discussion.
Year(s) Of Engagement Activity 2017
URL http://www.asmda.es/asmda2017.html
 
Description Assessing the quality of a clustering 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation on methods developed in the funded research for a broad audience of data analysis practitioners, PyData 2016, London. I had much discussion and a number of contacts afterwards.
Year(s) Of Engagement Activity 2016
URL http://pydata.org/london2016/schedule/presentation/24/
 
Description Cluster Benchmarking Data Analysis Challenge 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact After the Data Submission Challenge (reported earlier), the IFCS Cluster Benchmark task force (in which my participation was supported by the grant) organised a widely advertised challenge/competition about the analysis of the dataset that won the Submission Challenge. This was mainly carried out in the time funded by the grant, although results were presented later, on Tuesday 8 August, at the conference of the International Federation of Classification Societies, Tokyo. The presentation event prompted a lot of discussion and interest including plans for a future joint publication and further work.
Year(s) Of Engagement Activity 2017
URL https://ifcs.boku.ac.at/repository/challenge2/
 
Description Cluster analysis 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Course given at University of Valladolid on Cluster Analysis, January 2017, using some material from funded research. This had implications on our ongoing collaboration, also I got a request for advisory.
Year(s) Of Engagement Activity 2017
URL http://www.imuva.uva.es/en/actividades/ver/426
 
Description Clustering Data Submission Challenge 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The IFCS Cluster Benchmark Task Force, of which I am a member and for which my work is funded by the grant, organised and advertised a challenge (competition) to submit data to the IFCS Cluster Benchmark Repository with high quality metadata, which was widely advertised. The deadline was 15 January 2017. We finally got six submissions and awarded a winner. Based on the winning dataset, there will be another challenge, on analysing the data.
Year(s) Of Engagement Activity 2016
URL http://ifcs.boku.ac.at/_conference/index.php/ifcs2017/index/pages/view/challenge
 
Description IFCS 2017 - Decisions that are needed when using cluster analysis, and research that helps with making them 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited key note presentation at the conference of the International Federation of Classification Societies, Tokyo, 10 August 2017. Although this happened after the end of the grant funded time, it was based on research carried out when supported by the grant. This was a big success with many requests for more information and collaboration.
Year(s) Of Engagement Activity 2017
URL http://ifcs.boku.ac.at/_conference/index.php/ifcs2017/ifcs2017
 
Description Invited key note presentation: Cluster validation: How to think and what to do 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Key note presentation at the Unsupervised Machine Learning Workshop associated to the AFEKA Conference for Speech Processing, Tel Aviv, May 2016. The outcome was lots of interesting discussion and a potential future collaboration.
Year(s) Of Engagement Activity 2016
URL http://www.afekaconference.co.il/sp2016/Keynote-Speakers#580174-dr-christian-hennig
 
Description Overview presentation on clustering 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Overview presentation on clustering, including some of the issues the grant is about, for researchers and practitioners active in finance.
Year(s) Of Engagement Activity 2013
URL http://www.mathematik.uni-kl.de/fileadmin/AGs/fima/Sass/Workshop/RegimeSwitchingWorkshop.pdf
 
Description Tutorial on Clustering 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I gave a tutorial on general principles of clustering with the Gaussian mixture model, covering some of the methodology my grant is about for the conference of the Portuguese classification society, attendance was about 30-40.
Year(s) Of Engagement Activity 2014
URL http://www.clad.pt/DOC_ACTIVIDADES/JOCLAD2015_Programa_publico.pdf
 
Description Tutorial: Practical decision making in cluster analysis: Choice of method and evaluation of quality 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Tutorial on cluster validation methods including research outputs from the grant at COMPSTAT 2016 in Oviedo. I got much feedback and requests for advisory, which I gave.
Year(s) Of Engagement Activity 2016
URL http://www.compstat2016.org/tutorials.php