Robust clustering of mixed-type data

Lead Research Organisation: Imperial College London
Department Name: Mathematics

Abstract

Nowadays, given the vast amount of data that is available to us, there is a large need for efficient 'segmentation' algorithms to be used in the industry. A typical example comes from the field of market research, where one of the main tasks is dividing a large group of a company's current or potential customers into smaller groups. The ultimate goal of this process is to aggregate the subjects into 'segments', such that each segment consists of subjects which are likely to share the same needs or have common interests. This is achieved by identifying similarities among the subjects, which may be of demographic, geographic, psychographic or behavioural nature. In the statistical machine learning world, segmentation is known as 'cluster analysis' or 'clustering'. However, the process of clustering is not very straightforward when dealing with data sets that include both numerical and text
data (commonly referred to as 'mixed-type data'), or when anomalous points are included. A data point is said to be 'anomalous' or 'outlying' if it does not conform to a general pattern that may exist within the data set or if it consists of 'unusual' values that are 'abnormal' compared to the majority of values of the rest of the data points such as to arouse suspicion. Despite the fact that a significant number of methods for cluster analysis of mixed-type data exists in
the literature, no such methods are 'robust' to the presence of outlying data points. This is potentially a consequence of having a well-established definition of 'outliers' for numerical data but of this not being the case for data that is not numeric. A more general definition for 'categorical outliers' ('categorical' referring to the fact that some variables may only take a fixed number of values, called'categories') is therefore needed, so that we can better understand what it means to have 'outliers'or 'anomalies' in a mixed data set. Although a simple approach could involve detecting outliers for the numerical and the categorical data individually, this is rather naive, since outliers might still
exist based on the relationships among variables of different types. In fact, there exists a very small number of anomaly detection algorithms for data of mixed-type in the literature but these make use of the aforementioned naive approach, with no software implementation being available either. Our project aims to develop novel methodology for identifying data points that are anomalous in a
mixed data set, by employing anomaly detection techniques in an unsupervised manner (meaning that we do not have access to a 'ground truth' regarding which data points are the anomalies). This will involve making use of statistical tools to capture any dependencies or interactions between the variables that a mixed data set is comprised of, in order to achieve a good understanding of the data set and, as a result, of which observations may be outlying. Combining the results of such a method with clustering algorithms for mixed-type data could enhance the performance of existing non-robust methods. Moreover, such a method could be extended to account for an additional aspect of robustness that has to do with 'incomplete' observations within a data set. Data irregularities and missing observations are very common issues that practitioners from several industry sectors have to face, such as in the automotive, the education, the insurance, the retail or the telecommunications sectors, all of which use segmentation techniques in some form. Thus we want to provide them with a framework under which they can obtain results which are meaningful and easily interpretable to
them, without being affected by 'misleading' or missing observations. This project falls within the EPSRC Statistics and applied probability research area.
1

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2602507 Studentship EP/S023151/1 02/10/2021 30/08/2025 Efthymios COSTA