Novel advances in unsupervised learning for mixed-type data
Lead Research Organisation:
Imperial College London
Department Name: Mathematics
Abstract
Nowadays, given the vast amount of data that is available to us, there is a strong demand for efficient and flexible unsupervised learning algorithms to be used in the industry. Unsupervised algorithms include cluster analysis and outlier detection, among others. A typical use case of the former comes from the field of market research, where one of the main tasks is dividing a large group of a company's current or potential customers into smaller groups. The ultimate goal of this process is to aggregate the subjects into segments or clusters, such that each segment consists of subjects which are likely to share the same needs or have common interests. This is achieved by identifying similarities among the
subjects, which may be of demographic, geographic, psychographic or behavioural nature. However, the process of clustering can be very challenging and may lead to misleading conclusions being drawn when dealing with data sets that include outliers. Outliers are defined as data points consisting of unusual values which arouse suspicion regarding the mechanism that has been used to generate them.
Despite the fact that a significant number of methods for cluster analysis and outlier detection exists in the literature, the majority of these can not deal with a combination of continuous and categorical variables, also known as mixed-type data. Moreover, very few such methods are robust to the presence of outlying data points in a mixed-attribute domain. This is potentially a consequence of having a well-established definition of outliers for numerical data but of this not being the case for nominal observations. This mandates a more general definition for categorical outliers ('categorical' referring to the fact that some variables may only take a fixed number of values, called 'categories'), so that we can better understand what it means to have outliers in a mixed data set. Although a simple approach could involve detecting outliers for the numerical and the categorical data individually, this is rather naive; outliers might still exist based on the relationships among variables of different types. In fact, there exists a very small number of anomaly detection algorithms for data of mixed-type in the literature but these rely on the aforementioned simplistic approach and most of them lack a software implementation.
Our project seeks to address these gaps in the literature by defining a notion of outlyingness for purely nominal variables and hence developing novel methodology for identifying data points that are anomalous in a mixed data set. This will involve making use of statistical tools to capture any dependencies or interactions between the variables that a mixed data set is comprised of, in order to achieve a good understanding of the data set and, as a result, of which observations may be outlying. Combining the results of such a method with clustering algorithms for mixed-type data could enhance the performance of existing non-robust methods. Ultimately, we aim to provide a novel framework under which practitioners from several sectors in the industry (such as the automotive, the education, the insurance, the retail or the telecommunications sectors, all of which use segmentation techniques in some form) can obtain results which are meaningful and easily interpretable to them, without being misled by anomalous observations. This project falls within the EPSRC Statistics and applied probability research area.
subjects, which may be of demographic, geographic, psychographic or behavioural nature. However, the process of clustering can be very challenging and may lead to misleading conclusions being drawn when dealing with data sets that include outliers. Outliers are defined as data points consisting of unusual values which arouse suspicion regarding the mechanism that has been used to generate them.
Despite the fact that a significant number of methods for cluster analysis and outlier detection exists in the literature, the majority of these can not deal with a combination of continuous and categorical variables, also known as mixed-type data. Moreover, very few such methods are robust to the presence of outlying data points in a mixed-attribute domain. This is potentially a consequence of having a well-established definition of outliers for numerical data but of this not being the case for nominal observations. This mandates a more general definition for categorical outliers ('categorical' referring to the fact that some variables may only take a fixed number of values, called 'categories'), so that we can better understand what it means to have outliers in a mixed data set. Although a simple approach could involve detecting outliers for the numerical and the categorical data individually, this is rather naive; outliers might still exist based on the relationships among variables of different types. In fact, there exists a very small number of anomaly detection algorithms for data of mixed-type in the literature but these rely on the aforementioned simplistic approach and most of them lack a software implementation.
Our project seeks to address these gaps in the literature by defining a notion of outlyingness for purely nominal variables and hence developing novel methodology for identifying data points that are anomalous in a mixed data set. This will involve making use of statistical tools to capture any dependencies or interactions between the variables that a mixed data set is comprised of, in order to achieve a good understanding of the data set and, as a result, of which observations may be outlying. Combining the results of such a method with clustering algorithms for mixed-type data could enhance the performance of existing non-robust methods. Ultimately, we aim to provide a novel framework under which practitioners from several sectors in the industry (such as the automotive, the education, the insurance, the retail or the telecommunications sectors, all of which use segmentation techniques in some form) can obtain results which are meaningful and easily interpretable to them, without being misled by anomalous observations. This project falls within the EPSRC Statistics and applied probability research area.
Planned Impact
The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.
Organisations
People |
ORCID iD |
| Efthymios COSTA (Student) |
Studentship Projects
| Project Reference | Relationship | Related To | Start | End | Student Name |
|---|---|---|---|---|---|
| EP/S023151/1 | 31/03/2019 | 29/09/2027 | |||
| 2602507 | Studentship | EP/S023151/1 | 01/10/2021 | 30/09/2025 | Efthymios COSTA |