Semi-supervised learning of deep hierarchical hidden representations

Lead Research Organisation: University of Bristol

Department Name: Computer Science

Abstract

Until the end of the 20th century, most of the computer programs were manually
implemented to perform repetitive tasks that could be automated, thus alleviating human
work. However, at the end of the century, the field of Machine Learning emerged in order to
create algorithms that could generate programs automatically by means of data and
examples. These methods together with an exponential growth of available data and
computational power allowed the training deep hierarchical models. Nowadays, deep
hierarchical models are achieving and occasionally surpassing human performance on a
variety of tasks like object recognition, automatic translation, speech recognition,
autonomous transportation and medical applications.
One of the main problems of the current state-of-the-art models is that they need fully
annotated data to solve any specific task. This type of problems is known as Supervised
Learning tasks. For this reason, one of the bottlenecks for training these models is the
generation of good and large datasets, as they require lots of manual annotation.
To solve this problem, the field of Semi-Supervised learning uses data that has not been
annotated in order to help the Supervised Learning part. For example, in problems where
labels are scarce, it is possible to use unlabeled data to learn hierarchical hidden
representations that can be used to improve the performance of Supervised models. New
methods are still being investigated and this is one of the main topics of this Ph.D. Another
problem is that most of the current literature in Machine Learning assumes that data
available during the training of the models follows the same distribution as the future data
available during the deployment time. However, this assumption is only true in a few
controlled scenarios; for example in a closed factory. On the contrary, most of the real case
scenarios evolve and change with new objects, words or patterns. For this reason, it is
important to provide Machine Learning models with the ability to notify when new patterns
appear, thus avoiding possible mistakes.
This Ph.D. proposes to address this problem by means of Semi-Supervised Learning
techniques. Using new techniques we want to enhance existent models by giving them the
ability to discern between known and unknown patterns. In such a way that models are able
to manifest the confidence on their predictions. This is important in order to make confident
predictions given familiar patterns while being able to ask for further inspection otherwise.
In conclusion, there is an increasing popularity of machine learning models that learn deep
hierarchical hidden representations of the data. These models are being applied in a wide
range of problems, some of which have important implications. However, they need large
amounts of annotated data and are not able to output confidence values in their predictions.
For that reason, the topic of this Ph.D. is to improve models using unlabeled data, make
them aware of new situations, and able to avoid uninformed decisions.

Student:

Miquel Perello Nieto

Period of Study:

Sep 16 - Mar 20

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

1793885

Research Topic:

Unclassified

Organisations

University of Bristol (Lead Research Organisation)

People	ORCID iD
Peter Flach (Primary Supervisor)
Miquel Perello Nieto (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Perelló-Nieto M (2017) Advances in Intelligent Data Analysis XVI

Perello-Nieto M (2016) Background Check: A General Technique to Build More Reliable and Versatile Classifiers

Perello-Nieto M (2020) Recycling weak labels for multiclass classification in Neurocomputing

Kull M (2019) Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration

Bi H (2021) Human Activity Recognition Based on Dynamic Active Learning. in IEEE journal of biomedical and health informatics

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/N509619/1			01/10/2016	30/09/2021
1793885	Studentship	EP/N509619/1	19/09/2016	18/03/2020	Miquel Perello Nieto

Key Findings
Software and Technical Products
Engagement Activities


Description	In the field of Machine Learning, multi-class classification is a really important task. This task consists on creating mathematical models that are able to classify instances into different categories (eg. an autonomous car needs to classify objects into pedestrians, cars, animals, and traffic signs; between others). However, in most of the literature the data available during the training of the models is assumed to be a good representation of all future examples. This fact may lead to multiple misspredictions if the new context changes. This has happened multiple times in the past, eg. Google apologising for non-appropriate automatic photo tagging because a biased training set. We have proposed a new generic method that equips arbitrary probabilistic classifiers with the ability to discern between predictions that are; or are not; similar to previous examples. We have published an article in the 16th International Conference on Data Mining (ICDM 2016) demonstrating that our proposed method can be applied into multiple scenarios, and it performs equal; and in occasions outperforms; state-of-the-art and non-generic approaches. Similarly, it is important that the probabilities output by a multiclass classifier are good representations of the expected proportions of correct predictions. We have proposed a new method which enhances current state-of-the-art methods, and we have presented this work in the Neural Information Processing Systems Conference 2020, with the title "Beyond Temperature Scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration". Further work on classification calibration has lead to the elavoration of a tutorial which was presented at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2020. Another important concern about training multic-lass classification models is the requirement of having good annotations. These annotations are most of the time human annotations, and are time consuming, and in occasions may require expert annotators (eg. deciding if a mammography presents breast cancer). There are related tasks like Semi-supervised learning in which the use of non-annotated data can be included during the training of the models to improve their performance. However, this is research that is currently being developed and it is not clear in what circumstances an improvement can be achieved. On the other hand, we propose to use annotations with different degrees of quality. In this scenario, we would be able to collect cheaper annotations by accepting certain number of labelling mistakes. This could be done by crowd-sourcing the annotations, obtaining automatic annotations with machine learning models, or allowing to annotate coarse labels instead of more fine labels (eg. indicating that a picture includes a mammal or a plant, instead of specifying which animal or plant). For that reason, we are studying the empirical applicability of a set of theoretical results that allow the use of annotations with different levels of quality (namely weak labels). The empirical results showed that with real data and different types of noise it was possible to obtain good results. The outcome of this finding was also published in the Advances in Intelligent Data Analysis XVI: 16th International Symposium, IDA 2017, London. We also show in our recent article "Recycling weak labels for multiclass classification" published in the Journal of Neurocomputing 2020, that it is possible to aggregate labels with different quality into a larger dataset, and obtain better performance than only ussing a smaller but perfectly labelled dataset. More recently, all the ideas presented in the aforementioned pieces of work have been used to propose new methods to reduce the number of annotations required to train a classification model. Similar to our weak labels work, we were interested in reducing the annotation costs, in this case not by using unprecise annotations, but by requiring the smallest number of annotations. We achieved this by selecting the data instances most different from the already labeled ones (as in the Background Check work), and at the same time selecting the samples for which the model was less confident about its prediction (as in the Dirichlet Calibration work). This work leaded to the article "Human Activity Recognition Based on Dynamic Active Learning" published at the IEEE Journal of Biomedical and Health Informatics, 2020.
Exploitation Route	It is important to understand what are the limitations of common machine learning classification methods. With my current work, researchers can use the created tools to improve the interpretability of these models. It is also possible to benefit from my work by reducing the costs of manual annotations, as I have demonstrated empirically under what types of weak annotations it is still possible to train machine learning classifiers, and even reduced the amount of annotations necessary to train models. The applicability of these techniques have a great span, from researchers performing probabilistic estimations, to companies driven by real-world use cases.
Sectors	Digital/Communication/Information Technologies (including Software)
URL	https://www.bristol.ac.uk/people/person/Miquel-Perello%20Nieto-4bac86a5-6a60-48fd-baec-140e4d70ba95#publications


Title	Background Check
Description	This is a framework to create and evaluate classifier models with confidence levels.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	Some publications have used the proposed approach in their own work.
URL	https://reframe.github.io/background_check/


Title	PyCalib
Description	Python library with tools for the calibration of probabilistic classifiers.
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	Several researchers used the code in their own research work.
URL	http://www.perellonieto.com/PyCalib/


Description	Presented our work on Dirichlet Calibration to a PyData Bristol Meetup
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	About 80 participants attended an event co-organised by me, I presented one of our work about Dirichlet Calibration to an audience with interest in science, technology, engineering and mathematics. The audience was interested on the concept of calibration, and reported interest on a publicly available source code.
Year(s) Of Engagement Activity	2019
URL	https://github.com/pydatabristol/meetups/blob/81b78a66ba7b99540540fe637480650200ba6c30/meetup_2019_1...


Description	Tutorial at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	We delivered a tutorial in the classifier calibration topic, which has gained general interest in recent years. The attendants were interested in knowing more about the theory behind calibration, how to assess models and improve their predictions in a meaningful manner. The tutorial also had a Hands-On part in which participants were able to try by themselves some of the proposed practices. Several questions were raised during the tutorial, demonstrating a clear interest on the topic.
Year(s) Of Engagement Activity	2020
URL	https://classifier-calibration.github.io/

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects