Encyclopedic Lexical Representations for Natural Language Processing

Lead Research Organisation: CARDIFF UNIVERSITY
Department Name: Computer Science


The field of Natural Language Processing (NLP) has made unprecedented progress over the last decade, fuelled by the introduction of increasingly powerful neural network models. These models have an impressive ability to discover patterns in training examples, and to transfer these patterns to previously unseen test cases. Despite their strong performance in many NLP tasks, however, the extent to which they "understand" language is still remarkably limited. The key underlying problem is that language understanding requires a vast amount of world knowledge, which current NLP systems are largely lacking. In this project, we focus on conceptual knowledge, and more in particular on:

(i) capturing what properties are associated with a given concept (e.g. lions are dangerous, boats can float);
(ii) characterising how different concepts are related (e.g. brooms are used for cleaning, bees produce honey).

Our proposed approach relies on the fact that Wikipedia contains a wealth of such knowledge. A key problem, however, is that important properties and relationships are often not explicitly mentioned in text, especially if they follow straightforwardly from other information, for a human reader (e.g. if X is an animal that can fly then X probably has wings). Apart from learning to extract knowledge expressed in text, we thus also have to learn how to reason about conceptual knowledge.

A central question is how conceptual knowledge should be represented. Current NLP systems heavily rely on vector representations. Each concept is then represented by a single vector. It is now well-understood how such representations can be learned, and they are straightforward to incorporate into neural network architectures. However, they also have important theoretical limitations in terms of what knowledge they can capture, and they only allow for shallow and heuristic forms of reasoning. In contrast, in symbolic AI, conceptual knowledge is typically represented using facts and rules. This enables powerful forms of reasoning, but symbolic representations are harder to learn and to use in neural networks. Moreover, symbolic representations are also limited because they cannot capture aspects of knowledge that are matters of degree (e.g. similarity and typicality), which is especially restrictive when modelling commonsense knowledge.

The solution we propose relies on a novel hybrid representation framework, which combines the main advantages of vector representations with those of symbolic methods. In particular, we will explicitly represent properties and relationships, as in symbolic frameworks, but these properties and relations will be encoded as vectors. Each concept will thus be associated with several property vectors, while pairs of related concepts will be associated with one or more relation vectors. Our vectors will thus intuitively play the same role that facts play in symbolic frameworks, with associated neural network models then playing the role of rules.

The main output from this project will consist in a comprehensive resource, in which conceptual knowledge is encoded in this hybrid way. We expect that our resource will play an important role in NLP, given the importance of conceptual knowledge for language understanding and its highly complementary nature to existing resources. To demonstrate its usefulness, we will focus on two challenging applications: reading comprehension and topic/trend modelling. We will also develop three case studies. In one case study, we will learn representations of companies, by using our resource to summarise the activities of companies in a semantically meaningful way. In another case study, we will use our resource to identify news stories that are relevant to a given theme. Finally, we will use our methods to learn semantically coherent descriptions of emerging trends in patents.
Title Pre-trained concept and property encoders 
Description We developed a technique for learning concept and property embeddings, which allow us to predict which concepts have which properties (in accordance with WP1 of the project). Both the code to train these models as well as the pre-trained models themselves have been made publicly available. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact The proposed model has been used as the basis for our COLING 2022 paper. We are currently developing follow-up work that builds on this model. 
URL https://github.com/amitgajbhiye/biencoder_concept_property
Description AMPLYFI 
Organisation AMPLYFI
Country United Kingdom 
Sector Private 
PI Contribution Since AMPLYFI is a partner of this project, from early on in the project, we have been regularly discussing our progress with them. After extensive discussions about possible collaborations, we have decided to focus on the creation of a public benchmark on the problem of modelling emerging technologies, in terms of the concepts involved. We are currently exploring how we can come up with a list of relevant concepts, given the name of an emerging technology. These candidate terms will then be annotated. Our aim is to publish a paper about the resulting dataset and organise a SemEval competition.
Collaborator Contribution Our collaborators at AMPLYFI have taken the lead on two central problems for the creation of our benchmark: (i) identifying relevant emerging technologies and (ii) automatically generating lists of concepts that are likely to be relevant (both using GPT-3 and Wikipedia based strategies).
Impact The development of the proposed benchmark is still ongoing.
Start Year 2022
Description Zied Bouraoui 
Organisation Artois University
Country France 
Sector Academic/University 
PI Contribution I have been working with Dr Zied Bouraoui from the Université d'Artois on the problem learning concept embeddings, which is very closely aligned with work package 1 from the ELEXIR project. Specifically, I have been developing models for learning concept embeddings from mentions of the concept name, by fine-tuning an encoder based on the BERT language model. Moreover, I have designed a method for exploiting these concept embeddings in few-shot learning settings, such as ultra-fine entity typing. We are currently working on designing improved some new methods, which are based on the idea that we can now represent each concept as a list of properties, as a result of the methods that have been developed in WP1 of the ELEXIR project.
Collaborator Contribution The work described above has been developed in close collaboration. My primary role has been in the design of the model, whereas Dr Bouraoui has led the implementation of the different methods.
Impact The collaboration has led to two publications which are currently under review: * Na Li, Hanane Kteich, Zied Bouraoui, Steven Schockaert: Distilling semantic concept embeddings from contrastively fine-tuned language models * Na Li, Zied Bouraoui, Steven Schockaert: Exploiting pre-trained label embeddings and conceptual neighbourhood for ultra-fine entity typing
Start Year 2021
Description Invited seminar at Ulster University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact I gave a seminar in which I talked about the aims of the project.
Year(s) Of Engagement Activity 2022
Description Invited talk at Workshop on Ten Years of BabelNet and Multilingual Neurosymbolic Natural Language Understanding 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I delivered an invited talk at the Workshop on Ten Years of BabelNet and Multilingual Neurosymbolic Natural Language Understanding. The participants of this workshop primarily consisted of academics (staff and PhD students), but also included, for instance, professional lexicographers.
Year(s) Of Engagement Activity 2022
URL http://mousse-project.org/events/event-a5f3r5.html
Description Research School In Artificial Intelligence in Bergen 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Together with two colleagues, I delivered a lecture on the modelling of concepts at a summer school in Bergen, which was aimed at PhD students.
Year(s) Of Engagement Activity 2022