Encyclopedic Lexical Representations for Natural Language Processing

Lead Research Organisation: CARDIFF UNIVERSITY
Department Name: Computer Science

Abstract

The field of Natural Language Processing (NLP) has made unprecedented progress over the last decade, fuelled by the introduction of increasingly powerful neural network models. These models have an impressive ability to discover patterns in training examples, and to transfer these patterns to previously unseen test cases. Despite their strong performance in many NLP tasks, however, the extent to which they "understand" language is still remarkably limited. The key underlying problem is that language understanding requires a vast amount of world knowledge, which current NLP systems are largely lacking. In this project, we focus on conceptual knowledge, and more in particular on:

(i) capturing what properties are associated with a given concept (e.g. lions are dangerous, boats can float);
(ii) characterising how different concepts are related (e.g. brooms are used for cleaning, bees produce honey).

Our proposed approach relies on the fact that Wikipedia contains a wealth of such knowledge. A key problem, however, is that important properties and relationships are often not explicitly mentioned in text, especially if they follow straightforwardly from other information, for a human reader (e.g. if X is an animal that can fly then X probably has wings). Apart from learning to extract knowledge expressed in text, we thus also have to learn how to reason about conceptual knowledge.

A central question is how conceptual knowledge should be represented. Current NLP systems heavily rely on vector representations. Each concept is then represented by a single vector. It is now well-understood how such representations can be learned, and they are straightforward to incorporate into neural network architectures. However, they also have important theoretical limitations in terms of what knowledge they can capture, and they only allow for shallow and heuristic forms of reasoning. In contrast, in symbolic AI, conceptual knowledge is typically represented using facts and rules. This enables powerful forms of reasoning, but symbolic representations are harder to learn and to use in neural networks. Moreover, symbolic representations are also limited because they cannot capture aspects of knowledge that are matters of degree (e.g. similarity and typicality), which is especially restrictive when modelling commonsense knowledge.

The solution we propose relies on a novel hybrid representation framework, which combines the main advantages of vector representations with those of symbolic methods. In particular, we will explicitly represent properties and relationships, as in symbolic frameworks, but these properties and relations will be encoded as vectors. Each concept will thus be associated with several property vectors, while pairs of related concepts will be associated with one or more relation vectors. Our vectors will thus intuitively play the same role that facts play in symbolic frameworks, with associated neural network models then playing the role of rules.

The main output from this project will consist in a comprehensive resource, in which conceptual knowledge is encoded in this hybrid way. We expect that our resource will play an important role in NLP, given the importance of conceptual knowledge for language understanding and its highly complementary nature to existing resources. To demonstrate its usefulness, we will focus on two challenging applications: reading comprehension and topic/trend modelling. We will also develop three case studies. In one case study, we will learn representations of companies, by using our resource to summarise the activities of companies in a semantically meaningful way. In another case study, we will use our resource to identify news stories that are relevant to a given theme. Finally, we will use our methods to learn semantically coherent descriptions of emerging trends in patents.
 
Description One of the main objectives of this project is to learn so-called concept embeddings, i.e. vector representations of everyday concepts (e.g. "banana" or "fire truck"). We have developed several new methods for learning such embeddings (published e.g. at COLING 2022, SIGIR 2023, EMNLP 2023, COLING 2024). We found in particular that our proposed embeddings substantially outperform existing alternatives. We have also developed several case studies to demonstrate the importance of these concept embeddings in various applications, including few-shot multi label classification (in particular image classification and ultra-fine entity typing) and automated knowledge base completion.

A second aim of the project is to learn representations of relationships between concepts. Such embeddings allow us to better model how different concepts are related to each other. An important application is modelling analogies (which can in turn be used for supporting machine learning models or in the context of computational creativity). We have developed methods for learning better relation embeddings (e.g. published at EMNLP 2022 and EMNLP 2024) and extended our approach to relationships between named entities (e.g. published at EACL 2024).
Exploitation Route We believe that concept embeddings can be useful in a broad range of applications. We have demonstrated this with a few case studies, but there are many other areas where our representations can make a difference. Similarly, relation embeddings (and their ability to model analogies) are useful for many applications.

The introduction of Large Language Models also offers possibilities for building on our work, e.g. developing methods for further improving concept and relation representations with the latest models.
Sectors Creative Economy

Digital/Communication/Information Technologies (including Software)

 
Description The RAGAs evaluation framework has been adopted by a wide range of users, to support them in the development of retrieval-augmented language models.
First Year Of Impact 2023
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Title Pre-trained concept and property encoders 
Description We developed a technique for learning concept and property embeddings, which allow us to predict which concepts have which properties (in accordance with WP1 of the project). Both the code to train these models as well as the pre-trained models themselves have been made publicly available. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact The proposed model has been used as the basis for our COLING 2022 paper. We are currently developing follow-up work that builds on this model. 
URL https://github.com/amitgajbhiye/biencoder_concept_property
 
Title RAGAs 
Description We introduce a framework for automatically analysing the effectiveness of retrieval augmented generation with Large Language Models. 
Type Of Material Data analysis technique 
Year Produced 2023 
Provided To Others? Yes  
Impact The framework has received several thousands of Github stars, reflecting its widespread adoption in both industry and academia. 
URL https://github.com/explodinggradients/ragas
 
Title RelEntLess dataset 
Description We introduce a benchmark for evaluating the ability of language models to capture fine-grained relationships between named entities. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
Impact Our analysis based on this dataset has been used as the core proof-of-principle analysis to support a new grant proposal. Others have started using our dataset to analyse language models. 
URL https://huggingface.co/datasets/cardiffnlp/relentless
 
Description AMPLYFI 
Organisation AMPLYFI
Country United Kingdom 
Sector Private 
PI Contribution Since AMPLYFI is a partner of this project, from early on in the project, we have been regularly discussing our progress with them. After extensive discussions about possible collaborations, we have decided to focus on the creation of a public benchmark on the problem of modelling emerging technologies, in terms of the concepts involved. We are currently exploring how we can come up with a list of relevant concepts, given the name of an emerging technology. These candidate terms will then be annotated. Our aim is to publish a paper about the resulting dataset and organise a SemEval competition.
Collaborator Contribution Our collaborators at AMPLYFI have taken the lead on two central problems for the creation of our benchmark: (i) identifying relevant emerging technologies and (ii) automatically generating lists of concepts that are likely to be relevant (both using GPT-3 and Wikipedia based strategies).
Impact The development of the proposed benchmark is still ongoing.
Start Year 2022
 
Description Zied Bouraoui 
Organisation Artois University
Country France 
Sector Academic/University 
PI Contribution I have been working with Dr Zied Bouraoui from the Université d'Artois on the problem learning concept embeddings, which is very closely aligned with work package 1 from the ELEXIR project. Specifically, I have been developing models for learning concept embeddings from mentions of the concept name, by fine-tuning an encoder based on the BERT language model. Moreover, I have designed a method for exploiting these concept embeddings in few-shot learning settings, such as ultra-fine entity typing. We are currently working on designing improved some new methods, which are based on the idea that we can now represent each concept as a list of properties, as a result of the methods that have been developed in WP1 of the ELEXIR project.
Collaborator Contribution The work described above has been developed in close collaboration. My primary role has been in the design of the model, whereas Dr Bouraoui has led the implementation of the different methods.
Impact The collaboration has led to two publications which are currently under review: * Na Li, Hanane Kteich, Zied Bouraoui, Steven Schockaert: Distilling semantic concept embeddings from contrastively fine-tuned language models * Na Li, Zied Bouraoui, Steven Schockaert: Exploiting pre-trained label embeddings and conceptual neighbourhood for ultra-fine entity typing
Start Year 2021
 
Description Invited seminar at Ulster University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact I gave a seminar in which I talked about the aims of the project.
Year(s) Of Engagement Activity 2022
 
Description Invited talk at Workshop on Ten Years of BabelNet and Multilingual Neurosymbolic Natural Language Understanding 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I delivered an invited talk at the Workshop on Ten Years of BabelNet and Multilingual Neurosymbolic Natural Language Understanding. The participants of this workshop primarily consisted of academics (staff and PhD students), but also included, for instance, professional lexicographers.
Year(s) Of Engagement Activity 2022
URL http://mousse-project.org/events/event-a5f3r5.html
 
Description Keynote talk at the Neuro-Symbolic AI Workshop at ECSQARU 2023 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I gave an invited keynote talk at the workshop, discussing how importance of concept embeddings and region-based representations. for euro-symbolic reasoning.
Year(s) Of Engagement Activity 2023
 
Description Keynote talk at the UKRI Interactive AI CDT conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact I gave an invited keynote talk on strategies for learning concept embeddings using language models.
Year(s) Of Engagement Activity 2023
 
Description Research School In Artificial Intelligence in Bergen 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Together with two colleagues, I delivered a lecture on the modelling of concepts at a summer school in Bergen, which was aimed at PhD students.
Year(s) Of Engagement Activity 2022
 
Description Talk at the Creigiau 23 charity event 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Supporters
Results and Impact I gave a talk at a charity event, discussing Large Language Models and their impact on society.
Year(s) Of Engagement Activity 2023