Encyclopedic Lexical Representations for Natural Language Processing

Lead Research Organisation: CARDIFF UNIVERSITY

Department Name: Computer Science

Abstract

The field of Natural Language Processing (NLP) has made unprecedented progress over the last decade, fuelled by the introduction of increasingly powerful neural network models. These models have an impressive ability to discover patterns in training examples, and to transfer these patterns to previously unseen test cases. Despite their strong performance in many NLP tasks, however, the extent to which they "understand" language is still remarkably limited. The key underlying problem is that language understanding requires a vast amount of world knowledge, which current NLP systems are largely lacking. In this project, we focus on conceptual knowledge, and more in particular on:

(i) capturing what properties are associated with a given concept (e.g. lions are dangerous, boats can float);
(ii) characterising how different concepts are related (e.g. brooms are used for cleaning, bees produce honey).

Our proposed approach relies on the fact that Wikipedia contains a wealth of such knowledge. A key problem, however, is that important properties and relationships are often not explicitly mentioned in text, especially if they follow straightforwardly from other information, for a human reader (e.g. if X is an animal that can fly then X probably has wings). Apart from learning to extract knowledge expressed in text, we thus also have to learn how to reason about conceptual knowledge.

A central question is how conceptual knowledge should be represented. Current NLP systems heavily rely on vector representations. Each concept is then represented by a single vector. It is now well-understood how such representations can be learned, and they are straightforward to incorporate into neural network architectures. However, they also have important theoretical limitations in terms of what knowledge they can capture, and they only allow for shallow and heuristic forms of reasoning. In contrast, in symbolic AI, conceptual knowledge is typically represented using facts and rules. This enables powerful forms of reasoning, but symbolic representations are harder to learn and to use in neural networks. Moreover, symbolic representations are also limited because they cannot capture aspects of knowledge that are matters of degree (e.g. similarity and typicality), which is especially restrictive when modelling commonsense knowledge.

The solution we propose relies on a novel hybrid representation framework, which combines the main advantages of vector representations with those of symbolic methods. In particular, we will explicitly represent properties and relationships, as in symbolic frameworks, but these properties and relations will be encoded as vectors. Each concept will thus be associated with several property vectors, while pairs of related concepts will be associated with one or more relation vectors. Our vectors will thus intuitively play the same role that facts play in symbolic frameworks, with associated neural network models then playing the role of rules.

The main output from this project will consist in a comprehensive resource, in which conceptual knowledge is encoded in this hybrid way. We expect that our resource will play an important role in NLP, given the importance of conceptual knowledge for language understanding and its highly complementary nature to existing resources. To demonstrate its usefulness, we will focus on two challenging applications: reading comprehension and topic/trend modelling. We will also develop three case studies. In one case study, we will learn representations of companies, by using our resource to summarise the activities of companies in a semantically meaningful way. In another case study, we will use our resource to identify news stories that are relevant to a given theme. Finally, we will use our methods to learn semantically coherent descriptions of emerging trends in patents.

Funded Value:

£597,263

Funded Period:

Sep 21 - Sep 24

Funder:

EPSRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

EP/V025961/1

Principal Investigator:

Steven Schockaert

Research Subject:

Info. & commun. Technol. (60%)

Linguistics (40%)

Research Topic:

Artificial Intelligence (60%)

Computational Linguistics (40%)

Organisations

People	ORCID iD
Steven Schockaert (Principal Investigator)
Luis Espinosa-Anke (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Amit Gajbhiye (2022) Modelling commonsense properties using pre-trained bi-encoders

Chatterjee U (2023) Cabbage Sweeter than Cake? Analysing the Potential of Large Language Models for Learning Conceptual Spaces

Gajbhiye A (2023) What do Deck Chairs and Sun Hats Have in Common? Uncovering Shared Properties in Large Concept Vocabularies

Kumar N (2023) Solving Hard Analogy Questions with Relation Embedding Chains

Li N (2023) Distilling Semantic Concept Embeddings from Contrastively Fine-Tuned Language Models

Li N (2023) Ultra-Fine Entity Typing with Prior Knowledge about Labels: A Simple Clustering Based Strategy

Li N (2023) Distilling Semantic Concept Embeddings from Contrastively Fine-Tuned Language Models

Schockaert S (2023) Embeddings as epistemic states: Limitations on the use of pooling operators for accumulating knowledge in International Journal of Approximate Reasoning

Key Findings
Impact Summary
Research Databases and Models
Collaboration
Engagement Activities


Description	One of the main objectives of this project is to learn so-called concept embeddings, i.e. vector representations of everyday concepts (e.g. "banana" or "fire truck"). We have developed several new methods for learning such embeddings (published e.g. at COLING 2022, SIGIR 2023, EMNLP 2023, COLING 2024). We found in particular that our proposed embeddings substantially outperform existing alternatives. We have also developed several case studies to demonstrate the importance of these concept embeddings in various applications, including few-shot multi label classification (in particular image classification and ultra-fine entity typing) and automated knowledge base completion. A second aim of the project is to learn representations of relationships between concepts. Such embeddings allow us to better model how different concepts are related to each other. An important application is modelling analogies (which can in turn be used for supporting machine learning models or in the context of computational creativity). We have developed methods for learning better relation embeddings (e.g. published at EMNLP 2022 and EMNLP 2024) and extended our approach to relationships between named entities (e.g. published at EACL 2024).
Exploitation Route	We believe that concept embeddings can be useful in a broad range of applications. We have demonstrated this with a few case studies, but there are many other areas where our representations can make a difference. Similarly, relation embeddings (and their ability to model analogies) are useful for many applications. The introduction of Large Language Models also offers possibilities for building on our work, e.g. developing methods for further improving concept and relation representations with the latest models.
Sectors	Creative Economy Digital/Communication/Information Technologies (including Software)


Description	The RAGAs evaluation framework has been adopted by a wide range of users, to support them in the development of retrieval-augmented language models.
First Year Of Impact	2023
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Economic


Title	Pre-trained concept and property encoders
Description	We developed a technique for learning concept and property embeddings, which allow us to predict which concepts have which properties (in accordance with WP1 of the project). Both the code to train these models as well as the pre-trained models themselves have been made publicly available.
Type Of Material	Computer model/algorithm
Year Produced	2022
Provided To Others?	Yes
Impact	The proposed model has been used as the basis for our COLING 2022 paper. We are currently developing follow-up work that builds on this model.
URL	https://github.com/amitgajbhiye/biencoder_concept_property


Title	RAGAs
Description	We introduce a framework for automatically analysing the effectiveness of retrieval augmented generation with Large Language Models.
Type Of Material	Data analysis technique
Year Produced	2023
Provided To Others?	Yes
Impact	The framework has received several thousands of Github stars, reflecting its widespread adoption in both industry and academia.
URL	https://github.com/explodinggradients/ragas


Title	RelEntLess dataset
Description	We introduce a benchmark for evaluating the ability of language models to capture fine-grained relationships between named entities.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
Impact	Our analysis based on this dataset has been used as the core proof-of-principle analysis to support a new grant proposal. Others have started using our dataset to analyse language models.
URL	https://huggingface.co/datasets/cardiffnlp/relentless


Description	AMPLYFI
Organisation	AMPLYFI
Country	United Kingdom
Sector	Private
PI Contribution	Since AMPLYFI is a partner of this project, from early on in the project, we have been regularly discussing our progress with them. After extensive discussions about possible collaborations, we have decided to focus on the creation of a public benchmark on the problem of modelling emerging technologies, in terms of the concepts involved. We are currently exploring how we can come up with a list of relevant concepts, given the name of an emerging technology. These candidate terms will then be annotated. Our aim is to publish a paper about the resulting dataset and organise a SemEval competition.
Collaborator Contribution	Our collaborators at AMPLYFI have taken the lead on two central problems for the creation of our benchmark: (i) identifying relevant emerging technologies and (ii) automatically generating lists of concepts that are likely to be relevant (both using GPT-3 and Wikipedia based strategies).
Impact	The development of the proposed benchmark is still ongoing.
Start Year	2022


Description	Zied Bouraoui
Organisation	Artois University
Country	France
Sector	Academic/University
PI Contribution	I have been working with Dr Zied Bouraoui from the Université d'Artois on the problem learning concept embeddings, which is very closely aligned with work package 1 from the ELEXIR project. Specifically, I have been developing models for learning concept embeddings from mentions of the concept name, by fine-tuning an encoder based on the BERT language model. Moreover, I have designed a method for exploiting these concept embeddings in few-shot learning settings, such as ultra-fine entity typing. We are currently working on designing improved some new methods, which are based on the idea that we can now represent each concept as a list of properties, as a result of the methods that have been developed in WP1 of the ELEXIR project.
Collaborator Contribution	The work described above has been developed in close collaboration. My primary role has been in the design of the model, whereas Dr Bouraoui has led the implementation of the different methods.
Impact	The collaboration has led to two publications which are currently under review: * Na Li, Hanane Kteich, Zied Bouraoui, Steven Schockaert: Distilling semantic concept embeddings from contrastively fine-tuned language models * Na Li, Zied Bouraoui, Steven Schockaert: Exploiting pre-trained label embeddings and conceptual neighbourhood for ultra-fine entity typing
Start Year	2021


Description	Invited seminar at Ulster University
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Postgraduate students
Results and Impact	I gave a seminar in which I talked about the aims of the project.
Year(s) Of Engagement Activity	2022


Description	Invited talk at Workshop on Ten Years of BabelNet and Multilingual Neurosymbolic Natural Language Understanding
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	I delivered an invited talk at the Workshop on Ten Years of BabelNet and Multilingual Neurosymbolic Natural Language Understanding. The participants of this workshop primarily consisted of academics (staff and PhD students), but also included, for instance, professional lexicographers.
Year(s) Of Engagement Activity	2022
URL	http://mousse-project.org/events/event-a5f3r5.html


Description	Keynote talk at the Neuro-Symbolic AI Workshop at ECSQARU 2023
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	I gave an invited keynote talk at the workshop, discussing how importance of concept embeddings and region-based representations. for euro-symbolic reasoning.
Year(s) Of Engagement Activity	2023


Description	Keynote talk at the UKRI Interactive AI CDT conference
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	I gave an invited keynote talk on strategies for learning concept embeddings using language models.
Year(s) Of Engagement Activity	2023


Description	Research School In Artificial Intelligence in Bergen
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Together with two colleagues, I delivered a lecture on the modelling of concepts at a summer school in Bergen, which was aimed at PhD students.
Year(s) Of Engagement Activity	2022


Description	Talk at the Creigiau 23 charity event
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Supporters
Results and Impact	I gave a talk at a charity event, discussing Large Language Models and their impact on society.
Year(s) Of Engagement Activity	2023

Abstract

Organisations

People

ORCID iD

Publications