A Lebesgue Integral based Approximation for Language Modelling

Lead Research Organisation: King's College London

Department Name: Computer Science

Abstract

Deep learning (DL) based Natural Language Processing (NLP) technologies have attracted significant interest in recent years. The current SOTA language models, a.k.a. transformer-based language models, typically assume that the representation of a given word can be captured by the interpolation of its related context in a convex hull. However, it has recently been shown that in high-dimensional spaces, the interpolation almost surely never occurs regardless of the underlying intrinsic dimension of the data manifold. The representations generated by such transformer-based language models will converge into a dense cone-like hyperspace which is often discontinuous with many nonadjacent clusters. To overcome the limitation of current methods in most DL-based NLP models, this project aims to deploy Lebesgue integral, which can be defined as an ensemble of integrals among partitions (i.e., discontinuous feature clusters), to approximate the posterior distributions of clusters given input word features in finite measurable sets by automatically identifying the boundary of such discontinuous set, which in turn could help to generate better interpretations and quantify the uncertainty. By our proposed Lebesgue integral based approximation, the input text will be characterised by two properties: an indicator vector encoding its membership in clusters (i.e., measurable sets), and another continuous feature representation for better capturing its semantic meaning for downstream tasks. This not only allows for a more faithful approximation of commonly observed countably discontinuities in distributions of input text in NLP, but also enables learning text representations that are better understood by humans.

Funded Value:

£202,209

Funded Period:

Feb 23 - Feb 25

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/X019063/1

Principal Investigator:

Lin Gui

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (100%)

Organisations

People	ORCID iD
Lin Gui (Principal Investigator)	http://orcid.org/0000-0002-8054-9524
Yulan He (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Chen H (2023) Cone: Unsupervised Contrastive Opinion Extraction

Chen H (2023) Uncertainty Quantification for Text Classification

Da Silva I.L. (2024) Weak Reward Model Transforms Generative Models into Robust Causal Event Extraction Systems in EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Hanqi Yan (2023) Explainable Recommender with Geometric Information Bottleneck in IEEE Transactions on Knowledge and Data Engineering.

Hanqi Yan (2023) Counterfactual Generation with Identifiability Guarantees

Li H (2023) Distinguishability Calibration to In-Context Learning

Li H. (2023) Distinguishability Calibration to In-Context Learning in EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023

Li J (2023) CUE: An Uncertainty Interpretation Framework for Text Classifiers Built on Pre-Trained Language Models

Li J (2023) Distilling ChatGPT for Explainable Automated Student Answer Assessment

Li, J. (2023) OverPrompt: Enhancing ChatGPT Capabilities through an Efficient In-Context Learning Approach in arXiv

Key Findings
Impact Summary
Research Databases and Models
Software and Technical Products


Description	In this project, we're working on making large language model's decision easier to understand for users. Usually, these systems are like a black box which contains billions of parameters and it's hard to see how they work. Our new approach breaks the system down into simpler parts and figures out which part is responsible for the decisions it makes. We also introduce a new way to measure how sure the system is about its decisions based on these parts. This means our system can explain its decisions and how certain it is about them in a way that's easy for anyone to understand.
Exploitation Route	By breaking down the decision-making process of machine learning models into simpler parts, users can better understand why a model makes certain decisions. This transparency is crucial in sensitive areas like healthcare, finance, and criminal justice, where understanding the basis of a decision can be as important as the decision itself. Another potential application is that the proposed method is able to inspire the research in trustworthy AI. When users can see how and why decisions are made, they're likely to trust the technology more. This could lead to wider adoption of machine learning systems in various fields. By knowing how confident the model is in its predictions, users can make more informed decisions about when and how to rely on machine learning outputs. This could be particularly beneficial in scenarios where the stakes are high, and the cost of incorrect decisions is significant.
Sectors	Digital/Communication/Information Technologies (including Software)


Description	In this project, we have developed an interactive narrative understanding system that is open to the public. This demonstration was presented at the AAAI 2024 conference, a premier event in AI research, where it garnered significant attention from both academia and industry. On the academic front, we collaborated with the medical school at King's College London (KCL) to submit a joint funding application based on our current project to support their research. In industry, we developed a prototype interactive platform that enables a murder mistrial game between humans and AI. Huawei Research UK highly praised our project and provided funding to explore its potential in other directions.
Sector	Education,Other
Impact Types	Societal


Title	Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives
Description	We introduce Conan, a benchmark for extracting and analysing complex character relationship graphs from detective narratives. The input originates from character background stories of text-based games, which comprises k background stories, Nc, each uniquely crafted from the perspective of character c. We then manually extracted and annotated role-oriented relationships from these diverse viewpoints.
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
Impact	Existing datasets for narrative understanding often fail to represent the complexity and uncertainty of relationships in real-life social scenarios. To address this gap, we introduce a new benchmark, Conan, designed for extracting and analysing intricate character relation graphs from detective narratives. Specifically, we designed hierarchical relationship categories and manually extracted and annotated role-oriented relationships from the perspectives of various characters, incorporating both public relationships known to most characters and secret ones known to only a few. Our experiments with advanced Large Language Models (LLMs) like GPT-3.5, GPT-4, and Llama2 reveal their limitations in inferencing complex relationships and handling longer narratives. The combination of the Conan dataset and our pipeline strategy is geared towards understanding the ability of LLMs to comprehend nuanced relational dynamics in narrative contexts.
URL	https://zenodo.org/doi/10.5281/zenodo.10724723


Title	NewsQuote Dataset: Quote-Speaker Pairs from New Articles
Description	To seek reliable information sources for news events, we introduce a novel task of expert recommendation, which aims to identify trustworthy sources based on their previously quoted statements. To achieve this, we built a novel dataset, called NewsQuote, consisting of 23,571 quote-speaker pairs sourced from a collection of news articles. We built our NewsQuote dataset from the AYLIEN coronavirus dataset, which contains news articales published between November 2019 and August 2020. Apart from text, each article is also accompanied with the meta data such as authors, keywords, summary, source, publishing time, topical categories coded by both the Interactive Advertising Bureau (IAB) taxonomy and the IPTC NewsCodes, as well as the recognized entities and entity links from the DBpedia. Our data covers three categories of quotes: direct quote, indirect quote and mixed quote. We roughly estimated the weight of each quotation type on the dataset by the number and position of quotation marks: 81% for indirect quotes, 11% for direct quotes, and 7% for mixed quotes. In the test set, there are 1,582 (79%) indirect quotes, 178 (9%) mixed quotes, and 240 direct quotes (12%). The following table shows the statistics of our final NewsQuote dataset. In summary, we have a total of 23,571 English speaker-quote pairs with 2,843 speakers from 263 global domains. For each sample in the dataset: 'ID': sample id. In the id, number before '-' is the corresponding article id in the AYLIEN coronavirus dataset 'Sentence': main sentence. 'Tags': semantic role labels. 'Left_sentence': context before the sentence. 'Right_sentence': context after the sentence. 'Speaker': speaker. 'Words': words in the sentence. 'Verb': trigger verb that indicates the quotation. 'Quotation1': possible quotation. 'Time': publish time. 'Entity_link': speaker's Dbpedia link.
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
URL	https://zenodo.org/doi/10.5281/zenodo.11032190


Title	NewsQuote Dataset: Quote-Speaker Pairs from New Articles
Description	To seek reliable information sources for news events, we introduce a novel task of expert recommendation, which aims to identify trustworthy sources based on their previously quoted statements. To achieve this, we built a novel dataset, called NewsQuote, consisting of 23,571 quote-speaker pairs sourced from a collection of news articles. We built our NewsQuote dataset from the AYLIEN coronavirus dataset, which contains news articales published between November 2019 and August 2020. Apart from text, each article is also accompanied with the meta data such as authors, keywords, summary, source, publishing time, topical categories coded by both the Interactive Advertising Bureau (IAB) taxonomy and the IPTC NewsCodes, as well as the recognized entities and entity links from the DBpedia. Our data covers three categories of quotes: direct quote, indirect quote and mixed quote. We roughly estimated the weight of each quotation type on the dataset by the number and position of quotation marks: 81% for indirect quotes, 11% for direct quotes, and 7% for mixed quotes. In the test set, there are 1,582 (79%) indirect quotes, 178 (9%) mixed quotes, and 240 direct quotes (12%). The following table shows the statistics of our final NewsQuote dataset. In summary, we have a total of 23,571 English speaker-quote pairs with 2,843 speakers from 263 global domains. For each sample in the dataset: 'ID': sample id. In the id, number before '-' is the corresponding article id in the AYLIEN coronavirus dataset 'Sentence': main sentence. 'Tags': semantic role labels. 'Left_sentence': context before the sentence. 'Right_sentence': context after the sentence. 'Speaker': speaker. 'Words': words in the sentence. 'Verb': trigger verb that indicates the quotation. 'Quotation1': possible quotation. 'Time': publish time. 'Entity_link': speaker's Dbpedia link.
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
URL	https://zenodo.org/doi/10.5281/zenodo.11032189


Title	Tracking Brand-Associated Polarity-Bearing Topics in User Reviews
Description	Preprocessed data included in [data/beauty_makeupalley/] can be used directly for dBTM, O-dBTM. It can also be used for baseline BTM, dJST and TBIP, with some tiny change to fit the input formats of those models. The original data is from MakeupAlley, a review website on beauty products. In the repo data/{dataset_name}/time. The following files are included: counts.npz: a [num_documents, num_words] sparse CSR matrix containing the word counts for each document. brand_indices.npy: a [num_documents] vector where each entry is an integer in the set {0, 1, ..., num_brands - 1}, indicating the brand of the corresponding document in counts.npz. score_indices.npy: a [num_documents] vector where each entry is an integer in the set {-1, 0, 1}, indicating the review polarity of the corresponding document in counts.npz. Also in data/{dataset_name}/clean. The following files are included: brand_map.txt: a [num_brands]-length file where each line denotes the name of the brand in the corpus. vocabulary.txt: a [num_words]-length file where each line denotes the corresponding word in the vocabulary.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
Impact	Monitoring online customer reviews is important for business organisations to measure customer satisfaction and better manage their reputations. Popular datasets such as Yelp and Amazon products and Multi-Domain Sentiment dataset are constructed by randomly selecting reviews from Amazon or Yelp without considering their distributions over various brands and across different time periods. Therefore, we construct our own dataset by crawling reviews from top 25 brands from MakeupAlley, a review website on beauty products. Each review is accompanied with a rating score, product type, brand and post time. The entire dataset contains 611,128 reviews spanning over 9 years (2005 to 2013). We treat each year as a time slice and split reviews into 9 time slices. The average review length is 123 words.
URL	https://github.com/BLPXSPG/dBTM


Title	CUE: An Uncertainty Interpretation Framework for Text Classifiers Built on Pre-Trained Language Models
Description	Text classifiers built on Pre-trained Language Models (PLMs) have achieved remarkable progress in various tasks including sentiment analysis, natural language inference, and question-answering. However, the occurrence of uncertain predictions by these classifiers poses a challenge to their reliability when deployed in practical applications. Much effort has been devoted to designing various probes in order to understand what PLMs capture. But few studies have delved into factors influencing PLM-based classifiers' predictive uncertainty. In this paper, we propose a novel framework, called CUE, which aims to interpret uncertainties inherent in the predictions of PLM-based models. In particular, we first map PLM-encoded representations to a latent space via a variational auto-encoder. We then generate text representations by perturbing the latent space which causes fluctuation in predictive uncertainty. By comparing the difference in predictive uncertainty between the perturbed and the original text representations, we are able to identify the latent dimensions responsible for uncertainty and subsequently trace back to the input features that contribute to such uncertainty. Our extensive experiments on four benchmark datasets encompassing linguistic acceptability classification, emotion classification, and natural language inference show the feasibility of our proposed framework.
Type Of Technology	Software
Year Produced	2023
Open Source License?	Yes
Impact	This approach offers a potential solution for disentangling the representation learned from a pre-trained language model and interpreting the uncertainty caused by various learned factors.
URL	https://zenodo.org/doi/10.5281/zenodo.10795529


Title	OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning
Description	The remarkable performance of pre-trained large language models has revolutionised various natural language processing applications. Due to huge parameter sizes and extensive running costs, companies or organisations tend to transfer the models to the target task by zero-shot prompting techniques. However, the prohibitive costs of tokens and time have hindered their adoption in applications. We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs, thereby reducing token and time costs. This approach could potentially improve task performance during API queries due to better conditional distribution mapping. Evaluated across diverse classification datasets, our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance, and in some cases, even improving it. An ablation study conducted on various LLMs, along with an investigation into the robustness of our prompting strategy to different input ordering, offers valuable insights into the broader applicability of our method across diverse tasks. These findings also suggest a more seamless integration of our method with LLMs through an API.
Type Of Technology	Software
Year Produced	2023
Open Source License?	Yes
Impact	This tool provides a more efficient method for instruction-based searching on large language models.
URL	https://zenodo.org/doi/10.5281/zenodo.10795545

Abstract

Organisations

People

ORCID iD

Publications