A Lebesgue Integral based Approximation for Language Modelling

Lead Research Organisation: King's College London
Department Name: Computer Science

Abstract

Deep learning (DL) based Natural Language Processing (NLP) technologies have attracted significant interest in recent years. The current SOTA language models, a.k.a. transformer-based language models, typically assume that the representation of a given word can be captured by the interpolation of its related context in a convex hull. However, it has recently been shown that in high-dimensional spaces, the interpolation almost surely never occurs regardless of the underlying intrinsic dimension of the data manifold. The representations generated by such transformer-based language models will converge into a dense cone-like hyperspace which is often discontinuous with many nonadjacent clusters. To overcome the limitation of current methods in most DL-based NLP models, this project aims to deploy Lebesgue integral, which can be defined as an ensemble of integrals among partitions (i.e., discontinuous feature clusters), to approximate the posterior distributions of clusters given input word features in finite measurable sets by automatically identifying the boundary of such discontinuous set, which in turn could help to generate better interpretations and quantify the uncertainty. By our proposed Lebesgue integral based approximation, the input text will be characterised by two properties: an indicator vector encoding its membership in clusters (i.e., measurable sets), and another continuous feature representation for better capturing its semantic meaning for downstream tasks. This not only allows for a more faithful approximation of commonly observed countably discontinuities in distributions of input text in NLP, but also enables learning text representations that are better understood by humans.

Publications

10 25 50
publication icon
Hanqi Yan (2023) Explainable Recommender with Geometric Information Bottleneck in IEEE Transactions on Knowledge and Data Engineering.

publication icon
Li H. (2023) Distinguishability Calibration to In-Context Learning in EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023

publication icon
Lu J. (2023) NapSS: Paragraph-level Medical Text Simplification via Narrative Prompting and Sentence-matching Summarization in EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023

publication icon
Tan X. (2023) Event Temporal Relation Extraction with Bayesian Translational Model in EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

 
Description In this project, we're working on making large language model's decision easier to understand for users. Usually, these systems are like a black box which contains billions of parameters and it's hard to see how they work. Our new approach breaks the system down into simpler parts and figures out which part is responsible for the decisions it makes. We also introduce a new way to measure how sure the system is about its decisions based on these parts. This means our system can explain its decisions and how certain it is about them in a way that's easy for anyone to understand.
Exploitation Route By breaking down the decision-making process of machine learning models into simpler parts, users can better understand why a model makes certain decisions. This transparency is crucial in sensitive areas like healthcare, finance, and criminal justice, where understanding the basis of a decision can be as important as the decision itself.
Another potential application is that the proposed method is able to inspire the research in trustworthy AI. When users can see how and why decisions are made, they're likely to trust the technology more. This could lead to wider adoption of machine learning systems in various fields. By knowing how confident the model is in its predictions, users can make more informed decisions about when and how to rely on machine learning outputs. This could be particularly beneficial in scenarios where the stakes are high, and the cost of incorrect decisions is significant.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description In this project, we have developed an interactive narrative understanding system that is open to the public. This demonstration system was presented at the AAAI 2024 conference, a premier event in AI research. Our presentation attracted significant attention from both the academic and industrial sectors. Potential collaborations include applications in second language learning environments and entertainment, such as open-world gaming.
Sector Education
Impact Types Societal

 
Title Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives 
Description We introduce Conan, a benchmark for extracting and analysing complex character relationship graphs from detective narratives. The input originates from character background stories of text-based games, which comprises k background stories, Nc, each uniquely crafted from the perspective of character c. We then manually extracted and annotated role-oriented relationships from these diverse viewpoints. 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
Impact Existing datasets for narrative understanding often fail to represent the complexity and uncertainty of relationships in real-life social scenarios. To address this gap, we introduce a new benchmark, Conan, designed for extracting and analysing intricate character relation graphs from detective narratives. Specifically, we designed hierarchical relationship categories and manually extracted and annotated role-oriented relationships from the perspectives of various characters, incorporating both public relationships known to most characters and secret ones known to only a few. Our experiments with advanced Large Language Models (LLMs) like GPT-3.5, GPT-4, and Llama2 reveal their limitations in inferencing complex relationships and handling longer narratives. The combination of the Conan dataset and our pipeline strategy is geared towards understanding the ability of LLMs to comprehend nuanced relational dynamics in narrative contexts. 
URL https://zenodo.org/doi/10.5281/zenodo.10724723
 
Title Tracking Brand-Associated Polarity-Bearing Topics in User Reviews 
Description Preprocessed data included in [data/beauty_makeupalley/] can be used directly for dBTM, O-dBTM. It can also be used for baseline BTM, dJST and TBIP, with some tiny change to fit the input formats of those models. The original data is from MakeupAlley, a review website on beauty products. In the repo data/{dataset_name}/time. The following files are included: counts.npz: a [num_documents, num_words] sparse CSR matrix containing the word counts for each document. brand_indices.npy: a [num_documents] vector where each entry is an integer in the set {0, 1, ..., num_brands - 1}, indicating the brand of the corresponding document in counts.npz. score_indices.npy: a [num_documents] vector where each entry is an integer in the set {-1, 0, 1}, indicating the review polarity of the corresponding document in counts.npz. Also in data/{dataset_name}/clean. The following files are included: brand_map.txt: a [num_brands]-length file where each line denotes the name of the brand in the corpus. vocabulary.txt: a [num_words]-length file where each line denotes the corresponding word in the vocabulary. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
Impact Monitoring online customer reviews is important for business organisations to measure customer satisfaction and better manage their reputations. Popular datasets such as Yelp and Amazon products and Multi-Domain Sentiment dataset are constructed by randomly selecting reviews from Amazon or Yelp without considering their distributions over various brands and across different time periods. Therefore, we construct our own dataset by crawling reviews from top 25 brands from MakeupAlley, a review website on beauty products. Each review is accompanied with a rating score, product type, brand and post time. The entire dataset contains 611,128 reviews spanning over 9 years (2005 to 2013). We treat each year as a time slice and split reviews into 9 time slices. The average review length is 123 words. 
URL https://github.com/BLPXSPG/dBTM
 
Title CUE: An Uncertainty Interpretation Framework for Text Classifiers Built on Pre-Trained Language Models 
Description Text classifiers built on Pre-trained Language Models (PLMs) have achieved remarkable progress in various tasks including sentiment analysis, natural language inference, and question-answering. However, the occurrence of uncertain predictions by these classifiers poses a challenge to their reliability when deployed in practical applications. Much effort has been devoted to designing various probes in order to understand what PLMs capture. But few studies have delved into factors influencing PLM-based classifiers' predictive uncertainty. In this paper, we propose a novel framework, called CUE, which aims to interpret uncertainties inherent in the predictions of PLM-based models. In particular, we first map PLM-encoded representations to a latent space via a variational auto-encoder. We then generate text representations by perturbing the latent space which causes fluctuation in predictive uncertainty. By comparing the difference in predictive uncertainty between the perturbed and the original text representations, we are able to identify the latent dimensions responsible for uncertainty and subsequently trace back to the input features that contribute to such uncertainty. Our extensive experiments on four benchmark datasets encompassing linguistic acceptability classification, emotion classification, and natural language inference show the feasibility of our proposed framework. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact This approach offers a potential solution for disentangling the representation learned from a pre-trained language model and interpreting the uncertainty caused by various learned factors. 
URL https://zenodo.org/doi/10.5281/zenodo.10795529
 
Title OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning 
Description The remarkable performance of pre-trained large language models has revolutionised various natural language processing applications. Due to huge parameter sizes and extensive running costs, companies or organisations tend to transfer the models to the target task by zero-shot prompting techniques. However, the prohibitive costs of tokens and time have hindered their adoption in applications. We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs, thereby reducing token and time costs. This approach could potentially improve task performance during API queries due to better conditional distribution mapping. Evaluated across diverse classification datasets, our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance, and in some cases, even improving it. An ablation study conducted on various LLMs, along with an investigation into the robustness of our prompting strategy to different input ordering, offers valuable insights into the broader applicability of our method across diverse tasks. These findings also suggest a more seamless integration of our method with LLMs through an API. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact This tool provides a more efficient method for instruction-based searching on large language models. 
URL https://zenodo.org/doi/10.5281/zenodo.10795545