📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Unsupervised Background Knowledge for Language Understanding

Lead Research Organisation: CARDIFF UNIVERSITY
Department Name: Computer Science

Abstract

Significant progress in Artificial Intelligence (AI) has been made in recent years, and this has resulted in a huge expectation on what this technology can offer us in the future. However, there are still many challenges that must be addressed before this promise can be turned into a reality, and one of these challenges is Natural Language Processing (NLP). If a computer is ever to understand humans in a natural way and to demonstrate a level of intelligence that we would normally expect, then the problem of Language Understanding must be solved.

Making computers to understand natural languages is a non-trivial task. Current approaches to language understanding rely on end-to-end supervised learning, exemplified by deep learning techniques in recent years. Typically, a corpus of relevant text is collected and then used to train the computer to perform a certain task. However, this approach may have several problems, e.g., the words extracted and used to train a computer often have implicit meanings and can be ambiguous. Consider the following two sentences, for example:

(1) We found many birds during our visit to the zoo: eagles, parrots, cranes...
(2) The crane was hurt and could barely move.

A computer will not be able to understand from these training examples that there are in fact two types of crane (bird and machine) and the fact that only one type of crane (bird) can get hurt. It is widely recognised that handling word ambiguity and, more broadly, understanding what words mean, is a significant challenge in NLP. For instance, Google Translate, widely considered as the state-of-the-art in machine translation, fails to translate these two sentences correctly even to closely related languages such as Spanish. Generally speaking, current techniques are hard to generalize across different tasks and domains, especially in applications requiring language understanding.

The proposed research intends to develop theories and novel solutions to bridge this gap by combining and leveraging lexical resources and unsupervised techniques for analysing text corpora, thereby learning the much-needed, but not-explicitly-available background knowledge. Our goal is then to seamlessly integrate this background knowledge into real-world applications for more accurate language understanding. We will exploit these techniques in different languages, making them directly applicable in important multilingual NLP tasks, including lower-resourced languages such as Welsh, and in domains with direct societal impact such as social media and health care.

Planned Impact

The beneficiaries of this fellowship are diverse. Natural Language Processing (NLP) and text mining practitioners and researchers will directly benefit from the resources and models that we will develop. Moreover, researchers in other fields (e.g. social scientists) can benefit from our tools to understand textual content in different languages. In the longer term, better language understanding systems will impact end-users in their daily interactions: better virtual assistants, more accurate translation systems, etc.

Industry will benefit from the fact that we will train three postdoctoral research associates, including training and development in other top research labs, which will provide them with complementary expertise. Moreover, companies will also benefit from the large-scale lexical resource that we will make available and the fact that we will release open-source practical NLP models that can be used for different domains and languages.

As far as our first case study is concerned, this fellowship will provide tools to analyze and fight disinformation campaigns on social media. This will benefit the general public, as disinformation is currently an important problem threatening important aspects of our society, including democracy. In general, detecting deceptive content can help public security and it can provide a useful tool for addressing this issue directly from police and government bodies, including policymakers. Our second case study will provide better insights to develop treatments for patients with mental health. In addition to saving money from NHS, it can potentially improve the general wellbeing of patients with psychiatric disorders. Moreover, it will open up avenues for future research, where automatic tools will become an important tool to analyse available unstructured data in the form of e.g. clinical records.
 
Description Overall, the project has been showing that it is possible to have flexible and specialized models that are useful for a variety of applications (and languages), as shown by our research outputs and the wide usage of our released tools. In particular, for social media these tools have been used to combat disinformation, analysing political communication at scale, help detect users with depression, analysing first response to earthquakes, analysing potential health outbreaks, etc. For this domain, which is very dynamic, the temporal aspect is clue, and there has been progress towards developing time-aware models and releasing them open-source.
Exploitation Route I have applied and been awarded specific impact funding (EPSRC) to continue the development of the project. In particular, we have established the TweetNLP hub where we keep integrating models and improving user experience. So far, there is a dedicated demo with all models, as well as a Python library with extensive documentation and tutorials.
Sectors Other

URL https://tweetnlp.org/
 
Description Our models, released under the TweetNLP umbrella (https://tweetnlp.org/) have been used in many companies. They are among the most popular open machine learning models (as certified by Hugging Face with our models being downloaded and used millions of times per month). In particular, our models are specialized on informal text such as that of social media, and enable a wide range of applications from health domains or to help analyse disinformation (these were two case studies directly connected to the fellowship) and many others related to sentiment analysis, detecting hate speech online, analyse content at scale, etc.
First Year Of Impact 2021
Sector Other
Impact Types Societal

Economic

 
Description EPSRC Harmonised Impact Acceleration Account
Amount £49,652 (GBP)
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 03/2023 
End 12/2023
 
Title DataBench 
Description This benchmark contains 65 diverse datasets for question answering over tabular data, which can be used to evaluate or trained large language models on the task. 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
Impact The associated paper was published in LREC-COLING 2024. Based on this benchmark, we have organised a SemEval-2025 shared task that attracted over 100 participants worldwide, among with 35 of them submitted a system description paper. 
URL https://huggingface.co/datasets/cardiffnlp/databench
 
Title Hate speech detection model 
Description Robust hate speech detection model specialised on social media based on transformers 
Type Of Material Computer model/algorithm 
Year Produced 2023 
Provided To Others? Yes  
Impact Downloaded over 1 million times per month 
URL https://huggingface.co/cardiffnlp/twitter-roberta-base-hate-latest
 
Title Multilingual topic classification dataset and associated models 
Description A dataset of topic classification tweets in English, Japanese, Greek and Spanish. The collection also includes transformer-based language models fine-tuned on the task to automatically classify the topic of posts in social media. 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
Impact The paper was published in the EMNLP 2025 conference. The dataset is downloaded thousands of times every month, as well as the associated models. 
URL https://huggingface.co/collections/cardiffnlp/tweettopic-65eb2a0eada92a05d3d103ce
 
Title TimeLMs: Language Models for Social Media 
Description Despite its importance, the time variable has been largely neglected in the NLP and language model literature. We present TimeLMs, a set of language models specialized on diachronic Twitter data. We show that a continual learning strategy contributes to enhancing Twitter-based language models' capacity to deal with future and out-of-distribution tweets, while making them competitive with standardized and more monolithic benchmarks. We also perform a number of qualitative analyses showing how they cope with trends and peaks in activity involving specific named entities or concept drift 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact Featured in Import AI newsletter 
URL https://github.com/cardiffnlp/timelms
 
Description KAIST, South Korea 
Organisation Korea Advanced Institute of Science and Technology (KAIST)
Country Korea, Republic of 
Sector Academic/University 
PI Contribution Collaboration in developing methodologies to make NLP technologies (in particular Language Models), more culturally aware and sensitive. As part of that, we have a publication on analysing hate speech detection models and their cultural bias, and are collaborating toward building a major benchmark which can be further used to improve the cultural bias and awareness of these models. We are contributing mainly to the NLP aspect.
Collaborator Contribution The collaboration in project involved members from both teams, and they mainly contribute to the outcome. I have been visiting their lab for almost two months where I learned their perspective integrated in our work.
Impact https://arxiv.org/abs/2308.16705
Start Year 2023
 
Title AutoQG: Automatic question generation 
Description Automatic question generation model. Given a paragraph, it generates questions and answers automatically, in several languages 
Type Of Technology Webtool/Application 
Year Produced 2023 
Open Source License? Yes  
Impact Used in different applications, from more technical for NLP research, to education 
URL https://autoqg.net/
 
Title Metaphor dataset repository 
Description This website provides a platform to share and find metaphor datasets under a unified environment. It also provides an editing tool, and a catalogue of existing datasets. 
Type Of Technology Webtool/Application 
Year Produced 2024 
Open Source License? Yes  
Impact A paper describing the platform has been accepted to NAACL 2025 (Demo track) 
URL https://www.metaphorshare.com/
 
Title T-NER (Named Entity Recognition with transformers) 
Description Easy-to-use Python library for state-of-the-art Named Entity Recognition based on Transformers 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact Used by thousands of users, +130 stars in Github. Demo published in top NLP conference 
URL https://github.com/asahi417/tner
 
Title TweetNLP 
Description TweetNLP is a website to enable users to use cutting-edge language technologies in social media, irrespective of their level of expertise. It stems from a multilingual collaboration between academia and industry, and stemming from our research publications. TweetNLP contains a Python API, demos and tutorials with many examples to get you started. Best of all, everything is free! Tasks supported: Sentiment analysis, emotion recognition, hate speech detection, offensive language identification, emoji prediction, topic classification, named entity recognition,question answering and generation 
Type Of Technology Webtool/Application 
Year Produced 2022 
Open Source License? Yes  
Impact Some of the models included are used in industrial settings and are downloaded millions of times per month. For instance, the sentiment analysis model was the most downloaded model in the Hugging Face in January 2021 with over 15 million downloads that month, and remains among the most used models overall. 
URL https://tweetnlp.org/
 
Description Organisation of Cardiff NLP Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Organisation of annual Cardiff NLP Workshop with participation (invited speakers) from industry (e.g. Deepmind, Huawei, Hugging Face, Amazon) and academia (e.g. Cambridge University, University of Sheffield). Mainly focused for PhD students, early career researches and practitioners working with NLP. It has been celebrated every year from 2022, with 2025 being the 4th iteration.
Year(s) Of Engagement Activity 2022,2023,2024,2025
URL https://www.cardiffnlpworkshop.org/