Unsupervised Background Knowledge for Language Understanding
Lead Research Organisation:
CARDIFF UNIVERSITY
Department Name: Computer Science
Abstract
Significant progress in Artificial Intelligence (AI) has been made in recent years, and this has resulted in a huge expectation on what this technology can offer us in the future. However, there are still many challenges that must be addressed before this promise can be turned into a reality, and one of these challenges is Natural Language Processing (NLP). If a computer is ever to understand humans in a natural way and to demonstrate a level of intelligence that we would normally expect, then the problem of Language Understanding must be solved.
Making computers to understand natural languages is a non-trivial task. Current approaches to language understanding rely on end-to-end supervised learning, exemplified by deep learning techniques in recent years. Typically, a corpus of relevant text is collected and then used to train the computer to perform a certain task. However, this approach may have several problems, e.g., the words extracted and used to train a computer often have implicit meanings and can be ambiguous. Consider the following two sentences, for example:
(1) We found many birds during our visit to the zoo: eagles, parrots, cranes...
(2) The crane was hurt and could barely move.
A computer will not be able to understand from these training examples that there are in fact two types of crane (bird and machine) and the fact that only one type of crane (bird) can get hurt. It is widely recognised that handling word ambiguity and, more broadly, understanding what words mean, is a significant challenge in NLP. For instance, Google Translate, widely considered as the state-of-the-art in machine translation, fails to translate these two sentences correctly even to closely related languages such as Spanish. Generally speaking, current techniques are hard to generalize across different tasks and domains, especially in applications requiring language understanding.
The proposed research intends to develop theories and novel solutions to bridge this gap by combining and leveraging lexical resources and unsupervised techniques for analysing text corpora, thereby learning the much-needed, but not-explicitly-available background knowledge. Our goal is then to seamlessly integrate this background knowledge into real-world applications for more accurate language understanding. We will exploit these techniques in different languages, making them directly applicable in important multilingual NLP tasks, including lower-resourced languages such as Welsh, and in domains with direct societal impact such as social media and health care.
Making computers to understand natural languages is a non-trivial task. Current approaches to language understanding rely on end-to-end supervised learning, exemplified by deep learning techniques in recent years. Typically, a corpus of relevant text is collected and then used to train the computer to perform a certain task. However, this approach may have several problems, e.g., the words extracted and used to train a computer often have implicit meanings and can be ambiguous. Consider the following two sentences, for example:
(1) We found many birds during our visit to the zoo: eagles, parrots, cranes...
(2) The crane was hurt and could barely move.
A computer will not be able to understand from these training examples that there are in fact two types of crane (bird and machine) and the fact that only one type of crane (bird) can get hurt. It is widely recognised that handling word ambiguity and, more broadly, understanding what words mean, is a significant challenge in NLP. For instance, Google Translate, widely considered as the state-of-the-art in machine translation, fails to translate these two sentences correctly even to closely related languages such as Spanish. Generally speaking, current techniques are hard to generalize across different tasks and domains, especially in applications requiring language understanding.
The proposed research intends to develop theories and novel solutions to bridge this gap by combining and leveraging lexical resources and unsupervised techniques for analysing text corpora, thereby learning the much-needed, but not-explicitly-available background knowledge. Our goal is then to seamlessly integrate this background knowledge into real-world applications for more accurate language understanding. We will exploit these techniques in different languages, making them directly applicable in important multilingual NLP tasks, including lower-resourced languages such as Welsh, and in domains with direct societal impact such as social media and health care.
Planned Impact
The beneficiaries of this fellowship are diverse. Natural Language Processing (NLP) and text mining practitioners and researchers will directly benefit from the resources and models that we will develop. Moreover, researchers in other fields (e.g. social scientists) can benefit from our tools to understand textual content in different languages. In the longer term, better language understanding systems will impact end-users in their daily interactions: better virtual assistants, more accurate translation systems, etc.
Industry will benefit from the fact that we will train three postdoctoral research associates, including training and development in other top research labs, which will provide them with complementary expertise. Moreover, companies will also benefit from the large-scale lexical resource that we will make available and the fact that we will release open-source practical NLP models that can be used for different domains and languages.
As far as our first case study is concerned, this fellowship will provide tools to analyze and fight disinformation campaigns on social media. This will benefit the general public, as disinformation is currently an important problem threatening important aspects of our society, including democracy. In general, detecting deceptive content can help public security and it can provide a useful tool for addressing this issue directly from police and government bodies, including policymakers. Our second case study will provide better insights to develop treatments for patients with mental health. In addition to saving money from NHS, it can potentially improve the general wellbeing of patients with psychiatric disorders. Moreover, it will open up avenues for future research, where automatic tools will become an important tool to analyse available unstructured data in the form of e.g. clinical records.
Industry will benefit from the fact that we will train three postdoctoral research associates, including training and development in other top research labs, which will provide them with complementary expertise. Moreover, companies will also benefit from the large-scale lexical resource that we will make available and the fact that we will release open-source practical NLP models that can be used for different domains and languages.
As far as our first case study is concerned, this fellowship will provide tools to analyze and fight disinformation campaigns on social media. This will benefit the general public, as disinformation is currently an important problem threatening important aspects of our society, including democracy. In general, detecting deceptive content can help public security and it can provide a useful tool for addressing this issue directly from police and government bodies, including policymakers. Our second case study will provide better insights to develop treatments for patients with mental health. In addition to saving money from NHS, it can potentially improve the general wellbeing of patients with psychiatric disorders. Moreover, it will open up avenues for future research, where automatic tools will become an important tool to analyse available unstructured data in the form of e.g. clinical records.
Organisations
Publications
Aleksandra Edwards
(2022)
Guiding Generative Language Models for Data Augmentation in Few-Shot Text Classification
Antypas D
(2024)
Multilingual Topic Classification in X: Dataset and Analysis
Antypas D
(2024)
A Multi-Faceted NLP Analysis of Misinformation Spreaders in Twitter
Antypas D
(2022)
Politics, Sentiment and Virality: A Large-Scale Multilingual Twitter Analysis in Greece, Spain and United Kingdom
in SSRN Electronic Journal
Asahi Ushio
(2022)
Generative Language Models for Paragraph-Level Question Generation
| Description | Overall, the project has been showing that it is possible to have flexible and specialized models that are useful for a variety of applications (and languages), as shown by our research outputs and the wide usage of our released tools. In particular, for social media these tools have been used to combat disinformation, analysing political communication at scale, help detect users with depression, analysing first response to earthquakes, analysing potential health outbreaks, etc. For this domain, which is very dynamic, the temporal aspect is clue, and there has been progress towards developing time-aware models and releasing them open-source. |
| Exploitation Route | I have applied and been awarded specific impact funding (EPSRC) to continue the development of the project. In particular, we have established the TweetNLP hub where we keep integrating models and improving user experience. So far, there is a dedicated demo with all models, as well as a Python library with extensive documentation and tutorials. |
| Sectors | Other |
| URL | https://tweetnlp.org/ |
| Description | Our models, released under the TweetNLP umbrella (https://tweetnlp.org/) have been used in many companies. They are among the most popular open machine learning models (as certified by Hugging Face with our models being downloaded and used millions of times per month). In particular, our models are specialized on informal text such as that of social media, and enable a wide range of applications from health domains or to help analyse disinformation (these were two case studies directly connected to the fellowship) and many others related to sentiment analysis, detecting hate speech online, analyse content at scale, etc. |
| First Year Of Impact | 2021 |
| Sector | Other |
| Impact Types | Societal Economic |
| Description | EPSRC Harmonised Impact Acceleration Account |
| Amount | £49,652 (GBP) |
| Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
| Sector | Public |
| Country | United Kingdom |
| Start | 03/2023 |
| End | 12/2023 |
| Title | DataBench |
| Description | This benchmark contains 65 diverse datasets for question answering over tabular data, which can be used to evaluate or trained large language models on the task. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2024 |
| Provided To Others? | Yes |
| Impact | The associated paper was published in LREC-COLING 2024. Based on this benchmark, we have organised a SemEval-2025 shared task that attracted over 100 participants worldwide, among with 35 of them submitted a system description paper. |
| URL | https://huggingface.co/datasets/cardiffnlp/databench |
| Title | Hate speech detection model |
| Description | Robust hate speech detection model specialised on social media based on transformers |
| Type Of Material | Computer model/algorithm |
| Year Produced | 2023 |
| Provided To Others? | Yes |
| Impact | Downloaded over 1 million times per month |
| URL | https://huggingface.co/cardiffnlp/twitter-roberta-base-hate-latest |
| Title | Multilingual topic classification dataset and associated models |
| Description | A dataset of topic classification tweets in English, Japanese, Greek and Spanish. The collection also includes transformer-based language models fine-tuned on the task to automatically classify the topic of posts in social media. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2024 |
| Provided To Others? | Yes |
| Impact | The paper was published in the EMNLP 2025 conference. The dataset is downloaded thousands of times every month, as well as the associated models. |
| URL | https://huggingface.co/collections/cardiffnlp/tweettopic-65eb2a0eada92a05d3d103ce |
| Title | TimeLMs: Language Models for Social Media |
| Description | Despite its importance, the time variable has been largely neglected in the NLP and language model literature. We present TimeLMs, a set of language models specialized on diachronic Twitter data. We show that a continual learning strategy contributes to enhancing Twitter-based language models' capacity to deal with future and out-of-distribution tweets, while making them competitive with standardized and more monolithic benchmarks. We also perform a number of qualitative analyses showing how they cope with trends and peaks in activity involving specific named entities or concept drift |
| Type Of Material | Computer model/algorithm |
| Year Produced | 2022 |
| Provided To Others? | Yes |
| Impact | Featured in Import AI newsletter |
| URL | https://github.com/cardiffnlp/timelms |
| Description | KAIST, South Korea |
| Organisation | Korea Advanced Institute of Science and Technology (KAIST) |
| Country | Korea, Republic of |
| Sector | Academic/University |
| PI Contribution | Collaboration in developing methodologies to make NLP technologies (in particular Language Models), more culturally aware and sensitive. As part of that, we have a publication on analysing hate speech detection models and their cultural bias, and are collaborating toward building a major benchmark which can be further used to improve the cultural bias and awareness of these models. We are contributing mainly to the NLP aspect. |
| Collaborator Contribution | The collaboration in project involved members from both teams, and they mainly contribute to the outcome. I have been visiting their lab for almost two months where I learned their perspective integrated in our work. |
| Impact | https://arxiv.org/abs/2308.16705 |
| Start Year | 2023 |
| Title | AutoQG: Automatic question generation |
| Description | Automatic question generation model. Given a paragraph, it generates questions and answers automatically, in several languages |
| Type Of Technology | Webtool/Application |
| Year Produced | 2023 |
| Open Source License? | Yes |
| Impact | Used in different applications, from more technical for NLP research, to education |
| URL | https://autoqg.net/ |
| Title | Metaphor dataset repository |
| Description | This website provides a platform to share and find metaphor datasets under a unified environment. It also provides an editing tool, and a catalogue of existing datasets. |
| Type Of Technology | Webtool/Application |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | A paper describing the platform has been accepted to NAACL 2025 (Demo track) |
| URL | https://www.metaphorshare.com/ |
| Title | T-NER (Named Entity Recognition with transformers) |
| Description | Easy-to-use Python library for state-of-the-art Named Entity Recognition based on Transformers |
| Type Of Technology | Webtool/Application |
| Year Produced | 2021 |
| Open Source License? | Yes |
| Impact | Used by thousands of users, +130 stars in Github. Demo published in top NLP conference |
| URL | https://github.com/asahi417/tner |
| Title | TweetNLP |
| Description | TweetNLP is a website to enable users to use cutting-edge language technologies in social media, irrespective of their level of expertise. It stems from a multilingual collaboration between academia and industry, and stemming from our research publications. TweetNLP contains a Python API, demos and tutorials with many examples to get you started. Best of all, everything is free! Tasks supported: Sentiment analysis, emotion recognition, hate speech detection, offensive language identification, emoji prediction, topic classification, named entity recognition,question answering and generation |
| Type Of Technology | Webtool/Application |
| Year Produced | 2022 |
| Open Source License? | Yes |
| Impact | Some of the models included are used in industrial settings and are downloaded millions of times per month. For instance, the sentiment analysis model was the most downloaded model in the Hugging Face in January 2021 with over 15 million downloads that month, and remains among the most used models overall. |
| URL | https://tweetnlp.org/ |
| Description | Organisation of Cardiff NLP Workshop |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | Organisation of annual Cardiff NLP Workshop with participation (invited speakers) from industry (e.g. Deepmind, Huawei, Hugging Face, Amazon) and academia (e.g. Cambridge University, University of Sheffield). Mainly focused for PhD students, early career researches and practitioners working with NLP. It has been celebrated every year from 2022, with 2025 being the 4th iteration. |
| Year(s) Of Engagement Activity | 2022,2023,2024,2025 |
| URL | https://www.cardiffnlpworkshop.org/ |
