Modeling Idiomaticity in Human and Artificial Language Processing

Lead Research Organisation: University of Sheffield
Department Name: Computer Science

Abstract

This project will develop computational models with the ability to recognize and accurately process idiomatic (non-literal) language that are linguistically motivated and cognitively-inspired by human processing data. Equipping models with the ability to process idiomatic expressions is particularly important for obtaining more accurate representations as these can lead to gains in downstream tasks, such as machine translation and text simplification. The originality of this work is in integrating linguistic and cognitive clues about human idiomatic language processing in the construction of models for word and phrase representations, and in integrating them in downstream tasks.

The main objectives and research challenges of this project:
1:To explore cognitive and linguistic clues linked to idiomaticity that can be used to guide models for word and phrase representations
2: To investigate idiomatically-aware models.
3: To explore alternative forms of integrating these models in applications and to develop a framework for idiomaticity evaluation in word and phrase representation models.
4. To release software implementations of the proposed models to facilitate reproducibility and wider adoption by the research community.

This proposal targets a crucial limitation in standard NLP models, as idiomaticity is everywhere in human communication, with potential benefits to various applications that include natural language interfaces, such as conversational agents, computer assisted language learning platforms, question answering and information retrieval systems. As a consequence we anticipate the proposal will have a wide academic impact in the community. Moreover, enabling more precise language understanding and generation also has the potential of enhancing accessibility and digital inclusion, through promoting more natural and accurate communication between humans and machines. We intend to demonstrate the additional potential benefits of these models through interactions with our external collaborations, including by means of an advisory board. The board will include other academics and industrial partners working on related topics, such as, Dr. Fabio Kepler (Unbabel Portugal, for machine translation), Prof. Mathieu Constant (Université de Lorraine, France, for parsing and idiomaticity), Prof. Lucia Specia (Imperial College, UK, for text simplification) and Dr. Afsaneh Fazly (Samsung, Canada, for idiomaticity).

This proposal targets a crucial limitation in standard NLP models, as idiomaticity is everywhere in human communication, with potential benefits to various applications that include natural language interfaces, such as conversational agents, computer assisted language learning platforms, question answering and information retrieval systems. As a consequence we anticipate the proposal will have a wide academic impact in the community. Moreover, enabling more precise language understanding and generation also has the potential of enhancing accessibility and digital inclusion, through promoting more natural and accurate communication between humans and machines. We intend to demonstrate the additional potential benefits of these models through interactions with our external collaborations, including by means of an advisory board.

Planned Impact

Correctly representing and treating idiomatic expressions can have a positive impact on the accuracy of NLP applications such as machine translation (MT), text simplification (TS), text summarization, dialog systems, among others. Consequently, we anticipate an impact on the following areas:

- Economy
NLP service providers are highly interested in providing accurate outputs in order to avoid misleading their users, which can result in liability for these providers. In this project we will focus on two downstream applications, text simplification and machine translation, which are available as stand-alone commercial platforms. However, there is a large variety of products that also involve understanding and generating natural language, such as conversational agents and digital assistants, which can be found around the world in software and hardware including commercial websites, computers, mobile phones and other portable devices, car navigation systems, etc. As the users' trust on NLP applications increases, their economic value also increases, therefore, our approaches can positively impact the NLP industry.

- Society
As text simplification and machine translation reach millions of users worldwide, advances in idiomatic language processing has the potential of bringing a positive impact by improving their experience. We address both tasks, however, the results of this project have a high potential to improve the reliability of a larger variety of NLP downstream tasks, providing users with more accurate and human-friendly interfaces. This can also enable low-advantaged or vulnerable users, such as non-native speakers of English or low-literate individuals, to fully access and comprehend information that contains idiomatic expressions.

- Knowledge
This proposal targets a fundamental limitation in traditional NLP approaches that needs to be adequately addressed in precise language technology. As a consequence, we anticipate publication of the resources and methods developed together with results obtained in high profile NLP conferences. Moreover, to enhance reproducibility of results and to promote reusability of resources, we will make them available in the project dedicated website and Github repository. This project that will focus on analysis, modelling and application has the potential to encourage discussions about responsible innovation and ethics in NLP.

- People
This proposal has the potential for helping to establish the careers of both the Sheffield PI and Co-I, bringing enormous positive impacts on their careers. It will enable the PI to use her previous experience coordinating projects in Brazil, to establish a research profile in the UK and solidify her position in the field, while using her research expertise to address an important shortcoming in current language technology. This EPSRC grant will be instrumental in allowing both the Sheffield PI and Co-I to build a collaboration network, advance their research and demonstrate a successful project delivery in the UK for securing further funding. It will also enable the RA to work on a well-established fast-growing NLP group, one of the largest in Europe, with connections to the NLP and ML main centres around the world. Him(er) will also benefit from knowledge exchange with the PI and CO-Is, working on a relevant NLP topic. BSc and MSc students may also benefit from this project, as we will propose small projects related to this topic.

Publications

10 25 50
 
Description Large pre-trained language models (PLMs, also known as foundation models), like BERT, GPT-2, GPT-3, which have been widely claimed to successfully understand language, and have been incorporated in many applications, have been shown by this research to still not understand figurative, idiomatic language. This research, through careful analyses and the design of dedicated probing methods, led to concrete findings and evidence that stress the need for revising these models incorporating dedicated techniques for allowing accurate idiomaticity representation. Moreover, it generated datasets and probes that can be used as benchmarks for understanding the language understanding of PLMs. Not doing that risks incorrect understanding and consequent impact to applications (like machine translation) as well as any decision-making.
Exploitation Route The project has generated models, datasets and probes that can be used as benchmarks for understanding the language understanding of PLMs before deployment.
Sectors Communities and Social Services/Policy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology

 
Description This project has contributed to bringing the topic of multiword expressions, idiomaticity and figurative language to the forefront of the research community, through the creating and sharing of resources and models, as well as through the preparation of a shared task. The shared task was selected to ran at SemEval, via a selective submission process, and attracted the participation of over 100 participants, in 25 groups, both from academia and industry. The shared task invited contributions of systems and models by the community to both identify the landscape of existing State-of-the-art solutions as well as to promote the design of innovative solutions for improving the performance of PLMs in the accurate detection and representation of idiomatic language. In addition, the shared task created dedicated resources for Galician, a low-resource language, as well as for Portuguese, and English. Additional impact is expected in the final period of the project.
First Year Of Impact 2021
Sector Digital/Communication/Information Technologies (including Software),Education,Healthcare
 
Title AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models 
Description Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions along with the literal and, where applicable, (a single) non-literal interpretation of MWEs. This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings, spanning both English and Portuguese. We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms. Our experiments demonstrate that, on the task of detecting idiomatic usage, these models perform reasonably well in the one-shot and few-shot scenarios, but that there is significant scope for improvement in the zero-shot scenario. On the task of representing idiomaticity, we find that pre-training is not always effective, while fine-tuning could provide a sample efficient method of learning representations of sentences containing MWEs. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact This dataset gave rise to a SEMEVAL-2022 Shared Task devoted to idiomaticity detection. 
URL https://underline.io/lecture/38531-astitchinlanguagemodels-dataset-and-methods-for-the-exploration-o...
 
Title Noun Compound Type and Token Idiomaticity Dataset 
Description The publicly available repository for the dataset includes information about how to generate the Noun Compound Type and Token Idiomaticity (NCTTI) dataset, which contains noun compounds (NCs) in English and Portuguese with the following information (provided by annotators): Type-level compositionality scores (from Cordeiro et al., 2019 and Reddy et al., 2011). Token-level compositionality scores in three sentences. Suggestions of paraphrases classified at type-level (for the three sentences) or token-level (for each sentence). The NCTTI dataset has data for 280 and 180 noun compounds in English and Portuguese, respectively, with different degrees of idiomaticity. For each compound, it contains 3 naturalistic sentences obtained from corpora. Due to copyright restrictions we do not release all the original sentences. Instead, we include a script to obtain them from the ukWaC (Baroni et al., 2009) and brWaC (Wagner Filho et al., 2018) corpora. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact The NCTTI dataset is used to explore how vector space models reflect the variability of idiomaticity across sentences. Several experiments using state-of-the-art contextualised models suggest that their representations are not capturing the noun compounds idiomaticity as human annotators. This new multilingual resource also contains suggestions for paraphrases of the noun compounds both at type and token levels, with uses for lexical substitution or disambiguation in context. This dataset has already been used as basis for additional papers, a Shared Task, and model evaluations, including a review paper (https://www.annualreviews.org/doi/full/10.1146/annurev-linguistics-031120-122924#_i46). 
URL https://github.com/marcospln/nctti
 
Title SemEval 2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding 
Description Given this shortcoming in existing state-of-the-art models, this task (part of SemEval 2022) is aimed at detecting and representing multiword expressions (MWEs) which are potentially idiomatic phrases across English, Portuguese and Galician. We call these potentially idiomatic phrases because some MWEs, such as "wedding date" are not idiomatic (i.e literal). This task consists of two subtasks, each available in two "settings". Participants have the freedom to choose a subset of subtasks or settings that they'd like to participate in (see sections detailing each of the subtasks for details). You cannot pick a subset of languages. This task consists of two subtasks: * Subtask A - A binary classification task aimed at determining whether a sentence contains an idiomatic expression. * Subtask B - Subtask B is a novel task which requires models to output the correct Semantic Text Similarity (STS) scores between sentence pairs whether or not either sentence contains an idiomatic expression. Participants must submit STS scores which range between 0 (least similar) and 1 (most similar). This will require models to correctly encode the meaning of idiomatic phrases such that the encoding of a sentence containing an idiomatic phrase (e.g. Who will he start a program with and will it lead to his own swan song?) and the same sentence with the idiomatic phrase replaced by a (literal) paraphrase (e.g. Who will he start a program with and will it lead to his own final performance?) are semantically similar to each other and equally similar to any other sentence. (See details of subtask below). 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact The shared task attracted considerable interest from the community, with around 20 participating systems in the competition. The shared task and results will be published as part of SEMEVAL, and the dataset will be kept as freely available to the community. 
URL https://sites.google.com/view/semeval2022task2-idiomaticity
 
Description UniDive 
Organisation Aix-Marseille University
Country France 
Sector Academic/University 
PI Contribution This project has been the basis for the inclusion in the UniDive project, and collaborations.
Collaborator Contribution Establishing collaborations with EU and international partners working on related topics
Impact Still active
Start Year 2022
 
Description UniDive 
Organisation University of Paris-Saclay
Country France 
Sector Academic/University 
PI Contribution This project has been the basis for the inclusion in the UniDive project, and collaborations.
Collaborator Contribution Establishing collaborations with EU and international partners working on related topics
Impact Still active
Start Year 2022
 
Description COLING Tutorial 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Psychological, Cognitive and Linguistic BERTology: An Idiomatic Multiword Expression Perspective
In this tutorial we will introduce participants to the exciting research from topics associated with four themes: a) we will provide an overview of BERTology (Rogers et al., 2020) from a linguistic and cognitive stand point, b) introduce participants to the highlights from research on idiomaticity available in cognitive and corpus linguistics, along with studies that show the need for cultural, world and common sense knowledge in handling idiomaticity and related problems, c) traditional methods of handling idiomaticity by providing explicit information regarding idiomatic phrases to models, and c) the state of the art in idiomaticity detection and representation.

The first of these four themes - the capabilities of PLMs - will include an overview of BERTology and what PLMs' capabilities (or the lack thereof) means from a linguistic and cognitive standpoint with an emphasis on phenomenon such as understanding idiomatic expressions. For example, it has been shown that PLMs are good at syntax (Hewitt and Manning, 2019) and semantic roles (Tenney et al., 2019b), while being less effective at pragmatic inference, role-based event knowledge (Ettinger, 2020) and abstract attributes of objects (Da and Kasai, 2019). Importantly, PLMs are particularly bad at representing numbers (Wallace et al., 2019) and in reasoning based on the world knowledge they have access to (Forbes et al., 2019). We argue that this shows that PLMs are good at "low-level" linguistic tasks but struggle with "high-level" tasks associated with reasoning and understanding. These high-level cognitive tasks, such as the ability to make use of world and common sense knowledge are of particular relevance to idiomaticity. Consider the sentences ". . . cultivated land in this study accounts areas used for paddy fields and dry land" and ". . . It's a great feeling to be back on dry land". 'Dry land' literally refers to 'dry ground' in the first but refers to the more abstract 'terra firma' in the second.

In addressing the second theme, the tutorial will cover elements of MWE research that originate in linguistics. For example, the highly influential work in the field of linguistics by Nunberg et al. (1994), who discuss how, in some cases, parts of an idiom might be modified (e.g. "Your remark touched a nerve that I didn't even know existed"), quantified (e.g. "touch a couple of nerves"), or emphasized (e.g. "Those strings, he wouldn't pull for you"). Such an exploration would serve to highlight the kind of nuanced understanding of language and the world that is required to completely understand these utterances. Additionally, we will also explore studies on how humans process idiomaticity (Geeraert et al., 2020; Chanturia et al., 2011).

The third theme will deal with methods of identifying and representing idiomaticity using the traditional approach in NLP wherein a phenomenon is explicitly modeled more or less independently of other levels of analysis. Such work, as is the case with much of MWE research, hypothesizes that such explicit information will be useful to models on downstream tasks.

Finally, the tutorial will address the absolute state of the art in identifying and representing idiomaticity. This is made possible by the fact that some of the proposed presenters of this tutorial are also involved in the organisation the related task "Multilingual Idiomaticity Detection and Sentence Embedding" at SemEval 2022.
Year(s) Of Engagement Activity 2022
URL https://sites.google.com/view/coling2022tutorial
 
Description Dagstuhl Seminar on 'Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics' 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Dagstuhl Seminar on Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics aimed at the following objectives:
Theoretical: To deepen the understanding of language universals, and of how they apply to linguistic idiosyncrasy, so as to further promote unified modelling while preserving diversity.
Practical: To improve the treatment of idiosyncrasy in treebanking frameworks, in computationally tractable ways and, thus, to foster high quality NLP tools for more languages with greater typological diversity.
Networking: To promote a higher degree of convergence across typology-driven initiatives, while focusing on three main aspects of language modelling: morphology, syntax, and semantics.


https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=21351
Year(s) Of Engagement Activity 2021,2022
URL https://gitlab.com/unlid/dagstuhl-seminar/-/wikis/home
 
Description Invited Speaker of First Workshop on Current Trends in Text Simplification (2021) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This invited talk aimed at raising awareness of the challenges of idiomaticity for natural language understanding in general and text simplification systems in particular. The findings of the project were disseminated with the research community and all resources generated so far are publicly available.
Year(s) Of Engagement Activity 2021
URL https://www.taln.upf.edu/pages/cttsr2021-ws/
 
Description Invited Talk at the Department of Computer Science, Federal University of Rio Grande do Norte (Brazil) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This talk aimed to raise awareness of the challenges of current NLP models to capture idiomaticity, disseminating results and resources produced by the project. The audience was composed of computational neuroscientists, and undergraduate and graduate students of the Computer Science Department. The talk raised questions about possible trends in the areas, and some scientific challenges related to different languages, and the organisers reported additional interest for follow-up from the audience.
Year(s) Of Engagement Activity 2021
 
Description Invited Talk to Turing Institute Group 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact This talk aimed to raise awareness of the challenges of current NLP models to capture idiomaticity, disseminating results and resources produced by the project. The audience was composed of researchers linked to the Turing Institute working on NLP. The talk raised questions about possible trends in the areas, and some scientific challenges related to idiomaticity.
Year(s) Of Engagement Activity 2021
 
Description Invited Talk to the CL group at Cornell University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This talk aimed to raise awareness of the challenges of current NLP models to capture idiomaticity, disseminating results and resources produced by the project. The audience was composed of academics and students working on cognitive computational models. The talk raised questions about possible common interests, and the exchange of data for analyses by members of the audience.
Year(s) Of Engagement Activity 2021
 
Description Invited talk to Cardiff NLP group 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact This talk aimed to raise awareness of the challenges of current NLP models to capture idiomaticity, disseminating results and resources produced by the project. The audience was composed of NLP students and academics. The talk focused on the EACL-2021 paper and raised questions about possible trends in the areas, and some scientific challenges related to different languages, and about possible common interests among the groups.
Year(s) Of Engagement Activity 2021
 
Description Invited talk to Núcleo de Linguística Computacional da FALE, Federal University of Minas Gerais, Brazil 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This talk aimed to raise awareness of the challenges of current NLP models to capture idiomaticity, disseminating results and resources produced by the project. The audience was composed of linguists, computer scientists and undergraduate and graduate students working on NLP. The talk raised questions about possible trends in the areas, and some scientific challenges related to different languages, and the organisers reported additional interest for follow-up from the audience, including direct emails to the PI of the project.
Year(s) Of Engagement Activity 2021
 
Description LREC Tutorial 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact LREC 2022 Tutorial - Psychological, Cognitive and Linguistic BERTology: An Idiomatic Multiword Expression Perspective
The success of BERT and similar pre-trained language models (PLMs) has led to what might be described as an existential crisis for certain aspects of Natural Language Processing: PLMs can now do better than other models on numerous tasks in multiple evaluation scenarios and are argued to outperform human performances on some benchmarks (Wang et al., 2018; Sun et al., 2020; Hassan et al., 2018). In addition, PLMs also seem to have access to a variety of linguistic information as diverse as parse trees (Hewitt and Manning, 2019), entity types, relations, semantic roles (Tenney et al., 2019a), and constructional information (Tayyar Madabushi et al., 2020).

Does this mean that there is no longer a need to tap into the decades of progress that was made in traditional NLP and related fields including corpus and cognitive linguistics? In short, can deep(er) models replace linguistically motivated (layered) models and systematic engineering as we work towards high-level symbolic artificial intelligence systems?

This tutorial will explore these questions through the lens of a linguistically and cognitively important phenomena that PLMs do not (yet) handle very well: Idiomatic Multiword Expressions (MWEs) (Yu and Ettinger, 2020; Garcia et al., 2021; Tayyar Madabushi et al., 2021).
Year(s) Of Engagement Activity 2022
URL https://sites.google.com/view/psych-bertology-lrec-2022
 
Description SemEval Shared Task 2022 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact SemEval 2022 Task 2 Multilingual Idiomaticity Detection and Sentence Embedding
Given this shortcoming in existing state-of-the-art models, this task (part of SemEval 2022) is aimed at detecting and representing multiword expressions (MWEs) which are potentially idiomatic phrases across English, Portuguese and Galician. We call these potentially idiomatic phrases because some MWEs, such as "wedding date" are not idiomatic (i.e literal). This task consists of two subtasks, each available in two "settings".

Participants have the freedom to choose a subset of subtasks or settings that they'd like to participate in (see sections detailing each of the subtasks for details). You cannot pick a subset of languages.
Year(s) Of Engagement Activity 2022
URL https://sites.google.com/view/semeval2022task2-idiomaticity
 
Description Shared Task dedicated to assessing idiomaticity in representations 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact This shared task aimed at raising awareness of the challenges involved in capturing idiomaticity using current models. The task was opened to general participation and the call for participation was widely announced and invited participants to design systems that can detect idiomatic language. New datasets were collected and made available in 3 languages: English, Portuguese and Galician, the last two being lower-resourced languages. The task and the datasets are new contributions to the community and can provide the basis for benchmarking systems.
Year(s) Of Engagement Activity 2021,2022
URL https://sites.google.com/view/semeval2022task2-idiomaticity