Modeling Idiomaticity in Human and Artificial Language Processing

Lead Research Organisation: University of Sheffield

Department Name: Computer Science

Abstract

This project will develop computational models with the ability to recognize and accurately process idiomatic (non-literal) language that are linguistically motivated and cognitively-inspired by human processing data. Equipping models with the ability to process idiomatic expressions is particularly important for obtaining more accurate representations as these can lead to gains in downstream tasks, such as machine translation and text simplification. The originality of this work is in integrating linguistic and cognitive clues about human idiomatic language processing in the construction of models for word and phrase representations, and in integrating them in downstream tasks.

The main objectives and research challenges of this project:
1:To explore cognitive and linguistic clues linked to idiomaticity that can be used to guide models for word and phrase representations
2: To investigate idiomatically-aware models.
3: To explore alternative forms of integrating these models in applications and to develop a framework for idiomaticity evaluation in word and phrase representation models.
4. To release software implementations of the proposed models to facilitate reproducibility and wider adoption by the research community.

This proposal targets a crucial limitation in standard NLP models, as idiomaticity is everywhere in human communication, with potential benefits to various applications that include natural language interfaces, such as conversational agents, computer assisted language learning platforms, question answering and information retrieval systems. As a consequence we anticipate the proposal will have a wide academic impact in the community. Moreover, enabling more precise language understanding and generation also has the potential of enhancing accessibility and digital inclusion, through promoting more natural and accurate communication between humans and machines. We intend to demonstrate the additional potential benefits of these models through interactions with our external collaborations, including by means of an advisory board. The board will include other academics and industrial partners working on related topics, such as, Dr. Fabio Kepler (Unbabel Portugal, for machine translation), Prof. Mathieu Constant (Université de Lorraine, France, for parsing and idiomaticity), Prof. Lucia Specia (Imperial College, UK, for text simplification) and Dr. Afsaneh Fazly (Samsung, Canada, for idiomaticity).

This proposal targets a crucial limitation in standard NLP models, as idiomaticity is everywhere in human communication, with potential benefits to various applications that include natural language interfaces, such as conversational agents, computer assisted language learning platforms, question answering and information retrieval systems. As a consequence we anticipate the proposal will have a wide academic impact in the community. Moreover, enabling more precise language understanding and generation also has the potential of enhancing accessibility and digital inclusion, through promoting more natural and accurate communication between humans and machines. We intend to demonstrate the additional potential benefits of these models through interactions with our external collaborations, including by means of an advisory board.

Planned Impact

Correctly representing and treating idiomatic expressions can have a positive impact on the accuracy of NLP applications such as machine translation (MT), text simplification (TS), text summarization, dialog systems, among others. Consequently, we anticipate an impact on the following areas:

- Economy
NLP service providers are highly interested in providing accurate outputs in order to avoid misleading their users, which can result in liability for these providers. In this project we will focus on two downstream applications, text simplification and machine translation, which are available as stand-alone commercial platforms. However, there is a large variety of products that also involve understanding and generating natural language, such as conversational agents and digital assistants, which can be found around the world in software and hardware including commercial websites, computers, mobile phones and other portable devices, car navigation systems, etc. As the users' trust on NLP applications increases, their economic value also increases, therefore, our approaches can positively impact the NLP industry.

- Society
As text simplification and machine translation reach millions of users worldwide, advances in idiomatic language processing has the potential of bringing a positive impact by improving their experience. We address both tasks, however, the results of this project have a high potential to improve the reliability of a larger variety of NLP downstream tasks, providing users with more accurate and human-friendly interfaces. This can also enable low-advantaged or vulnerable users, such as non-native speakers of English or low-literate individuals, to fully access and comprehend information that contains idiomatic expressions.

- Knowledge
This proposal targets a fundamental limitation in traditional NLP approaches that needs to be adequately addressed in precise language technology. As a consequence, we anticipate publication of the resources and methods developed together with results obtained in high profile NLP conferences. Moreover, to enhance reproducibility of results and to promote reusability of resources, we will make them available in the project dedicated website and Github repository. This project that will focus on analysis, modelling and application has the potential to encourage discussions about responsible innovation and ethics in NLP.

- People
This proposal has the potential for helping to establish the careers of both the Sheffield PI and Co-I, bringing enormous positive impacts on their careers. It will enable the PI to use her previous experience coordinating projects in Brazil, to establish a research profile in the UK and solidify her position in the field, while using her research expertise to address an important shortcoming in current language technology. This EPSRC grant will be instrumental in allowing both the Sheffield PI and Co-I to build a collaboration network, advance their research and demonstrate a successful project delivery in the UK for securing further funding. It will also enable the RA to work on a well-established fast-growing NLP group, one of the largest in Europe, with connections to the NLP and ML main centres around the world. Him(er) will also benefit from knowledge exchange with the PI and CO-Is, working on a relevant NLP topic. BSc and MSc students may also benefit from this project, as we will propose small projects related to this topic.

Funded Value:

£446,163

Funded Period:

Dec 20 - Nov 24

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/T02450X/1

Principal Investigator:

Aline Villavicencio

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (100%)

Organisations

People	ORCID iD
Aline Villavicencio (Principal Investigator)	http://orcid.org/0000-0002-3731-9168
Anna Korhonen (Co-Investigator)
Carolina Scarton (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 > >|

10 25 50

Atsuki Yamaguchi (2024) An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference

Bigoulaeva I (2022) Effective Cross-Task Transfer Learning for Explainable Natural Language Inference with T5

Bigoulaeva I. (2022) Effective Cross-Task Transfer Learning for Explainable Natural Language Inference with T5 in FLP 2022 - 3rd Workshop on Figurative Language Processing, Proceedings of the Workshop

Boito M (2021) Investigating alignment interpretability for low-resource NMT in Machine Translation

Boito, M.Z. (2020) Investigating Language Impact in Bilingual Approaches for Computational Language Documentation in arXiv

Boito, M.Z. (2021) Unsupervised word segmentation from discrete speech units in low-resource settings in arXiv

Boito, M.Z. (2020) Investigating Language Impact in Bilingual Approaches for Computational Language Documentation in arXiv

Boito, M.Z. (2021) Unsupervised word segmentation from discrete speech units in low-resource settings in arXiv

Dylan Phelps (2024) Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection

Gamallo, P. (2020) Preface in CEUR Workshop Proceedings

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration
Engagement Activities


Description	Large pre-trained language models (PLMs, also known as foundation models), like BERT, GPT-2, GPT-3, which have been widely claimed to successfully understand language, and have been incorporated in many applications, have been shown by this research to still not understand figurative, idiomatic language. This research, through careful analyses and the design of dedicated probing methods, led to concrete findings and evidence that stress the need for revising these models incorporating dedicated techniques for allowing accurate idiomaticity representation. Moreover, it generated datasets and probes that can be used as benchmarks for understanding the language understanding of PLMs. Not doing that risks incorrect understanding and consequent impact to applications (like machine translation) as well as any decision-making.
Exploitation Route	The project has generated models, datasets and probes that can be used as benchmarks for understanding the language understanding of PLMs before deployment.
Sectors	Communities and Social Services/Policy Digital/Communication/Information Technologies (including Software) Education Healthcare Culture Heritage Museums and Collections Pharmaceuticals and Medical Biotechnology


Description	This project has contributed to bringing the topic of multiword expressions, idiomaticity and figurative language to the forefront of the research community, through the creating and sharing of resources and models, as well as through the preparation of a shared task. The shared task was selected to ran at SemEval, via a selective submission process, and attracted the participation of over 100 participants, in 25 groups, both from academia and industry. The shared task invited contributions of systems and models by the community to both identify the landscape of existing State-of-the-art solutions as well as to promote the design of innovative solutions for improving the performance of PLMs in the accurate detection and representation of idiomatic language. In addition, the shared task created dedicated resources for Galician, a low-resource language, as well as for Portuguese, and English. Additional impact is expected in the final period of the project.
First Year Of Impact	2021
Sector	Digital/Communication/Information Technologies (including Software),Education,Healthcare


Description	AI-Based Support for Mental Health Communication (AIM-Health)
Amount	£115,774 (GBP)
Organisation	São Paulo Research Foundation (FAPESP)
Sector	Public
Country	Brazil
Start	02/2025
End	02/2028


Description	Turing Fellowships 2024
Amount	£0 (GBP)
Organisation	Alan Turing Institute
Sector	Academic/University
Country	United Kingdom
Start	03/2024
End	03/2025


Title	AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models
Description	Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions along with the literal and, where applicable, (a single) non-literal interpretation of MWEs. This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings, spanning both English and Portuguese. We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms. Our experiments demonstrate that, on the task of detecting idiomatic usage, these models perform reasonably well in the one-shot and few-shot scenarios, but that there is significant scope for improvement in the zero-shot scenario. On the task of representing idiomaticity, we find that pre-training is not always effective, while fine-tuning could provide a sample efficient method of learning representations of sentences containing MWEs.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
Impact	This dataset gave rise to a SEMEVAL-2022 Shared Task devoted to idiomaticity detection.
URL	https://underline.io/lecture/38531-astitchinlanguagemodels-dataset-and-methods-for-the-exploration-o...


Title	AdMIRe: Advancing Multimodal Idiomaticity Representation (SemEval-2025 Task 1) - Labelled Datasets
Description	The AdMIRe shared task was organised and run as SemEval-2025 Task 1: https://semeval.github.io/SemEval2025/ The datasets contain potentially idiomatic expressions (PIEs) in English (EN) and Brazilian Portuguese (PT), context sentences in which the expressions are used in either a literal or idiomatic sense and associated images depicting the expressions with either a single image or a sequence of three images capturing change over time (like a comic strip). See the task website (https://semeval2025-task1.github.io/), the attached task description document (SemEval_2025_Task_1__ADMIRE___Advancing_Multimodal_Idiomaticity_Representation.pdf) or the following task paper for more information:Thomas Pickard, Aline Villavicencio, Maggie Mi, Wei He, Dylan Phelps, Carolina Scarton and Marco Idiart. 2025. SemEval-2025 Task 1: AdMIRe - Advancing Multimodal Idiomaticity Representation. Proceedings of the 19th International Workshop on Semantic Evaluations (SemEval-2025). Association for Computational Linguistics, Vienna, Austria.
Type Of Material	Database/Collection of data
Year Produced	2025
Provided To Others?	Yes
Impact	This dataset is the first to enable the evaluation of the impact of idiomaticity understanding in multiple modalities (text and image). We also propose a novel task in the community of next image prediction. The dataset covers two diverse languages, English and Portuguese, and the SemEval Shared Task attracted substantial attention, with over 100 participants, and 20 system submissions. A follow up edition of the shared task is in planning stages.
URL	https://orda.shef.ac.uk/articles/dataset/AdMIRe_Advancing_Multimodal_Idiomaticity_Representation_Sem...


Title	Noun Compound Type and Token Idiomaticity Dataset
Description	The publicly available repository for the dataset includes information about how to generate the Noun Compound Type and Token Idiomaticity (NCTTI) dataset, which contains noun compounds (NCs) in English and Portuguese with the following information (provided by annotators): Type-level compositionality scores (from Cordeiro et al., 2019 and Reddy et al., 2011). Token-level compositionality scores in three sentences. Suggestions of paraphrases classified at type-level (for the three sentences) or token-level (for each sentence). The NCTTI dataset has data for 280 and 180 noun compounds in English and Portuguese, respectively, with different degrees of idiomaticity. For each compound, it contains 3 naturalistic sentences obtained from corpora. Due to copyright restrictions we do not release all the original sentences. Instead, we include a script to obtain them from the ukWaC (Baroni et al., 2009) and brWaC (Wagner Filho et al., 2018) corpora.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
Impact	The NCTTI dataset is used to explore how vector space models reflect the variability of idiomaticity across sentences. Several experiments using state-of-the-art contextualised models suggest that their representations are not capturing the noun compounds idiomaticity as human annotators. This new multilingual resource also contains suggestions for paraphrases of the noun compounds both at type and token levels, with uses for lexical substitution or disambiguation in context. This dataset has already been used as basis for additional papers, a Shared Task, and model evaluations, including a review paper (https://www.annualreviews.org/doi/full/10.1146/annurev-linguistics-031120-122924#_i46).
URL	https://github.com/marcospln/nctti


Title	SemEval 2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding
Description	Given this shortcoming in existing state-of-the-art models, this task (part of SemEval 2022) is aimed at detecting and representing multiword expressions (MWEs) which are potentially idiomatic phrases across English, Portuguese and Galician. We call these potentially idiomatic phrases because some MWEs, such as "wedding date" are not idiomatic (i.e literal). This task consists of two subtasks, each available in two "settings". Participants have the freedom to choose a subset of subtasks or settings that they'd like to participate in (see sections detailing each of the subtasks for details). You cannot pick a subset of languages. This task consists of two subtasks: * Subtask A - A binary classification task aimed at determining whether a sentence contains an idiomatic expression. * Subtask B - Subtask B is a novel task which requires models to output the correct Semantic Text Similarity (STS) scores between sentence pairs whether or not either sentence contains an idiomatic expression. Participants must submit STS scores which range between 0 (least similar) and 1 (most similar). This will require models to correctly encode the meaning of idiomatic phrases such that the encoding of a sentence containing an idiomatic phrase (e.g. Who will he start a program with and will it lead to his own swan song?) and the same sentence with the idiomatic phrase replaced by a (literal) paraphrase (e.g. Who will he start a program with and will it lead to his own final performance?) are semantically similar to each other and equally similar to any other sentence. (See details of subtask below).
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
Impact	The shared task attracted considerable interest from the community, with around 20 participating systems in the competition. The shared task and results will be published as part of SEMEVAL, and the dataset will be kept as freely available to the community.
URL	https://sites.google.com/view/semeval2022task2-idiomaticity


Description	AI-Based Support for Mental Health Communication (AIM-Health)
Organisation	Federal University of Sao Carlos
Country	Brazil
Sector	Academic/University
PI Contribution	The project is based on analyses of idiomaticity and multiword expressions derived from those in this project
Collaborator Contribution	The partners will apply idiomaticity and metaphoricity detection to mental health chatbot
Impact	Funding for project just started
Start Year	2025


Description	UniDive
Organisation	Aix-Marseille University
Country	France
Sector	Academic/University
PI Contribution	This project has been the basis for the inclusion in the UniDive project, and collaborations.
Collaborator Contribution	Establishing collaborations with EU and international partners working on related topics
Impact	Still active
Start Year	2022


Description	UniDive
Organisation	University of Paris-Saclay
Country	France
Sector	Academic/University
PI Contribution	This project has been the basis for the inclusion in the UniDive project, and collaborations.
Collaborator Contribution	Establishing collaborations with EU and international partners working on related topics
Impact	Still active
Start Year	2022


Description	A talk or presentation - Invited Talk for the Interdisciplinary Nucleus of Computational Linguistics, University of São Paulo/São Carlos (Brazil)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	Invited talk in the Seminar Series of the Interdisciplinary Nucleus of Computational Linguistics, University of São Paulo/São Carlos (Brazil), gathering researchers and practitioners working on Natural Language Processing.
Year(s) Of Engagement Activity	2024


Description	COLING Tutorial
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Psychological, Cognitive and Linguistic BERTology: An Idiomatic Multiword Expression Perspective In this tutorial we will introduce participants to the exciting research from topics associated with four themes: a) we will provide an overview of BERTology (Rogers et al., 2020) from a linguistic and cognitive stand point, b) introduce participants to the highlights from research on idiomaticity available in cognitive and corpus linguistics, along with studies that show the need for cultural, world and common sense knowledge in handling idiomaticity and related problems, c) traditional methods of handling idiomaticity by providing explicit information regarding idiomatic phrases to models, and c) the state of the art in idiomaticity detection and representation. The first of these four themes - the capabilities of PLMs - will include an overview of BERTology and what PLMs' capabilities (or the lack thereof) means from a linguistic and cognitive standpoint with an emphasis on phenomenon such as understanding idiomatic expressions. For example, it has been shown that PLMs are good at syntax (Hewitt and Manning, 2019) and semantic roles (Tenney et al., 2019b), while being less effective at pragmatic inference, role-based event knowledge (Ettinger, 2020) and abstract attributes of objects (Da and Kasai, 2019). Importantly, PLMs are particularly bad at representing numbers (Wallace et al., 2019) and in reasoning based on the world knowledge they have access to (Forbes et al., 2019). We argue that this shows that PLMs are good at "low-level" linguistic tasks but struggle with "high-level" tasks associated with reasoning and understanding. These high-level cognitive tasks, such as the ability to make use of world and common sense knowledge are of particular relevance to idiomaticity. Consider the sentences ". . . cultivated land in this study accounts areas used for paddy fields and dry land" and ". . . It's a great feeling to be back on dry land". 'Dry land' literally refers to 'dry ground' in the first but refers to the more abstract 'terra firma' in the second. In addressing the second theme, the tutorial will cover elements of MWE research that originate in linguistics. For example, the highly influential work in the field of linguistics by Nunberg et al. (1994), who discuss how, in some cases, parts of an idiom might be modified (e.g. "Your remark touched a nerve that I didn't even know existed"), quantified (e.g. "touch a couple of nerves"), or emphasized (e.g. "Those strings, he wouldn't pull for you"). Such an exploration would serve to highlight the kind of nuanced understanding of language and the world that is required to completely understand these utterances. Additionally, we will also explore studies on how humans process idiomaticity (Geeraert et al., 2020; Chanturia et al., 2011). The third theme will deal with methods of identifying and representing idiomaticity using the traditional approach in NLP wherein a phenomenon is explicitly modeled more or less independently of other levels of analysis. Such work, as is the case with much of MWE research, hypothesizes that such explicit information will be useful to models on downstream tasks. Finally, the tutorial will address the absolute state of the art in identifying and representing idiomaticity. This is made possible by the fact that some of the proposed presenters of this tutorial are also involved in the organisation the related task "Multilingual Idiomaticity Detection and Sentence Embedding" at SemEval 2022.
Year(s) Of Engagement Activity	2022
URL	https://sites.google.com/view/coling2022tutorial


Description	Dagstuhl Seminar on 'Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics'
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The Dagstuhl Seminar on Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics aimed at the following objectives: Theoretical: To deepen the understanding of language universals, and of how they apply to linguistic idiosyncrasy, so as to further promote unified modelling while preserving diversity. Practical: To improve the treatment of idiosyncrasy in treebanking frameworks, in computationally tractable ways and, thus, to foster high quality NLP tools for more languages with greater typological diversity. Networking: To promote a higher degree of convergence across typology-driven initiatives, while focusing on three main aspects of language modelling: morphology, syntax, and semantics. https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=21351
Year(s) Of Engagement Activity	2021,2022
URL	https://gitlab.com/unlid/dagstuhl-seminar/-/wikis/home


Description	Invited Speaker of First Workshop on Current Trends in Text Simplification (2021)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	This invited talk aimed at raising awareness of the challenges of idiomaticity for natural language understanding in general and text simplification systems in particular. The findings of the project were disseminated with the research community and all resources generated so far are publicly available.
Year(s) Of Engagement Activity	2021
URL	https://www.taln.upf.edu/pages/cttsr2021-ws/


Description	Invited Talk at SimBIG (Peru)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Undergraduate students
Results and Impact	Invited talk at SimBIG in Peru.
Year(s) Of Engagement Activity	2024
URL	https://simbig.org/SIMBig2024/#speakers


Description	Invited Talk at the Department of Computer Science, Federal University of Rio Grande do Norte (Brazil)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	This talk aimed to raise awareness of the challenges of current NLP models to capture idiomaticity, disseminating results and resources produced by the project. The audience was composed of computational neuroscientists, and undergraduate and graduate students of the Computer Science Department. The talk raised questions about possible trends in the areas, and some scientific challenges related to different languages, and the organisers reported additional interest for follow-up from the audience.
Year(s) Of Engagement Activity	2021


Description	Invited Talk at the Department of Computer Science, Federal University of Rio Grande do Sul (Brazil)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	Invited Talk to the Seminar Series of the Institute of Informatics, Federal University of Rio Grande do Sul (Brazil)
Year(s) Of Engagement Activity	2024
URL	https://www.inf.ufrgs.br/site/noticia/seminf-aline-villavicencio/


Description	Invited Talk at the UFRN IMD CCHLA - Festival of the Digital Humanities (Brazil)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	Invited talk at the UFRN IMD CCHLA - Festival of the Digital Humanities, at the Federal University of Rio Grande do Norte (Brazil)
Year(s) Of Engagement Activity	2024
URL	https://humanidadesdigitais.imd.ufrn.br/


Description	Invited Talk in the SemChangeMWE Workshop
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Invited talk at the SemChangeMWE Workshop. Follow-up collaborations including a proposal for an edited book have been in development.
Year(s) Of Engagement Activity	2024
URL	https://www.ims.uni-stuttgart.de/en/research/colloquium/


Description	Invited Talk in the Seminar Series of Challenges in the NLP of Indigenous Languages
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	Invited talk at the Seminar Series on Indigenous Language Processing, discussing how the multilingual and low resource techniques needed to process idiomatic language could be adapted and extended for indigenous languages.
Year(s) Of Engagement Activity	2024
URL	http://www.iea.usp.br/midiateca/video/videos-2024/desafios-no-processamento-de-linguas-indigenas


Description	Invited Talk to Turing Institute Group
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	This talk aimed to raise awareness of the challenges of current NLP models to capture idiomaticity, disseminating results and resources produced by the project. The audience was composed of researchers linked to the Turing Institute working on NLP. The talk raised questions about possible trends in the areas, and some scientific challenges related to idiomaticity.
Year(s) Of Engagement Activity	2021


Description	Invited Talk to the CL group at Cornell University
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	This talk aimed to raise awareness of the challenges of current NLP models to capture idiomaticity, disseminating results and resources produced by the project. The audience was composed of academics and students working on cognitive computational models. The talk raised questions about possible common interests, and the exchange of data for analyses by members of the audience.
Year(s) Of Engagement Activity	2021


Description	Invited Talk: Compass Annual Conference 2024 (UK)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	Invited Talk at the Compass Annual Conference 2024 at the University of Bristol (UK)
Year(s) Of Engagement Activity	2024
URL	https://compass.blogs.bristol.ac.uk/tag/professor-aline-villavicencio/


Description	Invited talk to Cardiff NLP group
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	This talk aimed to raise awareness of the challenges of current NLP models to capture idiomaticity, disseminating results and resources produced by the project. The audience was composed of NLP students and academics. The talk focused on the EACL-2021 paper and raised questions about possible trends in the areas, and some scientific challenges related to different languages, and about possible common interests among the groups.
Year(s) Of Engagement Activity	2021


Description	Invited talk to Núcleo de Linguística Computacional da FALE, Federal University of Minas Gerais, Brazil
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	This talk aimed to raise awareness of the challenges of current NLP models to capture idiomaticity, disseminating results and resources produced by the project. The audience was composed of linguists, computer scientists and undergraduate and graduate students working on NLP. The talk raised questions about possible trends in the areas, and some scientific challenges related to different languages, and the organisers reported additional interest for follow-up from the audience, including direct emails to the PI of the project.
Year(s) Of Engagement Activity	2021


Description	LREC Tutorial
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	LREC 2022 Tutorial - Psychological, Cognitive and Linguistic BERTology: An Idiomatic Multiword Expression Perspective The success of BERT and similar pre-trained language models (PLMs) has led to what might be described as an existential crisis for certain aspects of Natural Language Processing: PLMs can now do better than other models on numerous tasks in multiple evaluation scenarios and are argued to outperform human performances on some benchmarks (Wang et al., 2018; Sun et al., 2020; Hassan et al., 2018). In addition, PLMs also seem to have access to a variety of linguistic information as diverse as parse trees (Hewitt and Manning, 2019), entity types, relations, semantic roles (Tenney et al., 2019a), and constructional information (Tayyar Madabushi et al., 2020). Does this mean that there is no longer a need to tap into the decades of progress that was made in traditional NLP and related fields including corpus and cognitive linguistics? In short, can deep(er) models replace linguistically motivated (layered) models and systematic engineering as we work towards high-level symbolic artificial intelligence systems? This tutorial will explore these questions through the lens of a linguistically and cognitively important phenomena that PLMs do not (yet) handle very well: Idiomatic Multiword Expressions (MWEs) (Yu and Ettinger, 2020; Garcia et al., 2021; Tayyar Madabushi et al., 2021).
Year(s) Of Engagement Activity	2022
URL	https://sites.google.com/view/psych-bertology-lrec-2022


Description	Participation in an activity, workshop or similar - Invited talk in The Advances in Data Science and AI Seminar Series, The University of Manchester.
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	The Advances in Data Science and AI Seminar Series on Tuesday 29 November, 2022 at The University of Manchester.
Year(s) Of Engagement Activity	2022
URL	https://www.youtube.com/watch?v=XcE3qGPMU4A


Description	Participation in an activity, workshop or similar - Invited talk in the 3rd Workshop on Figurative Language Processing
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Invited talk in the 3rd Workshop on Figurative Language Processing. Participants were from the Natural Language Processing and Computational Linguistics community
Year(s) Of Engagement Activity	2022
URL	https://sites.google.com/view/figlang2022/home


Description	SemEval Shared Task 2022
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	SemEval 2022 Task 2 Multilingual Idiomaticity Detection and Sentence Embedding Given this shortcoming in existing state-of-the-art models, this task (part of SemEval 2022) is aimed at detecting and representing multiword expressions (MWEs) which are potentially idiomatic phrases across English, Portuguese and Galician. We call these potentially idiomatic phrases because some MWEs, such as "wedding date" are not idiomatic (i.e literal). This task consists of two subtasks, each available in two "settings". Participants have the freedom to choose a subset of subtasks or settings that they'd like to participate in (see sections detailing each of the subtasks for details). You cannot pick a subset of languages.
Year(s) Of Engagement Activity	2022
URL	https://sites.google.com/view/semeval2022task2-idiomaticity


Description	Shared Task dedicated to assessing idiomaticity in representations
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	This shared task aimed at raising awareness of the challenges involved in capturing idiomaticity using current models. The task was opened to general participation and the call for participation was widely announced and invited participants to design systems that can detect idiomatic language. New datasets were collected and made available in 3 languages: English, Portuguese and Galician, the last two being lower-resourced languages. The task and the datasets are new contributions to the community and can provide the basis for benchmarking systems.
Year(s) Of Engagement Activity	2021,2022
URL	https://sites.google.com/view/semeval2022task2-idiomaticity

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications