Cross-lingual multiword expression identification and idiomaticity prediction using autoregressive and masked language models

Loading...
Thumbnail Image

Date

2025-05

Journal Title

Journal ISSN

Volume Title

Publisher

University of New Brunswick

Abstract

Token-level multiword expression (MWE) identification and idiomaticity prediction remain major challenges in natural language processing, demanding sophisticated approaches to address non-compositional meanings and idiosyncratic syntactic behaviors. These tasks involve identifying idiomatic expressions at the level of individual tokens, allowing systems to distinguish figurative from literal usages. This thesis explores cross-lingual MWE identification using the PARSEME 1.2 shared task dataset and idiomaticity prediction on the SemEval 2022 Task 2 dataset, where models are evaluated on unseen languages. We employ larger multilingual masked language models (MLMs), e.g., XLM-R and mT5, than previous work [137], which used supervised fine-tuning, and larger autoregressive models, e.g., GPT-4o, which previous work on these tasks have not considered. We adopted supervised fine-tuning of MLMs and autoregressive models and applied a prompt-based approach to autoregressive models. Our findings indicate that larger MLMs do not outperform the Swaminathan and Cook [137] results for the SemEval and PARSEME tasks, but that supervised fine-tuning of autoregressive models does.

Description

Keywords

Citation