Cross-lingual multiword expression identification and idiomaticity prediction using autoregressive and masked language models

Hasan, Md. Arid

Cross-lingual multiword expression identification and idiomaticity prediction using autoregressive and masked language models

Files

Md. Arid Hasan - Thesis.pdf (743.8 KB)

Date

2025-05

Authors

Hasan, Md. Arid

Publisher

University of New Brunswick

Abstract

Token-level multiword expression (MWE) identification and idiomaticity prediction remain major challenges in natural language processing, demanding sophisticated approaches to address non-compositional meanings and idiosyncratic syntactic behaviors. These tasks involve identifying idiomatic expressions at the level of individual tokens, allowing systems to distinguish figurative from literal usages. This thesis explores cross-lingual MWE identification using the PARSEME 1.2 shared task dataset and idiomaticity prediction on the SemEval 2022 Task 2 dataset, where models are evaluated on unseen languages. We employ larger multilingual masked language models (MLMs), e.g., XLM-R and mT5, than previous work [137], which used supervised fine-tuning, and larger autoregressive models, e.g., GPT-4o, which previous work on these tasks have not considered. We adopted supervised fine-tuning of MLMs and autoregressive models and applied a prompt-based approach to autoregressive models. Our findings indicate that larger MLMs do not outperform the Swaminathan and Cook [137] results for the SemEval and PARSEME tasks, but that supervised fine-tuning of autoregressive models does.

URI

https://unbscholar.lib.unb.ca/handle/1882/38373

Collections

Open Theses & Dissertations

Full item page

Cross-lingual multiword expression identification and idiomaticity prediction using autoregressive and masked language models

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

General

Libraries

Departments

Join the conversation: