Cross-lingual multiword expression identification and idiomaticity prediction using autoregressive and masked language models
| dc.contributor.advisor | Cook, Paul | |
| dc.contributor.author | Hasan, Md. Arid | |
| dc.date.accessioned | 2025-08-13T14:57:10Z | |
| dc.date.available | 2025-08-13T14:57:10Z | |
| dc.date.issued | 2025-05 | |
| dc.description.abstract | Token-level multiword expression (MWE) identification and idiomaticity prediction remain major challenges in natural language processing, demanding sophisticated approaches to address non-compositional meanings and idiosyncratic syntactic behaviors. These tasks involve identifying idiomatic expressions at the level of individual tokens, allowing systems to distinguish figurative from literal usages. This thesis explores cross-lingual MWE identification using the PARSEME 1.2 shared task dataset and idiomaticity prediction on the SemEval 2022 Task 2 dataset, where models are evaluated on unseen languages. We employ larger multilingual masked language models (MLMs), e.g., XLM-R and mT5, than previous work [137], which used supervised fine-tuning, and larger autoregressive models, e.g., GPT-4o, which previous work on these tasks have not considered. We adopted supervised fine-tuning of MLMs and autoregressive models and applied a prompt-based approach to autoregressive models. Our findings indicate that larger MLMs do not outperform the Swaminathan and Cook [137] results for the SemEval and PARSEME tasks, but that supervised fine-tuning of autoregressive models does. | |
| dc.description.copyright | © Md. Arid Hasan, 2025 | |
| dc.format.extent | viii, 94 | |
| dc.format.medium | electronic | |
| dc.identifier.uri | https://unbscholar.lib.unb.ca/handle/1882/38373 | |
| dc.language.iso | en | |
| dc.publisher | University of New Brunswick | |
| dc.rights | http://purl.org/coar/access_right/c_abf2 | |
| dc.subject.discipline | Computer Science | |
| dc.title | Cross-lingual multiword expression identification and idiomaticity prediction using autoregressive and masked language models | |
| dc.type | master thesis | |
| oaire.license.condition | other | |
| thesis.degree.discipline | Computer Science | |
| thesis.degree.grantor | University of New Brunswick | |
| thesis.degree.level | masters | |
| thesis.degree.name | M.C.S. |
