Token-level identification of multiword expressions using pre-trained multilingual language models

Swaminathan, Raghuraman

Token-level identification of multiword expressions using pre-trained multilingual language models

Files

Raghuraman Swaminathan - Thesis.pdf (1 MB)

Date

2023-09

Authors

Swaminathan, Raghuraman

Publisher

University of New Brunswick

Abstract

Multiword expressions (MWEs) are combinations of words where the meaning of the expression cannot be derived from its component words. MWEs are commonly used in different languages and are difficulty to identify. For different NLP tasks such as sentiment analysis and machine translation, it is important that language models automatically identify and classify these MWEs. While considerable work has been done in identifying and classifying MWEs, little work has been done in a cross-lingual setting. In this thesis, we consider novel cross-lingual settings for MWE identification and idiomaticity prediction in which systems are tested on languages that are unseen during training. We use multilingual models of BERT, specifically mBERT, RoBERTa and mDeBERTa. Our findings indicate that pre-trained multilingual language models are able to learn knowledge about MWEs and idiomaticity that is not language-specific. Moreover, we find that training data from other languages can be leveraged to give improvements over monolingual models.

URI

https://unbscholar.lib.unb.ca/handle/1882/37717

Collections

Open Theses & Dissertations

Full item page

Token-level identification of multiword expressions using pre-trained multilingual language models

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

General

Libraries

Departments

Join the conversation: