Token-level identification of multiword expressions using pre-trained multilingual language models
dc.contributor.advisor | Cook, Paul | |
dc.contributor.author | Swaminathan, Raghuraman | |
dc.date.accessioned | 2024-02-22T18:19:10Z | |
dc.date.available | 2024-02-22T18:19:10Z | |
dc.date.issued | 2023-09 | |
dc.description.abstract | Multiword expressions (MWEs) are combinations of words where the meaning of the expression cannot be derived from its component words. MWEs are commonly used in different languages and are difficulty to identify. For different NLP tasks such as sentiment analysis and machine translation, it is important that language models automatically identify and classify these MWEs. While considerable work has been done in identifying and classifying MWEs, little work has been done in a cross-lingual setting. In this thesis, we consider novel cross-lingual settings for MWE identification and idiomaticity prediction in which systems are tested on languages that are unseen during training. We use multilingual models of BERT, specifically mBERT, RoBERTa and mDeBERTa. Our findings indicate that pre-trained multilingual language models are able to learn knowledge about MWEs and idiomaticity that is not language-specific. Moreover, we find that training data from other languages can be leveraged to give improvements over monolingual models. | |
dc.description.copyright | © Raghuraman Swaminathan, 2023 | |
dc.format.extent | vii, 72 | |
dc.format.medium | electronic | |
dc.identifier.oclc | (OCoLC)1439830622 | en |
dc.identifier.other | Thesis 11409 | en |
dc.identifier.uri | https://unbscholar.lib.unb.ca/handle/1882/37717 | |
dc.language.iso | en | |
dc.publisher | University of New Brunswick | |
dc.rights | http://purl.org/coar/access_right/c_abf2 | |
dc.subject.discipline | Computer Science | |
dc.subject.lcsh | Natural language processing (Computer science) | en |
dc.subject.lcsh | Computational linguistics. | en |
dc.subject.lcsh | Multilingual computing. | en |
dc.title | Token-level identification of multiword expressions using pre-trained multilingual language models | |
dc.type | master thesis | |
oaire.license.condition | other | |
thesis.degree.discipline | Computer Science | |
thesis.degree.grantor | University of New Brunswick | |
thesis.degree.level | masters | |
thesis.degree.name | M.C.S. |