Leveraging bilingual dictionaries to learn word embeddings for low-resource languages

dc.contributor.advisorCook, Paul
dc.contributor.authorBear, Diego
dc.date.accessioned2025-04-04T13:34:39Z
dc.date.available2025-04-04T13:34:39Z
dc.date.issued2025-02
dc.description.abstractWord embeddings [33, 36] have been used to bolster the performance of natural language processing systems in a wide variety of tasks, including information retrieval [42] and machine translation [37]. However, approaches to learning word embeddings typically require large corpora of running text to learn high quality representations. For many languages, such resources are unavailable. This is the case for Wolastoqey and Mi’kmaq, two endangered low-resource Eastern Algonquian languages. As there exist no large corpora for Wolastoqey and Mi’kmaq, in this thesis, we leverage bilingual dictionaries to learn Wolastoqey and Mi’kmaq word embeddings by encoding their corresponding English definitions into vector representations using English word and sequence representation models. Specifically, we consider representations based on pretrained word2vec [33], RoBERTa [31], and sentence-RoBERTa [40] models, as well as, fine-tuned sentence-RoBERTa models [40]. We evaluate these embeddings in word prediction tasks focused on part-of-speech, animacy, and transitivity; semantic clustering; and reverse dictionary search. We additionally construct word embeddings for higher-resource languages — English, German and Spanish — using our methods and evaluate our embeddings on existing word-similarity datasets. Our findings indicate that our word embedding methods can be used to produce meaningful vector representations for low-resource languages such as Wolastoqey and Mi’kmaq and for higher-resource languages.
dc.description.copyright© Diego Bear, 2025
dc.format.extentviii, 57
dc.format.mediumelectronic
dc.identifier.urihttps://unbscholar.lib.unb.ca/handle/1882/38275
dc.language.isoen
dc.publisherUniversity of New Brunswick
dc.rightshttp://purl.org/coar/access_right/c_abf2
dc.subject.disciplineComputer Science
dc.titleLeveraging bilingual dictionaries to learn word embeddings for low-resource languages
dc.typemaster thesis
oaire.license.conditionother
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of New Brunswick
thesis.degree.levelmasters
thesis.degree.nameM.C.S.

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Diego Bear - Thesis.pdf
Size:
312.6 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.13 KB
Format:
Item-specific license agreed upon to submission
Description: