Leveraging bilingual dictionaries to learn word embeddings for low-resource languages
Loading...
Date
2025-02
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of New Brunswick
Abstract
Word embeddings [33, 36] have been used to bolster the performance of natural language processing systems in a wide variety of tasks, including information retrieval [42] and machine translation [37]. However, approaches to learning word embeddings typically require large corpora of running text to learn high quality representations. For many languages, such resources are unavailable. This is the case for Wolastoqey and Mi’kmaq, two endangered low-resource Eastern Algonquian languages. As there exist no large corpora for Wolastoqey and Mi’kmaq, in this thesis, we leverage bilingual dictionaries to learn Wolastoqey and Mi’kmaq word embeddings by encoding their corresponding English definitions into vector representations using English word and sequence representation models. Specifically, we consider representations based on pretrained word2vec [33], RoBERTa [31], and sentence-RoBERTa [40] models, as well as, fine-tuned sentence-RoBERTa models [40]. We evaluate these embeddings in word prediction tasks focused on part-of-speech, animacy, and transitivity; semantic clustering; and reverse dictionary search. We additionally construct word embeddings for higher-resource languages — English, German and Spanish — using our methods and evaluate our embeddings on existing word-similarity datasets. Our findings indicate that our word embedding methods can be used to produce meaningful vector representations for low-resource languages such as Wolastoqey and Mi’kmaq and for higher-resource languages.