Cross-lingual word embeddings for low-resource and morphologically-rich languages

Hakimi Parizi, Ali

Cross-lingual word embeddings for low-resource and morphologically-rich languages

dc.contributor.advisor	Cook, Paul
dc.contributor.author	Hakimi Parizi, Ali
dc.date.accessioned	2023-03-01T16:49:34Z
dc.date.available	2023-03-01T16:49:34Z
dc.date.issued	2021
dc.date.updated	2023-03-01T15:03:26Z
dc.description.abstract	Despite recent advances in natural language processing, there is still a gap in state-of-the-art methods to address problems related to low-resource and morphologically-rich languages. These methods are data-hungry, and due to the scarcity of training data for low-resource and morphologically-rich languages, developing NLP tools for them is a challenging task. Approaches for forming cross-lingual embeddings and transferring knowledge from a rich- to a low-resource language have emerged to overcome the lack of training data. Although in recent years we have seen major improvements in cross-lingual methods, these methods still have some limitations that have not been addressed properly. An important problem is the out-of-vocabulary word (OOV) problem, i.e., words that occur in a document being processed, but that the model did not observe during training. The OOV problem is more significant in the case of low-resource languages, since there is relatively little training data available for them, and also in the case of morphologically-rich languages, since it is very likely that we do not observe a considerable number of their word forms in the training data. Approaches to learning sub-word embeddings have been proposed to address the OOV problem in monolingual models, but most prior work has not considered sub-word embeddings in cross-lingual models. The hypothesis of this thesis is that it is possible to leverage sub-word information to overcome the OOV problem in low-resource and morphologically-rich languages. This thesis presents a novel bilingual lexicon induction task to demonstrate the effectiveness of sub-word information in the cross-lingual space and how it can be employed to overcome the OOV problem. Moreover, this thesis presents a novel cross-lingual word representation method that incorporates sub-word information during the training process to learn a better cross-lingual shared space and also better represent OOVs in the shared space. This method is particularly suitable for low-resource scenarios and this claim is proven through a series of experiments on bilingual lexicon induction, monolingual word similarity, and a downstream task, document classification. More specifically, it is shown that this method is suitable for low-resource languages by conducting bilingual lexicon induction on twelve low-resource and morphologically-rich languages.
dc.description.copyright	© Ali Hakimi Parizi, 2021
dc.format	text/xml
dc.format.extent	xiii, 133 pages
dc.format.medium	electronic
dc.identifier.oclc	(OCoLC)1410952293	en
dc.identifier.other	Thesis 10737	en
dc.identifier.uri	https://unbscholar.lib.unb.ca/handle/1882/14534
dc.language.iso	en_CA
dc.publisher	University of New Brunswick
dc.rights	http://purl.org/coar/access_right/c_abf2
dc.subject.discipline	Computer Science
dc.subject.lcsh	Natural language processing (Computer science)	en
dc.subject.lcsh	Similarity (Language learning)	en
dc.subject.lcsh	Bilingualism.	en
dc.subject.lcsh	Grammar, Comparative and general--Morphology.	en
dc.title	Cross-lingual word embeddings for low-resource and morphologically-rich languages
dc.type	doctoral thesis
thesis.degree.discipline	Computer Science
thesis.degree.fullname	Doctor of Philosophy
thesis.degree.grantor	University of New Brunswick
thesis.degree.level	doctoral
thesis.degree.name	Ph.D.

Files

Original bundle

Now showing 1 - 1 of 1

Name:: item.pdf
Size:: 660.88 KB
Format:: Adobe Portable Document Format

Download

Collections

Open Theses & Dissertations