Browsing by Author "Cook, Paul"
Now showing 1 - 12 of 12
Results Per Page
Sort Options
Item A comparison of machine learning algorithms for zero-shot cross-lingual phishing detection(University of New Brunswick, 2023-08) Staples, Dakota; Hakak, Saqib; Cook, PaulPhishing is a major problem worldwide. Existing studies have focused mainly on detecting emails in one language (mostly English). However, detecting emails in multiple languages is challenging due to a lack of datasets. Without ample data from which to learn, the models cannot detect a benign email from a spam email accurately, resulting in false positives and negatives. This research aims to compare the performance of numerous machine learning models and transformers using zero-shot learning for multilingual phishing detection. In a zero-shot learning set-up, the model is trained on one language and tested on another. English, French, and Russian emails are used as the training and testing languages. My results show that, on average, XLM-Roberta performs the best out of all the tested models in terms of accuracy scoring 99% testing on English, 99% testing on French, and 95% testing on Russian.Item A multi-sense context-agnostic definition generation model evaluated on multiple languages(University of New Brunswick, 2020) Kabiri, Arman; Cook, PaulDefinition modeling is a recently-introduced task in natural language processing (NLP) which aims to predict and generate dictionary-style definitions for any given word. Most prior work on definition modelling has not accounted for polysemy — i.e. a linguistic phenomenon in which a word can imply multiple meanings when used in various contexts — or has done so by considering definition modelling for a target word in a given context. In contrast, in this study, we propose a context-agnostic approach to definition modelling, based on multi-sense word embeddings, that is capable of generating multiple definitions for a target word. In further contrast to most prior work, which has primarily focused on English, we evaluate our proposed approach on fifteen different datasets covering nine languages from several language families. To evaluate our approach we consider several variations of BLEU — i.e., a widely-used evaluation metric initially introduced for machine translation that is adapted to definition modeling. Our results demonstrate that our proposed multisense model outperforms a single-sense model on all fifteen datasets.Item An automatic approach to discover lexical semantic differences in varieties of English(University of New Brunswick, 2017) Nagra, Priyal; Cook, PaulThe English language is not uniform. Speakers of English in different parts of the world can use the same word, but with different meanings. Investigating lexical semantic differences in varieties of English such as American, Australian, British, Canadian is an interesting area of research in computational linguistics. We use corpora of varieties of English to detect words that changed their meaning from one variety to another. Methods of automatically identifying lexical variation used in this work are the distributional semantic models, measures of keywords, and word embedding models inspired by neural network language models. We determine whether word embedding models can detect lexical semantic differences between varieties of English better than distributional similarity approaches and approaches based on keywords. This study presents the first important step towards a robust application of word embeddings to variational linguistics. Our results indicate that word embeddings perform best among all other methods in 2 out of 3 cases.Item Android authorship attribution through string analysis(University of New Brunswick, 2017) Kalgutkar, Vaibhavi; Stakhanova, Natalia; Cook, PaulItem Building and evaluating web corpora representing national varieties of English(ACM Digital Library, 2017) Cook, Paul; Brinton, Laurel J.Corpora are essential resources for language studies, as well as for training statistical natural language processing systems. Although very large English corpora have been built, only relatively small corpora are available for many varieties of English. National top-level domains (e.g., .au, .ca) could be exploited to automatically build web corpora, but it is unclear whether such corpora would reflect the corresponding national varieties of English; i.e., would a web corpus built from the .ca domain correspond to Canadian English? In this article we build web corpora from national top-level domains corresponding to countries in which English is widely spoken. We then carry out statistical analyses of these corpora in terms of keywords, measures of corpus comparison based on the Chi-square test and spelling variants, and the frequencies of words known to be marked in particular varieties of English. We find evidence that the web corpora indeed reflect the corresponding national varieties of English. We then demonstrate, through a case study on the analysis of Canadianisms, that these corpora could be valuable lexicographical resources.Item Contextualized embeddings encode knowledge of English verb-noun combination idiomaticity(University of New Brunswick, 2021) Fakharian, Samin; Cook, PaulEnglish verb-noun combinations (VNCs) consist of a verb with a noun in its direct object position, and can be used as idioms or as literal combinations (e.g., hit the road). As VNCs are commonly used in language and their meaning is often not predictable, they are an essential topic of research for NLP. In this study, we propose a supervised approach to distinguish idiomatic and literal usages of VNCs in a text based on contextualized representations, specifically BERT and RoBERTa. We show that this model using contextualized embeddings outperforms previous approaches, including the case that the model is tested on instances of VNC types that were not observed during training. We further consider the incorporation of linguistic knowledge of lexico-syntactic fixedness of VNCs into our model. Our findings indicate that contextualized embeddings capture this information.Item Cross-lingual word embeddings for low-resource and morphologically-rich languages(University of New Brunswick, 2021) Hakimi Parizi, Ali; Cook, PaulDespite recent advances in natural language processing, there is still a gap in state-of-the-art methods to address problems related to low-resource and morphologically-rich languages. These methods are data-hungry, and due to the scarcity of training data for low-resource and morphologically-rich languages, developing NLP tools for them is a challenging task. Approaches for forming cross-lingual embeddings and transferring knowledge from a rich- to a low-resource language have emerged to overcome the lack of training data. Although in recent years we have seen major improvements in cross-lingual methods, these methods still have some limitations that have not been addressed properly. An important problem is the out-of-vocabulary word (OOV) problem, i.e., words that occur in a document being processed, but that the model did not observe during training. The OOV problem is more significant in the case of low-resource languages, since there is relatively little training data available for them, and also in the case of morphologically-rich languages, since it is very likely that we do not observe a considerable number of their word forms in the training data. Approaches to learning sub-word embeddings have been proposed to address the OOV problem in monolingual models, but most prior work has not considered sub-word embeddings in cross-lingual models. The hypothesis of this thesis is that it is possible to leverage sub-word information to overcome the OOV problem in low-resource and morphologically-rich languages. This thesis presents a novel bilingual lexicon induction task to demonstrate the effectiveness of sub-word information in the cross-lingual space and how it can be employed to overcome the OOV problem. Moreover, this thesis presents a novel cross-lingual word representation method that incorporates sub-word information during the training process to learn a better cross-lingual shared space and also better represent OOVs in the shared space. This method is particularly suitable for low-resource scenarios and this claim is proven through a series of experiments on bilingual lexicon induction, monolingual word similarity, and a downstream task, document classification. More specifically, it is shown that this method is suitable for low-resource languages by conducting bilingual lexicon induction on twelve low-resource and morphologically-rich languages.Item Determining if this word is used like that word: predicting usage similarity with supervised and unsupervised approaches(University of New Brunswick, 2017) King, Milton; Cook, PaulDetermining the meaning of a word in context is an important task for a variety of natural language processing applications such as translating between languages, summarizing paragraphs, and phrase completion. Usage similarity (USim) is an approach to describe the meaning of a word in context that does not rely on a sense inventory -- a set of dictionary-like definitions. Instead, pairs of usages of a target word are rated in terms of their similarity on a scale. In this thesis, we evaluate unsupervised approaches to USim based on embeddings for words, contexts, and sentences, and achieve state-of-the-art results over two USim datasets. We further consider supervised approaches to USim, and find that they can increase the performance of our models. We look into a more detailed evaluation, observing the performance on different parts-of-speech as well as the change in performance when using different features. Our models also do competitively well in two word sense induction tasks, which involve clustering instances of a word based on the meaning of the word in context.Item Multiword expression identification using deep learning(University of New Brunswick, 2017) Gharbieh, Waseem; Bhavsar, Virendrakumar; Cook, PaulMultiword expressions combine words in various ways to produce phrases that have properties that are not predictable from the properties of their individual words or their normal mode of combination. There are many types of multiword expressions including proverbs, named entities, and verb noun combinations. In this thesis, we propose various deep learning models to identify multiword expressions and compare their performance to more traditional machine learning models and current multiword expression identification systems. We show that convolutional neural networks are able to perform better than state-of-the-art with the three hidden layer convolutional neural network performing best. To our knowledge, this is the first work that applies deep learning models for broad multiword expression identification.Item That ain’t how I speak: Personalizing natural language processing(University of New Brunswick, 2021-10) King, Milton; Cook, PaulNatural language processing (NLP) involves automatically analyzing text written by human authors. People develop their own use of a language known as an idiolect, which could result in poor performance from generic NLP systems. Ideally, each person would have their own personalized system that is tailored toward them. In this thesis, I demonstrate the potential benefits of personalizing systems in three different NLP tasks, which include language modeling (estimating the probability of a sequence of words), authorship verification (determining if a document belongs to a specific person), and word sense disambiguation (assigning a dictionary-like meaning to a word in context). Personalization in these topics has not been widely studied and to the best of my knowledge, this is the first work to consider personalization with word sense disambiguation, for which I design a novel dataset. For each task, I show the increase in performance that the proposed personalized models have against state-of-the-art models. The experiments in this thesis are designed without consideration of people’s demographic and all personalized methods require relatively low amounts of text from an individual. These two criteria are respected to ensure the personalized methods work well for each individual regardless of their demographic or the amount of text they have authored.Item Token-level identification of multiword expressions using pre-trained multilingual language models(University of New Brunswick, 2023-09) Swaminathan, Raghuraman; Cook, PaulMultiword expressions (MWEs) are combinations of words where the meaning of the expression cannot be derived from its component words. MWEs are commonly used in different languages and are difficulty to identify. For different NLP tasks such as sentiment analysis and machine translation, it is important that language models automatically identify and classify these MWEs. While considerable work has been done in identifying and classifying MWEs, little work has been done in a cross-lingual setting. In this thesis, we consider novel cross-lingual settings for MWE identification and idiomaticity prediction in which systems are tested on languages that are unseen during training. We use multilingual models of BERT, specifically mBERT, RoBERTa and mDeBERTa. Our findings indicate that pre-trained multilingual language models are able to learn knowledge about MWEs and idiomaticity that is not language-specific. Moreover, we find that training data from other languages can be leveraged to give improvements over monolingual models.Item WaCadie: Towards a web corpus of Acadian French(University of New Brunswick, 2023-12) Robichaud, Jérémy; Cook, PaulCorpora are important assets within the natural language processing and linguistics communities. However, not all low-resource languages have corpus representation. Acadians, an eastern people of North America, do not have a corpus representation of their variation of French. An Acadian French corpus could allow for a better understanding of the unique dialect. Leveraging web-as-corpus methodologies such as BootCaT, domain crawling, and social media scraping, we create three different corpus representations of Acadian French. Each corpus is, on its own, an Acadian French resource while also showcasing the strengths of their individual method of creation. We propose 22 statistical corpus-based measures stemming from previously researched Acadian French characteristics to compare these newly built corpora to known Acadian French text. We found that while all three yield traces of Acadian French text, BootCaT is the largest corpus, and social media scraping has the highest count of Acadian French characteristics.