WaCadie: Towards a web corpus of Acadian French

Thumbnail Image



Journal Title

Journal ISSN

Volume Title


University of New Brunswick


Corpora are important assets within the natural language processing and linguistics communities. However, not all low-resource languages have corpus representation. Acadians, an eastern people of North America, do not have a corpus representation of their variation of French. An Acadian French corpus could allow for a better understanding of the unique dialect. Leveraging web-as-corpus methodologies such as BootCaT, domain crawling, and social media scraping, we create three different corpus representations of Acadian French. Each corpus is, on its own, an Acadian French resource while also showcasing the strengths of their individual method of creation. We propose 22 statistical corpus-based measures stemming from previously researched Acadian French characteristics to compare these newly built corpora to known Acadian French text. We found that while all three yield traces of Acadian French text, BootCaT is the largest corpus, and social media scraping has the highest count of Acadian French characteristics.