WaCadie: Towards a web corpus of Acadian French

dc.contributor.advisorCook, Paul
dc.contributor.authorRobichaud, Jérémy
dc.date.accessioned2024-03-20T14:03:32Z
dc.date.available2024-03-20T14:03:32Z
dc.date.issued2023-12
dc.description.abstractCorpora are important assets within the natural language processing and linguistics communities. However, not all low-resource languages have corpus representation. Acadians, an eastern people of North America, do not have a corpus representation of their variation of French. An Acadian French corpus could allow for a better understanding of the unique dialect. Leveraging web-as-corpus methodologies such as BootCaT, domain crawling, and social media scraping, we create three different corpus representations of Acadian French. Each corpus is, on its own, an Acadian French resource while also showcasing the strengths of their individual method of creation. We propose 22 statistical corpus-based measures stemming from previously researched Acadian French characteristics to compare these newly built corpora to known Acadian French text. We found that while all three yield traces of Acadian French text, BootCaT is the largest corpus, and social media scraping has the highest count of Acadian French characteristics.
dc.description.copyright© Jérémy Robichaud, 2023
dc.format.extentviii, 73
dc.format.mediumelectronic
dc.identifier.urihttps://unbscholar.lib.unb.ca/handle/1882/37767
dc.language.isoen
dc.publisherUniversity of New Brunswick
dc.rightshttp://purl.org/coar/access_right/c_abf2
dc.subject.disciplineComputer Science
dc.titleWaCadie: Towards a web corpus of Acadian French
dc.typemaster thesis
oaire.license.conditionother
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of New Brunswick
thesis.degree.levelmasters
thesis.degree.nameM.C.S.

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Jeremy Robichaud - Thesis.pdf
Size:
438.3 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.13 KB
Format:
Item-specific license agreed upon to submission
Description: