Building and evaluating web corpora representing national varieties of English

dc.contributor.authorCook, Paul
dc.contributor.authorBrinton, Laurel J.
dc.date.accessioned2023-03-02T23:44:16Z
dc.date.available2023-03-02T23:44:16Z
dc.date.issued2017
dc.description.abstractCorpora are essential resources for language studies, as well as for training statistical natural language processing systems. Although very large English corpora have been built, only relatively small corpora are available for many varieties of English. National top-level domains (e.g., .au, .ca) could be exploited to automatically build web corpora, but it is unclear whether such corpora would reflect the corresponding national varieties of English; i.e., would a web corpus built from the .ca domain correspond to Canadian English? In this article we build web corpora from national top-level domains corresponding to countries in which English is widely spoken. We then carry out statistical analyses of these corpora in terms of keywords, measures of corpus comparison based on the Chi-square test and spelling variants, and the frequencies of words known to be marked in particular varieties of English. We find evidence that the web corpora indeed reflect the corresponding national varieties of English. We then demonstrate, through a case study on the analysis of Canadianisms, that these corpora could be valuable lexicographical resources.
dc.description.copyrightCopyright © 2021 ACM, Inc.
dc.identifier.urihttps://unbscholar.lib.unb.ca/handle/1882/22383
dc.identifier.urlhttps://dl.acm.org/doi/abs/10.1007/s10579-016-9378-z
dc.publisherACM Digital Library
dc.relation.hasversion10.1007/s10579-016-9378-z
dc.rightshttp://purl.org/coar/access_right/c_abf2
dc.subject.disciplineComputer Science
dc.titleBuilding and evaluating web corpora representing national varieties of English
dc.typejournal article
oaire.citation.endPage662
oaire.citation.issue3
oaire.citation.startPage643
oaire.citation.titleLanguage Resources and Evaluation
oaire.citation.volume51

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
item.pdf
Size:
354.91 KB
Format:
Adobe Portable Document Format

Collections