A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
From Web Crawl to Clean Register-Annotated Corpora
2020
Workshop on Web as Corpus
The web presents unprecedented opportunities for large-scale collection of text in many languages. However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents. In this paper, we evaluate a multilingual approach to this end. Our starting points are the Swedish and French Common Crawl datasets gathered for the 2017 CoNLL shared task, particularly the URLs.
dblp:conf/aclwac/LaippalaRHLRSSP20
fatcat:p6ltpzhmpjcxrosnw5m5rbxghy