A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
The Danish Gigaword Corpus
2021
Nordic Conference of Computational Linguistics
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialects. 1 BotXO maintains a Danish BERT instance at https://github.com/botxo/nordic_bert. This model was
dblp:conf/nodalida/Stromberg-Derczynski21
fatcat:3fxd6vjxczfr3o35b4br72kngi