The Danish Gigaword Corpus

Leon Strømberg-Derczynski, Manuel R. Ciosici, Rebekah Baglini, Morten H. Christiansen, Jacob Aarup Dalsgaard, Riccardo Fusaroli, Peter Juel Henrichsen, Rasmus Hvingelby, Andreas Kirkedal, Alex Speed Kjeldsen, Claus Ladefoged, Finn Årup Nielsen (+4 others)
2021 Nordic Conference of Computational Linguistics  
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialects. 1 BotXO maintains a Danish BERT instance at https://github.com/botxo/nordic_bert. This model was
more » ... trained exclusively on uncurated web text and, therefore, (a) has a spurious understanding of Danish among other languages and (b) is particularly susceptible to the kind of toxic language identified by Gehman et al. (2020). 2 http://ordnet.dk
dblp:conf/nodalida/Stromberg-Derczynski21 fatcat:3fxd6vjxczfr3o35b4br72kngi