The shortcomings of a tagger

Kristin Hagen, Janne Bondi Johannessen, Anders Nøklestad
1999 Nordic Conference of Computational Linguistics  
The tagger used for the Oslo Corpus of Tagged Norwegian Texts has very good statistical results. In spite of this, it makes mistakes. In this paper we take a closer look at some of them. Although some mistakes are of a kind that would disappear if we improved the tagger, many are impossible or very difficult to do anything about. They are due to errors in the corpus (spelling errors, foreign words, non-standard spellings), to elliptic sentences, such as headlines, and to structural ambiguity,
more » ... ich abounds to a surprising extent. Proofreading the corpus would have removed the first kind of problems, but the other two types cannot be resolved in any obvious way.
dblp:conf/nodalida/HagenJN99 fatcat:h34dz6hz45d4dhfstscagbspwa