Genre as noise: noise in genre

Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz
2007 International Journal on Document Analysis and Recognition  
Given a specific information need, documents of the wrong genre can be considered as noise. From this perspective, genre classification helps to separate relevant documents from noise. Orthographic errors represent a second, finer notion of noise. Since specific genres often include documents with many errors, an interesting question is whether this "micro-noise" can help to classify genre. In this paper we consider both problems. After introducing a comprehensive hierarchy of genres, we
more » ... an intuitive method to build specialized and distinctive classifiers that also work for very small training corpora. We then investigate the correlation between genre and micro noise. Using special error dictionaries, we estimate the typical error rates for each genre. We finally test if the error rate of a document represents a useful feature for genre classification. 3. We present a detailed evaluation of the distribution of error rates for orthographic errors found in distinct genres. 4. We show first results in how far an automated analysis of the error rate of a document can be used as an additional feature to improve genre classification. As to 1, our genre hierarchy extends previous work by [Crowston and Williams, 1997; Dewe et al., 1998 ]. We tried to reach maximal completeness, at the same time avoiding fuzzy and overlapping genre classes. With the use of two levels and 32 leaf categories in the genre hierarchy we want to guarantee sufficient granularity for practical applications, simultaneously offering the possibility to return to a coarser scheme where this is preferable. Our work on features and classifiers is motivated by the practical experience that standard classifiers based on learning (e.g., support vector machines [Joachims, 2001]) do not lead to satisfactory results if only a small amount of training documents is available. In our test, a total of 1,280 files in the complete corpus is composed of 40 documents available for each genre. When using 20 documents for training of a genre, standard classifiers and uniform feature sets produced poor results. We were then interested to see if a heuristic classifier based on a small set of intelligent hand-crafted fea- 9 AND 2007 9
doi:10.1007/s10032-007-0060-2 fatcat:uvvznwc3ofawdh4huvzwndxp7u