Robust Models in Information Retrieval

Nedim Lipka, Benno Stein
2011 2011 22nd International Workshop on Database and Expert Systems Applications  
Classification tasks in information retrieval deal with document collections of enormous size, which makes the ratio between the document set underlying the learning process and the set of unseen documents very small. With a ratio close to zero, the evaluation of a model-classifier-combination's generalization ability with leave-n-out-methods or cross-validation becomes unreliable: The generalization error of a complex model (with a more complex hypothesis structure) might be underestimated
more » ... ared to the generalization error of a simple model (with a less complex hypothesis structure). Given this situation, optimizing the bias-variance-tradeoff to select among these models will lead one astray. To address this problem we introduce the idea of robust models, where one intentionally restricts the hypothesis structure within the model formation process. We observe thatdespite the fact that such a robust model entails a higher test error-its efficiency "in the wild" outperforms the model that would have been chosen normally, under the perspective of the best bias-variance-tradeoff. We present two case studies: (1) a categorization task, which demonstrates that robust models are more stable in retrieval situations when training data is scarce, and (2) a genre identification task, which underlines the practical relevance of robust models.
doi:10.1109/dexa.2011.73 dblp:conf/dexaw/LipkaS11 fatcat:2vtjpybstfhjpntx3trmbtmquy