Moving away from semantic overfitting in disambiguation datasets

Marten Postma, Filip Ilievski, Piek Vossen, Marieke van Erp
2016 Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods   unpublished
Entities and events in the world have no frequency, but our communication about them and the expressions we use to refer to them do have a strong frequency profile. Language expressions and their meanings follow a Zipfian distribution, featuring a small amount of very frequent observations and a very long tail of low frequent observations. Since our NLP datasets sample texts but do not sample the world, they are no exception to Zipf's law. This causes a lack of representativeness in our NLP
more » ... s, leading to models that can capture the head phenomena in language, but fail when dealing with the long tail. We therefore propose a referential challenge for semantic NLP that reflects a higher degree of ambiguity and variance and captures a large range of small real-world phenomena. To perform well, systems would have to show deep understanding on the linguistic tail.
doi:10.18653/v1/w16-6004 fatcat:6je6eholijhv5muvokropifesa