Closed yesterday and closed minds
Proceedings of the 14th conference on Computational linguistics -
Collocation-based tagging and bracketing prograras have attained promising results. Yet, they have not arrived at the stage where they could be used as pre-procezsors for full-fledged parsing. Accuracy is still not high enough. To improve accuracy, it is necessary to investigate the points where statistical data is being misinterpreted, leading to incorrect results. In this paper we investigate inaccuracy which is injected when a pre-processor relies solely on collocations and blurs the
... ion between two separate relations: thematic relations and sentential relations. Thematic relations are word pairs, not necessarily adjacent, (e.g., adjourn a meeting) that encode information at the concept level. Sentential relations, on the other hand, concern adjacent word pairs that form a noun group. E.g., preferred stock is a noun group that must be identified as such at ttle syntactic level. Blurring the difference between these two phenomena contributes to errors in tagging of pairs such as ezpressed concerns, a verb-noun construct, as opposed to preferred stocks, an adjective-noun construct. Although both relations are manifested in the corpus as high mutual-information collocations, they possess difl'erent prot)erties and they need to be separaled. In our method, we distinguish between these two cases by asking additional questions of the corpus. By definition, thematic relations take on filrther variations in the corpus. Expressed concerns (a thematic relation) takes concerns expressed, expressing concerns, express his concerns ere. On the other hand, preferred stock (a sentential relation) does not take any such syntactic variations. We show how this method impacts preprocessing and parsing, and we provide empirical results based on the analysis of an 80million word corpus. I 2 Pre-Processing: The Greater Picture Sentences in a typical newspaper story include idioms, ellipses, and ungrammatic constructs. Since authentic language defies textbook grammar, we must rethink our basic pars-~This research was sponsored (in part) by the Defense Advanced Research Project Agency (DOD) and other government agencies.