Empirical Methods for the Study of Denotation in Nominalizations in Spanish

Aina Peris, Mariona Taulé, Horacio Rodríguez
2012 Computational Linguistics  
This article deals with deverbal nominalizations in Spanish; concretely, we focus on the denotative distinction between event and result nominalizations. The goals of this work is twofold: first, to detect the most relevant features for this denotative distinction; and, second, to build an automatic classification system of deverbal nominalizations according to their denotation. We have based our study on theoretical hypotheses dealing with this semantic distinction and we have analyzed them
more » ... irically by means of Machine Learning techniques which are the basis of the ADN-Classifier. This is the first tool that aims to automatically classify deverbal nominalizations in event, result, or underspecified denotation types in Spanish. The ADN-Classifier has helped us to quantitatively evaluate the validity of our claims regarding deverbal nominalizations. We set up a series of experiments in order to test the ADN-Classifier with different models and in different realistic scenarios depending on the knowledge resources and natural language processors available. The ADN-Classifier achieved good results (87.20% accuracy). Computational Linguistics Volume 38, Number 4 Peris, Taulé, and Rodríguez Empirical Methods for the Study of Denotation in Nominalizations Sporleder (2008) work with an unsupervised SRL system, and in Surdeanu et al. (2008) the work presented uses supervised SRL systems. The kind of argument annotated is also different in these works: Although only two, more syntactic labels (subj [subject] and obj [object]), are used to annotate the arguments in Lapata (2002) , Gurevich et al. (2006 ), and Gurevich and Waterman (2009 ), Padó, Pennacchiotti, and Sporleder (2008 use FrameNet labels and Surdeanu et al. (2008) use NomBank (Meyers, Reeves, and Macleod 2004) 2 labels. The interpretation of nominalizations is crucial because they are common in texts and an important amount of information is represented within them. In the AnCora-ES corpus (Taulé, Martí, and Recasens 2008) , for instance, the semantic information is mostly coded in verbs (56,590 verbal occurrences) but a significant number of deverbal nominalizations (23,431 occurrences) also encode rich semantic information. Most of the work on this topic sets out from the denotative distinction between nominalizations referring to an event, those that express an action or a process, and nominalizations referring to a result, those expressing the outcome of an action or process. From a theoretical point of view, it is stated that this denotative distinction may have repercussions on the argument-taking capability of deverbal nominalizations. Despite being aware of this distinction, computational approaches focus on event nominalizations, not taking into account the result ones or, more frequently, without characterizing the difference. For instance, SRL systems are mostly applied to event nominalizations (Pradhan et al. 2004; Erk and Padó 2006; Liu and Ng 2007) . Result nominalizations are more frequent than the event types, however, at least in Spanish (1,845 event occurrences in contrast to 20,037 result occurrences in AnCora-ES). In the present work, we hypothesize that result nominalizations, like event nominalizations, can take arguments; therefore, discarding result nominalizations would imply a loss of semantic information, equally relevant to text representation. In this article, we focus our interest on this denotative distinction. Concretely, we aim to determine the relevant linguistic information required to classify deverbal nominalizations as event or result types in Spanish. In order to achieve this goal, we have built an automatic classifier of deverbal nominalizations-the ADN-Classifier-for Spanish, aimed at identifying the semantic denotation of these nominal predicates . The ADN-Classifier is a tool that takes into account different levels of linguistic information depending on its availability, such as senses, lemmas, or syntactic and semantic information coded in the verbal and nominal lexicons (AnCora-Verb [Aparicio, Taulé, and Martí 2008] and AnCora-Nom [Peris and Taulé 2011]) or in the AnCora-ES corpus. Therefore, this article contributes to the semantic analysis of texts focusing on Spanish deverbal nominalizations, although the proposal presented could be extended to other Romance languages. We base our study on theoretical hypotheses that we analyze empirically, and as a result we have developed three new resources: 1) the ADN-Classifier, the first tool that allows for the automatic classification of deverbal nouns as event or result nominalizations; 2) the AnCora-ES corpus enriched with the annotation of deverbal nominalizations according to their semantic denotation, the only Spanish corpus that incorporates this information; and 3) AnCora-Nom, a lexicon of deverbal nominalizations containing information about denotation types and argument structure. The ADN-Classifier can be used independently in NLP tasks, such as Coreference Resolution and Paraphrase Detection (Recasens and Vila 2010). For Coreference
doi:10.1162/coli_a_00112 fatcat:ufenrbaakrgn3ixavhln7murqa