Enhanced Representations and Efficient Analysis of Syntactic Dependencies Within and Beyond Tree Structures

Tianze Shi
2021
As a fundamental task in natural language processing, dependency-based syntactic analysis provides useful structural representations of textual data. It is supported by an abundance of multilingual annotations and statistical parsers. A common representation format widely adopted by contemporary computational dependency-based syntactic analysis is single-rooted directed trees, where each edge represents a dependency relation. These governor-dependent relations capture bilexical syntactic
more » ... ations and facilitate efficient parsing algorithms that break down the analysis of the whole trees into identifications of individual dependency edges. However, it is known that edge-focused dependency-tree representations face practical challenges to properly handle certain linguistic phenomena involving multiple dependency edges, such as valency patterns and certain types of multi-word expressions. Further, dependency tree structures fall short in explicitly representing coordination structures, argument sharing in control and raising constructions, and so on. This thesis aims at addressing the aforementioned issues and improving dependencybased syntactic analysis via augmented and enhanced representations within and beyond tree structures, which involves new challenges in the designs of computational models, learning regimes from empirical data, and inferencing procedures to derive the desired structures. To guide parsers to consider wider structural contexts and to recognize lin-guistic constructions as a whole, in addition to predicting individual dependency relations, this thesis introduces two parser designs that combine parsing and tagging modules. In the first parser, taggers are trained to predict valency patterns, which encode the number, types, and linear orderings of each word's dependent syntactic relations (e.g., a transitive verb in English has a subject to its left and a direct object to its right). This method is demonstrated to improve precision on the selected subsets of dependency relations used in the valency patterns. The second effort focuses on headless multi-word expressions (MWEs), which are typically identified with taggers, when full syntactic analysis is not required. By integrating a tagging view of the MWEs into decoding processes, the parsers become more accurate in MWE identification. Certain syntactic constructions, such as coordination, pose extra representational challenges for dependency trees, and this thesis explores two types of enhanced structures beyond dependency trees and presents methods to analyze natural language texts into those formats. Enhanced Universal Dependencies format removes the tree constraint and the target structures become connected graphs. This thesis details the design of a tree-graph integrated-format parser, which serves as the basis of the winning solution at the IWPT 2021 shared task, in combination with other techniques including a two-stage finetuning strategy and text pre-processing pipelines powered by pre-training. Finally, this thesis revisits Kahane's (1997) idea of bubble trees, which marks span boundaries on top of otherwise dependency-based structures, to provide an explicit mechanism to represent coordination structures. The transition-based system developed to parse into such bubble tree structures shows improvement on the task of coordination structure prediction. ingly unremarkable "little things", how careful and thorough she is in exhausting all possible ways to investigate and interpret research questions, experiment results, and beyond, and how caring she is for people she knows, including her students, members of Cornell, and the entire NLP community. It is really hard to verbally describe how fortunate I feel about having Lillian as my advisor. She writes on her homepage "My debt to [my students] is unbounded", but in fact I am the one owing an unbounded debt to Lillian. Thank you, Lillian! My minor advisor in linguistics, Mats Rooth, is one of my academic role models at Cornell as well. He is incredibly knowledgeable and enthusiastic about linguistic research. I would also like to thank my other former and current committee members, Erik Andersen, and Karthik Sridharan for their feedback. My Ph.D. experience has benefited greatly from inputs by many people. I thank the entire Cornell NLP group and the NLP seminar attendees, espe-
doi:10.7298/jm0e-tj72 fatcat:uve5oocwrndizb3ervx6vmqva4