Comparative Summarization of Document Collections [article]

Umanga Bista, University, The Australian National
Comparing documents is an important task that help us in understanding the differences between documents. Example of document comparisons include comparing laws on same related subject matter in different jurisdictions, comparing the specifications of similar product from different manufacturers. One can see that the need for comparison does not stop at individual documents, and extends to large collections of documents. For example comparing the writing styles of an author early vs late in
more » ... r life, identifying linguistic and lexical patterns of different political ideologies, or discover commonalities of political arguments in disparate events. Comparing large document collections calls for automated algorithms to do so. Every day a huge volume of documents are produced in social and news media. There has been a lot of research in summarizing individual document such as a news article, or document collections such as a collection of news articles on a related topic or event. However, comparatively summarizing different document collections, or comparative summarization is under-explored problem in terms of methodology, datasets, evaluations and applicability in different domains. To address this, in this thesis, we make three types of contributions to comparative summarization, methodology, datasets and evaluation, and empirical measurements on a range of settings where comparative summarization can be applied. We propose a new formulation the problem of comparative summarization as competing binary classifiers. This formulation help us to develop new unsupervised and supervised methods for comparative summarization. Our methods are based on Maximum Mean Discrepancy (MMD), a metric that measures the distance between two sets of datapoints (or documents). The unsupervised methods incorporate information coverage, information diversity and discriminativeness of the prototypes based on global-model of sentence-sentence similarity, and be optimized with greedy and gradient methods. We show the efficacy of the app [...]
doi:10.25911/7mq1-fr27 fatcat:5i5heoejyngopcjkhgq5xmk4uy