Crowdsourcing Interaction Logs to Understand Text Reuse from the Web

Martin Potthast, Matthias Hagen, Michael Völske, Benno Stein
2013 Annual Meeting of the Association for Computational Linguistics  
We report on the construction of the Webis text reuse corpus 2012 for advanced research on text reuse. The corpus compiles manually written documents obtained from a completely controlled, yet representative environment that emulates the web. Each of the 297 documents in the corpus is about one of the 150 topics used at the TREC Web Tracks 2009-2011, thus forming a strong connection with existing evaluation efforts. Writers, hired at the crowdsourcing platform oDesk, had to retrieve sources for
more » ... a given topic and to reuse text from what they found. Part of the corpus are detailed interaction logs that consistently cover the search for sources as well as the creation of documents. This will allow for in-depth analyses of how text is composed if a writer is at liberty to reuse texts from a third party-a setting which has not been studied so far. In addition, the corpus provides an original resource for the evaluation of text reuse and plagiarism detectors, where currently only less realistic resources are employed.
dblp:conf/acl/PotthastHVS13 fatcat:ictfs2hjxfeobnj6fucfowtpzi