Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection

Mike Kestemont, Michael Tschuggnall, Efstathios Stamatatos, Walter Daelemans, Günther Specht, Benno Stein, Martin Potthast
2018 Conference and Labs of the Evaluation Forum  
Author identification attempts to reveal the authors behind texts. It is an emerging area of research associated with applications in literary research, cyber-security, forensics, and social media analysis. In this edition of PAN, we study two task, the novel task of cross-domain authorship attribution, where the texts of known and unknown authorship belong to different domains, and style change detection, where single-author and multi-author texts are to be distinguished. For the former task,
more » ... e make use of fanfiction texts, a large part of contemporary fiction written by non-professional authors who are inspired by specific well-known works, to enable us control the domain of texts for the first time. We describe a new corpus of fanfiction texts covering five languages (English, French, Italian, Polish, and Spanish). For the latter, a new data set of Q&As covering multiple topics in English is introduced. We received 11 submissions for the cross-domain authorship attribution task and 5 submissions for the style change detection task. A survey of participant methods and analytical evaluation results are presented in this paper. Closed-set authorship attribution is a task with rich relevant literature [42, 29] . Two previous editions of PAN included corresponding shared tasks [1, 19] . However, they only examined the case where both training and test documents belong to the same domain, as it is the case for the vast majority of published studies in this area. Crossdomain authorship attribution has been sporadically studied in the last decade [3, 31, 37, 38, 39, 45, 46] . In such cases, training and test texts belong to different domains that may refer to topic, genre, or language. In this section we focus our on the construction of building suitable resources for evaluating cross-domain attribution methods. The most frequent scenario examined in previous cross-domain attribution studies considers cross-topic conditions. To control topic, usually general thematic categories are defined and all texts are pre-assigned to a topic. For example, Koppel et al. uses three thematic categories (ritual, business, and family) of religious Hebrew-Aramaic texts [24] . Newspaper articles are considered by Mikros and Argiri [27] (classified into two thematic areas: politics and culture) and Stamatatos (classified into four areas: politics, society, world, and UK) [45] . Another approach is to use a controlled corpus where some individuals are asked to write texts on a specific, well-defined topic [47] . The latter provides fine-grained control over topic. On the other hand, the size of such controlled corpora is relatively small. Another important cross-domain perspective concerns cross-genre conditions. In general, it is hard to collect texts by several authors in different genres. Kestemont et al. make use of literary texts (theater plays and literary prose) [21] while Stamatatos explores differences between opinion articles and book reviews published in the same newspaper [45] . Another idea is to use social media texts based on the fact that many users are active in different social networks (e.g., Facebook and Twitter) [31] . Finally, a controlled corpus can be built, where each subject (author) is asked to write a text in a set of genres (e.g., email, blog, essay) [47] . The most extreme case concerns crosslanguage conditions where training and test texts are in different languages [3] . A convenient source of such cases is provided by novels that have been translated to other languages hoping that the translator's preferences do not significantly affect the style of the original author. To the best of our knowledge, so far there is no cross-domain authorship attribution study using fanfiction texts. With respect to intrinsic analyses of texts, PAN included several shared tasks in the last years. Starting from intrinsic plagiarism detection [33] , the focus went from clustering authors within documents [48] to the detection of positions where the style, i.e., the authorship, changes [50] . In general, all those tasks imply an intrinsic, stylometric analysis of the texts, as no reference corpora are available. Thus, stylistic fingerprints are created that include lexical features like character n-grams (e.g., [43] ), word frequencies (e.g., [16] ) or average word/sentence lengths (e.g., [51] ), syntactic features like part-of-speech (POS) tag frequencies/structures (e.g., [49] ) or structural features such as indentation usages (e.g., [51] ). Approaches specifically tackling the similar style breach detection task at PAN 2017 also utilize typical stylometric features such as bags of character n-grams, frequencies of function words, and other lexical metrics, processed by algorithms operating on top to detect borders [7, 22] or outliers [35] .
dblp:conf/clef/KestemontTSDSSP18 fatcat:claxc33fvfdqpnl5cznehjvkzy