Web Genre Benchmark Under Construction
Journal for Language Technology and Computational Linguistics
The project presented in this article focuses on the creation of web genre benchmarks (a.k.a. web genre reference corpora or web genre test collections), i.e. newly conceived test collections against which it will be possible to judge the performance of future genre-enabled web applications. The creation of web genre benchmarks is of key importance for the next generation of web applications because, at present, it is impossible to evaluate existing and in-progress genre-enabled prototypes. We
... uggest focusing on the following key points: ) propose a characterisation of genre suitable for digital environments and empirical approaches shared by a number of genre experts working in automatic genre identification; ) define the criteria for the construction of web genre benchmarks and draw up annotation guidelines; ) create web genre benchmarks in several languages; ) validate the methodology and evaluate the results. We describe work in progress and our plans for future development. Since it is sometimes difficult to anticipate the difficulties that will arise when developing a large resource, we present our ideas, our current views on genre issues and our first results with the aim of stimulating a proactive discussion, so that the stakeholders, i.e. researchers who will ultimately benefit from the resource, can contribute to its design. The Concept of Genre The concept of genre is hard to agree upon. Many interpretations have been proposed since Aristotle's Poetics without reaching any definite conclusions about the inventory or even principles for classifying documents into genres. Some studies put the number of genres to , (Görlach, ) or even , (Adamzik, ). Additionally, the lack of an agreed definition of what genre is causes the problem of the loose boundaries between the term 'genre' with other neighbouring terms, such as 'register', 'domain', 'topic', and 'style'. The inventory of genres can be based on linguistic theories or 'folksonomies', i.e. labels used by users (Rosso and Haas, ming). For instance, users are confident with a term like novel, whereas linguistic researchers may prefer functional terms, like recreation to indicate a wider range of texts aimed at recreational reading. Recently, definitions of genre have been adapted to the new digital environments, e.g., (Yates and Orlikowski, ; Erickson, ; Toms and Campbell, ; Beghtol, ; Heyd, ; Bateman, ). Undoubtedly, the situation on the web is more difficult than in the offline world, because the web is new, genres are fluid, web documents are very often characterised by a high level of hybridism, by the fragmentation of textuality across several documents, by the impact of technical features such as JLCL 2009 -Volume 24 (1) -129-145 Santini, Sharoff hyperlinking and posting facilities. Nevertheless, as stressed by Karlgren () the term 'genre' is established and generally understood, at least intuitively, by web users, and it is currently employed in many web-based real-world environments. For instance, online bookshops, like Amazon, organise their catalogues by genre, even if their genres are not defined in a systematic way, e.g., in addition to proper genres the Amazon list contains subject labels, like Arts, Computing or Science . At present, many researchers in different fields are working with genres of electronic documents, such as FAQs, e-shops, home pages, or conference websites in order to better satisfy users' needs in a number of different application areas, such as information retrieval, e.g., (Stamatatos et al., ; Meyer zu Eissen and Stein, ), digital libraries, e.g., (Rauber and Müller-Kögler, ; Kim and Ross, ming), and information extraction, e.g., (Maynard et al., ; Gupta et al., ). Arguably, genre is a fundamental concept in information management and definitely deserves in-depth investigations. Genre-Enabled Prototypes Attempts at automatic genre identification of the Brown Corpus start with (Karlgren and Cutting, ; Kessler et al., ). The first prototype of a genre-enabled application for the web was created in (Karlgren et al., ) (see DropJaw below). More recently, a genre add-on that can be installed on to a general-purpose search engine (namely Mozilla Firefox) has been completed at Bauhaus University Weimar, Germany (Stein et al., ming) (see WEGA below). In both cases, these applications could not and cannot be fully evaluated because of the absence of web genre benchmarks enabling the objective assessment of their effectiveness. Yet, the design and the construction of genre-enabled prototypes show the potential of genre in real-world applications. All in all, four prototypes have been described and documented, namely: DropJaw, Hyppia, X-Site and WEGA. DropJaw (for English) -Karlgren and co-workers (Karlgren et al., ) built a fully functional prototype system, DropJaw, to experiment with iterative search on the web. DropJaw bases its searches for web documents on terms entered by the user, as in a traditional system. However, rather than producing ranked lists of output based on term occurrence, DropJaw displays the distribution of the resulting set over two dimensions: dynamically generated topical clusters and document genres. The two-dimensional document space is displayed on a work board or matrix for further user processing. Hyppia (for English) -The Hyppia demo allows news articles to be filtered and searched based on genre information. The genre classes in this demo are considered to be "whether a document is subjective or objective" (Finn et al., ; Finn and Kushmerick, ). (Dimitrova and Kushmerick, ) contributed to the Hyppia project by showing how shallow text classification techniques can be used to sort the documents returned by web search engines according to genre dimensions, such as the degree of expertise assumed by the document, the amount of detail presented, or whether the document reports mainly facts or opinions.