Learning to cite framework: How to automatically construct citations for hierarchical data

Gianmaria Silvello
2017 Journal of the Association for Information Science and Technology  
The practice of citation is foundational for the propagation of knowledge along with scientific development and it is one of the core aspects on which scholarship and scientific publishing rely. Within the broad context of data citation, we focus on the automatic construction of citations problem for hierarchically structured data. We present the "learning to cite" framework which enables the automatic construction of human-and machine-readable citations with different level of coarseness. The
more » ... ain goal is to reduce the human intervention on data to a minimum and to provide a citation system general enough to work on heterogeneous and complex XML datasets. We describe how this framework can be realized by a system for creating citations to single nodes within an XML dataset and, as a use case, show how it can be applied in the context of digital archives. We conduct an extensive evaluation of the proposed citation system by analyzing its effectiveness from the correctness and completeness viewpoints, showing that it represents a suitable solution that can be easily employed in real-world environments and that reduces human intervention on data to a minimum. Nonetheless, traditional citation procedures cannot be straightforwardly applied to data citation which calls for new methodologies and solutions [Buneman et al. 2014 ]. Data citation is of upmost importance for giving credit to data curators and for connecting scholarly publications to data with the purpose of sustaining and validating scientific claims and results. In particular, data citation has a fundamental role in the call for better transparency and reproducibility in science [Baggerly, 2010] which has been embraced by several fields such as Astronomy [Kurtz, 2012], Information Retrieval [Arguello et al., 2015], Database Systems [Freire et al., 2012], Biomedical research [AMS, 2015], and Public Health Research [Carr and Littler, 2015], just to name a few. Data citation has been predominantly analyzed from the scholar publishing and the infrastructural viewpoint. The former has been investigating policies and meanings of data sharing and citation as a support for reproducibility and validation in science [Borgman, 2012a]; the necessity to connect (cite) scientific publications with the data used for supporting the reported results [Lawrence et al., 2011; Callaghan et al., 2012] as in the case of enhanced publications Pre-print paper. Accepted for publication in JASIST, June 2016. P r e -p r i n t c o p y , t o a p p e a r i n J A S I S T , 2 0 1 6 Gianmaria Silvello, "Learning to Cite Framework...", pre-print paper, to appear in JASIST, John Wiley and Sons, Inc., June 2016 -2 - [ Vernooy-Gerritsen, 2009; Bardi and Manghi, 2015] ; the role of data journals [Candela et al., 2015] ; and, how to give credit to data creators and curators [Borgman, 2012b] . From the infrastructural viewpoint, research has been focusing on the information and publishing infrastructures required to handle dynamic data changing through time [Auer et al., 2012, Prӧll and Rauber, 2013] , to use of persistent identifiers for the identification and access to data , and to realize data repositories to store, preserve and provide access to data [Burton et al., 2015] . Within the infrastructural viewpoint, data citation has started to be considered specifically from the computational perspective [Buneman et al. 2016 ] further strengthening the necessity to design tools and systems able to automatically construct both machine-and human-readable data citations (i.e., references or citation snippets), to cite data at different level of coarseness, to cite evolving datasets, and to group and structure sets of citations. In this work, by focusing on XML structured datasets, we tackle the the automatic construction of citations problem, which is composed of two key challenges: (i) modeling the referent of a citation and (ii) the automatic generation of citations. The first challenge requires us to define a general framework for specifying what a citation-to-data should look like and what the elements that compose a citation are. In a traditional setting, citations are structured around well-accepted concepts, for example the elements composing a citation to a journal article may be title, authors, pages, year; data citations by contrast do not fit this framework -the elements structuring a citation may vary from dataset to dataset and may need to be decided on-the-fly by considering the specific characteristics of the dataset being cited. This challenge also comprises the need to cite data at different levels of coarseness, i.e. to produce deep citations [Buneman, 2006] . For instance, if we consider an XML file, then every attribute or data element at any level (the root, an internal node or a leaf) of the XML hierarchy is a viable citable unit 1 . When XML is considered, all relevant information required to construct a citation may be directly available in the citable unit or, more likely, it can be distributed in coarser data elements related to the citable unit. The second challenge, i.e., the automatic generation of citations, requires defining a methodology to automatically produce data citations because we cannot assume that the people citing the data understand the complexity of the dataset, know how data should be cited in a specific context, and select relevant information to form a complete and correct citation. To the best of our knowledge, only one solution for addressing the problem of the automatic construction of citations has been defined [Buneman 2006; Buneman and Silvello, 2010] , and it is based on a rule-based system to build citations for XML files. This approach exploits the hierarchical nature of XML files to cite data at different levels of coarseness, create human-and machine-readable citations and associate description metadata with the cited data. This approach is computationally efficient and effective for XML, but has some limitations when it comes to being adopted by practitioners: (i) citation rules have to be embedded in the XML files and thus a not negligible amount of work is required to prepare the data in order to make it citable; (ii) the definition of the rules requires both the knowledge of the data domain and XML technology; (iii) heterogeneity of the XML files (e.g. differences in the use of tags, tag nesting and/or the intended tag semantics) directly reflects on the rules that need to be customized to adapt to it, thus general rules may not apply for all the XML files in a given collection. We propose the "learning to cite" framework, which enables the automatic construction of human-and machinereadable citations to XML data with different level of coarseness, with the final goal of reducing human intervention on data to a minimum and to providing a citation system general enough to work on different data collections. The basic 1 In this work, any element in a dataset that can be cited is considered a citable unit.
doi:10.1002/asi.23774 fatcat:aehsjjaj2vbhtdbkd7odwel6vu