Corpus Sharing Strategy for Descriptive Linguistics

Kazushi Ohya
2015 Journal of the Japanese Association for Digital Humanities  
This paper introduces the idea of data sharing strategy based on a conversion service, not on a sharing application, scheme, or ontology, that are dominant in proposals for language documentation. Although these three methods have been basic tactics for sharing corpora, they have a conceptual flaw in terms of descriptive linguistics. In this paper we report the results of a previous project -the LingDy project, and propose a basic concept for corpus sharing strategy to support personal
more » ... c data sharing. This paper is a revised version of a handout at JADH2012, so readers should be careful that this content is based on results at the time of 2012. Introduction This paper is a report of the three-year LingDy project (2008-2010) on the documentation of six endangered languages, and a progressive report of a subsequent four-year project (2011)(2012)(2013)(2014). As a result of the last project, we confirmed that (1) a method of transforming language data created by individual linguists into data in a shared format such as is used in a global-scale archive is an important and fundamental research target of language documentation; (2) it is difficult to realize this mechanism by the three ways hitherto adopted: sharing of application, scheme, and ontology; and (3) the key of language documentation in terms of data management is demarcation change and sound data handling, which pose key problems in computer science and linguistics respectively. These are the bases of the current objectives for our ongoing (in 2012) four-year project. In the following sections, first, we report the previous project, and show drawbacks of a broadly used scheme proposed by many standards; second, we confirm a philosophy of descriptive linguistics and an idea of personal diachronic data sharing; and finally, we propose an idea for sharing data based on data conversion services. Language Documentation in the LingDy Project Since 2008, we have been experimenting using a computational environment for endangered language study, which in recent years has come to be called language documentation (Gippert 2006). In our understanding, language documentation is a process of recording language information using computers. This project, with support from the LingDy project1 at Tokyo University of Foreign Studies, had aimed (1) to improve the environments of individual documentation activities by new tools and learning the usage, and (2) to seek a framework of archive systems for multiple endangered languages which will be used for typology research and for sharing the data with the language communities. The targeted languages were Yukaghir, 2 Alyutor, 3 Itelmen, 4 Hezhen, 5 Xibe, 6 and Tiddim Chin. 7 Our project's approach to language study is descriptive linguistics, unlike in natural language processing studies, which uses a prescriptive approach. This difference is a vital point for language documentation to overcome. Through this project, in addition to basic field linguistic activities, we (1) made a shared list of data items to be stored in our database, which will be used for both individual language studies and inter-language studies, such as typology; (2) provided a chance to learn ToolBox, 8 ELAN (Brugman et al. 2002 and Wittenburg et al. 2004), and Perl; (3) made a Java application for batch cutting-sound processing; 9 (4) made a Java application to convert data from a ToolBox format into XML, which has functions for data validation and normalization, and (5) made an experimental database system with Berkeley DB XML as a database engine and XQuery as a query language (Ohya 2009) . We presupposed plain text data with tags produced by ToolBox as an input data for data transformation ( fig. 1 ). 1
doi:10.17928/jjadh.1.1_68 fatcat:jrbldprsanb4pppjj3xdcdydkm