Computational approaches to the comparison of regional variety corpora : prototyping a semi-automatic system for German [article]

Stefanie Anstein, Universität Stuttgart, Universität Stuttgart
Regional varieties of pluri-centric languages such as German are generally very similar with respect to their structure and the linguistic phenomena that occur. The extraction of differences is thus crucial e.g. for variety documentation, lexicography, or didactics. In this thesis, computational approaches to the comparison of regional variety corpora are explored, in order to support manual analyses by variety linguists. A feasibility study on semi-automatic corpus comparison has been
more » ... by developing a prototype system, in order to determine on which levels of linguistic description such automation is possible and to what extent. Further research aims at showing which features of the input corpora produce the best results as well as on the 'relevance ranking' of the output. In addition, the potential of integrating available standard tools as well as the transferability of the system to other languages have been explored. Written corpora, which have been made increasingly available through initiatives such as Korpus Südtirol, are used as an empirical basis to extract differences semi-automatically, which is more efficient and more objective than a purely manual approach. The results yielded by the prototype system Vis-À-Vis assist variety linguists in their detailed qualitative analyses by reducing corpus comparison output to presumably relevant phenomena. In regional variety linguistics, numerous manual approaches have been applied and various single studies have been carried out, followed more recently by an increasing number of automated studies on the basis of corpora being developed for pluri-centric languages. In computational linguistics, the analysis and comparison of corpora through automated systems, in order to find differences on various levels of linguistic description, has been conducted for a considerable time (e.g. for register studies), yielding promising results. Vis-À-Vis applies linguistic pattern extraction as well as statistical output comparison, combining existing standard tools w [...]
doi:10.18419/opus-3240 fatcat:ol2toaqv2fbvhm2ofi4intlqd4