Estimating Language Lelationships from a Parallel Corpus. A Study of the Europarl Corpus

Taraka Rama, Lars Borin
2011 Nordic Conference of Computational Linguistics  
Since the 1950s, linguists have been using short lists (40-200 items) of basic vocabulary as the central component in a methodology which is claimed to make it possible to automatically calculate genetic relationships among languages. In the last few years these methods have experienced something of a revival, in that more languages are involved, different distance measures are systematically compared and evaluated, and methods from computational biology are used for calculating language family
more » ... trees. In this paper, we explore how this methodology can be extended in another direction, by using larger word lists automatically extracted from a parallel corpus using word alignment software. We present preliminary results from using the Europarl parallel corpus in this way for estimating the distances between some languages in the Indo-European language family.
dblp:conf/nodalida/RamaB11 fatcat:sxcojpbwqvapxewy53uwvgk57y