Basic Word Order Frequencies and Transition Probabilities in the Languages of the World

Harald Hammarström
Traditionally typologists look at frequencies of various types of languages of the world to gain insight about possible human languages. At least potentially, this reflection might be skewed by "historical accidents" that happened to surface as large-scale areal relationships. Whether or not this is an actual problem, one solution to it has already been suggested (i.e., a method to estimate the natural incidence of various types of languages that is [meant to be] immune to historical
more » ... storical accidents). Originally proposed by Maslova (2000) and taken up by Cysouw (2007), the idea is to change from estimating probabilities of occurrence to estimating probabilities of transition. At the center of this approach lies the assumption that there is a constant probilibity of change inherent in every linguistic parameter, henceforth CPCH ("constant probilibity of change hypothesis"). This further allows the interpretation of frequent types as stable, i.e., the constant probability distribution favours changes to the type and disfavours changes from it, versus infrequent types as less stable, i.e., the constant probability distribution disfavours changes to the type and favours changes from it (Maslova and Nikitina tted). In addition to CPCH, the Maslova/Cysouw model also allows birth-and death effects, henceforth BDE ("birth-death effects"). That is, languages, in addition to transitioning in features, can also die and/or fork into to more languages. Thus, languages we find today are not only the result of independent feature transitions from earlier versions of the same languages-they are the surviving members of isolate languages or languages which inherited features from an ancestor language. The specific rates of birth-and death are kept open, but we may assume that birth-and death processes are independent of features. For example, a language is no more (or less) likely to die (or fork) if it has SVO rather than some other value. We do not question BDE, but we will attempt to show that CPCH is not valid. We have put together three databases on basic word order: 1. Ethnologue: This database contains 1097 data points (Gordon 2005). Sources for the data points are not indicated. It is not clear how the data points/languages were selected. 2. WALS: This database contains 1203 data points (Dryer 2005). Sources for the data points are indicated. It is not clear how the data points/languages were selected, but it may be guessed that it is some kind of convenience sample. 3. Hammarström: This database contains 338 data points (Hammarström 2007a). Sources for the data points are indicated. The languages were sampled at random, one for every attested language family in the world. These three databases put together, without overlap, amount to 2086 languages-possibly the biggest database of a syntactic parameter so far assembled in linguistic typology. Using the classification of Hammarström (2007b), these 2086 languages are fall into 338 distinct families. 1 198 of the families have only one [language with a] data point (henceforth 'isolates'), and 140 of them have more than one. Intuitively, the word order distribution in the Hammarström sample, the isolates, and the majority word order for the non-isolates, should agree. This property is 1 According to this classification, a family is a set of languages which have been shown, in publication, using orthodox comparative methodology to be genetically related. This classification, in general, is ignorant of subgrouping matters.