On Sentence-Length as a Statistical Characteristic of Style in Prose: With Application to Two Cases of Disputed Authorship

G. Udny Yule
1939 Biometrika  
ONE element of style which seems to be characteristic of an author, in so far as can be judged from general impressions, is the length of his sentences. This author develops his thought in long, complex and wandering periods: that finds sufficient for his purpose a sequence of sentences that are brief, clear and perspicuous. Since the length of a sentence can be readily measured, for practical purposes, by the number of words, it occurred to me that it would be of interest to subject this
more » ... sion to statistical investigation. In carrying out the investigation, I met with more difficulties than I had foreseen. There are two terms used above: (1) Sentence, (2) Word. What is a sentence? What is a word, or what for present purposes is to be regarded as a word? Sentence. Let me cite the New English Dictionary: SENTENCE. Ab. 6. A series of words in connected speech or writing, forming the grammatically complete expression of a single thought; in popular use often ( = Period sb. 10) such a portion of a composition or utterance as extends from one full stop to another. In Grammar, the verbal expression of a proposition, question, command, or request, containing normally a subject and a predicate (though either of these may be omitted by ellipsis). In grammatical use, though not in popular language, a sentence may consist of a single word.... English grammarians usually recognize three classes: simple sentences, complex sentences (which contain one or more subordinate clauses), and compound sentences (which have more than one subject or predicate). From these definitions I conclude, I hope rightly, that we may drop the term 'period" and use the term "sentence" to cover any sentence (or as I should have been inclined to write " period "), however complex and however compound in the senses defined. It is convenient to be able to avoid a term which to a statistician would generally suggest a different meaning. Now, not being a grammarian but just one of the populace, I confess that I started with the popular notion of a "sentence" in this general sense: "such a portion of a composition as extends from one full stop to another", and thought I would have nothing to do but tot up the words from full stop to full stop. The first definition, however, reads: "the grammatically complete expression of a single thought." I feel some doubts as to the "single thought". (Is not "I am tired and hungry" a sentence, and does it not convey two thoughts, the thought of being tired and the thought of being hungry?) But the " grammatically complete 364 Sentence-Length as a Statistical Characteristic expression" surely is essential to make a word-series a sentence; the word-series must be what Webster calls a "sense unit ", and the trouble is that, especially in older works, " a portion of a composition " which " extends from one full stop to another " is often not the grammatically complete expression of anything. When the author or compositor has used punctuation in this fashion it is no longer possible simply to add up words from one full stop to the next, paying little or no attention to sense: it is necessary for the reader frequently to pull up and ask himself if the words just read do or do not form a sentence, and if they do not, what are in fact the limits of the sentence within which they must be assumed to lie. I need hardly point out how much this increases labour, and even, if the sentences are very long and complicated, brings in largely the element of personal judgement. Two readers, at least unskilled readers like myself, may well differ as to where a given sentence terminates. Here is quite a simple illustration of the difficulty from a modern essay on The Politics of Burns (ref. 1, at end of paper): There are several points here all at once calling for notice, and seldom getting it from friends of the poet: The extraordinary talent for history shown by Robert Burns. His attention to British History in preference to Scottish. The originality of his views. In this passage there are four word-series, the first divided from the second only by a colon (though the second begins with a capital letter), the second divided from the third, and the third from the fourth, by full stops. But neither the second, nor the third, nor the fourth word-series is a grammatically complete expression. The whole passage must be taken together, as it seems to me, as one single sentence. I am of course simply illustrating my difficulty, not criticizing the punctuation. On the other hand, where an author has written a very long and meandering sentence, a question may well arise between two different readers as to whether a halt should not be called in the middle, and a full stop entered where author or compositor has placed only a colon. I say author or compositor, for it must not be assumed that one is necessarily laying sacrilegious hands on the deliberate construction of the author himself. " So far as punctuation is concerned," says McKerrow (ref. 2), "there seems very little evidence that many authors exercised any care about it whatever. After all, even at present, few authors trouble to punctuate their MSS. with any care or consistency. Such punctuation as is found in ordinary MSS. of the sixteenth and seventeenth centuries is indeed most erratic and seldom goes beyond full stops at the end of most of the sentences and some indication of the caesura in verse." I had, before I started the present work, expected that this comment would apply much more to intermediate punctuation than to full stops, trusting that authors would at least insert " full stops at the end of most of their sentences " G. UDNY YULE But that it applies to both was enforced on me by different versions of the short tract by Gerson, De Meditatione Cordis, in the edition of his complete works that I used (see below section III and ref. 9) and in four editions of the Imitatio Christi on my shelves. The versions differed, not only verbally, but also as regards full stops. If punctuation, even as regards full stops, is largely the work of the compositor, there need be no hesitation in overriding them if necessary: indeed, the use of personal judgement seems unavoidable. Let me addthat at first I byno means realized the full extent of this difficulty, and when I did often felt myself horribly incompetent to deal with it. I am sure my final decisions could often be contested, and were not infrequently inconsistent with one another. But after all difficult cases are but a small proportion of all sentences in most writers and, if only as an exploratory piece of work, I hope the investigation may still retain interest and value. Word. Compared with the difficulties as to the sentence, the difficulties concerning words are really of a minor kind. One large class is indicated by the lines of Calverley: Forever; 'tis a single word! Our rude forefathers deemed it two: Can you imagine so absurd A view? Our rude forefathers also wrote it self, any where, every where and so forth, where their rude descendants write itself anywhere, everywhere. How shall we reckon such expressions? It is best, I think, to follow modern usage and I generally endeavoured to do so; but in rapid counting it is very easy to make a slip. Hyphened words present the same sort of difficulty. Law-courts, china-manufacturer, news-journal, well-earned, I would count as two words each; out-of-theway as four: but co-acervation, contra-distinguish, tri-syllabic, pre-disposed, reproduce, as one each. A something-nothing-every-thing (Coleridge) presents a special problem: I think it should be three words. But how many words is matter-of-factness? Coleridge calls it a word, "an uncouth and new coined words". Then there are abbreviations such as viz., i.e., etc. or &c. The first there is no reason to reckon as anything but one word. The second, third and fourth in spite of their meaning, I also reckoned as one each: eye and mind grasp them as wholes. Finally, what are we to do with figures? Dates may occur even in literary or historical essays: any year stated in figures (1825 or 1798) I reckoned as a word. Whether days of the month ever occurred I do not recall: but I would reckon the day of the month stated in figures, as in January 10th, as a word for the month and a word for the number of the day. Any actual number if stated in figures, and such numbers are frequent of course in the work of Graunt and Petty that I have discussed, would be reckoned as one word whatever the Biometrika xxx 24 G. UDNY YULE 367 quotations, it became obvious that this was unsatisfactory, and I then adopted the easier method of simply cutting out all pages on which this source of trouble was serious. This is, I think, the best course. SECTION Il. ILLUSTRATIONS FROM BACON, COLERIDGE, LAMB AND MACAULAY This section is in part purely illustrative, showing what sort of distributions of sentence-length we may expect, but in part is concerned with the fundamental question, how far sentence-length is really a characteristic of an author's style. If, that is to say, we take two lengthy passages, each containing a few hundred sentences, from a given fairly homogeneous work, will they present us with proportional numbers of sentences of each particular length in reasonably close agreement with one another? If they do not; if, although dealing with the same sort of material in the same sort of way, the author is liable capriciously to vary in the length of his sentences, sentence-length is not a characteristic of his style in any proper sense of the term, and one's impression to the contrary will be proved mistaken. If, however, there is reasonably close agreement, we can accept sentence-length as a characteristic. It is necessary, I think, to insert the condition that the author shall be dealing with the same sort of material in the same sort of way, since (again judging from general impressions) it seems clear that sentence-length may be affected by the author's matter as well as by his individuality: argumentative passages, for example, may well tend to longer sentences than matter purely descriptive.* The four authors chosen as illustrations are Bacon, Coleridge, Lamb and Macaulay; and their works, Bacon's Essays, Coleridge's Biographia Literaria, Lamb's Elia and Last Essays of Elic, and Macaulay's Essays. The particular editions used are not probably of any importance in this instance but are cited in the references at the end of the paper. They were simply those that I happened to have on my shelves. The fundamental tables, all in the same form and showing the numbers of sentences with 1 to 5, 6 to 10, 11 to 15 words, and so on, are given in the Appendix. Table A gives the data derived from Bacon's Essays. Here, when I had got to the end of Essay XXVI, "Of Seeming Wise", I judged myself to be about half-way, and called this batch of 462 sentences sample A: I then proceeded to the end of Essay LI, " Of Faction", and as this had given me 474 sentences, or approximately the same number, I called it sample B. The total number of essays being 58, the two samples together cover almost 90 % of the essays. Table A shows, in addition to the distributions for the two samples * Compare, for example, in Hazlitt's Lectures on the English Comic Writers, the style of the first essay "On Wit and Humour" with that of the subsequent lectures on definite groups of writers. See also below, section IV, for some remarks on Petty. * Webster and the O.E.D. concur in classifying this expression as "Colloq. or Slang". But after all the early Christians, judging from both gospels and epistles, did write in short sentences. Biometrika xxx 25
