Interpreting Quantitative Data in Corpus Linguistics
Quantitative information has become increasingly important in Corpus Linguistics, and increasingly sophisticated as measures that are sensitive to how language works have become more readily available. Questions around the use of quantitative information are driven by the need in Corpus Linguistics to innovate methodologically and theoretically. In 'phase 1' studies, corpora from different geographical areas, or chronological times, or registers, are compared by quantifying the relative
... he relative frequency of given grammatical or semantic categories. Such methods have underpinned substantial advances, for example the Longman Grammar of Spoken and Written English, work on Systemic-Functional Linguistics, and work comparing learner varieties of English, among many others. 'Phase 2' studies prioritise lexis over grammar and individual wordforms over categories of form or meaning. In these studies, frequency is often reduced to a concept of 'typicality' or 'centrality'. Comparison between corpora is usually not the identifying feature of such work. Examples include Sinclair's work on Units of Meaning, or Frances and Hunston's work on grammar patterns. The key aspects of phase 2 studies are its exploratory, 'bottom-up' approach and the novelty of its insights. A challenge for Corpus Linguistics is to marry the rigour of quantitative measures with innovation in insight. One way of doing this is to allow numbers to drive the way that information in the corpus is organised. This is what I term 'phase 3' studies. These are illustrated by a study of adjectives in a corpus of comments about university teaching staff (see Millar and Hunston 2015), and by a study of lexis in a corpus compiled from an interdisciplinary academic journal (see Murakami et al in press). In both cases the initial corpus work treats the corpus as a 'bag of words', allowing co-occurrence calculations to organise the data before linguistic considerations are brought to bear. Phase 3 studies remain true to a data-driven approach to corpora. They achieve a sketch rather than an analysis of a corpus.