A Multistrategy Approach to Improving Pronunciation by Analogy

Yannick Marchand, Robert I. Damper
2000 Computational Linguistics  
Pronunciation by analogy (PbA) is a data-driven method for relating letters to sound, with potential application to next-generation text-to-speech systems. This paper extends previous work on PbA in several directions. First, we have included "full" pattern matching between input letter string and dictionary entries, as well as including lexical stress in letter-to-phoneme conversion. Second, we have extended the method to phoneme-to-letter conversion. Third, and most importan t, we have
more » ... ented with multiple, different strategies for scoring the candidate pronunciations. Individual scores for each strategy are obtained on the basis of rank and either multiplied or summed to produce a final, overall score. Five strategies have been studied and results obtained from all 31 possible combinations. The two combination methods perform comparably, with the product rule only very marginally superior to the sum rule. Nonparametric statistical analysis reveals that performance improves as more strategies are included in the combination: this trend is very highly significant (p << 0.0005). Accordingly for letter-to-phoneme conversion, best results are obtained when all five strategies are combined: word accuracy is raised to 65.5% relative to 61.7% for our best previous result and 63.0% for the best-performing single strategy. These improvements are very highly significant (p ~ 0 and p = 0.00011 respectively). Similar results were found for phoneme-to-letter and letter-to-stress conversion, although the former was an easier problem for PbA than letter-to-phoneme conversion and the latter was harder. The main sources of error for the multistrategy approach are very similar to those for the best single strategy, and mostly involve vowel letters and phonemes. Computational Linguistics Volume 26, Number 2 that (literate) humans are able to read aloud, so that systems that can pronounce print serve as models of human cognitive performance. Modern text-to-speech (TTS) systems use lookup in a large dictionary or lexicon (we use the terms interchangeably) as the primary strategy to determine the pronunciation of input words. However, it is not possible to list exhaustively all the words of a language, so a secondary or backup strategy is required for the automatic phonemization of words not in the system dictionary. The latter are mostly (but not exclusively) proper names, acronyms, and neologisms. At this stage of our work, we concentrate on English and assume that any such missing words are dictionary-like with respect to their spelling and pronunciation, as will probably be the case for many neologisms. Even if the missing words are dictionary-like, automatic determination of pronunciation is a hard problem for languages like English and French (van den Bosch et al. 1994) . In fact, English is notorious for the lack of regularity in its spelling-to-sound correspondence. That is, it has a deep orthography (Coltheart 1978; Liberman et al. 1980; Sampson 1985) as opposed to the shallow orthography of, for example, Serbo-Croatian (Turvey, Feldman, and Lukatela 1984). To a large extent, this reflects the many complex historical influences on the spelling system (Venezky 1965; Scragg 1975; Carney 1994) . Indeed, Abercrombie (1981, 209) describes English orthography as "one of the least successful applications of the Roman alphabet." We use 26 letters in English orthography yet about 45-55 phonemes in specifying pronunciation. It follows that the relation between letters and phonemes cannot be simply one-to-one. For instance, the letter c is pronounced/s/in cider but/k/in cat. On the other hand, the/k/sound of kitten is written with a letter k. Nor is this lack of invariance between letters and phonemes the only problem. There is no strict correspondence between the number of letters and the number of phonemes in English words. Letter combinations (ch, gh, II, ea) frequently act as a functional spelling unit (Coltheart 1984)--or grapheme--signaling a single phoneme. Thus, the combination ough is pronounced /Af/ in enough, while ph is pronounced as the single phoneme/f/in phase. However, ph in uphill is pronounced as two phonemes,/ph/. Usually, there are fewer phonemes than letters but there are exceptions, e.g., (six,/sIks/). Pronunciation can depend upon word class (e.g., convict, subject). English also has noncontiguous markings (Wijk 1966; Venezky 1970) as, for instance, when the letter e is added to (mad,/mad/) to make (made,/meId/), also spelled maid! The final e is not sounded; rather it indicates that the vowel is lengthened or dipthongized. Such markings can be quite complex, or long-range, as when the suffix y is added to photograph or telegraph to yield photography or telegraphy, respectively. As a final comment, although not considered further here, English contains many proper nouns (place names, surnames) that display idiosyncratic pronunciations, and loan words from other languages that conform to a different set of (partial) regularities. These further complicate the problem. This paper is concerned with an analogical approach to letter-to-sound conversion and related string rewriting problems. Specifically, we aim to improve the performance of pronunciation by analogy (PbA) by information fusion, an approach to automated reasoning that seeks to utilize multiple sources of information in reaching a decision-in this case, a decision about the pronunciation of a word. The remainder of this paper is organized as follows: In the next section, we contrast traditional rule-based and more modern data-driven approaches (e.g., analogical reasoning) to language processing tasks, such as text-to-phoneme conversion. In Section 3, we describe the original (PRONOUNCE) PbA system of Dedina and Nusbaum (1986) in some detail as this forms the basis for the later work. Section 4 reviews our own work in this area. Next, in Section 5, we make some motivating remarks about information fusion and its use in computational linguistics in general. In Section 6, we present in some detail the
doi:10.1162/089120100561674 fatcat:mc5xhs25i5dzblt3w6usilc5bu