Using bilingual Web data to mine and rank translations
IEEE Intelligent Systems
I n the Internet era, the traditional Tower of Babel problem-how we read and write foreign languages-has become even more serious. According to research, about three-fourths of the Web pages that non-English speakers need to read are in English, while for English speakers, roughly one-fourth of the pages are in other languages (see www.statistics.com/content/datapages/data5.html). We propose using multilingual Web data and statistical-learning methods to help readers understand foreign
... nd foreign languages. We've created an intelligent English reading-assistance system that offers word and phrase translation with automatic mining and ranking features based on these methods. English Reading Wizard Full machine translation has made substantial achievements, but its quality hasn't reached a satisfactory level. Figure 1 shows such a system's Chinese-to-English translation. English speakers can get a rough sense of what the original Chinese text describes, but they'll probably have difficulties understanding the details. (For an example machine translation system, see Babelfish, http://babelfish. altavista.com.) Nearly 90 percent of Internet users in China have educational backgrounds beyond high school, and they can read English, although their abilities vary (see www.cnnic.net.cn). For many of them, therefore, a reading-assistance tool would be more helpful than full machine translation. The situations in other Asian countries such as Japan and Korea are very similar. Our English reading-assistance system, English Reading Wizard, provides dictionary consultation for words and phrases through two basic features: mouse hovering and searching. When a user puts the cursor on a word such as cellular, ERW displays the word and its translations in a pop-up menu (as shown in the lower part of Figure 2 ). When a user searches for a word such as biology by typing it in the reference window on the left, ERW displays the detailed translation under Dictionary Lookup Results. Local dictionary consultation by searching operates when the local tab is chosen in the reference window, which has both basic and personal translations. The latter is obtained from a user-compiled dictionary. ERW supports English-to-Chinese and English-to-Japanese translations. To make ERW easier to use, we've developed two advanced features. The first, translation mining, automatically extracts the translations of words and phrases from the Web when no translation can be found in the local computer dictionary. This feature deals with the local "out of vocabulary" problem that often plagues a foreign language reading-assistance system. The second advanced feature, translation ranking, sorts the translations of words or phrases into lists based on contexts. Because many translations contain ambiguities, putting the correct translations on the top of the translation list saves users time in dictionary consultation. This feature ranks translations existing in the local dictionary. Several commercial products exist for foreign language reading assistance, such as Ciba (www.iciba. net), and related research has been conducted, 1 but no other product offers ERW's advanced features. The English Reading Wizard uses bilingual Web and localdictionary data to help readers understand foreign languages by translating words and phrases. Methods include the Expectation and Maximization algorithm and bilingual bootstrapping.