Japanese Kana-to-Kanji Conversion Using Large Scale Collocation Data

Yasuo Koyama, Masako Yasutake, Kenji Yoshimura, Kosho Shudo
1998 Pacific Asia Conference on Language, Information and Computation  
Japanese wad prucessa. cr the cvmputer rated in Japaz employs, input method through keyboard vole canbinxIwith Kay Ohmetic) character b Kaiji (ickogrcphi4 Chime) cirraier aynersiattedsvlogy. .71r key fret►. of Karkto-Kanji co► tersion technology is how to rase the wary cfthe cantersicn hough the hamophae pvcwsirg we hate so many homcplvnes kits pcpet. , we sprat the mass cf our Karr-taKayi canersicn experiments which embo* dr homcialme processing using catnsite colloartion daft It is shown that
more » ... ciprzimately 135,000 °goo:0m dai2yields 9.1 %rnise cfie amtersicn axunory ccmparedwith the protoope .Dstan which ha rro collocatiatcbta Introduction Japanese word processor cr the computer used in Japan orrinanly employ the Japanese input method through kLyboard strobe combined with Kana (phonetic) to Kanji (ideographic, Chinese) character conversion tedixiogy, because no exttatechi plogy such as the flee hand charader recognition or the speech recogrition is required. The Kana4o-Kanji conversion is perfonned by the morphological analysis an the input Kana string with no space between words. Wad-or phrase-segmentation is carried ort by the analysis to decide the substing of fir input to be converted fromKana to Kanji. Kara-Kanji mixed string, which is the ordinary form ofJapanese written text is obtained as the final result. The major issue ofthis techrology lies in raising the a:curacy ofthe segmentation and the horn:phone prooesing to select the most poperKanji among many hornopluic candidates. The conventional methodology for Foca:sing the homophare has used the function to give de first ptiotity to the wad which was used lastly or to the vvord wlich is used most frequently. In fict, this method is effective in some situations, but axneg imes teals to output the inadequate conversio ' n result due to the lack ofconsideration on the semantic consistency ofthe cotrumence ofwards. While it is difficuk to employ the syntactic or semantic processing in earnest for the wad processor flan the cost vs. perfamance viewpoints, the following trials to imptove the conversion acanacy have been reported: Employing the case frame to check the smartie consistency of combination ofwords [Oshima, Y. et a1,1986], Examining the consistency ofthe concumznce of adjacent wads [Hon m, S. et a1,1986], Employing the neural net ovo& to descrix the consistency * Idle concurrence ofwords [Kobayashi,T. et a1,19921 Making a comurrence dictionary for the specific topic or field, ani giving the priority to the weld which is in tie dictionaty in case the keyvvrid appropriate to the topic is detected in the input [Yamamoto, K. et al, 1992], Emptying the validity of the ancunence of a noun and a verb vvlidi is calculated statistically [Takahasli, M et al, 19961 In any ofthle studies, where the main concern is to examine tbe consistency ofword commence in tie input there are many pv3blems left unsoived Besides these semartic or quasi-semantic *gets, there seems to be the room forte involvement by using word level nesources, min*, by the extensive use ofthe collocation, reasrent combination ofwards wlich co-ocair mote oftim than epected by than:v.. How many collocations should be collected and how muchthey contrioutetothe acancy ofKana-to-Kanji conversion have not been teported yet. In this paper, Ae present same resuks ofcur experiments afKanato-Kanji conversion, foaling on the ugge ofthe latge scale collocation
dblp:conf/paclic/KoyamaYYS98 fatcat:3y23ioby7bc2vkbzpnpc4rb7o4