Japanese word segmentation by hidden Markov model

Constantine P. Papageorgiou
1994 Proceedings of the workshop on Human Language Technology - HLT '94   unpublished
The processing of Japanese text is complicated by the fact that there are no word delimiters. To segment Japanese text, systems typically use knowledge-based methods and large lexicons. This paper presents a novel approach to Japanese word segmentation which avoids the need for Japanese word lexicons and explicit rule bases. The algorithm utilizes a hidden Markov model, a stochastic process, to determine word boundaries. This method has achieved 91% accuracy in segmenting words in a test corpus.
doi:10.3115/1075812.1075875 fatcat:c46yaotwybbkhavah7iqo5hkwe