Punctuation as Implicit Annotations for Chinese Word Segmentation

Zhongguo Li, Maosong Sun
2009 Computational Linguistics  
We present a Chinese word segmentation model learned from punctuation marks which are perfect word delimiters. The learning is aided by a manually segmented corpus. Our method is considerably more effective than previous methods in unknown word recognition. This is a step toward addressing one of the toughest problems in Chinese word segmentation. Segmentation as Tagging We call the first character of a Chinese word its left boundary L, and the last character its right boundary R. If we regard
more » ... and R as random events, then we can derive four events (or tags) from them:
doi:10.1162/coli.2009.35.4.35403 fatcat:6inl26biejcqjioa6bh6bpp5xm