A Cascaded Approach for CIPS-SIGHAN Micro-Blog Word Segmentation Bakeoff 2012

Bei Shi, Xianpei Han, Le Sun
2012 Workshop on Chinese Language Processing  
The state-of-the-art Chinese word segmentation systems have achieved high performance on well-formed long document. However, the segmentation for microblog is difficult due to the noise problem and the OOV problem. In this paper, we present a Chinese Micro-Blog Segmentation system for the CIP-SIGHAN Word Segmentation Bakeoff 2012 track. The proposed system adopts a cascaded approach which contains three steps, correspondingly the preprocessing, the word segmentation and the post-processing. In
more » ... he preprocessing step, the noise which contains the special characters is processed and removed. The remaining sentences are segmented in the second step. Finally, we use the dictionary to detect the OOVs which are not correctly segmented. The results show the competitive performance of our approach.
dblp:conf/acl-sighan/ShiHS12 fatcat:5xkc6fgu7jeojgxvnmo7itwlru