Automatic Chinese Confusion Words Extraction Using Conditional Random Fields and the Web

Chun-Hung Wang, Jason S. Chang, Jian-Cheng Wu
2013 Workshop on Chinese Language Processing  
A ready set of commonly confused words plays an important role in spelling error detection and correction in texts. In this paper, we present a system named ACE (Automatic Confusion words Extraction), which takes a Chinese word as input (e.g., "不脛而走") and automatically outputs its easily confused words (e.g., "不徑 徑 徑 徑而走", "不逕 逕 逕 逕而走"). The purpose of ACE is similar to web-based set expansion -the problem of finding all instances (e.g. "Halloween", "Thanksgiving Day", "Independence Day", etc.)
more » ... of a set given a small number of class names (e.g. "holidays"). Unlike set expansion, our system is used to produce commonly confused words of a given Chinese word. In brief, we use some handcoded patterns to find a set of sentence fragments from search engine, and then assign an array of tags to each character in each sentence fragment. Finally, these tagged fragments are served as inputs to a pre-learned conditional random fields (CRFs) model. We present experiment results on 3,211 test cases, showing that our system can achieve 95.2% precision rate while maintaining 91.2% recall rate.
dblp:conf/acl-sighan/WangCW13 fatcat:7b6gtlvesncatdsrhca6dmugpi