Recognition of Anomalously Deformed Kana Sequences in Japanese Historical Documents

Nam Tuan LY, Kha Cong NGUYEN, Cuong Tuan NGUYEN, Masaki NAKAGAWA
2019 IEICE transactions on information and systems  
This paper presents recognition of anomalously deformed Kana sequences in Japanese historical documents, for which a contest was held by IEICE PRMU 2017. The contest was divided into three levels in accordance with the number of characters to be recognized: level 1: single characters, level 2: sequences of three vertically written Kana characters, and level 3: unrestricted sets of characters composed of three or more characters possibly in multiple lines. This paper focuses on the methods for
more » ... vels 2 and 3 that won the contest. We basically follow the segmentationfree approach and employ the hierarchy of a Convolutional Neural Network (CNN) for feature extraction, Bidirectional Long Short-Term Memory (BLSTM) for frame prediction, and Connectionist Temporal Classification (CTC) for text recognition, which is named a Deep Convolutional Recurrent Network (DCRN). We compare the pretrained CNN approach and the end-to-end approach with more detailed variations for level 2. Then, we propose a method of vertical text line segmentation and multiple line concatenation before applying DCRN for level 3. We also examine a twodimensional BLSTM (2DBLSTM) based method for level 3. We present the evaluation of the best methods by cross validation. We achieved an accuracy of 89.10% for the three-Kana-character sequence recognition and an accuracy of 87.70% for the unrestricted Kana recognition without employing linguistic context. These results prove the performances of the proposed models on the level 2 and 3 tasks. key words: historical documents, deformed kana recognition, handwriting recognition, deep neural networks * Hanja is the Korean name for Chinese characters incorporated into the Korean language with Korean pronunciation.
doi:10.1587/transinf.2018edp7361 fatcat:fambt3but5b2zmlwdcurgvvx7a