Korean automatic spacing using pretrained transformer encoder and analysis

Taewook Hwang, Sangkeun Jung, Yoon‐Hyung Roh
2021 ETRI Journal  
Automatic spacing in Korean is used to correct spacing units in a given input sentence. The demand for automatic spacing has been increasing owing to frequent incorrect spacing in recent media, such as the Internet and mobile networks. Therefore, herein, we propose a transformer encoder that reads a sentence bidirectionally and can be pretrained using an out-of-task corpus. Notably, our model exhibited the highest character accuracy (98.42%) among the existing automatic spacing models for
more » ... . We experimentally validated the effectiveness of bidirectional encoding and pretraining for automatic spacing in Korean. Moreover, we conclude that pretraining is more important than fine-tuning and data size.
doi:10.4218/etrij.2020-0092 fatcat:i6pe5eklvzc4vlt4lcf6m5h34e