Harvesting the Bitexts of the Laws of Hong Kong From the Web

Chunyu Kit, Xiaoyue Liu, KingKui Sin, Jonathan J. Webster
2005 International Joint Conference on Natural Language Processing  
In this paper we present our recent work on harvesting English-Chinese bitexts of the laws of Hong Kong from the Web and aligning them to the subparagraph level via utilizing the numbering system in the legal text hierarchy. Basic methodology and practical techniques are reported in detail. The resultant bilingual corpus, 10.4M English words and 18.3M Chinese characters, is an authoritative and comprehensive text collection covering the specific and special domain of HK laws. It is particularly
more » ... valuable to empirical MT research. This piece of work has also laid a foundation for exploring and harvesting English-Chinese bitexts in a larger volume from the Web.
dblp:conf/ijcnlp/KitLSW05 fatcat:m2ah3rw7xzgvpbpd2xyempyrqy