A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2003; you can also visit the original URL.
The file type is application/pdf
.
The automatic extraction of open compounds from text corpora
1996
Proceedings of the 16th conference on Computational linguistics -
unpublished
This paper describes a new method for extracting open compounds (uninterrupted sequences of words) from text corpora of languages, such as Thai, Japanese and Korea that exhibit unexplicit word segmentation. Without applying word segmentation techniques to the inputted plain text, we generate ngram data from it. We then count the occurrence of each string and sort them in alphabetical order. It is significant that the frequency of occurrence of strings de, creases when the window size of
doi:10.3115/993268.993386
fatcat:kdk57ajj5fephi3cvahg3ldidu