Weakly Supervised Morphology Learning for Agglutinating Languages Using Small Training Sets

Ksenia Shalonova, Bruno Golénia
2010 International Conference on Computational Linguistics  
The paper describes a weakly supervised approach for decomposing words into all morphemes: stems, prefixes and suffixes, using wordforms with marked stems as training data. As we concentrate on under-resourced languages, the amount of training data is limited and we need some amount of supervision in the form of a small number of wordforms with marked stems. In the first stage we introduce a new Supervised Stem Extraction algorithm (SSE). Once stems have been extracted, an improved unsupervised
more » ... segmentation algorithm GBUMS (Graph-Based Unsupervised Morpheme Segmentation) is used to segment suffix or prefix sequences into individual suffixes and prefixes. The approach, experimentally validated on Turkish and isiZulu languages, gives high performance on test data and is comparable to a fully supervised method.
dblp:conf/coling/ShalonovaG10 fatcat:wedjkunttvhorpx7p3uzlu3d6i