Statistical language modeling (SLM) is the attempt to capture regularities of natural language. It estimates the probabilistic distribution of various linguistic units, such as words, sentences and whole documents. SLM is crucial for a large variety of language technology applications: speech recognition, document classification, information retrieval, POS (Part-of-speech) tagging, machine translation and many more.
In this thesis, we construct word (morpheme) and class-based n-gram language models for Korean. We verify the effectiveness of these models by a thorough experiment based on about 20-million POS tagged text corpus. For word-based n-gram modeling, we compare Katz's backoff method with Kneser-Ney's smoothing technique. For class-based modeling, POS-based class model is compared to automatically clustered class models. Lastly, we combined word and class-based language models by linear interpolation. The result shows that Kneser-ney smoothing outperforms over widely used Katz's backoff technique. Automatically driven word class model is better than POS-based class model. The best model is the combined model.
Our results lay the experimental foundation for statistical language modeling for Korean. It can be used various applications such as speech recognition and document classification. All these techniques including n-gram counting, smoothing, clustering and evaluation algorithms are implemented by myself. It is open to public domain named as KLM toolkit.
Further investigation is necessary for resolving sophisticated linguistic information (dependency grammar, semantic coherence, etc) to statistical language model.