Identifying index terms from Korean documents poses unique problems for which the methods used to index English documents are inappropriate. In this thesis, we address two problems of automatic indexing of Korean.
The first problem is to identifying nominal words and to remove suffixes from these nominal words to extract candidate index terms. Although many existing systems adopt morphological analyzers for this purpose, they suffer from heavy ambiguity and unknown words. To cope with the problems of ambiguity and unknown words, this thesis introduces an improved tagging model for Korean that takes into account the type of wordphrases. Because Korean sentences consists of wordphrases that contain one or more morphemes, Korean tagging must be posed differently form English tagging. We introduce an Hidden Markov model that closely reflects the natural structure of Korean. An experiment with 10,702 wordphrases shows that the tagging accuracy of the proposed model is 96.18 which is the upper edge of reported results for Korean.
The other problem is compound noun analysis that regards segmenting or decomposing compound nouns into promising index terms. Compound nouns as index terms that usually subscribe to the specific notion tend to increase the precision of retrieval performance. The indiscrete use of component nouns of compound nouns as index terms, on the other hand, may improve the recall performance, but can decrease the precision. We propose a method to handle compound nouns with a goal to preserve the precision while attaining the recall. In the proposed method, the relevance of the component nouns of a compound noun to the document content is computed by comparing the documents sets that are supported by the component nouns and the terms of the document. The operational content of a term is represented as the probabilistic distribution of the term over the document set. Experiments with a set of 1,000 documents show that the proposed method outperforms other known methods including even manual indexing for 30 sample queries