A variety of indexing methods for Hangul texts have been proposed in the past. They can be classified into two groups as follows: One is to extract index terms by removing particles, endings, suffixes et at. from word phrases, and the other is to generate index terms from morphemes of word phrases. The former suffers from the problem of word boundaries when documents contain many compound nouns even though it can be easily implemented with the longest match principle. The latter can overcome the word boundary problem by extracting simple nouns. It, however, has many overheads to develop a lot of linguistic knowledge needed in the indexing procedure.
In this paper we propose a new indexing method based on n-grams. The proposed method consists of the following four steps. First, word phrases are recognized from Hangul texts. Second, we eliminate stopwords which are not appropriate to represent the texts. Then, the meaningless parts consisting of particles, endings, suffixes et al. are removed from the remaining word phrases. Finally, we get n-grams from the meaningful parts. The proposed indexing method alleviates the problems of previous indexing methods related with word boundaries and linguistic knowledge. We also show through performance comparison that the n-gram based indexing method provides similar retrieval effectiveness to the case that texts are indexed with manually-extracted simple nouns.