The goal of automatic indexing is to generate the descriptors that reflect document content so that they may serve as document surrogates in text storage and retrieval systems. Popular approaches to automatic indexing include statistical analysis, syntactic processing technique and knowledge based method.
This paper presents the design and implementation of an automatic indexing system for Korean texts using statistical, syntactic and semi-semantic methods. This system adopts the Case Frame formalism which considers the surface structure of a sentence and predicts the deep structure. The system utilizes three types of information from a document: case roles, importance carrying phrases, and term frequency. It is assumed that each case role carries different degree of importance, and that the different degree of importance may contribute to extracting content bearing textual units. The paper argues that adding linguistic information in indexing may lead to high quality of indices, which in turn improve retrieval efficiency. The experiment results support this arguement.
The indexing procedure consists of three parts: analyzing morpheme structures, extracting noun phrases, and weighting the noun phrases. The vector space retrieval model was used in the experiment.