This thesis presents a method for automatic word sense disambiguation using a part-of-speech (POS) tagged corpus and a machine-readable dictionary(MRD). Word sense disambiguation (WSD) is the problem of assigning the appropriate sense to an ambiguous word, using its context. Word sense disambiguation algorithms may be categorized by the method used to overcome the knowledge acquisition bottleneck and the problem of data sparseness. Knowledge based WSD methods use information from a MRD or thesaurus such as LDOCE or WordNet, and corpus based WSD methods use information gained from training on text corpora or tagged corpora. We describe a method that attacks the knowledge acquisition bottleneck and the problem of data sparseness using a POS tagged corpus and a MRD.
Typical corpus based WSD methods depend critically on manual sense tagging, which is a laborious and time-consuming process. The need for sense tagged corpus creates a problem as the knowledge acquisition bottleneck. We circumvents this problem by acquiring selectional restriction knowledge from POS-tagged corpus. The selectional restrictions that predicates have for their arguments provide useful information that can help with resolution of sense ambiguity. However, some phenomena in Korean, for example, (1) the postposition shift by auxiliaries, (2) the multiple surface forms of a compound predicate, and (3) the omission of case components, debase the quality of the acquired selectional restriction knowledge. We define a corpus normalization as a process that transforms negative factors in the corpus into appropriate ones to extract the aiming knowledge, preserving the integrity of that corpus. In order to prevent negative effects due to these phenomena we develop corpus normalization rules. Verb sense disambiguation is the clustering of the objects of the ambiguous verb using the normalized selectional restriction knowledge. A word's dictionary definitions are likely to be good indicators for the senses they define. So the verb sense to be clustered is defined as a sense entry described in a dictionary definition. The experiment reveals that the corpus normalization rules are correct with the precision of over 93% and are significantly effective in reducing wrong and sparse data and in result, the recall and precision of WSD are improved.
Corpus based WSD methods suffer from the problem of data sparseness. Traditionally, the problem of data sparseness is approached by estimating the probability of unobserved cooccurrences using the actual cooccurrences in the training corpus. This can be done by smoothing the observed frequencies, or by class-based methods. We address this problem in two ways. First, we replace the all-or-none indicator of cooccurrence in noun distribution by a graded measure of noun distribution and verb distribution. Nouns are considered similar if they appear as the object of similar verbs; verbs are similar if they take similar nouns as their object. Using both noun and verb distributions, the data sparseness can be reduced than only noun distribution. Second, we classify the nouns appeared in the corpus by using IS-A relations extracted from the dictionary definition. The experiments show that by using both noun and verb distributions, the recall is improved by about 22% and by using the dictionary, the precision is improved by about 25%.