In Korean text, these days, the use of English words with or without phonetic translations is growing at a high speed. To make matters worse, the Korean transliteration of an English word may vary greatly. The mixed use of English word and their various Korean transliterations in the same document or document collection may cause severe word mismatch problem in Korean information retrieval. When user query and document text use different transliterations from each other, simple word matching is unable to retrieve the document. When a user query uses Korean transliteration and document contains English word or vice versa, simple word matching also fails.
In order to resolve the word mismatch problem, it is necessary to find equivalence classes among English words and their various Korean transliterations. However constructing the equivalence classes is not easy due to the inherent difficulties of the problem. There are two possible approaches to tackle the problem. One approach is to transform, i.e. back-transliterate, foreign words into their origin English words and use English words as canonical forms for indexing and querying. The other approach, which is proposed in this thesis, is to transliterate English words into Korean and construct equivalence classes among foreign words by measuring the phonetic similarities among them. We call the former back-transliteration approach and the latter transliteration approach.
The back-transliteration approach appears to be more convincing since the original English word is unique whereas its Korean equivalent can be transliterated in multiple ways. However the back-transliteration approach has more difficulties in its actual implementation than the transliteration approach. This is based on the following three observations: (1) back-transliteration is inherently more difficult than transliteration, (2) In Korean text there are generally much more foreign words than English words, (3) English multi-word problem is much more difficult to be handled in the back-transliteration approach than in the transliteration approach. Based on these three observations, we argue that our proposed transliteration approach is more advantageous for the resolution of the word mismatch problem than the previously proposed back-transliteration approach. Our information retrieval experiment results supported the argument.
The actual implementation of both the transliteration approach and the back-transliteration approach is not easy at all since they require very good solutions for the following more or less unsolved problems: foreign word extraction, automatic transliteration and back-transliteration, and phonetic similarity comparison between foreign words. Low performance in one of the processing modules would greatly degrade the final accuracy of the equivalence class construction. In this thesis we proposed an effective solution for each of the task of foreign word extraction, automatic Korean-English transliteration and back-transliteration, Korean phonetic similarity comparison, and Korean-English character alignment. The automatic character alignment is inevitable for the automatic generation of the training examples for the automatic transliteration and back-transliteration. Our character alignment algorithm was highly accurate but the solutions for the other tasks were not good enough. Hence the equivalence class generated turned out to be too poor for the practical application. We concluded that for the practical use in Korean information retrieval more effective solutions must be sought for the foreign word extraction, automatic transliteration and back-transliteration, and Korean phonetic similarity comparison. In current situation, in order not to harm the information retrieval performance, a realistic approach is to make more conservative decision whether a word belongs to an equivalence class.