In Korean documents, many transliterations of foreign loan words are found. They are usually proper nouns and technical terms that play important roles in information retrieval. In the case of cross-lingual information retrieval, these transliterations are a barrier for the automatic term translation because they usually do not appear in a dictionary. Moreover, the transliterations are used variously in the documents, which makes the automatic transliteration more difficult for the cross-lingual information retrieval.
Transliteration from English to Korean is assumed to be done in two ways: (1) automatically extracting the pronunciation from English letters in a word and then converting it to a Korean word, and (2) directly converting the English letters to a Korean word. In this thesis, the first one is called the pivot method and the second one is the direct method. In addition to the two methods, a hybrid method is proposed and the three methods are compared with together. For the proper comparison, a statistical transliteration model (STM) is proposed, which automatically learns transliteration rules from bilingual word-aligned data, and introduces pronunciation units to reflect the different sound structures of the two languages.
The pivot method uses the STM to produce pronunciations from English letters in the first stage and uses the Korean standard conversion rule of foreign word transliteration to convert the pronunciations to Korean characters in the second stage. The direct method is implemented with the STM, and the hybrid method collects the results of the two methods and selects one of them with higher probability.
Experiment was performed for a transliterating process and a retransliterating process; the former converts English words to Korean words, and the latter converts transliterated Korean words to original English words. For the transliteration experiment, transliteration accuracy, variation coverage and the efficiency of information retrieval are used for the measure of the performance. For the retransliteration experiment, the accuracy to find the original English word is used. The experiment showed that the hybrid method was the best in all criteria, and the performance of the direct method was slightly better than that of the pivot method in all tests except the information retrieval test. As a conclusion, the experiment showed that various transliterations are used in Korean documents, and the hybrid method of transliteration and retransliteration was most effective to retrieve the various transliterations and the related documents in cross-lingual information retrieval system.