서지주요정보
한글 문헌를 위한 확률적 자동색인 모델 연구 = A probabilistic approach to automatic indexing of Korean texts
서명 / 저자 한글 문헌를 위한 확률적 자동색인 모델 연구 = A probabilistic approach to automatic indexing of Korean texts / 박혁로.
저자명 박혁로 ; Park, Hyouk-Ro
발행사항 [대전 : 한국과학기술원, 1997].
Online Access 원문보기 원문인쇄

소장정보

등록번호

8007230

소장위치/청구기호

학술문화관(문화관) 보존서고

DCS 97003

휴대폰 전송

도서상태

이용가능

대출가능

반납예정일

등록번호

9003966

소장위치/청구기호

서울 학위논문 서가

DCS 97003 c. 2

휴대폰 전송

도서상태

이용가능

대출가능

반납예정일

초록정보

Identifying index terms from Korean documents poses unique problems for which the methods used to index English documents are inappropriate. In this thesis, we address two problems of automatic indexing of Korean. The first problem is to identifying nominal words and to remove suffixes from these nominal words to extract candidate index terms. Although many existing systems adopt morphological analyzers for this purpose, they suffer from heavy ambiguity and unknown words. To cope with the problems of ambiguity and unknown words, this thesis introduces an improved tagging model for Korean that takes into account the type of wordphrases. Because Korean sentences consists of wordphrases that contain one or more morphemes, Korean tagging must be posed differently form English tagging. We introduce an Hidden Markov model that closely reflects the natural structure of Korean. An experiment with 10,702 wordphrases shows that the tagging accuracy of the proposed model is 96.18 which is the upper edge of reported results for Korean. The other problem is compound noun analysis that regards segmenting or decomposing compound nouns into promising index terms. Compound nouns as index terms that usually subscribe to the specific notion tend to increase the precision of retrieval performance. The indiscrete use of component nouns of compound nouns as index terms, on the other hand, may improve the recall performance, but can decrease the precision. We propose a method to handle compound nouns with a goal to preserve the precision while attaining the recall. In the proposed method, the relevance of the component nouns of a compound noun to the document content is computed by comparing the documents sets that are supported by the component nouns and the terms of the document. The operational content of a term is represented as the probabilistic distribution of the term over the document set. Experiments with a set of 1,000 documents show that the proposed method outperforms other known methods including even manual indexing for 30 sample queries

서지기타정보

서지기타정보
청구기호 {DCS 97003
형태사항 [vi], 65 p. : 삽도 ; 26 cm
언어 한국어
일반주기 부록 : A, KT data 예제. - B, KT query. - C, 명사 접사 리스트
저자명의 영문표기 : Hyouk-Ro Park
지도교수의 한글표기 : 최기선
지도교수의 영문표기 : Key-Sun Choi
수록 잡지명 : . Computer Processing of Oritental Languages
학위논문 학위논문(박사) - 한국과학기술원 : 전산학과,
서지주기 참고문헌 : p. 54-60
주제 자동색인
정보검색
품사태깅
복합명사
Automatic indexing
Information retrieval
Tagging
Compound noun
QR CODE qr code