한국과학기술원 도서관

서지주요정보
Comparisons of classification methods in the original and pattern spaces and development of new pattern selection approaches for the logical analysis of data = 원래의 영역과 패턴 영역에서의 분류 기법 비교와 LAD의 패턴 선택 방법 개발
서명 / 저자	Comparisons of classification methods in the original and pattern spaces and development of new pattern selection approaches for the logical analysis of data = 원래의 영역과 패턴 영역에서의 분류 기법 비교와 LAD의 패턴 선택 방법 개발 / Jeong Han.
발행사항	[대전 : 한국과학기술원, 2010].
Online Access	원문보기 원문인쇄

소장정보

등록번호

8021332

소장위치/청구기호

학술문화관(문화관) 보존서고

MIE 10019

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

The logical analysis of data (LAD) is one of promising data mining and machine learning techniques to extract knowledge from data. The LAD was developed based on the concepts from combinatorics, optimization, and Boolean functions. The main steps of the LAD are composed of data binarization, support set construction, pattern generation and selection, and theory formulation. The key feature of the LAD is the capability of detecting hidden patterns in the data. Patterns are basically combinations of certain attributes and they are used to build a decision boundary for classification in the LAD. The patterns can provide important information to distinguish observations in one class from those in the other class. The use of patterns may result in more stable performance for the classification of both positive and negative classes due to their robustness to measurement errors. In addition, the patterns are interpretable and can serve as an essential tool for understanding the problem. Desirable properties of the patterns generated from the LAD motivate the use of the LAD patterns as input variables to other classification techniques to achieve more stable and accurate performance. In the first part of this thesis, the patterns generated from the LAD are used as the input variables to the decision tree and k-nearest neighbor classification methods. The applicability and usefulness of the LAD patterns for classification are investigated by experimental study. The classification results for different classifiers in the original and pattern spaces are compared using several public data sets in terms of classification accuracy and sensitivity. Comparisons of the LAD and other classification methods in the pattern space are also made using the same data sets to examine the effect of the LAD after the completion of the pattern generation step. The experimental results show that classifications in the pattern space can yield better performance than in the original space in terms of accuracy when the classification accuracy of the LAD is relatively good (i.e., the LAD patterns are of good quality), or the ratio of the number of patterns to the number of attributes is small, or the data set for classification is balanced between two classes. It is observed that classifications in the pattern space can achieve more stable results than those in the original space in terms of sensitivity and use of the decision tree and k-nearest neighbor classification methods in the pattern space can yield more accurate results than the LAD. On the other hand, the LAD tends to choose too many patterns by solving a set covering problem to build a classifier, especially when outliers exist in the data set. In the set covering problem of the LAD, every observation should be covered by at least one pattern, even though the observation is an outlier. Thus, existing approaches select many patterns to cover these outliers resulting in the problem of overfitting. In the second part of this thesis, new pattern selection approaches for the LAD are proposed considering outliers and the coverage of a pattern. The proposed approaches can avoid the problem of overfitting by building a sparse classifier. Performances of the proposed pattern selection approaches are compared with the existing LAD approaches using several public data sets. Computational results show that the sparse classifiers built on the patterns selected by the proposed approaches yield an improved classification performance compared to the existing approaches, especially when outliers exist in the data set.

LAD(Logical Analysis of data)는 데이터로부터 지식을 추출하는 유용한 데이터마이닝 및 기계학습 방법론이다. 조합론, 최적화 이론, Boolean 개념 등을 기반으로 개발된 LAD는 데이터 이원화, support set 생성, 패턴 생성 및 선택, theory 구성 등의 과정으로 이루어져 있다. LAD의 핵심은 데이터 속에 숨겨진 패턴을 찾아내는 것으로, 이 패턴은 특정 변수들의 조합으로써 만들 수 있다. 패턴은 분류(classification)를 위한 기준을 결정할 뿐만 아니라 관측치의 class를 구분하는 중요한 정보를 제공한다. 그러므로 패턴을 이용하면 각 class에 대해 안정적인 분류가 가능하고 측정오차에 대해서 강건한 분류가 가능하다. 또한 패턴은 해석이 용이하고 문제를 이해하는데 근본적인 도구로써 활용될 수 있다는 장점이 있다. 이와 같은 LAD 패턴의 유용한 특징은 다른 분류 방법들이 좀 더 안정적이고 정확한 결과를 제공하기 위한 동기를 부여하여, LAD 패턴을 다른 분류 방법들의 새로운 입력 변수로 사용하게 한다. 본 논문에서는 먼저 LAD 패턴을 의사결정나무(decision tree)와 k-최근접(k-nearest neighbor) 분류와 같은 방법론에 입력 변수로 사용하여 LAD 패턴의 응용가능성과 유용성을 실험적으로 확인하였다. 다양한 공공 데이터를 이용한 결과를 예측 정확도와 민감도에 근거하여 기존 영역과 패턴 영역에서의 분류 기법을 비교했으며, 패턴 생성 이후의 과정을 평가하기 위해 LAD와 다른 분류 기법들도 비교하였다. 이러한 실험 결과는 LAD의 정확도가 비교적 높은 경우, 변수 대비 패턴의 개수의 비율이 작은 경우, 두 class간 비율이 비슷한 경우 패턴 영역에서의 분류가 기존 영역에서의 분류보다 예측 정확도 관점에서 개선되었음을 보여준다. 또한 민감도 기준에서는 패턴 영역에서의 분류가 더 좋은 결과를 보이며, 패턴 영역을 사용한 다른 분류 방법이 LAD보다 예측 정확도가 높음을 확인하였다. 반면 LAD는 분류를 위한 모델을 생성함에 있어 너무 많은 패턴을 선택하는 경향이 있으며, 특히 이상 관측치가 존재하는 경우 그 문제가 심각해질 수 있다는 약점이 있다. LAD의 패턴 선택을 위한 set covering problem에 따르면 모든 관측치는 최소 하나의 패턴에 의하여 설명되어야 하며 이는 이상 관측치에 경우에도 적용되어 과잉적합(overfitting) 문제를 야기할 수 있다. 그러므로 본 논문의 두 번째 연구에서는 이상 관측치와 패턴의 설명력을 고려한 새로운 패턴 선택 방법을 수립하였다. 이 방법은 더 적은 개수의 패턴을 이용하여 분류 모델을 만들 수 있으며 과잉적합 문제를 완화할 수 있음을 확인하였다. 기존의 방법과 새로 제안한 방법을 비교하기 위해 마찬가지로 다양한 데이터를 이용하여 실험을 수행한 결과, 새롭게 제시한 패턴 선택 방법이 이상 관측치가 존재하는 경우 개선된 효과를 보임을 입증하였다.

서지기타정보

서지기타정보
청구기호	{MIE 10019
형태사항	ⅵ, 53 p. : 삽화 ; 26 cm
언어	영어
일반주기	저자명의 한글표기 : 한정 지도교수의 영문표기 : Bong-Jin Yum 지도교수의 한글표기 : 염봉진
학위논문	학위논문(석사) - 한국과학기술원 : 산업및시스템공학과,
서지주기	Reference: p. 51-53

QR CODE

책소개

전체보기

나의 도서관정보

메뉴

소장정보

리뷰정보

초록정보

서지기타정보

책소개

목차

이 주제의 인기대출도서