One of the major problems in this context is combining nominal, discrete, and continuous variables in the same model. There are many studies to find an efficient algorithm for partitioning the range of a continuous variable to a discrete number of intervals.
All these prompt researchers and practitioners to discretize continuous features before or during a machine learning or data mining task. Most methods used for discretizing a continuous variable use its relationship to another variable to determine the partitions. This is often found in classification procedures, such as decision trees and in naive Bayesian classifiers. There are numerous discretization methods available in the literature. Data can also be reduced and simplified through discretization. For both users and experts, discrete features are easier to understand, use, and explain. Therefore discrete values have important roles in data mining and knowledge discovery.
Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy.
Widely used systems such as CART (Breiman et al., 1984) deploy various ways to avoid using continuous values directly. Discrete features are closer to a knowledge-level representation (Simon, 1981) than continuous ones.
There are many other advantages of using discrete values over continuous ones.
Our thesis aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy. Contributions of this thesis are an abstract description summarizing existing discretization methods, a hierarchical framework to categorize the existing methods and pave the way for further development, concise discussions of representative discretization methods, extensive experiments and their analysis. The method is demonstrated with data and these results show that with the triple partitioning, with a high goodness of fit level from mapping continuous variables into discrete ones.
한 모델에서 이산적인 변수와 연속적인 변수를 혼합하는 것이 이 연구의 주요 문제이다. 지금까지 연속적인 변수를 이산적인 변수들의 구간으로 나누는 알고리즘에 대한 여러 가지 연구가 있었다. 데이터를 이산적인 변수로 바꿈으로써 데이터를 줄이고 간단화 시킬 수 있다. CART(classification and regression tree, Breiman et al., 1984) 같은 시스템이 연속적인 연소를 직접 사용하는 것을 피하는데 이용된다. 사용자나 전문가에게 이산적인 변수는 이해하고 사용하고, 설명하는데 좋은 점이 많다. 따라서 이산적인 데이터들은 데이터 마이닝이나 지식 탐색에 있어서 중요한 역할을 간다. 이 논문은 변수를 이산화 시키는 방법에 대한 체계적인 연구와 기존에 존재하는 이산화 방법들에 대한 고찰과 그래프 모형을 위한 이산화 방법의 적용을 목표로 하였다.