These days we are exposed to huge data, some of which has relations each other. But it is different to find them. Data mining is a series of procedure which extracts information by exploring and modelling the relationships within such data and CART is one of the most popular tools for Data Mining. It develops for us a classification tree for categorical response variables and a regression tree for continuous response variables. The trees are developed in such a way that predictor variables are selected one after another in the order of the information amount that a predictor variable has for the response variable, where the information amount is computed conditional on the outcome of the predictor variables that are already selected in the tree construction process.
Our goal in this thesis is finding a model structure for a large set of random variables, some of which are continuous and the rest are categorical.
While CART is useful for a supervised learning, log-linear modelling is an unsupervised learning. We use CART at an initial stage of large scale modelling for the purpose of selecting subgroups of the random variables that are involved in the whole data set. Since CART is available to a data set of many random variables of mixed type, easy to apply, and easy to interpret the result of analysis, we can easily group the variables so that the variables in a group are associated highly with each other.
Once groups of random variables are obtained, we then apply log-linear modelling to individual groups and obtain graphical log-linear models whose model structures are rep-resentable via graphs of vertices and edges. From each graphical model, we find particular types of graph separators called "prime separators", which are each defined as a graph separator which separates cliques or irreducible cycles. The prime separators have a nice property that they remain as prime separators both in a graphical model and its marginal model. This property is used in combining marginal models of a graphical log-linear model.
It is found out that the grouping of random variables affects mostly the whole modelling procedure. Any edge connecting a pair of random variables has a high probability of missing in the combined model if there is no group of variables which contain both of the variables. To get back these missing edges, we need a further grouping of variables and build a marginal model for the set of variables which contain both of such pair of variables corresponding to missing edges.
The approach as proposed in this thesis is applied to a simulated data of 100 random variables, 80 of which is binary and the rest continuous. We categorized the continuous variables into binary or 4-level categorical variables. The approach came up with a model which detected most of the edges that lie in the true model with some overly added edges that can be removed by an extended procedure of marginal modelling.
데이터 마이닝은 데이터간의 숨겨진 관계, 또는 너무 복잡하여 잘들어나지 않는 관계를 찾아내고 이 관계를 바탕으로 앞날을 예측하는 기술이다. CART 알고리즘은 변수간의 선형성이나 연관성 유ㆍ무에 상관없이, 독립변수에 영향을 미치는 주요 종속 변수를 차례대로 알려주기 때문에 데이터 마이닝의 주요 분석 도구로 활용할 수 있다.
이 논문의 목적은 범주형 자료와 연속형 자료를 모두 변수로 갖는 거대모형의 구조를 찾아내는 데에 있다. CART를 활용하면 변수들간의 관계를 쉽고 빠르게 파악하여 부분 별로 군집화시킬 수 있기 때문에, 이 논문에서는 CART를 통해 얻는 각각의 군집을 로그 선형 분석을 거쳐 주변 그래프 모형을 얻고, 각 모형마다 “prime separator”를 찾아내어 이를 골격으로 주변 모형을 결합, 거대 모형을 개발하는 방법을 제시하였다.
마지막으로 직접 자료에 적용시켜 봄으로써 이 모형탐색 방법의 효율성을 확인하였고 그에 따른 문제점을 제시한다.