Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A number of popular classifiers construct decision trees to generate class models. Generally, current algorithms to construct decision trees, including main-memory algorithms, make several scans over the training databases. Given the computation-intensive nature of decision tree construction, algorithms for efficiently constructing an approximate decision tree are of crucial importance.
In this paper, we propose a novel solution to the problem of efficient decision tree construction based on the idea of stratified sampling. We observe that a decision tree structure provides a natural, effective collection of strata that can significantly improve sampling accuracy compared to simple random sampling. By employing random samples drawn from leaves of the tree to construct the new tree, our algorithm is able to construct trees very quickly with higher accuracy than those constructed using a uniform random sample of the same size (drawn from the entire database). We present theoretical results that demonstrate the superiority of our stratified sampling method for estimating class proportions, a critical element for the construction of accurate decision trees. Our theoretical analysis is validated by extensive experimental results with both real-life data sets and synthetic data sets, demonstrating the effectiveness of our proposed scheme.