Histograms have been getting a lot of attention recently. Histograms are
commonly utilized in commercial database systems to capture attribute value
distributions for query optimization. Recently, in the advent of researches
on approximate query answering and stream data, the interests in histograms
are widely being spread. The simplest approach assumes that the attributes
in relational tables are independent by AVI(Attribute Value Independence)
assumption.
However, this assumption is not generally valid for real-life datasets.
To alleviate the problem of approximation on multi-dimensional data with
mutiple one-dimensional histograms, several techniques
such as wavelet, random sampling and multi-dimensional histograms are proposed.
Among them, GENHIST is a multi-dimensilnal histogram
that is designed to approximate the data distribution with real attributes.
It uses overlapping buckets that allow more efficient approximation
on data distribution.
In this thesis, we propose a scheme, OPT that can determine
the optimal frequencies of overlapped buckets
that minimize the SSE(Sum Squared Error).
A histogram with overlapping buckets is first generated by GENHIST
and OPT can improve the histogram by calculating the optimal frequency
for each bucket.
Our experimental result confirms that our techniqe can improve the
accuracy of histograms generated by GENHIST significantly.