Multidimensional on-line analytical processing (MOLAP) is a database application that allows users to extract the information necessary for decision-making. Aggregation is an operation that plays a key role in MOLAP. Since computing aggregation is very expensive, good aggregation algorithms are crucial for achieving performance in MOLAP systems.
In the first part of this dissertation, we present a method for computing aggregation in MOLAP. Existing aggregation methods have been proposed for file structures such as multidimensional arrays. These file structures are suitable for data with uniform distributions, but do not work well with skewed distributions. In this dissertation we consider an aggregation method that uses dynamic multidimensional files adapting to skewed distributions. In these multidimensional files, the sizes of page regions vary according to the data density in these regions, and the pages that belong to a larger region are accessed multiple times while computing aggregations. To solve this problem, we first present an aggregation computation model, called the $\textit{Disjoint-Inclusive Partition (DIP) computation model}$, that is the formal basis of our approach. DIP is defined as a partition of the domain space in which, when two regions are projected onto any subspace, the projected regions are disjoint or one region includes the other. Based on this model, we then present the one-pass aggregation algorithm. This algorithm computes aggregations using the $\textit{one-pass buffer size}$, which is the minimum buffer size required for guaranteeing one disk access per page. We prove that our aggregation algorithm is optimal with respect to the one-pass buffer size under our aggregation computation model. The DIP computation model allows us to correctly predict the order of accessing data pages in advance. Thus, our algorithm achieves the optimal one-pass buffer size by using a buffer replacement policy, such as Belady`s $B_0$ or Toss-Immediate policies, that exploits the page access order computed in advance. Since the page access order is not known $\textit{a priori}$ in general, these policies have been known to lack practicality despite its theoretic significance. Nevertheless, in this dissertation, we show that these policies can be effectively used for aggregation computation.
We have conducted extensive experiments. We first demonstrate that the one-pass buffer size theoretically derived is indeed correct in real environments. We also show that the memory requirement of our algorithm for processing the aggregation in one-pass is very small being 0.05\%~0.6\% of the size of the database. These results indicate that our algorithm is practically usable even for a fairly large database.
In the second part of this dissertation, we define a new class of aggregation queries, called range-groupby queries, and present a method for processing them. A range-groupby query is defined as a query that, for an arbitrarily specified region of an n-dimensional cube, computes aggregations for each combination of values of the grouping attributes given. Range-groupby queries are very frequently used in analyzing information in MOLAP since they allow us to summarize trends in an arbitrarily specified subregion of the domain space. In MOLAP applications, in order to improve the performance of query processing, methods of maintaining precomputed aggregation results called the prefix-sum arrays are widely used. For the case of range-groupby queries, however, maintaining precomputed aggregation results for each combination of the grouping attributes incurs enormous storage overhead. Here, we propose a fast algorithm that can compute range-groupby queries with minimal storage overhead. We study range-groupby queries with the SUM operator, which is an aggregation operator most frequently used in MOLAP. However, our method is also applicable to other aggregation operators such as COUNT or AVERAGE. We use the prefix-sum array method to process range-groupby queries. Our algorithm maintains only one prefix-sum array and still effectively processes range-groupby queries for all possible combinations of values of the grouping attributes. Compared with the method that maintains a precomputed prefix-sum for each combination of the grouping attributes in an n-dimensional cube, our algorithm reduces the space overhead by $\mathrm{O}(\frac{1}{2^n})$, while accessing a similar number of cells.
In summary, we believe the notion of DIP provides an excellent formal basis for investigating further issues in computing aggregations in MOLAP. We also believe that our work for range-groupby query processing provides a formal basis for investigating other advanced MOLAP queries that are confined to an arbitrarily specified region of the cube.