Range Selectivity Estimation for Continuous Attributes

Flip Korn


Many commercial database systems maintain histograms to estimate query selectivities efficiently as part of query optimization. Most work on histogram design is implicitly geared towards discrete or categorical attribute value domains. In this paper, we consider approaches that are better suited for the continuous valued attributes commonly found in scientific and statistical databases. We propose two methods based on spline functions for estimating the selectivity of range queries over univariate and multivariate data.

These methods have two advantages over histograms. First, they are more accurate. As the results from our experiments on both real and synthetic datasets demonstrate, the proposed methods achieve substantially better (up to 6 times) estimation error than the state-of-the-art histograms, at exactly the same storage space and with comparable CPU runtime overhead; moreover, the superiority of the proposed spline methods becomes even more dramatic over multivariate data. Second, the proposed methods can be used for data mining. The continuity and smoothness of splines make them more adept at identifying and locating features of the density (modalities, heavy tails, etc.) than histograms. For these reasons, we recommend splines for statistical profiling.

Back to the Spring 1998 dbchat index.