Using Histograms for Skyline Size Estimation
Résumé
Let T be a table of n points described by a set of d attributes/ dimensions.
Let p and q be two objects in T. p dominates q iff it is better than q in every
dimension and there exists at least one attribute for which p is
strictly better than p. p is a skyline point of T iff it is not dominated
by any point of T. A skyline query returns the set of all skyline
points. In order to integrate Skyline queries into database management
systems, deriving an estimation of the skyline cardinality is
important for query optimization purposes. We propose techniques
for estimating skyline cardinality when data distribution is known.
We first provide an unbiased estimator which requires one traversal
of the whole data which is much faster than computing the exact
skyline. Then, we show that this estimator can be used on a sample
of the underlying data while preserving the estimation quality, i.e.,
it is still unbiased. Next, we provide a convergent estimator which
does not require any data access but the data distribution. It estimates
skyline cardinality expectation for those data sets respecting
data distribution. The advantages of these solutions are their ease
of implementation and, by contrast to other proposals, no costly
subskyline queries are required. Our solutions are implemented
and some experiments are reported showing both the accuracy of
the estimations and the execution time efficiency by which they are
obtained.