Representative Skylines using Threshold-based Preference Distributions
Venue
International Conference on Data Engineering (ICDE) (2011)
Publication Year
2011
Authors
Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Richard J. Lipton, Jim Xu
BibTeX
Abstract
The study of skylines and their variants has received considerable attention in
recent years. Skylines are essentially sets of most interesting (undominated)
tuples in a database. However, since the skyline is often very large, much research
effort has been devoted to identifying a smaller subset of (say k) “representative
skyline” points. Several different definitions of representative skylines have been
considered. Most of these formulations are intuitive in that they try to achieve
some kind of clustering “spread” over the entire skyline, with k points. In this
work, we take a more principled approach in defining the representative skyline
objective. One of our main contributions is to formulate the problem of displaying
k representative skyline points such that the probability that a random user would
click on one of them is maximized. Two major research questions arise naturally
from this formulation. First, how does one mathematically model the likelihood with
which a user is interested in and will “click” on a certain tuple? Second, how does
one negotiate the absence of the knowledge of an explicit set of target users; in
particular what do we mean by “a random user”? To answer the first question, we
model users based on a novel formulation of threshold preferences which we will
motivate further in the paper. To answer the second question, we assume a
probability distribution of users instead of a fixed set of users. While this makes
the problem harder, it lends more mathematical structures that can be exploited as
well, as one can now work with probabilities of thresholds and handle cumulative
density functions. On the theoretical front, our objective is NP-hard. For the case
of a finite set of users with known thresholds, we present a simple greedy
algorithm that attains an approximation ratio of (1 − 1/e) of the optimal. For the
case of user distributions, we show that a careful yet similar greedy algorithm
achieves the same approximation ratio. Unfortunately, it turns out that this
algorithm is rather involved and computationally expensive. So we present a
threshold sampling based algorithm that is more computationally affordable and, for
any fixed epsilon > 0, has an approximation ratio of (1 − 1/e − epsilon). We
perform experiments on both real and synthetic data to show that our algorithm
significantly outperforms previously proposed approaches.
