Publication Data
Representative Skylines using Threshold-based Preference Distributions
Abstract: The study of skylines and their variants has received
considerable attention in recent years. Skylines are essentially sets of most
interesting (undominated) tuples in a database. However, since the skyline is often
very large, much research effort has been devoted to identifying a smaller subset of
(say k) “representative skyline” points. Several different definitions of
representative skylines have been considered. Most of these formulations are intuitive
in that they try to achieve some kind of clustering “spread” over the entire skyline,
with k points. In this work, we take a more principled approach in defining the
representative skyline objective. One of our main contributions is to formulate the
problem of displaying k representative skyline points such that the probability that a
random user would click on one of them is maximized. Two major research questions arise
naturally from this formulation. First, how does one mathematically model the
likelihood with which a user is interested in and will “click” on a certain tuple?
Second, how does one negotiate the absence of the knowledge of an explicit set of
target users; in particular what do we mean by “a random user”? To answer the first
question, we model users based on a novel formulation of threshold preferences which we
will motivate further in the paper. To answer the second question, we assume a
probability distribution of users instead of a fixed set of users. While this makes the
problem harder, it lends more mathematical structures that can be exploited as well, as
one can now work with probabilities of thresholds and handle cumulative density
functions. On the theoretical front, our objective is NP-hard. For the case of a finite
set of users with known thresholds, we present a simple greedy algorithm that attains
an approximation ratio of (1 − 1/e) of the optimal. For the case of user distributions,
we show that a careful yet similar greedy algorithm achieves the same approximation
ratio. Unfortunately, it turns out that this algorithm is rather involved and
computationally expensive. So we present a threshold sampling based algorithm that is
more computationally affordable and, for any fixed epsilon > 0, has an approximation
ratio of (1 − 1/e − epsilon). We perform experiments on both real and synthetic data to
show that our algorithm significantly outperforms previously proposed approaches.
