# Clustering Small Samples with Quality Guarantees: Adaptivity with One2all pps

## Abstract

Clustering of data points is a fundamental tool in data analysis. We consider
points $X$ in a relaxed metric space, where the triangle inequality holds within a
constant factor. A clustering of $X$ is a partition of $X$ defined by a set of
points $Q$ ({\em centroids}), according to the closest centroid. The {\em cost} of
clustering $X$ by $Q$ is $V(Q)=\sum_{x\in X} d_{xQ}$. This formulation generalizes
classic $k$-means clustering, which uses squared distances. Two basic tasks,
parametrized by $k \geq 1$, are {\em cost estimation}, which returns (approximate)
$V(Q)$ for queries $Q$ such that $|Q|=k$ and {\em clustering}, which returns an
(approximate) minimizer of $V(Q)$ of size $|Q|=k$. When the data set $X$ is very
large, we seek efficient constructions of small samples that can act as surrogates
for performing these tasks. Existing constructions that provide quality guarantees,
however, are either worst-case, and unable to benefit from structure of real data
sets, or make explicit strong assumptions on the structure. We show here how to
avoid both these pitfalls using adaptive designs. The core of our design are the
novel {\em one2all} probabilities, computed for a set $M$ of centroids and $\alpha
\geq 1$: The clustering cost of {\em each} $Q$ with cost $V(Q) \geq V(M)/\alpha$
can be estimated well from a sample of size $O(\alpha |M|\epsilon^{-2})$. For cost
estimation, we apply one2all with a bicriteria approximate $M$, while adaptively
balancing $|M|$ and $\alpha$ to optimize sample size per quality. For clustering,
we present a wrapper that adaptively applies a base clustering algorithm to a
sample $S$, using the smallest sample that provides the desired statistical
guarantees on quality. We demonstrate experimentally the huge gains of using our
adaptive instead of worst-case methods.