# An optimal online algorithm for retrieving heavily perturbed statistical databases in the low-dimensional querying model

### Venue

Proceedings of the Twenty-Fourth ACM International Conference on Information and Knowledge Management (CIKM 2015)

### Publication Year

2015

### Authors

Krzysztof Choromanski, Afshin Rostamizadeh, Umar Syed

### BibTeX

## Abstract

We give the first O(1/sqrt{T})-error online algorithm for reconstructing noisy
statistical databases, where T is the number of (online) sample queries received.
The algorithm is optimal up to the poly(log(T)) factor in terms of the error and
requires only O(log T) memory. It aims to learn a hidden database-vector w* in R^d
in order to accurately answer a stream of queries regarding the hidden database,
which arrive in an online fashion from some unknown distribution D. We assume the
distribution D is defined on the neighborhood of a low-dimensional manifold. The
presented algorithm runs in O(dD)-time per query, where d is the dimensionality of
the query-space. Contrary to the classical setting, there is no separate training
set that is used by the algorithm to learn the database —- the stream on which the
algorithm will be evaluated must also be used to learn the database-vector. The
algorithm only has access to a binary oracle O that answers whether a particular
linear function of the database-vector plus random noise is larger than a
threshold, which is specified by the algorithm. We note that we allow for a
significant O(D) amount of noise to be added while other works focused on the low
noise o(sqrt{D}) setting. For a stream of T queries our algorithm achieves an
average error O(1/sqrt{T}) by filtering out random noise, adapting threshold values
given to the oracle based on its previous answers and, as a consequence, recovering
with high precision a projection of a database-vector w* onto the manifold defining
the query-space. Our algorithm may be also applied in the adversarial machine
learning context to compromise machine learning engines by heavily exploiting the
vulnerabilities of the systems that output only binary signal and in the presence
of significant noise.