Succinct approximate counting of skewed data
Abstract
Practical data analysis relies on the ability to count
observations of objects succinctly and efficiently.
Unfortunately the space usage of an exact estimator
grows with the size of the a priori set from
which objects are drawn while the time required
to maintain such an estimator grows with the size
of the data set. We present static and on-line approximation
schemes that avoid these limitations
when approximate frequency estimates are acceptable.
Our Log-Frequency Sketch extends the approximate
counting algorithm of Morris [1978] to
estimate frequencies with bounded relative error
via a single pass over a data set. It uses constant
space per object when the frequencies follow a
power law and can be maintained in constant time
per observation. We give an (ε,δ)-approximation
scheme which we verify empirically on a large natural
language data set where, for instance, 95 percent
of frequencies are estimated with relative error
less than 0.25 using fewer than 11 bits per object in
the static case and 15 bits per object on-line.
observations of objects succinctly and efficiently.
Unfortunately the space usage of an exact estimator
grows with the size of the a priori set from
which objects are drawn while the time required
to maintain such an estimator grows with the size
of the data set. We present static and on-line approximation
schemes that avoid these limitations
when approximate frequency estimates are acceptable.
Our Log-Frequency Sketch extends the approximate
counting algorithm of Morris [1978] to
estimate frequencies with bounded relative error
via a single pass over a data set. It uses constant
space per object when the frequencies follow a
power law and can be maintained in constant time
per observation. We give an (ε,δ)-approximation
scheme which we verify empirically on a large natural
language data set where, for instance, 95 percent
of frequencies are estimated with relative error
less than 0.25 using fewer than 11 bits per object in
the static case and 15 bits per object on-line.