Improved Consistent Sampling, Weighted Minhash and L1 Sketching
Abstract: We propose a new Consistent Weighted Sampling method, where
the probability of drawing identical samples for a pair of inputs is equal to their
Jaccard similarity. Our method takes deterministic constant time per non-zero weight,
improving on the best previous approach which takes expected constant time. The samples
can be used as Weighted Minhash for efficient retrieval and compression (sketching)
under Jaccard or L1 metric. A method is presented for using simple data statistics to
reduce the running time of hash computation by two orders of magnitude. We compare our
method with the random projection method and show that it has better characteristics
for retrieval under L1. We present a novel method of mapping hashes to short
bit-strings, apply it to Weighted Minhash, and achieve more accurate distance estimates
from sketches than existing methods, as long as the inputs are sufficiently distinct.
We show how to choose the optimal number of bits per hash for sketching, and
demonstrate experimental results which agree with the theoretical analysis.