Probabilistic Counting with Stochastic Averaging (PCSA)

One of the most fascinating algorithms, if not the most, that I have come across. It is a solution to the count-distinct problem.

Jan 01, 2017

WIP

Very rough blog post.
Need to provide an interactive visualisation somehow?
Do the analysis of the algorithm.
Compare it to the current state of the art algorithm for the count-distinct problem.

Introduction

The Flajolet–Martin algorithm (PCSA) estimates how many distinct elements (cardinality) appear in a stream using very little memory by exploiting randomness in hash values and counting trailing zero bits in binary hash outputs.

Uses

Counting unique users/IPs/cookies on high-traffic sites. Used when exact counting is too expensive (billions of events/day). For example: “How many unique visitors hit this endpoint today?”. PCSA avoids storing every user ID and instead uses compact bitmaps.
Estimating: Unique requests, unique trace IDs and unique error signatures.
Counting: Unique source IPs hitting a server and unique destination ports scanned.
DDoS detection.
Port scan detection.
Traffic cardinality estimation.
Real-time pipelines: “How many unique users in the last 5 minutes?”.
Ad Tech & Marketing: Counting unique impressions and counting unique users exposed to an ad.

Core Idea

Hash each element to a (pseudo) random-looking binary string.
Look at how many trailing zeros are in that hash (e.g., $1001000_2$ 10010002
$1001000_2$
has three trailing zeros).
Very long runs of trailing zeros are rare; seeing them suggests many distinct elements have been observed.

A rule of thumb: if you see a hash with RR trailing zeros, that event happens with probability about , so you expect to see it only after on the order of distinct elements.

Basic single-register algorithm

Assume a stream of elements x1,x2,…x1,x2,….

Choose a hash function
- Use a hash function h(x)h(x) that maps elements uniformly into {0,1,…,2L−1}{0,1,…,2L−1}, i.e., to LL-bit binary strings.
Process each stream element
For each incoming element xx:
- Compute y=h(x)y=h(x).
- Let ρ(y)ρ(y) be the index (0-based) of the least significant 1 bit of yy (number of trailing zeros). Example:
  - y=20=101002⇒ρ(y)=2y=20=101002⇒ρ(y)=2 (two trailing zeros).
- Maintain a single integer register RR = maximum ρ(y)ρ(y) seen so far:
  - R←max⁡(R,ρ(y))R←max(R,ρ(y)).
Output estimate
- At any time, estimate the number of distinct elements nn byn^=2Rn^=2R(in practice, a constant factor correction is often used, but 2R2R captures the core idea).

Intuition: about half of all hashes end with 0, a quarter end with 00, an eighth with 000, etc., so the largest number of trailing zeros you’ve seen gives a log-scale clue about nn. 2𝑅, ,2𝑅,

2𝑅

For a random hash, the probability distribution is:

P(ρ=k)=2−kP(\rho = k) = 2^{-k}P(ρ=k)=2−k

because we need:

Analysis

References

Link to PCSA paper.

Mohamed Bana’s Substack

Discussion about this post

Ready for more?