PoC: Using Count-Min Sketch for join cardinality estimation - Mailing list pgsql-hackers

From Tomas Vondra
Subject PoC: Using Count-Min Sketch for join cardinality estimation
Date
Msg-id a08dda4c-aad4-a6b4-2cec-91363da73183@enterprisedb.com
Whole thread Raw
Responses Re: PoC: Using Count-Min Sketch for join cardinality estimation
List pgsql-hackers
Hi,

During the recent "CMU vaccination" talk given by Robert [1], a couple 
of the attendees (some of which were engineers working on various other 
database systems) asked whether PostgreSQL optimizer uses sketches. 
Which it does not, as far as I'm aware. Perhaps some of our statistics 
could be considered sketches, but we've not using data structures like 
hyperloglog, count-min sketch, etc.

But it reminded me that I thought about using one of the common sketches 
in the past, namely the Count-Min sketch [2], which is often mentioned 
as useful to estimating join cardinalities. There's a couple papers 
explaining how it works [3], [4], [5], but the general idea is that it 
approximates frequency table, i.e. a table tracking frequencies for all 
values. Our MCV list is one way to do that, but that only keeps a 
limited number of common values - for the rest we approximate the 
frequencies as uniform distribution. When the MCV covers only a tiny 
fraction of the data, or missing entirely, this may be an issue.

We can't possibly store exact frequencies all values for tables with 
many distinct values. The Count-Min sketch works around this by tracking 
frequencies in a limited number of counters - imagine you have 128 
counters. To add a value to the sketch, we hash it and the hash says 
which counter to increment.

To estimate a join size, we simply calculate "dot product" of the two 
sketches (which need to use the same number of counters):

   S = sum(s1(i) * s2(i) for i in 1 .. 128)

The actual sketches have multiple of those arrays (e.g. 8) using 
different hash functions, and we use the minimum of the sums. That 
limits the error, but I'll ignore it here for simplicity.

The attached patch is a very simple (and perhaps naive) implementation 
adding count-min sketch to pg_statistic for all attributes with a hash 
function (as a new statistics slot kind), and considering it in 
equijoinsel_inner. There's a GUC use_count_min_sketch to make it easier 
to see how it works.

A simple example

   create table t1 (a int, b int);
   create table t2 (a int, b int);

   insert into t1 select pow(random(), 2) * 1000, i
     from generate_series(1,30000) s(i);
   insert into t2 select pow(random(), 2) * 1000, i
     from generate_series(1,30000) s(i);

   analyze t1, t2;

   explain analyze select * from t1 join t2 using (a);

                              QUERY PLAN
   ------------------------------------------------------------------
    Hash Join  (cost=808.00..115470.35 rows=8936685 width=12)
               (actual time=31.231..1083.330 rows=2177649 loops=1)


So it's about 4x over-estimated, while without the count-min sketch it's 
about 2x under-estimated:

   set use_count_min_sketch = false;

                              QUERY PLAN
   ------------------------------------------------------------------
    Merge Join  (cost=5327.96..18964.16 rows=899101 width=12)
                (actual time=60.780..2896.829 rows=2177649 loops=1)

More about this a bit later.


The nice thing on count-min sketch is that there are pretty clear 
boundaries for error:

   size(t1,t2) <= dot_product(s1,2) <= epsilon * size(t1) * size(t2)

where s1/s2 are sketches on t1/t2, and epsilon is relative error. User 
may pick epsilon, and that determines size of the necessary sketch as 
2/epsilon. So with 128 buckets, the relative error is ~1.6%.

The trouble here is that this is relative to cartesian product of the 
two relations. So with two relations, each 30k rows, the error is up to 
~14.5M. Which is not great. We can pick lower epsilon value, but that 
increases the sketch size.

Where does the error come from? Each counter combines frequencies for 
multiple distinct values. So for example with 128 counters and 1024 
distinct values, each counter is representing ~4 values on average. But 
the dot product ignores this - it treats as if all the frequency was for 
a single value. It calculates the worst case for the bucket, because if 
you split the frequency e.g. in half, the estimate is always lower

    (f/2)^2 + (f/2)^2 < f^2

So maybe this could calculate the average number of items per counter 
and correct for this, somehow. We'd lose some of the sketch guarantees, 
but maybe it's the right thing to do.

There's a bunch of commented-out code doing this in different ways, and 
with the geometric mean variant the result looks like this:

                              QUERY PLAN
   ------------------------------------------------------------------
    Merge Join  (cost=5328.34..53412.58 rows=3195688 width=12)
                (actual time=64.037..2937.818 rows=2177649 loops=1)

which is much closer, but of course that depends on how exactly is the 
data set skewed.


There's a bunch of other open questions:

1) The papers about count-min sketch seem to be written for streaming 
use cases, which implies all the inserted data pass through the sketch. 
This patch only builds the sketch on analyze sample, which makes it less 
reliable. I doubt we want to do something different (e.g. because it'd 
require handling deletes, etc.).


2) The patch considers the sketch before MCVs, simply because it makes 
it much simpler to enable/disable the sketch, and compare it to MCVs. 
That's probably not what should be done - if we have MCVs, we should 
prefer using that, simply because it determines the frequencies more 
accurately than the sketch. And only use the sketch as a fallback, when 
we don't have MCVs on both sides of the join, instead of just assuming 
uniform distribution and relying on ndistinct.

We may have histograms, but AFAIK we don't use those when estimating 
joins (at least not equijoins). That's another thing we might maybe look 
into, comparing the histograms to verify how much they overlap. But 
that's irrelevant here.

Anyway, count-min sketches would be a better way to estimate the part 
not covered by MCVs - we might even assume the uniform distribution for 
individual counters, because that's what we do without MCVs anyway.


3) It's not clear to me how to extend this for multiple columns, so that 
it can be used to estimate joins on multiple correlated columns. For 
MCVs it was pretty simple, but let's say we add this as a new extended 
statistics kind, and user does

     CREATE STATISTICS s (cmsketch) ON a, b, c FROM t;

Should that build sketch on (a,b,c) or something else? The trouble is a 
sketch on (a,b,c) is useless for joins on (a,b).

We might do something like for ndistinct coefficients, and build a 
sketch for each combination of the columns. The sketches are much larger 
than ndistinct coefficients, though. But maybe that's fine - with 8 
columns we'd need ~56 sketches, each ~8kB. So that's not extreme.


regards


[1] 
https://db.cs.cmu.edu/events/vaccination-2021-postgresql-optimizer-methodology-robert-haas/

[2] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch

[3] https://dsf.berkeley.edu/cs286/papers/countmin-latin2004.pdf

[4] http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf

[5] http://dimacs.rutgers.edu/~graham/pubs/papers/cmz-sdm.pdf

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: Support for NSS as a libpq TLS backend
Next
From: Yugo NAGATA
Date:
Subject: Re: Avoid stuck of pbgench due to skipped transactions