Re: PoC: Using Count-Min Sketch for join cardinality estimation - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC: Using Count-Min Sketch for join cardinality estimation
Date
Msg-id e7406f53-0d9b-8c5f-0372-05ab1ea39454@enterprisedb.com
Whole thread Raw
In response to Re: PoC: Using Count-Min Sketch for join cardinality estimation  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On 6/17/21 2:23 AM, Tomas Vondra wrote:
> On 6/17/21 1:31 AM, John Naylor wrote:
>> On Wed, Jun 16, 2021 at 12:23 PM Tomas Vondra 
>> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
>> wrote:
>>
>>  ...
>>
>> + * depth 8 and width 128 is sufficient for relative error ~1.5% with a
>> + * probability of approximately 99.6%
>>
>> Okay, so in the example above, we have a 99.6% probability of having 
>> less than 14.5M, but the actual error is much smaller. Do we know how 
>> tight the error bounds are with some lower probability?
>>
> 
> I don't recall such formula mentioned in any of the papers. The [3]
> paper has a proof in section 4.2, deriving the formula using Markov's
> inequality, but it's not obvious how to relax that (it's been ages since
> I last did things like this).
> 


I've been thinking about this a bit more, and while I still don't know
about a nice formula, I think I have a fairly good illustration that may
provide some intuition about the "typical" error. I'll talk about self
joins, because it makes some of the formulas simpler. But in principle
the same thing works for a join of two relations too.

Imagine you have a relation with N rows and D distinct values, and let's
build a count-min sketch on it, with W counters. So assuming d=1 for
simplicity, we have one set of counters with frequencies:

    [f(1), f(2), ..., f(W)]

Now, the dot product effectively calculates

    S = sum[ f(i)^2 for i in 1 ... W ]

which treats each counter as if it was just a single distinct value. But
we know that this is the upper boundary of the join size estimate,
because if we "split" a grou in any way, the join will always be lower:

    (f(i) - X)^2 + X^2 <= f(i)^2

It's as if you have a rectangle - if you split a side in some way and
calculate the area of those smaller rectangles, it'll be smaller than
the are of the whole rectangle. To minimize the area, the parts need to
be of equal size, and for K parts it's

    K * (f(i) / K) ^ 2 = f(i)^2 / K

This is the "minimum relative error" case assuming uniform distribution
of the data, I think. If there are D distinct values in the data set,
then for uniform distribution we can assume each counter represents
about D / W = K distinct values, and we can assume f(i) = N / W, so then

    S = W * (N/W)^2 / (D/W) = N^2 / D

Of course, this is the exact cardinality of the join - the count-min
sketch simply multiplies the f(i) values, ignoring D entirely. But I
think this shows that the fewer distinct values are there and/or the
more skewed the data set is, the closer the estimate is to the actual
value. More uniform data sets with more distinct values will end up
closer to the (N^2 / D) size, and the sketch will significantly
over-estimate this.

So the question is whether to attempt to do any "custom" correction
based on number of distinct values (which I think the count-min sketch
does not do, because the papers assumes it's unknown).

I still don't know about an analytical solution, giving us smaller
confidence interval (with lower probability). But we could perform some
experiments, generating data sets with various data distribution and
then measure how accurate the adjusted estimate is.

But I think the fact that for "more skewed" data sets the estimate is
closer to reality is very interesting, and pretty much what we want.
It's probably better than just assuming uniformity on both sides, which
is what we do when we only have MCV on one side (that's a fairly common
case, I think).

The other interesting feature is that it *always* overestimates (at
least the default version, not the variant adjusted by distinct values).
That's probably good, because under-estimates are generally much more
dangerous than over-estimates (the execution often degrades pretty
quickly, not gracefully).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Nitin Jadhav
Date:
Subject: Re: when the startup process doesn't
Next
From: Amit Kapila
Date:
Subject: Re: Fix for segfault in logical replication on master