Re: PoC: Using Count-Min Sketch for join cardinality estimation - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC: Using Count-Min Sketch for join cardinality estimation
Date
Msg-id 846b316e-04b7-ed10-0bd1-58a22d143897@enterprisedb.com
Whole thread Raw
In response to Re: PoC: Using Count-Min Sketch for join cardinality estimation  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: PoC: Using Count-Min Sketch for join cardinality estimation
Re: PoC: Using Count-Min Sketch for join cardinality estimation
List pgsql-hackers
On 6/17/21 1:31 AM, John Naylor wrote:
> On Wed, Jun 16, 2021 at 12:23 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
> 
>  > The attached patch is a very simple (and perhaps naive) implementation
>  > adding count-min sketch to pg_statistic for all attributes with a hash
>  > function (as a new statistics slot kind), and considering it in
>  > equijoinsel_inner. There's a GUC use_count_min_sketch to make it easier
>  > to see how it works.
> 
> Cool! I have some high level questions below.
> 
>  > So it's about 4x over-estimated, while without the count-min sketch it's
>  > about 2x under-estimated:
> 
>  > The nice thing on count-min sketch is that there are pretty clear
>  > boundaries for error:
>  >
>  >    size(t1,t2) <= dot_product(s1,2) <= epsilon * size(t1) * size(t2)
>  >
>  > where s1/s2 are sketches on t1/t2, and epsilon is relative error. User
>  > may pick epsilon, and that determines size of the necessary sketch as
>  > 2/epsilon. So with 128 buckets, the relative error is ~1.6%.
>  >
>  > The trouble here is that this is relative to cartesian product of the
>  > two relations. So with two relations, each 30k rows, the error is up to
>  > ~14.5M. Which is not great. We can pick lower epsilon value, but that
>  > increases the sketch size.
> 
> + * depth 8 and width 128 is sufficient for relative error ~1.5% with a
> + * probability of approximately 99.6%
> 
> Okay, so in the example above, we have a 99.6% probability of having 
> less than 14.5M, but the actual error is much smaller. Do we know how 
> tight the error bounds are with some lower probability?
> 

I don't recall such formula mentioned in any of the papers. The [3]
paper has a proof in section 4.2, deriving the formula using Markov's
inequality, but it's not obvious how to relax that (it's been ages since
I last did things like this).

>  > There's a bunch of other open questions:
>  >
>  > 1) The papers about count-min sketch seem to be written for streaming
>  > use cases, which implies all the inserted data pass through the sketch.
>  > This patch only builds the sketch on analyze sample, which makes it less
>  > reliable. I doubt we want to do something different (e.g. because it'd
>  > require handling deletes, etc.).
> 
> We currently determine the sample size from the number of histogram 
> buckets requested, which is from the guc we expose. If these sketches 
> are more designed for the whole stream, do we have any idea how big a 
> sample we need to be reasonably accurate with them?
> 

Not really, but to be fair even for the histograms it's only really
supported by "seems to work in practice" :-(

My feeling is it's more about the number of distinct values rather than
the size of the table. If there are only a couple distinct values, small
sample is good enough. With many distinct values, we may need a larger
sample, but maybe not - we'll have to try, I guess.

FWIW there's a lot of various assumptions in the join estimates. For
example we assume the domains match (i.e. domain of the smaller table is
subset of the larger table) etc.

>  > 2) The patch considers the sketch before MCVs, simply because it makes
>  > it much simpler to enable/disable the sketch, and compare it to MCVs.
>  > That's probably not what should be done - if we have MCVs, we should
>  > prefer using that, simply because it determines the frequencies more
>  > accurately than the sketch. And only use the sketch as a fallback, when
>  > we don't have MCVs on both sides of the join, instead of just assuming
>  > uniform distribution and relying on ndistinct.
> 
>  > Anyway, count-min sketches would be a better way to estimate the part
>  > not covered by MCVs - we might even assume the uniform distribution for
>  > individual counters, because that's what we do without MCVs anyway.
> 
> When we calculate the sketch, would it make sense to exclude the MCVs 
> that we found? And use both sources for the estimate?
> 

Not sure. I've thought about this a bit, and excluding the MCV values
from the sketch would make it more like a MCV+histogram. So we'd have
MCV and then (sketch, histogram) on the non-MCV values.

I think the partial sketch is mostly useless, at least for join
estimates. Imagine we have MCV and sketch on both sides of the join, so
we have (MCV1, sketch1) and (MCV2, sketch2). Naively, we could do
estimate using (MCV1, MCV2) and then (sketch1,sketch2). But that's too
simplistic - there may be "overlap" between MCV1 and sketch2, for example?

So it seems more likely we'll just do MCV estimation if both sides have
it, and switch to sketch-only estimation otherwise.

There's also the fact that we exclude values wider than (1kB), so that
the stats are not too big, and there's no reason to do that for the
sketch (which is fixed-size thanks to hashing). It's a bit simpler to
build the full sketch during the initial scan of the data.

But it's not a very important detail - it's trivial to both add and
remove values from the sketch, if needed. So we can either exclude the
MCV values and "add them" to the partial sketch later, or we can build a
full sketch and then subtract them later.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: A qsort template
Next
From: Thomas Munro
Date:
Subject: Re: pgbench logging broken by time logic changes