Re: PoC: Using Count-Min Sketch for join cardinality estimation - Mailing list pgsql-hackers

From John Naylor
Subject Re: PoC: Using Count-Min Sketch for join cardinality estimation
Date
Msg-id CAFBsxsE3wQ4J5t-UBPQK06=PVfvowq3dFoNCod-cE9M3yo8n4Q@mail.gmail.com
Whole thread Raw
In response to PoC: Using Count-Min Sketch for join cardinality estimation  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: PoC: Using Count-Min Sketch for join cardinality estimation
List pgsql-hackers
On Wed, Jun 16, 2021 at 12:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> The attached patch is a very simple (and perhaps naive) implementation
> adding count-min sketch to pg_statistic for all attributes with a hash
> function (as a new statistics slot kind), and considering it in
> equijoinsel_inner. There's a GUC use_count_min_sketch to make it easier
> to see how it works.

Cool! I have some high level questions below.

> So it's about 4x over-estimated, while without the count-min sketch it's
> about 2x under-estimated:

> The nice thing on count-min sketch is that there are pretty clear
> boundaries for error:
>
>    size(t1,t2) <= dot_product(s1,2) <= epsilon * size(t1) * size(t2)
>
> where s1/s2 are sketches on t1/t2, and epsilon is relative error. User
> may pick epsilon, and that determines size of the necessary sketch as
> 2/epsilon. So with 128 buckets, the relative error is ~1.6%.
>
> The trouble here is that this is relative to cartesian product of the
> two relations. So with two relations, each 30k rows, the error is up to
> ~14.5M. Which is not great. We can pick lower epsilon value, but that
> increases the sketch size.

+ * depth 8 and width 128 is sufficient for relative error ~1.5% with a
+ * probability of approximately 99.6%

Okay, so in the example above, we have a 99.6% probability of having less than 14.5M, but the actual error is much smaller. Do we know how tight the error bounds are with some lower probability?

> There's a bunch of other open questions:
>
> 1) The papers about count-min sketch seem to be written for streaming
> use cases, which implies all the inserted data pass through the sketch.
> This patch only builds the sketch on analyze sample, which makes it less
> reliable. I doubt we want to do something different (e.g. because it'd
> require handling deletes, etc.).

We currently determine the sample size from the number of histogram buckets requested, which is from the guc we expose. If these sketches are more designed for the whole stream, do we have any idea how big a sample we need to be reasonably accurate with them?

> 2) The patch considers the sketch before MCVs, simply because it makes
> it much simpler to enable/disable the sketch, and compare it to MCVs.
> That's probably not what should be done - if we have MCVs, we should
> prefer using that, simply because it determines the frequencies more
> accurately than the sketch. And only use the sketch as a fallback, when
> we don't have MCVs on both sides of the join, instead of just assuming
> uniform distribution and relying on ndistinct.

> Anyway, count-min sketches would be a better way to estimate the part
> not covered by MCVs - we might even assume the uniform distribution for
> individual counters, because that's what we do without MCVs anyway.

When we calculate the sketch, would it make sense to exclude the MCVs that we found? And use both sources for the estimate?

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Outdated replication protocol error?
Next
From: Tom Lane
Date:
Subject: Re: A qsort template