Re: PoC: Using Count-Min Sketch for join cardinality estimation - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PoC: Using Count-Min Sketch for join cardinality estimation |
Date | |
Msg-id | e7406f53-0d9b-8c5f-0372-05ab1ea39454@enterprisedb.com Whole thread Raw |
In response to | Re: PoC: Using Count-Min Sketch for join cardinality estimation (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
List | pgsql-hackers |
On 6/17/21 2:23 AM, Tomas Vondra wrote: > On 6/17/21 1:31 AM, John Naylor wrote: >> On Wed, Jun 16, 2021 at 12:23 PM Tomas Vondra >> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> >> wrote: >> >> ... >> >> + * depth 8 and width 128 is sufficient for relative error ~1.5% with a >> + * probability of approximately 99.6% >> >> Okay, so in the example above, we have a 99.6% probability of having >> less than 14.5M, but the actual error is much smaller. Do we know how >> tight the error bounds are with some lower probability? >> > > I don't recall such formula mentioned in any of the papers. The [3] > paper has a proof in section 4.2, deriving the formula using Markov's > inequality, but it's not obvious how to relax that (it's been ages since > I last did things like this). > I've been thinking about this a bit more, and while I still don't know about a nice formula, I think I have a fairly good illustration that may provide some intuition about the "typical" error. I'll talk about self joins, because it makes some of the formulas simpler. But in principle the same thing works for a join of two relations too. Imagine you have a relation with N rows and D distinct values, and let's build a count-min sketch on it, with W counters. So assuming d=1 for simplicity, we have one set of counters with frequencies: [f(1), f(2), ..., f(W)] Now, the dot product effectively calculates S = sum[ f(i)^2 for i in 1 ... W ] which treats each counter as if it was just a single distinct value. But we know that this is the upper boundary of the join size estimate, because if we "split" a grou in any way, the join will always be lower: (f(i) - X)^2 + X^2 <= f(i)^2 It's as if you have a rectangle - if you split a side in some way and calculate the area of those smaller rectangles, it'll be smaller than the are of the whole rectangle. To minimize the area, the parts need to be of equal size, and for K parts it's K * (f(i) / K) ^ 2 = f(i)^2 / K This is the "minimum relative error" case assuming uniform distribution of the data, I think. If there are D distinct values in the data set, then for uniform distribution we can assume each counter represents about D / W = K distinct values, and we can assume f(i) = N / W, so then S = W * (N/W)^2 / (D/W) = N^2 / D Of course, this is the exact cardinality of the join - the count-min sketch simply multiplies the f(i) values, ignoring D entirely. But I think this shows that the fewer distinct values are there and/or the more skewed the data set is, the closer the estimate is to the actual value. More uniform data sets with more distinct values will end up closer to the (N^2 / D) size, and the sketch will significantly over-estimate this. So the question is whether to attempt to do any "custom" correction based on number of distinct values (which I think the count-min sketch does not do, because the papers assumes it's unknown). I still don't know about an analytical solution, giving us smaller confidence interval (with lower probability). But we could perform some experiments, generating data sets with various data distribution and then measure how accurate the adjusted estimate is. But I think the fact that for "more skewed" data sets the estimate is closer to reality is very interesting, and pretty much what we want. It's probably better than just assuming uniformity on both sides, which is what we do when we only have MCV on one side (that's a fairly common case, I think). The other interesting feature is that it *always* overestimates (at least the default version, not the variant adjusted by distinct values). That's probably good, because under-estimates are generally much more dangerous than over-estimates (the execution often degrades pretty quickly, not gracefully). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: