Re: improving GROUP BY estimation - Mailing list pgsql-hackers
| From | Mark Dilger |
|---|---|
| Subject | Re: improving GROUP BY estimation |
| Date | |
| Msg-id | B9765D22-4BDC-4764-935C-2CBE329B0455@gmail.com Whole thread Raw |
| In response to | Re: improving GROUP BY estimation (Mark Dilger <hornschnorter@gmail.com>) |
| List | pgsql-hackers |
> On Feb 25, 2016, at 3:16 PM, Mark Dilger <hornschnorter@gmail.com> wrote:
>
>
>> On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> <snip>
>>
>> So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values
aresomehow correlated.
>>
>
> I applied your patch, which caused a few regression tests to fail. Attached
> is a patch that includes the necessary changes to the expected test results.
>
> It is not hard to write test cases where your patched version overestimates
> the number of rows by a very similar factor as the old code underestimates
> them. My very first test, which was not specifically designed to demonstrate
> this, happens to be one such example:
>
>
> CREATE TABLE t (a INT, b int);
> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
> ANALYZE t;
> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
> QUERY PLAN
> ---------------------------------------------------------------
> HashAggregate (cost=169250.21..169258.71 rows=850 width=4)
> Group Key: a
> -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4)
> Filter: (b < 1000)
> (4 rows)
>
> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
> count
> -------
> 32
> (1 row)
>
>
>
> So, it estimates 850 rows where only 32 are returned . Without applying your patch,
> it estimates just 1 row where 32 are returned. That's an overestimate of roughly 26 times,
> rather than an underestimate of 32 times.
>
> As a patch review, I'd say that your patch does what you claim it does, and it applies
> cleanly, and passes the regression tests with my included modifications. I think there
> needs to be some discussion on the list about whether the patch is a good idea.
>
> Mark Dilger
>
>
> <estimate-num-groups-v2.txt>
Assuming the goal is to minimize the degree to which the estimates are inaccurate, I
get better results by a kind of averaging of the old values from the current code base
and the new values from Tomas's patch. I tried this and at least for the few examples
we've been throwing around, I found the results were not as wildly inaccurate in the
worst case examples than in either of the other two approaches:
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..c83135a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
Assert(rel->reloptkind== RELOPT_BASEREL); if (rel->tuples > 0) {
+ double old_factor; /* The way it is currently done on master */
+ double new_factor; /* The way Tomas Vondra proposes doing it */
/* * Clamp to size of rel, or size of rel / 10 if multiple Vars. The
* fudge factor is because the Vars are probably correlated but we
@@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
/* * Multiply by restriction selectivity. */
- reldistinct *= rel->rows / rel->tuples;
+ old_factor = rel->rows / rel->tuples; /* old way */
+ new_factor = (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); /* Tomas's way */
+
+ reldistinct *= sqrt(old_factor * new_factor); /* average of old way and new way, sort of */
/* * Update estimate of total distinct groups.
pgsql-hackers by date: