Re: improving GROUP BY estimation - Mailing list pgsql-hackers
From | Mark Dilger |
---|---|
Subject | Re: improving GROUP BY estimation |
Date | |
Msg-id | B9765D22-4BDC-4764-935C-2CBE329B0455@gmail.com Whole thread Raw |
In response to | Re: improving GROUP BY estimation (Mark Dilger <hornschnorter@gmail.com>) |
List | pgsql-hackers |
> On Feb 25, 2016, at 3:16 PM, Mark Dilger <hornschnorter@gmail.com> wrote: > > >> On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> <snip> >> >> So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values aresomehow correlated. >> > > I applied your patch, which caused a few regression tests to fail. Attached > is a patch that includes the necessary changes to the expected test results. > > It is not hard to write test cases where your patched version overestimates > the number of rows by a very similar factor as the old code underestimates > them. My very first test, which was not specifically designed to demonstrate > this, happens to be one such example: > > > CREATE TABLE t (a INT, b int); > INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs; > ANALYZE t; > EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a; > QUERY PLAN > --------------------------------------------------------------- > HashAggregate (cost=169250.21..169258.71 rows=850 width=4) > Group Key: a > -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4) > Filter: (b < 1000) > (4 rows) > > SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss; > count > ------- > 32 > (1 row) > > > > So, it estimates 850 rows where only 32 are returned . Without applying your patch, > it estimates just 1 row where 32 are returned. That's an overestimate of roughly 26 times, > rather than an underestimate of 32 times. > > As a patch review, I'd say that your patch does what you claim it does, and it applies > cleanly, and passes the regression tests with my included modifications. I think there > needs to be some discussion on the list about whether the patch is a good idea. > > Mark Dilger > > > <estimate-num-groups-v2.txt> Assuming the goal is to minimize the degree to which the estimates are inaccurate, I get better results by a kind of averaging of the old values from the current code base and the new values from Tomas's patch. I tried this and at least for the few examples we've been throwing around, I found the results were not as wildly inaccurate in the worst case examples than in either of the other two approaches: diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 46c95b0..c83135a 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, Assert(rel->reloptkind== RELOPT_BASEREL); if (rel->tuples > 0) { + double old_factor; /* The way it is currently done on master */ + double new_factor; /* The way Tomas Vondra proposes doing it */ /* * Clamp to size of rel, or size of rel / 10 if multiple Vars. The * fudge factor is because the Vars are probably correlated but we @@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, /* * Multiply by restriction selectivity. */ - reldistinct *= rel->rows / rel->tuples; + old_factor = rel->rows / rel->tuples; /* old way */ + new_factor = (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); /* Tomas's way */ + + reldistinct *= sqrt(old_factor * new_factor); /* average of old way and new way, sort of */ /* * Update estimate of total distinct groups.
pgsql-hackers by date: