Re: improving GROUP BY estimation - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: improving GROUP BY estimation
Date
Msg-id B9765D22-4BDC-4764-935C-2CBE329B0455@gmail.com
Whole thread Raw
In response to Re: improving GROUP BY estimation  (Mark Dilger <hornschnorter@gmail.com>)
List pgsql-hackers
> On Feb 25, 2016, at 3:16 PM, Mark Dilger <hornschnorter@gmail.com> wrote:
>
>
>> On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> <snip>
>>
>> So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values
aresomehow correlated. 
>>
>
> I applied your patch, which caused a few regression tests to fail.  Attached
> is a patch that includes the necessary changes to the expected test results.
>
> It is not hard to write test cases where your patched version overestimates
> the number of rows by a very similar factor as the old code underestimates
> them.  My very first test, which was not specifically designed to demonstrate
> this, happens to be one such example:
>
>
> CREATE TABLE t (a INT, b int);
> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
> ANALYZE t;
> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
>                          QUERY PLAN
> ---------------------------------------------------------------
> HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
>   Group Key: a
>   ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
>         Filter: (b < 1000)
> (4 rows)
>
> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
> count
> -------
>    32
> (1 row)
>
>
>
> So, it estimates 850 rows where only 32 are returned .  Without applying your patch,
> it estimates just 1 row where 32 are returned.  That's an overestimate of roughly 26 times,
> rather than an underestimate of 32 times.
>
> As a patch review, I'd say that your patch does what you claim it does, and it applies
> cleanly, and passes the regression tests with my included modifications.  I think there
> needs to be some discussion on the list about whether the patch is a good idea.
>
> Mark Dilger
>
>
> <estimate-num-groups-v2.txt>

Assuming the goal is to minimize the degree to which the estimates are inaccurate, I
get better results by a kind of averaging of the old values from the current code base
and the new values from Tomas's patch.  I tried this and at least for the few examples
we've been throwing around, I found the results were not as wildly inaccurate in the
worst case examples than in either of the other two approaches:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..c83135a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
Assert(rel->reloptkind== RELOPT_BASEREL);               if (rel->tuples > 0)               { 
+                       double          old_factor;             /* The way it is currently done on master */
+                       double          new_factor;             /* The way Tomas Vondra proposes doing it */
          /*                        * Clamp to size of rel, or size of rel / 10 if multiple Vars. The
    * fudge factor is because the Vars are probably correlated but we 
@@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
/*                        * Multiply by restriction selectivity.                        */ 
-                       reldistinct *= rel->rows / rel->tuples;
+                       old_factor = rel->rows / rel->tuples;           /* old way */
+                       new_factor = (1 - powl((reldistinct - 1) / reldistinct, rel->rows));  /* Tomas's way */
+
+                       reldistinct *= sqrt(old_factor * new_factor);   /* average of old way and new way, sort of */
                   /*                        * Update estimate of total distinct groups. 





pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Support for N synchronous standby servers - take 2
Next
From: Alvaro Herrera
Date:
Subject: Re: Writing new unit tests with PostgresNode