Home > mailing lists

Back-patch change in hashed DISTINCT estimation? - Mailing list pgsql-hackers

From	Tom Lane
Subject	Back-patch change in hashed DISTINCT estimation?
Date	August 21, 2013 00:24:21
Msg-id	17178.1377033858@sss.pgh.pa.us Whole thread Raw
Responses	Re: Back-patch change in hashed DISTINCT estimation? Re: Back-patch change in hashed DISTINCT estimation?
List	pgsql-hackers

Tree view

In a thread over in pgsql-performance, Tomas Vondra pointed out that
choose_hashed_distinct was sometimes making different choices than
choose_hashed_grouping, so that queries like these:
select distinct x from ... ;select x from ... group by 1;

might get different plans even though they should be equivalent.
After some debugging it turns out that I omitted hash_agg_entry_size()
from the hash table size estimate in choose_hashed_distinct --- I'm pretty
sure I momentarily thought that this function must yield zero if there are
no aggregates, but that's wrong.  So we need a patch like this:

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bcc0d45..99284cb 100644
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** choose_hashed_distinct(PlannerInfo *root
*** 2848,2854 ****
--- 2848,2858 ----      * Don't do it if it doesn't look like the hashtable will fit into      * work_mem.      */
+ 
+     /* Estimate per-hash-entry space at tuple width... */     hashentrysize = MAXALIGN(path_width) +
MAXALIGN(sizeof(MinimalTupleData));
+     /* plus the per-hash-entry overhead */
+     hashentrysize += hash_agg_entry_size(0);      if (hashentrysize * dNumDistinctRows > work_mem * 1024L)
returnfalse;
 

When grouping narrow data, like a float or a couple of ints, this
oversight makes for more than 2X error in the hash table size estimate.

What I'm wondering is whether to back-patch this or leave well enough
alone.  The risk of back-patching is that it might destabilize plan
choices that people like.  (In Tomas' original example, the underestimate
of the table size leads it to choose a plan that is in fact better.)
The risk of not back-patching is that the error could lead to
out-of-memory failures because the hash aggregation uses more memory
than the planner expected.  (Tomas was rather fortunate in that his
case had an overestimate of dNumDistinctRows, so it didn't end up
blowing out memory ... but usually I think we underestimate that more
than overestimate it.)

A possible compromise is to back-patch into 9.3 (where the argument about
destabilizing plan choices doesn't have much force yet), but not further.

Thoughts?
        regards, tom lane

pgsql-hackers by date:

From: Martijn van Oosterhout
Date: 20 August 2013, 22:41:31
Subject: Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])

From: Alvaro Herrera
Date: 21 August 2013, 02:10:32
Subject: Re: danger of stats_temp_directory = /dev/shm

Back-patch change in hashed DISTINCT estimation? - Mailing list pgsql-hackers

Previous

Next