pgsql: Improve planner's estimates of tuple hash table sizes. - Mailing list pgsql-committers

From Tom Lane
Subject pgsql: Improve planner's estimates of tuple hash table sizes.
Date
Msg-id E1vFg5H-004k5c-2k@gemulon.postgresql.org
Whole thread Raw
List pgsql-committers
Improve planner's estimates of tuple hash table sizes.

For several types of plan nodes that use TupleHashTables, the
planner estimated the expected size of the table as basically
numEntries * (MAXALIGN(dataWidth) + MAXALIGN(SizeofHeapTupleHeader)).
This is pretty far off, especially for small data widths, because
it doesn't account for the overhead of the simplehash.h hash table
nor for any per-tuple "additional space" the plan node may request.
Jeff Janes noted a case where the estimate was off by about a factor
of three, even though the obvious hazards such as inaccurate estimates
of numEntries or dataWidth didn't apply.

To improve matters, create functions provided by the relevant executor
modules that can estimate the required sizes with reasonable accuracy.
(We're still not accounting for effects like allocator padding, but
this at least gets the first-order effects correct.)

I added functions that can estimate the tuple table sizes for
nodeSetOp and nodeSubplan; these rely on an estimator for
TupleHashTables in general, and that in turn relies on one for
simplehash.h hash tables.  That feels like kind of a lot of mechanism,
but if we take any short-cuts we're violating modularity boundaries.

The other places that use TupleHashTables are nodeAgg, which took
pains to get its numbers right already, and nodeRecursiveunion.
I did not try to improve the situation for nodeRecursiveunion because
there's nothing to improve: we are not making an estimate of the hash
table size, and it wouldn't help us to do so because we have no
non-hashed alternative implementation.  On top of that, our estimate
of the number of entries to be hashed in that module is so suspect
that we'd likely often choose the wrong implementation if we did have
two ways to do it.

Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/1ea5bdb00bfbc6f8034859cd19769346bf31dc53

Modified Files
--------------
src/backend/executor/execGrouping.c    | 59 ++++++++++++++++++++++++++++++++++
src/backend/executor/nodeSetOp.c       |  9 ++++++
src/backend/executor/nodeSubplan.c     | 53 ++++++++++++++++++++++++++++--
src/backend/optimizer/plan/subselect.c | 45 +++++++++++++-------------
src/backend/optimizer/util/pathnode.c  | 12 ++++---
src/include/executor/executor.h        |  3 ++
src/include/executor/nodeSetOp.h       |  2 ++
src/include/executor/nodeSubplan.h     |  4 +++
src/include/lib/simplehash.h           | 50 ++++++++++++++++++++++++++--
9 files changed, 206 insertions(+), 31 deletions(-)


pgsql-committers by date:

Previous
From: Peter Geoghegan
Date:
Subject: pgsql: Document nbtree row comparison design.
Next
From: Peter Eisentraut
Date:
Subject: pgsql: Sort guc_parameters.dat alphabetically by name