Thread: weird hash plan cost, starting with pg10

weird hash plan cost, starting with pg10

From

Alvaro Herrera

Date:

23 March 2020, 16:50:59

Hello

While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
current Postgres 10 emits weird bucket/batch/memory values for certain
hash nodes:

                         ->  Hash  (cost=0.11..0.11 rows=10 width=12) (actual time=0.002..0.002 rows=1 loops=8)
                               Buckets: 2139062143  Batches: 2139062143  Memory Usage: 8971876904722400kB
                               ->  Function Scan on unnest init_1  (cost=0.01..0.11 rows=10 width=12) (actual
time=0.001..0.001rows=1 loops=8) 

It shows normal values in 9.6.

The complete query is:

SELECT c.tableoid, c.oid, c.relname, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl, row_n FROM
pg_catalog.unnest(coalesce(c.relacl,pg_catalog.acldefault(CASEWHEN c.relkind = 'S' THEN 's' ELSE 'r'
END::"char",c.relowner)))WITH ORDINALITY AS perm(acl,row_n) WHERE NOT EXISTS ( SELECT 1 FROM
pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault(CASEWHEN c.relkind = 'S' THEN 's' ELSE 'r'
END::"char",c.relowner)))AS init(init_acl) WHERE acl = init_acl)) as foo) AS relacl, (SELECT pg_catalog.array_agg(acl
ORDERBY row_n) FROM (SELECT acl, row_n FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault(CASE WHEN
c.relkind= 'S' THEN 's' ELSE 'r' END::"char",c.relowner))) WITH ORDINALITY AS initp(acl,row_n) WHERE NOT EXISTS (
SELECT1 FROM pg_catalog.unnest(coalesce(c.relacl,pg_catalog.acldefault(CASE WHEN c.relkind = 'S' THEN 's' ELSE 'r'
END::"char",c.relowner)))AS permp(orig_acl) WHERE acl = orig_acl)) as foo) as rrelacl, NULL AS initrelacl, NULL as
initrrelacl,c.relkind, c.relnamespace, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = c.relowner) AS rolname,
c.relchecks,c.relhastriggers, c.relhasindex, c.relhasrules, 'f'::bool AS relhasoids, c.relrowsecurity,
c.relforcerowsecurity,c.relfrozenxid, c.relminmxid, tc.oid AS toid, tc.relfrozenxid AS tfrozenxid, tc.relminmxid AS
tminmxid,c.relpersistence, c.relispopulated, c.relreplident, c.relpages, am.amname, CASE WHEN c.reloftype <> 0 THEN
c.reloftype::pg_catalog.regtypeELSE NULL END AS reloftype, d.refobjid AS owning_tab, d.refobjsubid AS owning_col,
(SELECTspcname FROM pg_tablespace t WHERE t.oid = c.reltablespace) AS reltablespace,
array_remove(array_remove(c.reloptions,'check_option=local'),'check_option=cascaded')AS reloptions, CASE WHEN
'check_option=local'= ANY (c.reloptions) THEN 'LOCAL'::text WHEN 'check_option=cascaded' = ANY (c.reloptions) THEN
'CASCADED'::textELSE NULL END AS checkoption, tc.reloptions AS toast_reloptions, c.relkind = 'S' AND EXISTS (SELECT 1
FROMpg_depend WHERE classid = 'pg_class'::regclass AND objid = c.oid AND objsubid = 0 AND refclassid =
'pg_class'::regclassAND deptype = 'i') AS is_identity_sequence, EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN
pg_init_privspip ON (c.oid = pip.objoid AND pip.classoid = 'pg_class'::regclass AND pip.objsubid = at.attnum)WHERE
at.attrelid= c.oid AND ((SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl, row_n FROM
pg_catalog.unnest(coalesce(at.attacl,pg_catalog.acldefault('c',c.relowner)))WITH ORDINALITY AS perm(acl,row_n) WHERE
NOTEXISTS ( SELECT 1 FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('c',c.relowner))) AS
init(init_acl)WHERE acl = init_acl)) as foo) IS NOT NULL OR (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM
(SELECTacl, row_n FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('c',c.relowner))) WITH ORDINALITY
ASinitp(acl,row_n) WHERE NOT EXISTS ( SELECT 1 FROM
pg_catalog.unnest(coalesce(at.attacl,pg_catalog.acldefault('c',c.relowner)))AS permp(orig_acl) WHERE acl = orig_acl))
asfoo) IS NOT NULL OR NULL IS NOT NULL OR NULL IS NOT NULL))AS changed_acl, pg_get_partkeydef(c.oid) AS partkeydef,
c.relispartitionAS ispartition, pg_get_expr(c.relpartbound, c.oid) AS partbound FROM pg_class c LEFT JOIN pg_depend d
ON(c.relkind = 'S' AND d.classid = c.tableoid AND d.objid = c.oid AND d.objsubid = 0 AND d.refclassid = c.tableoid AND
d.deptypeIN ('a', 'i')) LEFT JOIN pg_class tc ON (c.reltoastrelid = tc.oid AND c.relkind <> 'p') LEFT JOIN pg_am am ON
(c.relam= am.oid) LEFT JOIN pg_init_privs pip ON (c.oid = pip.objoid AND pip.classoid = 'pg_class'::regclass AND
pip.objsubid= 0) WHERE c.relkind in ('r', 'S', 'v', 'c', 'm', 'f', 'p') ORDER BY c.oid 

I'm not looking into this right now.  If somebody is bored in
quarantine, they might have a good time bisecting this.

--
Álvaro Herrera

Re: weird hash plan cost, starting with pg10

From

Tom Lane

Date:

23 March 2020, 17:00:59

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
> current Postgres 10 emits weird bucket/batch/memory values for certain
> hash nodes:

>                          ->  Hash  (cost=0.11..0.11 rows=10 width=12) (actual time=0.002..0.002 rows=1 loops=8)
>                                Buckets: 2139062143  Batches: 2139062143  Memory Usage: 8971876904722400kB
>                                ->  Function Scan on unnest init_1  (cost=0.01..0.11 rows=10 width=12) (actual
time=0.001..0.001rows=1 loops=8) 

Looks suspiciously like uninitialized memory ...

> The complete query is:

Reproduces here, though oddly only a couple of the several hash subplans
are doing that.

I'm not planning to dig into it right this second either.

            regards, tom lane

Re: weird hash plan cost, starting with pg10

From

Thomas Munro

Date:

23 March 2020, 20:55:11

On Tue, Mar 24, 2020 at 6:01 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
> > current Postgres 10 emits weird bucket/batch/memory values for certain
> > hash nodes:
>
> >                          ->  Hash  (cost=0.11..0.11 rows=10 width=12) (actual time=0.002..0.002 rows=1 loops=8)
> >                                Buckets: 2139062143  Batches: 2139062143  Memory Usage: 8971876904722400kB
> >                                ->  Function Scan on unnest init_1  (cost=0.01..0.11 rows=10 width=12) (actual
time=0.001..0.001rows=1 loops=8)
 
>
> Looks suspiciously like uninitialized memory ...

I think "hashtable" might have been pfree'd before
ExecHashGetInstrumentation() ran, because those numbers look like
CLOBBER_FREED_MEMORY's pattern:

>>> hex(2139062143)
'0x7f7f7f7f'
>>> hex(8971876904722400 / 1024)
'0x7f7f7f7f7f7'

Maybe there is something wrong with the shutdown order of nested subplans.

Re: weird hash plan cost, starting with pg10

From

Thomas Munro

Date:

24 March 2020, 03:04:56

On Tue, Mar 24, 2020 at 9:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Mar 24, 2020 at 6:01 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > > While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
> > > current Postgres 10 emits weird bucket/batch/memory values for certain
> > > hash nodes:
> >
> > >                          ->  Hash  (cost=0.11..0.11 rows=10 width=12) (actual time=0.002..0.002 rows=1 loops=8)
> > >                                Buckets: 2139062143  Batches: 2139062143  Memory Usage: 8971876904722400kB
> > >                                ->  Function Scan on unnest init_1  (cost=0.01..0.11 rows=10 width=12) (actual
time=0.001..0.001rows=1 loops=8)

> >
> > Looks suspiciously like uninitialized memory ...
>
> I think "hashtable" might have been pfree'd before
> ExecHashGetInstrumentation() ran, because those numbers look like
> CLOBBER_FREED_MEMORY's pattern:
>
> >>> hex(2139062143)
> '0x7f7f7f7f'
> >>> hex(8971876904722400 / 1024)
> '0x7f7f7f7f7f7'
>
> Maybe there is something wrong with the shutdown order of nested subplans.

I think there might be a case like this:

* ExecRescanHashJoin() decides it can't reuse the hash table for a
rescan, so it calls ExecHashTableDestroy(), clears HashJoinState's
hj_HashTable and sets hj_JoinState to HJ_BUILD_HASHTABLE
* the HashState node still has a reference to the pfree'd HashJoinTable!
* HJ_BUILD_HASHTABLE case reaches the empty-outer optimisation case so
it doesn't bother to build a new hash table
* EXPLAIN examines the HashState's pointer to a freed HashJoinTable struct

You could fix the dangling pointer problem by clearing it, but then
you'd have no data for EXPLAIN to show in this case.  Some other
solution is probably needed, but I didn't have time to dig further
today.

Re: weird hash plan cost, starting with pg10

From

Justin Pryzby

Date:

24 March 2020, 06:23:26

On Mon, Mar 23, 2020 at 01:50:59PM -0300, Alvaro Herrera wrote:
> While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
> current Postgres 10 emits weird bucket/batch/memory values for certain
> hash nodes:
> 
>                          ->  Hash  (cost=0.11..0.11 rows=10 width=12) (actual time=0.002..0.002 rows=1 loops=8)
>                                Buckets: 2139062143  Batches: 2139062143  Memory Usage: 8971876904722400kB
>                                ->  Function Scan on unnest init_1  (cost=0.01..0.11 rows=10 width=12) (actual
time=0.001..0.001rows=1 loops=8)
 
> 
> It shows normal values in 9.6.

Your message wasn't totally clear, but this is a live bug on 13dev.

It's actually broken on 9.6, but the issue isn't exposed until commit
6f236e1eb: "psql: Add tab completion for logical replication",
..which adds a nondefault ACL.

I reproduced the problem with this recipe, which doesn't depend on
c.relispartion or pg_get_partkeydef, and everything else shifting underfoot..

|CREATE TABLE t (i int); REVOKE ALL ON t FROM pryzbyj; explain analyze SELECT (SELECT 1 FROM (SELECT * FROM
unnest(c.relacl)ASacl WHERE NOT EXISTS ( SELECT 1 FROM unnest(c.relacl) AS init(init_acl) WHERE acl=init_acl)) as foo)
ASrelacl , EXISTS (SELECT 1 FROM pg_depend WHERE objid=c.oid) FROM pg_class c ORDER BY c.oid;
 
| Index Scan using pg_class_oid_index on pg_class c  (cost=0.27..4704.25 rows=333 width=9) (actual time=16.257..28.054
rows=334loops=1)
 
|   SubPlan 1
|     ->  Hash Anti Join  (cost=2.25..3.63 rows=1 width=4) (actual time=0.024..0.024 rows=0 loops=334)
|           Hash Cond: (acl.acl = init.init_acl)
|           ->  Function Scan on unnest acl  (cost=0.00..1.00 rows=100 width=12) (actual time=0.007..0.007 rows=1
loops=334)
|           ->  Hash  (cost=1.00..1.00 rows=100 width=12) (actual time=0.015..0.015 rows=2 loops=179)
|                 Buckets: 2139062143  Batches: 2139062143  Memory Usage: 8971876904722400kB
|                 ->  Function Scan on unnest init  (cost=0.00..1.00 rows=100 width=12) (actual time=0.009..0.010
rows=2loops=179)
 
|   SubPlan 2
|     ->  Seq Scan on pg_depend  (cost=0.00..144.21 rows=14 width=0) (never executed)
|           Filter: (objid = c.oid)
|   SubPlan 3
|     ->  Seq Scan on pg_depend pg_depend_1  (cost=0.00..126.17 rows=7217 width=4) (actual time=0.035..6.270 rows=7220
loops=1)

When I finally gave up on thinking I knew what branch was broken, I got:

|3fc6e2d7f5b652b417fa6937c34de2438d60fa9f is the first bad commit
|commit 3fc6e2d7f5b652b417fa6937c34de2438d60fa9f
|Author: Tom Lane <tgl@sss.pgh.pa.us>
|Date:   Mon Mar 7 15:58:22 2016 -0500
|
|    Make the upper part of the planner work by generating and comparing Paths.

-- 
Justin

Re: weird hash plan cost, starting with pg10

From

Richard Guo

Date:

24 March 2020, 07:36:34

On Tue, Mar 24, 2020 at 11:05 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Mar 24, 2020 at 9:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Mar 24, 2020 at 6:01 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > > While messing with EXPLAIN on a query emitted by pg_dump, I noticed that
> > > current Postgres 10 emits weird bucket/batch/memory values for certain
> > > hash nodes:
> >
> > > -> Hash (cost=0.11..0.11 rows=10 width=12) (actual time=0.002..0.002 rows=1 loops=8)
> > > Buckets: 2139062143 Batches: 2139062143 Memory Usage: 8971876904722400kB
> > > -> Function Scan on unnest init_1 (cost=0.01..0.11 rows=10 width=12) (actual time=0.001..0.001 rows=1 loops=8)
> >
> > Looks suspiciously like uninitialized memory ...
>
> I think "hashtable" might have been pfree'd before
> ExecHashGetInstrumentation() ran, because those numbers look like
> CLOBBER_FREED_MEMORY's pattern:
>
> >>> hex(2139062143)
> '0x7f7f7f7f'
> >>> hex(8971876904722400 / 1024)
> '0x7f7f7f7f7f7'
>
> Maybe there is something wrong with the shutdown order of nested subplans.

I think there might be a case like this:

* ExecRescanHashJoin() decides it can't reuse the hash table for a
rescan, so it calls ExecHashTableDestroy(), clears HashJoinState's
hj_HashTable and sets hj_JoinState to HJ_BUILD_HASHTABLE
* the HashState node still has a reference to the pfree'd HashJoinTable!
* HJ_BUILD_HASHTABLE case reaches the empty-outer optimisation case so
it doesn't bother to build a new hash table
* EXPLAIN examines the HashState's pointer to a freed HashJoinTable struct

Yes, debugging with gdb shows this is exactly what happens.

Thanks

Richard

Re: weird hash plan cost, starting with pg10

From

Richard Guo

Date:

25 March 2020, 10:36:17

On Tue, Mar 24, 2020 at 3:36 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Tue, Mar 24, 2020 at 11:05 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I think there might be a case like this:

* ExecRescanHashJoin() decides it can't reuse the hash table for a
rescan, so it calls ExecHashTableDestroy(), clears HashJoinState's
hj_HashTable and sets hj_JoinState to HJ_BUILD_HASHTABLE
* the HashState node still has a reference to the pfree'd HashJoinTable!
* HJ_BUILD_HASHTABLE case reaches the empty-outer optimisation case so
it doesn't bother to build a new hash table
* EXPLAIN examines the HashState's pointer to a freed HashJoinTable struct

Yes, debugging with gdb shows this is exactly what happens.

According to the scenario above, here is a recipe that reproduces this

issue.

-- recipe start
create table a(i int, j int);
create table b(i int, j int);
create table c(i int, j int);

insert into a select 3,3;
insert into a select 2,2;
insert into a select 1,1;

insert into b select 3,3;

insert into c select 0,0;

analyze a;
analyze b;
analyze c;

set enable_nestloop to off;
set enable_mergejoin to off;

explain analyze
select exists(select * from b join c on a.i > c.i and a.i = b.i and b.j = c.j) from a;
-- recipe end

I tried this recipe on different PostgreSQL versions, starting from
current master and going backwards. I was able to reproduce this issue
on all versions above 8.4. In 8.4 version, we do not output information
on hash buckets/batches. But manual inspection with gdb shows in 8.4 we
also have the dangling pointer for HashState->hashtable. I didn't check
versions below 8.4 though.

Thanks
Richard

Re: weird hash plan cost, starting with pg10

From

Konstantin Knizhnik

Date:

25 March 2020, 14:29:59

On 25.03.2020 13:36, Richard Guo wrote:

On Tue, Mar 24, 2020 at 3:36 PM Richard Guo <guofenglinux@gmail.com> wrote:
On Tue, Mar 24, 2020 at 11:05 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I think there might be a case like this:

* ExecRescanHashJoin() decides it can't reuse the hash table for a
rescan, so it calls ExecHashTableDestroy(), clears HashJoinState's
hj_HashTable and sets hj_JoinState to HJ_BUILD_HASHTABLE
* the HashState node still has a reference to the pfree'd HashJoinTable!
* HJ_BUILD_HASHTABLE case reaches the empty-outer optimisation case so
it doesn't bother to build a new hash table
* EXPLAIN examines the HashState's pointer to a freed HashJoinTable struct

Yes, debugging with gdb shows this is exactly what happens.

According to the scenario above, here is a recipe that reproduces this
issue.

-- recipe start
create table a(i int, j int);
create table b(i int, j int);
create table c(i int, j int);

insert into a select 3,3;
insert into a select 2,2;
insert into a select 1,1;

insert into b select 3,3;

insert into c select 0,0;

analyze a;
analyze b;
analyze c;

set enable_nestloop to off;
set enable_mergejoin to off;

explain analyze
select exists(select * from b join c on a.i > c.i and a.i = b.i and b.j = c.j) from a;
-- recipe end

I tried this recipe on different PostgreSQL versions, starting from
current master and going backwards. I was able to reproduce this issue
on all versions above 8.4. In 8.4 version, we do not output information
on hash buckets/batches. But manual inspection with gdb shows in 8.4 we
also have the dangling pointer for HashState->hashtable. I didn't check
versions below 8.4 though.

Thanks
Richard

I can propose the following patch for the problem.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

hash_join_instrumentation.patch

Re: weird hash plan cost, starting with pg10

From

Tom Lane

Date:

10 April 2020, 20:11:27

Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
> On 25.03.2020 13:36, Richard Guo wrote:
>> I tried this recipe on different PostgreSQL versions, starting from
>> current master and going backwards. I was able to reproduce this issue
>> on all versions above 8.4. In 8.4 version, we do not output information
>> on hash buckets/batches. But manual inspection with gdb shows in 8.4 we
>> also have the dangling pointer for HashState->hashtable. I didn't check
>> versions below 8.4 though.

> I can propose the following patch for the problem.

I looked at this patch a bit, and I don't think it goes far enough.
What this issue is really pointing out is that EXPLAIN is not considering
the possibility of a Hash node having had several hashtable instantiations
over its lifespan.  I propose what we do about that is generalize the
policy that show_hash_info() is already implementing (in a rather half
baked way) for multiple workers, and report the maximum field values
across all instantiations.  We can combine the code needed to do so
with the code for the parallelism case, as shown in the 0001 patch
below.

In principle we could probably get away with back-patching 0001,
at least into branches that already have the HashState.hinstrument
pointer.  I'm not sure it's worth any risk though.  A much simpler
fix is to make sure we clear the dangling hashtable pointer, as in
0002 below (a simplified form of Konstantin's patch).  The net
effect of that is that in the case where a hash table is destroyed
and never rebuilt, EXPLAIN ANALYZE would report no hash stats,
rather than possibly-garbage stats like it does today.  That's
probably good enough, because it should be an uncommon corner case.

Thoughts?

            regards, tom lane

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 455f54e..ff7c592 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2964,14 +2964,16 @@ show_hash_info(HashState *hashstate, ExplainState *es)
     HashInstrumentation hinstrument = {0};

     /*
+     * Collect stats from the local process, even when it's a parallel query.
      * In a parallel query, the leader process may or may not have run the
      * hash join, and even if it did it may not have built a hash table due to
      * timing (if it started late it might have seen no tuples in the outer
      * relation and skipped building the hash table).  Therefore we have to be
      * prepared to get instrumentation data from all participants.
      */
-    if (hashstate->hashtable)
-        ExecHashGetInstrumentation(&hinstrument, hashstate->hashtable);
+    if (hashstate->hinstrument)
+        memcpy(&hinstrument, hashstate->hinstrument,
+               sizeof(HashInstrumentation));

     /*
      * Merge results from workers.  In the parallel-oblivious case, the
@@ -2979,7 +2981,10 @@ show_hash_info(HashState *hashstate, ExplainState *es)
      * participants didn't run the join at all so have no data.  In the
      * parallel-aware case, we need to consider all the results.  Each worker
      * may have seen a different subset of batches and we want to find the
-     * highest memory usage for any one batch across all batches.
+     * highest memory usage for any one batch across all batches.  We take the
+     * maxima of other values too, for safety.  (In principle all workers
+     * should have the same nbuckets values, but workers that started late
+     * might have seen fewer batches than others.)
      */
     if (hashstate->shared_info)
     {
@@ -2990,31 +2995,16 @@ show_hash_info(HashState *hashstate, ExplainState *es)
         {
             HashInstrumentation *worker_hi = &shared_info->hinstrument[i];

-            if (worker_hi->nbatch > 0)
-            {
-                /*
-                 * Every participant should agree on the buckets, so to be
-                 * sure we have a value we'll just overwrite each time.
-                 */
-                hinstrument.nbuckets = worker_hi->nbuckets;
-                hinstrument.nbuckets_original = worker_hi->nbuckets_original;
-
-                /*
-                 * Normally every participant should agree on the number of
-                 * batches too, but it's possible for a backend that started
-                 * late and missed the whole join not to have the final nbatch
-                 * number.  So we'll take the largest number.
-                 */
-                hinstrument.nbatch = Max(hinstrument.nbatch, worker_hi->nbatch);
-                hinstrument.nbatch_original = worker_hi->nbatch_original;
-
-                /*
-                 * In a parallel-aware hash join, for now we report the
-                 * maximum peak memory reported by any worker.
-                 */
-                hinstrument.space_peak =
-                    Max(hinstrument.space_peak, worker_hi->space_peak);
-            }
+            hinstrument.nbuckets = Max(hinstrument.nbuckets,
+                                       worker_hi->nbuckets);
+            hinstrument.nbuckets_original = Max(hinstrument.nbuckets_original,
+                                                worker_hi->nbuckets_original);
+            hinstrument.nbatch = Max(hinstrument.nbatch,
+                                     worker_hi->nbatch);
+            hinstrument.nbatch_original = Max(hinstrument.nbatch_original,
+                                              worker_hi->nbatch_original);
+            hinstrument.space_peak = Max(hinstrument.space_peak,
+                                         worker_hi->space_peak);
         }
     }

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index c881dc1..165c693 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -2597,7 +2597,10 @@ ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt)
     size = offsetof(SharedHashInfo, hinstrument) +
         pcxt->nworkers * sizeof(HashInstrumentation);
     node->shared_info = (SharedHashInfo *) shm_toc_allocate(pcxt->toc, size);
+
+    /* Each per-worker area must start out as zeroes. */
     memset(node->shared_info, 0, size);
+
     node->shared_info->num_workers = pcxt->nworkers;
     shm_toc_insert(pcxt->toc, node->ps.plan->plan_node_id,
                    node->shared_info);
@@ -2616,22 +2619,33 @@ ExecHashInitializeWorker(HashState *node, ParallelWorkerContext *pwcxt)
     if (!node->ps.instrument)
         return;

+    /*
+     * Find our entry in the shared area, and set up a pointer to it so that
+     * we'll accumulate stats there when shutting down or rebuilding the hash
+     * table.
+     */
     shared_info = (SharedHashInfo *)
         shm_toc_lookup(pwcxt->toc, node->ps.plan->plan_node_id, false);
     node->hinstrument = &shared_info->hinstrument[ParallelWorkerNumber];
 }

 /*
- * Copy instrumentation data from this worker's hash table (if it built one)
- * to DSM memory so the leader can retrieve it.  This must be done in an
- * ExecShutdownHash() rather than ExecEndHash() because the latter runs after
- * we've detached from the DSM segment.
+ * Collect EXPLAIN stats if needed, saving them into DSM memory if
+ * ExecHashInitializeWorker was called, or local storage if not.  In the
+ * parallel case, this must be done in ExecShutdownHash() rather than
+ * ExecEndHash() because the latter runs after we've detached from the DSM
+ * segment.
  */
 void
 ExecShutdownHash(HashState *node)
 {
+    /* Allocate save space if EXPLAIN'ing and we didn't do so already */
+    if (node->ps.instrument && !node->hinstrument)
+        node->hinstrument = (HashInstrumentation *)
+            palloc0(sizeof(HashInstrumentation));
+    /* Now accumulate data for the current (final) hash table */
     if (node->hinstrument && node->hashtable)
-        ExecHashGetInstrumentation(node->hinstrument, node->hashtable);
+        ExecHashAccumInstrumentation(node->hinstrument, node->hashtable);
 }

 /*
@@ -2655,18 +2669,27 @@ ExecHashRetrieveInstrumentation(HashState *node)
 }

 /*
- * Copy the instrumentation data from 'hashtable' into a HashInstrumentation
- * struct.
+ * Accumulate instrumentation data from 'hashtable' into an
+ * initially-zeroed HashInstrumentation struct.
+ *
+ * This is used to merge information across successive hash table instances
+ * within a single plan node.  For now, we just take the maximum per-table
+ * values.  (See also explain.c's show_hash_info().)
  */
 void
-ExecHashGetInstrumentation(HashInstrumentation *instrument,
-                           HashJoinTable hashtable)
+ExecHashAccumInstrumentation(HashInstrumentation *instrument,
+                             HashJoinTable hashtable)
 {
-    instrument->nbuckets = hashtable->nbuckets;
-    instrument->nbuckets_original = hashtable->nbuckets_original;
-    instrument->nbatch = hashtable->nbatch;
-    instrument->nbatch_original = hashtable->nbatch_original;
-    instrument->space_peak = hashtable->spacePeak;
+    instrument->nbuckets = Max(instrument->nbuckets,
+                               hashtable->nbuckets);
+    instrument->nbuckets_original = Max(instrument->nbuckets_original,
+                                        hashtable->nbuckets_original);
+    instrument->nbatch = Max(instrument->nbatch,
+                             hashtable->nbatch);
+    instrument->nbatch_original = Max(instrument->nbatch_original,
+                                      hashtable->nbatch_original);
+    instrument->space_peak = Max(instrument->space_peak,
+                                 hashtable->spacePeak);
 }

 /*
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index c901a80..cc8edac 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1336,6 +1336,20 @@ ExecReScanHashJoin(HashJoinState *node)
         else
         {
             /* must destroy and rebuild hash table */
+            HashState  *hashNode = castNode(HashState, innerPlanState(node));
+
+            Assert(hashNode->hashtable == node->hj_HashTable);
+            /* accumulate stats from old hash table, if wanted */
+            /* (this should match ExecShutdownHash) */
+            if (hashNode->ps.instrument && !hashNode->hinstrument)
+                hashNode->hinstrument = (HashInstrumentation *)
+                    palloc0(sizeof(HashInstrumentation));
+            if (hashNode->hinstrument)
+                ExecHashAccumInstrumentation(hashNode->hinstrument,
+                                             hashNode->hashtable);
+            /* for safety, be sure to clear child plan node's pointer too */
+            hashNode->hashtable = NULL;
+
             ExecHashTableDestroy(node->hj_HashTable);
             node->hj_HashTable = NULL;
             node->hj_JoinState = HJ_BUILD_HASHTABLE;
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 1336fde..64d2ce6 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -73,7 +73,7 @@ extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeWorker(HashState *node, ParallelWorkerContext *pwcxt);
 extern void ExecHashRetrieveInstrumentation(HashState *node);
 extern void ExecShutdownHash(HashState *node);
-extern void ExecHashGetInstrumentation(HashInstrumentation *instrument,
-                                       HashJoinTable hashtable);
+extern void ExecHashAccumInstrumentation(HashInstrumentation *instrument,
+                                         HashJoinTable hashtable);

 #endif                            /* NODEHASH_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4c009b1..4fee043 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2358,7 +2358,7 @@ typedef struct HashInstrumentation
     int            nbuckets_original;    /* planned number of buckets */
     int            nbatch;            /* number of batches at end of execution */
     int            nbatch_original;    /* planned number of batches */
-    size_t        space_peak;        /* peak memory usage in bytes */
+    Size        space_peak;        /* peak memory usage in bytes */
 } HashInstrumentation;

 /* ----------------
@@ -2381,8 +2381,20 @@ typedef struct HashState
     HashJoinTable hashtable;    /* hash table for the hashjoin */
     List       *hashkeys;        /* list of ExprState nodes */

-    SharedHashInfo *shared_info;    /* one entry per worker */
-    HashInstrumentation *hinstrument;    /* this worker's entry */
+    /*
+     * In a parallelized hash join, the leader retains a pointer to the
+     * shared-memory stats area in its shared_info field, and then copies the
+     * shared-memory info back to local storage before DSM shutdown.  The
+     * shared_info field remains NULL in workers, or in non-parallel joins.
+     */
+    SharedHashInfo *shared_info;
+
+    /*
+     * If we are collecting hash stats, this points to an initially-zeroed
+     * collection area, which could be either local storage or in shared
+     * memory; either way it's for just one process.
+     */
+    HashInstrumentation *hinstrument;

     /* Parallel hash state. */
     struct ParallelHashJoinState *parallel_state;
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index c901a80..9e28ddd 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1336,6 +1336,12 @@ ExecReScanHashJoin(HashJoinState *node)
         else
         {
             /* must destroy and rebuild hash table */
+            HashState  *hashNode = castNode(HashState, innerPlanState(node));
+
+            /* for safety, be sure to clear child plan node's pointer too */
+            Assert(hashNode->hashtable == node->hj_HashTable);
+            hashNode->hashtable = NULL;
+
             ExecHashTableDestroy(node->hj_HashTable);
             node->hj_HashTable = NULL;
             node->hj_JoinState = HJ_BUILD_HASHTABLE;

Re: weird hash plan cost, starting with pg10

From

Richard Guo

Date:

13 April 2020, 09:07:41

On Sat, Apr 11, 2020 at 4:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
> On 25.03.2020 13:36, Richard Guo wrote:
>> I tried this recipe on different PostgreSQL versions, starting from
>> current master and going backwards. I was able to reproduce this issue
>> on all versions above 8.4. In 8.4 version, we do not output information
>> on hash buckets/batches. But manual inspection with gdb shows in 8.4 we
>> also have the dangling pointer for HashState->hashtable. I didn't check
>> versions below 8.4 though.

> I can propose the following patch for the problem.

I looked at this patch a bit, and I don't think it goes far enough.
What this issue is really pointing out is that EXPLAIN is not considering
the possibility of a Hash node having had several hashtable instantiations
over its lifespan. I propose what we do about that is generalize the
policy that show_hash_info() is already implementing (in a rather half
baked way) for multiple workers, and report the maximum field values
across all instantiations. We can combine the code needed to do so
with the code for the parallelism case, as shown in the 0001 patch
below.

I looked through 0001 patch and it looks good to me.

At first I was wondering if we need to check whether HashState.hashtable
is not NULL in ExecShutdownHash() before we decide to allocate save
space for HashState.hinstrument. And then I convinced myself that that's
not necessary since HashState.hinstrument and HashState.hashtable cannot
be both NULL there.

In principle we could probably get away with back-patching 0001,
at least into branches that already have the HashState.hinstrument
pointer. I'm not sure it's worth any risk though. A much simpler
fix is to make sure we clear the dangling hashtable pointer, as in
0002 below (a simplified form of Konstantin's patch). The net
effect of that is that in the case where a hash table is destroyed
and never rebuilt, EXPLAIN ANALYZE would report no hash stats,
rather than possibly-garbage stats like it does today. That's
probably good enough, because it should be an uncommon corner case.

Yes it's an uncommon corner case. But I think it may still surprise

people that most of the time the hash stat shows well but sometimes it
does not.

Thanks

Richard

Re: weird hash plan cost, starting with pg10

From

Tom Lane

Date:

13 April 2020, 13:53:46

Richard Guo <guofenglinux@gmail.com> writes:
> At first I was wondering if we need to check whether HashState.hashtable
> is not NULL in ExecShutdownHash() before we decide to allocate save
> space for HashState.hinstrument. And then I convinced myself that that's
> not necessary since HashState.hinstrument and HashState.hashtable cannot
> be both NULL there.

Even if the hashtable is null at that point, creating an all-zeroes
hinstrument struct is harmless.

            regards, tom lane

Re: weird hash plan cost, starting with pg10

From

Richard Guo

Date:

14 April 2020, 01:50:39

On Mon, Apr 13, 2020 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <guofenglinux@gmail.com> writes:
> At first I was wondering if we need to check whether HashState.hashtable
> is not NULL in ExecShutdownHash() before we decide to allocate save
> space for HashState.hinstrument. And then I convinced myself that that's
> not necessary since HashState.hinstrument and HashState.hashtable cannot
> be both NULL there.

Even if the hashtable is null at that point, creating an all-zeroes
hinstrument struct is harmless.

Correct. The only benefit we may get from checking if the hashtable is
null is to avoid an unnecessary palloc0 for hinstrument. But that case

cannot happen though.

Thanks

Richard

Re: weird hash plan cost, starting with pg10

From

Justin Pryzby

Date:

27 April 2020, 16:18:23

On Fri, Apr 10, 2020 at 04:11:27PM -0400, Tom Lane wrote:
> I'm not sure it's worth any risk though.  A much simpler
> fix is to make sure we clear the dangling hashtable pointer, as in
> 0002 below (a simplified form of Konstantin's patch).  The net
> effect of that is that in the case where a hash table is destroyed
> and never rebuilt, EXPLAIN ANALYZE would report no hash stats,
> rather than possibly-garbage stats like it does today.  That's
> probably good enough, because it should be an uncommon corner case.
> 
> Thoughts?

Checking if you're planning to backpatch this ?

> diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
> index c901a80..9e28ddd 100644
> --- a/src/backend/executor/nodeHashjoin.c
> +++ b/src/backend/executor/nodeHashjoin.c
> @@ -1336,6 +1336,12 @@ ExecReScanHashJoin(HashJoinState *node)
>          else
>          {
>              /* must destroy and rebuild hash table */
> +            HashState  *hashNode = castNode(HashState, innerPlanState(node));
> +
> +            /* for safety, be sure to clear child plan node's pointer too */
> +            Assert(hashNode->hashtable == node->hj_HashTable);
> +            hashNode->hashtable = NULL;
> +
>              ExecHashTableDestroy(node->hj_HashTable);
>              node->hj_HashTable = NULL;
>              node->hj_JoinState = HJ_BUILD_HASHTABLE;


-- 
Justin

Re: weird hash plan cost, starting with pg10

From

Tom Lane

Date:

27 April 2020, 16:26:03

Justin Pryzby <pryzby@telsasoft.com> writes:
> Checking if you're planning to backpatch this ?

Are you speaking of 5c27bce7f et al?

            regards, tom lane

Re: weird hash plan cost, starting with pg10

From

Justin Pryzby

Date:

27 April 2020, 16:29:58

On Mon, Apr 27, 2020 at 12:26:03PM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > Checking if you're planning to backpatch this ?
> 
> Are you speaking of 5c27bce7f et al?

Oops, yes, thanks.

I updated wiki/PostgreSQL_13_Open_Items just now.

-- 
Justin