Thread: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

PG Bug reporting form

Date:

08 November 2019, 21:52:16

The following bug has been logged on the website:

Bug reference:      16104
Logged by:          James Coleman
Email address:      jtc331@gmail.com
PostgreSQL version: 11.5
Operating system:   Debian
Description:

We have a query that, after a recent logical migration to 11.5, ends up with
a parallel hash join (I don't think the query plan/query itself is important
here, but if needed after the rest of the explanation, I can try to redact
it for posting). The query results in this error:

ERROR:  invalid DSA memory alloc request size 1375731712

(the size changes sometimes significantly, but always over a GB)

At first glance it sounded eerily similar to this report which preceded the
final release of 11.0:

https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde
but I confirmed that the patch for that bug was applied and is in 11.5 (and
earlier).

We managed to reproduce this on a replica, and so were able to attach gdb in
production to capture a backtrace:

#0  errfinish (dummy=dummy@entry=0) at
./build/../src/backend/utils/error/elog.c:423
#1  0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20,
fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at
./build/../src/backend/utils/error/elog.c:1385
#2  0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38,
size=1140850688, flags=flags@entry=4) at
./build/../src/backend/utils/mmgr/dsa.c:677
#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
./build/../src/backend/executor/nodeHash.c:2889
#4  0x000055a7c079e5f9 in ExecParallelHashIncreaseNumBatches
(hashtable=0x55a7c1db2740) at
./build/../src/backend/executor/nodeHash.c:1122
#5  0x000055a7c079ef6e in ExecParallelHashTuplePrealloc (size=56,
batchno=<optimized out>, hashtable=0x55a7c1db2740) at
./build/../src/backend/executor/nodeHash.c:3283
#6  ExecParallelHashTableInsert (hashtable=hashtable@entry=0x55a7c1db2740,
slot=slot@entry=0x55a7c1dadc90, hashvalue=<optimized out>) at
./build/../src/backend/executor/nodeHash.c:1716
#7  0x000055a7c079f17f in MultiExecParallelHash (node=0x55a7c1dacb78) at
./build/../src/backend/executor/nodeHash.c:288
#8  MultiExecHash (node=node@entry=0x55a7c1dacb78) at
./build/../src/backend/executor/nodeHash.c:112
#9  0x000055a7c078c40c in MultiExecProcNode (node=node@entry=0x55a7c1dacb78)
at ./build/../src/backend/executor/execProcnode.c:501
#10 0x000055a7c07a07d5 in ExecHashJoinImpl (parallel=true, pstate=<optimized
out>) at ./build/../src/backend/executor/nodeHashjoin.c:290
#11 ExecParallelHashJoin (pstate=<optimized out>) at
./build/../src/backend/executor/nodeHashjoin.c:581
#12 0x000055a7c078bdd9 in ExecProcNodeInstr (node=0x55a7c1d7b018) at
./build/../src/backend/executor/execProcnode.c:461
#13 0x000055a7c079f142 in ExecProcNode (node=0x55a7c1d7b018) at
./build/../src/include/executor/executor.h:251
#14 MultiExecParallelHash (node=0x55a7c1d759d0) at
./build/../src/backend/executor/nodeHash.c:281
#15 MultiExecHash (node=node@entry=0x55a7c1d759d0) at
./build/../src/backend/executor/nodeHash.c:112
#16 0x000055a7c078c40c in MultiExecProcNode (node=node@entry=0x55a7c1d759d0)
at ./build/../src/backend/executor/execProcnode.c:501
#17 0x000055a7c07a07d5 in ExecHashJoinImpl (parallel=true, pstate=<optimized
out>) at ./build/../src/backend/executor/nodeHashjoin.c:290
#18 ExecParallelHashJoin (pstate=<optimized out>) at
./build/../src/backend/executor/nodeHashjoin.c:581
#19 0x000055a7c078bdd9 in ExecProcNodeInstr (node=0x55a7c1d74e60) at
./build/../src/backend/executor/execProcnode.c:461
#20 0x000055a7c0784303 in ExecProcNode (node=0x55a7c1d74e60) at
./build/../src/include/executor/executor.h:251
#21 ExecutePlan (execute_once=<optimized out>, dest=0x55a7c1d0be00,
direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>,
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x55a7c1d74e60, estate=0x55a7c1d74b70) at
./build/../src/backend/executor/execMain.c:1640
#22 standard_ExecutorRun (queryDesc=0x55a7c1d5dcd0, direction=<optimized
out>, count=0, execute_once=<optimized out>) at
./build/../src/backend/executor/execMain.c:369
#23 0x00007f4b8b9ace85 in pgss_ExecutorRun (queryDesc=0x55a7c1d5dcd0,
direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at
./build/../contrib/pg_stat_statements/pg_stat_statements.c:892
#24 0x000055a7c07893d1 in ParallelQueryMain (seg=0x55a7c1caa648,
toc=<optimized out>) at
./build/../src/backend/executor/execParallel.c:1401
#25 0x000055a7c064ee64 in ParallelWorkerMain (main_arg=<optimized out>) at
./build/../src/backend/access/transam/parallel.c:1409
#26 0x000055a7c08568ed in StartBackgroundWorker () at
./build/../src/backend/postmaster/bgworker.c:834
#27 0x000055a7c08637b5 in do_start_bgworker (rw=0x55a7c1c98200) at
./build/../src/backend/postmaster/postmaster.c:5722
#28 maybe_start_bgworkers () at
./build/../src/backend/postmaster/postmaster.c:5935
#29 0x000055a7c0864355 in sigusr1_handler (postgres_signal_arg=<optimized
out>) at ./build/../src/backend/postmaster/postmaster.c:5096
#30 <signal handler called>
#31 0x00007f4b915895e3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#32 0x000055a7c05d8b5d in ServerLoop () at
./build/../src/backend/postmaster/postmaster.c:1671
#33 0x000055a7c08654f1 in PostmasterMain (argc=5, argv=0x55a7c1c73e50) at
./build/../src/backend/postmaster/postmaster.c:1380
#34 0x000055a7c05dac34 in main (argc=5, argv=0x55a7c1c73e50) at
./build/../src/backend/main/main.c:228

From what I can tell it seems that src/backend/executor/nodeHash.c:2888
(looking at the 11.5 release tag) is another entry point into similar
potential problems as were guarded against in the patch I mentioned earlier,
and that this is another way parallel hash nodes can end up attempting to
allocate more memory than is allowed.

Thanks,
James Coleman

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

08 November 2019, 23:30:44

On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
>The following bug has been logged on the website:
>
>Bug reference:      16104
>Logged by:          James Coleman
>Email address:      jtc331@gmail.com
>PostgreSQL version: 11.5
>Operating system:   Debian
>Description:
>
>We have a query that, after a recent logical migration to 11.5, ends up with
>a parallel hash join (I don't think the query plan/query itself is important
>here, but if needed after the rest of the explanation, I can try to redact
>it for posting). The query results in this error:
>
>ERROR:  invalid DSA memory alloc request size 1375731712
>
>(the size changes sometimes significantly, but always over a GB)
>
>At first glance it sounded eerily similar to this report which preceded the
>final release of 11.0:

>https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde
>but I confirmed that the patch for that bug was applied and is in 11.5 (and
>earlier).
>
>We managed to reproduce this on a replica, and so were able to attach gdb in
>production to capture a backtrace:
>
>#0  errfinish (dummy=dummy@entry=0) at
>./build/../src/backend/utils/error/elog.c:423
>#1  0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20,
>fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at
>./build/../src/backend/utils/error/elog.c:1385
>#2  0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38,
>size=1140850688, flags=flags@entry=4) at
>./build/../src/backend/utils/mmgr/dsa.c:677
>#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
>(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
>./build/../src/backend/executor/nodeHash.c:2889
> ...

I've briefly looked at this today, and I think the root cause is
somewhat similar to what is described in [1] where we simply increase
the number of batches in an effort to keep batch contents in work_mem,
but ignoring that each batch requires quite a bit of memory. So we end
up with a lot of batches where each is small enough to fit into
work_mem, but we need much more than work_mem to track the batches.

This seems to be about the same problem, except that instead of
forgeting about BufFile, the parallel hash join ignores this:

   pstate->batches =
     dsa_allocate0(hashtable->area,
                   EstimateParallelHashJoinBatch(hashtable) * nbatch);

Looking at the backtrace, you ended up with nbatch=2097152 and it failed
trying to allocate 1140850688 bytes, i.e. exactly 544B per batch.

The EstimateParallelHashJoinBatch combines a bunch of structs and does
depend on the number of participants, and by my rough calculation 544B
means ~13 participants. Any idea if that matches the number of parallel
workers for this query?

Can you try running the query with fewer workers per gather?

IMHO this has the same issue as the non-parallel case described in [1]
in that we don't really consider this memory when planning the query,
which means we may accept hash join for queries where we know the number
of batches will be too high during planning. We need to reject parallel
hash join in those cases.

Of course, I don't know if this is such case - perhaps this is a case of
underestimation and we only learn the actual number while executing the
query, at which point it's too late to reject PHJ. I suppose the right
solution in that case would be some form of BNL, as discussed in [2] but
I'm not sure how far are we from having that, or if that works for PHJ.

regards

[1] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh@development

[2] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

James Coleman

Date:

09 November 2019, 00:23:25

On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> >The following bug has been logged on the website:
> >
> >Bug reference:      16104
> >Logged by:          James Coleman
> >Email address:      jtc331@gmail.com
> >PostgreSQL version: 11.5
> >Operating system:   Debian
> >Description:
> >
> >We have a query that, after a recent logical migration to 11.5, ends up with
> >a parallel hash join (I don't think the query plan/query itself is important
> >here, but if needed after the rest of the explanation, I can try to redact
> >it for posting). The query results in this error:
> >
> >ERROR:  invalid DSA memory alloc request size 1375731712
> >
> >(the size changes sometimes significantly, but always over a GB)
> >
> >At first glance it sounded eerily similar to this report which preceded the
> >final release of 11.0:
>
>https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde
> >but I confirmed that the patch for that bug was applied and is in 11.5 (and
> >earlier).
> >
> >We managed to reproduce this on a replica, and so were able to attach gdb in
> >production to capture a backtrace:
> >
> >#0  errfinish (dummy=dummy@entry=0) at
> >./build/../src/backend/utils/error/elog.c:423
> >#1  0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20,
> >fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at
> >./build/../src/backend/utils/error/elog.c:1385
> >#2  0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38,
> >size=1140850688, flags=flags@entry=4) at
> >./build/../src/backend/utils/mmgr/dsa.c:677
> >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
> >./build/../src/backend/executor/nodeHash.c:2889
> > ...
>
> I've briefly looked at this today, and I think the root cause is
> somewhat similar to what is described in [1] where we simply increase
> the number of batches in an effort to keep batch contents in work_mem,
> but ignoring that each batch requires quite a bit of memory. So we end
> up with a lot of batches where each is small enough to fit into
> work_mem, but we need much more than work_mem to track the batches.
>
> This seems to be about the same problem, except that instead of
> forgeting about BufFile, the parallel hash join ignores this:
>
>    pstate->batches =
>      dsa_allocate0(hashtable->area,
>                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>
> Looking at the backtrace, you ended up with nbatch=2097152 and it failed
> trying to allocate 1140850688 bytes, i.e. exactly 544B per batch.
>
> The EstimateParallelHashJoinBatch combines a bunch of structs and does
> depend on the number of participants, and by my rough calculation 544B
> means ~13 participants. Any idea if that matches the number of parallel
> workers for this query?

We currently have max_parallel_workers_per_gather = 6, which is where
the repro happened with the backtrace. We initially had the problem
with max_parallel_workers_per_gather = 2. I also just confirmed an
explain shows Workers Planned: 6.

> Can you try running the query with fewer workers per gather?

Since it failed initially with 2, I think we've effectively tested
this already (I assume that's a significant enough change to be
expected to solve it).

> IMHO this has the same issue as the non-parallel case described in [1]
> in that we don't really consider this memory when planning the query,
> which means we may accept hash join for queries where we know the number
> of batches will be too high during planning. We need to reject parallel
> hash join in those cases.
>
> Of course, I don't know if this is such case - perhaps this is a case of
> underestimation and we only learn the actual number while executing the
> query, at which point it's too late to reject PHJ. I suppose the right
> solution in that case would be some form of BNL, as discussed in [2] but
> I'm not sure how far are we from having that, or if that works for PHJ.
>
>
> regards
>
> [1] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh@development
>
> [2] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

09 November 2019, 01:11:26

On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote:
> On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> > >ERROR:  invalid DSA memory alloc request size 1375731712

> > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at

> > I've briefly looked at this today, and I think the root cause is
> > somewhat similar to what is described in [1] where we simply increase
> > the number of batches in an effort to keep batch contents in work_mem,
> > but ignoring that each batch requires quite a bit of memory. So we end
> > up with a lot of batches where each is small enough to fit into
> > work_mem, but we need much more than work_mem to track the batches.

Yeah.  So even when this is fixed, the query is going to perform
*terribly*, opening and closing millions of files in random order to
stream tuples into, if this is case where there really are tuples to
go to all partitions (and not just a case of extreme skew that our
extreme skew detector fails to detect because it only detects absolute
extreme skew).

> > This seems to be about the same problem, except that instead of
> > forgeting about BufFile, the parallel hash join ignores this:
> >
> >    pstate->batches =
> >      dsa_allocate0(hashtable->area,
> >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);

Yeah, I failed to consider that possibility.  I suppose it could be
avoided with something like this (not tested, I will find a repro for
this on Monday to convince myself that it's right):

@@ -1246,7 +1246,10 @@
ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
                                }

                                /* Don't keep growing if it's not
helping or we'd overflow. */
-                               if (extreme_skew_detected ||
hashtable->nbatch >= INT_MAX / 2)
+                               if (extreme_skew_detected ||
+                                       hashtable->nbatch >= INT_MAX / 2 ||
+
!AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
+
   hashtable->nbatch * 2))
                                        pstate->growth = PHJ_GROWTH_DISABLED;
                                else if (space_exhausted)
                                        pstate->growth =
PHJ_GROWTH_NEED_MORE_BATCHES;

But James's query is still going to be terrible.

Do you know if it's extreme skew (many tuples with the same key, just
a few scattered around in other keys), or simply too much data for
your work_mem setting?

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

James Coleman

Date:

09 November 2019, 02:10:13

On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote:
> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> > > >ERROR:  invalid DSA memory alloc request size 1375731712
>
> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
>
> > > I've briefly looked at this today, and I think the root cause is
> > > somewhat similar to what is described in [1] where we simply increase
> > > the number of batches in an effort to keep batch contents in work_mem,
> > > but ignoring that each batch requires quite a bit of memory. So we end
> > > up with a lot of batches where each is small enough to fit into
> > > work_mem, but we need much more than work_mem to track the batches.
>
> Yeah.  So even when this is fixed, the query is going to perform
> *terribly*, opening and closing millions of files in random order to
> stream tuples into, if this is case where there really are tuples to
> go to all partitions (and not just a case of extreme skew that our
> extreme skew detector fails to detect because it only detects absolute
> extreme skew).

work_mem in the repro case is 500MB (the original failure was at
150MB). I realize that's too small for this query, though it's also
worth knowing that if I get rid of some other cluster-wide tunings
that shouldn't have been cluster-wide original (modifications to
cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
an index scan and no hash (and performs much better).

I think this also correlates with us seeing ~TB spike in disk usage,
so your explanation of the lots of "small" files would seem to be
consistent with that.

> > > This seems to be about the same problem, except that instead of
> > > forgeting about BufFile, the parallel hash join ignores this:
> > >
> > >    pstate->batches =
> > >      dsa_allocate0(hashtable->area,
> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>
> Yeah, I failed to consider that possibility.  I suppose it could be
> avoided with something like this (not tested, I will find a repro for
> this on Monday to convince myself that it's right):
>
> @@ -1246,7 +1246,10 @@
> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
>                                 }
>
>                                 /* Don't keep growing if it's not
> helping or we'd overflow. */
> -                               if (extreme_skew_detected ||
> hashtable->nbatch >= INT_MAX / 2)
> +                               if (extreme_skew_detected ||
> +                                       hashtable->nbatch >= INT_MAX / 2 ||
> +
> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
> +
>    hashtable->nbatch * 2))
>                                         pstate->growth = PHJ_GROWTH_DISABLED;
>                                 else if (space_exhausted)
>                                         pstate->growth =
> PHJ_GROWTH_NEED_MORE_BATCHES;
>
> But James's query is still going to be terrible.
>
> Do you know if it's extreme skew (many tuples with the same key, just
> a few scattered around in other keys), or simply too much data for
> your work_mem setting?

Given my description earlier (seq scan on a very large table), I
assume it's likely the latter? If you think that's sufficiently
likely, I'll leave it at that, or if not I could do calculation on
that key to see how distributed it is.

James

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

09 November 2019, 09:54:56

On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
>On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>
>> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote:
>> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
>> > > >ERROR:  invalid DSA memory alloc request size 1375731712
>>
>> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
>> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
>>
>> > > I've briefly looked at this today, and I think the root cause is
>> > > somewhat similar to what is described in [1] where we simply increase
>> > > the number of batches in an effort to keep batch contents in work_mem,
>> > > but ignoring that each batch requires quite a bit of memory. So we end
>> > > up with a lot of batches where each is small enough to fit into
>> > > work_mem, but we need much more than work_mem to track the batches.
>>
>> Yeah.  So even when this is fixed, the query is going to perform
>> *terribly*, opening and closing millions of files in random order to
>> stream tuples into, if this is case where there really are tuples to
>> go to all partitions (and not just a case of extreme skew that our
>> extreme skew detector fails to detect because it only detects absolute
>> extreme skew).
>
>work_mem in the repro case is 500MB (the original failure was at
>150MB). I realize that's too small for this query, though it's also
>worth knowing that if I get rid of some other cluster-wide tunings
>that shouldn't have been cluster-wide original (modifications to
>cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
>an index scan and no hash (and performs much better).
>

So is it a case of underestimate? I.e. does the Hash side expect much
less data (rows) than it gets during execution?

>I think this also correlates with us seeing ~TB spike in disk usage,
>so your explanation of the lots of "small" files would seem to be
>consistent with that.

That's consistent with the data. 500MB and nbatch=2097152 is exactly
1TB, and there'll be some additional overhead.

>
>> > > This seems to be about the same problem, except that instead of
>> > > forgeting about BufFile, the parallel hash join ignores this:
>> > >
>> > >    pstate->batches =
>> > >      dsa_allocate0(hashtable->area,
>> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>>
>> Yeah, I failed to consider that possibility.  I suppose it could be
>> avoided with something like this (not tested, I will find a repro for
>> this on Monday to convince myself that it's right):
>>
>> @@ -1246,7 +1246,10 @@
>> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
>>                                 }
>>
>>                                 /* Don't keep growing if it's not
>> helping or we'd overflow. */
>> -                               if (extreme_skew_detected ||
>> hashtable->nbatch >= INT_MAX / 2)
>> +                               if (extreme_skew_detected ||
>> +                                       hashtable->nbatch >= INT_MAX / 2 ||
>> +
>> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
>> +
>>    hashtable->nbatch * 2))
>>                                         pstate->growth = PHJ_GROWTH_DISABLED;
>>                                 else if (space_exhausted)
>>                                         pstate->growth =
>> PHJ_GROWTH_NEED_MORE_BATCHES;
>>

Probably, but that's the execution part of it. I think we should also
consider this in ExecChooseHashTableSize, and just don't do PHJ when
it exceeds work_mem from the very beginning.

Once we start executing we probably can't do much better than what you
proposed. In particular it doesn't make sense to limit the space by
work_mem, unless we also tweak that limit because then the batch size
increases arbitrarily.

I think we need do something about this in PG13 - both while planning
(considering BufFile and SharedTuplestore), and during execution. The
planner part seems fairly simple and independent, and I might have said
before I'll look into it.

For the executor I think we've agreed the "proper" solution is BNL or
something like that. Not sure how far are we from that, though, I
don't recall any recent updates (but maybe I just missed that, the
pgsql-hackers traffic is pretty insane). I wonder if we should get
something like the "rebalancing" I proposed before, which is not a 100%
fix but may at least reduce the negative impact.

>> But James's query is still going to be terrible.
>>
>> Do you know if it's extreme skew (many tuples with the same key, just
>> a few scattered around in other keys), or simply too much data for
>> your work_mem setting?
>
>Given my description earlier (seq scan on a very large table), I
>assume it's likely the latter? If you think that's sufficiently
>likely, I'll leave it at that, or if not I could do calculation on
>that key to see how distributed it is.
>

Depends where exactly is the seq scan - is it under the Hash? If yes,
how come we even pick hash join in this case? Hard to say without seeing
the query plan.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

James Coleman

Date:

09 November 2019, 13:23:23

Just realized I accidentally replied only to Tomas.

On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >>
> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote:
> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> >> > > >ERROR:  invalid DSA memory alloc request size 1375731712
> >>
> >> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
> >>
> >> > > I've briefly looked at this today, and I think the root cause is
> >> > > somewhat similar to what is described in [1] where we simply increase
> >> > > the number of batches in an effort to keep batch contents in work_mem,
> >> > > but ignoring that each batch requires quite a bit of memory. So we end
> >> > > up with a lot of batches where each is small enough to fit into
> >> > > work_mem, but we need much more than work_mem to track the batches.
> >>
> >> Yeah.  So even when this is fixed, the query is going to perform
> >> *terribly*, opening and closing millions of files in random order to
> >> stream tuples into, if this is case where there really are tuples to
> >> go to all partitions (and not just a case of extreme skew that our
> >> extreme skew detector fails to detect because it only detects absolute
> >> extreme skew).
> >
> >work_mem in the repro case is 500MB (the original failure was at
> >150MB). I realize that's too small for this query, though it's also
> >worth knowing that if I get rid of some other cluster-wide tunings
> >that shouldn't have been cluster-wide original (modifications to
> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
> >an index scan and no hash (and performs much better).
> >
>
> So is it a case of underestimate? I.e. does the Hash side expect much
> less data (rows) than it gets during execution?

I've attached a redacted plan, but the estimate with workers planned =
6 is 986,975,281, which seems quite reasonable to be as the current
table count is 5,917,539,491.

> >I think this also correlates with us seeing ~TB spike in disk usage,
> >so your explanation of the lots of "small" files would seem to be
> >consistent with that.
>
> That's consistent with the data. 500MB and nbatch=2097152 is exactly
> 1TB, and there'll be some additional overhead.

Good, that helps make sense of that piece of the puzzle.

> >> > > This seems to be about the same problem, except that instead of
> >> > > forgeting about BufFile, the parallel hash join ignores this:
> >> > >
> >> > >    pstate->batches =
> >> > >      dsa_allocate0(hashtable->area,
> >> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
> >>
> >> Yeah, I failed to consider that possibility.  I suppose it could be
> >> avoided with something like this (not tested, I will find a repro for
> >> this on Monday to convince myself that it's right):
> >>
> >> @@ -1246,7 +1246,10 @@
> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
> >>                                 }
> >>
> >>                                 /* Don't keep growing if it's not
> >> helping or we'd overflow. */
> >> -                               if (extreme_skew_detected ||
> >> hashtable->nbatch >= INT_MAX / 2)
> >> +                               if (extreme_skew_detected ||
> >> +                                       hashtable->nbatch >= INT_MAX / 2 ||
> >> +
> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
> >> +
> >>    hashtable->nbatch * 2))
> >>                                         pstate->growth = PHJ_GROWTH_DISABLED;
> >>                                 else if (space_exhausted)
> >>                                         pstate->growth =
> >> PHJ_GROWTH_NEED_MORE_BATCHES;
> >>
>
> Probably, but that's the execution part of it. I think we should also
> consider this in ExecChooseHashTableSize, and just don't do PHJ when
> it exceeds work_mem from the very beginning.
>
> Once we start executing we probably can't do much better than what you
> proposed. In particular it doesn't make sense to limit the space by
> work_mem, unless we also tweak that limit because then the batch size
> increases arbitrarily.
>
> I think we need do something about this in PG13 - both while planning
> (considering BufFile and SharedTuplestore), and during execution. The
> planner part seems fairly simple and independent, and I might have said
> before I'll look into it.
>
> For the executor I think we've agreed the "proper" solution is BNL or
> something like that. Not sure how far are we from that, though, I
> don't recall any recent updates (but maybe I just missed that, the
> pgsql-hackers traffic is pretty insane). I wonder if we should get
> something like the "rebalancing" I proposed before, which is not a 100%
> fix but may at least reduce the negative impact.

Do you happen have a link to those discussions? I'd be interested in
following along. I also can't say I know what BNL stands for.

> >> But James's query is still going to be terrible.
> >>
> >> Do you know if it's extreme skew (many tuples with the same key, just
> >> a few scattered around in other keys), or simply too much data for
> >> your work_mem setting?
> >
> >Given my description earlier (seq scan on a very large table), I
> >assume it's likely the latter? If you think that's sufficiently
> >likely, I'll leave it at that, or if not I could do calculation on
> >that key to see how distributed it is.
> >
>
> Depends where exactly is the seq scan - is it under the Hash? If yes,
> how come we even pick hash join in this case? Hard to say without seeing
> the query plan.

As noted earlier, I've attached a redacted plan here.

The seq scan is under the parallel hash. Note that also as hinted at
earlier, this happens with cpu_tuple_cost and cpu_operator_cost both
set to 0.5, which has been a long-standing tweak on this cluster to
affect some plans that are otherwise painfully bad, but really should
be targeted at specific queries. With those reset, everything stays
the same except that the hash join turns into a nested loop, and
instead of a hash on the inner side of the join there's an index scan
(the join key is indexed on that table).

Because of the modified cpu_*_cost gucs we have two ways around this
specific query plan: either reset those gucs or set
enable_parallel_hash = off. But even if it means poor performance, it
seems to me that we wouldn't want to fail the query. I can confirm
that it is in fact capable of completing with this plan even with
work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would
have been some level of unsurprising in this batch system, and also
much closer to "normal" for the non-parallel plan we used to have in
9.6. We were able to see this result on replicas while trying to find
out exactly how to reproduce the error (it seems sometimes it was
right under the boundary needed to raise the error).


James

Attachment

error_plan_redacted.txt

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

09 November 2019, 18:27:35

On Sat, Nov 09, 2019 at 08:23:23AM -0500, James Coleman wrote:
>Just realized I accidentally replied only to Tomas.
>
>On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
>> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> >>
>> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote:
>> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
>> >> > > >ERROR:  invalid DSA memory alloc request size 1375731712
>> >>
>> >> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
>> >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
>> >>
>> >> > > I've briefly looked at this today, and I think the root cause is
>> >> > > somewhat similar to what is described in [1] where we simply increase
>> >> > > the number of batches in an effort to keep batch contents in work_mem,
>> >> > > but ignoring that each batch requires quite a bit of memory. So we end
>> >> > > up with a lot of batches where each is small enough to fit into
>> >> > > work_mem, but we need much more than work_mem to track the batches.
>> >>
>> >> Yeah.  So even when this is fixed, the query is going to perform
>> >> *terribly*, opening and closing millions of files in random order to
>> >> stream tuples into, if this is case where there really are tuples to
>> >> go to all partitions (and not just a case of extreme skew that our
>> >> extreme skew detector fails to detect because it only detects absolute
>> >> extreme skew).
>> >
>> >work_mem in the repro case is 500MB (the original failure was at
>> >150MB). I realize that's too small for this query, though it's also
>> >worth knowing that if I get rid of some other cluster-wide tunings
>> >that shouldn't have been cluster-wide original (modifications to
>> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
>> >an index scan and no hash (and performs much better).
>> >
>>
>> So is it a case of underestimate? I.e. does the Hash side expect much
>> less data (rows) than it gets during execution?
>
>I've attached a redacted plan, but the estimate with workers planned =
>6 is 986,975,281, which seems quite reasonable to be as the current
>table count is 5,917,539,491.
>

Hmmm, but the expected row width is only 16B, and with 6M rows that's
only about 90GB. So how come this needs 1TB temporary files? I'm sure
there's a bit of overhead, but 10X seems a bit much.

>> >I think this also correlates with us seeing ~TB spike in disk usage,
>> >so your explanation of the lots of "small" files would seem to be
>> >consistent with that.
>>
>> That's consistent with the data. 500MB and nbatch=2097152 is exactly
>> 1TB, and there'll be some additional overhead.
>
>Good, that helps make sense of that piece of the puzzle.
>
>> >> > > This seems to be about the same problem, except that instead of
>> >> > > forgeting about BufFile, the parallel hash join ignores this:
>> >> > >
>> >> > >    pstate->batches =
>> >> > >      dsa_allocate0(hashtable->area,
>> >> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>> >>
>> >> Yeah, I failed to consider that possibility.  I suppose it could be
>> >> avoided with something like this (not tested, I will find a repro for
>> >> this on Monday to convince myself that it's right):
>> >>
>> >> @@ -1246,7 +1246,10 @@
>> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
>> >>                                 }
>> >>
>> >>                                 /* Don't keep growing if it's not
>> >> helping or we'd overflow. */
>> >> -                               if (extreme_skew_detected ||
>> >> hashtable->nbatch >= INT_MAX / 2)
>> >> +                               if (extreme_skew_detected ||
>> >> +                                       hashtable->nbatch >= INT_MAX / 2 ||
>> >> +
>> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
>> >> +
>> >>    hashtable->nbatch * 2))
>> >>                                         pstate->growth = PHJ_GROWTH_DISABLED;
>> >>                                 else if (space_exhausted)
>> >>                                         pstate->growth =
>> >> PHJ_GROWTH_NEED_MORE_BATCHES;
>> >>
>>
>> Probably, but that's the execution part of it. I think we should also
>> consider this in ExecChooseHashTableSize, and just don't do PHJ when
>> it exceeds work_mem from the very beginning.
>>
>> Once we start executing we probably can't do much better than what you
>> proposed. In particular it doesn't make sense to limit the space by
>> work_mem, unless we also tweak that limit because then the batch size
>> increases arbitrarily.
>>
>> I think we need do something about this in PG13 - both while planning
>> (considering BufFile and SharedTuplestore), and during execution. The
>> planner part seems fairly simple and independent, and I might have said
>> before I'll look into it.
>>
>> For the executor I think we've agreed the "proper" solution is BNL or
>> something like that. Not sure how far are we from that, though, I
>> don't recall any recent updates (but maybe I just missed that, the
>> pgsql-hackers traffic is pretty insane). I wonder if we should get
>> something like the "rebalancing" I proposed before, which is not a 100%
>> fix but may at least reduce the negative impact.
>
>Do you happen have a link to those discussions? I'd be interested in
>following along. I also can't say I know what BNL stands for.
>

I think the two main threads are these two:

1) accounting for memory used for BufFile during hash joins
https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development

2) Avoiding hash join batch explosions with extreme skew and weird stats
https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

The BNL means "block nested loop" which means we split the inner
relation into blocks, and then do nested loop for (some of) them. This
works as a fallback strategy for hash join - when a batch is too large
to fit into work_mem, we can split it into blocks and do a BNL against
the outer relation. 

>> >> But James's query is still going to be terrible.
>> >>
>> >> Do you know if it's extreme skew (many tuples with the same key, just
>> >> a few scattered around in other keys), or simply too much data for
>> >> your work_mem setting?
>> >
>> >Given my description earlier (seq scan on a very large table), I
>> >assume it's likely the latter? If you think that's sufficiently
>> >likely, I'll leave it at that, or if not I could do calculation on
>> >that key to see how distributed it is.
>> >
>>
>> Depends where exactly is the seq scan - is it under the Hash? If yes,
>> how come we even pick hash join in this case? Hard to say without seeing
>> the query plan.
>
>As noted earlier, I've attached a redacted plan here.
>
>The seq scan is under the parallel hash. Note that also as hinted at
>earlier, this happens with cpu_tuple_cost and cpu_operator_cost both
>set to 0.5, which has been a long-standing tweak on this cluster to
>affect some plans that are otherwise painfully bad, but really should
>be targeted at specific queries. With those reset, everything stays
>the same except that the hash join turns into a nested loop, and
>instead of a hash on the inner side of the join there's an index scan
>(the join key is indexed on that table).
>

I wonder what the bad plans look like and why this fixes them, but it
seems like a separate issue.

>Because of the modified cpu_*_cost gucs we have two ways around this
>specific query plan: either reset those gucs or set
>enable_parallel_hash = off. But even if it means poor performance, it
>seems to me that we wouldn't want to fail the query. I can confirm
>that it is in fact capable of completing with this plan even with
>work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would
>have been some level of unsurprising in this batch system, and also
>much closer to "normal" for the non-parallel plan we used to have in
>9.6. We were able to see this result on replicas while trying to find
>out exactly how to reproduce the error (it seems sometimes it was
>right under the boundary needed to raise the error).
>

It's certainly true that completing a query is preferrable to failing.
It does depend when we can identify that's the case - the later we
realize the issue, the harder it's to fix it. If we notice that during
planning, we can simply disable the hash join, which then forces using a
different join method (and it may actually be quite fast). Once we start
executing, it's way harder.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

James Coleman

Date:

09 November 2019, 20:05:26

On Sat, Nov 9, 2019 at 1:27 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Nov 09, 2019 at 08:23:23AM -0500, James Coleman wrote:
> >Just realized I accidentally replied only to Tomas.
> >
> >On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
> >> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >> >>
> >> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote:
> >> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> >> >> > > >ERROR:  invalid DSA memory alloc request size 1375731712
> >> >>
> >> >> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> >> >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
> >> >>
> >> >> > > I've briefly looked at this today, and I think the root cause is
> >> >> > > somewhat similar to what is described in [1] where we simply increase
> >> >> > > the number of batches in an effort to keep batch contents in work_mem,
> >> >> > > but ignoring that each batch requires quite a bit of memory. So we end
> >> >> > > up with a lot of batches where each is small enough to fit into
> >> >> > > work_mem, but we need much more than work_mem to track the batches.
> >> >>
> >> >> Yeah.  So even when this is fixed, the query is going to perform
> >> >> *terribly*, opening and closing millions of files in random order to
> >> >> stream tuples into, if this is case where there really are tuples to
> >> >> go to all partitions (and not just a case of extreme skew that our
> >> >> extreme skew detector fails to detect because it only detects absolute
> >> >> extreme skew).
> >> >
> >> >work_mem in the repro case is 500MB (the original failure was at
> >> >150MB). I realize that's too small for this query, though it's also
> >> >worth knowing that if I get rid of some other cluster-wide tunings
> >> >that shouldn't have been cluster-wide original (modifications to
> >> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
> >> >an index scan and no hash (and performs much better).
> >> >
> >>
> >> So is it a case of underestimate? I.e. does the Hash side expect much
> >> less data (rows) than it gets during execution?
> >
> >I've attached a redacted plan, but the estimate with workers planned =
> >6 is 986,975,281, which seems quite reasonable to be as the current
> >table count is 5,917,539,491.
> >
>
> Hmmm, but the expected row width is only 16B, and with 6M rows that's
> only about 90GB. So how come this needs 1TB temporary files? I'm sure
> there's a bit of overhead, but 10X seems a bit much.

I'm running the query again to confirm this is indeed the one that
triggers that, though I'm reasonably confident it is as I was able to
reproduce the disk spike on a barely used replica box.

I don't suppose there's any way for the row width to be wrong, or not
include some other variable (e.g., variable length fields)? The key
value here is type citext, and is often 8 chars long, but sometimes
much longer (I sampled some that were 34 chars long, for example). I'm
not sure the actual distribution of those lengths.

> >> >I think this also correlates with us seeing ~TB spike in disk usage,
> >> >so your explanation of the lots of "small" files would seem to be
> >> >consistent with that.
> >>
> >> That's consistent with the data. 500MB and nbatch=2097152 is exactly
> >> 1TB, and there'll be some additional overhead.
> >
> >Good, that helps make sense of that piece of the puzzle.
> >
> >> >> > > This seems to be about the same problem, except that instead of
> >> >> > > forgeting about BufFile, the parallel hash join ignores this:
> >> >> > >
> >> >> > >    pstate->batches =
> >> >> > >      dsa_allocate0(hashtable->area,
> >> >> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
> >> >>
> >> >> Yeah, I failed to consider that possibility.  I suppose it could be
> >> >> avoided with something like this (not tested, I will find a repro for
> >> >> this on Monday to convince myself that it's right):
> >> >>
> >> >> @@ -1246,7 +1246,10 @@
> >> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
> >> >>                                 }
> >> >>
> >> >>                                 /* Don't keep growing if it's not
> >> >> helping or we'd overflow. */
> >> >> -                               if (extreme_skew_detected ||
> >> >> hashtable->nbatch >= INT_MAX / 2)
> >> >> +                               if (extreme_skew_detected ||
> >> >> +                                       hashtable->nbatch >= INT_MAX / 2 ||
> >> >> +
> >> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
> >> >> +
> >> >>    hashtable->nbatch * 2))
> >> >>                                         pstate->growth = PHJ_GROWTH_DISABLED;
> >> >>                                 else if (space_exhausted)
> >> >>                                         pstate->growth =
> >> >> PHJ_GROWTH_NEED_MORE_BATCHES;
> >> >>
> >>
> >> Probably, but that's the execution part of it. I think we should also
> >> consider this in ExecChooseHashTableSize, and just don't do PHJ when
> >> it exceeds work_mem from the very beginning.
> >>
> >> Once we start executing we probably can't do much better than what you
> >> proposed. In particular it doesn't make sense to limit the space by
> >> work_mem, unless we also tweak that limit because then the batch size
> >> increases arbitrarily.
> >>
> >> I think we need do something about this in PG13 - both while planning
> >> (considering BufFile and SharedTuplestore), and during execution. The
> >> planner part seems fairly simple and independent, and I might have said
> >> before I'll look into it.
> >>
> >> For the executor I think we've agreed the "proper" solution is BNL or
> >> something like that. Not sure how far are we from that, though, I
> >> don't recall any recent updates (but maybe I just missed that, the
> >> pgsql-hackers traffic is pretty insane). I wonder if we should get
> >> something like the "rebalancing" I proposed before, which is not a 100%
> >> fix but may at least reduce the negative impact.
> >
> >Do you happen have a link to those discussions? I'd be interested in
> >following along. I also can't say I know what BNL stands for.
> >
>
> I think the two main threads are these two:
>
> 1) accounting for memory used for BufFile during hash joins
> https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development
>
> 2) Avoiding hash join batch explosions with extreme skew and weird stats
> https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com
>
> The BNL means "block nested loop" which means we split the inner
> relation into blocks, and then do nested loop for (some of) them. This
> works as a fallback strategy for hash join - when a batch is too large
> to fit into work_mem, we can split it into blocks and do a BNL against
> the outer relation.

Thanks.

> >> >> But James's query is still going to be terrible.
> >> >>
> >> >> Do you know if it's extreme skew (many tuples with the same key, just
> >> >> a few scattered around in other keys), or simply too much data for
> >> >> your work_mem setting?
> >> >
> >> >Given my description earlier (seq scan on a very large table), I
> >> >assume it's likely the latter? If you think that's sufficiently
> >> >likely, I'll leave it at that, or if not I could do calculation on
> >> >that key to see how distributed it is.
> >> >
> >>
> >> Depends where exactly is the seq scan - is it under the Hash? If yes,
> >> how come we even pick hash join in this case? Hard to say without seeing
> >> the query plan.
> >
> >As noted earlier, I've attached a redacted plan here.
> >
> >The seq scan is under the parallel hash. Note that also as hinted at
> >earlier, this happens with cpu_tuple_cost and cpu_operator_cost both
> >set to 0.5, which has been a long-standing tweak on this cluster to
> >affect some plans that are otherwise painfully bad, but really should
> >be targeted at specific queries. With those reset, everything stays
> >the same except that the hash join turns into a nested loop, and
> >instead of a hash on the inner side of the join there's an index scan
> >(the join key is indexed on that table).
> >
>
> I wonder what the bad plans look like and why this fixes them, but it
> seems like a separate issue.

I am going to look into those further, but it's been configured this
way for some time, so it's always a bit messy to find all of these
affected queries.

> >Because of the modified cpu_*_cost gucs we have two ways around this
> >specific query plan: either reset those gucs or set
> >enable_parallel_hash = off. But even if it means poor performance, it
> >seems to me that we wouldn't want to fail the query. I can confirm
> >that it is in fact capable of completing with this plan even with
> >work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would
> >have been some level of unsurprising in this batch system, and also
> >much closer to "normal" for the non-parallel plan we used to have in
> >9.6. We were able to see this result on replicas while trying to find
> >out exactly how to reproduce the error (it seems sometimes it was
> >right under the boundary needed to raise the error).
> >
>
> It's certainly true that completing a query is preferrable to failing.
> It does depend when we can identify that's the case - the later we
> realize the issue, the harder it's to fix it. If we notice that during
> planning, we can simply disable the hash join, which then forces using a
> different join method (and it may actually be quite fast). Once we start
> executing, it's way harder.

In this case though the failure in some sense seems fairly artificial.
Aside from the query being slow, there doesn't appear to be any real
limitation on the query completing. The box in question has 768GB of
memory, so limiting this memory structure to 1GB seems artificial.

James

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

09 November 2019, 21:53:19

On Sun, Nov 10, 2019 at 9:05 AM James Coleman <jtc331@gmail.com> wrote:
> In this case though the failure in some sense seems fairly artificial.
> Aside from the query being slow, there doesn't appear to be any real
> limitation on the query completing. The box in question has 768GB of
> memory, so limiting this memory structure to 1GB seems artificial.

Yeah, completely artificial. We limit ourselves to MaxAllocSize, a
self-imposed limit that applies to palloc() and dsa_allocate(),
probably dating back to a time in history when anyone asking for more
than that must surely be out of their mind. Then there are a few
places where we allow ourselves to go past that limit by passing in an
extra flag MCXT_ALLOC_HUGE or DSA_ALLOC_HUGE: from a quick grep,
that's the bitmap heap scan page table, and anything using
simplehash.h's default allocator (so I guess that includes hash agg;
that's interesting, we can do a 64GB-bucket-array hash agg but not
hash join).

As noted on https://wiki.postgresql.org/wiki/Hash_Join, there are a
couple of artificial constraints on hash joins: the limit on the
number of hash join buckets which comes entirely from using 'int' as a
data type for bucket numbers (an anachronism from pre-standard C or
maybe just 32 bit-ism: the correct type is surely size_t, which is by
definition big enough for any array that you could possibly address),
and the MaxAllocSize thing. Easy to fix, of course. I noticed that
when working on PHJ and decided to keep the same restriction for the
new parallel code paths, because (1) it seemed like a policy choice we
should make and then apply to both, and (2) it does provide some kind
of sanity check, though it's increasingly looking overdue to be
removed (in other words: previously I was only speculating about
people showing up with ~1TB RAM machines and running into these
ancient limits, but ... here we are).

For example, suppose we added the DSA_ALLOC_HUGE flag to the line that
is failing in your case. Now it would be able to blow through that
1GB limit, but where would it stop? Until we know how you're reaching
this state, it's hard to know whether it'd go to (say) 2GB, and then
work perfectly, job done, or whether it'd keep going until it ate all
your RAM and made your users really unhappy.

I think this must be a case of extreme skew, as complained about
in[1]. Let's see... you have ~6 billion rows, and you said the
planner knew that (it estimated about a billion, when there were 6
workers, so it's in the ball park). You didn't say how many batches
the planner planned for. Let's see if I can guess... 256 or 512?
That'd allow for 6 billion * 16 byte rows + some slop, chopped up into
a power-of-two number of partitions that fit inside 500MB. Then at
execution time, they didn't fit, and we went into
repartition-until-it-fits mode. At some point we tried to cut them
into ~2 million partitions and hit this error. That'd be a paltry
3,000 rows per partition if the keys were uniformly distributed, but
it thinks that at least one partition is not fitting into 500MB.
Conclusion: unless this is a carefully crafted hash attack, there must
be one particular key has more than 500MB worth of rows, but also a
smattering of other keys that fall into the same partition, that are
causing us to keep trying to repartition until it eventually squeezes
all of them all in the same direction during a split (this requires
repeatedly splitting partitions until you reach one partition per
tuple!). So I'm wondering if this would be fixed by, say, a 95%
threshold (instead of 100%) for the extreme skew detector, as I
proposed in a patch in the first email in that thread that later
veered off into the BNL discussion[1]. Unfortunately that patch only
deals with non-parallel HJ, but a PHJ version should be easy. And if
not by 95%, I wonder what threshold would be needed for this case, and
what kind of maths to use to think about it. If I wrote a patch like
[1] with PHJ support, would you be able to test it on a copy of your
workload?

[1] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

09 November 2019, 23:13:59

On Sun, Nov 10, 2019 at 7:27 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hmmm, but the expected row width is only 16B, and with 6M rows that's
> only about 90GB. So how come this needs 1TB temporary files? I'm sure
> there's a bit of overhead, but 10X seems a bit much.

(s/6M/6B/) Yeah, that comes out to only ~90GB but ... PHJ doesn't
immediately unlink files from the previous generation when it
repartitions.  You need at two generations' worth of files (old and
new) while repartitioning, but you don't need the grand-parent
generation.  I didn't think this was a problem because I didn't expect
to have to repartition many times (and there is a similar but
different kind of amplification in the non-parallel code).  If this
problem is due to the 100% extreme skew threshold causing us to go
berserk, then that 10X multiplier is of the right order, if you
imagine this thing started out with ~512 batches and got up to ~1M
batches before it blew a gasket.

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

James Coleman

Date:

10 November 2019, 02:25:00

On Sat, Nov 9, 2019 at 4:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sun, Nov 10, 2019 at 9:05 AM James Coleman <jtc331@gmail.com> wrote:
> > In this case though the failure in some sense seems fairly artificial.
> > Aside from the query being slow, there doesn't appear to be any real
> > limitation on the query completing. The box in question has 768GB of
> > memory, so limiting this memory structure to 1GB seems artificial.
>
> Yeah, completely artificial.  We limit ourselves to MaxAllocSize, a
> self-imposed limit that applies to palloc() and dsa_allocate(),
> probably dating back to a time in history when anyone asking for more
> than that must surely be out of their mind.  Then there are a few
> places where we allow ourselves to go past that limit by passing in an
> extra flag MCXT_ALLOC_HUGE or DSA_ALLOC_HUGE: from a quick grep,
> that's the bitmap heap scan page table, and anything using
> simplehash.h's default allocator (so I guess that includes hash agg;
> that's interesting, we can do a 64GB-bucket-array hash agg but not
> hash join).
>
> As noted on https://wiki.postgresql.org/wiki/Hash_Join, there are a
> couple of artificial constraints on hash joins: the limit on the
> number of hash join buckets which comes entirely from using 'int' as a
> data type for bucket numbers (an anachronism from pre-standard C or
> maybe just 32 bit-ism: the correct type is surely size_t, which is by
> definition big enough for any array that you could possibly address),
> and the MaxAllocSize thing.  Easy to fix, of course.  I noticed that
> when working on PHJ and decided to keep the same restriction for the
> new parallel code paths, because (1) it seemed like a policy choice we
> should make and then apply to both, and (2) it does provide some kind
> of sanity check, though it's increasingly looking overdue to be
> removed (in other words: previously I was only speculating about
> people showing up with ~1TB RAM machines and running into these
> ancient limits, but ... here we are).

Heh. We've run with some clusters at 2TB of memory for many years, but
I also realize that's not all that common.

> For example, suppose we added the DSA_ALLOC_HUGE flag to the line that
> is failing in your case.  Now it would be able to blow through that
> 1GB limit, but where would it stop?  Until we know how you're reaching
> this state, it's hard to know whether it'd go to (say) 2GB, and then
> work perfectly, job done, or whether it'd keep going until it ate all
> your RAM and made your users really unhappy.

Yeah, I think it's definitely non-obvious exactly what to do. In
general I think preferring not failing queries is good, but on the
other hand that could result in OOM-killing a bunch of other things
too and/or failing other queries because they can't alloc.

> I think this must be a case of extreme skew, as complained about
> in[1].  Let's see... you have ~6 billion rows, and you said the
> planner knew that (it estimated about a billion, when there were 6
> workers, so it's in the ball park).  You didn't say how many batches
> the planner planned for.  Let's see if I can guess... 256 or 512?

I ran a successful one (this run happened to have max parallel workers
per gather set to 8, and used 8 workers) at 150MB work_mem, and the
parallel hash node had these stats:

Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
4096)  Memory Usage: 344448kB
Buffers: shared hit=38360 read=142422702, temp read=45766294 written=76912272

I'm not sure how to captures the batch information on a failing run,
since you only get it on ANALYZE, but the query doesn't complete so
you can't get that. I do notice in the original bug report stack trace
though that:

#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
./build/../src/backend/executor/nodeHash.c:2889

which I think tells us the number of batches?

That buffers written is ~615GB which corresponds to one blip on the
disk graphs for this run (that was around 600GB) and the HashAggregate
node at the top of the query (which is only grouping on
items.account_id; see below, but this has significant skew) had:

Buffers: shared hit=608075044 read=142422702, temp read=77222547
written=77284432

which corresponds to the second disk graph blip of similar size.

Side note about explain output: it seems that the hash aggregate node
appears to not output bucket stats. I've also noticed that the nested
loop joins appear to not output the join condition, but hash joins
output the hash condition, and if I remember correctly, merge joins
output the join condition. If I'm right about that...I guess it means
another thing on the list of potential patches. Seems like it'd make
understanding plans easier.

> That'd allow for 6 billion * 16 byte rows + some slop, chopped up into
> a power-of-two number of partitions that fit inside 500MB.  Then at
> execution time, they didn't fit, and we went into
> repartition-until-it-fits mode.  At some point we tried to cut them
> into ~2 million partitions and hit this error.  That'd be a paltry
> 3,000 rows per partition if the keys were uniformly distributed, but
> it thinks that at least one partition is not fitting into 500MB.
> Conclusion: unless this is a carefully crafted hash attack, there must
> be one particular key has more than 500MB worth of rows, but also a
> smattering of other keys that fall into the same partition, that are
> causing us to keep trying to repartition until it eventually squeezes
> all of them all in the same direction during a split (this requires
> repeatedly splitting partitions until you reach one partition per
> tuple!).  So I'm wondering if this would be fixed by, say, a 95%
> threshold (instead of 100%) for the extreme skew detector, as I
> proposed in a patch in the first email in that thread that later
> veered off into the BNL discussion[1].  Unfortunately that patch only
> deals with non-parallel HJ, but a PHJ version should be easy.  And if
> not by 95%, I wonder what threshold would be needed for this case, and
> what kind of maths to use to think about it.

So I should have run the earlier attached plan with VERBOSE, but
here's the interesting thing: the parallel hash node's seq scan node
outputs two columns: let's call them (from the redacted plan)
items.system_id and items.account_id. The first (system_id) is both
not null and unique; the second (account_id) definitely has massive
skew. I'm not very up-to-speed on how the hash building works, but I
would have (perhaps naïvely?) assumed that the first column being
unique would make the hash keys very likely not to collide in any
significantly skewed way. Am I missing something here?

> If I wrote a patch like
> [1] with PHJ support, would you be able to test it on a copy of your
> workload?

I think that would be a significant effort to pull off, but _maybe_
not entire impossible. Certainly quite difficult though.

> [1] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

James Coleman

Date:

10 November 2019, 02:25:46

On Sat, Nov 9, 2019 at 6:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sun, Nov 10, 2019 at 7:27 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > Hmmm, but the expected row width is only 16B, and with 6M rows that's
> > only about 90GB. So how come this needs 1TB temporary files? I'm sure
> > there's a bit of overhead, but 10X seems a bit much.
>
> (s/6M/6B/) Yeah, that comes out to only ~90GB but ... PHJ doesn't
> immediately unlink files from the previous generation when it
> repartitions.  You need at two generations' worth of files (old and
> new) while repartitioning, but you don't need the grand-parent
> generation.  I didn't think this was a problem because I didn't expect
> to have to repartition many times (and there is a similar but
> different kind of amplification in the non-parallel code).  If this
> problem is due to the 100% extreme skew threshold causing us to go
> berserk, then that 10X multiplier is of the right order, if you
> imagine this thing started out with ~512 batches and got up to ~1M
> batches before it blew a gasket.

Are you saying that it also doesn't unlink the grand-parent until the end?

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

10 November 2019, 03:02:22

On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote:
> On Sat, Nov 9, 2019 at 6:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Sun, Nov 10, 2019 at 7:27 AM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > > Hmmm, but the expected row width is only 16B, and with 6M rows that's
> > > only about 90GB. So how come this needs 1TB temporary files? I'm sure
> > > there's a bit of overhead, but 10X seems a bit much.
> >
> > (s/6M/6B/) Yeah, that comes out to only ~90GB but ... PHJ doesn't
> > immediately unlink files from the previous generation when it
> > repartitions.  You need at two generations' worth of files (old and
> > new) while repartitioning, but you don't need the grand-parent
> > generation.  I didn't think this was a problem because I didn't expect
> > to have to repartition many times (and there is a similar but
> > different kind of amplification in the non-parallel code).  If this
> > problem is due to the 100% extreme skew threshold causing us to go
> > berserk, then that 10X multiplier is of the right order, if you
> > imagine this thing started out with ~512 batches and got up to ~1M
> > batches before it blew a gasket.
>
> Are you saying that it also doesn't unlink the grand-parent until the end?

Yeah, it doesn't unlink anything until the end, which obviously isn't
great in this case.  Might be something to add to my list of patches
to write.

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

10 November 2019, 03:43:49

On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote:
> So I should have run the earlier attached plan with VERBOSE, but
> here's the interesting thing: the parallel hash node's seq scan node
> outputs two columns: let's call them (from the redacted plan)
> items.system_id and items.account_id. The first (system_id) is both
> not null and unique; the second (account_id) definitely has massive
> skew. I'm not very up-to-speed on how the hash building works, but I
> would have (perhaps naïvely?) assumed that the first column being
> unique would make the hash keys very likely not to collide in any
> significantly skewed way. Am I missing something here?

Hrm.  So the compound key is unique then.  I was assuming up until now
that it had duplicates.  The hashes of the individual keys are
combined (see ExecHashGetHashValue()), so assuming there is nothing
funky about the way citext gets hashed (and it's not jumping out at
me), your unique keys should give you uniform hash values and thus
partition size, and repartitioning should be an effective way of
reducing hash table size.  So now it sounds like you have a simple
case of underestimation, but now I'm confused about how you got a
344MB hash table with work_mem = 150MB:

Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
4096)  Memory Usage: 344448kB

And I'm confused about what was different when it wanted the crazy
number of batches.

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

James Coleman

Date:

10 November 2019, 14:15:06

On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote:
> > So I should have run the earlier attached plan with VERBOSE, but
> > here's the interesting thing: the parallel hash node's seq scan node
> > outputs two columns: let's call them (from the redacted plan)
> > items.system_id and items.account_id. The first (system_id) is both
> > not null and unique; the second (account_id) definitely has massive
> > skew. I'm not very up-to-speed on how the hash building works, but I
> > would have (perhaps naïvely?) assumed that the first column being
> > unique would make the hash keys very likely not to collide in any
> > significantly skewed way. Am I missing something here?
>
> Hrm.  So the compound key is unique then.  I was assuming up until now
> that it had duplicates.  The hashes of the individual keys are
> combined (see ExecHashGetHashValue()), so assuming there is nothing
> funky about the way citext gets hashed (and it's not jumping out at
> me), your unique keys should give you uniform hash values and thus
> partition size, and repartitioning should be an effective way of
> reducing hash table size.  So now it sounds like you have a simple
> case of underestimation, but now I'm confused about how you got a
> 344MB hash table with work_mem = 150MB:

Looking at the source, citext's hash is a pretty standard call to
hash_any, so I don't see how that would lead to any oddities (barring
an intentional hash collision etc., but this is real data).

Do you have any theories about where the underestimation is happening?
It knows the number of rows reasonably well. The one thing I'm not
sure about yet is the row width = 16. The account_id is a bigint, so
we can assume it knows the size properly there. The system_id being
citext...I think the 8 bytes it has for that is probably a bit low on
average, but I'm not sure yet by how much (I'm going to run a query to
find out). Hmm. Query came back, and average length is just shy of
7.9...so the 16 byte row size is looking pretty good. So I'm at a loss
of why/what it would be underestimating (does it know about
uniqueness/non-uniqueness? could that be a factor?)?

I also don't know why it seems to regularly fail on the primary, but
not on the sync, unless we adjust the work_mem up. I've double-checked
all GUCs and the only differences are things related to replication
that you'd expect to be different on primary/replica.

I know there are some things that execute differently on replicas, so
I assume it's something like that, but I don't know what specifically
would cause this here.

> Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
> 4096)  Memory Usage: 344448kB
>
> And I'm confused about what was different when it wanted the crazy
> number of batches.

I'm not quite sure how to find out; if you have any ideas, I'd love to
hear them. The one thing I can think of to try is to slowly increase
work_mem (since setting it to 500MB reproduced the error on the
replica) and see if the bucket info starts to trend up.

James

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

10 November 2019, 17:29:25

On Sun, Nov 10, 2019 at 09:15:06AM -0500, James Coleman wrote:
>On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>
>> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote:
>> > So I should have run the earlier attached plan with VERBOSE, but
>> > here's the interesting thing: the parallel hash node's seq scan node
>> > outputs two columns: let's call them (from the redacted plan)
>> > items.system_id and items.account_id. The first (system_id) is both
>> > not null and unique; the second (account_id) definitely has massive
>> > skew. I'm not very up-to-speed on how the hash building works, but I
>> > would have (perhaps naïvely?) assumed that the first column being
>> > unique would make the hash keys very likely not to collide in any
>> > significantly skewed way. Am I missing something here?
>>
>> Hrm.  So the compound key is unique then.  I was assuming up until now
>> that it had duplicates.  The hashes of the individual keys are
>> combined (see ExecHashGetHashValue()), so assuming there is nothing
>> funky about the way citext gets hashed (and it's not jumping out at
>> me), your unique keys should give you uniform hash values and thus
>> partition size, and repartitioning should be an effective way of
>> reducing hash table size.  So now it sounds like you have a simple
>> case of underestimation, but now I'm confused about how you got a
>> 344MB hash table with work_mem = 150MB:
>

Why would the compound key matter? The join only does this

   Hash Cond: (group_details.group_number = items.system_id)

so the only thing that really matters when it comes to skew is
system_id, i.e. the citext column. The other column is used by
aggregation, but that's irrelevant here, no?

>Looking at the source, citext's hash is a pretty standard call to
>hash_any, so I don't see how that would lead to any oddities (barring
>an intentional hash collision etc., but this is real data).
>
>Do you have any theories about where the underestimation is happening?
>It knows the number of rows reasonably well. The one thing I'm not
>sure about yet is the row width = 16. The account_id is a bigint, so
>we can assume it knows the size properly there. The system_id being
>citext...I think the 8 bytes it has for that is probably a bit low on
>average, but I'm not sure yet by how much (I'm going to run a query to
>find out). Hmm. Query came back, and average length is just shy of
>7.9...so the 16 byte row size is looking pretty good. So I'm at a loss
>of why/what it would be underestimating (does it know about
>uniqueness/non-uniqueness? could that be a factor?)?
>

I think the easiest thing we can do is running a couple of queries
collecting useful stats, like

     -- the most frequent system_id values
     select system_id, count(*), sum(length(system_id)) from items
     gtoup by system_id
     order by count(*) desc limit 100;

     -- the largest group by system_id
     select system_id, count(*), sum(length(system_id)) from items
     gtoup by system_id
     order by sum(length(system_id))  desc limit 100;

That should tell us if there's something off.

>I also don't know why it seems to regularly fail on the primary, but
>not on the sync, unless we adjust the work_mem up. I've double-checked
>all GUCs and the only differences are things related to replication
>that you'd expect to be different on primary/replica.
>
>I know there are some things that execute differently on replicas, so
>I assume it's something like that, but I don't know what specifically
>would cause this here.
>

Hmm, I haven't realized it fails on primary more often. That's certainly
strange, I don't have a very good idea why that could be happening. Do
the query plans look the same? How does the EXPLAIN ANALYZE look like?

>> Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
>> 4096)  Memory Usage: 344448kB
>>
>> And I'm confused about what was different when it wanted the crazy
>> number of batches.
>
>I'm not quite sure how to find out; if you have any ideas, I'd love to
>hear them. The one thing I can think of to try is to slowly increase
>work_mem (since setting it to 500MB reproduced the error on the
>replica) and see if the bucket info starts to trend up.
>

Chances are it actually happened to detect skew and disabled the growth,
hence the table grew above work_mem. We should probably add info about
this (skew, disabling growth) to EXPLAIN ANALYZE verbose mode.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

James Coleman

Date:

10 November 2019, 19:24:10

On Sun, Nov 10, 2019 at 12:29 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sun, Nov 10, 2019 at 09:15:06AM -0500, James Coleman wrote:
> >On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >>
> >> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote:
> >> > So I should have run the earlier attached plan with VERBOSE, but
> >> > here's the interesting thing: the parallel hash node's seq scan node
> >> > outputs two columns: let's call them (from the redacted plan)
> >> > items.system_id and items.account_id. The first (system_id) is both
> >> > not null and unique; the second (account_id) definitely has massive
> >> > skew. I'm not very up-to-speed on how the hash building works, but I
> >> > would have (perhaps naīvely?) assumed that the first column being
> >> > unique would make the hash keys very likely not to collide in any
> >> > significantly skewed way. Am I missing something here?
> >>
> >> Hrm.  So the compound key is unique then.  I was assuming up until now
> >> that it had duplicates.  The hashes of the individual keys are
> >> combined (see ExecHashGetHashValue()), so assuming there is nothing
> >> funky about the way citext gets hashed (and it's not jumping out at
> >> me), your unique keys should give you uniform hash values and thus
> >> partition size, and repartitioning should be an effective way of
> >> reducing hash table size.  So now it sounds like you have a simple
> >> case of underestimation, but now I'm confused about how you got a
> >> 344MB hash table with work_mem = 150MB:
> >
>
> Why would the compound key matter? The join only does this
>
>    Hash Cond: (group_details.group_number = items.system_id)
>
> so the only thing that really matters when it comes to skew is
> system_id, i.e. the citext column. The other column is used by
> aggregation, but that's irrelevant here, no?

Yeah, that seems right. I think I got confused because account_id is
used in another join (well, technically a correlated subquery, but
effectively a join in implementation).

But even so, system_id is unique and not null. So the above discussion
of how uniqueness affects this still holds, right?

> >Looking at the source, citext's hash is a pretty standard call to
> >hash_any, so I don't see how that would lead to any oddities (barring
> >an intentional hash collision etc., but this is real data).
> >
> >Do you have any theories about where the underestimation is happening?
> >It knows the number of rows reasonably well. The one thing I'm not
> >sure about yet is the row width = 16. The account_id is a bigint, so
> >we can assume it knows the size properly there. The system_id being
> >citext...I think the 8 bytes it has for that is probably a bit low on
> >average, but I'm not sure yet by how much (I'm going to run a query to
> >find out). Hmm. Query came back, and average length is just shy of
> >7.9...so the 16 byte row size is looking pretty good. So I'm at a loss
> >of why/what it would be underestimating (does it know about
> >uniqueness/non-uniqueness? could that be a factor?)?
> >
>
> I think the easiest thing we can do is running a couple of queries
> collecting useful stats, like
>
>      -- the most frequent system_id values
>      select system_id, count(*), sum(length(system_id)) from items
>      gtoup by system_id
>      order by count(*) desc limit 100;
>
>      -- the largest group by system_id
>      select system_id, count(*), sum(length(system_id)) from items
>      gtoup by system_id
>      order by sum(length(system_id))  desc limit 100;
>
> That should tell us if there's something off.

Since the column is unique, we don't need to do any grouping to get
stats; as noted earlier I queried and found that the average length
was just shy of 7.9 chars long (though I happen to know the length
varies at least between 7 and 34).

> >I also don't know why it seems to regularly fail on the primary, but
> >not on the sync, unless we adjust the work_mem up. I've double-checked
> >all GUCs and the only differences are things related to replication
> >that you'd expect to be different on primary/replica.
> >
> >I know there are some things that execute differently on replicas, so
> >I assume it's something like that, but I don't know what specifically
> >would cause this here.
> >
>
> Hmm, I haven't realized it fails on primary more often. That's certainly
> strange, I don't have a very good idea why that could be happening. Do
> the query plans look the same? How does the EXPLAIN ANALYZE look like?

So far they all seem to look very similar. The other possibility is
that it's just that the replica is behind (it's not a sync replica),
but this has happened enough times on different days that that
explanation feels pretty unsatisfactory to me.

As mentioned in my previous email, I'm slowly checking against various
work_mem values to see if I can determine 1.) where it starts to only
fail on the primary and 2.) if the buckets stats start to grow before
the failure case.

So far I've seen it fail on both primary and sync at 500MB and 400MB,
300MB, and 200MB, and, interestingly, with today's data, succeed on
both at 150MB. That last point does imply data changing could have
something to do with the sync/replica difference I suppose. ...And, it
just failed on the replica at 150MB and succeeded on the primary at
150MB. So...probably data changing?

In all of those runs I haven't seen the bucket stats change.

> >> Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally
> >> 4096)  Memory Usage: 344448kB
> >>
> >> And I'm confused about what was different when it wanted the crazy
> >> number of batches.
> >
> >I'm not quite sure how to find out; if you have any ideas, I'd love to
> >hear them. The one thing I can think of to try is to slowly increase
> >work_mem (since setting it to 500MB reproduced the error on the
> >replica) and see if the bucket info starts to trend up.
> >
>
> Chances are it actually happened to detect skew and disabled the growth,
> hence the table grew above work_mem. We should probably add info about
> this (skew, disabling growth) to EXPLAIN ANALYZE verbose mode.

Would skew be likely though if the column is unique?

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

10 November 2019, 21:08:58

I think I see what's happening: we're running out of hash bits.

> Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB

Here it's using the lower 22 bits for the bucket number, and started
out using 12 bits for the batch (!), and increased that until it got
to 15 (!!).  After using 22 bits for the bucket, there are only 10
bits left, so all the tuples go into the lower 1024 batches.

I'm not sure how exactly this leads to wildly varying numbers of
repartioning cycles (the above-quoted example did it 3 times, the
version that crashed into MaxAllocSize did it ~10 times).

Besides switching to 64 bit hashes so that we don't run out of
information (clearly a good idea), what other options do we have?  (1)
We could disable repartitioning (give up on work_mem) after we've run
out of bits; this will eat more memory than it should.  (2) You could
start stealing bucket bits; this will eat more CPU than it should,
because you'd effectively have fewer active buckets (tuples will
concentrated on the bits you didn't steal).

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

10 November 2019, 21:23:52

On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
>I think I see what's happening: we're running out of hash bits.
>
>> Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB
>
>Here it's using the lower 22 bits for the bucket number, and started
>out using 12 bits for the batch (!), and increased that until it got
>to 15 (!!).  After using 22 bits for the bucket, there are only 10
>bits left, so all the tuples go into the lower 1024 batches.
>

Ouch!

>I'm not sure how exactly this leads to wildly varying numbers of
>repartioning cycles (the above-quoted example did it 3 times, the
>version that crashed into MaxAllocSize did it ~10 times).
>
>Besides switching to 64 bit hashes so that we don't run out of
>information (clearly a good idea), what other options do we have?  (1)
>We could disable repartitioning (give up on work_mem) after we've run
>out of bits; this will eat more memory than it should.  (2) You could
>start stealing bucket bits; this will eat more CPU than it should,
>because you'd effectively have fewer active buckets (tuples will
>concentrated on the bits you didn't steal).

Can't we simply compute two hash values, using different seeds - one for
bucket and the other for batch? Of course, that'll be more expensive.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tom Lane

Date:

10 November 2019, 21:34:09

Thomas Munro <thomas.munro@gmail.com> writes:
> I think I see what's happening: we're running out of hash bits.

Ugh.

> Besides switching to 64 bit hashes so that we don't run out of
> information (clearly a good idea), what other options do we have?

If the input column is less than 64 bits wide, this is an illusory
solution, because there can't really be that many independent hash bits.
I think we need to have some constraint that keeps us from widening
the batch+bucket counts to more hash bits than we have.

            regards, tom lane

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

James Coleman

Date:

10 November 2019, 21:38:22

On Sun, Nov 10, 2019 at 4:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I think I see what's happening: we're running out of hash bits.
>
> > Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB
>
> Here it's using the lower 22 bits for the bucket number, and started
> out using 12 bits for the batch (!), and increased that until it got
> to 15 (!!).  After using 22 bits for the bucket, there are only 10
> bits left, so all the tuples go into the lower 1024 batches.
>
> I'm not sure how exactly this leads to wildly varying numbers of
> repartioning cycles (the above-quoted example did it 3 times, the
> version that crashed into MaxAllocSize did it ~10 times).
>
> Besides switching to 64 bit hashes so that we don't run out of
> information (clearly a good idea), what other options do we have?  (1)
> We could disable repartitioning (give up on work_mem) after we've run
> out of bits; this will eat more memory than it should.  (2) You could
> start stealing bucket bits; this will eat more CPU than it should,
> because you'd effectively have fewer active buckets (tuples will
> concentrated on the bits you didn't steal).

Thanks for working through this!

But now we're at the end of my understanding of how hash tables and
joins are implemented in PG; is there a wiki page or design that might
give me some current design description of how the buckets and batches
work with the hash so I can keep following along?

James

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

10 November 2019, 21:50:17

On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote:
>On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
>>I think I see what's happening: we're running out of hash bits.
>>
>>>Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB
>>
>>Here it's using the lower 22 bits for the bucket number, and started
>>out using 12 bits for the batch (!), and increased that until it got
>>to 15 (!!).  After using 22 bits for the bucket, there are only 10
>>bits left, so all the tuples go into the lower 1024 batches.
>>
>
>Ouch!
>
>>I'm not sure how exactly this leads to wildly varying numbers of
>>repartioning cycles (the above-quoted example did it 3 times, the
>>version that crashed into MaxAllocSize did it ~10 times).
>>
>>Besides switching to 64 bit hashes so that we don't run out of
>>information (clearly a good idea), what other options do we have?  (1)
>>We could disable repartitioning (give up on work_mem) after we've run
>>out of bits; this will eat more memory than it should.  (2) You could
>>start stealing bucket bits; this will eat more CPU than it should,
>>because you'd effectively have fewer active buckets (tuples will
>>concentrated on the bits you didn't steal).
>
>Can't we simply compute two hash values, using different seeds - one for
>bucket and the other for batch? Of course, that'll be more expensive.

Meh, I realized that's pretty much just a different way to get 64-bit
hashes (which is what you mentioned).

An I think you're right we should detect cases when we use all the bits,
and stop making it worse. Stealing bucket bits seems reasonable - we
can't stop adding batches, because that would mean we stop enforcing
work_mem. The question is how far that gets us - we enforce nbuckets to
be at least 1024 (i.e. 10 bits), which leaves 32 bits for nbatch. And in
one of the explains we've seen nbatch=2097152 (i.e. 21 bits).

Of course, this is bound to be (extremely) slow. The hashjoin changes in
9.5 which reduced th hash table load factor from 10 to 1 resulted in
speedups close to 3x. And if you go from 4194304 to 1024 buckets, those
will be loooooooong chains in each bucket :-(

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

10 November 2019, 22:02:59

On Sun, Nov 10, 2019 at 04:34:09PM -0500, Tom Lane wrote:
>Thomas Munro <thomas.munro@gmail.com> writes:
>> I think I see what's happening: we're running out of hash bits.
>
>Ugh.
>
>> Besides switching to 64 bit hashes so that we don't run out of
>> information (clearly a good idea), what other options do we have?
>
>If the input column is less than 64 bits wide, this is an illusory
>solution, because there can't really be that many independent hash bits.
>I think we need to have some constraint that keeps us from widening
>the batch+bucket counts to more hash bits than we have.
>

That's not quite correct - it's a solution/improvement as long as the
values have more than 32 bits of information. I.e. it works even for
values that are less than 64 bits wide, as long as there's more than 32
bits of information (assuming the hash does not lost the extra bits).

In the case discussed in this text we're dealing with a citext column,
with average length ~8 chars.  So it probably has less than 64 bits of
information, because it's likely just alphanumeric, or something like
that. But likely more than 32 bits.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

10 November 2019, 22:07:01

On Mon, Nov 11, 2019 at 10:38 AM James Coleman <jtc331@gmail.com> wrote:
> But now we're at the end of my understanding of how hash tables and
> joins are implemented in PG; is there a wiki page or design that might
> give me some current design description of how the buckets and batches
> work with the hash so I can keep following along?

We have something like the so-called "Grace" hash join (with the
"hybrid" refinement, irrelevant for this discussion):

https://en.wikipedia.org/wiki/Hash_join#Grace_hash_join

Our word "batch" just means partition.  Most descriptions talk about
using two different hash functions for partition and bucket, but our
implementation uses a single hash function, and takes some of the bits
to choose the bucket and some of the bits to choose the batch.  That
happens here:

https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeHash.c#L1872

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Andres Freund

Date:

10 November 2019, 22:46:31

Hi,

On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote:
> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote:
> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
> > Can't we simply compute two hash values, using different seeds - one for
> > bucket and the other for batch? Of course, that'll be more expensive.
> 
> Meh, I realized that's pretty much just a different way to get 64-bit
> hashes (which is what you mentioned).

I'm not sure it's really the same, given practical realities in
postgres. Right now the "extended" hash function supporting 64 bit hash
functions is optional. So we couldn't unconditionally rely on it being
present, even in master, unless we're prepared to declare it as
required from now on.

So computing two different hash values at the same time, by using a
different IV and a different combine function, doesn't seem like an
unreasonable approach.

Greetings,

Andres Freund

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

10 November 2019, 23:44:38

On Sun, Nov 10, 2019 at 02:46:31PM -0800, Andres Freund wrote:
>Hi,
>
>On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote:
>> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote:
>> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
>> > Can't we simply compute two hash values, using different seeds - one for
>> > bucket and the other for batch? Of course, that'll be more expensive.
>>
>> Meh, I realized that's pretty much just a different way to get 64-bit
>> hashes (which is what you mentioned).
>
>I'm not sure it's really the same, given practical realities in
>postgres. Right now the "extended" hash function supporting 64 bit hash
>functions is optional. So we couldn't unconditionally rely on it being
>present, even in master, unless we're prepared to declare it as
>required from now on.
>
>So computing two different hash values at the same time, by using a
>different IV and a different combine function, doesn't seem like an
>unreasonable approach.
>

True. I was commenting on the theoretical fact that computing two 32-bit
hashes is close to computing a 64-bit hash, but you're right there are
implementation details that may make it more usable in our case.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

11 November 2019, 00:33:41

On Mon, Nov 11, 2019 at 12:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Sun, Nov 10, 2019 at 02:46:31PM -0800, Andres Freund wrote:
> >On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote:
> >> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote:
> >> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
> >> > Can't we simply compute two hash values, using different seeds - one for
> >> > bucket and the other for batch? Of course, that'll be more expensive.
> >>
> >> Meh, I realized that's pretty much just a different way to get 64-bit
> >> hashes (which is what you mentioned).
> >
> >I'm not sure it's really the same, given practical realities in
> >postgres. Right now the "extended" hash function supporting 64 bit hash
> >functions is optional. So we couldn't unconditionally rely on it being
> >present, even in master, unless we're prepared to declare it as
> >required from now on.
> >
> >So computing two different hash values at the same time, by using a
> >different IV and a different combine function, doesn't seem like an
> >unreasonable approach.
>
> True. I was commenting on the theoretical fact that computing two 32-bit
> hashes is close to computing a 64-bit hash, but you're right there are
> implementation details that may make it more usable in our case.

Here is a quick sketch of something like that, for discussion only.  I
figured that simply mixing the hash value we have with some arbitrary
bits afterwards would be just as good as having started with a
different IV, which leads to a very simple change without refactoring.
From quick experiments with unique keys (generate_series) I seem to
get approximately even sized partitions, and correct answers, but I
make no claim to strong hash-math-fu and haven't tested on very large
inputs.  Thoughts?

Attachment

rehash-partition.patch

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

11 November 2019, 10:46:05

On Mon, Nov 11, 2019 at 01:33:41PM +1300, Thomas Munro wrote:
>On Mon, Nov 11, 2019 at 12:44 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> On Sun, Nov 10, 2019 at 02:46:31PM -0800, Andres Freund wrote:
>> >On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote:
>> >> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote:
>> >> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote:
>> >> > Can't we simply compute two hash values, using different seeds - one for
>> >> > bucket and the other for batch? Of course, that'll be more expensive.
>> >>
>> >> Meh, I realized that's pretty much just a different way to get 64-bit
>> >> hashes (which is what you mentioned).
>> >
>> >I'm not sure it's really the same, given practical realities in
>> >postgres. Right now the "extended" hash function supporting 64 bit hash
>> >functions is optional. So we couldn't unconditionally rely on it being
>> >present, even in master, unless we're prepared to declare it as
>> >required from now on.
>> >
>> >So computing two different hash values at the same time, by using a
>> >different IV and a different combine function, doesn't seem like an
>> >unreasonable approach.
>>
>> True. I was commenting on the theoretical fact that computing two 32-bit
>> hashes is close to computing a 64-bit hash, but you're right there are
>> implementation details that may make it more usable in our case.
>
>Here is a quick sketch of something like that, for discussion only.  I
>figured that simply mixing the hash value we have with some arbitrary
>bits afterwards would be just as good as having started with a
>different IV, which leads to a very simple change without refactoring.
>From quick experiments with unique keys (generate_series) I seem to
>get approximately even sized partitions, and correct answers, but I
>make no claim to strong hash-math-fu and haven't tested on very large
>inputs.  Thoughts?

Hmmm, iteresting. Using a hard-coded constant seems a bit weird,
although I don't have any precise argument why it's wrong yet.

Essentially, the patch does this:

   batchno = hash_combine(0x9e3779b9, hashvalue) & (nbatch - 1);
   bucketno = hashvalue & (nbuckets - 1);

but adding a constant does not 'propagate' any bits from the original
value to the hash. So it seems pretty much equivalent to

   batchno = hashvalue & (nbatch - 1);
   bucketno = hashvalue & (nbuckets - 1);

so we still operate with the same number of bits. It might solve this
particular case, but I doubt it's a good idea in general ...

I think we will have to actually compute two hashes, to get more than 32
bits. But yes, that'll be more invasive, and unlikely backpatchable.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

James Coleman

Date:

11 November 2019, 14:14:43

On Sun, Nov 10, 2019 at 4:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I think I see what's happening: we're running out of hash bits.
>
> > Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB
>
> Here it's using the lower 22 bits for the bucket number, and started
> out using 12 bits for the batch (!), and increased that until it got
> to 15 (!!).  After using 22 bits for the bucket, there are only 10
> bits left, so all the tuples go into the lower 1024 batches.

Do we have this kind of problem with hash aggregates also? I've
noticed the temp disk usage pattern applies to both, and the buffers
stats shows that being the case, but unfortunately the hash aggregate
node doesn't report memory usage for its hash or buckets info. Given
it's not a join, maybe we only need buckets and not batches, but I
don't know this part of the code at all, so I'm just guessing to
assume either way.

James

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

11 November 2019, 15:16:25

On Mon, Nov 11, 2019 at 09:14:43AM -0500, James Coleman wrote:
>On Sun, Nov 10, 2019 at 4:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>
>> I think I see what's happening: we're running out of hash bits.
>>
>> > Buckets: 4194304 (originally 4194304)  Batches: 32768 (originally 4096)  Memory Usage: 344448kB
>>
>> Here it's using the lower 22 bits for the bucket number, and started
>> out using 12 bits for the batch (!), and increased that until it got
>> to 15 (!!).  After using 22 bits for the bucket, there are only 10
>> bits left, so all the tuples go into the lower 1024 batches.
>
>Do we have this kind of problem with hash aggregates also? I've
>noticed the temp disk usage pattern applies to both, and the buffers
>stats shows that being the case, but unfortunately the hash aggregate
>node doesn't report memory usage for its hash or buckets info. Given
>it's not a join, maybe we only need buckets and not batches, but I
>don't know this part of the code at all, so I'm just guessing to
>assume either way.
>

I don't think so, because hash aggregate does not spill to disk. The
trouble with hashjoins is that we need two independent indexes - bucket
and batch, and we only have a single 32-bit hash value. The hash agg is
currently unable to spill to disk, so it's not affected by this (but it
does have plenty of issues on it's own).

There's work in progress aiming to add memory-bounded hash aggregate,
but I think the spilling is supposed to work very differently there.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

11 November 2019, 21:46:43

On Mon, Nov 11, 2019 at 11:46 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Essentially, the patch does this:
>
>    batchno = hash_combine(0x9e3779b9, hashvalue) & (nbatch - 1);
>    bucketno = hashvalue & (nbuckets - 1);
>
> but adding a constant does not 'propagate' any bits from the original
> value to the hash. So it seems pretty much equivalent to
>
>    batchno = hashvalue & (nbatch - 1);
>    bucketno = hashvalue & (nbuckets - 1);
>
> so we still operate with the same number of bits. It might solve this
> particular case, but I doubt it's a good idea in general ...

Yeah, I think you're generally right, but it's much better than the
specific coding batchno = hashvalue & (nbatch - 1).  With that recipe,
batch #42 finishes up with all its tuples crammed into bucket #42.
That's the worst case of the steal-the-bucket-bits idea mentioned
earlier.

After sleeping on it, I think the re-hash idea is equivalent to
stealing the minimum number of bucket bits.  A simple but subtly
broken version of that would be: take bucket number from the low end,
and batch number from the high end, and don't worry if they begin to
overlap because it's the best we can do:

    batchno = (hashvalue >> (hash_width - nbatch_width)) & (nbatch - 1);
    bucketno = hashvalue & (nbuckets - 1);

Tuples start concentrating in a subset of buckets when the bit ranges
begin to overlap, due to a correlation between batch and bucket, but
that seems to be true for the re-hash idea too.  You can't actually
use that exact coding, though, because of the way we increase nbatch
in serial hash joins: we're only allowed to add new bits to the high
end of the generated batch numbers so that every tuple either stays in
the current batch or is "thrown forward" to a higher-numbered batch
when nbatch increases.  So instead you could write it as:

    batchno = reverse_bits(hashvalue) & (nbatch - 1);
    bucketno = hashvalue & (nbuckets - 1);

I'm not sure if reversing bits can be done any more efficiently than
the cheap-and-cheerful hash_combine() code, though.  With the right
inlining it seems like nice straight-line no-jmp no-lookup-table code,
so maybe it's still worth considering if it has equivalent performance
for our purposes.

A tangential observation is that with any of these approaches you can
probably lift the current restriction on increasing nbucket when
nbatch > 1 (that prohibition is based on not wanting overlapping
bucket and batch bits ranges, but if we're deciding that's OK in
extreme cases and taking steps to minimise it in usual cases...)

> I think we will have to actually compute two hashes, to get more than 32
> bits. But yes, that'll be more invasive, and unlikely backpatchable.

Agreed.  I think if you started with a different IV *inside the hash
function*, in other words, if you effectively had two different hash
functions (which is the way Grace is usually described) so that the
two hash function each get to see (potentially) more than 32 bits'
worth of input, it'd be about the same as having a single 64 bit hash
and using non-overlapping bit ranges for batch and bucket.  Nothing
like that seems plausible for back branches.

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Andres Freund

Date:

13 November 2019, 18:47:59

Hi,

On 2019-11-11 11:46:05 +0100, Tomas Vondra wrote:
> On Mon, Nov 11, 2019 at 01:33:41PM +1300, Thomas Munro wrote:
> Hmmm, iteresting. Using a hard-coded constant seems a bit weird,
> although I don't have any precise argument why it's wrong yet.
> 
> Essentially, the patch does this:
> 
>   batchno = hash_combine(0x9e3779b9, hashvalue) & (nbatch - 1);
>   bucketno = hashvalue & (nbuckets - 1);
> 
> but adding a constant does not 'propagate' any bits from the original
> value to the hash. So it seems pretty much equivalent to

I was over IM arguing that we ought to also use a different mixing
functions when computing the hash.

>   batchno = hashvalue & (nbatch - 1);
>   bucketno = hashvalue & (nbuckets - 1);

> so we still operate with the same number of bits. It might solve this
> particular case, but I doubt it's a good idea in general ...
> 
> I think we will have to actually compute two hashes, to get more than 32
> bits. But yes, that'll be more invasive, and unlikely backpatchable.

I think it might also just generally not be acceptable, from a
performance POV. Computing the hashes is already a major bottleneck for
HJ. Penalizing everybody for these extreme cases doesn't strike me as
great. It might be ok to compute a 64bit hash, but I don't see how
computing two hashes in parallel wouldn't have significant overhead.

I don't really see what the problem is with using just a differently
mixed hash. There's not really a requirement for there to be a
tremendous amount of independence between the bits used for the bucket,
and the bits for the batch. We cannot just reuse exactly the same bits
for the bucket that we used for the hash, as otherwise we'd just raise
the number of bucket conflicts (as the whole hashtable will only contain
tuples with the reduced number of bits). But I don't see why that
implies a need for full bit indepence.  As long as the mixing guarantees
that we use the bits for bucketing differently than the ones for
batches, I think we're good.  All that need is that each batch is is
roughly equally sized, and contains tuples with hashes that roughly
equally distributed - you could bitreverse the hash for batch selection,
and get that, no?

Greetings,

Andres Freund

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

13 November 2019, 20:20:11

On Thu, Nov 14, 2019 at 7:48 AM Andres Freund <andres@anarazel.de> wrote:
> I don't really see what the problem is with using just a differently
> mixed hash. There's not really a requirement for there to be a
> tremendous amount of independence between the bits used for the bucket,
> and the bits for the batch. We cannot just reuse exactly the same bits
> for the bucket that we used for the hash, as otherwise we'd just raise
> the number of bucket conflicts (as the whole hashtable will only contain
> tuples with the reduced number of bits). But I don't see why that
> implies a need for full bit indepence.  As long as the mixing guarantees
> that we use the bits for bucketing differently than the ones for
> batches, I think we're good.  All that need is that each batch is is
> roughly equally sized, and contains tuples with hashes that roughly
> equally distributed - you could bitreverse the hash for batch selection,
> and get that, no?

Right, that's what I said earlier:

>     batchno = reverse_bits(hashvalue) & (nbatch - 1);
>     bucketno = hashvalue & (nbuckets - 1);

My theory is that batchno = hash_combine(<constant>, hashvalue) &
(nbatch - 1) has about the same distribution properties.  That is, it
works nicely up until you have more than about 2^32 keys, and then it
begins to break down due to lack of information, but the break down
manifests as more conflicts in buckets, instead of the failure to
repartition once you start reading off the end of the hash value (the
cause of the present bug report).  The bit reverse approach might be a
bit less mysterious, but when I looked at ways to implement that they
looked potentially slower than the xor, shifts and adds used by
hash_combine.  I didn't attempt to test that, though.

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

20 November 2019, 00:05:56

On Thu, Nov 14, 2019 at 9:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Nov 14, 2019 at 7:48 AM Andres Freund <andres@anarazel.de> wrote:
> > I don't really see what the problem is with using just a differently
> > mixed hash. There's not really a requirement for there to be a
> > tremendous amount of independence between the bits used for the bucket,
> > and the bits for the batch. We cannot just reuse exactly the same bits
> > for the bucket that we used for the hash, as otherwise we'd just raise
> > the number of bucket conflicts (as the whole hashtable will only contain
> > tuples with the reduced number of bits). But I don't see why that
> > implies a need for full bit indepence.  As long as the mixing guarantees
> > that we use the bits for bucketing differently than the ones for
> > batches, I think we're good.  All that need is that each batch is is
> > roughly equally sized, and contains tuples with hashes that roughly
> > equally distributed - you could bitreverse the hash for batch selection,
> > and get that, no?
>
> Right, that's what I said earlier:
>
> >     batchno = reverse_bits(hashvalue) & (nbatch - 1);
> >     bucketno = hashvalue & (nbuckets - 1);
>
> My theory is that batchno = hash_combine(<constant>, hashvalue) &
> (nbatch - 1) has about the same distribution properties.  That is, it
> works nicely up until you have more than about 2^32 keys, and then it
> begins to break down due to lack of information, but the break down
> manifests as more conflicts in buckets, instead of the failure to
> repartition once you start reading off the end of the hash value (the
> cause of the present bug report).  The bit reverse approach might be a
> bit less mysterious, but when I looked at ways to implement that they
> looked potentially slower than the xor, shifts and adds used by
> hash_combine.  I didn't attempt to test that, though.

If you have plenty of disk space and time, you can repro the problem like so:

create unlogged table sevenb as select generate_series(1, 7000000000)::bigint i;
analyze sevenb;
set work_mem = '150MB';
explain analyze select count(*) from sevenb t1 join sevenb t2 using (i);

If you have max_parallel_workers_per_gather >= 6 it gives up the
strange idea of using a sort-merge join, otherwise you also need
enable_mergejoin = off to prevent that, which is curious.

You can see as soon as it starts executing the join that it's only
writing to 1023 partitions (that is, 1023 * nprocesses temporary files
with names like "i...of4096.p...", and then when one of those gets too
big, "i...of8192.p...", and so on until something goes wrong).  My
(slow) machine hasn't finished repartitioning yet... I'll check again
tomorrow, if it hasn't melted...  Patched, it writes to 4095 as it
should (there is no "i0of4096..."), their sizes are uniform, and the
query completes after a single repartitioning operation.  (Well it's
4096 with work_mem = 150MB and 2 workers, YMMV).

I'm planning to push something like this late this week (though I
think the constant might be better as 0 since it doesn't matter much
given the definition, and I need a back-patchable version that is open
coded or I need to back-patch hash_combine), unless someone has a
better idea.

Separately, I'm working on an idea for how to support extended hashing
for the master branch only.

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

21 November 2019, 04:45:03

On Wed, Nov 20, 2019 at 1:05 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'm planning to push something like this late this week (though I
> think the constant might be better as 0 since it doesn't matter much
> given the definition, and I need a back-patchable version that is open
> coded or I need to back-patch hash_combine), unless someone has a
> better idea.

Scratch that plan, it turned out to be a terrible idea.  Although it
solves the batch problem, it's much worse than I expected at bucket
distribution, so the performance is bad.  I think something like
that might work given better bit mixing, but that sounds expensive (or
maybe I'm just being really stupid here).  Anyway, back to the drawing
board.  I'm now thinking the bit stealing scheme (ie take bits from
the other end of the hash, and just let them start to overlap with
bucket bits if you have to when nbatch increases) is a better plan,
and it is so much easier to reason about, and still in the realm of a
handful of instructions.  Here are some numbers[1] showing the number
of non-empty buckets ("pop") and chain length distribution for
non-empty buckets, given 7 billion bigints as input.

method | nbatch | nbucket |    pop |  min |  max |  avg | stddev
-------+--------+---------+--------+------+------+------+--------
rehash |   4096 | 1048576 |    256 | 6212 | 7001 | 6674 |    110
steal  |   4096 | 1048576 | 659574 |    1 |   17 |  2.5 |    1.5
steal  |   4096 | 4194304 | 659574 |    1 |   17 |  2.5 |    1.5
steal  |   8192 | 4194304 | 329685 |    1 |   17 |  2.5 |    1.5

The rehash numbers are obviously catastrophic in this case, and that's
before we've even run out of hash bits.  The steal numbers show that
you just waste memory on empty buckets when hash bits get thin on the
ground (the last two lines in the table).  We could perhaps consider
automatically lowering nbucket in this case too, since you can't
really have more than 2^32 virtual buckets, so pretending you can just
wastes array memory.

So here's a draft steal-the-bucket-bits patch.  Yeah,
reverse_bits_u32() may be in the wrong header.  But is it too slow?
On my desktop I can call ExecGetBucketAndBatch() 353 million times per
second (~2.8ns), and unpatched gets me 656 million/sec (~1.5ns)
(though that includes a function call, and when Hash does it it's
inlined), but it's only slower when nbatch > 1 due to the condition.
To put that number into persepective, I can hash 10 million single-int
tuples from a prewarmed seq scan in 2.5s without batching or
parallelism, so that's 250ns per tuple.  This'd be +0.4% of that, and
I do see it in a few more samples with my profiler, but it's still
basically nothing, and lost in the noise with other noisy partitioning
overheads like IO.  Thoughts?

[1] Working: https://gist.github.com/macdice/b7eb7015846b8384972c2c7cfd6d2c22

Attachment

0001-Improve-batch-number-algorithm-for-large-hash-joins.patch

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Alvaro Herrera

Date:

03 December 2019, 17:07:04

Hello

I can make no useful comment on the correctness of the new bit
distribution.  But I can make two comments on this part:

On 2019-Nov-21, Thomas Munro wrote:

> So here's a draft steal-the-bucket-bits patch.  Yeah,
> reverse_bits_u32() may be in the wrong header.

I think pg_bitutils.h is the right place in master, but that file didn't
exist earlier.  I think access/hash.h is the right place in older
branches, which is where hash_any() used to be.

> But is it too slow?
> On my desktop I can call ExecGetBucketAndBatch() 353 million times per
> second (~2.8ns), and unpatched gets me 656 million/sec (~1.5ns)
> (though that includes a function call, and when Hash does it it's
> inlined), but it's only slower when nbatch > 1 due to the condition.

If the whole query takes hours (or even minutes) to run, then adding one
second of runtime is not going to change things in any noticeable way.
I'd rather have the extreme cases working and take additional 1.3ns per
output row, than not work at all.

> To put that number into persepective, I can hash 10 million single-int
> tuples from a prewarmed seq scan in 2.5s without batching or
> parallelism, so that's 250ns per tuple.  This'd be +0.4% of that, and
> I do see it in a few more samples with my profiler, but it's still
> basically nothing, and lost in the noise with other noisy partitioning
> overheads like IO.  Thoughts?

That summarizes it well for me: yes, it's a slowdown, yes it's barely
measurable.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

10 December 2019, 17:32:24

On Tue, Dec 03, 2019 at 02:07:04PM -0300, Alvaro Herrera wrote:
>Hello
>
>I can make no useful comment on the correctness of the new bit
>distribution.  But I can make two comments on this part:
>
>On 2019-Nov-21, Thomas Munro wrote:
>
>> So here's a draft steal-the-bucket-bits patch.  Yeah,
>> reverse_bits_u32() may be in the wrong header.
>
>I think pg_bitutils.h is the right place in master, but that file didn't
>exist earlier.  I think access/hash.h is the right place in older
>branches, which is where hash_any() used to be.
>
>> But is it too slow?
>> On my desktop I can call ExecGetBucketAndBatch() 353 million times per
>> second (~2.8ns), and unpatched gets me 656 million/sec (~1.5ns)
>> (though that includes a function call, and when Hash does it it's
>> inlined), but it's only slower when nbatch > 1 due to the condition.
>
>If the whole query takes hours (or even minutes) to run, then adding one
>second of runtime is not going to change things in any noticeable way.
>I'd rather have the extreme cases working and take additional 1.3ns per
>output row, than not work at all.
>
>> To put that number into persepective, I can hash 10 million single-int
>> tuples from a prewarmed seq scan in 2.5s without batching or
>> parallelism, so that's 250ns per tuple.  This'd be +0.4% of that, and
>> I do see it in a few more samples with my profiler, but it's still
>> basically nothing, and lost in the noise with other noisy partitioning
>> overheads like IO.  Thoughts?
>
>That summarizes it well for me: yes, it's a slowdown, yes it's barely
>measurable.
>

OK, I did a bit of testing with this patch today, and I can confirm it
does indeed help quite a bit. I only used a smaller data set with 4.5B
rows, but it was enough to trigger the issue - it allocated ~1TB of temp
space, and then crashed.

The annoying thing is that it's the workers that crash, and the leader
failed to notice that, so it was waiting in WaitForParallelWorkersToExit
forever. Not sure what the issue is.

Anyway, with the patch applied, the query completed in ~2.2 hours,
after allocating ~235GB of temp files. So that's good, the patch clearly
improves the situation quite a bit.

As for the performance impact, I did this:

   create table dim (id int, val text);
   insert into dim select i, md5(i::text) from generate_series(1,1000000) s(i);

   create table fact (id int, val text);
   insert into fact select mod(i,1000000)+1, md5(i::text) from generate_series(1,25000000) s(i);

   set max_parallel_workers_per_gather = 0;
   select count(*) from fact join dim using (id);

So a perfectly regular join between 1M and 25M table. On my machine,
this takes ~8851ms on master and 8979ms with the patch (average of about
20 runs with minimal variability). That's ~1.4% regression, so a bit
more than the 0.4% mentioned before. Not a huge difference though, and
some of it might be due to different binary layout etc.

It's probably still a good idea to do this, although I wonder if we
might start doing this only when actually running out of bits (and use
the current algorithm by default). But I have no idea if that would be
any cheaper, and it would add complexity.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Alvaro Herrera

Date:

10 December 2019, 17:35:16

On 2019-Dec-10, Tomas Vondra wrote:

> It's probably still a good idea to do this, although I wonder if we
> might start doing this only when actually running out of bits (and use
> the current algorithm by default). But I have no idea if that would be
> any cheaper, and it would add complexity.

I'm under the impression that this patch is proposed for backpatching,
and that in master we can simply increase the hash width.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

10 December 2019, 18:36:06

On Tue, Dec 10, 2019 at 02:35:16PM -0300, Alvaro Herrera wrote:
>On 2019-Dec-10, Tomas Vondra wrote:
>
>> It's probably still a good idea to do this, although I wonder if we
>> might start doing this only when actually running out of bits (and use
>> the current algorithm by default). But I have no idea if that would be
>> any cheaper, and it would add complexity.
>
>I'm under the impression that this patch is proposed for backpatching,
>and that in master we can simply increase the hash width.
>

Possibly, although AFAIK we don't have that patch yet, and I don't
recall any measurements (so it might have overhead too, not sure). But
it's true additional complexity might complicate backpatching.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tom Lane

Date:

10 December 2019, 19:49:42

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> As for the performance impact, I did this:

>    create table dim (id int, val text);
>    insert into dim select i, md5(i::text) from generate_series(1,1000000) s(i);
>    create table fact (id int, val text);
>    insert into fact select mod(i,1000000)+1, md5(i::text) from generate_series(1,25000000) s(i);
>    set max_parallel_workers_per_gather = 0;
>    select count(*) from fact join dim using (id);

> So a perfectly regular join between 1M and 25M table. On my machine,
> this takes ~8851ms on master and 8979ms with the patch (average of about
> 20 runs with minimal variability). That's ~1.4% regression, so a bit
> more than the 0.4% mentioned before. Not a huge difference though, and
> some of it might be due to different binary layout etc.

Hmm ... I replicated this experiment here, using my usual precautions
to get more-or-less-reproducible numbers [1].  I concur that the
patch seems to be slower, but only by around half a percent on the
median numbers, which is much less than the run-to-run variation.

So that would be fine --- except that in my first set of runs,
I forgot the "set max_parallel_workers_per_gather" step and hence
tested this same data set with a parallel hash join.  And in that
scenario, I got a repeatable slowdown of around 7.5%, which is far
above the noise floor.  So that's not good --- why does this change
make PHJ worse?

            regards, tom lane

[1] https://www.postgresql.org/message-id/31686.1574722301%40sss.pgh.pa.us

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

10 December 2019, 20:41:32

On Tue, Dec 10, 2019 at 02:49:42PM -0500, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> As for the performance impact, I did this:
>
>>    create table dim (id int, val text);
>>    insert into dim select i, md5(i::text) from generate_series(1,1000000) s(i);
>>    create table fact (id int, val text);
>>    insert into fact select mod(i,1000000)+1, md5(i::text) from generate_series(1,25000000) s(i);
>>    set max_parallel_workers_per_gather = 0;
>>    select count(*) from fact join dim using (id);
>
>> So a perfectly regular join between 1M and 25M table. On my machine,
>> this takes ~8851ms on master and 8979ms with the patch (average of about
>> 20 runs with minimal variability). That's ~1.4% regression, so a bit
>> more than the 0.4% mentioned before. Not a huge difference though, and
>> some of it might be due to different binary layout etc.
>
>Hmm ... I replicated this experiment here, using my usual precautions
>to get more-or-less-reproducible numbers [1].  I concur that the
>patch seems to be slower, but only by around half a percent on the
>median numbers, which is much less than the run-to-run variation.
>

Sounds good.

>So that would be fine --- except that in my first set of runs,
>I forgot the "set max_parallel_workers_per_gather" step and hence
>tested this same data set with a parallel hash join.  And in that
>scenario, I got a repeatable slowdown of around 7.5%, which is far
>above the noise floor.  So that's not good --- why does this change
>make PHJ worse?
>

Hmmm, I can't reproduce this. For me the timings from 20 runs look like
this:

              master         |          patched
     workers=2    workers=5  |   workers=2    workers=5
    -------------------------|--------------------------
          3153         1785  |        3185         1790
          3167         1783  |        3220         1783

I haven't done the extra steps with cpupower/taskset, but the numbers
seem pretty consistent. I'll try on another machine tomorrow.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

14 December 2019, 00:22:39

On Wed, Dec 11, 2019 at 7:36 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Tue, Dec 10, 2019 at 02:35:16PM -0300, Alvaro Herrera wrote:
> >On 2019-Dec-10, Tomas Vondra wrote:
> >> It's probably still a good idea to do this, although I wonder if we
> >> might start doing this only when actually running out of bits (and use
> >> the current algorithm by default). But I have no idea if that would be
> >> any cheaper, and it would add complexity.

Yeah, I understand your desire to minimise the change here.  One thing
we could do to get exactly the same algorithm as today right up until
we run out of bits and then begin stealing bucket bits would be to
rotate instead of shifting:

-               *batchno = (hashvalue >> hashtable->log2_nbuckets) &
(nbatch - 1);
+               *batchno = rotate(hashvalue, hashtable->log2_nbuckets)
& (nbatch - 1);

Attached.  Thoughts?  At a glance, the definition of rotate() that I
used seems to be recognised by my compiler, so that the generated code
differs only by a SHR instruction changing to a ROR instruction on my
system.

The bit-reversing thing has once nice advantage: you are allowed to
increase nbuckets later. But then I thought of another way to achieve
that: take batchno from the low end, and bucketno from the high end.
I'm not persuing that, based on the suspicion that what we all want
here is the minimal possible change.

> >I'm under the impression that this patch is proposed for backpatching,
> >and that in master we can simply increase the hash width.
>
> Possibly, although AFAIK we don't have that patch yet, and I don't
> recall any measurements (so it might have overhead too, not sure). But
> it's true additional complexity might complicate backpatching.

What I'd like to experiment with in the new year is a code
specialisation scheme that further developers what I did with
ExecHashJoinImpl().  This would increase the executable size by
including a bunch of different variants of the hash join code, but
avoid adding new branches (and also remove some existing branches) in
the hot paths.  We'd still write one "template" version of the
algorithm that contains if statements for the points of variation, and
then we'd call it from a load of wrapper functions with constant
arguments and forced inlining.  For example, there might be a variant
for { non-parallel, 64 bit hashes, no skew optimisation }.  A more
extreme version of the idea inlines the actual hashing functions for a
handful of common cases like test and integer keys, but that gets into
combination explosion territory among other problems.  I have some
early prototype code along these lines but it's not good enough yet;
more soon.  I should probably think about Andres's proposal to merge
Hash and Hash Join nodes first.

Attachment

0001-Rotate-instead-of-shifting-hash-join-batch-number.patch

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

14 December 2019, 04:01:05

On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> The annoying thing is that it's the workers that crash, and the leader
> failed to notice that, so it was waiting in WaitForParallelWorkersToExit
> forever. Not sure what the issue is.

That's bad.  Some edge case in
WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is
b0rked perhaps?  Can you reproduce it, and tell us what the stack
looks like, maybe on a new thread?

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

14 December 2019, 13:39:12

On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote:
>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> The annoying thing is that it's the workers that crash, and the leader
>> failed to notice that, so it was waiting in WaitForParallelWorkersToExit
>> forever. Not sure what the issue is.
>
>That's bad.  Some edge case in
>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is
>b0rked perhaps?  Can you reproduce it, and tell us what the stack
>looks like, maybe on a new thread?

Sure. I'll try reproducing it and get this extra info. I was planning to
do that, but then I lost the data for the report because of a silly
mistake. Early next week, hopefully.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

16 December 2019, 14:09:22

On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote:
>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote:
>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra
>><tomas.vondra@2ndquadrant.com> wrote:
>>>The annoying thing is that it's the workers that crash, and the leader
>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit
>>>forever. Not sure what the issue is.
>>
>>That's bad.  Some edge case in
>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is
>>b0rked perhaps?  Can you reproduce it, and tell us what the stack
>>looks like, maybe on a new thread?
>
>Sure. I'll try reproducing it and get this extra info. I was planning to
>do that, but then I lost the data for the report because of a silly
>mistake. Early next week, hopefully.
>

I've tried reproducing this, but so far no luck. I've only managed to do
two runs, though, because this particular machine only has space for the
1TB of temp files on spinning rust. I wonder if it's possible to
reproduce the issue with a smaller data set? I've tried using smaller
work_mem, but then it does not trigger the issue :-(


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

16 December 2019, 21:02:49

On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote:
>On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote:
>>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote:
>>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra
>>><tomas.vondra@2ndquadrant.com> wrote:
>>>>The annoying thing is that it's the workers that crash, and the leader
>>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit
>>>>forever. Not sure what the issue is.
>>>
>>>That's bad.  Some edge case in
>>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is
>>>b0rked perhaps?  Can you reproduce it, and tell us what the stack
>>>looks like, maybe on a new thread?
>>
>>Sure. I'll try reproducing it and get this extra info. I was planning to
>>do that, but then I lost the data for the report because of a silly
>>mistake. Early next week, hopefully.
>>
>
>I've tried reproducing this, but so far no luck. I've only managed to do
>two runs, though, because this particular machine only has space for the
>1TB of temp files on spinning rust. I wonder if it's possible to
>reproduce the issue with a smaller data set? I've tried using smaller
>work_mem, but then it does not trigger the issue :-(
>

And of course, the very next run after I send the "can't reproduce"
message actually reproduces the issue. How typical ...

So, there were ~8 workers while the query was running, with the leader
process being 29417. At the moment, the list of processes looks like
this:

bench2 ~ # ps ax | grep postgres
11877 pts/2    S      0:00 su - postgres
14082 ?        Ss    12:38 postgres: parallel worker for PID 29417
14083 ?        Ss    12:28 postgres: parallel worker for PID 29417
17334 pts/0    S+     0:00 grep --colour=auto postgres
29408 ?        Ss     0:01 /var/lib/postgresql/pg-master/bin/postgres -D /mnt/data/data-master
29410 ?        Ss     0:00 postgres: checkpointer
29411 ?        Ss     0:01 postgres: background writer
29412 ?        Ss     0:01 postgres: walwriter
29413 ?        Ss     0:02 postgres: autovacuum launcher
29414 ?        Ss     0:14 postgres: stats collector
29415 ?        Ss     0:00 postgres: logical replication launcher
29417 ?        Ss    75:11 postgres: postgres test [local] EXPLAIN

and it failes like this:

2019-12-16 20:25:00.617 CET [14081] ERROR:  invalid DSA memory alloc request size 1711276032
2019-12-16 20:25:00.617 CET [14081] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.617 CET [29417] ERROR:  invalid DSA memory alloc request size 1711276032
2019-12-16 20:25:00.617 CET [29417] CONTEXT:  parallel worker
2019-12-16 20:25:00.617 CET [29417] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.618 CET [14080] FATAL:  terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14080] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.618 CET [14085] FATAL:  terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14085] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.618 CET [14084] FATAL:  terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14084] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.618 CET [14079] FATAL:  terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14079] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.618 CET [14078] FATAL:  terminating connection due to administrator command
2019-12-16 20:25:00.618 CET [14078] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
2019-12-16 20:25:00.746 CET [29408] LOG:  background worker "parallel worker" (PID 14081) exited with exit code 1
2019-12-16 20:25:00.816 CET [29408] LOG:  background worker "parallel worker" (PID 14079) exited with exit code 1
2019-12-16 20:25:00.816 CET [29408] LOG:  background worker "parallel worker" (PID 14085) exited with exit code 1
2019-12-16 20:25:00.817 CET [29408] LOG:  background worker "parallel worker" (PID 14084) exited with exit code 1
2019-12-16 20:25:00.817 CET [29408] LOG:  background worker "parallel worker" (PID 14080) exited with exit code 1
2019-12-16 20:25:00.818 CET [29408] LOG:  background worker "parallel worker" (PID 14078) exited with exit code 1

which is ~2h ago. There's no activity at all (no I/O or anything like
that),  the temp directory is ~1TB of allocated files:

   bench2 /var/lib/postgresql # du -s /mnt/raid/pgsql_tmp/
   1099760800    /mnt/raid/pgsql_tmp/

which means nothing got cleaned up. The backtraces for the remaining
backends look like this:

1) 29417 (i.e. the leader)

0x00007f64684d2bfb in epoll_wait (epfd=5, events=0x55d887416dc0, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0  0x00007f64684d2bfb in epoll_wait (epfd=5, events=0x55d887416dc0, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d887416d48, cur_timeout=-1, occurred_events=0x7ffc6e805c70,
nevents=1)at latch.c:1080
 
#2  0x000055d8864d4d00 in WaitEventSetWait (set=0x55d887416d48, timeout=-1, occurred_events=0x7ffc6e805c70, nevents=1,
wait_event_info=134217728)at latch.c:1032
 
#3  0x000055d8864d4565 in WaitLatchOrSocket (latch=0x7f64670ac194, wakeEvents=17, sock=-1, timeout=-1,
wait_event_info=134217728)at latch.c:407
 
#4  0x000055d8864d4421 in WaitLatch (latch=0x7f64670ac194, wakeEvents=17, timeout=0, wait_event_info=134217728) at
latch.c:347
#5  0x000055d88644c18b in WaitForBackgroundWorkerShutdown (handle=0x55d88742abc8) at bgworker.c:1153
#6  0x000055d886147690 in WaitForParallelWorkersToExit (pcxt=0x55d88742a338) at parallel.c:842
#7  0x000055d8861478af in DestroyParallelContext (pcxt=0x55d88742a338) at parallel.c:924
#8  0x000055d886147feb in AtEOXact_Parallel (isCommit=false) at parallel.c:1197
#9  0x000055d886153774 in AbortTransaction () at xact.c:2654
#10 0x000055d886153f80 in AbortCurrentTransaction () at xact.c:3196
#11 0x000055d8865044e7 in PostgresMain (argc=1, argv=0x55d8873d41e8, dbname=0x55d8873d4118 "test",
username=0x55d8873d40f8"postgres") at postgres.c:4081
 
#12 0x000055d88645f0d6 in BackendRun (port=0x55d8873ca150) at postmaster.c:4498
#13 0x000055d88645e7fc in BackendStartup (port=0x55d8873ca150) at postmaster.c:4189
#14 0x000055d88645abd0 in ServerLoop () at postmaster.c:1727
#15 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400
#16 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210

2) 14082

0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0  0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130,
nevents=1)at latch.c:1080
 
#2  0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130, nevents=1,
wait_event_info=134217742)at latch.c:1032
 
#3  0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at
condition_variable.c:186
#4  0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at
condition_variable.c:126
#5  0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191
#6  0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189
#7  0x000055d88632ca1f in ExecParallelHashTuplePrealloc (hashtable=0x55d887481e70, batchno=237, size=40) at
nodeHash.c:3308
#8  0x000055d886329dfe in ExecParallelHashTableInsert (hashtable=0x55d887481e70, slot=0x55d887470e10,
hashvalue=994179508)at nodeHash.c:1725
 
#9  0x000055d8863275dc in MultiExecParallelHash (node=0x55d887470860) at nodeHash.c:288
#10 0x000055d8863271e3 in MultiExecHash (node=0x55d887470860) at nodeHash.c:111
#11 0x000055d88630b79c in MultiExecProcNode (node=0x55d887470860) at execProcnode.c:500
#12 0x000055d88632cfac in ExecHashJoinImpl (pstate=0x55d8874700a0, parallel=true) at nodeHashjoin.c:291
#13 0x000055d88632d617 in ExecParallelHashJoin (pstate=0x55d8874700a0) at nodeHashjoin.c:588
#14 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d8874700a0) at execProcnode.c:460
#15 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d8874700a0) at execProcnode.c:444
#16 0x000055d886318a14 in ExecProcNode (node=0x55d8874700a0) at ../../../src/include/executor/executor.h:240
#17 0x000055d886318e8b in fetch_input_tuple (aggstate=0x55d88746f628) at nodeAgg.c:405
#18 0x000055d88631b2fa in agg_retrieve_direct (aggstate=0x55d88746f628) at nodeAgg.c:1757
#19 0x000055d88631af4a in ExecAgg (pstate=0x55d88746f628) at nodeAgg.c:1572
#20 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d88746f628) at execProcnode.c:460
#21 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d88746f628) at execProcnode.c:444
#22 0x000055d886300882 in ExecProcNode (node=0x55d88746f628) at ../../../src/include/executor/executor.h:240
#23 0x000055d886303055 in ExecutePlan (estate=0x55d88746f398, planstate=0x55d88746f628, use_parallel_mode=false,
operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55d88741d408,
execute_once=true)at execMain.c:1646
 
#24 0x000055d886300d9a in standard_ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0,
execute_once=true)at execMain.c:364
 
#25 0x000055d886300c3c in ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0,
execute_once=true)at execMain.c:308
 
#26 0x000055d886307c2b in ParallelQueryMain (seg=0x55d8873d23d8, toc=0x7f63602a0000) at execParallel.c:1401
#27 0x000055d886148632 in ParallelWorkerMain (main_arg=2143010421) at parallel.c:1430
#28 0x000055d88644ba56 in StartBackgroundWorker () at bgworker.c:834
#29 0x000055d88645ff82 in do_start_bgworker (rw=0x55d8873ca340) at postmaster.c:5839
#30 0x000055d88646038f in maybe_start_bgworkers () at postmaster.c:6065
#31 0x000055d88645f265 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5234
#32 <signal handler called>
#33 0x00007f64684c956b in __GI___select (nfds=7, readfds=0x7ffc6e8060e0, writefds=0x0, exceptfds=0x0,
timeout=0x7ffc6e806050)at ../sysdeps/unix/sysv/linux/select.c:41
 
#34 0x000055d88645aa3c in ServerLoop () at postmaster.c:1691
#35 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400
#36 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210

3) 14083

0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0  0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805120,
nevents=1)at latch.c:1080
 
#2  0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805120, nevents=1,
wait_event_info=134217742)at latch.c:1032
 
#3  0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at
condition_variable.c:186
#4  0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at
condition_variable.c:126
#5  0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191
#6  0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189
#7  0x000055d88632b93c in ExecParallelHashTupleAlloc (hashtable=0x55d887481e70, size=40, shared=0x7ffc6e8052b8) at
nodeHash.c:2805
#8  0x000055d886329d46 in ExecParallelHashTableInsert (hashtable=0x55d887481e70, slot=0x55d887470e10,
hashvalue=2006491)at nodeHash.c:1702
 
#9  0x000055d8863275dc in MultiExecParallelHash (node=0x55d887470860) at nodeHash.c:288
#10 0x000055d8863271e3 in MultiExecHash (node=0x55d887470860) at nodeHash.c:111
#11 0x000055d88630b79c in MultiExecProcNode (node=0x55d887470860) at execProcnode.c:500
#12 0x000055d88632cfac in ExecHashJoinImpl (pstate=0x55d8874700a0, parallel=true) at nodeHashjoin.c:291
#13 0x000055d88632d617 in ExecParallelHashJoin (pstate=0x55d8874700a0) at nodeHashjoin.c:588
#14 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d8874700a0) at execProcnode.c:460
#15 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d8874700a0) at execProcnode.c:444
#16 0x000055d886318a14 in ExecProcNode (node=0x55d8874700a0) at ../../../src/include/executor/executor.h:240
#17 0x000055d886318e8b in fetch_input_tuple (aggstate=0x55d88746f628) at nodeAgg.c:405
#18 0x000055d88631b2fa in agg_retrieve_direct (aggstate=0x55d88746f628) at nodeAgg.c:1757
#19 0x000055d88631af4a in ExecAgg (pstate=0x55d88746f628) at nodeAgg.c:1572
#20 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d88746f628) at execProcnode.c:460
#21 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d88746f628) at execProcnode.c:444
#22 0x000055d886300882 in ExecProcNode (node=0x55d88746f628) at ../../../src/include/executor/executor.h:240
#23 0x000055d886303055 in ExecutePlan (estate=0x55d88746f398, planstate=0x55d88746f628, use_parallel_mode=false,
operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55d88741d408,
execute_once=true)at execMain.c:1646
 
#24 0x000055d886300d9a in standard_ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0,
execute_once=true)at execMain.c:364
 
#25 0x000055d886300c3c in ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0,
execute_once=true)at execMain.c:308
 
#26 0x000055d886307c2b in ParallelQueryMain (seg=0x55d8873d23d8, toc=0x7f63602a0000) at execParallel.c:1401
#27 0x000055d886148632 in ParallelWorkerMain (main_arg=2143010421) at parallel.c:1430
#28 0x000055d88644ba56 in StartBackgroundWorker () at bgworker.c:834
#29 0x000055d88645ff82 in do_start_bgworker (rw=0x55d8873ca590) at postmaster.c:5839
#30 0x000055d88646038f in maybe_start_bgworkers () at postmaster.c:6065
#31 0x000055d88645f265 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5234
#32 <signal handler called>
#33 0x00007f64684c956b in __GI___select (nfds=7, readfds=0x7ffc6e8060e0, writefds=0x0, exceptfds=0x0,
timeout=0x7ffc6e806050)at ../sysdeps/unix/sysv/linux/select.c:41
 
#34 0x000055d88645aa3c in ServerLoop () at postmaster.c:1691
#35 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400
#36 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210

So the only difference between the two remaining worker processes is
frame #7, where 14082 calls ExecParallelHashTuplePrealloc while 14083
calls ExecParallelHashTupleAlloc.


I'll leave the system in this "stuck" space in case we need to do
further investigation.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

17 December 2019, 22:21:11

On Tue, Dec 17, 2019 at 10:02 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote:
> >On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote:
> >>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote:
> >>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra
> >>><tomas.vondra@2ndquadrant.com> wrote:
> >>>>The annoying thing is that it's the workers that crash, and the leader
> >>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit
> >>>>forever. Not sure what the issue is.
> >>>
> >>>That's bad.  Some edge case in
> >>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is
> >>>b0rked perhaps?  Can you reproduce it, and tell us what the stack
> >>>looks like, maybe on a new thread?
> >>
> >>Sure. I'll try reproducing it and get this extra info. I was planning to
> >>do that, but then I lost the data for the report because of a silly
> >>mistake. Early next week, hopefully.
> >>
> >
> >I've tried reproducing this, but so far no luck. I've only managed to do
> >two runs, though, because this particular machine only has space for the
> >1TB of temp files on spinning rust. I wonder if it's possible to
> >reproduce the issue with a smaller data set? I've tried using smaller
> >work_mem, but then it does not trigger the issue :-(
> >
>
> And of course, the very next run after I send the "can't reproduce"
> message actually reproduces the issue. How typical ...

Thanks for reproducing this.

> 2019-12-16 20:25:00.617 CET [14081] ERROR:  invalid DSA memory alloc request size 1711276032
> 2019-12-16 20:25:00.617 CET [14081] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
> 2019-12-16 20:25:00.617 CET [29417] ERROR:  invalid DSA memory alloc request size 1711276032
> 2019-12-16 20:25:00.617 CET [29417] CONTEXT:  parallel worker

Ok, so 14081 raised an error and exited, ...

> 2019-12-16 20:25:00.746 CET [29408] LOG:  background worker "parallel worker" (PID 14081) exited with exit code 1

... and then the leader cancelled the query on 5 other workers,

> 2019-12-16 20:25:00.816 CET [29408] LOG:  background worker "parallel worker" (PID 14079) exited with exit code 1
> 2019-12-16 20:25:00.816 CET [29408] LOG:  background worker "parallel worker" (PID 14085) exited with exit code 1
> 2019-12-16 20:25:00.817 CET [29408] LOG:  background worker "parallel worker" (PID 14084) exited with exit code 1
> 2019-12-16 20:25:00.817 CET [29408] LOG:  background worker "parallel worker" (PID 14080) exited with exit code 1
> 2019-12-16 20:25:00.818 CET [29408] LOG:  background worker "parallel worker" (PID 14078) exited with exit code 1

... but somehow there were 2 more workers that didn't get the message
and are waiting:

> #0  0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
> #1  0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130,
nevents=1)at latch.c:1080
 
> #2  0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130,
nevents=1,wait_event_info=134217742) at latch.c:1032
 
> #3  0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at
condition_variable.c:186
> #4  0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at
condition_variable.c:126
> #5  0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191
> #6  0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189

They're waiting on a condition variable, in a barrier, while trying to
coordinate an increase in the number of batches, while hashing.  I
suspect the details of what exactly it's waiting for are not relevant
here though.  ConditionVariableTimedSleep() contains a standard latch
wait loop with CHECK_FOR_INTERRUPTS(), and parallel workers run with
the standard SIGTERM handler die().  So what I expect to happen here
is:

1.  DestroyParallelContext() calls TerminateBackgroundWorker() for all workers.
2.  TerminateBackgroundWorker() sets slot->terminate, and signals the
postmaster.
3.  The postmaster sees the terminate flag and does kill(worker, SIGTERM).
4.  The worker handles SIGTERM by setting ProcDiePending and setting
its own latch.
5.  The ConditionVariableTimedSleep()'s syscall returs,
CHECK_FOR_INTERRUPTS() runs and sees ProcDiePending.

So how did it escape this fate?  Ooooh.  Now I'm wondering if the
problem is that ConditionVariablePrepareToSleep() contains a call to
ResetLatch(), but doesn't then run CHECK_FOR_INTERRUPTS().  Perhaps
that ate the latch that was set by die().  There isn't another
CHECK_FOR_INTERRUPTS() until *after* the WaitEventSetWait() in
ConditionVariableTimedSleep().

> I'll leave the system in this "stuck" space in case we need to do
> further investigation.

What are the value sof InterruptHoldoffCount, CritSectionCount,
InterruptPending and ProcDiePending in the stuck workers?

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tomas Vondra

Date:

17 December 2019, 23:21:24

On Wed, Dec 18, 2019 at 11:21:11AM +1300, Thomas Munro wrote:
>On Tue, Dec 17, 2019 at 10:02 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote:
>> >On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote:
>> >>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote:
>> >>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra
>> >>><tomas.vondra@2ndquadrant.com> wrote:
>> >>>>The annoying thing is that it's the workers that crash, and the leader
>> >>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit
>> >>>>forever. Not sure what the issue is.
>> >>>
>> >>>That's bad.  Some edge case in
>> >>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is
>> >>>b0rked perhaps?  Can you reproduce it, and tell us what the stack
>> >>>looks like, maybe on a new thread?
>> >>
>> >>Sure. I'll try reproducing it and get this extra info. I was planning to
>> >>do that, but then I lost the data for the report because of a silly
>> >>mistake. Early next week, hopefully.
>> >>
>> >
>> >I've tried reproducing this, but so far no luck. I've only managed to do
>> >two runs, though, because this particular machine only has space for the
>> >1TB of temp files on spinning rust. I wonder if it's possible to
>> >reproduce the issue with a smaller data set? I've tried using smaller
>> >work_mem, but then it does not trigger the issue :-(
>> >
>>
>> And of course, the very next run after I send the "can't reproduce"
>> message actually reproduces the issue. How typical ...
>
>Thanks for reproducing this.
>
>> 2019-12-16 20:25:00.617 CET [14081] ERROR:  invalid DSA memory alloc request size 1711276032
>> 2019-12-16 20:25:00.617 CET [14081] STATEMENT:  explain analyze select count(*) from sevenb t1 join sevenb t2 using
(i);
>> 2019-12-16 20:25:00.617 CET [29417] ERROR:  invalid DSA memory alloc request size 1711276032
>> 2019-12-16 20:25:00.617 CET [29417] CONTEXT:  parallel worker
>
>Ok, so 14081 raised an error and exited, ...
>
>> 2019-12-16 20:25:00.746 CET [29408] LOG:  background worker "parallel worker" (PID 14081) exited with exit code 1
>
>... and then the leader cancelled the query on 5 other workers,
>
>> 2019-12-16 20:25:00.816 CET [29408] LOG:  background worker "parallel worker" (PID 14079) exited with exit code 1
>> 2019-12-16 20:25:00.816 CET [29408] LOG:  background worker "parallel worker" (PID 14085) exited with exit code 1
>> 2019-12-16 20:25:00.817 CET [29408] LOG:  background worker "parallel worker" (PID 14084) exited with exit code 1
>> 2019-12-16 20:25:00.817 CET [29408] LOG:  background worker "parallel worker" (PID 14080) exited with exit code 1
>> 2019-12-16 20:25:00.818 CET [29408] LOG:  background worker "parallel worker" (PID 14078) exited with exit code 1
>
>... but somehow there were 2 more workers that didn't get the message
>and are waiting:
>
>> #0  0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
>> #1  0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130,
nevents=1)at latch.c:1080
 
>> #2  0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130,
nevents=1,wait_event_info=134217742) at latch.c:1032
 
>> #3  0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at
condition_variable.c:186
>> #4  0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at
condition_variable.c:126
>> #5  0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191
>> #6  0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189
>
>They're waiting on a condition variable, in a barrier, while trying to
>coordinate an increase in the number of batches, while hashing.  I
>suspect the details of what exactly it's waiting for are not relevant
>here though.  ConditionVariableTimedSleep() contains a standard latch
>wait loop with CHECK_FOR_INTERRUPTS(), and parallel workers run with
>the standard SIGTERM handler die().  So what I expect to happen here
>is:
>
>1.  DestroyParallelContext() calls TerminateBackgroundWorker() for all workers.
>2.  TerminateBackgroundWorker() sets slot->terminate, and signals the
>postmaster.
>3.  The postmaster sees the terminate flag and does kill(worker, SIGTERM).
>4.  The worker handles SIGTERM by setting ProcDiePending and setting
>its own latch.
>5.  The ConditionVariableTimedSleep()'s syscall returs,
>CHECK_FOR_INTERRUPTS() runs and sees ProcDiePending.
>
>So how did it escape this fate?  Ooooh.  Now I'm wondering if the
>problem is that ConditionVariablePrepareToSleep() contains a call to
>ResetLatch(), but doesn't then run CHECK_FOR_INTERRUPTS().  Perhaps
>that ate the latch that was set by die().  There isn't another
>CHECK_FOR_INTERRUPTS() until *after* the WaitEventSetWait() in
>ConditionVariableTimedSleep().
>
>> I'll leave the system in this "stuck" space in case we need to do
>> further investigation.
>
>What are the value sof InterruptHoldoffCount, CritSectionCount,
>InterruptPending and ProcDiePending in the stuck workers?

In both processes it's like this:

(gdb) p InterruptHoldoffCount
$1 = 0
(gdb) p CritSectionCount
$2 = 0
(gdb) p InterruptPending
$3 = 1
(gdb) p ProcDiePending
$5 = 1

So yeah, that probably means the theory about CHECK_FOR_INTERRUPTS call
missing right after latch reset is correct.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

17 December 2019, 23:36:57

On Wed, Dec 18, 2019 at 12:21 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So yeah, that probably means the theory about CHECK_FOR_INTERRUPTS call
> missing right after latch reset is correct.

I wonder if it'd be better to remove that ResetLatch() (what is it
good for?) or add CFI().

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

21 December 2019, 03:02:28

On Sat, Dec 14, 2019 at 1:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > >On 2019-Dec-10, Tomas Vondra wrote:
> > >> It's probably still a good idea to do this, although I wonder if we
> > >> might start doing this only when actually running out of bits (and use
> > >> the current algorithm by default). But I have no idea if that would be
> > >> any cheaper, and it would add complexity.

> -               *batchno = (hashvalue >> hashtable->log2_nbuckets) &
> (nbatch - 1);
> +               *batchno = rotate(hashvalue, hashtable->log2_nbuckets)
> & (nbatch - 1);

I ran the 150MB 4096 batch "sevenb" self-join with the "rotate" patch,
and it worked as expected.  I'm now planning to commit that version,
unless there are objections or someone wants to argue for a different
way to spell rotate() etc.

If I disassemble and diff unpatched and patched
ExecHashGetBucketAndBatch() and functions that inline it, as compiled
by Clang 6, I see just eg:

-   0x00000000006ba54e <+30>:    shr    %cl,%esi
+   0x00000000006ba54e <+30>:    ror    %cl,%esi

Here's a festive pictorial representation of the various dead horses
flogged in this thread, for the holidays (view in monospace font):

The "shift" algorithm in (pre-existing coding):

 <----------batchno---|
                      |---bucketno---|
     [MSB . . . . . . . . . . . . LSB]

* can't increase nbuckets (or max bucket load would degrade too soon)
* reads zeroes off the end, so further partitioning is broken

The "bit reverse" technique proposed earlier:

     |---batchno----------->
                      <---bucketno---|
     [MSB . . . . . . . . . . . . LSB]

* could increase nbuckets: bucket load degradation is maximally deferred
* takes a few more instructions to compute batchno

The "rotate" technique:

      ------batchno---|          <---
                      |---bucketno---|
     [MSB . . . . . . . . . . . . LSB]

* same as "shift" until you run out of bits, then bucket load degrades
* can't increase nbuckets
* only changes one machine code instruction

The "swap ends" technique:

                <----------batchno---|
     |---bucketno---|
     [MSB . . . . . . . . . . . . LSB]

* really just a rotated version of "rotate" technique
* meh

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tom Lane

Date:

21 December 2019, 05:10:31

Thomas Munro <thomas.munro@gmail.com> writes:
> I ran the 150MB 4096 batch "sevenb" self-join with the "rotate" patch,
> and it worked as expected.  I'm now planning to commit that version,
> unless there are objections or someone wants to argue for a different
> way to spell rotate() etc.

FWIW, I do want to argue for a different way to spell that.  It seems
like a mighty generic function name --- in particular, there's no
hint of the word width we're rotating in.  Maybe "rotate32" would be
good enough, though I'm tempted to propose "pg_rotate32" or "rotate_bit32"
or some other variation on that.

            regards, tom lane

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

21 December 2019, 07:43:05

On Sat, Dec 21, 2019 at 6:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > I ran the 150MB 4096 batch "sevenb" self-join with the "rotate" patch,
> > and it worked as expected.  I'm now planning to commit that version,
> > unless there are objections or someone wants to argue for a different
> > way to spell rotate() etc.
>
> FWIW, I do want to argue for a different way to spell that.  It seems
> like a mighty generic function name --- in particular, there's no
> hint of the word width we're rotating in.  Maybe "rotate32" would be
> good enough, though I'm tempted to propose "pg_rotate32" or "rotate_bit32"
> or some other variation on that.

Yeah, I suppose it would be nice to get something reusable out of this
rather than a local definition only.  The other important
characteristic is the direction of rotation, so here's a version that
defines pg_rotate_right32() in pg_bitutils.h.  In back-branches
without that header we could define it locally.

Attachment

0001-Rotate-instead-of-shifting-hash-join-batch-number-v2.patch

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Tom Lane

Date:

21 December 2019, 15:02:05

Thomas Munro <thomas.munro@gmail.com> writes:
> Yeah, I suppose it would be nice to get something reusable out of this
> rather than a local definition only.  The other important
> characteristic is the direction of rotation, so here's a version that
> defines pg_rotate_right32() in pg_bitutils.h.  In back-branches
> without that header we could define it locally.

That looks like a good approach.  One thought is that you could rely
on BITS_PER_BYTE from c.h, rather than CHAR_BIT; then you don't have
to #include <limits.h> in bitutils.h, which seems like header creep
we could do without.  (I think most other places that want this
constant use BITS_PER_BYTE, too.)

            regards, tom lane

Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From

Thomas Munro

Date:

24 December 2019, 00:29:13

On Sun, Dec 22, 2019 at 4:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Yeah, I suppose it would be nice to get something reusable out of this
> > rather than a local definition only.  The other important
> > characteristic is the direction of rotation, so here's a version that
> > defines pg_rotate_right32() in pg_bitutils.h.  In back-branches
> > without that header we could define it locally.
>
> That looks like a good approach.  One thought is that you could rely
> on BITS_PER_BYTE from c.h, rather than CHAR_BIT; then you don't have
> to #include <limits.h> in bitutils.h, which seems like header creep
> we could do without.  (I think most other places that want this
> constant use BITS_PER_BYTE, too.)

Done, and pushed.

I'm hoping to look into wider hashing support which will give better
results on big joins soon, but not in time for 13.