Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date
Msg-id 20191109182735.ewficcv2hibbyzl4@development
Whole thread Raw
In response to Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash  (James Coleman <jtc331@gmail.com>)
Responses Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
List pgsql-bugs
On Sat, Nov 09, 2019 at 08:23:23AM -0500, James Coleman wrote:
>Just realized I accidentally replied only to Tomas.
>
>On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
>> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> >>
>> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote:
>> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
>> >> > > >ERROR:  invalid DSA memory alloc request size 1375731712
>> >>
>> >> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
>> >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
>> >>
>> >> > > I've briefly looked at this today, and I think the root cause is
>> >> > > somewhat similar to what is described in [1] where we simply increase
>> >> > > the number of batches in an effort to keep batch contents in work_mem,
>> >> > > but ignoring that each batch requires quite a bit of memory. So we end
>> >> > > up with a lot of batches where each is small enough to fit into
>> >> > > work_mem, but we need much more than work_mem to track the batches.
>> >>
>> >> Yeah.  So even when this is fixed, the query is going to perform
>> >> *terribly*, opening and closing millions of files in random order to
>> >> stream tuples into, if this is case where there really are tuples to
>> >> go to all partitions (and not just a case of extreme skew that our
>> >> extreme skew detector fails to detect because it only detects absolute
>> >> extreme skew).
>> >
>> >work_mem in the repro case is 500MB (the original failure was at
>> >150MB). I realize that's too small for this query, though it's also
>> >worth knowing that if I get rid of some other cluster-wide tunings
>> >that shouldn't have been cluster-wide original (modifications to
>> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
>> >an index scan and no hash (and performs much better).
>> >
>>
>> So is it a case of underestimate? I.e. does the Hash side expect much
>> less data (rows) than it gets during execution?
>
>I've attached a redacted plan, but the estimate with workers planned =
>6 is 986,975,281, which seems quite reasonable to be as the current
>table count is 5,917,539,491.
>

Hmmm, but the expected row width is only 16B, and with 6M rows that's
only about 90GB. So how come this needs 1TB temporary files? I'm sure
there's a bit of overhead, but 10X seems a bit much.

>> >I think this also correlates with us seeing ~TB spike in disk usage,
>> >so your explanation of the lots of "small" files would seem to be
>> >consistent with that.
>>
>> That's consistent with the data. 500MB and nbatch=2097152 is exactly
>> 1TB, and there'll be some additional overhead.
>
>Good, that helps make sense of that piece of the puzzle.
>
>> >> > > This seems to be about the same problem, except that instead of
>> >> > > forgeting about BufFile, the parallel hash join ignores this:
>> >> > >
>> >> > >    pstate->batches =
>> >> > >      dsa_allocate0(hashtable->area,
>> >> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>> >>
>> >> Yeah, I failed to consider that possibility.  I suppose it could be
>> >> avoided with something like this (not tested, I will find a repro for
>> >> this on Monday to convince myself that it's right):
>> >>
>> >> @@ -1246,7 +1246,10 @@
>> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
>> >>                                 }
>> >>
>> >>                                 /* Don't keep growing if it's not
>> >> helping or we'd overflow. */
>> >> -                               if (extreme_skew_detected ||
>> >> hashtable->nbatch >= INT_MAX / 2)
>> >> +                               if (extreme_skew_detected ||
>> >> +                                       hashtable->nbatch >= INT_MAX / 2 ||
>> >> +
>> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
>> >> +
>> >>    hashtable->nbatch * 2))
>> >>                                         pstate->growth = PHJ_GROWTH_DISABLED;
>> >>                                 else if (space_exhausted)
>> >>                                         pstate->growth =
>> >> PHJ_GROWTH_NEED_MORE_BATCHES;
>> >>
>>
>> Probably, but that's the execution part of it. I think we should also
>> consider this in ExecChooseHashTableSize, and just don't do PHJ when
>> it exceeds work_mem from the very beginning.
>>
>> Once we start executing we probably can't do much better than what you
>> proposed. In particular it doesn't make sense to limit the space by
>> work_mem, unless we also tweak that limit because then the batch size
>> increases arbitrarily.
>>
>> I think we need do something about this in PG13 - both while planning
>> (considering BufFile and SharedTuplestore), and during execution. The
>> planner part seems fairly simple and independent, and I might have said
>> before I'll look into it.
>>
>> For the executor I think we've agreed the "proper" solution is BNL or
>> something like that. Not sure how far are we from that, though, I
>> don't recall any recent updates (but maybe I just missed that, the
>> pgsql-hackers traffic is pretty insane). I wonder if we should get
>> something like the "rebalancing" I proposed before, which is not a 100%
>> fix but may at least reduce the negative impact.
>
>Do you happen have a link to those discussions? I'd be interested in
>following along. I also can't say I know what BNL stands for.
>

I think the two main threads are these two:

1) accounting for memory used for BufFile during hash joins
https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development

2) Avoiding hash join batch explosions with extreme skew and weird stats
https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

The BNL means "block nested loop" which means we split the inner
relation into blocks, and then do nested loop for (some of) them. This
works as a fallback strategy for hash join - when a batch is too large
to fit into work_mem, we can split it into blocks and do a BNL against
the outer relation. 

>> >> But James's query is still going to be terrible.
>> >>
>> >> Do you know if it's extreme skew (many tuples with the same key, just
>> >> a few scattered around in other keys), or simply too much data for
>> >> your work_mem setting?
>> >
>> >Given my description earlier (seq scan on a very large table), I
>> >assume it's likely the latter? If you think that's sufficiently
>> >likely, I'll leave it at that, or if not I could do calculation on
>> >that key to see how distributed it is.
>> >
>>
>> Depends where exactly is the seq scan - is it under the Hash? If yes,
>> how come we even pick hash join in this case? Hard to say without seeing
>> the query plan.
>
>As noted earlier, I've attached a redacted plan here.
>
>The seq scan is under the parallel hash. Note that also as hinted at
>earlier, this happens with cpu_tuple_cost and cpu_operator_cost both
>set to 0.5, which has been a long-standing tweak on this cluster to
>affect some plans that are otherwise painfully bad, but really should
>be targeted at specific queries. With those reset, everything stays
>the same except that the hash join turns into a nested loop, and
>instead of a hash on the inner side of the join there's an index scan
>(the join key is indexed on that table).
>

I wonder what the bad plans look like and why this fixes them, but it
seems like a separate issue.

>Because of the modified cpu_*_cost gucs we have two ways around this
>specific query plan: either reset those gucs or set
>enable_parallel_hash = off. But even if it means poor performance, it
>seems to me that we wouldn't want to fail the query. I can confirm
>that it is in fact capable of completing with this plan even with
>work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would
>have been some level of unsurprising in this batch system, and also
>much closer to "normal" for the non-parallel plan we used to have in
>9.6. We were able to see this result on replicas while trying to find
>out exactly how to reproduce the error (it seems sometimes it was
>right under the boundary needed to raise the error).
>

It's certainly true that completing a query is preferrable to failing.
It does depend when we can identify that's the case - the later we
realize the issue, the harder it's to fix it. If we notice that during
planning, we can simply disable the hash join, which then forces using a
different join method (and it may actually be quite fast). Once we start
executing, it's way harder.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-bugs by date:

Previous
From: Ryan Lambert
Date:
Subject: Re: EXPLAIN ANALYZE not displaying recheck condition
Next
From: James Coleman
Date:
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash