Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date
Msg-id 20191109095456.csw3f5rhc364z5fg@development
Whole thread Raw
In response to Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash  (James Coleman <jtc331@gmail.com>)
Responses Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
List pgsql-bugs
On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
>On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>
>> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote:
>> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
>> > > >ERROR:  invalid DSA memory alloc request size 1375731712
>>
>> > > >#3  0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
>> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at
>>
>> > > I've briefly looked at this today, and I think the root cause is
>> > > somewhat similar to what is described in [1] where we simply increase
>> > > the number of batches in an effort to keep batch contents in work_mem,
>> > > but ignoring that each batch requires quite a bit of memory. So we end
>> > > up with a lot of batches where each is small enough to fit into
>> > > work_mem, but we need much more than work_mem to track the batches.
>>
>> Yeah.  So even when this is fixed, the query is going to perform
>> *terribly*, opening and closing millions of files in random order to
>> stream tuples into, if this is case where there really are tuples to
>> go to all partitions (and not just a case of extreme skew that our
>> extreme skew detector fails to detect because it only detects absolute
>> extreme skew).
>
>work_mem in the repro case is 500MB (the original failure was at
>150MB). I realize that's too small for this query, though it's also
>worth knowing that if I get rid of some other cluster-wide tunings
>that shouldn't have been cluster-wide original (modifications to
>cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
>an index scan and no hash (and performs much better).
>

So is it a case of underestimate? I.e. does the Hash side expect much
less data (rows) than it gets during execution?

>I think this also correlates with us seeing ~TB spike in disk usage,
>so your explanation of the lots of "small" files would seem to be
>consistent with that.

That's consistent with the data. 500MB and nbatch=2097152 is exactly
1TB, and there'll be some additional overhead.

>
>> > > This seems to be about the same problem, except that instead of
>> > > forgeting about BufFile, the parallel hash join ignores this:
>> > >
>> > >    pstate->batches =
>> > >      dsa_allocate0(hashtable->area,
>> > >                    EstimateParallelHashJoinBatch(hashtable) * nbatch);
>>
>> Yeah, I failed to consider that possibility.  I suppose it could be
>> avoided with something like this (not tested, I will find a repro for
>> this on Monday to convince myself that it's right):
>>
>> @@ -1246,7 +1246,10 @@
>> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
>>                                 }
>>
>>                                 /* Don't keep growing if it's not
>> helping or we'd overflow. */
>> -                               if (extreme_skew_detected ||
>> hashtable->nbatch >= INT_MAX / 2)
>> +                               if (extreme_skew_detected ||
>> +                                       hashtable->nbatch >= INT_MAX / 2 ||
>> +
>> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
>> +
>>    hashtable->nbatch * 2))
>>                                         pstate->growth = PHJ_GROWTH_DISABLED;
>>                                 else if (space_exhausted)
>>                                         pstate->growth =
>> PHJ_GROWTH_NEED_MORE_BATCHES;
>>

Probably, but that's the execution part of it. I think we should also
consider this in ExecChooseHashTableSize, and just don't do PHJ when
it exceeds work_mem from the very beginning.

Once we start executing we probably can't do much better than what you
proposed. In particular it doesn't make sense to limit the space by
work_mem, unless we also tweak that limit because then the batch size
increases arbitrarily.

I think we need do something about this in PG13 - both while planning
(considering BufFile and SharedTuplestore), and during execution. The
planner part seems fairly simple and independent, and I might have said
before I'll look into it.

For the executor I think we've agreed the "proper" solution is BNL or
something like that. Not sure how far are we from that, though, I
don't recall any recent updates (but maybe I just missed that, the
pgsql-hackers traffic is pretty insane). I wonder if we should get
something like the "rebalancing" I proposed before, which is not a 100%
fix but may at least reduce the negative impact.

>> But James's query is still going to be terrible.
>>
>> Do you know if it's extreme skew (many tuples with the same key, just
>> a few scattered around in other keys), or simply too much data for
>> your work_mem setting?
>
>Given my description earlier (seq scan on a very large table), I
>assume it's likely the latter? If you think that's sufficiently
>likely, I'll leave it at that, or if not I could do calculation on
>that key to see how distributed it is.
>

Depends where exactly is the seq scan - is it under the Hash? If yes,
how come we even pick hash join in this case? Hard to say without seeing
the query plan.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #16105: trying to install pgadmin4
Next
From: Tomas Vondra
Date:
Subject: Re: BUG #16105: trying to install pgadmin4