Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |
Date | |
Msg-id | 20191109182735.ewficcv2hibbyzl4@development Whole thread Raw |
In response to | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash (James Coleman <jtc331@gmail.com>) |
Responses |
Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |
List | pgsql-bugs |
On Sat, Nov 09, 2019 at 08:23:23AM -0500, James Coleman wrote: >Just realized I accidentally replied only to Tomas. > >On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote: >> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> >> >> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote: >> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: >> >> > > >ERROR: invalid DSA memory alloc request size 1375731712 >> >> >> >> > > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches >> >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at >> >> >> >> > > I've briefly looked at this today, and I think the root cause is >> >> > > somewhat similar to what is described in [1] where we simply increase >> >> > > the number of batches in an effort to keep batch contents in work_mem, >> >> > > but ignoring that each batch requires quite a bit of memory. So we end >> >> > > up with a lot of batches where each is small enough to fit into >> >> > > work_mem, but we need much more than work_mem to track the batches. >> >> >> >> Yeah. So even when this is fixed, the query is going to perform >> >> *terribly*, opening and closing millions of files in random order to >> >> stream tuples into, if this is case where there really are tuples to >> >> go to all partitions (and not just a case of extreme skew that our >> >> extreme skew detector fails to detect because it only detects absolute >> >> extreme skew). >> > >> >work_mem in the repro case is 500MB (the original failure was at >> >150MB). I realize that's too small for this query, though it's also >> >worth knowing that if I get rid of some other cluster-wide tunings >> >that shouldn't have been cluster-wide original (modifications to >> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into >> >an index scan and no hash (and performs much better). >> > >> >> So is it a case of underestimate? I.e. does the Hash side expect much >> less data (rows) than it gets during execution? > >I've attached a redacted plan, but the estimate with workers planned = >6 is 986,975,281, which seems quite reasonable to be as the current >table count is 5,917,539,491. > Hmmm, but the expected row width is only 16B, and with 6M rows that's only about 90GB. So how come this needs 1TB temporary files? I'm sure there's a bit of overhead, but 10X seems a bit much. >> >I think this also correlates with us seeing ~TB spike in disk usage, >> >so your explanation of the lots of "small" files would seem to be >> >consistent with that. >> >> That's consistent with the data. 500MB and nbatch=2097152 is exactly >> 1TB, and there'll be some additional overhead. > >Good, that helps make sense of that piece of the puzzle. > >> >> > > This seems to be about the same problem, except that instead of >> >> > > forgeting about BufFile, the parallel hash join ignores this: >> >> > > >> >> > > pstate->batches = >> >> > > dsa_allocate0(hashtable->area, >> >> > > EstimateParallelHashJoinBatch(hashtable) * nbatch); >> >> >> >> Yeah, I failed to consider that possibility. I suppose it could be >> >> avoided with something like this (not tested, I will find a repro for >> >> this on Monday to convince myself that it's right): >> >> >> >> @@ -1246,7 +1246,10 @@ >> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable) >> >> } >> >> >> >> /* Don't keep growing if it's not >> >> helping or we'd overflow. */ >> >> - if (extreme_skew_detected || >> >> hashtable->nbatch >= INT_MAX / 2) >> >> + if (extreme_skew_detected || >> >> + hashtable->nbatch >= INT_MAX / 2 || >> >> + >> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) * >> >> + >> >> hashtable->nbatch * 2)) >> >> pstate->growth = PHJ_GROWTH_DISABLED; >> >> else if (space_exhausted) >> >> pstate->growth = >> >> PHJ_GROWTH_NEED_MORE_BATCHES; >> >> >> >> Probably, but that's the execution part of it. I think we should also >> consider this in ExecChooseHashTableSize, and just don't do PHJ when >> it exceeds work_mem from the very beginning. >> >> Once we start executing we probably can't do much better than what you >> proposed. In particular it doesn't make sense to limit the space by >> work_mem, unless we also tweak that limit because then the batch size >> increases arbitrarily. >> >> I think we need do something about this in PG13 - both while planning >> (considering BufFile and SharedTuplestore), and during execution. The >> planner part seems fairly simple and independent, and I might have said >> before I'll look into it. >> >> For the executor I think we've agreed the "proper" solution is BNL or >> something like that. Not sure how far are we from that, though, I >> don't recall any recent updates (but maybe I just missed that, the >> pgsql-hackers traffic is pretty insane). I wonder if we should get >> something like the "rebalancing" I proposed before, which is not a 100% >> fix but may at least reduce the negative impact. > >Do you happen have a link to those discussions? I'd be interested in >following along. I also can't say I know what BNL stands for. > I think the two main threads are these two: 1) accounting for memory used for BufFile during hash joins https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development 2) Avoiding hash join batch explosions with extreme skew and weird stats https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com The BNL means "block nested loop" which means we split the inner relation into blocks, and then do nested loop for (some of) them. This works as a fallback strategy for hash join - when a batch is too large to fit into work_mem, we can split it into blocks and do a BNL against the outer relation. >> >> But James's query is still going to be terrible. >> >> >> >> Do you know if it's extreme skew (many tuples with the same key, just >> >> a few scattered around in other keys), or simply too much data for >> >> your work_mem setting? >> > >> >Given my description earlier (seq scan on a very large table), I >> >assume it's likely the latter? If you think that's sufficiently >> >likely, I'll leave it at that, or if not I could do calculation on >> >that key to see how distributed it is. >> > >> >> Depends where exactly is the seq scan - is it under the Hash? If yes, >> how come we even pick hash join in this case? Hard to say without seeing >> the query plan. > >As noted earlier, I've attached a redacted plan here. > >The seq scan is under the parallel hash. Note that also as hinted at >earlier, this happens with cpu_tuple_cost and cpu_operator_cost both >set to 0.5, which has been a long-standing tweak on this cluster to >affect some plans that are otherwise painfully bad, but really should >be targeted at specific queries. With those reset, everything stays >the same except that the hash join turns into a nested loop, and >instead of a hash on the inner side of the join there's an index scan >(the join key is indexed on that table). > I wonder what the bad plans look like and why this fixes them, but it seems like a separate issue. >Because of the modified cpu_*_cost gucs we have two ways around this >specific query plan: either reset those gucs or set >enable_parallel_hash = off. But even if it means poor performance, it >seems to me that we wouldn't want to fail the query. I can confirm >that it is in fact capable of completing with this plan even with >work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would >have been some level of unsurprising in this batch system, and also >much closer to "normal" for the non-parallel plan we used to have in >9.6. We were able to see this result on replicas while trying to find >out exactly how to reproduce the error (it seems sometimes it was >right under the boundary needed to raise the error). > It's certainly true that completing a query is preferrable to failing. It does depend when we can identify that's the case - the later we realize the issue, the harder it's to fix it. If we notice that during planning, we can simply disable the hash join, which then forces using a different join method (and it may actually be quite fast). Once we start executing, it's way harder. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-bugs by date: