Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash - Mailing list pgsql-bugs
From | James Coleman |
---|---|
Subject | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |
Date | |
Msg-id | CAAaqYe_CKJg6F8TK6Q5c7WjOtFQ3cQ5DeZ=grAYxVU-B8FKuTg@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
(Tomas Vondra <tomas.vondra@2ndquadrant.com>)
|
List | pgsql-bugs |
On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote: > > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: > > > >ERROR: invalid DSA memory alloc request size 1375731712 > > > > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches > > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at > > > > I've briefly looked at this today, and I think the root cause is > > > somewhat similar to what is described in [1] where we simply increase > > > the number of batches in an effort to keep batch contents in work_mem, > > > but ignoring that each batch requires quite a bit of memory. So we end > > > up with a lot of batches where each is small enough to fit into > > > work_mem, but we need much more than work_mem to track the batches. > > Yeah. So even when this is fixed, the query is going to perform > *terribly*, opening and closing millions of files in random order to > stream tuples into, if this is case where there really are tuples to > go to all partitions (and not just a case of extreme skew that our > extreme skew detector fails to detect because it only detects absolute > extreme skew). work_mem in the repro case is 500MB (the original failure was at 150MB). I realize that's too small for this query, though it's also worth knowing that if I get rid of some other cluster-wide tunings that shouldn't have been cluster-wide original (modifications to cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into an index scan and no hash (and performs much better). I think this also correlates with us seeing ~TB spike in disk usage, so your explanation of the lots of "small" files would seem to be consistent with that. > > > This seems to be about the same problem, except that instead of > > > forgeting about BufFile, the parallel hash join ignores this: > > > > > > pstate->batches = > > > dsa_allocate0(hashtable->area, > > > EstimateParallelHashJoinBatch(hashtable) * nbatch); > > Yeah, I failed to consider that possibility. I suppose it could be > avoided with something like this (not tested, I will find a repro for > this on Monday to convince myself that it's right): > > @@ -1246,7 +1246,10 @@ > ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable) > } > > /* Don't keep growing if it's not > helping or we'd overflow. */ > - if (extreme_skew_detected || > hashtable->nbatch >= INT_MAX / 2) > + if (extreme_skew_detected || > + hashtable->nbatch >= INT_MAX / 2 || > + > !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) * > + > hashtable->nbatch * 2)) > pstate->growth = PHJ_GROWTH_DISABLED; > else if (space_exhausted) > pstate->growth = > PHJ_GROWTH_NEED_MORE_BATCHES; > > But James's query is still going to be terrible. > > Do you know if it's extreme skew (many tuples with the same key, just > a few scattered around in other keys), or simply too much data for > your work_mem setting? Given my description earlier (seq scan on a very large table), I assume it's likely the latter? If you think that's sufficiently likely, I'll leave it at that, or if not I could do calculation on that key to see how distributed it is. James
pgsql-bugs by date: