Re: Avoiding hash join batch explosions with extreme skew and weirdstats - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Avoiding hash join batch explosions with extreme skew and weirdstats |
Date | |
Msg-id | 20190520143152.77nrjewyd3mbsbyj@development Whole thread Raw |
In response to | Re: Avoiding hash join batch explosions with extreme skew and weird stats (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
On Mon, May 20, 2019 at 01:25:52PM +1200, Thomas Munro wrote: >On Mon, May 20, 2019 at 12:22 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote: >> >First let me restate the PostgreSQL terminology for this stuff so I >> >don't get confused while talking about it: >> > >> >* The inner side of the join = the right side = the side we use to >> >build a hash table. Right and full joins emit inner tuples when there >> >is no matching tuple on the outer side. >> > >> >* The outer side of the join = the left side = the side we use to >> >probe the hash table. Left and full joins emit outer tuples when >> >there is no matching tuple on the inner side. >> > >> >* Semi and anti joins emit exactly one instance of each outer tuple if >> >there is/isn't at least one match on the inner side. >> > >> >> I think you're conflating inner/outer side and left/right, or rather >> assuming it's always left=inner and right=outer. > >In PostgreSQL, it's always inner = right, outer = left. You can see >that reflected in plannodes.h and elsewhere: > >/* ---------------- > * these are defined to avoid confusion problems with "left" > * and "right" and "inner" and "outer". The convention is that > * the "left" plan is the "outer" plan and the "right" plan is > * the inner plan, but these make the code more readable. > * ---------------- > */ >#define innerPlan(node) (((Plan *)(node))->righttree) >#define outerPlan(node) (((Plan *)(node))->lefttree) > >I'm not sure you think it's not always like that: are you referring to >the fact that the planner can choose to reverse the join (compared to >the SQL LEFT|RIGHT JOIN that appeared in the query), creating an extra >layer of confusion? In my email I was talking only about left and >right as seen by the executor. > It might be my lack of understanding, but I'm not sure how we map LEFT/RIGHT JOIN to left/righttree and inner/outer at plan level. My assumption was that for "a LEFT JOIN b" then "a" and "b" can end up both as inner and outer (sub)tree. But I haven't checked so I may easily be wrong. Maybe the comment you quoted clarifies that, not sure. >> >About the question of when exactly to set the "use_NLJ" flag: I had >> >originally been thinking of this only as a way to deal with the >> >extreme skew problem. But in light of Tomas's complaints about >> >unmetered per-batch memory overheads, I had a new thought: it should >> >also be triggered whenever doubling the number of batches would halve >> >the amount of memory left for the hash table (after including the size >> >of all those BufFile objects in the computation as Tomas proposes). I >> >think that might be exactly the right right cut-off if you want to do >> >as much Grace partitioning as your work_mem can afford, and therefore >> >as little looping as possible to complete the join while respecting >> >work_mem. >> > >> >> Not sure what NLJ flag rule you propose, exactly. >> >> Regarding the threshold value - once the space for BufFiles (and other >> overhead) gets over work_mem/2, it does not make any sense to increase >> the number of batches because then the work_mem would be entirely >> occupied by BufFiles. >> >> The WIP patches don't actually do exactly that though - they just check >> if the incremented size would be over work_mem/2. I think we should >> instead allow up to work_mem*2/3, i.e. stop adding batches after the >> BufFiles start consuming more than work_mem/3 memory. >> >> I think that's actually what you mean by "halving the amount of memory >> left for the hash table" because that's what happens after reaching the >> work_mem/3. > >Well, instead of an arbitrary number like work_mem/2 or work_mem * >2/3, I was trying to figure out the precise threshold beyond which it >doesn't make sense to expend more memory on BufFile objects, even if >the keys are uniformly distributed so that splitting batches halves >the expect tuple count per batch. Let work_mem_for_hash_table = >work_mem - nbatch * sizeof(BufFile). Whenever you increase nbatch, >work_mem_for_hash_table goes down, but it had better be more than half >what it was before, or we expect to run out of memory again (if the >batch didn't fit before, and we're now splitting it so that we'll try >to load only half of it, we'd better have more than half the budget >for the hash table than we had before). Otherwise you'd be making >matters worse, and this process probably won't terminate. > But the work_mem/3 does exactly that. Let's say BufFiles need a bit less than work_mem/3. That means we have a bit more than 2*work_mem/3 for the hash table. If you double the number of batches, then you'll end up with a bit more than 2*work_mem/3. That is, we've not halved the hash table size. If BufFiles need a bit more memory than work_mem/3, then after doubling the number of batches we'll end up with less than half the initial hash table space. So I think work_mem/3 is the threshold we're looking for. >> But I think that rule is irrelevant here, really, because this thread >> was discussing cases where adding batches is futile due to skew, no? In >> which case we should stop adding batches after reaching some % of tuples >> not moving from the batch. > >Yeah, this thread started off just about the 95% thing, but veered off >course since these topics are tangled up. Sorry. > >> Or are you suggesting we should remove that rule, and instead realy on >> this rule about halving the hash table space? That might work too, I >> guess. > >No, I suspect you need both rules. We still want to detect extreme >skew soon as possible, even though the other rule will eventually >fire; might as well do it sooner in clear-cut cases. > Right, I agree. I think we need the 95% rule (or whatever) to handle the cases with skew / many duplicates, and then the overflow files to handle underestimates with uniform distribution (or some other solution). >> OTOH I'm not sure it's a good idea to handle both those cases the same >> way - "overflow file" idea works pretty well for cases where the hash >> table actually can be split into batches, and I'm afraid NLJ will be >> much less efficient for those cases. > >Yeah, you might be right about that, and everything I'm describing is >pure vapourware anyway. But your overflow file scheme isn't exactly >free of IO-amplification and multiple-processing of input data >either... and I haven't yet grokked how it would work for parallel >hash. Parallel hash generally doesn't have the >'throw-the-tuples-forward' concept. which is inherently based on >sequential in-order processing of batches. > Sure, let's do some math. With the overflow scheme, the amplification is roughly ~2x (relative to master), because we need to write data for most batches first into the overflow file and then to the correct one. Master has wrte aplification about ~1.25x (due to the gradual increase of batches), so the "total" amplification is ~2.5x. For the NLJ, the amplification fully depends on what fraction of the hash table fits into work_mem. For example when it needs to be split into 32 fragments, we have ~32x amplification. It might affect just some batches, of course. So I still think those approaches are complementary and we need both. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: