Re: Avoiding hash join batch explosions with extreme skew and weird stats - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Avoiding hash join batch explosions with extreme skew and weird stats |
Date | |
Msg-id | CA+hUKGJjs6H77u+PL3ovMSowFZ8nib9Z+nHGNF6YNmw6osUU+A@mail.gmail.com Whole thread Raw |
In response to | Re: Avoiding hash join batch explosions with extreme skew and weird stats (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: Avoiding hash join batch explosions with extreme skew and weird stats
Re: Avoiding hash join batch explosions with extreme skew and weird stats |
List | pgsql-hackers |
On Mon, Dec 30, 2019 at 4:34 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > So, I finally have a prototype to share of parallel hashloop fallback. Hi Melanie, Thanks for all your continued work on this! I started looking at it today; it's a difficult project and I think it'll take me a while to grok. I do have some early comments though: * I am uneasy about BarrierArriveExplicitAndWait() (a variant of BarrierArriveAndWait() that lets you skip directly to a given phase?); perhaps you only needed that for a circular phase system, which you could do with modular phase numbers, like PHJ_GROW_BATCHES_PHASE? I tried to make the barrier interfaces look like the libraries in other parallel programming environments, and I'd be worried that the explicit phase thing could easily lead to bugs. * It seems a bit strange to have "outer_match_status_file" in SharedTupleStore; something's gone awry layering-wise there. * I'm not sure it's OK to wait at the end of each loop, as described in the commit message: Workers probing a fallback batch will wait until all workers have finished probing before moving on so that an elected worker can read and combine the outer match status files into a single bitmap and use it to emit unmatched outer tuples after all chunks of the inner side have been processed. Maybe I misunderstood completely, but that seems to break the programming rule described in nodeHashjoin.c's comment beginning "To avoid deadlocks, ...". To recap: (1) When you emit a tuple, the program counter escapes to some other node, and maybe that other node waits for thee, (2) Maybe the leader is waiting for you but you're waiting for it to drain its queue so you can emit a tuple (I learned a proper name for this: "flow control deadlock"). That's why the current code only ever detaches (a non-waiting operation) after it's begun emitting tuples (that is, the probing phase). It just moves onto another batch. That's not a solution here: you can't simply move to another loop, loops are not independent of each other like batches. It's possible that barriers are not the right tool for this part of the problem, or that there is a way to use a barrier that you don't remain attached to while emitting, or that we should remove the deadlock risks another way entirely[1] but I'm not sure. Furthermore, the new code in ExecParallelHashJoinNewBatch() appears to break the rule even in the non-looping case (it calls BarrierArriveAndWait() in ExecParallelHashJoinNewBatch(), where the existing code just detaches). > This patch does contain refactoring of nodeHashjoin. > > I have split the Parallel HashJoin and Serial HashJoin state machines > up, as they were diverging in my patch to a point that made for a > really cluttered ExecHashJoinImpl() (ExecHashJoinImpl() is now gone). Hmm. I'm rather keen on extending that technique further: I'd like there to be more configuration points in the form of parameters to that function, so that we write the algorithm just once but we generate a bunch of specialised variants that are the best possible machine code for each combination of parameters via constant-folding using the "always inline" trick (steampunk C++ function templates). My motivations for wanting to do that are: supporting different hash sizes (CF commit e69d6445), removing branches for unused optimisations (eg skew), and inlining common hash functions. That isn't to say we couldn't have two different templatoid functions from which many others are specialised, but I feel like that's going to lead to a lot of duplication. > The reason I didn't do this refactoring in one patch and then put the > adaptive hashjoin code on top of it is that I might like to make > Parallel HashJoin and Serial HashJoin different nodes. > > I think that has been discussed elsewhere and was looking to > understand the rationale for keeping them in the same node. Well, there is a discussion about getting rid of the Hash node, since it's so tightly coupled with Hash Join that it might as well not exist as a separate entity. (Incidentally, I noticed in someone's blog that MySQL now shows Hash separately in its PostgreSQL-style EXPLAIN output; now we'll remove it, CF the Dr Seuss story about the Sneetches). But as for Parallel Hash Join vs [Serial] Hash Join, I think it makes sense to use the same node because they are substantially the same thing, with optional extra magic, and I think it's our job to figure out how to write code in a style that makes the differences maintainable. That fits into a general pattern that "Parallel" is a mode, not a different node. On the other hand, PHJ is by far the most different from the original code, compared to things like Parallel Sequential Scan etc. FWIW I think we're probably in relatively new territory here: as far as I know, other traditional RDBMSs didn't really seem to have a concept like parallel-aware executor nodes, because they tended to be based on partitioning, so that the operators are all oblivious to parallelism and don't have to share/coordinate anything at this level. It seems that everyone is now coming around to the view that shared hash table hash joins are a good idea now that we have so many cores connected up to shared memory. Curiously, judging from another blog article I saw, on the surface it looks like Oracle's brand new HASH JOIN SHARED is a different operator than HASH JOIN (just an observation, I could be way off and I don't know or want to know how that's done under the covers in that system). > - number of batches is not deterministic from run-to-run Yeah, I had a lot of fun with that sort of thing on the build farm when PHJ was first committed, and the effects were different on systems I don't have access to that have different sizeof() for certain types. > - Rename "chunk" (as in chunks of inner side) to something that is > not already used in the context of memory chunks and, more > importantly, SharedTuplestoreChunk +1. Fragments? Loops? Blocks (from https://en.wikipedia.org/wiki/Block_nested_loop, though, no, strike that, blocks are also super overloaded). [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com
pgsql-hackers by date: