Re: Avoiding hash join batch explosions with extreme skew and weird stats - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date
Msg-id CA+hUKGJjs6H77u+PL3ovMSowFZ8nib9Z+nHGNF6YNmw6osUU+A@mail.gmail.com
Whole thread Raw
In response to Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Avoiding hash join batch explosions with extreme skew and weird stats
Re: Avoiding hash join batch explosions with extreme skew and weird stats
List pgsql-hackers
On Mon, Dec 30, 2019 at 4:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> So, I finally have a prototype to share of parallel hashloop fallback.

Hi Melanie,

Thanks for all your continued work on this!  I started looking at it
today; it's a difficult project and I think it'll take me a while to
grok.  I do have some early comments though:

* I am uneasy about BarrierArriveExplicitAndWait() (a variant of
BarrierArriveAndWait() that lets you skip directly to a given phase?);
perhaps you only needed that for a circular phase system, which you
could do with modular phase numbers, like PHJ_GROW_BATCHES_PHASE?  I
tried to make the barrier interfaces look like the libraries in other
parallel programming environments, and I'd be worried that the
explicit phase thing could easily lead to bugs.
* It seems a bit strange to have "outer_match_status_file" in
SharedTupleStore; something's gone awry layering-wise there.
* I'm not sure it's OK to wait at the end of each loop, as described
in the commit message:

    Workers probing a fallback batch will wait until all workers have
    finished probing before moving on so that an elected worker can read
    and combine the outer match status files into a single bitmap and use
    it to emit unmatched outer tuples after all chunks of the inner side
    have been processed.

Maybe I misunderstood completely, but that seems to break the
programming rule described in nodeHashjoin.c's comment beginning "To
avoid deadlocks, ...".  To recap: (1) When you emit a tuple, the
program counter escapes to some other node, and maybe that other node
waits for thee, (2) Maybe the leader is waiting for you but you're
waiting for it to drain its queue so you can emit a tuple (I learned a
proper name for this: "flow control deadlock").  That's why the
current code only ever detaches (a non-waiting operation) after it's
begun emitting tuples (that is, the probing phase).  It just moves
onto another batch.  That's not a solution here: you can't simply move
to another loop, loops are not independent of each other like batches.
It's possible that barriers are not the right tool for this part of
the problem, or that there is a way to use a barrier that you don't
remain attached to while emitting, or that we should remove the
deadlock risks another way entirely[1] but I'm not sure.  Furthermore,
the new code in ExecParallelHashJoinNewBatch() appears to break the
rule even in the non-looping case (it calls BarrierArriveAndWait() in
ExecParallelHashJoinNewBatch(), where the existing code just
detaches).

> This patch does contain refactoring of nodeHashjoin.
>
> I have split the Parallel HashJoin and Serial HashJoin state machines
> up, as they were diverging in my patch to a point that made for a
> really cluttered ExecHashJoinImpl() (ExecHashJoinImpl() is now gone).

Hmm.  I'm rather keen on extending that technique further: I'd like
there to be more configuration points in the form of parameters to
that function, so that we write the algorithm just once but we
generate a bunch of specialised variants that are the best possible
machine code for each combination of parameters via constant-folding
using the "always inline" trick (steampunk C++ function templates).
My motivations for wanting to do that are: supporting different hash
sizes (CF commit e69d6445), removing branches for unused optimisations
(eg skew), and inlining common hash functions.  That isn't to say we
couldn't have two different templatoid functions from which many
others are specialised, but I feel like that's going to lead to a lot
of duplication.

> The reason I didn't do this refactoring in one patch and then put the
> adaptive hashjoin code on top of it is that I might like to make
> Parallel HashJoin and Serial HashJoin different nodes.
>
> I think that has been discussed elsewhere and was looking to
> understand the rationale for keeping them in the same node.

Well, there is a discussion about getting rid of the Hash node, since
it's so tightly coupled with Hash Join that it might as well not exist
as a separate entity.  (Incidentally, I noticed in someone's blog that
MySQL now shows Hash separately in its PostgreSQL-style EXPLAIN
output; now we'll remove it, CF the Dr Seuss story about the
Sneetches).  But as for Parallel Hash Join vs [Serial] Hash Join, I
think it makes sense to use the same node because they are
substantially the same thing, with optional extra magic, and I think
it's our job to figure out how to write code in a style that makes the
differences maintainable.  That fits into a general pattern that
"Parallel" is a mode, not a different node.  On the other hand, PHJ is
by far the most different from the original code, compared to things
like Parallel Sequential Scan etc.  FWIW I think we're probably in
relatively new territory here: as far as I know, other traditional
RDBMSs didn't really seem to have a concept like parallel-aware
executor nodes, because they tended to be based on partitioning, so
that the operators are all oblivious to parallelism and don't have to
share/coordinate anything at this level.  It seems that everyone is
now coming around to the view that shared hash table hash joins are a
good idea now that we have so many cores connected up to shared
memory.  Curiously, judging from another blog article I saw, on the
surface it looks like Oracle's brand new HASH JOIN SHARED is a
different operator than HASH JOIN (just an observation, I could be way
off and I don't know or want to know how that's done under the covers
in that system).

>   - number of batches is not deterministic from run-to-run

Yeah, I had a lot of fun with that sort of thing on the build farm
when PHJ was first committed, and the effects were different on
systems I don't have access to that have different sizeof() for
certain types.

>   - Rename "chunk" (as in chunks of inner side) to something that is
>     not already used in the context of memory chunks and, more
>     importantly, SharedTuplestoreChunk

+1.  Fragments?  Loops?  Blocks (from
https://en.wikipedia.org/wiki/Block_nested_loop, though, no, strike
that, blocks are also super overloaded).

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: xact_start for walsender & logical decoding not updated
Next
From: Tom Lane
Date:
Subject: Re: Created feature for to_date() conversion using patterns 'YYYY-WW', 'YYYY-WW-D', 'YYYY-MM-W' and 'YYYY-MM-W-D'