Re: Avoiding hash join batch explosions with extreme skew and weird stats - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date
Msg-id CAAKRu_aEmMtr1p7noNwz3wqgpcmYJZh=ppgLqUafHpjU8ogrPA@mail.gmail.com
Whole thread Raw
In response to Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers

On Tue, Jan 7, 2020 at 4:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:
* I am uneasy about BarrierArriveExplicitAndWait() (a variant of
BarrierArriveAndWait() that lets you skip directly to a given phase?);
perhaps you only needed that for a circular phase system, which you
could do with modular phase numbers, like PHJ_GROW_BATCHES_PHASE?  I
tried to make the barrier interfaces look like the libraries in other
parallel programming environments, and I'd be worried that the
explicit phase thing could easily lead to bugs.

BarrierArriveExplicitAndWait() is gone now due to the refactor to
address the barrier waiting deadlock hazard (mentioned below).
 
* It seems a bit strange to have "outer_match_status_file" in
SharedTupleStore; something's gone awry layering-wise there.

outer_match_status_file is now out of the SharedTuplestore. Jesse
Zhang and I worked on a new API, SharedBits, for workers to
collaboratively make a bitmap and then used it for the outer match
status file and the combined bitmap file
(v4-0004-Add-SharedBits-API.patch).

The SharedBits API is modeled closely after the SharedTuplestore API.
It uses a control object in shared memory to synchronize access to
some files in a SharedFileset and maintains some participant-specific
shared state. The big difference (other than that the files are for
bitmaps and not tuples) is that each backend writes to its file in one
phase and a single backend reads from all of the files and combines
them in another phase.
In other words, it supports parallel write but not parallel scan (and
not concurrent read/write). This could definitely be modified in the
future.
   
Also, the SharedBits uses a SharedFileset which uses BufFiles. This is
not the ideal API for the bitmap. The access pattern is small sequential
writes and random reads. It would also be nice to maintain the fixed
size buffer but have an API that let us write an arbitrary number of
bytes to it in bufsize chunks without incurring additional function call
overhead.
 
* I'm not sure it's OK to wait at the end of each loop, as described
in the commit message:

    Workers probing a fallback batch will wait until all workers have
    finished probing before moving on so that an elected worker can read
    and combine the outer match status files into a single bitmap and use
    it to emit unmatched outer tuples after all chunks of the inner side
    have been processed.

Maybe I misunderstood completely, but that seems to break the
programming rule described in nodeHashjoin.c's comment beginning "To
avoid deadlocks, ...".  To recap: (1) When you emit a tuple, the
program counter escapes to some other node, and maybe that other node
waits for thee, (2) Maybe the leader is waiting for you but you're
waiting for it to drain its queue so you can emit a tuple (I learned a
proper name for this: "flow control deadlock").  That's why the
current code only ever detaches (a non-waiting operation) after it's
begun emitting tuples (that is, the probing phase).  It just moves
onto another batch.  That's not a solution here: you can't simply move
to another loop, loops are not independent of each other like batches.
It's possible that barriers are not the right tool for this part of
the problem, or that there is a way to use a barrier that you don't
remain attached to while emitting, or that we should remove the
deadlock risks another way entirely[1] but I'm not sure.  Furthermore,
the new code in ExecParallelHashJoinNewBatch() appears to break the
rule even in the non-looping case (it calls BarrierArriveAndWait() in
ExecParallelHashJoinNewBatch(), where the existing code just
detaches).


So, after a more careful reading of the parallel full hashjoin email
[1], I think I understand the ways in which I am violating the rule in
nodeHashJoin.c.
I do have some questions about the potential solutions mentioned in
that thread, however, I'll pose those over there.

For adaptive hashjoin, for now, the options for addressing the barrier
wait hazard that Jesse and I came up with based on the PFHJ thread are:
- leader doesn't participate in fallback batches (has the downside of
  reduced parallelism and needing special casing when it ends up being
  the only worker because other workers get used for something else
  [like autovaccuum])
- use some kind of spool to avoid deadlock
- the original solution I proposed in which all workers detach from
  the batch barrier (instead of waiting)

I revisited the original solution I proposed and realized that I had
not implemented it as advertised. By reverting to the original
design, I can skirt the issue for now.

In the original solution I suggested, I mentioned all workers would
detach from the batch barrier and the last to detach would combine the
bitmaps.  That was not what I actually implemented (my patch had all
the workers wait on the barrier).

I've changed to actually doing this--which addresses some of the
potential deadlock hazard.

The two deadlock waits causing the deadlock hazard were waiting on the
chunk barrier and waiting on the batch barrier.  In order to fully
address the deadlock hazard, Jesse and I came up with the following
solution (in v4-0003-Address-barrier-wait-deadlock-hazard.patch in the
attached patchset) to each:

chunk barrier wait:
- instead of waiting on the chunk barrier when it is not in its final
  state and then reusing it and jumping back to the initial state,
  initialize an array of chunk barriers, one per chunk, and, workers
  only wait on a chunk barrier when it is in its final state. The last
  worker to arrive will increment the chunk number. All workers detach
  from the chunk barrier they are attached to and select the next
  chunk barrier

Jesse brought up that there isn't a safe time to reinitialize the
chunk barrier, so reusing it doesn't seem like a good idea.

batch barrier wait:
- In order to mitigate the other cause of deadlock hazard (workers
  wait on the batch barrier after emitting tuples), now, in
  ExecParallelHashJoinNewBatch(), if we are attached to a batch
  barrier and it is a fallback batch, all workers will detach from the
  batch barrier and then end their scan of that batch.  The last
  worker to detach will combine the outer match status files, then it
  will detach from the batch, clean up the hashtable, and end its scan
  of the inner side.  Then it will return and proceed to emit
  unmatched outer tuples.
 
> This patch does contain refactoring of nodeHashjoin.
>
> I have split the Parallel HashJoin and Serial HashJoin state machines
> up, as they were diverging in my patch to a point that made for a
> really cluttered ExecHashJoinImpl() (ExecHashJoinImpl() is now gone).

Hmm.  I'm rather keen on extending that technique further: I'd like
there to be more configuration points in the form of parameters to
that function, so that we write the algorithm just once but we
generate a bunch of specialised variants that are the best possible
machine code for each combination of parameters via constant-folding
using the "always inline" trick (steampunk C++ function templates).
My motivations for wanting to do that are: supporting different hash
sizes (CF commit e69d6445), removing branches for unused optimisations
(eg skew), and inlining common hash functions.  That isn't to say we
couldn't have two different templatoid functions from which many
others are specialised, but I feel like that's going to lead to a lot
of duplication.


I'm okay with using templating. For now, while I am addressing large
TODO items with the patchset, I will keep them as separate functions.
Once it is in a better state, I will look at the overlap and explore
templating. The caveat here is if a lot of new commits start going
into nodeHashjoin.c and keeping this long-running branch rebased gets
painful.

The patchset has also been run through pg_indent, so,
v4-0001-Implement-Adaptive-Hashjoin.patch will look a bit different
than v3-0001-hashloop-fallback.patch, but, it is the same content.
v4-0002-Fixup-tupleMetadata-struct-issues.patch is just some other
fixups and small cosmetic changes.

The new big TODOs is to make a file type that suits the SharedBits API
better--but I don't want to do that unless the idea is validated.

[1] https://www.postgresql.org/message-id/flat/CA+hUKG+A6ftXPz4oe92+x8Er+xpGZqto70-Q_ERwRaSyA=afNg@mail.gmail.com
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Duplicate Workers entries in some EXPLAIN plans
Next
From: Ryan Lambert
Date:
Subject: Re: A rather hackish POC for alternative implementation of WITH TIES