Re: Avoiding hash join batch explosions with extreme skew and weird stats - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date
Msg-id CAAKRu_YOcGjUzGHXUVw8oHHYdsCw9Z+WtTgqCVa7tdhg0HNrXg@mail.gmail.com
Whole thread Raw
In response to Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers


On Tue, Jan 7, 2020 at 4:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:
* I am uneasy about BarrierArriveExplicitAndWait() (a variant of
BarrierArriveAndWait() that lets you skip directly to a given phase?);
perhaps you only needed that for a circular phase system, which you
could do with modular phase numbers, like PHJ_GROW_BATCHES_PHASE?  I
tried to make the barrier interfaces look like the libraries in other
parallel programming environments, and I'd be worried that the
explicit phase thing could easily lead to bugs.

So, I actually use it to circle back up to the first phase while
skipping the last phase.
So I couldn't do it with modular phase numbers and a loop.
The last phase detaches from the chunk barrier. I don't want to detach
from the chunk barrier if there are more chunks.
I basically need a way to only attach to the chunk barrier at the
begininng of the first chunk and only detach at the end of the last
chunk--not in between chunks. I will return from the function and
re-enter between chunks -- say between chunk 2 and chunk 3 of 5.

However, could this be solved by having more than one chunk
barrier?
A worker would attach to one chunk barrier and then when it moves to
the next chunk it would attach to the other chunk barrier and then
switch back when it switches to the next chunk. Then it could detach
and attach each time it enters/leaves the function.
 
* I'm not sure it's OK to wait at the end of each loop, as described
in the commit message:

    Workers probing a fallback batch will wait until all workers have
    finished probing before moving on so that an elected worker can read
    and combine the outer match status files into a single bitmap and use
    it to emit unmatched outer tuples after all chunks of the inner side
    have been processed.

Maybe I misunderstood completely, but that seems to break the
programming rule described in nodeHashjoin.c's comment beginning "To
avoid deadlocks, ...".  To recap: (1) When you emit a tuple, the
program counter escapes to some other node, and maybe that other node
waits for thee, (2) Maybe the leader is waiting for you but you're
waiting for it to drain its queue so you can emit a tuple (I learned a
proper name for this: "flow control deadlock").  That's why the
current code only ever detaches (a non-waiting operation) after it's
begun emitting tuples (that is, the probing phase).  It just moves
onto another batch.  That's not a solution here: you can't simply move
to another loop, loops are not independent of each other like batches.
It's possible that barriers are not the right tool for this part of
the problem, or that there is a way to use a barrier that you don't
remain attached to while emitting, or that we should remove the
deadlock risks another way entirely[1] but I'm not sure.  Furthermore,
the new code in ExecParallelHashJoinNewBatch() appears to break the
rule even in the non-looping case (it calls BarrierArriveAndWait() in
ExecParallelHashJoinNewBatch(), where the existing code just
detaches).

Yea, I think I'm totally breaking that rule.
Just to make sure I understand the way in which I am breaking that
rule:

In my patch, while attached to a chunk_barrier, worker1 emits a
matched tuple (control leaves the current node).  Meanwhile, worker2
has finished probing the chunk and is waiting on the chunk_barrier for
worker1.
How though could worker1 be waiting for worker2?

Is this only a problem when one of the barrier participants is the
leader and is reading from the tuple queue? (reading your tuple queue
deadlock hazard example in the thread [1] you referred to).
Basically is my deadlock hazard a tuple queue deadlock hazard?

I thought maybe this could be a problem with nested HJ nodes, but I'm
not sure.

As I understand it, this isn't a problem with current master with
batch barriers because while attached to a batch_barrier, a worker can
emit tuples. No other workers will wait on the batch barrier once they
have started probing.

I need to think more about the suggestions you provided in [1] about
nixing the tuple queue deadlock hazard.

However, hypothetically, if we decide we don't want to break the no
emitting tuples while attached to a barrier rule, how can we still
allow workers to coordinate while probing chunks of the batch
sequentially (1 chunk at a time)?

I could think of two options (both sound slow and bad):

Option 1:
Stash away the matched tuples in a tuplestore and emit them at the end
of the batch (incurring more writes).

Option 2:
Degenerate to 1 worker for fallback batches

Any other ideas?
 

>   - Rename "chunk" (as in chunks of inner side) to something that is
>     not already used in the context of memory chunks and, more
>     importantly, SharedTuplestoreChunk

+1.  Fragments?  Loops?  Blocks (from
https://en.wikipedia.org/wiki/Block_nested_loop, though, no, strike
that, blocks are also super overloaded).

Hmmm. I think loop is kinda confusing. "fragment" has potential.
I also thought of "piece". That is actually where I am leaning now.
What do you think?

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com

--
Melanie Plageman

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: [Logical Replication] TRAP: FailedAssertion("rel->rd_rel->relreplident== REPLICA_IDENTITY_DEFAULT || rel->rd_rel->relreplident ==REPLICA_IDENTITY_FULL || rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX"
Next
From: Masahiko Sawada
Date:
Subject: Re: base backup client as auxiliary backend process