Thread: Parallel Full Hash Join

Parallel Full Hash Join

From

Thomas Munro

Date:

12 September 2019, 05:56:00

Hello,

While thinking about looping hash joins (an alternative strategy for
limiting hash join memory usage currently being investigated by
Melanie Plageman in a nearby thread[1]), the topic of parallel query
deadlock hazards came back to haunt me. I wanted to illustrate the
problems I'm aware of with the concrete code where I ran into this
stuff, so here is a new-but-still-broken implementation of $SUBJECT.
This was removed from the original PHJ submission when I got stuck and
ran out of time in the release cycle for 11. Since the original
discussion is buried in long threads and some of it was also a bit
confused, here's a fresh description of the problems as I see them.
Hopefully these thoughts might help Melanie's project move forward,
because it's closely related, but I didn't want to dump another patch
into that other thread. Hence this new thread.

I haven't succeeded in actually observing a deadlock with the attached
patch (though I did last year, very rarely), but I also haven't tried
very hard. The patch seems to produce the right answers and is pretty
scalable, so it's really frustrating not to be able to get it over the
line.

Tuple queue deadlock hazard:

If the leader process is executing the subplan itself and waiting for
all processes to arrive in ExecParallelHashEndProbe() (in this patch)
while another process has filled up its tuple queue and is waiting for
the leader to read some tuples an unblock it, they will deadlock
forever. That can't happen in the the committed version of PHJ,
because it never waits for barriers after it has begun emitting
tuples.

Some possible ways to fix this:

1. You could probably make it so that the PHJ_BATCH_SCAN_INNER phase
in this patch (the scan for unmatched tuples) is executed by only one
process, using the "detach-and-see-if-you-were-last" trick. Melanie
proposed that for an equivalent problem in the looping hash join. I
think it probably works, but it gives up a lot of parallelism and thus
won't scale as nicely as the attached patch.

2. You could probably make it so that only the leader process drops
out of executing the inner unmatched scan, and then I think you
wouldn't have this very specific problem at the cost of losing some
(but not all) parallelism (ie the leader), but there might be other
variants of the problem. For example, a GatherMerge leader process
might be blocked waiting for the next tuple for a tuple from P1, while
P2 is try to write to a full queue, and P1 waits for P2.

3. You could introduce some kind of overflow for tuple queues, so
that tuple queues can never block because they're full (until you run
out of extra memory buffers or disk and error out). I haven't
seriously looked into this but I'm starting to suspect it's the
industrial strength general solution to the problem and variants of it
that show up in other parallelism projects (Parallel Repartition). As
Robert mentioned last time I talked about this[2], you'd probably only
want to allow spooling (rather than waiting) when the leader is
actually waiting for other processes; I'm not sure how exactly to
control that.

4. <thinking-really-big>Goetz Graefe's writing about parallel sorting
comes close to this topic, which he calls flow control deadlocks. He
mentions the possibility of infinite spooling like (3) as a solution.
He's describing a world where producers and consumers are running
concurrently, and the consumer doesn't just decide to start running
the subplan (what we call "leader participation"), so he doesn't
actually have a problem like Gather deadlock. He describes
planner-enforced rules that allow deadlock free execution even with
fixed-size tuple queue flow control by careful controlling where
order-forcing operators are allowed to appear, so he doesn't have a
problem like Gather Merge deadlock. I'm not proposing we should
create a whole bunch of producer and consumer processes to run
different plan fragments, but I think you can virtualise the general
idea in an async executor with "streams", and that also solves other
problems when you start working with partitions in a world where it's
not even sure how many workers will show up. I see this as a long
term architectural goal requiring vast amounts of energy to achieve,
hence my new interest in (3) for now.</thinking-really-big>

Hypothetical inter-node deadlock hazard:

Right now I think it is the case the whenever any node begins pulling
tuples from a subplan, it continues to do so until either the query
ends early or the subplan runs out of tuples. For example, Append
processes its subplans one at a time until they're done -- it doesn't
jump back and forth. Parallel Append doesn't necessarily run them in
the order that they appear in the plan, but it still runs each one to
completion before picking another one. If we ever had a node that
didn't adhere to that rule, then two Parallel Full Hash Join nodes
could dead lock, if some of the workers were stuck waiting in one
while some were stuck waiting in the other.

If we were happy to decree that that is a rule of the current
PostgreSQL executor, then this hypothetical problem would go away.
For example, consider the old patch I recently rebased[3] to allow
Append over a bunch of FDWs representing remote shards to return
tuples as soon as they're ready, not necessarily sequentially (and I
think several others have worked on similar patches). To be
committable under such a rule that applies globally to the whole
executor, that patch would only be allowed to *start* them in any
order, but once it's started pulling tuples from a given subplan it'd
have to pull them all to completion before considering another node.

(Again, that problem goes away in an async model like (4), which will
also be able to do much more interesting things with FDWs, and it's
the FDW thing that I think generates more interest in async execution
than my rambling about abstract parallel query problems.)

Some other notes on the patch:

Aside from the deadlock problem, there are some minor details to tidy
up (handling of late starters probably not quite right, rescans not
yet considered). There is a fun hard-coded parameter that controls
the parallel step size in terms of cache lines for the unmatched scan;
I found that 8 was a lot faster than 4, but no slower than 128 on my
laptop, so I set it to 8. More thoughts along those micro-optimistic
lines: instead of match bit in the header, you could tag the pointer
and sometimes avoid having to follow it, and you could prefetch next
non-matching tuple's cacheline by looking a head a bit.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CA%2BTgmoY4LogYcg1y5JPtto_fL-DBUqvxRiZRndDC70iFiVsVFQ%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/flat/CA%2BhUKGLBRyu0rHrDCMC4%3DRn3252gogyp1SjOgG8SEKKZv%3DFwfQ%40mail.gmail.com

--
Thomas Munro
https://enterprisedb.com

Thread: Parallel Full Hash Join

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment