Re: Parallel Full Hash Join - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Parallel Full Hash Join
Date
Msg-id CA+hUKGJjN-vJXa3P9=-h4XU-Ty-h7WPpzzXJnA2b4NzitdYFgg@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Full Hash Join  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Parallel Full Hash Join
Re: Parallel Full Hash Join
List pgsql-hackers
On Fri, Mar 31, 2023 at 8:23 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I understand the scenario you are thinking of, however, I question how
> those incorrectly formed tuples would ever be returned by the query. The
> hashjoin would only start to shutdown once enough tuples had been
> emitted to satisfy the limit, at which point, those tuples buffered in
> p0 may be emitted by this worker but wouldn't be included in the query
> result, no?

Yeah, I think I must have been confused by that too early on.  The
thing is, Gather asks every worker process for n tuples so that any
one of them could satisfy the LIMIT if required, but it's unknown
which process's output the Gather node will receive first (or might
make it into intermediate nodes and affect the results).  I guess to
see bogus unmatched tuples actually escaping anywhere (with the
earlier patches) you'd need parallel leader off + diabolical
scheduling?

I thought about 3 solutions before settling on #3:  (1)
Hypothetically, P1 could somehow steal/finish P0's work, but our
executor has no mechanism for anything like that.  (2) P0 isn't
allowed to leave the probe early, instead it has to keep going but
throw away the tuples it'd normally emit, so we are sure we have all
the match bits in shared memory.  (3) P0 seizes responsibility for
emitting those tuples, but then does nothing because the top level
executor doesn't want more tuples, which in practice looks like a flag
telling everyone else not to bother.

Idea #1 would probably require shared address space (threads) and a
non-recursive executor, as speculated about a few times before, and
that type of magic could address several kinds of deadlock risks, but
in this case we still wouldn't want to do that even if we could; it's
work that is provably (by idea #3's argument) a waste of time.  Idea
#2 is a horrible pessimisation of idea #1 within our existing executor
design, but it helped me think about what it really means to be
authorised to throw away tuples from on high.

> I suppose even if what I said is true, we do not want the hashjoin node
> to ever produce incorrect tuples. In which case, your fix seems correct to me.

Yeah, that's a good way to put it.

> > The last things I'm thinking about now:  Are the planner changes
> > right?
>
> I think the current changes are correct. I wonder if we have to change
> anything in initial/final_cost_hashjoin to account for the fact that
> for a single batch full/right parallel hash join, part of the
> execution is serial. And, if so, do we need to consider the estimated
> number of unmatched tuples to be emitted?

I have no idea how to model that, and I'm assuming the existing model
should continue to work as well as it does today "on average".  The
expected number of tuples will be the same across all workers, it's
just an unfortunate implementation detail that the distribution sucks
(but is still much better than a serial plan).  I wondered if
get_parallel_divisor() might provide some inspiration but that's
dealing with a different problem: a partial extra process that will
take some of the work (ie tuples) away from the other processes, and
that's not the case here.

> > Are the tests enough?
>
> So, the tests currently in the patch set cover the unmatched tuple scan
> phase for single batch parallel full hash join. I've attached the
> dumbest possible addition to that which adds in a multi-batch full
> parallel hash join case. I did not do any checking to ensure I picked
> the case which would add the least execution time to the test, etc.

Thanks, added.

I should probably try to figure out how to get the join_hash tests to
run with smaller tables.  It's one of the slower tests and this adds
to it.  I vaguely recall it was hard to get the batch counts to be
stable across the build farm, which makes me hesitant to change the
tests but perhaps I can figure out how to screw it down...

I decided to drop the scan order change for now (0001 in v13).  Yes,
it's better than what we have now, but it seems to cut off some other
possible ideas to do even better, so it feels premature to change it
without more work.  I changed the parallel unmatched scan back to
being as similar as possible to the serial one for now.

I committed the main patch.

Here are a couple of ideas that came up while working on this, for future study:

* the "opportunistic help" thing you once suggested to make it a
little fairer in multi-batch cases.  Quick draft attached, for future
experimentation.  Seems to work pretty well, but could definitely be
tidier and there may be holes in it.  Pretty picture attached.

* should we pass HJ_FILL_INNER(hjstate) into a new parameter
fill_inner to ExecHashJoinImpl(), so that we can make specialised hash
join routines for the yes and no cases, so that we can remove
branching and memory traffic related to match bits?

* could we use tagged pointers to track matched tuples?  Tuples are
MAXALIGNed, so bits 0 and 1 of pointers to them are certainly always
0.  Perhaps we could use bit 0 for "matched" and bit 1 for "I am not
the last tuple in my chain, you'll have to check the next one too".
Then you could scan for unmatched without following many pointers, if
you're lucky.  You could skip the required masking etc for that if
!fill_inner.

* should we use software prefetching to smooth over the random memory
order problem when you do have to follow them?  Though it's hard to
prefetch chains, here we have an array full of pointers at least to
the first tuples in each chain.  This probably goes along with the
general hash join memory prefetching work that I started a couple of
years back and need to restart for 17.

* this idea is probably stupid overkill, but it's something that
v13-0001 made me think about: could it be worth the effort to sample a
fraction of the match bits in the hash table buckets (with the scheme
above), and determine whether you'll be emitting a high fraction of
the tuples, and then switch to chunk based so that you can do it in
memory order if so?  That requires having the match flag in *two*
places, which seems silly; you'd need some experimental evidence that
any of this is worth bothering with

* currently, the "hash inner" phase only loads tuples into batch 0's
hash table (the so-called "hybrid Grace" technique), but if there are
(say) 4 processes, you could actually load batches 0-3 into memory
during that phase, to avoid having to dump 1-3 out to disk and then
immediately load them back in again; you'd get to skip "l1", "l2",
"l3" on those diagrams and finish a good bit faster

Attachment

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Support logical replication of DDLs
Next
From: Tom Lane
Date:
Subject: Re: Thoughts on using Text::Template for our autogenerated code?