Here is a rebased and lightly hacked-upon version that I'm testing.
0001-Scan-for-unmatched-hash-join-tuples-in-memory-order.patch
 * this change can stand on its own, separately from any PHJ changes
 * renamed hashtable->current_chunk[_idx] to unmatched_scan_{chunk,idx}
 * introduced a local variable to avoid some x->y->z stuff
 * removed some references to no-longer-relevant hj_XXX variables in
the Prep function
I haven't attempted to prove anything about the performance of this
one yet, but it seems fairly obvious that it can't be worse than what
we're doing today.  I have suppressed the urge to look into improving
locality and software prefetching.
0002-Parallel-Hash-Full-Join.patch
 * reuse the same umatched_scan_{chunk,idx} variables as above
 * rename the list of chunks to scan to work_queue
 * fix race/memory leak if we see PHJ_BATCH_SCAN when we attach (it
wasn't OK to just fall through)
That "work queue" name/concept already exists in other places that
need to process every chunk, namely rebucketing and repartitioning.
In later work, I'd like to harmonise these work queues, but I'm not
trying to increase the size of this patch set at this time, I just
want to use consistent naming.
I don't love the way that both ExecHashTableDetachBatch() and
ExecParallelPrepHashTableForUnmatched() duplicate logic relating to
the _SCAN/_FREE protocol, but I'm struggling to find a better idea.
Perhaps I just need more coffee.
I think your idea of opportunistically joining the scan if it's
already running makes sense to explore for a later step, ie to make
multi-batch PHFJ fully fair, and I think that should be a fairly easy
code change, and I put in some comments where changes would be needed.
Continuing to test, more soon.