Richard Guo <guofenglinux@gmail.com> writes:
> AFAICT, there are 3 possible options for a fix.
> 1) Revert aa86129e1.
> 2) Modify the code to perform atomic operations on the matched flag
> using a CAS (or a similar) mechanism when running in parallel
> execution.
> 3) Disable parallel right semi joins in the planner.
Right. I agree that #3 is the most attractive stopgap answer.
We can look into #2 later, but it doesn't sound like something
to back-patch. (The main problem according to my brief look
is that t_infomask2 is uint16, but we haven't built out any
16-bit atomic primitives; perhaps they do not exist everywhere.)
> (I'm still trying to understand why concurrent access to the matched
> flag in cases other than right semi joins (such as right or full
> joins) doesn't lead to concurrency issues.)
I believe PRSJ is the only case where we need to set and concurrently
inspect the HEAP_TUPLE_HAS_MATCH flag in a shared hashtable.
I have a nasty feeling that this was well understood back when
we first did parallel hash join, which is why it wasn't done
already. Apparently the problem didn't get documented though,
or at least not in any place you chanced to look.
Looking at the code now, ExecParallelScanHashTableForUnmatched
also has unprotected tests (not sets) of the flag, but I think
that may be okay because we shouldn't still be mutating the
flags while that runs.
regards, tom lane