Lori Corbani <Lori.Corbani@jax.org> writes:
> Attached is a file with this info. Please let me know if this is what you need.
Thank you for the test case. With your smaller data set (for some
reason not the larger one) I can duplicate the misbehavior: v17
consistently returns 115427 rows, v18 often one or two more than
that. I just loaded up the data, vacuum analyzed, and ran
select count(*) from (
select s._Strain_key
from mgd.prb_strain s
where s.private = 0
and s.strain not ilike '%involves%'
and s.strain not ilike '%either%'
and s.strain not ilike '% and %'
and s.strain not ilike '% or %'
and exists (select 1 from mgd.voc_annot va, mgd.voc_term t
where va._AnnotType_key = 1009
and va._Term_key = t._Term_key
and t.term != 'Not Applicable'
and t.term != 'Not Specified'
and va._Object_key = s._Strain_key)
) ss;
repeatedly in an out-of-the-box configuration. (BTW, without the
outer "count(*)" subquery, it tends not to pick a PHJ plan, so the
misbehavior isn't there. I didn't look into why not.)
After a good deal of time bisecting, I find the first commit that
shows the problem is
Author: Richard Guo <rguo@postgresql.org>
Branch: master Release: REL_18_BR [aa86129e1] 2024-07-05 09:26:48 +0900
Support "Right Semi Join" plan shapes
which is perhaps unsurprising, because the planner picks a "Parallel
Hash Right Semi Join" plan after that and a non-Right plan before it.
So it looks like there is some race condition in that.
The failure rate at aa86129e1 and following commits is only perhaps
two or three tries out of 100, and I never saw more than 115428
rows produced. However, after
Author: Thomas Munro <tmunro@postgresql.org>
Branch: master Release: REL_18_BR [4effd0844] 2024-08-31 17:28:02 +1200
Branch: REL_17_STABLE Release: REL_17_0 [3ed368361] 2024-08-31 17:29:30 +1200
Fix unfairness in all-cached parallel seq scan.
the failure rate gets enormously worse, and there are many runs
producing more than one extra row, for example these stats from
200 test runs at 5668a857d:
$ sort -n ~/corbanicounts | uniq -c
50 115427
78 115428
38 115429
19 115430
11 115431
4 115432
I don't know what to make of that, exactly. 4effd0844 is nowhere near
the PHJ code, and it hasn't been causing any problems in the v17
branch as far as I've heard. My guess is that it changes the order of
delivery of rows to PHJ in a way that tends to tickle the hypothesized
race condition.
BTW, 5668a857d did not move the needle one way or the other, so it's
not that.
None of this is code that I've been heavily into, so I hope somebody
else will pick it up. But if we can't solve it in the next 10 days or
so, I think we need to disable parallel right semi joins for 18.1.
regards, tom lane