Hi,
On Fri, Oct 25, 2024 at 09:33:12AM +0900, Michael Paquier wrote:
> This is in the first permutation of the test done with "wait1 wakeup2
> detach2", and the diff means that the backend running the "wait"
> callback is reported as finished after the detach is done,
> injection_points_run being only used for the waits. Hence the wait is
> so slow to finish that the detach has time to complete and finish,
> breaking the output.
Yeah, I agree with your analysis.
> And here comes the puzzling part: all of failures involve FreeBSD 13
> in the CI. Reproducing this failure would not be difficult, I thought
> first; we can add a hardcoded pg_usleep() to delay the end of
> injection_wait() so as we make sure that the backend doing the wait
> reports later than the detach. Or just do the same at the end of
> injection_points_run() once the callback exits. I've sure done that,
> placing some strategic pg_usleep() calls on Linux to make the paths
> that matter in the wait slower, but the test remains stable.
Right, I did the same and observed the exact same behavior.
Still, it's possible to observe the s1 wait finishing after the s2 detach by
running this test manually (means create 2 sessions and run the test commands
manually) and:
1. attach a debuger on the first session (say with a break point in injection_wait()).
or
2. add a hardcoded pg_usleep() in injection_wait()
So I think that the s1 wait finishing after the s2 detach is possible if
the session 1 is "freezed" (gdb case) or slow enough (pg_usleep() case).
> The CI
> on Linux is stable as well: 3rd and 4th columns of the cfbot are
> green, I did not spot any failures related to this isolation test in
> injection_points. Only the second column about FreeBSD is going rogue
> on a periodic basis.
>
> One fix would is to remove this first permutation test, still that's
> hiding a problem rather than solving it, and something looks wrong
> with conditional variables, specific to FreeBSD?
Hum, we would probably observe other failures in other tests no?
> Any thoughts or comments?
I think that we can not be 100% sure that the s1 wait will finish before the
s2 detach (easy reproducible with gdb attached on s1 or an hardcoded sleep) and
that other OS could also report the test as failing for the same reason.
It's not ideal, but instead of removing this first permutation test what about
adding a "sleep2" step in it (doing say, SELECT pg_sleep(1);) and call this
new step before the detach2 one?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com