Thread: Test of a partition with an incomplete detach has a timing issue
Test of a partition with an incomplete detach has a timing issue
From
"osumi.takamichi@fujitsu.com"
Date:
Hi During my review of a patch in the community, I've encountered failures of OSS HEAD'S make check-world in a continuous loop. I just repeated make check-world. Accordingly, this should be an existing issue. Make check-world fails once in about 20 times in my env. I'd like to report this. The test itself ended with stderr messages below. NOTICE: database "regression" does not exist, skipping make[2]: *** [check] Error 1 make[1]: *** [check-isolation-recurse] Error 2 make[1]: *** Waiting for unfinished jobs.... make: *** [check-world-src/test-recurse] Error 2 Also, I've gotten some logs left. * src/test/isolation/output_iso/regression.out test detach-partition-concurrently-1 ... ok 682 ms test detach-partition-concurrently-2 ... ok 321 ms test detach-partition-concurrently-3 ... FAILED 1084 ms test detach-partition-concurrently-4 ... ok 1078 ms test fk-contention ... ok 77 ms * src/test/isolation/output_iso/regression.diffs diff -U3 /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently-3.out /(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out --- /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently-3.out 2021-05-24 03:30:15.735488295 +0000 +++ /(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out 2021-05-24 04:46:48.851488295+0000 @@ -12,9 +12,9 @@ pg_cancel_backend t -step s2detach: <... completed> -error in steps s1cancel s2detach: ERROR: canceling statement due to user request step s1c: COMMIT; +step s2detach: <... completed> +error in steps s1c s2detach: ERROR: canceling statement due to user request step s1describe: SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_listp') UNION ALL SELECT 'd3_listp1', * FROM pg_partition_tree('d3_listp1'); root relid parentrelid isleaf level . The steps I did : 1 - ./configure --enable-cassert --enable-debug --enable-tap-tests --with-icu CFLAGS=-O0 --prefix=/where/I/put/binary 2 - make -j2 2> make.log # no stderr output at this stage, of course 3 - make check-world -j8 2> make_check_world.log For the 1st RT, I succeeded. But, repeating the make check-world failed. Best Regards, Takamichi Osumi
On 2021-May-24, osumi.takamichi@fujitsu.com wrote: > Also, I've gotten some logs left. > * src/test/isolation/output_iso/regression.out > > test detach-partition-concurrently-1 ... ok 682 ms > test detach-partition-concurrently-2 ... ok 321 ms > test detach-partition-concurrently-3 ... FAILED 1084 ms > test detach-partition-concurrently-4 ... ok 1078 ms > test fk-contention ... ok 77 ms > > * src/test/isolation/output_iso/regression.diffs > > diff -U3 /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently-3.out /(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out > --- /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently-3.out 2021-05-24 03:30:15.735488295+0000 > +++ /(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out 2021-05-24 04:46:48.851488295+0000 > @@ -12,9 +12,9 @@ > pg_cancel_backend > > t > -step s2detach: <... completed> > -error in steps s1cancel s2detach: ERROR: canceling statement due to user request > step s1c: COMMIT; > +step s2detach: <... completed> > +error in steps s1c s2detach: ERROR: canceling statement due to user request Uh, how annoying. If I understand correctly, I agree that this is a timing issue: sometimes it is fast enough that the cancel is reported together with its own step, but other times it takes longer so it is reported with the next command of that session instead, s1c (commit). I suppose a fix would imply that the error report waits until after the "cancel" step is over, but I'm not sure how to do that. Maybe we can change the "cancel" query to something like SELECT pg_cancel_backend(pid), somehow_wait_for_detach_to_terminate() FROM d3_pid; ... where maybe that function can check the "state" column in s3's pg_stat_activity row? I'll give that a try. -- Álvaro Herrera 39°49'30"S 73°17'W "That sort of implies that there are Emacs keystrokes which aren't obscure. I've been using it daily for 2 years now and have yet to discover any key sequence which makes any sense." (Paul Thomas)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2021-May-24, osumi.takamichi@fujitsu.com wrote: >> t >> -step s2detach: <... completed> >> -error in steps s1cancel s2detach: ERROR: canceling statement due to user request >> step s1c: COMMIT; >> +step s2detach: <... completed> >> +error in steps s1c s2detach: ERROR: canceling statement due to user request > Uh, how annoying. If I understand correctly, I agree that this is a > timing issue: sometimes it is fast enough that the cancel is reported > together with its own step, but other times it takes longer so it is > reported with the next command of that session instead, s1c (commit). Yeah, we see such failures in the buildfarm with various isolation tests; some recent examples: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-05-23%2019%3A43%3A04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-05-08%2006%3A34%3A13 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-04-29%2009%3A43%3A04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-04-22%2021%3A24%3A02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-04-21%2010%3A38%3A32 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fossa&dt=2021-04-08%2019%3A36%3A06 I remember having tried to rewrite the isolation tester to eliminate the race condition, without success (and I don't seem to have kept my notes, which now I regret). However, the existing hazards seem to hit rarely enough to not be much of a problem. We might need to see if we can rejigger the timing in this test to make it a little more stable. regards, tom lane
RE: Test of a partition with an incomplete detach has a timing issue
From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, May 25, 2021 3:07 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2021-May-24, osumi.takamichi@fujitsu.com wrote: > > > Also, I've gotten some logs left. > > * src/test/isolation/output_iso/regression.out > > > > test detach-partition-concurrently-1 ... ok 682 ms > > test detach-partition-concurrently-2 ... ok 321 ms > > test detach-partition-concurrently-3 ... FAILED 1084 ms > > test detach-partition-concurrently-4 ... ok 1078 ms > > test fk-contention ... ok 77 ms > > > > * src/test/isolation/output_iso/regression.diffs > > > > diff -U3 > /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently > -3.out > /(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-con > currently-3.out > > --- > /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently > -3.out 2021-05-24 03:30:15.735488295 +0000 > > +++ > /(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-con > currently-3.out 2021-05-24 04:46:48.851488295 +0000 > > @@ -12,9 +12,9 @@ > > pg_cancel_backend > > > > t > > -step s2detach: <... completed> > > -error in steps s1cancel s2detach: ERROR: canceling statement due to > > user request step s1c: COMMIT; > > +step s2detach: <... completed> > > +error in steps s1c s2detach: ERROR: canceling statement due to user > > +request > > Uh, how annoying. If I understand correctly, I agree that this is a timing issue: > sometimes it is fast enough that the cancel is reported together with its own > step, but other times it takes longer so it is reported with the next command of > that session instead, s1c (commit). > > I suppose a fix would imply that the error report waits until after the "cancel" > step is over, but I'm not sure how to do that. > > Maybe we can change the "cancel" query to something like > > SELECT pg_cancel_backend(pid), somehow_wait_for_detach_to_terminate() > FROM d3_pid; > > ... where maybe that function can check the "state" column in s3's > pg_stat_activity row? I'll give that a try. Thank you so much for addressing this issue. Best Regards, Takamichi Osumi
On Mon, May 24, 2021 at 02:07:12PM -0400, Alvaro Herrera wrote: > I suppose a fix would imply that the error report waits until after the > "cancel" step is over, but I'm not sure how to do that. > > Maybe we can change the "cancel" query to something like > > SELECT pg_cancel_backend(pid), somehow_wait_for_detach_to_terminate() FROM d3_pid; > > ... where maybe that function can check the "state" column in s3's > pg_stat_activity row? I'll give that a try. Couldn't you achieve that with a small PL function in a way similar to what 32a9c0b did, except that you track a different state in pg_stat_activity for this PID? -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Mon, May 24, 2021 at 02:07:12PM -0400, Alvaro Herrera wrote: >> Maybe we can change the "cancel" query to something like >> SELECT pg_cancel_backend(pid), somehow_wait_for_detach_to_terminate() FROM d3_pid; >> ... where maybe that function can check the "state" column in s3's >> pg_stat_activity row? I'll give that a try. > Couldn't you achieve that with a small PL function in a way similar to > what 32a9c0b did, except that you track a different state in > pg_stat_activity for this PID? The number of subsequent fixes to 32a9c0b seem to argue against using that as a model :-( The experiments I did awhile ago are coming back to me now. I tried a number of variations on this same theme, and none of them closed the gap entirely. The fundamental problem is that it's possible for backend A to complete its transaction, and for backend B (which is the isolationtester's monitoring session) to observe that A has completed its transaction, and for B to report that fact to the isolationtester, and for that report to arrive at the isolationtester *before A's query result does*. You need some bad luck for that to happen, like A losing the CPU right before it flushes its output buffer to the client, but I was able to demonstrate it fairly repeatably. (IIRC, the reason I was looking into this was that the clobber-cache-always buildfarm critters were showing such failures somewhat regularly.) It doesn't really matter whether B's observation technique involves locks (as now), or the pgstat activity table, or what. Conceivably, if we used the activity data, we could have A postpone updating its state to "idle" until after it's flushed its buffer to the client. But that would likely break things for other use-cases. Moreover it still guarantees nothing, really, because we're still at the mercy of the kernel as to when it will choose to deliver network packets. So a completely bulletproof interlock seems out of reach. Maybe something like what Alvaro's thinking of will get the failure rate down to an acceptable level for most developers. A simple "pg_sleep(1)" might have about the same effect for much less work, though. I do agree we need to do something. I've had two or three failures in those test cases in the past few days in my own manual check-world runs, which is orders of magnitude worse than the previous reliability. regards, tom lane
On Mon, May 24, 2021 at 09:12:40PM -0400, Tom Lane wrote: > The experiments I did awhile ago are coming back to me now. I tried > a number of variations on this same theme, and none of them closed > the gap entirely. The fundamental problem is that it's possible > for backend A to complete its transaction, and for backend B (which > is the isolationtester's monitoring session) to observe that A has > completed its transaction, and for B to report that fact to the > isolationtester, and for that report to arrive at the isolationtester > *before A's query result does*. You need some bad luck for that > to happen, like A losing the CPU right before it flushes its output > buffer to the client, but I was able to demonstrate it fairly > repeatably. > So a completely bulletproof interlock seems out of reach. What if we had a standard that the step after the cancel shall send a query to the backend that just received the cancel? Something like: --- a/src/test/isolation/specs/detach-partition-concurrently-3.spec +++ b/src/test/isolation/specs/detach-partition-concurrently-3.spec @@ -34,16 +34,18 @@ step "s1describe" { SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_list session "s2" step "s2begin" { BEGIN; } step "s2snitch" { INSERT INTO d3_pid SELECT pg_backend_pid(); } step "s2detach" { ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; } +step "s2noop" { UNLISTEN noop; } +# TODO follow every instance of s1cancel w/ s2noop step "s2detach2" { ALTER TABLE d3_listp DETACH PARTITION d3_listp2 CONCURRENTLY; } step "s2detachfinal" { ALTER TABLE d3_listp DETACH PARTITION d3_listp1 FINALIZE; } step "s2drop" { DROP TABLE d3_listp1; } step "s2commit" { COMMIT; } # Try various things while the partition is in "being detached" state, with # no session waiting. -permutation "s2snitch" "s1b" "s1s" "s2detach" "s1cancel" "s1c" "s1describe" "s1alter" +permutation "s2snitch" "s1b" "s1s" "s2detach" "s1cancel" "s2noop" "s1c" "s1describe" "s1alter" permutation "s2snitch" "s1b" "s1s" "s2detach" "s1cancel" "s1insert" "s1c" permutation "s2snitch" "s1brr" "s1s" "s2detach" "s1cancel" "s1insert" "s1c" "s1spart" permutation "s2snitch" "s1b" "s1s" "s2detach" "s1cancel" "s1c" "s1insertpart"
So I had a hard time reproducing the problem, until I realized that I needed to limit the server to use only one CPU, and in addition run some other stuff concurrently in the same server in order to keep it busy. With that, I see about one failure every 10 runs. So I start the server as "numactl -C0 postmaster", then another terminal with an infinite loop doing "make -C src/test/regress installcheck-parallel"; and a third terminal doing this while [ $? == 0 ]; do ../../../src/test/isolation/pg_isolation_regress --inputdir=/pgsql/source/master/src/test/isolation--outputdir=output_iso --bindir='/pgsql/install/master/bin' detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3; done With the test unpatched, I get about one failure in the set. On 2021-May-24, Noah Misch wrote: > What if we had a standard that the step after the cancel shall send a query to > the backend that just received the cancel? Something like: Hmm ... I don't understand why this fixes the problem, but it drastically reduces the probability. Here's a complete patch. I got about one failure in 1000 instead of 1 in 10. The new failure looks like this: diff -U3 /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out --- /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out 2021-05-25 11:12:42.333987835-0400 +++ /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out 2021-05-2511:19:03.714947775 -0400 @@ -13,7 +13,7 @@ t step s2detach: <... completed> -error in steps s1cancel s2detach: ERROR: canceling statement due to user request +ERROR: canceling statement due to user request step s2noop: UNLISTEN noop; step s1c: COMMIT; step s1describe: SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_listp') I find this a bit weird and I'm wondering if it could be an isolationtester bug -- why is it not attributing the error message to any steps? The problem disappears completely if I add a sleep to the cancel query: step "s1cancel" { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM d3_pid; } I suppose a 0.01 second sleep is not going to be sufficient to close the problem in slower animals, but I hesitate to propose a much longer sleep because this test has 18 permutations so even a one second sleep adds quite a lot of (mostly useless) test runtime. -- Álvaro Herrera 39°49'30"S 73°17'W
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > The problem disappears completely if I add a sleep to the cancel query: > step "s1cancel" { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM d3_pid; } > I suppose a 0.01 second sleep is not going to be sufficient to close the > problem in slower animals, but I hesitate to propose a much longer sleep > because this test has 18 permutations so even a one second sleep adds > quite a lot of (mostly useless) test runtime. Yeah ... maybe 0.1 second is the right tradeoff? Note that on slow (like CCA) animals, the extra query required by Noah's suggestion is likely to take more than 0.1 second. regards, tom lane
On 2021-May-25, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > The problem disappears completely if I add a sleep to the cancel query: > > step "s1cancel" { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM d3_pid; } > > I suppose a 0.01 second sleep is not going to be sufficient to close the > > problem in slower animals, but I hesitate to propose a much longer sleep > > because this test has 18 permutations so even a one second sleep adds > > quite a lot of (mostly useless) test runtime. > > Yeah ... maybe 0.1 second is the right tradeoff? Pushed with a 0.1 sleep, and some commentary. > Note that on slow (like CCA) animals, the extra query required by > Noah's suggestion is likely to take more than 0.1 second. Hmm, but the sleep is to compete with the cancelling of detach, not with the noop query. I tried running the test under CCA here and it didn't fail, but of course that's not a guarantee of anything since it only completed one iteration. -- Álvaro Herrera Valdivia, Chile
On Tue, May 25, 2021 at 11:32:38AM -0400, Alvaro Herrera wrote: > On 2021-May-24, Noah Misch wrote: > > What if we had a standard that the step after the cancel shall send a query to > > the backend that just received the cancel? Something like: > > Hmm ... I don't understand why this fixes the problem, but it > drastically reduces the probability. This comment, from run_permutation(), explains: /* * Check whether the session that needs to perform the next step is * still blocked on an earlier step. If so, wait for it to finish. * * (In older versions of this tool, we allowed precisely one session * to be waiting at a time. If we reached a step that required that * session to execute the next command, we would declare the whole * permutation invalid, cancel everything, and move on to the next * one. Unfortunately, that made it impossible to test the deadlock * detector using this framework, unless the number of processes * involved in the deadlock was precisely two. We now assume that if * we reach a step that is still blocked, we need to wait for it to * unblock itself.) */ > Here's a complete patch. I got > about one failure in 1000 instead of 1 in 10. The new failure looks > like this: > > diff -U3 /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out > --- /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out 2021-05-25 11:12:42.333987835-0400 > +++ /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out 2021-05-25 11:19:03.714947775 -0400 > @@ -13,7 +13,7 @@ > > t > step s2detach: <... completed> > -error in steps s1cancel s2detach: ERROR: canceling statement due to user request I'm guessing this is: report_multiple_error_messages("s1cancel", 1, {"s2detach"}) > +ERROR: canceling statement due to user request And this is: report_multiple_error_messages("s2detach", 0, {}) > step s2noop: UNLISTEN noop; > step s1c: COMMIT; > step s1describe: SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_listp') > > > I find this a bit weird and I'm wondering if it could be an > isolationtester bug -- why is it not attributing the error message to > any steps? I agree that looks like an isolationtester bug. isolationtester already printed the tuples from s1cancel, so s1cancel should be considered finished. This error message emission should never have any step name prefixing it. Thanks, nm