Thread: Test of a partition with an incomplete detach has a timing issue

Test of a partition with an incomplete detach has a timing issue

From
"osumi.takamichi@fujitsu.com"
Date:
Hi


During my review of a patch in the community,
I've encountered failures of OSS HEAD'S make check-world in a continuous loop.
I just repeated make check-world. Accordingly, this should be an existing issue.
Make check-world fails once in about 20 times in my env. I'd like to report this.

The test itself ended with stderr messages below.

NOTICE:  database "regression" does not exist, skipping
make[2]: *** [check] Error 1
make[1]: *** [check-isolation-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

Also, I've gotten some logs left.
* src/test/isolation/output_iso/regression.out

test detach-partition-concurrently-1 ... ok          682 ms
test detach-partition-concurrently-2 ... ok          321 ms
test detach-partition-concurrently-3 ... FAILED     1084 ms
test detach-partition-concurrently-4 ... ok         1078 ms
test fk-contention                ... ok           77 ms

* src/test/isolation/output_iso/regression.diffs

diff -U3 /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently-3.out
/(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
--- /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently-3.out     2021-05-24 03:30:15.735488295
+0000
+++ /(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out   2021-05-24
04:46:48.851488295+0000 
@@ -12,9 +12,9 @@
 pg_cancel_backend

 t
-step s2detach: <... completed>
-error in steps s1cancel s2detach: ERROR:  canceling statement due to user request
 step s1c: COMMIT;
+step s2detach: <... completed>
+error in steps s1c s2detach: ERROR:  canceling statement due to user request
 step s1describe: SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_listp')
                                          UNION ALL SELECT 'd3_listp1', * FROM pg_partition_tree('d3_listp1');
 root           relid          parentrelid    isleaf         level
.

The steps I did :
1 - ./configure --enable-cassert --enable-debug --enable-tap-tests --with-icu CFLAGS=-O0 --prefix=/where/I/put/binary
2 - make -j2 2> make.log # no stderr output at this stage, of course
3 - make check-world -j8 2> make_check_world.log
    For the 1st RT, I succeeded. But, repeating the make check-world failed.


Best Regards,
    Takamichi Osumi




Re: Test of a partition with an incomplete detach has a timing issue

From
Alvaro Herrera
Date:
On 2021-May-24, osumi.takamichi@fujitsu.com wrote:

> Also, I've gotten some logs left.
> * src/test/isolation/output_iso/regression.out
> 
> test detach-partition-concurrently-1 ... ok          682 ms
> test detach-partition-concurrently-2 ... ok          321 ms
> test detach-partition-concurrently-3 ... FAILED     1084 ms
> test detach-partition-concurrently-4 ... ok         1078 ms
> test fk-contention                ... ok           77 ms
> 
> * src/test/isolation/output_iso/regression.diffs
> 
> diff -U3 /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently-3.out
/(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
> --- /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently-3.out     2021-05-24
03:30:15.735488295+0000
 
> +++ /(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out   2021-05-24
04:46:48.851488295+0000
 
> @@ -12,9 +12,9 @@
>  pg_cancel_backend
>  
>  t              
> -step s2detach: <... completed>
> -error in steps s1cancel s2detach: ERROR:  canceling statement due to user request
>  step s1c: COMMIT;
> +step s2detach: <... completed>
> +error in steps s1c s2detach: ERROR:  canceling statement due to user request

Uh, how annoying.  If I understand correctly, I agree that this is a
timing issue: sometimes it is fast enough that the cancel is reported
together with its own step, but other times it takes longer so it is
reported with the next command of that session instead, s1c (commit).

I suppose a fix would imply that the error report waits until after the
"cancel" step is over, but I'm not sure how to do that.

Maybe we can change the "cancel" query to something like

SELECT pg_cancel_backend(pid), somehow_wait_for_detach_to_terminate() FROM d3_pid;

... where maybe that function can check the "state" column in s3's
pg_stat_activity row?  I'll give that a try.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
"That sort of implies that there are Emacs keystrokes which aren't obscure.
I've been using it daily for 2 years now and have yet to discover any key
sequence which makes any sense."                        (Paul Thomas)



Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-May-24, osumi.takamichi@fujitsu.com wrote:
>> t
>> -step s2detach: <... completed>
>> -error in steps s1cancel s2detach: ERROR:  canceling statement due to user request
>> step s1c: COMMIT;
>> +step s2detach: <... completed>
>> +error in steps s1c s2detach: ERROR:  canceling statement due to user request

> Uh, how annoying.  If I understand correctly, I agree that this is a
> timing issue: sometimes it is fast enough that the cancel is reported
> together with its own step, but other times it takes longer so it is
> reported with the next command of that session instead, s1c (commit).

Yeah, we see such failures in the buildfarm with various isolation
tests; some recent examples:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-05-23%2019%3A43%3A04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-05-08%2006%3A34%3A13
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2021-04-29%2009%3A43%3A04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2021-04-22%2021%3A24%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-04-21%2010%3A38%3A32
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fossa&dt=2021-04-08%2019%3A36%3A06

I remember having tried to rewrite the isolation tester to eliminate
the race condition, without success (and I don't seem to have kept
my notes, which now I regret).

However, the existing hazards seem to hit rarely enough to not be
much of a problem.  We might need to see if we can rejigger the
timing in this test to make it a little more stable.

            regards, tom lane



RE: Test of a partition with an incomplete detach has a timing issue

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, May 25, 2021 3:07 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2021-May-24, osumi.takamichi@fujitsu.com wrote:
> 
> > Also, I've gotten some logs left.
> > * src/test/isolation/output_iso/regression.out
> >
> > test detach-partition-concurrently-1 ... ok          682 ms
> > test detach-partition-concurrently-2 ... ok          321 ms
> > test detach-partition-concurrently-3 ... FAILED     1084 ms
> > test detach-partition-concurrently-4 ... ok         1078 ms
> > test fk-contention                ... ok           77 ms
> >
> > * src/test/isolation/output_iso/regression.diffs
> >
> > diff -U3
> /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently
> -3.out
> /(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-con
> currently-3.out
> > ---
> /(where/I/put/PG)/src/test/isolation/expected/detach-partition-concurrently
> -3.out     2021-05-24 03:30:15.735488295 +0000
> > +++
> /(where/I/put/PG)/src/test/isolation/output_iso/results/detach-partition-con
> currently-3.out   2021-05-24 04:46:48.851488295 +0000
> > @@ -12,9 +12,9 @@
> >  pg_cancel_backend
> >
> >  t
> > -step s2detach: <... completed>
> > -error in steps s1cancel s2detach: ERROR:  canceling statement due to
> > user request  step s1c: COMMIT;
> > +step s2detach: <... completed>
> > +error in steps s1c s2detach: ERROR:  canceling statement due to user
> > +request
> 
> Uh, how annoying.  If I understand correctly, I agree that this is a timing issue:
> sometimes it is fast enough that the cancel is reported together with its own
> step, but other times it takes longer so it is reported with the next command of
> that session instead, s1c (commit).
> 
> I suppose a fix would imply that the error report waits until after the "cancel"
> step is over, but I'm not sure how to do that.
> 
> Maybe we can change the "cancel" query to something like
> 
> SELECT pg_cancel_backend(pid), somehow_wait_for_detach_to_terminate()
> FROM d3_pid;
> 
> ... where maybe that function can check the "state" column in s3's
> pg_stat_activity row?  I'll give that a try.
Thank you so much for addressing this issue.


Best Regards,
    Takamichi Osumi


Re: Test of a partition with an incomplete detach has a timing issue

From
Michael Paquier
Date:
On Mon, May 24, 2021 at 02:07:12PM -0400, Alvaro Herrera wrote:
> I suppose a fix would imply that the error report waits until after the
> "cancel" step is over, but I'm not sure how to do that.
>
> Maybe we can change the "cancel" query to something like
>
> SELECT pg_cancel_backend(pid), somehow_wait_for_detach_to_terminate() FROM d3_pid;
>
> ... where maybe that function can check the "state" column in s3's
> pg_stat_activity row?  I'll give that a try.

Couldn't you achieve that with a small PL function in a way similar to
what 32a9c0b did, except that you track a different state in
pg_stat_activity for this PID?
--
Michael

Attachment
Michael Paquier <michael@paquier.xyz> writes:
> On Mon, May 24, 2021 at 02:07:12PM -0400, Alvaro Herrera wrote:
>> Maybe we can change the "cancel" query to something like
>> SELECT pg_cancel_backend(pid), somehow_wait_for_detach_to_terminate() FROM d3_pid;
>> ... where maybe that function can check the "state" column in s3's
>> pg_stat_activity row?  I'll give that a try.

> Couldn't you achieve that with a small PL function in a way similar to
> what 32a9c0b did, except that you track a different state in
> pg_stat_activity for this PID?

The number of subsequent fixes to 32a9c0b seem to argue against
using that as a model :-(

The experiments I did awhile ago are coming back to me now.  I tried
a number of variations on this same theme, and none of them closed
the gap entirely.  The fundamental problem is that it's possible
for backend A to complete its transaction, and for backend B (which
is the isolationtester's monitoring session) to observe that A has
completed its transaction, and for B to report that fact to the
isolationtester, and for that report to arrive at the isolationtester
*before A's query result does*.  You need some bad luck for that
to happen, like A losing the CPU right before it flushes its output
buffer to the client, but I was able to demonstrate it fairly
repeatably.  (IIRC, the reason I was looking into this was that
the clobber-cache-always buildfarm critters were showing such
failures somewhat regularly.)

It doesn't really matter whether B's observation technique involves
locks (as now), or the pgstat activity table, or what.  Conceivably,
if we used the activity data, we could have A postpone updating its
state to "idle" until after it's flushed its buffer to the client.
But that would likely break things for other use-cases.  Moreover
it still guarantees nothing, really, because we're still at the
mercy of the kernel as to when it will choose to deliver network
packets.

So a completely bulletproof interlock seems out of reach.
Maybe something like what Alvaro's thinking of will get the
failure rate down to an acceptable level for most developers.
A simple "pg_sleep(1)" might have about the same effect for
much less work, though.

I do agree we need to do something.  I've had two or three failures
in those test cases in the past few days in my own manual check-world
runs, which is orders of magnitude worse than the previous
reliability.

            regards, tom lane



Re: Test of a partition with an incomplete detach has a timing issue

From
Noah Misch
Date:
On Mon, May 24, 2021 at 09:12:40PM -0400, Tom Lane wrote:
> The experiments I did awhile ago are coming back to me now.  I tried
> a number of variations on this same theme, and none of them closed
> the gap entirely.  The fundamental problem is that it's possible
> for backend A to complete its transaction, and for backend B (which
> is the isolationtester's monitoring session) to observe that A has
> completed its transaction, and for B to report that fact to the
> isolationtester, and for that report to arrive at the isolationtester
> *before A's query result does*.  You need some bad luck for that
> to happen, like A losing the CPU right before it flushes its output
> buffer to the client, but I was able to demonstrate it fairly
> repeatably.

> So a completely bulletproof interlock seems out of reach.

What if we had a standard that the step after the cancel shall send a query to
the backend that just received the cancel?  Something like:

--- a/src/test/isolation/specs/detach-partition-concurrently-3.spec
+++ b/src/test/isolation/specs/detach-partition-concurrently-3.spec
@@ -34,16 +34,18 @@ step "s1describe"    { SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_list
 session "s2"
 step "s2begin"        { BEGIN; }
 step "s2snitch"        { INSERT INTO d3_pid SELECT pg_backend_pid(); }
 step "s2detach"        { ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; }
+step "s2noop"        { UNLISTEN noop; }
+# TODO follow every instance of s1cancel w/ s2noop
 step "s2detach2"    { ALTER TABLE d3_listp DETACH PARTITION d3_listp2 CONCURRENTLY; }
 step "s2detachfinal"    { ALTER TABLE d3_listp DETACH PARTITION d3_listp1 FINALIZE; }
 step "s2drop"        { DROP TABLE d3_listp1; }
 step "s2commit"        { COMMIT; }
 
 # Try various things while the partition is in "being detached" state, with
 # no session waiting.
-permutation "s2snitch" "s1b" "s1s" "s2detach" "s1cancel" "s1c" "s1describe" "s1alter"
+permutation "s2snitch" "s1b" "s1s" "s2detach" "s1cancel" "s2noop" "s1c" "s1describe" "s1alter"
 permutation "s2snitch" "s1b" "s1s" "s2detach" "s1cancel" "s1insert" "s1c"
 permutation "s2snitch" "s1brr" "s1s" "s2detach" "s1cancel" "s1insert" "s1c" "s1spart"
 permutation "s2snitch" "s1b" "s1s" "s2detach" "s1cancel" "s1c" "s1insertpart"
 



Re: Test of a partition with an incomplete detach has a timing issue

From
Alvaro Herrera
Date:
So I had a hard time reproducing the problem, until I realized that I
needed to limit the server to use only one CPU, and in addition run some
other stuff concurrently in the same server in order to keep it busy.
With that, I see about one failure every 10 runs.

So I start the server as "numactl -C0 postmaster", then another terminal
with an infinite loop doing "make -C src/test/regress installcheck-parallel";
and a third terminal doing this

while [ $? == 0 ]; do ../../../src/test/isolation/pg_isolation_regress
--inputdir=/pgsql/source/master/src/test/isolation--outputdir=output_iso --bindir='/pgsql/install/master/bin'
detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3
detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3
detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3
detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3
detach-partition-concurrently-3; done
 

With the test unpatched, I get about one failure in the set.

On 2021-May-24, Noah Misch wrote:

> What if we had a standard that the step after the cancel shall send a query to
> the backend that just received the cancel?  Something like:

Hmm ... I don't understand why this fixes the problem, but it
drastically reduces the probability.  Here's a complete patch.  I got
about one failure in 1000 instead of 1 in 10.  The new failure looks
like this:

diff -U3 /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out
/home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
--- /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out    2021-05-25
11:12:42.333987835-0400
 
+++ /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
2021-05-2511:19:03.714947775 -0400
 
@@ -13,7 +13,7 @@
 
 t              
 step s2detach: <... completed>
-error in steps s1cancel s2detach: ERROR:  canceling statement due to user request
+ERROR:  canceling statement due to user request
 step s2noop: UNLISTEN noop;
 step s1c: COMMIT;
 step s1describe: SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_listp')


I find this a bit weird and I'm wondering if it could be an
isolationtester bug -- why is it not attributing the error message to
any steps?

The problem disappears completely if I add a sleep to the cancel query:

step "s1cancel"     { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM d3_pid; }

I suppose a 0.01 second sleep is not going to be sufficient to close the
problem in slower animals, but I hesitate to propose a much longer sleep
because this test has 18 permutations so even a one second sleep adds
quite a lot of (mostly useless) test runtime.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W



Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> The problem disappears completely if I add a sleep to the cancel query:
> step "s1cancel"     { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM d3_pid; }
> I suppose a 0.01 second sleep is not going to be sufficient to close the
> problem in slower animals, but I hesitate to propose a much longer sleep
> because this test has 18 permutations so even a one second sleep adds
> quite a lot of (mostly useless) test runtime.

Yeah ... maybe 0.1 second is the right tradeoff?

Note that on slow (like CCA) animals, the extra query required by
Noah's suggestion is likely to take more than 0.1 second.

            regards, tom lane



Re: Test of a partition with an incomplete detach has a timing issue

From
Alvaro Herrera
Date:
On 2021-May-25, Tom Lane wrote:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > The problem disappears completely if I add a sleep to the cancel query:
> > step "s1cancel"     { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM d3_pid; }
> > I suppose a 0.01 second sleep is not going to be sufficient to close the
> > problem in slower animals, but I hesitate to propose a much longer sleep
> > because this test has 18 permutations so even a one second sleep adds
> > quite a lot of (mostly useless) test runtime.
> 
> Yeah ... maybe 0.1 second is the right tradeoff?

Pushed with a 0.1 sleep, and some commentary.

> Note that on slow (like CCA) animals, the extra query required by
> Noah's suggestion is likely to take more than 0.1 second.

Hmm, but the sleep is to compete with the cancelling of detach, not with
the noop query.

I tried running the test under CCA here and it didn't fail, but of
course that's not a guarantee of anything since it only completed one
iteration.

-- 
Álvaro Herrera       Valdivia, Chile



Re: Test of a partition with an incomplete detach has a timing issue

From
Noah Misch
Date:
On Tue, May 25, 2021 at 11:32:38AM -0400, Alvaro Herrera wrote:
> On 2021-May-24, Noah Misch wrote:
> > What if we had a standard that the step after the cancel shall send a query to
> > the backend that just received the cancel?  Something like:
> 
> Hmm ... I don't understand why this fixes the problem, but it
> drastically reduces the probability.

This comment, from run_permutation(), explains:

        /*
         * Check whether the session that needs to perform the next step is
         * still blocked on an earlier step.  If so, wait for it to finish.
         *
         * (In older versions of this tool, we allowed precisely one session
         * to be waiting at a time.  If we reached a step that required that
         * session to execute the next command, we would declare the whole
         * permutation invalid, cancel everything, and move on to the next
         * one.  Unfortunately, that made it impossible to test the deadlock
         * detector using this framework, unless the number of processes
         * involved in the deadlock was precisely two.  We now assume that if
         * we reach a step that is still blocked, we need to wait for it to
         * unblock itself.)
         */

> Here's a complete patch.  I got
> about one failure in 1000 instead of 1 in 10.  The new failure looks
> like this:
> 
> diff -U3 /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out
/home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
> --- /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out    2021-05-25
11:12:42.333987835-0400
 
> +++ /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
 2021-05-25 11:19:03.714947775 -0400
 
> @@ -13,7 +13,7 @@
>  
>  t              
>  step s2detach: <... completed>
> -error in steps s1cancel s2detach: ERROR:  canceling statement due to user request

I'm guessing this is:
report_multiple_error_messages("s1cancel", 1, {"s2detach"})

> +ERROR:  canceling statement due to user request

And this is:
report_multiple_error_messages("s2detach", 0, {})

>  step s2noop: UNLISTEN noop;
>  step s1c: COMMIT;
>  step s1describe: SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_listp')
> 
> 
> I find this a bit weird and I'm wondering if it could be an
> isolationtester bug -- why is it not attributing the error message to
> any steps?

I agree that looks like an isolationtester bug.  isolationtester already
printed the tuples from s1cancel, so s1cancel should be considered finished.
This error message emission should never have any step name prefixing it.

Thanks,
nm