Re: Test of a partition with an incomplete detach has a timing issue - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: Test of a partition with an incomplete detach has a timing issue |
Date | |
Msg-id | 20210525153238.GA17744@alvherre.pgsql Whole thread Raw |
In response to | Re: Test of a partition with an incomplete detach has a timing issue (Noah Misch <noah@leadboat.com>) |
Responses |
Re: Test of a partition with an incomplete detach has a timing issue
Re: Test of a partition with an incomplete detach has a timing issue |
List | pgsql-hackers |
So I had a hard time reproducing the problem, until I realized that I needed to limit the server to use only one CPU, and in addition run some other stuff concurrently in the same server in order to keep it busy. With that, I see about one failure every 10 runs. So I start the server as "numactl -C0 postmaster", then another terminal with an infinite loop doing "make -C src/test/regress installcheck-parallel"; and a third terminal doing this while [ $? == 0 ]; do ../../../src/test/isolation/pg_isolation_regress --inputdir=/pgsql/source/master/src/test/isolation--outputdir=output_iso --bindir='/pgsql/install/master/bin' detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3; done With the test unpatched, I get about one failure in the set. On 2021-May-24, Noah Misch wrote: > What if we had a standard that the step after the cancel shall send a query to > the backend that just received the cancel? Something like: Hmm ... I don't understand why this fixes the problem, but it drastically reduces the probability. Here's a complete patch. I got about one failure in 1000 instead of 1 in 10. The new failure looks like this: diff -U3 /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out --- /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out 2021-05-25 11:12:42.333987835-0400 +++ /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out 2021-05-2511:19:03.714947775 -0400 @@ -13,7 +13,7 @@ t step s2detach: <... completed> -error in steps s1cancel s2detach: ERROR: canceling statement due to user request +ERROR: canceling statement due to user request step s2noop: UNLISTEN noop; step s1c: COMMIT; step s1describe: SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_listp') I find this a bit weird and I'm wondering if it could be an isolationtester bug -- why is it not attributing the error message to any steps? The problem disappears completely if I add a sleep to the cancel query: step "s1cancel" { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM d3_pid; } I suppose a 0.01 second sleep is not going to be sufficient to close the problem in slower animals, but I hesitate to propose a much longer sleep because this test has 18 permutations so even a one second sleep adds quite a lot of (mostly useless) test runtime. -- Álvaro Herrera 39°49'30"S 73°17'W
pgsql-hackers by date: