Thread: xid_wraparound tests intermittent failure.
I noticed this when working on the PostgreSQL::Test::Session project I have in hand. All the tests pass except occasionally the xid_wraparound tests fail. It's not always the same test script that fails either. I tried everything but couldn't make the failure stop. So then I switched out my patch so it's running on plain master and set things running in a loop. Lo and behold it can be relied on to fail after only a few iterations. In the latest iteration the failure looks like this stderr: # poll_query_until timed out executing this query: # # SELECT NOT EXISTS ( # SELECT * # FROM pg_database # WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int) # # expecting this output: # t # last actual query output: # f # with stderr: # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 1. (test program exited with status code 29) ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― Summary of Failures: 295/295 postgresql:xid_wraparound / xid_wraparound/001_emergency_vacuum ERROR 211.76s exit status 29 cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > I noticed this when working on the PostgreSQL::Test::Session project I > have in hand. All the tests pass except occasionally the xid_wraparound > tests fail. It's not always the same test script that fails either. I > tried everything but couldn't make the failure stop. So then I switched > out my patch so it's running on plain master and set things running in a > loop. Lo and behold it can be relied on to fail after only a few > iterations. I have been noticing xid_wraparound failures in the buildfarm too. They seemed quite infrequent, but it wasn't till just now that I realized that xid_wraparound is not run by default. (You have to put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the only buildfarm animals that have enabled it are dodo and perentie. dodo is failing this test fairly often: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD perentie doesn't seem to be having a problem, but I will bet that part of the reason is it's running with cranked-up timeouts: 'build_env' => { 'PG_TEST_EXTRA' => 'xid_wraparound', 'PG_TEST_TIMEOUT_DEFAULT' => '360' }, One thing that seems quite interesting is that the test seems to take about 10 minutes when successful on dodo, but when it fails it's twice that. Why the instability? (Perhaps dodo has highly variable background load, and the thing simply times out in some runs but not others?) Locally, I've not managed to reproduce the failure yet; so perhaps there is some platform dependency. What are you testing on? regards, tom lane
On 2024-07-21 Su 1:34 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I noticed this when working on the PostgreSQL::Test::Session project I >> have in hand. All the tests pass except occasionally the xid_wraparound >> tests fail. It's not always the same test script that fails either. I >> tried everything but couldn't make the failure stop. So then I switched >> out my patch so it's running on plain master and set things running in a >> loop. Lo and behold it can be relied on to fail after only a few >> iterations. > I have been noticing xid_wraparound failures in the buildfarm too. > They seemed quite infrequent, but it wasn't till just now that > I realized that xid_wraparound is not run by default. (You have to > put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the > only buildfarm animals that have enabled it are dodo and perentie. > dodo is failing this test fairly often: > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD > > perentie doesn't seem to be having a problem, but I will bet that > part of the reason is it's running with cranked-up timeouts: > > 'build_env' => { > 'PG_TEST_EXTRA' => 'xid_wraparound', > 'PG_TEST_TIMEOUT_DEFAULT' => '360' > }, > > One thing that seems quite interesting is that the test seems to > take about 10 minutes when successful on dodo, but when it fails > it's twice that. Why the instability? (Perhaps dodo has highly > variable background load, and the thing simply times out in some > runs but not others?) > > Locally, I've not managed to reproduce the failure yet; so perhaps > there is some platform dependency. What are you testing on? Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux It's a VM running on UTM/Apple Silicon cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hello, 21.07.2024 20:34, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I noticed this when working on the PostgreSQL::Test::Session project I >> have in hand. All the tests pass except occasionally the xid_wraparound >> tests fail. It's not always the same test script that fails either. I >> tried everything but couldn't make the failure stop. So then I switched >> out my patch so it's running on plain master and set things running in a >> loop. Lo and behold it can be relied on to fail after only a few >> iterations. > I have been noticing xid_wraparound failures in the buildfarm too. > They seemed quite infrequent, but it wasn't till just now that > I realized that xid_wraparound is not run by default. (You have to > put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the > only buildfarm animals that have enabled it are dodo and perentie. > dodo is failing this test fairly often: > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD I think this failure is counted at [1]. Please look at the linked message [2], where I described what makes the test fail. [1] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing [2] https://www.postgresql.org/message-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com Best regards, Alexander
Andrew Dunstan <andrew@dunslane.net> writes: > On 2024-07-21 Su 1:34 PM, Tom Lane wrote: >> Locally, I've not managed to reproduce the failure yet; so perhaps >> there is some platform dependency. What are you testing on? > Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC > 2024 aarch64 aarch64 aarch64 GNU/Linux > It's a VM running on UTM/Apple Silicon Hmm, doesn't sound like that ought to be slow. I did manage to reproduce dodo's failures by running xid_wraparound manually on mamba's very slow host: $ time make -s installcheck PROVE_FLAGS=--timer # +++ tap install-check in src/test/modules/xid_wraparound +++ [13:37:49] t/001_emergency_vacuum.pl .. 1/? # poll_query_until timed out executing this query: # # SELECT NOT EXISTS ( # SELECT * # FROM pg_database # WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int) # # expecting this output: # t # last actual query output: # f # with stderr: # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 4 just after 1. [13:37:49] t/001_emergency_vacuum.pl .. Dubious, test returned 4 (wstat 1024, 0x400) All 1 subtests passed [14:06:51] t/002_limits.pl ............ 2/? # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 2. [14:06:51] t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00) All 2 subtests passed [14:31:16] t/003_wraparounds.pl ....... ok 7564763 ms ( 0.00 usr 0.01 sys + 13.82 cusr 9.26 csys = 23.09 CPU) [16:37:21] Test Summary Report ------------------- t/001_emergency_vacuum.pl (Wstat: 1024 (exited 4) Tests: 1 Failed: 0) Non-zero exit status: 4 Parse errors: No plan found in TAP output t/002_limits.pl (Wstat: 7424 (exited 29) Tests: 2 Failed: 0) Non-zero exit status: 29 Parse errors: No plan found in TAP output Files=3, Tests=4, 10772 wallclock secs ( 0.15 usr 0.06 sys + 58.50 cusr 59.88 csys = 118.59 CPU) Result: FAIL make: *** [../../../../src/makefiles/pgxs.mk:442: installcheck] Error 1 10772.99 real 59.34 user 60.14 sys Each of those two failures looks just like something that dodo has shown at one time or another. So it's at least plausible that "slow machine" is the whole explanation. I'm still wondering though if there's some effect that causes the test's runtime to be unstable in itself, sometimes leading to timeouts. regards, tom lane
On Mon, Jul 22, 2024 at 8:08 AM Alexander Lakhin <exclusion@gmail.com> wrote: > https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures This is great. Thanks for collating all this info here! And of course all the research and reports behind it.
On 2024-07-21 Su 4:08 PM, Alexander Lakhin wrote: > Hello, > > 21.07.2024 20:34, Tom Lane wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >>> I noticed this when working on the PostgreSQL::Test::Session project I >>> have in hand. All the tests pass except occasionally the xid_wraparound >>> tests fail. It's not always the same test script that fails either. I >>> tried everything but couldn't make the failure stop. So then I switched >>> out my patch so it's running on plain master and set things running >>> in a >>> loop. Lo and behold it can be relied on to fail after only a few >>> iterations. >> I have been noticing xid_wraparound failures in the buildfarm too. >> They seemed quite infrequent, but it wasn't till just now that >> I realized that xid_wraparound is not run by default. (You have to >> put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the >> only buildfarm animals that have enabled it are dodo and perentie. >> dodo is failing this test fairly often: >> >> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD > > I think this failure is counted at [1]. Please look at the linked message > [2], where I described what makes the test fail. > > [1] > https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing > [2] > https://www.postgresql.org/message-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com It's sad nothing has happened abut this for 2 months. There's no point in having unreliable tests. What's not 100% clear to me is whether this failure indicates a badly formulated test or the test is correct and has identified an underlying bug. Regarding the point in [2] about the test being run twice in buildfarm clients, I think we should mark the module as NO_INSTALLCHECK in the Makefile. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Sun, Jul 21, 2024 at 7:28 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Jul 22, 2024 at 8:08 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures > > This is great. Thanks for collating all this info here! And of > course all the research and reports behind it. Wow, that's an incredible wiki page. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Jul 21, 2024 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: > > On 2024-07-21 Su 1:34 PM, Tom Lane wrote: > >> Locally, I've not managed to reproduce the failure yet; so perhaps > >> there is some platform dependency. What are you testing on? > > > Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC > > 2024 aarch64 aarch64 aarch64 GNU/Linux > > It's a VM running on UTM/Apple Silicon > > Hmm, doesn't sound like that ought to be slow. > > I did manage to reproduce dodo's failures by running xid_wraparound > manually on mamba's very slow host: > > $ time make -s installcheck PROVE_FLAGS=--timer > # +++ tap install-check in src/test/modules/xid_wraparound +++ > [13:37:49] t/001_emergency_vacuum.pl .. 1/? # poll_query_until timed out executing this query: > # > # SELECT NOT EXISTS ( > # SELECT * > # FROM pg_database > # WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int) > # > # expecting this output: > # t > # last actual query output: > # f > # with stderr: > # Tests were run but no plan was declared and done_testing() was not seen. > # Looks like your test exited with 4 just after 1. > [13:37:49] t/001_emergency_vacuum.pl .. Dubious, test returned 4 (wstat 1024, 0x400) > All 1 subtests passed > [14:06:51] t/002_limits.pl ............ 2/? # Tests were run but no plan was declared and done_testing() was not seen. > # Looks like your test exited with 29 just after 2. > [14:06:51] t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00) > All 2 subtests passed > [14:31:16] t/003_wraparounds.pl ....... ok 7564763 ms ( 0.00 usr 0.01 sys + 13.82 cusr 9.26 csys = 23.09 CPU) > [16:37:21] > > Test Summary Report > ------------------- > t/001_emergency_vacuum.pl (Wstat: 1024 (exited 4) Tests: 1 Failed: 0) > Non-zero exit status: 4 > Parse errors: No plan found in TAP output > t/002_limits.pl (Wstat: 7424 (exited 29) Tests: 2 Failed: 0) > Non-zero exit status: 29 > Parse errors: No plan found in TAP output > Files=3, Tests=4, 10772 wallclock secs ( 0.15 usr 0.06 sys + 58.50 cusr 59.88 csys = 118.59 CPU) > Result: FAIL > make: *** [../../../../src/makefiles/pgxs.mk:442: installcheck] Error 1 > 10772.99 real 59.34 user 60.14 sys > > Each of those two failures looks just like something that dodo has > shown at one time or another. So it's at least plausible that > "slow machine" is the whole explanation. I'm still wondering > though if there's some effect that causes the test's runtime to > be unstable in itself, sometimes leading to timeouts. > Since the server writes a lot of logs during the xid_wraparound test, "slow disk" could also be a reason. Looking at dodo's failures, it seems that while it passes module-xid_wraparound-check, all failures happened only during testmodules-install-check-C. Can we check the server logs written during xid_wraparound test in testmodules-install-check-C? I thought the following link is the server logs but since it seems there were no autovacuum logs I suspected there is another log file: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dodo&dt=2024-07-20%2020%3A35%3A39&stg=testmodules-install-check-C&raw=1 Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Masahiko Sawada <sawada.mshk@gmail.com> writes: > Looking at dodo's failures, it seems that while it passes > module-xid_wraparound-check, all failures happened only during > testmodules-install-check-C. Can we check the server logs written > during xid_wraparound test in testmodules-install-check-C? Oooh, that is indeed an interesting observation. There are enough examples now that it's hard to dismiss it as chance, but why would the two runs be different? (I agree with the comment that we shouldn't be running this test twice, but that's a separate matter.) regards, tom lane
On Mon, Jul 22, 2024 at 9:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Masahiko Sawada <sawada.mshk@gmail.com> writes: > > Looking at dodo's failures, it seems that while it passes > > module-xid_wraparound-check, all failures happened only during > > testmodules-install-check-C. Can we check the server logs written > > during xid_wraparound test in testmodules-install-check-C? > > Oooh, that is indeed an interesting observation. There are enough > examples now that it's hard to dismiss it as chance, but why would > the two runs be different? During the xid_wraparound test in testmodules-install-check-C two clusters are running at the same time. This fact could make the xid_wraparound test slower by any chance. > > (I agree with the comment that we shouldn't be running this test > twice, but that's a separate matter.) +1 not running it twice. There are test modules that have only TAP tests and are not marked as NO_INSTALLCHECK, for example test_custom_rmgrs. Probably we don't want to run these tests twice too? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Masahiko Sawada <sawada.mshk@gmail.com> writes:Looking at dodo's failures, it seems that while it passes module-xid_wraparound-check, all failures happened only during testmodules-install-check-C. Can we check the server logs written during xid_wraparound test in testmodules-install-check-C?Oooh, that is indeed an interesting observation. There are enough examples now that it's hard to dismiss it as chance, but why would the two runs be different?
It's not deterministic.
I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:
for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
done
It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2024-07-22 Mo 12:46 PM, Tom Lane wrote: > > Masahiko Sawada <sawada.mshk@gmail.com> writes: > > Looking at dodo's failures, it seems that while it passes > module-xid_wraparound-check, all failures happened only during > testmodules-install-check-C. Can we check the server logs written > during xid_wraparound test in testmodules-install-check-C? > > Oooh, that is indeed an interesting observation. There are enough > examples now that it's hard to dismiss it as chance, but why would > the two runs be different? > > > It's not deterministic. > > I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did: > > for f in `seq 1 100` > do echo iteration = $f > meson test --suite xid_wraparound || break > done > > It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normallytakes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause. Could you provide server logs in both OK and NG tests? I want to see if there's a difference in the rate at which tables are vacuumed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2024-07-22 Mo 12:46 PM, Tom Lane wrote: >> Masahiko Sawada<sawada.mshk@gmail.com> writes: >>> Looking at dodo's failures, it seems that while it passes >>> module-xid_wraparound-check, all failures happened only during >>> testmodules-install-check-C. Can we check the server logs written >>> during xid_wraparound test in testmodules-install-check-C? >> Oooh, that is indeed an interesting observation. There are enough >> examples now that it's hard to dismiss it as chance, but why would >> the two runs be different? > It's not deterministic. Perhaps. I tried "make check" on mamba's host and got exactly the same failures as with "make installcheck", which counts in favor of dodo's results being just luck. Still, dodo has now shown 11 failures in "make installcheck" and zero in "make check", so it's getting hard to credit that there's no difference. regards, tom lane
On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:On 2024-07-22 Mo 12:46 PM, Tom Lane wrote: Masahiko Sawada <sawada.mshk@gmail.com> writes: Looking at dodo's failures, it seems that while it passes module-xid_wraparound-check, all failures happened only during testmodules-install-check-C. Can we check the server logs written during xid_wraparound test in testmodules-install-check-C? Oooh, that is indeed an interesting observation. There are enough examples now that it's hard to dismiss it as chance, but why would the two runs be different? It's not deterministic. I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did: for f in `seq 1 100` do echo iteration = $f meson test --suite xid_wraparound || break done It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.Could you provide server logs in both OK and NG tests? I want to see if there's a difference in the rate at which tables are vacuumed.
See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
The failure logs are from a run where both tests 1 and 2 failed.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-07-22 Mo 10:11 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 2024-07-22 Mo 12:46 PM, Tom Lane wrote: >>> Masahiko Sawada<sawada.mshk@gmail.com> writes: >>>> Looking at dodo's failures, it seems that while it passes >>>> module-xid_wraparound-check, all failures happened only during >>>> testmodules-install-check-C. Can we check the server logs written >>>> during xid_wraparound test in testmodules-install-check-C? >>> Oooh, that is indeed an interesting observation. There are enough >>> examples now that it's hard to dismiss it as chance, but why would >>> the two runs be different? >> It's not deterministic. > Perhaps. I tried "make check" on mamba's host and got exactly the > same failures as with "make installcheck", which counts in favor of > dodo's results being just luck. Still, dodo has now shown 11 failures > in "make installcheck" and zero in "make check", so it's getting hard > to credit that there's no difference. > > Yeah, I agree that's perplexing. That step doesn't run with "make -j nn", so it's a bit hard to see why it should get different results from one run rather than the other. The only thing that's different is that there's another postgres instance running. Maybe that's just enough to slow the test down? After all, this is an RPi. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Jul 23, 2024 at 3:49 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2024-07-22 Mo 9:29 PM, Masahiko Sawada wrote: > > On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > On 2024-07-22 Mo 12:46 PM, Tom Lane wrote: > > Masahiko Sawada <sawada.mshk@gmail.com> writes: > > Looking at dodo's failures, it seems that while it passes > module-xid_wraparound-check, all failures happened only during > testmodules-install-check-C. Can we check the server logs written > during xid_wraparound test in testmodules-install-check-C? > > Oooh, that is indeed an interesting observation. There are enough > examples now that it's hard to dismiss it as chance, but why would > the two runs be different? > > > It's not deterministic. > > I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did: > > for f in `seq 1 100` > do echo iteration = $f > meson test --suite xid_wraparound || break > done > > It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normallytakes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause. > > Could you provide server logs in both OK and NG tests? I want to see > if there's a difference in the rate at which tables are vacuumed. > > > See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2> > > > The failure logs are from a run where both tests 1 and 2 failed. > Thank you for sharing the logs. I think that the problem seems to match what Alexander Lakhin mentioned[1]. Probably we can fix such a race condition somehow but I'm not sure it's worth it as setting autovacuum = off and autovacuum_max_workers = 1 (or a low number) is an extremely rare case. I think it would be better to stabilize these tests. One idea is to turn the autovacuum GUC parameter on while setting autovacuum_enabled = off for each table. That way, we can ensure that autovacuum workers are launched. And I think it seems to align real use cases. Regards, [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2> The failure logs are from a run where both tests 1 and 2 failed.Thank you for sharing the logs. I think that the problem seems to match what Alexander Lakhin mentioned[1]. Probably we can fix such a race condition somehow but I'm not sure it's worth it as setting autovacuum = off and autovacuum_max_workers = 1 (or a low number) is an extremely rare case. I think it would be better to stabilize these tests. One idea is to turn the autovacuum GUC parameter on while setting autovacuum_enabled = off for each table. That way, we can ensure that autovacuum workers are launched. And I think it seems to align real use cases. Regards, [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com
OK, do you want to propose a patch?
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote: > > See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2> > > > The failure logs are from a run where both tests 1 and 2 failed. > > Thank you for sharing the logs. > > I think that the problem seems to match what Alexander Lakhin > mentioned[1]. Probably we can fix such a race condition somehow but > I'm not sure it's worth it as setting autovacuum = off and > autovacuum_max_workers = 1 (or a low number) is an extremely rare > case. I think it would be better to stabilize these tests. One idea is > to turn the autovacuum GUC parameter on while setting > autovacuum_enabled = off for each table. That way, we can ensure that > autovacuum workers are launched. And I think it seems to align real > use cases. > > Regards, > > [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com > > > OK, do you want to propose a patch? > Yes, I'll prepare and share it soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > > > > > On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote: > > > > See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2> > > > > > > The failure logs are from a run where both tests 1 and 2 failed. > > > > Thank you for sharing the logs. > > > > I think that the problem seems to match what Alexander Lakhin > > mentioned[1]. Probably we can fix such a race condition somehow but > > I'm not sure it's worth it as setting autovacuum = off and > > autovacuum_max_workers = 1 (or a low number) is an extremely rare > > case. I think it would be better to stabilize these tests. One idea is > > to turn the autovacuum GUC parameter on while setting > > autovacuum_enabled = off for each table. That way, we can ensure that > > autovacuum workers are launched. And I think it seems to align real > > use cases. > > > > Regards, > > > > [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com > > > > > > OK, do you want to propose a patch? > > > > Yes, I'll prepare and share it soon. > I've attached the patch. Could you please test if the patch fixes the instability you observed? Since we turn off autovacuum on all three tests and we wait for autovacuum to complete processing databases, these tests potentially have a similar (but lower) risk. So I modified these tests to turn it on so we can ensure the autovacuum runs periodically. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote: See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2> The failure logs are from a run where both tests 1 and 2 failed. Thank you for sharing the logs. I think that the problem seems to match what Alexander Lakhin mentioned[1]. Probably we can fix such a race condition somehow but I'm not sure it's worth it as setting autovacuum = off and autovacuum_max_workers = 1 (or a low number) is an extremely rare case. I think it would be better to stabilize these tests. One idea is to turn the autovacuum GUC parameter on while setting autovacuum_enabled = off for each table. That way, we can ensure that autovacuum workers are launched. And I think it seems to align real use cases. Regards, [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com OK, do you want to propose a patch?Yes, I'll prepare and share it soon.I've attached the patch. Could you please test if the patch fixes the instability you observed? Since we turn off autovacuum on all three tests and we wait for autovacuum to complete processing databases, these tests potentially have a similar (but lower) risk. So I modified these tests to turn it on so we can ensure the autovacuum runs periodically.
I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.
You can see the logs at <https://f001.backblazeb2.com/file/net-dunslane-public/002_limits-failure-log.tar.bz2>
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote: > > On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote: > > See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2> > > > The failure logs are from a run where both tests 1 and 2 failed. > > Thank you for sharing the logs. > > I think that the problem seems to match what Alexander Lakhin > mentioned[1]. Probably we can fix such a race condition somehow but > I'm not sure it's worth it as setting autovacuum = off and > autovacuum_max_workers = 1 (or a low number) is an extremely rare > case. I think it would be better to stabilize these tests. One idea is > to turn the autovacuum GUC parameter on while setting > autovacuum_enabled = off for each table. That way, we can ensure that > autovacuum workers are launched. And I think it seems to align real > use cases. > > Regards, > > [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com > > > OK, do you want to propose a patch? > > Yes, I'll prepare and share it soon. > > I've attached the patch. Could you please test if the patch fixes the > instability you observed? > > Since we turn off autovacuum on all three tests and we wait for > autovacuum to complete processing databases, these tests potentially > have a similar (but lower) risk. So I modified these tests to turn it > on so we can ensure the autovacuum runs periodically. > > > I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retriedmy test, but on iteration 100 out of 100 it failed on test 002_limits.pl. > I think we need to remove the "autovacuum = off' also in 002_limits.pl as it waits for autovacuum to process both template0 and template1 databases. Just to be clear, the failure happened even without "autovacuum = off"? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote: > On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote: >> >> On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote: >> >> On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> >> On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote: >> >> See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2> >> >> >> The failure logs are from a run where both tests 1 and 2 failed. >> >> Thank you for sharing the logs. >> >> I think that the problem seems to match what Alexander Lakhin >> mentioned[1]. Probably we can fix such a race condition somehow but >> I'm not sure it's worth it as setting autovacuum = off and >> autovacuum_max_workers = 1 (or a low number) is an extremely rare >> case. I think it would be better to stabilize these tests. One idea is >> to turn the autovacuum GUC parameter on while setting >> autovacuum_enabled = off for each table. That way, we can ensure that >> autovacuum workers are launched. And I think it seems to align real >> use cases. >> >> Regards, >> >> [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com >> >> >> OK, do you want to propose a patch? >> >> Yes, I'll prepare and share it soon. >> >> I've attached the patch. Could you please test if the patch fixes the >> instability you observed? >> >> Since we turn off autovacuum on all three tests and we wait for >> autovacuum to complete processing databases, these tests potentially >> have a similar (but lower) risk. So I modified these tests to turn it >> on so we can ensure the autovacuum runs periodically. >> >> >> I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retriedmy test, but on iteration 100 out of 100 it failed on test 002_limits.pl. >> > I think we need to remove the "autovacuum = off' also in 002_limits.pl > as it waits for autovacuum to process both template0 and template1 > databases. Just to be clear, the failure happened even without > "autovacuum = off"? > The attached patch, a slight modification of yours, removes "autovacuum = off" for all three tests, and given that a set of 200 runs was clean for me. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On Sat, Jul 27, 2024 at 1:06 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote: > > On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote: > >> > >> On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote: > >> > >> On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> > >> On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote: > >> > >> On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote: > >> > >> See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2> > >> > >> > >> The failure logs are from a run where both tests 1 and 2 failed. > >> > >> Thank you for sharing the logs. > >> > >> I think that the problem seems to match what Alexander Lakhin > >> mentioned[1]. Probably we can fix such a race condition somehow but > >> I'm not sure it's worth it as setting autovacuum = off and > >> autovacuum_max_workers = 1 (or a low number) is an extremely rare > >> case. I think it would be better to stabilize these tests. One idea is > >> to turn the autovacuum GUC parameter on while setting > >> autovacuum_enabled = off for each table. That way, we can ensure that > >> autovacuum workers are launched. And I think it seems to align real > >> use cases. > >> > >> Regards, > >> > >> [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com > >> > >> > >> OK, do you want to propose a patch? > >> > >> Yes, I'll prepare and share it soon. > >> > >> I've attached the patch. Could you please test if the patch fixes the > >> instability you observed? > >> > >> Since we turn off autovacuum on all three tests and we wait for > >> autovacuum to complete processing databases, these tests potentially > >> have a similar (but lower) risk. So I modified these tests to turn it > >> on so we can ensure the autovacuum runs periodically. > >> > >> > >> I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch Iretried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl. > >> > > I think we need to remove the "autovacuum = off' also in 002_limits.pl > > as it waits for autovacuum to process both template0 and template1 > > databases. Just to be clear, the failure happened even without > > "autovacuum = off"? > > > > The attached patch, a slight modification of yours, removes "autovacuum > = off" for all three tests, and given that a set of 200 runs was clean > for me. Oh I missed that I left "autovacuum = off' for some reason in 002 test. Thank you for attaching the patch, it looks good to me. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
I've attached the patch. Could you please test if the patch fixes the instability you observed? Since we turn off autovacuum on all three tests and we wait for autovacuum to complete processing databases, these tests potentially have a similar (but lower) risk. So I modified these tests to turn it on so we can ensure the autovacuum runs periodically. I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.I think we need to remove the "autovacuum = off' also in 002_limits.pl as it waits for autovacuum to process both template0 and template1 databases. Just to be clear, the failure happened even without "autovacuum = off"?The attached patch, a slight modification of yours, removes "autovacuum = off" for all three tests, and given that a set of 200 runs was clean for me.Oh I missed that I left "autovacuum = off' for some reason in 002 test. Thank you for attaching the patch, it looks good to me.
Thanks, pushed.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Jul 30, 2024 at 3:29 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2024-07-29 Mo 5:25 PM, Masahiko Sawada wrote: > > I've attached the patch. Could you please test if the patch fixes the > instability you observed? > > Since we turn off autovacuum on all three tests and we wait for > autovacuum to complete processing databases, these tests potentially > have a similar (but lower) risk. So I modified these tests to turn it > on so we can ensure the autovacuum runs periodically. > > > I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retriedmy test, but on iteration 100 out of 100 it failed on test 002_limits.pl. > > I think we need to remove the "autovacuum = off' also in 002_limits.pl > as it waits for autovacuum to process both template0 and template1 > databases. Just to be clear, the failure happened even without > "autovacuum = off"? > > The attached patch, a slight modification of yours, removes "autovacuum > = off" for all three tests, and given that a set of 200 runs was clean > for me. > > Oh I missed that I left "autovacuum = off' for some reason in 002 > test. Thank you for attaching the patch, it looks good to me. > > > Thanks, pushed. Thanks! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com