Thread: xid_wraparound tests intermittent failure.

xid_wraparound tests intermittent failure.

From

Andrew Dunstan

Date:

21 July, 19:20:03

I noticed this when working on the PostgreSQL::Test::Session project I 
have in hand. All the tests pass except occasionally the xid_wraparound 
tests fail. It's not always the same test script that fails either. I 
tried everything but couldn't make the failure stop. So then I switched 
out my patch so it's running on plain master and set things running in a 
loop. Lo and behold it can be relied on to fail after only a few 
iterations.

In the latest iteration the failure looks like this


stderr:
# poll_query_until timed out executing this query:
#
# SELECT NOT EXISTS (
#   SELECT *
#   FROM pg_database
#   WHERE age(datfrozenxid) > 
current_setting('autovacuum_freeze_max_age')::int)
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 1.

(test program exited with status code 29)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――


Summary of Failures:

295/295 postgresql:xid_wraparound / xid_wraparound/001_emergency_vacuum 
ERROR           211.76s   exit status 29



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: xid_wraparound tests intermittent failure.

From

Tom Lane

Date:

21 July, 20:34:36

Andrew Dunstan <andrew@dunslane.net> writes:
> I noticed this when working on the PostgreSQL::Test::Session project I 
> have in hand. All the tests pass except occasionally the xid_wraparound 
> tests fail. It's not always the same test script that fails either. I 
> tried everything but couldn't make the failure stop. So then I switched 
> out my patch so it's running on plain master and set things running in a 
> loop. Lo and behold it can be relied on to fail after only a few 
> iterations.

I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default.  (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.)  AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD

perentie doesn't seem to be having a problem, but I will bet that
part of the reason is it's running with cranked-up timeouts:

                   'build_env' => {
                                    'PG_TEST_EXTRA' => 'xid_wraparound',
                                    'PG_TEST_TIMEOUT_DEFAULT' => '360'
                                  },

One thing that seems quite interesting is that the test seems to
take about 10 minutes when successful on dodo, but when it fails
it's twice that.  Why the instability?  (Perhaps dodo has highly
variable background load, and the thing simply times out in some
runs but not others?)

Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency.  What are you testing on?

            regards, tom lane

Re: xid_wraparound tests intermittent failure.

From

Andrew Dunstan

Date:

21 July, 21:46:28

On 2024-07-21 Su 1:34 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I noticed this when working on the PostgreSQL::Test::Session project I
>> have in hand. All the tests pass except occasionally the xid_wraparound
>> tests fail. It's not always the same test script that fails either. I
>> tried everything but couldn't make the failure stop. So then I switched
>> out my patch so it's running on plain master and set things running in a
>> loop. Lo and behold it can be relied on to fail after only a few
>> iterations.
> I have been noticing xid_wraparound failures in the buildfarm too.
> They seemed quite infrequent, but it wasn't till just now that
> I realized that xid_wraparound is not run by default.  (You have to
> put "xid_wraparound" in PG_TEST_EXTRA to enable it.)  AFAICS the
> only buildfarm animals that have enabled it are dodo and perentie.
> dodo is failing this test fairly often:
>
> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD
>
> perentie doesn't seem to be having a problem, but I will bet that
> part of the reason is it's running with cranked-up timeouts:
>
>                     'build_env' => {
>                                      'PG_TEST_EXTRA' => 'xid_wraparound',
>                                      'PG_TEST_TIMEOUT_DEFAULT' => '360'
>                                    },
>
> One thing that seems quite interesting is that the test seems to
> take about 10 minutes when successful on dodo, but when it fails
> it's twice that.  Why the instability?  (Perhaps dodo has highly
> variable background load, and the thing simply times out in some
> runs but not others?)
>
> Locally, I've not managed to reproduce the failure yet; so perhaps
> there is some platform dependency.  What are you testing on?


Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC 
2024 aarch64 aarch64 aarch64 GNU/Linux

It's a VM running on UTM/Apple Silicon


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: xid_wraparound tests intermittent failure.

From

Alexander Lakhin

Date:

21 July, 23:08:08

Hello,

21.07.2024 20:34, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I noticed this when working on the PostgreSQL::Test::Session project I
>> have in hand. All the tests pass except occasionally the xid_wraparound
>> tests fail. It's not always the same test script that fails either. I
>> tried everything but couldn't make the failure stop. So then I switched
>> out my patch so it's running on plain master and set things running in a
>> loop. Lo and behold it can be relied on to fail after only a few
>> iterations.
> I have been noticing xid_wraparound failures in the buildfarm too.
> They seemed quite infrequent, but it wasn't till just now that
> I realized that xid_wraparound is not run by default.  (You have to
> put "xid_wraparound" in PG_TEST_EXTRA to enable it.)  AFAICS the
> only buildfarm animals that have enabled it are dodo and perentie.
> dodo is failing this test fairly often:
>
> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD

I think this failure is counted at [1]. Please look at the linked message
[2], where I described what makes the test fail.

[1] 

https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
[2] https://www.postgresql.org/message-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com

Best regards,
Alexander

Re: xid_wraparound tests intermittent failure.

From

Tom Lane

Date:

22 July, 00:36:48

Andrew Dunstan <andrew@dunslane.net> writes:
> On 2024-07-21 Su 1:34 PM, Tom Lane wrote:
>> Locally, I've not managed to reproduce the failure yet; so perhaps
>> there is some platform dependency.  What are you testing on?

> Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
> 2024 aarch64 aarch64 aarch64 GNU/Linux
> It's a VM running on UTM/Apple Silicon

Hmm, doesn't sound like that ought to be slow.

I did manage to reproduce dodo's failures by running xid_wraparound
manually on mamba's very slow host:

$ time make -s installcheck PROVE_FLAGS=--timer
# +++ tap install-check in src/test/modules/xid_wraparound +++
[13:37:49] t/001_emergency_vacuum.pl .. 1/? # poll_query_until timed out executing this query:
#
# SELECT NOT EXISTS (
#       SELECT *
#       FROM pg_database
#       WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int)
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 4 just after 1.
[13:37:49] t/001_emergency_vacuum.pl .. Dubious, test returned 4 (wstat 1024, 0x400)
All 1 subtests passed
[14:06:51] t/002_limits.pl ............ 2/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 2.
[14:06:51] t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00)
All 2 subtests passed
[14:31:16] t/003_wraparounds.pl ....... ok  7564763 ms ( 0.00 usr  0.01 sys + 13.82 cusr  9.26 csys = 23.09 CPU)
[16:37:21]

Test Summary Report
-------------------
t/001_emergency_vacuum.pl (Wstat: 1024 (exited 4) Tests: 1 Failed: 0)
  Non-zero exit status: 4
  Parse errors: No plan found in TAP output
t/002_limits.pl          (Wstat: 7424 (exited 29) Tests: 2 Failed: 0)
  Non-zero exit status: 29
  Parse errors: No plan found in TAP output
Files=3, Tests=4, 10772 wallclock secs ( 0.15 usr  0.06 sys + 58.50 cusr 59.88 csys = 118.59 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:442: installcheck] Error 1
    10772.99 real        59.34 user        60.14 sys

Each of those two failures looks just like something that dodo has
shown at one time or another.  So it's at least plausible that
"slow machine" is the whole explanation.  I'm still wondering
though if there's some effect that causes the test's runtime to
be unstable in itself, sometimes leading to timeouts.

            regards, tom lane

Re: xid_wraparound tests intermittent failure.

From

Thomas Munro

Date:

22 July, 02:27:37

On Mon, Jul 22, 2024 at 8:08 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures

This is great.  Thanks for collating all this info here!  And of
course all the research and reports behind it.

Re: xid_wraparound tests intermittent failure.

From

Andrew Dunstan

Date:

22 July, 15:54:02

On 2024-07-21 Su 4:08 PM, Alexander Lakhin wrote:
> Hello,
>
> 21.07.2024 20:34, Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> I noticed this when working on the PostgreSQL::Test::Session project I
>>> have in hand. All the tests pass except occasionally the xid_wraparound
>>> tests fail. It's not always the same test script that fails either. I
>>> tried everything but couldn't make the failure stop. So then I switched
>>> out my patch so it's running on plain master and set things running 
>>> in a
>>> loop. Lo and behold it can be relied on to fail after only a few
>>> iterations.
>> I have been noticing xid_wraparound failures in the buildfarm too.
>> They seemed quite infrequent, but it wasn't till just now that
>> I realized that xid_wraparound is not run by default.  (You have to
>> put "xid_wraparound" in PG_TEST_EXTRA to enable it.)  AFAICS the
>> only buildfarm animals that have enabled it are dodo and perentie.
>> dodo is failing this test fairly often:
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD
>
> I think this failure is counted at [1]. Please look at the linked message
> [2], where I described what makes the test fail.
>
> [1] 
>
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
> [2] 
> https://www.postgresql.org/message-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com


It's sad nothing has happened abut this for 2 months.

There's no point in having unreliable tests. What's not 100% clear to me 
is whether this failure indicates a badly formulated test or the test is 
correct and has identified an underlying bug.

Regarding the point in [2] about the test being run twice in buildfarm 
clients, I think we should mark the module as NO_INSTALLCHECK in the 
Makefile.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: xid_wraparound tests intermittent failure.

From

Robert Haas

Date:

22 July, 16:13:36

On Sun, Jul 21, 2024 at 7:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Jul 22, 2024 at 8:08 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures
>
> This is great.  Thanks for collating all this info here!  And of
> course all the research and reports behind it.

Wow, that's an incredible wiki page.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: xid_wraparound tests intermittent failure.

From

Masahiko Sawada

Date:

22 July, 19:07:32

On Sun, Jul 21, 2024 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrew Dunstan <andrew@dunslane.net> writes:
> > On 2024-07-21 Su 1:34 PM, Tom Lane wrote:
> >> Locally, I've not managed to reproduce the failure yet; so perhaps
> >> there is some platform dependency.  What are you testing on?
>
> > Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
> > 2024 aarch64 aarch64 aarch64 GNU/Linux
> > It's a VM running on UTM/Apple Silicon
>
> Hmm, doesn't sound like that ought to be slow.
>
> I did manage to reproduce dodo's failures by running xid_wraparound
> manually on mamba's very slow host:
>
> $ time make -s installcheck PROVE_FLAGS=--timer
> # +++ tap install-check in src/test/modules/xid_wraparound +++
> [13:37:49] t/001_emergency_vacuum.pl .. 1/? # poll_query_until timed out executing this query:
> #
> # SELECT NOT EXISTS (
> #       SELECT *
> #       FROM pg_database
> #       WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int)
> #
> # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
> # Tests were run but no plan was declared and done_testing() was not seen.
> # Looks like your test exited with 4 just after 1.
> [13:37:49] t/001_emergency_vacuum.pl .. Dubious, test returned 4 (wstat 1024, 0x400)
> All 1 subtests passed
> [14:06:51] t/002_limits.pl ............ 2/? # Tests were run but no plan was declared and done_testing() was not
seen.
> # Looks like your test exited with 29 just after 2.
> [14:06:51] t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00)
> All 2 subtests passed
> [14:31:16] t/003_wraparounds.pl ....... ok  7564763 ms ( 0.00 usr  0.01 sys + 13.82 cusr  9.26 csys = 23.09 CPU)
> [16:37:21]
>
> Test Summary Report
> -------------------
> t/001_emergency_vacuum.pl (Wstat: 1024 (exited 4) Tests: 1 Failed: 0)
>   Non-zero exit status: 4
>   Parse errors: No plan found in TAP output
> t/002_limits.pl          (Wstat: 7424 (exited 29) Tests: 2 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: No plan found in TAP output
> Files=3, Tests=4, 10772 wallclock secs ( 0.15 usr  0.06 sys + 58.50 cusr 59.88 csys = 118.59 CPU)
> Result: FAIL
> make: *** [../../../../src/makefiles/pgxs.mk:442: installcheck] Error 1
>     10772.99 real        59.34 user        60.14 sys
>
> Each of those two failures looks just like something that dodo has
> shown at one time or another.  So it's at least plausible that
> "slow machine" is the whole explanation.  I'm still wondering
> though if there's some effect that causes the test's runtime to
> be unstable in itself, sometimes leading to timeouts.
>

Since the server writes a lot of logs during the xid_wraparound test,
"slow disk" could also be a reason.

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C? I thought
the following link is the server logs but since it seems there were no
autovacuum logs I suspected there is another log file:


https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dodo&dt=2024-07-20%2020%3A35%3A39&stg=testmodules-install-check-C&raw=1

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: xid_wraparound tests intermittent failure.

From

Tom Lane

Date:

22 July, 19:46:01

Masahiko Sawada <sawada.mshk@gmail.com> writes:
> Looking at dodo's failures, it seems that while it passes
> module-xid_wraparound-check, all failures happened only during
> testmodules-install-check-C. Can we check the server logs written
> during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation.  There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

(I agree with the comment that we shouldn't be running this test
twice, but that's a separate matter.)

            regards, tom lane

Re: xid_wraparound tests intermittent failure.

From

Masahiko Sawada

Date:

22 July, 21:50:34

On Mon, Jul 22, 2024 at 9:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Masahiko Sawada <sawada.mshk@gmail.com> writes:
> > Looking at dodo's failures, it seems that while it passes
> > module-xid_wraparound-check, all failures happened only during
> > testmodules-install-check-C. Can we check the server logs written
> > during xid_wraparound test in testmodules-install-check-C?
>
> Oooh, that is indeed an interesting observation.  There are enough
> examples now that it's hard to dismiss it as chance, but why would
> the two runs be different?

During the xid_wraparound test in testmodules-install-check-C two
clusters are running at the same time. This fact could make the
xid_wraparound test slower by any chance.

>
> (I agree with the comment that we shouldn't be running this test
> twice, but that's a separate matter.)

+1 not running it twice.

There are test modules that have only TAP tests and are not marked as
NO_INSTALLCHECK, for example test_custom_rmgrs. Probably we don't want
to run these tests twice too?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: xid_wraparound tests intermittent failure.

From

Andrew Dunstan

Date:

22 July, 22:53:08

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation.  There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:

    for f in `seq 1 100`
      do echo iteration = $f
      meson test --suite xid_wraparound || break
    done

It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: xid_wraparound tests intermittent failure.

From

Masahiko Sawada

Date:

23 July, 04:29:51

On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:
>
> Masahiko Sawada <sawada.mshk@gmail.com> writes:
>
> Looking at dodo's failures, it seems that while it passes
> module-xid_wraparound-check, all failures happened only during
> testmodules-install-check-C. Can we check the server logs written
> during xid_wraparound test in testmodules-install-check-C?
>
> Oooh, that is indeed an interesting observation.  There are enough
> examples now that it's hard to dismiss it as chance, but why would
> the two runs be different?
>
>
> It's not deterministic.
>
> I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I
did:
>
>     for f in `seq 1 100`
>       do echo iteration = $f
>       meson test --suite xid_wraparound || break
>     done
>
> It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile"
normallytakes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause. 

Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: xid_wraparound tests intermittent failure.

From

Tom Lane

Date:

23 July, 05:11:59

Andrew Dunstan <andrew@dunslane.net> writes:
> On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:
>> Masahiko Sawada<sawada.mshk@gmail.com>  writes:
>>> Looking at dodo's failures, it seems that while it passes
>>> module-xid_wraparound-check, all failures happened only during
>>> testmodules-install-check-C. Can we check the server logs written
>>> during xid_wraparound test in testmodules-install-check-C?

>> Oooh, that is indeed an interesting observation.  There are enough
>> examples now that it's hard to dismiss it as chance, but why would
>> the two runs be different?

> It's not deterministic.

Perhaps.  I tried "make check" on mamba's host and got exactly the
same failures as with "make installcheck", which counts in favor of
dodo's results being just luck.  Still, dodo has now shown 11 failures
in "make installcheck" and zero in "make check", so it's getting hard
to credit that there's no difference.

            regards, tom lane

Re: xid_wraparound tests intermittent failure.

From

Andrew Dunstan

Date:

23 July, 13:49:05

On 2024-07-22 Mo 9:29 PM, Masahiko Sawada wrote:

On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:


On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation.  There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?


It's not deterministic.

I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:
    for f in `seq 1 100`      do echo iteration = $f      meson test --suite xid_wraparound || break    done

It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.

Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>

The failure logs are from a run where both tests 1 and 2 failed.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: xid_wraparound tests intermittent failure.

From

Andrew Dunstan

Date:

23 July, 16:07:19

On 2024-07-22 Mo 10:11 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:
>>> Masahiko Sawada<sawada.mshk@gmail.com>  writes:
>>>> Looking at dodo's failures, it seems that while it passes
>>>> module-xid_wraparound-check, all failures happened only during
>>>> testmodules-install-check-C. Can we check the server logs written
>>>> during xid_wraparound test in testmodules-install-check-C?
>>> Oooh, that is indeed an interesting observation.  There are enough
>>> examples now that it's hard to dismiss it as chance, but why would
>>> the two runs be different?
>> It's not deterministic.
> Perhaps.  I tried "make check" on mamba's host and got exactly the
> same failures as with "make installcheck", which counts in favor of
> dodo's results being just luck.  Still, dodo has now shown 11 failures
> in "make installcheck" and zero in "make check", so it's getting hard
> to credit that there's no difference.
>
>             


Yeah, I agree that's perplexing. That step doesn't run with "make -j 
nn", so it's a bit hard to see why it should get different results from 
one run rather than the other.  The only thing that's different is that 
there's another postgres instance running. Maybe that's just enough to 
slow the test down? After all, this is an RPi.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: xid_wraparound tests intermittent failure.

From

Masahiko Sawada

Date:

24 July, 01:59:28

On Tue, Jul 23, 2024 at 3:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2024-07-22 Mo 9:29 PM, Masahiko Sawada wrote:
>
> On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>
> On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:
>
> Masahiko Sawada <sawada.mshk@gmail.com> writes:
>
> Looking at dodo's failures, it seems that while it passes
> module-xid_wraparound-check, all failures happened only during
> testmodules-install-check-C. Can we check the server logs written
> during xid_wraparound test in testmodules-install-check-C?
>
> Oooh, that is indeed an interesting observation.  There are enough
> examples now that it's hard to dismiss it as chance, but why would
> the two runs be different?
>
>
> It's not deterministic.
>
> I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I
did:
>
>     for f in `seq 1 100`
>       do echo iteration = $f
>       meson test --suite xid_wraparound || break
>     done
>
> It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile"
normallytakes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause. 
>
> Could you provide server logs in both OK and NG tests? I want to see
> if there's a difference in the rate at which tables are vacuumed.
>
>
> See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
>
>
> The failure logs are from a run where both tests 1 and 2 failed.
>

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: xid_wraparound tests intermittent failure.

From

Andrew Dunstan

Date:

25 July, 20:56:18

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>


The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com

OK, do you want to propose a patch?

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: xid_wraparound tests intermittent failure.

From

Masahiko Sawada

Date:

25 July, 21:06:45

On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
>
> See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
>
>
> The failure logs are from a run where both tests 1 and 2 failed.
>
> Thank you for sharing the logs.
>
> I think that the problem seems to match what Alexander Lakhin
> mentioned[1]. Probably we can fix such a race condition somehow but
> I'm not sure it's worth it as setting autovacuum = off and
> autovacuum_max_workers = 1 (or a low number) is an extremely rare
> case. I think it would be better to stabilize these tests. One idea is
> to turn the autovacuum GUC parameter on while setting
> autovacuum_enabled = off for each table. That way, we can ensure that
> autovacuum workers are launched. And I think it seems to align real
> use cases.
>
> Regards,
>
> [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com
>
>
> OK, do you want to propose a patch?
>

Yes, I'll prepare and share it soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: xid_wraparound tests intermittent failure.

From

Masahiko Sawada

Date:

25 July, 22:40:02

On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> >
> >
> > On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
> >
> > See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
> >
> >
> > The failure logs are from a run where both tests 1 and 2 failed.
> >
> > Thank you for sharing the logs.
> >
> > I think that the problem seems to match what Alexander Lakhin
> > mentioned[1]. Probably we can fix such a race condition somehow but
> > I'm not sure it's worth it as setting autovacuum = off and
> > autovacuum_max_workers = 1 (or a low number) is an extremely rare
> > case. I think it would be better to stabilize these tests. One idea is
> > to turn the autovacuum GUC parameter on while setting
> > autovacuum_enabled = off for each table. That way, we can ensure that
> > autovacuum workers are launched. And I think it seems to align real
> > use cases.
> >
> > Regards,
> >
> > [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com
> >
> >
> > OK, do you want to propose a patch?
> >
>
> Yes, I'll prepare and share it soon.
>

I've attached the patch. Could you please test if the patch fixes the
instability you observed?

Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

stabilize_xid_wraparound_test.patch

Re: xid_wraparound tests intermittent failure.

From

Andrew Dunstan

Date:

26 July, 04:52:13

On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:

On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:


On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>


The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com


OK, do you want to propose a patch?

Yes, I'll prepare and share it soon.

I've attached the patch. Could you please test if the patch fixes the
instability you observed?

Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.

I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.

You can see the logs at <https://f001.backblazeb2.com/file/net-dunslane-public/002_limits-failure-log.tar.bz2>

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: xid_wraparound tests intermittent failure.

From

Masahiko Sawada

Date:

26 July, 20:46:58

On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:
>
> On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>
> On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
>
> See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
>
>
> The failure logs are from a run where both tests 1 and 2 failed.
>
> Thank you for sharing the logs.
>
> I think that the problem seems to match what Alexander Lakhin
> mentioned[1]. Probably we can fix such a race condition somehow but
> I'm not sure it's worth it as setting autovacuum = off and
> autovacuum_max_workers = 1 (or a low number) is an extremely rare
> case. I think it would be better to stabilize these tests. One idea is
> to turn the autovacuum GUC parameter on while setting
> autovacuum_enabled = off for each table. That way, we can ensure that
> autovacuum workers are launched. And I think it seems to align real
> use cases.
>
> Regards,
>
> [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com
>
>
> OK, do you want to propose a patch?
>
> Yes, I'll prepare and share it soon.
>
> I've attached the patch. Could you please test if the patch fixes the
> instability you observed?
>
> Since we turn off autovacuum on all three tests and we wait for
> autovacuum to complete processing databases, these tests potentially
> have a similar (but lower) risk. So I modified these tests to turn it
> on so we can ensure the autovacuum runs periodically.
>
>
> I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I
retriedmy test, but on iteration 100 out of 100 it failed on test 002_limits.pl. 
>

I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: xid_wraparound tests intermittent failure.

From

Andrew Dunstan

Date:

27 July, 23:05:55

On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote:
> On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>>
>> On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:
>>
>> On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>>
>> On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
>>
>> See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
>>
>>
>> The failure logs are from a run where both tests 1 and 2 failed.
>>
>> Thank you for sharing the logs.
>>
>> I think that the problem seems to match what Alexander Lakhin
>> mentioned[1]. Probably we can fix such a race condition somehow but
>> I'm not sure it's worth it as setting autovacuum = off and
>> autovacuum_max_workers = 1 (or a low number) is an extremely rare
>> case. I think it would be better to stabilize these tests. One idea is
>> to turn the autovacuum GUC parameter on while setting
>> autovacuum_enabled = off for each table. That way, we can ensure that
>> autovacuum workers are launched. And I think it seems to align real
>> use cases.
>>
>> Regards,
>>
>> [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com
>>
>>
>> OK, do you want to propose a patch?
>>
>> Yes, I'll prepare and share it soon.
>>
>> I've attached the patch. Could you please test if the patch fixes the
>> instability you observed?
>>
>> Since we turn off autovacuum on all three tests and we wait for
>> autovacuum to complete processing databases, these tests potentially
>> have a similar (but lower) risk. So I modified these tests to turn it
>> on so we can ensure the autovacuum runs periodically.
>>
>>
>> I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I
retriedmy test, but on iteration 100 out of 100 it failed on test 002_limits.pl.
 
>>
> I think we need to remove the "autovacuum = off' also in 002_limits.pl
> as it waits for autovacuum to process both template0 and template1
> databases. Just to be clear, the failure happened even without
> "autovacuum = off"?
>

The attached patch, a slight modification of yours, removes "autovacuum 
= off" for all three tests, and given that a set of 200 runs was clean 
for me.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

xid_wraparound-test-fix.patch

Re: xid_wraparound tests intermittent failure.

From

Masahiko Sawada

Date:

30 July, 00:25:11

On Sat, Jul 27, 2024 at 1:06 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote:
> > On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> >>
> >> On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:
> >>
> >> On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>
> >> On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> >>
> >> On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
> >>
> >> See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
> >>
> >>
> >> The failure logs are from a run where both tests 1 and 2 failed.
> >>
> >> Thank you for sharing the logs.
> >>
> >> I think that the problem seems to match what Alexander Lakhin
> >> mentioned[1]. Probably we can fix such a race condition somehow but
> >> I'm not sure it's worth it as setting autovacuum = off and
> >> autovacuum_max_workers = 1 (or a low number) is an extremely rare
> >> case. I think it would be better to stabilize these tests. One idea is
> >> to turn the autovacuum GUC parameter on while setting
> >> autovacuum_enabled = off for each table. That way, we can ensure that
> >> autovacuum workers are launched. And I think it seems to align real
> >> use cases.
> >>
> >> Regards,
> >>
> >> [1] https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com
> >>
> >>
> >> OK, do you want to propose a patch?
> >>
> >> Yes, I'll prepare and share it soon.
> >>
> >> I've attached the patch. Could you please test if the patch fixes the
> >> instability you observed?
> >>
> >> Since we turn off autovacuum on all three tests and we wait for
> >> autovacuum to complete processing databases, these tests potentially
> >> have a similar (but lower) risk. So I modified these tests to turn it
> >> on so we can ensure the autovacuum runs periodically.
> >>
> >>
> >> I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch
Iretried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl. 
> >>
> > I think we need to remove the "autovacuum = off' also in 002_limits.pl
> > as it waits for autovacuum to process both template0 and template1
> > databases. Just to be clear, the failure happened even without
> > "autovacuum = off"?
> >
>
> The attached patch, a slight modification of yours, removes "autovacuum
> = off" for all three tests, and given that a set of 200 runs was clean
> for me.

Oh I missed that I left "autovacuum = off' for some reason in 002
test. Thank you for attaching the patch, it looks good to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: xid_wraparound tests intermittent failure.

From

Andrew Dunstan

Date:

30 July, 13:29:45

On 2024-07-29 Mo 5:25 PM, Masahiko Sawada wrote:

I've attached the patch. Could you please test if the patch fixes the
instability you observed?

Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.


I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.

I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?

The attached patch, a slight modification of yours, removes "autovacuum
= off" for all three tests, and given that a set of 200 runs was clean
for me.

Oh I missed that I left "autovacuum = off' for some reason in 002
test. Thank you for attaching the patch, it looks good to me.

Thanks, pushed.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: xid_wraparound tests intermittent failure.

From

Masahiko Sawada

Date:

31 July, 00:57:02

On Tue, Jul 30, 2024 at 3:29 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2024-07-29 Mo 5:25 PM, Masahiko Sawada wrote:
>
> I've attached the patch. Could you please test if the patch fixes the
> instability you observed?
>
> Since we turn off autovacuum on all three tests and we wait for
> autovacuum to complete processing databases, these tests potentially
> have a similar (but lower) risk. So I modified these tests to turn it
> on so we can ensure the autovacuum runs periodically.
>
>
> I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I
retriedmy test, but on iteration 100 out of 100 it failed on test 002_limits.pl. 
>
> I think we need to remove the "autovacuum = off' also in 002_limits.pl
> as it waits for autovacuum to process both template0 and template1
> databases. Just to be clear, the failure happened even without
> "autovacuum = off"?
>
> The attached patch, a slight modification of yours, removes "autovacuum
> = off" for all three tests, and given that a set of 200 runs was clean
> for me.
>
> Oh I missed that I left "autovacuum = off' for some reason in 002
> test. Thank you for attaching the patch, it looks good to me.
>
>
> Thanks, pushed.

Thanks!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com