Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?) - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
Date
Msg-id db093cce-7eec-8516-ef0f-891895178c46@gmail.com
Whole thread Raw
In response to recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)  (Alexander Lakhin <exclusion@gmail.com>)
Responses Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
List pgsql-hackers
22.06.2024 12:00, Alexander Lakhin wrote:
> On the other hand, backporting a7f417107 could fix the issue too, but I'm
> afraid we'll still see other tests (027_stream_regress) failing like [4].
> When similar failures were observed on Andres Freund's animals, Andres
> came to conclusion that they were caused by fsync too ([5]).
>

It seems to me that another dodo failure [1] has the same cause:
t/001_emergency_vacuum.pl .. ok
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 2.
t/002_limits.pl ............
Dubious, test returned 29 (wstat 7424, 0x1d00)
All 2 subtests passed
t/003_wraparounds.pl ....... ok

Test Summary Report
-------------------
t/002_limits.pl          (Wstat: 7424 Tests: 2 Failed: 0)
   Non-zero exit status: 29
   Parse errors: No plan found in TAP output
Files=3, Tests=10, 4235 wallclock secs ( 0.10 usr  0.13 sys + 18.05 cusr 12.76 csys = 31.04 CPU)
Result: FAIL

Unfortunately, the buildfarm log doesn't contain regress_log_002_limits,
but I managed to reproduce the failure on that my device, when it's storage
as slow as:
$ dd if=/dev/zero of=./test count=1024 oflag=dsync bs=128k
1024+0 records in
1024+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 33.9446 s, 4.0 MB/s

The test fails as below:
[15:36:04.253](729.754s) ok 1 - warn-limit reached
#### Begin standard error
psql:<stdin>:1: WARNING:  database "postgres" must be vacuumed within 37483631 transactions
HINT:  To avoid XID assignment failures, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
#### End standard error
[15:36:45.220](40.968s) ok 2 - stop-limit
[15:36:45.222](0.002s) # issuing query via background psql: COMMIT
IPC::Run: timeout on timer #1 at /usr/share/perl5/IPC/Run.pm line 2944.

It looks like this bump (coined at [2]) is not enough for machines that are
that slow:
# Bump the query timeout to avoid false negatives on slow test systems.
my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-20%2007%3A18%3A46
[2] https://www.postgresql.org/message-id/CAD21AoBKBVkXyEwkApSUqN98CuOWw%3DYQdbkeE6gGJ0zH7z-TBw%40mail.gmail.com

Best regards,
Alexander



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Meson far from ready on Windows
Next
From: Tom Lane
Date:
Subject: Re: Unable parse a comment in gram.y