Thread: has_wal_read_bug

has_wal_read_bug

From

Thomas Munro

Date:

17 May 2022, 02:50:51

027_stream_regress.pl has:

if (PostgreSQL::Test::Utils::has_wal_read_bug)
{
    # We'd prefer to use Test::More->builder->todo_start, but the bug causes
    # this test file to die(), not merely to fail.
    plan skip_all => 'filesystem bug';
}

Is the die() referenced there the one from the system_or_bail() call
that commit a096813b got rid of?

Here's a failure in 031_recovery_conflict.pl that smells like
concurrent pread() corruption:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-16%2015%3A45%3A54

2022-05-16 18:10:33.375 CEST [52106:1] LOG:  started streaming WAL
from primary at 0/3000000 on timeline 1
2022-05-16 18:10:33.621 CEST [52105:5] LOG:  incorrect resource
manager data checksum in record at 0/338FDC8
2022-05-16 18:10:33.622 CEST [52106:2] FATAL:  terminating walreceiver
process due to administrator command

Presumably we also need the has_wal_read_bug kludge in all these new
tests that use replication.

Re: has_wal_read_bug

From

Noah Misch

Date:

17 May 2022, 10:15:35

On Tue, May 17, 2022 at 11:50:51AM +1200, Thomas Munro wrote:
> 027_stream_regress.pl has:
> 
> if (PostgreSQL::Test::Utils::has_wal_read_bug)
> {
>     # We'd prefer to use Test::More->builder->todo_start, but the bug causes
>     # this test file to die(), not merely to fail.
>     plan skip_all => 'filesystem bug';
> }
> 
> Is the die() referenced there the one from the system_or_bail() call
> that commit a096813b got rid of?

No, it was the 'croak "timed out waiting for catchup"',
e.g. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-01-25%2016%3A56%3A26

> Here's a failure in 031_recovery_conflict.pl that smells like
> concurrent pread() corruption:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-16%2015%3A45%3A54
> 
> 2022-05-16 18:10:33.375 CEST [52106:1] LOG:  started streaming WAL
> from primary at 0/3000000 on timeline 1
> 2022-05-16 18:10:33.621 CEST [52105:5] LOG:  incorrect resource
> manager data checksum in record at 0/338FDC8
> 2022-05-16 18:10:33.622 CEST [52106:2] FATAL:  terminating walreceiver
> process due to administrator command

Agreed.  Here, too:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-09%2015%3A46%3A03

> Presumably we also need the has_wal_read_bug kludge in all these new
> tests that use replication.

That is an option.  One alternative is to reconfigure those three animals to
remove --enable-tap-tests.  Another alternative is to make the build system
skip all files of all TAP suites on affected systems.  Handling this on a
file-by-file basis seemed reasonable to me when only two files had failed that
way.  Now, five files have failed.  We have wait_for_catchup calls in
fifty-one files, and I wouldn't have chosen the has_wal_read_bug approach if I
had expected fifty-one files to eventually call it.  I could tolerate it,
though.

Re: has_wal_read_bug

From

Noah Misch

Date:

30 October 2022, 06:16:39

On Tue, May 17, 2022 at 12:15:35AM -0700, Noah Misch wrote:
> On Tue, May 17, 2022 at 11:50:51AM +1200, Thomas Munro wrote:
> > 027_stream_regress.pl has:
> > 
> > if (PostgreSQL::Test::Utils::has_wal_read_bug)
> > {
> >     # We'd prefer to use Test::More->builder->todo_start, but the bug causes
> >     # this test file to die(), not merely to fail.
> >     plan skip_all => 'filesystem bug';
> > }
> > 
> > Is the die() referenced there the one from the system_or_bail() call
> > that commit a096813b got rid of?
> 
> No, it was the 'croak "timed out waiting for catchup"',
> e.g. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-01-25%2016%3A56%3A26
> 
> > Here's a failure in 031_recovery_conflict.pl that smells like
> > concurrent pread() corruption:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-16%2015%3A45%3A54
> > 
> > 2022-05-16 18:10:33.375 CEST [52106:1] LOG:  started streaming WAL
> > from primary at 0/3000000 on timeline 1
> > 2022-05-16 18:10:33.621 CEST [52105:5] LOG:  incorrect resource
> > manager data checksum in record at 0/338FDC8
> > 2022-05-16 18:10:33.622 CEST [52106:2] FATAL:  terminating walreceiver
> > process due to administrator command
> 
> Agreed.  Here, too:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-09%2015%3A46%3A03
> 
> > Presumably we also need the has_wal_read_bug kludge in all these new
> > tests that use replication.
> 
> That is an option.  One alternative is to reconfigure those three animals to
> remove --enable-tap-tests.  Another alternative is to make the build system
> skip all files of all TAP suites on affected systems.  Handling this on a
> file-by-file basis seemed reasonable to me when only two files had failed that
> way.  Now, five files have failed.  We have wait_for_catchup calls in
> fifty-one files, and I wouldn't have chosen the has_wal_read_bug approach if I
> had expected fifty-one files to eventually call it.  I could tolerate it,
> though.

Squashing another test that failed multiple times (commit a9f8ca6) led me to
think of another option, attached.  When wait_for_catchup() fails under
has_wal_read_bug(), end the suite with an abrupt success.  Thoughts?

Attachment

wait_for_catchup-vs-has_wal_read_bug-v1.patch