Thread: has_wal_read_bug
027_stream_regress.pl has: if (PostgreSQL::Test::Utils::has_wal_read_bug) { # We'd prefer to use Test::More->builder->todo_start, but the bug causes # this test file to die(), not merely to fail. plan skip_all => 'filesystem bug'; } Is the die() referenced there the one from the system_or_bail() call that commit a096813b got rid of? Here's a failure in 031_recovery_conflict.pl that smells like concurrent pread() corruption: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-16%2015%3A45%3A54 2022-05-16 18:10:33.375 CEST [52106:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2022-05-16 18:10:33.621 CEST [52105:5] LOG: incorrect resource manager data checksum in record at 0/338FDC8 2022-05-16 18:10:33.622 CEST [52106:2] FATAL: terminating walreceiver process due to administrator command Presumably we also need the has_wal_read_bug kludge in all these new tests that use replication.
On Tue, May 17, 2022 at 11:50:51AM +1200, Thomas Munro wrote: > 027_stream_regress.pl has: > > if (PostgreSQL::Test::Utils::has_wal_read_bug) > { > # We'd prefer to use Test::More->builder->todo_start, but the bug causes > # this test file to die(), not merely to fail. > plan skip_all => 'filesystem bug'; > } > > Is the die() referenced there the one from the system_or_bail() call > that commit a096813b got rid of? No, it was the 'croak "timed out waiting for catchup"', e.g. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-01-25%2016%3A56%3A26 > Here's a failure in 031_recovery_conflict.pl that smells like > concurrent pread() corruption: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-16%2015%3A45%3A54 > > 2022-05-16 18:10:33.375 CEST [52106:1] LOG: started streaming WAL > from primary at 0/3000000 on timeline 1 > 2022-05-16 18:10:33.621 CEST [52105:5] LOG: incorrect resource > manager data checksum in record at 0/338FDC8 > 2022-05-16 18:10:33.622 CEST [52106:2] FATAL: terminating walreceiver > process due to administrator command Agreed. Here, too: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-09%2015%3A46%3A03 > Presumably we also need the has_wal_read_bug kludge in all these new > tests that use replication. That is an option. One alternative is to reconfigure those three animals to remove --enable-tap-tests. Another alternative is to make the build system skip all files of all TAP suites on affected systems. Handling this on a file-by-file basis seemed reasonable to me when only two files had failed that way. Now, five files have failed. We have wait_for_catchup calls in fifty-one files, and I wouldn't have chosen the has_wal_read_bug approach if I had expected fifty-one files to eventually call it. I could tolerate it, though.
On Tue, May 17, 2022 at 12:15:35AM -0700, Noah Misch wrote: > On Tue, May 17, 2022 at 11:50:51AM +1200, Thomas Munro wrote: > > 027_stream_regress.pl has: > > > > if (PostgreSQL::Test::Utils::has_wal_read_bug) > > { > > # We'd prefer to use Test::More->builder->todo_start, but the bug causes > > # this test file to die(), not merely to fail. > > plan skip_all => 'filesystem bug'; > > } > > > > Is the die() referenced there the one from the system_or_bail() call > > that commit a096813b got rid of? > > No, it was the 'croak "timed out waiting for catchup"', > e.g. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-01-25%2016%3A56%3A26 > > > Here's a failure in 031_recovery_conflict.pl that smells like > > concurrent pread() corruption: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-16%2015%3A45%3A54 > > > > 2022-05-16 18:10:33.375 CEST [52106:1] LOG: started streaming WAL > > from primary at 0/3000000 on timeline 1 > > 2022-05-16 18:10:33.621 CEST [52105:5] LOG: incorrect resource > > manager data checksum in record at 0/338FDC8 > > 2022-05-16 18:10:33.622 CEST [52106:2] FATAL: terminating walreceiver > > process due to administrator command > > Agreed. Here, too: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-09%2015%3A46%3A03 > > > Presumably we also need the has_wal_read_bug kludge in all these new > > tests that use replication. > > That is an option. One alternative is to reconfigure those three animals to > remove --enable-tap-tests. Another alternative is to make the build system > skip all files of all TAP suites on affected systems. Handling this on a > file-by-file basis seemed reasonable to me when only two files had failed that > way. Now, five files have failed. We have wait_for_catchup calls in > fifty-one files, and I wouldn't have chosen the has_wal_read_bug approach if I > had expected fifty-one files to eventually call it. I could tolerate it, > though. Squashing another test that failed multiple times (commit a9f8ca6) led me to think of another option, attached. When wait_for_catchup() fails under has_wal_read_bug(), end the suite with an abrupt success. Thoughts?