Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
Date
Msg-id CA+hUKGJ-gvLg9vJgMvu62XW-Nsb2Hf40DvqmKdusVeJ7KV-crA@mail.gmail.com
Whole thread Raw
In response to Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Michael Paquier <michael@paquier.xyz>)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Sat, Apr 9, 2022 at 12:59 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-04-08 17:55:51 -0400, Tom Lane wrote:
> > I tried adjusting the patch so it does guarantee that (as attached),
> > and in two out of two tries it got past the archive_cleanup_command
> > failure but then hung up waiting for standby2 to promote.
>
> Adding
>
> $node_standby->safe_psql('postgres', "SELECT pg_switch_wal()");
> just after
> $node_standby2->start;
>
> makes the tests pass here.

Sorry for the delay... I got a bit confused about the different things
going on in this thread but I hope I've got it now:

1.  This test had some pre-existing bugs/races, which hadn't failed
before due to scheduling, even under Valgrind.  The above changes
appear to fix those problems.  To Michael for comment.

> What is that second test really testing?
>
> # Check the presence of temporary files specifically generated during
> # archive recovery.  To ensure the presence of the temporary history
> # file, switch to a timeline large enough to allow a standby to recover
> # a history file from an archive.  As this requires at least two timeline
> # switches, promote the existing standby first.  Then create a second
> # standby based on the promoted one.  Finally, the second standby is
> # promoted.
>
> Note "Then create a second standby based on the promoted one." - but that's
> not actually what's happening:

2.  There may also be other problems with the test but those aren't
relevant to skink's failure, which starts on the 5th test.  To Michael
for comment.

> But I think there's also something funky in the prefetching logic. I think it
> may attempt restoring during prefetching somehow, even though there's code
> that appears to try to prevent that?

3.  Urghl.  Yeah.  There is indeed code to report XLREAD_WOULDBLOCK
for the lastSourceFailed case, but we don't really want to do that
more often than every 5 seconds.  So I think I need to make that
"sticky", so that we don't attempt to prefetch again (ie to read from
the next segment, which invokes restore_command) until we've replayed
all the records we already have, then hit the end and go to sleep.
The attached patch does that, and makes the offending test pass under
Valgrind for me, even without the other changes already mentioned.  If
I understand correctly, this is due to a timing race in the tests
(though I didn't check where exactly), because all those extra
fork/exec calls are extremely slow under Valgrind.

> And interestingly I'm not seeing the
> "switched WAL source from stream to archive after failure"
> lines I'd expect.

I see them now.  It's because it gives up when it's reading ahead
(nonblocking), which may not be strictly necessary but I found it
simpler to think about.  Then when it tries again in 5 seconds it's in
blocking mode so it doesn't give up so easily.

2022-04-11 18:15:08.220 NZST [524796] DEBUG:  switched WAL source from
stream to archive after failure
cp: cannot stat '/tmp/archive/000000010000000000000017': No such file
or directory
2022-04-11 18:15:08.226 NZST [524796] DEBUG:  could not restore file
"000000010000000000000017" from archive: child process exited with
exit code 1
2022-04-11 18:15:08.226 NZST [524796] DEBUG:  could not open file
"pg_wal/000000010000000000000017": No such file or directory
2022-04-11 18:15:08.226 NZST [524796] DEBUG:  switched WAL source from
archive to stream after failure

Attachment

pgsql-hackers by date:

Previous
From: Qiongwen Liu
Date:
Subject: PostgreSQL Program Application
Next
From: Michael Paquier
Date:
Subject: Fixes for compression options of pg_receivewal and refactoring of backup_compression.{c,h}