I noticed that the new TAP test for basic_archive was failing
intermittently for cfbot. It looks like the query for checking that the
post-backup WAL is restored sometimes executes before archive recovery is
complete (because hot_standby is on). To fix this, I adjusted the test to
use poll_query_until instead. There are no other changes in v14.
I first tried to set hot_standby to off on the restored node so that the
query wouldn't run until archive recovery completed. This seemed like it
would work because start() useѕ "pg_ctl --wait", which has the following
note in the docs:
Startup is considered complete when the PID file indicates that the
server is ready to accept connections.
However, that's not what happens when hot_standby is off. In that case,
the postmaster.pid file is updated with PM_STATUS_STANDBY once recovery
starts, which wait_for_postmaster_start() interprets as "ready." I see
this was reported before [0], but that discussion fizzled out. IIUC it was
done this way to avoid infinite waits when hot_standby is off and standby
mode is enabled. I could be missing something obvious, but that doesn't
seem necessary when hot_standby is off and recovery mode is enabled because
recovery should end at some point (never mind the halting problem). I'm
still digging into this and may spin off a new thread if I can conjure up a
proposal.
[0] https://postgr.es/m/CAMkU%3D1wrMqPggnEfszE-c3PPLmKgRK17_qr7tmxBECYEbyV-4Q%40mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com