Hi,
Our team has previously asked on pgsql-adminpgsql-general about a standby
that is never switching to streaming replication while recovering: [1]
Our investigation has shown that this happens because often an xlog record
falls on a WAL boundary which makes a single XLogDecodeNextRecord call fetch
pages from different archives. With prefetching enabled, this causes the
following sequence of events:
1. prefetching successfully reads page 1;
2. prefetching fails to read page 2 because the corresponding WAL has not been
uploaded to the archive yet, gets XLREAD_WOULDBLOCK;
3. all of the prefetched records are decoded, recovery attempts to read the
next record;
4. recovery reads page 1 again, reinvoking restore_command.
Because only one WAL is kept open at a time, this causes PostgreSQL to fetch
one WAL from the archive twice, which can be a very slow operation if the
archive is network-attached; the latency of archive fetches may even be
significant enough that recovery never catches up to the primary. Since the
only piece of information a restore_command receives is the segment number,
it cannot distinguish this situation from the database restarting, so it can't
refuse to redownload the WAL either. We use the CloudNativePG operator, which
prefetches multiple WALs at a time and makes use of a one-off flag to stop,
but the nonmonotonicity of the segment number makes the one-off flag useless.
The attached patch fixes this situation by skipping calls to ReadPageInternal
if the required data is already present in the record reassembly buffer,
reducing the number of I/O operations during recovery and ensuring that
restore_command is only executed with monotonically increasing segment
numbers during a single recovery run.
The patch is for the current master branch, but the nonmonotonicity has
been present since at least v15. I don't know if it makes sense to backport
the patch, since it's technically merely a performance improvement? I'm not
sure on how to regression test this either, but the code passes all existing
regression tests and I ran the manual reproduction to confirm that the issue
we've observed has been eliminated.
[1] https://postgr.es/m/CANOng2i1G_57nvZ4ip4uKKU87jtt%2BfzqWUFV_ou6L8N3bteSXQ%40mail.gmail.com
// Sonya Valchuk