Reuse data from readRecordBuf in XLogDecodeNextRecord - Mailing list pgsql-hackers

From Sonya Valchuk
Subject Reuse data from readRecordBuf in XLogDecodeNextRecord
Date
Msg-id CAJLmdKyC3iq-UjYDG0S0rLQfRxfhcctWNcN-i3=3t6ceaUu1oA@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi,

Our team has previously asked on pgsql-adminpgsql-general about a standby
that is never switching to streaming replication while recovering: [1]

Our investigation has shown that this happens because often an xlog record
falls on a WAL boundary which makes a single XLogDecodeNextRecord call fetch
pages from different archives. With prefetching enabled, this causes the
following sequence of events:

1. prefetching successfully reads page 1;
2. prefetching fails to read page 2 because the corresponding WAL has not been
   uploaded to the archive yet, gets XLREAD_WOULDBLOCK;
3. all of the prefetched records are decoded, recovery attempts to read the
   next record;
4. recovery reads page 1 again, reinvoking restore_command.

Because only one WAL is kept open at a time, this causes PostgreSQL to fetch
one WAL from the archive twice, which can be a very slow operation if the
archive is network-attached; the latency of archive fetches may even be
significant enough that recovery never catches up to the primary. Since the
only piece of information a restore_command receives is the segment number,
it cannot distinguish this situation from the database restarting, so it can't
refuse to redownload the WAL either. We use the CloudNativePG operator, which
prefetches multiple WALs at a time and makes use of a one-off flag to stop,
but the nonmonotonicity of the segment number makes the one-off flag useless.

The attached patch fixes this situation by skipping calls to ReadPageInternal
if the required data is already present in the record reassembly buffer,
reducing the number of I/O operations during recovery and ensuring that
restore_command is only executed with monotonically increasing segment
numbers during a single recovery run.

The patch is for the current master branch, but the nonmonotonicity has
been present since at least v15. I don't know if it makes sense to backport
the patch, since it's technically merely a performance improvement? I'm not
sure on how to regression test this either, but the code passes all existing
regression tests and I ran the manual reproduction to confirm that the issue
we've observed has been eliminated.

[1] https://postgr.es/m/CANOng2i1G_57nvZ4ip4uKKU87jtt%2BfzqWUFV_ou6L8N3bteSXQ%40mail.gmail.com

// Sonya Valchuk

Attachment

pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: Get rid of "Section.N.N.N" on DOCs
Next
From: Amit Langote
Date:
Subject: Re: Qual push down to table AM