Re: Infinite loop in XLogPageRead() on standby - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Infinite loop in XLogPageRead() on standby
Date
Msg-id ZeEQcMBA98eZOHpC@paquier.xyz
Whole thread Raw
In response to Re: Infinite loop in XLogPageRead() on standby  (Alexander Kukushkin <cyberdemn@gmail.com>)
Responses Re: Infinite loop in XLogPageRead() on standby
List pgsql-hackers
On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote:
> On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
>> In the first place, it's important to note that we do not guarantee
>> that an async standby can always switch its replication connection to
>> the old primary or another sibling standby. This is due to the
>> variations in replication lag among standbys. pg_rewind is required to
>> adjust such discrepancies.
>
> Sure, I know. But in this case the async standby received and flushed
> absolutely the same amount of WAL as the promoted one.

Ugh.  If it can happen transparently to the user without the user
knowing directly about it, that does not sound good to me.  I did not
look very closely at monitoring tools available out there, but if both
standbys flushed the same WAL locations a rewind should not be
required.  It is not something that monitoring tools would be able to
detect because they just look at LSNs.

>> In the repro script, the replication connection of the second standby
>> is switched from the old primary to the first standby after its
>> promotion. After the switching, replication is expected to continue
>> from the beginning of the last replayed segment.
>
> Well, maybe, but apparently the standby is busy trying to decode a record
> that spans multiple pages, and it is just infinitely waiting for the next
> page to arrive. Also, the restart "fixes" the problem, because indeed it is
> reading the file from the beginning.

What happens if the continuation record spawns across multiple segment
files boundaries in this case?  We would go back to the beginning of
the segment where the record spawning across multiple segments began,
right?  (I may recall this part of contrecords incorrectly, feel free
to correct me if necessary.)

>> But with the script,
>> the second standby copies the intentionally broken file, which differs
>> from the data that should be received via streaming.
>
> As I already said, this is a simple way to emulate the primary crash while
> standbys receiving WAL.
> It could easily happen that the record spans on multiple pages is not fully
> received and flushed.

I think that's OK to do so at test level to force a test in the
backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has
proved that this can be designed to be cheap and stable across the
buildfarm fleet.

For anything like that, the result had better have solid test
coverage, where perhaps we'd better refactor some of the routines of
039_end_of_wal.pl into a module to use them here, rather than
duplicate the code.  The other test has a few assumptions with the
calculation of page boundaries, and I'd rather not duplicate that
across the tree.

Seeing that Alexander is a maintainer of Patroni, which is very
probably used by his employer across a large set of PostgreSQL
instances, if he says that he's seen that in the field, that's good
enough for me to respond to the problem, especially if reconnecting a
standby to a promoted node where both flushed the same LSN.  Now the
level of response also depends on the invasiness of the change, and we
need a very careful evaluation here.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: DOCS: Avoid using abbreviation "aka"
Next
From: Stephen Frost
Date:
Subject: Re: Statistics Import and Export