Home > mailing lists

Re: Infinite loop in XLogPageRead() on standby - Mailing list pgsql-hackers

From	Alexander Kukushkin
Subject	Re: Infinite loop in XLogPageRead() on standby
Date	February 29, 2024 16:44:25
Msg-id	CAFh8B==zUj1+asN5REAvqJccgUZFgOh5Ze9c=mOrGypRuTEm=g@mail.gmail.com Whole thread Raw
In response to	Re: Infinite loop in XLogPageRead() on standby (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses	Re: Infinite loop in XLogPageRead() on standby
List	pgsql-hackers

Tree view

Hi Kyotaro,

On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.

Sure, I know. But in this case the async standby received and flushed absolutely the same amount of WAL as the promoted one.

I might be overlooking something, but I don't understand how this
occurs without purposefully tweaking WAL files. The repro script
pushes an incomplete WAL file to the archive as a non-partial
segment. This shouldn't happen in the real world.

It easily happens if the primary crashed and standbys didn't receive another page with continuation record.

In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment.

Well, maybe, but apparently the standby is busy trying to decode a record that spans multiple pages, and it is just infinitely waiting for the next page to arrive. Also, the restart "fixes" the problem, because indeed it is reading the file from the beginning.

But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming.

As I already said, this is a simple way to emulate the primary crash while standbys receiving WAL.

It could easily happen that the record spans on multiple pages is not fully received and flushed.

Regards,

Alexander Kukushkin

pgsql-hackers by date:

From: Dean Rasheed
Date: 29 February 2024, 16:37:28
Subject: Re: Supporting MERGE on updatable views

From: Stephen Frost
Date: 29 February 2024, 16:45:07
Subject: Re: Atomic ops for unlogged LSN

Re: Infinite loop in XLogPageRead() on standby - Mailing list pgsql-hackers

Previous

Next