Re: Infinite loop in XLogPageRead() on standby - Mailing list pgsql-hackers

From Alexander Kukushkin
Subject Re: Infinite loop in XLogPageRead() on standby
Date
Msg-id CAFh8B=nPSERv7NyYHmjVXK4xK3va1XzU3-rhOswjgEZMWkV=RQ@mail.gmail.com
Whole thread Raw
In response to Re: Infinite loop in XLogPageRead() on standby  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Hi Michael,

On Thu, 29 Feb 2024 at 06:05, Michael Paquier <michael@paquier.xyz> wrote:

Wow.  Have you seen that in an actual production environment?

Yes, we see it regularly, and it is reproducible in test environments as well.
 
my $start_page = start_of_page($end_lsn);
my $wal_file = write_wal($primary, $TLI, $start_page,
                         "\x00" x $WAL_BLOCK_SIZE);
# copy the file we just "hacked" to the archive
copy($wal_file, $primary->archive_dir);

So you are emulating a failure by filling with zeros the second page
where the last emit_message() generated a record, and the page before
that includes the continuation record.  Then abuse of WAL archiving to
force the replay of the last record.  That's kind of cool.

Right, at this point it is easier than to cause an artificial crash on the primary after it finished writing just one page.
 
> To be honest, I don't know yet how to fix it nicely. I am thinking about
> returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new
> timeline while trying to read a page and if this page is invalid.

Hmm.  I suspect that you may be right on a TLI change when reading a
page.  There are a bunch of side cases with continuation records and
header validation around XLogReaderValidatePageHeader().  Perhaps you
have an idea of patch to show your point?

Not yet, but hopefully I will get something done next week.
 

Nit.  In your test, it seems to me that you should not call directly
set_standby_mode and enable_restoring, just rely on has_restoring with
the standby option included.

Thanks, I'll look into it. 

--
Regards,
--
Alexander Kukushkin

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Atomic ops for unlogged LSN
Next
From: Dean Rasheed
Date:
Subject: Re: Supporting MERGE on updatable views