Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction" - Mailing list pgsql-bugs

From Kirill Reshke
Subject Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"
Date
Msg-id CALdSSPhMhNzRRd-SeU0PTwKiGDpFOb5Yss7PWBPN3cHv6kW8eQ@mail.gmail.com
Whole thread Raw
In response to Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"
List pgsql-bugs
On Sat, 14 Feb 2026 at 16:42, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 13/02/2026 22:31, Sebastian Webber wrote:
> > PostgreSQL version: 17.8 (standby), 17.5 (primary)
> >
> > Primary: PostgreSQL 17.5 (Debian 17.5-1.pgdg130+1) on aarch64-unknown-
> > linux-gnu
> > Standby: PostgreSQL 17.8 (Debian 17.8-1.pgdg13+1) on aarch64-unknown-
> > linux-gnu
> >
> > Platform: Docker containers on macOS (Apple Silicon / aarch64), Docker
> > Desktop
> >
> >
> > Description
> > -----------
> >
> > A PostgreSQL 17.8 standby crashes during WAL replay when streaming
> > from a 17.5 primary. The crash occurs after replaying a
> > MultiXact/TRUNCATE_ID record followed by a MultiXact/CREATE_ID
> > record.
>
> Thanks for the report, I can repro it with your script. It is indeed a
> regression introduced in the latest minor release, in the logic to
> replay multixact WAL generated on older minor versions. (Commit
> 8ba61bc063). Adding the folks from the thread that led to that commit.
>
> The commit added this in RecordNewMultiXact():
>
> >       /*
> >        * Older minor versions didn't set the next multixid's offset in this
> >        * function, and therefore didn't initialize the next page until the next
> >        * multixid was assigned.  If we're replaying WAL that was generated by
> >        * such a version, the next page might not be initialized yet.  Initialize
> >        * it now.
> >        */
> >       if (InRecovery &&
> >               next_pageno != pageno &&
> >               pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
> >       {
> >               elog(DEBUG1, "next offsets page is not initialized, initializing it now");
>
> The idea is that if the next offset falls on a different page
> (next_pageno != pageno), and we have not yet initialized the next page
> (pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) ==
> pageno), we initialize it now. However, that last check goes wrong after
> a truncation record is replayed. Replaying a truncation record does this:
>
> >
> >               /*
> >                * During XLOG replay, latest_page_number isn't necessarily set up
> >                * yet; insert a suitable value to bypass the sanity test in
> >                * SimpleLruTruncate.
> >                */
> >               pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
> >               pg_atomic_write_u64(&MultiXactOffsetCtl->shared->latest_page_number,
> >                                                       pageno);
> Thanks to that, latest_page_number moves backwards to much older page
> number. That breaks the "was the next offset page already initialized?"
> test in RecordNewMultiXact().
>
> I don't understand why that "bypass the sanity check" is needed. As far
> as I can see, latest_page_number is tracked accurately during WAL
> replay, and should already be set up. It's initialized in
> StartupMultiXact(), and updated whenever the next page is initialized.
>
> That was introduced a long time ago, in commit 4f627f8973, which in turn
> was a backpatched and had deal with WAL that was generated before that
> commit. I suspect it was necessary back then, for backwards
> compatiblity, but isn't necessary any more. Hence, I propose to remove
> that "bypass the sanity check" code (attached). Does anyone see a
> scenario where latest_page_number might not be set correctly?
>
> If we want to play it even more safe -- and I guess that's the right
> thing to do for backpatching -- we could set latest_page_number
> *temporarily* while we do the the truncation, and restore the old value
> afterwards.
>
> This fixes the bug. With this fix, you can replay WAL that's already
> been generated.
>
> - Heikki

Hi!
Patch LGTM. Lets wrap new minors with IT?

-- 
Best regards,
Kirill Reshke



pgsql-bugs by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"
Next
From: Heikki Linnakangas
Date:
Subject: Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"