Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction" - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"
Date
Msg-id 349f9c82-3a8b-48ad-8cc4-fe81553793dd@iki.fi
Whole thread Raw
In response to 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"  (Sebastian Webber <sebastian@swebber.me>)
Responses Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"
Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"
List pgsql-bugs
On 13/02/2026 22:31, Sebastian Webber wrote:
> PostgreSQL version: 17.8 (standby), 17.5 (primary)
> 
> Primary: PostgreSQL 17.5 (Debian 17.5-1.pgdg130+1) on aarch64-unknown- 
> linux-gnu
> Standby: PostgreSQL 17.8 (Debian 17.8-1.pgdg13+1) on aarch64-unknown- 
> linux-gnu
> 
> Platform: Docker containers on macOS (Apple Silicon / aarch64), Docker 
> Desktop
> 
> 
> Description
> -----------
> 
> A PostgreSQL 17.8 standby crashes during WAL replay when streaming
> from a 17.5 primary. The crash occurs after replaying a
> MultiXact/TRUNCATE_ID record followed by a MultiXact/CREATE_ID
> record.

Thanks for the report, I can repro it with your script. It is indeed a 
regression introduced in the latest minor release, in the logic to 
replay multixact WAL generated on older minor versions. (Commit 
8ba61bc063). Adding the folks from the thread that led to that commit.

The commit added this in RecordNewMultiXact():

>     /*
>      * Older minor versions didn't set the next multixid's offset in this
>      * function, and therefore didn't initialize the next page until the next
>      * multixid was assigned.  If we're replaying WAL that was generated by
>      * such a version, the next page might not be initialized yet.  Initialize
>      * it now.
>      */
>     if (InRecovery &&
>         next_pageno != pageno &&
>         pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
>     {
>         elog(DEBUG1, "next offsets page is not initialized, initializing it now");

The idea is that if the next offset falls on a different page 
(next_pageno != pageno), and we have not yet initialized the next page 
(pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == 
pageno), we initialize it now. However, that last check goes wrong after 
a truncation record is replayed. Replaying a truncation record does this:

> 
>         /*
>          * During XLOG replay, latest_page_number isn't necessarily set up
>          * yet; insert a suitable value to bypass the sanity test in
>          * SimpleLruTruncate.
>          */
>         pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
>         pg_atomic_write_u64(&MultiXactOffsetCtl->shared->latest_page_number,
>                             pageno);
Thanks to that, latest_page_number moves backwards to much older page 
number. That breaks the "was the next offset page already initialized?" 
test in RecordNewMultiXact().

I don't understand why that "bypass the sanity check" is needed. As far 
as I can see, latest_page_number is tracked accurately during WAL 
replay, and should already be set up. It's initialized in 
StartupMultiXact(), and updated whenever the next page is initialized.

That was introduced a long time ago, in commit 4f627f8973, which in turn 
was a backpatched and had deal with WAL that was generated before that 
commit. I suspect it was necessary back then, for backwards 
compatiblity, but isn't necessary any more. Hence, I propose to remove 
that "bypass the sanity check" code (attached). Does anyone see a 
scenario where latest_page_number might not be set correctly?

If we want to play it even more safe -- and I guess that's the right 
thing to do for backpatching -- we could set latest_page_number 
*temporarily* while we do the the truncation, and restore the old value 
afterwards.

This fixes the bug. With this fix, you can replay WAL that's already 
been generated.

- Heikki

Attachment

pgsql-bugs by date:

Previous
From: Richard Guo
Date:
Subject: Re: BUG #19405: Assertion in eval_windowaggregates() fails due to integer overflow
Next
From: Richard Guo
Date:
Subject: Re: BUG #19405: Assertion in eval_windowaggregates() fails due to integer overflow