Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
Date
Msg-id 18a40051-abc2-4fc2-8dcc-4dc39aa3e79e@iki.fi
Whole thread Raw
In response to Bug in MultiXact replay compat logic for older minor version after crash-recovery  ("段坤仁(刻韧)" <duankunren.dkr@alibaba-inc.com>)
Responses 回复:Bug in MultiXact replay compat logic for older minor version after crash-recovery
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
List pgsql-hackers
On 19/03/2026 20:02, 段坤仁(刻韧) wrote:
> Hi hackers,
>      
> This is related to two recent threads:
>      
> [0] https://www.postgresql.org/message-id/flat/172e5723-d65f-4eec-b512-14beacb326ce%40yandex.ru
> [1]
https://www.postgresql.org/message-id/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%40mail.gmail.com
>      
> I think there may be a bug in the compat logic introduced in
> RecordNewMultiXact() to fix [0]. The compat logic handles WAL
> generated by older  minor versions where ZERO_OFF_PAGE N+1
> may appear after CREATE_ID:N in the WAL stream. However, the condition
> used to detect whether the next offset page needs initialization
> seems to fail after a crash-restart, causing the standby to enter a
> FATAL crash loop.
>      
> We hit this in production: a PG 16.12 standby replaying WAL from a
> PG 16.8 primary entered a crash-restart loop with:
>      
>> FATAL:  could not access status of transaction 1277952
>> DETAIL:  Could not read from file "pg_multixact/offsets/0013" at offset
>> 131072: read too few bytes.
>> CONTEXT:  WAL redo at 5194/5F5F7118 for MultiXact/CREATE_ID: 1277951
>> offset 2605275 nmembers 2: 814605500 (keysh) 814605501 (keysh)

:-(. This is a gift that keeps giving.

> == Analysis ==
> 
> ...
> 
> == Timeline: the lock-free window ==
>      
> ...
> 
> == Reproduction ==

Thanks for the thorough analysis and the repro!

> == Fix ideas ==
>      
> I have two ideas for fixing this. Two independent patches are
> attached.
>      
> Idea 1 (0001-Fix-multixact-compat-logic-via-unconditional-zero.patch):
>      
> Remove the "latest_page_number == pageno" condition entirely. During
> recovery, whenever we cross a page boundary in RecordNewMultiXact(),
> unconditionally ensure the next page is initialized via
> SimpleLruZeroPage().
>      
> Pro: Minimal change, easy to review, impossible to miss any edge
> case since we always initialize. Safe because SimpleLruZeroPage is
> idempotent -- at this point the next page only contains zeros
> (CREATE_ID records for that page haven't been replayed yet), so
> re-zeroing loses nothing.
> 
> Con: When replaying WAL from a new-version primary (where
> ZERO_OFF_PAGE is emitted BEFORE CREATE_ID), the page has already
> been initialized by the ZERO_OFF_PAGE record. This patch will
> redundantly zero+write it again. The cost is one extra
> SimpleLruZeroPage + SimpleLruWritePage per 2048 multixacts during
> recovery, which is negligible in practice, but not zero.

This fix is not correct. The CREATE_ID records can appear in the WAL out 
of order, e.g:

CREATE_ID:N+2 -> CREATE_ID:N+3 -> CREATE_ID:N+1

With the patch, replaying the CREATE_ID:N+1 record would overwrite the 
changes made by the CREATE_ID:N+2 and CREATE_ID:N+3 records.

> Idea 2 (0002-Fix-multixact-compat-logic-via-SLRU-buffer-check.patch):
>      
> Also remove the "latest_page_number == pageno" condition, but add a
> two-tier check: first, if latest_page_number == pageno (fast path),
> we know the next page needs initialization. Otherwise, scan the SLRU
> buffer slots to check if the next page already exists (e.g. from a
> prior ZERO_OFF_PAGE replay). Only initialize if the page is not
> found in the buffer.
>      
> Pro: Avoids the redundant zero+write when replaying new-version WAL
> where ZERO_OFF_PAGE has already been replayed before CREATE_ID. The
> SLRU buffer scan is O(n_buffers) which is cheap (default 8 buffers).
>      
> Con: More complex than Idea 1. The SLRU buffer check is a heuristic
> -- if the page was written out and evicted from the buffer pool
> before CREATE_ID replay, we'd zero it again (still safe, just
> redundant). Also introduces a dependency on SLRU buffer pool
> internals.
>      
> I'd appreciate it if someone could review whether my analysis of the
> bug is correct. If it is, I'd be happy to hear any thoughts on the
> patches or better approaches to fix this.

I think this has the same problem, although the extra conditions make it 
much harder to hit. Maybe even impossible, because the SLRU never evicts 
the latest page. But it seems fragile to rely on that, and the 
consequences are pretty bad if you get that wrong.

Idea 3:

I think a better fix is to accept that our tracking is a little 
imprecise and use SimpleLruDoesPhysicalPageExist() to check if the page 
exists. I suspect that's too expensive to do on every 
RecordNewMultiXact() call that crosses a page, but perhaps we could do 
it once at StartupMultiXact().

Or perhaps track last-zeroed page separately from latest_page_number, 
and if we haven't seen any XLOG_MULTIXACT_ZERO_OFF_PAGE records yet 
after startup, call SimpleLruDoesPhysicalPageExist() to determine if 
initialization is needed. Attached patch does that.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: pg_restore: TAP test case typo(wrong word) for an error hint in 001_basic.pl
Next
From: Matheus Alcantara
Date:
Subject: Re: pg_plan_advice