Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery - Mailing list pgsql-hackers
| From | Heikki Linnakangas |
|---|---|
| Subject | Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery |
| Date | |
| Msg-id | 18a40051-abc2-4fc2-8dcc-4dc39aa3e79e@iki.fi Whole thread Raw |
| In response to | Bug in MultiXact replay compat logic for older minor version after crash-recovery ("段坤仁(刻韧)" <duankunren.dkr@alibaba-inc.com>) |
| Responses |
回复:Bug in MultiXact replay compat logic for older minor version after crash-recovery
Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery |
| List | pgsql-hackers |
On 19/03/2026 20:02, 段坤仁(刻韧) wrote: > Hi hackers, > > This is related to two recent threads: > > [0] https://www.postgresql.org/message-id/flat/172e5723-d65f-4eec-b512-14beacb326ce%40yandex.ru > [1] https://www.postgresql.org/message-id/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%40mail.gmail.com > > I think there may be a bug in the compat logic introduced in > RecordNewMultiXact() to fix [0]. The compat logic handles WAL > generated by older minor versions where ZERO_OFF_PAGE N+1 > may appear after CREATE_ID:N in the WAL stream. However, the condition > used to detect whether the next offset page needs initialization > seems to fail after a crash-restart, causing the standby to enter a > FATAL crash loop. > > We hit this in production: a PG 16.12 standby replaying WAL from a > PG 16.8 primary entered a crash-restart loop with: > >> FATAL: could not access status of transaction 1277952 >> DETAIL: Could not read from file "pg_multixact/offsets/0013" at offset >> 131072: read too few bytes. >> CONTEXT: WAL redo at 5194/5F5F7118 for MultiXact/CREATE_ID: 1277951 >> offset 2605275 nmembers 2: 814605500 (keysh) 814605501 (keysh) :-(. This is a gift that keeps giving. > == Analysis == > > ... > > == Timeline: the lock-free window == > > ... > > == Reproduction == Thanks for the thorough analysis and the repro! > == Fix ideas == > > I have two ideas for fixing this. Two independent patches are > attached. > > Idea 1 (0001-Fix-multixact-compat-logic-via-unconditional-zero.patch): > > Remove the "latest_page_number == pageno" condition entirely. During > recovery, whenever we cross a page boundary in RecordNewMultiXact(), > unconditionally ensure the next page is initialized via > SimpleLruZeroPage(). > > Pro: Minimal change, easy to review, impossible to miss any edge > case since we always initialize. Safe because SimpleLruZeroPage is > idempotent -- at this point the next page only contains zeros > (CREATE_ID records for that page haven't been replayed yet), so > re-zeroing loses nothing. > > Con: When replaying WAL from a new-version primary (where > ZERO_OFF_PAGE is emitted BEFORE CREATE_ID), the page has already > been initialized by the ZERO_OFF_PAGE record. This patch will > redundantly zero+write it again. The cost is one extra > SimpleLruZeroPage + SimpleLruWritePage per 2048 multixacts during > recovery, which is negligible in practice, but not zero. This fix is not correct. The CREATE_ID records can appear in the WAL out of order, e.g: CREATE_ID:N+2 -> CREATE_ID:N+3 -> CREATE_ID:N+1 With the patch, replaying the CREATE_ID:N+1 record would overwrite the changes made by the CREATE_ID:N+2 and CREATE_ID:N+3 records. > Idea 2 (0002-Fix-multixact-compat-logic-via-SLRU-buffer-check.patch): > > Also remove the "latest_page_number == pageno" condition, but add a > two-tier check: first, if latest_page_number == pageno (fast path), > we know the next page needs initialization. Otherwise, scan the SLRU > buffer slots to check if the next page already exists (e.g. from a > prior ZERO_OFF_PAGE replay). Only initialize if the page is not > found in the buffer. > > Pro: Avoids the redundant zero+write when replaying new-version WAL > where ZERO_OFF_PAGE has already been replayed before CREATE_ID. The > SLRU buffer scan is O(n_buffers) which is cheap (default 8 buffers). > > Con: More complex than Idea 1. The SLRU buffer check is a heuristic > -- if the page was written out and evicted from the buffer pool > before CREATE_ID replay, we'd zero it again (still safe, just > redundant). Also introduces a dependency on SLRU buffer pool > internals. > > I'd appreciate it if someone could review whether my analysis of the > bug is correct. If it is, I'd be happy to hear any thoughts on the > patches or better approaches to fix this. I think this has the same problem, although the extra conditions make it much harder to hit. Maybe even impossible, because the SLRU never evicts the latest page. But it seems fragile to rely on that, and the consequences are pretty bad if you get that wrong. Idea 3: I think a better fix is to accept that our tracking is a little imprecise and use SimpleLruDoesPhysicalPageExist() to check if the page exists. I suspect that's too expensive to do on every RecordNewMultiXact() call that crosses a page, but perhaps we could do it once at StartupMultiXact(). Or perhaps track last-zeroed page separately from latest_page_number, and if we haven't seen any XLOG_MULTIXACT_ZERO_OFF_PAGE records yet after startup, call SimpleLruDoesPhysicalPageExist() to determine if initialization is needed. Attached patch does that. - Heikki
Attachment
pgsql-hackers by date: