17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction" - Mailing list pgsql-bugs
| From | Sebastian Webber |
|---|---|
| Subject | 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction" |
| Date | |
| Msg-id | CACV2tSw3VYS7d27ftO_cs+aF3M54+JwWBbqSGLcKoG9cvyb6EA@mail.gmail.com Whole thread Raw |
| Responses |
Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"
Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction" |
| List | pgsql-bugs |
PostgreSQL version: 17.8 (standby), 17.5 (primary)
Primary: PostgreSQL 17.5 (Debian 17.5-1.pgdg130+1) on aarch64-unknown-linux-gnu
Standby: PostgreSQL 17.8 (Debian 17.8-1.pgdg13+1) on aarch64-unknown-linux-gnu
Platform: Docker containers on macOS (Apple Silicon / aarch64), Docker Desktop
Description
-----------
A PostgreSQL 17.8 standby crashes during WAL replay when streaming
from a 17.5 primary. The crash occurs after replaying a
MultiXact/TRUNCATE_ID record followed by a MultiXact/CREATE_ID
record.
Steps to reproduce
------------------
1. Start a 17.5 primary configured for streaming replication
2. Seed a database with ~2GB of data (tables with foreign key
constraints)
3. Start a 17.5 standby via pg_basebackup, confirm streaming
replication
4. Generate ~500K MultiXact IDs using concurrent SELECT ... FOR SHARE
/ FOR KEY SHARE on the same rows
5. Run VACUUM on the multixact-heavy tables (generates TRUNCATE_ID
WAL records)
6. Stop the 17.5 standby
7. Continue generating ~2M additional MultiXact IDs on the primary
(builds WAL backlog)
8. Start a 17.8 standby on the same data volume -- it begins
replaying the WAL backlog
9. Standby crashes during replay
An automated reproducer (Go program + shell scripts) is available at:
https://gist.github.com/sebastianwebber/2cd25d298bfe85cabcd8d41f83591acb
It requires Go 1.22+ and Docker. Typical runtime is ~10 minutes.
go run main.go --cleanup
Actual output (standby log)
----------------------------
The standby successfully replays multiple SLRU page boundaries with
this pattern:
DEBUG: next offsets page is not initialized, initializing it now
CONTEXT: WAL redo at 3/28C148D8 for MultiXact/CREATE_ID: 856063 offset 6680130 nmembers 9: ...
DEBUG: skipping initialization of offsets page 418 because it was already initialized on multixid creation
CONTEXT: WAL redo at 3/28C149B8 for MultiXact/ZERO_OFF_PAGE: 418
This repeats for pages 408 through 418. Then a truncation occurs:
DEBUG: replaying multixact truncation: offsets [1, 490986), offsets segments [0, 7), members [1, 3864017), members segments [0, 49)
CONTEXT: WAL redo at 3/29D6D548 for MultiXact/TRUNCATE_ID: offsets [1, 490986), members [1, 3864017)
The very next CREATE_ID crashes:
FATAL: could not access status of transaction 858112
DETAIL: Could not read from file "pg_multixact/offsets/000D" at offset 24576: read too few bytes.
CONTEXT: WAL redo at 3/2A3AB408 for MultiXact/CREATE_ID: 858111 offset 6695072 nmembers 5: 1048228 (sh) 1048271 (keysh) 1048316 (sh) 1048344 (keysh) 1048370 (sh)
LOG: startup process (PID 29) exited with exit code 1
LOG: shutting down due to startup process failure
Expected output
---------------
The standby should successfully replay all WAL records and reach a
consistent streaming state.
Configuration (non-default on primary)
--------------------------------------
wal_level = replica
max_wal_senders = 10
max_connections = 1200
shared_buffers = 256MB
wal_keep_size = 16GB
autovacuum_multixact_freeze_max_age = 100000
vacuum_multixact_freeze_min_age = 1000
vacuum_multixact_freeze_table_age = 50000
Standby configured with log_min_messages = debug1.
Primary: PostgreSQL 17.5 (Debian 17.5-1.pgdg130+1) on aarch64-unknown-linux-gnu
Standby: PostgreSQL 17.8 (Debian 17.8-1.pgdg13+1) on aarch64-unknown-linux-gnu
Platform: Docker containers on macOS (Apple Silicon / aarch64), Docker Desktop
Description
-----------
A PostgreSQL 17.8 standby crashes during WAL replay when streaming
from a 17.5 primary. The crash occurs after replaying a
MultiXact/TRUNCATE_ID record followed by a MultiXact/CREATE_ID
record.
Steps to reproduce
------------------
1. Start a 17.5 primary configured for streaming replication
2. Seed a database with ~2GB of data (tables with foreign key
constraints)
3. Start a 17.5 standby via pg_basebackup, confirm streaming
replication
4. Generate ~500K MultiXact IDs using concurrent SELECT ... FOR SHARE
/ FOR KEY SHARE on the same rows
5. Run VACUUM on the multixact-heavy tables (generates TRUNCATE_ID
WAL records)
6. Stop the 17.5 standby
7. Continue generating ~2M additional MultiXact IDs on the primary
(builds WAL backlog)
8. Start a 17.8 standby on the same data volume -- it begins
replaying the WAL backlog
9. Standby crashes during replay
An automated reproducer (Go program + shell scripts) is available at:
https://gist.github.com/sebastianwebber/2cd25d298bfe85cabcd8d41f83591acb
It requires Go 1.22+ and Docker. Typical runtime is ~10 minutes.
go run main.go --cleanup
Actual output (standby log)
----------------------------
The standby successfully replays multiple SLRU page boundaries with
this pattern:
DEBUG: next offsets page is not initialized, initializing it now
CONTEXT: WAL redo at 3/28C148D8 for MultiXact/CREATE_ID: 856063 offset 6680130 nmembers 9: ...
DEBUG: skipping initialization of offsets page 418 because it was already initialized on multixid creation
CONTEXT: WAL redo at 3/28C149B8 for MultiXact/ZERO_OFF_PAGE: 418
This repeats for pages 408 through 418. Then a truncation occurs:
DEBUG: replaying multixact truncation: offsets [1, 490986), offsets segments [0, 7), members [1, 3864017), members segments [0, 49)
CONTEXT: WAL redo at 3/29D6D548 for MultiXact/TRUNCATE_ID: offsets [1, 490986), members [1, 3864017)
The very next CREATE_ID crashes:
FATAL: could not access status of transaction 858112
DETAIL: Could not read from file "pg_multixact/offsets/000D" at offset 24576: read too few bytes.
CONTEXT: WAL redo at 3/2A3AB408 for MultiXact/CREATE_ID: 858111 offset 6695072 nmembers 5: 1048228 (sh) 1048271 (keysh) 1048316 (sh) 1048344 (keysh) 1048370 (sh)
LOG: startup process (PID 29) exited with exit code 1
LOG: shutting down due to startup process failure
Expected output
---------------
The standby should successfully replay all WAL records and reach a
consistent streaming state.
Configuration (non-default on primary)
--------------------------------------
wal_level = replica
max_wal_senders = 10
max_connections = 1200
shared_buffers = 256MB
wal_keep_size = 16GB
autovacuum_multixact_freeze_max_age = 100000
vacuum_multixact_freeze_min_age = 1000
vacuum_multixact_freeze_table_age = 50000
Standby configured with log_min_messages = debug1.
Sebastian Webber
pgsql-bugs by date: