Fix possible 'unexpected data beyond EOF' on replica restart - Mailing list pgsql-hackers

From Anthonin Bonnefoy
Subject Fix possible 'unexpected data beyond EOF' on replica restart
Date
Msg-id CAO6_Xqrv-snNJNhbj1KjQmWiWHX3nYGDgAc=vxaZP3qc4g1Siw@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi,

On restart, a replica can fail with an 'unexpected data beyond EOF in block x of relation T/D/R' error. This happened on a PG17.7 and I've been able to reproduce it on PG 18. This can happen under the following circumstances:

- A relation has a size of 400 blocks.
  - Blocks 201 to 400 are empty.
  - Block 200 has two rows.
  - Blocks 101 to 199 are empty.
- A restartpoint is done
- Vacuum truncates the relation to 200 blocks
- A FPW deletes a row in block 200
- A checkpoint is done
- A FPW deletes the last row in block 200
- Vacuum truncates the relation to 100 blocks
- The replica restarts

When the replica restarts:
- The relation on disk is reduced to 100 blocks due to having applied the truncate before restart.
- The first truncate to 200 blocks is replayed. It silently fails in mdtruncate since 'nblocks > curnblk', but the caller isn't aware of that and will still update the cached size to 200 blocks
- The first FPW on block 200 is applied, XLogReadBufferForRead will rely on the cached size and incorrectly assume the page exists in file, and thus won't extend the relation.
- The Checkpoint Online is replayed, calling smgrdestroyall which will discard the cached size.
- The second FPW on block 200 is applied. This time, the detected size is 100 blocks, an extend is attempted. However, the block 200 is already present in the buffer table due to the first FPW. This triggers the 'unexpected data beyond EOF' since the page isn't new.

The issue can be reproduced with the following script:

"""
pgbench -i
# Prepare the relation
psql -c "DELETE FROM pgbench_accounts WHERE aid > 80000 AND aid != ALL('{90000, 90001}');"
psql -c "VACUUM (VERBOSE, INDEX_CLEANUP ON, TRUNCATE OFF) pgbench_accounts;"

# Restartpoint here
psql -c "CHECKPOINT;"
psql -p 5433 -c "CHECKPOINT;"

# First truncate
psql -c "VACUUM (VERBOSE, INDEX_CLEANUP ON, TRUNCATE ON) pgbench_accounts;"

# First FPW deletion
psql -c "DELETE FROM pgbench_accounts WHERE aid = 90001;"

# Second FPW deletion
psql -c "CHECKPOINT;"
psql -c "DELETE FROM pgbench_accounts WHERE aid = 90000;"

# Second truncate
psql -c "VACUUM (VERBOSE, INDEX_CLEANUP ON, TRUNCATE ON) pgbench_accounts;"

# Let some time for replica to replay the truncate
psql -c "SELECT pg_sleep(1);"

# Stop without advancing the restartpoint
kill -9 $(pgrep -f "pg_data_replica")

# Restart should fail with the EOF error
pg_ctl -D pg_data_replica restart
"""

This assumes the replica is running on port 5433 and no hot_standby_feedback (otherwise, tuples will be seen as 'not yet removable'). I've used kill -9 to avoid advancing the restart point, but I've seen the issue happening with a clean shutdown. 

The patch fixes the issue by moving smgr_cached_nblocks updates in mdtruncate and only updating the cached value if truncate was successful.

Regards,
Anthonin Bonnefoy
Attachment

pgsql-hackers by date:

Previous
From: Daniil Davydov
Date:
Subject: Proposal : Use bump memory context for temp buffers
Next
From: Christoph Berg
Date:
Subject: Re: failed NUMA pages inquiry status: Operation not permitted