I found a bug in pg_surgery's heap_force_kill that can produce WAL records with incorrect CRC checksums when multiple sessions operate on heap pages that share the same visibility map page.
When heap_force_kill kills a tuple on an all-visible page, it calls visibilitymap_clear() to clear the VM bits, then later calls log_newpage_buffer(vmbuf) to write a full-page image of the VM page to WAL. However, visibilitymap_clear() releases the exclusive lock on the VM buffer before returning -- that's its normal API contract. The subsequent log_newpage_buffer(vmbuf) then writes the FPI without holding any content lock on the VM buffer.
XLogInsert reads the page content twice via the XLogRecData chain pointer to shared memory: once to compute the CRC in XLogRecordAssemble, and once to copy the data to WAL buffers in CopyXLogRecordToWAL. If a concurrent backend modifies the VM page between those two reads (e.g., another heap_force_kill session clearing a different bit on the same VM page), the CRC will not match the written bytes.
Fix it by re-acquiring the VM buffer exclusive lock immediately after visibilitymap_clear() returns, and release it after log_newpage_buffer() completes. The lock is only acquired when RelationNeedsWAL() is true, since unlogged relations skip the FPI write entirely.
The patch includes a TAP test that uses injection points to deterministically reproduce the race condition. Three injection points are used:
* "heap-force-kill-vm-pin" in heap_surgery.c (outside the critical section) to initialize DSM shared memory for the wait machinery.
* "heap-force-kill-before-vm-wal" in heap_surgery.c (inside the critical section, between the heap FPI and VM FPI writes) as a synchronization barrier.
* "wal-insert-after-crc" in xloginsert.c (inside XLogInsert, between XLogRecordAssemble and XLogInsertRecord) to pause after CRC computation but before the data copy.
The test pauses session 1 inside XLogInsert after the VM FPI's CRC is computed, runs a concurrent heap_force_kill from session 2 on the same VM page, then wakes session 1. Without the fix, session 2 modifies the VM page between CRC and copy, and pg_walinspect reports "incorrect resource manager data checksum". With the fix, session 2 blocks on the VM buffer lock and no corruption occurs.
The "wal-insert-after-crc" injection point in xloginsert.c uses INJECTION_POINT_CACHED (pre-loaded by the caller before the critical section), so it only fires when explicitly loaded and adds no overhead to the normal WAL write path.
Regards,
-- Fabrízio de Royes Mello
On behalf of PlanetScale