Thread: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
From
vignesh C
Date:
Hi, Buildfarm identified one issue at [1] where it could not identify a partial WAL record spanning across 2 pages was written due to immediate shutdown. Consider a scenario where a WAL record is split across multiple WAL pages. If the server crashes before the entire WAL record is written, the recovery process detects the incomplete (broken) record. In such cases, a special flag, XLP_FIRST_IS_OVERWRITE_CONTRECORD, is set in the page header to indicate this condition. When we attempt to retrieve changes using a logical replication slot, the system reads WAL data based on the record header. In one specific instance, a total of 8133 bytes were required to reconstruct the WAL record. Of this, 277 bytes were expected to be present in the subsequent WAL page. However, only 248 bytes were available, because no new WAL records had been generated as there were no operations performed on the system after the crash. As a result, the slot_get_changes function continuously checks whether the current WAL flush position has reached 0x29004115 (the offset corresponding to the required 277 bytes in the next WAL page). Since no new WAL is being generated, the flush position always returns 0x290040F8 (the 248-byte offset), causing the function to wait indefinitely at read_local_xlog_page_guts function. Currently, the logic attempts to read the complete WAL record based on the size obtained before the crash—even though only a partial record was written. It then checks the page header to determine whether the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the complete WAL record at XLogDecodeNextRecord function, but since that much WAL data was not available in the system we never get a chance to check the header after this.. To address this issue, a more robust approach would be to first read the page header, check if the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed to read the WAL record size if the record is not marked as a partial overwrite. This would prevent the system from waiting for WAL data that will never arrive. Attached partial_wal_record_fix.patch patch for this. I don't have a consistent test to reproduce this issue, currently this issue can be reproduced by running the 046_checkpoint_logical_slot test about 50 times. This test 046_checkpoint_logical_slot was reverted recently after it caused a few buildfarm failures discussed at [2]. The same test is also attached as reverted_test_046_checkpoint_logical_slot.patch. Added members who were involved in the discussion of this issue. Thoughts? [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2025-06-19+23%3A47%3A08 [2] - https://www.postgresql.org/message-id/CAPpHfdsO9s5he3xHWNFtwvXtvsftu3nNz%3DPV9fdN32UOh-Z3tA%40mail.gmail.com Regards, Vignesh
Attachment
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
From
Michael Paquier
Date:
On Wed, Jun 25, 2025 at 10:19:55PM +0530, vignesh C wrote: > Currently, the logic attempts to read the complete WAL record based on > the size obtained before the crash—even though only a partial record > was written. It then checks the page header to determine whether the > XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the > complete WAL record at XLogDecodeNextRecord function, but since that > much WAL data was not available in the system we never get a chance to > check the header after this.. To address this issue, a more robust > approach would be to first read the page header, check if the > XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed > to read the WAL record size if the record is not marked as a partial > overwrite. This would prevent the system from waiting for WAL data > that will never arrive. Attached partial_wal_record_fix.patch patch > for this. So you are suggesting the addition of an extra ReadPageInternal() that forces a read of only the read, perform the checks on the header, then read the rest. After reading SizeOfXLogShortPHD worth of data, shouldn't the checks on xlp_rem_len be done a bit earlier than what you are proposing in this patch? > I don't have a consistent test to reproduce this issue, currently this > issue can be reproduced by running the 046_checkpoint_logical_slot > test about 50 times. This test 046_checkpoint_logical_slot was > reverted recently after it caused a few buildfarm failures discussed > at [2]. The same test is also attached as > reverted_test_046_checkpoint_logical_slot.patch. Seeing the noise that this originally created in the CFbot and the buildfarm, even if the issue would be only triggered after a timeout, I'd vote for relying on this test being good enough for the purpose of this race condition. Another reliable approach would be to make the code wait before reading the record in the internal loop of ReadPageInternal() with an injection point when we know that we have a contrecord, but I'm not really excited about this prospect in xlogreader.c which can be also used in the frontend. -- Michael