Hot standby queries see transient all-zeros pages - Mailing list pgsql-hackers

From Noah Misch
Subject Hot standby queries see transient all-zeros pages
Date
Msg-id 20240512171658.7e.nmisch@google.com
Whole thread Raw
List pgsql-hackers
XLogReadBufferForRedoExtended() precedes RestoreBlockImage() with
RBM_ZERO_AND_LOCK.  Per src/backend/storage/buffer/README:

  Once one has determined that a tuple is interesting (visible to the current
  transaction) one may drop the content lock, yet continue to access the
  tuple's data for as long as one holds the buffer pin.

The use of RBM_ZERO_AND_LOCK is incompatible with that.  See a similar
argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led me
to the cause.  Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2
failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp)) in
heapgettup_pagemode().  In the core file, lpp pointed into an all-zeros page.
RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby existed,
but it wasn't a bug until queries could run concurrently.

I suspect the fix is to add a ReadBufferMode specified as, "If the block is
already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
Otherwise, do RBM_ZERO_AND_LOCK."  That avoids RBM_NORMAL for a block past the
current end of the file.  Like RBM_ZERO_AND_LOCK, it avoids wasting disk reads
on data we discard.  Are there other strategies to consider?

I got here from a Windows CI failure,
https://cirrus-ci.com/task/6247605141766144.  That involved patched code, but
adding the sleep suffices on Linux, with today's git master:

--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -388,6 +388,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
         *buf = XLogReadBufferExtended(rlocator, forknum, blkno,
                                       get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
                                       prefetch_buffer);
+        if (!get_cleanup_lock)
+            pg_usleep(10 * 1000);
         page = BufferGetPage(*buf);
         if (!RestoreBlockImage(record, block_id, page))
             ereport(ERROR,



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Why is citext/regress failing on hamerkop?
Next
From: Noah Misch
Date:
Subject: Re: Weird test mixup