Thread: Hot standby queries see transient all-zeros pages

Hot standby queries see transient all-zeros pages

From
Noah Misch
Date:
XLogReadBufferForRedoExtended() precedes RestoreBlockImage() with
RBM_ZERO_AND_LOCK.  Per src/backend/storage/buffer/README:

  Once one has determined that a tuple is interesting (visible to the current
  transaction) one may drop the content lock, yet continue to access the
  tuple's data for as long as one holds the buffer pin.

The use of RBM_ZERO_AND_LOCK is incompatible with that.  See a similar
argument at https://postgr.es/m/flat/5101.1328219790@sss.pgh.pa.us that led me
to the cause.  Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2
failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp)) in
heapgettup_pagemode().  In the core file, lpp pointed into an all-zeros page.
RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby existed,
but it wasn't a bug until queries could run concurrently.

I suspect the fix is to add a ReadBufferMode specified as, "If the block is
already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
Otherwise, do RBM_ZERO_AND_LOCK."  That avoids RBM_NORMAL for a block past the
current end of the file.  Like RBM_ZERO_AND_LOCK, it avoids wasting disk reads
on data we discard.  Are there other strategies to consider?

I got here from a Windows CI failure,
https://cirrus-ci.com/task/6247605141766144.  That involved patched code, but
adding the sleep suffices on Linux, with today's git master:

--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -388,6 +388,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
         *buf = XLogReadBufferExtended(rlocator, forknum, blkno,
                                       get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
                                       prefetch_buffer);
+        if (!get_cleanup_lock)
+            pg_usleep(10 * 1000);
         page = BufferGetPage(*buf);
         if (!RestoreBlockImage(record, block_id, page))
             ereport(ERROR,