Re: Inadequate thought about buffer locking during hot standby replay - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Inadequate thought about buffer locking during hot standby replay
Date
Msg-id 25470.1352571406@sss.pgh.pa.us
Whole thread Raw
In response to Inadequate thought about buffer locking during hot standby replay  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Inadequate thought about buffer locking during hot standby replay
List pgsql-hackers
I wrote:
> I'm inclined to think that we need to fix this by getting rid of
> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
> routines dictate when each full-page image is restored (and whether or
> not to release the buffer lock immediately).  That's not going to be a
> small change unfortunately :-(

Here's a WIP patch that attacks it in this way.  I've only gone as far as
fixing btree_xlog_split() for the moment, since the main point is to see
whether the coding change seems reasonable.  At least in this function,
it seems that locking considerations are handled correctly as long as no
full-page image is used, so it's pretty straightforward to tweak it to
handle the case with full-page image(s) as well.  I've not looked
through any other replay functions yet --- some of them may need more
invasive hacking.  But so far this looks pretty good.

Note that this patch isn't correct yet even for btree_xlog_split(),
because I've not removed the RestoreBkpBlocks() call in btree_redo().
All the btree replay routines will have to get fixed before it's
testable at all.

One thing that seems a bit annoying is the use of zero-based backup
block indexes in RestoreBackupBlock, when most (not all) of the callers
are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
That's a bug waiting to happen.  We could address it by changing
RestoreBackupBlock to accept a one-based argument, but what I'm thinking
of doing instead is getting rid of the one-based convention entirely;
that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
advantage of doing that is that it'll force inspection of all code
related to this.

Comments, opinions?

            regards, tom lane

diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 72ea1719e7d2c3ca230c8f755261b9a439ec2a99..ee7e15fddcfca7dfe7ccba6ab021397937b1a96d 100644
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
*************** btree_xlog_split(bool onleft, bool isroo
*** 323,329 ****
          datalen -= newitemsz;
      }

!     /* Reconstruct right (new) sibling from scratch */
      rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
      Assert(BufferIsValid(rbuf));
      rpage = (Page) BufferGetPage(rbuf);
--- 323,329 ----
          datalen -= newitemsz;
      }

!     /* Reconstruct right (new) sibling page from scratch */
      rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
      Assert(BufferIsValid(rbuf));
      rpage = (Page) BufferGetPage(rbuf);
*************** btree_xlog_split(bool onleft, bool isroo
*** 357,374 ****

      /* don't release the buffer yet; we touch right page's first item below */

!     /*
!      * Reconstruct left (original) sibling if needed.  Note that this code
!      * ensures that the items remaining on the left page are in the correct
!      * item number order, but it does not reproduce the physical order they
!      * would have had.    Is this worth changing?  See also _bt_restore_page().
!      */
!     if (!(record->xl_info & XLR_BKP_BLOCK_1))
      {
          Buffer        lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);

          if (BufferIsValid(lbuf))
          {
              Page        lpage = (Page) BufferGetPage(lbuf);
              BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);

--- 357,377 ----

      /* don't release the buffer yet; we touch right page's first item below */

!     /* Now reconstruct left (original) sibling page */
!     if (record->xl_info & XLR_BKP_BLOCK_1)
!         (void) RestoreBackupBlock(lsn, record, 0, false, false);
!     else
      {
          Buffer        lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);

          if (BufferIsValid(lbuf))
          {
+             /*
+              * Note that this code ensures that the items remaining on the
+              * left page are in the correct item number order, but it does not
+              * reproduce the physical order they would have had.  Is this
+              * worth changing?  See also _bt_restore_page().
+              */
              Page        lpage = (Page) BufferGetPage(lbuf);
              BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);

*************** btree_xlog_split(bool onleft, bool isroo
*** 432,439 ****
      /* We no longer need the right buffer */
      UnlockReleaseBuffer(rbuf);

!     /* Fix left-link of the page to the right of the new right sibling */
!     if (xlrec->rnext != P_NONE && !(record->xl_info & XLR_BKP_BLOCK_2))
      {
          Buffer        buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);

--- 435,451 ----
      /* We no longer need the right buffer */
      UnlockReleaseBuffer(rbuf);

!     /*
!      * Fix left-link of the page to the right of the new right sibling.
!      *
!      * Note: in normal operation, we do this while still holding lock on the
!      * two split pages.  However, that's not necessary for correctness in WAL
!      * replay, because no other index update can be in progress, and readers
!      * will cope properly when following an obsolete left-link.
!      */
!     if (record->xl_info & XLR_BKP_BLOCK_2)
!         (void) RestoreBackupBlock(lsn, record, 1, false, false);
!     else if (xlrec->rnext != P_NONE)
      {
          Buffer        buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 573c9ad682e59d9e7b01ad7c8d9d321f15f948f3..b167ae0382b58ac3b4c995e5652bb6cdb3ecc59e 100644
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
*************** ID at least once; otherwise there is no
*** 498,508 ****
  The standard replay-routine pattern for this case is

      if (record->xl_info & XLR_BKP_BLOCK_n)
!         << do nothing, page was rewritten from logged copy >>;

      buffer = XLogReadBuffer(rnode, blkno, false);
      if (!BufferIsValid(buffer))
!         << do nothing, page has been deleted >>;
      page = (Page) BufferGetPage(buffer);

      if (XLByteLE(lsn, PageGetLSN(page)))
--- 498,515 ----
  The standard replay-routine pattern for this case is

      if (record->xl_info & XLR_BKP_BLOCK_n)
!     {
!         /* apply the change from the full-page image */
!         (void) RestoreBackupBlock(lsn, record, n-1, false, false);
!         return;
!     }

      buffer = XLogReadBuffer(rnode, blkno, false);
      if (!BufferIsValid(buffer))
!     {
!         /* page has been deleted, so we need do nothing */
!         return;
!     }
      page = (Page) BufferGetPage(buffer);

      if (XLByteLE(lsn, PageGetLSN(page)))
*************** associated with the n'th distinct buffer
*** 527,532 ****
--- 534,570 ----
  per the above discussion, fully-rewritable buffers shouldn't be mentioned in
  "rdata".)

+ When replaying a WAL record that describes changes on multiple pages, you
+ must be careful to lock the pages in the same way that the original action
+ did, because in Hot Standby mode there may be other backends inspecting
+ these pages.  Replay can't expose inconsistent page states to concurrent
+ queries, any more than normal operation did.  If this requires that two
+ or more buffer locks be held concurrently, the coding pattern shown above
+ is too simplistic, since it assumes the routine can exit as soon as it's
+ known the current page requires no modification.  Instead, you might have
+ something like
+
+     if (record->xl_info & XLR_BKP_BLOCK_1)
+     {
+         /* apply the change from the full-page image */
+         buffer1 = RestoreBackupBlock(lsn, record, 0, false, true);
+     }
+     else
+     {
+         buffer1 = XLogReadBuffer(rnode, blkno, false);
+         if (BufferIsValid(buffer1))
+         {
+             ... apply the change if not already done ...
+             MarkBufferDirty(buffer1);
+         }
+     }
+
+     ... similarly apply the changes for page 2 ...
+
+     /* and now we can release the lock on the first page */
+     if (BufferIsValid(buffer1))
+         UnlockReleaseBuffer(buffer1);
+
  Due to all these constraints, complex changes (such as a multilevel index
  insertion) normally need to be described by a series of atomic-action WAL
  records.  What do you do if the intermediate states are not self-consistent?
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf76f6d24cb8bbf3b1b8b19e0a1c007ff08e4035..4d379e2f78a89604795362e76a03867dcfdf42e6 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec
*** 3155,3160 ****
--- 3155,3256 ----
  }

  /*
+  * Restore a full-page image from a backup block attached to an XLOG record.
+  *
+  * lsn: LSN of the XLOG record being replayed
+  * record: the complete XLOG record
+  * block_index: which backup block to restore (0 .. XLR_MAX_BKP_BLOCKS - 1)
+  * get_cleanup_lock: TRUE to get a cleanup rather than plain exclusive lock
+  * keep_buffer: TRUE to return the buffer still locked and pinned
+  *
+  * Returns the buffer number containing the page.  Note this is not terribly
+  * useful unless keep_buffer is specified as TRUE.
+  *
+  * Note: when a backup block is available in XLOG, we restore it
+  * unconditionally, even if the page in the database appears newer.
+  * This is to protect ourselves against database pages that were partially
+  * or incorrectly written during a crash.  We assume that the XLOG data
+  * must be good because it has passed a CRC check, while the database
+  * page might not be.  This will force us to replay all subsequent
+  * modifications of the page that appear in XLOG, rather than possibly
+  * ignoring them as already applied, but that's not a huge drawback.
+  *
+  * If 'get_cleanup_lock' is true, a cleanup lock is obtained on the buffer,
+  * else a normal exclusive lock is used.  During crash recovery, that's just
+  * pro forma because there can't be any regular backends in the system, but
+  * in hot standby mode the distinction is important.
+  *
+  * If 'keep_buffer' is true, return without releasing the buffer lock and pin;
+  * then caller is responsible for doing UnlockReleaseBuffer() later.  This
+  * is needed in some cases when replaying XLOG records that touch multiple
+  * pages, to prevent inconsistent states from being visible to other backends.
+  * (Again, that's only important in hot standby mode.)
+  */
+ Buffer
+ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
+                    bool get_cleanup_lock, bool keep_buffer)
+ {
+     Buffer        buffer;
+     Page        page;
+     BkpBlock    bkpb;
+     char       *blk;
+     int            i;
+
+     /* Locate requested BkpBlock in the record */
+     blk = (char *) XLogRecGetData(record) + record->xl_len;
+     for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+     {
+         if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
+             continue;
+
+         memcpy(&bkpb, blk, sizeof(BkpBlock));
+         blk += sizeof(BkpBlock);
+
+         if (i == block_index)
+         {
+             buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
+                                             RBM_ZERO);
+             Assert(BufferIsValid(buffer));
+             if (get_cleanup_lock)
+                 LockBufferForCleanup(buffer);
+             else
+                 LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+             page = (Page) BufferGetPage(buffer);
+
+             if (bkpb.hole_length == 0)
+             {
+                 memcpy((char *) page, blk, BLCKSZ);
+             }
+             else
+             {
+                 memcpy((char *) page, blk, bkpb.hole_offset);
+                 /* must zero-fill the hole */
+                 MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
+                 memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
+                        blk + bkpb.hole_offset,
+                        BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
+             }
+
+             PageSetLSN(page, lsn);
+             PageSetTLI(page, ThisTimeLineID);
+             MarkBufferDirty(buffer);
+
+             if (!keep_buffer)
+                 UnlockReleaseBuffer(buffer);
+
+             return buffer;
+         }
+
+         blk += BLCKSZ - bkpb.hole_length;
+     }
+
+     /* Caller specified a bogus block_index */
+     elog(ERROR, "failed to restore block_index %d", block_index);
+     return InvalidBuffer;        /* keep compiler quiet */
+ }
+
+ /*
   * CRC-check an XLOG record.  We do not believe the contents of an XLOG
   * record (other than to the minimal extent of computing the amount of
   * data to read in) until we've checked the CRCs.
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2893f3b35243effb3890630dde423daddb51eaea..e5cfb96a600b83b50bee2871ee23116c54d0883d 100644
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
*************** extern void XLogGetLastRemoved(XLogSegNo
*** 275,280 ****
--- 275,283 ----
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);

  extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);
+ extern Buffer RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record,
+                    int block_index,
+                    bool get_cleanup_lock, bool keep_buffer);

  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Further pg_upgrade analysis for many tables
Next
From: Simon Riggs
Date:
Subject: Re: Inadequate thought about buffer locking during hot standby replay