Re: Inadequate thought about buffer locking during hot standby replay - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Inadequate thought about buffer locking during hot standby replay
Date
Msg-id 1448.1352582662@sss.pgh.pa.us
Whole thread Raw
In response to Re: Inadequate thought about buffer locking during hot standby replay  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Inadequate thought about buffer locking during hot standby replay  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Inadequate thought about buffer locking during hot standby replay  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Inadequate thought about buffer locking during hot standby replay  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 10 November 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Here's a WIP patch that attacks it in this way.

> Looks fine, but we need a top-level comment in btree code explaining
> why/how it follows the very well explained rules in the README.

Not sure what you'd want it to say exactly?  Really these issues are per
WAL-replay function, not global.  I've now been through all of the btree
functions, and AFAICS btree_xlog_split is the only one of them that
actually has a bug; the others can be a bit lax about the order in which
things get done.

>> One thing that seems a bit annoying is the use of zero-based backup
>> block indexes in RestoreBackupBlock, when most (not all) of the callers
>> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
>> That's a bug waiting to happen.  We could address it by changing
>> RestoreBackupBlock to accept a one-based argument, but what I'm thinking
>> of doing instead is getting rid of the one-based convention entirely;
>> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
>> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
>> advantage of doing that is that it'll force inspection of all code
>> related to this.

> I wouldn't do that in a back branch, but I can see why its a good idea.

If any of this stuff were getting used by external modules, changing it
would be problematic ... but fortunately, it isn't, because we lack
support for plug-in rmgrs.  So I'm not worried about back-patching the
change, and would prefer to keep the 9.x branches in sync.

Attached is an updated patch in which nbtxlog.c has been fully converted,
but I've not touched other rmgrs yet.  I've tested this to the extent of
having a replication slave follow a master that's running the regression
tests, and it seems to work.  It's difficult to prove much by testing
about concurrent behavior, of course.

I found that there's another benefit of managing backup-block replay
this way, which is that we can de-contort the handling of conflict
resolution by moving it into the per-record-type subroutines, instead of
needing an extra switch before the RestoreBkpBlocks call.  So that gives
me more confidence that this is a good way to go.

            regards, tom lane

diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 72ea1719e7d2c3ca230c8f755261b9a439ec2a99..8f534803a7cea75105869de6eaf2946e03d5437c 100644
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
*************** btree_xlog_insert(bool isleaf, bool isme
*** 218,227 ****
          datalen -= sizeof(xl_btree_metadata);
      }

!     if ((record->xl_info & XLR_BKP_BLOCK_1) && !ismeta && isleaf)
!         return;                    /* nothing to do */
!
!     if (!(record->xl_info & XLR_BKP_BLOCK_1))
      {
          buffer = XLogReadBuffer(xlrec->target.node,
                               ItemPointerGetBlockNumber(&(xlrec->target.tid)),
--- 218,226 ----
          datalen -= sizeof(xl_btree_metadata);
      }

!     if (record->xl_info & XLR_BKP_BLOCK(0))
!         (void) RestoreBackupBlock(lsn, record, 0, false, false);
!     else
      {
          buffer = XLogReadBuffer(xlrec->target.node,
                               ItemPointerGetBlockNumber(&(xlrec->target.tid)),
*************** btree_xlog_insert(bool isleaf, bool isme
*** 249,254 ****
--- 248,260 ----
          }
      }

+     /*
+      * Note: in normal operation, we'd update the metapage while still holding
+      * lock on the page we inserted into.  But during replay it's not
+      * necessary to hold that lock, since no other index updates can be
+      * happening concurrently, and readers will cope fine with following an
+      * obsolete link from the metapage.
+      */
      if (ismeta)
          _bt_restore_meta(xlrec->target.node, lsn,
                           md.root, md.level,
*************** btree_xlog_split(bool onleft, bool isroo
*** 290,296 ****
          forget_matching_split(xlrec->node, downlink, false);

          /* Extract left hikey and its size (still assuming 16-bit alignment) */
!         if (!(record->xl_info & XLR_BKP_BLOCK_1))
          {
              /* We assume 16-bit alignment is enough for IndexTupleSize */
              left_hikey = (Item) datapos;
--- 296,302 ----
          forget_matching_split(xlrec->node, downlink, false);

          /* Extract left hikey and its size (still assuming 16-bit alignment) */
!         if (!(record->xl_info & XLR_BKP_BLOCK(0)))
          {
              /* We assume 16-bit alignment is enough for IndexTupleSize */
              left_hikey = (Item) datapos;
*************** btree_xlog_split(bool onleft, bool isroo
*** 310,316 ****
          datalen -= sizeof(OffsetNumber);
      }

!     if (onleft && !(record->xl_info & XLR_BKP_BLOCK_1))
      {
          /*
           * We assume that 16-bit alignment is enough to apply IndexTupleSize
--- 316,322 ----
          datalen -= sizeof(OffsetNumber);
      }

!     if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
      {
          /*
           * We assume that 16-bit alignment is enough to apply IndexTupleSize
*************** btree_xlog_split(bool onleft, bool isroo
*** 323,329 ****
          datalen -= newitemsz;
      }

!     /* Reconstruct right (new) sibling from scratch */
      rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
      Assert(BufferIsValid(rbuf));
      rpage = (Page) BufferGetPage(rbuf);
--- 329,335 ----
          datalen -= newitemsz;
      }

!     /* Reconstruct right (new) sibling page from scratch */
      rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
      Assert(BufferIsValid(rbuf));
      rpage = (Page) BufferGetPage(rbuf);
*************** btree_xlog_split(bool onleft, bool isroo
*** 357,374 ****

      /* don't release the buffer yet; we touch right page's first item below */

!     /*
!      * Reconstruct left (original) sibling if needed.  Note that this code
!      * ensures that the items remaining on the left page are in the correct
!      * item number order, but it does not reproduce the physical order they
!      * would have had.    Is this worth changing?  See also _bt_restore_page().
!      */
!     if (!(record->xl_info & XLR_BKP_BLOCK_1))
      {
          Buffer        lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);

          if (BufferIsValid(lbuf))
          {
              Page        lpage = (Page) BufferGetPage(lbuf);
              BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);

--- 363,383 ----

      /* don't release the buffer yet; we touch right page's first item below */

!     /* Now reconstruct left (original) sibling page */
!     if (record->xl_info & XLR_BKP_BLOCK(0))
!         (void) RestoreBackupBlock(lsn, record, 0, false, false);
!     else
      {
          Buffer        lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);

          if (BufferIsValid(lbuf))
          {
+             /*
+              * Note that this code ensures that the items remaining on the
+              * left page are in the correct item number order, but it does not
+              * reproduce the physical order they would have had.  Is this
+              * worth changing?  See also _bt_restore_page().
+              */
              Page        lpage = (Page) BufferGetPage(lbuf);
              BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);

*************** btree_xlog_split(bool onleft, bool isroo
*** 432,439 ****
      /* We no longer need the right buffer */
      UnlockReleaseBuffer(rbuf);

!     /* Fix left-link of the page to the right of the new right sibling */
!     if (xlrec->rnext != P_NONE && !(record->xl_info & XLR_BKP_BLOCK_2))
      {
          Buffer        buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);

--- 441,457 ----
      /* We no longer need the right buffer */
      UnlockReleaseBuffer(rbuf);

!     /*
!      * Fix left-link of the page to the right of the new right sibling.
!      *
!      * Note: in normal operation, we do this while still holding lock on the
!      * two split pages.  However, that's not necessary for correctness in WAL
!      * replay, because no other index update can be in progress, and readers
!      * will cope properly when following an obsolete left-link.
!      */
!     if (record->xl_info & XLR_BKP_BLOCK(1))
!         (void) RestoreBackupBlock(lsn, record, 1, false, false);
!     else if (xlrec->rnext != P_NONE)
      {
          Buffer        buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);

*************** btree_xlog_split(bool onleft, bool isroo
*** 463,475 ****
  static void
  btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
  {
!     xl_btree_vacuum *xlrec;
      Buffer        buffer;
      Page        page;
      BTPageOpaque opaque;

-     xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
-
      /*
       * If queries might be active then we need to ensure every block is
       * unpinned between the lastBlockVacuumed and the current block, if there
--- 481,491 ----
  static void
  btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
  {
!     xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
      Buffer        buffer;
      Page        page;
      BTPageOpaque opaque;

      /*
       * If queries might be active then we need to ensure every block is
       * unpinned between the lastBlockVacuumed and the current block, if there
*************** btree_xlog_vacuum(XLogRecPtr lsn, XLogRe
*** 502,514 ****
      }

      /*
!      * If the block was restored from a full page image, nothing more to do.
!      * The RestoreBkpBlocks() call already pinned and took cleanup lock on it.
!      * XXX: Perhaps we should call RestoreBkpBlocks() *after* the loop above,
!      * to make the disk access more sequential.
       */
!     if (record->xl_info & XLR_BKP_BLOCK_1)
          return;

      /*
       * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
--- 518,531 ----
      }

      /*
!      * If we have a full-page image, restore it (using a cleanup lock) and
!      * we're done.
       */
!     if (record->xl_info & XLR_BKP_BLOCK(0))
!     {
!         (void) RestoreBackupBlock(lsn, record, 0, true, false);
          return;
+     }

      /*
       * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
*************** btree_xlog_vacuum(XLogRecPtr lsn, XLogRe
*** 563,571 ****
   * XXX optimise later with something like XLogPrefetchBuffer()
   */
  static TransactionId
! btree_xlog_delete_get_latestRemovedXid(XLogRecord *record)
  {
-     xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
      OffsetNumber *unused;
      Buffer        ibuffer,
                  hbuffer;
--- 580,587 ----
   * XXX optimise later with something like XLogPrefetchBuffer()
   */
  static TransactionId
! btree_xlog_delete_get_latestRemovedXid(xl_btree_delete *xlrec)
  {
      OffsetNumber *unused;
      Buffer        ibuffer,
                  hbuffer;
*************** btree_xlog_delete_get_latestRemovedXid(X
*** 702,716 ****
  static void
  btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
  {
!     xl_btree_delete *xlrec;
      Buffer        buffer;
      Page        page;
      BTPageOpaque opaque;

!     if (record->xl_info & XLR_BKP_BLOCK_1)
!         return;

!     xlrec = (xl_btree_delete *) XLogRecGetData(record);

      /*
       * We don't need to take a cleanup lock to apply these changes. See
--- 718,752 ----
  static void
  btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
  {
!     xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
      Buffer        buffer;
      Page        page;
      BTPageOpaque opaque;

!     /*
!      * If we have any conflict processing to do, it must happen before we
!      * update the page.
!      *
!      * Btree delete records can conflict with standby queries.  You might
!      * think that vacuum records would conflict as well, but we've handled
!      * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
!      * cleaned by the vacuum of the heap and so we can resolve any conflicts
!      * just once when that arrives.  After that we know that no conflicts
!      * exist from individual btree vacuum records on that index.
!      */
!     if (InHotStandby)
!     {
!         TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(xlrec);

!         ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
!     }
!
!     /* If we have a full-page image, restore it and we're done */
!     if (record->xl_info & XLR_BKP_BLOCK(0))
!     {
!         (void) RestoreBackupBlock(lsn, record, 0, false, false);
!         return;
!     }

      /*
       * We don't need to take a cleanup lock to apply these changes. See
*************** btree_xlog_delete_page(uint8 info, XLogR
*** 766,773 ****
      leftsib = xlrec->leftblk;
      rightsib = xlrec->rightblk;

      /* parent page */
!     if (!(record->xl_info & XLR_BKP_BLOCK_1))
      {
          buffer = XLogReadBuffer(xlrec->target.node, parent, false);
          if (BufferIsValid(buffer))
--- 802,819 ----
      leftsib = xlrec->leftblk;
      rightsib = xlrec->rightblk;

+     /*
+      * In normal operation, we would lock all the pages this WAL record
+      * touches before changing any of them.  In WAL replay, it should be okay
+      * to lock just one page at a time, since no concurrent index updates can
+      * be happening, and readers should not care whether they arrive at the
+      * target page or not (since it's surely empty).
+      */
+
      /* parent page */
!     if (record->xl_info & XLR_BKP_BLOCK(0))
!         (void) RestoreBackupBlock(lsn, record, 0, false, false);
!     else
      {
          buffer = XLogReadBuffer(xlrec->target.node, parent, false);
          if (BufferIsValid(buffer))
*************** btree_xlog_delete_page(uint8 info, XLogR
*** 813,819 ****
      }

      /* Fix left-link of right sibling */
!     if (!(record->xl_info & XLR_BKP_BLOCK_2))
      {
          buffer = XLogReadBuffer(xlrec->target.node, rightsib, false);
          if (BufferIsValid(buffer))
--- 859,867 ----
      }

      /* Fix left-link of right sibling */
!     if (record->xl_info & XLR_BKP_BLOCK(1))
!         (void) RestoreBackupBlock(lsn, record, 1, false, false);
!     else
      {
          buffer = XLogReadBuffer(xlrec->target.node, rightsib, false);
          if (BufferIsValid(buffer))
*************** btree_xlog_delete_page(uint8 info, XLogR
*** 837,843 ****
      }

      /* Fix right-link of left sibling, if any */
!     if (!(record->xl_info & XLR_BKP_BLOCK_3))
      {
          if (leftsib != P_NONE)
          {
--- 885,893 ----
      }

      /* Fix right-link of left sibling, if any */
!     if (record->xl_info & XLR_BKP_BLOCK(2))
!         (void) RestoreBackupBlock(lsn, record, 2, false, false);
!     else
      {
          if (leftsib != P_NONE)
          {
*************** btree_xlog_newroot(XLogRecPtr lsn, XLogR
*** 911,916 ****
--- 961,969 ----
      BTPageOpaque pageop;
      BlockNumber downlink = 0;

+     /* Backup blocks are not used in newroot records */
+     Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
      buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
      Assert(BufferIsValid(buffer));
      page = (Page) BufferGetPage(buffer);
*************** btree_xlog_newroot(XLogRecPtr lsn, XLogR
*** 952,1018 ****
          forget_matching_split(xlrec->node, downlink, true);
  }

!
! void
! btree_redo(XLogRecPtr lsn, XLogRecord *record)
  {
!     uint8        info = record->xl_info & ~XLR_INFO_MASK;

      /*
!      * If we have any conflict processing to do, it must happen before we
!      * update the page.
       */
      if (InHotStandby)
      {
!         switch (info)
!         {
!             case XLOG_BTREE_DELETE:
!
!                 /*
!                  * Btree delete records can conflict with standby queries. You
!                  * might think that vacuum records would conflict as well, but
!                  * we've handled that already. XLOG_HEAP2_CLEANUP_INFO records
!                  * provide the highest xid cleaned by the vacuum of the heap
!                  * and so we can resolve any conflicts just once when that
!                  * arrives. After that any we know that no conflicts exist
!                  * from individual btree vacuum records on that index.
!                  */
!                 {
!                     TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
!                     xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
!
!                     ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
!                 }
!                 break;
!
!             case XLOG_BTREE_REUSE_PAGE:
!
!                 /*
!                  * Btree reuse page records exist to provide a conflict point
!                  * when we reuse pages in the index via the FSM. That's all it
!                  * does though. latestRemovedXid was the page's btpo.xact. The
!                  * btpo.xact < RecentGlobalXmin test in _bt_page_recyclable()
!                  * conceptually mirrors the pgxact->xmin > limitXmin test in
!                  * GetConflictingVirtualXIDs().  Consequently, one XID value
!                  * achieves the same exclusion effect on master and standby.
!                  */
!                 {
!                     xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);

!                     ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
!                 }
!                 return;

-             default:
-                 break;
-         }
-     }

!     /*
!      * Vacuum needs to pin and take cleanup lock on every leaf page, a regular
!      * exclusive lock is enough for all other purposes.
!      */
!     RestoreBkpBlocks(lsn, record, (info == XLOG_BTREE_VACUUM));

      switch (info)
      {
--- 1005,1040 ----
          forget_matching_split(xlrec->node, downlink, true);
  }

! static void
! btree_xlog_reuse_page(XLogRecPtr lsn, XLogRecord *record)
  {
!     xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);

      /*
!      * Btree reuse_page records exist to provide a conflict point when we
!      * reuse pages in the index via the FSM.  That's all they do though.
!      *
!      * latestRemovedXid was the page's btpo.xact.  The btpo.xact <
!      * RecentGlobalXmin test in _bt_page_recyclable() conceptually mirrors the
!      * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
!      * Consequently, one XID value achieves the same exclusion effect on
!      * master and standby.
       */
      if (InHotStandby)
      {
!         ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
!                                             xlrec->node);
!     }

!     /* Backup blocks are not used in reuse_page records */
!     Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
! }


! void
! btree_redo(XLogRecPtr lsn, XLogRecord *record)
! {
!     uint8        info = record->xl_info & ~XLR_INFO_MASK;

      switch (info)
      {
*************** btree_redo(XLogRecPtr lsn, XLogRecord *r
*** 1052,1058 ****
              btree_xlog_newroot(lsn, record);
              break;
          case XLOG_BTREE_REUSE_PAGE:
!             /* Handled above before restoring bkp block */
              break;
          default:
              elog(PANIC, "btree_redo: unknown op code %u", info);
--- 1074,1080 ----
              btree_xlog_newroot(lsn, record);
              break;
          case XLOG_BTREE_REUSE_PAGE:
!             btree_xlog_reuse_page(lsn, record);
              break;
          default:
              elog(PANIC, "btree_redo: unknown op code %u", info);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 573c9ad682e59d9e7b01ad7c8d9d321f15f948f3..f8ebf5758f0a9782016348dca041d92533e94494 100644
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
*************** critical section.)
*** 438,445 ****
  4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
  happen before the WAL record is inserted; see notes in SyncOneBuffer().)

! 5. Build a WAL log record and pass it to XLogInsert(); then update the page's
! LSN and TLI using the returned XLOG location.  For instance,

          recptr = XLogInsert(rmgr_id, info, rdata);

--- 438,446 ----
  4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
  happen before the WAL record is inserted; see notes in SyncOneBuffer().)

! 5. If the relation requires WAL-logging, build a WAL log record and pass it
! to XLogInsert(); then update the page's LSN and TLI using the returned XLOG
! location.  For instance,

          recptr = XLogInsert(rmgr_id, info, rdata);

*************** which buffers were handled that way ---
*** 466,474 ****
  what the XLOG record actually contains.  XLOG records that describe multi-page
  changes therefore require some care to design: you must be certain that you
  know what data is indicated by each "BKP" bit.  An example of the trickiness
! is that in a HEAP_UPDATE record, BKP(1) normally is associated with the source
! page and BKP(2) is associated with the destination page --- but if these are
! the same page, only BKP(1) would have been set.

  For this reason as well as the risk of deadlocking on buffer locks, it's best
  to design WAL records so that they reflect small atomic actions involving just
--- 467,475 ----
  what the XLOG record actually contains.  XLOG records that describe multi-page
  changes therefore require some care to design: you must be certain that you
  know what data is indicated by each "BKP" bit.  An example of the trickiness
! is that in a HEAP_UPDATE record, BKP(0) normally is associated with the source
! page and BKP(1) is associated with the destination page --- but if these are
! the same page, only BKP(0) would have been set.

  For this reason as well as the risk of deadlocking on buffer locks, it's best
  to design WAL records so that they reflect small atomic actions involving just
*************** incrementally update the page, the rdata
*** 497,508 ****
  ID at least once; otherwise there is no defense against torn-page problems.
  The standard replay-routine pattern for this case is

!     if (record->xl_info & XLR_BKP_BLOCK_n)
!         << do nothing, page was rewritten from logged copy >>;

      buffer = XLogReadBuffer(rnode, blkno, false);
      if (!BufferIsValid(buffer))
!         << do nothing, page has been deleted >>;
      page = (Page) BufferGetPage(buffer);

      if (XLByteLE(lsn, PageGetLSN(page)))
--- 498,516 ----
  ID at least once; otherwise there is no defense against torn-page problems.
  The standard replay-routine pattern for this case is

!     if (record->xl_info & XLR_BKP_BLOCK(N))
!     {
!         /* apply the change from the full-page image */
!         (void) RestoreBackupBlock(lsn, record, N, false, false);
!         return;
!     }

      buffer = XLogReadBuffer(rnode, blkno, false);
      if (!BufferIsValid(buffer))
!     {
!         /* page has been deleted, so we need do nothing */
!         return;
!     }
      page = (Page) BufferGetPage(buffer);

      if (XLByteLE(lsn, PageGetLSN(page)))
*************** The standard replay-routine pattern for
*** 520,532 ****
      UnlockReleaseBuffer(buffer);

  As noted above, for a multi-page update you need to be able to determine
! which XLR_BKP_BLOCK_n flag applies to each page.  If a WAL record reflects
  a combination of fully-rewritable and incremental updates, then the rewritable
! pages don't count for the XLR_BKP_BLOCK_n numbering.  (XLR_BKP_BLOCK_n is
! associated with the n'th distinct buffer ID seen in the "rdata" array, and
  per the above discussion, fully-rewritable buffers shouldn't be mentioned in
  "rdata".)

  Due to all these constraints, complex changes (such as a multilevel index
  insertion) normally need to be described by a series of atomic-action WAL
  records.  What do you do if the intermediate states are not self-consistent?
--- 528,569 ----
      UnlockReleaseBuffer(buffer);

  As noted above, for a multi-page update you need to be able to determine
! which XLR_BKP_BLOCK(N) flag applies to each page.  If a WAL record reflects
  a combination of fully-rewritable and incremental updates, then the rewritable
! pages don't count for the XLR_BKP_BLOCK(N) numbering.  (XLR_BKP_BLOCK(N) is
! associated with the N'th distinct buffer ID seen in the "rdata" array, and
  per the above discussion, fully-rewritable buffers shouldn't be mentioned in
  "rdata".)

+ When replaying a WAL record that describes changes on multiple pages, you
+ must be careful to lock the pages properly to prevent concurrent Hot Standby
+ queries from seeing an inconsistent state.  If this requires that two
+ or more buffer locks be held concurrently, the coding pattern shown above
+ is too simplistic, since it assumes the routine can exit as soon as it's
+ known the current page requires no modification.  Instead, you might have
+ something like
+
+     if (record->xl_info & XLR_BKP_BLOCK(0))
+     {
+         /* apply the change from the full-page image */
+         buffer0 = RestoreBackupBlock(lsn, record, 0, false, true);
+     }
+     else
+     {
+         buffer0 = XLogReadBuffer(rnode, blkno, false);
+         if (BufferIsValid(buffer0))
+         {
+             ... apply the change if not already done ...
+             MarkBufferDirty(buffer0);
+         }
+     }
+
+     ... similarly apply the changes for remaining pages ...
+
+     /* and now we can release the lock on the first page */
+     if (BufferIsValid(buffer0))
+         UnlockReleaseBuffer(buffer0);
+
  Due to all these constraints, complex changes (such as a multilevel index
  insertion) normally need to be described by a series of atomic-action WAL
  records.  What do you do if the intermediate states are not self-consistent?
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf76f6d24cb8bbf3b1b8b19e0a1c007ff08e4035..555c69c460ef8baf7be43540673132f766fb0e01 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** begin:;
*** 835,842 ****
       * At the exit of this loop, write_len includes the backup block data.
       *
       * Also set the appropriate info bits to show which buffers were backed
!      * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
!      * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
       */
      rdt_lastnormal = rdt;
      write_len = len;
--- 835,842 ----
       * At the exit of this loop, write_len includes the backup block data.
       *
       * Also set the appropriate info bits to show which buffers were backed
!      * up. The XLR_BKP_BLOCK(N) bit corresponds to the N'th distinct buffer
!      * value (ignoring InvalidBuffer) appearing in the rdata chain.
       */
      rdt_lastnormal = rdt;
      write_len = len;
*************** begin:;
*** 848,854 ****
          if (!dtbuf_bkp[i])
              continue;

!         info |= XLR_SET_BKP_BLOCK(i);

          bkpb = &(dtbuf_xlg[i]);
          page = (char *) BufferGetBlock(dtbuf[i]);
--- 848,854 ----
          if (!dtbuf_bkp[i])
              continue;

!         info |= XLR_BKP_BLOCK(i);

          bkpb = &(dtbuf_xlg[i]);
          page = (char *) BufferGetBlock(dtbuf[i]);
*************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec
*** 3155,3160 ****
--- 3155,3257 ----
  }

  /*
+  * Restore a full-page image from a backup block attached to an XLOG record.
+  *
+  * lsn: LSN of the XLOG record being replayed
+  * record: the complete XLOG record
+  * block_index: which backup block to restore (0 .. XLR_MAX_BKP_BLOCKS - 1)
+  * get_cleanup_lock: TRUE to get a cleanup rather than plain exclusive lock
+  * keep_buffer: TRUE to return the buffer still locked and pinned
+  *
+  * Returns the buffer number containing the page.  Note this is not terribly
+  * useful unless keep_buffer is specified as TRUE.
+  *
+  * Note: when a backup block is available in XLOG, we restore it
+  * unconditionally, even if the page in the database appears newer.
+  * This is to protect ourselves against database pages that were partially
+  * or incorrectly written during a crash.  We assume that the XLOG data
+  * must be good because it has passed a CRC check, while the database
+  * page might not be.  This will force us to replay all subsequent
+  * modifications of the page that appear in XLOG, rather than possibly
+  * ignoring them as already applied, but that's not a huge drawback.
+  *
+  * If 'get_cleanup_lock' is true, a cleanup lock is obtained on the buffer,
+  * else a normal exclusive lock is used.  During crash recovery, that's just
+  * pro forma because there can't be any regular backends in the system, but
+  * in hot standby mode the distinction is important.
+  *
+  * If 'keep_buffer' is true, return without releasing the buffer lock and pin;
+  * then caller is responsible for doing UnlockReleaseBuffer() later.  This
+  * is needed in some cases when replaying XLOG records that touch multiple
+  * pages, to prevent inconsistent states from being visible to other backends.
+  * (Again, that's only important in hot standby mode.)
+  */
+ Buffer
+ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
+                    bool get_cleanup_lock, bool keep_buffer)
+ {
+     Buffer        buffer;
+     Page        page;
+     BkpBlock    bkpb;
+     char       *blk;
+     int            i;
+
+     /* Locate requested BkpBlock in the record */
+     blk = (char *) XLogRecGetData(record) + record->xl_len;
+     for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+     {
+         if (!(record->xl_info & XLR_BKP_BLOCK(i)))
+             continue;
+
+         memcpy(&bkpb, blk, sizeof(BkpBlock));
+         blk += sizeof(BkpBlock);
+
+         if (i == block_index)
+         {
+             /* Found it, apply the update */
+             buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
+                                             RBM_ZERO);
+             Assert(BufferIsValid(buffer));
+             if (get_cleanup_lock)
+                 LockBufferForCleanup(buffer);
+             else
+                 LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+             page = (Page) BufferGetPage(buffer);
+
+             if (bkpb.hole_length == 0)
+             {
+                 memcpy((char *) page, blk, BLCKSZ);
+             }
+             else
+             {
+                 memcpy((char *) page, blk, bkpb.hole_offset);
+                 /* must zero-fill the hole */
+                 MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
+                 memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
+                        blk + bkpb.hole_offset,
+                        BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
+             }
+
+             PageSetLSN(page, lsn);
+             PageSetTLI(page, ThisTimeLineID);
+             MarkBufferDirty(buffer);
+
+             if (!keep_buffer)
+                 UnlockReleaseBuffer(buffer);
+
+             return buffer;
+         }
+
+         blk += BLCKSZ - bkpb.hole_length;
+     }
+
+     /* Caller specified a bogus block_index */
+     elog(ERROR, "failed to restore block_index %d", block_index);
+     return InvalidBuffer;        /* keep compiler quiet */
+ }
+
+ /*
   * CRC-check an XLOG record.  We do not believe the contents of an XLOG
   * record (other than to the minimal extent of computing the amount of
   * data to read in) until we've checked the CRCs.
*************** RecordIsValid(XLogRecord *record, XLogRe
*** 3193,3199 ****
      {
          uint32        blen;

!         if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
              continue;

          if (remaining < sizeof(BkpBlock))
--- 3290,3296 ----
      {
          uint32        blen;

!         if (!(record->xl_info & XLR_BKP_BLOCK(i)))
              continue;

          if (remaining < sizeof(BkpBlock))
*************** xlog_outrec(StringInfo buf, XLogRecord *
*** 8081,8087 ****
      int            i;

      appendStringInfo(buf, "prev %X/%X; xid %u",
!                      (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
                       record->xl_xid);

      appendStringInfo(buf, "; len %u",
--- 8178,8185 ----
      int            i;

      appendStringInfo(buf, "prev %X/%X; xid %u",
!                      (uint32) (record->xl_prev >> 32),
!                      (uint32) record->xl_prev,
                       record->xl_xid);

      appendStringInfo(buf, "; len %u",
*************** xlog_outrec(StringInfo buf, XLogRecord *
*** 8089,8096 ****

      for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
      {
!         if (record->xl_info & XLR_SET_BKP_BLOCK(i))
!             appendStringInfo(buf, "; bkpb%d", i + 1);
      }

      appendStringInfo(buf, ": %s", RmgrTable[record->xl_rmid].rm_name);
--- 8187,8194 ----

      for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
      {
!         if (record->xl_info & XLR_BKP_BLOCK(i))
!             appendStringInfo(buf, "; bkpb%d", i);
      }

      appendStringInfo(buf, ": %s", RmgrTable[record->xl_rmid].rm_name);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2893f3b35243effb3890630dde423daddb51eaea..4aedb25dfc31ffbc307765955737a43bc5054bc0 100644
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
*************** typedef struct XLogRecord
*** 71,76 ****
--- 71,78 ----
   */
  #define XLR_BKP_BLOCK_MASK        0x0F    /* all info bits used for bkp blocks */
  #define XLR_MAX_BKP_BLOCKS        4
+ #define XLR_BKP_BLOCK(iblk)        (0x08 >> (iblk))        /* iblk in 0..3 */
+ // XXX these macros will go away:
  #define XLR_SET_BKP_BLOCK(iblk) (0x08 >> (iblk))
  #define XLR_BKP_BLOCK_1            XLR_SET_BKP_BLOCK(0)    /* 0x08 */
  #define XLR_BKP_BLOCK_2            XLR_SET_BKP_BLOCK(1)    /* 0x04 */
*************** extern int    sync_method;
*** 94,106 ****
   * If buffer is valid then XLOG will check if buffer must be backed up
   * (ie, whether this is first change of that page since last checkpoint).
   * If so, the whole page contents are attached to the XLOG record, and XLOG
!  * sets XLR_BKP_BLOCK_X bit in xl_info.  Note that the buffer must be pinned
   * and exclusive-locked by the caller, so that it won't change under us.
   * NB: when the buffer is backed up, we DO NOT insert the data pointed to by
   * this XLogRecData struct into the XLOG record, since we assume it's present
   * in the buffer.  Therefore, rmgr redo routines MUST pay attention to
!  * XLR_BKP_BLOCK_X to know what is actually stored in the XLOG record.
!  * The i'th XLR_BKP_BLOCK bit corresponds to the i'th distinct buffer
   * value (ignoring InvalidBuffer) appearing in the rdata chain.
   *
   * When buffer is valid, caller must set buffer_std to indicate whether the
--- 96,108 ----
   * If buffer is valid then XLOG will check if buffer must be backed up
   * (ie, whether this is first change of that page since last checkpoint).
   * If so, the whole page contents are attached to the XLOG record, and XLOG
!  * sets XLR_BKP_BLOCK(N) bit in xl_info.  Note that the buffer must be pinned
   * and exclusive-locked by the caller, so that it won't change under us.
   * NB: when the buffer is backed up, we DO NOT insert the data pointed to by
   * this XLogRecData struct into the XLOG record, since we assume it's present
   * in the buffer.  Therefore, rmgr redo routines MUST pay attention to
!  * XLR_BKP_BLOCK(N) to know what is actually stored in the XLOG record.
!  * The N'th XLR_BKP_BLOCK bit corresponds to the N'th distinct buffer
   * value (ignoring InvalidBuffer) appearing in the rdata chain.
   *
   * When buffer is valid, caller must set buffer_std to indicate whether the
*************** extern int    XLogFileOpen(XLogSegNo segno)
*** 274,279 ****
--- 276,285 ----
  extern void XLogGetLastRemoved(XLogSegNo *segno);
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);

+ extern Buffer RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record,
+                    int block_index,
+                    bool get_cleanup_lock, bool keep_buffer);
+ // XXX this will go away:
  extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);

  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Next
From: Tom Lane
Date:
Subject: Re: Inadequate thought about buffer locking during hot standby replay