Thread: Inadequate thought about buffer locking during hot standby replay

Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
During normal running, operations such as btree page splits are
extremely careful about the order in which they acquire and release
buffer locks, if they're doing something that concurrently modifies
multiple pages.

During WAL replay, that all goes out the window.  Even if an individual
WAL-record replay function does things in the right order for "standard"
cases, RestoreBkpBlocks has no idea what it's doing.  So if one or more
of the referenced pages gets treated as a full-page image, we are left
with no guarantee whatsoever about what order the pages are restored in.
That never mattered when the code was originally designed, but it sure
matters during Hot Standby when other queries might be able to see the
intermediate states.

I can't prove that this is the cause of bug #7648, but it's fairly easy
to see that it could explain the symptom.  You only need to assume that
the page-being-split had been handled as a full-page image, and that the
new right-hand page had gotten allocated by extending the relation.
Then there will be an interval just after RestoreBkpBlocks does its
thing where the updated left-hand sibling is in the index and is not
locked in any way, but its right-link points off the end of the index.
If a few indexscans come along before the replay process gets to
continue, you'd get exactly the reported errors.

I'm inclined to think that we need to fix this by getting rid of
RestoreBkpBlocks per se, and instead having the per-WAL-record restore
routines dictate when each full-page image is restored (and whether or
not to release the buffer lock immediately).  That's not going to be a
small change unfortunately :-(
        regards, tom lane



Re: Inadequate thought about buffer locking during hot standby replay

From
Andres Freund
Date:
On 2012-11-09 18:24:25 -0500, Tom Lane wrote:
> I can't prove that this is the cause of bug #7648, but it's fairly easy
> to see that it could explain the symptom.  You only need to assume that
> the page-being-split had been handled as a full-page image, and that the
> new right-hand page had gotten allocated by extending the relation.
> Then there will be an interval just after RestoreBkpBlocks does its
> thing where the updated left-hand sibling is in the index and is not
> locked in any way, but its right-link points off the end of the index.
> If a few indexscans come along before the replay process gets to
> continue, you'd get exactly the reported errors.

Sounds plausible.

> I'm inclined to think that we need to fix this by getting rid of
> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
> routines dictate when each full-page image is restored (and whether or
> not to release the buffer lock immediately).  That's not going to be a
> small change unfortunately :-(

I wonder if we couldn't instead "fix" it by ensuring the backup blocks
are in the right order in the backup blocks at the inserting
location. That would "just" need some care about the order of
XLogRecData blocks.
I am pretty unfamiliar with the nbtree locking but I seem to remember
that we should be fine if we always restore from left to right?

Greetings,

Andres Freund



Re: Inadequate thought about buffer locking during hot standby replay

From
Daniel Farina
Date:
On Fri, Nov 9, 2012 at 3:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> During normal running, operations such as btree page splits are
> extremely careful about the order in which they acquire and release
> buffer locks, if they're doing something that concurrently modifies
> multiple pages.
>
> During WAL replay, that all goes out the window.  Even if an individual
> WAL-record replay function does things in the right order for "standard"
> cases, RestoreBkpBlocks has no idea what it's doing.  So if one or more
> of the referenced pages gets treated as a full-page image, we are left
> with no guarantee whatsoever about what order the pages are restored in.
> That never mattered when the code was originally designed, but it sure
> matters during Hot Standby when other queries might be able to see the
> intermediate states.
>
> I can't prove that this is the cause of bug #7648,

(I was the reporter of 7648)

To lend slightly more circumstantial evidence in support of this, I
also happened to note that the relfile in question was the last
segment and it was about a quarter full, so the access attempt was
definitely at the extreme outermost edge of the index most generally.

--
fdr



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2012-11-09 18:24:25 -0500, Tom Lane wrote:
>> I'm inclined to think that we need to fix this by getting rid of
>> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
>> routines dictate when each full-page image is restored (and whether or
>> not to release the buffer lock immediately).  That's not going to be a
>> small change unfortunately :-(

> I wonder if we couldn't instead "fix" it by ensuring the backup blocks
> are in the right order in the backup blocks at the inserting
> location. That would "just" need some care about the order of
> XLogRecData blocks.

I don't think that's a good way to go.  In the first place, if we did
that the fix would require incompatible changes in the contents of WAL
streams.  In the second place, there are already severe constraints on
the positioning of backup blocks to ensure that WAL records can be
uniquely decoded (the section of access/transam/README about WAL coding
touches on this) --- I don't think it's a good plan to add still more
constraints there.  And in the third place, the specific problem we're
positing here results from a failure to hold the buffer lock for a
full-page image until after we're done restoring a *non* full-page image
represented elsewhere in the same WAL record.  In general, of the set
of pages touched by a WAL record, any arbitrary subset of them might be
converted to FPIs during XLogInsert; but the replay-time locking
requirements are going to be the same regardless of that.  So AFAICS,
any design in which RestoreBkpBlocks acts independently of the
non-full-page-image updates is just broken.
        regards, tom lane



Re: Inadequate thought about buffer locking during hot standby replay

From
Simon Riggs
Date:
On 9 November 2012 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> During normal running, operations such as btree page splits are
> extremely careful about the order in which they acquire and release
> buffer locks, if they're doing something that concurrently modifies
> multiple pages.
>
> During WAL replay, that all goes out the window.  Even if an individual
> WAL-record replay function does things in the right order for "standard"
> cases, RestoreBkpBlocks has no idea what it's doing.  So if one or more
> of the referenced pages gets treated as a full-page image, we are left
> with no guarantee whatsoever about what order the pages are restored in.
> That never mattered when the code was originally designed, but it sure
> matters during Hot Standby when other queries might be able to see the
> intermediate states.
>
> I can't prove that this is the cause of bug #7648, but it's fairly easy
> to see that it could explain the symptom.  You only need to assume that
> the page-being-split had been handled as a full-page image, and that the
> new right-hand page had gotten allocated by extending the relation.
> Then there will be an interval just after RestoreBkpBlocks does its
> thing where the updated left-hand sibling is in the index and is not
> locked in any way, but its right-link points off the end of the index.
> If a few indexscans come along before the replay process gets to
> continue, you'd get exactly the reported errors.
>
> I'm inclined to think that we need to fix this by getting rid of
> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
> routines dictate when each full-page image is restored (and whether or
> not to release the buffer lock immediately).  That's not going to be a
> small change unfortunately :-(

No, but it looks like a clear bug scenario and a clear resolution also.

I'll start looking at it.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 9 November 2012 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm inclined to think that we need to fix this by getting rid of
>> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
>> routines dictate when each full-page image is restored (and whether or
>> not to release the buffer lock immediately).  That's not going to be a
>> small change unfortunately :-(

> No, but it looks like a clear bug scenario and a clear resolution also.
> I'll start looking at it.

I'm on it already.
        regards, tom lane



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
I wrote:
> I'm inclined to think that we need to fix this by getting rid of
> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
> routines dictate when each full-page image is restored (and whether or
> not to release the buffer lock immediately).  That's not going to be a
> small change unfortunately :-(

Here's a WIP patch that attacks it in this way.  I've only gone as far as
fixing btree_xlog_split() for the moment, since the main point is to see
whether the coding change seems reasonable.  At least in this function,
it seems that locking considerations are handled correctly as long as no
full-page image is used, so it's pretty straightforward to tweak it to
handle the case with full-page image(s) as well.  I've not looked
through any other replay functions yet --- some of them may need more
invasive hacking.  But so far this looks pretty good.

Note that this patch isn't correct yet even for btree_xlog_split(),
because I've not removed the RestoreBkpBlocks() call in btree_redo().
All the btree replay routines will have to get fixed before it's
testable at all.

One thing that seems a bit annoying is the use of zero-based backup
block indexes in RestoreBackupBlock, when most (not all) of the callers
are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
That's a bug waiting to happen.  We could address it by changing
RestoreBackupBlock to accept a one-based argument, but what I'm thinking
of doing instead is getting rid of the one-based convention entirely;
that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
advantage of doing that is that it'll force inspection of all code
related to this.

Comments, opinions?

            regards, tom lane

diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 72ea1719e7d2c3ca230c8f755261b9a439ec2a99..ee7e15fddcfca7dfe7ccba6ab021397937b1a96d 100644
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
*************** btree_xlog_split(bool onleft, bool isroo
*** 323,329 ****
          datalen -= newitemsz;
      }

!     /* Reconstruct right (new) sibling from scratch */
      rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
      Assert(BufferIsValid(rbuf));
      rpage = (Page) BufferGetPage(rbuf);
--- 323,329 ----
          datalen -= newitemsz;
      }

!     /* Reconstruct right (new) sibling page from scratch */
      rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
      Assert(BufferIsValid(rbuf));
      rpage = (Page) BufferGetPage(rbuf);
*************** btree_xlog_split(bool onleft, bool isroo
*** 357,374 ****

      /* don't release the buffer yet; we touch right page's first item below */

!     /*
!      * Reconstruct left (original) sibling if needed.  Note that this code
!      * ensures that the items remaining on the left page are in the correct
!      * item number order, but it does not reproduce the physical order they
!      * would have had.    Is this worth changing?  See also _bt_restore_page().
!      */
!     if (!(record->xl_info & XLR_BKP_BLOCK_1))
      {
          Buffer        lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);

          if (BufferIsValid(lbuf))
          {
              Page        lpage = (Page) BufferGetPage(lbuf);
              BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);

--- 357,377 ----

      /* don't release the buffer yet; we touch right page's first item below */

!     /* Now reconstruct left (original) sibling page */
!     if (record->xl_info & XLR_BKP_BLOCK_1)
!         (void) RestoreBackupBlock(lsn, record, 0, false, false);
!     else
      {
          Buffer        lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);

          if (BufferIsValid(lbuf))
          {
+             /*
+              * Note that this code ensures that the items remaining on the
+              * left page are in the correct item number order, but it does not
+              * reproduce the physical order they would have had.  Is this
+              * worth changing?  See also _bt_restore_page().
+              */
              Page        lpage = (Page) BufferGetPage(lbuf);
              BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);

*************** btree_xlog_split(bool onleft, bool isroo
*** 432,439 ****
      /* We no longer need the right buffer */
      UnlockReleaseBuffer(rbuf);

!     /* Fix left-link of the page to the right of the new right sibling */
!     if (xlrec->rnext != P_NONE && !(record->xl_info & XLR_BKP_BLOCK_2))
      {
          Buffer        buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);

--- 435,451 ----
      /* We no longer need the right buffer */
      UnlockReleaseBuffer(rbuf);

!     /*
!      * Fix left-link of the page to the right of the new right sibling.
!      *
!      * Note: in normal operation, we do this while still holding lock on the
!      * two split pages.  However, that's not necessary for correctness in WAL
!      * replay, because no other index update can be in progress, and readers
!      * will cope properly when following an obsolete left-link.
!      */
!     if (record->xl_info & XLR_BKP_BLOCK_2)
!         (void) RestoreBackupBlock(lsn, record, 1, false, false);
!     else if (xlrec->rnext != P_NONE)
      {
          Buffer        buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 573c9ad682e59d9e7b01ad7c8d9d321f15f948f3..b167ae0382b58ac3b4c995e5652bb6cdb3ecc59e 100644
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
*************** ID at least once; otherwise there is no
*** 498,508 ****
  The standard replay-routine pattern for this case is

      if (record->xl_info & XLR_BKP_BLOCK_n)
!         << do nothing, page was rewritten from logged copy >>;

      buffer = XLogReadBuffer(rnode, blkno, false);
      if (!BufferIsValid(buffer))
!         << do nothing, page has been deleted >>;
      page = (Page) BufferGetPage(buffer);

      if (XLByteLE(lsn, PageGetLSN(page)))
--- 498,515 ----
  The standard replay-routine pattern for this case is

      if (record->xl_info & XLR_BKP_BLOCK_n)
!     {
!         /* apply the change from the full-page image */
!         (void) RestoreBackupBlock(lsn, record, n-1, false, false);
!         return;
!     }

      buffer = XLogReadBuffer(rnode, blkno, false);
      if (!BufferIsValid(buffer))
!     {
!         /* page has been deleted, so we need do nothing */
!         return;
!     }
      page = (Page) BufferGetPage(buffer);

      if (XLByteLE(lsn, PageGetLSN(page)))
*************** associated with the n'th distinct buffer
*** 527,532 ****
--- 534,570 ----
  per the above discussion, fully-rewritable buffers shouldn't be mentioned in
  "rdata".)

+ When replaying a WAL record that describes changes on multiple pages, you
+ must be careful to lock the pages in the same way that the original action
+ did, because in Hot Standby mode there may be other backends inspecting
+ these pages.  Replay can't expose inconsistent page states to concurrent
+ queries, any more than normal operation did.  If this requires that two
+ or more buffer locks be held concurrently, the coding pattern shown above
+ is too simplistic, since it assumes the routine can exit as soon as it's
+ known the current page requires no modification.  Instead, you might have
+ something like
+
+     if (record->xl_info & XLR_BKP_BLOCK_1)
+     {
+         /* apply the change from the full-page image */
+         buffer1 = RestoreBackupBlock(lsn, record, 0, false, true);
+     }
+     else
+     {
+         buffer1 = XLogReadBuffer(rnode, blkno, false);
+         if (BufferIsValid(buffer1))
+         {
+             ... apply the change if not already done ...
+             MarkBufferDirty(buffer1);
+         }
+     }
+
+     ... similarly apply the changes for page 2 ...
+
+     /* and now we can release the lock on the first page */
+     if (BufferIsValid(buffer1))
+         UnlockReleaseBuffer(buffer1);
+
  Due to all these constraints, complex changes (such as a multilevel index
  insertion) normally need to be described by a series of atomic-action WAL
  records.  What do you do if the intermediate states are not self-consistent?
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf76f6d24cb8bbf3b1b8b19e0a1c007ff08e4035..4d379e2f78a89604795362e76a03867dcfdf42e6 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec
*** 3155,3160 ****
--- 3155,3256 ----
  }

  /*
+  * Restore a full-page image from a backup block attached to an XLOG record.
+  *
+  * lsn: LSN of the XLOG record being replayed
+  * record: the complete XLOG record
+  * block_index: which backup block to restore (0 .. XLR_MAX_BKP_BLOCKS - 1)
+  * get_cleanup_lock: TRUE to get a cleanup rather than plain exclusive lock
+  * keep_buffer: TRUE to return the buffer still locked and pinned
+  *
+  * Returns the buffer number containing the page.  Note this is not terribly
+  * useful unless keep_buffer is specified as TRUE.
+  *
+  * Note: when a backup block is available in XLOG, we restore it
+  * unconditionally, even if the page in the database appears newer.
+  * This is to protect ourselves against database pages that were partially
+  * or incorrectly written during a crash.  We assume that the XLOG data
+  * must be good because it has passed a CRC check, while the database
+  * page might not be.  This will force us to replay all subsequent
+  * modifications of the page that appear in XLOG, rather than possibly
+  * ignoring them as already applied, but that's not a huge drawback.
+  *
+  * If 'get_cleanup_lock' is true, a cleanup lock is obtained on the buffer,
+  * else a normal exclusive lock is used.  During crash recovery, that's just
+  * pro forma because there can't be any regular backends in the system, but
+  * in hot standby mode the distinction is important.
+  *
+  * If 'keep_buffer' is true, return without releasing the buffer lock and pin;
+  * then caller is responsible for doing UnlockReleaseBuffer() later.  This
+  * is needed in some cases when replaying XLOG records that touch multiple
+  * pages, to prevent inconsistent states from being visible to other backends.
+  * (Again, that's only important in hot standby mode.)
+  */
+ Buffer
+ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
+                    bool get_cleanup_lock, bool keep_buffer)
+ {
+     Buffer        buffer;
+     Page        page;
+     BkpBlock    bkpb;
+     char       *blk;
+     int            i;
+
+     /* Locate requested BkpBlock in the record */
+     blk = (char *) XLogRecGetData(record) + record->xl_len;
+     for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+     {
+         if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
+             continue;
+
+         memcpy(&bkpb, blk, sizeof(BkpBlock));
+         blk += sizeof(BkpBlock);
+
+         if (i == block_index)
+         {
+             buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
+                                             RBM_ZERO);
+             Assert(BufferIsValid(buffer));
+             if (get_cleanup_lock)
+                 LockBufferForCleanup(buffer);
+             else
+                 LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+             page = (Page) BufferGetPage(buffer);
+
+             if (bkpb.hole_length == 0)
+             {
+                 memcpy((char *) page, blk, BLCKSZ);
+             }
+             else
+             {
+                 memcpy((char *) page, blk, bkpb.hole_offset);
+                 /* must zero-fill the hole */
+                 MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
+                 memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
+                        blk + bkpb.hole_offset,
+                        BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
+             }
+
+             PageSetLSN(page, lsn);
+             PageSetTLI(page, ThisTimeLineID);
+             MarkBufferDirty(buffer);
+
+             if (!keep_buffer)
+                 UnlockReleaseBuffer(buffer);
+
+             return buffer;
+         }
+
+         blk += BLCKSZ - bkpb.hole_length;
+     }
+
+     /* Caller specified a bogus block_index */
+     elog(ERROR, "failed to restore block_index %d", block_index);
+     return InvalidBuffer;        /* keep compiler quiet */
+ }
+
+ /*
   * CRC-check an XLOG record.  We do not believe the contents of an XLOG
   * record (other than to the minimal extent of computing the amount of
   * data to read in) until we've checked the CRCs.
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2893f3b35243effb3890630dde423daddb51eaea..e5cfb96a600b83b50bee2871ee23116c54d0883d 100644
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
*************** extern void XLogGetLastRemoved(XLogSegNo
*** 275,280 ****
--- 275,283 ----
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);

  extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);
+ extern Buffer RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record,
+                    int block_index,
+                    bool get_cleanup_lock, bool keep_buffer);

  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);

Re: Inadequate thought about buffer locking during hot standby replay

From
Simon Riggs
Date:
On 10 November 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> I'm inclined to think that we need to fix this by getting rid of
>> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
>> routines dictate when each full-page image is restored (and whether or
>> not to release the buffer lock immediately).  That's not going to be a
>> small change unfortunately :-(
>
> Here's a WIP patch that attacks it in this way.  I've only gone as far as
> fixing btree_xlog_split() for the moment, since the main point is to see
> whether the coding change seems reasonable.  At least in this function,
> it seems that locking considerations are handled correctly as long as no
> full-page image is used, so it's pretty straightforward to tweak it to
> handle the case with full-page image(s) as well.  I've not looked
> through any other replay functions yet --- some of them may need more
> invasive hacking.  But so far this looks pretty good.

Looks fine, but we need a top-level comment in btree code explaining
why/how it follows the very well explained rules in the README.

> Note that this patch isn't correct yet even for btree_xlog_split(),
> because I've not removed the RestoreBkpBlocks() call in btree_redo().
> All the btree replay routines will have to get fixed before it's
> testable at all.

I was just about to write you back you missed that...

> One thing that seems a bit annoying is the use of zero-based backup
> block indexes in RestoreBackupBlock, when most (not all) of the callers
> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
> That's a bug waiting to happen.  We could address it by changing
> RestoreBackupBlock to accept a one-based argument, but what I'm thinking
> of doing instead is getting rid of the one-based convention entirely;
> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
> advantage of doing that is that it'll force inspection of all code
> related to this.

I wouldn't do that in a back branch, but I can see why its a good idea.

Thanks for fixing.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 10 November 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Here's a WIP patch that attacks it in this way.

> Looks fine, but we need a top-level comment in btree code explaining
> why/how it follows the very well explained rules in the README.

Not sure what you'd want it to say exactly?  Really these issues are per
WAL-replay function, not global.  I've now been through all of the btree
functions, and AFAICS btree_xlog_split is the only one of them that
actually has a bug; the others can be a bit lax about the order in which
things get done.

>> One thing that seems a bit annoying is the use of zero-based backup
>> block indexes in RestoreBackupBlock, when most (not all) of the callers
>> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
>> That's a bug waiting to happen.  We could address it by changing
>> RestoreBackupBlock to accept a one-based argument, but what I'm thinking
>> of doing instead is getting rid of the one-based convention entirely;
>> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
>> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
>> advantage of doing that is that it'll force inspection of all code
>> related to this.

> I wouldn't do that in a back branch, but I can see why its a good idea.

If any of this stuff were getting used by external modules, changing it
would be problematic ... but fortunately, it isn't, because we lack
support for plug-in rmgrs.  So I'm not worried about back-patching the
change, and would prefer to keep the 9.x branches in sync.

Attached is an updated patch in which nbtxlog.c has been fully converted,
but I've not touched other rmgrs yet.  I've tested this to the extent of
having a replication slave follow a master that's running the regression
tests, and it seems to work.  It's difficult to prove much by testing
about concurrent behavior, of course.

I found that there's another benefit of managing backup-block replay
this way, which is that we can de-contort the handling of conflict
resolution by moving it into the per-record-type subroutines, instead of
needing an extra switch before the RestoreBkpBlocks call.  So that gives
me more confidence that this is a good way to go.

            regards, tom lane

diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 72ea1719e7d2c3ca230c8f755261b9a439ec2a99..8f534803a7cea75105869de6eaf2946e03d5437c 100644
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
*************** btree_xlog_insert(bool isleaf, bool isme
*** 218,227 ****
          datalen -= sizeof(xl_btree_metadata);
      }

!     if ((record->xl_info & XLR_BKP_BLOCK_1) && !ismeta && isleaf)
!         return;                    /* nothing to do */
!
!     if (!(record->xl_info & XLR_BKP_BLOCK_1))
      {
          buffer = XLogReadBuffer(xlrec->target.node,
                               ItemPointerGetBlockNumber(&(xlrec->target.tid)),
--- 218,226 ----
          datalen -= sizeof(xl_btree_metadata);
      }

!     if (record->xl_info & XLR_BKP_BLOCK(0))
!         (void) RestoreBackupBlock(lsn, record, 0, false, false);
!     else
      {
          buffer = XLogReadBuffer(xlrec->target.node,
                               ItemPointerGetBlockNumber(&(xlrec->target.tid)),
*************** btree_xlog_insert(bool isleaf, bool isme
*** 249,254 ****
--- 248,260 ----
          }
      }

+     /*
+      * Note: in normal operation, we'd update the metapage while still holding
+      * lock on the page we inserted into.  But during replay it's not
+      * necessary to hold that lock, since no other index updates can be
+      * happening concurrently, and readers will cope fine with following an
+      * obsolete link from the metapage.
+      */
      if (ismeta)
          _bt_restore_meta(xlrec->target.node, lsn,
                           md.root, md.level,
*************** btree_xlog_split(bool onleft, bool isroo
*** 290,296 ****
          forget_matching_split(xlrec->node, downlink, false);

          /* Extract left hikey and its size (still assuming 16-bit alignment) */
!         if (!(record->xl_info & XLR_BKP_BLOCK_1))
          {
              /* We assume 16-bit alignment is enough for IndexTupleSize */
              left_hikey = (Item) datapos;
--- 296,302 ----
          forget_matching_split(xlrec->node, downlink, false);

          /* Extract left hikey and its size (still assuming 16-bit alignment) */
!         if (!(record->xl_info & XLR_BKP_BLOCK(0)))
          {
              /* We assume 16-bit alignment is enough for IndexTupleSize */
              left_hikey = (Item) datapos;
*************** btree_xlog_split(bool onleft, bool isroo
*** 310,316 ****
          datalen -= sizeof(OffsetNumber);
      }

!     if (onleft && !(record->xl_info & XLR_BKP_BLOCK_1))
      {
          /*
           * We assume that 16-bit alignment is enough to apply IndexTupleSize
--- 316,322 ----
          datalen -= sizeof(OffsetNumber);
      }

!     if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
      {
          /*
           * We assume that 16-bit alignment is enough to apply IndexTupleSize
*************** btree_xlog_split(bool onleft, bool isroo
*** 323,329 ****
          datalen -= newitemsz;
      }

!     /* Reconstruct right (new) sibling from scratch */
      rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
      Assert(BufferIsValid(rbuf));
      rpage = (Page) BufferGetPage(rbuf);
--- 329,335 ----
          datalen -= newitemsz;
      }

!     /* Reconstruct right (new) sibling page from scratch */
      rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
      Assert(BufferIsValid(rbuf));
      rpage = (Page) BufferGetPage(rbuf);
*************** btree_xlog_split(bool onleft, bool isroo
*** 357,374 ****

      /* don't release the buffer yet; we touch right page's first item below */

!     /*
!      * Reconstruct left (original) sibling if needed.  Note that this code
!      * ensures that the items remaining on the left page are in the correct
!      * item number order, but it does not reproduce the physical order they
!      * would have had.    Is this worth changing?  See also _bt_restore_page().
!      */
!     if (!(record->xl_info & XLR_BKP_BLOCK_1))
      {
          Buffer        lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);

          if (BufferIsValid(lbuf))
          {
              Page        lpage = (Page) BufferGetPage(lbuf);
              BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);

--- 363,383 ----

      /* don't release the buffer yet; we touch right page's first item below */

!     /* Now reconstruct left (original) sibling page */
!     if (record->xl_info & XLR_BKP_BLOCK(0))
!         (void) RestoreBackupBlock(lsn, record, 0, false, false);
!     else
      {
          Buffer        lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);

          if (BufferIsValid(lbuf))
          {
+             /*
+              * Note that this code ensures that the items remaining on the
+              * left page are in the correct item number order, but it does not
+              * reproduce the physical order they would have had.  Is this
+              * worth changing?  See also _bt_restore_page().
+              */
              Page        lpage = (Page) BufferGetPage(lbuf);
              BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);

*************** btree_xlog_split(bool onleft, bool isroo
*** 432,439 ****
      /* We no longer need the right buffer */
      UnlockReleaseBuffer(rbuf);

!     /* Fix left-link of the page to the right of the new right sibling */
!     if (xlrec->rnext != P_NONE && !(record->xl_info & XLR_BKP_BLOCK_2))
      {
          Buffer        buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);

--- 441,457 ----
      /* We no longer need the right buffer */
      UnlockReleaseBuffer(rbuf);

!     /*
!      * Fix left-link of the page to the right of the new right sibling.
!      *
!      * Note: in normal operation, we do this while still holding lock on the
!      * two split pages.  However, that's not necessary for correctness in WAL
!      * replay, because no other index update can be in progress, and readers
!      * will cope properly when following an obsolete left-link.
!      */
!     if (record->xl_info & XLR_BKP_BLOCK(1))
!         (void) RestoreBackupBlock(lsn, record, 1, false, false);
!     else if (xlrec->rnext != P_NONE)
      {
          Buffer        buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);

*************** btree_xlog_split(bool onleft, bool isroo
*** 463,475 ****
  static void
  btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
  {
!     xl_btree_vacuum *xlrec;
      Buffer        buffer;
      Page        page;
      BTPageOpaque opaque;

-     xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
-
      /*
       * If queries might be active then we need to ensure every block is
       * unpinned between the lastBlockVacuumed and the current block, if there
--- 481,491 ----
  static void
  btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
  {
!     xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
      Buffer        buffer;
      Page        page;
      BTPageOpaque opaque;

      /*
       * If queries might be active then we need to ensure every block is
       * unpinned between the lastBlockVacuumed and the current block, if there
*************** btree_xlog_vacuum(XLogRecPtr lsn, XLogRe
*** 502,514 ****
      }

      /*
!      * If the block was restored from a full page image, nothing more to do.
!      * The RestoreBkpBlocks() call already pinned and took cleanup lock on it.
!      * XXX: Perhaps we should call RestoreBkpBlocks() *after* the loop above,
!      * to make the disk access more sequential.
       */
!     if (record->xl_info & XLR_BKP_BLOCK_1)
          return;

      /*
       * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
--- 518,531 ----
      }

      /*
!      * If we have a full-page image, restore it (using a cleanup lock) and
!      * we're done.
       */
!     if (record->xl_info & XLR_BKP_BLOCK(0))
!     {
!         (void) RestoreBackupBlock(lsn, record, 0, true, false);
          return;
+     }

      /*
       * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
*************** btree_xlog_vacuum(XLogRecPtr lsn, XLogRe
*** 563,571 ****
   * XXX optimise later with something like XLogPrefetchBuffer()
   */
  static TransactionId
! btree_xlog_delete_get_latestRemovedXid(XLogRecord *record)
  {
-     xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
      OffsetNumber *unused;
      Buffer        ibuffer,
                  hbuffer;
--- 580,587 ----
   * XXX optimise later with something like XLogPrefetchBuffer()
   */
  static TransactionId
! btree_xlog_delete_get_latestRemovedXid(xl_btree_delete *xlrec)
  {
      OffsetNumber *unused;
      Buffer        ibuffer,
                  hbuffer;
*************** btree_xlog_delete_get_latestRemovedXid(X
*** 702,716 ****
  static void
  btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
  {
!     xl_btree_delete *xlrec;
      Buffer        buffer;
      Page        page;
      BTPageOpaque opaque;

!     if (record->xl_info & XLR_BKP_BLOCK_1)
!         return;

!     xlrec = (xl_btree_delete *) XLogRecGetData(record);

      /*
       * We don't need to take a cleanup lock to apply these changes. See
--- 718,752 ----
  static void
  btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
  {
!     xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
      Buffer        buffer;
      Page        page;
      BTPageOpaque opaque;

!     /*
!      * If we have any conflict processing to do, it must happen before we
!      * update the page.
!      *
!      * Btree delete records can conflict with standby queries.  You might
!      * think that vacuum records would conflict as well, but we've handled
!      * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
!      * cleaned by the vacuum of the heap and so we can resolve any conflicts
!      * just once when that arrives.  After that we know that no conflicts
!      * exist from individual btree vacuum records on that index.
!      */
!     if (InHotStandby)
!     {
!         TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(xlrec);

!         ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
!     }
!
!     /* If we have a full-page image, restore it and we're done */
!     if (record->xl_info & XLR_BKP_BLOCK(0))
!     {
!         (void) RestoreBackupBlock(lsn, record, 0, false, false);
!         return;
!     }

      /*
       * We don't need to take a cleanup lock to apply these changes. See
*************** btree_xlog_delete_page(uint8 info, XLogR
*** 766,773 ****
      leftsib = xlrec->leftblk;
      rightsib = xlrec->rightblk;

      /* parent page */
!     if (!(record->xl_info & XLR_BKP_BLOCK_1))
      {
          buffer = XLogReadBuffer(xlrec->target.node, parent, false);
          if (BufferIsValid(buffer))
--- 802,819 ----
      leftsib = xlrec->leftblk;
      rightsib = xlrec->rightblk;

+     /*
+      * In normal operation, we would lock all the pages this WAL record
+      * touches before changing any of them.  In WAL replay, it should be okay
+      * to lock just one page at a time, since no concurrent index updates can
+      * be happening, and readers should not care whether they arrive at the
+      * target page or not (since it's surely empty).
+      */
+
      /* parent page */
!     if (record->xl_info & XLR_BKP_BLOCK(0))
!         (void) RestoreBackupBlock(lsn, record, 0, false, false);
!     else
      {
          buffer = XLogReadBuffer(xlrec->target.node, parent, false);
          if (BufferIsValid(buffer))
*************** btree_xlog_delete_page(uint8 info, XLogR
*** 813,819 ****
      }

      /* Fix left-link of right sibling */
!     if (!(record->xl_info & XLR_BKP_BLOCK_2))
      {
          buffer = XLogReadBuffer(xlrec->target.node, rightsib, false);
          if (BufferIsValid(buffer))
--- 859,867 ----
      }

      /* Fix left-link of right sibling */
!     if (record->xl_info & XLR_BKP_BLOCK(1))
!         (void) RestoreBackupBlock(lsn, record, 1, false, false);
!     else
      {
          buffer = XLogReadBuffer(xlrec->target.node, rightsib, false);
          if (BufferIsValid(buffer))
*************** btree_xlog_delete_page(uint8 info, XLogR
*** 837,843 ****
      }

      /* Fix right-link of left sibling, if any */
!     if (!(record->xl_info & XLR_BKP_BLOCK_3))
      {
          if (leftsib != P_NONE)
          {
--- 885,893 ----
      }

      /* Fix right-link of left sibling, if any */
!     if (record->xl_info & XLR_BKP_BLOCK(2))
!         (void) RestoreBackupBlock(lsn, record, 2, false, false);
!     else
      {
          if (leftsib != P_NONE)
          {
*************** btree_xlog_newroot(XLogRecPtr lsn, XLogR
*** 911,916 ****
--- 961,969 ----
      BTPageOpaque pageop;
      BlockNumber downlink = 0;

+     /* Backup blocks are not used in newroot records */
+     Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
      buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
      Assert(BufferIsValid(buffer));
      page = (Page) BufferGetPage(buffer);
*************** btree_xlog_newroot(XLogRecPtr lsn, XLogR
*** 952,1018 ****
          forget_matching_split(xlrec->node, downlink, true);
  }

!
! void
! btree_redo(XLogRecPtr lsn, XLogRecord *record)
  {
!     uint8        info = record->xl_info & ~XLR_INFO_MASK;

      /*
!      * If we have any conflict processing to do, it must happen before we
!      * update the page.
       */
      if (InHotStandby)
      {
!         switch (info)
!         {
!             case XLOG_BTREE_DELETE:
!
!                 /*
!                  * Btree delete records can conflict with standby queries. You
!                  * might think that vacuum records would conflict as well, but
!                  * we've handled that already. XLOG_HEAP2_CLEANUP_INFO records
!                  * provide the highest xid cleaned by the vacuum of the heap
!                  * and so we can resolve any conflicts just once when that
!                  * arrives. After that any we know that no conflicts exist
!                  * from individual btree vacuum records on that index.
!                  */
!                 {
!                     TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
!                     xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
!
!                     ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
!                 }
!                 break;
!
!             case XLOG_BTREE_REUSE_PAGE:
!
!                 /*
!                  * Btree reuse page records exist to provide a conflict point
!                  * when we reuse pages in the index via the FSM. That's all it
!                  * does though. latestRemovedXid was the page's btpo.xact. The
!                  * btpo.xact < RecentGlobalXmin test in _bt_page_recyclable()
!                  * conceptually mirrors the pgxact->xmin > limitXmin test in
!                  * GetConflictingVirtualXIDs().  Consequently, one XID value
!                  * achieves the same exclusion effect on master and standby.
!                  */
!                 {
!                     xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);

!                     ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
!                 }
!                 return;

-             default:
-                 break;
-         }
-     }

!     /*
!      * Vacuum needs to pin and take cleanup lock on every leaf page, a regular
!      * exclusive lock is enough for all other purposes.
!      */
!     RestoreBkpBlocks(lsn, record, (info == XLOG_BTREE_VACUUM));

      switch (info)
      {
--- 1005,1040 ----
          forget_matching_split(xlrec->node, downlink, true);
  }

! static void
! btree_xlog_reuse_page(XLogRecPtr lsn, XLogRecord *record)
  {
!     xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);

      /*
!      * Btree reuse_page records exist to provide a conflict point when we
!      * reuse pages in the index via the FSM.  That's all they do though.
!      *
!      * latestRemovedXid was the page's btpo.xact.  The btpo.xact <
!      * RecentGlobalXmin test in _bt_page_recyclable() conceptually mirrors the
!      * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
!      * Consequently, one XID value achieves the same exclusion effect on
!      * master and standby.
       */
      if (InHotStandby)
      {
!         ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
!                                             xlrec->node);
!     }

!     /* Backup blocks are not used in reuse_page records */
!     Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
! }


! void
! btree_redo(XLogRecPtr lsn, XLogRecord *record)
! {
!     uint8        info = record->xl_info & ~XLR_INFO_MASK;

      switch (info)
      {
*************** btree_redo(XLogRecPtr lsn, XLogRecord *r
*** 1052,1058 ****
              btree_xlog_newroot(lsn, record);
              break;
          case XLOG_BTREE_REUSE_PAGE:
!             /* Handled above before restoring bkp block */
              break;
          default:
              elog(PANIC, "btree_redo: unknown op code %u", info);
--- 1074,1080 ----
              btree_xlog_newroot(lsn, record);
              break;
          case XLOG_BTREE_REUSE_PAGE:
!             btree_xlog_reuse_page(lsn, record);
              break;
          default:
              elog(PANIC, "btree_redo: unknown op code %u", info);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 573c9ad682e59d9e7b01ad7c8d9d321f15f948f3..f8ebf5758f0a9782016348dca041d92533e94494 100644
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
*************** critical section.)
*** 438,445 ****
  4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
  happen before the WAL record is inserted; see notes in SyncOneBuffer().)

! 5. Build a WAL log record and pass it to XLogInsert(); then update the page's
! LSN and TLI using the returned XLOG location.  For instance,

          recptr = XLogInsert(rmgr_id, info, rdata);

--- 438,446 ----
  4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
  happen before the WAL record is inserted; see notes in SyncOneBuffer().)

! 5. If the relation requires WAL-logging, build a WAL log record and pass it
! to XLogInsert(); then update the page's LSN and TLI using the returned XLOG
! location.  For instance,

          recptr = XLogInsert(rmgr_id, info, rdata);

*************** which buffers were handled that way ---
*** 466,474 ****
  what the XLOG record actually contains.  XLOG records that describe multi-page
  changes therefore require some care to design: you must be certain that you
  know what data is indicated by each "BKP" bit.  An example of the trickiness
! is that in a HEAP_UPDATE record, BKP(1) normally is associated with the source
! page and BKP(2) is associated with the destination page --- but if these are
! the same page, only BKP(1) would have been set.

  For this reason as well as the risk of deadlocking on buffer locks, it's best
  to design WAL records so that they reflect small atomic actions involving just
--- 467,475 ----
  what the XLOG record actually contains.  XLOG records that describe multi-page
  changes therefore require some care to design: you must be certain that you
  know what data is indicated by each "BKP" bit.  An example of the trickiness
! is that in a HEAP_UPDATE record, BKP(0) normally is associated with the source
! page and BKP(1) is associated with the destination page --- but if these are
! the same page, only BKP(0) would have been set.

  For this reason as well as the risk of deadlocking on buffer locks, it's best
  to design WAL records so that they reflect small atomic actions involving just
*************** incrementally update the page, the rdata
*** 497,508 ****
  ID at least once; otherwise there is no defense against torn-page problems.
  The standard replay-routine pattern for this case is

!     if (record->xl_info & XLR_BKP_BLOCK_n)
!         << do nothing, page was rewritten from logged copy >>;

      buffer = XLogReadBuffer(rnode, blkno, false);
      if (!BufferIsValid(buffer))
!         << do nothing, page has been deleted >>;
      page = (Page) BufferGetPage(buffer);

      if (XLByteLE(lsn, PageGetLSN(page)))
--- 498,516 ----
  ID at least once; otherwise there is no defense against torn-page problems.
  The standard replay-routine pattern for this case is

!     if (record->xl_info & XLR_BKP_BLOCK(N))
!     {
!         /* apply the change from the full-page image */
!         (void) RestoreBackupBlock(lsn, record, N, false, false);
!         return;
!     }

      buffer = XLogReadBuffer(rnode, blkno, false);
      if (!BufferIsValid(buffer))
!     {
!         /* page has been deleted, so we need do nothing */
!         return;
!     }
      page = (Page) BufferGetPage(buffer);

      if (XLByteLE(lsn, PageGetLSN(page)))
*************** The standard replay-routine pattern for
*** 520,532 ****
      UnlockReleaseBuffer(buffer);

  As noted above, for a multi-page update you need to be able to determine
! which XLR_BKP_BLOCK_n flag applies to each page.  If a WAL record reflects
  a combination of fully-rewritable and incremental updates, then the rewritable
! pages don't count for the XLR_BKP_BLOCK_n numbering.  (XLR_BKP_BLOCK_n is
! associated with the n'th distinct buffer ID seen in the "rdata" array, and
  per the above discussion, fully-rewritable buffers shouldn't be mentioned in
  "rdata".)

  Due to all these constraints, complex changes (such as a multilevel index
  insertion) normally need to be described by a series of atomic-action WAL
  records.  What do you do if the intermediate states are not self-consistent?
--- 528,569 ----
      UnlockReleaseBuffer(buffer);

  As noted above, for a multi-page update you need to be able to determine
! which XLR_BKP_BLOCK(N) flag applies to each page.  If a WAL record reflects
  a combination of fully-rewritable and incremental updates, then the rewritable
! pages don't count for the XLR_BKP_BLOCK(N) numbering.  (XLR_BKP_BLOCK(N) is
! associated with the N'th distinct buffer ID seen in the "rdata" array, and
  per the above discussion, fully-rewritable buffers shouldn't be mentioned in
  "rdata".)

+ When replaying a WAL record that describes changes on multiple pages, you
+ must be careful to lock the pages properly to prevent concurrent Hot Standby
+ queries from seeing an inconsistent state.  If this requires that two
+ or more buffer locks be held concurrently, the coding pattern shown above
+ is too simplistic, since it assumes the routine can exit as soon as it's
+ known the current page requires no modification.  Instead, you might have
+ something like
+
+     if (record->xl_info & XLR_BKP_BLOCK(0))
+     {
+         /* apply the change from the full-page image */
+         buffer0 = RestoreBackupBlock(lsn, record, 0, false, true);
+     }
+     else
+     {
+         buffer0 = XLogReadBuffer(rnode, blkno, false);
+         if (BufferIsValid(buffer0))
+         {
+             ... apply the change if not already done ...
+             MarkBufferDirty(buffer0);
+         }
+     }
+
+     ... similarly apply the changes for remaining pages ...
+
+     /* and now we can release the lock on the first page */
+     if (BufferIsValid(buffer0))
+         UnlockReleaseBuffer(buffer0);
+
  Due to all these constraints, complex changes (such as a multilevel index
  insertion) normally need to be described by a series of atomic-action WAL
  records.  What do you do if the intermediate states are not self-consistent?
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf76f6d24cb8bbf3b1b8b19e0a1c007ff08e4035..555c69c460ef8baf7be43540673132f766fb0e01 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** begin:;
*** 835,842 ****
       * At the exit of this loop, write_len includes the backup block data.
       *
       * Also set the appropriate info bits to show which buffers were backed
!      * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
!      * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
       */
      rdt_lastnormal = rdt;
      write_len = len;
--- 835,842 ----
       * At the exit of this loop, write_len includes the backup block data.
       *
       * Also set the appropriate info bits to show which buffers were backed
!      * up. The XLR_BKP_BLOCK(N) bit corresponds to the N'th distinct buffer
!      * value (ignoring InvalidBuffer) appearing in the rdata chain.
       */
      rdt_lastnormal = rdt;
      write_len = len;
*************** begin:;
*** 848,854 ****
          if (!dtbuf_bkp[i])
              continue;

!         info |= XLR_SET_BKP_BLOCK(i);

          bkpb = &(dtbuf_xlg[i]);
          page = (char *) BufferGetBlock(dtbuf[i]);
--- 848,854 ----
          if (!dtbuf_bkp[i])
              continue;

!         info |= XLR_BKP_BLOCK(i);

          bkpb = &(dtbuf_xlg[i]);
          page = (char *) BufferGetBlock(dtbuf[i]);
*************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec
*** 3155,3160 ****
--- 3155,3257 ----
  }

  /*
+  * Restore a full-page image from a backup block attached to an XLOG record.
+  *
+  * lsn: LSN of the XLOG record being replayed
+  * record: the complete XLOG record
+  * block_index: which backup block to restore (0 .. XLR_MAX_BKP_BLOCKS - 1)
+  * get_cleanup_lock: TRUE to get a cleanup rather than plain exclusive lock
+  * keep_buffer: TRUE to return the buffer still locked and pinned
+  *
+  * Returns the buffer number containing the page.  Note this is not terribly
+  * useful unless keep_buffer is specified as TRUE.
+  *
+  * Note: when a backup block is available in XLOG, we restore it
+  * unconditionally, even if the page in the database appears newer.
+  * This is to protect ourselves against database pages that were partially
+  * or incorrectly written during a crash.  We assume that the XLOG data
+  * must be good because it has passed a CRC check, while the database
+  * page might not be.  This will force us to replay all subsequent
+  * modifications of the page that appear in XLOG, rather than possibly
+  * ignoring them as already applied, but that's not a huge drawback.
+  *
+  * If 'get_cleanup_lock' is true, a cleanup lock is obtained on the buffer,
+  * else a normal exclusive lock is used.  During crash recovery, that's just
+  * pro forma because there can't be any regular backends in the system, but
+  * in hot standby mode the distinction is important.
+  *
+  * If 'keep_buffer' is true, return without releasing the buffer lock and pin;
+  * then caller is responsible for doing UnlockReleaseBuffer() later.  This
+  * is needed in some cases when replaying XLOG records that touch multiple
+  * pages, to prevent inconsistent states from being visible to other backends.
+  * (Again, that's only important in hot standby mode.)
+  */
+ Buffer
+ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
+                    bool get_cleanup_lock, bool keep_buffer)
+ {
+     Buffer        buffer;
+     Page        page;
+     BkpBlock    bkpb;
+     char       *blk;
+     int            i;
+
+     /* Locate requested BkpBlock in the record */
+     blk = (char *) XLogRecGetData(record) + record->xl_len;
+     for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+     {
+         if (!(record->xl_info & XLR_BKP_BLOCK(i)))
+             continue;
+
+         memcpy(&bkpb, blk, sizeof(BkpBlock));
+         blk += sizeof(BkpBlock);
+
+         if (i == block_index)
+         {
+             /* Found it, apply the update */
+             buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
+                                             RBM_ZERO);
+             Assert(BufferIsValid(buffer));
+             if (get_cleanup_lock)
+                 LockBufferForCleanup(buffer);
+             else
+                 LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+             page = (Page) BufferGetPage(buffer);
+
+             if (bkpb.hole_length == 0)
+             {
+                 memcpy((char *) page, blk, BLCKSZ);
+             }
+             else
+             {
+                 memcpy((char *) page, blk, bkpb.hole_offset);
+                 /* must zero-fill the hole */
+                 MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
+                 memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
+                        blk + bkpb.hole_offset,
+                        BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
+             }
+
+             PageSetLSN(page, lsn);
+             PageSetTLI(page, ThisTimeLineID);
+             MarkBufferDirty(buffer);
+
+             if (!keep_buffer)
+                 UnlockReleaseBuffer(buffer);
+
+             return buffer;
+         }
+
+         blk += BLCKSZ - bkpb.hole_length;
+     }
+
+     /* Caller specified a bogus block_index */
+     elog(ERROR, "failed to restore block_index %d", block_index);
+     return InvalidBuffer;        /* keep compiler quiet */
+ }
+
+ /*
   * CRC-check an XLOG record.  We do not believe the contents of an XLOG
   * record (other than to the minimal extent of computing the amount of
   * data to read in) until we've checked the CRCs.
*************** RecordIsValid(XLogRecord *record, XLogRe
*** 3193,3199 ****
      {
          uint32        blen;

!         if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
              continue;

          if (remaining < sizeof(BkpBlock))
--- 3290,3296 ----
      {
          uint32        blen;

!         if (!(record->xl_info & XLR_BKP_BLOCK(i)))
              continue;

          if (remaining < sizeof(BkpBlock))
*************** xlog_outrec(StringInfo buf, XLogRecord *
*** 8081,8087 ****
      int            i;

      appendStringInfo(buf, "prev %X/%X; xid %u",
!                      (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
                       record->xl_xid);

      appendStringInfo(buf, "; len %u",
--- 8178,8185 ----
      int            i;

      appendStringInfo(buf, "prev %X/%X; xid %u",
!                      (uint32) (record->xl_prev >> 32),
!                      (uint32) record->xl_prev,
                       record->xl_xid);

      appendStringInfo(buf, "; len %u",
*************** xlog_outrec(StringInfo buf, XLogRecord *
*** 8089,8096 ****

      for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
      {
!         if (record->xl_info & XLR_SET_BKP_BLOCK(i))
!             appendStringInfo(buf, "; bkpb%d", i + 1);
      }

      appendStringInfo(buf, ": %s", RmgrTable[record->xl_rmid].rm_name);
--- 8187,8194 ----

      for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
      {
!         if (record->xl_info & XLR_BKP_BLOCK(i))
!             appendStringInfo(buf, "; bkpb%d", i);
      }

      appendStringInfo(buf, ": %s", RmgrTable[record->xl_rmid].rm_name);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2893f3b35243effb3890630dde423daddb51eaea..4aedb25dfc31ffbc307765955737a43bc5054bc0 100644
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
*************** typedef struct XLogRecord
*** 71,76 ****
--- 71,78 ----
   */
  #define XLR_BKP_BLOCK_MASK        0x0F    /* all info bits used for bkp blocks */
  #define XLR_MAX_BKP_BLOCKS        4
+ #define XLR_BKP_BLOCK(iblk)        (0x08 >> (iblk))        /* iblk in 0..3 */
+ // XXX these macros will go away:
  #define XLR_SET_BKP_BLOCK(iblk) (0x08 >> (iblk))
  #define XLR_BKP_BLOCK_1            XLR_SET_BKP_BLOCK(0)    /* 0x08 */
  #define XLR_BKP_BLOCK_2            XLR_SET_BKP_BLOCK(1)    /* 0x04 */
*************** extern int    sync_method;
*** 94,106 ****
   * If buffer is valid then XLOG will check if buffer must be backed up
   * (ie, whether this is first change of that page since last checkpoint).
   * If so, the whole page contents are attached to the XLOG record, and XLOG
!  * sets XLR_BKP_BLOCK_X bit in xl_info.  Note that the buffer must be pinned
   * and exclusive-locked by the caller, so that it won't change under us.
   * NB: when the buffer is backed up, we DO NOT insert the data pointed to by
   * this XLogRecData struct into the XLOG record, since we assume it's present
   * in the buffer.  Therefore, rmgr redo routines MUST pay attention to
!  * XLR_BKP_BLOCK_X to know what is actually stored in the XLOG record.
!  * The i'th XLR_BKP_BLOCK bit corresponds to the i'th distinct buffer
   * value (ignoring InvalidBuffer) appearing in the rdata chain.
   *
   * When buffer is valid, caller must set buffer_std to indicate whether the
--- 96,108 ----
   * If buffer is valid then XLOG will check if buffer must be backed up
   * (ie, whether this is first change of that page since last checkpoint).
   * If so, the whole page contents are attached to the XLOG record, and XLOG
!  * sets XLR_BKP_BLOCK(N) bit in xl_info.  Note that the buffer must be pinned
   * and exclusive-locked by the caller, so that it won't change under us.
   * NB: when the buffer is backed up, we DO NOT insert the data pointed to by
   * this XLogRecData struct into the XLOG record, since we assume it's present
   * in the buffer.  Therefore, rmgr redo routines MUST pay attention to
!  * XLR_BKP_BLOCK(N) to know what is actually stored in the XLOG record.
!  * The N'th XLR_BKP_BLOCK bit corresponds to the N'th distinct buffer
   * value (ignoring InvalidBuffer) appearing in the rdata chain.
   *
   * When buffer is valid, caller must set buffer_std to indicate whether the
*************** extern int    XLogFileOpen(XLogSegNo segno)
*** 274,279 ****
--- 276,285 ----
  extern void XLogGetLastRemoved(XLogSegNo *segno);
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);

+ extern Buffer RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record,
+                    int block_index,
+                    bool get_cleanup_lock, bool keep_buffer);
+ // XXX this will go away:
  extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);

  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);

Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
While I'm looking at this: there seems to be a bug/inconsistency in
heap_xlog_freeze().  It uses a cleanup lock in the "normal" code path,
but it doesn't tell RestoreBkpBlocks to use a cleanup lock.  Which
choice is correct?
        regards, tom lane



Re: Inadequate thought about buffer locking during hot standby replay

From
Simon Riggs
Date:
On 10 November 2012 21:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On 10 November 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Here's a WIP patch that attacks it in this way.
>
>> Looks fine, but we need a top-level comment in btree code explaining
>> why/how it follows the very well explained rules in the README.
>
> Not sure what you'd want it to say exactly?  Really these issues are per
> WAL-replay function, not global.  I've now been through all of the btree
> functions, and AFAICS btree_xlog_split is the only one of them that
> actually has a bug; the others can be a bit lax about the order in which
> things get done.
>
>>> One thing that seems a bit annoying is the use of zero-based backup
>>> block indexes in RestoreBackupBlock, when most (not all) of the callers
>>> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
>>> That's a bug waiting to happen.  We could address it by changing
>>> RestoreBackupBlock to accept a one-based argument, but what I'm thinking
>>> of doing instead is getting rid of the one-based convention entirely;
>>> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
>>> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
>>> advantage of doing that is that it'll force inspection of all code
>>> related to this.
>
>> I wouldn't do that in a back branch, but I can see why its a good idea.
>
> If any of this stuff were getting used by external modules, changing it
> would be problematic ... but fortunately, it isn't, because we lack
> support for plug-in rmgrs.  So I'm not worried about back-patching the
> change, and would prefer to keep the 9.x branches in sync.
>
> Attached is an updated patch in which nbtxlog.c has been fully converted,
> but I've not touched other rmgrs yet.  I've tested this to the extent of
> having a replication slave follow a master that's running the regression
> tests, and it seems to work.  It's difficult to prove much by testing
> about concurrent behavior, of course.
>
> I found that there's another benefit of managing backup-block replay
> this way, which is that we can de-contort the handling of conflict
> resolution by moving it into the per-record-type subroutines, instead of
> needing an extra switch before the RestoreBkpBlocks call.  So that gives
> me more confidence that this is a good way to go.

It's a good way to go, I'm just glad you're handling it for the back
branches ;-)

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
Attached is a complete draft patch against HEAD for this issue.  I found
a depressingly large number of pre-existing bugs:

Practically all WAL record types that touch multiple pages have some
bug of this type.  In addition to btree_xlog_split, I found that
heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
locks as required to make their updates safe for concurrent queries.
(I'm not totally sure about ginRedoDeletePage, but the original action
definitely locks the pages simultaneously, and it's not clear that it's
safe not to.)  Most of these are okay in cases without any full-page
images, but could fail if the wrong subset of the pages-to-be-touched
were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes.  There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s).  If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action.  This needs to be
looked at closer by somebody who's more up on the GIST code than I am.
The attached patch doesn't try very hard to make the GIST functions
safe, it just transposes them to the new API.

I also found that spgRedoAddNode could fail to update the parent
downlink at all, if the parent tuple is in the same page as either the
old or new split tuple --- it would get fooled by the LSN having been
advanced already.  This is an index corruption bug not just a
transient-failure problem.

Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

I already pointed out the inconsistency in heap_xlog_freeze about
whether a cleanup lock is needed.  As is, this patch uses a cleanup
lock, but I suspect that a regular lock is sufficient --- comments?

Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
anything that emits XLOG_GIST_PAGE_DELETE.
        regards, tom lane

#application/octet-stream [restore-backup-blocks-individually-2.patch.gz]
/home/tgl/pgsql/restore-backup-blocks-individually-2.patch.gz



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
[ this time *with* the patch ]

Attached is a complete draft patch against HEAD for this issue.  I found
a depressingly large number of pre-existing bugs:

Practically all WAL record types that touch multiple pages have some
bug of this type.  In addition to btree_xlog_split, I found that
heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
locks as required to make their updates safe for concurrent queries.
(I'm not totally sure about ginRedoDeletePage, but the original action
definitely locks the pages simultaneously, and it's not clear that it's
safe not to.)  Most of these are okay in cases without any full-page
images, but could fail if the wrong subset of the pages-to-be-touched
were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes.  There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s).  If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action.  This needs to be
looked at closer by somebody who's more up on the GIST code than I am.
The attached patch doesn't try very hard to make the GIST functions
safe, it just transposes them to the new API.

I also found that spgRedoAddNode could fail to update the parent
downlink at all, if the parent tuple is in the same page as either the
old or new split tuple --- it would get fooled by the LSN having been
advanced already.  This is an index corruption bug not just a
transient-failure problem.

Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

I already pointed out the inconsistency in heap_xlog_freeze about
whether a cleanup lock is needed.  As is, this patch uses a cleanup
lock, but I suspect that a regular lock is sufficient --- comments?

Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
anything that emits XLOG_GIST_PAGE_DELETE.

            regards, tom lane


Attachment

Re: Inadequate thought about buffer locking during hot standby replay

From
Andres Freund
Date:
On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> >> One thing that seems a bit annoying is the use of zero-based backup
> >> block indexes in RestoreBackupBlock, when most (not all) of the callers
> >> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
> >> That's a bug waiting to happen.  We could address it by changing
> >> RestoreBackupBlock to accept a one-based argument, but what I'm thinking
> >> of doing instead is getting rid of the one-based convention entirely;
> >> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
> >> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
> >> advantage of doing that is that it'll force inspection of all code
> >> related to this.
>
> > I wouldn't do that in a back branch, but I can see why its a good idea.
> 
> If any of this stuff were getting used by external modules, changing it
> would be problematic ... but fortunately, it isn't, because we lack
> support for plug-in rmgrs.  So I'm not worried about back-patching the
> change, and would prefer to keep the 9.x branches in sync.

XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
used by xlogdump. Not sure if either are worth that much attention, but
it seems worth noticing that such a change will break stuff.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Inadequate thought about buffer locking during hot standby replay

From
Heikki Linnakangas
Date:
On 12.11.2012 01:25, Tom Lane wrote:
> Worse than that, GIST WAL replay seems to be a total disaster from this
> standpoint, and I'm not convinced that it's even fixable without
> incompatible WAL changes.  There are several cases where index
> insertion operations generate more than one WAL record while holding
> exclusive lock on buffer(s).  If the lock continuity is actually
> necessary to prevent concurrent queries from seeing inconsistent index
> states, then we have a big problem, because WAL replay isn't designed to
> hold any buffer lock for more than one WAL action.

Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), 
the GiST index is always in a consistent state. Before the downlink for 
a newly split page has been inserted yet, its left sibling is flagged so 
that a search knows to follow the right-link to find it. In normal 
operation, the lock continuity is needed to prevent another backend from 
seeing the incomplete split and trying to fix it, but in hot standby, we 
never try to fix incomplete splits anyway.

So I think we're good on >= 9.1. The 9.0 code is broken, however. In 
9.0, when a child page is split, the parent and new children are kept 
locked until the downlinks are inserted/updated. If a concurrent scan 
comes along and sees that incomplete state, it will miss tuples on the 
new right siblings. We rely on a rm_cleanup operation at the end of WAL 
replay to fix that situation, if the downlink insertion record is not 
there. I don't see any easy way to fix that, unfortunately. Perhaps we 
could backpatch the 9.1 rewrite, now that it's gotten some real-world 
testing, but it was a big change so I don't feel very comfortable doing 
that.

> Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
> anything that emits XLOG_GIST_PAGE_DELETE.

Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was 
removed.

- Heikki



Re: Inadequate thought about buffer locking during hot standby replay

From
Robert Haas
Date:
On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I already pointed out the inconsistency in heap_xlog_freeze about
> whether a cleanup lock is needed.  As is, this patch uses a cleanup
> lock, but I suspect that a regular lock is sufficient --- comments?

Well, according to storage/buffer/README:

1. To scan a page for tuples, one must hold a pin and either shared or
exclusive content lock.  To examine the commit status (XIDs and status bits)
of a tuple in a shared buffer, one must likewise hold a pin and either shared
or exclusive lock.

That does indeed make it sound like an x-lock is enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I already pointed out the inconsistency in heap_xlog_freeze about
>> whether a cleanup lock is needed.  As is, this patch uses a cleanup
>> lock, but I suspect that a regular lock is sufficient --- comments?

> Well, according to storage/buffer/README:

> 1. To scan a page for tuples, one must hold a pin and either shared or
> exclusive content lock.  To examine the commit status (XIDs and status bits)
> of a tuple in a shared buffer, one must likewise hold a pin and either shared
> or exclusive lock.

> That does indeed make it sound like an x-lock is enough.

Yeah.  AFAICS, the only possible downside is that somebody might be
using the tuple (while holding just a buffer pin), and that its xmin
might change while that's happening.  So for instance a nestloop join
might read out different xmin values for the same row while the join
proceeds.  But that could happen anyway if a different plan had been
chosen (viz, this table on the inside not the outside of the nestloop).
The xmin update ought to be physically atomic, so you shouldn't be able
to get a garbage result, just the old value or the new.

The cleanup lock is needed for cases where we'd like to remove or move a
tuple, but that's not required here.
        regards, tom lane



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 12.11.2012 01:25, Tom Lane wrote:
>> Worse than that, GIST WAL replay seems to be a total disaster from this
>> standpoint, and I'm not convinced that it's even fixable without
>> incompatible WAL changes.

> Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), 
> the GiST index is always in a consistent state. Before the downlink for 
> a newly split page has been inserted yet, its left sibling is flagged so 
> that a search knows to follow the right-link to find it. In normal 
> operation, the lock continuity is needed to prevent another backend from 
> seeing the incomplete split and trying to fix it, but in hot standby, we 
> never try to fix incomplete splits anyway.

> So I think we're good on >= 9.1.

Okay.  I'll update the patch to make sure that the individual WAL replay
functions hold all locks, but not worry about holding locks across
actions.

> The 9.0 code is broken, however. In 
> 9.0, when a child page is split, the parent and new children are kept 
> locked until the downlinks are inserted/updated. If a concurrent scan 
> comes along and sees that incomplete state, it will miss tuples on the 
> new right siblings. We rely on a rm_cleanup operation at the end of WAL 
> replay to fix that situation, if the downlink insertion record is not 
> there. I don't see any easy way to fix that, unfortunately. Perhaps we 
> could backpatch the 9.1 rewrite, now that it's gotten some real-world 
> testing, but it was a big change so I don't feel very comfortable doing 
> that.

Me either.  Given the lack of field complaints, I think we're better
advised to just leave it unfixed in 9.0.  It'd not be a step forward
if we broke something trying to make this work.

>> Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
>> anything that emits XLOG_GIST_PAGE_DELETE.

> Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was 
> removed.

Okay.  I see we bumped XLOG_PAGE_MAGIC in 9.0, so there's no longer
any way that 9.0 or later versions could see this WAL record type.
I'll delete that replay function rather than bothering to fix it.
        regards, tom lane



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
>> If any of this stuff were getting used by external modules, changing it
>> would be problematic ... but fortunately, it isn't, because we lack
>> support for plug-in rmgrs.  So I'm not worried about back-patching the
>> change, and would prefer to keep the 9.x branches in sync.

> XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
> used by xlogdump. Not sure if either are worth that much attention, but
> it seems worth noticing that such a change will break stuff.

Hm.  Okay, how about we leave the old macros in place in the back
branches?
        regards, tom lane



Re: Inadequate thought about buffer locking during hot standby replay

From
Andres Freund
Date:
On 2012-11-12 10:19:09 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
> >> If any of this stuff were getting used by external modules, changing it
> >> would be problematic ... but fortunately, it isn't, because we lack
> >> support for plug-in rmgrs.  So I'm not worried about back-patching the
> >> change, and would prefer to keep the 9.x branches in sync.
>
> > XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
> > used by xlogdump. Not sure if either are worth that much attention, but
> > it seems worth noticing that such a change will break stuff.
>
> Hm.  Okay, how about we leave the old macros in place in the back
> branches?

Sounds good to me. The RestoreBkpBlocks change seems unproblematic to
me. If anything its good that it has been renamed.

Thanks,

Andres



Re: Inadequate thought about buffer locking during hot standby replay

From
Simon Riggs
Date:
On 11 November 2012 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Practically all WAL record types that touch multiple pages have some
> bug of this type.  In addition to btree_xlog_split, I found that
> heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
> spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
> locks as required to make their updates safe for concurrent queries.
> (I'm not totally sure about ginRedoDeletePage, but the original action
> definitely locks the pages simultaneously, and it's not clear that it's
> safe not to.)  Most of these are okay in cases without any full-page
> images, but could fail if the wrong subset of the pages-to-be-touched
> were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Hmm, not good. Thanks for spotting.

Do these changes do anything to actions that occur across multiple
records? I assume not and think those are OK, agreed?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Inadequate thought about buffer locking during hot standby replay

From
Simon Riggs
Date:
On 12 November 2012 14:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> The 9.0 code is broken, however. In
>> 9.0, when a child page is split, the parent and new children are kept
>> locked until the downlinks are inserted/updated. If a concurrent scan
>> comes along and sees that incomplete state, it will miss tuples on the
>> new right siblings. We rely on a rm_cleanup operation at the end of WAL
>> replay to fix that situation, if the downlink insertion record is not
>> there. I don't see any easy way to fix that, unfortunately. Perhaps we
>> could backpatch the 9.1 rewrite, now that it's gotten some real-world
>> testing, but it was a big change so I don't feel very comfortable doing
>> that.
>
> Me either.  Given the lack of field complaints, I think we're better
> advised to just leave it unfixed in 9.0.  It'd not be a step forward
> if we broke something trying to make this work.

Agreed. Most people running 9.0 with GIST indexes have already upgraded.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 11 November 2012 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Practically all WAL record types that touch multiple pages have some
>> bug of this type.  In addition to btree_xlog_split, I found that
>> heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
>> spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
>> locks as required to make their updates safe for concurrent queries.
>> (I'm not totally sure about ginRedoDeletePage, but the original action
>> definitely locks the pages simultaneously, and it's not clear that it's
>> safe not to.)  Most of these are okay in cases without any full-page
>> images, but could fail if the wrong subset of the pages-to-be-touched
>> were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

> Hmm, not good. Thanks for spotting.

> Do these changes do anything to actions that occur across multiple
> records? I assume not and think those are OK, agreed?

Right, we were and still are assuming that any state that exists between
WAL records is consistent and safe to expose to hot-standby queries.
The important thing here is that these WAL replay functions were failing
to ensure that their own actions appear atomic to onlookers.  This is
basically hangover from pre-hot-standby coding conventions, when no such
concern existed.
        regards, tom lane



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
Here's an updated patch that fixes the GIST replay functions as well as
the other minor issues that were mentioned.  Barring objections, I'll
set about back-patching this as far as 9.0.

One thing that could use verification is my fix for
gistRedoPageSplitRecord.  AFAICS, the first page listed in the WAL
record is always the "original" page, and the ones following it are
pages that were split off from it, and can (as yet) only be reached by
following right-links from the "original" page.  As such, it should be
okay to release locks on the non-first pages as soon as we've written
them.  We have to hold lock on the original page though to avoid letting
readers follow dangling right-links.  Also, the update of
NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically
with all this, so that has to be done before releasing the original-page
lock as well.  Does that sound right?

            regards, tom lane


Attachment

Re: Inadequate thought about buffer locking during hot standby replay

From
Heikki Linnakangas
Date:
On 12.11.2012 22:53, Tom Lane wrote:
> Here's an updated patch that fixes the GIST replay functions as well as
> the other minor issues that were mentioned.  Barring objections, I'll
> set about back-patching this as far as 9.0.

Ok. It won't help all that much on 9.0, though.

> One thing that could use verification is my fix for
> gistRedoPageSplitRecord.  AFAICS, the first page listed in the WAL
> record is always the "original" page, and the ones following it are
> pages that were split off from it, and can (as yet) only be reached by
> following right-links from the "original" page.  As such, it should be
> okay to release locks on the non-first pages as soon as we've written
> them.  We have to hold lock on the original page though to avoid letting
> readers follow dangling right-links.  Also, the update of
> NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically
> with all this, so that has to be done before releasing the original-page
> lock as well.  Does that sound right?

Yep.

- Heikki



Re: Inadequate thought about buffer locking during hot standby replay

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 12.11.2012 22:53, Tom Lane wrote:
>> Here's an updated patch that fixes the GIST replay functions as well as
>> the other minor issues that were mentioned.  Barring objections, I'll
>> set about back-patching this as far as 9.0.

> Ok. It won't help all that much on 9.0, though.

Well, it won't help GIST much, but the actually-reported-from-the-field
case is in btree, and it does fix that.

It occurs to me that if we're sufficiently scared of this case, we could
probably hack the planner (in 9.0 only) to refuse to use GIST indexes
in hot-standby queries.  That cure might be worse than the disease though.
        regards, tom lane



Re: Inadequate thought about buffer locking during hot standby replay

From
Merlin Moncure
Date:
On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ok. It won't help all that much on 9.0, though.
>
> Well, it won't help GIST much, but the actually-reported-from-the-field
> case is in btree, and it does fix that.
>
> It occurs to me that if we're sufficiently scared of this case, we could
> probably hack the planner (in 9.0 only) to refuse to use GIST indexes
> in hot-standby queries.  That cure might be worse than the disease though.

if anything, it should be documented.  if you do this kind of thing
people will stop installing bugfix releases.

merlin



Re: Inadequate thought about buffer locking during hot standby replay

From
Gavin Flower
Date:
On 14/11/12 04:32, Merlin Moncure wrote:
> On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Ok. It won't help all that much on 9.0, though.
>> Well, it won't help GIST much, but the actually-reported-from-the-field
>> case is in btree, and it does fix that.
>>
>> It occurs to me that if we're sufficiently scared of this case, we could
>> probably hack the planner (in 9.0 only) to refuse to use GIST indexes
>> in hot-standby queries.  That cure might be worse than the disease though.
> if anything, it should be documented.  if you do this kind of thing
> people will stop installing bugfix releases.
>
> merlin
>
>
How about displaying a warning, when people try to use the 'feature', as 
well as document it?

Cheers,
Gavin




Re: Inadequate thought about buffer locking during hot standby replay

From
Robert Haas
Date:
On Tue, Nov 13, 2012 at 10:32 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Ok. It won't help all that much on 9.0, though.
>>
>> Well, it won't help GIST much, but the actually-reported-from-the-field
>> case is in btree, and it does fix that.
>>
>> It occurs to me that if we're sufficiently scared of this case, we could
>> probably hack the planner (in 9.0 only) to refuse to use GIST indexes
>> in hot-standby queries.  That cure might be worse than the disease though.
>
> if anything, it should be documented.  if you do this kind of thing
> people will stop installing bugfix releases.

Agreed.  I think doing that in a back-branch release would be
extremely user-hostile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company