Thread: Inadequate thought about buffer locking during hot standby replay
During normal running, operations such as btree page splits are extremely careful about the order in which they acquire and release buffer locks, if they're doing something that concurrently modifies multiple pages. During WAL replay, that all goes out the window. Even if an individual WAL-record replay function does things in the right order for "standard" cases, RestoreBkpBlocks has no idea what it's doing. So if one or more of the referenced pages gets treated as a full-page image, we are left with no guarantee whatsoever about what order the pages are restored in. That never mattered when the code was originally designed, but it sure matters during Hot Standby when other queries might be able to see the intermediate states. I can't prove that this is the cause of bug #7648, but it's fairly easy to see that it could explain the symptom. You only need to assume that the page-being-split had been handled as a full-page image, and that the new right-hand page had gotten allocated by extending the relation. Then there will be an interval just after RestoreBkpBlocks does its thing where the updated left-hand sibling is in the index and is not locked in any way, but its right-link points off the end of the index. If a few indexscans come along before the replay process gets to continue, you'd get exactly the reported errors. I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( regards, tom lane
On 2012-11-09 18:24:25 -0500, Tom Lane wrote: > I can't prove that this is the cause of bug #7648, but it's fairly easy > to see that it could explain the symptom. You only need to assume that > the page-being-split had been handled as a full-page image, and that the > new right-hand page had gotten allocated by extending the relation. > Then there will be an interval just after RestoreBkpBlocks does its > thing where the updated left-hand sibling is in the index and is not > locked in any way, but its right-link points off the end of the index. > If a few indexscans come along before the replay process gets to > continue, you'd get exactly the reported errors. Sounds plausible. > I'm inclined to think that we need to fix this by getting rid of > RestoreBkpBlocks per se, and instead having the per-WAL-record restore > routines dictate when each full-page image is restored (and whether or > not to release the buffer lock immediately). That's not going to be a > small change unfortunately :-( I wonder if we couldn't instead "fix" it by ensuring the backup blocks are in the right order in the backup blocks at the inserting location. That would "just" need some care about the order of XLogRecData blocks. I am pretty unfamiliar with the nbtree locking but I seem to remember that we should be fine if we always restore from left to right? Greetings, Andres Freund
On Fri, Nov 9, 2012 at 3:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > During normal running, operations such as btree page splits are > extremely careful about the order in which they acquire and release > buffer locks, if they're doing something that concurrently modifies > multiple pages. > > During WAL replay, that all goes out the window. Even if an individual > WAL-record replay function does things in the right order for "standard" > cases, RestoreBkpBlocks has no idea what it's doing. So if one or more > of the referenced pages gets treated as a full-page image, we are left > with no guarantee whatsoever about what order the pages are restored in. > That never mattered when the code was originally designed, but it sure > matters during Hot Standby when other queries might be able to see the > intermediate states. > > I can't prove that this is the cause of bug #7648, (I was the reporter of 7648) To lend slightly more circumstantial evidence in support of this, I also happened to note that the relfile in question was the last segment and it was about a quarter full, so the access attempt was definitely at the extreme outermost edge of the index most generally. -- fdr
Andres Freund <andres@anarazel.de> writes: > On 2012-11-09 18:24:25 -0500, Tom Lane wrote: >> I'm inclined to think that we need to fix this by getting rid of >> RestoreBkpBlocks per se, and instead having the per-WAL-record restore >> routines dictate when each full-page image is restored (and whether or >> not to release the buffer lock immediately). That's not going to be a >> small change unfortunately :-( > I wonder if we couldn't instead "fix" it by ensuring the backup blocks > are in the right order in the backup blocks at the inserting > location. That would "just" need some care about the order of > XLogRecData blocks. I don't think that's a good way to go. In the first place, if we did that the fix would require incompatible changes in the contents of WAL streams. In the second place, there are already severe constraints on the positioning of backup blocks to ensure that WAL records can be uniquely decoded (the section of access/transam/README about WAL coding touches on this) --- I don't think it's a good plan to add still more constraints there. And in the third place, the specific problem we're positing here results from a failure to hold the buffer lock for a full-page image until after we're done restoring a *non* full-page image represented elsewhere in the same WAL record. In general, of the set of pages touched by a WAL record, any arbitrary subset of them might be converted to FPIs during XLogInsert; but the replay-time locking requirements are going to be the same regardless of that. So AFAICS, any design in which RestoreBkpBlocks acts independently of the non-full-page-image updates is just broken. regards, tom lane
On 9 November 2012 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > During normal running, operations such as btree page splits are > extremely careful about the order in which they acquire and release > buffer locks, if they're doing something that concurrently modifies > multiple pages. > > During WAL replay, that all goes out the window. Even if an individual > WAL-record replay function does things in the right order for "standard" > cases, RestoreBkpBlocks has no idea what it's doing. So if one or more > of the referenced pages gets treated as a full-page image, we are left > with no guarantee whatsoever about what order the pages are restored in. > That never mattered when the code was originally designed, but it sure > matters during Hot Standby when other queries might be able to see the > intermediate states. > > I can't prove that this is the cause of bug #7648, but it's fairly easy > to see that it could explain the symptom. You only need to assume that > the page-being-split had been handled as a full-page image, and that the > new right-hand page had gotten allocated by extending the relation. > Then there will be an interval just after RestoreBkpBlocks does its > thing where the updated left-hand sibling is in the index and is not > locked in any way, but its right-link points off the end of the index. > If a few indexscans come along before the replay process gets to > continue, you'd get exactly the reported errors. > > I'm inclined to think that we need to fix this by getting rid of > RestoreBkpBlocks per se, and instead having the per-WAL-record restore > routines dictate when each full-page image is restored (and whether or > not to release the buffer lock immediately). That's not going to be a > small change unfortunately :-( No, but it looks like a clear bug scenario and a clear resolution also. I'll start looking at it. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 9 November 2012 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm inclined to think that we need to fix this by getting rid of >> RestoreBkpBlocks per se, and instead having the per-WAL-record restore >> routines dictate when each full-page image is restored (and whether or >> not to release the buffer lock immediately). That's not going to be a >> small change unfortunately :-( > No, but it looks like a clear bug scenario and a clear resolution also. > I'll start looking at it. I'm on it already. regards, tom lane
I wrote: > I'm inclined to think that we need to fix this by getting rid of > RestoreBkpBlocks per se, and instead having the per-WAL-record restore > routines dictate when each full-page image is restored (and whether or > not to release the buffer lock immediately). That's not going to be a > small change unfortunately :-( Here's a WIP patch that attacks it in this way. I've only gone as far as fixing btree_xlog_split() for the moment, since the main point is to see whether the coding change seems reasonable. At least in this function, it seems that locking considerations are handled correctly as long as no full-page image is used, so it's pretty straightforward to tweak it to handle the case with full-page image(s) as well. I've not looked through any other replay functions yet --- some of them may need more invasive hacking. But so far this looks pretty good. Note that this patch isn't correct yet even for btree_xlog_split(), because I've not removed the RestoreBkpBlocks() call in btree_redo(). All the btree replay routines will have to get fixed before it's testable at all. One thing that seems a bit annoying is the use of zero-based backup block indexes in RestoreBackupBlock, when most (not all) of the callers are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc). That's a bug waiting to happen. We could address it by changing RestoreBackupBlock to accept a one-based argument, but what I'm thinking of doing instead is getting rid of the one-based convention entirely; that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One advantage of doing that is that it'll force inspection of all code related to this. Comments, opinions? regards, tom lane diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 72ea1719e7d2c3ca230c8f755261b9a439ec2a99..ee7e15fddcfca7dfe7ccba6ab021397937b1a96d 100644 *** a/src/backend/access/nbtree/nbtxlog.c --- b/src/backend/access/nbtree/nbtxlog.c *************** btree_xlog_split(bool onleft, bool isroo *** 323,329 **** datalen -= newitemsz; } ! /* Reconstruct right (new) sibling from scratch */ rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true); Assert(BufferIsValid(rbuf)); rpage = (Page) BufferGetPage(rbuf); --- 323,329 ---- datalen -= newitemsz; } ! /* Reconstruct right (new) sibling page from scratch */ rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true); Assert(BufferIsValid(rbuf)); rpage = (Page) BufferGetPage(rbuf); *************** btree_xlog_split(bool onleft, bool isroo *** 357,374 **** /* don't release the buffer yet; we touch right page's first item below */ ! /* ! * Reconstruct left (original) sibling if needed. Note that this code ! * ensures that the items remaining on the left page are in the correct ! * item number order, but it does not reproduce the physical order they ! * would have had. Is this worth changing? See also _bt_restore_page(). ! */ ! if (!(record->xl_info & XLR_BKP_BLOCK_1)) { Buffer lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false); if (BufferIsValid(lbuf)) { Page lpage = (Page) BufferGetPage(lbuf); BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage); --- 357,377 ---- /* don't release the buffer yet; we touch right page's first item below */ ! /* Now reconstruct left (original) sibling page */ ! if (record->xl_info & XLR_BKP_BLOCK_1) ! (void) RestoreBackupBlock(lsn, record, 0, false, false); ! else { Buffer lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false); if (BufferIsValid(lbuf)) { + /* + * Note that this code ensures that the items remaining on the + * left page are in the correct item number order, but it does not + * reproduce the physical order they would have had. Is this + * worth changing? See also _bt_restore_page(). + */ Page lpage = (Page) BufferGetPage(lbuf); BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage); *************** btree_xlog_split(bool onleft, bool isroo *** 432,439 **** /* We no longer need the right buffer */ UnlockReleaseBuffer(rbuf); ! /* Fix left-link of the page to the right of the new right sibling */ ! if (xlrec->rnext != P_NONE && !(record->xl_info & XLR_BKP_BLOCK_2)) { Buffer buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false); --- 435,451 ---- /* We no longer need the right buffer */ UnlockReleaseBuffer(rbuf); ! /* ! * Fix left-link of the page to the right of the new right sibling. ! * ! * Note: in normal operation, we do this while still holding lock on the ! * two split pages. However, that's not necessary for correctness in WAL ! * replay, because no other index update can be in progress, and readers ! * will cope properly when following an obsolete left-link. ! */ ! if (record->xl_info & XLR_BKP_BLOCK_2) ! (void) RestoreBackupBlock(lsn, record, 1, false, false); ! else if (xlrec->rnext != P_NONE) { Buffer buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false); diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 573c9ad682e59d9e7b01ad7c8d9d321f15f948f3..b167ae0382b58ac3b4c995e5652bb6cdb3ecc59e 100644 *** a/src/backend/access/transam/README --- b/src/backend/access/transam/README *************** ID at least once; otherwise there is no *** 498,508 **** The standard replay-routine pattern for this case is if (record->xl_info & XLR_BKP_BLOCK_n) ! << do nothing, page was rewritten from logged copy >>; buffer = XLogReadBuffer(rnode, blkno, false); if (!BufferIsValid(buffer)) ! << do nothing, page has been deleted >>; page = (Page) BufferGetPage(buffer); if (XLByteLE(lsn, PageGetLSN(page))) --- 498,515 ---- The standard replay-routine pattern for this case is if (record->xl_info & XLR_BKP_BLOCK_n) ! { ! /* apply the change from the full-page image */ ! (void) RestoreBackupBlock(lsn, record, n-1, false, false); ! return; ! } buffer = XLogReadBuffer(rnode, blkno, false); if (!BufferIsValid(buffer)) ! { ! /* page has been deleted, so we need do nothing */ ! return; ! } page = (Page) BufferGetPage(buffer); if (XLByteLE(lsn, PageGetLSN(page))) *************** associated with the n'th distinct buffer *** 527,532 **** --- 534,570 ---- per the above discussion, fully-rewritable buffers shouldn't be mentioned in "rdata".) + When replaying a WAL record that describes changes on multiple pages, you + must be careful to lock the pages in the same way that the original action + did, because in Hot Standby mode there may be other backends inspecting + these pages. Replay can't expose inconsistent page states to concurrent + queries, any more than normal operation did. If this requires that two + or more buffer locks be held concurrently, the coding pattern shown above + is too simplistic, since it assumes the routine can exit as soon as it's + known the current page requires no modification. Instead, you might have + something like + + if (record->xl_info & XLR_BKP_BLOCK_1) + { + /* apply the change from the full-page image */ + buffer1 = RestoreBackupBlock(lsn, record, 0, false, true); + } + else + { + buffer1 = XLogReadBuffer(rnode, blkno, false); + if (BufferIsValid(buffer1)) + { + ... apply the change if not already done ... + MarkBufferDirty(buffer1); + } + } + + ... similarly apply the changes for page 2 ... + + /* and now we can release the lock on the first page */ + if (BufferIsValid(buffer1)) + UnlockReleaseBuffer(buffer1); + Due to all these constraints, complex changes (such as a multilevel index insertion) normally need to be described by a series of atomic-action WAL records. What do you do if the intermediate states are not self-consistent? diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index bf76f6d24cb8bbf3b1b8b19e0a1c007ff08e4035..4d379e2f78a89604795362e76a03867dcfdf42e6 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec *** 3155,3160 **** --- 3155,3256 ---- } /* + * Restore a full-page image from a backup block attached to an XLOG record. + * + * lsn: LSN of the XLOG record being replayed + * record: the complete XLOG record + * block_index: which backup block to restore (0 .. XLR_MAX_BKP_BLOCKS - 1) + * get_cleanup_lock: TRUE to get a cleanup rather than plain exclusive lock + * keep_buffer: TRUE to return the buffer still locked and pinned + * + * Returns the buffer number containing the page. Note this is not terribly + * useful unless keep_buffer is specified as TRUE. + * + * Note: when a backup block is available in XLOG, we restore it + * unconditionally, even if the page in the database appears newer. + * This is to protect ourselves against database pages that were partially + * or incorrectly written during a crash. We assume that the XLOG data + * must be good because it has passed a CRC check, while the database + * page might not be. This will force us to replay all subsequent + * modifications of the page that appear in XLOG, rather than possibly + * ignoring them as already applied, but that's not a huge drawback. + * + * If 'get_cleanup_lock' is true, a cleanup lock is obtained on the buffer, + * else a normal exclusive lock is used. During crash recovery, that's just + * pro forma because there can't be any regular backends in the system, but + * in hot standby mode the distinction is important. + * + * If 'keep_buffer' is true, return without releasing the buffer lock and pin; + * then caller is responsible for doing UnlockReleaseBuffer() later. This + * is needed in some cases when replaying XLOG records that touch multiple + * pages, to prevent inconsistent states from being visible to other backends. + * (Again, that's only important in hot standby mode.) + */ + Buffer + RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index, + bool get_cleanup_lock, bool keep_buffer) + { + Buffer buffer; + Page page; + BkpBlock bkpb; + char *blk; + int i; + + /* Locate requested BkpBlock in the record */ + blk = (char *) XLogRecGetData(record) + record->xl_len; + for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++) + { + if (!(record->xl_info & XLR_SET_BKP_BLOCK(i))) + continue; + + memcpy(&bkpb, blk, sizeof(BkpBlock)); + blk += sizeof(BkpBlock); + + if (i == block_index) + { + buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block, + RBM_ZERO); + Assert(BufferIsValid(buffer)); + if (get_cleanup_lock) + LockBufferForCleanup(buffer); + else + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + page = (Page) BufferGetPage(buffer); + + if (bkpb.hole_length == 0) + { + memcpy((char *) page, blk, BLCKSZ); + } + else + { + memcpy((char *) page, blk, bkpb.hole_offset); + /* must zero-fill the hole */ + MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length); + memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length), + blk + bkpb.hole_offset, + BLCKSZ - (bkpb.hole_offset + bkpb.hole_length)); + } + + PageSetLSN(page, lsn); + PageSetTLI(page, ThisTimeLineID); + MarkBufferDirty(buffer); + + if (!keep_buffer) + UnlockReleaseBuffer(buffer); + + return buffer; + } + + blk += BLCKSZ - bkpb.hole_length; + } + + /* Caller specified a bogus block_index */ + elog(ERROR, "failed to restore block_index %d", block_index); + return InvalidBuffer; /* keep compiler quiet */ + } + + /* * CRC-check an XLOG record. We do not believe the contents of an XLOG * record (other than to the minimal extent of computing the amount of * data to read in) until we've checked the CRCs. diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 2893f3b35243effb3890630dde423daddb51eaea..e5cfb96a600b83b50bee2871ee23116c54d0883d 100644 *** a/src/include/access/xlog.h --- b/src/include/access/xlog.h *************** extern void XLogGetLastRemoved(XLogSegNo *** 275,280 **** --- 275,283 ---- extern void XLogSetAsyncXactLSN(XLogRecPtr record); extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup); + extern Buffer RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, + int block_index, + bool get_cleanup_lock, bool keep_buffer); extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record); extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
On 10 November 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> I'm inclined to think that we need to fix this by getting rid of >> RestoreBkpBlocks per se, and instead having the per-WAL-record restore >> routines dictate when each full-page image is restored (and whether or >> not to release the buffer lock immediately). That's not going to be a >> small change unfortunately :-( > > Here's a WIP patch that attacks it in this way. I've only gone as far as > fixing btree_xlog_split() for the moment, since the main point is to see > whether the coding change seems reasonable. At least in this function, > it seems that locking considerations are handled correctly as long as no > full-page image is used, so it's pretty straightforward to tweak it to > handle the case with full-page image(s) as well. I've not looked > through any other replay functions yet --- some of them may need more > invasive hacking. But so far this looks pretty good. Looks fine, but we need a top-level comment in btree code explaining why/how it follows the very well explained rules in the README. > Note that this patch isn't correct yet even for btree_xlog_split(), > because I've not removed the RestoreBkpBlocks() call in btree_redo(). > All the btree replay routines will have to get fixed before it's > testable at all. I was just about to write you back you missed that... > One thing that seems a bit annoying is the use of zero-based backup > block indexes in RestoreBackupBlock, when most (not all) of the callers > are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc). > That's a bug waiting to happen. We could address it by changing > RestoreBackupBlock to accept a one-based argument, but what I'm thinking > of doing instead is getting rid of the one-based convention entirely; > that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have > all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One > advantage of doing that is that it'll force inspection of all code > related to this. I wouldn't do that in a back branch, but I can see why its a good idea. Thanks for fixing. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 10 November 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Here's a WIP patch that attacks it in this way. > Looks fine, but we need a top-level comment in btree code explaining > why/how it follows the very well explained rules in the README. Not sure what you'd want it to say exactly? Really these issues are per WAL-replay function, not global. I've now been through all of the btree functions, and AFAICS btree_xlog_split is the only one of them that actually has a bug; the others can be a bit lax about the order in which things get done. >> One thing that seems a bit annoying is the use of zero-based backup >> block indexes in RestoreBackupBlock, when most (not all) of the callers >> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc). >> That's a bug waiting to happen. We could address it by changing >> RestoreBackupBlock to accept a one-based argument, but what I'm thinking >> of doing instead is getting rid of the one-based convention entirely; >> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have >> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One >> advantage of doing that is that it'll force inspection of all code >> related to this. > I wouldn't do that in a back branch, but I can see why its a good idea. If any of this stuff were getting used by external modules, changing it would be problematic ... but fortunately, it isn't, because we lack support for plug-in rmgrs. So I'm not worried about back-patching the change, and would prefer to keep the 9.x branches in sync. Attached is an updated patch in which nbtxlog.c has been fully converted, but I've not touched other rmgrs yet. I've tested this to the extent of having a replication slave follow a master that's running the regression tests, and it seems to work. It's difficult to prove much by testing about concurrent behavior, of course. I found that there's another benefit of managing backup-block replay this way, which is that we can de-contort the handling of conflict resolution by moving it into the per-record-type subroutines, instead of needing an extra switch before the RestoreBkpBlocks call. So that gives me more confidence that this is a good way to go. regards, tom lane diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 72ea1719e7d2c3ca230c8f755261b9a439ec2a99..8f534803a7cea75105869de6eaf2946e03d5437c 100644 *** a/src/backend/access/nbtree/nbtxlog.c --- b/src/backend/access/nbtree/nbtxlog.c *************** btree_xlog_insert(bool isleaf, bool isme *** 218,227 **** datalen -= sizeof(xl_btree_metadata); } ! if ((record->xl_info & XLR_BKP_BLOCK_1) && !ismeta && isleaf) ! return; /* nothing to do */ ! ! if (!(record->xl_info & XLR_BKP_BLOCK_1)) { buffer = XLogReadBuffer(xlrec->target.node, ItemPointerGetBlockNumber(&(xlrec->target.tid)), --- 218,226 ---- datalen -= sizeof(xl_btree_metadata); } ! if (record->xl_info & XLR_BKP_BLOCK(0)) ! (void) RestoreBackupBlock(lsn, record, 0, false, false); ! else { buffer = XLogReadBuffer(xlrec->target.node, ItemPointerGetBlockNumber(&(xlrec->target.tid)), *************** btree_xlog_insert(bool isleaf, bool isme *** 249,254 **** --- 248,260 ---- } } + /* + * Note: in normal operation, we'd update the metapage while still holding + * lock on the page we inserted into. But during replay it's not + * necessary to hold that lock, since no other index updates can be + * happening concurrently, and readers will cope fine with following an + * obsolete link from the metapage. + */ if (ismeta) _bt_restore_meta(xlrec->target.node, lsn, md.root, md.level, *************** btree_xlog_split(bool onleft, bool isroo *** 290,296 **** forget_matching_split(xlrec->node, downlink, false); /* Extract left hikey and its size (still assuming 16-bit alignment) */ ! if (!(record->xl_info & XLR_BKP_BLOCK_1)) { /* We assume 16-bit alignment is enough for IndexTupleSize */ left_hikey = (Item) datapos; --- 296,302 ---- forget_matching_split(xlrec->node, downlink, false); /* Extract left hikey and its size (still assuming 16-bit alignment) */ ! if (!(record->xl_info & XLR_BKP_BLOCK(0))) { /* We assume 16-bit alignment is enough for IndexTupleSize */ left_hikey = (Item) datapos; *************** btree_xlog_split(bool onleft, bool isroo *** 310,316 **** datalen -= sizeof(OffsetNumber); } ! if (onleft && !(record->xl_info & XLR_BKP_BLOCK_1)) { /* * We assume that 16-bit alignment is enough to apply IndexTupleSize --- 316,322 ---- datalen -= sizeof(OffsetNumber); } ! if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0))) { /* * We assume that 16-bit alignment is enough to apply IndexTupleSize *************** btree_xlog_split(bool onleft, bool isroo *** 323,329 **** datalen -= newitemsz; } ! /* Reconstruct right (new) sibling from scratch */ rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true); Assert(BufferIsValid(rbuf)); rpage = (Page) BufferGetPage(rbuf); --- 329,335 ---- datalen -= newitemsz; } ! /* Reconstruct right (new) sibling page from scratch */ rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true); Assert(BufferIsValid(rbuf)); rpage = (Page) BufferGetPage(rbuf); *************** btree_xlog_split(bool onleft, bool isroo *** 357,374 **** /* don't release the buffer yet; we touch right page's first item below */ ! /* ! * Reconstruct left (original) sibling if needed. Note that this code ! * ensures that the items remaining on the left page are in the correct ! * item number order, but it does not reproduce the physical order they ! * would have had. Is this worth changing? See also _bt_restore_page(). ! */ ! if (!(record->xl_info & XLR_BKP_BLOCK_1)) { Buffer lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false); if (BufferIsValid(lbuf)) { Page lpage = (Page) BufferGetPage(lbuf); BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage); --- 363,383 ---- /* don't release the buffer yet; we touch right page's first item below */ ! /* Now reconstruct left (original) sibling page */ ! if (record->xl_info & XLR_BKP_BLOCK(0)) ! (void) RestoreBackupBlock(lsn, record, 0, false, false); ! else { Buffer lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false); if (BufferIsValid(lbuf)) { + /* + * Note that this code ensures that the items remaining on the + * left page are in the correct item number order, but it does not + * reproduce the physical order they would have had. Is this + * worth changing? See also _bt_restore_page(). + */ Page lpage = (Page) BufferGetPage(lbuf); BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage); *************** btree_xlog_split(bool onleft, bool isroo *** 432,439 **** /* We no longer need the right buffer */ UnlockReleaseBuffer(rbuf); ! /* Fix left-link of the page to the right of the new right sibling */ ! if (xlrec->rnext != P_NONE && !(record->xl_info & XLR_BKP_BLOCK_2)) { Buffer buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false); --- 441,457 ---- /* We no longer need the right buffer */ UnlockReleaseBuffer(rbuf); ! /* ! * Fix left-link of the page to the right of the new right sibling. ! * ! * Note: in normal operation, we do this while still holding lock on the ! * two split pages. However, that's not necessary for correctness in WAL ! * replay, because no other index update can be in progress, and readers ! * will cope properly when following an obsolete left-link. ! */ ! if (record->xl_info & XLR_BKP_BLOCK(1)) ! (void) RestoreBackupBlock(lsn, record, 1, false, false); ! else if (xlrec->rnext != P_NONE) { Buffer buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false); *************** btree_xlog_split(bool onleft, bool isroo *** 463,475 **** static void btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record) { ! xl_btree_vacuum *xlrec; Buffer buffer; Page page; BTPageOpaque opaque; - xlrec = (xl_btree_vacuum *) XLogRecGetData(record); - /* * If queries might be active then we need to ensure every block is * unpinned between the lastBlockVacuumed and the current block, if there --- 481,491 ---- static void btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record) { ! xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record); Buffer buffer; Page page; BTPageOpaque opaque; /* * If queries might be active then we need to ensure every block is * unpinned between the lastBlockVacuumed and the current block, if there *************** btree_xlog_vacuum(XLogRecPtr lsn, XLogRe *** 502,514 **** } /* ! * If the block was restored from a full page image, nothing more to do. ! * The RestoreBkpBlocks() call already pinned and took cleanup lock on it. ! * XXX: Perhaps we should call RestoreBkpBlocks() *after* the loop above, ! * to make the disk access more sequential. */ ! if (record->xl_info & XLR_BKP_BLOCK_1) return; /* * Like in btvacuumpage(), we need to take a cleanup lock on every leaf --- 518,531 ---- } /* ! * If we have a full-page image, restore it (using a cleanup lock) and ! * we're done. */ ! if (record->xl_info & XLR_BKP_BLOCK(0)) ! { ! (void) RestoreBackupBlock(lsn, record, 0, true, false); return; + } /* * Like in btvacuumpage(), we need to take a cleanup lock on every leaf *************** btree_xlog_vacuum(XLogRecPtr lsn, XLogRe *** 563,571 **** * XXX optimise later with something like XLogPrefetchBuffer() */ static TransactionId ! btree_xlog_delete_get_latestRemovedXid(XLogRecord *record) { - xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record); OffsetNumber *unused; Buffer ibuffer, hbuffer; --- 580,587 ---- * XXX optimise later with something like XLogPrefetchBuffer() */ static TransactionId ! btree_xlog_delete_get_latestRemovedXid(xl_btree_delete *xlrec) { OffsetNumber *unused; Buffer ibuffer, hbuffer; *************** btree_xlog_delete_get_latestRemovedXid(X *** 702,716 **** static void btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record) { ! xl_btree_delete *xlrec; Buffer buffer; Page page; BTPageOpaque opaque; ! if (record->xl_info & XLR_BKP_BLOCK_1) ! return; ! xlrec = (xl_btree_delete *) XLogRecGetData(record); /* * We don't need to take a cleanup lock to apply these changes. See --- 718,752 ---- static void btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record) { ! xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record); Buffer buffer; Page page; BTPageOpaque opaque; ! /* ! * If we have any conflict processing to do, it must happen before we ! * update the page. ! * ! * Btree delete records can conflict with standby queries. You might ! * think that vacuum records would conflict as well, but we've handled ! * that already. XLOG_HEAP2_CLEANUP_INFO records provide the highest xid ! * cleaned by the vacuum of the heap and so we can resolve any conflicts ! * just once when that arrives. After that we know that no conflicts ! * exist from individual btree vacuum records on that index. ! */ ! if (InHotStandby) ! { ! TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(xlrec); ! ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node); ! } ! ! /* If we have a full-page image, restore it and we're done */ ! if (record->xl_info & XLR_BKP_BLOCK(0)) ! { ! (void) RestoreBackupBlock(lsn, record, 0, false, false); ! return; ! } /* * We don't need to take a cleanup lock to apply these changes. See *************** btree_xlog_delete_page(uint8 info, XLogR *** 766,773 **** leftsib = xlrec->leftblk; rightsib = xlrec->rightblk; /* parent page */ ! if (!(record->xl_info & XLR_BKP_BLOCK_1)) { buffer = XLogReadBuffer(xlrec->target.node, parent, false); if (BufferIsValid(buffer)) --- 802,819 ---- leftsib = xlrec->leftblk; rightsib = xlrec->rightblk; + /* + * In normal operation, we would lock all the pages this WAL record + * touches before changing any of them. In WAL replay, it should be okay + * to lock just one page at a time, since no concurrent index updates can + * be happening, and readers should not care whether they arrive at the + * target page or not (since it's surely empty). + */ + /* parent page */ ! if (record->xl_info & XLR_BKP_BLOCK(0)) ! (void) RestoreBackupBlock(lsn, record, 0, false, false); ! else { buffer = XLogReadBuffer(xlrec->target.node, parent, false); if (BufferIsValid(buffer)) *************** btree_xlog_delete_page(uint8 info, XLogR *** 813,819 **** } /* Fix left-link of right sibling */ ! if (!(record->xl_info & XLR_BKP_BLOCK_2)) { buffer = XLogReadBuffer(xlrec->target.node, rightsib, false); if (BufferIsValid(buffer)) --- 859,867 ---- } /* Fix left-link of right sibling */ ! if (record->xl_info & XLR_BKP_BLOCK(1)) ! (void) RestoreBackupBlock(lsn, record, 1, false, false); ! else { buffer = XLogReadBuffer(xlrec->target.node, rightsib, false); if (BufferIsValid(buffer)) *************** btree_xlog_delete_page(uint8 info, XLogR *** 837,843 **** } /* Fix right-link of left sibling, if any */ ! if (!(record->xl_info & XLR_BKP_BLOCK_3)) { if (leftsib != P_NONE) { --- 885,893 ---- } /* Fix right-link of left sibling, if any */ ! if (record->xl_info & XLR_BKP_BLOCK(2)) ! (void) RestoreBackupBlock(lsn, record, 2, false, false); ! else { if (leftsib != P_NONE) { *************** btree_xlog_newroot(XLogRecPtr lsn, XLogR *** 911,916 **** --- 961,969 ---- BTPageOpaque pageop; BlockNumber downlink = 0; + /* Backup blocks are not used in newroot records */ + Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK)); + buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true); Assert(BufferIsValid(buffer)); page = (Page) BufferGetPage(buffer); *************** btree_xlog_newroot(XLogRecPtr lsn, XLogR *** 952,1018 **** forget_matching_split(xlrec->node, downlink, true); } ! ! void ! btree_redo(XLogRecPtr lsn, XLogRecord *record) { ! uint8 info = record->xl_info & ~XLR_INFO_MASK; /* ! * If we have any conflict processing to do, it must happen before we ! * update the page. */ if (InHotStandby) { ! switch (info) ! { ! case XLOG_BTREE_DELETE: ! ! /* ! * Btree delete records can conflict with standby queries. You ! * might think that vacuum records would conflict as well, but ! * we've handled that already. XLOG_HEAP2_CLEANUP_INFO records ! * provide the highest xid cleaned by the vacuum of the heap ! * and so we can resolve any conflicts just once when that ! * arrives. After that any we know that no conflicts exist ! * from individual btree vacuum records on that index. ! */ ! { ! TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record); ! xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record); ! ! ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node); ! } ! break; ! ! case XLOG_BTREE_REUSE_PAGE: ! ! /* ! * Btree reuse page records exist to provide a conflict point ! * when we reuse pages in the index via the FSM. That's all it ! * does though. latestRemovedXid was the page's btpo.xact. The ! * btpo.xact < RecentGlobalXmin test in _bt_page_recyclable() ! * conceptually mirrors the pgxact->xmin > limitXmin test in ! * GetConflictingVirtualXIDs(). Consequently, one XID value ! * achieves the same exclusion effect on master and standby. ! */ ! { ! xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record); ! ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node); ! } ! return; - default: - break; - } - } ! /* ! * Vacuum needs to pin and take cleanup lock on every leaf page, a regular ! * exclusive lock is enough for all other purposes. ! */ ! RestoreBkpBlocks(lsn, record, (info == XLOG_BTREE_VACUUM)); switch (info) { --- 1005,1040 ---- forget_matching_split(xlrec->node, downlink, true); } ! static void ! btree_xlog_reuse_page(XLogRecPtr lsn, XLogRecord *record) { ! xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record); /* ! * Btree reuse_page records exist to provide a conflict point when we ! * reuse pages in the index via the FSM. That's all they do though. ! * ! * latestRemovedXid was the page's btpo.xact. The btpo.xact < ! * RecentGlobalXmin test in _bt_page_recyclable() conceptually mirrors the ! * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs(). ! * Consequently, one XID value achieves the same exclusion effect on ! * master and standby. */ if (InHotStandby) { ! ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, ! xlrec->node); ! } ! /* Backup blocks are not used in reuse_page records */ ! Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK)); ! } ! void ! btree_redo(XLogRecPtr lsn, XLogRecord *record) ! { ! uint8 info = record->xl_info & ~XLR_INFO_MASK; switch (info) { *************** btree_redo(XLogRecPtr lsn, XLogRecord *r *** 1052,1058 **** btree_xlog_newroot(lsn, record); break; case XLOG_BTREE_REUSE_PAGE: ! /* Handled above before restoring bkp block */ break; default: elog(PANIC, "btree_redo: unknown op code %u", info); --- 1074,1080 ---- btree_xlog_newroot(lsn, record); break; case XLOG_BTREE_REUSE_PAGE: ! btree_xlog_reuse_page(lsn, record); break; default: elog(PANIC, "btree_redo: unknown op code %u", info); diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 573c9ad682e59d9e7b01ad7c8d9d321f15f948f3..f8ebf5758f0a9782016348dca041d92533e94494 100644 *** a/src/backend/access/transam/README --- b/src/backend/access/transam/README *************** critical section.) *** 438,445 **** 4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must happen before the WAL record is inserted; see notes in SyncOneBuffer().) ! 5. Build a WAL log record and pass it to XLogInsert(); then update the page's ! LSN and TLI using the returned XLOG location. For instance, recptr = XLogInsert(rmgr_id, info, rdata); --- 438,446 ---- 4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must happen before the WAL record is inserted; see notes in SyncOneBuffer().) ! 5. If the relation requires WAL-logging, build a WAL log record and pass it ! to XLogInsert(); then update the page's LSN and TLI using the returned XLOG ! location. For instance, recptr = XLogInsert(rmgr_id, info, rdata); *************** which buffers were handled that way --- *** 466,474 **** what the XLOG record actually contains. XLOG records that describe multi-page changes therefore require some care to design: you must be certain that you know what data is indicated by each "BKP" bit. An example of the trickiness ! is that in a HEAP_UPDATE record, BKP(1) normally is associated with the source ! page and BKP(2) is associated with the destination page --- but if these are ! the same page, only BKP(1) would have been set. For this reason as well as the risk of deadlocking on buffer locks, it's best to design WAL records so that they reflect small atomic actions involving just --- 467,475 ---- what the XLOG record actually contains. XLOG records that describe multi-page changes therefore require some care to design: you must be certain that you know what data is indicated by each "BKP" bit. An example of the trickiness ! is that in a HEAP_UPDATE record, BKP(0) normally is associated with the source ! page and BKP(1) is associated with the destination page --- but if these are ! the same page, only BKP(0) would have been set. For this reason as well as the risk of deadlocking on buffer locks, it's best to design WAL records so that they reflect small atomic actions involving just *************** incrementally update the page, the rdata *** 497,508 **** ID at least once; otherwise there is no defense against torn-page problems. The standard replay-routine pattern for this case is ! if (record->xl_info & XLR_BKP_BLOCK_n) ! << do nothing, page was rewritten from logged copy >>; buffer = XLogReadBuffer(rnode, blkno, false); if (!BufferIsValid(buffer)) ! << do nothing, page has been deleted >>; page = (Page) BufferGetPage(buffer); if (XLByteLE(lsn, PageGetLSN(page))) --- 498,516 ---- ID at least once; otherwise there is no defense against torn-page problems. The standard replay-routine pattern for this case is ! if (record->xl_info & XLR_BKP_BLOCK(N)) ! { ! /* apply the change from the full-page image */ ! (void) RestoreBackupBlock(lsn, record, N, false, false); ! return; ! } buffer = XLogReadBuffer(rnode, blkno, false); if (!BufferIsValid(buffer)) ! { ! /* page has been deleted, so we need do nothing */ ! return; ! } page = (Page) BufferGetPage(buffer); if (XLByteLE(lsn, PageGetLSN(page))) *************** The standard replay-routine pattern for *** 520,532 **** UnlockReleaseBuffer(buffer); As noted above, for a multi-page update you need to be able to determine ! which XLR_BKP_BLOCK_n flag applies to each page. If a WAL record reflects a combination of fully-rewritable and incremental updates, then the rewritable ! pages don't count for the XLR_BKP_BLOCK_n numbering. (XLR_BKP_BLOCK_n is ! associated with the n'th distinct buffer ID seen in the "rdata" array, and per the above discussion, fully-rewritable buffers shouldn't be mentioned in "rdata".) Due to all these constraints, complex changes (such as a multilevel index insertion) normally need to be described by a series of atomic-action WAL records. What do you do if the intermediate states are not self-consistent? --- 528,569 ---- UnlockReleaseBuffer(buffer); As noted above, for a multi-page update you need to be able to determine ! which XLR_BKP_BLOCK(N) flag applies to each page. If a WAL record reflects a combination of fully-rewritable and incremental updates, then the rewritable ! pages don't count for the XLR_BKP_BLOCK(N) numbering. (XLR_BKP_BLOCK(N) is ! associated with the N'th distinct buffer ID seen in the "rdata" array, and per the above discussion, fully-rewritable buffers shouldn't be mentioned in "rdata".) + When replaying a WAL record that describes changes on multiple pages, you + must be careful to lock the pages properly to prevent concurrent Hot Standby + queries from seeing an inconsistent state. If this requires that two + or more buffer locks be held concurrently, the coding pattern shown above + is too simplistic, since it assumes the routine can exit as soon as it's + known the current page requires no modification. Instead, you might have + something like + + if (record->xl_info & XLR_BKP_BLOCK(0)) + { + /* apply the change from the full-page image */ + buffer0 = RestoreBackupBlock(lsn, record, 0, false, true); + } + else + { + buffer0 = XLogReadBuffer(rnode, blkno, false); + if (BufferIsValid(buffer0)) + { + ... apply the change if not already done ... + MarkBufferDirty(buffer0); + } + } + + ... similarly apply the changes for remaining pages ... + + /* and now we can release the lock on the first page */ + if (BufferIsValid(buffer0)) + UnlockReleaseBuffer(buffer0); + Due to all these constraints, complex changes (such as a multilevel index insertion) normally need to be described by a series of atomic-action WAL records. What do you do if the intermediate states are not self-consistent? diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index bf76f6d24cb8bbf3b1b8b19e0a1c007ff08e4035..555c69c460ef8baf7be43540673132f766fb0e01 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *************** begin:; *** 835,842 **** * At the exit of this loop, write_len includes the backup block data. * * Also set the appropriate info bits to show which buffers were backed ! * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct ! * buffer value (ignoring InvalidBuffer) appearing in the rdata chain. */ rdt_lastnormal = rdt; write_len = len; --- 835,842 ---- * At the exit of this loop, write_len includes the backup block data. * * Also set the appropriate info bits to show which buffers were backed ! * up. The XLR_BKP_BLOCK(N) bit corresponds to the N'th distinct buffer ! * value (ignoring InvalidBuffer) appearing in the rdata chain. */ rdt_lastnormal = rdt; write_len = len; *************** begin:; *** 848,854 **** if (!dtbuf_bkp[i]) continue; ! info |= XLR_SET_BKP_BLOCK(i); bkpb = &(dtbuf_xlg[i]); page = (char *) BufferGetBlock(dtbuf[i]); --- 848,854 ---- if (!dtbuf_bkp[i]) continue; ! info |= XLR_BKP_BLOCK(i); bkpb = &(dtbuf_xlg[i]); page = (char *) BufferGetBlock(dtbuf[i]); *************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec *** 3155,3160 **** --- 3155,3257 ---- } /* + * Restore a full-page image from a backup block attached to an XLOG record. + * + * lsn: LSN of the XLOG record being replayed + * record: the complete XLOG record + * block_index: which backup block to restore (0 .. XLR_MAX_BKP_BLOCKS - 1) + * get_cleanup_lock: TRUE to get a cleanup rather than plain exclusive lock + * keep_buffer: TRUE to return the buffer still locked and pinned + * + * Returns the buffer number containing the page. Note this is not terribly + * useful unless keep_buffer is specified as TRUE. + * + * Note: when a backup block is available in XLOG, we restore it + * unconditionally, even if the page in the database appears newer. + * This is to protect ourselves against database pages that were partially + * or incorrectly written during a crash. We assume that the XLOG data + * must be good because it has passed a CRC check, while the database + * page might not be. This will force us to replay all subsequent + * modifications of the page that appear in XLOG, rather than possibly + * ignoring them as already applied, but that's not a huge drawback. + * + * If 'get_cleanup_lock' is true, a cleanup lock is obtained on the buffer, + * else a normal exclusive lock is used. During crash recovery, that's just + * pro forma because there can't be any regular backends in the system, but + * in hot standby mode the distinction is important. + * + * If 'keep_buffer' is true, return without releasing the buffer lock and pin; + * then caller is responsible for doing UnlockReleaseBuffer() later. This + * is needed in some cases when replaying XLOG records that touch multiple + * pages, to prevent inconsistent states from being visible to other backends. + * (Again, that's only important in hot standby mode.) + */ + Buffer + RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index, + bool get_cleanup_lock, bool keep_buffer) + { + Buffer buffer; + Page page; + BkpBlock bkpb; + char *blk; + int i; + + /* Locate requested BkpBlock in the record */ + blk = (char *) XLogRecGetData(record) + record->xl_len; + for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++) + { + if (!(record->xl_info & XLR_BKP_BLOCK(i))) + continue; + + memcpy(&bkpb, blk, sizeof(BkpBlock)); + blk += sizeof(BkpBlock); + + if (i == block_index) + { + /* Found it, apply the update */ + buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block, + RBM_ZERO); + Assert(BufferIsValid(buffer)); + if (get_cleanup_lock) + LockBufferForCleanup(buffer); + else + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + page = (Page) BufferGetPage(buffer); + + if (bkpb.hole_length == 0) + { + memcpy((char *) page, blk, BLCKSZ); + } + else + { + memcpy((char *) page, blk, bkpb.hole_offset); + /* must zero-fill the hole */ + MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length); + memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length), + blk + bkpb.hole_offset, + BLCKSZ - (bkpb.hole_offset + bkpb.hole_length)); + } + + PageSetLSN(page, lsn); + PageSetTLI(page, ThisTimeLineID); + MarkBufferDirty(buffer); + + if (!keep_buffer) + UnlockReleaseBuffer(buffer); + + return buffer; + } + + blk += BLCKSZ - bkpb.hole_length; + } + + /* Caller specified a bogus block_index */ + elog(ERROR, "failed to restore block_index %d", block_index); + return InvalidBuffer; /* keep compiler quiet */ + } + + /* * CRC-check an XLOG record. We do not believe the contents of an XLOG * record (other than to the minimal extent of computing the amount of * data to read in) until we've checked the CRCs. *************** RecordIsValid(XLogRecord *record, XLogRe *** 3193,3199 **** { uint32 blen; ! if (!(record->xl_info & XLR_SET_BKP_BLOCK(i))) continue; if (remaining < sizeof(BkpBlock)) --- 3290,3296 ---- { uint32 blen; ! if (!(record->xl_info & XLR_BKP_BLOCK(i))) continue; if (remaining < sizeof(BkpBlock)) *************** xlog_outrec(StringInfo buf, XLogRecord * *** 8081,8087 **** int i; appendStringInfo(buf, "prev %X/%X; xid %u", ! (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev, record->xl_xid); appendStringInfo(buf, "; len %u", --- 8178,8185 ---- int i; appendStringInfo(buf, "prev %X/%X; xid %u", ! (uint32) (record->xl_prev >> 32), ! (uint32) record->xl_prev, record->xl_xid); appendStringInfo(buf, "; len %u", *************** xlog_outrec(StringInfo buf, XLogRecord * *** 8089,8096 **** for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++) { ! if (record->xl_info & XLR_SET_BKP_BLOCK(i)) ! appendStringInfo(buf, "; bkpb%d", i + 1); } appendStringInfo(buf, ": %s", RmgrTable[record->xl_rmid].rm_name); --- 8187,8194 ---- for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++) { ! if (record->xl_info & XLR_BKP_BLOCK(i)) ! appendStringInfo(buf, "; bkpb%d", i); } appendStringInfo(buf, ": %s", RmgrTable[record->xl_rmid].rm_name); diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 2893f3b35243effb3890630dde423daddb51eaea..4aedb25dfc31ffbc307765955737a43bc5054bc0 100644 *** a/src/include/access/xlog.h --- b/src/include/access/xlog.h *************** typedef struct XLogRecord *** 71,76 **** --- 71,78 ---- */ #define XLR_BKP_BLOCK_MASK 0x0F /* all info bits used for bkp blocks */ #define XLR_MAX_BKP_BLOCKS 4 + #define XLR_BKP_BLOCK(iblk) (0x08 >> (iblk)) /* iblk in 0..3 */ + // XXX these macros will go away: #define XLR_SET_BKP_BLOCK(iblk) (0x08 >> (iblk)) #define XLR_BKP_BLOCK_1 XLR_SET_BKP_BLOCK(0) /* 0x08 */ #define XLR_BKP_BLOCK_2 XLR_SET_BKP_BLOCK(1) /* 0x04 */ *************** extern int sync_method; *** 94,106 **** * If buffer is valid then XLOG will check if buffer must be backed up * (ie, whether this is first change of that page since last checkpoint). * If so, the whole page contents are attached to the XLOG record, and XLOG ! * sets XLR_BKP_BLOCK_X bit in xl_info. Note that the buffer must be pinned * and exclusive-locked by the caller, so that it won't change under us. * NB: when the buffer is backed up, we DO NOT insert the data pointed to by * this XLogRecData struct into the XLOG record, since we assume it's present * in the buffer. Therefore, rmgr redo routines MUST pay attention to ! * XLR_BKP_BLOCK_X to know what is actually stored in the XLOG record. ! * The i'th XLR_BKP_BLOCK bit corresponds to the i'th distinct buffer * value (ignoring InvalidBuffer) appearing in the rdata chain. * * When buffer is valid, caller must set buffer_std to indicate whether the --- 96,108 ---- * If buffer is valid then XLOG will check if buffer must be backed up * (ie, whether this is first change of that page since last checkpoint). * If so, the whole page contents are attached to the XLOG record, and XLOG ! * sets XLR_BKP_BLOCK(N) bit in xl_info. Note that the buffer must be pinned * and exclusive-locked by the caller, so that it won't change under us. * NB: when the buffer is backed up, we DO NOT insert the data pointed to by * this XLogRecData struct into the XLOG record, since we assume it's present * in the buffer. Therefore, rmgr redo routines MUST pay attention to ! * XLR_BKP_BLOCK(N) to know what is actually stored in the XLOG record. ! * The N'th XLR_BKP_BLOCK bit corresponds to the N'th distinct buffer * value (ignoring InvalidBuffer) appearing in the rdata chain. * * When buffer is valid, caller must set buffer_std to indicate whether the *************** extern int XLogFileOpen(XLogSegNo segno) *** 274,279 **** --- 276,285 ---- extern void XLogGetLastRemoved(XLogSegNo *segno); extern void XLogSetAsyncXactLSN(XLogRecPtr record); + extern Buffer RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, + int block_index, + bool get_cleanup_lock, bool keep_buffer); + // XXX this will go away: extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup); extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
While I'm looking at this: there seems to be a bug/inconsistency in heap_xlog_freeze(). It uses a cleanup lock in the "normal" code path, but it doesn't tell RestoreBkpBlocks to use a cleanup lock. Which choice is correct? regards, tom lane
On 10 November 2012 21:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On 10 November 2012 18:16, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Here's a WIP patch that attacks it in this way. > >> Looks fine, but we need a top-level comment in btree code explaining >> why/how it follows the very well explained rules in the README. > > Not sure what you'd want it to say exactly? Really these issues are per > WAL-replay function, not global. I've now been through all of the btree > functions, and AFAICS btree_xlog_split is the only one of them that > actually has a bug; the others can be a bit lax about the order in which > things get done. > >>> One thing that seems a bit annoying is the use of zero-based backup >>> block indexes in RestoreBackupBlock, when most (not all) of the callers >>> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc). >>> That's a bug waiting to happen. We could address it by changing >>> RestoreBackupBlock to accept a one-based argument, but what I'm thinking >>> of doing instead is getting rid of the one-based convention entirely; >>> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have >>> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One >>> advantage of doing that is that it'll force inspection of all code >>> related to this. > >> I wouldn't do that in a back branch, but I can see why its a good idea. > > If any of this stuff were getting used by external modules, changing it > would be problematic ... but fortunately, it isn't, because we lack > support for plug-in rmgrs. So I'm not worried about back-patching the > change, and would prefer to keep the 9.x branches in sync. > > Attached is an updated patch in which nbtxlog.c has been fully converted, > but I've not touched other rmgrs yet. I've tested this to the extent of > having a replication slave follow a master that's running the regression > tests, and it seems to work. It's difficult to prove much by testing > about concurrent behavior, of course. > > I found that there's another benefit of managing backup-block replay > this way, which is that we can de-contort the handling of conflict > resolution by moving it into the per-record-type subroutines, instead of > needing an extra switch before the RestoreBkpBlocks call. So that gives > me more confidence that this is a good way to go. It's a good way to go, I'm just glad you're handling it for the back branches ;-) -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Attached is a complete draft patch against HEAD for this issue. I found a depressingly large number of pre-existing bugs: Practically all WAL record types that touch multiple pages have some bug of this type. In addition to btree_xlog_split, I found that heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold locks as required to make their updates safe for concurrent queries. (I'm not totally sure about ginRedoDeletePage, but the original action definitely locks the pages simultaneously, and it's not clear that it's safe not to.) Most of these are okay in cases without any full-page images, but could fail if the wrong subset of the pages-to-be-touched were processed by RestoreBkpBlocks. Some had bugs even without that :-( Worse than that, GIST WAL replay seems to be a total disaster from this standpoint, and I'm not convinced that it's even fixable without incompatible WAL changes. There are several cases where index insertion operations generate more than one WAL record while holding exclusive lock on buffer(s). If the lock continuity is actually necessary to prevent concurrent queries from seeing inconsistent index states, then we have a big problem, because WAL replay isn't designed to hold any buffer lock for more than one WAL action. This needs to be looked at closer by somebody who's more up on the GIST code than I am. The attached patch doesn't try very hard to make the GIST functions safe, it just transposes them to the new API. I also found that spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple --- it would get fooled by the LSN having been advanced already. This is an index corruption bug not just a transient-failure problem. Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. I already pointed out the inconsistency in heap_xlog_freeze about whether a cleanup lock is needed. As is, this patch uses a cleanup lock, but I suspect that a regular lock is sufficient --- comments? Also, gistRedoPageDeleteRecord seems to be dead code? I don't see anything that emits XLOG_GIST_PAGE_DELETE. regards, tom lane #application/octet-stream [restore-backup-blocks-individually-2.patch.gz] /home/tgl/pgsql/restore-backup-blocks-individually-2.patch.gz
[ this time *with* the patch ] Attached is a complete draft patch against HEAD for this issue. I found a depressingly large number of pre-existing bugs: Practically all WAL record types that touch multiple pages have some bug of this type. In addition to btree_xlog_split, I found that heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold locks as required to make their updates safe for concurrent queries. (I'm not totally sure about ginRedoDeletePage, but the original action definitely locks the pages simultaneously, and it's not clear that it's safe not to.) Most of these are okay in cases without any full-page images, but could fail if the wrong subset of the pages-to-be-touched were processed by RestoreBkpBlocks. Some had bugs even without that :-( Worse than that, GIST WAL replay seems to be a total disaster from this standpoint, and I'm not convinced that it's even fixable without incompatible WAL changes. There are several cases where index insertion operations generate more than one WAL record while holding exclusive lock on buffer(s). If the lock continuity is actually necessary to prevent concurrent queries from seeing inconsistent index states, then we have a big problem, because WAL replay isn't designed to hold any buffer lock for more than one WAL action. This needs to be looked at closer by somebody who's more up on the GIST code than I am. The attached patch doesn't try very hard to make the GIST functions safe, it just transposes them to the new API. I also found that spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple --- it would get fooled by the LSN having been advanced already. This is an index corruption bug not just a transient-failure problem. Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. I already pointed out the inconsistency in heap_xlog_freeze about whether a cleanup lock is needed. As is, this patch uses a cleanup lock, but I suspect that a regular lock is sufficient --- comments? Also, gistRedoPageDeleteRecord seems to be dead code? I don't see anything that emits XLOG_GIST_PAGE_DELETE. regards, tom lane
Attachment
On 2012-11-10 16:24:22 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > >> One thing that seems a bit annoying is the use of zero-based backup > >> block indexes in RestoreBackupBlock, when most (not all) of the callers > >> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc). > >> That's a bug waiting to happen. We could address it by changing > >> RestoreBackupBlock to accept a one-based argument, but what I'm thinking > >> of doing instead is getting rid of the one-based convention entirely; > >> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have > >> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One > >> advantage of doing that is that it'll force inspection of all code > >> related to this. > > > I wouldn't do that in a back branch, but I can see why its a good idea. > > If any of this stuff were getting used by external modules, changing it > would be problematic ... but fortunately, it isn't, because we lack > support for plug-in rmgrs. So I'm not worried about back-patching the > change, and would prefer to keep the 9.x branches in sync. XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely used by xlogdump. Not sure if either are worth that much attention, but it seems worth noticing that such a change will break stuff. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12.11.2012 01:25, Tom Lane wrote: > Worse than that, GIST WAL replay seems to be a total disaster from this > standpoint, and I'm not convinced that it's even fixable without > incompatible WAL changes. There are several cases where index > insertion operations generate more than one WAL record while holding > exclusive lock on buffer(s). If the lock continuity is actually > necessary to prevent concurrent queries from seeing inconsistent index > states, then we have a big problem, because WAL replay isn't designed to > hold any buffer lock for more than one WAL action. Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), the GiST index is always in a consistent state. Before the downlink for a newly split page has been inserted yet, its left sibling is flagged so that a search knows to follow the right-link to find it. In normal operation, the lock continuity is needed to prevent another backend from seeing the incomplete split and trying to fix it, but in hot standby, we never try to fix incomplete splits anyway. So I think we're good on >= 9.1. The 9.0 code is broken, however. In 9.0, when a child page is split, the parent and new children are kept locked until the downlinks are inserted/updated. If a concurrent scan comes along and sees that incomplete state, it will miss tuples on the new right siblings. We rely on a rm_cleanup operation at the end of WAL replay to fix that situation, if the downlink insertion record is not there. I don't see any easy way to fix that, unfortunately. Perhaps we could backpatch the 9.1 rewrite, now that it's gotten some real-world testing, but it was a big change so I don't feel very comfortable doing that. > Also, gistRedoPageDeleteRecord seems to be dead code? I don't see > anything that emits XLOG_GIST_PAGE_DELETE. Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was removed. - Heikki
On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I already pointed out the inconsistency in heap_xlog_freeze about > whether a cleanup lock is needed. As is, this patch uses a cleanup > lock, but I suspect that a regular lock is sufficient --- comments? Well, according to storage/buffer/README: 1. To scan a page for tuples, one must hold a pin and either shared or exclusive content lock. To examine the commit status (XIDs and status bits) of a tuple in a shared buffer, one must likewise hold a pin and either shared or exclusive lock. That does indeed make it sound like an x-lock is enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I already pointed out the inconsistency in heap_xlog_freeze about >> whether a cleanup lock is needed. As is, this patch uses a cleanup >> lock, but I suspect that a regular lock is sufficient --- comments? > Well, according to storage/buffer/README: > 1. To scan a page for tuples, one must hold a pin and either shared or > exclusive content lock. To examine the commit status (XIDs and status bits) > of a tuple in a shared buffer, one must likewise hold a pin and either shared > or exclusive lock. > That does indeed make it sound like an x-lock is enough. Yeah. AFAICS, the only possible downside is that somebody might be using the tuple (while holding just a buffer pin), and that its xmin might change while that's happening. So for instance a nestloop join might read out different xmin values for the same row while the join proceeds. But that could happen anyway if a different plan had been chosen (viz, this table on the inside not the outside of the nestloop). The xmin update ought to be physically atomic, so you shouldn't be able to get a garbage result, just the old value or the new. The cleanup lock is needed for cases where we'd like to remove or move a tuple, but that's not required here. regards, tom lane
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 12.11.2012 01:25, Tom Lane wrote: >> Worse than that, GIST WAL replay seems to be a total disaster from this >> standpoint, and I'm not convinced that it's even fixable without >> incompatible WAL changes. > Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), > the GiST index is always in a consistent state. Before the downlink for > a newly split page has been inserted yet, its left sibling is flagged so > that a search knows to follow the right-link to find it. In normal > operation, the lock continuity is needed to prevent another backend from > seeing the incomplete split and trying to fix it, but in hot standby, we > never try to fix incomplete splits anyway. > So I think we're good on >= 9.1. Okay. I'll update the patch to make sure that the individual WAL replay functions hold all locks, but not worry about holding locks across actions. > The 9.0 code is broken, however. In > 9.0, when a child page is split, the parent and new children are kept > locked until the downlinks are inserted/updated. If a concurrent scan > comes along and sees that incomplete state, it will miss tuples on the > new right siblings. We rely on a rm_cleanup operation at the end of WAL > replay to fix that situation, if the downlink insertion record is not > there. I don't see any easy way to fix that, unfortunately. Perhaps we > could backpatch the 9.1 rewrite, now that it's gotten some real-world > testing, but it was a big change so I don't feel very comfortable doing > that. Me either. Given the lack of field complaints, I think we're better advised to just leave it unfixed in 9.0. It'd not be a step forward if we broke something trying to make this work. >> Also, gistRedoPageDeleteRecord seems to be dead code? I don't see >> anything that emits XLOG_GIST_PAGE_DELETE. > Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was > removed. Okay. I see we bumped XLOG_PAGE_MAGIC in 9.0, so there's no longer any way that 9.0 or later versions could see this WAL record type. I'll delete that replay function rather than bothering to fix it. regards, tom lane
Andres Freund <andres@2ndquadrant.com> writes: > On 2012-11-10 16:24:22 -0500, Tom Lane wrote: >> If any of this stuff were getting used by external modules, changing it >> would be problematic ... but fortunately, it isn't, because we lack >> support for plug-in rmgrs. So I'm not worried about back-patching the >> change, and would prefer to keep the 9.x branches in sync. > XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely > used by xlogdump. Not sure if either are worth that much attention, but > it seems worth noticing that such a change will break stuff. Hm. Okay, how about we leave the old macros in place in the back branches? regards, tom lane
On 2012-11-12 10:19:09 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2012-11-10 16:24:22 -0500, Tom Lane wrote: > >> If any of this stuff were getting used by external modules, changing it > >> would be problematic ... but fortunately, it isn't, because we lack > >> support for plug-in rmgrs. So I'm not worried about back-patching the > >> change, and would prefer to keep the 9.x branches in sync. > > > XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely > > used by xlogdump. Not sure if either are worth that much attention, but > > it seems worth noticing that such a change will break stuff. > > Hm. Okay, how about we leave the old macros in place in the back > branches? Sounds good to me. The RestoreBkpBlocks change seems unproblematic to me. If anything its good that it has been renamed. Thanks, Andres
On 11 November 2012 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Practically all WAL record types that touch multiple pages have some > bug of this type. In addition to btree_xlog_split, I found that > heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, > spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold > locks as required to make their updates safe for concurrent queries. > (I'm not totally sure about ginRedoDeletePage, but the original action > definitely locks the pages simultaneously, and it's not clear that it's > safe not to.) Most of these are okay in cases without any full-page > images, but could fail if the wrong subset of the pages-to-be-touched > were processed by RestoreBkpBlocks. Some had bugs even without that :-( Hmm, not good. Thanks for spotting. Do these changes do anything to actions that occur across multiple records? I assume not and think those are OK, agreed? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12 November 2012 14:51, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The 9.0 code is broken, however. In >> 9.0, when a child page is split, the parent and new children are kept >> locked until the downlinks are inserted/updated. If a concurrent scan >> comes along and sees that incomplete state, it will miss tuples on the >> new right siblings. We rely on a rm_cleanup operation at the end of WAL >> replay to fix that situation, if the downlink insertion record is not >> there. I don't see any easy way to fix that, unfortunately. Perhaps we >> could backpatch the 9.1 rewrite, now that it's gotten some real-world >> testing, but it was a big change so I don't feel very comfortable doing >> that. > > Me either. Given the lack of field complaints, I think we're better > advised to just leave it unfixed in 9.0. It'd not be a step forward > if we broke something trying to make this work. Agreed. Most people running 9.0 with GIST indexes have already upgraded. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 11 November 2012 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Practically all WAL record types that touch multiple pages have some >> bug of this type. In addition to btree_xlog_split, I found that >> heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, >> spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold >> locks as required to make their updates safe for concurrent queries. >> (I'm not totally sure about ginRedoDeletePage, but the original action >> definitely locks the pages simultaneously, and it's not clear that it's >> safe not to.) Most of these are okay in cases without any full-page >> images, but could fail if the wrong subset of the pages-to-be-touched >> were processed by RestoreBkpBlocks. Some had bugs even without that :-( > Hmm, not good. Thanks for spotting. > Do these changes do anything to actions that occur across multiple > records? I assume not and think those are OK, agreed? Right, we were and still are assuming that any state that exists between WAL records is consistent and safe to expose to hot-standby queries. The important thing here is that these WAL replay functions were failing to ensure that their own actions appear atomic to onlookers. This is basically hangover from pre-hot-standby coding conventions, when no such concern existed. regards, tom lane
Here's an updated patch that fixes the GIST replay functions as well as the other minor issues that were mentioned. Barring objections, I'll set about back-patching this as far as 9.0. One thing that could use verification is my fix for gistRedoPageSplitRecord. AFAICS, the first page listed in the WAL record is always the "original" page, and the ones following it are pages that were split off from it, and can (as yet) only be reached by following right-links from the "original" page. As such, it should be okay to release locks on the non-first pages as soon as we've written them. We have to hold lock on the original page though to avoid letting readers follow dangling right-links. Also, the update of NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically with all this, so that has to be done before releasing the original-page lock as well. Does that sound right? regards, tom lane
Attachment
On 12.11.2012 22:53, Tom Lane wrote: > Here's an updated patch that fixes the GIST replay functions as well as > the other minor issues that were mentioned. Barring objections, I'll > set about back-patching this as far as 9.0. Ok. It won't help all that much on 9.0, though. > One thing that could use verification is my fix for > gistRedoPageSplitRecord. AFAICS, the first page listed in the WAL > record is always the "original" page, and the ones following it are > pages that were split off from it, and can (as yet) only be reached by > following right-links from the "original" page. As such, it should be > okay to release locks on the non-first pages as soon as we've written > them. We have to hold lock on the original page though to avoid letting > readers follow dangling right-links. Also, the update of > NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically > with all this, so that has to be done before releasing the original-page > lock as well. Does that sound right? Yep. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 12.11.2012 22:53, Tom Lane wrote: >> Here's an updated patch that fixes the GIST replay functions as well as >> the other minor issues that were mentioned. Barring objections, I'll >> set about back-patching this as far as 9.0. > Ok. It won't help all that much on 9.0, though. Well, it won't help GIST much, but the actually-reported-from-the-field case is in btree, and it does fix that. It occurs to me that if we're sufficiently scared of this case, we could probably hack the planner (in 9.0 only) to refuse to use GIST indexes in hot-standby queries. That cure might be worse than the disease though. regards, tom lane
On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Ok. It won't help all that much on 9.0, though. > > Well, it won't help GIST much, but the actually-reported-from-the-field > case is in btree, and it does fix that. > > It occurs to me that if we're sufficiently scared of this case, we could > probably hack the planner (in 9.0 only) to refuse to use GIST indexes > in hot-standby queries. That cure might be worse than the disease though. if anything, it should be documented. if you do this kind of thing people will stop installing bugfix releases. merlin
On 14/11/12 04:32, Merlin Moncure wrote: > On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Ok. It won't help all that much on 9.0, though. >> Well, it won't help GIST much, but the actually-reported-from-the-field >> case is in btree, and it does fix that. >> >> It occurs to me that if we're sufficiently scared of this case, we could >> probably hack the planner (in 9.0 only) to refuse to use GIST indexes >> in hot-standby queries. That cure might be worse than the disease though. > if anything, it should be documented. if you do this kind of thing > people will stop installing bugfix releases. > > merlin > > How about displaying a warning, when people try to use the 'feature', as well as document it? Cheers, Gavin
On Tue, Nov 13, 2012 at 10:32 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Ok. It won't help all that much on 9.0, though. >> >> Well, it won't help GIST much, but the actually-reported-from-the-field >> case is in btree, and it does fix that. >> >> It occurs to me that if we're sufficiently scared of this case, we could >> probably hack the planner (in 9.0 only) to refuse to use GIST indexes >> in hot-standby queries. That cure might be worse than the disease though. > > if anything, it should be documented. if you do this kind of thing > people will stop installing bugfix releases. Agreed. I think doing that in a back-branch release would be extremely user-hostile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company