Thread: Hot standby, RestoreBkpBlocks and cleanup locks
The hot standby patch has some hacks to decide which full-page-images can be restored holding an exclusive lock and which ones need a vacuum-strength lock. It's not very pretty as is, as mentioned in comments too. How about we refactor things so that redo-functions are responsible for calling RestoreBkpBlocks? The redo function can then pass an argument indicating what kind of lock is required. We should also change XLogReadBufferExtended so that it doesn't lock the page; the caller knows better what kind of a lock it needs. That makes it more analogous with ReadBufferExtended too, although I think we should keep XLogReadBuffer() unchanged for now. See attached patch. One shortfall of this patch is that you can pass only one argument to RestoreBkpBlocks, but there can multiple backup blocks in one WAL record. That's OK in current usage, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** src/backend/access/gin/ginxlog.c --- src/backend/access/gin/ginxlog.c *************** *** 438,443 **** gin_redo(XLogRecPtr lsn, XLogRecord *record) --- 438,445 ---- { uint8 info = record->xl_info & ~XLR_INFO_MASK; + RestoreBkpBlocks(lsn, record, false); + topCtx = MemoryContextSwitchTo(opCtx); switch (info) { *** src/backend/access/gist/gistxlog.c --- src/backend/access/gist/gistxlog.c *************** *** 395,400 **** gist_redo(XLogRecPtr lsn, XLogRecord *record) --- 395,402 ---- { uint8 info = record->xl_info & ~XLR_INFO_MASK; + RestoreBkpBlocks(lsn, record, false); + MemoryContext oldCxt; oldCxt = MemoryContextSwitchTo(opCtx); *** src/backend/access/heap/heapam.c --- src/backend/access/heap/heapam.c *************** *** 4777,4782 **** heap_redo(XLogRecPtr lsn, XLogRecord *record) --- 4777,4784 ---- { uint8 info = record->xl_info & ~XLR_INFO_MASK; + RestoreBkpBlocks(lsn, record, false); + switch (info & XLOG_HEAP_OPMASK) { case XLOG_HEAP_INSERT: *************** *** 4816,4827 **** heap2_redo(XLogRecPtr lsn, XLogRecord *record) --- 4818,4832 ---- switch (info & XLOG_HEAP_OPMASK) { case XLOG_HEAP2_FREEZE: + RestoreBkpBlocks(lsn, record, false); heap_xlog_freeze(lsn, record); break; case XLOG_HEAP2_CLEAN: + RestoreBkpBlocks(lsn, record, true); heap_xlog_clean(lsn, record, false); break; case XLOG_HEAP2_CLEAN_MOVE: + RestoreBkpBlocks(lsn, record, true); heap_xlog_clean(lsn, record, true); break; default: *** src/backend/access/nbtree/nbtxlog.c --- src/backend/access/nbtree/nbtxlog.c *************** *** 714,719 **** btree_redo(XLogRecPtr lsn, XLogRecord *record) --- 714,721 ---- { uint8 info = record->xl_info & ~XLR_INFO_MASK; + RestoreBkpBlocks(lsn, record, false); + switch (info) { case XLOG_BTREE_INSERT_LEAF: *** src/backend/access/transam/xlog.c --- src/backend/access/transam/xlog.c *************** *** 2934,2941 **** CleanupBackupHistory(void) * modifications of the page that appear in XLOG, rather than possibly * ignoring them as already applied, but that's not a huge drawback. */ ! static void ! RestoreBkpBlocks(XLogRecord *record, XLogRecPtr lsn) { Buffer buffer; Page page; --- 2934,2941 ---- * modifications of the page that appear in XLOG, rather than possibly * ignoring them as already applied, but that's not a huge drawback. */ ! void ! RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup) { Buffer buffer; Page page; *************** *** 2943,2948 **** RestoreBkpBlocks(XLogRecord *record, XLogRecPtr lsn) --- 2943,2951 ---- char *blk; int i; + if (!(record->xl_info & XLR_BKP_BLOCK_MASK)) + return; + blk = (char *) XLogRecGetData(record) + record->xl_len; for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++) { *************** *** 2955,2960 **** RestoreBkpBlocks(XLogRecord *record, XLogRecPtr lsn) --- 2958,2968 ---- buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block, RBM_ZERO); Assert(BufferIsValid(buffer)); + if (cleanup) + LockBufferForCleanup(buffer); + else + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + page = (Page) BufferGetPage(buffer); if (bkpb.hole_length == 0) *************** *** 5210,5218 **** StartupXLOG(void) TransactionIdAdvance(ShmemVariableCache->nextXid); } - if (record->xl_info & XLR_BKP_BLOCK_MASK) - RestoreBkpBlocks(record, EndRecPtr); - RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record); /* Pop the error context stack */ --- 5218,5223 ---- *** src/backend/access/transam/xlogutils.c --- src/backend/access/transam/xlogutils.c *************** *** 217,224 **** XLogCheckInvalidPages(void) /* * XLogReadBuffer ! * A shorthand of XLogReadBufferExtended(), for reading from the main ! * fork. * * For historical reasons, instead of a ReadBufferMode argument, this only * supports RBM_ZERO (init == true) and RBM_NORMAL (init == false) modes. --- 217,234 ---- /* * XLogReadBuffer ! * Read a page during XLOG replay. ! * ! * This is a shorthand of XLogReadBufferExtended() followed by ! * LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE), for reading from the main ! * fork. ! * ! * (Getting the lock is not really necessary, since we ! * expect that this is only used during single-process XLOG replay, but ! * some subroutines such as MarkBufferDirty will complain if we don't.) ! * XXX: but it will be with the hot standby patch. ! * ! * The returned buffer is exclusively-locked. * * For historical reasons, instead of a ReadBufferMode argument, this only * supports RBM_ZERO (init == true) and RBM_NORMAL (init == false) modes. *************** *** 226,247 **** XLogCheckInvalidPages(void) Buffer XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init) { ! return XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, ! init ? RBM_ZERO : RBM_NORMAL); } /* * XLogReadBufferExtended * Read a page during XLOG replay * ! * This is functionally comparable to ReadBuffer followed by ! * LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE): you get back a pinned ! * and locked buffer. (Getting the lock is not really necessary, since we ! * expect that this is only used during single-process XLOG replay, but ! * some subroutines such as MarkBufferDirty will complain if we don't.) ! * ! * There's some differences in the behavior wrt. the "mode" argument, ! * compared to ReadBufferExtended: * * In RBM_NORMAL mode, if the page doesn't exist, or contains all-zeroes, we * return InvalidBuffer. In this case the caller should silently skip the --- 236,256 ---- Buffer XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init) { ! Buffer buf; ! buf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, ! init ? RBM_ZERO : RBM_NORMAL); ! if (BufferIsValid(buf)) ! LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); ! ! return buf; } /* * XLogReadBufferExtended * Read a page during XLOG replay * ! * This is functionally comparable to ReadBufferExtended. There's some ! * differences in the behavior wrt. the "mode" argument: * * In RBM_NORMAL mode, if the page doesn't exist, or contains all-zeroes, we * return InvalidBuffer. In this case the caller should silently skip the *************** *** 306,321 **** XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, Assert(BufferGetBlockNumber(buffer) == blkno); } - LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); - if (mode == RBM_NORMAL) { /* check that page has been initialized */ Page page = (Page) BufferGetPage(buffer); if (PageIsNew(page)) { - UnlockReleaseBuffer(buffer); log_invalid_page(rnode, forknum, blkno, true); return InvalidBuffer; } --- 315,331 ---- Assert(BufferGetBlockNumber(buffer) == blkno); } if (mode == RBM_NORMAL) { /* check that page has been initialized */ Page page = (Page) BufferGetPage(buffer); + /* + * We assume that PageIsNew works without a lock. During recovery, + * no other backend should modify the buffer at the same time. + */ if (PageIsNew(page)) { log_invalid_page(rnode, forknum, blkno, true); return InvalidBuffer; } *** src/backend/commands/sequence.c --- src/backend/commands/sequence.c *************** *** 1339,1344 **** seq_redo(XLogRecPtr lsn, XLogRecord *record) --- 1339,1346 ---- xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record); sequence_magic *sm; + RestoreBkpBlocks(lsn, record, false); + if (info != XLOG_SEQ_LOG) elog(PANIC, "seq_redo: unknown op code %u", info); *** src/backend/storage/freespace/freespace.c --- src/backend/storage/freespace/freespace.c *************** *** 212,217 **** XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk, --- 212,219 ---- /* If the page doesn't exist already, extend */ buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + page = BufferGetPage(buf); if (PageIsNew(page)) PageInit(page, BLCKSZ, 0); *** src/include/access/xlog.h --- src/include/access/xlog.h *************** *** 197,202 **** extern void XLogSetAsyncCommitLSN(XLogRecPtr record); --- 197,204 ---- extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record); extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec); + extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup); + extern void UpdateControlFile(void); extern Size XLOGShmemSize(void); extern void XLOGShmemInit(void);
On Fri, 2009-01-09 at 18:30 +0200, Heikki Linnakangas wrote: > The hot standby patch has some hacks to decide which full-page-images > can be restored holding an exclusive lock and which ones need a > vacuum-strength lock. It's not very pretty as is, as mentioned in > comments too. Agreed! > How about we refactor things so that redo-functions are responsible for > calling RestoreBkpBlocks? The redo function can then pass an argument > indicating what kind of lock is required. We should also change > XLogReadBufferExtended so that it doesn't lock the page; the caller > knows better what kind of a lock it needs. That makes it more analogous > with ReadBufferExtended too, although I think we should keep > XLogReadBuffer() unchanged for now. Much better idea, thanks. I felt a new rmgr function was overkill but couldn't think of how to do this. > See attached patch. One shortfall of this patch is that you can pass > only one argument to RestoreBkpBlocks, but there can multiple backup > blocks in one WAL record. That's OK in current usage, though. If we're doing this because of cleanup locks, then I'd say we don't currently need a cleanup lock with any WAL record that uses multiple backup blocks. So we can just document that so anybody adding such a record in the future will be careful. So all seems good. Would you want to push ResolveRedoVisibilityConflicts() down into the rmgrs as well and make reachedSafeStartPoint a global? That is only called for cleanup records. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2009-01-09 at 18:30 +0200, Heikki Linnakangas wrote: > How about we refactor things? Was that a patch against HEAD or a patch on patch? It would be useful to nibble away at the patch, committing all the necessary refactorings to make the patch work. That will reduce size of eventual commit. Do you want me to start using the GIT repo to make it easier to extract parts? You'd need to show me the setup you use first. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Fri, 2009-01-09 at 18:30 +0200, Heikki Linnakangas wrote: >> How about we refactor things? > > Was that a patch against HEAD or a patch on patch? Against HEAD. > It would be useful to nibble away at the patch, committing all the > necessary refactorings to make the patch work. That will reduce size of > eventual commit. Agreed. This in particular is a change I feel pretty confident to commit beforehand. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Maintaining patchset with GIT (was Re: Hot standby, RestoreBkpBlocks and cleanup locks)
From
Heikki Linnakangas
Date:
Simon Riggs wrote:> Do you want me to start using the GIT repo to make it easier to extract parts? It would be useful. Even more so for your own sanity, I think :-). I find maintaining multiple interdependent patches much easier with GIT, though it's still easy to get confused. Feel free to continue with CVS, of course. > You'd need to show me the setup you use first. Well, the first thing to do is to clone the repository, see wiki for that. And get an account at git.postgresql.org so that you can publish your stuff as a git repository. (I should get one myself..) For reviewing hot standby, I created one "hotstandbyv6a" branch, and applied and committed all the patches in right order. But if you want to keep the patches separate, you should create a separate branch for each. If patch B depends on patch A, create branch A first, and then branch branch B from that. That's what I did for the relation forks and FSM work. Just remember to always commit to the right branch. Whenever you commit changes to branch A, also merge those changes to branch B with "git checkout B; git merge A". To sync with PostgreSQL CVS HEAD: "git merge origin/master" To generate diffs, you can do for example "git diff A..B" to create a diff between A and B, or "git diff origin/master..A" to create a diff between PostgreSQL CVS HEAD and A. "git log" is also very helpful. There's a learning curve, but don't hesitate to ask. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Maintaining patchset with GIT (was Re: Hot standby, RestoreBkpBlocks and cleanup locks)
From
Simon Riggs
Date:
On Fri, 2009-01-09 at 19:38 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Do you want me to start using the GIT repo to make it easier to > extract parts? > > It would be useful. Thanks, this is all good. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Please commit soon.... On Fri, 2009-01-09 at 18:30 +0200, Heikki Linnakangas wrote: > The hot standby patch has some hacks to decide which full-page-images > can be restored holding an exclusive lock and which ones need a > vacuum-strength lock. It's not very pretty as is, as mentioned in > comments too. > > How about we refactor things so that redo-functions are responsible for > calling RestoreBkpBlocks? The redo function can then pass an argument > indicating what kind of lock is required. We should also change > XLogReadBufferExtended so that it doesn't lock the page; the caller > knows better what kind of a lock it needs. That makes it more analogous > with ReadBufferExtended too, although I think we should keep > XLogReadBuffer() unchanged for now. > > See attached patch. One shortfall of this patch is that you can pass > only one argument to RestoreBkpBlocks, but there can multiple backup > blocks in one WAL record. That's OK in current usage, though. > -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2009-01-09 at 19:34 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > It would be useful to nibble away at the patch, committing all the > > necessary refactorings to make the patch work. That will reduce size of > > eventual commit. > > Agreed. This in particular is a change I feel pretty confident to commit > beforehand. I'm looking to implement refactoring of this now. The idea outlined before didn't deal with all call points for RecordIsCleanupRecord(), so doesn't actually work. ISTM easier to do things within the rmgr at the time WAL records are written, rather than in the rmgr while handling redo. We currently have 2 bytes spare on the WAL record, so I propose to add an uint16 xl_info2 field (again). This can then be marked with 2 bits: * 1 bit to show that it is a cleanup record and may conflict * 1 bit to show that backup blocks must be applied with cleanup lock Just to say again that adding these is free: we use no more space because of alignment. This avoids another rmgr call and is much more straightforward since we define how to redo the record at the time it is written, rather than via a separate mechanism that could mismatch. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > The idea outlined before didn't deal with all call points for > RecordIsCleanupRecord(), so doesn't actually work. Are we talking about the same thing? If we put the control of locking to the hands of the redo-function, I don't see why it couldn't use a lock of the right strength. Surely the redo-function can be taught what lock it needs to take. > ISTM easier to do things within the rmgr at the time WAL records are > written, rather than in the rmgr while handling redo. I don't like that idea. I'd like to keep the coupling between the primary and standby to the minimum. > This avoids another rmgr call and is much more straightforward since we > define how to redo the record at the time it is written, rather than via > a separate mechanism that could mismatch. The code that generates a WAL record and the redo-functions need to match in general anyway. The strength of the lock is not any more error-prone than other things that a redo-function must do. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > The idea outlined before didn't deal with all call points for > RecordIsCleanupRecord(), so doesn't actually work. Hmm, grep finds two call points for that: > ! case RECOVERY_TARGET_PAUSE_CLEANUP: > ! /* > ! * Advance until we see a cleanup record, then pause. > ! */ > ! if (RecordIsCleanupRecord(record)) > ! pauseHere = true; > ! break; > ! and > + /* > + * Wait, kill or otherwise resolve any conflicts between > + * incoming cleanup records and user queries. This is the > + * main barrier that allows MVCC to work correctly when > + * running standby servers. Only need to do this if there > + * is a possibility that users may be active. > + */ > + if (reachedSafeStartPoint && RecordIsCleanupRecord(record)) > + ResolveRedoVisibilityConflicts(EndRecPtr, record); The second we can easily handle by getting rid of ResolveRedoVisibilityConflicts functions and making the redo-functions call XactResolveRecoveryConflicts when necessary. Is the first really useful? I would understand "advance until next cleanup record *that would block/kill queries*", but why would you want to advance until the next cleanup record? Anyway, if it is useful, you could do the pausing in XactResolveRecoveryConflicts, too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-01-15 at 18:05 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > The idea outlined before didn't deal with all call points for > > RecordIsCleanupRecord(), so doesn't actually work. > > Are we talking about the same thing? I guess not. Look at the other call points for that function, cos your suggestion didn't include them AFAICS. > If we put the control of locking to > the hands of the redo-function, I don't see why it couldn't use a lock > of the right strength. Surely the redo-function can be taught what lock > it needs to take. Yes, it can, but there were other uses. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2009-01-15 at 18:16 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > Is the first really useful? I would understand "advance until next > cleanup record *that would block/kill queries*", but why would you > want to advance until the next cleanup record? Minor difference there, but noted. > Anyway, if it is useful, you could do the pausing in > XactResolveRecoveryConflicts, too. Well that spreads code around that was previously fairly tight, which I'm not that happy with but I'll do as you suggest. We need to crack on now. I want to keep the feature at least until we have some serious feedback in the beta phase that it isn't necessary. Usability is close behind correctness and robustness. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > It would be useful to nibble away at the patch, committing all the > necessary refactorings to make the patch work. That will reduce size of > eventual commit. I committed this part now; one less thing to review. Please adjust the patch accordingly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2009-01-20 at 21:01 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > It would be useful to nibble away at the patch, committing all the > > necessary refactorings to make the patch work. That will reduce size of > > eventual commit. > > I committed this part now; one less thing to review. Please adjust the > patch accordingly. OK, thanks. I did want you to commit those changes, but that was 8 days ago and much has changed since then. Now, I'm slightly worried that this may be a retrograde step. The last 3 days I've been refactoring the code to account for other requirements, as I have been discussing, and some of these things have now changed. Though the general pattern of your suggested refactoring remains the same. I'll check in more detail, but please could we talk before you commit parts, otherwise we'll just trip over each other. I'll post the new version (v8f) (which is pre-commit) this evening, so we can discuss. Can we plan a move to GIT? We can both work on things at the same time and my changes can be more visible. There will be some initial pain... -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > I did want you to commit those changes, but that was 8 days ago and much > has changed since then. Now, I'm slightly worried that this may be a > retrograde step. The last 3 days I've been refactoring the code to > account for other requirements, as I have been discussing, and some of > these things have now changed. Though the general pattern of your > suggested refactoring remains the same. I figured there's possibly some further changes, but the general idea to move the responsibility of restoring backup blocks to the redo function should remain the same. > Can we plan a move to GIT? We can both work on things at the same time > and my changes can be more visible. There will be some initial pain... Sure, just go ahead and publish a repository. Or would you like me to apply the patches and publish a repository you can clone from? Perhaps that would be easier since I'm already familiar with GIT. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com