Thread: BUG #4879: bgwriter fails to fsync the file in recovery mode
The following bug has been logged online: Bug reference: 4879 Logged by: Fujii Masao Email address: masao.fujii@gmail.com PostgreSQL version: 8.4dev Operating system: RHEL5.1 x86_64 Description: bgwriter fails to fsync the file in recovery mode Details: The restartpoint by bgwriter in recovery mode caused the following error. ERROR: could not fsync segment 0 of relation base/11564/16422_fsm: No such file or directory The following procedure can reproduce this error. (1) create warm-standby environment (2) execute "pgbench -i -s10" (3) execute the following SQLs TRUNCATE pgbench_accounts ; TRUNCATE pgbench_branches ; TRUNCATE pgbench_history ; TRUNCATE pgbench_tellers ; CHECKPOINT ; SELECT pg_switch_xlog(); (4) wait a minute, then the upcoming restartpoint would cause the error in the standby server. Whether this error happens or not depends on the timing of operations. So, you might need to repeat the procedure (2) and (3) in order to reproduce the error. I suspect that the cause of this error is the race condition between file deletion by startup process and fsync by bgwriter: TRUNCATE xlog record immediately deletes the corresponding file, while it might be scheduled to be fsynced by bgwriter. We should leave the actual file deletion to bgwriter instead of startup process, like normal mode?
On Thu, 2009-06-25 at 12:55 +0000, Fujii Masao wrote: > The following bug has been logged online: > > Bug reference: 4879 > Logged by: Fujii Masao > Email address: masao.fujii@gmail.com > PostgreSQL version: 8.4dev > Operating system: RHEL5.1 x86_64 > Description: bgwriter fails to fsync the file in recovery mode > Details: Looking at it now. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2009-06-25 at 12:55 +0000, Fujii Masao wrote: >> The following bug has been logged online: >> >> Bug reference: 4879 >> Logged by: Fujii Masao >> Email address: masao.fujii@gmail.com >> PostgreSQL version: 8.4dev >> Operating system: RHEL5.1 x86_64 >> Description: bgwriter fails to fsync the file in recovery mode >> Details: > > Looking at it now. Thanks. >> I suspect that the cause of this error is the race condition between >> file deletion by startup process and fsync by bgwriter: TRUNCATE xlog >> record immediately deletes the corresponding file, while it might be >> scheduled to be fsynced by bgwriter. We should leave the actual file >> deletion to bgwriter instead of startup process, like normal mode? I think the real problem is this in mdunlink(): > /* Register request to unlink first segment later */ > if (!isRedo && forkNum == MAIN_FORKNUM) > register_unlink(rnode); When we replay the unlink of the relation, we don't te bgwriter about it. Normally we do, so bgwriter knows that if the fsync() fails with ENOENT, it's ok since the file was deleted. It's tempting to just remove the "!isRedo" condition, but then we have another problem: if bgwriter hasn't been started yet, and the shmem queue is full, we get stuck in register_unlink() trying to send the message and failing. In archive recovery, we always start bgwriter at the beginning of WAL replay. In crash recovery, we don't start bgwriter until the end of wAL replay. So we could change the "!isRedo" condition to "!InArchiveRecovery". It's not a very clean solution, but it's simple. Hmm, what happens when the startup process performs a write, and bgwriter is not running? Do the fsync requests queue up in the shmem queue until the end of recovery when bgwriter is launched? I guess I'll have to try it out... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Hmm, what happens when the startup process performs a write, and > bgwriter is not running? Do the fsync requests queue up in the shmem > queue until the end of recovery when bgwriter is launched? I guess I'll > have to try it out... Oh dear, doesn't look good. The startup process has a pendingOpsTable of its own. bgwriter won't fsync() files that the startup process has written itself. That needs to be fixed, or you can lose data when an archive recovery crashes after a restartpoint. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > In archive recovery, we always start bgwriter at the beginning of WAL > replay. In crash recovery, we don't start bgwriter until the end of wAL > replay. So we could change the "!isRedo" condition to > "!InArchiveRecovery". It's not a very clean solution, but it's simple. This is probably what is needed. We need to look around for other tests of "in redo" that have been obsoleted by the change in bgwriter behavior. regards, tom lane
On Thu, 2009-06-25 at 12:55 +0000, Fujii Masao wrote: > The restartpoint by bgwriter in recovery mode caused the following error. > > ERROR: could not fsync segment 0 of relation base/11564/16422_fsm: No > such file or directory I think I see a related bug also. register_dirty_segment() run by Startup process doesn't differentiate correctly whether the bgwriter is active. md.c line 1194. So it will continue to register fsync requests into its own private pendingOpsTable when it should be forwarding them to bgwriter. When bgwriter runs mdsync() the contents of startup's pendingOpsTable will be ignored. That looks to me to be a serious bug. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Thu, 2009-06-25 at 17:35 +0300, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > Hmm, what happens when the startup process performs a write, and > > bgwriter is not running? Do the fsync requests queue up in the shmem > > queue until the end of recovery when bgwriter is launched? I guess I'll > > have to try it out... > > Oh dear, doesn't look good. The startup process has a pendingOpsTable of > its own. bgwriter won't fsync() files that the startup process has > written itself. That needs to be fixed, or you can lose data when an > archive recovery crashes after a restartpoint. Yes, that's what I see also. Patch attached. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> In archive recovery, we always start bgwriter at the beginning of WAL >> replay. In crash recovery, we don't start bgwriter until the end of wAL >> replay. So we could change the "!isRedo" condition to >> "!InArchiveRecovery". It's not a very clean solution, but it's simple. > > This is probably what is needed. We need to look around for other tests > of "in redo" that have been obsoleted by the change in bgwriter > behavior. We have another problem with the end-of-recovery checkpoint. When the startup process does the checkpoint, it won't know to perform the pending fsyncs() the bgwriter has absorbed. A short fix would be to have bgwriter do the shutdown checkpoint instead in archive recovery. I don't recall if there was a reason it wasn't coded like that to begin with, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Heikki Linnakangas wrote: >> Hmm, what happens when the startup process performs a write, and >> bgwriter is not running? Do the fsync requests queue up in the shmem >> queue until the end of recovery when bgwriter is launched? I guess I'll >> have to try it out... > Oh dear, doesn't look good. The startup process has a pendingOpsTable of > its own. bgwriter won't fsync() files that the startup process has > written itself. That needs to be fixed, or you can lose data when an > archive recovery crashes after a restartpoint. Ouch. I'm beginning to think that the best thing is to temporarily revert the change that made bgwriter active during recovery. It's obviously not been adequately thought through or tested. regards, tom lane
On Thu, 2009-06-25 at 17:02 +0300, Heikki Linnakangas wrote: > I think the real problem is this in mdunlink(): > > > /* Register request to unlink first segment later */ > > if (!isRedo && forkNum == MAIN_FORKNUM) > > register_unlink(rnode); > > When we replay the unlink of the relation, we don't te bgwriter about > it. Normally we do, so bgwriter knows that if the fsync() fails with > ENOENT, it's ok since the file was deleted. > > It's tempting to just remove the "!isRedo" condition, but then we have > another problem: if bgwriter hasn't been started yet, and the shmem > queue is full, we get stuck in register_unlink() trying to send the > message and failing. > > In archive recovery, we always start bgwriter at the beginning of WAL > replay. In crash recovery, we don't start bgwriter until the end of wAL > replay. So we could change the "!isRedo" condition to > "!InArchiveRecovery". It's not a very clean solution, but it's simple. That seems to work for me, though I have some doubts as to the way two phase commit is coded. 2PC seems to assume that if a file still exists we must be in recovery and its OK to ignore. Clean? We've changed the conditions under which the unlink needs to be registered and !InArchiveRecovery defines the changed conditions precisely. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Thu, 2009-06-25 at 11:31 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > Heikki Linnakangas wrote: > >> Hmm, what happens when the startup process performs a write, and > >> bgwriter is not running? Do the fsync requests queue up in the shmem > >> queue until the end of recovery when bgwriter is launched? I guess I'll > >> have to try it out... > > > Oh dear, doesn't look good. The startup process has a pendingOpsTable of > > its own. bgwriter won't fsync() files that the startup process has > > written itself. That needs to be fixed, or you can lose data when an > > archive recovery crashes after a restartpoint. > > Ouch. I'm beginning to think that the best thing is to temporarily > revert the change that made bgwriter active during recovery. That seems the safest course, to avoid derailing the schedule. Please define "temporarily". Will it be re-enabled in 8.4.1, assuming we find an acceptable fix? > It's > obviously not been adequately thought through or tested. Hmmm... -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Heikki Linnakangas wrote: >>> Hmm, what happens when the startup process performs a write, and >>> bgwriter is not running? Do the fsync requests queue up in the shmem >>> queue until the end of recovery when bgwriter is launched? I guess I'll >>> have to try it out... > >> Oh dear, doesn't look good. The startup process has a pendingOpsTable of >> its own. bgwriter won't fsync() files that the startup process has >> written itself. That needs to be fixed, or you can lose data when an >> archive recovery crashes after a restartpoint. > > Ouch. I'm beginning to think that the best thing is to temporarily > revert the change that made bgwriter active during recovery. It's > obviously not been adequately thought through or tested. That was my first thought too, but unfortunately we now rely on bgwriter to perform restartpoints :-(. I came up with the attached patch, which includes Simon's patch to have all fsync requests forwarded to bgwriter during archive recovery. To fix the startup checkpoint issue, startup process requests a forced restartpoint, which will flush any fsync requests bgwriter has accumulated, before doing the actual checkpoint in the startup process. This is completely untested still, but does anyone immediately see any more problems? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 4fd9d41..5114664 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5156,6 +5156,7 @@ StartupXLOG(void) XLogRecord *record; uint32 freespace; TransactionId oldestActiveXID; + bool bgwriterLaunched = false; XLogCtl->SharedRecoveryInProgress = true; @@ -5472,7 +5473,11 @@ StartupXLOG(void) * process in addition to postmaster! */ if (InArchiveRecovery && IsUnderPostmaster) + { + SetForwardFsyncRequests(); SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED); + bgwriterLaunched = true; + } /* * main redo apply loop @@ -5742,7 +5747,16 @@ StartupXLOG(void) * assigning a new TLI, using a shutdown checkpoint allows us to have * the rule that TLI only changes in shutdown checkpoints, which * allows some extra error checking in xlog_redo. + * + * If bgwriter is active, we can't do the necessary fsyncs in the + * startup process because bgwriter has accumulated fsync requests + * into its private memory. So we request a last forced restart point, + * which will flush them to disk, before performing the actual + * checkpoint. */ + if (bgwriterLaunched) + RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | + CHECKPOINT_WAIT); CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); /* @@ -6219,27 +6233,11 @@ CreateCheckPoint(int flags) /* * Acquire CheckpointLock to ensure only one checkpoint happens at a time. - * During normal operation, bgwriter is the only process that creates - * checkpoints, but at the end of archive recovery, the bgwriter can be - * busy creating a restartpoint while the startup process tries to perform - * the startup checkpoint. + * (This is just pro forma, since in the present system structure there is + * only one process that is allowed to issue checkpoints at any given + * time.) */ - if (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE)) - { - Assert(InRecovery); - - /* - * A restartpoint is in progress. Wait until it finishes. This can - * cause an extra restartpoint to be performed, but that's OK because - * we're just about to perform a checkpoint anyway. Flushing the - * buffers in this restartpoint can take some time, but that time is - * saved from the upcoming checkpoint so the net effect is zero. - */ - ereport(DEBUG2, (errmsg("hurrying in-progress restartpoint"))); - RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT); - - LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); - } + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); /* * Prepare to accumulate statistics. @@ -6660,8 +6658,9 @@ CreateRestartPoint(int flags) * restartpoint. It's assumed that flushing the buffers will do that as a * side-effect. */ - if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) || - XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo)) + if (!(flags & CHECKPOINT_FORCE) && + (XLogRecPtrIsInvalid(lastCheckPointRecPtr) || + XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))) { XLogRecPtr InvalidXLogRecPtr = {0, 0}; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index d42e86a..b01b2ab 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -204,6 +204,21 @@ mdinit(void) } /* + * In archive recovery, we rely on bgwriter to do fsyncs(), but we don't + * know that we do archive recovery at process startup when pendingOpsTable + * has already been created. Calling this function drops pendingOpsTable + * and causes any subsequent requests to be forwarded to bgwriter. + */ +void +SetForwardFsyncRequests(void) +{ + /* Perform any pending ops we may have queued up */ + if (pendingOpsTable) + mdsync(); + pendingOpsTable = NULL; +} + +/* * mdexists() -- Does the physical file exist? * * Note: this will return true for lingering files, with pending deletions diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 7556b14..fd79d7b 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -109,6 +109,7 @@ extern void mdpreckpt(void); extern void mdsync(void); extern void mdpostckpt(void); +extern void SetForwardFsyncRequests(void); extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno); extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
On Thu, 2009-06-25 at 18:12 +0300, Heikki Linnakangas wrote: > A short fix would be to have bgwriter do the shutdown checkpoint instead > in archive recovery. I don't recall if there was a reason it wasn't > coded like that to begin with, though. I think the problem was that it was coded both ways at different stages of patch evolution. The decision to retain the shutdown checkpoint by the startup process was taken in January, IIRC. Having startup process issue this if (InArchiveRecovery) RequestCheckpoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FORCE | CHECKPOINT_WAIT) else should make the startup process wait for bgwriter to complete the checkpoint. But we need to set LocalRecoveryInProgress appropriately also. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Thu, 2009-06-25 at 18:57 +0300, Heikki Linnakangas wrote: > I came up with the attached patch, which includes Simon's patch to > have all fsync requests forwarded to bgwriter during archive recovery. > To fix the startup checkpoint issue, startup process requests a forced > restartpoint, which will flush any fsync requests bgwriter has > accumulated, before doing the actual checkpoint in the startup > process. That looks fine. > This is completely untested still, but does anyone immediately see any > more problems? Nothing seen. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2009-06-25 at 18:12 +0300, Heikki Linnakangas wrote: > >> A short fix would be to have bgwriter do the shutdown checkpoint instead >> in archive recovery. I don't recall if there was a reason it wasn't >> coded like that to begin with, though. > > I think the problem was that it was coded both ways at different stages > of patch evolution. The decision to retain the shutdown checkpoint by > the startup process was taken in January, IIRC. > > Having startup process issue this > > if (InArchiveRecovery) > RequestCheckpoint(CHECKPOINT_IS_SHUTDOWN | > CHECKPOINT_FORCE | > CHECKPOINT_WAIT) > else > > should make the startup process wait for bgwriter to complete the > checkpoint. But we need to set LocalRecoveryInProgress appropriately > also. Yeah, the trouble is to tell bgwriter that it's OK for it to create the checkpoint, which includes writing a WAL record, while still keeping the system "in-recovery" for all other purposes. While we could just relax the checks, that seems like a very important safeguard. (I posted in the other mail to do a restartpoint before the startup process does the checkpoint) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-06-25 at 19:20 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > > But we need to set LocalRecoveryInProgress appropriately > > also. > > Yeah, the trouble is to tell bgwriter that it's OK for it to create the > checkpoint, which includes writing a WAL record, while still keeping the > system "in-recovery" for all other purposes. While we could just relax > the checks, that seems like a very important safeguard. > > (I posted in the other mail to do a restartpoint before the startup > process does the checkpoint) I think we're in strong agreement, just thinking and writing emails in realtime gives a few race conditions... your patch covers everything I mentioned above. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > This is completely untested still, but does anyone immediately see any > more problems? Isn't SetForwardFsyncRequests going to cause the final mdsync to fail? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> This is completely untested still, but does anyone immediately see any >> more problems? > > Isn't SetForwardFsyncRequests going to cause the final mdsync to fail? Yes, I just noticed that myself, testing it for the first time. That check needs to be suppressed in the startup checkpoint. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-06-25 at 12:33 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > This is completely untested still, but does anyone immediately see any > > more problems? > > Isn't SetForwardFsyncRequests going to cause the final mdsync to fail? Yes it will, as it is now. I didn't read the parts of the patch that were described as being from my patch. Heikki, please tell me if you change things... it means I have to be both patch author and reviewer. Perhaps just separate patches, so we can review them independently. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> Isn't SetForwardFsyncRequests going to cause the final mdsync to fail? > Yes, I just noticed that myself, testing it for the first time. That > check needs to be suppressed in the startup checkpoint. Ugh. Better would be to move the responsibility for the final checkpoint to the bgwriter. However, this whole line of thought still seems to be leading towards "fix the patch", and I don't have much confidence that we can have a trustworthy fix today. What about "revert the patch"? regards, tom lane
On Thu, 2009-06-25 at 19:37 +0300, Heikki Linnakangas wrote: > Tom Lane wrote: > > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > >> This is completely untested still, but does anyone immediately see any > >> more problems? > > > > Isn't SetForwardFsyncRequests going to cause the final mdsync to fail? > > Yes, I just noticed that myself, testing it for the first time. That > check needs to be suppressed in the startup checkpoint. Better to do it as it was in my patch. We can turn off/on then. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Thu, 2009-06-25 at 12:43 -0400, Tom Lane wrote: > What about "revert the patch"? That's probably just as dangerous. Much easier to just avoid that state altogether, if you must. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2009-06-25 at 12:43 -0400, Tom Lane wrote: > >> What about "revert the patch"? > > That's probably just as dangerous. I don't feel comfortable either reverting such a big patch at last minute. Would need a fair amount of testing to make sure that the revertion is correct and that no patches committed after the patch are now broken. I'm testing the attached patch at the moment. It's the same as the previous one, with the elog() in mdsync() issue fixed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 4fd9d41..274369f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5156,6 +5156,7 @@ StartupXLOG(void) XLogRecord *record; uint32 freespace; TransactionId oldestActiveXID; + bool bgwriterLaunched = false; XLogCtl->SharedRecoveryInProgress = true; @@ -5472,7 +5473,11 @@ StartupXLOG(void) * process in addition to postmaster! */ if (InArchiveRecovery && IsUnderPostmaster) + { + SetForwardFsyncRequests(); SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED); + bgwriterLaunched = true; + } /* * main redo apply loop @@ -5742,7 +5747,17 @@ StartupXLOG(void) * assigning a new TLI, using a shutdown checkpoint allows us to have * the rule that TLI only changes in shutdown checkpoints, which * allows some extra error checking in xlog_redo. + * + * If bgwriter is active, we can't do the necessary fsyncs in the + * startup process because bgwriter has accumulated fsync requests + * into its private memory. So we request a last forced restart point, + * which will flush them to disk, before performing the actual + * checkpoint. It's safe to flush the buffers first, because no other + * process can do (WAL-logged) changes to data pages in between. */ + if (bgwriterLaunched) + RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | + CHECKPOINT_WAIT); CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); /* @@ -6219,27 +6234,11 @@ CreateCheckPoint(int flags) /* * Acquire CheckpointLock to ensure only one checkpoint happens at a time. - * During normal operation, bgwriter is the only process that creates - * checkpoints, but at the end of archive recovery, the bgwriter can be - * busy creating a restartpoint while the startup process tries to perform - * the startup checkpoint. + * (This is just pro forma, since in the present system structure there is + * only one process that is allowed to issue checkpoints at any given + * time.) */ - if (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE)) - { - Assert(InRecovery); - - /* - * A restartpoint is in progress. Wait until it finishes. This can - * cause an extra restartpoint to be performed, but that's OK because - * we're just about to perform a checkpoint anyway. Flushing the - * buffers in this restartpoint can take some time, but that time is - * saved from the upcoming checkpoint so the net effect is zero. - */ - ereport(DEBUG2, (errmsg("hurrying in-progress restartpoint"))); - RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT); - - LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); - } + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); /* * Prepare to accumulate statistics. @@ -6660,8 +6659,9 @@ CreateRestartPoint(int flags) * restartpoint. It's assumed that flushing the buffers will do that as a * side-effect. */ - if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) || - XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo)) + if (!(flags & CHECKPOINT_FORCE) && + (XLogRecPtrIsInvalid(lastCheckPointRecPtr) || + XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))) { XLogRecPtr InvalidXLogRecPtr = {0, 0}; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index d42e86a..56824d5 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -204,6 +204,21 @@ mdinit(void) } /* + * In archive recovery, we rely on bgwriter to do fsyncs(), but we don't + * know that we do archive recovery at process startup when pendingOpsTable + * has already been created. Calling this function drops pendingOpsTable + * and causes any subsequent requests to be forwarded to bgwriter. + */ +void +SetForwardFsyncRequests(void) +{ + /* Perform any pending ops we may have queued up */ + if (pendingOpsTable) + mdsync(); + pendingOpsTable = NULL; +} + +/* * mdexists() -- Does the physical file exist? * * Note: this will return true for lingering files, with pending deletions @@ -908,9 +923,18 @@ mdsync(void) /* * This is only called during checkpoints, and checkpoints should only * occur in processes that have created a pendingOpsTable. + * + * Startup process performing a startup checkpoint after archive recovery + * is an exception. It has no pendingOpsTable, but that's OK because it + * has requested bgwriter to perform a restartpoint before the checkpoint + * which does mdsync() instead. */ if (!pendingOpsTable) + { + if (InRecovery) + return; elog(ERROR, "cannot sync without a pendingOpsTable"); + } /* * If we are in the bgwriter, the sync had better include all fsync diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 7556b14..fd79d7b 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -109,6 +109,7 @@ extern void mdpreckpt(void); extern void mdsync(void); extern void mdpostckpt(void); +extern void SetForwardFsyncRequests(void); extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno); extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > It's tempting to just remove the "!isRedo" condition, but then we have > another problem: if bgwriter hasn't been started yet, and the shmem > queue is full, we get stuck in register_unlink() trying to send the > message and failing. > In archive recovery, we always start bgwriter at the beginning of WAL > replay. In crash recovery, we don't start bgwriter until the end of wAL > replay. So we could change the "!isRedo" condition to > "!InArchiveRecovery". It's not a very clean solution, but it's simple. I looked through the callers of smgrdounlink(), and found that FinishPreparedTransaction passes isRedo = true. I'm not sure at the moment what the reasoning behind that was, but it does seem to mean that checking InArchiveRecovery instead of isRedo may not be quite right. regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Simon Riggs wrote: >> On Thu, 2009-06-25 at 12:43 -0400, Tom Lane wrote: >>> What about "revert the patch"? >> >> That's probably just as dangerous. > I don't feel comfortable either reverting such a big patch at last > minute. Yeah, I'm not very happy with that either. However, I've still got no confidence in anything proposed so far. > I'm testing the attached patch at the moment. It's the same as the > previous one, with the elog() in mdsync() issue fixed. This seems like a kluge on top of a hack. Can't we have the bgwriter do the final checkpoint instead? regards, tom lane
On Thu, 2009-06-25 at 13:25 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > It's tempting to just remove the "!isRedo" condition, but then we have > > another problem: if bgwriter hasn't been started yet, and the shmem > > queue is full, we get stuck in register_unlink() trying to send the > > message and failing. > > > In archive recovery, we always start bgwriter at the beginning of WAL > > replay. In crash recovery, we don't start bgwriter until the end of wAL > > replay. So we could change the "!isRedo" condition to > > "!InArchiveRecovery". It's not a very clean solution, but it's simple. > > I looked through the callers of smgrdounlink(), and found that > FinishPreparedTransaction passes isRedo = true. I'm not sure at the > moment what the reasoning behind that was, but it does seem to mean that > checking InArchiveRecovery instead of isRedo may not be quite right. I think that's because FinishPreparedTransaction() implicitly assumes that if a file still exists we are in recovery. I can't comment on whether that is necessarily true for some reason, but it doesn't sound like it is a safe assumption. I would guess it's using isRedo as an implicit override rather than using the correct meaning of the variable. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2009-06-25 at 13:25 -0400, Tom Lane wrote: >> I looked through the callers of smgrdounlink(), and found that >> FinishPreparedTransaction passes isRedo = true. I'm not sure at the >> moment what the reasoning behind that was, but it does seem to mean that >> checking InArchiveRecovery instead of isRedo may not be quite right. > > I think that's because FinishPreparedTransaction() implicitly assumes > that if a file still exists we are in recovery. I can't comment on > whether that is necessarily true for some reason, but it doesn't sound > like it is a safe assumption. I would guess it's using isRedo as an > implicit override rather than using the correct meaning of the variable. It looks like a copy-pasto. In 8.3 it used to be: > for (i = 0; i < hdr->ncommitrels; i++) > smgrdounlink(smgropen(commitrels[i]), false, false); and now it's: > for (fork = 0; fork <= MAX_FORKNUM; fork++) > { > if (smgrexists(srel, fork)) > { > XLogDropRelation(delrels[i], fork); > smgrdounlink(srel, fork, false, true); > } > } That clearly assumes that we're in recovery, hence the XLogDropRelation call, but FinishPreparedTransaction is never called in recovery. I don't think this is related to the recovery bug we have at hand. Still needs to be fixed, though it's not that urgent. AFAICS it causes no other ill effect except that we don't complain if the file doesn't exist, and we don't leave it lingering like we do for other files to avoid the OID wraparound issue. I'll fix this as a separate patch. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Simon Riggs wrote: >>> On Thu, 2009-06-25 at 12:43 -0400, Tom Lane wrote: >>>> What about "revert the patch"? >>> That's probably just as dangerous. > >> I don't feel comfortable either reverting such a big patch at last >> minute. > > Yeah, I'm not very happy with that either. However, I've still got no > confidence in anything proposed so far. > >> I'm testing the attached patch at the moment. It's the same as the >> previous one, with the elog() in mdsync() issue fixed. > > This seems like a kluge on top of a hack. Can't we have the bgwriter > do the final checkpoint instead? Here's a patch taking that approach, and I think it's better than the previous one. I was afraid we would lose robustness if we have to set the shared state as "out of recovery" before requesting the checkpoint, but we can use the same trick we were using in startup process and set LocalRecoveryInProgress=false before setting the shared variable. I introduced a new CHECKPOINT_IS_STARTUP flag, which is otherwise identical to CHECKPOINT_IS_SHUTDOWN, but in a startup checkpoint CreateCheckPoint() excpects to be called while recovery is still active, and sets LocalRecoveryInProgress=false. It also tells bgwriter that it needs to do a checkpoint instead of a restartpoint, even though recovery is still in progress. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 4fd9d41..18c47aa 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5156,6 +5156,7 @@ StartupXLOG(void) XLogRecord *record; uint32 freespace; TransactionId oldestActiveXID; + bool bgwriterLaunched; XLogCtl->SharedRecoveryInProgress = true; @@ -5472,7 +5473,11 @@ StartupXLOG(void) * process in addition to postmaster! */ if (InArchiveRecovery && IsUnderPostmaster) + { + SetForwardFsyncRequests(); SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED); + bgwriterLaunched = true; + } /* * main redo apply loop @@ -5709,12 +5714,6 @@ StartupXLOG(void) /* Pre-scan prepared transactions to find out the range of XIDs present */ oldestActiveXID = PrescanPreparedTransactions(); - /* - * Allow writing WAL for us, so that we can create a checkpoint record. - * But not yet for other backends! - */ - LocalRecoveryInProgress = false; - if (InRecovery) { int rmid; @@ -5743,7 +5742,11 @@ StartupXLOG(void) * the rule that TLI only changes in shutdown checkpoints, which * allows some extra error checking in xlog_redo. */ - CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); + if (bgwriterLaunched) + RequestCheckpoint(CHECKPOINT_IS_STARTUP | CHECKPOINT_IMMEDIATE | + CHECKPOINT_WAIT); + else + CreateCheckPoint(CHECKPOINT_IS_STARTUP | CHECKPOINT_IMMEDIATE); /* * And finally, execute the recovery_end_command, if any. @@ -5806,7 +5809,7 @@ StartupXLOG(void) } /* - * All done. Allow others to write WAL. + * All done. Allow backends to write WAL. */ XLogCtl->SharedRecoveryInProgress = false; } @@ -6123,12 +6126,13 @@ LogCheckpointStart(int flags, bool restartpoint) * the main message, but what about all the flags? */ if (restartpoint) - msg = "restartpoint starting:%s%s%s%s%s%s"; + msg = "restartpoint starting:%s%s%s%s%s%s%s"; else - msg = "checkpoint starting:%s%s%s%s%s%s"; + msg = "checkpoint starting:%s%s%s%s%s%s%s"; elog(LOG, msg, (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "", + (flags & CHECKPOINT_IS_STARTUP) ? " startup" : "", (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "", (flags & CHECKPOINT_FORCE) ? " force" : "", (flags & CHECKPOINT_WAIT) ? " wait" : "", @@ -6190,10 +6194,12 @@ LogCheckpointEnd(bool restartpoint) * * flags is a bitwise OR of the following: * CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown. + * CHECKPOINT_IS_STARTUP: checkpoint is for database startup. * CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP, * ignoring checkpoint_completion_target parameter. * CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured - * since the last one (implied by CHECKPOINT_IS_SHUTDOWN). + * since the last one (implied by CHECKPOINT_IS_SHUTDOWN and + * CHECKPOINT_IS_STARTUP). * * Note: flags contains other bits, of interest here only for logging purposes. * In particular note that this routine is synchronous and does not pay @@ -6202,7 +6208,7 @@ LogCheckpointEnd(bool restartpoint) void CreateCheckPoint(int flags) { - bool shutdown = (flags & CHECKPOINT_IS_SHUTDOWN) != 0; + bool shutdown; CheckPoint checkPoint; XLogRecPtr recptr; XLogCtlInsert *Insert = &XLogCtl->Insert; @@ -6213,35 +6219,40 @@ CreateCheckPoint(int flags) TransactionId *inCommitXids; int nInCommit; - /* shouldn't happen */ - if (RecoveryInProgress()) - elog(ERROR, "can't create a checkpoint during recovery"); + /* + * A startup checkpoint is really a shutdown checkpoint, just issued at + * a different time. + */ + shutdown = (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IS_STARTUP)) != 0; /* - * Acquire CheckpointLock to ensure only one checkpoint happens at a time. - * During normal operation, bgwriter is the only process that creates - * checkpoints, but at the end of archive recovery, the bgwriter can be - * busy creating a restartpoint while the startup process tries to perform - * the startup checkpoint. + * A startup checkpoint is created before anyone else is allowed to + * write WAL. To allow us to write the checkpoint record, set + * LocalRecoveryInProgress to false. This lets us write WAL, but others + * are still not allowed to do so. */ - if (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE)) + if (flags & CHECKPOINT_IS_STARTUP) { - Assert(InRecovery); - - /* - * A restartpoint is in progress. Wait until it finishes. This can - * cause an extra restartpoint to be performed, but that's OK because - * we're just about to perform a checkpoint anyway. Flushing the - * buffers in this restartpoint can take some time, but that time is - * saved from the upcoming checkpoint so the net effect is zero. - */ - ereport(DEBUG2, (errmsg("hurrying in-progress restartpoint"))); - RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT); - - LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); + Assert(RecoveryInProgress()); + LocalRecoveryInProgress = false; + InitXLOGAccess(); + } + else + { + /* shouldn't happen */ + if (RecoveryInProgress()) + elog(ERROR, "can't create a checkpoint during recovery"); } /* + * Acquire CheckpointLock to ensure only one checkpoint happens at a time. + * (This is just pro forma, since in the present system structure there is + * only one process that is allowed to issue checkpoints at any given + * time.) + */ + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); + + /* * Prepare to accumulate statistics. * * Note: because it is possible for log_checkpoints to change while a @@ -6298,7 +6309,8 @@ CreateCheckPoint(int flags) * the end of the last checkpoint record, and its redo pointer must point * to itself. */ - if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FORCE)) == 0) + if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IS_STARTUP | + CHECKPOINT_FORCE)) == 0) { XLogRecPtr curInsert; @@ -6528,7 +6540,7 @@ CreateCheckPoint(int flags) * in subtrans.c). During recovery, though, we mustn't do this because * StartupSUBTRANS hasn't been called yet. */ - if (!InRecovery) + if (!InRecovery && (flags & CHECKPOINT_IS_STARTUP) == 0) TruncateSUBTRANS(GetOldestXmin(true, false)); /* All real work is done, but log before releasing lock. */ diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 55dff57..2c42584 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -449,6 +449,13 @@ BackgroundWriterMain(void) SpinLockRelease(&bgs->ckpt_lck); /* + * A startup checkpoint is a real checkpoint that's performed + * while we're still in recovery. + */ + if (flags & CHECKPOINT_IS_STARTUP) + do_restartpoint = false; + + /* * We will warn if (a) too soon since last checkpoint (whatever * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag * since the last checkpoint start. Note in particular that this @@ -895,10 +902,12 @@ BgWriterShmemInit(void) * * flags is a bitwise OR of the following: * CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown. + * CHECKPOINT_IS_STARTUP: checkpoint is for database startup. * CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP, * ignoring checkpoint_completion_target parameter. * CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured - * since the last one (implied by CHECKPOINT_IS_SHUTDOWN). + * since the last one (implied by CHECKPOINT_IS_SHUTDOWN and + * CHECKPOINT_IS_STARTUP). * CHECKPOINT_WAIT: wait for completion before returning (otherwise, * just signal bgwriter to do it, and return). * CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling. diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index d42e86a..b01b2ab 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -204,6 +204,21 @@ mdinit(void) } /* + * In archive recovery, we rely on bgwriter to do fsyncs(), but we don't + * know that we do archive recovery at process startup when pendingOpsTable + * has already been created. Calling this function drops pendingOpsTable + * and causes any subsequent requests to be forwarded to bgwriter. + */ +void +SetForwardFsyncRequests(void) +{ + /* Perform any pending ops we may have queued up */ + if (pendingOpsTable) + mdsync(); + pendingOpsTable = NULL; +} + +/* * mdexists() -- Does the physical file exist? * * Note: this will return true for lingering files, with pending deletions diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index f8720bb..d206b93 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -166,6 +166,7 @@ extern bool XLOG_DEBUG; /* These indicate the cause of a checkpoint request */ #define CHECKPOINT_CAUSE_XLOG 0x0010 /* XLOG consumption */ #define CHECKPOINT_CAUSE_TIME 0x0020 /* Elapsed time */ +#define CHECKPOINT_IS_STARTUP 0x0040 /* Checkpoint is for startup */ /* Checkpoint statistics */ typedef struct CheckpointStatsData diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 7556b14..fd79d7b 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -109,6 +109,7 @@ extern void mdpreckpt(void); extern void mdsync(void); extern void mdpostckpt(void); +extern void SetForwardFsyncRequests(void); extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno); extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Here's a patch taking that approach, and I think it's better than the > previous one. I was afraid we would lose robustness if we have to set > the shared state as "out of recovery" before requesting the checkpoint, > but we can use the same trick we were using in startup process and set > LocalRecoveryInProgress=false before setting the shared variable. I > introduced a new CHECKPOINT_IS_STARTUP flag, This isn't a "startup" checkpoint, it's an "end of recovery" checkpoint, because we don't do it during normal startup. Not sure about an equally concise name based on that, but I don't like IS_STARTUP. On the other point: are we going to eliminate mdunlink's isRedo parameter? Maybe the better thing is to have its callers pass the value of InArchiveRecovery? regards, tom lane
On Thu, 2009-06-25 at 14:46 -0400, Tom Lane wrote: > On the other point: are we going to eliminate mdunlink's isRedo > parameter? Maybe the better thing is to have its callers pass the value > of InArchiveRecovery? We have one caller that is using a parameter incorrectly. It seems we should correct the caller rather than change the API and potentially effect all callers. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Here's a patch taking that approach, and I think it's better than the >> previous one. I was afraid we would lose robustness if we have to set >> the shared state as "out of recovery" before requesting the checkpoint, >> but we can use the same trick we were using in startup process and set >> LocalRecoveryInProgress=false before setting the shared variable. I >> introduced a new CHECKPOINT_IS_STARTUP flag, > > This isn't a "startup" checkpoint, it's an "end of recovery" checkpoint, > because we don't do it during normal startup. Not sure about an equally > concise name based on that, but I don't like IS_STARTUP. > > On the other point: are we going to eliminate mdunlink's isRedo > parameter? Maybe the better thing is to have its callers pass the value > of InArchiveRecovery? I think my initial analysis of this bug was bogus. There's nothing wrong with mdunlink() as it is, it's calling ForgetRelationFsyncRequests() before it unlinks anything, regardless of the isRedo parameter. In Fujii-san's scenario, it was just going to the pendingOpsTable of the startup process and not sent to bgwriter as it should. Setting pendingOpsTable=NULL when bgwriter is launched will fix that. I somehow confused register_unlink() and ForgetRelationFsyncRequests() and thought that we need the register_unlink() call when bgwriter is active, but we don't. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
While nosing around the problem areas, I think I've found yet another issue here. The global bool InRecovery is only maintained correctly in the startup process, which wasn't a problem before 8.4. However, if we are making the bgwriter execute the end-of-recovery checkpoint, there are multiple places where it is tested that are going to be executed by bgwriter. I think (but am not 100% sure) that these are all the at-risk references: XLogFlush CheckPointMultiXact CreateCheckPoint (2 places) Heikki's latest patch deals with the tests in CreateCheckPoint (rather klugily IMO) but not the others. I think it might be better to fix things so that InRecovery is maintained correctly in the bgwriter too. regards, tom lane
On Thu, 2009-06-25 at 21:59 +0300, Heikki Linnakangas wrote: > I think my initial analysis of this bug was bogus. Perhaps, but the pendingOpsTable issue is a serious one, so we haven't wasted our time by fixing it. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> On the other point: are we going to eliminate mdunlink's isRedo >> parameter? Maybe the better thing is to have its callers pass the value >> of InArchiveRecovery? > I think my initial analysis of this bug was bogus. There's nothing wrong > with mdunlink() as it is, it's calling ForgetRelationFsyncRequests() > before it unlinks anything, regardless of the isRedo parameter. In > Fujii-san's scenario, it was just going to the pendingOpsTable of the > startup process and not sent to bgwriter as it should. Setting > pendingOpsTable=NULL when bgwriter is launched will fix that. +1, I think that's okay. So to summarize the state of play, it seems we have these issues: * need to delete startup process's local pendingOpsTable once bgwriter is launched, so that requests go to bgwriter instead * need to push end-of-recovery checkpoint into bgwriter * need to do something about IsRecovery tests that will now be executed in bgwriter * need to fix mistaken code in FinishPreparedTransaction Anything else? Are you (Heikki and Simon) in a position to finish these things today? I know it's starting to get late on your side of the pond. Core have already been discussing whether we have to abort the release for this. regards, tom lane
Tom Lane wrote: > While nosing around the problem areas, I think I've found yet another > issue here. The global bool InRecovery is only maintained correctly > in the startup process, which wasn't a problem before 8.4. However, > if we are making the bgwriter execute the end-of-recovery checkpoint, > there are multiple places where it is tested that are going to be > executed by bgwriter. I think (but am not 100% sure) that these > are all the at-risk references: > XLogFlush > CheckPointMultiXact > CreateCheckPoint (2 places) > Heikki's latest patch deals with the tests in CreateCheckPoint (rather > klugily IMO) but not the others. I think it might be better to fix > things so that InRecovery is maintained correctly in the bgwriter too. We could set InRecovery=true in CreateCheckPoint if it's a startup checkpoint, and reset it afterwards. I'm not 100% sure it's safe to have bgwriter running with InRecovery=true at other times. Grepping for InRecovery doesn't show anything that bgwriter calls, but it feels safer that way. Hmm, I see another small issue. We now keep track of the "minimum recovery point". Whenever a data page is flushed, we set minimum recovery point to the LSN of the page in XLogFlush(), instead of fsyncing WAL like we do in normal operation. During the end-of-recovery checkpoint, however, RecoveryInProgress() returns false, so we don't update minimum recovery point in XLogFlush(). You're unlikely to be bitten by that in practice; you would need to crash during the end-of-recovery checkpoint, and then set the recovery target to an earlier point. It should be fixed nevertheless. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > +1, I think that's okay. So to summarize the state of play, it seems > we have these issues: > > * need to delete startup process's local pendingOpsTable once bgwriter > is launched, so that requests go to bgwriter instead > > * need to push end-of-recovery checkpoint into bgwriter > > * need to do something about IsRecovery tests that will now be executed > in bgwriter Yep, that's all I have in mind. > * need to fix mistaken code in FinishPreparedTransaction This one I committed already. > Are you (Heikki and Simon) in a position to finish these things today? > I know it's starting to get late on your side of the pond. Core have > already been discussing whether we have to abort the release for this. I'm planning to finish this today still. Assuming no new issues are found, it won't take long to finish the patch, and then it needs an hour or two of testing. If there's any new issues, it will have to wait 'till tomorrow since after a few hours I won't be able to think straight anymore. I'm not familiar with the release process so I don't have an opinion on what to do about the release. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Here's a patch taking that approach, and I think it's better than the >> previous one. I was afraid we would lose robustness if we have to set >> the shared state as "out of recovery" before requesting the checkpoint, >> but we can use the same trick we were using in startup process and set >> LocalRecoveryInProgress=false before setting the shared variable. I >> introduced a new CHECKPOINT_IS_STARTUP flag, > > This isn't a "startup" checkpoint, it's an "end of recovery" checkpoint, > because we don't do it during normal startup. Not sure about an equally > concise name based on that, but I don't like IS_STARTUP. Ok, END_OF_RECOVERY works for me. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> ... I think it might be better to fix >> things so that InRecovery is maintained correctly in the bgwriter too. > We could set InRecovery=true in CreateCheckPoint if it's a startup > checkpoint, and reset it afterwards. I'm not 100% sure it's safe to have > bgwriter running with InRecovery=true at other times. Grepping for > InRecovery doesn't show anything that bgwriter calls, but it feels safer > that way. Actually, my thought was exactly that it would be better if it was set correctly earlier in the run --- if there ever are any places where it matters, this way is more likely to be right. (I'm not convinced that it doesn't matter today, anyhow --- are we sure these places are not called in a restartpoint?) What you suggest might be a good minimum-change solution for right now, but I think eventually it should work the other way. > Hmm, I see another small issue. We now keep track of the "minimum > recovery point". Whenever a data page is flushed, we set minimum > recovery point to the LSN of the page in XLogFlush(), instead of > fsyncing WAL like we do in normal operation. During the end-of-recovery > checkpoint, however, RecoveryInProgress() returns false, so we don't > update minimum recovery point in XLogFlush(). You're unlikely to be > bitten by that in practice; you would need to crash during the > end-of-recovery checkpoint, and then set the recovery target to an > earlier point. It should be fixed nevertheless. We would want the end-of-recovery checkpoint to act like it's not in recovery anymore for this purpose, no? regards, tom lane
I wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Hmm, I see another small issue. We now keep track of the "minimum >> recovery point". Whenever a data page is flushed, we set minimum >> recovery point to the LSN of the page in XLogFlush(), instead of >> fsyncing WAL like we do in normal operation. > We would want the end-of-recovery checkpoint to act like it's not in > recovery anymore for this purpose, no? Actually, what in the world is the purpose of that code at all? I can't see a case where it would trigger that wouldn't be a result of data corruption rather than a real need to advance the min restart point. regards, tom lane
Tom Lane wrote: > I wrote: >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >>> Hmm, I see another small issue. We now keep track of the "minimum >>> recovery point". Whenever a data page is flushed, we set minimum >>> recovery point to the LSN of the page in XLogFlush(), instead of >>> fsyncing WAL like we do in normal operation. > >> We would want the end-of-recovery checkpoint to act like it's not in >> recovery anymore for this purpose, no? > > Actually, what in the world is the purpose of that code at all? > I can't see a case where it would trigger that wouldn't be a result > of data corruption rather than a real need to advance the min restart > point. Huh? The only other place where we advance minimum recovery point is CreateRestartPoint() when we skip the restartpoint. The UpdateMinRecoveryPoint() call in XLogFlush() is what we rely on. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> Actually, what in the world is the purpose of that code at all? > Huh? The only other place where we advance minimum recovery point is > CreateRestartPoint() when we skip the restartpoint. The > UpdateMinRecoveryPoint() call in XLogFlush() is what we rely on. I would like an explanation of why minimum recovery point needs to be advanced at all. If there's any use for it, it is surely part of functionality that is not there in 8.4. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> ... I think it might be better to fix >>> things so that InRecovery is maintained correctly in the bgwriter too. > >> We could set InRecovery=true in CreateCheckPoint if it's a startup >> checkpoint, and reset it afterwards. I'm not 100% sure it's safe to have >> bgwriter running with InRecovery=true at other times. Grepping for >> InRecovery doesn't show anything that bgwriter calls, but it feels safer >> that way. > > Actually, my thought was exactly that it would be better if it was set > correctly earlier in the run --- if there ever are any places where it > matters, this way is more likely to be right. Well, we have RecoveryInProgress() now that answers the question "is recovery still in progress in the system". InRecovery now means "am I a process that's performing WAL replay?". > (I'm not convinced that > it doesn't matter today, anyhow --- are we sure these places are not > called in a restartpoint?) Hmm, good point, I didn't think of restartpoints. But skimming though all the references to InRecovery, I can't see any. >> Hmm, I see another small issue. We now keep track of the "minimum >> recovery point". Whenever a data page is flushed, we set minimum >> recovery point to the LSN of the page in XLogFlush(), instead of >> fsyncing WAL like we do in normal operation. During the end-of-recovery >> checkpoint, however, RecoveryInProgress() returns false, so we don't >> update minimum recovery point in XLogFlush(). You're unlikely to be >> bitten by that in practice; you would need to crash during the >> end-of-recovery checkpoint, and then set the recovery target to an >> earlier point. It should be fixed nevertheless. > > We would want the end-of-recovery checkpoint to act like it's not in > recovery anymore for this purpose, no? For the purpose of updating min recovery point, we want it to act like it *is* still in recovery. But in the XLogFlush() call in CreateCheckPoint(), we really want it to flush the WAL, not update min recovery point. A simple fix is to call UpdateMinRecoveryPoint() after the WAL replay is finished, but before creating the checkpoint. exitArchiveRecovery() seems like a good place. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-06-25 at 15:10 -0400, Tom Lane wrote: > Are you (Heikki and Simon) in a position to finish these things today? Certainly will try. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> We would want the end-of-recovery checkpoint to act like it's not in >> recovery anymore for this purpose, no? > For the purpose of updating min recovery point, we want it to act like > it *is* still in recovery. Well, without a clear explanation of why min recovery point should move at all, I'm unable to evaluate that statement. regards, tom lane
On Thu, 2009-06-25 at 15:10 -0400, Tom Lane wrote: > So to summarize the state of play, it seems > we have these issues: > > * need to delete startup process's local pendingOpsTable once bgwriter > is launched, so that requests go to bgwriter instead Need to ensure that fsync requests are directed to the process that will act on the fsync requests. > * need to push end-of-recovery checkpoint into bgwriter That's probably the easiest thing to do, but the issue is that we must fsync all files mentioned in the pendingOpsTable in *any* process that has been accumulating such requests. > * need to do something about IsRecovery tests that will now be executed > in bgwriter Yes > * need to fix mistaken code in FinishPreparedTransaction Yes -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> Actually, what in the world is the purpose of that code at all? > >> Huh? The only other place where we advance minimum recovery point is >> CreateRestartPoint() when we skip the restartpoint. The >> UpdateMinRecoveryPoint() call in XLogFlush() is what we rely on. > > I would like an explanation of why minimum recovery point needs to > be advanced at all. If there's any use for it, it is surely part > of functionality that is not there in 8.4. It gives protection against starting up the database too early. It stops the database from starting up if you stop recovery, and move recovery target backwards or delete WAL files from pg_xlog, in a way that's not safe. Of course, that's a case of "don't do that!", but it's a mistake you can make when trying to manually recover to a given point in time. Or if your recovery scripts are broken. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> I would like an explanation of why minimum recovery point needs to >> be advanced at all. If there's any use for it, it is surely part >> of functionality that is not there in 8.4. > It gives protection against starting up the database too early. Protection against what failure scenario? > It stops > the database from starting up if you stop recovery, and move recovery > target backwards or delete WAL files from pg_xlog, in a way that's not > safe. Of course, that's a case of "don't do that!", but it's a mistake > you can make when trying to manually recover to a given point in time. You can't move the target backwards --- it's embedded in pg_control. And even if you could, I don't see what protection this particular mechanism gives you. To the extent that it does anything at all, it's going to be full of holes because it only examines pages that happen to be dirtied by (the current instance of) the recovery process. regards, tom lane
On Thu, 2009-06-25 at 22:29 +0300, Heikki Linnakangas wrote: > Tom Lane wrote: > > While nosing around the problem areas, I think I've found yet another > > issue here. The global bool InRecovery is only maintained correctly > > in the startup process, which wasn't a problem before 8.4. However, > > if we are making the bgwriter execute the end-of-recovery checkpoint, > > there are multiple places where it is tested that are going to be > > executed by bgwriter. I think (but am not 100% sure) that these > > are all the at-risk references: > > XLogFlush > > CheckPointMultiXact > > CreateCheckPoint (2 places) > > Heikki's latest patch deals with the tests in CreateCheckPoint (rather > > klugily IMO) but not the others. I think it might be better to fix > > things so that InRecovery is maintained correctly in the bgwriter too. > > We could set InRecovery=true in CreateCheckPoint if it's a startup > checkpoint, and reset it afterwards. That seems like a bad idea. As you said earlier, On Thu, 2009-06-25 at 23:15 +0300, Heikki Linnakangas wrote: > Well, we have RecoveryInProgress() now that answers the question "is > recovery still in progress in the system". InRecovery now means "am I > a process that's performing WAL replay?". so it would be a mistake to do as you propose above because it changes the meaning of these well defined parts of the system. * XLogFlush mentions InRecovery right at the end and the correct usage would be RecoveryIsInProgress() rather than InRecovery. * The usage of InRecovery in CheckPointMultiXact() should also be replaced with RecoveryIsInProgress() * The two usages of InRecovery can also be replaced by RecoveryIsInProgress() So, yes, there are some places where InRecovery is used in code executed by the bgwriter, but the correct fix is to use RecoveryIsInProgress(). It is a hack to try to set the InRecovery state flag in bgwriter and one that would change the clear meaning of InRecovery, to no good purpose. Subsequent discussion on this subthread may no longer be relevant. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> I would like an explanation of why minimum recovery point needs to >>> be advanced at all. If there's any use for it, it is surely part >>> of functionality that is not there in 8.4. > >> It gives protection against starting up the database too early. > > Protection against what failure scenario? > >> It stops >> the database from starting up if you stop recovery, and move recovery >> target backwards or delete WAL files from pg_xlog, in a way that's not >> safe. Of course, that's a case of "don't do that!", but it's a mistake >> you can make when trying to manually recover to a given point in time. > > You can't move the target backwards --- it's embedded in pg_control. No it's not. It's in recovery.conf. We do store the minimum recovery point required by the base backup in control file, but it's not advanced once the recovery starts. So if you start recovery, starting from say 123, let it progress to location 456, kill recovery and start it again, but only let it go up to 300, you end up with a corrupt database. > And even if you could, I don't see what protection this particular > mechanism gives you. To the extent that it does anything at all, it's > going to be full of holes because it only examines pages that happen to > be dirtied by (the current instance of) the recovery process. The only process dirtying pages during recovery is the startup process, when it replays WAL. And for the sake of a consistent database if we crash, we only care about pages that have been written to disk. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndQuadrant.com> writes: > So, yes, there are some places where InRecovery is used in code executed > by the bgwriter, but the correct fix is to use RecoveryIsInProgress(). Agreed, but this gets us no closer to solving the real problem, which is that when we perform the end-of-recovery checkpoint, we need to act like we are *not* in recovery anymore, for at least some purposes. Most notably, to allow us to write a WAL entry at all; but I am suspicious that pretty much every InRecovery/RecoveryIsInProgress test that that checkpoint might execute should behave as if we're not in recovery. regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > We do store the minimum recovery point required by the base backup in > control file, but it's not advanced once the recovery starts. So if you > start recovery, starting from say 123, let it progress to location 456, > kill recovery and start it again, but only let it go up to 300, you end > up with a corrupt database. I don't believe that you have the ability to do that. Once the recovery process is launched, the user does not have the ability to control where the WAL scan proceeds from. Or at least that's how it used to work, and if someone has tried to change it, it's broken for exactly this reason. The behavior you seem to have in mind would be completely disastrous from a performance standpoint, as we'd be writing and fsyncing pg_control constantly during a recovery. I wouldn't consider that a good idea from a reliability standpoint either --- the more writes to pg_control, the more risk of fatal corruption of that file. This is sounding more and more like something that needs to be reverted. regards, tom lane
On Thu, 2009-06-25 at 22:29 +0300, Heikki Linnakangas wrote: > Hmm, I see another small issue. We now keep track of the "minimum > recovery point". Whenever a data page is flushed, we set minimum > recovery point to the LSN of the page in XLogFlush(), instead of > fsyncing WAL like we do in normal operation. During the end-of-recovery > checkpoint, however, RecoveryInProgress() returns false, so we don't > update minimum recovery point in XLogFlush(). You're unlikely to be > bitten by that in practice; you would need to crash during the > end-of-recovery checkpoint, and then set the recovery target to an > earlier point. It should be fixed nevertheless. RecoveryInProgress returns false only because you set LocalRecoveryInProgress at the start of the checkpoint, not at the end. It only needs to be set immediately prior to the call for XLogInsert() in CreateCheckpoint(). If we do that then XLogFlush() acts as advertised and we have no worries. Tom: Heikki's work on MinRecoveryPoint seems sound to me and altering it now seems like something too dangerous to attempt at this stage. I see no need; we have no new bug, just a minor point of how we code the fix to the bug we do have. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Thu, 2009-06-25 at 17:11 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > So, yes, there are some places where InRecovery is used in code executed > > by the bgwriter, but the correct fix is to use RecoveryIsInProgress(). > > Agreed, but this gets us no closer to solving the real problem, which is > that when we perform the end-of-recovery checkpoint, we need to act like > we are *not* in recovery anymore, for at least some purposes. Not for some purposes, just for *one* purpose: the end of recovery checkpoint needs to run XLogInsert() which has specific protection against being run during recovery. > Most > notably, to allow us to write a WAL entry at all; but I am suspicious > that pretty much every InRecovery/RecoveryIsInProgress test that that > checkpoint might execute should behave as if we're not in recovery. You are right to question whether we should revoke the patch. ISTM that we are likely to decrease code robustness by doing that at this stage. I think we have the problems pretty much solved now. If we want to increase robustness, I would suggest we add a recovery.conf parameter to explicitly enable use of bgwriter during recovery, off by default. In addition to the fixes being worked on currently. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2009-06-25 at 15:10 -0400, Tom Lane wrote: > >> So to summarize the state of play, it seems >> we have these issues: >> >> * need to delete startup process's local pendingOpsTable once bgwriter >> is launched, so that requests go to bgwriter instead > > Need to ensure that fsync requests are directed to the process that will > act on the fsync requests. > >> * need to push end-of-recovery checkpoint into bgwriter > > That's probably the easiest thing to do, but the issue is that we must > fsync all files mentioned in the pendingOpsTable in *any* process that > has been accumulating such requests. > >> * need to do something about IsRecovery tests that will now be executed >> in bgwriter > > Yes > >> * need to fix mistaken code in FinishPreparedTransaction > > Yes Ok, I've committed the above fixes everyone agreed on. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-06-25 at 17:17 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > We do store the minimum recovery point required by the base backup in > > control file, but it's not advanced once the recovery starts. So if you > > start recovery, starting from say 123, let it progress to location 456, > > kill recovery and start it again, but only let it go up to 300, you end > > up with a corrupt database. > > I don't believe that you have the ability to do that. Once the recovery > process is launched, the user does not have the ability to control where > the WAL scan proceeds from. Or at least that's how it used to work, and > if someone has tried to change it, it's broken for exactly this reason. > > The behavior you seem to have in mind would be completely disastrous > from a performance standpoint, as we'd be writing and fsyncing > pg_control constantly during a recovery. I wouldn't consider that a > good idea from a reliability standpoint either --- the more writes to > pg_control, the more risk of fatal corruption of that file. > > This is sounding more and more like something that needs to be reverted. AFAICS the problem Heikki is worried about exists 8.2+. If you stop recovery, edit recovery.conf to an earlier recovery target and then re-run recovery then it is possible that data that would not exist until after the (new) recovery point has made its way to disk. The code in 8.4 does a few things to improve that but the base problem persists and revoking code won't change that. We should describe the issue in the docs and leave it at that - there is no particular reason to believe anybody would want to do such a thing. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs wrote: > AFAICS the problem Heikki is worried about exists 8.2+. If you stop > recovery, edit recovery.conf to an earlier recovery target and then > re-run recovery then it is possible that data that would not exist until > after the (new) recovery point has made its way to disk. The code in 8.4 > does a few things to improve that but the base problem persists and > revoking code won't change that. We should describe the issue in the > docs and leave it at that - there is no particular reason to believe > anybody would want to do such a thing. The way I've bumped into that is when playing with pg_standby: 1. Create base backup, set up standby 2. Let pg_standby catch up, so that it waits for the next segment to arrive. 3. killall -9 postmaster 4. Start standby again 5. Create trigger file, before recovery catches up again (that is, before it reaches the point where it was before killing it) Now that we have the "smart" mode in pg_standby, that's harder to do by accident, but can still happen if you e.g remove a WAL file from archive. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Simon Riggs wrote: >> AFAICS the problem Heikki is worried about exists 8.2+. If you stop >> recovery, edit recovery.conf to an earlier recovery target and then >> re-run recovery then it is possible that data that would not exist until >> after the (new) recovery point has made its way to disk. The code in 8.4 >> does a few things to improve that but the base problem persists and >> revoking code won't change that. We should describe the issue in the >> docs and leave it at that - there is no particular reason to believe >> anybody would want to do such a thing. > The way I've bumped into that is when playing with pg_standby: > [ different scenario *not* involving any explicit recovery target ] Okay, I misunderstood that code as being intended to prevent some scenario that was new with Hot Standby. I still think it's a bad solution though because of the large number of pg_control writes it will cause. I agree that the code can be made to work in connection with the fixes for the immediate bugs, but I am not convinced that we want it there in its current form. regards, tom lane
Tom Lane wrote: > The behavior you seem to have in mind would be completely disastrous > from a performance standpoint, as we'd be writing and fsyncing > pg_control constantly during a recovery. Please define "constantly". We discussed that part of the patch here: http://archives.postgresql.org/message-id/498AB55D.50408@enterprisedb.com > I wouldn't consider that a > good idea from a reliability standpoint either --- the more writes to > pg_control, the more risk of fatal corruption of that file. We certainly update it an order of magnitude more often than before, but I don't think that's an issue. We're talking about archive recovery here. It's not like in normal operation where a corrupt pg_control file means that you lose your data. It will stop the server from starting up, but there's many other files that can be corrupt in a way that causes recovery to fail or stop too early. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-06-26 at 00:37 +0300, Heikki Linnakangas wrote: > Ok, I've committed the above fixes everyone agreed on. Sorry, but I haven't agreed to very much of what you just committed. There are some unnecessary and confusing hacks that really don't help anybody sort out why any of this has been done. At this stage of RC, I expected you to post the patch and have the other 2 people working actively on this issue review it before you commit. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Heikki Linnakangas wrote: > Tom Lane wrote: >> I wouldn't consider that a >> good idea from a reliability standpoint either --- the more writes to >> pg_control, the more risk of fatal corruption of that file. > > We certainly update it an order of magnitude more often than before, but > I don't think that's an issue. We're talking about archive recovery > here. It's not like in normal operation where a corrupt pg_control file > means that you lose your data. It will stop the server from starting up, > but there's many other files that can be corrupt in a way that causes > recovery to fail or stop too early. If you still find the frequent pg_control updates unacceptable, we could always move minRecoveryPoint to a file of its own.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> The behavior you seem to have in mind would be completely disastrous >> from a performance standpoint, as we'd be writing and fsyncing >> pg_control constantly during a recovery. > Please define "constantly". We discussed that part of the patch here: > http://archives.postgresql.org/message-id/498AB55D.50408@enterprisedb.com Okay, after reading the code a bit more I found this: /* * To avoid having to update the control file too often, we update it * all the way to the last record being replayed, even though 'lsn' * would suffice for correctness. */ which should alleviate the too-many-writes syndrome. Never mind that complaint then. (But shouldn't there be an Assert that this is >= lsn?) regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2009-06-26 at 00:37 +0300, Heikki Linnakangas wrote: >> Ok, I've committed the above fixes everyone agreed on. > At this stage of RC, I expected you to post the patch and have the other > 2 people working actively on this issue review it before you commit. We're going to slip release a day to allow time to review this properly. Given that, I have no problem with the proposed fix being in CVS --- it makes it a shade easier for other people to test it. regards, tom lane
On Thu, 2009-06-25 at 18:40 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Fri, 2009-06-26 at 00:37 +0300, Heikki Linnakangas wrote: > >> Ok, I've committed the above fixes everyone agreed on. > > > At this stage of RC, I expected you to post the patch and have the other > > 2 people working actively on this issue review it before you commit. > > We're going to slip release a day to allow time to review this properly. > Given that, I have no problem with the proposed fix being in CVS --- it > makes it a shade easier for other people to test it. I wasn't suggesting that we post the patch and leave it for days, I was thinking more of the 3 people actively working together on the problem looking at the code before its committed. Describing it as stuff we agreed seems particularly strange in that light. On the patch, the kluge to set InRecovery is unnecessary and confusing. If I submitted that you'd kick my ass all around town. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On the patch, the kluge to set InRecovery is unnecessary and confusing. I'll look into a better way to do that tonight. Did you have any specific improvement in mind? regards, tom lane
On Thu, 2009-06-25 at 19:16 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On the patch, the kluge to set InRecovery is unnecessary and confusing. > > I'll look into a better way to do that tonight. Did you have any > specific improvement in mind? Yes, all already mentioned on this thread. I'm unable to spend time on this tomorrow, but I think the main elements of the solution are there now. The bug was found in my feature, so mea culpa. Thanks to everybody for helping identify and fix it. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2009-06-25 at 19:16 -0400, Tom Lane wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> On the patch, the kluge to set InRecovery is unnecessary and confusing. >> >> I'll look into a better way to do that tonight. Did you have any >> specific improvement in mind? > Yes, all already mentioned on this thread. After looking at this a bit more, the "easy" solution mentioned upthread doesn't work. We want the end-of-recovery checkpoint to be able to write WAL (obviously), but it *must not* try to call TruncateMultiXact or TruncateSUBTRANS, because those subsystems aren't initialized yet. The check in XLogFlush needs to behave as though InRecovery were true too. So the idea of testing RecoveryInProgress() rather than InRecovery cannot handle all these cases. However, I still agree with your thought that having InRecovery true only in the process that's applying WAL records is probably the cleanest definition. And it also seems to me that having crystal-clear definitions of these states is essential, because not being clear about them is exactly what got us into this mess. What I am thinking is that instead of the hack of clearing LocalRecoveryInProgress to allow the current process to write WAL, we should have a separate test function WALWriteAllowed() with a state variable LocalWALWriteAllowed, and the hack should set that state without playing any games with LocalRecoveryInProgress. Then RecoveryInProgress() remains true during the end-of-recovery checkpoint and we can condition the TruncateMultiXact and TruncateSUBTRANS calls on that. Meanwhile the various places that check RecoveryInProgress to decide if WAL writing is allowed should call WALWriteAllowed() instead. I have not yet gone through all the code sites to make sure this is a consistent way to do it, but we clearly need more states than we have now if we are to avoid weird overloadings of the state meanings. regards, tom lane
Hi, Thank you for addressing the problem! On Fri, Jun 26, 2009 at 7:14 AM, Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: > We certainly update it an order of magnitude more often than before, but > I don't think that's an issue. We're talking about archive recovery > here. It's not like in normal operation where a corrupt pg_control file > means that you lose your data. It will stop the server from starting up, > but there's many other files that can be corrupt in a way that causes > recovery to fail or stop too early. Frequent updating of pg_control causes the significant performance degradation of archive recovery. I think that this is an issue to be fixed. The warm-standby users (including me) care about the performance of the standby server, because that affects the failover time, for example. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
I wrote: > ... Never mind that complaint then. Be advised that I'm going to be on the warpath again in the morning. I observe that the substantial amount of care we have taken over XLogFlush's handling of bad-input-LSN scenarios has been completely destroyed by the UpdateMinRecoveryPoint patch, which will fail disastrously (leaving the database unstartable/unrecoverable) if a bogusly large LSN is encountered during recovery. However, I grow weary... regards, tom lane
On Thu, 2009-06-25 at 20:25 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Thu, 2009-06-25 at 19:16 -0400, Tom Lane wrote: > >> Simon Riggs <simon@2ndQuadrant.com> writes: > >>> On the patch, the kluge to set InRecovery is unnecessary and confusing. > >> > >> I'll look into a better way to do that tonight. Did you have any > >> specific improvement in mind? > > > Yes, all already mentioned on this thread. > > After looking at this a bit more, the "easy" solution mentioned upthread > doesn't work. We want the end-of-recovery checkpoint to be able to > write WAL (obviously), but it *must not* try to call TruncateMultiXact > or TruncateSUBTRANS, because those subsystems aren't initialized yet. > The check in XLogFlush needs to behave as though InRecovery were true > too. So the idea of testing RecoveryInProgress() rather than InRecovery > cannot handle all these cases. Yes it can, but only if LocalRecoveryInProgress is set immediately before XLogInsert() in CreateCheckpoint(). Then it is correct > However, I still agree with your thought that having InRecovery true > only in the process that's applying WAL records is probably the cleanest > definition. And it also seems to me that having crystal-clear > definitions of these states is essential, because not being clear about > them is exactly what got us into this mess. Yes > What I am thinking is that instead of the hack of clearing > LocalRecoveryInProgress to allow the current process to write WAL, > we should have a separate test function WALWriteAllowed() with a > state variable LocalWALWriteAllowed, and the hack should set that > state without playing any games with LocalRecoveryInProgress. Then > RecoveryInProgress() remains true during the end-of-recovery checkpoint > and we can condition the TruncateMultiXact and TruncateSUBTRANS calls > on that. Meanwhile the various places that check RecoveryInProgress > to decide if WAL writing is allowed should call WALWriteAllowed() > instead. No need. > I have not yet gone through all the code sites to make sure this is a > consistent way to do it, but we clearly need more states than we have > now if we are to avoid weird overloadings of the state meanings. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Fri, 2009-06-26 at 10:56 +0900, Fujii Masao wrote: > Hi, > > Thank you for addressing the problem! > > On Fri, Jun 26, 2009 at 7:14 AM, Heikki > Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: > > We certainly update it an order of magnitude more often than before, but > > I don't think that's an issue. We're talking about archive recovery > > here. It's not like in normal operation where a corrupt pg_control file > > means that you lose your data. It will stop the server from starting up, > > but there's many other files that can be corrupt in a way that causes > > recovery to fail or stop too early. > > Frequent updating of pg_control causes the significant performance > degradation of archive recovery. I think that this is an issue to be fixed. > The warm-standby users (including me) care about the performance > of the standby server, because that affects the failover time, for example. An important point. The update rate "should" be once per clock cycle at most and should occur in bgwriter not startup. If there is evidence of a problem then I would support reducing the update rate. The purpose of the patch was performance, not to fix an obscure and unreported inconsistency. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Fri, 2009-06-26 at 05:14 +0100, Simon Riggs wrote: > On Thu, 2009-06-25 at 20:25 -0400, Tom Lane wrote: > > What I am thinking is that instead of the hack of clearing > > LocalRecoveryInProgress to allow the current process to write WAL, > > we should have a separate test function WALWriteAllowed() with a > > state variable LocalWALWriteAllowed, and the hack should set that > > state without playing any games with LocalRecoveryInProgress. Then > > RecoveryInProgress() remains true during the end-of-recovery checkpoint > > and we can condition the TruncateMultiXact and TruncateSUBTRANS calls > > on that. Meanwhile the various places that check RecoveryInProgress > > to decide if WAL writing is allowed should call WALWriteAllowed() > > instead. > > No need. Belay that. Yes, agree need for additional state, though think its more like EndRecoveryIsComplete(). -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Fujii Masao wrote: > On Fri, Jun 26, 2009 at 7:14 AM, Heikki > Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: >> We certainly update it an order of magnitude more often than before, but >> I don't think that's an issue. We're talking about archive recovery >> here. It's not like in normal operation where a corrupt pg_control file >> means that you lose your data. It will stop the server from starting up, >> but there's many other files that can be corrupt in a way that causes >> recovery to fail or stop too early. > > Frequent updating of pg_control causes the significant performance > degradation of archive recovery. I think that this is an issue to be fixed. > The warm-standby users (including me) care about the performance > of the standby server, because that affects the failover time, for example. Are you actually seeing performance degradation caused by frequent pg_control updates? In the simple test scenarios I've tested, pg_control is updated only once every few WAL segments, and this with shared_buffer=32MB. With larger shared_buffers, it happens even less frequently. There's a DEBUG2-line in UpdateMinRecoveryPoint() that you can bump to LOG level if you want to observe that behavior. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > I observe that the substantial amount of care we have taken over > XLogFlush's handling of bad-input-LSN scenarios has been completely > destroyed by the UpdateMinRecoveryPoint patch, which will fail > disastrously (leaving the database unstartable/unrecoverable) if a > bogusly large LSN is encountered during recovery. Note that we don't update minRecoveryPoint to the LSN from the data page, but to the LSN of the last replayed WAL record. A warning similar to that at the end of XLogFlush() would be a good idea though, if the data page LSN is greater. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, On Fri, Jun 26, 2009 at 3:22 PM, Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote: > Are you actually seeing performance degradation caused by frequent > pg_control updates? In the simple test scenarios I've tested, pg_control > is updated only once every few WAL segments, and this with > shared_buffer=32MB. With larger shared_buffers, it happens even less > frequently. Oh, I misunderstood about how UpdateMinRecoveryPoint() is called. ISTM that recovery is still fast unless shared_buffers is set to unreasonable value. Sorry for the noise. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Simon Riggs wrote: > On Fri, 2009-06-26 at 05:14 +0100, Simon Riggs wrote: >> On Thu, 2009-06-25 at 20:25 -0400, Tom Lane wrote: > >>> What I am thinking is that instead of the hack of clearing >>> LocalRecoveryInProgress to allow the current process to write WAL, >>> we should have a separate test function WALWriteAllowed() with a >>> state variable LocalWALWriteAllowed, and the hack should set that >>> state without playing any games with LocalRecoveryInProgress. Then >>> RecoveryInProgress() remains true during the end-of-recovery checkpoint >>> and we can condition the TruncateMultiXact and TruncateSUBTRANS calls >>> on that. Meanwhile the various places that check RecoveryInProgress >>> to decide if WAL writing is allowed should call WALWriteAllowed() >>> instead. >> No need. > > Belay that. Yes, agree need for additional state, though think its more > like EndRecoveryIsComplete(). Here's a patch implementing the WALWriteAllowed() idea (I'm not wedded to the name). There's two things that trouble me with this patch: - CreateCheckPoint() calls AdvanceXLInsertBuffer() before setting LocalWALWriteAllowed. I don't see anything in AdvanceXLInsertBuffer() that would fail, but it doesn't feel right. While strictly speaking it doesn't insert new WAL records, it does write WAL page headers. - As noted with an XXX comment in the patch, CreateCheckPoint() now resets LocalWALWriteAllowed to false after a shutdown/end-of-recovery checkpoint. But that's not enough to stop WAL inserts after a shutdown checkpoint, because when RecoveryInProgress() is false, we WALWriteAllowed() still returns true. We haven't had such a safeguard in place before, so we can keep living without it, but now that we have a WALWriteAllowed() macro it would be nice if it returned false when WAL writes are no longer allowed after a shutdown checkpoint. (that would've caught a bug in Guillaume Smet's original patch to rotate a WAL segment at shutdown, where the xlog switch was done after shutdown checkpoint) On whole, this is probably the right way going forward, but I'm not sure if it'd make 8.4 more or less robust than what's in CVS now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 7314341..6f86961 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -1543,7 +1543,7 @@ CheckPointMultiXact(void) * SimpleLruTruncate would get confused. It seems best not to risk * removing any data during recovery anyway, so don't truncate. */ - if (!InRecovery) + if (!RecoveryInProgress()) TruncateMultiXact(); TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true); diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d6f63c7..f28bd4d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -142,6 +142,16 @@ static bool InArchiveRecovery = false; */ static bool LocalRecoveryInProgress = true; +/* + * Am I allowed to write new WAL records? It's always allowed after recovery. + * End-of-recovery checkpoint sets LocalWALWriteAllowed before we exit + * recovery, so that it can write the checkpoint record even though + * RecoveryInProgress() is still true. + */ +#define WALWriteAllowed() (LocalWALWriteAllowed || !RecoveryInProgress()) + +static bool LocalWALWriteAllowed = false; + /* Was the last xlog file restored from archive, or local? */ static bool restoredFromArchive = false; @@ -537,7 +547,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata) bool isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH); /* cross-check on whether we should be here or not */ - if (RecoveryInProgress()) + if (!WALWriteAllowed()) elog(FATAL, "cannot make new WAL entries during recovery"); /* info's high bits are reserved for use by me */ @@ -1846,7 +1856,7 @@ XLogFlush(XLogRecPtr record) * During REDO, we don't try to flush the WAL, but update minRecoveryPoint * instead. */ - if (RecoveryInProgress()) + if (!WALWriteAllowed()) { UpdateMinRecoveryPoint(record, false); return; @@ -1949,7 +1959,7 @@ XLogFlush(XLogRecPtr record) * and so we will not force a restart for a bad LSN on a data page. */ if (XLByteLT(LogwrtResult.Flush, record)) - elog(InRecovery ? WARNING : ERROR, + elog(RecoveryInProgress() ? WARNING : ERROR, "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X", record.xlogid, record.xrecoff, LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff); @@ -1977,7 +1987,7 @@ XLogBackgroundFlush(void) bool flexible = true; /* XLOG doesn't need flushing during recovery */ - if (RecoveryInProgress()) + if (!WALWriteAllowed()) return; /* read LogwrtResult and update local state */ @@ -5849,7 +5859,10 @@ RecoveryInProgress(void) * recovery is finished. */ if (!LocalRecoveryInProgress) + { InitXLOGAccess(); + LocalWALWriteAllowed = true; + } return LocalRecoveryInProgress; } @@ -6225,7 +6238,6 @@ CreateCheckPoint(int flags) uint32 _logSeg; TransactionId *inCommitXids; int nInCommit; - bool OldInRecovery = InRecovery; /* * An end-of-recovery checkpoint is really a shutdown checkpoint, just @@ -6236,33 +6248,9 @@ CreateCheckPoint(int flags) else shutdown = false; - /* - * A startup checkpoint is created before anyone else is allowed to - * write WAL. To allow us to write the checkpoint record, set - * LocalRecoveryInProgress to false. This lets us write WAL, but others - * are still not allowed to do so. - */ - if (flags & CHECKPOINT_END_OF_RECOVERY) - { - Assert(RecoveryInProgress()); - LocalRecoveryInProgress = false; - InitXLOGAccess(); - - /* - * Before 8.4, end-of-recovery checkpoints were always performed by - * the startup process, and InRecovery was set true. InRecovery is not - * normally set in bgwriter, but we set it here temporarily to avoid - * confusing old code in the end-of-recovery checkpoint code path that - * rely on it. - */ - InRecovery = true; - } - else - { - /* shouldn't happen */ - if (RecoveryInProgress()) - elog(ERROR, "can't create a checkpoint during recovery"); - } + /* shouldn't happen */ + if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0) + elog(ERROR, "can't create a checkpoint during recovery"); /* * Acquire CheckpointLock to ensure only one checkpoint happens at a time. @@ -6305,7 +6293,6 @@ CreateCheckPoint(int flags) /* Begin filling in the checkpoint WAL record */ MemSet(&checkPoint, 0, sizeof(checkPoint)); - checkPoint.ThisTimeLineID = ThisTimeLineID; checkPoint.time = (pg_time_t) time(NULL); /* @@ -6473,6 +6460,24 @@ CreateCheckPoint(int flags) START_CRIT_SECTION(); /* + * An end-of-recovery checkpoint is created before anyone is allowed to + * write WAL. To allow us to write the checkpoint record, temporarily + * enable LocalWALWriteAllowed. + */ + if (flags & CHECKPOINT_END_OF_RECOVERY) + { + Assert(RecoveryInProgress()); + LocalWALWriteAllowed = true; + InitXLOGAccess(); + } + + /* + * this needs to be done after the InitXLOGAccess() call above or + * ThisTimeLineID might be uninitialized + */ + checkPoint.ThisTimeLineID = ThisTimeLineID; + + /* * Now insert the checkpoint record into XLOG. */ rdata.data = (char *) (&checkPoint); @@ -6488,6 +6493,19 @@ CreateCheckPoint(int flags) XLogFlush(recptr); /* + * We mustn't write any new WAL after a shutdown checkpoint, or it will + * be overwritten at next startup. No-one should even try, this just + * allows a bit more sanity-checking. XXX: since RecoveryInProgress() is + * false at that point, we'll just set LocalWriteAllowed again if anyone + * calls XLogInsert(), so this is actually useless in the shutdown case. + * + * Don't allow WAL writes after an end-of-recovery checkpoint either. It + * will be enabled again after the startup is fully completed. + */ + if (shutdown) + LocalWALWriteAllowed = false; + + /* * We now have ProcLastRecPtr = start of actual checkpoint record, recptr * = end of actual checkpoint record. */ @@ -6560,7 +6578,7 @@ CreateCheckPoint(int flags) * in subtrans.c). During recovery, though, we mustn't do this because * StartupSUBTRANS hasn't been called yet. */ - if (!InRecovery) + if (!RecoveryInProgress()) TruncateSUBTRANS(GetOldestXmin(true, false)); /* All real work is done, but log before releasing lock. */ @@ -6574,9 +6592,6 @@ CreateCheckPoint(int flags) CheckpointStats.ckpt_segs_recycled); LWLockRelease(CheckpointLock); - - /* Restore old value */ - InRecovery = OldInRecovery; } /*
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> I observe that the substantial amount of care we have taken over >> XLogFlush's handling of bad-input-LSN scenarios has been completely >> destroyed by the UpdateMinRecoveryPoint patch, which will fail >> disastrously (leaving the database unstartable/unrecoverable) if a >> bogusly large LSN is encountered during recovery. > Note that we don't update minRecoveryPoint to the LSN from the data > page, but to the LSN of the last replayed WAL record. A warning similar > to that at the end of XLogFlush() would be a good idea though, if the > data page LSN is greater. Ah, roger, so actually we can make this *better* than it was before. The special case in XLogFlush is no longer needed, because the case in which we formerly wished to use WARNING is now diverted to UpdateMinRecoveryPoint. But the latter ought to handle the situation explicitly. regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Here's a patch implementing the WALWriteAllowed() idea (I'm not wedded > to the name). I'm not either ... this morning it seems to me that XLogWriteAllowed (or maybe XLogInsertAllowed?) would be more in keeping with the existing code names in this area. > There's two things that trouble me with this patch: > - CreateCheckPoint() calls AdvanceXLInsertBuffer() before setting > LocalWALWriteAllowed. I don't see anything in AdvanceXLInsertBuffer() > that would fail, but it doesn't feel right. While strictly speaking it > doesn't insert new WAL records, it does write WAL page headers. The ordering is necessary because we have to do that before we start flushing buffers, and XLogFlush has to see WriteAllowed as FALSE or it will do the wrong thing. If we ever put something into AdvanceXLInsertBuffer that would depend on this, we could flip the flag to TRUE just for the duration of calling AdvanceXLInsertBuffer, though I agree that's a bit ugly. Perhaps calling the flag/routine XLogInsertAllowed() would alleviate this issue somewhat? > - As noted with an XXX comment in the patch, CreateCheckPoint() now > resets LocalWALWriteAllowed to false after a shutdown/end-of-recovery > checkpoint. But that's not enough to stop WAL inserts after a shutdown > checkpoint, because when RecoveryInProgress() is false, we > WALWriteAllowed() still returns true. We haven't had such a safeguard in > place before, so we can keep living without it, but now that we have a > WALWriteAllowed() macro it would be nice if it returned false when WAL > writes are no longer allowed after a shutdown checkpoint. (that would've > caught a bug in Guillaume Smet's original patch to rotate a WAL segment > at shutdown, where the xlog switch was done after shutdown checkpoint) It would be possible to make LocalWALWriteAllowed a three-state flag: * allowed, don't check RecoveryInProgress, * disallowed, don't check RecoveryInProgress * check RecoveryInProgress, transition to first state if clear Not sure if worth the trouble. > On whole, this is probably the right way going forward, but I'm not sure > if it'd make 8.4 more or less robust than what's in CVS now. I think we should do it. There is a lot of stuff I'm still not happy with in this area, but without a clean and understandable set of state definitions we aren't going to be able to improve it. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> - CreateCheckPoint() calls AdvanceXLInsertBuffer() before setting >> LocalWALWriteAllowed. I don't see anything in AdvanceXLInsertBuffer() >> that would fail, but it doesn't feel right. While strictly speaking it >> doesn't insert new WAL records, it does write WAL page headers. > > The ordering is necessary because we have to do that before we start > flushing buffers, and XLogFlush has to see WriteAllowed as FALSE or it > will do the wrong thing. If we ever put something into > AdvanceXLInsertBuffer that would depend on this, we could flip the flag > to TRUE just for the duration of calling AdvanceXLInsertBuffer, though > I agree that's a bit ugly. Perhaps calling the flag/routine > XLogInsertAllowed() would alleviate this issue somewhat? Yeah, it would look much less dirty if called XLogInsertAllowed(). >> - As noted with an XXX comment in the patch, CreateCheckPoint() now >> resets LocalWALWriteAllowed to false after a shutdown/end-of-recovery >> checkpoint. But that's not enough to stop WAL inserts after a shutdown >> checkpoint, because when RecoveryInProgress() is false, we >> WALWriteAllowed() still returns true. We haven't had such a safeguard in >> place before, so we can keep living without it, but now that we have a >> WALWriteAllowed() macro it would be nice if it returned false when WAL >> writes are no longer allowed after a shutdown checkpoint. (that would've >> caught a bug in Guillaume Smet's original patch to rotate a WAL segment >> at shutdown, where the xlog switch was done after shutdown checkpoint) > > It would be possible to make LocalWALWriteAllowed a three-state flag: > * allowed, don't check RecoveryInProgress, > * disallowed, don't check RecoveryInProgress > * check RecoveryInProgress, transition to first state if clear > Not sure if worth the trouble. One idea is to merge LocalWALWriteAllowed and LocalRecoveryInProgress (and the corresponding shared variables too) into one XLogState variable with three states * in-recovery * normal operation * shutdown. The variable would always move from a lower state to higher, similar to PMState in postmaster.c, so it could be cached like RecoveryInProgress works now. WAL inserts would only be allowed in normal operation. An end-of-recovery checkpoint would set LocalXLogState to "normal operation" ahead of SharedXLogState, to allow writing the checkpoint record, and a shutdown checkpoint would set Shared- and LocalXLogState to "shutdown" right after writing the checkpoint record. >> On whole, this is probably the right way going forward, but I'm not sure >> if it'd make 8.4 more or less robust than what's in CVS now. > > I think we should do it. There is a lot of stuff I'm still not happy > with in this area, but without a clean and understandable set of state > definitions we aren't going to be able to improve it. I agree, we need clear states and invariants that can be checked and relied on. This gets even more complicated when the hot standby stuff gets involved. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > One idea is to merge LocalWALWriteAllowed and LocalRecoveryInProgress > (and the corresponding shared variables too) into one XLogState variable > with three states > * in-recovery > * normal operation > * shutdown. > The variable would always move from a lower state to higher, Hmm ... this doesn't really feel cleaner to me, although I'm not sure why not. One point is that you really don't have enough states there; there's a difference between "in recovery" and "writing end-of-recovery checkpoint". Also, you stated that you want to disable XLogInsert again as soon as the EOR checkpoint is written, so I don't believe that the state sequence is really monotonic, unless you split in-recovery into three sub-states. On the whole I prefer keeping "InRecovery" and "XLogInsertAllowed" as separate concepts, I think. regards, tom lane
I wrote: > Hmm ... this doesn't really feel cleaner to me, although I'm not sure > why not. Oh, I thought of a more concrete point: InRecovery is inherently a system-wide state, but XLogInsertAllowed is *not*. While we write the EOR checkpoint, we really want only the bgwriter to be authorized to write WAL, but the scheme you propose would effectively authorize all processes during that window. regards, tom lane
Tom Lane escribió: > I wrote: > > Hmm ... this doesn't really feel cleaner to me, although I'm not sure > > why not. > > Oh, I thought of a more concrete point: InRecovery is inherently a > system-wide state, but XLogInsertAllowed is *not*. While we write > the EOR checkpoint, we really want only the bgwriter to be authorized > to write WAL, but the scheme you propose would effectively authorize > all processes during that window. BTW one of the problems of the block-CRC patch was that we needed to emit WAL records for hint bit states set during shutdown, as buffers were flushed out, which obviously caused that error message to show up. Due to another more serious problem I didn't investigate the solution to this one, but it looks like it would be easy to set XLogInsertAllowed to true while flushing buffers. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Tom Lane wrote: > I wrote: >> Hmm ... this doesn't really feel cleaner to me, although I'm not sure >> why not. > > Oh, I thought of a more concrete point: InRecovery is inherently a > system-wide state, but XLogInsertAllowed is *not*. While we write > the EOR checkpoint, we really want only the bgwriter to be authorized > to write WAL, but the scheme you propose would effectively authorize > all processes during that window. I was thinking of doing the same hack we have now, and set LocalXLogState to "normal processing" in bgwriter, but leave SharedXLogState still in "recovery". It's getting late again, and I'm afraid I have to stop for today :-(. I'll probably check my emails tomorrow, but probably won't be doing much else until Monday. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > It's getting late again, and I'm afraid I have to stop for today :-(. OK, I'll start from your last patch and see what I can do. regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Here's a patch implementing the WALWriteAllowed() idea (I'm not wedded > to the name). There's two things that trouble me with this patch: I've committed this patch plus the mentioned revisions, along with a bunch of code review of my own (mostly doctoring comments that needed it). There are a couple of loose ends that should get looked at later: * AFAICS there is no mechanism preventing the bgwriter from trying to issue a restartpoint after the end-of-recovery checkpoint and before the startup process manages to clear SharedRecoveryInProgress. It's unlikely, because the EOR checkpoint would advance the local idea of when the next checkpoint should happen, but it doesn't appear to be impossible. If it did happen it would be really bad because pg_control's notion of the latest checkpoint could go backward, perhaps into a part of XLOG that isn't there anymore. You'd never notice unless you were unfortunate enough to crash before the next regular checkpoint. I put a hack defense into CreateRestartPoint so that it won't overwrite pg_control if the state doesn't look right once it's got ControlFileLock, but I wonder if there's a better answer. * CreateRestartPoint tries to report recoveryLastXTime, but that is dead code because that variable isn't maintained in the bgwriter process. This is not critical functionality so I don't mind releasing 8.4 with it broken. Perhaps we should use the time from the restartpoint's checkpoint? * I'm unconvinced by the code you added to "Update min recovery point one last time". What is that supposed to be good for? It won't do anything in a crash-recovery case (when minRecoveryPoint is 0/0); does that matter? * I find the RecoveryInProgress test in XLogNeedsFlush rather dubious. Why is it okay to disable that? For at least one of the two callers (SetHintBits) it seems like the safe answer is "true" not "false". This doesn't matter too much yet but it will for hot standby. * I find the RecoveryInProgress test in ShutdownXLOG rather dubious too. If that code path can actually be taken, isn't it trying to shut down modules that haven't been initialized? regards, tom lane
On Fri, 2009-06-26 at 16:48 -0400, Tom Lane wrote: > * I find the RecoveryInProgress test in XLogNeedsFlush rather dubious. > Why is it okay to disable that? For at least one of the two callers > (SetHintBits) it seems like the safe answer is "true" not "false". > This doesn't matter too much yet but it will for hot standby. IIUC you think it is safest to *not* write hint bits during Hot Standby? So we would treat all commits as async commits? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2009-06-26 at 16:48 -0400, Tom Lane wrote: >> * I find the RecoveryInProgress test in XLogNeedsFlush rather dubious. >> Why is it okay to disable that? For at least one of the two callers >> (SetHintBits) it seems like the safe answer is "true" not "false". >> This doesn't matter too much yet but it will for hot standby. > IIUC you think it is safest to *not* write hint bits during Hot Standby? Well, the point I was trying to make is that if you want the code to do nothing then hot-wiring it to return "false" all the time during hot standby doesn't accomplish that. I think it probably is safe to update hint bits during HS, but we need to think through the timing relative to WAL processing and make sure that we don't set them too early. Maybe that analysis has actually been done already, but there's no evidence of it in the code or comments. regards, tom lane