Re: [PATCHES] Infrastructure changes for recovery - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [PATCHES] Infrastructure changes for recovery |
Date | |
Msg-id | 28973.1222381719@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [PATCHES] Infrastructure changes for recovery (Simon Riggs <simon@2ndQuadrant.com>) |
Responses |
Re: [PATCHES] Infrastructure changes for recovery
Re: [PATCHES] Infrastructure changes for recovery |
List | pgsql-hackers |
Simon Riggs <simon@2ndQuadrant.com> writes: > Version 7 After reading this for awhile, I realized that there is a rather fundamental problem with it: it switches into "consistent recovery" mode as soon as it's read WAL beyond ControlFile->minRecoveryPoint. In a crash recovery situation that typically is before the last checkpoint (if indeed it's not still zero), and what that means is that this patch will activate the bgwriter and start letting in backends instantaneously after a crash, long before we can have any certainty that the DB state really is consistent. In a normal crash recovery situation this would be easily fixed by simply not letting it go to "consistent recovery" state at all, but what about recovery from a restartpoint? We don't want a slave that's crashed once to never let backends in again. But I don't see how to determine that we're far enough past the restartpoint to be consistent again. In crash recovery we assume (without proof ;-)) that we're consistent once we reach the end of valid-looking WAL, but that rule doesn't help for a slave that's following a continuing WAL sequence. Perhaps something could be done based on noting when we have to pull in a WAL segment from the recovery_command, but it sounds like a pretty fragile assumption. Anyway, that's sufficiently bad that I'm bouncing the patch for reconsideration. Some other issues I noted before giving up: * I'm a bit uncomfortable with the fact that the IsRecoveryProcessingMode flag is read and written with no lock. There's no atomicity or concurrent-write problem, sure, but on a multiprocessor with weak memory ordering guarantees (eg PPC) readers could get a fairly stale value of the flag. The false to true transition happens before anyone except the startup process is running, so that's no problem; the net effect is then that backends might think that recovery mode was still active for awhile after it wasn't. This seems a bit scary, eg in the patch as it stands that'll disable XLogFlush calls that should have happened. You could fix that by grabbing/releasing some spinlock (any old one) around the accesses, but are any of the call sites performance-critical? The one in XLogInsert looks like it is, if nothing else. * I kinda think you broke XLogFlush anyway. It's certainly clear that the WARNING case at the bottom is unreachable with the patch, and I think that means that you've messed up an important error robustness behavior. Is it still possible to get out of recovery mode if there are any bad LSNs in the shared buffer pool? * The use of InRecovery in CreateCheckPoint seems pretty bogus, since that function can be called from the bgwriter in which the flag will never be true. Either this needs to be IsRecoveryProcessingMode(), or it's useless because we'll never create ordinary checkpoints until after subtrans.c is up anyway. * The patch moves the clearing of InRecovery from after to before StartupCLOG, RecoverPreparedTransactions, etc. Is that really a good idea? It makes it hard for those modules to know if they are coming up after a normal or recovery startup. I think they may not care at the moment, but I'd leave that alone without a darn good reason to change it. * The comment about CheckpointLock being only pro forma is now wrong, if you are going to use locking it to implement a delay in exitRecovery. But I don't understand why the delay there. If it's needed, seems like the path where a restartpoint *isn't* in progress is wrong --- don't you actually need to start one and wait for it? Also I note that if the LWLockConditionalAcquire succeeds, you neglect to release the lock, which surely can't be right. * The tail end of StartupXLOG() looks pretty unsafe to me. Surely we mustn't clear IsRecoveryProcessingMode until after we have established the safe-recovery checkpoint. The comments there seem to be only vaguely related to the current state of the patch, too. * Logging of recoveryLastXTime seems pretty bogus now. It's supposed to be associated with a restartpoint completion report, but now it's just out in the ether somewhere and doesn't represent a guarantee that we're synchronized that far. * backup.sgml needs to be updated to say that log_restartpoints is obsolete. * There are a bunch of disturbing assumptions in the SLRU-related modules about their StartUp calls being executed without any concurrent access. This isn't really a problem that needs to be dealt with in this patch, I think, but that will all have to be cleaned up before we dare allow any backends to run concurrently with recovery. btree's suppression of relcache invals for metapage updates will be a problem too. regards, tom lane
pgsql-hackers by date: