Thread: Proposal for Recover mode in pg_ctl (in 8.0)
Joachim Wieland has diligently and sensibly pointed out a potential for user error with the current PITR implementation. This is not a bug *per se*, but is a design flaw that more than one person could fall into. It is a minor issue and not that likely, since the manual describes what to do...but I'm proposing a possible solution nonetheless, since this error is likely to occur whilst learning the PITR functionality and might potentially dissuade users from using the approach. Following restoration of a base backup, archive recovery is entered by placing a recovery.conf file in the data directory and then restarting the server using pg_ctl start. If the recovery.conf file is misnamed, e.g. Recovery.conf, or if it is misplaced, or simply absent, then the server start will not enter archive recovery, yet will start normally. The precise difference might well not be apparent. Subsequent server activity could potentially overwrite archived log files in a poorly managed archive. Both of those situations are likely to occur simultaneously with inexperienced users. The first fix to this is clearly to document the possibility and warn the user of this possibility - i.e. describe what NOT to do when invoking archive recovery. This will be submitted shortly. A further fix is to implement a fail safe mode for invoking recovery. Rather than making the recovery a normal server start, which then searches for recovery.conf, it would be better to indicate that the server start is expected to be a recovery and so the absence of a recovery.conf should be regarded as an error. If a further pg_ctl mode, recover, were implemented, this would allow a fail safe mode for recovery. e.g. pg_ctl -D datadir recover pg_ctl could then check for the existence of a recovery.conf file and return an error if none were found. This mode would not then need to be passed through to the postmaster as an option, which could accidentally be re-invoked later should a crash recovery occur (the main reason to avoid adding recovery.conf options to postgresql.conf in the first place). This mode would do nothing more than: - check for recovery.conf, if not found, return error - invoke a start normally, as if mode=start had been requested The doc for invoking recovery could then be changed to include this new mode, and the potential for error would be removed. This is a change requested for 8.0, to ensure that PITR doesn't fall into disrepute by novice users shooting themselves in the foot. It is a minor change, effecting only PITR, and only to the extent of a documentation change and a file existence check in pg_ctl. No server code would be changed. Alternative suggestions are welcome. Thoughts? -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > If a further pg_ctl mode, recover, were implemented, this would allow a > fail safe mode for recovery. > e.g. pg_ctl -D datadir recover > pg_ctl could then check for the existence of a recovery.conf file and > return an error if none were found. I can't get very excited about this approach, because it only protects those people who (a) use pg_ctl to start the postmaster (not everyone) and (b) carefully follow the recovery directions (which the people you are worried about are very bad at, by hypothesis). A possibly more reliable interlock would involve having the postmaster probe during normal startup to see if there is already an archived WAL segment for what it thinks is the current segment. However there are several issues here: one is that if you're doing partial-log-file shipping, that isn't necessarily an error condition; another is that we don't know how to do such a probe unless more information is added to postgresql.conf. We could imagine adding another shell command string (something like "test -f ..." perhaps) but if the user gets it wrong he may still be left with no protection. regards, tom lane
I like it - nice and simple, but targets a large (and likely) foot gun situation. regards Mark Simon Riggs wrote: > >If a further pg_ctl mode, recover, were implemented, this would allow a >fail safe mode for recovery. > >e.g. pg_ctl -D datadir recover > >pg_ctl could then check for the existence of a recovery.conf file and >return an error if none were found. This mode would not then need to be >passed through to the postmaster as an option, which could accidentally >be re-invoked later should a crash recovery occur (the main reason to >avoid adding recovery.conf options to postgresql.conf in the first >place). > >This mode would do nothing more than: >- check for recovery.conf, if not found, return error >- invoke a start normally, as if mode=start had been requested > >The doc for invoking recovery could then be changed to include this new >mode, and the potential for error would be removed. > >This is a change requested for 8.0, to ensure that PITR doesn't fall >into disrepute by novice users shooting themselves in the foot. It is a >minor change, effecting only PITR, and only to the extent of a >documentation change and a file existence check in pg_ctl. No server >code would be changed. > >Alternative suggestions are welcome. Thoughts? > > >
I wrote: > A possibly more reliable interlock would involve having the postmaster > probe during normal startup to see if there is already an archived WAL > segment for what it thinks is the current segment. Another and simpler way is to recommend that people use archive_command strings that won't overwrite an existing archive file. For instance instead of showing the example archive_command = 'cp %p /mnt/server/archivedir/%f' we could show archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' or on some machines archive_command = 'cp -i %p /mnt/server/archivedir/%f </dev/null' One tricky point is to make sure that the command will return nonzero status if the target exists. In some quick testing I found that the "cp -i" method doesn't have portable behavior on this point. I think that the "test" method is portable across Unixen, but no idea what to do on Windows. regards, tom lane
While this is nice, it will not help you if the restoration directory is different from your archive directory. That is : restore_command in recovery.conf fetches from somewhere other than where archive_command in postgresql.conf copied. I am not sure how likely this situation is, but setting up log shipping, or maybe recovering from disk failure *might* mean you need to bring the saved archive files "back from somewhere else". regards Mark Tom Lane wrote: > >Another and simpler way is to recommend that people use archive_command >strings that won't overwrite an existing archive file. > >For instance instead of showing the example >archive_command = 'cp %p /mnt/server/archivedir/%f' >we could show >archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' > > >
I was thinking that even mildly experienced folks could benefit from a helpful sanity check. Typically the need to recover a system never comes at a good time, and features that help prevent silly mistakes are a great stress saver. As an aside, while testing recovery during pre beta, I think I probably "forgot" to put in a recovery.conf about 1 time in 10. Now I was using a small database cluster tar'ed up in /tmp, so no big deal, but if it had been a 100G beast that had to come off tape .... regards Mark Tom Lane wrote: > >I can't get very excited about this approach, because it only protects >those people who (a) use pg_ctl to start the postmaster (not everyone) >and (b) carefully follow the recovery directions (which the people you >are worried about are very bad at, by hypothesis). > > >
On Sun, 2004-11-07 at 00:16, Tom Lane wrote: > I wrote: > > A possibly more reliable interlock would involve having the postmaster > > probe during normal startup to see if there is already an archived WAL > > segment for what it thinks is the current segment. > > Another and simpler way is to recommend that people use archive_command > strings that won't overwrite an existing archive file. > Yes, exactly what I meant by saying I'd document the warning. -- Best Regards, Simon Riggs
On Sun, 2004-11-07 at 00:54, Mark Kirkwood wrote: > While this is nice, it will not help you if the restoration directory is > different from your archive directory. That is : restore_command in > recovery.conf fetches from somewhere other than where archive_command in > postgresql.conf copied. > ...hmmm, this doesn't look like an issue to me. If the archive is protected when it's written, then it will be safe to use later for restoration. The path to the archive may differ, but the archive itself *needs* to be the same archive. -- Best Regards, Simon Riggs
On Sat, 2004-11-06 at 23:29, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > If a further pg_ctl mode, recover, were implemented, this would allow a > > fail safe mode for recovery. > > > e.g. pg_ctl -D datadir recover > > > pg_ctl could then check for the existence of a recovery.conf file and > > return an error if none were found. > ... > A possibly more reliable interlock would involve having the postmaster > probe during normal startup to see if there is already an archived WAL > segment for what it thinks is the current segment. However there are > several issues here: one is that if you're doing partial-log-file > shipping, that isn't necessarily an error condition; another is that > we don't know how to do such a probe unless more information is added > to postgresql.conf. We could imagine adding another shell command > string (something like "test -f ..." perhaps) but if the user gets it > wrong he may still be left with no protection. > Yes, checking the archive is the safe way, but we don't know how to do that unless restore_command has been successfully read in (currently from recovery.conf). Putting it in postgresql.conf is the wrong place, because that will likely be wrongly set when we restore, and I'm against editing that file as part of a recovery...once edited, you could lose all context and thus completely screw the recovery. All the suggested change is about is trying to find a safe way to fail if restore_command has not been set because recovery.conf is missing for whatever reason. I can't get very excited about this approach, because it only protects > those people who (a) use pg_ctl to start the postmaster (not everyone) > and (b) carefully follow the recovery directions (which the people you > are worried about are very bad at, by hypothesis). > Well, I was trying to find a least-risk approach. Touching pg_ctl code at this stage of beta-ness seemed more reliable than touching postmaster code. pg_ctl doesn't catch everyone. Placing the test in postmaster is the natural place for it, yes. This additional keyword should never be placed in any startup script, so the commonly used interface would not change, just the one used at recovery time. Documentation changes are one thing - and as you point out, if they miss one thing in the manual, they'll miss three. I'm not expecting many people to actually use an archive created by using copy or cp. I'm with Mark on this: you need all the help you can get at 2am when you just got called out of bed with a phone call from your boss *requesting* you to recover your incredibly important production system. I'll cut the doc changes first, then produce a slim patch on postmaster. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Sat, 2004-11-06 at 23:29, Tom Lane wrote: >> A possibly more reliable interlock would involve having the postmaster >> probe during normal startup to see if there is already an archived WAL >> segment for what it thinks is the current segment. > Yes, checking the archive is the safe way, but we don't know how to do > that unless restore_command has been successfully read in (currently > from recovery.conf). Putting it in postgresql.conf is the wrong place, Agreed; we left it out of postgresql.conf for good reasons. I was thinking in terms of adding a third command string, perhaps like test_archive_file = 'test -f /mnt/server/archive/%f' But it's probably best just to tell people to write their archive_commands in a non-overwrite style. regards, tom lane