Thread: Proposal for Recover mode in pg_ctl (in 8.0)

Proposal for Recover mode in pg_ctl (in 8.0)

From
Simon Riggs
Date:
Joachim Wieland has diligently and sensibly pointed out a potential for
user error with the current PITR implementation. This is not a bug *per
se*, but is a design flaw that more than one person could fall into. It
is a minor issue and not that likely, since the manual describes what to
do...but I'm proposing a possible solution nonetheless, since this error
is likely to occur whilst learning the PITR functionality and might
potentially dissuade users from using the approach.

Following restoration of a base backup, archive recovery is entered by
placing a recovery.conf file in the data directory and then restarting
the server using pg_ctl start. If the recovery.conf file is misnamed,
e.g. Recovery.conf, or if it is misplaced, or simply absent, then the
server start will not enter archive recovery, yet will start normally.
The precise difference might well not be apparent. Subsequent server
activity could potentially overwrite archived log files in a poorly
managed archive. Both of those situations are likely to occur
simultaneously with inexperienced users.

The first fix to this is clearly to document the possibility and warn
the user of this possibility - i.e. describe what NOT to do when
invoking archive recovery. This will be submitted shortly.

A further fix is to implement a fail safe mode for invoking recovery.
Rather than making the recovery a normal server start, which then
searches for recovery.conf, it would be better to indicate that the
server start is expected to be a recovery and so the absence of a
recovery.conf should be regarded as an error.

If a further pg_ctl mode, recover, were implemented, this would allow a
fail safe mode for recovery.

e.g.    pg_ctl -D datadir recover

pg_ctl could then check for the existence of a recovery.conf file and
return an error if none were found. This mode would not then need to be
passed through to the postmaster as an option, which could accidentally
be re-invoked later should a crash recovery occur (the main reason to
avoid adding recovery.conf options to postgresql.conf in the first
place).

This mode would do nothing more than:
- check for recovery.conf, if not found, return error
- invoke a start normally, as if mode=start had been requested

The doc for invoking recovery could then be changed to include this new
mode, and the potential for error would be removed.

This is a change requested for 8.0, to ensure that PITR doesn't fall
into disrepute by novice users shooting themselves in the foot. It is a
minor change, effecting only PITR, and only to the extent of a
documentation change and a file existence check in pg_ctl. No server
code would be changed.

Alternative suggestions are welcome. Thoughts?

-- 
Best Regards, Simon Riggs



Re: Proposal for Recover mode in pg_ctl (in 8.0)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> If a further pg_ctl mode, recover, were implemented, this would allow a
> fail safe mode for recovery.

> e.g.    pg_ctl -D datadir recover

> pg_ctl could then check for the existence of a recovery.conf file and
> return an error if none were found.

I can't get very excited about this approach, because it only protects
those people who (a) use pg_ctl to start the postmaster (not everyone)
and (b) carefully follow the recovery directions (which the people you
are worried about are very bad at, by hypothesis).

A possibly more reliable interlock would involve having the postmaster
probe during normal startup to see if there is already an archived WAL
segment for what it thinks is the current segment.  However there are
several issues here: one is that if you're doing partial-log-file
shipping, that isn't necessarily an error condition; another is that
we don't know how to do such a probe unless more information is added
to postgresql.conf.  We could imagine adding another shell command
string (something like "test -f ..." perhaps) but if the user gets it
wrong he may still be left with no protection.
        regards, tom lane


Re: Proposal for Recover mode in pg_ctl (in 8.0)

From
Mark Kirkwood
Date:
I like it - nice and simple, but targets a large (and likely) foot gun 
situation.

regards

Mark

Simon Riggs wrote:

>
>If a further pg_ctl mode, recover, were implemented, this would allow a
>fail safe mode for recovery.
>
>e.g.    pg_ctl -D datadir recover
>
>pg_ctl could then check for the existence of a recovery.conf file and
>return an error if none were found. This mode would not then need to be
>passed through to the postmaster as an option, which could accidentally
>be re-invoked later should a crash recovery occur (the main reason to
>avoid adding recovery.conf options to postgresql.conf in the first
>place).
>
>This mode would do nothing more than:
>- check for recovery.conf, if not found, return error
>- invoke a start normally, as if mode=start had been requested
>
>The doc for invoking recovery could then be changed to include this new
>mode, and the potential for error would be removed.
>
>This is a change requested for 8.0, to ensure that PITR doesn't fall
>into disrepute by novice users shooting themselves in the foot. It is a
>minor change, effecting only PITR, and only to the extent of a
>documentation change and a file existence check in pg_ctl. No server
>code would be changed.
>
>Alternative suggestions are welcome. Thoughts?
>
>  
>


Re: Proposal for Recover mode in pg_ctl (in 8.0)

From
Tom Lane
Date:
I wrote:
> A possibly more reliable interlock would involve having the postmaster
> probe during normal startup to see if there is already an archived WAL
> segment for what it thinks is the current segment.

Another and simpler way is to recommend that people use archive_command
strings that won't overwrite an existing archive file.

For instance instead of showing the example
archive_command = 'cp %p /mnt/server/archivedir/%f'
we could show
archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
or on some machines
archive_command = 'cp -i %p /mnt/server/archivedir/%f </dev/null'

One tricky point is to make sure that the command will return nonzero
status if the target exists.  In some quick testing I found that
the "cp -i" method doesn't have portable behavior on this point.
I think that the "test" method is portable across Unixen, but no idea
what to do on Windows.
        regards, tom lane


Re: Proposal for Recover mode in pg_ctl (in 8.0)

From
Mark Kirkwood
Date:
While this is nice, it will not help you if the restoration directory is 
different from your archive directory. That is : restore_command in 
recovery.conf fetches from somewhere other than where archive_command in 
postgresql.conf copied.

I am not sure how likely this situation is, but setting up log shipping, 
or maybe recovering from disk failure *might* mean you need to bring the 
saved archive files "back from somewhere else".

regards

Mark

Tom Lane wrote:

>
>Another and simpler way is to recommend that people use archive_command
>strings that won't overwrite an existing archive file.
>
>For instance instead of showing the example
>archive_command = 'cp %p /mnt/server/archivedir/%f'
>we could show
>archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
>
>  
>


Re: Proposal for Recover mode in pg_ctl (in 8.0)

From
Mark Kirkwood
Date:
I was thinking that even mildly experienced folks could benefit from a 
helpful sanity check. Typically the need to recover a system never comes 
at a good time, and features that help prevent silly mistakes are a 
great stress saver.

As an aside, while testing recovery during pre beta, I think I probably 
"forgot" to put in a recovery.conf about 1 time in 10. Now I was using a 
small database cluster tar'ed up in /tmp, so no big deal, but if it had 
been a 100G beast that had to come off tape ....

regards

Mark

Tom Lane wrote:

>
>I can't get very excited about this approach, because it only protects
>those people who (a) use pg_ctl to start the postmaster (not everyone)
>and (b) carefully follow the recovery directions (which the people you
>are worried about are very bad at, by hypothesis).
>
>  
>


Re: Proposal for Recover mode in pg_ctl (in 8.0)

From
Simon Riggs
Date:
On Sun, 2004-11-07 at 00:16, Tom Lane wrote:
> I wrote:
> > A possibly more reliable interlock would involve having the postmaster
> > probe during normal startup to see if there is already an archived WAL
> > segment for what it thinks is the current segment.
> 
> Another and simpler way is to recommend that people use archive_command
> strings that won't overwrite an existing archive file.
> 

Yes, exactly what I meant by saying I'd document the warning.

-- 
Best Regards, Simon Riggs



Re: Proposal for Recover mode in pg_ctl (in 8.0)

From
Simon Riggs
Date:
On Sun, 2004-11-07 at 00:54, Mark Kirkwood wrote:
> While this is nice, it will not help you if the restoration directory is 
> different from your archive directory. That is : restore_command in 
> recovery.conf fetches from somewhere other than where archive_command in 
> postgresql.conf copied.
> 

...hmmm, this doesn't look like an issue to me.

If the archive is protected when it's written, then it will be safe to
use later for restoration. The path to the archive may differ, but the
archive itself *needs* to be the same archive.

-- 
Best Regards, Simon Riggs



Re: Proposal for Recover mode in pg_ctl (in 8.0)

From
Simon Riggs
Date:
On Sat, 2004-11-06 at 23:29, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > If a further pg_ctl mode, recover, were implemented, this would allow a
> > fail safe mode for recovery.
> 
> > e.g.    pg_ctl -D datadir recover
> 
> > pg_ctl could then check for the existence of a recovery.conf file and
> > return an error if none were found.
> 

...

> A possibly more reliable interlock would involve having the postmaster
> probe during normal startup to see if there is already an archived WAL
> segment for what it thinks is the current segment.  However there are
> several issues here: one is that if you're doing partial-log-file
> shipping, that isn't necessarily an error condition; another is that
> we don't know how to do such a probe unless more information is added
> to postgresql.conf.  We could imagine adding another shell command
> string (something like "test -f ..." perhaps) but if the user gets it
> wrong he may still be left with no protection.
> 

Yes, checking the archive is the safe way, but we don't know how to do
that unless restore_command has been successfully read in (currently
from recovery.conf). Putting it in postgresql.conf is the wrong place,
because that will likely be wrongly set when we restore, and I'm against
editing that file as part of a recovery...once edited, you could lose
all context and thus completely screw the recovery.

All the suggested change is about is trying to find a safe way to fail
if restore_command has not been set because recovery.conf is missing for
whatever reason.

I can't get very excited about this approach, because it only protects
> those people who (a) use pg_ctl to start the postmaster (not everyone)
> and (b) carefully follow the recovery directions (which the people you
> are worried about are very bad at, by hypothesis).
> 

Well, I was trying to find a least-risk approach. Touching pg_ctl code
at this stage of beta-ness seemed more reliable than touching postmaster
code. pg_ctl doesn't catch everyone.

Placing the test in postmaster is the natural place for it, yes.

This additional keyword should never be placed in any startup script, so
the commonly used interface would not change, just the one used at
recovery time.

Documentation changes are one thing - and as you point out, if they miss
one thing in the manual, they'll miss three. I'm not expecting many
people to actually use an archive created by using copy or cp.

I'm with Mark on this: you need all the help you can get at 2am when you
just got called out of bed with a phone call from your boss *requesting*
you to recover your incredibly important production system.

I'll cut the doc changes first, then produce a slim patch on postmaster.
Best Regards, Simon Riggs



Re: Proposal for Recover mode in pg_ctl (in 8.0)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Sat, 2004-11-06 at 23:29, Tom Lane wrote:
>> A possibly more reliable interlock would involve having the postmaster
>> probe during normal startup to see if there is already an archived WAL
>> segment for what it thinks is the current segment.

> Yes, checking the archive is the safe way, but we don't know how to do
> that unless restore_command has been successfully read in (currently
> from recovery.conf). Putting it in postgresql.conf is the wrong place,

Agreed; we left it out of postgresql.conf for good reasons.  I was
thinking in terms of adding a third command string, perhaps like
test_archive_file = 'test -f /mnt/server/archive/%f'

But it's probably best just to tell people to write their
archive_commands in a non-overwrite style.
        regards, tom lane