Re: Point in Time Recovery - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Point in Time Recovery
Date
Msg-id 1669.1090282798@sss.pgh.pa.us
Whole thread Raw
In response to Point in Time Recovery  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: Point in Time Recovery
Re: Point in Time Recovery
List pgsql-hackers
Bruce and I had another phone chat about the problems that can ensue
if you restore a tar backup that contains old (incompletely filled)
versions of WAL segment files.  While the current code will ignore them
during the recovery-from-archive run, leaving them laying around seems
awfully dangerous.  One nasty possibility is that the archiving
mechanism will pick up these files and overwrite good copies in the
archive area with the obsolete ones from the backup :-(.

Bruce earlier proposed that we simply "rm pg_xlog/*" at the start of
a recovery-from-archive run, but as I said I'm scared to death of code
that does such a thing automatically.  In particular this would make it
impossible to handle scenarios where you want to do a PITR recovery but
you need to use some recent WAL segments that didn't make it into your
archive yet.  (Maybe you could get around this by forcibly transferring
such segments into the archive, but that seems like a bad idea for
incomplete segments.)

It would really be best for the DBA to make sure that the starting
condition for the recovery run does not have any obsolete segment files
in pg_xlog.  He could do this either by setting up his backup policy so
that pg_xlog isn't included in the tar backup in the first place, or by
manually removing the included files just after restoring the backup,
before he tries to start the recovery run.

Of course the objection to that is "what if the DBA forgets to do it?"

The idea that we came to on the phone was for the postmaster, when it
enters recovery mode because a recovery.conf file exists, to look in
pg_xlog for existing segment files and refuse to start if any are there
--- *unless* the user has put a special, non-default overriding flag
into recovery.conf.  Call it "use_unarchived_files" or something like
that.  We'd have to provide good documentation and an extensive HINT of
course, but basically the DBA would have two choices when he gets this
refusal to start:

1. Remove all the segment files in pg_xlog.  (This would be the right
thing to do if he knows they all came off the backup.)

2. Verify that pg_xlog contains only segment files that are newer than
what's stored in the WAL archive, and then set the override flag in
recovery.conf.  In this case the DBA is taking responsibility for
leaving only segment files that are good to use.

One interesting point is that with such a policy, we could use locally
available WAL segments in preference to pulling the same segments from
archive, which would be at least marginally more efficient, and seems
logically cleaner anyway.

In particular it seems that this would be a useful arrangement in cases
where you have questionable WAL segments --- you're not sure if they're
good or not.  Rather than having to push questionable data into your WAL
archive, you can leave it local, try a recovery run, and see if you like
the resulting state.  If not, it's a lot easier to do-over when you have
not corrupted your archive area.

Comments?  Better ideas?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Why we really need timelines *now* in PITR
Next
From: Tom Lane
Date:
Subject: Re: localhost redux