Re: Point in Time Recovery - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Point in Time Recovery |
Date | |
Msg-id | 1089150005.17493.270.camel@stromboli Whole thread Raw |
In response to | Re: Point in Time Recovery (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Point in Time Recovery
|
List | pgsql-hackers |
On Mon, 2004-07-05 at 22:46, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > - when we stop, keep reading records until EOF, just don't apply them. > > When we write a checkpoint at end of recovery, the unapplied > > transactions are buried alive, never to return. > > - stop where we stop, then force zeros to EOF, so that no possible > > record remains of previous transactions. > > Go with plan B; it's best not to destroy data (what if you chose the > wrong restart point the first time)? > > Actually this now reminds me of a discussion I had with Patrick > Macdonald some time ago. The DB2 practice in this connection is that > you *never* overwrite existing logfile data when recovering. Instead > you start a brand new xlog segment file, which is given a new "branch > number" so it can be distinguished from the future-time xlog segments > that you chose not to apply. I don't recall what the DB2 terminology > was exactly --- not "branch number" I don't think --- but anyway the > idea is that when you restart the database after an incomplete recovery, > you are now in a sort of parallel universe that has its own history > after the branch point (PITR stop point). You need to be able to > distinguish archived log segments of this parallel universe from those > of previous and subsequent incarnations. I'm not sure whether Vadim > intended our StartUpID to serve this purpose, but it could perhaps be > used that way, if we reflected it in the WAL file names. > Some more thoughts...focusing on the what do we do after we've finished recovering. The objectives, as I see them, are to put the system into a state, that preserves these features: 1. we never overwrite files, in case we want to re-run recovery 2. we never write files that MIGHT have been written previously 3. we need to ensure that any xlog records skipped at admins request (in PITR mode) are never in a position to be re-applied to this timeline. 4. ensure we can re-recover, if we need to, without further problems Tom's concept above, I'm going to call timelines. A timeline is the sequence of logs created by the execution of a server. If you recover the database, you create a new timeline. [This is because, if you've invoked PITR you absolutely definitely want log records written to, say, xlog15 to be different to those that were written to xlog15 in a previous timeline that you have chosen not to reapply.] Objective (1) is complex. When we are restoring, we always start with archived copies of the xlog, to make sure we don't finish too soon. We roll forward until we either reach PITR stop point, or we hit end of archived logs. If we hit end of logs on archive, then we switch to a local copy, if one exists that is higher than those, we carry on rolling forward until either we reach PITR stop point, or we hit end of that log. (Hopefully, there isn't more than one local xlog higher than the archive, but its possible). If we are rolling forward on local copies, then they are our only copies. We'd really like to archive them ASAP, but the archiver's not running yet - we don't want to force that situation in case the archive device (say a tape) is the one being used to recover right now. So we write an archive_status of .ready for that file, ensuring that the checkpoint won't remove it until it gets copied to archive, whenever that starts working again. Objective (1) met. When we have finished recovering we: - create a new xlog at the start of a new ++timeline - copy the last applied xlog record to it as the first record - set the record pointer so that it matches That way, when we come up and begin running, we never overwrite files that might have been written previously. Objective (2) met. We do the other stuff because recovery finishes up by pointing to the last applied record...which is what was causing all of this extra work in the first place. At this point, we also reset the secondary checkpoint record, so that should recovery be required again before next checkpoint AND the shutdown checkpoint record written after recovery completes is wrong/damaged, the recovery will not autorewind back past the PITR stop point and attempt to recover the records we have just tried so hard to reverse/ignore. Objective (3) met. (Clearly, that situation seems unlikely, but I feel we must deal with it...a newly restored system is actually very fragile, so a crash again within 3 minutes or so is very commonplace, as far as these things go). Should we need to re-recover, we can do so because the new timeline xlogs are further forward than the old timeline, so never get seen by any processes (all of which look backwards). Re-recovery is possible without problems, if required. This means you're a lot safer from some of the mistakes you might of made, such as deciding you need to go into recovery, then realising it wasn't required (or some other painful flapping as goes on in computer rooms at 3am). How do we implement timelines? The main presumption in the code is that xlogs are sequential. That has two effects: 1. during recovery, we try to open the "next" xlog by adding one to the numbers and then looking for that file 2. during checkpoint, we look for filenames less than the current checkpoint marker Creating a timeline by adding a larger number to LogId allows us to prevent (1) from working, yet without breaking (2). Well, Tom does seem to have something with regard to StartUpIds. I feel it is easier to force a new timeline by adding a very large number to the LogId IF, and only if, we have performed an archive recovery. That way, we do not change at all the behaviour of the system for people that choose not to implement archive_mode. Should we implement timelines? Yes, I think we should. I've already hit the problems that timelines solve in my testing and so that means they'll be hit when you don't need the hassle. Comments much appreciated, assuming you read this far... Best regards, Simon Riggs
pgsql-hackers by date: