Why we really need timelines *now* in PITR - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Why we really need timelines *now* in PITR |
Date | |
Msg-id | 16709.1090096569@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Why we really need timelines *now* in PITR
|
List | pgsql-hackers |
If we do not add timeline numbers to WAL file names, we will be forced to destroy information during recovery. Consider the following scenario: 1. You have a WAL directory containing, say, WAL segments 0010 to 0020 (for the purposes of this example I won't bother typing out realistic 16-digit filenames, but just use 4-digit names). 2. You discover that your junior DBA messed up badly and you need to revert to yesterday evening's state. Let's say the chosen recovery end time is in the middle of file 0014. 3. You run the recovery process. At its end, the WAL end pointer will be 0014 and some offset. If we simply run forward from this situation, then we will be overwriting existing WAL records in the existing files 0014-0020. This is bad from the point of view of not wanting to discard information (what if we decide we should have recovered to a later time??), but there is an even more serious reason for not doing that. Suppose we suffer a crash sometime after recovery. On restart, the system will start replaying the logs, and *there will be nothing to keep it from replaying all the way to the end of file 0020*. (The files will contain proper, in-sequence page headers, so the tests that normally detect recycled log segments won't think there is anything wrong.) This will leave you with a thoroughly corrupt database. One way to solve this would be to physically discard 0015-0020 as soon as we decide we're stopping short of the end of WAL. I think that is unacceptable on don't-throw-away-information grounds. I think it would be far better to invent the timeline concept. Then, our old WAL files would be named say 0001.0010 through 0001.0020, and we would start logging into 0002.0014 after recovery. A slightly tricky point is that we have to "sew together" the end of one timeline and the start of the next --- for instance, we only want the front part of 0001.0014, not the back part, to be part of the new timeline. Patrick Macdonald told me about a pretty baroque scheme that DB2 uses for this, but I think it would be simplest if we just copied the appropriate amount of data from 0001.0014 into 0002.0014 and then ran forward from there. Copying a max of 16MB of data doesn't sound very onerous. During WAL replay or recovery, there would be a notion of the "target timeline" that you are trying to recover to a point within. The rule for selecting which WAL segment file to read is "use the one with largest timeline number less than or equal to the target, and never less than the timeline number you used for the previous segment". So for example if we realized we'd chosen the wrong recovery target time, we could backpedal and redo the same recovery process with target timeline 0001, ignoring any WAL segments that had been archived with timeline 0002. Alternatively, if we were simply doing crash recovery in timeline 0002, we could stop at (say) segment 0002.0018, and we'd know that we should ignore 0001.0019 because it is not in our timeline. regards, tom lane
pgsql-hackers by date: