Thread: Why we really need timelines *now* in PITR
If we do not add timeline numbers to WAL file names, we will be forced to destroy information during recovery. Consider the following scenario: 1. You have a WAL directory containing, say, WAL segments 0010 to 0020 (for the purposes of this example I won't bother typing out realistic 16-digit filenames, but just use 4-digit names). 2. You discover that your junior DBA messed up badly and you need to revert to yesterday evening's state. Let's say the chosen recovery end time is in the middle of file 0014. 3. You run the recovery process. At its end, the WAL end pointer will be 0014 and some offset. If we simply run forward from this situation, then we will be overwriting existing WAL records in the existing files 0014-0020. This is bad from the point of view of not wanting to discard information (what if we decide we should have recovered to a later time??), but there is an even more serious reason for not doing that. Suppose we suffer a crash sometime after recovery. On restart, the system will start replaying the logs, and *there will be nothing to keep it from replaying all the way to the end of file 0020*. (The files will contain proper, in-sequence page headers, so the tests that normally detect recycled log segments won't think there is anything wrong.) This will leave you with a thoroughly corrupt database. One way to solve this would be to physically discard 0015-0020 as soon as we decide we're stopping short of the end of WAL. I think that is unacceptable on don't-throw-away-information grounds. I think it would be far better to invent the timeline concept. Then, our old WAL files would be named say 0001.0010 through 0001.0020, and we would start logging into 0002.0014 after recovery. A slightly tricky point is that we have to "sew together" the end of one timeline and the start of the next --- for instance, we only want the front part of 0001.0014, not the back part, to be part of the new timeline. Patrick Macdonald told me about a pretty baroque scheme that DB2 uses for this, but I think it would be simplest if we just copied the appropriate amount of data from 0001.0014 into 0002.0014 and then ran forward from there. Copying a max of 16MB of data doesn't sound very onerous. During WAL replay or recovery, there would be a notion of the "target timeline" that you are trying to recover to a point within. The rule for selecting which WAL segment file to read is "use the one with largest timeline number less than or equal to the target, and never less than the timeline number you used for the previous segment". So for example if we realized we'd chosen the wrong recovery target time, we could backpedal and redo the same recovery process with target timeline 0001, ignoring any WAL segments that had been archived with timeline 0002. Alternatively, if we were simply doing crash recovery in timeline 0002, we could stop at (say) segment 0002.0018, and we'd know that we should ignore 0001.0019 because it is not in our timeline. regards, tom lane
On Sat, 2004-07-17 at 21:36, Tom Lane wrote: > If we do not add timeline numbers to WAL file names, we will be forced > to destroy information during recovery. Consider the following > scenario: > > 1. You have a WAL directory containing, say, WAL segments 0010 to 0020 > (for the purposes of this example I won't bother typing out realistic > 16-digit filenames, but just use 4-digit names). > > 2. You discover that your junior DBA messed up badly and you need to > revert to yesterday evening's state. Let's say the chosen recovery end > time is in the middle of file 0014. > > 3. You run the recovery process. At its end, the WAL end pointer will > be 0014 and some offset. > > If we simply run forward from this situation, then we will be > overwriting existing WAL records in the existing files 0014-0020. > This is bad from the point of view of not wanting to discard information > (what if we decide we should have recovered to a later time??), but > there is an even more serious reason for not doing that. Suppose we > suffer a crash sometime after recovery. On restart, the system will > start replaying the logs, and *there will be nothing to keep it from > replaying all the way to the end of file 0020*. (The files will contain > proper, in-sequence page headers, so the tests that normally detect > recycled log segments won't think there is anything wrong.) This will > leave you with a thoroughly corrupt database. > > One way to solve this would be to physically discard 0015-0020 as soon > as we decide we're stopping short of the end of WAL. I think that is > unacceptable on don't-throw-away-information grounds. I think it would > be far better to invent the timeline concept. Then, our old WAL files > would be named say 0001.0010 through 0001.0020, and we would start > logging into 0002.0014 after recovery. > > A slightly tricky point is that we have to "sew together" the end of one > timeline and the start of the next --- for instance, we only want the > front part of 0001.0014, not the back part, to be part of the new > timeline. Patrick Macdonald told me about a pretty baroque scheme that > DB2 uses for this, but I think it would be simplest if we just copied > the appropriate amount of data from 0001.0014 into 0002.0014 and then > ran forward from there. Copying a max of 16MB of data doesn't sound > very onerous. > Well, yes - I completely agree that we need the timeline concept as one of the highest priorities. I originally raised the problem timelines solve because of the errors I had experienced re-running restores many times with the same archive set. It's just too easy to overwrite log files without the timeline concept. IMHO you don't need to change the xlog format as a necessary step to introduce timelines. Simply adding FFFF to the logid is sufficient (which lets face it takes a heck of long time before it gets to 1...) [Also, as an extra detail on your analysis, when recovery is finished you need to move both primary and secondary checkpoint markers forwards to the new timeline, so that crash recovery can't go back to the old timeline] If you're going to change xlog filenames, then I would think that adding the system identifier to the xlogs would be a very good addition. I would simply have recommended keeping them in separate directories, but putting it on the name would be best. PostgreSQL doesn't have a name concept...which would be the thing to use if it did. > During WAL replay or recovery, there would be a notion of the "target > timeline" that you are trying to recover to a point within. The rule > for selecting which WAL segment file to read is "use the one with > largest timeline number less than or equal to the target, and never less > than the timeline number you used for the previous segment". So for > example if we realized we'd chosen the wrong recovery target time, we > could backpedal and redo the same recovery process with target timeline > 0001, ignoring any WAL segments that had been archived with timeline > 0002. Alternatively, if we were simply doing crash recovery in timeline > 0002, we could stop at (say) segment 0002.0018, and we'd know that we > should ignore 0001.0019 because it is not in our timeline. > That sounds like the way it should work. The way you write this makes me think you might mean you would allow: we can start recovering in one timelines, then rollforward takes us through all the timeline nexus points required to get us to the target timeline. I had imagined that recovery would only ever be allowed to start and end on the same timeline. I think you probably mean that? Another of the issues I was thinking through was what happens at the end of your scenario abobe - You're on timeline 1 and you need to perform recovery. - You perform recovery and timeline 2 is created. - You discover another error and decide to recover again. - You recover timeline 1 again: what do you name the new timeline created? 2 or 3? If you call it 2 you will be overwriting data just like you would have done - which is why timelines were invented, so thats got to be a bad plan. If you think to avoid this by calling it 3, how do you know to do that? My imperfect solution to that was to use a randomised future timeline number, reducing greatly the chance of ever conflicting on timeline names. There's probably a solution to this used by other RDBMS, but I don't know what it is and haven't gone looking on the basis that is likely to be patented anyway... [If you do this by adding a big number to the LogId, then 0002.0018 is simply numerically larger than 0001.0019, so wouldn't ever be considered.] You then don't need the idea of a target timeline explicitly - and therefore the user can't get wrong which timeline their on/want to be on. Best regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > The way you write this makes me think you might mean you would allow: we > can start recovering in one timelines, then rollforward takes us through > all the timeline nexus points required to get us to the target > timeline. Sure. Let's draw a diagram: 0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ... | + 0002.0016 - 0002.0017 -... | + 0003.0017 - ... If you decide you would like to recover to someplace in timeline 0002, you need to take the 0002 log files where they exist, and the 0001 log files where there is no 0002, except you do not revert to 0001 once you have used an 0002 file (this restriction is needed in case the 0001 timeline goes to higher segment numbers than 0002 has reached). In no case do you use an 0003 file. > I had imagined that recovery would only ever be allowed to start and end > on the same timeline. I think you probably mean that? Logically it's all one timeline, I suppose, but to implement it physically that way would mean duplicating all past 0001 segments when we want to create the 0002 timeline. That's not practical and not necessary. > Another of the issues I was thinking through was what happens at the end > of your scenario abobe > - You're on timeline 1 and you need to perform recovery. > - You perform recovery and timeline 2 is created. > - You discover another error and decide to recover again. > - You recover timeline 1 again: what do you name the new timeline > created? 2 or 3? You really want to call it 3. To enforce this mechanically would require having a counter that sits outside the $PGDATA directory and is not subject to being reverted by a restore-from-backup. I don't see any very clean way to do that at the moment --- any thoughts? In the absence of such a counter we could ask the DBA to specify a new timeline number in recovery.conf, but this strikes me as one of those easy-to-get-wrong things ... One possibility is to extend the archiving API so that we can inquire about the largest timeline number that exists anywhere in the archive. If we take new timeline number = 1 + max(any in archive, any in pg_xlog) then we are safe. But I'm not really convinced that such a thing would be any less error-prone than the manual way :-(, because for any archival method that's more complicated than "cp them all into one directory", it'd be hard to extract the max stored filename. regards, tom lane
On Mon, 2004-07-19 at 04:31, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > The way you write this makes me think you might mean you would allow: we > > can start recovering in one timelines, then rollforward takes us through > > all the timeline nexus points required to get us to the target > > timeline. > > Sure. Let's draw a diagram: > > 0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ... > | > + 0002.0016 - 0002.0017 - ... > | > + 0003.0017 - ... > > If you decide you would like to recover to someplace in timeline 0002, > you need to take the 0002 log files where they exist, and the 0001 > log files where there is no 0002, except you do not revert to 0001 > once you have used an 0002 file (this restriction is needed in case > the 0001 timeline goes to higher segment numbers than 0002 has reached). > In no case do you use an 0003 file. > > > I had imagined that recovery would only ever be allowed to start and end > > on the same timeline. I think you probably mean that? > > Logically it's all one timeline, I suppose, but to implement it > physically that way would mean duplicating all past 0001 segments when > we want to create the 0002 timeline. That's not practical and not > necessary. > > > Another of the issues I was thinking through was what happens at the end > > of your scenario abobe > > - You're on timeline 1 and you need to perform recovery. > > - You perform recovery and timeline 2 is created. > > - You discover another error and decide to recover again. > > - You recover timeline 1 again: what do you name the new timeline > > created? 2 or 3? > > You really want to call it 3. To enforce this mechanically would > require having a counter that sits outside the $PGDATA directory and > is not subject to being reverted by a restore-from-backup. I don't > see any very clean way to do that at the moment --- any thoughts? > > In the absence of such a counter we could ask the DBA to specify a new > timeline number in recovery.conf, but this strikes me as one of those > easy-to-get-wrong things ... > > One possibility is to extend the archiving API so that we can inquire > about the largest timeline number that exists anywhere in the archive. > If we take new timeline number = 1 + max(any in archive, any in pg_xlog) > then we are safe. But I'm not really convinced that such a thing would > be any less error-prone than the manual way :-(, because for any > archival method that's more complicated than "cp them all into one > directory", it'd be hard to extract the max stored filename. > Think the same as you do on all of that. Excellent. Some further thinking from that base... Perhaps timelines should be nest-numbered: (using 0 as a counter also) 0 etc is the original branch 0.1 is the first recovery off the original branch 0.2 is the second recovery off the original branch 0.1.1 is the first recovery off the first recovery (so to speak) 0.1.2 is the second etc That way you don't have the problem of "which is 3?" in the examples above. [Would we number a recovery of 1 as 3 or would then next recovery off 2 be numbered 3?] Not necessarily the way we would show that as a timeline number. It could still be shown as a single hex number representing each nesting level as 4 bits...(restricting us to 7 recoveries per timeline...) Just as a thought, DB2 uses the concept of a history file also.... If we go with the renaming recovery.conf when it completes, why not make that the record of previous recoveries? Move it to archive_status and name it according to the timeline it just created, e.g. recovery.done.<timeline>.<timestamp> If you're re-restoring into the same directory you wouldn't then overwrite the history of previous recoveries. It would be an extremely bad thing to have to specify the timeline number, and I agree we can't ask the archive. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Some further thinking from that base... > Perhaps timelines should be nest-numbered: (using 0 as a counter also) > 0 etc is the original branch > 0.1 is the first recovery off the original branch > 0.2 is the second recovery off the original branch > 0.1.1 is the first recovery off the first recovery (so to speak) > 0.1.2 is the second etc > That way you don't have the problem of "which is 3?" in the examples > above. [Would we number a recovery of 1 as 3 or would then next recovery > off 2 be numbered 3?] Hmm. This would have some usefulness as far as documenting "how did we get here", but unless you knew where/when the timeline splits had occurred, I don't think it would be super useful. It'd make more sense to record the parentage and split time of each timeline in some human-readable "meta-history" reference file (but where exactly?) I don't think it does anything to solve our immediate problem, anyhow. You may know that you are recovering off of branch 0.1, but how do you know if this is the first, second, or Nth time you have done that? > Not necessarily the way we would show that as a timeline number. It > could still be shown as a single hex number representing each nesting > level as 4 bits...(restricting us to 7 recoveries per timeline...) Sounds too tight to me :-( I do see a hole in my original concept now that you mention it. It'd be quite possible for timeline 2 *not* to be an ancestor of timeline 3, that is you might have tried a restore, not liked the result, and decided to re-restore from someplace else on timeline 1. That is, instead of 0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ... | + 0002.0016 - 0002.0017 -... | + 0003.0017 - ... maybe the history is 0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ... | | | +0003.0017 - ... | + 0002.0016 - 0002.0017 - ... where I've had to draw 3 above 2 to avoid having unrelated lines crossing each other in my diagram. The problem here is that a crash recovery in timeline 3 would not realize that it should not use WAL segment 0002.0016. So we need a more sophisticated rule than just numerical comparison of timeline numbers. I think your idea of nested numbers might fix this, but I'm concerned about the restrictions of fitting it into 32 bits as you suggest. Can we think of a less restrictive representation? > If we go with the renaming recovery.conf when it completes, why not make > that the record of previous recoveries? Move it to archive_status and > name it according to the timeline it just created, e.g. > recovery.done.<timeline>.<timestamp> There's still the problem of how can you be sure that all the files created in the past are still in there. It'd be way too likely for someone to decide they ought to do a recovery restore by first doing "rm -rf $PGDATA". Or they lost the disk entirely and are restoring their last full backup onto virgin media. I think there's really no way around the issue: somehow we've got to keep some meta-history outside the $PGDATA area, if we want to do this in a clean fashion. We could perhaps expect the archive area to store it, but I'm a bit worried about the idea of overwriting the meta-history file in archive from time to time; it's mighty critical data and you'd not be happy if a crash corrupted your only copy. We could archive meta-history files with successively higher versioned names ... but then we need an API extension to get back the latest one. regards, tom lane
I wrote: > I think there's really no way around the issue: somehow we've got to > keep some meta-history outside the $PGDATA area, if we want to do this > in a clean fashion. After further thought I think we can fix this stuff by creating a "history file" for each timeline. This will make recovery slightly more complicated but I don't think it would be any material performance problem. Here's how it goes: * Timeline IDs are 32-bit ints with no particular semantic significance (that is, we do not assume timeline 3 is a child of 2, or anything like that). The actual parentage of a timeline has to be found by inspecting its history file. * History files will be named by their timeline ID, say "00000042.history". They will be created in /pg_xlog whenever a new timeline is created by the act of doing a recovery to a point in time earlier than the end of existing WAL. When doing WAL archiving a history file can be copied off to the archive area by the existing archiver mechanism (ie, we'll make a .ready file for it as soon as it's written). * History files will be plain text (for human consumption) and will essentially consist of a list of parent timeline IDs in sequence. I envision adding the timeline split timestamp and starting WAL segment number too, but these are for documentation purposes --- the system doesn't need them. We may as well allow comments in there as well, so that the DBA can annotate the reasons for a PITR split to have been done. So the contents might look like # Recover from unintentional TRUNCATE00000001 0000000A00142568 2005-05-16 12:34:15 EDT# Ex-assistant DBA dropped wrongtable00000007 0000002200005434 2005-11-17 18:44:44 EST When we split off a new timeline, we just have to copy the parent's history file (which we can do verbatim including comments) and then add a new line at the end showing the immediate parent's timeline ID and the other details of the split. Initdb can create 00000001.history with empty contents (since that timeline has no parents). * When we need to do recovery, we first identify the source timeline (either by reading the current timeline ID from pg_control, or the DBA can tell us with a parameter in recovery.conf). We then read the history file for that timeline, and remember its sequence of parent timeline IDs. We can crosscheck that pg_control's timeline ID is one of this set of timeline IDs, too --- if it's not then the wrong backup file was restored. * During recovery, whenever we need to open a WAL segment file, we first try to open it with the source timeline ID; if that doesn't exist, try the immediate parent timeline ID; then the grandparent, etc. Whenever we find a WAL file with a particular timeline ID, we forget about all parents further up in the history, and won't try to open their segments anymore (this is the generalization of my previous rule that you never drop down in timeline number as you scan forward). * If we end recovery because we have rolled forward off the end of WAL, we can just continue using the source timeline ID --- we are extending that timeline. (Thus, an ordinary crash and restart doesn't require generating a new timeline ID; nor do we generate a new line during normal postmaster stop/start.) But if we stop recovery at a requested point-in-time earlier than end of WAL, we have to branch off a new timeline. We do this by:* Selecting a previously unused timeline ID (see below).* Writing a history file for this ID, bycopying the parent timeline's history file and adding a new line at the end.* Copying the last-used WAL segment of theparent timeline, giving it the same segment number but the new timeline's ID. This becomes the active WAL segment whenwe start operating. * We can identify the highest timeline ID ever used by simply starting with the source timeline ID and probing pg_xlog and the archive area for history files N+1.history, N+2.history, etc until we find an ID for which there is no history file. Under reasonable scenarios this will not take very many probes, so it doesn't seem that we need any addition to the archiver API to make it more efficient. * Since history files will be small and made infrequently (one hopes you do not need to do a PITR recovery very often...) I see no particular reason not to leave them in /pg_xlog indefinitely. The DBA can clean out old ones if she is a neatnik, but I don't think the system needs to or should delete them. Similarly the archive area could be expected to retain history files indefinitely. * However, you *can* throw away a history file once you are no longer interested in rolling back to times predating the splitoff point of the timeline. If we don't find a history file we can just act as though the timeline has no parents (extends indefinitely far in the past). (Hm, so we don't actually have to bother creating 00000001.history...) * I'm intending to replace the current concept of StartUpID (SUI) by timeline IDs --- we'll record timeline IDs not SUIs in data page headers and WAL page headers. SUI isn't doing anything of value for us; I think it was probably intended to do what timelines will do, but it's not defined quite right for the purpose. One good thing about timeline IDs for WAL page headers is that we know exactly which IDs should be expected in a WAL file (either the current timeline or one of its parents); this allows a much tighter check than is possible with SUIs. Anybody see any holes in this design? regards, tom lane
On Mon, 2004-07-19 at 16:58, Tom Lane wrote: > I think there's really no way around the issue: somehow we've got to > keep some meta-history outside the $PGDATA area, if we want to do this > in a clean fashion. We could perhaps expect the archive area to store > it, but I'm a bit worried about the idea of overwriting the meta-history > file in archive from time to time; it's mighty critical data and you'd > not be happy if a crash corrupted your only copy. We could archive > meta-history files with successively higher versioned names ... but then > we need an API extension to get back the latest one. > Yes, you've convinced me. It is critical data, but never for that long. If we only split timelines when we recover, then we just make not to take about ~100 copies of it immediately. If we really did recover OK, then it'll only be a few days/weeks before we can forget it ever happened. The crucial time is when re-running recoveries repeatedly and if we write the manual with sufficient red ink then we'll avoid this. But heck, not having your history file is only as bad as not having added timelines in the first place. Not great, just more care required. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > The crucial time is when re-running recoveries repeatedly and if we > write the manual with sufficient red ink then we'll avoid this. But > heck, not having your history file is only as bad as not having added > timelines in the first place. Not great, just more care required. Yeah, you only really need them when you are hip-deep in repeated recovery retries. If you haven't gotten to my later proposal yet, the history files will be plain text and it'd be at least theoretically possible for someone to reconstruct one by hand if needed. All you need to have is the sequence of parent timeline IDs, which you could reconstruct in most cases by looking at the archived WAL files. regards, tom lane
On Mon, 2004-07-19 at 19:33, Tom Lane wrote: > I wrote: > > I think there's really no way around the issue: somehow we've got to > > keep some meta-history outside the $PGDATA area, if we want to do this > > in a clean fashion. > > After further thought I think we can fix this stuff by creating a > "history file" for each timeline. This will make recovery slightly more > complicated but I don't think it would be any material performance > problem. Here's how it goes: Yes...I came to the conclusion that trying to avoid doing something like DB2 does was just stubornness on my part. We may as well use analogies with other systems when they are available. All of this is good. Two main areas of comments/questions, noted (**) Timelines should be easy to understand for anybody that can follow a HACKERS conversation anyhow :) > > * Timeline IDs are 32-bit ints with no particular semantic significance > (that is, we do not assume timeline 3 is a child of 2, or anything like > that). The actual parentage of a timeline has to be found by inspecting > its history file. > OK...thats better. The nested idea doesn't read well second time through. > * History files will be named by their timeline ID, say "00000042.history". > They will be created in /pg_xlog whenever a new timeline is created > by the act of doing a recovery to a point in time earlier than the end > of existing WAL. When doing WAL archiving a history file can be copied > off to the archive area by the existing archiver mechanism (ie, we'll > make a .ready file for it as soon as it's written). > Need to check the archive code which relies on file shape and length > * History files will be plain text (for human consumption) and will > essentially consist of a list of parent timeline IDs in sequence. > I envision adding the timeline split timestamp and starting WAL segment > number too, but these are for documentation purposes --- the system > doesn't need them. We may as well allow comments in there as well, > so that the DBA can annotate the reasons for a PITR split to have been > done. So the contents might look like > > # Recover from unintentional TRUNCATE > 00000001 0000000A00142568 2005-05-16 12:34:15 EDT > # Ex-assistant DBA dropped wrong table > 00000007 0000002200005434 2005-11-17 18:44:44 EST > Or should there be a recovery_comment parameter in the recovery.conf? That would be better than suggesting that admins can edit such an important file. (Even if they can, its best not to encourage it). > When we split off a new timeline, we just have to copy the parent's > history file (which we can do verbatim including comments) and then > add a new line at the end showing the immediate parent's timeline ID > and the other details of the split. Initdb can create 00000001.history > with empty contents (since that timeline has no parents). Yes. Will you then delete the previous timeline's history file or just leave it there? (OK, you say that later) > * When we need to do recovery, we first identify the source timeline > (either by reading the current timeline ID from pg_control, or the DBA > can tell us with a parameter in recovery.conf). We then read the > history file for that timeline, and remember its sequence of parent > timeline IDs. We can crosscheck that pg_control's timeline ID is > one of this set of timeline IDs, too --- if it's not then the wrong > backup file was restored. ** Surely it is the backup itself that determines the source timeline? Backups are always taken in one particular timeline. The rollforward must start at a checkpoint before the begin backup and roll past the end of backup marker onwards. The starting checkpoint should be the last checkpoint prior to backup - why would you pick another? That checkpoint will always be in the current timeline, since we always come out of startup with a checkpoint (either because we shutdown earlier, or we recovered and just wrote another shutdown checkpoint). So the backup's timeline will determine the source timeline, but not necessarily the target timeline. ...thinking....recovery.conf would need to specify: recovery_target (if there is one, either a time or txnid) recovery_target_timeline (if there is one, otherwise end of last one) recovery_target_history_file (which specifies how the timeline ids are sequenced) I take it that your understanding is that the recovery_target timeline needs to be specified also? > * During recovery, whenever we need to open a WAL segment file, we first > try to open it with the source timeline ID; if that doesn't exist, try > the immediate parent timeline ID; then the grandparent, etc. Whenever > we find a WAL file with a particular timeline ID, we forget about all > parents further up in the history, and won't try to open their segments > anymore (this is the generalization of my previous rule that you never > drop down in timeline number as you scan forward). > This jigging around is OK, because most people will be using only one timeline anyhow, so its not likely to cause too much fuss for the user. > * If we end recovery because we have rolled forward off the end of WAL, > we can just continue using the source timeline ID --- we are extending > that timeline. (Thus, an ordinary crash and restart doesn't require > generating a new timeline ID; nor do we generate a new line during > normal postmaster stop/start.) Yes, exactly - thats why it can't be the SUID. > But if we stop recovery at a requested > point-in-time earlier than end of WAL, we have to branch off a new > timeline. We do this by: > * Selecting a previously unused timeline ID (see below). > * Writing a history file for this ID, by copying the parent > timeline's history file and adding a new line at the end. > * Copying the last-used WAL segment of the parent timeline, > giving it the same segment number but the new timeline's ID. > This becomes the active WAL segment when we start operating. > > * We can identify the highest timeline ID ever used by simply starting > with the source timeline ID and probing pg_xlog and the archive area > for history files N+1.history, N+2.history, etc until we find an ID > for which there is no history file. Under reasonable scenarios this > will not take very many probes, so it doesn't seem that we need any > addition to the archiver API to make it more efficient. ** I would prefer to add a random number to the timeline as a way of identifying the next one. This will produce fewer probes, so less wasted tape mounts, but most importantly it gets round this issue: You're on timeline X, then you recover and run for a while on timeline Y. You then realise recovering to that target was a really bad idea for some reason (some VIPs record wasn't in the recovered data etc). We then need to re-recover from the backup on X to a new timeline, Z. But how does X know that Y existed when it creates Z? If Y = f(x) in a deterministic way, then Y will always == Z. Of course, we could provide an id, but what would you pick? The best way is to get out of trouble by picking a new timeline id that's very unlikely to have been picked before. If the sequence of timeline ids is not important, just pick one from the billions you have available to you (and that aren't mentioned in the history file). We can do this automatically and pick it randomly. That way, when you re-recover you stand a vanishingly small chance of picking any timeline id that you (or indeed anyone!) have ever used. This will be very important for diagnosing problems, and it is my experience that the re-recovery scenario happens on about 50% of recoveries. i.e. if you recover once, you're very likely to recover 2 or more times before you're really done. (...and if you don't believe me, look what happened to Danske Bank running DB2 - recovered 4 times inside a week, but hats off to those guys - they got it back in the end). But then - we also need to be able to identify which was the latest history file and searching a billion files might take a while. So the sequential numbering does serve a purpose. Both ideas solve only one of the two problems....hmmm, I think perhaps finding latest file is more important and so perhaps sequential numbering should win after all? > * Since history files will be small and made infrequently (one hopes you > do not need to do a PITR recovery very often...) I see no particular > reason not to leave them in /pg_xlog indefinitely. The DBA can clean > out old ones if she is a neatnik, but I don't think the system needs to > or should delete them. Similarly the archive area could be expected to > retain history files indefinitely. > OK. Answered question above... Yes, agreed. We'll want them for diagnostics anyway. > * However, you *can* throw away a history file once you are no longer > interested in rolling back to times predating the splitoff point of the > timeline. If we don't find a history file we can just act as though the > timeline has no parents (extends indefinitely far in the past). (Hm, > so we don't actually have to bother creating 00000001.history...) > Agreed. Thats better, less files waiting around, less chance of being deleted by over-diligent admins. But we shouldn't encourage the deletion of those files. The worst problems happen when people "tidy up" after they think recovery is over, then delete an important file and we're back in traction again. > * I'm intending to replace the current concept of StartUpID (SUI) by > timeline IDs --- we'll record timeline IDs not SUIs in data page headers > and WAL page headers. SUI isn't doing anything of value for us; I think > it was probably intended to do what timelines will do, but it's not > defined quite right for the purpose. One good thing about timeline IDs > for WAL page headers is that we know exactly which IDs should be > expected in a WAL file (either the current timeline or one of its > parents); this allows a much tighter check than is possible with SUIs. > Definitely agree on this last part, that stuff about 512 SUIs was wierd. > Anybody see any holes in this design? > As said already, All of this is good. Two main areas of comments/questions, noted above. (**) That's coherent and good. Best regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Mon, 2004-07-19 at 19:33, Tom Lane wrote: >> When doing WAL archiving a history file can be copied >> off to the archive area by the existing archiver mechanism (ie, we'll >> make a .ready file for it as soon as it's written). > Need to check the archive code which relies on file shape and length Yeah. I made some adjustments already with the expectation that we'd want to do this, but it'll take a little bit more weakening of the code's tests. >> * When we need to do recovery, we first identify the source timeline >> (either by reading the current timeline ID from pg_control, or the DBA >> can tell us with a parameter in recovery.conf). > ** Surely it is the backup itself that determines the source timeline? The backup determines the starting point, but there may be several timelines you could follow after that (especially in the scenario where you're redoing a recovery starting from the same backup). The point here is that there could be timeline branches after the backup occurred. So yes the backup has to be in an ancestral timeline, but not necessarily exactly the recovery-target timeline. > ...thinking....recovery.conf would need to specify: > recovery_target (if there is one, either a time or txnid) > recovery_target_timeline (if there is one, otherwise end of last one) > recovery_target_history_file (which specifies how the timeline ids are > sequenced) No, the source timeline is not necessarily associated with a recovery_target --- for instance you might want it to run to the end of a particular timeline. I suspect it might be more confusing than helpful to use the term "target timeline". >> try to open it with the source timeline ID; if that doesn't exist, try >> the immediate parent timeline ID; then the grandparent, etc. > This jigging around is OK, because most people will be using only one > timeline anyhow, so its not likely to cause too much fuss for the user. It might confuse someone who's watching the sequence of archive retrieval requests, but as long as that's all mechanized it doesn't seem like there's any real potential for trouble. > ** I would prefer to add a random number to the timeline as a way of > identifying the next one. This will produce fewer probes, so less wasted > tape mounts, How do you figure that? Seems like the same number of probes either way. > but most importantly it gets round this issue: > You're on timeline X, then you recover and run for a while on timeline > Y. You then realise recovering to that target was a really bad idea for > some reason (some VIPs record wasn't in the recovered data etc). We then > need to re-recover from the backup on X to a new timeline, Z. But how > does X know that Y existed when it creates Z? Because there is a Y.history file laying about (either in the archive or pg_xlog). We will need to recommend to DBAs that they not delete Y.history from the archive unless they've already deleted all Y.whatever log segments. Once they have done this, the past existence of timeline Y is no longer of interest and so there'd be no real problem in recycling the ID. I would say the above is just as true if you use random IDs as if you use sequential ones. I distrust systems that assume there will never be a collision of "randomly-chosen" IDs. > If Y = f(x) in a deterministic way, then Y will always == Z. Of course, > we could provide an id, but what would you pick? The best way is to get > out of trouble by picking a new timeline id that's very unlikely to have > been picked before. I do not see the advantage. regards, tom lane
On Mon, 2004-07-19 at 23:15, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Mon, 2004-07-19 at 19:33, Tom Lane wrote: > >> * When we need to do recovery, we first identify the source timeline > >> (either by reading the current timeline ID from pg_control, or the DBA > >> can tell us with a parameter in recovery.conf). > > > ** Surely it is the backup itself that determines the source timeline? > > The backup determines the starting point, but there may be several > timelines you could follow after that (especially in the scenario where > you're redoing a recovery starting from the same backup). The point > here is that there could be timeline branches after the backup > occurred. So yes the backup has to be in an ancestral timeline, but not > necessarily exactly the recovery-target timeline. > Agreed. > > ...thinking....recovery.conf would need to specify: > > recovery_target (if there is one, either a time or txnid) > > recovery_target_timeline (if there is one, otherwise end of last one) > > recovery_target_history_file (which specifies how the timeline ids are > > sequenced) > > No, the source timeline is not necessarily associated with a > recovery_target --- for instance you might want it to run to the end of > a particular timeline. I suspect it might be more confusing than > helpful to use the term "target timeline". > I think we're heatedly agreeing again. A summary: we don't specify the start timeline, but we do specify the timeline which contains our chosen endpoint. [....But when we reach it, we may create a new timeline id if we didn't go to end of logs on that timeline.] The history file specifies how to get from start to end, through however many branchpoints there are....and the history file we use for recovery is the one pointed to by (target_in_timeline). Or even shorter: - backup specifies starting timeline (and so user specifies indirectly) - user specifies end point (explicitly in recovery.conf) - history file shows how to get from start to end more thoughts...if you specify: target = X target_in_timeline where the default is <notarget> and if you specify a target, the default target_in_timeline is <latest>. I don't like the name target_in_timeline, I'm just trying to clarify what we mean so we can think of a better name for it. ...we definitely need an offline-log inspection tool, don't we? Next month... > We will need to recommend to DBAs that they not delete Y.history from > the archive unless they've already deleted all Y.whatever log segments. > Once they have done this, the past existence of timeline Y is no longer > of interest and so there'd be no real problem in recycling the ID. > I would say the above is just as true if you use random IDs as if you > use sequential ones. I distrust systems that assume there will never be > a collision of "randomly-chosen" IDs. > Yes, I argued myself in a circle, but it seemed worth recording just to avoid repeating the thought next time. Best regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > I think we're heatedly agreeing again. Yeah, I think so. I'll get started on this tomorrow. > where the default is <notarget> and if you specify a target, the default > target_in_timeline is <latest>. I think actually the default target has to be the timeline ID found in pg_control --- otherwise you get weird behavior in the plain crash recovery, non-PITR case. > I don't like the name target_in_timeline, Agreed, but I don't have a better name offhand for it. The point I was making is that we seem to be using "target" to mean a point-in-time stopping target. But you might be interested in going to the end of timeline N and so there's not a "target" in that sense. That's why I was wanting to avoid using the term "target" for the desired timeline. But maybe there's not a better word. > ...we definitely need an offline-log inspection tool, don't we? > Next month... Yeah. When you get started, I have a toy tool I've been using for awhile that might serve as a starting point. (I'm going to have to whack it around for timelines so there's little point in attaching it right now...) regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Anybody see any holes in this design? God help the DBA who deletes a history file with needed information. Or edits it inappropriately. Why can't every log file contain a header that says which timeline it's part of and which timeline the preceding log file was part of? That would avoid having a file outside PGDATA and would mean the backups would always contain enough information to be self-sufficient. Then if you want to restore from a cold backup and apply PITR up to segment 0020 in timeline X. You read the header of X.0020, find out it followed X.0019 and so on. If X.0010 branched from Y.0009 you'll find out and be able to continue threading back until you find a segment that matches the current segment in the cold backup. The only problem I see is that if your backups are stored on tape it might be awkward to have to read the headers of all log segments in reverse order to backtrack to the right place to start. -- greg
On Tue, 2004-07-20 at 00:58, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > where the default is <notarget> and if you specify a target, the default > > target_in_timeline is <latest>. > > I think actually the default target has to be the timeline ID found in > pg_control -- otherwise you get weird behavior in the plain crash > recovery, non-PITR case. Yes, I was talking about recovery.conf only So, overall there would be 5 recovery modes with 4 levels of default: Summarised in the table, also showing what set of actions/parameters we need to specify to invoke that mode... ----------------SPECIFIED ITEMs MODE recovery.conf rec_target* target_in_timeline crash recovery - up to 7.4-------------no------no--------------no archive recovery - current end-----------yes-----no--------------no - current target (*)----yes-----yes-------------no current timeline - other target (*)------yes-----either----------yes, = 'latest' latest timeline - other target (*)------yes-----either----------yes, = '<value>' other timeline (*) these operations cause a new timeline to be created More verbosely (not numbered because they're not a sequence or progression) - if no recovery.conf is present we do crash recovery to end of logs on pg_control timeline (i.e. current) - if recovery.conf is present and we do not specify a target we do archive recovery to end of logs on pg_control timeline (i.e. current) - if recovery.conf is present and we specify a target, but no timeline, then we do archive recovery on the pg_control timeline only, and assume that the target was supposed to be on this, even if we don't find it - if recovery.conf is present and we specify a timeline of literally 'latest' (without having to know what that is) - then we search archive for the latest history file, then we do archive recovery from the pg_control timeline to the latest timeline as specified in the latest history file. If we specify a target, then this is searched for on whatever timeline we're on as we rollforward. - if recovery.conf is present and we specify a timeline - then we search archive for that history file, then we do archive recovery from the pg_control timeline to the specified timeline as shown in that history file. If we specify a target, then this is searched for on whatever timeline we're on as we rollforward. > > I don't like the name target_in_timeline, > > Agreed, but I don't have a better name offhand for it. The point I was > making is that we seem to be using "target" to mean a point-in-time > stopping target. But you might be interested in going to the end of > timeline N and so there's not a "target" in that sense. That's why I > was wanting to avoid using the term "target" for the desired timeline. > But maybe there's not a better word. how about? end_at_timeline which is more neutral towards the idea of whether a target has been specified or not... Another thing I note is that archive_status .ready messages are written for all restored xlog files (rather than .done messages). That seems to cause the archive to be completely overwritten when recovery is complete. Is that part of the plan for timelines also. Not sure I understand that... Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > More verbosely (not numbered because they're not a sequence or > progression) > - if no recovery.conf is present we do crash recovery to end of logs on > pg_control timeline (i.e. current) Check. > - if recovery.conf is present and we do not specify a target we do > archive recovery to end of logs on pg_control timeline (i.e. current) I have done it this way for now, but I'm unconvinced whether this is the best default --- it might be that we'd be better off making 'latest' be the default. The point here is that when you restore a tar backup, 'current' is going to become the timeline that was current when the backup was made, not the one that was current just before you wiped $PGDATA. I'm not really sure which case is going to be more commonly wanted. > - if recovery.conf is present and we specify a target, but no timeline, > then we do archive recovery on the pg_control timeline only, and assume > that the target was supposed to be on this, even if we don't find it Whether you specify a target stopping point does not matter AFAICS. The timeline selection has to be made before we can even look at the data. > - if recovery.conf is present and we specify a timeline of literally > 'latest' (without having to know what that is) - then we search archive > for the latest history file, then we do archive recovery from the > pg_control timeline to the latest timeline as specified in the latest > history file. If we specify a target, then this is searched for on > whatever timeline we're on as we rollforward. Check. > - if recovery.conf is present and we specify a timeline - then we search > archive for that history file, then we do archive recovery from the > pg_control timeline to the specified timeline as shown in that history > file. If we specify a target, then this is searched for on whatever > timeline we're on as we rollforward. Check. >>> I don't like the name target_in_timeline, >> >> Agreed, but I don't have a better name offhand for it. For lack of any better idea, I have swallowed my objections to "target" and called it "recovery_target_timeline". We can easily rename the parameter if anyone comes up with something more compelling. Above behavior is all committed to CVS as of a few minutes ago. > Another thing I note is that archive_status .ready messages are written > for all restored xlog files (rather than .done messages). I think this is gone now. However, we still have the issue of preventing re-archival of old, incomplete XLOG segments that might be brought back into pg_xlog/ as a result of restoring a tar backup. I don't have a better solution to that than the one Bruce and I proposed yesterday (make the DBA clean out pg_xlog before starting a recovery run). regards, tom lane
On Wed, 2004-07-21 at 23:42, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > More verbosely (not numbered because they're not a sequence or > > progression) > > > - if no recovery.conf is present we do crash recovery to end of logs on > > pg_control timeline (i.e. current) > > Check. > > > - if recovery.conf is present and we do not specify a target we do > > archive recovery to end of logs on pg_control timeline (i.e. current) > > I have done it this way for now, but I'm unconvinced whether this is the > best default --- it might be that we'd be better off making 'latest' be > the default. The point here is that when you restore a tar backup, > 'current' is going to become the timeline that was current when the > backup was made, not the one that was current just before you wiped > $PGDATA. I'm not really sure which case is going to be more commonly > wanted. Right now, that sounds the best option. But my head hurts :) > > > - if recovery.conf is present and we specify a target, but no timeline, > > then we do archive recovery on the pg_control timeline only, and assume > > that the target was supposed to be on this, even if we don't find it > > Whether you specify a target stopping point does not matter AFAICS. The > timeline selection has to be made before we can even look at the data. > Yes, I was describing a case where a default behaviour would be required to make the timeline selection before the "desired" behaviour could be enacted. > > - if recovery.conf is present and we specify a timeline of literally > > 'latest' (without having to know what that is) - then we search archive > > for the latest history file, then we do archive recovery from the > > pg_control timeline to the latest timeline as specified in the latest > > history file. If we specify a target, then this is searched for on > > whatever timeline we're on as we rollforward. > > Check. > > > - if recovery.conf is present and we specify a timeline - then we search > > archive for that history file, then we do archive recovery from the > > pg_control timeline to the specified timeline as shown in that history > > file. If we specify a target, then this is searched for on whatever > > timeline we're on as we rollforward. > > Check. > > >>> I don't like the name target_in_timeline, > >> > >> Agreed, but I don't have a better name offhand for it. > > For lack of any better idea, I have swallowed my objections to "target" > and called it "recovery_target_timeline". We can easily rename the > parameter if anyone comes up with something more compelling. > > Above behavior is all committed to CVS as of a few minutes ago. > ...very cool. OK, back to first principles as a cross-check then: PITR should cope with these scenarios. These are described reasonably closely but not as exact mechanical tests, so as to ensure that if multiple solutions exist to these recovery scenarios that all paths are tested. These are written with a view to *rough* functionality of timelines, rather than reading the above and making up cases to fit. I suggest we see if these all work, see why not (if not) and make up some other cases to make sure all possibilities are catered for. 1. We crash, and wish to recover, as per 7.4 2. We are running happily, using an automated standby database. The first database fails irrecoverably and we are forced to switch to the second system which recovers quickly to end of logs, though without the partially full current xlog from the downed system. 3. We are running happily, but spot a rogue transaction that we wish to expunge. We decide to run a PITR up to that txnid. We do an archive recovery to a recovery_target_xid. We have available to us local copies of the xlogs if required. 4. We perform (3), then after operating for an hour, we realise that this was an extremely bad idea and decide to recover back to the point BEFORE we started to recover the first time - i.e. to try to pretend we had never attempted PITR in the first place because there was some even more important data just recently committed we didn't know about. 5. We attempt (4) but fail because the then-current log, which has not been archived, was deleted because we wouldn't need it anymore. We decide that we made the right choice in the first place and decide to re-run the PITR, though to a point slightly ahead of where we stopped last time we tried that. 6. We are running a distributed system that does not properly support two-phase commit in all of its persistent components. One of the other components fails (of course not pg!) and we are forced to do a PITR to a point in time that matches the best last known timestamp of all persistent system components. We PITR to a recovery_target_time. 7. We have just done (6), but 10 minutes into production we realise that the clocks between 2 of our systems were out by 3 seconds. Not much, but it is causing serious errors to bang around the system. We decide to re-run the previous PITR, but this time to a point 3 seconds further along the same chain of xlogs. We don't specify timeline, cos thats really complex stuff and we don't understand it. 8. We perform (3,4,5), then after operating for three hours the rogue transaction happens again. We realise that the rogue transaction is in fact a deliberate security violation. We immediately close network access and try to recover. Management decides we must accept the first rogue transaction's effects, but the second is too large to be acceptable. We need to recover to a recovery_target_xid prior to the second attack. The first recovery meant that xids were being reused (on a different timeline) and so the xid we wish to recover MAY exist on both the first and second timeline. To ensure we don't recover the wrong transactions, we decide to specify we wish to recover to a recovery_target_xid on recovery_target_timeline = 2. 9. A mistake was made setting setting a system clock - the month and day were transposed (7th May -> 5 July), so setting the system apparently into the future. To reset the clock, we have to perform a full database recovery into the newly reset system which is now apparently in the past. We rollforward to end of logs using local and archive copies. We want crash recovery to still work, even though we have apparently gone backwards in time according to the log timestamps. 10. We perform (6), then realise that the database server is hosted in another timezone and we accidentally recovered to a different point in time, out by a few hours. We want to re-run the recovery, correctly specifying the point in time. 11. We are in same position as (7), but specify a timeline and also a time that in fact does not exist on that timeline. That gives us enough to talk through and begin some testing. Anybody have any other horror stories, bring 'em on. > Another thing I note is that archive_status .ready messages are written > > for all restored xlog files (rather than .done messages). > > I think this is gone now. However, we still have the issue of preventing > re-archival of old, incomplete XLOG segments that might be brought back > into pg_xlog/ as a result of restoring a tar backup. I don't have a > better solution to that than the one Bruce and I proposed yesterday > (make the DBA clean out pg_xlog before starting a recovery run). I'll give that some thought. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > PITR should cope with these scenarios. These are described reasonably > closely but not as exact mechanical tests, so as to ensure that if > multiple solutions exist to these recovery scenarios that all paths are > tested. > [ snip ... ] Now *my* head is hurting ;-) AFAICS, we should be able to handle all of these cases, as long as (a) the DBA doesn't manually delete any xlog files (else he probably loses the chance to recover to the associated timeline), and (b) the DBA doesn't screw up on picking which timeline to recover to. I'd guess that (b) is likely to be the bigger threat :-(. regards, tom lane
> That gives us enough to talk through and begin some testing. > > Anybody have any other horror stories, bring 'em on. I think that the PITR docs will have to be written in two sections. One will need to be a pure reference that orthogonally describes the options, etc. The other section will need to be a scenario-based explanation of what to do/how to recover in all the major different failure patterns. It's the only way people (I!) will understand it all. From my point of view, what I need PITR to be able to do is allow me to restore to any point in the 24 hour period between pg_dumpalls. I also need to know what the exact criteria for deleting archived logs every 24 hours, and how that can be determined automatically in a script (checking the pg_dumpall end-of-log marker exists as well). I need to be told to copy, not move the logs. Also, I need to be sure that pg_dumpall is enough, and I don't need to make sure I issue a checkpoint before the pg_dumpall or anything. Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > Also, I need to be sure that pg_dumpall is enough, and I don't need to make > sure I issue a checkpoint before the pg_dumpall or anything. I think if you're using PITR you don't use pg_dump you just tar up the PGDATA directory. Of course you can still use pg_dump to save logical exports for use in other purposes. They're useful for loading into test databases for poking around in the data for example. But the transaction logs aren't going to be applicable to a database restored from a logical pg_dump. That's effectively a completely fresh database even if it has functionally equivalent data in it. The transaction logs are physical; they store "replace this block with the following raw data". That can't be applied to a logically equivalent but physically dissimilar database. -- greg