Re: Why we really need timelines *now* in PITR - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Why we really need timelines *now* in PITR |
Date | |
Msg-id | 1090274296.28049.317.camel@stromboli Whole thread Raw |
In response to | Re: Why we really need timelines *now* in PITR (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Why we really need timelines *now* in PITR
|
List | pgsql-hackers |
On Mon, 2004-07-19 at 19:33, Tom Lane wrote: > I wrote: > > I think there's really no way around the issue: somehow we've got to > > keep some meta-history outside the $PGDATA area, if we want to do this > > in a clean fashion. > > After further thought I think we can fix this stuff by creating a > "history file" for each timeline. This will make recovery slightly more > complicated but I don't think it would be any material performance > problem. Here's how it goes: Yes...I came to the conclusion that trying to avoid doing something like DB2 does was just stubornness on my part. We may as well use analogies with other systems when they are available. All of this is good. Two main areas of comments/questions, noted (**) Timelines should be easy to understand for anybody that can follow a HACKERS conversation anyhow :) > > * Timeline IDs are 32-bit ints with no particular semantic significance > (that is, we do not assume timeline 3 is a child of 2, or anything like > that). The actual parentage of a timeline has to be found by inspecting > its history file. > OK...thats better. The nested idea doesn't read well second time through. > * History files will be named by their timeline ID, say "00000042.history". > They will be created in /pg_xlog whenever a new timeline is created > by the act of doing a recovery to a point in time earlier than the end > of existing WAL. When doing WAL archiving a history file can be copied > off to the archive area by the existing archiver mechanism (ie, we'll > make a .ready file for it as soon as it's written). > Need to check the archive code which relies on file shape and length > * History files will be plain text (for human consumption) and will > essentially consist of a list of parent timeline IDs in sequence. > I envision adding the timeline split timestamp and starting WAL segment > number too, but these are for documentation purposes --- the system > doesn't need them. We may as well allow comments in there as well, > so that the DBA can annotate the reasons for a PITR split to have been > done. So the contents might look like > > # Recover from unintentional TRUNCATE > 00000001 0000000A00142568 2005-05-16 12:34:15 EDT > # Ex-assistant DBA dropped wrong table > 00000007 0000002200005434 2005-11-17 18:44:44 EST > Or should there be a recovery_comment parameter in the recovery.conf? That would be better than suggesting that admins can edit such an important file. (Even if they can, its best not to encourage it). > When we split off a new timeline, we just have to copy the parent's > history file (which we can do verbatim including comments) and then > add a new line at the end showing the immediate parent's timeline ID > and the other details of the split. Initdb can create 00000001.history > with empty contents (since that timeline has no parents). Yes. Will you then delete the previous timeline's history file or just leave it there? (OK, you say that later) > * When we need to do recovery, we first identify the source timeline > (either by reading the current timeline ID from pg_control, or the DBA > can tell us with a parameter in recovery.conf). We then read the > history file for that timeline, and remember its sequence of parent > timeline IDs. We can crosscheck that pg_control's timeline ID is > one of this set of timeline IDs, too --- if it's not then the wrong > backup file was restored. ** Surely it is the backup itself that determines the source timeline? Backups are always taken in one particular timeline. The rollforward must start at a checkpoint before the begin backup and roll past the end of backup marker onwards. The starting checkpoint should be the last checkpoint prior to backup - why would you pick another? That checkpoint will always be in the current timeline, since we always come out of startup with a checkpoint (either because we shutdown earlier, or we recovered and just wrote another shutdown checkpoint). So the backup's timeline will determine the source timeline, but not necessarily the target timeline. ...thinking....recovery.conf would need to specify: recovery_target (if there is one, either a time or txnid) recovery_target_timeline (if there is one, otherwise end of last one) recovery_target_history_file (which specifies how the timeline ids are sequenced) I take it that your understanding is that the recovery_target timeline needs to be specified also? > * During recovery, whenever we need to open a WAL segment file, we first > try to open it with the source timeline ID; if that doesn't exist, try > the immediate parent timeline ID; then the grandparent, etc. Whenever > we find a WAL file with a particular timeline ID, we forget about all > parents further up in the history, and won't try to open their segments > anymore (this is the generalization of my previous rule that you never > drop down in timeline number as you scan forward). > This jigging around is OK, because most people will be using only one timeline anyhow, so its not likely to cause too much fuss for the user. > * If we end recovery because we have rolled forward off the end of WAL, > we can just continue using the source timeline ID --- we are extending > that timeline. (Thus, an ordinary crash and restart doesn't require > generating a new timeline ID; nor do we generate a new line during > normal postmaster stop/start.) Yes, exactly - thats why it can't be the SUID. > But if we stop recovery at a requested > point-in-time earlier than end of WAL, we have to branch off a new > timeline. We do this by: > * Selecting a previously unused timeline ID (see below). > * Writing a history file for this ID, by copying the parent > timeline's history file and adding a new line at the end. > * Copying the last-used WAL segment of the parent timeline, > giving it the same segment number but the new timeline's ID. > This becomes the active WAL segment when we start operating. > > * We can identify the highest timeline ID ever used by simply starting > with the source timeline ID and probing pg_xlog and the archive area > for history files N+1.history, N+2.history, etc until we find an ID > for which there is no history file. Under reasonable scenarios this > will not take very many probes, so it doesn't seem that we need any > addition to the archiver API to make it more efficient. ** I would prefer to add a random number to the timeline as a way of identifying the next one. This will produce fewer probes, so less wasted tape mounts, but most importantly it gets round this issue: You're on timeline X, then you recover and run for a while on timeline Y. You then realise recovering to that target was a really bad idea for some reason (some VIPs record wasn't in the recovered data etc). We then need to re-recover from the backup on X to a new timeline, Z. But how does X know that Y existed when it creates Z? If Y = f(x) in a deterministic way, then Y will always == Z. Of course, we could provide an id, but what would you pick? The best way is to get out of trouble by picking a new timeline id that's very unlikely to have been picked before. If the sequence of timeline ids is not important, just pick one from the billions you have available to you (and that aren't mentioned in the history file). We can do this automatically and pick it randomly. That way, when you re-recover you stand a vanishingly small chance of picking any timeline id that you (or indeed anyone!) have ever used. This will be very important for diagnosing problems, and it is my experience that the re-recovery scenario happens on about 50% of recoveries. i.e. if you recover once, you're very likely to recover 2 or more times before you're really done. (...and if you don't believe me, look what happened to Danske Bank running DB2 - recovered 4 times inside a week, but hats off to those guys - they got it back in the end). But then - we also need to be able to identify which was the latest history file and searching a billion files might take a while. So the sequential numbering does serve a purpose. Both ideas solve only one of the two problems....hmmm, I think perhaps finding latest file is more important and so perhaps sequential numbering should win after all? > * Since history files will be small and made infrequently (one hopes you > do not need to do a PITR recovery very often...) I see no particular > reason not to leave them in /pg_xlog indefinitely. The DBA can clean > out old ones if she is a neatnik, but I don't think the system needs to > or should delete them. Similarly the archive area could be expected to > retain history files indefinitely. > OK. Answered question above... Yes, agreed. We'll want them for diagnostics anyway. > * However, you *can* throw away a history file once you are no longer > interested in rolling back to times predating the splitoff point of the > timeline. If we don't find a history file we can just act as though the > timeline has no parents (extends indefinitely far in the past). (Hm, > so we don't actually have to bother creating 00000001.history...) > Agreed. Thats better, less files waiting around, less chance of being deleted by over-diligent admins. But we shouldn't encourage the deletion of those files. The worst problems happen when people "tidy up" after they think recovery is over, then delete an important file and we're back in traction again. > * I'm intending to replace the current concept of StartUpID (SUI) by > timeline IDs --- we'll record timeline IDs not SUIs in data page headers > and WAL page headers. SUI isn't doing anything of value for us; I think > it was probably intended to do what timelines will do, but it's not > defined quite right for the purpose. One good thing about timeline IDs > for WAL page headers is that we know exactly which IDs should be > expected in a WAL file (either the current timeline or one of its > parents); this allows a much tighter check than is possible with SUIs. > Definitely agree on this last part, that stuff about 512 SUIs was wierd. > Anybody see any holes in this design? > As said already, All of this is good. Two main areas of comments/questions, noted above. (**) That's coherent and good. Best regards, Simon Riggs
pgsql-hackers by date: