Thread: Why we really need timelines *now* in PITR

Why we really need timelines *now* in PITR

From
Tom Lane
Date:
If we do not add timeline numbers to WAL file names, we will be forced
to destroy information during recovery.  Consider the following
scenario:

1. You have a WAL directory containing, say, WAL segments 0010 to 0020
(for the purposes of this example I won't bother typing out realistic
16-digit filenames, but just use 4-digit names).

2. You discover that your junior DBA messed up badly and you need to
revert to yesterday evening's state.  Let's say the chosen recovery end
time is in the middle of file 0014.

3. You run the recovery process.  At its end, the WAL end pointer will
be 0014 and some offset.

If we simply run forward from this situation, then we will be
overwriting existing WAL records in the existing files 0014-0020.
This is bad from the point of view of not wanting to discard information
(what if we decide we should have recovered to a later time??), but
there is an even more serious reason for not doing that.  Suppose we
suffer a crash sometime after recovery.  On restart, the system will
start replaying the logs, and *there will be nothing to keep it from
replaying all the way to the end of file 0020*.  (The files will contain
proper, in-sequence page headers, so the tests that normally detect
recycled log segments won't think there is anything wrong.)  This will
leave you with a thoroughly corrupt database.

One way to solve this would be to physically discard 0015-0020 as soon
as we decide we're stopping short of the end of WAL.  I think that is
unacceptable on don't-throw-away-information grounds.  I think it would
be far better to invent the timeline concept.  Then, our old WAL files
would be named say 0001.0010 through 0001.0020, and we would start
logging into 0002.0014 after recovery.

A slightly tricky point is that we have to "sew together" the end of one
timeline and the start of the next --- for instance, we only want the
front part of 0001.0014, not the back part, to be part of the new
timeline.  Patrick Macdonald told me about a pretty baroque scheme that
DB2 uses for this, but I think it would be simplest if we just copied
the appropriate amount of data from 0001.0014 into 0002.0014 and then
ran forward from there.  Copying a max of 16MB of data doesn't sound
very onerous.

During WAL replay or recovery, there would be a notion of the "target
timeline" that you are trying to recover to a point within.  The rule
for selecting which WAL segment file to read is "use the one with
largest timeline number less than or equal to the target, and never less
than the timeline number you used for the previous segment".  So for
example if we realized we'd chosen the wrong recovery target time, we
could backpedal and redo the same recovery process with target timeline
0001, ignoring any WAL segments that had been archived with timeline
0002.  Alternatively, if we were simply doing crash recovery in timeline
0002, we could stop at (say) segment 0002.0018, and we'd know that we
should ignore 0001.0019 because it is not in our timeline.
        regards, tom lane


Re: Why we really need timelines *now* in PITR

From
Simon Riggs
Date:
On Sat, 2004-07-17 at 21:36, Tom Lane wrote:
> If we do not add timeline numbers to WAL file names, we will be forced
> to destroy information during recovery.  Consider the following
> scenario:
> 
> 1. You have a WAL directory containing, say, WAL segments 0010 to 0020
> (for the purposes of this example I won't bother typing out realistic
> 16-digit filenames, but just use 4-digit names).
> 
> 2. You discover that your junior DBA messed up badly and you need to
> revert to yesterday evening's state.  Let's say the chosen recovery end
> time is in the middle of file 0014.
> 
> 3. You run the recovery process.  At its end, the WAL end pointer will
> be 0014 and some offset.
> 
> If we simply run forward from this situation, then we will be
> overwriting existing WAL records in the existing files 0014-0020.
> This is bad from the point of view of not wanting to discard information
> (what if we decide we should have recovered to a later time??), but
> there is an even more serious reason for not doing that.  Suppose we
> suffer a crash sometime after recovery.  On restart, the system will
> start replaying the logs, and *there will be nothing to keep it from
> replaying all the way to the end of file 0020*.  (The files will contain
> proper, in-sequence page headers, so the tests that normally detect
> recycled log segments won't think there is anything wrong.)  This will
> leave you with a thoroughly corrupt database.
> 
> One way to solve this would be to physically discard 0015-0020 as soon
> as we decide we're stopping short of the end of WAL.  I think that is
> unacceptable on don't-throw-away-information grounds.  I think it would
> be far better to invent the timeline concept.  Then, our old WAL files
> would be named say 0001.0010 through 0001.0020, and we would start
> logging into 0002.0014 after recovery.
> 
> A slightly tricky point is that we have to "sew together" the end of one
> timeline and the start of the next --- for instance, we only want the
> front part of 0001.0014, not the back part, to be part of the new
> timeline.  Patrick Macdonald told me about a pretty baroque scheme that
> DB2 uses for this, but I think it would be simplest if we just copied
> the appropriate amount of data from 0001.0014 into 0002.0014 and then
> ran forward from there.  Copying a max of 16MB of data doesn't sound
> very onerous.
> 

Well, yes - I completely agree that we need the timeline concept as one
of the highest priorities. I originally raised the problem timelines
solve because of the errors I had experienced re-running restores many
times with the same archive set. It's just too easy to overwrite log
files without the timeline concept.

IMHO you don't need to change the xlog format as a necessary step to
introduce timelines. Simply adding FFFF to the logid is sufficient
(which lets face it takes a heck of long time before it gets to 1...)

[Also, as an extra detail on your analysis, when recovery is finished
you need to move both primary and secondary checkpoint markers forwards
to the new timeline, so that crash recovery can't go back to the old
timeline]

If you're going to change xlog filenames, then I would think that adding
the system identifier to the xlogs would be a very good addition. I
would simply have recommended keeping them in separate directories, but
putting it on the name would be best. PostgreSQL doesn't have a name
concept...which would be the thing to use if it did.

> During WAL replay or recovery, there would be a notion of the "target
> timeline" that you are trying to recover to a point within.  The rule
> for selecting which WAL segment file to read is "use the one with
> largest timeline number less than or equal to the target, and never less
> than the timeline number you used for the previous segment".  So for
> example if we realized we'd chosen the wrong recovery target time, we
> could backpedal and redo the same recovery process with target timeline
> 0001, ignoring any WAL segments that had been archived with timeline
> 0002.  Alternatively, if we were simply doing crash recovery in timeline
> 0002, we could stop at (say) segment 0002.0018, and we'd know that we
> should ignore 0001.0019 because it is not in our timeline.
> 

That sounds like the way it should work.

The way you write this makes me think you might mean you would allow: we
can start recovering in one timelines, then rollforward takes us through
all the timeline nexus points required to get us to the target timeline.

I had imagined that recovery would only ever be allowed to start and end
on the same timeline. I think you probably mean that?


Another of the issues I was thinking through was what happens at the end
of your scenario abobe
- You're on timeline 1 and you need to perform recovery.
- You perform recovery and timeline 2 is created.
- You discover another error and decide to recover again.
- You recover timeline 1 again: what do you name the new timeline
created? 2 or 3? If you call it 2 you will be overwriting data just like
you would have done - which is why timelines were invented, so thats got
to be a bad plan. If you think to avoid this by calling it 3, how do you
know to do that?  

My imperfect solution to that was to use a randomised future timeline
number, reducing greatly the chance of ever conflicting on timeline
names. There's probably a solution to this used by other RDBMS, but I
don't know what it is and haven't gone looking on the basis that is
likely to be patented anyway...

[If you do this by adding a big number to the LogId, then 0002.0018 is
simply numerically larger than 0001.0019, so wouldn't ever be
considered.]

You then don't need the idea of a target timeline explicitly - and
therefore the user can't get wrong which timeline their on/want to be
on.

Best regards, Simon Riggs



Re: Why we really need timelines *now* in PITR

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> The way you write this makes me think you might mean you would allow: we
> can start recovering in one timelines, then rollforward takes us through
> all the timeline nexus points required to get us to the target
> timeline.

Sure.  Let's draw a diagram:
0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ...                      |                      + 0002.0016 - 0002.0017
-...                                  |                                  + 0003.0017 - ...
 

If you decide you would like to recover to someplace in timeline 0002,
you need to take the 0002 log files where they exist, and the 0001
log files where there is no 0002, except you do not revert to 0001
once you have used an 0002 file (this restriction is needed in case
the 0001 timeline goes to higher segment numbers than 0002 has reached).
In no case do you use an 0003 file.

> I had imagined that recovery would only ever be allowed to start and end
> on the same timeline. I think you probably mean that?

Logically it's all one timeline, I suppose, but to implement it
physically that way would mean duplicating all past 0001 segments when
we want to create the 0002 timeline.  That's not practical and not
necessary.

> Another of the issues I was thinking through was what happens at the end
> of your scenario abobe
> - You're on timeline 1 and you need to perform recovery.
> - You perform recovery and timeline 2 is created.
> - You discover another error and decide to recover again.
> - You recover timeline 1 again: what do you name the new timeline
> created? 2 or 3?

You really want to call it 3.  To enforce this mechanically would
require having a counter that sits outside the $PGDATA directory and
is not subject to being reverted by a restore-from-backup.  I don't
see any very clean way to do that at the moment --- any thoughts?

In the absence of such a counter we could ask the DBA to specify a new
timeline number in recovery.conf, but this strikes me as one of those
easy-to-get-wrong things ...

One possibility is to extend the archiving API so that we can inquire
about the largest timeline number that exists anywhere in the archive.
If we take new timeline number = 1 + max(any in archive, any in pg_xlog)
then we are safe.  But I'm not really convinced that such a thing would
be any less error-prone than the manual way :-(, because for any
archival method that's more complicated than "cp them all into one
directory", it'd be hard to extract the max stored filename.
        regards, tom lane


Re: Why we really need timelines *now* in PITR

From
Simon Riggs
Date:
On Mon, 2004-07-19 at 04:31, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > The way you write this makes me think you might mean you would allow: we
> > can start recovering in one timelines, then rollforward takes us through
> > all the timeline nexus points required to get us to the target
> > timeline.
> 
> Sure.  Let's draw a diagram:
> 
>     0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ...
>                           |
>                           + 0002.0016 - 0002.0017 - ...
>                                       |
>                                       + 0003.0017 - ...
> 
> If you decide you would like to recover to someplace in timeline 0002,
> you need to take the 0002 log files where they exist, and the 0001
> log files where there is no 0002, except you do not revert to 0001
> once you have used an 0002 file (this restriction is needed in case
> the 0001 timeline goes to higher segment numbers than 0002 has reached).
> In no case do you use an 0003 file.
> 
> > I had imagined that recovery would only ever be allowed to start and end
> > on the same timeline. I think you probably mean that?
> 
> Logically it's all one timeline, I suppose, but to implement it
> physically that way would mean duplicating all past 0001 segments when
> we want to create the 0002 timeline.  That's not practical and not
> necessary.
> 
> > Another of the issues I was thinking through was what happens at the end
> > of your scenario abobe
> > - You're on timeline 1 and you need to perform recovery.
> > - You perform recovery and timeline 2 is created.
> > - You discover another error and decide to recover again.
> > - You recover timeline 1 again: what do you name the new timeline
> > created? 2 or 3?
> 
> You really want to call it 3.  To enforce this mechanically would
> require having a counter that sits outside the $PGDATA directory and
> is not subject to being reverted by a restore-from-backup.  I don't
> see any very clean way to do that at the moment --- any thoughts?
> 
> In the absence of such a counter we could ask the DBA to specify a new
> timeline number in recovery.conf, but this strikes me as one of those
> easy-to-get-wrong things ...
> 
> One possibility is to extend the archiving API so that we can inquire
> about the largest timeline number that exists anywhere in the archive.
> If we take new timeline number = 1 + max(any in archive, any in pg_xlog)
> then we are safe.  But I'm not really convinced that such a thing would
> be any less error-prone than the manual way :-(, because for any
> archival method that's more complicated than "cp them all into one
> directory", it'd be hard to extract the max stored filename.
> 

Think the same as you do on all of that. Excellent.

Some further thinking from that base...

Perhaps timelines should be nest-numbered: (using 0 as a counter also)
0 etc is the original branch
0.1 is the first recovery off the original branch
0.2 is the second recovery off the original branch
0.1.1 is the first recovery off the first recovery (so to speak)
0.1.2 is the second etc
That way you don't have the problem of "which is 3?" in the examples
above. [Would we number a recovery of 1 as 3 or would then next recovery
off 2 be numbered 3?]

Not necessarily the way we would show that as a timeline number. It
could still be shown as a single hex number representing each nesting
level as 4 bits...(restricting us to 7 recoveries per timeline...)

Just as a thought, DB2 uses the concept of a history file also....

If we go with the renaming recovery.conf when it completes, why not make
that the record of previous recoveries? Move it to archive_status and
name it according to the timeline it just created, e.g.
recovery.done.<timeline>.<timestamp>

If you're re-restoring into the same directory you wouldn't then
overwrite the history of previous recoveries.

It would be an extremely bad thing to have to specify the timeline
number, and I agree we can't ask the archive. 

Best Regards, Simon Riggs



Re: Why we really need timelines *now* in PITR

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Some further thinking from that base...

> Perhaps timelines should be nest-numbered: (using 0 as a counter also)
> 0 etc is the original branch
> 0.1 is the first recovery off the original branch
> 0.2 is the second recovery off the original branch
> 0.1.1 is the first recovery off the first recovery (so to speak)
> 0.1.2 is the second etc
> That way you don't have the problem of "which is 3?" in the examples
> above. [Would we number a recovery of 1 as 3 or would then next recovery
> off 2 be numbered 3?]

Hmm.  This would have some usefulness as far as documenting "how did we
get here", but unless you knew where/when the timeline splits had
occurred, I don't think it would be super useful.  It'd make more sense
to record the parentage and split time of each timeline in some
human-readable "meta-history" reference file (but where exactly?)

I don't think it does anything to solve our immediate problem, anyhow.
You may know that you are recovering off of branch 0.1, but how do you
know if this is the first, second, or Nth time you have done that?

> Not necessarily the way we would show that as a timeline number. It
> could still be shown as a single hex number representing each nesting
> level as 4 bits...(restricting us to 7 recoveries per timeline...)

Sounds too tight to me :-(

I do see a hole in my original concept now that you mention it.  It'd be
quite possible for timeline 2 *not* to be an ancestor of timeline 3,
that is you might have tried a restore, not liked the result, and
decided to re-restore from someplace else on timeline 1.  That is,
instead of
0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ...                      |                      + 0002.0016 - 0002.0017
-...                                  |                                  + 0003.0017 - ...
 

maybe the history is
0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ...                      |           |                      |
+0003.0017 - ...                      |                      + 0002.0016 - 0002.0017 - ...
 

where I've had to draw 3 above 2 to avoid having unrelated lines
crossing each other in my diagram.  The problem here is that a crash
recovery in timeline 3 would not realize that it should not use WAL
segment 0002.0016.  So we need a more sophisticated rule than just
numerical comparison of timeline numbers.

I think your idea of nested numbers might fix this, but I'm concerned
about the restrictions of fitting it into 32 bits as you suggest.
Can we think of a less restrictive representation?

> If we go with the renaming recovery.conf when it completes, why not make
> that the record of previous recoveries? Move it to archive_status and
> name it according to the timeline it just created, e.g.
> recovery.done.<timeline>.<timestamp>

There's still the problem of how can you be sure that all the files
created in the past are still in there.  It'd be way too likely for
someone to decide they ought to do a recovery restore by first doing
"rm -rf $PGDATA".  Or they lost the disk entirely and are restoring
their last full backup onto virgin media.

I think there's really no way around the issue: somehow we've got to
keep some meta-history outside the $PGDATA area, if we want to do this
in a clean fashion.  We could perhaps expect the archive area to store
it, but I'm a bit worried about the idea of overwriting the meta-history
file in archive from time to time; it's mighty critical data and you'd
not be happy if a crash corrupted your only copy.  We could archive
meta-history files with successively higher versioned names ... but then
we need an API extension to get back the latest one.
        regards, tom lane


Re: Why we really need timelines *now* in PITR

From
Tom Lane
Date:
I wrote:
> I think there's really no way around the issue: somehow we've got to
> keep some meta-history outside the $PGDATA area, if we want to do this
> in a clean fashion.

After further thought I think we can fix this stuff by creating a
"history file" for each timeline.  This will make recovery slightly more
complicated but I don't think it would be any material performance
problem.  Here's how it goes:

* Timeline IDs are 32-bit ints with no particular semantic significance
(that is, we do not assume timeline 3 is a child of 2, or anything like
that).  The actual parentage of a timeline has to be found by inspecting
its history file.

* History files will be named by their timeline ID, say "00000042.history".
They will be created in /pg_xlog whenever a new timeline is created
by the act of doing a recovery to a point in time earlier than the end
of existing WAL.  When doing WAL archiving a history file can be copied
off to the archive area by the existing archiver mechanism (ie, we'll
make a .ready file for it as soon as it's written).

* History files will be plain text (for human consumption) and will
essentially consist of a list of parent timeline IDs in sequence.
I envision adding the timeline split timestamp and starting WAL segment
number too, but these are for documentation purposes --- the system
doesn't need them.  We may as well allow comments in there as well,
so that the DBA can annotate the reasons for a PITR split to have been
done.  So the contents might look like
# Recover from unintentional TRUNCATE00000001    0000000A00142568    2005-05-16 12:34:15 EDT# Ex-assistant DBA dropped
wrongtable00000007    0000002200005434    2005-11-17 18:44:44 EST
 

When we split off a new timeline, we just have to copy the parent's
history file (which we can do verbatim including comments) and then
add a new line at the end showing the immediate parent's timeline ID
and the other details of the split.  Initdb can create 00000001.history
with empty contents (since that timeline has no parents).

* When we need to do recovery, we first identify the source timeline
(either by reading the current timeline ID from pg_control, or the DBA
can tell us with a parameter in recovery.conf).  We then read the
history file for that timeline, and remember its sequence of parent
timeline IDs.  We can crosscheck that pg_control's timeline ID is
one of this set of timeline IDs, too --- if it's not then the wrong
backup file was restored.

* During recovery, whenever we need to open a WAL segment file, we first
try to open it with the source timeline ID; if that doesn't exist, try
the immediate parent timeline ID; then the grandparent, etc.  Whenever
we find a WAL file with a particular timeline ID, we forget about all
parents further up in the history, and won't try to open their segments
anymore (this is the generalization of my previous rule that you never
drop down in timeline number as you scan forward).

* If we end recovery because we have rolled forward off the end of WAL,
we can just continue using the source timeline ID --- we are extending
that timeline.  (Thus, an ordinary crash and restart doesn't require
generating a new timeline ID; nor do we generate a new line during
normal postmaster stop/start.)  But if we stop recovery at a requested
point-in-time earlier than end of WAL, we have to branch off a new
timeline.  We do this by:* Selecting a previously unused timeline ID (see below).* Writing a history file for this ID,
bycopying the parent  timeline's history file and adding a new line at the end.* Copying the last-used WAL segment of
theparent timeline,  giving it the same segment number but the new timeline's ID.  This becomes the active WAL segment
whenwe start operating.
 

* We can identify the highest timeline ID ever used by simply starting
with the source timeline ID and probing pg_xlog and the archive area
for history files N+1.history, N+2.history, etc until we find an ID
for which there is no history file.  Under reasonable scenarios this
will not take very many probes, so it doesn't seem that we need any
addition to the archiver API to make it more efficient.

* Since history files will be small and made infrequently (one hopes you
do not need to do a PITR recovery very often...) I see no particular
reason not to leave them in /pg_xlog indefinitely.  The DBA can clean
out old ones if she is a neatnik, but I don't think the system needs to
or should delete them.  Similarly the archive area could be expected to
retain history files indefinitely.

* However, you *can* throw away a history file once you are no longer
interested in rolling back to times predating the splitoff point of the
timeline.  If we don't find a history file we can just act as though the
timeline has no parents (extends indefinitely far in the past).  (Hm,
so we don't actually have to bother creating 00000001.history...)

* I'm intending to replace the current concept of StartUpID (SUI) by
timeline IDs --- we'll record timeline IDs not SUIs in data page headers
and WAL page headers.  SUI isn't doing anything of value for us; I think
it was probably intended to do what timelines will do, but it's not
defined quite right for the purpose.  One good thing about timeline IDs
for WAL page headers is that we know exactly which IDs should be
expected in a WAL file (either the current timeline or one of its
parents); this allows a much tighter check than is possible with SUIs.

Anybody see any holes in this design?
        regards, tom lane


Re: Why we really need timelines *now* in PITR

From
Simon Riggs
Date:
On Mon, 2004-07-19 at 16:58, Tom Lane wrote:
> I think there's really no way around the issue: somehow we've got to
> keep some meta-history outside the $PGDATA area, if we want to do this
> in a clean fashion.  We could perhaps expect the archive area to store
> it, but I'm a bit worried about the idea of overwriting the meta-history
> file in archive from time to time; it's mighty critical data and you'd
> not be happy if a crash corrupted your only copy.  We could archive
> meta-history files with successively higher versioned names ... but then
> we need an API extension to get back the latest one.
> 

Yes, you've convinced me.

It is critical data, but never for that long. If we only split timelines
when we recover, then we just make not to take about ~100 copies of it
immediately. If we really did recover OK, then it'll only be a few
days/weeks before we can forget it ever happened.

The crucial time is when re-running recoveries repeatedly and if we
write the manual with sufficient red ink then we'll avoid this. But
heck, not having your history file is only as bad as not having added
timelines in the first place. Not great, just more care required.

Best Regards, Simon Riggs



Re: Why we really need timelines *now* in PITR

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> The crucial time is when re-running recoveries repeatedly and if we
> write the manual with sufficient red ink then we'll avoid this. But
> heck, not having your history file is only as bad as not having added
> timelines in the first place. Not great, just more care required.

Yeah, you only really need them when you are hip-deep in repeated
recovery retries.

If you haven't gotten to my later proposal yet, the history files will
be plain text and it'd be at least theoretically possible for someone to
reconstruct one by hand if needed.  All you need to have is the sequence
of parent timeline IDs, which you could reconstruct in most cases by
looking at the archived WAL files.
        regards, tom lane


Re: Why we really need timelines *now* in PITR

From
Simon Riggs
Date:
On Mon, 2004-07-19 at 19:33, Tom Lane wrote:
> I wrote:
> > I think there's really no way around the issue: somehow we've got to
> > keep some meta-history outside the $PGDATA area, if we want to do this
> > in a clean fashion.
> 
> After further thought I think we can fix this stuff by creating a
> "history file" for each timeline.  This will make recovery slightly more
> complicated but I don't think it would be any material performance
> problem.  Here's how it goes:

Yes...I came to the conclusion that trying to avoid doing something like
DB2 does was just stubornness on my part. We may as well use analogies
with other systems when they are available.

All of this is good. Two main areas of comments/questions, noted (**)

Timelines should be easy to understand for anybody that can follow a
HACKERS conversation anyhow :)

> 
> * Timeline IDs are 32-bit ints with no particular semantic significance
> (that is, we do not assume timeline 3 is a child of 2, or anything like
> that).  The actual parentage of a timeline has to be found by inspecting
> its history file.
> 

OK...thats better. The nested idea doesn't read well second time
through.

> * History files will be named by their timeline ID, say "00000042.history".
> They will be created in /pg_xlog whenever a new timeline is created
> by the act of doing a recovery to a point in time earlier than the end
> of existing WAL.  When doing WAL archiving a history file can be copied
> off to the archive area by the existing archiver mechanism (ie, we'll
> make a .ready file for it as soon as it's written).
> 

Need to check the archive code which relies on file shape and length

> * History files will be plain text (for human consumption) and will
> essentially consist of a list of parent timeline IDs in sequence.
> I envision adding the timeline split timestamp and starting WAL segment
> number too, but these are for documentation purposes --- the system
> doesn't need them.  We may as well allow comments in there as well,
> so that the DBA can annotate the reasons for a PITR split to have been
> done.  So the contents might look like
> 
>     # Recover from unintentional TRUNCATE
>     00000001    0000000A00142568    2005-05-16 12:34:15 EDT
>     # Ex-assistant DBA dropped wrong table
>     00000007    0000002200005434    2005-11-17 18:44:44 EST
> 

Or should there be a recovery_comment parameter in the recovery.conf?
That would be better than suggesting that admins can edit such an
important file. (Even if they can, its best not to encourage it).

> When we split off a new timeline, we just have to copy the parent's
> history file (which we can do verbatim including comments) and then
> add a new line at the end showing the immediate parent's timeline ID
> and the other details of the split.  Initdb can create 00000001.history
> with empty contents (since that timeline has no parents).

Yes.

Will you then delete the previous timeline's history file or just leave
it there? (OK, you say that later)

> * When we need to do recovery, we first identify the source timeline
> (either by reading the current timeline ID from pg_control, or the DBA
> can tell us with a parameter in recovery.conf).  We then read the
> history file for that timeline, and remember its sequence of parent
> timeline IDs.  We can crosscheck that pg_control's timeline ID is
> one of this set of timeline IDs, too --- if it's not then the wrong
> backup file was restored.

** Surely it is the backup itself that determines the source timeline? 

Backups are always taken in one particular timeline. The rollforward
must start at a checkpoint before the begin backup and roll past the end
of backup marker onwards. The starting checkpoint should be the last
checkpoint prior to backup - why would you pick another? That checkpoint
will always be in the current timeline, since we always come out of
startup with a checkpoint (either because we shutdown earlier, or we
recovered and just wrote another shutdown checkpoint). 

So the backup's timeline will determine the source timeline, but not
necessarily the target timeline.

...thinking....recovery.conf would need to specify:
recovery_target (if there is one, either a time or txnid)
recovery_target_timeline (if there is one, otherwise end of last one)
recovery_target_history_file (which specifies how the timeline ids are
sequenced)

I take it that your understanding is that the recovery_target timeline
needs to be specified also?

> * During recovery, whenever we need to open a WAL segment file, we first
> try to open it with the source timeline ID; if that doesn't exist, try
> the immediate parent timeline ID; then the grandparent, etc.  Whenever
> we find a WAL file with a particular timeline ID, we forget about all
> parents further up in the history, and won't try to open their segments
> anymore (this is the generalization of my previous rule that you never
> drop down in timeline number as you scan forward).
> 

This jigging around is OK, because most people will be using only one
timeline anyhow, so its not likely to cause too much fuss for the user.

> * If we end recovery because we have rolled forward off the end of WAL,
> we can just continue using the source timeline ID --- we are extending
> that timeline.  (Thus, an ordinary crash and restart doesn't require
> generating a new timeline ID; nor do we generate a new line during
> normal postmaster stop/start.)  

Yes, exactly - thats why it can't be the SUID.

> But if we stop recovery at a requested
> point-in-time earlier than end of WAL, we have to branch off a new
> timeline.  We do this by:
>     * Selecting a previously unused timeline ID (see below).
>     * Writing a history file for this ID, by copying the parent
>       timeline's history file and adding a new line at the end.
>     * Copying the last-used WAL segment of the parent timeline,
>       giving it the same segment number but the new timeline's ID.
>       This becomes the active WAL segment when we start operating.
> 
> * We can identify the highest timeline ID ever used by simply starting
> with the source timeline ID and probing pg_xlog and the archive area
> for history files N+1.history, N+2.history, etc until we find an ID
> for which there is no history file.  Under reasonable scenarios this
> will not take very many probes, so it doesn't seem that we need any
> addition to the archiver API to make it more efficient.

** I would prefer to add a random number to the timeline as a way of
identifying the next one. This will produce fewer probes, so less wasted
tape mounts, but most importantly it gets round this issue:

You're on timeline X, then you recover and run for a while on timeline
Y. You then realise recovering to that target was a really bad idea for
some reason (some VIPs record wasn't in the recovered data etc). We then
need to re-recover from the backup on X to a new timeline, Z. But how
does X know that Y existed when it creates Z?

If Y = f(x) in a deterministic way, then Y will always == Z. Of course,
we could provide an id, but what would you pick? The best way is to get
out of trouble by picking a new timeline id that's very unlikely to have
been picked before.

If the sequence of timeline ids is not important, just pick one from the
billions you have available to you (and that aren't mentioned in the
history file). We can do this automatically and pick it randomly.

That way, when you re-recover you stand a vanishingly small chance of
picking any timeline id that you (or indeed anyone!) have ever used.

This will be very important for diagnosing problems, and it is my
experience that the re-recovery scenario happens on about 50% of
recoveries. i.e. if you recover once, you're very likely to recover 2 or
more times before you're really done. (...and if you don't believe me,
look what happened to Danske Bank running DB2 - recovered 4 times inside
a week, but hats off to those guys - they got it back in the end).

But then - we also need to be able to identify which was the latest
history file and searching a billion files might take a while. So the
sequential numbering does serve a purpose. Both ideas solve only one of
the two problems....hmmm, I think perhaps finding latest file is more
important and so perhaps sequential numbering should win after all?

> * Since history files will be small and made infrequently (one hopes you
> do not need to do a PITR recovery very often...) I see no particular
> reason not to leave them in /pg_xlog indefinitely.  The DBA can clean
> out old ones if she is a neatnik, but I don't think the system needs to
> or should delete them.  Similarly the archive area could be expected to
> retain history files indefinitely.
> 

OK. Answered question above...

Yes, agreed. We'll want them for diagnostics anyway.

> * However, you *can* throw away a history file once you are no longer
> interested in rolling back to times predating the splitoff point of the
> timeline.  If we don't find a history file we can just act as though the
> timeline has no parents (extends indefinitely far in the past).  (Hm,
> so we don't actually have to bother creating 00000001.history...)
> 

Agreed. Thats better, less files waiting around, less chance of being
deleted by over-diligent admins.

But we shouldn't encourage the deletion of those files. The worst
problems happen when people "tidy up" after they think recovery is over,
then delete an important file and we're back in traction again.

> * I'm intending to replace the current concept of StartUpID (SUI) by
> timeline IDs --- we'll record timeline IDs not SUIs in data page headers
> and WAL page headers.  SUI isn't doing anything of value for us; I think
> it was probably intended to do what timelines will do, but it's not
> defined quite right for the purpose.  One good thing about timeline IDs
> for WAL page headers is that we know exactly which IDs should be
> expected in a WAL file (either the current timeline or one of its
> parents); this allows a much tighter check than is possible with SUIs.
> 

Definitely agree on this last part, that stuff about 512 SUIs was wierd.

> Anybody see any holes in this design?
> 

As said already, All of this is good. Two main areas of
comments/questions, noted above. (**)

That's coherent and good. 

Best regards, Simon Riggs



Re: Why we really need timelines *now* in PITR

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2004-07-19 at 19:33, Tom Lane wrote:
>> When doing WAL archiving a history file can be copied
>> off to the archive area by the existing archiver mechanism (ie, we'll
>> make a .ready file for it as soon as it's written).

> Need to check the archive code which relies on file shape and length

Yeah.  I made some adjustments already with the expectation that we'd
want to do this, but it'll take a little bit more weakening of the
code's tests.

>> * When we need to do recovery, we first identify the source timeline
>> (either by reading the current timeline ID from pg_control, or the DBA
>> can tell us with a parameter in recovery.conf).

> ** Surely it is the backup itself that determines the source timeline? 

The backup determines the starting point, but there may be several
timelines you could follow after that (especially in the scenario where
you're redoing a recovery starting from the same backup).  The point
here is that there could be timeline branches after the backup
occurred.  So yes the backup has to be in an ancestral timeline, but not
necessarily exactly the recovery-target timeline.

> ...thinking....recovery.conf would need to specify:
> recovery_target (if there is one, either a time or txnid)
> recovery_target_timeline (if there is one, otherwise end of last one)
> recovery_target_history_file (which specifies how the timeline ids are
> sequenced)

No, the source timeline is not necessarily associated with a
recovery_target --- for instance you might want it to run to the end of
a particular timeline.  I suspect it might be more confusing than
helpful to use the term "target timeline".

>> try to open it with the source timeline ID; if that doesn't exist, try
>> the immediate parent timeline ID; then the grandparent, etc.

> This jigging around is OK, because most people will be using only one
> timeline anyhow, so its not likely to cause too much fuss for the user.

It might confuse someone who's watching the sequence of archive
retrieval requests, but as long as that's all mechanized it doesn't
seem like there's any real potential for trouble.

> ** I would prefer to add a random number to the timeline as a way of
> identifying the next one. This will produce fewer probes, so less wasted
> tape mounts,

How do you figure that?  Seems like the same number of probes either way.

> but most importantly it gets round this issue:

> You're on timeline X, then you recover and run for a while on timeline
> Y. You then realise recovering to that target was a really bad idea for
> some reason (some VIPs record wasn't in the recovered data etc). We then
> need to re-recover from the backup on X to a new timeline, Z. But how
> does X know that Y existed when it creates Z?

Because there is a Y.history file laying about (either in the archive or
pg_xlog).

We will need to recommend to DBAs that they not delete Y.history from
the archive unless they've already deleted all Y.whatever log segments.
Once they have done this, the past existence of timeline Y is no longer
of interest and so there'd be no real problem in recycling the ID.
I would say the above is just as true if you use random IDs as if you
use sequential ones.  I distrust systems that assume there will never be
a collision of "randomly-chosen" IDs.

> If Y = f(x) in a deterministic way, then Y will always == Z. Of course,
> we could provide an id, but what would you pick? The best way is to get
> out of trouble by picking a new timeline id that's very unlikely to have
> been picked before.

I do not see the advantage.
        regards, tom lane


Re: Why we really need timelines *now* in PITR

From
Simon Riggs
Date:
On Mon, 2004-07-19 at 23:15, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2004-07-19 at 19:33, Tom Lane wrote:

> >> * When we need to do recovery, we first identify the source timeline
> >> (either by reading the current timeline ID from pg_control, or the DBA
> >> can tell us with a parameter in recovery.conf).
> 
> > ** Surely it is the backup itself that determines the source timeline? 
> 
> The backup determines the starting point, but there may be several
> timelines you could follow after that (especially in the scenario where
> you're redoing a recovery starting from the same backup).  The point
> here is that there could be timeline branches after the backup
> occurred.  So yes the backup has to be in an ancestral timeline, but not
> necessarily exactly the recovery-target timeline.
> 

Agreed.

> > ...thinking....recovery.conf would need to specify:
> > recovery_target (if there is one, either a time or txnid)
> > recovery_target_timeline (if there is one, otherwise end of last one)
> > recovery_target_history_file (which specifies how the timeline ids are
> > sequenced)
> 
> No, the source timeline is not necessarily associated with a
> recovery_target --- for instance you might want it to run to the end of
> a particular timeline.  I suspect it might be more confusing than
> helpful to use the term "target timeline".
> 

I think we're heatedly agreeing again.

A summary: we don't specify the start timeline, but we do specify the
timeline which contains our chosen endpoint. [....But when we reach it,
we may create a new timeline id if we didn't go to end of logs on that
timeline.] The history file specifies how to get from start to end,
through however many branchpoints there are....and the history file we
use for recovery is the one pointed to by (target_in_timeline).

Or even shorter:
- backup specifies starting timeline (and so user specifies indirectly)
- user specifies end point (explicitly in recovery.conf)
- history file shows how to get from start to end

more thoughts...if you specify:
target = X
target_in_timeline

where the default is <notarget> and if you specify a target, the default
target_in_timeline is <latest>.

I don't like the name target_in_timeline, I'm just trying to clarify
what we mean so we can think of a better name for it.


...we definitely need an offline-log inspection tool, don't we? 
Next month...

> We will need to recommend to DBAs that they not delete Y.history from
> the archive unless they've already deleted all Y.whatever log segments.
> Once they have done this, the past existence of timeline Y is no longer
> of interest and so there'd be no real problem in recycling the ID.
> I would say the above is just as true if you use random IDs as if you
> use sequential ones.  I distrust systems that assume there will never be
> a collision of "randomly-chosen" IDs.
> 

Yes, I argued myself in a circle, but it seemed worth recording just to
avoid repeating the thought next time.


Best regards, Simon Riggs



Re: Why we really need timelines *now* in PITR

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> I think we're heatedly agreeing again.

Yeah, I think so.  I'll get started on this tomorrow.

> where the default is <notarget> and if you specify a target, the default
> target_in_timeline is <latest>.

I think actually the default target has to be the timeline ID found in
pg_control --- otherwise you get weird behavior in the plain crash
recovery, non-PITR case.

> I don't like the name target_in_timeline,

Agreed, but I don't have a better name offhand for it.  The point I was
making is that we seem to be using "target" to mean a point-in-time
stopping target.  But you might be interested in going to the end of
timeline N and so there's not a "target" in that sense.  That's why I
was wanting to avoid using the term "target" for the desired timeline.
But maybe there's not a better word.

> ...we definitely need an offline-log inspection tool, don't we? 
> Next month...

Yeah.  When you get started, I have a toy tool I've been using for
awhile that might serve as a starting point.  (I'm going to have to
whack it around for timelines so there's little point in attaching
it right now...)
        regards, tom lane


Re: Why we really need timelines *now* in PITR

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Anybody see any holes in this design?

God help the DBA who deletes a history file with needed information. Or edits
it inappropriately.

Why can't every log file contain a header that says which timeline it's part
of and which timeline the preceding log file was part of? That would avoid
having a file outside PGDATA and would mean the backups would always contain
enough information to be self-sufficient.

Then if you want to restore from a cold backup and apply PITR up to segment
0020 in timeline X. You read the header of X.0020, find out it followed X.0019
and so on. If X.0010 branched from Y.0009 you'll find out and be able to
continue threading back until you find a segment that matches the current
segment in the cold backup.

The only problem I see is that if your backups are stored on tape it might be
awkward to have to read the headers of all log segments in reverse order to
backtrack to the right place to start.

-- 
greg



Re: Why we really need timelines *now* in PITR

From
Simon Riggs
Date:
On Tue, 2004-07-20 at 00:58, Tom Lane wrote: 
> Simon Riggs <simon@2ndquadrant.com> writes:
> > where the default is <notarget> and if you specify a target, the default
> > target_in_timeline is <latest>.
> 
> I think actually the default target has to be the timeline ID found in
> pg_control -- otherwise you get weird behavior in the plain crash
> recovery, non-PITR case.

Yes, I was talking about recovery.conf only

So, overall there would be 5 recovery modes with 4 levels of default:

Summarised in the table, also showing what set of actions/parameters we
need to specify to invoke that mode...

----------------SPECIFIED ITEMs
MODE        recovery.conf     rec_target*     target_in_timeline
crash recovery
- up to 7.4-------------no------no--------------no
archive recovery 
- current end-----------yes-----no--------------no
- current target (*)----yes-----yes-------------no current timeline
- other target (*)------yes-----either----------yes, = 'latest' latest timeline
- other target (*)------yes-----either----------yes, = '<value>' other timeline

(*) these operations cause a new timeline to be created

More verbosely (not numbered because they're not a sequence or
progression)

- if no recovery.conf is present we do crash recovery to end of logs on
pg_control timeline (i.e. current) 

- if recovery.conf is present and we do not specify a target we do
archive recovery to end of logs on pg_control timeline (i.e. current) 

- if recovery.conf is present and we specify a target, but no timeline,
then we do archive recovery on the pg_control timeline only, and assume
that the target was supposed to be on this, even if we don't find it

- if recovery.conf is present and we specify a timeline of literally
'latest' (without having to know what that is) - then we search archive
for the latest history file, then we do archive recovery from the
pg_control timeline to the latest timeline as specified in the latest
history file. If we specify a target, then this is searched for on
whatever timeline we're on as we rollforward.

- if recovery.conf is present and we specify a timeline - then we search
archive for that history file, then we do archive recovery from the
pg_control timeline to the specified timeline as shown in that history
file. If we specify a target, then this is searched for on whatever
timeline we're on as we rollforward.

> > I don't like the name target_in_timeline,
> 
> Agreed, but I don't have a better name offhand for it.  The point I was
> making is that we seem to be using "target" to mean a point-in-time
> stopping target.  But you might be interested in going to the end of
> timeline N and so there's not a "target" in that sense.  That's why I
> was wanting to avoid using the term "target" for the desired timeline.
> But maybe there's not a better word.

how about?

end_at_timeline

which is more neutral towards the idea of whether a target has been
specified or not...


Another thing I note is that archive_status .ready messages are written
for all restored xlog files (rather than .done messages). That seems to
cause the archive to be completely overwritten when recovery is
complete. Is that part of the plan for timelines also. Not sure I
understand that...


Best Regards, Simon Riggs



Re: Why we really need timelines *now* in PITR

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> More verbosely (not numbered because they're not a sequence or
> progression)

> - if no recovery.conf is present we do crash recovery to end of logs on
> pg_control timeline (i.e. current) 

Check.

> - if recovery.conf is present and we do not specify a target we do
> archive recovery to end of logs on pg_control timeline (i.e. current) 

I have done it this way for now, but I'm unconvinced whether this is the
best default --- it might be that we'd be better off making 'latest' be
the default.  The point here is that when you restore a tar backup,
'current' is going to become the timeline that was current when the
backup was made, not the one that was current just before you wiped
$PGDATA.  I'm not really sure which case is going to be more commonly
wanted.

> - if recovery.conf is present and we specify a target, but no timeline,
> then we do archive recovery on the pg_control timeline only, and assume
> that the target was supposed to be on this, even if we don't find it

Whether you specify a target stopping point does not matter AFAICS.  The
timeline selection has to be made before we can even look at the data.

> - if recovery.conf is present and we specify a timeline of literally
> 'latest' (without having to know what that is) - then we search archive
> for the latest history file, then we do archive recovery from the
> pg_control timeline to the latest timeline as specified in the latest
> history file. If we specify a target, then this is searched for on
> whatever timeline we're on as we rollforward.

Check.

> - if recovery.conf is present and we specify a timeline - then we search
> archive for that history file, then we do archive recovery from the
> pg_control timeline to the specified timeline as shown in that history
> file. If we specify a target, then this is searched for on whatever
> timeline we're on as we rollforward.

Check.

>>> I don't like the name target_in_timeline,
>> 
>> Agreed, but I don't have a better name offhand for it.

For lack of any better idea, I have swallowed my objections to "target"
and called it "recovery_target_timeline".  We can easily rename the
parameter if anyone comes up with something more compelling.

Above behavior is all committed to CVS as of a few minutes ago.

> Another thing I note is that archive_status .ready messages are written
> for all restored xlog files (rather than .done messages).

I think this is gone now.  However, we still have the issue of preventing
re-archival of old, incomplete XLOG segments that might be brought back
into pg_xlog/ as a result of restoring a tar backup.  I don't have a
better solution to that than the one Bruce and I proposed yesterday
(make the DBA clean out pg_xlog before starting a recovery run).
        regards, tom lane


Re: Why we really need timelines *now* in PITR

From
Simon Riggs
Date:
On Wed, 2004-07-21 at 23:42, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > More verbosely (not numbered because they're not a sequence or
> > progression)
> 
> > - if no recovery.conf is present we do crash recovery to end of logs on
> > pg_control timeline (i.e. current) 
> 
> Check.
> 
> > - if recovery.conf is present and we do not specify a target we do
> > archive recovery to end of logs on pg_control timeline (i.e. current) 
> 
> I have done it this way for now, but I'm unconvinced whether this is the
> best default --- it might be that we'd be better off making 'latest' be
> the default.  The point here is that when you restore a tar backup,
> 'current' is going to become the timeline that was current when the
> backup was made, not the one that was current just before you wiped
> $PGDATA.  I'm not really sure which case is going to be more commonly
> wanted.

Right now, that sounds the best option. But my head hurts :)

> 
> > - if recovery.conf is present and we specify a target, but no timeline,
> > then we do archive recovery on the pg_control timeline only, and assume
> > that the target was supposed to be on this, even if we don't find it
> 
> Whether you specify a target stopping point does not matter AFAICS.  The
> timeline selection has to be made before we can even look at the data.
> 

Yes, I was describing a case where a default behaviour would be required
to make the timeline selection before the "desired" behaviour could be
enacted. 

> > - if recovery.conf is present and we specify a timeline of literally
> > 'latest' (without having to know what that is) - then we search archive
> > for the latest history file, then we do archive recovery from the
> > pg_control timeline to the latest timeline as specified in the latest
> > history file. If we specify a target, then this is searched for on
> > whatever timeline we're on as we rollforward.
> 
> Check.
> 
> > - if recovery.conf is present and we specify a timeline - then we search
> > archive for that history file, then we do archive recovery from the
> > pg_control timeline to the specified timeline as shown in that history
> > file. If we specify a target, then this is searched for on whatever
> > timeline we're on as we rollforward.
> 
> Check.
> 
> >>> I don't like the name target_in_timeline,
> >> 
> >> Agreed, but I don't have a better name offhand for it.
> 
> For lack of any better idea, I have swallowed my objections to "target"
> and called it "recovery_target_timeline".  We can easily rename the
> parameter if anyone comes up with something more compelling.
> 
> Above behavior is all committed to CVS as of a few minutes ago.
> 

...very cool.

OK, back to first principles as a cross-check then:

PITR should cope with these scenarios. These are described reasonably
closely but not as exact mechanical tests, so as to ensure that if
multiple solutions exist to these recovery scenarios that all paths are
tested.

These are written with a view to *rough* functionality of timelines,
rather than reading the above and making up cases to fit. I suggest we
see if these all work, see why not (if not) and make up some other cases
to make sure all possibilities are catered for.

1. We crash, and wish to recover, as per 7.4

2. We are running happily, using an automated standby database. The
first database fails irrecoverably and we are forced to switch to the
second system which recovers quickly to end of logs, though without the
partially full current xlog from the downed system.

3. We are running happily, but spot a rogue transaction that we wish to
expunge. We decide to run a PITR up to that txnid. We do an archive
recovery to a recovery_target_xid. We have available to us local copies
of the xlogs if required.

4. We perform (3), then after operating for an hour, we realise that
this was an extremely bad idea and decide to recover back to the point
BEFORE we started to recover the first time - i.e. to try to pretend we
had never attempted PITR in the first place because there was some even
more important data just recently committed we didn't know about.

5. We attempt (4) but fail because the then-current log, which has not
been archived, was deleted because we wouldn't need it anymore. We
decide that we made the right choice in the first place and decide to
re-run the PITR, though to a point slightly ahead of where we stopped
last time we tried that.

6. We are running a distributed system that does not properly support
two-phase commit in all of its persistent components. One of the other
components fails (of course not pg!) and we are forced to do a PITR to a
point in time that matches the best last known timestamp of all
persistent system components. We PITR to a recovery_target_time.

7. We have just done (6), but 10 minutes into production we realise that
the clocks between 2 of our systems were out by 3 seconds. Not much, but
it is causing serious errors to bang around the system. We decide to
re-run the previous PITR, but this time to a point 3 seconds further
along the same chain of xlogs. We don't specify timeline, cos thats
really complex stuff and we don't understand it.

8. We perform (3,4,5), then after operating for three hours the rogue
transaction happens again. We realise that the rogue transaction is in
fact a deliberate security violation. We immediately close network
access and try to recover. Management decides we must accept the first
rogue transaction's effects, but the second is too large to be
acceptable. We need to recover to a recovery_target_xid prior to the
second attack. The first recovery meant that xids were being reused (on
a different timeline) and so the xid we wish to recover MAY exist on
both the first and second timeline. To ensure we don't recover the wrong
transactions, we decide to specify we wish to recover to a
recovery_target_xid on recovery_target_timeline = 2.

9. A mistake was made setting setting a system clock - the month and day
were transposed (7th May -> 5 July), so setting the system apparently
into the future. To reset the clock, we have to perform a full database
recovery into the newly reset system which is now apparently in the
past. We rollforward to end of logs using local and archive copies. We
want crash recovery to still work, even though we have apparently gone
backwards in time according to the log timestamps.

10. We perform (6), then realise that the database server is hosted in
another timezone and we accidentally recovered to a different point in
time, out by a few hours. We want to re-run the recovery, correctly
specifying the point in time.

11. We are in same position as (7), but specify a timeline and also a
time that in fact does not exist on that timeline.

That gives us enough to talk through and begin some testing.

Anybody have any other horror stories, bring 'em on.

> Another thing I note is that archive_status .ready messages are
written
> > for all restored xlog files (rather than .done messages).
> 
> I think this is gone now.  However, we still have the issue of preventing
> re-archival of old, incomplete XLOG segments that might be brought back
> into pg_xlog/ as a result of restoring a tar backup.  I don't have a
> better solution to that than the one Bruce and I proposed yesterday
> (make the DBA clean out pg_xlog before starting a recovery run).

I'll give that some thought.

Best Regards, Simon Riggs



Re: Why we really need timelines *now* in PITR

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> PITR should cope with these scenarios. These are described reasonably
> closely but not as exact mechanical tests, so as to ensure that if
> multiple solutions exist to these recovery scenarios that all paths are
> tested.

> [ snip ... ]

Now *my* head is hurting ;-)

AFAICS, we should be able to handle all of these cases, as long as
(a) the DBA doesn't manually delete any xlog files (else he probably
loses the chance to recover to the associated timeline), and
(b) the DBA doesn't screw up on picking which timeline to recover to.

I'd guess that (b) is likely to be the bigger threat :-(.
        regards, tom lane


Re: Why we really need timelines *now* in PITR

From
Christopher Kings-Lynne
Date:
> That gives us enough to talk through and begin some testing.
> 
> Anybody have any other horror stories, bring 'em on.

I think that the PITR docs will have to be written in two sections.  One 
will need to be a pure reference that orthogonally describes the 
options, etc.  The other section will need to be a scenario-based 
explanation of what to do/how to recover in all the major different 
failure patterns.  It's the only way people (I!) will understand it all.
From my point of view, what I need PITR to be able to do is allow me to 
restore to any point in the 24 hour period between pg_dumpalls.  I also 
need to know what the exact criteria for deleting archived logs every 24 
hours, and how that can be determined automatically in a script 
(checking the pg_dumpall end-of-log marker exists as well).  I need to 
be told to copy, not move the logs.  Also, I need to be sure that 
pg_dumpall is enough, and I don't need to make sure I issue a checkpoint 
before the pg_dumpall or anything.

Chris



Re: Why we really need timelines *now* in PITR

From
Greg Stark
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

> Also, I need to be sure that pg_dumpall is enough, and I don't need to make
> sure I issue a checkpoint before the pg_dumpall or anything.

I think if you're using PITR you don't use pg_dump you just tar up the PGDATA
directory.

Of course you can still use pg_dump to save logical exports for use in other
purposes. They're useful for loading into test databases for poking around in
the data for example.

But the transaction logs aren't going to be applicable to a database restored
from a logical pg_dump. That's effectively a completely fresh database even if
it has functionally equivalent data in it. The transaction logs are physical;
they store "replace this block with the following raw data". That can't be
applied to a logically equivalent but physically dissimilar database.

-- 
greg