Thread: Point in Time Recovery

Point in Time Recovery

From
Simon Riggs
Date:
Taking advantage of the freeze bubble allowed us... there are some last
minute features to add. 

Summarising earlier thoughts, with some detailed digging and design from
myself in last few days - we're now in a position to add Point-in-Time
Recovery, on top of whats been achieved.

The target for the last record to recover to can be specified in 2 ways:
- by transactionId - not that useful, unless you have a means of
identifying what has happened from the log, then using that info to
specify how to recover - coming later - not in next few days :( 
- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?
If we did, would that be portable?
Suggestions welcome, because I know very little of the details of
various *nix systems and win* on that topic.

Only COMMIT and ABORT records have timestamps, allowing us to circumvent
any discussion about partial transaction recovery and nested
transactions.

When we do recover, stopping at the timestamp is just half the battle.
We need to leave the xlog in which we stop in a state from which we can
enter production smoothly and cleanly. To do this, we could:
- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.
I'm tempted by the first plan, because it is more straightforward and
stands much less chance of me introducing 50 wierd bugs just before
close.

Also, I think it is straightforward to introduce control file duplexing,
with a second copy stored and maintained in the pg_xlog directory. This
would provide additional protection for pg_control, which takes on more
importance now that archive recovery is working. pg_xlog is a natural
home, since on busy systems it's on its own disk away from everything
else, ensuring that at least one copy survives. I can't see a downside
to that, but others might... We can introduce user specifiable
duplexing, in later releases.

For later, I envisage an off-line utility that can be used to inspect
xlog records. This could provide a number of features:
- validate archived xlogs, to check they are sound.
- produce summary reports, to allow identification of transactionIds and
the effects of particular transactions
- performance analysis to allow decisions to be made about whether group
commit features could be utilised to good effect
(Not now...)

Best regards, Simon Riggs




Re: Point in Time Recovery

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Should we use a different datatype than time_t for the commit timestamp,
> one that offers more fine grained differentiation between checkpoints?

Pretty much everybody supports gettimeofday() (time_t and separate
integer microseconds); you might as well use that.  Note that the actual
resolution is not necessarily microseconds, and it'd still not be
certain that successive commits have distinct timestamps --- so maybe
this refinement would be pointless.  You'll still have to design a user
interface that allows selection without the assumption of distinct
timestamps.

> - when we stop, keep reading records until EOF, just don't apply them.
> When we write a checkpoint at end of recovery, the unapplied
> transactions are buried alive, never to return.
> - stop where we stop, then force zeros to EOF, so that no possible
> record remains of previous transactions.

Go with plan B; it's best not to destroy data (what if you chose the
wrong restart point the first time)?

Actually this now reminds me of a discussion I had with Patrick
Macdonald some time ago.  The DB2 practice in this connection is that
you *never* overwrite existing logfile data when recovering.  Instead
you start a brand new xlog segment file, which is given a new "branch
number" so it can be distinguished from the future-time xlog segments
that you chose not to apply.  I don't recall what the DB2 terminology
was exactly --- not "branch number" I don't think --- but anyway the
idea is that when you restart the database after an incomplete recovery,
you are now in a sort of parallel universe that has its own history
after the branch point (PITR stop point).  You need to be able to
distinguish archived log segments of this parallel universe from those
of previous and subsequent incarnations.  I'm not sure whether Vadim
intended our StartUpID to serve this purpose, but it could perhaps be
used that way, if we reflected it in the WAL file names.
        regards, tom lane


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Should we use a different datatype than time_t for the commit timestamp,
> > one that offers more fine grained differentiation between checkpoints?
> 
> Pretty much everybody supports gettimeofday() (time_t and separate
> integer microseconds); you might as well use that.  Note that the actual
> resolution is not necessarily microseconds, and it'd still not be
> certain that successive commits have distinct timestamps --- so maybe
> this refinement would be pointless.  You'll still have to design a user
> interface that allows selection without the assumption of distinct
> timestamps.

Well, I agree, though without the desired-for UI now, I think some finer
grained mechanism would be good. This means extending the xlog commit
record by a couple of bytes...OK, lets live a little.

> > - when we stop, keep reading records until EOF, just don't apply them.
> > When we write a checkpoint at end of recovery, the unapplied
> > transactions are buried alive, never to return.
> > - stop where we stop, then force zeros to EOF, so that no possible
> > record remains of previous transactions.
> 
> Go with plan B; it's best not to destroy data (what if you chose the
> wrong restart point the first time)?
> 

eh? Which way round? The second plan was the one where I would destroy
data by overwriting it, thats exactly why I preferred the first.

Actually, the files are always copied from archive, so re-recovery is
always an available option in the design thats been implemented.

No matter...

> Actually this now reminds me of a discussion I had with Patrick
> Macdonald some time ago.  The DB2 practice in this connection is that
> you *never* overwrite existing logfile data when recovering.  Instead
> you start a brand new xlog segment file, 

Now thats a much better plan...I suppose I just have to rack up the
recovery pointer to the first record on the first page of a new xlog
file, similar to first plan, but just fast-forwarding rather than
forwarding.

My only issue was to do with the secondary Checkpoint marker, which is
always reset to the place you just restored FROM, when you complete a
recovery. That could lead to a situation where you recover, then before
next checkpoint, fail and lose last checkpoint marker, then crash
recover from previous checkpoint (again), but this time replay the
records you were careful to avoid.

> which is given a new "branch
> number" so it can be distinguished from the future-time xlog segments
> that you chose not to apply.  I don't recall what the DB2 terminology
> was exactly --- not "branch number" I don't think --- but anyway the
> idea is that when you restart the database after an incomplete recovery,
> you are now in a sort of parallel universe that has its own history
> after the branch point (PITR stop point).  You need to be able to
> distinguish archived log segments of this parallel universe from those
> of previous and subsequent incarnations.  

Thats a good idea, if only because you so easily screw your test data
during multiple recovery situations. But if its good during testing, it
must be good in production too...since you may well perform
recovery...run for a while, then discover that you got it wrong first
time, then need to re-recover again. I already added that to my list of
gotchas and that would solve it.

I was going to say hats off to the Blue-hued ones, when I remembered
this little gem from last year
http://www.danskebank.com/link/ITreport20030403uk/$file/ITreport20030403uk.pdf

> I'm not sure whether Vadim
> intended our StartUpID to serve this purpose, but it could perhaps be
> used that way, if we reflected it in the WAL file names.
> 

Well, I'm not sure about StartUpId....but certainly the high 2 bytes of
LogId looks pretty certain never to be anything but zeros. You have 2.4
x 10^14...which is 9,000 years at 1000 log file/sec
We could use the scheme you descibe:
add xFFFF to the logid every time you complete an archive recovery...so
the log files look like 0001000000000CE3 after youve recovered a load of
files that look like 0000000000000CE3

If you used StartUpID directly, you might just run out....but its very
unlikely you would ever perform 65000 recovery situations - unless
you've run the <expletive> code as often as I have :(.

Doing that also means we don't have to work out how to do that with
StartUpID. Of course, altering the length and makeup of the xlog files
is possible too, but that will cause other stuff to stop working....

[We'll have to give this a no-SciFi name, unless we want to make
in-roads into the Dr.Who fanbase :) Don't get them started. Better
still, dont give it a name at all.]

I'll sleep on that lot.

Best regards, Simon Riggs



Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
> - by time - but the time stamp on each xlog record only specifies to the
> second, which could easily be 10 or more commits (we hope....)
>
> Should we use a different datatype than time_t for the commit timestamp,
> one that offers more fine grained differentiation between checkpoints?

Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.

> - when we stop, keep reading records until EOF, just don't apply them.
> When we write a checkpoint at end of recovery, the unapplied
> transactions are buried alive, never to return.
> - stop where we stop, then force zeros to EOF, so that no possible
> record remains of previous transactions.
> I'm tempted by the first plan, because it is more straightforward and
> stands much less chance of me introducing 50 wierd bugs just before
> close.

But what if you restore because after that PIT everything went haywire
including the log ? Did you mean to apply the remaining changes but not
commit those xids ? I think it is essential to only leave xlogs around
that allow a subsequent rollforward from the same old full backup.
Or is an instant new full backup required after restore ?

Andreas


Re: Point in Time Recovery

From
Richard Huxton
Date:
Simon Riggs wrote:
> On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
> 
>>Simon Riggs <simon@2ndquadrant.com> writes:
>>
>>>Should we use a different datatype than time_t for the commit timestamp,
>>>one that offers more fine grained differentiation between checkpoints?
>>
>>Pretty much everybody supports gettimeofday() (time_t and separate
>>integer microseconds); you might as well use that.  Note that the actual
>>resolution is not necessarily microseconds, and it'd still not be
>>certain that successive commits have distinct timestamps --- so maybe
>>this refinement would be pointless.  You'll still have to design a user
>>interface that allows selection without the assumption of distinct
>>timestamps.
> 
> 
> Well, I agree, though without the desired-for UI now, I think some finer
> grained mechanism would be good. This means extending the xlog commit
> record by a couple of bytes...OK, lets live a little.

At the risk of irritating people, I'll repeat what I suggested a few 
weeks ago...

Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz, 
pitr_comment text)
Let the user insert rows in transactions as desired. Let them stop the 
restore when a specific (pitr_ts,pitr_comment) gets inserted (or on 
pitr_id if they record it).

IMHO time is seldom relevant, event boundaries are.

If you want to add special syntax for this, fine. If not, an INSERT 
statement is a convenient way to do this anyway.

--   Richard Huxton  Archonet Ltd


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:
>  > - by time - but the time stamp on each xlog record only specifies to the
> > second, which could easily be 10 or more commits (we hope....)
> > 
> > Should we use a different datatype than time_t for the commit timestamp,
> > one that offers more fine grained differentiation between checkpoints?
> 
> Imho seconds is really sufficient. If you know a more precise position
> you will probably know it from backend log or an xlog sniffer. With those
> you can easily use the TransactionId way.
> 

OK, thanks. I'll just leave the time_t datatype just the way it is.

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Tue, 2004-07-06 at 20:00, Richard Huxton wrote:
> Simon Riggs wrote:
> > On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
> > 
> >>Simon Riggs <simon@2ndquadrant.com> writes:
> >>
> >>>Should we use a different datatype than time_t for the commit timestamp,
> >>>one that offers more fine grained differentiation between checkpoints?
> >>
> >>Pretty much everybody supports gettimeofday() (time_t and separate
> >>integer microseconds); you might as well use that.  Note that the actual
> >>resolution is not necessarily microseconds, and it'd still not be
> >>certain that successive commits have distinct timestamps --- so maybe
> >>this refinement would be pointless.  You'll still have to design a user
> >>interface that allows selection without the assumption of distinct
> >>timestamps.
> > 
> > 
> > Well, I agree, though without the desired-for UI now, I think some finer
> > grained mechanism would be good. This means extending the xlog commit
> > record by a couple of bytes...OK, lets live a little.
> 
> At the risk of irritating people, I'll repeat what I suggested a few 
> weeks ago...
> 

All feedback is good. Thanks.

> Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz, 
> pitr_comment text)
> Let the user insert rows in transactions as desired. Let them stop the 
> restore when a specific (pitr_ts,pitr_comment) gets inserted (or on 
> pitr_id if they record it).
> 

It's a good plan, but the recovery is currently offline recovery and no
SQL is possible. So no way to insert, no way to access tables until
recovery completes. I like that plan and probably would have used it if
it was viable.

> IMHO time is seldom relevant, event boundaries are.
> 

Agreed, but time is the universally agreed way of describing two events
as being simultaneous. No other way to say "recover to the point when
the message queue went wild".

As of last post to Andreas, I've said I'll not bother changing the
granularity of the timestamp.

> If you want to add special syntax for this, fine. If not, an INSERT 
> statement is a convenient way to do this anyway.

The special syntax isn't hugely important - I did suggest a kind of
SQL-like syntax previously, but thats gone now. Invoking recovery via a
command file IS, so we are able to tell the system its not in crash
recovery AND that when you've finished I want you to respond to crashes
without re-entering archive recovery.

Thanks for your comments. I'm not making this more complex than needs
be; in fact much of the code is very simple - its just the planning
that's complex.

Best regards, Simon Riggs



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:

> > - when we stop, keep reading records until EOF, just don't apply them.
> > When we write a checkpoint at end of recovery, the unapplied
> > transactions are buried alive, never to return.
> > - stop where we stop, then force zeros to EOF, so that no possible
> > record remains of previous transactions.
> 
> Go with plan B; it's best not to destroy data (what if you chose the
> wrong restart point the first time)?
> 
> Actually this now reminds me of a discussion I had with Patrick
> Macdonald some time ago.  The DB2 practice in this connection is that
> you *never* overwrite existing logfile data when recovering.  Instead
> you start a brand new xlog segment file, which is given a new "branch
> number" so it can be distinguished from the future-time xlog segments
> that you chose not to apply.  I don't recall what the DB2 terminology
> was exactly --- not "branch number" I don't think --- but anyway the
> idea is that when you restart the database after an incomplete recovery,
> you are now in a sort of parallel universe that has its own history
> after the branch point (PITR stop point).  You need to be able to
> distinguish archived log segments of this parallel universe from those
> of previous and subsequent incarnations.  I'm not sure whether Vadim
> intended our StartUpID to serve this purpose, but it could perhaps be
> used that way, if we reflected it in the WAL file names.
> 

Some more thoughts...focusing on the what do we do after we've finished
recovering. The objectives, as I see them, are to put the system into a
state, that preserves these features:
1. we never overwrite files, in case we want to re-run recovery
2. we never write files that MIGHT have been written previously
3. we need to ensure that any xlog records skipped at admins request (in
PITR mode) are never in a position to be re-applied to this timeline.
4. ensure we can re-recover, if we need to, without further problems

Tom's concept above, I'm going to call timelines. A timeline is the
sequence of logs created by the execution of a server. If you recover
the database, you create a new timeline. [This is because, if you've
invoked PITR you absolutely definitely want log records written to, say,
xlog15 to be different to those that were written to xlog15 in a
previous timeline that you have chosen not to reapply.]

Objective (1) is complex.
When we are restoring, we always start with archived copies of the xlog,
to make sure we don't finish too soon. We roll forward until we either
reach PITR stop point, or we hit end of archived logs. If we hit end of
logs on archive, then we switch to a local copy, if one exists that is
higher than those, we carry on rolling forward until either we reach
PITR stop point, or we hit end of that log. (Hopefully, there isn't more
than one local xlog higher than the archive, but its possible). 
If we are rolling forward on local copies, then they are our only
copies. We'd really like to archive them ASAP, but the archiver's not
running yet - we don't want to force that situation in case the archive
device (say a tape) is the one being used to recover right now. So we
write an archive_status of .ready for that file, ensuring that the
checkpoint won't remove it until it gets copied to archive, whenever
that starts working again. Objective (1) met.

When we have finished recovering we:
- create a new xlog at the start of a new ++timeline
- copy the last applied xlog record to it as the first record
- set the record pointer so that it matches
That way, when we come up and begin running, we never overwrite files
that might have been written previously. Objective (2) met.
We do the other stuff because recovery finishes up by pointing to the
last applied record...which is what was causing all of this extra work
in the first place.

At this point, we also reset the secondary checkpoint record, so that
should recovery be required again before next checkpoint AND the
shutdown checkpoint record written after recovery completes is
wrong/damaged, the recovery will not autorewind back past the PITR stop
point and attempt to recover the records we have just tried so hard to
reverse/ignore. Objective (3) met. (Clearly, that situation seems
unlikely, but I feel we must deal with it...a newly restored system is
actually very fragile, so a crash again within 3 minutes or so is very
commonplace, as far as these things go).

Should we need to re-recover, we can do so because the new timeline
xlogs are further forward than the old timeline, so never get seen by
any processes (all of which look backwards). Re-recovery is possible
without problems, if required. This means you're a lot safer from some
of the mistakes you might of made, such as deciding you need to go into
recovery, then realising it wasn't required (or some other painful
flapping as goes on in computer rooms at 3am).

How do we implement timelines?
The main presumption in the code is that xlogs are sequential. That has
two effects:
1. during recovery, we try to open the "next" xlog by adding one to the
numbers and then looking for that file
2. during checkpoint, we look for filenames less than the current
checkpoint marker
Creating a timeline by adding a larger number to LogId allows us to
prevent (1) from working, yet without breaking (2).
Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose not to implement archive_mode.

Should we implement timelines?
Yes, I think we should. I've already hit the problems that timelines
solve in my testing and so that means they'll be hit when you don't need
the hassle.

Comments much appreciated, assuming you read this far...

Best regards, Simon Riggs



Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
> Well, Tom does seem to have something with regard to StartUpIds. I feel
> it is easier to force a new timeline by adding a very large number to
> the LogId IF, and only if, we have performed an archive recovery. That
> way, we do not change at all the behaviour of the system for people that
> choose not to implement archive_mode.

Imho you should take a close look at StartUpId, I think it is exactly this
"large number". Maybe you can add +2 to intentionally leave a hole.

Once you increment, I think it is very essential to checkpoint and double
check pg_control, cause otherwise a crashrecovery would read the wrong xlogs.
> Should we implement timelines?

Yes :-)

Andreas


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Wed, 2004-07-07 at 14:17, Zeugswetter Andreas SB SD wrote:
> > Well, Tom does seem to have something with regard to StartUpIds. I feel
> > it is easier to force a new timeline by adding a very large number to
> > the LogId IF, and only if, we have performed an archive recovery. That
> > way, we do not change at all the behaviour of the system for people that
> > choose not to implement archive_mode.
> 
> Imho you should take a close look at StartUpId, I think it is exactly this 
> "large number". Maybe you can add +2 to intentionally leave a hole.
> 
> Once you increment, I think it is very essential to checkpoint and double 
> check pg_control, cause otherwise a crashrecovery would read the wrong xlogs.

Thanks for your thoughts - you have made me rethink this over some
hours. Trouble is, on this occasion, the other suggestion still seems
the best one, IMVHO.

If we number timelines based upon StartUpId, then there is still some
potential for conflict and this is what we're seeking to avoid.

Simply adding FFFF to the LogId puts the new timeline so far into the
previous timelines future that there isn't any problems. We only
increment the timeline when we recover, so we're not eating up the
numbers quickly. Simply adding a number means that there isn't any
conflict with any normal operations. The timelines aren't numbered
separately, so I'm not introducing a new version of
StartUpID...technically there isn't a new timeline, just a chunk of
geological time between them.

We don't need to mention timelines in the docs, nor do we need to alter
pg_controldata to display it...just a comment in the code to explain why
we add a large number to the LogId after each recovery completes.

Best regards, Simon Riggs




Re: Point in Time Recovery

From
Simon Riggs
Date:
On Thu, 2004-07-08 at 07:57, spock@mgnet.de wrote:
> On Thu, 8 Jul 2004, Simon Riggs wrote:
> 
> > We don't need to mention timelines in the docs, nor do we need to alter
> > pg_controldata to display it...just a comment in the code to explain why
> > we add a large number to the LogId after each recovery completes.
> 
> I'd disagree on that. Knowing what exactly happens when recovering the
> database is a must. It makes people feel more safe about the process. This
> is because the software doesn't do things you wouldn't expect.
> 
> On Oracle e.g. you create a new database incarnation when you recover a
> database (incomplete). Which means your System Change Number and your Log
> Sequence is reset to 0.
> Knowing this is crucial because you otherwise would wonder "Why is it
> doing that? Is this a bug or a feature?"
> 

OK, will do.

Best regards, Simon Riggs



Re: Point in Time Recovery

From
spock@mgnet.de
Date:
On Tue, 6 Jul 2004, Zeugswetter Andreas SB SD wrote:

> > Should we use a different datatype than time_t for the commit timestamp,
> > one that offers more fine grained differentiation between checkpoints?
>
> Imho seconds is really sufficient. If you know a more precise position
> you will probably know it from backend log or an xlog sniffer. With those
> you can easily use the TransactionId way.

I'd also think that seconds are absolutely sufficient. From my daily
experience I can say that you're normally lucky to know the time
on one minute basis.
If you need to get closer, a log sniffer is unavoidable ...

Greetings, Klaus

-- 
Full Name   : Klaus Naumann     | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964  | E-Mail: (kn@mgnet.de)


Re: Point in Time Recovery

From
spock@mgnet.de
Date:
On Thu, 8 Jul 2004, Simon Riggs wrote:

> We don't need to mention timelines in the docs, nor do we need to alter
> pg_controldata to display it...just a comment in the code to explain why
> we add a large number to the LogId after each recovery completes.

I'd disagree on that. Knowing what exactly happens when recovering the
database is a must. It makes people feel more safe about the process. This
is because the software doesn't do things you wouldn't expect.

On Oracle e.g. you create a new database incarnation when you recover a
database (incomplete). Which means your System Change Number and your Log
Sequence is reset to 0.
Knowing this is crucial because you otherwise would wonder "Why is it
doing that? Is this a bug or a feature?"

Just my 2ct :-))

Greetings, Klaus

-- 
Full Name   : Klaus Naumann     | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964  | E-Mail: (kn@mgnet.de)


Re: Point in Time Recovery

From
Jan Wieck
Date:
On 7/6/2004 3:58 PM, Simon Riggs wrote:

> On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:
>>  > - by time - but the time stamp on each xlog record only specifies to the
>> > second, which could easily be 10 or more commits (we hope....)
>> > 
>> > Should we use a different datatype than time_t for the commit timestamp,
>> > one that offers more fine grained differentiation between checkpoints?
>> 
>> Imho seconds is really sufficient. If you know a more precise position
>> you will probably know it from backend log or an xlog sniffer. With those
>> you can easily use the TransactionId way.

TransactionID and timestamp is only sufficient if the transactions are 
selected by their commit order. Especially in read committed mode, 
consider this execution:
    xid-1: start    xid-2: start    xid-2: update field x    xid-2: commit    xid-1: update field y    xid-1: commit

In this case, the update done by xid-1 depends on the row created by 
xid-2. So logically xid-2 precedes xid-1, because it made its changes 
earlier.

So you have to apply the log until you find the commit record of the 
transaction you want apply last, and then stamp all transactions that 
where in progress at that time as aborted.


Jan

>> 
> 
> OK, thanks. I'll just leave the time_t datatype just the way it is.
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Sat, 2004-07-10 at 15:17, Jan Wieck wrote:
> On 7/6/2004 3:58 PM, Simon Riggs wrote:
> 
> > On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:
> >>  > - by time - but the time stamp on each xlog record only specifies to the
> >> > second, which could easily be 10 or more commits (we hope....)
> >> > 
> >> > Should we use a different datatype than time_t for the commit timestamp,
> >> > one that offers more fine grained differentiation between checkpoints?
> >> 
> >> Imho seconds is really sufficient. If you know a more precise position
> >> you will probably know it from backend log or an xlog sniffer. With those
> >> you can easily use the TransactionId way.
> 
> TransactionID and timestamp is only sufficient if the transactions are 
> selected by their commit order. Especially in read committed mode, 
> consider this execution:
> 
>      xid-1: start
>      xid-2: start
>      xid-2: update field x
>      xid-2: commit
>      xid-1: update field y
>      xid-1: commit
> 
> In this case, the update done by xid-1 depends on the row created by 
> xid-2. So logically xid-2 precedes xid-1, because it made its changes 
> earlier.
> 
> So you have to apply the log until you find the commit record of the 
> transaction you want apply last, and then stamp all transactions that 
> where in progress at that time as aborted.
> 

Agreed.

I've implemented this exactly as you say....

This turns out to be very easy because:
- when looking where to stop we only ever stop at commit or aborts -
these are the only records that have timestamps anyway...
- any record that isn't specifically committed is not updated in the
clog and therefore not visible. The clog starts in indeterminate state,
0 and is then updated to either committed or aborted. Aborted and
indeterminate are handled similarly in the current code, to allow for
crash recovery - PITR doesn't change anything there.
So, PITR doesn't do anything that crash recovery doen't already do.
Crash recovery makes no attempt to keep track of in-progress
transactions and doesn't make a special journey to the clog to
specifically mark them as aborted - they just are by default.

So, what we mean by "stop at a transactionId" is "stop applying redo at
the commit/abort record for that transactionId." It has to be an exact
match, not a "greater than", for exactly the reason you mention. That
means that although we stop at the commit record of transactionId X, we
may also have applied records for transactions with later transactionIds
e.g. X+1, X+2...etc (without limit or restriction).

(I'll even admit that as first, I did think we could get away with the
"less than" test that you are warning me about. Overall, I've spent more
time on theory/analysis than on coding, on the idea that you can improve
poor code, but wrong code just needs to be thrown away).

Timestamps are more vague...When time is used, there might easily be 10+
transactions whose commit/abort records have identical timestamp values.
So we either stop at the first or last record depending upon whether we
specified inclusive or exclusive on the recovery target value.

The hard bit, IMHO, is what we do with the part of the log that we have
chosen not to apply....which has been discussed on list in detail also.

Thanks for keeping an eye out for possible errors - this one is
something I'd thought through and catered for (there are comments in my
current latest published code to that effect, although I have not yet
finished coding the clean-up-after-stopping part). 

This implies nothing with regard to other possible errors or oversights
and so I very much welcome any questioning of this nature - I am as
prone to error as the next man. It's important we get this right.

Best regards, Simon Riggs



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Tue, 2004-07-06 at 22:40, Simon Riggs wrote:
> On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> 
> > > - when we stop, keep reading records until EOF, just don't apply them.
> > > When we write a checkpoint at end of recovery, the unapplied
> > > transactions are buried alive, never to return.
> > > - stop where we stop, then force zeros to EOF, so that no possible
> > > record remains of previous transactions.
> > 
> > Go with plan B; it's best not to destroy data (what if you chose the
> > wrong restart point the first time)?
> > 
> > Actually this now reminds me of a discussion I had with Patrick
> > Macdonald some time ago.  The DB2 practice in this connection is that
> > you *never* overwrite existing logfile data when recovering.  Instead
> > you start a brand new xlog segment file, which is given a new "branch
> > number" so it can be distinguished from the future-time xlog segments
> > that you chose not to apply.  I don't recall what the DB2 terminology
> > was exactly --- not "branch number" I don't think --- but anyway the
> > idea is that when you restart the database after an incomplete recovery,
> > you are now in a sort of parallel universe that has its own history
> > after the branch point (PITR stop point).  You need to be able to
> > distinguish archived log segments of this parallel universe from those
> > of previous and subsequent incarnations.  I'm not sure whether Vadim
> > intended our StartUpID to serve this purpose, but it could perhaps be
> > used that way, if we reflected it in the WAL file names.
> > 
> 
> Some more thoughts...focusing on the what do we do after we've finished
> recovering. The objectives, as I see them, are to put the system into a
> state, that preserves these features:
> 1. we never overwrite files, in case we want to re-run recovery
> 2. we never write files that MIGHT have been written previously
> 3. we need to ensure that any xlog records skipped at admins request (in
> PITR mode) are never in a position to be re-applied to this timeline.
> 4. ensure we can re-recover, if we need to, without further problems
> 
> Tom's concept above, I'm going to call timelines. A timeline is the
> sequence of logs created by the execution of a server. If you recover
> the database, you create a new timeline. [This is because, if you've
> invoked PITR you absolutely definitely want log records written to, say,
> xlog15 to be different to those that were written to xlog15 in a
> previous timeline that you have chosen not to reapply.]
> 
> Objective (1) is complex.
> When we are restoring, we always start with archived copies of the xlog,
> to make sure we don't finish too soon. We roll forward until we either
> reach PITR stop point, or we hit end of archived logs. If we hit end of
> logs on archive, then we switch to a local copy, if one exists that is
> higher than those, we carry on rolling forward until either we reach
> PITR stop point, or we hit end of that log. (Hopefully, there isn't more
> than one local xlog higher than the archive, but its possible). 
> If we are rolling forward on local copies, then they are our only
> copies. We'd really like to archive them ASAP, but the archiver's not
> running yet - we don't want to force that situation in case the archive
> device (say a tape) is the one being used to recover right now. So we
> write an archive_status of .ready for that file, ensuring that the
> checkpoint won't remove it until it gets copied to archive, whenever
> that starts working again. Objective (1) met.
> 
> When we have finished recovering we:
> - create a new xlog at the start of a new ++timeline
> - copy the last applied xlog record to it as the first record
> - set the record pointer so that it matches
> That way, when we come up and begin running, we never overwrite files
> that might have been written previously. Objective (2) met.
> We do the other stuff because recovery finishes up by pointing to the
> last applied record...which is what was causing all of this extra work
> in the first place.
> 
> At this point, we also reset the secondary checkpoint record, so that
> should recovery be required again before next checkpoint AND the
> shutdown checkpoint record written after recovery completes is
> wrong/damaged, the recovery will not autorewind back past the PITR stop
> point and attempt to recover the records we have just tried so hard to
> reverse/ignore. Objective (3) met. (Clearly, that situation seems
> unlikely, but I feel we must deal with it...a newly restored system is
> actually very fragile, so a crash again within 3 minutes or so is very
> commonplace, as far as these things go).
> 
> Should we need to re-recover, we can do so because the new timeline
> xlogs are further forward than the old timeline, so never get seen by
> any processes (all of which look backwards). Re-recovery is possible
> without problems, if required. This means you're a lot safer from some
> of the mistakes you might of made, such as deciding you need to go into
> recovery, then realising it wasn't required (or some other painful
> flapping as goes on in computer rooms at 3am).
> 
> How do we implement timelines?
> The main presumption in the code is that xlogs are sequential. That has
> two effects:
> 1. during recovery, we try to open the "next" xlog by adding one to the
> numbers and then looking for that file
> 2. during checkpoint, we look for filenames less than the current
> checkpoint marker
> Creating a timeline by adding a larger number to LogId allows us to
> prevent (1) from working, yet without breaking (2).
> Well, Tom does seem to have something with regard to StartUpIds. I feel
> it is easier to force a new timeline by adding a very large number to
> the LogId IF, and only if, we have performed an archive recovery. That
> way, we do not change at all the behaviour of the system for people that
> choose not to implement archive_mode.
> 
> Should we implement timelines?
> Yes, I think we should. I've already hit the problems that timelines
> solve in my testing and so that means they'll be hit when you don't need
> the hassle.
> 

I'm still wrestling with the cleanup-after-stopping-at-point-in-time
code and have some important conclusions.

Moving forward on a timeline is somewhat tricky for xlogs, as shown
above,...but...

My earlier treatment seems to have neglected to include the clog also.
If we stop before end of log, then we also have potentially many (though
presumably at least one) committed transactions that we do not want to
be told about ever again. 

The starting a new timeline thought works for xlogs, but not for clogs.
No matter how far you go into the future, there is a small (yet
vanishing) possibility that there is a yet undiscovered committed
transaction in the future. (Because transactions are ordered in the clog
because xids are assigned sequentially at txn start, but not ordered in
the xlog where they are recorded in the order the txns complete).

Please tell me that we can ignore the state of the clog, but I think we
can't - if a new xid re-used a previous xid that had committed AND then
we crashed...we would have inconsistent data. Unless we physically write
zeros to clog for every begin transaction after a recovery...err, no...

The only recourse that I can see is to "truncate the future" of the
clog, which would mean:
- keeping track of the highest xid provided by any record from the xlog,
in xact.c, xact_redo
- using that xid to write zeros to the clog after this point until EOF
- drop any clog segment files past the new "high" segment
- no idea how that effects NT or not...

The timeline idea works for xlog because once we've applied the xlog
records and checkpointed, we can discard the xlog records. We can't do
that with clog records (unless we followed recovery with a vacuum full -
which is possible, but not hugely desirable) - though this doesn't solve
the issue that xlog records don't have any prescribed position in the
file, clog records do.

Right now, I don't know where to start with the clog code and the
opportunity for code-overlap with NT seems way high. These problems can
be conquered, given time and "given enough eyeballs".

I'm all ears for some bright ideas...but I'm getting pretty wary that we
may introduce some unintended features if we try to get this stabilised
within two weeks. My current conclusion is: lets commit archive recovery
in this release, then wait until next dot release for full recovery
target features. We've hit all the features which were a priority and
the fundamental architecture is there, so i think it is time to be happy
with what we've got, for now.

Comments, please....remembering that I'd love it if I've missed
something that simplifies the task. Fire away.

Best regards, Simon Riggs



Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
> The starting a new timeline thought works for xlogs, but not for clogs.
> No matter how far you go into the future, there is a small (yet
> vanishing) possibility that there is a yet undiscovered committed
> transaction in the future. (Because transactions are ordered in the clog
> because xids are assigned sequentially at txn start, but not ordered in
> the xlog where they are recorded in the order the txns complete).

Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem
would arise at xid wraparound, no ?

Andreas


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Tue, 2004-07-13 at 13:18, Zeugswetter Andreas SB SD wrote:
> > The starting a new timeline thought works for xlogs, but not for clogs.
> > No matter how far you go into the future, there is a small (yet
> > vanishing) possibility that there is a yet undiscovered committed
> > transaction in the future. (Because transactions are ordered in the clog
> > because xids are assigned sequentially at txn start, but not ordered in
> > the xlog where they are recorded in the order the txns complete).
> 
> Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem
> would arise at xid wraparound, no ?
> 

Sounds like a good start...

When PITR ends, we need to stop mid-way through a file. Does that handle
that situation?

Simon Riggs



Re: Point in Time Recovery

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Please tell me that we can ignore the state of the clog,

We can.

The reason that keeping track of timelines is interesting for xlog is
simply to take pity on the poor DBA who needs to distinguish the various
archived xlog files he's got laying about, and so that we can detect
errors like supplying inconsistent sets of xlog segments during restore.

This does not apply to clog because it's not archived.  It's no more
than a data file.  If you think you have trouble recreating clog then
you have the same issues recreating data files.
        regards, tom lane


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Tue, 2004-07-13 at 15:29, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Please tell me that we can ignore the state of the clog,
> 
> We can.
> 

In general, you are of course correct.

> The reason that keeping track of timelines is interesting for xlog is
> simply to take pity on the poor DBA who needs to distinguish the various
> archived xlog files he's got laying about, and so that we can detect
> errors like supplying inconsistent sets of xlog segments during restore.
> 
> This does not apply to clog because it's not archived.  It's no more
> than a data file.  If you think you have trouble recreating clog then
> you have the same issues recreating data files.

I'm getting carried away with the improbable....but this is the rather
strange, but possible scenario I foresee:

A sequence of times...
1. We start archiving xlogs
2. We take a checkpoint
3. we commit an important transaction
4. We take a backup
5. We take a checkpoint

As stands currently, when we restore the backup, controlfile says that
last checkpoint was at 2, so we rollforward from 2 THRU 4 and continue
on past 5 until end of logs. Normally, end of logs isn't until after
4...

When we specify a recovery target, it is possible to specify the
rollforward to complete just before point 3. So we use the backup taken
at 4 to rollforward to a point in the past (from the backups
perspective). The backup taken at 4 may now have data and clog records
written by bgwriter.

Given that time between checkpoints is likely to be longer than
previously was the case...this becomes a non-zero situation.

I was trying to solve this problem head on, but the best way is to make
sure we never allow ourselves such a muddled situation:

ISTM the way to avoid this is to insist that we always rollforward
through at least one checkpoint to guarantee that this will not occur. 

...then I can forget I ever mentioned the ****** clog again.

I'm ignoring this issue for now....whether it exists or not!

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> I'm getting carried away with the improbable....but this is the rather
> strange, but possible scenario I foresee:

> A sequence of times...
> 1. We start archiving xlogs
> 2. We take a checkpoint
> 3. we commit an important transaction
> 4. We take a backup
> 5. We take a checkpoint

> When we specify a recovery target, it is possible to specify the
> rollforward to complete just before point 3.

No, it isn't possible.  The recovery *must* proceed at least as far as
wherever the end of the log was at the time the backup was completed.
Otherwise everything is broken, not only clog, because you may have disk
blocks in your backup that postdate where you stopped log replay.

To have a consistent recovery at all, you must replay the log starting
from a checkpoint before the backup began and extending to the time that
the backup finished.  You only get to decide where to stop after that
point.
        regards, tom lane


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Tue, 2004-07-13 at 22:19, Tom Lane wrote:

> To have a consistent recovery at all, you must replay the log starting
> from a checkpoint before the backup began and extending to the time that
> the backup finished.  You only get to decide where to stop after that
> point.
> 

So the situation is: 
- You must only stop recovery at a point in time (in the logs) after the
backup had completed.

No way to enforce that currently, apart from procedurally. Not exactly
frequent, so I think I just document that and move on, eh?

Thanks for your help,

Best regards, Simon Riggs



Re: Point in Time Recovery

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Tue, 2004-07-13 at 22:19, Tom Lane wrote:
> 
> > To have a consistent recovery at all, you must replay the log starting
> > from a checkpoint before the backup began and extending to the time that
> > the backup finished.  You only get to decide where to stop after that
> > point.
> > 
> 
> So the situation is: 
> - You must only stop recovery at a point in time (in the logs) after the
> backup had completed.
> 
> No way to enforce that currently, apart from procedurally. Not exactly
> frequent, so I think I just document that and move on, eh?

If it happens, could you use your previous full backup and the PITR logs
from before stop stopped logging, and then after?  Is there a period
where they could not restore reliably?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Tue, 2004-07-13 at 23:42, Bruce Momjian wrote:
> Simon Riggs wrote:
> > On Tue, 2004-07-13 at 22:19, Tom Lane wrote:
> > 
> > > To have a consistent recovery at all, you must replay the log starting
> > > from a checkpoint before the backup began and extending to the time that
> > > the backup finished.  You only get to decide where to stop after that
> > > point.
> > > 
> > 
> > So the situation is: 
> > - You must only stop recovery at a point in time (in the logs) after the
> > backup had completed.
> > 
> > No way to enforce that currently, apart from procedurally. Not exactly
> > frequent, so I think I just document that and move on, eh?
> 
> If it happens, could you use your previous full backup and the PITR logs
> from before stop stopped logging, and then after?  

Yes.

> Is there a period
> where they could not restore reliably?

Good question. No is the answer. 

The situation is that the backup isn't timestamped with respect to the
logs, so its possible to attempt to use the wrong backup for recovery.

The solution is procedural - make sure you timestamp your backup files,
so you know which ones to recover with...

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> So the situation is: 
> - You must only stop recovery at a point in time (in the logs) after the
> backup had completed.

Right.

> No way to enforce that currently, apart from procedurally. Not exactly
> frequent, so I think I just document that and move on, eh?

The procedure that generates a backup has got to be responsible for
recording both the start and stop times.  If it does not do so then
it's fatally flawed.  (Note also that you had better be careful to get
the time as seen on the server machine's clock ... this could be a nasty
gotcha if the backup is run on a different machine, such as an NFS
server.)
        regards, tom lane


Re: Point in Time Recovery

From
Bruce Momjian
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > So the situation is: 
> > - You must only stop recovery at a point in time (in the logs) after the
> > backup had completed.
> 
> Right.
> 
> > No way to enforce that currently, apart from procedurally. Not exactly
> > frequent, so I think I just document that and move on, eh?
> 
> The procedure that generates a backup has got to be responsible for
> recording both the start and stop times.  If it does not do so then
> it's fatally flawed.  (Note also that you had better be careful to get
> the time as seen on the server machine's clock ... this could be a nasty
> gotcha if the backup is run on a different machine, such as an NFS
> server.)

OK, but procedurally, how do you correlate the start/stop time of the
tar backup with the WAL numeric file names?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Simon Riggs
Date:
PITR Patch v5_1 just posted has Point in Time Recovery working....

Still some rough edges....but we really need some testers now to give
this a try and let me know what you think.

Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.

Many thanks,

Simon Riggs


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Wed, 2004-07-14 at 00:01, Bruce Momjian wrote:
> Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > So the situation is: 
> > > - You must only stop recovery at a point in time (in the logs) after the
> > > backup had completed.
> > 
> > Right.
> > 
> > > No way to enforce that currently, apart from procedurally. Not exactly
> > > frequent, so I think I just document that and move on, eh?
> > 
> > The procedure that generates a backup has got to be responsible for
> > recording both the start and stop times.  If it does not do so then
> > it's fatally flawed.  (Note also that you had better be careful to get
> > the time as seen on the server machine's clock ... this could be a nasty
> > gotcha if the backup is run on a different machine, such as an NFS
> > server.)
> 
> OK, but procedurally, how do you correlate the start/stop time of the
> tar backup with the WAL numeric file names?

No need. You just correlate the recovery target with the backup file
times. Mostly, you'll only ever use your last backup and won't need to
fuss with the times.

Backup should begin with a CHECKPOINT...then wait for that to complete,
just to make the backup as current as possible.

If you want to start purging your archives of old archived xlogs, you
can use the filedate (assuming you preserved that on your copy to
archive - but even if not, they'll be fairly close).

Best regards, Simon Riggs



Re: Point in Time Recovery

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> OK, but procedurally, how do you correlate the start/stop time of the
> tar backup with the WAL numeric file names?

Ideally the procedure for making a backup would go something like:

1. Inquire of the server its current time and the WAL position of the
most recent checkpoint record (which is what you really need).

2. Make the backup.

3. Inquire of the server its current time and the current end-of-WAL
position.

4. Record items 1 and 3 along with the backup itself.

I think the current theory was you could fake #1 by copying pg_control
before everything else, but this doesn't really help for step #3, so
it would probably be better to add some server functions to get this
info.
        regards, tom lane


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Wed, 2004-07-14 at 00:28, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > OK, but procedurally, how do you correlate the start/stop time of the
> > tar backup with the WAL numeric file names?
> 
> Ideally the procedure for making a backup would go something like:
> 
> 1. Inquire of the server its current time and the WAL position of the
> most recent checkpoint record (which is what you really need).
> 
> 2. Make the backup.
> 
> 3. Inquire of the server its current time and the current end-of-WAL
> position.
> 
> 4. Record items 1 and 3 along with the backup itself.
> 
> I think the current theory was you could fake #1 by copying pg_control
> before everything else, but this doesn't really help for step #3, so
> it would probably be better to add some server functions to get this
> info.
> 

err...I think at this point we should review the PITR patch....

The recovery mechanism doesn't rely upon you knowing 1 or 3. The
recovery reads pg_control (from the backup) and then attempts to
de-archive the appropriate xlog segment file and then starts rollforward
from there. Effectively, restore assumes it has access to an infinite
timeline of logs....which clearly isn't the case, but its up to *you* to
check that you have the logs that go with the backups. (Or put another
way, if this sounds hard, buy some software that administers the
procedure for you). That's the mechanism that allows "infinite
recovery".

In brief, the code path is as identical as possible to the current crash
recovery situation...archive recovery restores the files from archive
when they are needed, just as if they had always been in pg_xlog, in a
way that ensures pg_xlog never runs out of space.

Recovery ends when: it reaches the recovery target you specified, or it
runs out of xlogs (first it runs out of archived xlogs, then tries to
find a more recent local copy if there is one).

I think the current theory was you could fake #1 by copying pg_control
> before everything else, but this doesn't really help for step #3, so
> it would probably be better to add some server functions to get this
> info.

Not sure what you mean by "fake"....

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Christopher Kings-Lynne
Date:
Can you give us some suggestions of what kind of stuff to test?  Is
there a way we can artificially kill the backend in all sorts of nasty
spots to see if recovery works?  Does kill -9 simulate a 'power off'?

Chris

Simon Riggs wrote:

> PITR Patch v5_1 just posted has Point in Time Recovery working....
>
> Still some rough edges....but we really need some testers now to give
> this a try and let me know what you think.
>
> Klaus Naumann and Mark Wong are the only [non-committers] to have tried
> to run the code (and let me know about it), so please have a look at
> [PATCHES] and try it out.
>
> Many thanks,
>
> Simon Riggs
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

Re: Point in Time Recovery

From
Simon Riggs
Date:
On Wed, 2004-07-14 at 03:31, Christopher Kings-Lynne wrote:
> Can you give us some suggestions of what kind of stuff to test?  Is
> there a way we can artificially kill the backend in all sorts of nasty
> spots to see if recovery works?  Does kill -9 simulate a 'power off'?
>

I was hoping some fiendish plans would be presented to me...

But please start with "this feels like typical usage" and we'll go from
there...the important thing is to try the first one.

I've not done power off tests, yet. They need to be done just to
check...actually you don't need to do this to test PITR...

We need to exhaustive tests of...
- power off
- scp and cross network copies
- all the permuted recovery options
- archive_mode = off (i.e. current behaviour)
- deliberately incorrectly set options (idiot-proof testing)

I'd love some help assembling a test document with numbered tests...

Best regards, Simon Riggs


Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
> The recovery mechanism doesn't rely upon you knowing 1 or 3. The
> recovery reads pg_control (from the backup) and then attempts to
> de-archive the appropriate xlog segment file and then starts
> rollforward

Unfortunately this only works if pg_control was the first file to be
backed up (or by chance no checkpoint happened after backup start and
pg_control backup)

Other db's have commands for:
start/end external backup

Maybe we should add those two commands that would initially only do
the following:

start external backup:- (checkpoint as an option)- make a copy of pg_control
end external backup:- record WAL position (helps choose an allowed minimum PIT)

Those commands would actually not be obligatory but recommended, and would
only help with the restore process.

Restore would then eighter take the existing pg_control backup, or ask
the dba where rollforward should start and create a corresponding pg_control.
A helper utility could list possible checkpoints in a given xlog.

Andreas


Re: Point in Time Recovery

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> I've not done power off tests, yet. They need to be done just to
> check...actually you don't need to do this to test PITR...

I agree, power off is not really the point here.  What we need to check
into is (a) the mechanics of archiving WAL segments and (b) the
process of restoring given a backup and a bunch of WAL segments.

            regards, tom lane

Re: Point in Time Recovery

From
markw@osdl.org
Date:
On 14 Jul, Simon Riggs wrote:
> PITR Patch v5_1 just posted has Point in Time Recovery working....
>
> Still some rough edges....but we really need some testers now to give
> this a try and let me know what you think.
>
> Klaus Naumann and Mark Wong are the only [non-committers] to have tried
> to run the code (and let me know about it), so please have a look at
> [PATCHES] and try it out.
>
> Many thanks,
>
> Simon Riggs

Simon,

I just tried applying the v5_1 patch against the cvs tip today and got a
couple of rejections.  I'll copy the patch output here.  Let me know if
you want to see the reject files or anything else:

$ patch -p0 < ../../../pitr-v5_1.diff
patching file backend/access/nbtree/nbtsort.c
Hunk #2 FAILED at 221.
1 out of 2 hunks FAILED -- saving rejects to file backend/access/nbtree/nbtsort.c.rej
patching file backend/access/transam/xlog.c
Hunk #11 FAILED at 1802.
Hunk #15 FAILED at 2152.
Hunk #16 FAILED at 2202.
Hunk #21 FAILED at 3450.
Hunk #23 FAILED at 3539.
Hunk #25 FAILED at 3582.
Hunk #26 FAILED at 3833.
Hunk #27 succeeded at 3883 with fuzz 2.
Hunk #28 FAILED at 4446.
Hunk #29 succeeded at 4470 with fuzz 2.
8 out of 29 hunks FAILED -- saving rejects to file backend/access/transam/xlog.c.rej
patching file backend/postmaster/Makefile
patching file backend/postmaster/postmaster.c
Hunk #3 succeeded at 1218 with fuzz 2 (offset 70 lines).
Hunk #4 succeeded at 1827 (offset 70 lines).
Hunk #5 succeeded at 1874 (offset 70 lines).
Hunk #6 succeeded at 1894 (offset 70 lines).
Hunk #7 FAILED at 1985.
Hunk #8 succeeded at 2039 (offset 70 lines).
Hunk #9 succeeded at 2236 (offset 70 lines).
Hunk #10 succeeded at 2996 with fuzz 2 (offset 70 lines).
1 out of 10 hunks FAILED -- saving rejects to file backend/postmaster/postmaster.c.rej
patching file backend/storage/smgr/md.c
Hunk #1 succeeded at 162 with fuzz 2.
patching file backend/utils/misc/guc.c
Hunk #1 succeeded at 342 (offset 9 lines).
Hunk #2 succeeded at 1387 (offset 9 lines).
patching file backend/utils/misc/postgresql.conf.sample
Hunk #1 succeeded at 113 (offset 10 lines).
patching file bin/initdb/initdb.c
patching file include/access/xlog.h
patching file include/storage/pmsignal.h


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Wed, 2004-07-14 at 16:55, markw@osdl.org wrote:
> On 14 Jul, Simon Riggs wrote:
> > PITR Patch v5_1 just posted has Point in Time Recovery working....
> >
> > Still some rough edges....but we really need some testers now to give
> > this a try and let me know what you think.
> >
> > Klaus Naumann and Mark Wong are the only [non-committers] to have tried
> > to run the code (and let me know about it), so please have a look at
> > [PATCHES] and try it out.
> >

> I just tried applying the v5_1 patch against the cvs tip today and got a
> couple of rejections.  I'll copy the patch output here.  Let me know if
> you want to see the reject files or anything else:
>

I'm on it. Sorry 'bout that all - midnight fingers.


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote:
> > The recovery mechanism doesn't rely upon you knowing 1 or 3. The
> > recovery reads pg_control (from the backup) and then attempts to
> > de-archive the appropriate xlog segment file and then starts 
> > rollforward
> 
> Unfortunately this only works if pg_control was the first file to be 
> backed up (or by chance no checkpoint happened after backup start and 
> pg_control backup)
> 
> Other db's have commands for:
> start/end external backup
> 

OK...this idea has come up a few times. Here's my take:

- OS and hardware facilities exist now to make instant copies of sets of
files. Some of these are open source, others not. If you use these, you
have no requirement for this functionality....but these alone are no
replacement for archive recovery.... I accept that some people may not
wish to go to the expense or effort to use those options, but in my mind
these are the people that will not be using archive_mode anyway.

- all we would really need to do is to stop the bgwriter from doing
anything during backup. pgcontrol is only updated at checkpoint. The
current xlog is updated constantly, but this need not be copied because
we are already archiving it as soon as its full. That leaves the
bgwriter, which is now responsible for both lazy writing and
checkpoints.
So, put a switch into bgwriter to halt for a period, then turn it back
on at the end. Could be a SIGHUP GUC...or...

...and with my greatest respects....

- please could somebody else code that?... my time is limited

Best regards, Simon Riggs



Re: Point in Time Recovery

From
Mark Kirkwood
Date:
I noticed that compiling with 5_1 patch applied fails due to
XLOG_archive_dir being removed from xlog.c , but
src/backend/commands/tablecmds.c still uses it.

I did the following to tablecmds.c :

5408c5408
<               extern char XLOG_archive_dir[];
---
 >               extern char *XLogArchiveDest;
5410c5410
<               use_wal = XLOG_archive_dir[0] && !rel->rd_istemp;
---
 >               use_wal = XLogArchiveDest[0] && !rel->rd_istemp;


Now I have to see if I have broken it with this change :-)

regards

Mark

Simon Riggs wrote:

>On Wed, 2004-07-14 at 16:55, markw@osdl.org wrote:
>
>
>>On 14 Jul, Simon Riggs wrote:
>>
>>
>>>PITR Patch v5_1 just posted has Point in Time Recovery working....
>>>
>>>Still some rough edges....but we really need some testers now to give
>>>this a try and let me know what you think.
>>>
>>>Klaus Naumann and Mark Wong are the only [non-committers] to have tried
>>>to run the code (and let me know about it), so please have a look at
>>>[PATCHES] and try it out.
>>>
>>>
>>>
>
>
>
>>I just tried applying the v5_1 patch against the cvs tip today and got a
>>couple of rejections.  I'll copy the patch output here.  Let me know if
>>you want to see the reject files or anything else:
>>
>>
>>
>
>I'm on it. Sorry 'bout that all - midnight fingers.
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html
>
>

Re: Point in Time Recovery

From
SAKATA Tetsuo
Date:
Hi, folks.

My colleages and I are planning to test PITR after the 7.5 beta release.
Now we are desinging test items, but some specification are enough clear
(to us).

For example, we are not clear which resouce manager order to store log
records.
  - some access method (like B-tree) require to log its date or not.  - create/drop action of table space to be stored
tothe log or not.
 

We'll be pleased if someone informs them.

The test set we'll proceed has following items;
  - PITR can recover ordinary commited transaction's data.    - tuple data themselves    - index data associated with
them - PITR can recover commited some special transaction's data.    - DDL; create database, table, index and so on
-maintenance commands (handling large amount of data);      truncate, vacuum, reindex and so on.
 

Items above are 'data aspects' of the test. Other aspects are as follows
  - Place of the archival log's drive;    PITR can recover a database from archived log data       - stored in the same
driveas xlog.       - stored in a different drive on the same machine         in which the PostgreSQL runs.       -
storedin a different drive on a different machine.
 
  - Duration between a checkpoint and recovery;    PITR can recover a database enough long after a checkpoint.
  - Time to Recover;    - to end of the log.    - to some specified time.
  - Type of failures;    - system down --- kill the PostgreSQL process (as a simulation).    - media lost  --- delete
databasefiles (as a simulation).    - These two case will be tested by a simulated situation first,      and we would
trysome 'real' failure after.      (real power down of the test machine to the first case,       and 'plug off' the
diskdrive to the second one.       these action would damage test machine, this is because       we plan them after
'ordinary'test items.)
 

The test set is under construction and we'll test the 7.5 beta
for some weeks, and report the result of the test here.
Sincerely yours.Tetsuo SAKATA.

-- 
sakata.tetsuo _at_ lab.ntt.co.jp
SAKATA, Tetsuo. Yokosuka JAPAN.



Re: Point in Time Recovery

From
Bruce Momjian
Date:
I talked to Tom on the phone today and and I think we have a procedure
for doing backup/restore in a fairly foolproof way.

As outlined below, we need to record the start/stop and checkpoint WAL
file names and offsets, and somehow pass those on to restore.  I think
any system that requires users to link those values together is going
to cause confusion and be error-prone.

My idea is to do much of this automatically.  First, create a
server-side function called pitr_backup_start() which creates a file in
the /data directory which contains the WAL filename/offsets for
last checkpoint and start.  Then do the backup of the data directory. 
Then call pitr_backup_stop() which adds the stop filename/offsets to the
file, and archive that file in the same place as the WAL files.

To restore, you untar the backup of /data.  Then the recover backend
reads the file created by pitr_backup_start() to find the name of the
backup parameter file.  It then recovers that file from the archive
location and uses the start/stop/checkpoint filename/offset information
to the restore.  

The advantage of this is that the tar backup contains everything needed
to find the proper parameter file for restore.  Ideally we could get all
the parameters into the tar backup, but that isn't possible because we
can't push the stop counters into the backup after the backup has
completed.

I recommend the pitr_backup_start() file be named for the current WAL
filename/offset, perhaps 000000000000032c.3da390.backup or something
like that.  The file would be a simple text file in
pg_xlog/archive_status:
# start 2004-07-14 21:35:22.324579-04wal_checkpoint = 0000000000000319.021233wal_start = 000000000000032c.92a9cb
...added after backup completes...wal_stop = 000000000000034a.3db030# stop 2004-07-14 21:32:22.0923213-04

The timestamps are for documentation only.  These files give admins
looking in the archive directory information on backup times.

(As an idea, there is no need for the user to specify a recovery mode. 
If the postmaster starts and sees the pitr_backup_start() file in /data,
it can go into recovery mode automatically.  If the archiver can't find
the file in the archive location, it can assume that it is just being
started from power failure mode.  However if it finds the file in the
archive location, it can assume it is to enter recovery mode.  There is
a race condition that a crash during copy of the file to the archive
location would be a problem.   The solution would be to create a special
flag file before copying the file to archive, and then archive it and
remove the flag file.  If the postmaster starts up and sees the
pitr_backup_start() file in /data and in the archive location, and it
doesn't see the flag file, it then knows it is doing a restore because
the flag file would never appear in a backup.  Anyway, this is just an
idea.)

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote:
> > > The recovery mechanism doesn't rely upon you knowing 1 or 3. The
> > > recovery reads pg_control (from the backup) and then attempts to
> > > de-archive the appropriate xlog segment file and then starts 
> > > rollforward
> > 
> > Unfortunately this only works if pg_control was the first file to be 
> > backed up (or by chance no checkpoint happened after backup start and 
> > pg_control backup)
> > 
> > Other db's have commands for:
> > start/end external backup
> > 
> 
> OK...this idea has come up a few times. Here's my take:
> 
> - OS and hardware facilities exist now to make instant copies of sets of
> files. Some of these are open source, others not. If you use these, you
> have no requirement for this functionality....but these alone are no
> replacement for archive recovery.... I accept that some people may not
> wish to go to the expense or effort to use those options, but in my mind
> these are the people that will not be using archive_mode anyway.
> 
> - all we would really need to do is to stop the bgwriter from doing
> anything during backup. pgcontrol is only updated at checkpoint. The
> current xlog is updated constantly, but this need not be copied because
> we are already archiving it as soon as its full. That leaves the
> bgwriter, which is now responsible for both lazy writing and
> checkpoints.
> So, put a switch into bgwriter to halt for a period, then turn it back
> on at the end. Could be a SIGHUP GUC...or...
> 
> ...and with my greatest respects....
> 
> - please could somebody else code that?... my time is limited
> 
> Best regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Thu, 2004-07-15 at 02:43, Mark Kirkwood wrote:
> I noticed that compiling with 5_1 patch applied fails due to
> XLOG_archive_dir being removed from xlog.c , but
> src/backend/commands/tablecmds.c still uses it.
>
> I did the following to tablecmds.c :
>
> 5408c5408
> <               extern char XLOG_archive_dir[];
> ---
>  >               extern char *XLogArchiveDest;
> 5410c5410
> <               use_wal = XLOG_archive_dir[0] && !rel->rd_istemp;
> ---
>  >               use_wal = XLogArchiveDest[0] && !rel->rd_istemp;
>
>

Yes, I discovered that myself.

The fix is included in pitr_v5_2.patch...

Your patch follows the right thinking and looks like it would have
worked...
- XLogArchiveMode carries the main bool value for mode on/off
- XLogArchiveDest might also be used, though best to use the mode

Thanks for looking through the code...

Best Regards, Simon Riggs


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Thu, 2004-07-15 at 03:02, Bruce Momjian wrote:
> I talked to Tom on the phone today and and I think we have a procedure
> for doing backup/restore in a fairly foolproof way.
> 
> As outlined below, we need to record the start/stop and checkpoint WAL
> file names and offsets, and somehow pass those on to restore.  I think
> any system that requires users to link those values together is going
> to cause confusion and be error-prone.
> 

Unfortunately, it seems clear that many of my posts have not been read,
nor has anyone here actually tried to use the patch. Everybody's views
on what constitutes error-prone might well differ then.

Speculation about additional requirements is just great, but please
don't assume that I have infinite resources to apply to these problems.
Documentation has still to be written.

For a long time now, I've been adding "one last feature" to what is
there, but we're still no nearer to anybody inspecting the patch or
committing it.

There is building consensus on other threads that PITR should not even
be included in the release (3 tentative votes). This latest request
feels more like the necessary excuse to take the decision to pull PITR.
I would much rather that we took the brave decision and pull it NOW,
rather than have me work like crazy to chase this release.

:(

Best Regards, Simon Riggs





Re: Point in Time Recovery

From
Mark Kirkwood
Date:
I tried what I thought was a straightforward scenario, and seem to have 
broken it :-(

Here is the little tale

1) initdb
2) set archive_mode and archive_dest in postgresql.conf
3) startup
4) create database called 'test'
5) connect to 'test' and type 'checkpoint'
6) backup PGDATA using 'tar -zcvf'
7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
8) shutdown and remove PGDATA
9)  recover using 'tar -zxvf'
10) copy recovery.conf into PGDATA
11) startup

This is what I get :

LOG:  database system was interrupted at 2004-07-15 21:24:04 NZST
LOG:  recovery command file found...
LOG:  restore_program = cp %s/%s %s
LOG:  recovery_target_inclusive = true
LOG:  recovery_debug_log = true
LOG:  starting archive recovery
LOG:  restored log file "0000000000000000" from archive
LOG:  checkpoint record is at 0/A48054
LOG:  redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 496; next OID: 25419
LOG:  database system was not properly shut down; automatic recovery in 
progress
LOG:  redo starts at 0/A48094
LOG:  restored log file "0000000000000001" from archive
LOG:  record with zero length at 0/1FFFFE0
LOG:  redo done at 0/1FFFF30
LOG:  restored log file "0000000000000001" from archive
LOG:  restored log file "0000000000000001" from archive
PANIC:  concurrent transaction log activity while database system is 
shutting down
LOG:  startup process (PID 13492) was terminated by signal 6
LOG:  aborting startup due to startup process failure

The concurrent access is a bit of a puzzle, as this is my home machine 
(i.e. I am *sure* noone else is connected!)


Mark

P.s : CVS HEAD from about 1 hour ago, PITR 5.2, FreeBSD 4.10 on x86


Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
> > Other db's have commands for:
> > start/end external backup

I see that the analogy to external backup was not good, since you are correct
that dba's would expect that to stop all writes, so they can safely split
their mirror or some such. Usually the expected time from start
until end external backup is expected to be only seconds. I actually think we
do not need this functionality, since in pg you can safely split the mirror any
time you like.

My comment was meant to give dba's a familiar tool. The effect of it
would only have been to create a separate backup of pg_control.
Might as well tell people to always backup pg_control first.

I originally thought you would require restore to specify an xlog id
from which recovery will start. You would search this log for the first
checkpoint record, create an appropriate pg_control, and start rollforward.

I still think this would be a nice feature, since then all that would be required
for restore is a system backup (that includes pg data files) and the xlogs.

Andreas


Re: Point in Time Recovery

From
HISADAMasaki
Date:
Dear Simon,

I've just tested pitr_v5_2.patch and got an error message
during archiving process as follows.

-- beginLOG:  archive command="cp /usr/local/pgsql/data/pg_xlog/0000000000000000 /tmp",return code=-1
-- end

The command called in system(3) works, but it returns -1.
system(3) can not get right exit code from its child process,
when SIGCHLD is set as SIG_IGN.

So I did following change to pgarch_Main() in pgarch.c

-- line 236 ---
- pgsignal(SIGCHLD, SIG_IGN);

-- line 236 ---
+ pgsignal(SIGCHLD, SIG_DFL);

After that, 
the error message doen't come out and it seems to be working propery.

Regards,
Hisada, Masaki

On Wed, 14 Jul 2004 00:13:37 +0100
Simon Riggs <simon@2ndquadrant.com> wrote:

> PITR Patch v5_1 just posted has Point in Time Recovery working....
> 
> Still some rough edges....but we really need some testers now to give
> this a try and let me know what you think.
> 
> Klaus Naumann and Mark Wong are the only [non-committers] to have tried
> to run the code (and let me know about it), so please have a look at
> [PATCHES] and try it out.
> 
> Many thanks,
> 
> Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

-- 
HISADA, Masaki <hisada.masaki@lab.ntt.co.jp>




Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
Sorry for the stupid question, but how do I get this patch if I do not
receive the patches mails ?

The web interface html'ifies it, thus making it unusable.

Thanks
Andreas


Re: Point in Time Recovery

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Thu, 2004-07-15 at 03:02, Bruce Momjian wrote:
> > I talked to Tom on the phone today and and I think we have a procedure
> > for doing backup/restore in a fairly foolproof way.
> > 
> > As outlined below, we need to record the start/stop and checkpoint WAL
> > file names and offsets, and somehow pass those on to restore.  I think
> > any system that requires users to link those values together is going
> > to cause confusion and be error-prone.
> > 
> 
> Unfortunately, it seems clear that many of my posts have not been read,
> nor has anyone here actually tried to use the patch. Everybody's views
> on what constitutes error-prone might well differ then.
> 
> Speculation about additional requirements is just great, but please
> don't assume that I have infinite resources to apply to these problems.
> Documentation has still to be written.
> 
> For a long time now, I've been adding "one last feature" to what is
> there, but we're still no nearer to anybody inspecting the patch or
> committing it.

I totally understand your feeling this, and I would be feeling the exact
same way (but would probably have complained much earlier  :-)  ). 
Anyway, the problem is that Tom and I are serializing application of the
major features in the pipeline.  We decided to focus on nested
transactions (NT) first (it is a larger patch), and that is why PITR has
gotten so little attention from us.   However, there is no sense that
you had anything to do with it being places behind NT in the queue, and
therefore there is no feeling on our part that PITR is less important or
deserves less time than NT.  Certainly any system that made you less
likely to be applied would be unfair and something we will not do.

My explanation about the file format was an attempt to address the
method of passing the wal filename/offset to the recover process.  If
that isn't needed, I am sorry.

> There is building consensus on other threads that PITR should not even
> be included in the release (3 tentative votes). This latest request
> feels more like the necessary excuse to take the decision to pull PITR.
> I would much rather that we took the brave decision and pull it NOW,
> rather than have me work like crazy to chase this release.

Those three individuals are not representative of the group.  Sorry it
might seem there there is lack of enthusiasm for PITR, but it isn't true
from our end.  You might have noticed that the patch queue has shrunk
dramatically, and now we are focused on NT and PITR almost exclusively.

We will get there --- it just seems dark at this time.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote:
> > > The recovery mechanism doesn't rely upon you knowing 1 or 3. The
> > > recovery reads pg_control (from the backup) and then attempts to
> > > de-archive the appropriate xlog segment file and then starts 
> > > rollforward
> > 
> > Unfortunately this only works if pg_control was the first file to be 
> > backed up (or by chance no checkpoint happened after backup start and 
> > pg_control backup)
> > 
> > Other db's have commands for:
> > start/end external backup
> > 
> 
> OK...this idea has come up a few times. Here's my take:
> 
> - OS and hardware facilities exist now to make instant copies of sets of
> files. Some of these are open source, others not. If you use these, you
> have no requirement for this functionality....but these alone are no
> replacement for archive recovery.... I accept that some people may not
> wish to go to the expense or effort to use those options, but in my mind
> these are the people that will not be using archive_mode anyway.
> 
> - all we would really need to do is to stop the bgwriter from doing
> anything during backup. pgcontrol is only updated at checkpoint. The
> current xlog is updated constantly, but this need not be copied because
> we are already archiving it as soon as its full. That leaves the
> bgwriter, which is now responsible for both lazy writing and
> checkpoints.
> So, put a switch into bgwriter to halt for a period, then turn it back
> on at the end. Could be a SIGHUP GUC...or...

I don't think we can turn off all file system writes during a backup. 
Imagine writing to a tape.  Preventing file system writes would make the
system useless.

> - please could somebody else code that?... my time is limited

Yes, I think someone else could code this.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote:
> I tried what I thought was a straightforward scenario, and seem to have 
> broken it :-(
> 
> Here is the little tale
> 
> 1) initdb
> 2) set archive_mode and archive_dest in postgresql.conf
> 3) startup
> 4) create database called 'test'
> 5) connect to 'test' and type 'checkpoint'
> 6) backup PGDATA using 'tar -zcvf'
> 7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
> 8) shutdown and remove PGDATA
> 9)  recover using 'tar -zxvf'
> 10) copy recovery.conf into PGDATA
> 11) startup
> 
> This is what I get :
> 
> LOG:  database system was interrupted at 2004-07-15 21:24:04 NZST
> LOG:  recovery command file found...
> LOG:  restore_program = cp %s/%s %s
> LOG:  recovery_target_inclusive = true
> LOG:  recovery_debug_log = true
> LOG:  starting archive recovery
> LOG:  restored log file "0000000000000000" from archive
> LOG:  checkpoint record is at 0/A48054
> LOG:  redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
> LOG:  next transaction ID: 496; next OID: 25419
> LOG:  database system was not properly shut down; automatic recovery in 
> progress
> LOG:  redo starts at 0/A48094
> LOG:  restored log file "0000000000000001" from archive
> LOG:  record with zero length at 0/1FFFFE0
> LOG:  redo done at 0/1FFFF30
> LOG:  restored log file "0000000000000001" from archive
> LOG:  restored log file "0000000000000001" from archive
> PANIC:  concurrent transaction log activity while database system is 
> shutting down
> LOG:  startup process (PID 13492) was terminated by signal 6
> LOG:  aborting startup due to startup process failure
> 
> The concurrent access is a bit of a puzzle, as this is my home machine 
> (i.e. I am *sure* noone else is connected!)

First, thanks for sticking with it to test this.

I've not received such a message myself - this is interesting.

Is it possible to copy that directory to one side and re-run the test?
Add another parameter in postgresql.conf called "archive_debug = true"
Does it happen identically the second time?

What time difference was there between steps 5 and 6? I think I can here
Andreas saying "told you".... I'm thinking the backup might be somehow
corrupted because the checkpoint occurred during the backup. Hmmm...

Could you also post me the recovery.log file? (don't post to list)

Thanks, Simon Riggs



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote:

> We will get there --- it just seems dark at this time.

Thanks for that. My comments were heartfelt, but not useful right now. 

I'm badly overdrawn already on my time budget, though that is my concern
alone. There is more to do than I have time for. Pragmatically, if we
aren't going to get there then I need to stop now, so I can progress
other outstanding issues. All help is appreciated.

I'm aiming for the minimum feature set - which means we do need to take
care over whether that set is insufficient and also to pull any part
that doesn't stand up to close scrutiny over the next few days.

Overall, my primary goal is increased robustness and availability for
PostgreSQL...and then to have a rest!

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Devrim GUNDUZ
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Simon,

On Thu, 15 Jul 2004, Simon Riggs wrote:

> > We will get there --- it just seems dark at this time.
> 
> Thanks for that. My comments were heartfelt, but not useful right now. 
> 
> I'm badly overdrawn already on my time budget, though that is my concern
> alone. There is more to do than I have time for. Pragmatically, if we
> aren't going to get there then I need to stop now, so I can progress
> other outstanding issues. All help is appreciated.

Personally, as a PostgreSQL Advocate, I believe that PITR is one of the 
most important missing features in PostgreSQL. I've been keeping 'all' of 
you e-mails about PITR and I'm really excited with that feature. 

Please do not stop working on PITR. I'm pretty sure that most of the 
'silent' people in the lists are waiting for PITR for an {Oracle, DB2, ...}-killer 
database. In my country (Turkey), too many people spend a lot of 
money for  proprietary databases, just for some missing features in 
PostgreSQL. If you finish your work on PITR (and other guys on NT, Win32 
port, etc), then we'll feel more concentrated on PostgreSQL Advocation, so 
that PostgreSQL will be used more and more. (Oh, we also need native 
clustering...)

Maybe I should send this e-mail offlist, but I wanted everyone to learn my 
feelings.

Regards and best wishes,
- -- 
Devrim GUNDUZ           
devrim~gunduz.org                devrim.gunduz~linux.org.tr         http://www.tdmsoft.com
http://www.gunduz.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFA9wKxtl86P3SPfQ4RAms+AJ95RfFi0lVwMD7u7zQ/DzLFEBC8MACgvRzd
HRqAjVqI3hekwImPpqelj9U=
=l445
-----END PGP SIGNATURE-----



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:
> Dear Simon,
> 
> I've just tested pitr_v5_2.patch and got an error message
> during archiving process as follows.
> 
> -- begin
>  LOG:  archive command="cp /usr/local/pgsql/data/pg_xlog/0000000000000000 /tmp",return code=-1
> -- end
> 
> The command called in system(3) works, but it returns -1.
> system(3) can not get right exit code from its child process,
> when SIGCHLD is set as SIG_IGN.
> 
> So I did following change to pgarch_Main() in pgarch.c
> 
> -- line 236 ---
> - pgsignal(SIGCHLD, SIG_IGN);
> 
> -- line 236 ---
> + pgsignal(SIGCHLD, SIG_DFL);
> 

Thank you for testing the patch. Very much appreciated.

I was aware of the potential issues of incorrect return codes, and that
exact part of the code is the part I'm least happy with. 

I'm not sure I understand why its returned -1, though I'll take you
recommendation. I've not witnessed such an issue. What system are you
running, or is it a default shell issue?

Do people think that the change is appropriate for all systems, or just
the one you're using?

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Simon Riggs
Date:
> On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: 

Thanks for the vote of confidence, on or off list.

>  too many people spend a lot of 
> money for  proprietary databases, just for some missing features in 
> PostgreSQL

Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
already, even though it doesn't have it, then you are quite likely to be
able to keep going without it.

Most commercial users won't touch anything that doesn't have PITR.

>  (Oh, we also need native 
> clustering...)

Next week, OK? :)


Best Regards, Simon





Re: Point in Time Recovery

From
Alvaro Herrera
Date:
On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote:
> On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:

> > -- line 236 ---
> > - pgsignal(SIGCHLD, SIG_IGN);
> > 
> > -- line 236 ---
> > + pgsignal(SIGCHLD, SIG_DFL);
> 
> I'm not sure I understand why its returned -1, though I'll take you
> recommendation. I've not witnessed such an issue. What system are you
> running, or is it a default shell issue?
> 
> Do people think that the change is appropriate for all systems, or just
> the one you're using?

My manpage for signal(2) says that you shouldn't assign SIG_IGN to
SIGCHLD, according to POSIX.  It goes on to say that BSD and SysV
behaviors differ on this aspect.

(This is on linux BTW)

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)



Re: Point in Time Recovery

From
Bruce Momjian
Date:
Simon Riggs wrote:
> > On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: 
> 
> Thanks for the vote of confidence, on or off list.
> 
> >  too many people spend a lot of 
> > money for  proprietary databases, just for some missing features in 
> > PostgreSQL
> 
> Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
> already, even though it doesn't have it, then you are quite likely to be
> able to keep going without it.
> 
> Most commercial users won't touch anything that doesn't have PITR.

Agreed. I am surprised at how few requests we have gotten for PITR.  I
assume people are either using replication or not considering us.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Mark Kirkwood
Date:

Simon Riggs wrote:

>
>First, thanks for sticking with it to test this.
>
>I've not received such a message myself - this is interesting.
>
>Is it possible to copy that directory to one side and re-run the test?
>Add another parameter in postgresql.conf called "archive_debug = true"
>Does it happen identically the second time?
>
>  
>
Yes, identical results - I re-initdb'ed and ran the process again, 
rather than reuse the files.

>What time difference was there between steps 5 and 6? I think I can here
>Andreas saying "told you".... I'm thinking the backup might be somehow
>corrupted because the checkpoint occurred during the backup. Hmmm...
>
>  
>
I was wondering about this, so  left a bit more time in between, and 
forced a sync as well for good measure.

5) $ psql -d test -c "checkpoint"; sleep 30;sync;sleep 30
6) $ tar -zcvf /data1/dump/pgdata-7.5.tar.gz *

>
>Thanks, Simon Riggs
>
>  
>


Re: Point in Time Recovery

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote:
> 
> > We will get there --- it just seems dark at this time.
> 
> Thanks for that. My comments were heartfelt, but not useful right now. 
> 
> I'm badly overdrawn already on my time budget, though that is my concern
> alone. There is more to do than I have time for. Pragmatically, if we
> aren't going to get there then I need to stop now, so I can progress
> other outstanding issues. All help is appreciated.
> 
> I'm aiming for the minimum feature set - which means we do need to take
> care over whether that set is insufficient and also to pull any part
> that doesn't stand up to close scrutiny over the next few days.

As you can see, we are still chewing on NT.  What PITR features are
missing?  I assume because we can't stop file system writes during
backup that we will need a backup parameter file like I described.  Is
there anything else that PITR needs?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
"Glen Parker"
Date:
> Simon Riggs wrote:
> > > On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote:
> >
> > Thanks for the vote of confidence, on or off list.
> >
> > >  too many people spend a lot of
> > > money for  proprietary databases, just for some missing features in
> > > PostgreSQL
> >
> > Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
> > already, even though it doesn't have it, then you are quite likely to be
> > able to keep going without it.
> >
> > Most commercial users won't touch anything that doesn't have PITR.
>
> Agreed. I am surprised at how few requests we have gotten for PITR.  I
> assume people are either using replication or not considering us.

Don't forget that there are (must be) lots of us that know it's coming and
are just waiting until it's available.  I haven't requested per se, but
believe me, I'm waiting for it :-)



Re: Point in Time Recovery

From
Mark Kirkwood
Date:
Couldn't agree more. Maybe we should have made more noise :-)

Glen Parker wrote:

>>Simon Riggs wrote:
>>    
>>
>>>>On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote:
>>>>        
>>>>
>>>Thanks for the vote of confidence, on or off list.
>>>
>>>      
>>>
>>>> too many people spend a lot of
>>>>money for  proprietary databases, just for some missing features in
>>>>PostgreSQL
>>>>        
>>>>
>>>Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
>>>already, even though it doesn't have it, then you are quite likely to be
>>>able to keep going without it.
>>>
>>>Most commercial users won't touch anything that doesn't have PITR.
>>>      
>>>
>>Agreed. I am surprised at how few requests we have gotten for PITR.  I
>>assume people are either using replication or not considering us.
>>    
>>
>
>Don't forget that there are (must be) lots of us that know it's coming and
>are just waiting until it's available.  I haven't requested per se, but
>believe me, I'm waiting for it :-)
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 9: the planner will ignore your desire to choose an index scan if your
>      joining column's datatypes do not match
>  
>


Re: Point in Time Recovery

From
Mark Kirkwood
Date:
Simon Riggs wrote:

>
>So far:
>
>I've tried to re-create the problem as exactly as I can, but it works
>for me. 
>
>This is clearly an important case to chase down.
>
>I assume that this is the very first time you tried recovery? Second and
>subsequent recoveries using the same set have a potential loophole,
>which we have been discussing.
>
>Right now, I'm thinking that the "exactly 2 logs worth" of data has
>brought you very close to the end of the log file (FFFFE0) ending with 1
>and the shutdown checkpoint that is then subsequently written is
>failing.
>
>Can you repeat this your end?
>
>  
>
It is repeatable at my end. It is actually fairly easy to recreate the 
example I am using, download 

http://sourceforge.net/projects/benchw

and generate the dataset for Pg - but trim the large "fact0.dat" dump 
file using head -100000.
Thus step 7 consists of creating the 4 tables and COPYing in the data 
for them.

>The nearest I can get to the exact record pointers you show are to start
>recovery at A4807C and to end at with FFFF88.
>
>Overall, PITR changes the recovery process very little, if at all. The
>main areas of effect are to do with sequencing of actions and matching
>up the right logs with the right backup. I'm not looking for bugs in the
>code but in subtle side-effects and "edge" cases. Everything you can
>tell me will help me greatly in chasing that down. 
>
>  
>
I agree - I will try this sort of example again, but will change the 
number of rows I am COPYing (currently 100000) and see if that helps.

>Best Regards, Simon Riggs
>
>  
>

By way of contrast, using the *same* procedure (1-11), but generating 2 
logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* - 
e.g :

LOG:  database system was interrupted at 2004-07-16 11:17:52 NZST
LOG:  recovery command file found...
LOG:  restore_program = cp %s/%s %s
LOG:  recovery_target_inclusive = true
LOG:  recovery_debug_log = true
LOG:  starting archive recovery
LOG:  restored log file "0000000000000000" from archive
LOG:  checkpoint record is at 0/A4803C
LOG:  redo record is at 0/A4803C; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 496; next OID: 25419
LOG:  database system was not properly shut down; automatic recovery in 
progress
LOG:  redo starts at 0/A4807C
postmaster starting
[postgres@shroudeater 7.5]$ LOG:  restored log file "0000000000000001" 
from archive
cp: cannot stat `/data1/pgdata/7.5-archive/0000000000000002': No such 
file or directory
LOG:  could not restore "0000000000000002" from archive
LOG:  could not open file "/data1/pgdata/7.5/pg_xlog/0000000000000002" 
(log file 0, segment 2): No such file or directory
LOG:  redo done at 0/1FFFFD4
LOG:  archive recovery complete
LOG:  database system is ready
LOG:  archiver started




Re: Point in Time Recovery

From
Simon Riggs
Date:
On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:
> On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote:
> > On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:
> 
> > > -- line 236 ---
> > > - pgsignal(SIGCHLD, SIG_IGN);
> > > 
> > > -- line 236 ---
> > > + pgsignal(SIGCHLD, SIG_DFL);
> > 
> > I'm not sure I understand why its returned -1, though I'll take you
> > recommendation. I've not witnessed such an issue. What system are you
> > running, or is it a default shell issue?
> > 
> > Do people think that the change is appropriate for all systems, or just
> > the one you're using?
> 
> My manpage for signal(2) says that you shouldn't assign SIG_IGN to
> SIGCHLD, according to POSIX.  It goes on to say that BSD and SysV
> behaviors differ on this aspect.
> 

POSIX rules OK!

So - I should be setting this to SIG_DFL and thats good for everyone?

OK. Will do.

Best regards, Simon Riggs



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Fri, 2004-07-16 at 00:46, Mark Kirkwood wrote:

> 
> By way of contrast, using the *same* procedure (1-11), but generating 2 
> logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* - 
> e.g :
> 

Great...at least we have shown that something works (or can work) and
have begun to isolate the problem whatever it is.

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:
>> My manpage for signal(2) says that you shouldn't assign SIG_IGN to
>> SIGCHLD, according to POSIX.

> So - I should be setting this to SIG_DFL and thats good for everyone?

Yeah, we learned the same lesson in the backend not too many releases
back.  SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms
but not others.

You could do worse than to look at the existing handling of signals in
the postmaster and its children; that code has been beat on pretty
heavily ...
        regards, tom lane


Re: Point in Time Recovery

From
Christopher Kings-Lynne
Date:
> Thanks for that. My comments were heartfelt, but not useful right now. 

Hi Simon,  I'm sorry if I gave the impression that I thought your work 
wasn't worthwhile, it is :(

> I'm badly overdrawn already on my time budget, though that is my concern
> alone. There is more to do than I have time for. Pragmatically, if we
> aren't going to get there then I need to stop now, so I can progress
> other outstanding issues. All help is appreciated.

I've got your patch applied (but having some compilation problem), but 
I'm really not sure what to test really.  I don't really understand the 
whole thing fully :/

> I'm aiming for the minimum feature set - which means we do need to take
> care over whether that set is insufficient and also to pull any part
> that doesn't stand up to close scrutiny over the next few days.
> 
> Overall, my primary goal is increased robustness and availability for
> PostgreSQL...and then to have a rest!

Definitely!

Chris



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Fri, 2004-07-16 at 04:49, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:
> >> My manpage for signal(2) says that you shouldn't assign SIG_IGN to
> >> SIGCHLD, according to POSIX.
> 
> > So - I should be setting this to SIG_DFL and thats good for everyone?
> 
> Yeah, we learned the same lesson in the backend not too many releases
> back.  SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms
> but not others.

Many thanks all, Best Regards Simon Riggs






Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
> > I'm aiming for the minimum feature set - which means we do need to take
> > care over whether that set is insufficient and also to pull any part
> > that doesn't stand up to close scrutiny over the next few days.
>
> As you can see, we are still chewing on NT.  What PITR features are
> missing?  I assume because we can't stop file system writes during
> backup that we will need a backup parameter file like I described.  Is
> there anything else that PITR needs?

No, we don't need to stop writes ! Not even to split a mirror,
other db's need that to be able to restore, but we dont.
We only need to tell people to backup pg_control first. The rest was only
intended to enforce
1. that pg_control is the first file backed up
2. the dba uses a large enough PIT (or xid) for restore

I think the idea with an extra file with WAL start position was overly
complicated, since all you need is pg_control (+ WAL end position to enforce 2.).

If we don't want to tell people to backup pg_control first, imho the next
best plan would be to add a "WAL start" input (e.g. xlog name) parameter
to recovery.conf, that "fixes" pg_control.

Andreas


Re: Point in Time Recovery

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> We only need to tell people to backup pg_control first. The rest was only 
> intended to enforce 
> 1. that pg_control is the first file backed up
> 2. the dba uses a large enough PIT (or xid) for restore

Right, but I think Bruce's point is that it is far too easy to get those
things wrong; especially point 2 for which a straight tar dump will
simply not contain the information you need to determine what is a safe
stopping point.

I agree with Bruce that we should have some mechanism that doesn't rely
on the DBA to get this right.  Exactly what the mechanism should be is
certainly open for discussion...
        regards, tom lane


Re: Point in Time Recovery

From
Bruce Momjian
Date:
> "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> > We only need to tell people to backup pg_control first. The rest was only 
> > intended to enforce 
> > 1. that pg_control is the first file backed up
> > 2. the dba uses a large enough PIT (or xid) for restore
> 
> Right, but I think Bruce's point is that it is far too easy to get those
> things wrong; especially point 2 for which a straight tar dump will
> simply not contain the information you need to determine what is a safe
> stopping point.
> 
> I agree with Bruce that we should have some mechanism that doesn't rely
> on the DBA to get this right.  Exactly what the mechanism should be is
> certainly open for discussion...

Right.  I am wondering what process people would use to backup
pg_control first?  If they do:
tar -f $TAPE ./global/pg_control .

They will get two copies or pg_control, the early one, and one as part
of the directory scan.  On restore, they would restore the early one,
but the directory scan would overwrite it.  I suppose they could do:
cp global/pg_control global/pg_control.backuptar -f $TAPE .

then on restore once all the files are restored move the
pg_control.backup to its original name.  That gives us the checkpoint
wal/offset but how do we get the start/stop information.  Is that not
required?  Maybe we should just have a start/stop server-side functions
that create a file in the archive directory describing the start/stop
counters and time and the admin would then have to find those values.
Why are the start/stop wal/offset values needed anyway?  I know why we
need the checkpoint value.  Do we need a checkpoint after the archiving
starts but before the backup begins?

Also, when you are in recovery mode, how do you get out of recovery
mode, meaning if you have a power failure, how do you prevent the system
from doing another recovery?  Do you remove the recovery.conf file?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Also, when you are in recovery mode, how do you get out of recovery
> mode, meaning if you have a power failure, how do you prevent the system
> from doing another recovery?  Do you remove the recovery.conf file?

I do not care for the idea of a recovery.conf file at all, and have been
intending to look to see what we'd need to do to not have one.  I find
it hard to believe that there is anything one would put in it that is
really persistent state.  The above concern shows why it shouldn't be
treated as a persistent configuration file.
        regards, tom lane


Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
> then on restore once all the files are restored move the
> pg_control.backup to its original name.  That gives us the checkpoint
> wal/offset but how do we get the start/stop information.  Is that not
> required?

The checkpoint wal/offset is in pg_control, that is sufficient start
information. The stop info is only necessary as a safeguard.

> Do we need a checkpoint after the archiving
> starts but before the backup begins?

No.

> Also, when you are in recovery mode, how do you get out of recovery
> mode, meaning if you have a power failure, how do you prevent the system
> from doing another recovery?  Do you remove the recovery.conf file?

pg_control could be updated during rollforward (only if that actually
does a checkpoint). So if pg_control is also the recovery start info, then
we can continue from there if we have a power failure.
For the first release it would imho also be ok to simply start over if
you loose power.

I think the filename 'recovery.conf' is misleading, since it is not a
static configuration file, but a command file for one recovery.
How about 'recovery.command' then 'recovery.inprogress', and on recovery
completion it should be renamed to 'recovery.done'

Andreas


Re: Point in Time Recovery

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>> Do we need a checkpoint after the archiving
>> starts but before the backup begins?

> No.

Actually yes.  You have to start at a checkpoint record when replaying
the log, so if no checkpoint occurred between starting to archive WAL
and starting the tar backup, you have a useless backup.

It would be reasonable to issue a CHECKPOINT just before starting the
backup as part of the standard operating procedure for taking PITR
dumps.  We need not require this, but it would help to avoid this
particular sort of mistake; and of course it might save a little bit of
replay effort if the backup is ever used.

As far as the business about copying pg_control first goes: there is
another way to think about it, which is to copy pg_control to another
place that will be included in your backup.  For example the standard
backup procedure could be

1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.

2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump

3. tar cf /dev/mt $PGDATA

4. do something to record ending WAL position

If we standardized on this way, then the tar archive would automatically
contain the pre-backup checkpoint position in ./pg_control.dump, and
there is no need for any special assumptions about the order in which
tar processes things.

However, once you decide to do things like that, there is no reason why
the copied file has to be an exact image of pg_control.  I claim it
would be more useful if the copied file were plain text so that you
could just "cat" it to find out the starting WAL position; that would
let you determine without any special tools what range of WAL archive
files you are going to need to bring back from your archives.

This is pretty much the same chain of reasoning that Bruce and I went
through yesterday to come up with the idea of putting a label file
inside the tar backups.  We concluded that it'd be worth putting
both the backup starting time and the checkpoint WAL position into
the label file --- the starting time isn't needed for restore but
might be really helpful as documentation, if you needed to verify
which dump file was which.
        regards, tom lane


Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
> >> Do we need a checkpoint after the archiving
> >> starts but before the backup begins?
>
> > No.
>
> Actually yes.

Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
So yes, you need one checkpoint after archiving starts. Imho turning on xlog
archiving should issue such a checkpoint just to be sure.

Andreas


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote:
> > >> Do we need a checkpoint after the archiving
> > >> starts but before the backup begins?
> > 
> > > No.
> > 
> > Actually yes.
> 
> Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
> So yes, you need one checkpoint after archiving starts. Imho turning on xlog
> archiving should issue such a checkpoint just to be sure. 
> 

By agreement, archive_mode can only be turned on at postmaster startup,
which means you always have a checkpoint - either because you shut it
down cleanly, or you didn't and it recovers, then writes one.

There is always something to start the rollforward. 

So, non-issue.

Best regards, Simon Riggs



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Fri, 2004-07-16 at 15:27, Bruce Momjian wrote:

> Also, when you are in recovery mode, how do you get out of recovery
> mode, meaning if you have a power failure, how do you prevent the system
> from doing another recovery?  Do you remove the recovery.conf file?

That was the whole point of the recovery.conf file:
it prevents you from re-entering recovery accidentally, as would occur
if the parameters were set in the normal .conf file.

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Fri, 2004-07-16 at 16:25, Zeugswetter Andreas SB SD wrote:

> I think the filename 'recovery.conf' is misleading, since it is not a 
> static configuration file, but a command file for one recovery.
> How about 'recovery.command' then 'recovery.inprogress', and on recovery 
> completion it should be renamed to 'recovery.done'

You understand this and your assessment is correct.

recovery.conf isn't an attempt to persist information. It is a means of
delivering a set of parameters to the recovery process, as well as
signalling overall that archive recovery is required (because the system
default remains the same, which is to recover from the logs it has
locally available to it).

I originally offered a design which used a command, similar to
DB2/Oracle...that was overruled as too complex. The (whatever you call
it) file is just a very simple way of specifying whats required.

There is more to be said here...clearly some explanations are required
and I will provide those later...

Best regards, Simon Riggs



Re: Point in Time Recovery

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote:
> > > >> Do we need a checkpoint after the archiving
> > > >> starts but before the backup begins?
> > > 
> > > > No.
> > > 
> > > Actually yes.
> > 
> > Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
> > So yes, you need one checkpoint after archiving starts. Imho turning on xlog
> > archiving should issue such a checkpoint just to be sure. 
> > 
> 
> By agreement, archive_mode can only be turned on at postmaster startup,
> which means you always have a checkpoint - either because you shut it
> down cleanly, or you didn't and it recovers, then writes one.
> 
> There is always something to start the rollforward. 
> 
> So, non-issue.

I don't think so.  I can imagine many cases where you want to do a
nightly tar backup without turning archiving on/off or restarting the
postmaster.  In those cases, a manual checkpoint would have to be issued
before the backup begins.

Imagine a system that is up for a month, and they don't have enough
archive space to keep a months worth of WAL files.  They would probably
do nightly or weekend tar backups, and then discard the WAL archives.

What procedure would they use?  I assume they would copy all their old
WAL files to a save directory, issue a checkpoint, do a tar backup, then
they can delete the saved WAL files.  Is that correct?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Fri, 2004-07-16 at 19:30, Bruce Momjian wrote:
> Simon Riggs wrote:
> > On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote:
> > > > >> Do we need a checkpoint after the archiving
> > > > >> starts but before the backup begins?
> > > > 
> > > > > No.
> > > > 
> > > > Actually yes.
> > > 
> > > Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
> > > So yes, you need one checkpoint after archiving starts. Imho turning on xlog
> > > archiving should issue such a checkpoint just to be sure. 
> > > 
> > 
> > By agreement, archive_mode can only be turned on at postmaster startup,
> > which means you always have a checkpoint - either because you shut it
> > down cleanly, or you didn't and it recovers, then writes one.
> > 
> > There is always something to start the rollforward. 
> > 
> > So, non-issue.
> 

I was discussing the claim that there might not be a checkpoint to begin
the rollforward from. There always is: if you are in archive_mode=true
then you will always have a checkpoint that can be used for recovery. It
may be "a long way in the past", if there has been no write activity,
but the rollforward will very very quick, since there will be no log
records.

> I don't think so.  I can imagine many cases where you want to do a
> nightly tar backup without turning archiving on/off or restarting the
> postmaster.  

This is a misunderstanding. I strongly agree with what you say: the
whole system has been designed to avoid any benefit from turning on/off
archiving and there is no requirement to restart postmaster to take
backups.

> In those cases, a manual checkpoint would have to be issued
> before the backup begins.

A manual checkpoint doesn't HAVE TO be issued. Presumably most systems
will be running checkpoint every few minutes. Wherever the last one was
is where the rollforward would start from.

But you can if thats the way you want to do things, just wait long
enough for the checkpoint to have completed, otherwise your objective of
reducing rollforward time will not be met.

(please note my earlier reported rollback performance of approximately
x10 rate of recovery v elapsed time - will require testing on your own
systems).

> Imagine a system that is up for a month, and they don't have enough
> archive space to keep a months worth of WAL files.  They would probably
> do nightly or weekend tar backups, and then discard the WAL archives.
> 

Yes, that would be normal practice. I would recommend keeping at least
the last 3 full backups and all of the WAL logs to cover that period.

> What procedure would they use?  I assume they would copy all their old
> WAL files to a save directory, issue a checkpoint, do a tar backup, then
> they can delete the saved WAL files.  Is that correct?

PITR is designed to interface with a wide range of systems, through the
extensible archive/recovery program interface. We shouldn't focus on
just tar backups - if you do, then the whole thing seems less
feature-rich. The current design allows interfacing with tape, remote
backup, internet backup providers, automated standby servers and the
dozen major storage/archive vendors' solutions.

Writing a procedure to backup, assign filenames, keep track of stuff
isn't too difficult if you're a competent DBA with a mild knowledge of
shell or perl scripting. But if data is important, people will want to
invest the time and trouble to adopt one of the open source or licenced
vendors that provide solutions in this area.

Systems management is a discipline and procedures should be in place for
everything. I fully agree with the "automate everything" dictum, but
just don't want to constrain people too much to a particular way of
doing things.

-o-o-

Overall, for first release, I think the complexity of this design is
acceptable. PITR is similar to Oracle7 Backup/Recovery, and easily
recognisable to any DBA with current experience of current SQL Server,
DB2 (MVS, UDB) or Teradata systems. [I can't comment much on Ingres,
Informix, Sybase etc]

My main areas of concern are:
- the formal correctness of the recovery process
As a result of this concern, PITR makes ZERO alterations to the recovery
code itself. The trick is to feed it the right xlog files and to stop,
if required, at the right place and allow normal work to resume.

- the robustness and quality of my implementation
This requires quality checking of the code and full beta testing

-o-o-

We've raised a couple of valid points on the lists in the last few days:
- its probably a desirable feature (but not essential) to implement a
write suspend feature on the bgwriter, if nothing else it will be a
confidence building feature...as said previously, for many people, this
will not be required, but people will no doubt keep asking
- there is a small window of risk around the possibility that a recovery
target might be set by the user that doesn't rollforward all the way
past the end of the backup. That is real, but in general, people aren't
likely to be performing archive recovery within minutes of a backup
being taken - and if they are, they can always start from the previous
one to that. This is a gap we should close, but its just something to be
aware of...just like pg_dump not sorting things in the correct order in
its first release.

Not for now, but soon, I would propose:
- a command to suspend/resume bgwriter to allow backups. 
- use the suspend/resume feature to write a log record "backup end
marker" which shows when this took place. Ensure that any rollforward
goes through AT LEAST ONE "backup end marker" on its way. (If a Point in
Time is specified too early, we can check this immediately against the
checkpoint record. We can then refuse to stop at eny point in time
earlier than the backup end marker.

I've written a todo list and will post this again soon.

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Simon Riggs
Date:
On Fri, 2004-07-16 at 16:47, Tom Lane wrote:
> As far as the business about copying pg_control first goes: there is
> another way to think about it, which is to copy pg_control to another
> place that will be included in your backup.  For example the standard
> backup procedure could be
> 
> 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.
> 
> 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump
> 
> 3. tar cf /dev/mt $PGDATA
> 
> 4. do something to record ending WAL position
> 
> If we standardized on this way, then the tar archive would automatically
> contain the pre-backup checkpoint position in ./pg_control.dump, and
> there is no need for any special assumptions about the order in which
> tar processes things.
> 

Sounds good. That would be familiar to Oracle DBAs doing BACKUP
CONTROLFILE. We can document that and off it as a suggested procedure.

> However, once you decide to do things like that, there is no reason why
> the copied file has to be an exact image of pg_control.  I claim it
> would be more useful if the copied file were plain text so that you
> could just "cat" it to find out the starting WAL position; that would
> let you determine without any special tools what range of WAL archive
> files you are going to need to bring back from your archives.

I wouldn't be in favour of a manual mechanism. If you want an automated
mechanism, whats wrong with using the one thats already there? You can
use pg_controldata to read the controlfile, again whats wrong with that?

We agreed some time back that an off-line xlog file inspector would be
required to allow us to inspect the logs and make a decision about where
to end recovery. You'd still need that.

It's scary enough having to specify the end point, let alone having to
specify the starting point as well.

At your request, and with Bruce's idea, I designed and built the
recovery system so that you don't need to know what range of xlogs to
bring back. You just run it, it brings back the right files from archive
and does recovery with them, then cleans up - and it works without
running out of disk space on long recoveries.

I've built it now and it works...

> This is pretty much the same chain of reasoning that Bruce and I went
> through yesterday to come up with the idea of putting a label file
> inside the tar backups.  We concluded that it'd be worth putting
> both the backup starting time and the checkpoint WAL position into
> the label file --- the starting time isn't needed for restore but
> might be really helpful as documentation, if you needed to verify
> which dump file was which.

...if you are doing tar backups...what will you do if you're not using
that mechanism?

If you are: It's common practice to make up a backup filename from
elements such as systemname, databasename, date and time etc. That gives
you the start time, the file last mod date gives you the end time. 

I think its perfectly fine for everybody to do backups any way they
please. There are many licenced variants of PostgreSQL and it might be
appropriate in those to specify particular ways of doing things.

I'll be trusting the management of backup metadata and storage media to
a solution designed for the purpose (open or closed source), just as
I'll be trusting my data to a database solution designed for that
purpose. That for me is one of the good things about PostgreSQL - we use
the filesystem, we don't write our own, we provide language interfaces
not invent our own proprietary server language etc..

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Bruce Momjian
Date:
OK, I think I have some solid ideas and reasons for them.

First, I think we need server-side functions to call when we start/stop
the backup.  The advantage of these server-side functions is that they
will do the required work of recording the pg_control values and
creating needed files with little chance for user error.  It also allows
us to change the internal operations in later releases without requiring
admins to change their procedures.  We are even able to adjust the
internal operation in minor releases without forcing a new procedure on
users.

Second, I think once we start a restore, we should rename recovery.conf
to recovery.in_progress, and when complete rename that to
recovery.done.  If the postmaster starts and sees recovery.in_progress,
it will fail to start knowing its recovery was interrupted.  This allows
the admin to take appropriate action.  (I am not sure what that action
would be. Does he bring back the backup files or just keep going?)

Third, I think we need to put a file in the archive location once we
complete a backup, recording the start/stop xid and wal/offsets.  This
gives the admin documentation on what archive logs to keep and what xids
are available for recovery.  Ideally the recover program would read that
file and check the recover xid to make sure it is after the stop xid
recorded in the file.

How would the recover program know the name of that file?  We need to
create it in /data with start contents before the backup, then complete
it with end contents and archive it.

What should we name it?  Ideally it would be named by the WAL
name/offset of the start so it orders in the proper spot in the archive
file listing, e.g.:
000000000000093a000000000000093b000000000000093b.032b9.start000000000000093c

Are people going to know they need 000000000000093b for
000000000000093b.032b9.start?  I hope so.  Another idea is to do:

000000000000093a.xlog000000000000093b.032b9.start000000000000093b.xlog000000000000093c.xlog

This would order properly.  It might be a very good idea to add
extensions to these log files now that we are archiving them in strange
places.  In fact, maybe we should use *.pg_xlog to document the
directory they came from.

---------------------------------------------------------------------------


Simon Riggs wrote:
> On Fri, 2004-07-16 at 16:47, Tom Lane wrote:
> > As far as the business about copying pg_control first goes: there is
> > another way to think about it, which is to copy pg_control to another
> > place that will be included in your backup.  For example the standard
> > backup procedure could be
> > 
> > 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.
> > 
> > 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump
> > 
> > 3. tar cf /dev/mt $PGDATA
> > 
> > 4. do something to record ending WAL position
> > 
> > If we standardized on this way, then the tar archive would automatically
> > contain the pre-backup checkpoint position in ./pg_control.dump, and
> > there is no need for any special assumptions about the order in which
> > tar processes things.
> > 
> 
> Sounds good. That would be familiar to Oracle DBAs doing BACKUP
> CONTROLFILE. We can document that and off it as a suggested procedure.
> 
> > However, once you decide to do things like that, there is no reason why
> > the copied file has to be an exact image of pg_control.  I claim it
> > would be more useful if the copied file were plain text so that you
> > could just "cat" it to find out the starting WAL position; that would
> > let you determine without any special tools what range of WAL archive
> > files you are going to need to bring back from your archives.
> 
> I wouldn't be in favour of a manual mechanism. If you want an automated
> mechanism, whats wrong with using the one thats already there? You can
> use pg_controldata to read the controlfile, again whats wrong with that?
> 
> We agreed some time back that an off-line xlog file inspector would be
> required to allow us to inspect the logs and make a decision about where
> to end recovery. You'd still need that.
> 
> It's scary enough having to specify the end point, let alone having to
> specify the starting point as well.
> 
> At your request, and with Bruce's idea, I designed and built the
> recovery system so that you don't need to know what range of xlogs to
> bring back. You just run it, it brings back the right files from archive
> and does recovery with them, then cleans up - and it works without
> running out of disk space on long recoveries.
> 
> I've built it now and it works...
> 
> > This is pretty much the same chain of reasoning that Bruce and I went
> > through yesterday to come up with the idea of putting a label file
> > inside the tar backups.  We concluded that it'd be worth putting
> > both the backup starting time and the checkpoint WAL position into
> > the label file --- the starting time isn't needed for restore but
> > might be really helpful as documentation, if you needed to verify
> > which dump file was which.
> 
> ...if you are doing tar backups...what will you do if you're not using
> that mechanism?
> 
> If you are: It's common practice to make up a backup filename from
> elements such as systemname, databasename, date and time etc. That gives
> you the start time, the file last mod date gives you the end time. 
> 
> I think its perfectly fine for everybody to do backups any way they
> please. There are many licenced variants of PostgreSQL and it might be
> appropriate in those to specify particular ways of doing things.
> 
> I'll be trusting the management of backup metadata and storage media to
> a solution designed for the purpose (open or closed source), just as
> I'll be trusting my data to a database solution designed for that
> purpose. That for me is one of the good things about PostgreSQL - we use
> the filesystem, we don't write our own, we provide language interfaces
> not invent our own proprietary server language etc..
> 
> Best Regards, Simon Riggs
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Bruce Momjian
Date:
Let me address you concerns about PITR getting into 7.5. I think a few
people spoke last week expressing concern about our release process and
wanting to take drastic action.  However, looking at the release status
report I am about to post, you will see we are on track for an August 1
beta.

PITR has been neglected only because it has been moving along so well we
haven't needed to get deeply involved.  Simon has been able to address
concerns as we raised them and make adjustments quickly with little
guidance.  

Now, we certainly don't want to skip adding PITR by not giving it our
full attention to get into 7.5.  Once Tom completes the cursor issues
with NT in the next day or so,  I think that removes the last big NT
stumbling block, and we will start to focus on PITR.  Unless there is
some major thing we are missing, we fully expect to get PITR in 7.5.  We
don't have a crystal ball to know for sure, but our intent is clear.

I know Simon is going away July 26 so we want to get him feedback as
soon as possible.  If we wait until after July 26, we will have to make
all the adjustments without Simon's guidance, which will be difficult.

As far as the importance of PITR, it is a _key_ enterprise feature, even
more key than NT.  PITR is going to be one of the crowning jewels of the
7.5 release, and I don't want to go into beta without it unless we can't
help it.

So, I know with the deadline looming, and everyone it getting nervous,
but keep the faith.  I can see the light at the end of the tunnel.  I
know this is a tighter schedule than we would like, but I know we can do
it, and I expect we will do it.

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote:
> 
> > We will get there --- it just seems dark at this time.
> 
> Thanks for that. My comments were heartfelt, but not useful right now. 
> 
> I'm badly overdrawn already on my time budget, though that is my concern
> alone. There is more to do than I have time for. Pragmatically, if we
> aren't going to get there then I need to stop now, so I can progress
> other outstanding issues. All help is appreciated.
> 
> I'm aiming for the minimum feature set - which means we do need to take
> care over whether that set is insufficient and also to pull any part
> that doesn't stand up to close scrutiny over the next few days.
> 
> Overall, my primary goal is increased robustness and availability for
> PostgreSQL...and then to have a rest!
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


PITR COPY Failure (was Point in Time Recovery)

From
Mark Kirkwood
Date:
I decided to produce a nice simple example, so that anyone could 
hopefully replicate what I am seeing.

The scenario is the same as before (the 11 steps), but the CREATE TABLE 
and COPY step has been reduced to:

CREATE TABLE test0 (filler  VARCHAR(120));
COPY test0 FROM '/data0/dump/test0.dat' USING DELIMITERS ',';

Now the file 'test0.dat' consists of (128293) identical lines, each of 
109 'a' charactors (plus end of line)


A script to run the whole business can be found here :

http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz

(It will need a bit of editing for things like location of Pg, PGDATA, 
and you will need to make your own data file)

The main points of interest are:
- anything <=128392 rows in test0.dat results in 1 archived log, and the 
recovery succeeds
- anything >=128393 rows in test0.dat results in 2 or more archived 
logs, and recovery fails on the second log (and gives the zero length 
redo at 0/1FFFFE0 message).

Let me know if I can do any more legwork on this (I am considering 
re-compiling with WAL_DEBUG now that example is simpler)

regards

Mark

Simon Riggs wrote:

>On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote:
>  
>
>>I tried what I thought was a straightforward scenario, and seem to have 
>>broken it :-(
>>
>>Here is the little tale
>>
>>1) initdb
>>2) set archive_mode and archive_dest in postgresql.conf
>>3) startup
>>4) create database called 'test'
>>5) connect to 'test' and type 'checkpoint'
>>6) backup PGDATA using 'tar -zcvf'
>>7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
>>8) shutdown and remove PGDATA
>>9)  recover using 'tar -zxvf'
>>10) copy recovery.conf into PGDATA
>>11) startup
>>
>>This is what I get :
>>
>>LOG:  database system was interrupted at 2004-07-15 21:24:04 NZST
>>LOG:  recovery command file found...
>>LOG:  restore_program = cp %s/%s %s
>>LOG:  recovery_target_inclusive = true
>>LOG:  recovery_debug_log = true
>>LOG:  starting archive recovery
>>LOG:  restored log file "0000000000000000" from archive
>>LOG:  checkpoint record is at 0/A48054
>>LOG:  redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
>>LOG:  next transaction ID: 496; next OID: 25419
>>LOG:  database system was not properly shut down; automatic recovery in 
>>progress
>>LOG:  redo starts at 0/A48094
>>LOG:  restored log file "0000000000000001" from archive
>>LOG:  record with zero length at 0/1FFFFE0
>>LOG:  redo done at 0/1FFFF30
>>LOG:  restored log file "0000000000000001" from archive
>>LOG:  restored log file "0000000000000001" from archive
>>PANIC:  concurrent transaction log activity while database system is 
>>shutting down
>>LOG:  startup process (PID 13492) was terminated by signal 6
>>LOG:  aborting startup due to startup process failure
>>
>>The concurrent access is a bit of a puzzle, as this is my home machine 
>>(i.e. I am *sure* noone else is connected!)
>>
>>    
>>
>
>I can see what is wrong now, but you'll have to help me on details your
>end...
>
>The log shows that xlog 1 was restored from archive. It contains a zero
>length record, which indicates that it isn't yet full (or thats what the
>existing recovery code assumes it means). Which also indicates that it
>should never have been archived in the first place, and should not
>therefore be a candidate for a restore from archive.
>
>The double message "restored log file" can only occur after you've
>retrieved a partially full file from archive - which as I say, shouldn't
>be there.
>
>Other messages are essentially spurious in those circumstances.
>
>Either:
>- somehow the files have been mixed up in the archive directory, which
>is possible if the filing discipline is not strict - various ways,
>unfortunately I would guess this to be the most likely, somehow
>- the file that has been restored has been damaged in some way
>- the archiver has archived a file too early (very unlikely, IMHO -
>thats the most robust bit of the code)
>- some aspect of the code has written a zero length record to WAL (which
>is supposed to not be possible, but we musn't discount an error in
>recent committed work)
>
>- there may also be an effect going on with checkpoints that I don't
>understand...spurious checkpoint warning messages have already been
>observed and reported,
>
>Best regards, Simon Riggs
>
>
>
>
>  
>


Re: PITR COPY Failure (was Point in Time Recovery)

From
Tom Lane
Date:
Mark Kirkwood <markir@coretech.co.nz> writes:
> - anything >=128393 rows in test0.dat results in 2 or more archived 
> logs, and recovery fails on the second log (and gives the zero length 
> redo at 0/1FFFFE0 message).

Zero length record is not an error, it's the normal way of detecting
end-of-log.
        regards, tom lane


Re: PITR COPY Failure (was Point in Time Recovery)

From
Mark Kirkwood
Date:
There are some silly bugs in the script:

- forgot to export PGDATA and PATH after changing them
- forgot to mention the need to edit test.sql (COPY line needs path to 
dump file)

Apologies - I will submit a fixed version a little later

regards

Mark

Mark Kirkwood wrote:

>
> A script to run the whole business can be found here :
>
> http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz
>
> (It will need a bit of editing for things like location of Pg, PGDATA, 
> and you will need to make your own data file)
>


Re: PITR COPY Failure (was Point in Time Recovery)

From
Mark Kirkwood
Date:
fixed.

Mark Kirkwood wrote:

> There are some silly bugs in the script:
>
> - forgot to export PGDATA and PATH after changing them
> - forgot to mention the need to edit test.sql (COPY line needs path to 
> dump file)
>
> Apologies - I will submit a fixed version a little later
>
> regards
>
> Mark
>
> Mark Kirkwood wrote:
>
>>
>> A script to run the whole business can be found here :
>>
>> http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz
>>
>> (It will need a bit of editing for things like location of Pg, 
>> PGDATA, and you will need to make your own data file)
>>
>


Re: Point in Time Recovery

From
Simon Riggs
Date:
On Sat, 2004-07-17 at 00:57, Bruce Momjian wrote:
> OK, I think I have some solid ideas and reasons for them.
> 

Sorry for taking so long to reply...

> First, I think we need server-side functions to call when we start/stop
> the backup.  The advantage of these server-side functions is that they
> will do the required work of recording the pg_control values and
> creating needed files with little chance for user error.  It also allows
> us to change the internal operations in later releases without requiring
> admins to change their procedures.  We are even able to adjust the
> internal operation in minor releases without forcing a new procedure on
> users.

Yes, I think we should go down this route. ....there's a "but" and that
is we don't absolutely need it for correctness....and so I must decline
adding it to THIS release. I don't imagine I'll stop be associated with
this code for a while yet....

Can we recommend that users should expect to have to call a start and
end backup routine in later releases? Don't expect you'll agree to
that..

> 
> Second, I think once we start a restore, we should rename recovery.conf
> to recovery.in_progress, and when complete rename that to
> recovery.done.  If the postmaster starts and sees recovery.in_progress,
> it will fail to start knowing its recovery was interrupted.  This allows
> the admin to take appropriate action.  (I am not sure what that action
> would be. Does he bring back the backup files or just keep going?)
> 

Superceded by Tom's actions. Two states are required: start and stop.
Recovery isn't going to be checkpoint-restartable anytime soon, IMHO.

Best regards, Simon Riggs



Re: Point in Time Recovery

From
Tom Lane
Date:
Bruce and I had another phone chat about the problems that can ensue
if you restore a tar backup that contains old (incompletely filled)
versions of WAL segment files.  While the current code will ignore them
during the recovery-from-archive run, leaving them laying around seems
awfully dangerous.  One nasty possibility is that the archiving
mechanism will pick up these files and overwrite good copies in the
archive area with the obsolete ones from the backup :-(.

Bruce earlier proposed that we simply "rm pg_xlog/*" at the start of
a recovery-from-archive run, but as I said I'm scared to death of code
that does such a thing automatically.  In particular this would make it
impossible to handle scenarios where you want to do a PITR recovery but
you need to use some recent WAL segments that didn't make it into your
archive yet.  (Maybe you could get around this by forcibly transferring
such segments into the archive, but that seems like a bad idea for
incomplete segments.)

It would really be best for the DBA to make sure that the starting
condition for the recovery run does not have any obsolete segment files
in pg_xlog.  He could do this either by setting up his backup policy so
that pg_xlog isn't included in the tar backup in the first place, or by
manually removing the included files just after restoring the backup,
before he tries to start the recovery run.

Of course the objection to that is "what if the DBA forgets to do it?"

The idea that we came to on the phone was for the postmaster, when it
enters recovery mode because a recovery.conf file exists, to look in
pg_xlog for existing segment files and refuse to start if any are there
--- *unless* the user has put a special, non-default overriding flag
into recovery.conf.  Call it "use_unarchived_files" or something like
that.  We'd have to provide good documentation and an extensive HINT of
course, but basically the DBA would have two choices when he gets this
refusal to start:

1. Remove all the segment files in pg_xlog.  (This would be the right
thing to do if he knows they all came off the backup.)

2. Verify that pg_xlog contains only segment files that are newer than
what's stored in the WAL archive, and then set the override flag in
recovery.conf.  In this case the DBA is taking responsibility for
leaving only segment files that are good to use.

One interesting point is that with such a policy, we could use locally
available WAL segments in preference to pulling the same segments from
archive, which would be at least marginally more efficient, and seems
logically cleaner anyway.

In particular it seems that this would be a useful arrangement in cases
where you have questionable WAL segments --- you're not sure if they're
good or not.  Rather than having to push questionable data into your WAL
archive, you can leave it local, try a recovery run, and see if you like
the resulting state.  If not, it's a lot easier to do-over when you have
not corrupted your archive area.

Comments?  Better ideas?
        regards, tom lane


Re: PITR COPY Failure (was Point in Time Recovery)

From
Mark Kirkwood
Date:
I have been doing some re-testing with CVS HEAD from about 1 hour ago 
using the simplified example posted previously.


It is quite interesting:

i) create the table as:

CREATE TABLE test0 (filler  TEXT);

and COPY 100 000 rows on length 109, then recovery succeeds.


ii) create the table as:

CREATE TABLE test0 (filler    VARCHAR(120));

and COPY as above, then recovery *fails* with the the signal 6 error below.



LOG:  database system was not properly shut down; automatic recovery in 
progress
LOG:  redo starts at 0/A4807C
LOG:  record with zero length at 0/FFFFE0
LOG:  redo done at 0/FFFF30
LOG:  restored log file "0000000000000000" from archive
LOG:  archive recovery complete
PANIC:  concurrent transaction log activity while database system is 
shutting down
LOG:  startup process (PID 17546) was terminated by signal 6
LOG:  aborting startup due to startup process failure

(I am pretty sure both TEXT and VARCHAR(120) failed using the original 
patch)

Any suggestions for the best way to dig a bit deeper?

regards

Mark




Re: Point in Time Recovery

From
Christopher Kings-Lynne
Date:
I've got a PITR set up here that's happily scp'ing WAL files across to 
another machine.  However, the NIC in the machine is currently stuffed, 
so it gets like 50k/s :)  What happens in general if you are generating 
WAL file bytes faster always than they can be copied off?

Also, does the archive dir just basically keep filling up forever?  How 
do I know when I can prune some files?  Anything older than the last 
full backup?

Chris



Re: Point in Time Recovery

From
Tom Lane
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> I've got a PITR set up here that's happily scp'ing WAL files across to 
> another machine.  However, the NIC in the machine is currently stuffed, 
> so it gets like 50k/s :)  What happens in general if you are generating 
> WAL file bytes faster always than they can be copied off?

If you keep falling further and further behind, eventually your pg_xlog
directory will fill the space available on its disk, and I think at that
point PG will panic and shut down because it can't create any more xlog
segments.

> Also, does the archive dir just basically keep filling up forever?  How 
> do I know when I can prune some files?  Anything older than the last 
> full backup?

Anything older than the starting checkpoint of the last full backup that
you might want to restore to.  We need to adjust the backup procedure so
that the starting segment number for a backup is more readily visible;
see recent discussions about logging that explicitly in some fashion.
        regards, tom lane


Re: Point in Time Recovery

From
Christopher Kings-Lynne
Date:
> If you keep falling further and further behind, eventually your pg_xlog
> directory will fill the space available on its disk, and I think at that
> point PG will panic and shut down because it can't create any more xlog
> segments.

Hang on, are you supposed to MOVE or COPY away WAL segments?

Chris



Re: Point in Time Recovery

From
Bruce Momjian
Date:
Christopher Kings-Lynne wrote:
> > If you keep falling further and further behind, eventually your pg_xlog
> > directory will fill the space available on its disk, and I think at that
> > point PG will panic and shut down because it can't create any more xlog
> > segments.
> 
> Hang on, are you supposed to MOVE or COPY away WAL segments?

Copy.  pg will delete them once they are archived.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Tom Lane
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>> If you keep falling further and further behind, eventually your pg_xlog
>> directory will fill the space available on its disk, and I think at that
>> point PG will panic and shut down because it can't create any more xlog
>> segments.

> Hang on, are you supposed to MOVE or COPY away WAL segments?

COPY.  The checkpoint code will then delete or recycle the segment file,
as appropriate.
        regards, tom lane


Re: Point in Time Recovery

From
Christopher Kings-Lynne
Date:
>>Hang on, are you supposed to MOVE or COPY away WAL segments?
> 
> COPY.  The checkpoint code will then delete or recycle the segment file,
> as appropriate.

So what happens if you just move it?  Postgres breaks?

Chris



Re: Point in Time Recovery

From
Tom Lane
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>>> Hang on, are you supposed to MOVE or COPY away WAL segments?
>> 
>> COPY.  The checkpoint code will then delete or recycle the segment file,
>> as appropriate.

> So what happens if you just move it?  Postgres breaks?

I don't think so, but it seems like a much less robust way to do things.
What happens if you have a failure partway through?  For instance
archive machine dies and loses recent data right after you've rm'd the
source file.  The recommended COPY procedure at least provides some
breathing room between when you install the data on the archive and when
the original file is removed.

It's not like you save any effort by using a MOVE anyway.  You're not
going to have the archive on the same machine as the database (or if you
are, you ain't gonna be *my* DBA ...)
        regards, tom lane


Re: Point in Time Recovery

From
Christopher Kings-Lynne
Date:
> I don't think so, but it seems like a much less robust way to do things.
> What happens if you have a failure partway through?  For instance
> archive machine dies and loses recent data right after you've rm'd the
> source file.  The recommended COPY procedure at least provides some
> breathing room between when you install the data on the archive and when
> the original file is removed.

Well, I tried it in 'cross your fingers' mode and it works, at least:

archive_command = 'rm %p'

:)

Chris



Re: PITR COPY Failure (was Point in Time Recovery)

From
Tom Lane
Date:
Mark Kirkwood <markir@coretech.co.nz> writes:
> I have been doing some re-testing with CVS HEAD from about 1 hour ago 
> using the simplified example posted previously.

> It is quite interesting:

The problem seems to be that the computation of checkPoint.redo at
xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not
allowing for the possibility that XLogInsert will decide it doesn't
want to split the checkpoint record across XLOG files, and will then
insert a WASTED_SPACE record to avoid that (see comment and following
code at lines 758-795).  This wouldn't really matter except that there
is a safety crosscheck at line 4268 that tries to detect unexpected
insertions of other records during a shutdown checkpoint.

I think the code in CreateCheckPoint was correct when it was written,
because we only recently changed XLogInsert to not split records
across files.  But it's got a boundary-case bug now, which your test
scenario is able to exercise by making the recovery run try to write
a shutdown checkpoint exactly at the end of a WAL file segment.

The quick and dirty solution would be to dike out the safety check at
4268ff.  I don't much care for that, but am too tired right now to work
out a better answer.  I'm not real sure whether it's better to adjust
the computation of checkPoint.redo or to smarten the safety check
... but one or the other needs to allow for file-end padding, or maybe
we could hack some update of the state in WasteXLInsertBuffer().  (But
at some point you have to say "this is more trouble than it's worth",
so maybe we'll end up taking out the safety check.)

In any case this isn't a fundamental bug, just an insufficiently
smart safety check.  But thanks for finding it!  As is, the code has
a nonzero probability of failure in the field :-( and I don't know
how we'd have tracked it down without a reproducible test case.
        regards, tom lane


Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
> > Hang on, are you supposed to MOVE or COPY away WAL segments?
>
> Copy.  pg will delete them once they are archived.

Copy. pg will recycle them once they are archived.

Andreas


Re: PITR COPY Failure (was Point in Time Recovery)

From
Simon Riggs
Date:
On Tue, 2004-07-20 at 05:14, Tom Lane wrote:
> Mark Kirkwood <markir@coretech.co.nz> writes:
> > I have been doing some re-testing with CVS HEAD from about 1 hour ago 
> > using the simplified example posted previously.
> 
> > It is quite interesting:

> The problem seems to be that the computation of checkPoint.redo at
> xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not
> allowing for the possibility that XLogInsert will decide it doesn't
> want to split the checkpoint record across XLOG files, and will then
> insert a WASTED_SPACE record to avoid that (see comment and following
> code at lines 758-795).  This wouldn't really matter except that there
> is a safety crosscheck at line 4268 that tries to detect unexpected
> insertions of other records during a shutdown checkpoint.
> 
> I think the code in CreateCheckPoint was correct when it was written,
> because we only recently changed XLogInsert to not split records
> across files.  But it's got a boundary-case bug now, which your test
> scenario is able to exercise by making the recovery run try to write
> a shutdown checkpoint exactly at the end of a WAL file segment.
> 

Thanks for locating that, I was suspicious of that piece of code, but it
would have taken me longer than this to locate it exactly.

It was clear (to me) that it had to be of this nature, since I've done a
fair amount of recovery testing and not hit anything like that.

> The quick and dirty solution would be to dike out the safety check at
> 4268ff.  I don't much care for that, but am too tired right now to work
> out a better answer.  I'm not real sure whether it's better to adjust
> the computation of checkPoint.redo or to smarten the safety check
> ... but one or the other needs to allow for file-end padding, or maybe
> we could hack some update of the state in WasteXLInsertBuffer().  (But
> at some point you have to say "this is more trouble than it's worth",
> so maybe we'll end up taking out the safety check.)
> 

I'll take a look

> In any case this isn't a fundamental bug, just an insufficiently
> smart safety check.  But thanks for finding it!  As is, the code has
> a nonzero probability of failure in the field :-( and I don't know
> how we'd have tracked it down without a reproducible test case.

All code has a non-zero probability of failure in the field, its just
they don't tell you that usually. The main thing here is that we write
everything we need to write to the logs in the first place. 

If that is true, then the code can always be adjusted or the logs dumped
and re-spliced to recover data.

Definitely: Thanks Mark! Reproducibility is key.

Best regards, Simon Riggs




Re: PITR COPY Failure (was Point in Time Recovery)

From
Mark Kirkwood
Date:
Great that it's not fundamental - and hopefully with this discovery, the 
probability you mentioned is being squashed towards zero a bit more  :-)

Don't let this early bug detract from what is really a superb piece of work!

regards

Mark

Tom Lane wrote:

>In any case this isn't a fundamental bug, just an insufficiently
>smart safety check.  But thanks for finding it!  As is, the code has
>a nonzero probability of failure in the field :-( and I don't know
>how we'd have tracked it down without a reproducible test case.
>
>
>  
>


Re: PITR COPY Failure (was Point in Time Recovery)

From
Simon Riggs
Date:
On Tue, 2004-07-20 at 05:14, Tom Lane wrote:
> Mark Kirkwood <markir@coretech.co.nz> writes:
> > I have been doing some re-testing with CVS HEAD from about 1 hour ago 
> > using the simplified example posted previously.
> 
> > It is quite interesting:
> 
> The problem seems to be that the computation of checkPoint.redo at
> xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not
> allowing for the possibility that XLogInsert will decide it doesn't
> want to split the checkpoint record across XLOG files, and will then
> insert a WASTED_SPACE record to avoid that (see comment and following
> code at lines 758-795).  This wouldn't really matter except that there
> is a safety crosscheck at line 4268 that tries to detect unexpected
> insertions of other records during a shutdown checkpoint.
> 
> I think the code in CreateCheckPoint was correct when it was written,
> because we only recently changed XLogInsert to not split records
> across files.  But it's got a boundary-case bug now, which your test
> scenario is able to exercise by making the recovery run try to write
> a shutdown checkpoint exactly at the end of a WAL file segment.
> 
> The quick and dirty solution would be to dike out the safety check at
> 4268ff.  

Well, taking out the safety check isn't the answer.

The check produces the last error message "concurrent transaction...",
but it isn't the cause of the mismatch in the first place.

If you take out that check, we still fail because the wasted space at
the end is causing a "record with zero length" error.

>  I'm not real sure whether it's better to adjust
> the computation of checkPoint.redo or to smarten the safety check
> ... but one or the other needs to allow for file-end padding, or maybe
> we could hack some update of the state in WasteXLInsertBuffer().  (But
> at some point you have to say "this is more trouble than it's worth",
> so maybe we'll end up taking out the safety check.)

...I'm looking at other options now.

Best Regards, Simon Riggs



Re: PITR COPY Failure (was Point in Time Recovery)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
>> The quick and dirty solution would be to dike out the safety check at
>> 4268ff.  

> If you take out that check, we still fail because the wasted space at
> the end is causing a "record with zero length" error.

Ugh.  I'm beginning to think we ought to revert the patch that added the
don't-split-across-files logic to XLogInsert; that seems to have broken
more assumptions than I realized.  That was added here:

2004-02-11 17:55  tgl
* src/: backend/access/transam/xact.c,backend/access/transam/xlog.c,
backend/access/transam/xlogutils.c,backend/storage/smgr/md.c,
backend/storage/smgr/smgr.c,bin/pg_controldata/pg_controldata.c,bin/pg_resetxlog/pg_resetxlog.c,
include/access/xact.h,include/access/xlog.h,include/access/xlogutils.h,include/pg_config_manual.h,
include/catalog/pg_control.h,include/storage/smgr.h:Commit the reasonably uncontroversial partsof J.R. Nield's PITR
patch,to wit: Add a header record to each WALsegment file so that it can be reliably identified.  Avoidsplitting WAL
recordsacross segment files (this is not strictlynecessary, but makes it simpler to incorporate the header records).
MakeWAL entries for file creation, deletion, and truncation (asforeseen but never implemented by Vadim).  Also, add
supportformaking XLOG_SEG_SIZE configurable at compile time, similarly toBLCKSZ.  Fix a couple bugs I introduced in WAL
replayduring recentsmgr API changes.  initdb is forced due to changes in pg_controlcontents.
 

There are other ways to do this, for example we could treat the WAL page
headers as variable-size, and stick the file labeling info into the
first page's header instead of making it be a separate record.  The
separate-record way makes it easier to incorporate future additions to
the file labeling info, but I don't really think it's critical to allow
for that.
        regards, tom lane


Re: PITR COPY Failure (was Point in Time Recovery)

From
Simon Riggs
Date:
On Tue, 2004-07-20 at 13:51, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> >> The quick and dirty solution would be to dike out the safety check at
> >> 4268ff.  
> 
> > If you take out that check, we still fail because the wasted space at
> > the end is causing a "record with zero length" error.
> 
> Ugh.  I'm beginning to think we ought to revert the patch that added the
> don't-split-across-files logic to XLogInsert; that seems to have broken
> more assumptions than I realized.  That was added here:
> 
> 2004-02-11 17:55  tgl
> 
>     * src/: backend/access/transam/xact.c,
>     backend/access/transam/xlog.c, backend/access/transam/xlogutils.c,
>     backend/storage/smgr/md.c, backend/storage/smgr/smgr.c,
>     bin/pg_controldata/pg_controldata.c,
>     bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h,
>     include/access/xlog.h, include/access/xlogutils.h,
>     include/pg_config_manual.h, include/catalog/pg_control.h,
>     include/storage/smgr.h: Commit the reasonably uncontroversial parts
>     of J.R. Nield's PITR patch, to wit: Add a header record to each WAL
>     segment file so that it can be reliably identified.  Avoid
>     splitting WAL records across segment files (this is not strictly
>     necessary, but makes it simpler to incorporate the header records).
>      Make WAL entries for file creation, deletion, and truncation (as
>     foreseen but never implemented by Vadim).  Also, add support for
>     making XLOG_SEG_SIZE configurable at compile time, similarly to
>     BLCKSZ.  Fix a couple bugs I introduced in WAL replay during recent
>     smgr API changes.  initdb is forced due to changes in pg_control
>     contents.
> 
> There are other ways to do this, for example we could treat the WAL page
> headers as variable-size, and stick the file labeling info into the
> first page's header instead of making it be a separate record.  The
> separate-record way makes it easier to incorporate future additions to
> the file labeling info, but I don't really think it's critical to allow
> for that.
> 

I think I've fixed it now...but wait 20

The problem was that a zero length XLOG_WASTED_SPACE record just fell
out of ReadRecord when it shouldn't have. By giving it a helping hand it
makes it through with pointers correctly set, and everything else was
already thought of in the earlier patch, so xlog_redo etc happens.

I'll update again in a few minutes....no point us both looking at this.

Best regards, Simon Riggs





Re: PITR COPY Failure (was Point in Time Recovery)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2004-07-20 at 13:51, Tom Lane wrote:
>> Ugh.  I'm beginning to think we ought to revert the patch that added the
>> don't-split-across-files logic to XLogInsert; that seems to have broken
>> more assumptions than I realized.

> The problem was that a zero length XLOG_WASTED_SPACE record just fell
> out of ReadRecord when it shouldn't have. By giving it a helping hand it
> makes it through with pointers correctly set, and everything else was
> already thought of in the earlier patch, so xlog_redo etc happens.

Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and
adding two more warts to the code to support it is sticking in my craw.
I'm thinking it would be cleaner to treat the extra labeling information
as an extension of the WAL page header.
        regards, tom lane


Re: PITR COPY Failure (was Point in Time Recovery)

From
Simon Riggs
Date:
On Tue, 2004-07-20 at 14:11, Simon Riggs wrote:
> On Tue, 2004-07-20 at 13:51, Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > >> The quick and dirty solution would be to dike out the safety check at
> > >> 4268ff.
> >
> > > If you take out that check, we still fail because the wasted space at
> > > the end is causing a "record with zero length" error.
> >
> > Ugh.  I'm beginning to think we ought to revert the patch that added the
> > don't-split-across-files logic to XLogInsert; that seems to have broken
> > more assumptions than I realized.  That was added here:
> >
> > 2004-02-11 17:55  tgl
> >
> >     * src/: backend/access/transam/xact.c,
> >     backend/access/transam/xlog.c, backend/access/transam/xlogutils.c,
> >     backend/storage/smgr/md.c, backend/storage/smgr/smgr.c,
> >     bin/pg_controldata/pg_controldata.c,
> >     bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h,
> >     include/access/xlog.h, include/access/xlogutils.h,
> >     include/pg_config_manual.h, include/catalog/pg_control.h,
> >     include/storage/smgr.h: Commit the reasonably uncontroversial parts
> >     of J.R. Nield's PITR patch, to wit: Add a header record to each WAL
> >     segment file so that it can be reliably identified.  Avoid
> >     splitting WAL records across segment files (this is not strictly
> >     necessary, but makes it simpler to incorporate the header records).
> >      Make WAL entries for file creation, deletion, and truncation (as
> >     foreseen but never implemented by Vadim).  Also, add support for
> >     making XLOG_SEG_SIZE configurable at compile time, similarly to
> >     BLCKSZ.  Fix a couple bugs I introduced in WAL replay during recent
> >     smgr API changes.  initdb is forced due to changes in pg_control
> >     contents.
> >
> > There are other ways to do this, for example we could treat the WAL page
> > headers as variable-size, and stick the file labeling info into the
> > first page's header instead of making it be a separate record.  The
> > separate-record way makes it easier to incorporate future additions to
> > the file labeling info, but I don't really think it's critical to allow
> > for that.
> >
>
> I think I've fixed it now...but wait 20
>
> The problem was that a zero length XLOG_WASTED_SPACE record just fell
> out of ReadRecord when it shouldn't have. By giving it a helping hand it
> makes it through with pointers correctly set, and everything else was
> already thought of in the earlier patch, so xlog_redo etc happens.
>
> I'll update again in a few minutes....no point us both looking at this.
>

This was a very confusing test...Here's what I think happened:

Mark discovered a numerological coincidence that meant that the
XLOG_WASTED_SPACE record was zero length at the end of EACH file he was
writing to, as long as there was just that one writer. So no matter how
many records were inserted, each xlog file had a zero length
XLOG_WASTED_SPACE record at its end.

ReadRecord failed on seeing a zero length record, i.e. when it got to
the FIRST of the XLOG_WASTED_SPACE records. Thats why the test fails no
matter how many records you give it, as long as it was more than enough
to write into a second xlog segment file.

By telling ReadRecord that XLOG_WASTED_SPACE records of zero length are
in fact *OK*, it continues happily. (Thats just a partial fix, see
later)

The test works, but gives what looks like strange results: the test
blows away the data directory completely, so the then-current xlog dies
too. That contained the commit for the large COPY, so even though the
recovery now works, the table has zero rows in it. (When things die
you're still likely to lose *some* data).

Anyway, so then we put the "concurrent transaction" test back in and the
test passes because we have now set the pointers correctly.

After all that, I think the wasted space idea is still sensible. You
musn't have a continuation record across files, otherwise we'll end up
with half a commit one-day, which would break ACID.

I'm happy that we have the explicit test in XLogInsert for zero-length
records. Somebody will one-day write a resource manager with zero length
records when they didn't mean to and we need to catch that at write
time, not at recovery time like Mark has done. The WasteXLInsertBuffer
was the only part of the code that *can* write a zero-length record, so
we will *not* see another recurrence of this situation --at recovery
time--.

Though further concerns along this theme are:
- what happens when the space at the end of a file is so small we can't
even write a zero-length XLOG_WASTED_SPACE record? Hopefully, you're
gonna say "damn your eyes...couldnt you see that, its already there".
- if the space at the end of a file was just zeros, then the "concurrent
transaction test" would still fail....we probably need to enhance this
to treat a few zeros at end of file AS IF it was an XLOG_WASTED_SPACE
record an continue. (That scenario would happen if we were doing a
recovery that included a local un-archived xlog that was very close to
being full - probably more likely to occur in crash recovery than
archive recovery)

The included patch doesn't attempt to address those issues, yet.

Best regards, Simon Riggs


Attachment

Re: PITR COPY Failure (was Point in Time Recovery)

From
Simon Riggs
Date:
On Tue, 2004-07-20 at 15:00, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2004-07-20 at 13:51, Tom Lane wrote:
> >> Ugh.  I'm beginning to think we ought to revert the patch that added the
> >> don't-split-across-files logic to XLogInsert; that seems to have broken
> >> more assumptions than I realized.
> 
> > The problem was that a zero length XLOG_WASTED_SPACE record just fell
> > out of ReadRecord when it shouldn't have. By giving it a helping hand it
> > makes it through with pointers correctly set, and everything else was
> > already thought of in the earlier patch, so xlog_redo etc happens.
> 
> Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and
> adding two more warts to the code to support it is sticking in my craw.
> I'm thinking it would be cleaner to treat the extra labeling information
> as an extension of the WAL page header.

Sounds like a better solution than scrabbling around at the end of file
with too many edge cases to test properly 

...over to you then...

Best Regards, Simon Riggs



Re: Point in Time Recovery

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Sat, 2004-07-17 at 00:57, Bruce Momjian wrote:
> > OK, I think I have some solid ideas and reasons for them.
> > 
> 
> Sorry for taking so long to reply...
> 
> > First, I think we need server-side functions to call when we start/stop
> > the backup.  The advantage of these server-side functions is that they
> > will do the required work of recording the pg_control values and
> > creating needed files with little chance for user error.  It also allows
> > us to change the internal operations in later releases without requiring
> > admins to change their procedures.  We are even able to adjust the
> > internal operation in minor releases without forcing a new procedure on
> > users.
> 
> Yes, I think we should go down this route. ....there's a "but" and that
> is we don't absolutely need it for correctness....and so I must decline
> adding it to THIS release. I don't imagine I'll stop be associated with
> this code for a while yet....
> 
> Can we recommend that users should expect to have to call a start and
> end backup routine in later releases? Don't expect you'll agree to
> that..

I guess my big question is that if we don't do this for 7.5 how will
people doing restores know if the xid they specify is valid for the
backup they used.  If we recover to most recent time, is there any check
that will tell them their backup is invalid because there are no archive
records that span the time of their backup?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: PITR COPY Failure (was Point in Time Recovery)

From
Mark Kirkwood
Date:
FYI - I can confirm that the patch fixes main issue.

Simon Riggs wrote:

>
>This was a very confusing test...Here's what I think happened:
>.....
>The included patch doesn't attempt to address those issues, yet.
>
>Best regards, Simon Riggs
>
>  
>
>


Re: PITR COPY Failure (was Point in Time Recovery)

From
Mark Kirkwood
Date:
This is presumably a standard feature of any PITR design - if the 
failure event destroys the current transaction log, then you can only 
recover transactions that committed in the last *archived* log.

regards

Mark

Simon Riggs wrote:

>
>The test works, but gives what looks like strange results: the test
>blows away the data directory completely, so the then-current xlog dies
>too. That contained the commit for the large COPY, so even though the
>recovery now works, the table has zero rows in it. (When things die
>you're still likely to lose *some* data).
>
>
>
>  
>


Re: PITR COPY Failure (was Point in Time Recovery)

From
Mark Kirkwood
Date:
Looks good to me. Log file numbering scheme seems to have changed - is 
that part of the fix too?.

Tom Lane wrote:

>
>This is done in CVS tip.  Mark, could you retest to verify it's fixed?
>
>            regards, tom lane
>  
>


Re: PITR COPY Failure (was Point in Time Recovery)

From
Tom Lane
Date:
Mark Kirkwood <markir@coretech.co.nz> writes:
> Looks good to me. Log file numbering scheme seems to have changed - is 
> that part of the fix too?.

That's for timelines ... it's not directly related but I thought I
should put in both changes at once to avoid forcing an extra initdb.
        regards, tom lane


Re: PITR COPY Failure (was Point in Time Recovery)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2004-07-20 at 15:00, Tom Lane wrote:
>> Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and
>> adding two more warts to the code to support it is sticking in my craw.
>> I'm thinking it would be cleaner to treat the extra labeling information
>> as an extension of the WAL page header.

> Sounds like a better solution than scrabbling around at the end of file
> with too many edge cases to test properly 

This is done in CVS tip.  Mark, could you retest to verify it's fixed?
        regards, tom lane


Re: Point in Time Recovery

From
Bruce Momjian
Date:
We need someone to code two backend functions to complete PITR.  The
function would be called at start/stop of backup of the data directory.
The functions would be checked during restore to make sure the requested
xid is not between the start/stop xids of the backup.  They would also
contain timestamps so the admin can easily review the archive directory.

The start needs to call checkpoint and create file in the data directory
that contains a few server parameters.  At backup stop the function
needs to move the file to pg_xlog and set the *.ready archive flag so it
is archived.

As for checking during recover, the file needs to be retrieved and
checked to see the xid recovery is valid.  Tom and I can help you with
that detail.

DOn't worry about all the details of the email below.  It is just a
general summary.  We can give you details once you volunteer.

---------------------------------------------------------------------------

Bruce Momjian wrote:
> 
> OK, I think I have some solid ideas and reasons for them.
> 
> First, I think we need server-side functions to call when we start/stop
> the backup.  The advantage of these server-side functions is that they
> will do the required work of recording the pg_control values and
> creating needed files with little chance for user error.  It also allows
> us to change the internal operations in later releases without requiring
> admins to change their procedures.  We are even able to adjust the
> internal operation in minor releases without forcing a new procedure on
> users.
> 
> Second, I think once we start a restore, we should rename recovery.conf
> to recovery.in_progress, and when complete rename that to
> recovery.done.  If the postmaster starts and sees recovery.in_progress,
> it will fail to start knowing its recovery was interrupted.  This allows
> the admin to take appropriate action.  (I am not sure what that action
> would be. Does he bring back the backup files or just keep going?)
> 
> Third, I think we need to put a file in the archive location once we
> complete a backup, recording the start/stop xid and wal/offsets.  This
> gives the admin documentation on what archive logs to keep and what xids
> are available for recovery.  Ideally the recover program would read that
> file and check the recover xid to make sure it is after the stop xid
> recorded in the file.
> 
> How would the recover program know the name of that file?  We need to
> create it in /data with start contents before the backup, then complete
> it with end contents and archive it.
> 
> What should we name it?  Ideally it would be named by the WAL
> name/offset of the start so it orders in the proper spot in the archive
> file listing, e.g.:
> 
>     000000000000093a
>     000000000000093b
>     000000000000093b.032b9.start
>     000000000000093c
> 
> Are people going to know they need 000000000000093b for
> 000000000000093b.032b9.start?  I hope so.  Another idea is to do:
> 
> 
>     000000000000093a.xlog
>     000000000000093b.032b9.start
>     000000000000093b.xlog
>     000000000000093c.xlog
> 
> This would order properly.  It might be a very good idea to add
> extensions to these log files now that we are archiving them in strange
> places.  In fact, maybe we should use *.pg_xlog to document the
> directory they came from.
> 
> ---------------------------------------------------------------------------
> 
> 
> Simon Riggs wrote:
> > On Fri, 2004-07-16 at 16:47, Tom Lane wrote:
> > > As far as the business about copying pg_control first goes: there is
> > > another way to think about it, which is to copy pg_control to another
> > > place that will be included in your backup.  For example the standard
> > > backup procedure could be
> > > 
> > > 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.
> > > 
> > > 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump
> > > 
> > > 3. tar cf /dev/mt $PGDATA
> > > 
> > > 4. do something to record ending WAL position
> > > 
> > > If we standardized on this way, then the tar archive would automatically
> > > contain the pre-backup checkpoint position in ./pg_control.dump, and
> > > there is no need for any special assumptions about the order in which
> > > tar processes things.
> > > 
> > 
> > Sounds good. That would be familiar to Oracle DBAs doing BACKUP
> > CONTROLFILE. We can document that and off it as a suggested procedure.
> > 
> > > However, once you decide to do things like that, there is no reason why
> > > the copied file has to be an exact image of pg_control.  I claim it
> > > would be more useful if the copied file were plain text so that you
> > > could just "cat" it to find out the starting WAL position; that would
> > > let you determine without any special tools what range of WAL archive
> > > files you are going to need to bring back from your archives.
> > 
> > I wouldn't be in favour of a manual mechanism. If you want an automated
> > mechanism, whats wrong with using the one thats already there? You can
> > use pg_controldata to read the controlfile, again whats wrong with that?
> > 
> > We agreed some time back that an off-line xlog file inspector would be
> > required to allow us to inspect the logs and make a decision about where
> > to end recovery. You'd still need that.
> > 
> > It's scary enough having to specify the end point, let alone having to
> > specify the starting point as well.
> > 
> > At your request, and with Bruce's idea, I designed and built the
> > recovery system so that you don't need to know what range of xlogs to
> > bring back. You just run it, it brings back the right files from archive
> > and does recovery with them, then cleans up - and it works without
> > running out of disk space on long recoveries.
> > 
> > I've built it now and it works...
> > 
> > > This is pretty much the same chain of reasoning that Bruce and I went
> > > through yesterday to come up with the idea of putting a label file
> > > inside the tar backups.  We concluded that it'd be worth putting
> > > both the backup starting time and the checkpoint WAL position into
> > > the label file --- the starting time isn't needed for restore but
> > > might be really helpful as documentation, if you needed to verify
> > > which dump file was which.
> > 
> > ...if you are doing tar backups...what will you do if you're not using
> > that mechanism?
> > 
> > If you are: It's common practice to make up a backup filename from
> > elements such as systemname, databasename, date and time etc. That gives
> > you the start time, the file last mod date gives you the end time. 
> > 
> > I think its perfectly fine for everybody to do backups any way they
> > please. There are many licenced variants of PostgreSQL and it might be
> > appropriate in those to specify particular ways of doing things.
> > 
> > I'll be trusting the management of backup metadata and storage media to
> > a solution designed for the purpose (open or closed source), just as
> > I'll be trusting my data to a database solution designed for that
> > purpose. That for me is one of the good things about PostgreSQL - we use
> > the filesystem, we don't write our own, we provide language interfaces
> > not invent our own proprietary server language etc..
> > 
> > Best Regards, Simon Riggs
> > 
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faqs/FAQ.html
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Bruce Momjian
Date:
Oh, here is something else we need to add --- a GUC to control whether
pg_xlog is clean on recovery start.

---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce and I had another phone chat about the problems that can ensue
> if you restore a tar backup that contains old (incompletely filled)
> versions of WAL segment files.  While the current code will ignore them
> during the recovery-from-archive run, leaving them laying around seems
> awfully dangerous.  One nasty possibility is that the archiving
> mechanism will pick up these files and overwrite good copies in the
> archive area with the obsolete ones from the backup :-(.
> 
> Bruce earlier proposed that we simply "rm pg_xlog/*" at the start of
> a recovery-from-archive run, but as I said I'm scared to death of code
> that does such a thing automatically.  In particular this would make it
> impossible to handle scenarios where you want to do a PITR recovery but
> you need to use some recent WAL segments that didn't make it into your
> archive yet.  (Maybe you could get around this by forcibly transferring
> such segments into the archive, but that seems like a bad idea for
> incomplete segments.)
> 
> It would really be best for the DBA to make sure that the starting
> condition for the recovery run does not have any obsolete segment files
> in pg_xlog.  He could do this either by setting up his backup policy so
> that pg_xlog isn't included in the tar backup in the first place, or by
> manually removing the included files just after restoring the backup,
> before he tries to start the recovery run.
> 
> Of course the objection to that is "what if the DBA forgets to do it?"
> 
> The idea that we came to on the phone was for the postmaster, when it
> enters recovery mode because a recovery.conf file exists, to look in
> pg_xlog for existing segment files and refuse to start if any are there
> --- *unless* the user has put a special, non-default overriding flag
> into recovery.conf.  Call it "use_unarchived_files" or something like
> that.  We'd have to provide good documentation and an extensive HINT of
> course, but basically the DBA would have two choices when he gets this
> refusal to start:
> 
> 1. Remove all the segment files in pg_xlog.  (This would be the right
> thing to do if he knows they all came off the backup.)
> 
> 2. Verify that pg_xlog contains only segment files that are newer than
> what's stored in the WAL archive, and then set the override flag in
> recovery.conf.  In this case the DBA is taking responsibility for
> leaving only segment files that are good to use.
> 
> One interesting point is that with such a policy, we could use locally
> available WAL segments in preference to pulling the same segments from
> archive, which would be at least marginally more efficient, and seems
> logically cleaner anyway.
> 
> In particular it seems that this would be a useful arrangement in cases
> where you have questionable WAL segments --- you're not sure if they're
> good or not.  Rather than having to push questionable data into your WAL
> archive, you can leave it local, try a recovery run, and see if you like
> the resulting state.  If not, it's a lot easier to do-over when you have
> not corrupted your archive area.
> 
> Comments?  Better ideas?
> 
>             regards, tom lane
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Mark Kirkwood
Date:
I was wondering about this point - might it not be just as reasonable 
for the copied file to *be* an exact image of pg_control?  Then a very 
simple variant of pg_controldata (or maybe even just adding switches to 
pg_controldata itself) would enable the relevant info to be extracted

P.s : would love to be that volunteer - however up to the eyeballs in 
Business Objects (cringe) and Db2 for the next week or so....

regards

Mark

Bruce Momjian wrote:

>We need someone to code two backend functions to complete PITR. 
>
>>><snippage>
>>>      
>>>
>>>>However, once you decide to do things like that, there is no reason why
>>>>the copied file has to be an exact image of pg_control.  I claim it
>>>>would be more useful if the copied file were plain text so that you
>>>>could just "cat" it to find out the starting WAL position; that would
>>>>let you determine without any special tools what range of WAL archive
>>>>files you are going to need to bring back from your archives.
>>>>        
>>>>
>>>


Re: Point in Time Recovery

From
Bruce Momjian
Date:
Mark Kirkwood wrote:
> I was wondering about this point - might it not be just as reasonable 
> for the copied file to *be* an exact image of pg_control?  Then a very 
> simple variant of pg_controldata (or maybe even just adding switches to 
> pg_controldata itself) would enable the relevant info to be extracted

We didn't do that so admins could easily read the file contents.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
markir@coretech.co.nz
Date:
Quoting Bruce Momjian <pgman@candle.pha.pa.us>:

> Mark Kirkwood wrote:
> > I was wondering about this point - might it not be just as reasonable
> > for the copied file to *be* an exact image of pg_control?  Then a very
> > simple variant of pg_controldata (or maybe even just adding switches to
> > pg_controldata itself) would enable the relevant info to be extracted
>
> We didn't do that so admins could easily read the file contents.
>
Ease of reading is a good thing, no argument there.

However using 'pg_controldata' (or similar) to perform the read is not really
that much harder than using 'cat' - (it is a wee bit harder, I grant you)

When I posted the original mail I was thinking that the pg_control image is good
because it has much more information than just the last wal offset, and could
be used to perform a recovery in the advent of the "actual" pg_control being
unsuitable (e.g. backed up last instead of first on a busy system).

Of couse this thinking didn't make it into the original mail, sorry about that!

regards

Mark




Re: Point in Time Recovery

From
"Zeugswetter Andreas SB SD"
Date:
> > I was wondering about this point - might it not be just as reasonable
> > for the copied file to *be* an exact image of pg_control?  Then a very
> > simple variant of pg_controldata (or maybe even just adding switches to
> > pg_controldata itself) would enable the relevant info to be extracted
>
> We didn't do that so admins could easily read the file contents.

If you use a readable file you will also need a feature for restore (or a tool)
to create an appropriate pg_control file, or are you intending to still require
that pg_control be the first file backed up.
Another possibility would be that the start function writes the readable file and
also copies pg_control.

Andreas


Re: Point in Time Recovery

From
Bruce Momjian
Date:
Zeugswetter Andreas SB SD wrote:
> 
> > > I was wondering about this point - might it not be just as reasonable 
> > > for the copied file to *be* an exact image of pg_control?  Then a very 
> > > simple variant of pg_controldata (or maybe even just adding switches to 
> > > pg_controldata itself) would enable the relevant info to be extracted
> > 
> > We didn't do that so admins could easily read the file contents.
> 
> If you use a readable file you will also need a feature for restore (or a tool) 
> to create an appropriate pg_control file, or are you intending to still require
> that pg_control be the first file backed up. 
> Another possibility would be that the start function writes the readable file and
> also copies pg_control.

We will back up pg_control in the tar file but it doesn't have to have
all correct information. The WAL replay will set it properly I think. 
In fact it has to start recovery checkpoint settings, not the backup
setting at all.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Point in Time Recovery

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> If you use a readable file you will also need a feature for restore
> (or a tool) to create an appropriate pg_control file, or are you
> intending to still require that pg_control be the first file backed
> up.

No, the entire point of this exercise is to get rid of that assumption.
You do need *a* copy of pg_control, but the only reason you'd need to
back it up first rather than later is so that its checkpoint pointer
points to the last checkpoint before the dump starts.  Which is the
information we want to put in the archive-label file insted.

If a copy of pg_control were sufficient then I'd be all for using it as
the archive-label file, but it's *not* sufficient because you also need
the ending WAL offset.  So we need a different file layout in any case,
and we may as well take some pity on the poor DBA and make the file
easily human-readable.
        regards, tom lane


Re: Point in Time Recovery

From
Mark Kirkwood
Date:
Ok - that is a much better way of doing it!

regards

Mark

Tom Lane wrote:

>"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>  
>
>>If you use a readable file you will also need a feature for restore
>>(or a tool) to create an appropriate pg_control file, or are you
>>intending to still require that pg_control be the first file backed
>>up.
>>    
>>
>
>No, the entire point of this exercise is to get rid of that assumption.
>You do need *a* copy of pg_control, but the only reason you'd need to
>back it up first rather than later is so that its checkpoint pointer
>points to the last checkpoint before the dump starts.  Which is the
>information we want to put in the archive-label file insted.
>
>If a copy of pg_control were sufficient then I'd be all for using it as
>the archive-label file, but it's *not* sufficient because you also need
>the ending WAL offset.  So we need a different file layout in any case,
>and we may as well take some pity on the poor DBA and make the file
>easily human-readable.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>  
>