Thread: Point in Time Recovery
Taking advantage of the freeze bubble allowed us... there are some last minute features to add. Summarising earlier thoughts, with some detailed digging and design from myself in last few days - we're now in a position to add Point-in-Time Recovery, on top of whats been achieved. The target for the last record to recover to can be specified in 2 ways: - by transactionId - not that useful, unless you have a means of identifying what has happened from the log, then using that info to specify how to recover - coming later - not in next few days :( - by time - but the time stamp on each xlog record only specifies to the second, which could easily be 10 or more commits (we hope....) Should we use a different datatype than time_t for the commit timestamp, one that offers more fine grained differentiation between checkpoints? If we did, would that be portable? Suggestions welcome, because I know very little of the details of various *nix systems and win* on that topic. Only COMMIT and ABORT records have timestamps, allowing us to circumvent any discussion about partial transaction recovery and nested transactions. When we do recover, stopping at the timestamp is just half the battle. We need to leave the xlog in which we stop in a state from which we can enter production smoothly and cleanly. To do this, we could: - when we stop, keep reading records until EOF, just don't apply them. When we write a checkpoint at end of recovery, the unapplied transactions are buried alive, never to return. - stop where we stop, then force zeros to EOF, so that no possible record remains of previous transactions. I'm tempted by the first plan, because it is more straightforward and stands much less chance of me introducing 50 wierd bugs just before close. Also, I think it is straightforward to introduce control file duplexing, with a second copy stored and maintained in the pg_xlog directory. This would provide additional protection for pg_control, which takes on more importance now that archive recovery is working. pg_xlog is a natural home, since on busy systems it's on its own disk away from everything else, ensuring that at least one copy survives. I can't see a downside to that, but others might... We can introduce user specifiable duplexing, in later releases. For later, I envisage an off-line utility that can be used to inspect xlog records. This could provide a number of features: - validate archived xlogs, to check they are sound. - produce summary reports, to allow identification of transactionIds and the effects of particular transactions - performance analysis to allow decisions to be made about whether group commit features could be utilised to good effect (Not now...) Best regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Should we use a different datatype than time_t for the commit timestamp, > one that offers more fine grained differentiation between checkpoints? Pretty much everybody supports gettimeofday() (time_t and separate integer microseconds); you might as well use that. Note that the actual resolution is not necessarily microseconds, and it'd still not be certain that successive commits have distinct timestamps --- so maybe this refinement would be pointless. You'll still have to design a user interface that allows selection without the assumption of distinct timestamps. > - when we stop, keep reading records until EOF, just don't apply them. > When we write a checkpoint at end of recovery, the unapplied > transactions are buried alive, never to return. > - stop where we stop, then force zeros to EOF, so that no possible > record remains of previous transactions. Go with plan B; it's best not to destroy data (what if you chose the wrong restart point the first time)? Actually this now reminds me of a discussion I had with Patrick Macdonald some time ago. The DB2 practice in this connection is that you *never* overwrite existing logfile data when recovering. Instead you start a brand new xlog segment file, which is given a new "branch number" so it can be distinguished from the future-time xlog segments that you chose not to apply. I don't recall what the DB2 terminology was exactly --- not "branch number" I don't think --- but anyway the idea is that when you restart the database after an incomplete recovery, you are now in a sort of parallel universe that has its own history after the branch point (PITR stop point). You need to be able to distinguish archived log segments of this parallel universe from those of previous and subsequent incarnations. I'm not sure whether Vadim intended our StartUpID to serve this purpose, but it could perhaps be used that way, if we reflected it in the WAL file names. regards, tom lane
On Mon, 2004-07-05 at 22:46, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Should we use a different datatype than time_t for the commit timestamp, > > one that offers more fine grained differentiation between checkpoints? > > Pretty much everybody supports gettimeofday() (time_t and separate > integer microseconds); you might as well use that. Note that the actual > resolution is not necessarily microseconds, and it'd still not be > certain that successive commits have distinct timestamps --- so maybe > this refinement would be pointless. You'll still have to design a user > interface that allows selection without the assumption of distinct > timestamps. Well, I agree, though without the desired-for UI now, I think some finer grained mechanism would be good. This means extending the xlog commit record by a couple of bytes...OK, lets live a little. > > - when we stop, keep reading records until EOF, just don't apply them. > > When we write a checkpoint at end of recovery, the unapplied > > transactions are buried alive, never to return. > > - stop where we stop, then force zeros to EOF, so that no possible > > record remains of previous transactions. > > Go with plan B; it's best not to destroy data (what if you chose the > wrong restart point the first time)? > eh? Which way round? The second plan was the one where I would destroy data by overwriting it, thats exactly why I preferred the first. Actually, the files are always copied from archive, so re-recovery is always an available option in the design thats been implemented. No matter... > Actually this now reminds me of a discussion I had with Patrick > Macdonald some time ago. The DB2 practice in this connection is that > you *never* overwrite existing logfile data when recovering. Instead > you start a brand new xlog segment file, Now thats a much better plan...I suppose I just have to rack up the recovery pointer to the first record on the first page of a new xlog file, similar to first plan, but just fast-forwarding rather than forwarding. My only issue was to do with the secondary Checkpoint marker, which is always reset to the place you just restored FROM, when you complete a recovery. That could lead to a situation where you recover, then before next checkpoint, fail and lose last checkpoint marker, then crash recover from previous checkpoint (again), but this time replay the records you were careful to avoid. > which is given a new "branch > number" so it can be distinguished from the future-time xlog segments > that you chose not to apply. I don't recall what the DB2 terminology > was exactly --- not "branch number" I don't think --- but anyway the > idea is that when you restart the database after an incomplete recovery, > you are now in a sort of parallel universe that has its own history > after the branch point (PITR stop point). You need to be able to > distinguish archived log segments of this parallel universe from those > of previous and subsequent incarnations. Thats a good idea, if only because you so easily screw your test data during multiple recovery situations. But if its good during testing, it must be good in production too...since you may well perform recovery...run for a while, then discover that you got it wrong first time, then need to re-recover again. I already added that to my list of gotchas and that would solve it. I was going to say hats off to the Blue-hued ones, when I remembered this little gem from last year http://www.danskebank.com/link/ITreport20030403uk/$file/ITreport20030403uk.pdf > I'm not sure whether Vadim > intended our StartUpID to serve this purpose, but it could perhaps be > used that way, if we reflected it in the WAL file names. > Well, I'm not sure about StartUpId....but certainly the high 2 bytes of LogId looks pretty certain never to be anything but zeros. You have 2.4 x 10^14...which is 9,000 years at 1000 log file/sec We could use the scheme you descibe: add xFFFF to the logid every time you complete an archive recovery...so the log files look like 0001000000000CE3 after youve recovered a load of files that look like 0000000000000CE3 If you used StartUpID directly, you might just run out....but its very unlikely you would ever perform 65000 recovery situations - unless you've run the <expletive> code as often as I have :(. Doing that also means we don't have to work out how to do that with StartUpID. Of course, altering the length and makeup of the xlog files is possible too, but that will cause other stuff to stop working.... [We'll have to give this a no-SciFi name, unless we want to make in-roads into the Dr.Who fanbase :) Don't get them started. Better still, dont give it a name at all.] I'll sleep on that lot. Best regards, Simon Riggs
> - by time - but the time stamp on each xlog record only specifies to the > second, which could easily be 10 or more commits (we hope....) > > Should we use a different datatype than time_t for the commit timestamp, > one that offers more fine grained differentiation between checkpoints? Imho seconds is really sufficient. If you know a more precise position you will probably know it from backend log or an xlog sniffer. With those you can easily use the TransactionId way. > - when we stop, keep reading records until EOF, just don't apply them. > When we write a checkpoint at end of recovery, the unapplied > transactions are buried alive, never to return. > - stop where we stop, then force zeros to EOF, so that no possible > record remains of previous transactions. > I'm tempted by the first plan, because it is more straightforward and > stands much less chance of me introducing 50 wierd bugs just before > close. But what if you restore because after that PIT everything went haywire including the log ? Did you mean to apply the remaining changes but not commit those xids ? I think it is essential to only leave xlogs around that allow a subsequent rollforward from the same old full backup. Or is an instant new full backup required after restore ? Andreas
Simon Riggs wrote: > On Mon, 2004-07-05 at 22:46, Tom Lane wrote: > >>Simon Riggs <simon@2ndquadrant.com> writes: >> >>>Should we use a different datatype than time_t for the commit timestamp, >>>one that offers more fine grained differentiation between checkpoints? >> >>Pretty much everybody supports gettimeofday() (time_t and separate >>integer microseconds); you might as well use that. Note that the actual >>resolution is not necessarily microseconds, and it'd still not be >>certain that successive commits have distinct timestamps --- so maybe >>this refinement would be pointless. You'll still have to design a user >>interface that allows selection without the assumption of distinct >>timestamps. > > > Well, I agree, though without the desired-for UI now, I think some finer > grained mechanism would be good. This means extending the xlog commit > record by a couple of bytes...OK, lets live a little. At the risk of irritating people, I'll repeat what I suggested a few weeks ago... Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz, pitr_comment text) Let the user insert rows in transactions as desired. Let them stop the restore when a specific (pitr_ts,pitr_comment) gets inserted (or on pitr_id if they record it). IMHO time is seldom relevant, event boundaries are. If you want to add special syntax for this, fine. If not, an INSERT statement is a convenient way to do this anyway. -- Richard Huxton Archonet Ltd
On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote: > > - by time - but the time stamp on each xlog record only specifies to the > > second, which could easily be 10 or more commits (we hope....) > > > > Should we use a different datatype than time_t for the commit timestamp, > > one that offers more fine grained differentiation between checkpoints? > > Imho seconds is really sufficient. If you know a more precise position > you will probably know it from backend log or an xlog sniffer. With those > you can easily use the TransactionId way. > OK, thanks. I'll just leave the time_t datatype just the way it is. Best Regards, Simon Riggs
On Tue, 2004-07-06 at 20:00, Richard Huxton wrote: > Simon Riggs wrote: > > On Mon, 2004-07-05 at 22:46, Tom Lane wrote: > > > >>Simon Riggs <simon@2ndquadrant.com> writes: > >> > >>>Should we use a different datatype than time_t for the commit timestamp, > >>>one that offers more fine grained differentiation between checkpoints? > >> > >>Pretty much everybody supports gettimeofday() (time_t and separate > >>integer microseconds); you might as well use that. Note that the actual > >>resolution is not necessarily microseconds, and it'd still not be > >>certain that successive commits have distinct timestamps --- so maybe > >>this refinement would be pointless. You'll still have to design a user > >>interface that allows selection without the assumption of distinct > >>timestamps. > > > > > > Well, I agree, though without the desired-for UI now, I think some finer > > grained mechanism would be good. This means extending the xlog commit > > record by a couple of bytes...OK, lets live a little. > > At the risk of irritating people, I'll repeat what I suggested a few > weeks ago... > All feedback is good. Thanks. > Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz, > pitr_comment text) > Let the user insert rows in transactions as desired. Let them stop the > restore when a specific (pitr_ts,pitr_comment) gets inserted (or on > pitr_id if they record it). > It's a good plan, but the recovery is currently offline recovery and no SQL is possible. So no way to insert, no way to access tables until recovery completes. I like that plan and probably would have used it if it was viable. > IMHO time is seldom relevant, event boundaries are. > Agreed, but time is the universally agreed way of describing two events as being simultaneous. No other way to say "recover to the point when the message queue went wild". As of last post to Andreas, I've said I'll not bother changing the granularity of the timestamp. > If you want to add special syntax for this, fine. If not, an INSERT > statement is a convenient way to do this anyway. The special syntax isn't hugely important - I did suggest a kind of SQL-like syntax previously, but thats gone now. Invoking recovery via a command file IS, so we are able to tell the system its not in crash recovery AND that when you've finished I want you to respond to crashes without re-entering archive recovery. Thanks for your comments. I'm not making this more complex than needs be; in fact much of the code is very simple - its just the planning that's complex. Best regards, Simon Riggs
On Mon, 2004-07-05 at 22:46, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > - when we stop, keep reading records until EOF, just don't apply them. > > When we write a checkpoint at end of recovery, the unapplied > > transactions are buried alive, never to return. > > - stop where we stop, then force zeros to EOF, so that no possible > > record remains of previous transactions. > > Go with plan B; it's best not to destroy data (what if you chose the > wrong restart point the first time)? > > Actually this now reminds me of a discussion I had with Patrick > Macdonald some time ago. The DB2 practice in this connection is that > you *never* overwrite existing logfile data when recovering. Instead > you start a brand new xlog segment file, which is given a new "branch > number" so it can be distinguished from the future-time xlog segments > that you chose not to apply. I don't recall what the DB2 terminology > was exactly --- not "branch number" I don't think --- but anyway the > idea is that when you restart the database after an incomplete recovery, > you are now in a sort of parallel universe that has its own history > after the branch point (PITR stop point). You need to be able to > distinguish archived log segments of this parallel universe from those > of previous and subsequent incarnations. I'm not sure whether Vadim > intended our StartUpID to serve this purpose, but it could perhaps be > used that way, if we reflected it in the WAL file names. > Some more thoughts...focusing on the what do we do after we've finished recovering. The objectives, as I see them, are to put the system into a state, that preserves these features: 1. we never overwrite files, in case we want to re-run recovery 2. we never write files that MIGHT have been written previously 3. we need to ensure that any xlog records skipped at admins request (in PITR mode) are never in a position to be re-applied to this timeline. 4. ensure we can re-recover, if we need to, without further problems Tom's concept above, I'm going to call timelines. A timeline is the sequence of logs created by the execution of a server. If you recover the database, you create a new timeline. [This is because, if you've invoked PITR you absolutely definitely want log records written to, say, xlog15 to be different to those that were written to xlog15 in a previous timeline that you have chosen not to reapply.] Objective (1) is complex. When we are restoring, we always start with archived copies of the xlog, to make sure we don't finish too soon. We roll forward until we either reach PITR stop point, or we hit end of archived logs. If we hit end of logs on archive, then we switch to a local copy, if one exists that is higher than those, we carry on rolling forward until either we reach PITR stop point, or we hit end of that log. (Hopefully, there isn't more than one local xlog higher than the archive, but its possible). If we are rolling forward on local copies, then they are our only copies. We'd really like to archive them ASAP, but the archiver's not running yet - we don't want to force that situation in case the archive device (say a tape) is the one being used to recover right now. So we write an archive_status of .ready for that file, ensuring that the checkpoint won't remove it until it gets copied to archive, whenever that starts working again. Objective (1) met. When we have finished recovering we: - create a new xlog at the start of a new ++timeline - copy the last applied xlog record to it as the first record - set the record pointer so that it matches That way, when we come up and begin running, we never overwrite files that might have been written previously. Objective (2) met. We do the other stuff because recovery finishes up by pointing to the last applied record...which is what was causing all of this extra work in the first place. At this point, we also reset the secondary checkpoint record, so that should recovery be required again before next checkpoint AND the shutdown checkpoint record written after recovery completes is wrong/damaged, the recovery will not autorewind back past the PITR stop point and attempt to recover the records we have just tried so hard to reverse/ignore. Objective (3) met. (Clearly, that situation seems unlikely, but I feel we must deal with it...a newly restored system is actually very fragile, so a crash again within 3 minutes or so is very commonplace, as far as these things go). Should we need to re-recover, we can do so because the new timeline xlogs are further forward than the old timeline, so never get seen by any processes (all of which look backwards). Re-recovery is possible without problems, if required. This means you're a lot safer from some of the mistakes you might of made, such as deciding you need to go into recovery, then realising it wasn't required (or some other painful flapping as goes on in computer rooms at 3am). How do we implement timelines? The main presumption in the code is that xlogs are sequential. That has two effects: 1. during recovery, we try to open the "next" xlog by adding one to the numbers and then looking for that file 2. during checkpoint, we look for filenames less than the current checkpoint marker Creating a timeline by adding a larger number to LogId allows us to prevent (1) from working, yet without breaking (2). Well, Tom does seem to have something with regard to StartUpIds. I feel it is easier to force a new timeline by adding a very large number to the LogId IF, and only if, we have performed an archive recovery. That way, we do not change at all the behaviour of the system for people that choose not to implement archive_mode. Should we implement timelines? Yes, I think we should. I've already hit the problems that timelines solve in my testing and so that means they'll be hit when you don't need the hassle. Comments much appreciated, assuming you read this far... Best regards, Simon Riggs
> Well, Tom does seem to have something with regard to StartUpIds. I feel > it is easier to force a new timeline by adding a very large number to > the LogId IF, and only if, we have performed an archive recovery. That > way, we do not change at all the behaviour of the system for people that > choose not to implement archive_mode. Imho you should take a close look at StartUpId, I think it is exactly this "large number". Maybe you can add +2 to intentionally leave a hole. Once you increment, I think it is very essential to checkpoint and double check pg_control, cause otherwise a crashrecovery would read the wrong xlogs. > Should we implement timelines? Yes :-) Andreas
On Wed, 2004-07-07 at 14:17, Zeugswetter Andreas SB SD wrote: > > Well, Tom does seem to have something with regard to StartUpIds. I feel > > it is easier to force a new timeline by adding a very large number to > > the LogId IF, and only if, we have performed an archive recovery. That > > way, we do not change at all the behaviour of the system for people that > > choose not to implement archive_mode. > > Imho you should take a close look at StartUpId, I think it is exactly this > "large number". Maybe you can add +2 to intentionally leave a hole. > > Once you increment, I think it is very essential to checkpoint and double > check pg_control, cause otherwise a crashrecovery would read the wrong xlogs. Thanks for your thoughts - you have made me rethink this over some hours. Trouble is, on this occasion, the other suggestion still seems the best one, IMVHO. If we number timelines based upon StartUpId, then there is still some potential for conflict and this is what we're seeking to avoid. Simply adding FFFF to the LogId puts the new timeline so far into the previous timelines future that there isn't any problems. We only increment the timeline when we recover, so we're not eating up the numbers quickly. Simply adding a number means that there isn't any conflict with any normal operations. The timelines aren't numbered separately, so I'm not introducing a new version of StartUpID...technically there isn't a new timeline, just a chunk of geological time between them. We don't need to mention timelines in the docs, nor do we need to alter pg_controldata to display it...just a comment in the code to explain why we add a large number to the LogId after each recovery completes. Best regards, Simon Riggs
On Thu, 2004-07-08 at 07:57, spock@mgnet.de wrote: > On Thu, 8 Jul 2004, Simon Riggs wrote: > > > We don't need to mention timelines in the docs, nor do we need to alter > > pg_controldata to display it...just a comment in the code to explain why > > we add a large number to the LogId after each recovery completes. > > I'd disagree on that. Knowing what exactly happens when recovering the > database is a must. It makes people feel more safe about the process. This > is because the software doesn't do things you wouldn't expect. > > On Oracle e.g. you create a new database incarnation when you recover a > database (incomplete). Which means your System Change Number and your Log > Sequence is reset to 0. > Knowing this is crucial because you otherwise would wonder "Why is it > doing that? Is this a bug or a feature?" > OK, will do. Best regards, Simon Riggs
On Tue, 6 Jul 2004, Zeugswetter Andreas SB SD wrote: > > Should we use a different datatype than time_t for the commit timestamp, > > one that offers more fine grained differentiation between checkpoints? > > Imho seconds is really sufficient. If you know a more precise position > you will probably know it from backend log or an xlog sniffer. With those > you can easily use the TransactionId way. I'd also think that seconds are absolutely sufficient. From my daily experience I can say that you're normally lucky to know the time on one minute basis. If you need to get closer, a log sniffer is unavoidable ... Greetings, Klaus -- Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany) Phone / FAX : ++49/177/7862964 | E-Mail: (kn@mgnet.de)
On Thu, 8 Jul 2004, Simon Riggs wrote: > We don't need to mention timelines in the docs, nor do we need to alter > pg_controldata to display it...just a comment in the code to explain why > we add a large number to the LogId after each recovery completes. I'd disagree on that. Knowing what exactly happens when recovering the database is a must. It makes people feel more safe about the process. This is because the software doesn't do things you wouldn't expect. On Oracle e.g. you create a new database incarnation when you recover a database (incomplete). Which means your System Change Number and your Log Sequence is reset to 0. Knowing this is crucial because you otherwise would wonder "Why is it doing that? Is this a bug or a feature?" Just my 2ct :-)) Greetings, Klaus -- Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany) Phone / FAX : ++49/177/7862964 | E-Mail: (kn@mgnet.de)
On 7/6/2004 3:58 PM, Simon Riggs wrote: > On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote: >> > - by time - but the time stamp on each xlog record only specifies to the >> > second, which could easily be 10 or more commits (we hope....) >> > >> > Should we use a different datatype than time_t for the commit timestamp, >> > one that offers more fine grained differentiation between checkpoints? >> >> Imho seconds is really sufficient. If you know a more precise position >> you will probably know it from backend log or an xlog sniffer. With those >> you can easily use the TransactionId way. TransactionID and timestamp is only sufficient if the transactions are selected by their commit order. Especially in read committed mode, consider this execution: xid-1: start xid-2: start xid-2: update field x xid-2: commit xid-1: update field y xid-1: commit In this case, the update done by xid-1 depends on the row created by xid-2. So logically xid-2 precedes xid-1, because it made its changes earlier. So you have to apply the log until you find the commit record of the transaction you want apply last, and then stamp all transactions that where in progress at that time as aborted. Jan >> > > OK, thanks. I'll just leave the time_t datatype just the way it is. > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Sat, 2004-07-10 at 15:17, Jan Wieck wrote: > On 7/6/2004 3:58 PM, Simon Riggs wrote: > > > On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote: > >> > - by time - but the time stamp on each xlog record only specifies to the > >> > second, which could easily be 10 or more commits (we hope....) > >> > > >> > Should we use a different datatype than time_t for the commit timestamp, > >> > one that offers more fine grained differentiation between checkpoints? > >> > >> Imho seconds is really sufficient. If you know a more precise position > >> you will probably know it from backend log or an xlog sniffer. With those > >> you can easily use the TransactionId way. > > TransactionID and timestamp is only sufficient if the transactions are > selected by their commit order. Especially in read committed mode, > consider this execution: > > xid-1: start > xid-2: start > xid-2: update field x > xid-2: commit > xid-1: update field y > xid-1: commit > > In this case, the update done by xid-1 depends on the row created by > xid-2. So logically xid-2 precedes xid-1, because it made its changes > earlier. > > So you have to apply the log until you find the commit record of the > transaction you want apply last, and then stamp all transactions that > where in progress at that time as aborted. > Agreed. I've implemented this exactly as you say.... This turns out to be very easy because: - when looking where to stop we only ever stop at commit or aborts - these are the only records that have timestamps anyway... - any record that isn't specifically committed is not updated in the clog and therefore not visible. The clog starts in indeterminate state, 0 and is then updated to either committed or aborted. Aborted and indeterminate are handled similarly in the current code, to allow for crash recovery - PITR doesn't change anything there. So, PITR doesn't do anything that crash recovery doen't already do. Crash recovery makes no attempt to keep track of in-progress transactions and doesn't make a special journey to the clog to specifically mark them as aborted - they just are by default. So, what we mean by "stop at a transactionId" is "stop applying redo at the commit/abort record for that transactionId." It has to be an exact match, not a "greater than", for exactly the reason you mention. That means that although we stop at the commit record of transactionId X, we may also have applied records for transactions with later transactionIds e.g. X+1, X+2...etc (without limit or restriction). (I'll even admit that as first, I did think we could get away with the "less than" test that you are warning me about. Overall, I've spent more time on theory/analysis than on coding, on the idea that you can improve poor code, but wrong code just needs to be thrown away). Timestamps are more vague...When time is used, there might easily be 10+ transactions whose commit/abort records have identical timestamp values. So we either stop at the first or last record depending upon whether we specified inclusive or exclusive on the recovery target value. The hard bit, IMHO, is what we do with the part of the log that we have chosen not to apply....which has been discussed on list in detail also. Thanks for keeping an eye out for possible errors - this one is something I'd thought through and catered for (there are comments in my current latest published code to that effect, although I have not yet finished coding the clean-up-after-stopping part). This implies nothing with regard to other possible errors or oversights and so I very much welcome any questioning of this nature - I am as prone to error as the next man. It's important we get this right. Best regards, Simon Riggs
On Tue, 2004-07-06 at 22:40, Simon Riggs wrote: > On Mon, 2004-07-05 at 22:46, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > > - when we stop, keep reading records until EOF, just don't apply them. > > > When we write a checkpoint at end of recovery, the unapplied > > > transactions are buried alive, never to return. > > > - stop where we stop, then force zeros to EOF, so that no possible > > > record remains of previous transactions. > > > > Go with plan B; it's best not to destroy data (what if you chose the > > wrong restart point the first time)? > > > > Actually this now reminds me of a discussion I had with Patrick > > Macdonald some time ago. The DB2 practice in this connection is that > > you *never* overwrite existing logfile data when recovering. Instead > > you start a brand new xlog segment file, which is given a new "branch > > number" so it can be distinguished from the future-time xlog segments > > that you chose not to apply. I don't recall what the DB2 terminology > > was exactly --- not "branch number" I don't think --- but anyway the > > idea is that when you restart the database after an incomplete recovery, > > you are now in a sort of parallel universe that has its own history > > after the branch point (PITR stop point). You need to be able to > > distinguish archived log segments of this parallel universe from those > > of previous and subsequent incarnations. I'm not sure whether Vadim > > intended our StartUpID to serve this purpose, but it could perhaps be > > used that way, if we reflected it in the WAL file names. > > > > Some more thoughts...focusing on the what do we do after we've finished > recovering. The objectives, as I see them, are to put the system into a > state, that preserves these features: > 1. we never overwrite files, in case we want to re-run recovery > 2. we never write files that MIGHT have been written previously > 3. we need to ensure that any xlog records skipped at admins request (in > PITR mode) are never in a position to be re-applied to this timeline. > 4. ensure we can re-recover, if we need to, without further problems > > Tom's concept above, I'm going to call timelines. A timeline is the > sequence of logs created by the execution of a server. If you recover > the database, you create a new timeline. [This is because, if you've > invoked PITR you absolutely definitely want log records written to, say, > xlog15 to be different to those that were written to xlog15 in a > previous timeline that you have chosen not to reapply.] > > Objective (1) is complex. > When we are restoring, we always start with archived copies of the xlog, > to make sure we don't finish too soon. We roll forward until we either > reach PITR stop point, or we hit end of archived logs. If we hit end of > logs on archive, then we switch to a local copy, if one exists that is > higher than those, we carry on rolling forward until either we reach > PITR stop point, or we hit end of that log. (Hopefully, there isn't more > than one local xlog higher than the archive, but its possible). > If we are rolling forward on local copies, then they are our only > copies. We'd really like to archive them ASAP, but the archiver's not > running yet - we don't want to force that situation in case the archive > device (say a tape) is the one being used to recover right now. So we > write an archive_status of .ready for that file, ensuring that the > checkpoint won't remove it until it gets copied to archive, whenever > that starts working again. Objective (1) met. > > When we have finished recovering we: > - create a new xlog at the start of a new ++timeline > - copy the last applied xlog record to it as the first record > - set the record pointer so that it matches > That way, when we come up and begin running, we never overwrite files > that might have been written previously. Objective (2) met. > We do the other stuff because recovery finishes up by pointing to the > last applied record...which is what was causing all of this extra work > in the first place. > > At this point, we also reset the secondary checkpoint record, so that > should recovery be required again before next checkpoint AND the > shutdown checkpoint record written after recovery completes is > wrong/damaged, the recovery will not autorewind back past the PITR stop > point and attempt to recover the records we have just tried so hard to > reverse/ignore. Objective (3) met. (Clearly, that situation seems > unlikely, but I feel we must deal with it...a newly restored system is > actually very fragile, so a crash again within 3 minutes or so is very > commonplace, as far as these things go). > > Should we need to re-recover, we can do so because the new timeline > xlogs are further forward than the old timeline, so never get seen by > any processes (all of which look backwards). Re-recovery is possible > without problems, if required. This means you're a lot safer from some > of the mistakes you might of made, such as deciding you need to go into > recovery, then realising it wasn't required (or some other painful > flapping as goes on in computer rooms at 3am). > > How do we implement timelines? > The main presumption in the code is that xlogs are sequential. That has > two effects: > 1. during recovery, we try to open the "next" xlog by adding one to the > numbers and then looking for that file > 2. during checkpoint, we look for filenames less than the current > checkpoint marker > Creating a timeline by adding a larger number to LogId allows us to > prevent (1) from working, yet without breaking (2). > Well, Tom does seem to have something with regard to StartUpIds. I feel > it is easier to force a new timeline by adding a very large number to > the LogId IF, and only if, we have performed an archive recovery. That > way, we do not change at all the behaviour of the system for people that > choose not to implement archive_mode. > > Should we implement timelines? > Yes, I think we should. I've already hit the problems that timelines > solve in my testing and so that means they'll be hit when you don't need > the hassle. > I'm still wrestling with the cleanup-after-stopping-at-point-in-time code and have some important conclusions. Moving forward on a timeline is somewhat tricky for xlogs, as shown above,...but... My earlier treatment seems to have neglected to include the clog also. If we stop before end of log, then we also have potentially many (though presumably at least one) committed transactions that we do not want to be told about ever again. The starting a new timeline thought works for xlogs, but not for clogs. No matter how far you go into the future, there is a small (yet vanishing) possibility that there is a yet undiscovered committed transaction in the future. (Because transactions are ordered in the clog because xids are assigned sequentially at txn start, but not ordered in the xlog where they are recorded in the order the txns complete). Please tell me that we can ignore the state of the clog, but I think we can't - if a new xid re-used a previous xid that had committed AND then we crashed...we would have inconsistent data. Unless we physically write zeros to clog for every begin transaction after a recovery...err, no... The only recourse that I can see is to "truncate the future" of the clog, which would mean: - keeping track of the highest xid provided by any record from the xlog, in xact.c, xact_redo - using that xid to write zeros to the clog after this point until EOF - drop any clog segment files past the new "high" segment - no idea how that effects NT or not... The timeline idea works for xlog because once we've applied the xlog records and checkpointed, we can discard the xlog records. We can't do that with clog records (unless we followed recovery with a vacuum full - which is possible, but not hugely desirable) - though this doesn't solve the issue that xlog records don't have any prescribed position in the file, clog records do. Right now, I don't know where to start with the clog code and the opportunity for code-overlap with NT seems way high. These problems can be conquered, given time and "given enough eyeballs". I'm all ears for some bright ideas...but I'm getting pretty wary that we may introduce some unintended features if we try to get this stabilised within two weeks. My current conclusion is: lets commit archive recovery in this release, then wait until next dot release for full recovery target features. We've hit all the features which were a priority and the fundamental architecture is there, so i think it is time to be happy with what we've got, for now. Comments, please....remembering that I'd love it if I've missed something that simplifies the task. Fire away. Best regards, Simon Riggs
> The starting a new timeline thought works for xlogs, but not for clogs. > No matter how far you go into the future, there is a small (yet > vanishing) possibility that there is a yet undiscovered committed > transaction in the future. (Because transactions are ordered in the clog > because xids are assigned sequentially at txn start, but not ordered in > the xlog where they are recorded in the order the txns complete). Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem would arise at xid wraparound, no ? Andreas
On Tue, 2004-07-13 at 13:18, Zeugswetter Andreas SB SD wrote: > > The starting a new timeline thought works for xlogs, but not for clogs. > > No matter how far you go into the future, there is a small (yet > > vanishing) possibility that there is a yet undiscovered committed > > transaction in the future. (Because transactions are ordered in the clog > > because xids are assigned sequentially at txn start, but not ordered in > > the xlog where they are recorded in the order the txns complete). > > Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem > would arise at xid wraparound, no ? > Sounds like a good start... When PITR ends, we need to stop mid-way through a file. Does that handle that situation? Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Please tell me that we can ignore the state of the clog, We can. The reason that keeping track of timelines is interesting for xlog is simply to take pity on the poor DBA who needs to distinguish the various archived xlog files he's got laying about, and so that we can detect errors like supplying inconsistent sets of xlog segments during restore. This does not apply to clog because it's not archived. It's no more than a data file. If you think you have trouble recreating clog then you have the same issues recreating data files. regards, tom lane
On Tue, 2004-07-13 at 15:29, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Please tell me that we can ignore the state of the clog, > > We can. > In general, you are of course correct. > The reason that keeping track of timelines is interesting for xlog is > simply to take pity on the poor DBA who needs to distinguish the various > archived xlog files he's got laying about, and so that we can detect > errors like supplying inconsistent sets of xlog segments during restore. > > This does not apply to clog because it's not archived. It's no more > than a data file. If you think you have trouble recreating clog then > you have the same issues recreating data files. I'm getting carried away with the improbable....but this is the rather strange, but possible scenario I foresee: A sequence of times... 1. We start archiving xlogs 2. We take a checkpoint 3. we commit an important transaction 4. We take a backup 5. We take a checkpoint As stands currently, when we restore the backup, controlfile says that last checkpoint was at 2, so we rollforward from 2 THRU 4 and continue on past 5 until end of logs. Normally, end of logs isn't until after 4... When we specify a recovery target, it is possible to specify the rollforward to complete just before point 3. So we use the backup taken at 4 to rollforward to a point in the past (from the backups perspective). The backup taken at 4 may now have data and clog records written by bgwriter. Given that time between checkpoints is likely to be longer than previously was the case...this becomes a non-zero situation. I was trying to solve this problem head on, but the best way is to make sure we never allow ourselves such a muddled situation: ISTM the way to avoid this is to insist that we always rollforward through at least one checkpoint to guarantee that this will not occur. ...then I can forget I ever mentioned the ****** clog again. I'm ignoring this issue for now....whether it exists or not! Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > I'm getting carried away with the improbable....but this is the rather > strange, but possible scenario I foresee: > A sequence of times... > 1. We start archiving xlogs > 2. We take a checkpoint > 3. we commit an important transaction > 4. We take a backup > 5. We take a checkpoint > When we specify a recovery target, it is possible to specify the > rollforward to complete just before point 3. No, it isn't possible. The recovery *must* proceed at least as far as wherever the end of the log was at the time the backup was completed. Otherwise everything is broken, not only clog, because you may have disk blocks in your backup that postdate where you stopped log replay. To have a consistent recovery at all, you must replay the log starting from a checkpoint before the backup began and extending to the time that the backup finished. You only get to decide where to stop after that point. regards, tom lane
On Tue, 2004-07-13 at 22:19, Tom Lane wrote: > To have a consistent recovery at all, you must replay the log starting > from a checkpoint before the backup began and extending to the time that > the backup finished. You only get to decide where to stop after that > point. > So the situation is: - You must only stop recovery at a point in time (in the logs) after the backup had completed. No way to enforce that currently, apart from procedurally. Not exactly frequent, so I think I just document that and move on, eh? Thanks for your help, Best regards, Simon Riggs
Simon Riggs wrote: > On Tue, 2004-07-13 at 22:19, Tom Lane wrote: > > > To have a consistent recovery at all, you must replay the log starting > > from a checkpoint before the backup began and extending to the time that > > the backup finished. You only get to decide where to stop after that > > point. > > > > So the situation is: > - You must only stop recovery at a point in time (in the logs) after the > backup had completed. > > No way to enforce that currently, apart from procedurally. Not exactly > frequent, so I think I just document that and move on, eh? If it happens, could you use your previous full backup and the PITR logs from before stop stopped logging, and then after? Is there a period where they could not restore reliably? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 2004-07-13 at 23:42, Bruce Momjian wrote: > Simon Riggs wrote: > > On Tue, 2004-07-13 at 22:19, Tom Lane wrote: > > > > > To have a consistent recovery at all, you must replay the log starting > > > from a checkpoint before the backup began and extending to the time that > > > the backup finished. You only get to decide where to stop after that > > > point. > > > > > > > So the situation is: > > - You must only stop recovery at a point in time (in the logs) after the > > backup had completed. > > > > No way to enforce that currently, apart from procedurally. Not exactly > > frequent, so I think I just document that and move on, eh? > > If it happens, could you use your previous full backup and the PITR logs > from before stop stopped logging, and then after? Yes. > Is there a period > where they could not restore reliably? Good question. No is the answer. The situation is that the backup isn't timestamped with respect to the logs, so its possible to attempt to use the wrong backup for recovery. The solution is procedural - make sure you timestamp your backup files, so you know which ones to recover with... Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > So the situation is: > - You must only stop recovery at a point in time (in the logs) after the > backup had completed. Right. > No way to enforce that currently, apart from procedurally. Not exactly > frequent, so I think I just document that and move on, eh? The procedure that generates a backup has got to be responsible for recording both the start and stop times. If it does not do so then it's fatally flawed. (Note also that you had better be careful to get the time as seen on the server machine's clock ... this could be a nasty gotcha if the backup is run on a different machine, such as an NFS server.) regards, tom lane
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > So the situation is: > > - You must only stop recovery at a point in time (in the logs) after the > > backup had completed. > > Right. > > > No way to enforce that currently, apart from procedurally. Not exactly > > frequent, so I think I just document that and move on, eh? > > The procedure that generates a backup has got to be responsible for > recording both the start and stop times. If it does not do so then > it's fatally flawed. (Note also that you had better be careful to get > the time as seen on the server machine's clock ... this could be a nasty > gotcha if the backup is run on a different machine, such as an NFS > server.) OK, but procedurally, how do you correlate the start/stop time of the tar backup with the WAL numeric file names? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
PITR Patch v5_1 just posted has Point in Time Recovery working.... Still some rough edges....but we really need some testers now to give this a try and let me know what you think. Klaus Naumann and Mark Wong are the only [non-committers] to have tried to run the code (and let me know about it), so please have a look at [PATCHES] and try it out. Many thanks, Simon Riggs
On Wed, 2004-07-14 at 00:01, Bruce Momjian wrote: > Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > So the situation is: > > > - You must only stop recovery at a point in time (in the logs) after the > > > backup had completed. > > > > Right. > > > > > No way to enforce that currently, apart from procedurally. Not exactly > > > frequent, so I think I just document that and move on, eh? > > > > The procedure that generates a backup has got to be responsible for > > recording both the start and stop times. If it does not do so then > > it's fatally flawed. (Note also that you had better be careful to get > > the time as seen on the server machine's clock ... this could be a nasty > > gotcha if the backup is run on a different machine, such as an NFS > > server.) > > OK, but procedurally, how do you correlate the start/stop time of the > tar backup with the WAL numeric file names? No need. You just correlate the recovery target with the backup file times. Mostly, you'll only ever use your last backup and won't need to fuss with the times. Backup should begin with a CHECKPOINT...then wait for that to complete, just to make the backup as current as possible. If you want to start purging your archives of old archived xlogs, you can use the filedate (assuming you preserved that on your copy to archive - but even if not, they'll be fairly close). Best regards, Simon Riggs
Bruce Momjian <pgman@candle.pha.pa.us> writes: > OK, but procedurally, how do you correlate the start/stop time of the > tar backup with the WAL numeric file names? Ideally the procedure for making a backup would go something like: 1. Inquire of the server its current time and the WAL position of the most recent checkpoint record (which is what you really need). 2. Make the backup. 3. Inquire of the server its current time and the current end-of-WAL position. 4. Record items 1 and 3 along with the backup itself. I think the current theory was you could fake #1 by copying pg_control before everything else, but this doesn't really help for step #3, so it would probably be better to add some server functions to get this info. regards, tom lane
On Wed, 2004-07-14 at 00:28, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > OK, but procedurally, how do you correlate the start/stop time of the > > tar backup with the WAL numeric file names? > > Ideally the procedure for making a backup would go something like: > > 1. Inquire of the server its current time and the WAL position of the > most recent checkpoint record (which is what you really need). > > 2. Make the backup. > > 3. Inquire of the server its current time and the current end-of-WAL > position. > > 4. Record items 1 and 3 along with the backup itself. > > I think the current theory was you could fake #1 by copying pg_control > before everything else, but this doesn't really help for step #3, so > it would probably be better to add some server functions to get this > info. > err...I think at this point we should review the PITR patch.... The recovery mechanism doesn't rely upon you knowing 1 or 3. The recovery reads pg_control (from the backup) and then attempts to de-archive the appropriate xlog segment file and then starts rollforward from there. Effectively, restore assumes it has access to an infinite timeline of logs....which clearly isn't the case, but its up to *you* to check that you have the logs that go with the backups. (Or put another way, if this sounds hard, buy some software that administers the procedure for you). That's the mechanism that allows "infinite recovery". In brief, the code path is as identical as possible to the current crash recovery situation...archive recovery restores the files from archive when they are needed, just as if they had always been in pg_xlog, in a way that ensures pg_xlog never runs out of space. Recovery ends when: it reaches the recovery target you specified, or it runs out of xlogs (first it runs out of archived xlogs, then tries to find a more recent local copy if there is one). I think the current theory was you could fake #1 by copying pg_control > before everything else, but this doesn't really help for step #3, so > it would probably be better to add some server functions to get this > info. Not sure what you mean by "fake".... Best Regards, Simon Riggs
Can you give us some suggestions of what kind of stuff to test? Is there a way we can artificially kill the backend in all sorts of nasty spots to see if recovery works? Does kill -9 simulate a 'power off'? Chris Simon Riggs wrote: > PITR Patch v5_1 just posted has Point in Time Recovery working.... > > Still some rough edges....but we really need some testers now to give > this a try and let me know what you think. > > Klaus Naumann and Mark Wong are the only [non-committers] to have tried > to run the code (and let me know about it), so please have a look at > [PATCHES] and try it out. > > Many thanks, > > Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend
On Wed, 2004-07-14 at 03:31, Christopher Kings-Lynne wrote: > Can you give us some suggestions of what kind of stuff to test? Is > there a way we can artificially kill the backend in all sorts of nasty > spots to see if recovery works? Does kill -9 simulate a 'power off'? > I was hoping some fiendish plans would be presented to me... But please start with "this feels like typical usage" and we'll go from there...the important thing is to try the first one. I've not done power off tests, yet. They need to be done just to check...actually you don't need to do this to test PITR... We need to exhaustive tests of... - power off - scp and cross network copies - all the permuted recovery options - archive_mode = off (i.e. current behaviour) - deliberately incorrectly set options (idiot-proof testing) I'd love some help assembling a test document with numbered tests... Best regards, Simon Riggs
> The recovery mechanism doesn't rely upon you knowing 1 or 3. The > recovery reads pg_control (from the backup) and then attempts to > de-archive the appropriate xlog segment file and then starts > rollforward Unfortunately this only works if pg_control was the first file to be backed up (or by chance no checkpoint happened after backup start and pg_control backup) Other db's have commands for: start/end external backup Maybe we should add those two commands that would initially only do the following: start external backup:- (checkpoint as an option)- make a copy of pg_control end external backup:- record WAL position (helps choose an allowed minimum PIT) Those commands would actually not be obligatory but recommended, and would only help with the restore process. Restore would then eighter take the existing pg_control backup, or ask the dba where rollforward should start and create a corresponding pg_control. A helper utility could list possible checkpoints in a given xlog. Andreas
Simon Riggs <simon@2ndquadrant.com> writes: > I've not done power off tests, yet. They need to be done just to > check...actually you don't need to do this to test PITR... I agree, power off is not really the point here. What we need to check into is (a) the mechanics of archiving WAL segments and (b) the process of restoring given a backup and a bunch of WAL segments. regards, tom lane
On 14 Jul, Simon Riggs wrote: > PITR Patch v5_1 just posted has Point in Time Recovery working.... > > Still some rough edges....but we really need some testers now to give > this a try and let me know what you think. > > Klaus Naumann and Mark Wong are the only [non-committers] to have tried > to run the code (and let me know about it), so please have a look at > [PATCHES] and try it out. > > Many thanks, > > Simon Riggs Simon, I just tried applying the v5_1 patch against the cvs tip today and got a couple of rejections. I'll copy the patch output here. Let me know if you want to see the reject files or anything else: $ patch -p0 < ../../../pitr-v5_1.diff patching file backend/access/nbtree/nbtsort.c Hunk #2 FAILED at 221. 1 out of 2 hunks FAILED -- saving rejects to file backend/access/nbtree/nbtsort.c.rej patching file backend/access/transam/xlog.c Hunk #11 FAILED at 1802. Hunk #15 FAILED at 2152. Hunk #16 FAILED at 2202. Hunk #21 FAILED at 3450. Hunk #23 FAILED at 3539. Hunk #25 FAILED at 3582. Hunk #26 FAILED at 3833. Hunk #27 succeeded at 3883 with fuzz 2. Hunk #28 FAILED at 4446. Hunk #29 succeeded at 4470 with fuzz 2. 8 out of 29 hunks FAILED -- saving rejects to file backend/access/transam/xlog.c.rej patching file backend/postmaster/Makefile patching file backend/postmaster/postmaster.c Hunk #3 succeeded at 1218 with fuzz 2 (offset 70 lines). Hunk #4 succeeded at 1827 (offset 70 lines). Hunk #5 succeeded at 1874 (offset 70 lines). Hunk #6 succeeded at 1894 (offset 70 lines). Hunk #7 FAILED at 1985. Hunk #8 succeeded at 2039 (offset 70 lines). Hunk #9 succeeded at 2236 (offset 70 lines). Hunk #10 succeeded at 2996 with fuzz 2 (offset 70 lines). 1 out of 10 hunks FAILED -- saving rejects to file backend/postmaster/postmaster.c.rej patching file backend/storage/smgr/md.c Hunk #1 succeeded at 162 with fuzz 2. patching file backend/utils/misc/guc.c Hunk #1 succeeded at 342 (offset 9 lines). Hunk #2 succeeded at 1387 (offset 9 lines). patching file backend/utils/misc/postgresql.conf.sample Hunk #1 succeeded at 113 (offset 10 lines). patching file bin/initdb/initdb.c patching file include/access/xlog.h patching file include/storage/pmsignal.h
On Wed, 2004-07-14 at 16:55, markw@osdl.org wrote: > On 14 Jul, Simon Riggs wrote: > > PITR Patch v5_1 just posted has Point in Time Recovery working.... > > > > Still some rough edges....but we really need some testers now to give > > this a try and let me know what you think. > > > > Klaus Naumann and Mark Wong are the only [non-committers] to have tried > > to run the code (and let me know about it), so please have a look at > > [PATCHES] and try it out. > > > I just tried applying the v5_1 patch against the cvs tip today and got a > couple of rejections. I'll copy the patch output here. Let me know if > you want to see the reject files or anything else: > I'm on it. Sorry 'bout that all - midnight fingers.
On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote: > > The recovery mechanism doesn't rely upon you knowing 1 or 3. The > > recovery reads pg_control (from the backup) and then attempts to > > de-archive the appropriate xlog segment file and then starts > > rollforward > > Unfortunately this only works if pg_control was the first file to be > backed up (or by chance no checkpoint happened after backup start and > pg_control backup) > > Other db's have commands for: > start/end external backup > OK...this idea has come up a few times. Here's my take: - OS and hardware facilities exist now to make instant copies of sets of files. Some of these are open source, others not. If you use these, you have no requirement for this functionality....but these alone are no replacement for archive recovery.... I accept that some people may not wish to go to the expense or effort to use those options, but in my mind these are the people that will not be using archive_mode anyway. - all we would really need to do is to stop the bgwriter from doing anything during backup. pgcontrol is only updated at checkpoint. The current xlog is updated constantly, but this need not be copied because we are already archiving it as soon as its full. That leaves the bgwriter, which is now responsible for both lazy writing and checkpoints. So, put a switch into bgwriter to halt for a period, then turn it back on at the end. Could be a SIGHUP GUC...or... ...and with my greatest respects.... - please could somebody else code that?... my time is limited Best regards, Simon Riggs
I noticed that compiling with 5_1 patch applied fails due to XLOG_archive_dir being removed from xlog.c , but src/backend/commands/tablecmds.c still uses it. I did the following to tablecmds.c : 5408c5408 < extern char XLOG_archive_dir[]; --- > extern char *XLogArchiveDest; 5410c5410 < use_wal = XLOG_archive_dir[0] && !rel->rd_istemp; --- > use_wal = XLogArchiveDest[0] && !rel->rd_istemp; Now I have to see if I have broken it with this change :-) regards Mark Simon Riggs wrote: >On Wed, 2004-07-14 at 16:55, markw@osdl.org wrote: > > >>On 14 Jul, Simon Riggs wrote: >> >> >>>PITR Patch v5_1 just posted has Point in Time Recovery working.... >>> >>>Still some rough edges....but we really need some testers now to give >>>this a try and let me know what you think. >>> >>>Klaus Naumann and Mark Wong are the only [non-committers] to have tried >>>to run the code (and let me know about it), so please have a look at >>>[PATCHES] and try it out. >>> >>> >>> > > > >>I just tried applying the v5_1 patch against the cvs tip today and got a >>couple of rejections. I'll copy the patch output here. Let me know if >>you want to see the reject files or anything else: >> >> >> > >I'm on it. Sorry 'bout that all - midnight fingers. > > >---------------------------(end of broadcast)--------------------------- >TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > >
Hi, folks. My colleages and I are planning to test PITR after the 7.5 beta release. Now we are desinging test items, but some specification are enough clear (to us). For example, we are not clear which resouce manager order to store log records. - some access method (like B-tree) require to log its date or not. - create/drop action of table space to be stored tothe log or not. We'll be pleased if someone informs them. The test set we'll proceed has following items; - PITR can recover ordinary commited transaction's data. - tuple data themselves - index data associated with them - PITR can recover commited some special transaction's data. - DDL; create database, table, index and so on -maintenance commands (handling large amount of data); truncate, vacuum, reindex and so on. Items above are 'data aspects' of the test. Other aspects are as follows - Place of the archival log's drive; PITR can recover a database from archived log data - stored in the same driveas xlog. - stored in a different drive on the same machine in which the PostgreSQL runs. - storedin a different drive on a different machine. - Duration between a checkpoint and recovery; PITR can recover a database enough long after a checkpoint. - Time to Recover; - to end of the log. - to some specified time. - Type of failures; - system down --- kill the PostgreSQL process (as a simulation). - media lost --- delete databasefiles (as a simulation). - These two case will be tested by a simulated situation first, and we would trysome 'real' failure after. (real power down of the test machine to the first case, and 'plug off' the diskdrive to the second one. these action would damage test machine, this is because we plan them after 'ordinary'test items.) The test set is under construction and we'll test the 7.5 beta for some weeks, and report the result of the test here. Sincerely yours.Tetsuo SAKATA. -- sakata.tetsuo _at_ lab.ntt.co.jp SAKATA, Tetsuo. Yokosuka JAPAN.
I talked to Tom on the phone today and and I think we have a procedure for doing backup/restore in a fairly foolproof way. As outlined below, we need to record the start/stop and checkpoint WAL file names and offsets, and somehow pass those on to restore. I think any system that requires users to link those values together is going to cause confusion and be error-prone. My idea is to do much of this automatically. First, create a server-side function called pitr_backup_start() which creates a file in the /data directory which contains the WAL filename/offsets for last checkpoint and start. Then do the backup of the data directory. Then call pitr_backup_stop() which adds the stop filename/offsets to the file, and archive that file in the same place as the WAL files. To restore, you untar the backup of /data. Then the recover backend reads the file created by pitr_backup_start() to find the name of the backup parameter file. It then recovers that file from the archive location and uses the start/stop/checkpoint filename/offset information to the restore. The advantage of this is that the tar backup contains everything needed to find the proper parameter file for restore. Ideally we could get all the parameters into the tar backup, but that isn't possible because we can't push the stop counters into the backup after the backup has completed. I recommend the pitr_backup_start() file be named for the current WAL filename/offset, perhaps 000000000000032c.3da390.backup or something like that. The file would be a simple text file in pg_xlog/archive_status: # start 2004-07-14 21:35:22.324579-04wal_checkpoint = 0000000000000319.021233wal_start = 000000000000032c.92a9cb ...added after backup completes...wal_stop = 000000000000034a.3db030# stop 2004-07-14 21:32:22.0923213-04 The timestamps are for documentation only. These files give admins looking in the archive directory information on backup times. (As an idea, there is no need for the user to specify a recovery mode. If the postmaster starts and sees the pitr_backup_start() file in /data, it can go into recovery mode automatically. If the archiver can't find the file in the archive location, it can assume that it is just being started from power failure mode. However if it finds the file in the archive location, it can assume it is to enter recovery mode. There is a race condition that a crash during copy of the file to the archive location would be a problem. The solution would be to create a special flag file before copying the file to archive, and then archive it and remove the flag file. If the postmaster starts up and sees the pitr_backup_start() file in /data and in the archive location, and it doesn't see the flag file, it then knows it is doing a restore because the flag file would never appear in a backup. Anyway, this is just an idea.) --------------------------------------------------------------------------- Simon Riggs wrote: > On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote: > > > The recovery mechanism doesn't rely upon you knowing 1 or 3. The > > > recovery reads pg_control (from the backup) and then attempts to > > > de-archive the appropriate xlog segment file and then starts > > > rollforward > > > > Unfortunately this only works if pg_control was the first file to be > > backed up (or by chance no checkpoint happened after backup start and > > pg_control backup) > > > > Other db's have commands for: > > start/end external backup > > > > OK...this idea has come up a few times. Here's my take: > > - OS and hardware facilities exist now to make instant copies of sets of > files. Some of these are open source, others not. If you use these, you > have no requirement for this functionality....but these alone are no > replacement for archive recovery.... I accept that some people may not > wish to go to the expense or effort to use those options, but in my mind > these are the people that will not be using archive_mode anyway. > > - all we would really need to do is to stop the bgwriter from doing > anything during backup. pgcontrol is only updated at checkpoint. The > current xlog is updated constantly, but this need not be copied because > we are already archiving it as soon as its full. That leaves the > bgwriter, which is now responsible for both lazy writing and > checkpoints. > So, put a switch into bgwriter to halt for a period, then turn it back > on at the end. Could be a SIGHUP GUC...or... > > ...and with my greatest respects.... > > - please could somebody else code that?... my time is limited > > Best regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, 2004-07-15 at 02:43, Mark Kirkwood wrote: > I noticed that compiling with 5_1 patch applied fails due to > XLOG_archive_dir being removed from xlog.c , but > src/backend/commands/tablecmds.c still uses it. > > I did the following to tablecmds.c : > > 5408c5408 > < extern char XLOG_archive_dir[]; > --- > > extern char *XLogArchiveDest; > 5410c5410 > < use_wal = XLOG_archive_dir[0] && !rel->rd_istemp; > --- > > use_wal = XLogArchiveDest[0] && !rel->rd_istemp; > > Yes, I discovered that myself. The fix is included in pitr_v5_2.patch... Your patch follows the right thinking and looks like it would have worked... - XLogArchiveMode carries the main bool value for mode on/off - XLogArchiveDest might also be used, though best to use the mode Thanks for looking through the code... Best Regards, Simon Riggs
On Thu, 2004-07-15 at 03:02, Bruce Momjian wrote: > I talked to Tom on the phone today and and I think we have a procedure > for doing backup/restore in a fairly foolproof way. > > As outlined below, we need to record the start/stop and checkpoint WAL > file names and offsets, and somehow pass those on to restore. I think > any system that requires users to link those values together is going > to cause confusion and be error-prone. > Unfortunately, it seems clear that many of my posts have not been read, nor has anyone here actually tried to use the patch. Everybody's views on what constitutes error-prone might well differ then. Speculation about additional requirements is just great, but please don't assume that I have infinite resources to apply to these problems. Documentation has still to be written. For a long time now, I've been adding "one last feature" to what is there, but we're still no nearer to anybody inspecting the patch or committing it. There is building consensus on other threads that PITR should not even be included in the release (3 tentative votes). This latest request feels more like the necessary excuse to take the decision to pull PITR. I would much rather that we took the brave decision and pull it NOW, rather than have me work like crazy to chase this release. :( Best Regards, Simon Riggs
I tried what I thought was a straightforward scenario, and seem to have broken it :-( Here is the little tale 1) initdb 2) set archive_mode and archive_dest in postgresql.conf 3) startup 4) create database called 'test' 5) connect to 'test' and type 'checkpoint' 6) backup PGDATA using 'tar -zcvf' 7) create tables in 'test' and add data using COPY (exactly 2 logs worth) 8) shutdown and remove PGDATA 9) recover using 'tar -zxvf' 10) copy recovery.conf into PGDATA 11) startup This is what I get : LOG: database system was interrupted at 2004-07-15 21:24:04 NZST LOG: recovery command file found... LOG: restore_program = cp %s/%s %s LOG: recovery_target_inclusive = true LOG: recovery_debug_log = true LOG: starting archive recovery LOG: restored log file "0000000000000000" from archive LOG: checkpoint record is at 0/A48054 LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 496; next OID: 25419 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/A48094 LOG: restored log file "0000000000000001" from archive LOG: record with zero length at 0/1FFFFE0 LOG: redo done at 0/1FFFF30 LOG: restored log file "0000000000000001" from archive LOG: restored log file "0000000000000001" from archive PANIC: concurrent transaction log activity while database system is shutting down LOG: startup process (PID 13492) was terminated by signal 6 LOG: aborting startup due to startup process failure The concurrent access is a bit of a puzzle, as this is my home machine (i.e. I am *sure* noone else is connected!) Mark P.s : CVS HEAD from about 1 hour ago, PITR 5.2, FreeBSD 4.10 on x86
> > Other db's have commands for: > > start/end external backup I see that the analogy to external backup was not good, since you are correct that dba's would expect that to stop all writes, so they can safely split their mirror or some such. Usually the expected time from start until end external backup is expected to be only seconds. I actually think we do not need this functionality, since in pg you can safely split the mirror any time you like. My comment was meant to give dba's a familiar tool. The effect of it would only have been to create a separate backup of pg_control. Might as well tell people to always backup pg_control first. I originally thought you would require restore to specify an xlog id from which recovery will start. You would search this log for the first checkpoint record, create an appropriate pg_control, and start rollforward. I still think this would be a nice feature, since then all that would be required for restore is a system backup (that includes pg data files) and the xlogs. Andreas
Dear Simon, I've just tested pitr_v5_2.patch and got an error message during archiving process as follows. -- beginLOG: archive command="cp /usr/local/pgsql/data/pg_xlog/0000000000000000 /tmp",return code=-1 -- end The command called in system(3) works, but it returns -1. system(3) can not get right exit code from its child process, when SIGCHLD is set as SIG_IGN. So I did following change to pgarch_Main() in pgarch.c -- line 236 --- - pgsignal(SIGCHLD, SIG_IGN); -- line 236 --- + pgsignal(SIGCHLD, SIG_DFL); After that, the error message doen't come out and it seems to be working propery. Regards, Hisada, Masaki On Wed, 14 Jul 2004 00:13:37 +0100 Simon Riggs <simon@2ndquadrant.com> wrote: > PITR Patch v5_1 just posted has Point in Time Recovery working.... > > Still some rough edges....but we really need some testers now to give > this a try and let me know what you think. > > Klaus Naumann and Mark Wong are the only [non-committers] to have tried > to run the code (and let me know about it), so please have a look at > [PATCHES] and try it out. > > Many thanks, > > Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- HISADA, Masaki <hisada.masaki@lab.ntt.co.jp>
Sorry for the stupid question, but how do I get this patch if I do not receive the patches mails ? The web interface html'ifies it, thus making it unusable. Thanks Andreas
Simon Riggs wrote: > On Thu, 2004-07-15 at 03:02, Bruce Momjian wrote: > > I talked to Tom on the phone today and and I think we have a procedure > > for doing backup/restore in a fairly foolproof way. > > > > As outlined below, we need to record the start/stop and checkpoint WAL > > file names and offsets, and somehow pass those on to restore. I think > > any system that requires users to link those values together is going > > to cause confusion and be error-prone. > > > > Unfortunately, it seems clear that many of my posts have not been read, > nor has anyone here actually tried to use the patch. Everybody's views > on what constitutes error-prone might well differ then. > > Speculation about additional requirements is just great, but please > don't assume that I have infinite resources to apply to these problems. > Documentation has still to be written. > > For a long time now, I've been adding "one last feature" to what is > there, but we're still no nearer to anybody inspecting the patch or > committing it. I totally understand your feeling this, and I would be feeling the exact same way (but would probably have complained much earlier :-) ). Anyway, the problem is that Tom and I are serializing application of the major features in the pipeline. We decided to focus on nested transactions (NT) first (it is a larger patch), and that is why PITR has gotten so little attention from us. However, there is no sense that you had anything to do with it being places behind NT in the queue, and therefore there is no feeling on our part that PITR is less important or deserves less time than NT. Certainly any system that made you less likely to be applied would be unfair and something we will not do. My explanation about the file format was an attempt to address the method of passing the wal filename/offset to the recover process. If that isn't needed, I am sorry. > There is building consensus on other threads that PITR should not even > be included in the release (3 tentative votes). This latest request > feels more like the necessary excuse to take the decision to pull PITR. > I would much rather that we took the brave decision and pull it NOW, > rather than have me work like crazy to chase this release. Those three individuals are not representative of the group. Sorry it might seem there there is lack of enthusiasm for PITR, but it isn't true from our end. You might have noticed that the patch queue has shrunk dramatically, and now we are focused on NT and PITR almost exclusively. We will get there --- it just seems dark at this time. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Simon Riggs wrote: > On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote: > > > The recovery mechanism doesn't rely upon you knowing 1 or 3. The > > > recovery reads pg_control (from the backup) and then attempts to > > > de-archive the appropriate xlog segment file and then starts > > > rollforward > > > > Unfortunately this only works if pg_control was the first file to be > > backed up (or by chance no checkpoint happened after backup start and > > pg_control backup) > > > > Other db's have commands for: > > start/end external backup > > > > OK...this idea has come up a few times. Here's my take: > > - OS and hardware facilities exist now to make instant copies of sets of > files. Some of these are open source, others not. If you use these, you > have no requirement for this functionality....but these alone are no > replacement for archive recovery.... I accept that some people may not > wish to go to the expense or effort to use those options, but in my mind > these are the people that will not be using archive_mode anyway. > > - all we would really need to do is to stop the bgwriter from doing > anything during backup. pgcontrol is only updated at checkpoint. The > current xlog is updated constantly, but this need not be copied because > we are already archiving it as soon as its full. That leaves the > bgwriter, which is now responsible for both lazy writing and > checkpoints. > So, put a switch into bgwriter to halt for a period, then turn it back > on at the end. Could be a SIGHUP GUC...or... I don't think we can turn off all file system writes during a backup. Imagine writing to a tape. Preventing file system writes would make the system useless. > - please could somebody else code that?... my time is limited Yes, I think someone else could code this. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote: > I tried what I thought was a straightforward scenario, and seem to have > broken it :-( > > Here is the little tale > > 1) initdb > 2) set archive_mode and archive_dest in postgresql.conf > 3) startup > 4) create database called 'test' > 5) connect to 'test' and type 'checkpoint' > 6) backup PGDATA using 'tar -zcvf' > 7) create tables in 'test' and add data using COPY (exactly 2 logs worth) > 8) shutdown and remove PGDATA > 9) recover using 'tar -zxvf' > 10) copy recovery.conf into PGDATA > 11) startup > > This is what I get : > > LOG: database system was interrupted at 2004-07-15 21:24:04 NZST > LOG: recovery command file found... > LOG: restore_program = cp %s/%s %s > LOG: recovery_target_inclusive = true > LOG: recovery_debug_log = true > LOG: starting archive recovery > LOG: restored log file "0000000000000000" from archive > LOG: checkpoint record is at 0/A48054 > LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE > LOG: next transaction ID: 496; next OID: 25419 > LOG: database system was not properly shut down; automatic recovery in > progress > LOG: redo starts at 0/A48094 > LOG: restored log file "0000000000000001" from archive > LOG: record with zero length at 0/1FFFFE0 > LOG: redo done at 0/1FFFF30 > LOG: restored log file "0000000000000001" from archive > LOG: restored log file "0000000000000001" from archive > PANIC: concurrent transaction log activity while database system is > shutting down > LOG: startup process (PID 13492) was terminated by signal 6 > LOG: aborting startup due to startup process failure > > The concurrent access is a bit of a puzzle, as this is my home machine > (i.e. I am *sure* noone else is connected!) First, thanks for sticking with it to test this. I've not received such a message myself - this is interesting. Is it possible to copy that directory to one side and re-run the test? Add another parameter in postgresql.conf called "archive_debug = true" Does it happen identically the second time? What time difference was there between steps 5 and 6? I think I can here Andreas saying "told you".... I'm thinking the backup might be somehow corrupted because the checkpoint occurred during the backup. Hmmm... Could you also post me the recovery.log file? (don't post to list) Thanks, Simon Riggs
On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote: > We will get there --- it just seems dark at this time. Thanks for that. My comments were heartfelt, but not useful right now. I'm badly overdrawn already on my time budget, though that is my concern alone. There is more to do than I have time for. Pragmatically, if we aren't going to get there then I need to stop now, so I can progress other outstanding issues. All help is appreciated. I'm aiming for the minimum feature set - which means we do need to take care over whether that set is insufficient and also to pull any part that doesn't stand up to close scrutiny over the next few days. Overall, my primary goal is increased robustness and availability for PostgreSQL...and then to have a rest! Best Regards, Simon Riggs
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Simon, On Thu, 15 Jul 2004, Simon Riggs wrote: > > We will get there --- it just seems dark at this time. > > Thanks for that. My comments were heartfelt, but not useful right now. > > I'm badly overdrawn already on my time budget, though that is my concern > alone. There is more to do than I have time for. Pragmatically, if we > aren't going to get there then I need to stop now, so I can progress > other outstanding issues. All help is appreciated. Personally, as a PostgreSQL Advocate, I believe that PITR is one of the most important missing features in PostgreSQL. I've been keeping 'all' of you e-mails about PITR and I'm really excited with that feature. Please do not stop working on PITR. I'm pretty sure that most of the 'silent' people in the lists are waiting for PITR for an {Oracle, DB2, ...}-killer database. In my country (Turkey), too many people spend a lot of money for proprietary databases, just for some missing features in PostgreSQL. If you finish your work on PITR (and other guys on NT, Win32 port, etc), then we'll feel more concentrated on PostgreSQL Advocation, so that PostgreSQL will be used more and more. (Oh, we also need native clustering...) Maybe I should send this e-mail offlist, but I wanted everyone to learn my feelings. Regards and best wishes, - -- Devrim GUNDUZ devrim~gunduz.org devrim.gunduz~linux.org.tr http://www.tdmsoft.com http://www.gunduz.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (GNU/Linux) iD8DBQFA9wKxtl86P3SPfQ4RAms+AJ95RfFi0lVwMD7u7zQ/DzLFEBC8MACgvRzd HRqAjVqI3hekwImPpqelj9U= =l445 -----END PGP SIGNATURE-----
On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote: > Dear Simon, > > I've just tested pitr_v5_2.patch and got an error message > during archiving process as follows. > > -- begin > LOG: archive command="cp /usr/local/pgsql/data/pg_xlog/0000000000000000 /tmp",return code=-1 > -- end > > The command called in system(3) works, but it returns -1. > system(3) can not get right exit code from its child process, > when SIGCHLD is set as SIG_IGN. > > So I did following change to pgarch_Main() in pgarch.c > > -- line 236 --- > - pgsignal(SIGCHLD, SIG_IGN); > > -- line 236 --- > + pgsignal(SIGCHLD, SIG_DFL); > Thank you for testing the patch. Very much appreciated. I was aware of the potential issues of incorrect return codes, and that exact part of the code is the part I'm least happy with. I'm not sure I understand why its returned -1, though I'll take you recommendation. I've not witnessed such an issue. What system are you running, or is it a default shell issue? Do people think that the change is appropriate for all systems, or just the one you're using? Best Regards, Simon Riggs
> On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: Thanks for the vote of confidence, on or off list. > too many people spend a lot of > money for proprietary databases, just for some missing features in > PostgreSQL Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it already, even though it doesn't have it, then you are quite likely to be able to keep going without it. Most commercial users won't touch anything that doesn't have PITR. > (Oh, we also need native > clustering...) Next week, OK? :) Best Regards, Simon
On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote: > On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote: > > -- line 236 --- > > - pgsignal(SIGCHLD, SIG_IGN); > > > > -- line 236 --- > > + pgsignal(SIGCHLD, SIG_DFL); > > I'm not sure I understand why its returned -1, though I'll take you > recommendation. I've not witnessed such an issue. What system are you > running, or is it a default shell issue? > > Do people think that the change is appropriate for all systems, or just > the one you're using? My manpage for signal(2) says that you shouldn't assign SIG_IGN to SIGCHLD, according to POSIX. It goes on to say that BSD and SysV behaviors differ on this aspect. (This is on linux BTW) -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "La experiencia nos dice que el hombre peló millones de veces las patatas, pero era forzoso admitir la posibilidad de que en un caso entre millones, las patatas pelarían al hombre" (Ijon Tichy)
Simon Riggs wrote: > > On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: > > Thanks for the vote of confidence, on or off list. > > > too many people spend a lot of > > money for proprietary databases, just for some missing features in > > PostgreSQL > > Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it > already, even though it doesn't have it, then you are quite likely to be > able to keep going without it. > > Most commercial users won't touch anything that doesn't have PITR. Agreed. I am surprised at how few requests we have gotten for PITR. I assume people are either using replication or not considering us. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Simon Riggs wrote: > >First, thanks for sticking with it to test this. > >I've not received such a message myself - this is interesting. > >Is it possible to copy that directory to one side and re-run the test? >Add another parameter in postgresql.conf called "archive_debug = true" >Does it happen identically the second time? > > > Yes, identical results - I re-initdb'ed and ran the process again, rather than reuse the files. >What time difference was there between steps 5 and 6? I think I can here >Andreas saying "told you".... I'm thinking the backup might be somehow >corrupted because the checkpoint occurred during the backup. Hmmm... > > > I was wondering about this, so left a bit more time in between, and forced a sync as well for good measure. 5) $ psql -d test -c "checkpoint"; sleep 30;sync;sleep 30 6) $ tar -zcvf /data1/dump/pgdata-7.5.tar.gz * > >Thanks, Simon Riggs > > >
Simon Riggs wrote: > On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote: > > > We will get there --- it just seems dark at this time. > > Thanks for that. My comments were heartfelt, but not useful right now. > > I'm badly overdrawn already on my time budget, though that is my concern > alone. There is more to do than I have time for. Pragmatically, if we > aren't going to get there then I need to stop now, so I can progress > other outstanding issues. All help is appreciated. > > I'm aiming for the minimum feature set - which means we do need to take > care over whether that set is insufficient and also to pull any part > that doesn't stand up to close scrutiny over the next few days. As you can see, we are still chewing on NT. What PITR features are missing? I assume because we can't stop file system writes during backup that we will need a backup parameter file like I described. Is there anything else that PITR needs? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> Simon Riggs wrote: > > > On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: > > > > Thanks for the vote of confidence, on or off list. > > > > > too many people spend a lot of > > > money for proprietary databases, just for some missing features in > > > PostgreSQL > > > > Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it > > already, even though it doesn't have it, then you are quite likely to be > > able to keep going without it. > > > > Most commercial users won't touch anything that doesn't have PITR. > > Agreed. I am surprised at how few requests we have gotten for PITR. I > assume people are either using replication or not considering us. Don't forget that there are (must be) lots of us that know it's coming and are just waiting until it's available. I haven't requested per se, but believe me, I'm waiting for it :-)
Couldn't agree more. Maybe we should have made more noise :-) Glen Parker wrote: >>Simon Riggs wrote: >> >> >>>>On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: >>>> >>>> >>>Thanks for the vote of confidence, on or off list. >>> >>> >>> >>>> too many people spend a lot of >>>>money for proprietary databases, just for some missing features in >>>>PostgreSQL >>>> >>>> >>>Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it >>>already, even though it doesn't have it, then you are quite likely to be >>>able to keep going without it. >>> >>>Most commercial users won't touch anything that doesn't have PITR. >>> >>> >>Agreed. I am surprised at how few requests we have gotten for PITR. I >>assume people are either using replication or not considering us. >> >> > >Don't forget that there are (must be) lots of us that know it's coming and >are just waiting until it's available. I haven't requested per se, but >believe me, I'm waiting for it :-) > > >---------------------------(end of broadcast)--------------------------- >TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > >
Simon Riggs wrote: > >So far: > >I've tried to re-create the problem as exactly as I can, but it works >for me. > >This is clearly an important case to chase down. > >I assume that this is the very first time you tried recovery? Second and >subsequent recoveries using the same set have a potential loophole, >which we have been discussing. > >Right now, I'm thinking that the "exactly 2 logs worth" of data has >brought you very close to the end of the log file (FFFFE0) ending with 1 >and the shutdown checkpoint that is then subsequently written is >failing. > >Can you repeat this your end? > > > It is repeatable at my end. It is actually fairly easy to recreate the example I am using, download http://sourceforge.net/projects/benchw and generate the dataset for Pg - but trim the large "fact0.dat" dump file using head -100000. Thus step 7 consists of creating the 4 tables and COPYing in the data for them. >The nearest I can get to the exact record pointers you show are to start >recovery at A4807C and to end at with FFFF88. > >Overall, PITR changes the recovery process very little, if at all. The >main areas of effect are to do with sequencing of actions and matching >up the right logs with the right backup. I'm not looking for bugs in the >code but in subtle side-effects and "edge" cases. Everything you can >tell me will help me greatly in chasing that down. > > > I agree - I will try this sort of example again, but will change the number of rows I am COPYing (currently 100000) and see if that helps. >Best Regards, Simon Riggs > > > By way of contrast, using the *same* procedure (1-11), but generating 2 logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* - e.g : LOG: database system was interrupted at 2004-07-16 11:17:52 NZST LOG: recovery command file found... LOG: restore_program = cp %s/%s %s LOG: recovery_target_inclusive = true LOG: recovery_debug_log = true LOG: starting archive recovery LOG: restored log file "0000000000000000" from archive LOG: checkpoint record is at 0/A4803C LOG: redo record is at 0/A4803C; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 496; next OID: 25419 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/A4807C postmaster starting [postgres@shroudeater 7.5]$ LOG: restored log file "0000000000000001" from archive cp: cannot stat `/data1/pgdata/7.5-archive/0000000000000002': No such file or directory LOG: could not restore "0000000000000002" from archive LOG: could not open file "/data1/pgdata/7.5/pg_xlog/0000000000000002" (log file 0, segment 2): No such file or directory LOG: redo done at 0/1FFFFD4 LOG: archive recovery complete LOG: database system is ready LOG: archiver started
On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote: > On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote: > > On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote: > > > > -- line 236 --- > > > - pgsignal(SIGCHLD, SIG_IGN); > > > > > > -- line 236 --- > > > + pgsignal(SIGCHLD, SIG_DFL); > > > > I'm not sure I understand why its returned -1, though I'll take you > > recommendation. I've not witnessed such an issue. What system are you > > running, or is it a default shell issue? > > > > Do people think that the change is appropriate for all systems, or just > > the one you're using? > > My manpage for signal(2) says that you shouldn't assign SIG_IGN to > SIGCHLD, according to POSIX. It goes on to say that BSD and SysV > behaviors differ on this aspect. > POSIX rules OK! So - I should be setting this to SIG_DFL and thats good for everyone? OK. Will do. Best regards, Simon Riggs
On Fri, 2004-07-16 at 00:46, Mark Kirkwood wrote: > > By way of contrast, using the *same* procedure (1-11), but generating 2 > logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* - > e.g : > Great...at least we have shown that something works (or can work) and have begun to isolate the problem whatever it is. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote: >> My manpage for signal(2) says that you shouldn't assign SIG_IGN to >> SIGCHLD, according to POSIX. > So - I should be setting this to SIG_DFL and thats good for everyone? Yeah, we learned the same lesson in the backend not too many releases back. SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms but not others. You could do worse than to look at the existing handling of signals in the postmaster and its children; that code has been beat on pretty heavily ... regards, tom lane
> Thanks for that. My comments were heartfelt, but not useful right now. Hi Simon, I'm sorry if I gave the impression that I thought your work wasn't worthwhile, it is :( > I'm badly overdrawn already on my time budget, though that is my concern > alone. There is more to do than I have time for. Pragmatically, if we > aren't going to get there then I need to stop now, so I can progress > other outstanding issues. All help is appreciated. I've got your patch applied (but having some compilation problem), but I'm really not sure what to test really. I don't really understand the whole thing fully :/ > I'm aiming for the minimum feature set - which means we do need to take > care over whether that set is insufficient and also to pull any part > that doesn't stand up to close scrutiny over the next few days. > > Overall, my primary goal is increased robustness and availability for > PostgreSQL...and then to have a rest! Definitely! Chris
On Fri, 2004-07-16 at 04:49, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote: > >> My manpage for signal(2) says that you shouldn't assign SIG_IGN to > >> SIGCHLD, according to POSIX. > > > So - I should be setting this to SIG_DFL and thats good for everyone? > > Yeah, we learned the same lesson in the backend not too many releases > back. SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms > but not others. Many thanks all, Best Regards Simon Riggs
> > I'm aiming for the minimum feature set - which means we do need to take > > care over whether that set is insufficient and also to pull any part > > that doesn't stand up to close scrutiny over the next few days. > > As you can see, we are still chewing on NT. What PITR features are > missing? I assume because we can't stop file system writes during > backup that we will need a backup parameter file like I described. Is > there anything else that PITR needs? No, we don't need to stop writes ! Not even to split a mirror, other db's need that to be able to restore, but we dont. We only need to tell people to backup pg_control first. The rest was only intended to enforce 1. that pg_control is the first file backed up 2. the dba uses a large enough PIT (or xid) for restore I think the idea with an extra file with WAL start position was overly complicated, since all you need is pg_control (+ WAL end position to enforce 2.). If we don't want to tell people to backup pg_control first, imho the next best plan would be to add a "WAL start" input (e.g. xlog name) parameter to recovery.conf, that "fixes" pg_control. Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > We only need to tell people to backup pg_control first. The rest was only > intended to enforce > 1. that pg_control is the first file backed up > 2. the dba uses a large enough PIT (or xid) for restore Right, but I think Bruce's point is that it is far too easy to get those things wrong; especially point 2 for which a straight tar dump will simply not contain the information you need to determine what is a safe stopping point. I agree with Bruce that we should have some mechanism that doesn't rely on the DBA to get this right. Exactly what the mechanism should be is certainly open for discussion... regards, tom lane
> "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > > We only need to tell people to backup pg_control first. The rest was only > > intended to enforce > > 1. that pg_control is the first file backed up > > 2. the dba uses a large enough PIT (or xid) for restore > > Right, but I think Bruce's point is that it is far too easy to get those > things wrong; especially point 2 for which a straight tar dump will > simply not contain the information you need to determine what is a safe > stopping point. > > I agree with Bruce that we should have some mechanism that doesn't rely > on the DBA to get this right. Exactly what the mechanism should be is > certainly open for discussion... Right. I am wondering what process people would use to backup pg_control first? If they do: tar -f $TAPE ./global/pg_control . They will get two copies or pg_control, the early one, and one as part of the directory scan. On restore, they would restore the early one, but the directory scan would overwrite it. I suppose they could do: cp global/pg_control global/pg_control.backuptar -f $TAPE . then on restore once all the files are restored move the pg_control.backup to its original name. That gives us the checkpoint wal/offset but how do we get the start/stop information. Is that not required? Maybe we should just have a start/stop server-side functions that create a file in the archive directory describing the start/stop counters and time and the admin would then have to find those values. Why are the start/stop wal/offset values needed anyway? I know why we need the checkpoint value. Do we need a checkpoint after the archiving starts but before the backup begins? Also, when you are in recovery mode, how do you get out of recovery mode, meaning if you have a power failure, how do you prevent the system from doing another recovery? Do you remove the recovery.conf file? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Also, when you are in recovery mode, how do you get out of recovery > mode, meaning if you have a power failure, how do you prevent the system > from doing another recovery? Do you remove the recovery.conf file? I do not care for the idea of a recovery.conf file at all, and have been intending to look to see what we'd need to do to not have one. I find it hard to believe that there is anything one would put in it that is really persistent state. The above concern shows why it shouldn't be treated as a persistent configuration file. regards, tom lane
> then on restore once all the files are restored move the > pg_control.backup to its original name. That gives us the checkpoint > wal/offset but how do we get the start/stop information. Is that not > required? The checkpoint wal/offset is in pg_control, that is sufficient start information. The stop info is only necessary as a safeguard. > Do we need a checkpoint after the archiving > starts but before the backup begins? No. > Also, when you are in recovery mode, how do you get out of recovery > mode, meaning if you have a power failure, how do you prevent the system > from doing another recovery? Do you remove the recovery.conf file? pg_control could be updated during rollforward (only if that actually does a checkpoint). So if pg_control is also the recovery start info, then we can continue from there if we have a power failure. For the first release it would imho also be ok to simply start over if you loose power. I think the filename 'recovery.conf' is misleading, since it is not a static configuration file, but a command file for one recovery. How about 'recovery.command' then 'recovery.inprogress', and on recovery completion it should be renamed to 'recovery.done' Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: >> Do we need a checkpoint after the archiving >> starts but before the backup begins? > No. Actually yes. You have to start at a checkpoint record when replaying the log, so if no checkpoint occurred between starting to archive WAL and starting the tar backup, you have a useless backup. It would be reasonable to issue a CHECKPOINT just before starting the backup as part of the standard operating procedure for taking PITR dumps. We need not require this, but it would help to avoid this particular sort of mistake; and of course it might save a little bit of replay effort if the backup is ever used. As far as the business about copying pg_control first goes: there is another way to think about it, which is to copy pg_control to another place that will be included in your backup. For example the standard backup procedure could be 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes. 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump 3. tar cf /dev/mt $PGDATA 4. do something to record ending WAL position If we standardized on this way, then the tar archive would automatically contain the pre-backup checkpoint position in ./pg_control.dump, and there is no need for any special assumptions about the order in which tar processes things. However, once you decide to do things like that, there is no reason why the copied file has to be an exact image of pg_control. I claim it would be more useful if the copied file were plain text so that you could just "cat" it to find out the starting WAL position; that would let you determine without any special tools what range of WAL archive files you are going to need to bring back from your archives. This is pretty much the same chain of reasoning that Bruce and I went through yesterday to come up with the idea of putting a label file inside the tar backups. We concluded that it'd be worth putting both the backup starting time and the checkpoint WAL position into the label file --- the starting time isn't needed for restore but might be really helpful as documentation, if you needed to verify which dump file was which. regards, tom lane
> >> Do we need a checkpoint after the archiving > >> starts but before the backup begins? > > > No. > > Actually yes. Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-( So yes, you need one checkpoint after archiving starts. Imho turning on xlog archiving should issue such a checkpoint just to be sure. Andreas
On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote: > > >> Do we need a checkpoint after the archiving > > >> starts but before the backup begins? > > > > > No. > > > > Actually yes. > > Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-( > So yes, you need one checkpoint after archiving starts. Imho turning on xlog > archiving should issue such a checkpoint just to be sure. > By agreement, archive_mode can only be turned on at postmaster startup, which means you always have a checkpoint - either because you shut it down cleanly, or you didn't and it recovers, then writes one. There is always something to start the rollforward. So, non-issue. Best regards, Simon Riggs
On Fri, 2004-07-16 at 15:27, Bruce Momjian wrote: > Also, when you are in recovery mode, how do you get out of recovery > mode, meaning if you have a power failure, how do you prevent the system > from doing another recovery? Do you remove the recovery.conf file? That was the whole point of the recovery.conf file: it prevents you from re-entering recovery accidentally, as would occur if the parameters were set in the normal .conf file. Best Regards, Simon Riggs
On Fri, 2004-07-16 at 16:25, Zeugswetter Andreas SB SD wrote: > I think the filename 'recovery.conf' is misleading, since it is not a > static configuration file, but a command file for one recovery. > How about 'recovery.command' then 'recovery.inprogress', and on recovery > completion it should be renamed to 'recovery.done' You understand this and your assessment is correct. recovery.conf isn't an attempt to persist information. It is a means of delivering a set of parameters to the recovery process, as well as signalling overall that archive recovery is required (because the system default remains the same, which is to recover from the logs it has locally available to it). I originally offered a design which used a command, similar to DB2/Oracle...that was overruled as too complex. The (whatever you call it) file is just a very simple way of specifying whats required. There is more to be said here...clearly some explanations are required and I will provide those later... Best regards, Simon Riggs
Simon Riggs wrote: > On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote: > > > >> Do we need a checkpoint after the archiving > > > >> starts but before the backup begins? > > > > > > > No. > > > > > > Actually yes. > > > > Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-( > > So yes, you need one checkpoint after archiving starts. Imho turning on xlog > > archiving should issue such a checkpoint just to be sure. > > > > By agreement, archive_mode can only be turned on at postmaster startup, > which means you always have a checkpoint - either because you shut it > down cleanly, or you didn't and it recovers, then writes one. > > There is always something to start the rollforward. > > So, non-issue. I don't think so. I can imagine many cases where you want to do a nightly tar backup without turning archiving on/off or restarting the postmaster. In those cases, a manual checkpoint would have to be issued before the backup begins. Imagine a system that is up for a month, and they don't have enough archive space to keep a months worth of WAL files. They would probably do nightly or weekend tar backups, and then discard the WAL archives. What procedure would they use? I assume they would copy all their old WAL files to a save directory, issue a checkpoint, do a tar backup, then they can delete the saved WAL files. Is that correct? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 2004-07-16 at 19:30, Bruce Momjian wrote: > Simon Riggs wrote: > > On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote: > > > > >> Do we need a checkpoint after the archiving > > > > >> starts but before the backup begins? > > > > > > > > > No. > > > > > > > > Actually yes. > > > > > > Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-( > > > So yes, you need one checkpoint after archiving starts. Imho turning on xlog > > > archiving should issue such a checkpoint just to be sure. > > > > > > > By agreement, archive_mode can only be turned on at postmaster startup, > > which means you always have a checkpoint - either because you shut it > > down cleanly, or you didn't and it recovers, then writes one. > > > > There is always something to start the rollforward. > > > > So, non-issue. > I was discussing the claim that there might not be a checkpoint to begin the rollforward from. There always is: if you are in archive_mode=true then you will always have a checkpoint that can be used for recovery. It may be "a long way in the past", if there has been no write activity, but the rollforward will very very quick, since there will be no log records. > I don't think so. I can imagine many cases where you want to do a > nightly tar backup without turning archiving on/off or restarting the > postmaster. This is a misunderstanding. I strongly agree with what you say: the whole system has been designed to avoid any benefit from turning on/off archiving and there is no requirement to restart postmaster to take backups. > In those cases, a manual checkpoint would have to be issued > before the backup begins. A manual checkpoint doesn't HAVE TO be issued. Presumably most systems will be running checkpoint every few minutes. Wherever the last one was is where the rollforward would start from. But you can if thats the way you want to do things, just wait long enough for the checkpoint to have completed, otherwise your objective of reducing rollforward time will not be met. (please note my earlier reported rollback performance of approximately x10 rate of recovery v elapsed time - will require testing on your own systems). > Imagine a system that is up for a month, and they don't have enough > archive space to keep a months worth of WAL files. They would probably > do nightly or weekend tar backups, and then discard the WAL archives. > Yes, that would be normal practice. I would recommend keeping at least the last 3 full backups and all of the WAL logs to cover that period. > What procedure would they use? I assume they would copy all their old > WAL files to a save directory, issue a checkpoint, do a tar backup, then > they can delete the saved WAL files. Is that correct? PITR is designed to interface with a wide range of systems, through the extensible archive/recovery program interface. We shouldn't focus on just tar backups - if you do, then the whole thing seems less feature-rich. The current design allows interfacing with tape, remote backup, internet backup providers, automated standby servers and the dozen major storage/archive vendors' solutions. Writing a procedure to backup, assign filenames, keep track of stuff isn't too difficult if you're a competent DBA with a mild knowledge of shell or perl scripting. But if data is important, people will want to invest the time and trouble to adopt one of the open source or licenced vendors that provide solutions in this area. Systems management is a discipline and procedures should be in place for everything. I fully agree with the "automate everything" dictum, but just don't want to constrain people too much to a particular way of doing things. -o-o- Overall, for first release, I think the complexity of this design is acceptable. PITR is similar to Oracle7 Backup/Recovery, and easily recognisable to any DBA with current experience of current SQL Server, DB2 (MVS, UDB) or Teradata systems. [I can't comment much on Ingres, Informix, Sybase etc] My main areas of concern are: - the formal correctness of the recovery process As a result of this concern, PITR makes ZERO alterations to the recovery code itself. The trick is to feed it the right xlog files and to stop, if required, at the right place and allow normal work to resume. - the robustness and quality of my implementation This requires quality checking of the code and full beta testing -o-o- We've raised a couple of valid points on the lists in the last few days: - its probably a desirable feature (but not essential) to implement a write suspend feature on the bgwriter, if nothing else it will be a confidence building feature...as said previously, for many people, this will not be required, but people will no doubt keep asking - there is a small window of risk around the possibility that a recovery target might be set by the user that doesn't rollforward all the way past the end of the backup. That is real, but in general, people aren't likely to be performing archive recovery within minutes of a backup being taken - and if they are, they can always start from the previous one to that. This is a gap we should close, but its just something to be aware of...just like pg_dump not sorting things in the correct order in its first release. Not for now, but soon, I would propose: - a command to suspend/resume bgwriter to allow backups. - use the suspend/resume feature to write a log record "backup end marker" which shows when this took place. Ensure that any rollforward goes through AT LEAST ONE "backup end marker" on its way. (If a Point in Time is specified too early, we can check this immediately against the checkpoint record. We can then refuse to stop at eny point in time earlier than the backup end marker. I've written a todo list and will post this again soon. Best Regards, Simon Riggs
On Fri, 2004-07-16 at 16:47, Tom Lane wrote: > As far as the business about copying pg_control first goes: there is > another way to think about it, which is to copy pg_control to another > place that will be included in your backup. For example the standard > backup procedure could be > > 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes. > > 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump > > 3. tar cf /dev/mt $PGDATA > > 4. do something to record ending WAL position > > If we standardized on this way, then the tar archive would automatically > contain the pre-backup checkpoint position in ./pg_control.dump, and > there is no need for any special assumptions about the order in which > tar processes things. > Sounds good. That would be familiar to Oracle DBAs doing BACKUP CONTROLFILE. We can document that and off it as a suggested procedure. > However, once you decide to do things like that, there is no reason why > the copied file has to be an exact image of pg_control. I claim it > would be more useful if the copied file were plain text so that you > could just "cat" it to find out the starting WAL position; that would > let you determine without any special tools what range of WAL archive > files you are going to need to bring back from your archives. I wouldn't be in favour of a manual mechanism. If you want an automated mechanism, whats wrong with using the one thats already there? You can use pg_controldata to read the controlfile, again whats wrong with that? We agreed some time back that an off-line xlog file inspector would be required to allow us to inspect the logs and make a decision about where to end recovery. You'd still need that. It's scary enough having to specify the end point, let alone having to specify the starting point as well. At your request, and with Bruce's idea, I designed and built the recovery system so that you don't need to know what range of xlogs to bring back. You just run it, it brings back the right files from archive and does recovery with them, then cleans up - and it works without running out of disk space on long recoveries. I've built it now and it works... > This is pretty much the same chain of reasoning that Bruce and I went > through yesterday to come up with the idea of putting a label file > inside the tar backups. We concluded that it'd be worth putting > both the backup starting time and the checkpoint WAL position into > the label file --- the starting time isn't needed for restore but > might be really helpful as documentation, if you needed to verify > which dump file was which. ...if you are doing tar backups...what will you do if you're not using that mechanism? If you are: It's common practice to make up a backup filename from elements such as systemname, databasename, date and time etc. That gives you the start time, the file last mod date gives you the end time. I think its perfectly fine for everybody to do backups any way they please. There are many licenced variants of PostgreSQL and it might be appropriate in those to specify particular ways of doing things. I'll be trusting the management of backup metadata and storage media to a solution designed for the purpose (open or closed source), just as I'll be trusting my data to a database solution designed for that purpose. That for me is one of the good things about PostgreSQL - we use the filesystem, we don't write our own, we provide language interfaces not invent our own proprietary server language etc.. Best Regards, Simon Riggs
OK, I think I have some solid ideas and reasons for them. First, I think we need server-side functions to call when we start/stop the backup. The advantage of these server-side functions is that they will do the required work of recording the pg_control values and creating needed files with little chance for user error. It also allows us to change the internal operations in later releases without requiring admins to change their procedures. We are even able to adjust the internal operation in minor releases without forcing a new procedure on users. Second, I think once we start a restore, we should rename recovery.conf to recovery.in_progress, and when complete rename that to recovery.done. If the postmaster starts and sees recovery.in_progress, it will fail to start knowing its recovery was interrupted. This allows the admin to take appropriate action. (I am not sure what that action would be. Does he bring back the backup files or just keep going?) Third, I think we need to put a file in the archive location once we complete a backup, recording the start/stop xid and wal/offsets. This gives the admin documentation on what archive logs to keep and what xids are available for recovery. Ideally the recover program would read that file and check the recover xid to make sure it is after the stop xid recorded in the file. How would the recover program know the name of that file? We need to create it in /data with start contents before the backup, then complete it with end contents and archive it. What should we name it? Ideally it would be named by the WAL name/offset of the start so it orders in the proper spot in the archive file listing, e.g.: 000000000000093a000000000000093b000000000000093b.032b9.start000000000000093c Are people going to know they need 000000000000093b for 000000000000093b.032b9.start? I hope so. Another idea is to do: 000000000000093a.xlog000000000000093b.032b9.start000000000000093b.xlog000000000000093c.xlog This would order properly. It might be a very good idea to add extensions to these log files now that we are archiving them in strange places. In fact, maybe we should use *.pg_xlog to document the directory they came from. --------------------------------------------------------------------------- Simon Riggs wrote: > On Fri, 2004-07-16 at 16:47, Tom Lane wrote: > > As far as the business about copying pg_control first goes: there is > > another way to think about it, which is to copy pg_control to another > > place that will be included in your backup. For example the standard > > backup procedure could be > > > > 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes. > > > > 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump > > > > 3. tar cf /dev/mt $PGDATA > > > > 4. do something to record ending WAL position > > > > If we standardized on this way, then the tar archive would automatically > > contain the pre-backup checkpoint position in ./pg_control.dump, and > > there is no need for any special assumptions about the order in which > > tar processes things. > > > > Sounds good. That would be familiar to Oracle DBAs doing BACKUP > CONTROLFILE. We can document that and off it as a suggested procedure. > > > However, once you decide to do things like that, there is no reason why > > the copied file has to be an exact image of pg_control. I claim it > > would be more useful if the copied file were plain text so that you > > could just "cat" it to find out the starting WAL position; that would > > let you determine without any special tools what range of WAL archive > > files you are going to need to bring back from your archives. > > I wouldn't be in favour of a manual mechanism. If you want an automated > mechanism, whats wrong with using the one thats already there? You can > use pg_controldata to read the controlfile, again whats wrong with that? > > We agreed some time back that an off-line xlog file inspector would be > required to allow us to inspect the logs and make a decision about where > to end recovery. You'd still need that. > > It's scary enough having to specify the end point, let alone having to > specify the starting point as well. > > At your request, and with Bruce's idea, I designed and built the > recovery system so that you don't need to know what range of xlogs to > bring back. You just run it, it brings back the right files from archive > and does recovery with them, then cleans up - and it works without > running out of disk space on long recoveries. > > I've built it now and it works... > > > This is pretty much the same chain of reasoning that Bruce and I went > > through yesterday to come up with the idea of putting a label file > > inside the tar backups. We concluded that it'd be worth putting > > both the backup starting time and the checkpoint WAL position into > > the label file --- the starting time isn't needed for restore but > > might be really helpful as documentation, if you needed to verify > > which dump file was which. > > ...if you are doing tar backups...what will you do if you're not using > that mechanism? > > If you are: It's common practice to make up a backup filename from > elements such as systemname, databasename, date and time etc. That gives > you the start time, the file last mod date gives you the end time. > > I think its perfectly fine for everybody to do backups any way they > please. There are many licenced variants of PostgreSQL and it might be > appropriate in those to specify particular ways of doing things. > > I'll be trusting the management of backup metadata and storage media to > a solution designed for the purpose (open or closed source), just as > I'll be trusting my data to a database solution designed for that > purpose. That for me is one of the good things about PostgreSQL - we use > the filesystem, we don't write our own, we provide language interfaces > not invent our own proprietary server language etc.. > > Best Regards, Simon Riggs > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Let me address you concerns about PITR getting into 7.5. I think a few people spoke last week expressing concern about our release process and wanting to take drastic action. However, looking at the release status report I am about to post, you will see we are on track for an August 1 beta. PITR has been neglected only because it has been moving along so well we haven't needed to get deeply involved. Simon has been able to address concerns as we raised them and make adjustments quickly with little guidance. Now, we certainly don't want to skip adding PITR by not giving it our full attention to get into 7.5. Once Tom completes the cursor issues with NT in the next day or so, I think that removes the last big NT stumbling block, and we will start to focus on PITR. Unless there is some major thing we are missing, we fully expect to get PITR in 7.5. We don't have a crystal ball to know for sure, but our intent is clear. I know Simon is going away July 26 so we want to get him feedback as soon as possible. If we wait until after July 26, we will have to make all the adjustments without Simon's guidance, which will be difficult. As far as the importance of PITR, it is a _key_ enterprise feature, even more key than NT. PITR is going to be one of the crowning jewels of the 7.5 release, and I don't want to go into beta without it unless we can't help it. So, I know with the deadline looming, and everyone it getting nervous, but keep the faith. I can see the light at the end of the tunnel. I know this is a tighter schedule than we would like, but I know we can do it, and I expect we will do it. --------------------------------------------------------------------------- Simon Riggs wrote: > On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote: > > > We will get there --- it just seems dark at this time. > > Thanks for that. My comments were heartfelt, but not useful right now. > > I'm badly overdrawn already on my time budget, though that is my concern > alone. There is more to do than I have time for. Pragmatically, if we > aren't going to get there then I need to stop now, so I can progress > other outstanding issues. All help is appreciated. > > I'm aiming for the minimum feature set - which means we do need to take > care over whether that set is insufficient and also to pull any part > that doesn't stand up to close scrutiny over the next few days. > > Overall, my primary goal is increased robustness and availability for > PostgreSQL...and then to have a rest! > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
I decided to produce a nice simple example, so that anyone could hopefully replicate what I am seeing. The scenario is the same as before (the 11 steps), but the CREATE TABLE and COPY step has been reduced to: CREATE TABLE test0 (filler VARCHAR(120)); COPY test0 FROM '/data0/dump/test0.dat' USING DELIMITERS ','; Now the file 'test0.dat' consists of (128293) identical lines, each of 109 'a' charactors (plus end of line) A script to run the whole business can be found here : http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz (It will need a bit of editing for things like location of Pg, PGDATA, and you will need to make your own data file) The main points of interest are: - anything <=128392 rows in test0.dat results in 1 archived log, and the recovery succeeds - anything >=128393 rows in test0.dat results in 2 or more archived logs, and recovery fails on the second log (and gives the zero length redo at 0/1FFFFE0 message). Let me know if I can do any more legwork on this (I am considering re-compiling with WAL_DEBUG now that example is simpler) regards Mark Simon Riggs wrote: >On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote: > > >>I tried what I thought was a straightforward scenario, and seem to have >>broken it :-( >> >>Here is the little tale >> >>1) initdb >>2) set archive_mode and archive_dest in postgresql.conf >>3) startup >>4) create database called 'test' >>5) connect to 'test' and type 'checkpoint' >>6) backup PGDATA using 'tar -zcvf' >>7) create tables in 'test' and add data using COPY (exactly 2 logs worth) >>8) shutdown and remove PGDATA >>9) recover using 'tar -zxvf' >>10) copy recovery.conf into PGDATA >>11) startup >> >>This is what I get : >> >>LOG: database system was interrupted at 2004-07-15 21:24:04 NZST >>LOG: recovery command file found... >>LOG: restore_program = cp %s/%s %s >>LOG: recovery_target_inclusive = true >>LOG: recovery_debug_log = true >>LOG: starting archive recovery >>LOG: restored log file "0000000000000000" from archive >>LOG: checkpoint record is at 0/A48054 >>LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE >>LOG: next transaction ID: 496; next OID: 25419 >>LOG: database system was not properly shut down; automatic recovery in >>progress >>LOG: redo starts at 0/A48094 >>LOG: restored log file "0000000000000001" from archive >>LOG: record with zero length at 0/1FFFFE0 >>LOG: redo done at 0/1FFFF30 >>LOG: restored log file "0000000000000001" from archive >>LOG: restored log file "0000000000000001" from archive >>PANIC: concurrent transaction log activity while database system is >>shutting down >>LOG: startup process (PID 13492) was terminated by signal 6 >>LOG: aborting startup due to startup process failure >> >>The concurrent access is a bit of a puzzle, as this is my home machine >>(i.e. I am *sure* noone else is connected!) >> >> >> > >I can see what is wrong now, but you'll have to help me on details your >end... > >The log shows that xlog 1 was restored from archive. It contains a zero >length record, which indicates that it isn't yet full (or thats what the >existing recovery code assumes it means). Which also indicates that it >should never have been archived in the first place, and should not >therefore be a candidate for a restore from archive. > >The double message "restored log file" can only occur after you've >retrieved a partially full file from archive - which as I say, shouldn't >be there. > >Other messages are essentially spurious in those circumstances. > >Either: >- somehow the files have been mixed up in the archive directory, which >is possible if the filing discipline is not strict - various ways, >unfortunately I would guess this to be the most likely, somehow >- the file that has been restored has been damaged in some way >- the archiver has archived a file too early (very unlikely, IMHO - >thats the most robust bit of the code) >- some aspect of the code has written a zero length record to WAL (which >is supposed to not be possible, but we musn't discount an error in >recent committed work) > >- there may also be an effect going on with checkpoints that I don't >understand...spurious checkpoint warning messages have already been >observed and reported, > >Best regards, Simon Riggs > > > > > >
Mark Kirkwood <markir@coretech.co.nz> writes: > - anything >=128393 rows in test0.dat results in 2 or more archived > logs, and recovery fails on the second log (and gives the zero length > redo at 0/1FFFFE0 message). Zero length record is not an error, it's the normal way of detecting end-of-log. regards, tom lane
There are some silly bugs in the script: - forgot to export PGDATA and PATH after changing them - forgot to mention the need to edit test.sql (COPY line needs path to dump file) Apologies - I will submit a fixed version a little later regards Mark Mark Kirkwood wrote: > > A script to run the whole business can be found here : > > http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz > > (It will need a bit of editing for things like location of Pg, PGDATA, > and you will need to make your own data file) >
fixed. Mark Kirkwood wrote: > There are some silly bugs in the script: > > - forgot to export PGDATA and PATH after changing them > - forgot to mention the need to edit test.sql (COPY line needs path to > dump file) > > Apologies - I will submit a fixed version a little later > > regards > > Mark > > Mark Kirkwood wrote: > >> >> A script to run the whole business can be found here : >> >> http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz >> >> (It will need a bit of editing for things like location of Pg, >> PGDATA, and you will need to make your own data file) >> >
On Sat, 2004-07-17 at 00:57, Bruce Momjian wrote: > OK, I think I have some solid ideas and reasons for them. > Sorry for taking so long to reply... > First, I think we need server-side functions to call when we start/stop > the backup. The advantage of these server-side functions is that they > will do the required work of recording the pg_control values and > creating needed files with little chance for user error. It also allows > us to change the internal operations in later releases without requiring > admins to change their procedures. We are even able to adjust the > internal operation in minor releases without forcing a new procedure on > users. Yes, I think we should go down this route. ....there's a "but" and that is we don't absolutely need it for correctness....and so I must decline adding it to THIS release. I don't imagine I'll stop be associated with this code for a while yet.... Can we recommend that users should expect to have to call a start and end backup routine in later releases? Don't expect you'll agree to that.. > > Second, I think once we start a restore, we should rename recovery.conf > to recovery.in_progress, and when complete rename that to > recovery.done. If the postmaster starts and sees recovery.in_progress, > it will fail to start knowing its recovery was interrupted. This allows > the admin to take appropriate action. (I am not sure what that action > would be. Does he bring back the backup files or just keep going?) > Superceded by Tom's actions. Two states are required: start and stop. Recovery isn't going to be checkpoint-restartable anytime soon, IMHO. Best regards, Simon Riggs
Bruce and I had another phone chat about the problems that can ensue if you restore a tar backup that contains old (incompletely filled) versions of WAL segment files. While the current code will ignore them during the recovery-from-archive run, leaving them laying around seems awfully dangerous. One nasty possibility is that the archiving mechanism will pick up these files and overwrite good copies in the archive area with the obsolete ones from the backup :-(. Bruce earlier proposed that we simply "rm pg_xlog/*" at the start of a recovery-from-archive run, but as I said I'm scared to death of code that does such a thing automatically. In particular this would make it impossible to handle scenarios where you want to do a PITR recovery but you need to use some recent WAL segments that didn't make it into your archive yet. (Maybe you could get around this by forcibly transferring such segments into the archive, but that seems like a bad idea for incomplete segments.) It would really be best for the DBA to make sure that the starting condition for the recovery run does not have any obsolete segment files in pg_xlog. He could do this either by setting up his backup policy so that pg_xlog isn't included in the tar backup in the first place, or by manually removing the included files just after restoring the backup, before he tries to start the recovery run. Of course the objection to that is "what if the DBA forgets to do it?" The idea that we came to on the phone was for the postmaster, when it enters recovery mode because a recovery.conf file exists, to look in pg_xlog for existing segment files and refuse to start if any are there --- *unless* the user has put a special, non-default overriding flag into recovery.conf. Call it "use_unarchived_files" or something like that. We'd have to provide good documentation and an extensive HINT of course, but basically the DBA would have two choices when he gets this refusal to start: 1. Remove all the segment files in pg_xlog. (This would be the right thing to do if he knows they all came off the backup.) 2. Verify that pg_xlog contains only segment files that are newer than what's stored in the WAL archive, and then set the override flag in recovery.conf. In this case the DBA is taking responsibility for leaving only segment files that are good to use. One interesting point is that with such a policy, we could use locally available WAL segments in preference to pulling the same segments from archive, which would be at least marginally more efficient, and seems logically cleaner anyway. In particular it seems that this would be a useful arrangement in cases where you have questionable WAL segments --- you're not sure if they're good or not. Rather than having to push questionable data into your WAL archive, you can leave it local, try a recovery run, and see if you like the resulting state. If not, it's a lot easier to do-over when you have not corrupted your archive area. Comments? Better ideas? regards, tom lane
I have been doing some re-testing with CVS HEAD from about 1 hour ago using the simplified example posted previously. It is quite interesting: i) create the table as: CREATE TABLE test0 (filler TEXT); and COPY 100 000 rows on length 109, then recovery succeeds. ii) create the table as: CREATE TABLE test0 (filler VARCHAR(120)); and COPY as above, then recovery *fails* with the the signal 6 error below. LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/A4807C LOG: record with zero length at 0/FFFFE0 LOG: redo done at 0/FFFF30 LOG: restored log file "0000000000000000" from archive LOG: archive recovery complete PANIC: concurrent transaction log activity while database system is shutting down LOG: startup process (PID 17546) was terminated by signal 6 LOG: aborting startup due to startup process failure (I am pretty sure both TEXT and VARCHAR(120) failed using the original patch) Any suggestions for the best way to dig a bit deeper? regards Mark
I've got a PITR set up here that's happily scp'ing WAL files across to another machine. However, the NIC in the machine is currently stuffed, so it gets like 50k/s :) What happens in general if you are generating WAL file bytes faster always than they can be copied off? Also, does the archive dir just basically keep filling up forever? How do I know when I can prune some files? Anything older than the last full backup? Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > I've got a PITR set up here that's happily scp'ing WAL files across to > another machine. However, the NIC in the machine is currently stuffed, > so it gets like 50k/s :) What happens in general if you are generating > WAL file bytes faster always than they can be copied off? If you keep falling further and further behind, eventually your pg_xlog directory will fill the space available on its disk, and I think at that point PG will panic and shut down because it can't create any more xlog segments. > Also, does the archive dir just basically keep filling up forever? How > do I know when I can prune some files? Anything older than the last > full backup? Anything older than the starting checkpoint of the last full backup that you might want to restore to. We need to adjust the backup procedure so that the starting segment number for a backup is more readily visible; see recent discussions about logging that explicitly in some fashion. regards, tom lane
> If you keep falling further and further behind, eventually your pg_xlog > directory will fill the space available on its disk, and I think at that > point PG will panic and shut down because it can't create any more xlog > segments. Hang on, are you supposed to MOVE or COPY away WAL segments? Chris
Christopher Kings-Lynne wrote: > > If you keep falling further and further behind, eventually your pg_xlog > > directory will fill the space available on its disk, and I think at that > > point PG will panic and shut down because it can't create any more xlog > > segments. > > Hang on, are you supposed to MOVE or COPY away WAL segments? Copy. pg will delete them once they are archived. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >> If you keep falling further and further behind, eventually your pg_xlog >> directory will fill the space available on its disk, and I think at that >> point PG will panic and shut down because it can't create any more xlog >> segments. > Hang on, are you supposed to MOVE or COPY away WAL segments? COPY. The checkpoint code will then delete or recycle the segment file, as appropriate. regards, tom lane
>>Hang on, are you supposed to MOVE or COPY away WAL segments? > > COPY. The checkpoint code will then delete or recycle the segment file, > as appropriate. So what happens if you just move it? Postgres breaks? Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >>> Hang on, are you supposed to MOVE or COPY away WAL segments? >> >> COPY. The checkpoint code will then delete or recycle the segment file, >> as appropriate. > So what happens if you just move it? Postgres breaks? I don't think so, but it seems like a much less robust way to do things. What happens if you have a failure partway through? For instance archive machine dies and loses recent data right after you've rm'd the source file. The recommended COPY procedure at least provides some breathing room between when you install the data on the archive and when the original file is removed. It's not like you save any effort by using a MOVE anyway. You're not going to have the archive on the same machine as the database (or if you are, you ain't gonna be *my* DBA ...) regards, tom lane
> I don't think so, but it seems like a much less robust way to do things. > What happens if you have a failure partway through? For instance > archive machine dies and loses recent data right after you've rm'd the > source file. The recommended COPY procedure at least provides some > breathing room between when you install the data on the archive and when > the original file is removed. Well, I tried it in 'cross your fingers' mode and it works, at least: archive_command = 'rm %p' :) Chris
Mark Kirkwood <markir@coretech.co.nz> writes: > I have been doing some re-testing with CVS HEAD from about 1 hour ago > using the simplified example posted previously. > It is quite interesting: The problem seems to be that the computation of checkPoint.redo at xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not allowing for the possibility that XLogInsert will decide it doesn't want to split the checkpoint record across XLOG files, and will then insert a WASTED_SPACE record to avoid that (see comment and following code at lines 758-795). This wouldn't really matter except that there is a safety crosscheck at line 4268 that tries to detect unexpected insertions of other records during a shutdown checkpoint. I think the code in CreateCheckPoint was correct when it was written, because we only recently changed XLogInsert to not split records across files. But it's got a boundary-case bug now, which your test scenario is able to exercise by making the recovery run try to write a shutdown checkpoint exactly at the end of a WAL file segment. The quick and dirty solution would be to dike out the safety check at 4268ff. I don't much care for that, but am too tired right now to work out a better answer. I'm not real sure whether it's better to adjust the computation of checkPoint.redo or to smarten the safety check ... but one or the other needs to allow for file-end padding, or maybe we could hack some update of the state in WasteXLInsertBuffer(). (But at some point you have to say "this is more trouble than it's worth", so maybe we'll end up taking out the safety check.) In any case this isn't a fundamental bug, just an insufficiently smart safety check. But thanks for finding it! As is, the code has a nonzero probability of failure in the field :-( and I don't know how we'd have tracked it down without a reproducible test case. regards, tom lane
> > Hang on, are you supposed to MOVE or COPY away WAL segments? > > Copy. pg will delete them once they are archived. Copy. pg will recycle them once they are archived. Andreas
On Tue, 2004-07-20 at 05:14, Tom Lane wrote: > Mark Kirkwood <markir@coretech.co.nz> writes: > > I have been doing some re-testing with CVS HEAD from about 1 hour ago > > using the simplified example posted previously. > > > It is quite interesting: > The problem seems to be that the computation of checkPoint.redo at > xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not > allowing for the possibility that XLogInsert will decide it doesn't > want to split the checkpoint record across XLOG files, and will then > insert a WASTED_SPACE record to avoid that (see comment and following > code at lines 758-795). This wouldn't really matter except that there > is a safety crosscheck at line 4268 that tries to detect unexpected > insertions of other records during a shutdown checkpoint. > > I think the code in CreateCheckPoint was correct when it was written, > because we only recently changed XLogInsert to not split records > across files. But it's got a boundary-case bug now, which your test > scenario is able to exercise by making the recovery run try to write > a shutdown checkpoint exactly at the end of a WAL file segment. > Thanks for locating that, I was suspicious of that piece of code, but it would have taken me longer than this to locate it exactly. It was clear (to me) that it had to be of this nature, since I've done a fair amount of recovery testing and not hit anything like that. > The quick and dirty solution would be to dike out the safety check at > 4268ff. I don't much care for that, but am too tired right now to work > out a better answer. I'm not real sure whether it's better to adjust > the computation of checkPoint.redo or to smarten the safety check > ... but one or the other needs to allow for file-end padding, or maybe > we could hack some update of the state in WasteXLInsertBuffer(). (But > at some point you have to say "this is more trouble than it's worth", > so maybe we'll end up taking out the safety check.) > I'll take a look > In any case this isn't a fundamental bug, just an insufficiently > smart safety check. But thanks for finding it! As is, the code has > a nonzero probability of failure in the field :-( and I don't know > how we'd have tracked it down without a reproducible test case. All code has a non-zero probability of failure in the field, its just they don't tell you that usually. The main thing here is that we write everything we need to write to the logs in the first place. If that is true, then the code can always be adjusted or the logs dumped and re-spliced to recover data. Definitely: Thanks Mark! Reproducibility is key. Best regards, Simon Riggs
Great that it's not fundamental - and hopefully with this discovery, the probability you mentioned is being squashed towards zero a bit more :-) Don't let this early bug detract from what is really a superb piece of work! regards Mark Tom Lane wrote: >In any case this isn't a fundamental bug, just an insufficiently >smart safety check. But thanks for finding it! As is, the code has >a nonzero probability of failure in the field :-( and I don't know >how we'd have tracked it down without a reproducible test case. > > > >
On Tue, 2004-07-20 at 05:14, Tom Lane wrote: > Mark Kirkwood <markir@coretech.co.nz> writes: > > I have been doing some re-testing with CVS HEAD from about 1 hour ago > > using the simplified example posted previously. > > > It is quite interesting: > > The problem seems to be that the computation of checkPoint.redo at > xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not > allowing for the possibility that XLogInsert will decide it doesn't > want to split the checkpoint record across XLOG files, and will then > insert a WASTED_SPACE record to avoid that (see comment and following > code at lines 758-795). This wouldn't really matter except that there > is a safety crosscheck at line 4268 that tries to detect unexpected > insertions of other records during a shutdown checkpoint. > > I think the code in CreateCheckPoint was correct when it was written, > because we only recently changed XLogInsert to not split records > across files. But it's got a boundary-case bug now, which your test > scenario is able to exercise by making the recovery run try to write > a shutdown checkpoint exactly at the end of a WAL file segment. > > The quick and dirty solution would be to dike out the safety check at > 4268ff. Well, taking out the safety check isn't the answer. The check produces the last error message "concurrent transaction...", but it isn't the cause of the mismatch in the first place. If you take out that check, we still fail because the wasted space at the end is causing a "record with zero length" error. > I'm not real sure whether it's better to adjust > the computation of checkPoint.redo or to smarten the safety check > ... but one or the other needs to allow for file-end padding, or maybe > we could hack some update of the state in WasteXLInsertBuffer(). (But > at some point you have to say "this is more trouble than it's worth", > so maybe we'll end up taking out the safety check.) ...I'm looking at other options now. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: >> The quick and dirty solution would be to dike out the safety check at >> 4268ff. > If you take out that check, we still fail because the wasted space at > the end is causing a "record with zero length" error. Ugh. I'm beginning to think we ought to revert the patch that added the don't-split-across-files logic to XLogInsert; that seems to have broken more assumptions than I realized. That was added here: 2004-02-11 17:55 tgl * src/: backend/access/transam/xact.c,backend/access/transam/xlog.c, backend/access/transam/xlogutils.c,backend/storage/smgr/md.c, backend/storage/smgr/smgr.c,bin/pg_controldata/pg_controldata.c,bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h,include/access/xlog.h,include/access/xlogutils.h,include/pg_config_manual.h, include/catalog/pg_control.h,include/storage/smgr.h:Commit the reasonably uncontroversial partsof J.R. Nield's PITR patch,to wit: Add a header record to each WALsegment file so that it can be reliably identified. Avoidsplitting WAL recordsacross segment files (this is not strictlynecessary, but makes it simpler to incorporate the header records). MakeWAL entries for file creation, deletion, and truncation (asforeseen but never implemented by Vadim). Also, add supportformaking XLOG_SEG_SIZE configurable at compile time, similarly toBLCKSZ. Fix a couple bugs I introduced in WAL replayduring recentsmgr API changes. initdb is forced due to changes in pg_controlcontents. There are other ways to do this, for example we could treat the WAL page headers as variable-size, and stick the file labeling info into the first page's header instead of making it be a separate record. The separate-record way makes it easier to incorporate future additions to the file labeling info, but I don't really think it's critical to allow for that. regards, tom lane
On Tue, 2004-07-20 at 13:51, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > >> The quick and dirty solution would be to dike out the safety check at > >> 4268ff. > > > If you take out that check, we still fail because the wasted space at > > the end is causing a "record with zero length" error. > > Ugh. I'm beginning to think we ought to revert the patch that added the > don't-split-across-files logic to XLogInsert; that seems to have broken > more assumptions than I realized. That was added here: > > 2004-02-11 17:55 tgl > > * src/: backend/access/transam/xact.c, > backend/access/transam/xlog.c, backend/access/transam/xlogutils.c, > backend/storage/smgr/md.c, backend/storage/smgr/smgr.c, > bin/pg_controldata/pg_controldata.c, > bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h, > include/access/xlog.h, include/access/xlogutils.h, > include/pg_config_manual.h, include/catalog/pg_control.h, > include/storage/smgr.h: Commit the reasonably uncontroversial parts > of J.R. Nield's PITR patch, to wit: Add a header record to each WAL > segment file so that it can be reliably identified. Avoid > splitting WAL records across segment files (this is not strictly > necessary, but makes it simpler to incorporate the header records). > Make WAL entries for file creation, deletion, and truncation (as > foreseen but never implemented by Vadim). Also, add support for > making XLOG_SEG_SIZE configurable at compile time, similarly to > BLCKSZ. Fix a couple bugs I introduced in WAL replay during recent > smgr API changes. initdb is forced due to changes in pg_control > contents. > > There are other ways to do this, for example we could treat the WAL page > headers as variable-size, and stick the file labeling info into the > first page's header instead of making it be a separate record. The > separate-record way makes it easier to incorporate future additions to > the file labeling info, but I don't really think it's critical to allow > for that. > I think I've fixed it now...but wait 20 The problem was that a zero length XLOG_WASTED_SPACE record just fell out of ReadRecord when it shouldn't have. By giving it a helping hand it makes it through with pointers correctly set, and everything else was already thought of in the earlier patch, so xlog_redo etc happens. I'll update again in a few minutes....no point us both looking at this. Best regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2004-07-20 at 13:51, Tom Lane wrote: >> Ugh. I'm beginning to think we ought to revert the patch that added the >> don't-split-across-files logic to XLogInsert; that seems to have broken >> more assumptions than I realized. > The problem was that a zero length XLOG_WASTED_SPACE record just fell > out of ReadRecord when it shouldn't have. By giving it a helping hand it > makes it through with pointers correctly set, and everything else was > already thought of in the earlier patch, so xlog_redo etc happens. Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and adding two more warts to the code to support it is sticking in my craw. I'm thinking it would be cleaner to treat the extra labeling information as an extension of the WAL page header. regards, tom lane
On Tue, 2004-07-20 at 14:11, Simon Riggs wrote: > On Tue, 2004-07-20 at 13:51, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > >> The quick and dirty solution would be to dike out the safety check at > > >> 4268ff. > > > > > If you take out that check, we still fail because the wasted space at > > > the end is causing a "record with zero length" error. > > > > Ugh. I'm beginning to think we ought to revert the patch that added the > > don't-split-across-files logic to XLogInsert; that seems to have broken > > more assumptions than I realized. That was added here: > > > > 2004-02-11 17:55 tgl > > > > * src/: backend/access/transam/xact.c, > > backend/access/transam/xlog.c, backend/access/transam/xlogutils.c, > > backend/storage/smgr/md.c, backend/storage/smgr/smgr.c, > > bin/pg_controldata/pg_controldata.c, > > bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h, > > include/access/xlog.h, include/access/xlogutils.h, > > include/pg_config_manual.h, include/catalog/pg_control.h, > > include/storage/smgr.h: Commit the reasonably uncontroversial parts > > of J.R. Nield's PITR patch, to wit: Add a header record to each WAL > > segment file so that it can be reliably identified. Avoid > > splitting WAL records across segment files (this is not strictly > > necessary, but makes it simpler to incorporate the header records). > > Make WAL entries for file creation, deletion, and truncation (as > > foreseen but never implemented by Vadim). Also, add support for > > making XLOG_SEG_SIZE configurable at compile time, similarly to > > BLCKSZ. Fix a couple bugs I introduced in WAL replay during recent > > smgr API changes. initdb is forced due to changes in pg_control > > contents. > > > > There are other ways to do this, for example we could treat the WAL page > > headers as variable-size, and stick the file labeling info into the > > first page's header instead of making it be a separate record. The > > separate-record way makes it easier to incorporate future additions to > > the file labeling info, but I don't really think it's critical to allow > > for that. > > > > I think I've fixed it now...but wait 20 > > The problem was that a zero length XLOG_WASTED_SPACE record just fell > out of ReadRecord when it shouldn't have. By giving it a helping hand it > makes it through with pointers correctly set, and everything else was > already thought of in the earlier patch, so xlog_redo etc happens. > > I'll update again in a few minutes....no point us both looking at this. > This was a very confusing test...Here's what I think happened: Mark discovered a numerological coincidence that meant that the XLOG_WASTED_SPACE record was zero length at the end of EACH file he was writing to, as long as there was just that one writer. So no matter how many records were inserted, each xlog file had a zero length XLOG_WASTED_SPACE record at its end. ReadRecord failed on seeing a zero length record, i.e. when it got to the FIRST of the XLOG_WASTED_SPACE records. Thats why the test fails no matter how many records you give it, as long as it was more than enough to write into a second xlog segment file. By telling ReadRecord that XLOG_WASTED_SPACE records of zero length are in fact *OK*, it continues happily. (Thats just a partial fix, see later) The test works, but gives what looks like strange results: the test blows away the data directory completely, so the then-current xlog dies too. That contained the commit for the large COPY, so even though the recovery now works, the table has zero rows in it. (When things die you're still likely to lose *some* data). Anyway, so then we put the "concurrent transaction" test back in and the test passes because we have now set the pointers correctly. After all that, I think the wasted space idea is still sensible. You musn't have a continuation record across files, otherwise we'll end up with half a commit one-day, which would break ACID. I'm happy that we have the explicit test in XLogInsert for zero-length records. Somebody will one-day write a resource manager with zero length records when they didn't mean to and we need to catch that at write time, not at recovery time like Mark has done. The WasteXLInsertBuffer was the only part of the code that *can* write a zero-length record, so we will *not* see another recurrence of this situation --at recovery time--. Though further concerns along this theme are: - what happens when the space at the end of a file is so small we can't even write a zero-length XLOG_WASTED_SPACE record? Hopefully, you're gonna say "damn your eyes...couldnt you see that, its already there". - if the space at the end of a file was just zeros, then the "concurrent transaction test" would still fail....we probably need to enhance this to treat a few zeros at end of file AS IF it was an XLOG_WASTED_SPACE record an continue. (That scenario would happen if we were doing a recovery that included a local un-archived xlog that was very close to being full - probably more likely to occur in crash recovery than archive recovery) The included patch doesn't attempt to address those issues, yet. Best regards, Simon Riggs
Attachment
On Tue, 2004-07-20 at 15:00, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2004-07-20 at 13:51, Tom Lane wrote: > >> Ugh. I'm beginning to think we ought to revert the patch that added the > >> don't-split-across-files logic to XLogInsert; that seems to have broken > >> more assumptions than I realized. > > > The problem was that a zero length XLOG_WASTED_SPACE record just fell > > out of ReadRecord when it shouldn't have. By giving it a helping hand it > > makes it through with pointers correctly set, and everything else was > > already thought of in the earlier patch, so xlog_redo etc happens. > > Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and > adding two more warts to the code to support it is sticking in my craw. > I'm thinking it would be cleaner to treat the extra labeling information > as an extension of the WAL page header. Sounds like a better solution than scrabbling around at the end of file with too many edge cases to test properly ...over to you then... Best Regards, Simon Riggs
Simon Riggs wrote: > On Sat, 2004-07-17 at 00:57, Bruce Momjian wrote: > > OK, I think I have some solid ideas and reasons for them. > > > > Sorry for taking so long to reply... > > > First, I think we need server-side functions to call when we start/stop > > the backup. The advantage of these server-side functions is that they > > will do the required work of recording the pg_control values and > > creating needed files with little chance for user error. It also allows > > us to change the internal operations in later releases without requiring > > admins to change their procedures. We are even able to adjust the > > internal operation in minor releases without forcing a new procedure on > > users. > > Yes, I think we should go down this route. ....there's a "but" and that > is we don't absolutely need it for correctness....and so I must decline > adding it to THIS release. I don't imagine I'll stop be associated with > this code for a while yet.... > > Can we recommend that users should expect to have to call a start and > end backup routine in later releases? Don't expect you'll agree to > that.. I guess my big question is that if we don't do this for 7.5 how will people doing restores know if the xid they specify is valid for the backup they used. If we recover to most recent time, is there any check that will tell them their backup is invalid because there are no archive records that span the time of their backup? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
FYI - I can confirm that the patch fixes main issue. Simon Riggs wrote: > >This was a very confusing test...Here's what I think happened: >..... >The included patch doesn't attempt to address those issues, yet. > >Best regards, Simon Riggs > > > >
This is presumably a standard feature of any PITR design - if the failure event destroys the current transaction log, then you can only recover transactions that committed in the last *archived* log. regards Mark Simon Riggs wrote: > >The test works, but gives what looks like strange results: the test >blows away the data directory completely, so the then-current xlog dies >too. That contained the commit for the large COPY, so even though the >recovery now works, the table has zero rows in it. (When things die >you're still likely to lose *some* data). > > > > >
Looks good to me. Log file numbering scheme seems to have changed - is that part of the fix too?. Tom Lane wrote: > >This is done in CVS tip. Mark, could you retest to verify it's fixed? > > regards, tom lane > >
Mark Kirkwood <markir@coretech.co.nz> writes: > Looks good to me. Log file numbering scheme seems to have changed - is > that part of the fix too?. That's for timelines ... it's not directly related but I thought I should put in both changes at once to avoid forcing an extra initdb. regards, tom lane
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2004-07-20 at 15:00, Tom Lane wrote: >> Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and >> adding two more warts to the code to support it is sticking in my craw. >> I'm thinking it would be cleaner to treat the extra labeling information >> as an extension of the WAL page header. > Sounds like a better solution than scrabbling around at the end of file > with too many edge cases to test properly This is done in CVS tip. Mark, could you retest to verify it's fixed? regards, tom lane
We need someone to code two backend functions to complete PITR. The function would be called at start/stop of backup of the data directory. The functions would be checked during restore to make sure the requested xid is not between the start/stop xids of the backup. They would also contain timestamps so the admin can easily review the archive directory. The start needs to call checkpoint and create file in the data directory that contains a few server parameters. At backup stop the function needs to move the file to pg_xlog and set the *.ready archive flag so it is archived. As for checking during recover, the file needs to be retrieved and checked to see the xid recovery is valid. Tom and I can help you with that detail. DOn't worry about all the details of the email below. It is just a general summary. We can give you details once you volunteer. --------------------------------------------------------------------------- Bruce Momjian wrote: > > OK, I think I have some solid ideas and reasons for them. > > First, I think we need server-side functions to call when we start/stop > the backup. The advantage of these server-side functions is that they > will do the required work of recording the pg_control values and > creating needed files with little chance for user error. It also allows > us to change the internal operations in later releases without requiring > admins to change their procedures. We are even able to adjust the > internal operation in minor releases without forcing a new procedure on > users. > > Second, I think once we start a restore, we should rename recovery.conf > to recovery.in_progress, and when complete rename that to > recovery.done. If the postmaster starts and sees recovery.in_progress, > it will fail to start knowing its recovery was interrupted. This allows > the admin to take appropriate action. (I am not sure what that action > would be. Does he bring back the backup files or just keep going?) > > Third, I think we need to put a file in the archive location once we > complete a backup, recording the start/stop xid and wal/offsets. This > gives the admin documentation on what archive logs to keep and what xids > are available for recovery. Ideally the recover program would read that > file and check the recover xid to make sure it is after the stop xid > recorded in the file. > > How would the recover program know the name of that file? We need to > create it in /data with start contents before the backup, then complete > it with end contents and archive it. > > What should we name it? Ideally it would be named by the WAL > name/offset of the start so it orders in the proper spot in the archive > file listing, e.g.: > > 000000000000093a > 000000000000093b > 000000000000093b.032b9.start > 000000000000093c > > Are people going to know they need 000000000000093b for > 000000000000093b.032b9.start? I hope so. Another idea is to do: > > > 000000000000093a.xlog > 000000000000093b.032b9.start > 000000000000093b.xlog > 000000000000093c.xlog > > This would order properly. It might be a very good idea to add > extensions to these log files now that we are archiving them in strange > places. In fact, maybe we should use *.pg_xlog to document the > directory they came from. > > --------------------------------------------------------------------------- > > > Simon Riggs wrote: > > On Fri, 2004-07-16 at 16:47, Tom Lane wrote: > > > As far as the business about copying pg_control first goes: there is > > > another way to think about it, which is to copy pg_control to another > > > place that will be included in your backup. For example the standard > > > backup procedure could be > > > > > > 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes. > > > > > > 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump > > > > > > 3. tar cf /dev/mt $PGDATA > > > > > > 4. do something to record ending WAL position > > > > > > If we standardized on this way, then the tar archive would automatically > > > contain the pre-backup checkpoint position in ./pg_control.dump, and > > > there is no need for any special assumptions about the order in which > > > tar processes things. > > > > > > > Sounds good. That would be familiar to Oracle DBAs doing BACKUP > > CONTROLFILE. We can document that and off it as a suggested procedure. > > > > > However, once you decide to do things like that, there is no reason why > > > the copied file has to be an exact image of pg_control. I claim it > > > would be more useful if the copied file were plain text so that you > > > could just "cat" it to find out the starting WAL position; that would > > > let you determine without any special tools what range of WAL archive > > > files you are going to need to bring back from your archives. > > > > I wouldn't be in favour of a manual mechanism. If you want an automated > > mechanism, whats wrong with using the one thats already there? You can > > use pg_controldata to read the controlfile, again whats wrong with that? > > > > We agreed some time back that an off-line xlog file inspector would be > > required to allow us to inspect the logs and make a decision about where > > to end recovery. You'd still need that. > > > > It's scary enough having to specify the end point, let alone having to > > specify the starting point as well. > > > > At your request, and with Bruce's idea, I designed and built the > > recovery system so that you don't need to know what range of xlogs to > > bring back. You just run it, it brings back the right files from archive > > and does recovery with them, then cleans up - and it works without > > running out of disk space on long recoveries. > > > > I've built it now and it works... > > > > > This is pretty much the same chain of reasoning that Bruce and I went > > > through yesterday to come up with the idea of putting a label file > > > inside the tar backups. We concluded that it'd be worth putting > > > both the backup starting time and the checkpoint WAL position into > > > the label file --- the starting time isn't needed for restore but > > > might be really helpful as documentation, if you needed to verify > > > which dump file was which. > > > > ...if you are doing tar backups...what will you do if you're not using > > that mechanism? > > > > If you are: It's common practice to make up a backup filename from > > elements such as systemname, databasename, date and time etc. That gives > > you the start time, the file last mod date gives you the end time. > > > > I think its perfectly fine for everybody to do backups any way they > > please. There are many licenced variants of PostgreSQL and it might be > > appropriate in those to specify particular ways of doing things. > > > > I'll be trusting the management of backup metadata and storage media to > > a solution designed for the purpose (open or closed source), just as > > I'll be trusting my data to a database solution designed for that > > purpose. That for me is one of the good things about PostgreSQL - we use > > the filesystem, we don't write our own, we provide language interfaces > > not invent our own proprietary server language etc.. > > > > Best Regards, Simon Riggs > > > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Oh, here is something else we need to add --- a GUC to control whether pg_xlog is clean on recovery start. --------------------------------------------------------------------------- Tom Lane wrote: > Bruce and I had another phone chat about the problems that can ensue > if you restore a tar backup that contains old (incompletely filled) > versions of WAL segment files. While the current code will ignore them > during the recovery-from-archive run, leaving them laying around seems > awfully dangerous. One nasty possibility is that the archiving > mechanism will pick up these files and overwrite good copies in the > archive area with the obsolete ones from the backup :-(. > > Bruce earlier proposed that we simply "rm pg_xlog/*" at the start of > a recovery-from-archive run, but as I said I'm scared to death of code > that does such a thing automatically. In particular this would make it > impossible to handle scenarios where you want to do a PITR recovery but > you need to use some recent WAL segments that didn't make it into your > archive yet. (Maybe you could get around this by forcibly transferring > such segments into the archive, but that seems like a bad idea for > incomplete segments.) > > It would really be best for the DBA to make sure that the starting > condition for the recovery run does not have any obsolete segment files > in pg_xlog. He could do this either by setting up his backup policy so > that pg_xlog isn't included in the tar backup in the first place, or by > manually removing the included files just after restoring the backup, > before he tries to start the recovery run. > > Of course the objection to that is "what if the DBA forgets to do it?" > > The idea that we came to on the phone was for the postmaster, when it > enters recovery mode because a recovery.conf file exists, to look in > pg_xlog for existing segment files and refuse to start if any are there > --- *unless* the user has put a special, non-default overriding flag > into recovery.conf. Call it "use_unarchived_files" or something like > that. We'd have to provide good documentation and an extensive HINT of > course, but basically the DBA would have two choices when he gets this > refusal to start: > > 1. Remove all the segment files in pg_xlog. (This would be the right > thing to do if he knows they all came off the backup.) > > 2. Verify that pg_xlog contains only segment files that are newer than > what's stored in the WAL archive, and then set the override flag in > recovery.conf. In this case the DBA is taking responsibility for > leaving only segment files that are good to use. > > One interesting point is that with such a policy, we could use locally > available WAL segments in preference to pulling the same segments from > archive, which would be at least marginally more efficient, and seems > logically cleaner anyway. > > In particular it seems that this would be a useful arrangement in cases > where you have questionable WAL segments --- you're not sure if they're > good or not. Rather than having to push questionable data into your WAL > archive, you can leave it local, try a recovery run, and see if you like > the resulting state. If not, it's a lot easier to do-over when you have > not corrupted your archive area. > > Comments? Better ideas? > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
I was wondering about this point - might it not be just as reasonable for the copied file to *be* an exact image of pg_control? Then a very simple variant of pg_controldata (or maybe even just adding switches to pg_controldata itself) would enable the relevant info to be extracted P.s : would love to be that volunteer - however up to the eyeballs in Business Objects (cringe) and Db2 for the next week or so.... regards Mark Bruce Momjian wrote: >We need someone to code two backend functions to complete PITR. > >>><snippage> >>> >>> >>>>However, once you decide to do things like that, there is no reason why >>>>the copied file has to be an exact image of pg_control. I claim it >>>>would be more useful if the copied file were plain text so that you >>>>could just "cat" it to find out the starting WAL position; that would >>>>let you determine without any special tools what range of WAL archive >>>>files you are going to need to bring back from your archives. >>>> >>>> >>>
Mark Kirkwood wrote: > I was wondering about this point - might it not be just as reasonable > for the copied file to *be* an exact image of pg_control? Then a very > simple variant of pg_controldata (or maybe even just adding switches to > pg_controldata itself) would enable the relevant info to be extracted We didn't do that so admins could easily read the file contents. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Quoting Bruce Momjian <pgman@candle.pha.pa.us>: > Mark Kirkwood wrote: > > I was wondering about this point - might it not be just as reasonable > > for the copied file to *be* an exact image of pg_control? Then a very > > simple variant of pg_controldata (or maybe even just adding switches to > > pg_controldata itself) would enable the relevant info to be extracted > > We didn't do that so admins could easily read the file contents. > Ease of reading is a good thing, no argument there. However using 'pg_controldata' (or similar) to perform the read is not really that much harder than using 'cat' - (it is a wee bit harder, I grant you) When I posted the original mail I was thinking that the pg_control image is good because it has much more information than just the last wal offset, and could be used to perform a recovery in the advent of the "actual" pg_control being unsuitable (e.g. backed up last instead of first on a busy system). Of couse this thinking didn't make it into the original mail, sorry about that! regards Mark
> > I was wondering about this point - might it not be just as reasonable > > for the copied file to *be* an exact image of pg_control? Then a very > > simple variant of pg_controldata (or maybe even just adding switches to > > pg_controldata itself) would enable the relevant info to be extracted > > We didn't do that so admins could easily read the file contents. If you use a readable file you will also need a feature for restore (or a tool) to create an appropriate pg_control file, or are you intending to still require that pg_control be the first file backed up. Another possibility would be that the start function writes the readable file and also copies pg_control. Andreas
Zeugswetter Andreas SB SD wrote: > > > > I was wondering about this point - might it not be just as reasonable > > > for the copied file to *be* an exact image of pg_control? Then a very > > > simple variant of pg_controldata (or maybe even just adding switches to > > > pg_controldata itself) would enable the relevant info to be extracted > > > > We didn't do that so admins could easily read the file contents. > > If you use a readable file you will also need a feature for restore (or a tool) > to create an appropriate pg_control file, or are you intending to still require > that pg_control be the first file backed up. > Another possibility would be that the start function writes the readable file and > also copies pg_control. We will back up pg_control in the tar file but it doesn't have to have all correct information. The WAL replay will set it properly I think. In fact it has to start recovery checkpoint settings, not the backup setting at all. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > If you use a readable file you will also need a feature for restore > (or a tool) to create an appropriate pg_control file, or are you > intending to still require that pg_control be the first file backed > up. No, the entire point of this exercise is to get rid of that assumption. You do need *a* copy of pg_control, but the only reason you'd need to back it up first rather than later is so that its checkpoint pointer points to the last checkpoint before the dump starts. Which is the information we want to put in the archive-label file insted. If a copy of pg_control were sufficient then I'd be all for using it as the archive-label file, but it's *not* sufficient because you also need the ending WAL offset. So we need a different file layout in any case, and we may as well take some pity on the poor DBA and make the file easily human-readable. regards, tom lane
Ok - that is a much better way of doing it! regards Mark Tom Lane wrote: >"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > > >>If you use a readable file you will also need a feature for restore >>(or a tool) to create an appropriate pg_control file, or are you >>intending to still require that pg_control be the first file backed >>up. >> >> > >No, the entire point of this exercise is to get rid of that assumption. >You do need *a* copy of pg_control, but the only reason you'd need to >back it up first rather than later is so that its checkpoint pointer >points to the last checkpoint before the dump starts. Which is the >information we want to put in the archive-label file insted. > >If a copy of pg_control were sufficient then I'd be all for using it as >the archive-label file, but it's *not* sufficient because you also need >the ending WAL offset. So we need a different file layout in any case, >and we may as well take some pity on the poor DBA and make the file >easily human-readable. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > >