Thread: Re: [HACKERS] PITR Dead horse?
>Bruce Momjian > Marc G. Fournier wrote: > > > > pgsql-hackers-pitr@postgresql.org > > > > I set myself as owner, since I didn't figure it was something you really > > needed added to your plate? :) Just means you don't have to go through > > and do the Approvals for postings when they need it, I'll just do it as > my > > normal stuff ... > > OK, I have added the mailing list to the web page: > > http://momjian.postgresql.org/main/writings/pgsql/project > > and have subscribed myself. Sorry, I've joined also - I thought Marc's post was a suggestion rather than a reality. I checked http://www.postgresql.org/lists.html but it wasn't listed, so I thought it didn't exist yet. Is there a digest of all previous postings? Or another way to access them? Regards, Simon
Simon Riggs wrote: > >Bruce Momjian > > Marc G. Fournier wrote: > > > > > > pgsql-hackers-pitr@postgresql.org > > > > > > I set myself as owner, since I didn't figure it was something you > really > > > needed added to your plate? :) Just means you don't have to go > through > > > and do the Approvals for postings when they need it, I'll just do it > as > > my > > > normal stuff ... > > > > OK, I have added the mailing list to the web page: > > > > http://momjian.postgresql.org/main/writings/pgsql/project > > > > and have subscribed myself. > > Sorry, I've joined also - I thought Marc's post was a suggestion rather > than a reality. I checked http://www.postgresql.org/lists.html but it > wasn't listed, so I thought it didn't exist yet. > > Is there a digest of all previous postings? Or another way to access > them? No one has said anything on the list yet. :-) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
>Bruce Momjian > No one has said anything on the list yet. :-) Great. Well, I'm planning to be on-line tomorrow night to begin thinking this through and making some proposals. If anybody's out there and intending to help, please shout about it on this list now - Marc's set it up for us, so we should use it, or shut it down! Looking forward to working with y'all. Very best regards, Simon Riggs
On Wed, 2004-02-11 at 21:10, Simon Riggs wrote: > >Bruce Momjian > > No one has said anything on the list yet. :-) > > Great. Well, I'm planning to be on-line tomorrow night to begin thinking > this through and making some proposals. > > If anybody's out there and intending to help, please shout about it on > this list now - Marc's set it up for us, so we should use it, or shut it > down! I haven't been involved on the development side of PostgreSQL yet but have managed to make my way to this list with an interest in working on PITR. Let me know what I can do to help, I don't mind doing grunt work if it helps push this along. > Looking forward to working with y'all. I am also looking forward to working on this. Hope I can contribute to this great project, Regards, Fred
Tom recently posted the PITR patch to PATCHES - maybe we should sent it to this list too. That might kick start the discussion... regards Mark Bruce Momjian wrote: > >No one has said anything on the list yet. :-) > > >
Mark Kirkwood <markir@paradise.net.nz> writes: > Tom recently posted the PITR patch to PATCHES - maybe we should sent it > to this list too. That might kick start the discussion... Actually, I tried to send it here first. This list bounced it because it was over 40K (which I think is the standard limit for all the lists except PATCHES). So, if you want to look at it, go look in the PATCHES archives: http://archives.postgresql.org/pgsql-patches/2004-02/msg00134.php (BTW, Marc, I still don't see an entry for the -pitr list at http://archives.postgresql.org/ ...) regards, tom lane
On Thu, 12 Feb 2004, Tom Lane wrote: > (BTW, Marc, I still don't see an entry for the -pitr list at > http://archives.postgresql.org/ ...) fixed ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
PITR for PostgreSQL 7.5+ =========================== I'm posting a few thoughts and plans on how to progress PITR. Initially, I'd suggest chewing through some ideas to get some direction and then split out some tasks to pursue individually or in teams. I'm not claiming leadership, infallibility or anything really - I'm more than happy to work together on these or other ideas for PITR. Any and all comments or completely alternate views are welcomed. It seems very likely that this design/plan comes much later than other designs/thoughts from members of the PGDG community or elsewhere. All ideas, corrections or offerings of partial code will be gratefully accepted. There's no competition here - if you do it first and it works, great. I'll buy you a beer. I'll update this as information arrives and the picture changes: OVERVIEW PITR means Point-in-Time-Recovery and is an extension to the current functionality of PostgreSQL. Achieving PITR will be a mixture of OS features and integrated dbms features. My suggested first priority is to sort out the latter and not reinvent-the-wheel too much on OS-provided features. OVERALL PITR REQUIREMENTS - A full backup of the database - A backup of set of time-sequenceable changes that can be used to "rollforward" from the point that the full backup was taken to the desired point in time. - Best practice, possibly offered by backup management software, to implement both types of backup using a planned & structured approach SUGGESTED IMPLEMENTATION GOALS 1. Exactly accurate PITR for databases 2. Robustness of code 3. Close to zero performance impact on normal (backed up) system 4. Good overall performance of total recovery procedures to minimise outage 5. Other normal PostgreSQL code givens: Berkeley licence, portable ANSI C GENERAL PITR RECOVERY SCENARIO The general PITR recovery scenario requires: A - Take a full database backup B - Take regular log file backups C - Restore a full backup of a database. If you cannot do this, hide your face in shame. If it fails, retry it until it works. If it never does, you're out of luck. If it succeeds, but portions of the database are corrupt, it may still be possible to continue if non-critical sections of the dbms are still intact, such as the system tables. D - Restore a full set of log records. If you cannot do this, you will have to accept that you have lost data. If it fails, retry it until it works. If it never does, at least you have the full backup point, even if that is out of date (slightly). E - Locate the point in the set of log files that matches the state at which the full backup was taken. If this cannot occur or errors, it may be because you have a set of log files that don't match your full backup. In that case, you must accept that the full backup you have is the best you've got. F - Decide somehow what the point in time is you would like to recover to G - Issue commands to define the Point-in-Time to which you would like to recover. H - Rollforward from the point of the backup to the defined PIT. If this fails because of corrupted log files, then you may choose to either: try to carry on restoring past any corruption point, or give up and be happy with however far you've got. SUGGESTED FEATURES TO BE IMPLEMENTED My suggestion is that PostgreSQL PITR team should try to provide facilities to allow B,E,F,G and H. Partial facilities already exist to achieve: A - either using pg_dump or via a full OS file copy on a shutdown database G - rollforwad on WAL logs already possible, but need to implement "stop at point in time logic". It shouldn't be our role to implement file handling: backup and restore at a later time can be done many ways, so let the user pick their preferred method, according to their preferences, data volume, available funds etc. We should assume that the system onto which the restore takes place is completely different from the system on which the backup was taken. We may need to alter config files prior to starting the restored database. Any other pre-requisites? COMPARISON WITH EXISTING TODO ENTRIES - Add entries to WAL files so it contains enough information for recovery after a full backup This is assumed to be complete. [I'm not sure where the role of pg_clog is in all of this, so this may be a broken assumption.] Comments? - Write application to archive WAL files to tape, disk, or network Probably need to do first part, but I'm arguing not to do the copy to tape.. - Determine how the archive writer will interact with the checkpoint process and WAL file recycling Required for above - Write an application that will recover from a pg_dump or tar backup and then recover WAL files to a specific time Seems possible, but is possible without this initially, so do other stuff first SUGGESTED PHASING OF IMPLEMENTATION Here's a suggested phasing of work, to move towards full PITR. Clearly other ways are also possible... Phase 0: Planning & Discussion, including writing detailed test plan Thread 1 Phase 1.1: Implement backup & full restore from OS image (B, E) Phase 1.2: Implement PITR (add G and modify for H) Phase 1.3: Implement WAL log inspection facilities (to improve step F) Thread 2 Phase 2.1: Implement step A/C using pg_dump Overall, doing this is going to require lots of investigation and discussion. Most importantly, its going to involve a fair amount of detailed and fairly intrusive testing. i.e. complete database wipe and restore. Please don't anyone reading this think I'm going to cut acres of code...we'll need lots of help all sorts of people, especially from veteran PostgreSQL developers. No time plans. We have time to get it right. FEATURE OVERVIEWS & INVESTIGATIONS/IMPLEMENTATIONS REQUIRED B - Backing up WAL log files -Ordinarily, when old log segment files are no longer needed, they are recycled (renamed to become the next segments in the numbered sequence). This means that the data within them must be copied from there to another location AFTER postgres has closed that file BEFORE it is renamed and recycled Spotting that window of opportunity fairly accurately is important. We will need some mechanism to recognise availability of file for copying, then lock the file and copy it away. Think about -what will happen if postgres tries to reuse file while we are still copying. Does postgres hang, waiting for copy to complete (which may be a long time if the copy is hanging because of a disk full condition). This may already be catered for in the code since recycle logic can handle not-ready-to-recycle conditions (not sure). -Is it possible to allow read-only queries to continue in this case? -what will happen if copy fails? -who will do the copying? Should we use another sub-process of postmaster to do the copying, or should we try to use a separate program entirely? -what will happen if that process slows down? What will we do if it fails? -how will we notice that a WAL log is now ready for copying? How will we communicate that to the archiver process? -Manual indicates that current WAL format is bulky and would require some compressed format to be implemented. Initially, I suggest ignoring this and simply relying of OS or hardware/tape compression methods. E - Tee up log files to backup state Should be fairly straightforward when using a full OS file backup of a correctly shutdown database. -Investigate use of WAL file names, to see if any problems exist there. G - Issue command for Recovery point in time With full OS file backup, if the database is shutdown correctly, then we will need a way to tell the database "you think you're up to date, but you're not - I've added some more WAL files into the directories, so roll forward on those now please". -Decide when and how to issue command. -Decide syntax for command? RECOVER DATABASE TO TIME <TIME> if an SQL statement Could also implement a new switch on postmaster, to tell it to come up in recovery mode even though it thinks it doesn't need to? H - Modify WAL rollfoward Rollforward until point-in-time should be fairly straightforward. Current implementation is once-started it will run until it runs out of files to apply, so simply change the termination test. Where to store the point-in-time variable? Global? Pass as parameter when WAL rollforward logic invoked as a command? A/C - using pg_dump Following restore from pg_dump, pg_control will not be set at the same point it was at when the backup was taken. As a result, step E would incorrectly identify the starting position for the rollforward, which would either result in a "cannot find correct log" message, or attempting to rollforward with completely inappropriate WAL data. Some means of getting round this issue needs to be investigated. Idea: modify pg_dump to record the transactionid at the start of the pg_dump and have it generate as the LAST SQL statement in the dump a statement to modify the system tables (gasp!) to reflect the equivalent state at backup. Maybe; investigation required. This route may well be complicated by National Language issues etc, whereas the full backup method will at least ignore all of that. What else do we need to do? What complications are there? = = = = = = =
"Simon Riggs" <simon@2ndquadrant.com> writes: > The general PITR recovery scenario requires: > A - Take a full database backup > B - Take regular log file backups > C - Restore a full backup of a database. Note that pg_dump has absolutely zero to do with this. The base for a PITR recovery scenario has to be a physical dump, not a logical dump; else WAL replay won't work. The conclusion that we came to in previous discussions was that the physical dump can be taken simply by doing "tar" (or your favorite alternative tool) on the $PGDATA directory tree --- you could optionally exclude the pg_xlog directory but must include all else. It is okay to do this while the database is live, so long as the series of log files you will use to recover starts from a checkpoint that occurred before you began the tar run. When you do this the contents of the tar archive will be an inconsistent snapshot of the database and therefore not directly useful. However, if you load the tar archive and then replay WAL from a prior checkpoint to any time after completion of the tar run, you will have brought the database back to consistency. Part A of the problem therefore only requires tar plus enough management software to keep track of the start and stop times of the tar run and correlate the start time to the WAL log series. It is worth pointing out though that you need to start up your Part B solution (archiving off WAL files as they come free) *before* you start taking your base dump. (When we get around to implementing tablespaces, things might get more complicated, since you'd need to remember to archive tablespace trees that might not live under $PGDATA. But we can ignore that refinement for now, I think.) > We should assume that the system onto which the restore takes place is > completely different from the system on which the backup was taken. It can't be too different, since you will need a binary-compatible server; you won't be able to move across machine architectures this way. > - Add entries to WAL files so it contains enough information for > recovery after a full backup > This is assumed to be complete. [I'm not sure where the role of pg_clog > is in all of this, so this may be a broken assumption.] Comments? I believe that the major problems (lack of recording of file creation/ deletion) were solved by the J.R. Nield patches I applied last week. There may be some minor issues left to fix but I can't think of any showstoppers. > - Write application to archive WAL files to tape, disk, or network > Probably need to do first part, but I'm arguing not to do the copy to > tape.. I'd like to somehow see this handled by a user-supplied program or script. What we mainly need is to define a good API that lets the archiver program understand which WAL segment files to archive when. > B - Backing up WAL log files > -Ordinarily, when old log segment files are no longer needed, they are > recycled (renamed to become the next segments in the numbered sequence). > This means that the data within them must be copied from there to > another location > AFTER postgres has closed that file > BEFORE it is renamed and recycled My inclination would be to change the backend code so that as soon as a WAL segment is completed, it is flagged as being ready to dump to tape (or wherever). Possibly the easiest way to do this is to rename the segment file somehow, perhaps "nnn" becomes "nnn.full". Then, after the archiver process has properly dumped the file, reflag it as being dumped (perhaps rename to "nnn.done"). Obviously there are any number of ways we could do this flagging, and depending on an OS rename facility might not be the best. A segment then can be recycled when it is both (a) older than the latest checkpoint and (b) flagged as dumped. Note that this approach allows dumping of a file to start before the first time at which it could be recycled. In the event of a crash and restart, WAL replay has to be able to find the flagged segments, so the flagging mechanism can't be one that would make this impossible. > Think about > -what will happen if postgres tries to reuse file while we are still > copying. This is a non-problem; segments that haven't been recycled can't become reuse targets. > -what will happen if copy fails? This is the archiver's problem to deal with. It only gets to be a serious problem when you run out of disk space for WAL segment files, so in most scenarios there is time for manual intervention to fix any such problem and restart the archiver. > -Manual indicates that current WAL format is bulky and would require > some compressed format to be implemented. Initially, I suggest ignoring > this and simply relying of OS or hardware/tape compression methods. This is definitely something we could leave for later. > With full OS file backup, if the database is shutdown correctly, then we > will need a way to tell the database "you think you're up to date, but > you're not - I've added some more WAL files into the directories, so > roll forward on those now please". I do not think this is an issue either, because my vision of this does not include tar backups of shutdown databases. What will be backed up is a live database, therefore the postmaster will definitely know that it needs to perform WAL replay. What we will need is hooks to make sure that the full set of required log files is available. It's entirely possible that that set of log files exceeds available disk space, so it needs to be possible to run WAL replay incrementally, loading and then replaying additional log segments after deleting old ones. Possibly we could do this with some postmaster command-line switches. J. R. Nield's patch embodied an "interactive recovery" backend mode, which I didn't like in detail but the general idea is not necessarily wrong. regards, tom lane
Tom, Thanks for your many comments and practical suggestions - most of which I think I should be able to bash something out once I've got my new dev env sorted. I'll update the proposal into a design document with some of my earlier blah taken out and all of your clarifications put in. There's a few comments on stuff below: Best Regards, Simon Riggs >Tom Lane [mailto:tgl@sss.pgh.pa.us] writes > > - Write application to archive WAL files to tape, disk, or network > > Probably need to do first part, but I'm arguing not to do the copy to > > tape.. > > I'd like to somehow see this handled by a user-supplied program or > script. What we mainly need is to define a good API that lets the > archiver program understand which WAL segment files to archive when. > > > B - Backing up WAL log files > > -Ordinarily, when old log segment files are no longer needed, they are > > recycled (renamed to become the next segments in the numbered sequence). > > This means that the data within them must be copied from there to > > another location > > AFTER postgres has closed that file > > BEFORE it is renamed and recycled > > My inclination would be to change the backend code so that as soon as a > WAL segment is completed, it is flagged as being ready to dump to tape > (or wherever). Possibly the easiest way to do this is to rename the > segment file somehow, perhaps "nnn" becomes "nnn.full". Then, after the > archiver process has properly dumped the file, reflag it as being dumped > (perhaps rename to "nnn.done"). Obviously there are any number of ways > we could do this flagging, and depending on an OS rename facility might > not be the best. > > A segment then can be recycled when it is both (a) older than the latest > checkpoint and (b) flagged as dumped. Note that this approach allows > dumping of a file to start before the first time at which it could be > recycled. In the event of a crash and restart, WAL replay has to be > able to find the flagged segments, so the flagging mechanism can't be > one that would make this impossible. That sort of API doesn't do much for my sense of truth-and-beauty, but it will work and allow us to get to the testing stage beyond where we will, I'm sure, discover many things. When that knowledge is gained *we* can refactor. Spawning new post to think through the API in more detail. > > With full OS file backup, if the database is shutdown correctly, then we > > will need a way to tell the database "you think you're up to date, but > > you're not - I've added some more WAL files into the directories, so > > roll forward on those now please". > > I do not think this is an issue either, because my vision of this does > not include tar backups of shutdown databases. What will be backed up > is a live database, therefore the postmaster will definitely know that > it needs to perform WAL replay. What we will need is hooks to make sure > that the full set of required log files is available. OK, again lets go for it on that assumption. Longer term, I would feel more comfortable with a specific "backup state". Relying on a side-effect of crash recovery for disaster recovery doesn't give me a warm feeling. BUT, that feeling is for later, not now. > It's entirely > possible that that set of log files exceeds available disk space, so it > needs to be possible to run WAL replay incrementally, loading and then > replaying additional log segments after deleting old ones. > Possibly we could do this with some postmaster command-line switches. > J. R. Nield's patch embodied an "interactive recovery" backend mode, > which I didn't like in detail but the general idea is not necessarily > wrong. Again, yes, though I will for now aim at the assumption that recovery can be completed within available disk space, with this as an immediate add-on when we have something that works. That is also the basis for a "warm standby" solution: Copy the tar to a new system (similar as you say), then repeatedly move new WAL logs across to it, then startup in recover-only mode. "Recover-only" mode would be initiated by a command line switch, as you say. This would recover all of the WAL logs, then immediately shutdown again. The extension to that is what Oli Sennhauser has suggested, which is to allow the second system to come up in read-only mode. Best Regards, Simon Riggs
>Tom Lane [mailto:tgl@sss.pgh.pa.us] >Re: [pgsql-hackers-pitr] Proposals for PITR >>"Simon Riggs" <simon@2ndquadrant.com> writes: >> The general PITR recovery scenario requires: >> A - Take a full database backup >> B - Take regular log file backups >> C - Restore a full backup of a database. > Note that pg_dump has absolutely zero to do with this. The base for a > PITR recovery scenario has to be a physical dump, not a logical dump; > else WAL replay won't work. Yes, I agree, I only included it because it was on the TODO. I'll cut it out of the further thinking on PITR, in case it confuses the issue in design/usage. Regards, Simon
Simon Riggs wrote: > >Tom Lane [mailto:tgl@sss.pgh.pa.us] > >Re: [pgsql-hackers-pitr] Proposals for PITR > >>"Simon Riggs" <simon@2ndquadrant.com> writes: > >> The general PITR recovery scenario requires: > >> A - Take a full database backup > >> B - Take regular log file backups > >> C - Restore a full backup of a database. > > Note that pg_dump has absolutely zero to do with this. The base for a > > PITR recovery scenario has to be a physical dump, not a logical dump; > > else WAL replay won't work. > > Yes, I agree, I only included it because it was on the TODO. I'll cut it > out of the further thinking on PITR, in case it confuses the issue in > design/usage. I removed the pg_dump mention on the web page. Thanks. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
>Tom Lane [mailto:tgl@sss.pgh.pa.us] > > Simon Riggs wrote > > - Write application to archive WAL files to tape, disk, or network > > Probably need to do first part, but I'm arguing not to do the copy to > > tape.. > > I'd like to somehow see this handled by a user-supplied program or > script. What we mainly need is to define a good API that lets the > archiver program understand which WAL segment files to archive when. > > > B - Backing up WAL log files > > -Ordinarily, when old log segment files are no longer needed, they are > > recycled (renamed to become the next segments in the numbered sequence). > > This means that the data within them must be copied from there to > > another location > > AFTER postgres has closed that file > > BEFORE it is renamed and recycled > > My inclination would be to change the backend code so that as soon as a > WAL segment is completed, it is flagged as being ready to dump to tape > (or wherever). Possibly the easiest way to do this is to rename the > segment file somehow, perhaps "nnn" becomes "nnn.full". Then, after the > archiver process has properly dumped the file, reflag it as being dumped > (perhaps rename to "nnn.done"). Obviously there are any number of ways > we could do this flagging, and depending on an OS rename facility might > not be the best. Yes, that would be the correct time to begin archive. The way the code is currently written there is a slot in MoveOfflineLogs which looks to see if XLOG_archive_dir is set before entering a section which is empty apart from a message. That routine doesn't get called until we're about to recycle the files, which means we've lost our window of opportunity to archive them. Making the number of files larger doesn't effect that being called last.... I'm going to ignore that "hint" and any patch will include deletion of that code to avoid later confusion. The log switch and close occurs during XLogWrite, when it is established that there is no more room in the current log file for the current segment. The file-flagging mechanism only allows a single archiver program to operate, so I'll structure it as a new function XLogArchiveNotify() so we can add in extra stuff later to improve/change things. That way we have a home for the API. > A segment then can be recycled when it is both (a) older than the latest > checkpoint and (b) flagged as dumped. Note that this approach allows > dumping of a file to start before the first time at which it could be > recycled. In the event of a crash and restart, WAL replay has to be > able to find the flagged segments, so the flagging mechanism can't be > one that would make this impossible. The number of WAL logs is effectively tunable anyway because it depends on the number of checkpoint segments, so we can increase that if there are issues with archival speed v txn rate. The rename is always safe because the log file names never wrap. However, I'm loathe to touch the files, in case something crashes somewhere and we are left with recovery failing because of an unlocatable file. (To paraphrase one of the existing code comments, only the truly paranoid survive). A similar way is to have a "buddy" file, which indicates whether it is full and ready for archival. i.e. when we close file "nnn" we also write an nearly/empty file called "nnn.full". That file can then be deleted later BY THE archiver once archival has finished, allowing it to be recycled by InstallXLogFileSegment(). (Would require at least 6 more file descriptors, but I'm not sure if that's an issue). InstallXLogFileSegment() can check for XLogArchiveBusy() to see whether it is allowed to reuse or allocate a new one. In initial implementation this would just test to see whether "nnn.full" still exists. This will allow a range of behaviour to be catered for, such as long waits while manual tape mounts are requested by the archiver etc.. So in summary, the API is: Archiver initialises and waits on notify Postgresql initialises ...then Postgresql fills log, switches and close it, then calls XLogArchiveNotify() Archiver moves log somewhere safe, then sets state such that... ...sometime later Postgresql checks XLogArchiveBusy() to see if its safe to recycle file and discovers the state set by API is completely unintrusive on current tried and tested operation, and leaves the archiver(s) free to act as they choose, outside of the address space of PostgreSQL. That way we don't have to update regession tests with some destructive non-manual crash tests to show that works. Clearly, we wouldn't want WAL logs to hang around too long, so we need an initiation method for the archival process. Otherwise, we'll be writing "nnn.full" notifications yet without anybody ever deleting them. Either this could be set at startup with an archive_log_mode parameter (OK, the names been used before, but if the cap fits, wear it) or setting a maximum limit to number of archive logs and a few other ideas, none of which I like. Hmmmm...any listeners got any ideas here? How do we want this to work? Anybody want to write a more complex archiver process to act as more than just a test harness? Best regards, Simon Riggs