Thread: PITR Phase 1 - Test results
I've now completed the coding of Phase 1 of PITR. This allows a backup to be recovered and then rolled forward (all the way) on transaction logs. This proves the code and the design works, but also validates a lot of the earlier assumptions that were the subject of much earlier debate. As noted in the previous designs, PostgreSQL talks to an external archiver using the XLogArchive API. I've now completed: - changes to PostgreSQL - written a simple archiving utility, pg_arch Using both of these together, I have successfully: - started pg_arch - started postgres - taken a backup using tar - ran pgbench for an extended period, so that the transaction logs taken at the start have long since been recycled - killed postmaster - wait for completion - rm -R $PGDATA - restore using tar - restore xlogs from archive directory - start postmaster and watch it recover to end of logs This has been tested through a number of times on non-trivial tests and I've sat and watch the beast at work to make sure nothing wierd was happening on timing. At this stage: Missing Functions - - recovery does NOT yet stop at a specified point-in-time (that was always planned for Phase 2) - few more log messages required to report progress - debug mode required to allow most to be turned off Wrinkles - code is system testable, but not as cute as it could be - input from committers is now sought to complete the work - you are strongly advised not to treat any of the patches as usable in any real world situation YET - that bit comes next Bugs - two bugs currently occur during some tests: 1. the notification mechanism as originally designed causes ALL backends to report that a log file has closed. That works most of the time, though does give rise to occaisional timing errors - nothing too serious, but this inexactness could lead to later errors. 2. After restore, the notification system doesn't recover fully - this is a straightforward one I'm building a full patchset for this code and will upload this soon. As you might expect over the time its taken me to develop this, some bitrot has set in, so I'm rebuilding it against the latest dev version now, and will complete fixes for the two bugs mentioned above. I'm sure some will say "no words, show me the code"... I thought you all would appreciate some advance warning of this, to plan time to investigate and comment upon the coding. Best Regards, Simon Riggs, 2ndQuadrant http://www.2ndquadrant.com
Simon Riggs wrote: > > Well, I guess I was fairly happy too :-) YES! > I'd be more comfortable if I'd found more bugs though, but I'm sure the > kind folk on this list will see that wish of mine comes true! > > The code is in a "needs more polishing" state - which is just the right > time for some last discussions before everything sets too solid. Once we see the patch, we will be able to eyeball all the code paths and interface to existing code and will be able to spot a lot of stuff, I am sure. It might take a few passes over it but you will get all the support and ideas we have. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Well, I guess I was fairly happy too :-) I'd be more comfortable if I'd found more bugs though, but I'm sure the kind folk on this list will see that wish of mine comes true! The code is in a "needs more polishing" state - which is just the right time for some last discussions before everything sets too solid. Regards, Simon On Mon, 2004-04-26 at 17:48, Bruce Momjian wrote: > I want to come hug you --- where do you live? !!! > > :-) > > --------------------------------------------------------------------------- > > Simon Riggs wrote: > > I've now completed the coding of Phase 1 of PITR. > > > > This allows a backup to be recovered and then rolled forward (all the > > way) on transaction logs. This proves the code and the design works, but > > also validates a lot of the earlier assumptions that were the subject of > > much earlier debate. > > > > As noted in the previous designs, PostgreSQL talks to an external > > archiver using the XLogArchive API. > > I've now completed: > > - changes to PostgreSQL > > - written a simple archiving utility, pg_arch > > > > Using both of these together, I have successfully: > > - started pg_arch > > - started postgres > > - taken a backup using tar > > - ran pgbench for an extended period, so that the transaction logs taken > > at the start have long since been recycled > > - killed postmaster > > - wait for completion > > - rm -R $PGDATA > > - restore using tar > > - restore xlogs from archive directory > > - start postmaster and watch it recover to end of logs > > > > This has been tested through a number of times on non-trivial tests and > > I've sat and watch the beast at work to make sure nothing wierd was > > happening on timing. > > > > At this stage: > > Missing Functions - > > - recovery does NOT yet stop at a specified point-in-time (that was > > always planned for Phase 2) > > - few more log messages required to report progress > > - debug mode required to allow most to be turned off > > > > Wrinkles > > - code is system testable, but not as cute as it could be > > - input from committers is now sought to complete the work > > - you are strongly advised not to treat any of the patches as usable in > > any real world situation YET - that bit comes next > > > > Bugs > > - two bugs currently occur during some tests: > > 1. the notification mechanism as originally designed causes ALL backends > > to report that a log file has closed. That works most of the time, > > though does give rise to occaisional timing errors - nothing too > > serious, but this inexactness could lead to later errors. > > 2. After restore, the notification system doesn't recover fully - this > > is a straightforward one > > > > I'm building a full patchset for this code and will upload this soon. As > > you might expect over the time its taken me to develop this, some bitrot > > has set in, so I'm rebuilding it against the latest dev version now, and > > will complete fixes for the two bugs mentioned above. > > > > I'm sure some will say "no words, show me the code"... I thought you all > > would appreciate some advance warning of this, to plan time to > > investigate and comment upon the coding. > > > > Best Regards, Simon Riggs, 2ndQuadrant > > http://www.2ndquadrant.com > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: the planner will ignore your desire to choose an index scan if your > > joining column's datatypes do not match > >
I want to come hug you --- where do you live? !!! :-) --------------------------------------------------------------------------- Simon Riggs wrote: > I've now completed the coding of Phase 1 of PITR. > > This allows a backup to be recovered and then rolled forward (all the > way) on transaction logs. This proves the code and the design works, but > also validates a lot of the earlier assumptions that were the subject of > much earlier debate. > > As noted in the previous designs, PostgreSQL talks to an external > archiver using the XLogArchive API. > I've now completed: > - changes to PostgreSQL > - written a simple archiving utility, pg_arch > > Using both of these together, I have successfully: > - started pg_arch > - started postgres > - taken a backup using tar > - ran pgbench for an extended period, so that the transaction logs taken > at the start have long since been recycled > - killed postmaster > - wait for completion > - rm -R $PGDATA > - restore using tar > - restore xlogs from archive directory > - start postmaster and watch it recover to end of logs > > This has been tested through a number of times on non-trivial tests and > I've sat and watch the beast at work to make sure nothing wierd was > happening on timing. > > At this stage: > Missing Functions - > - recovery does NOT yet stop at a specified point-in-time (that was > always planned for Phase 2) > - few more log messages required to report progress > - debug mode required to allow most to be turned off > > Wrinkles > - code is system testable, but not as cute as it could be > - input from committers is now sought to complete the work > - you are strongly advised not to treat any of the patches as usable in > any real world situation YET - that bit comes next > > Bugs > - two bugs currently occur during some tests: > 1. the notification mechanism as originally designed causes ALL backends > to report that a log file has closed. That works most of the time, > though does give rise to occaisional timing errors - nothing too > serious, but this inexactness could lead to later errors. > 2. After restore, the notification system doesn't recover fully - this > is a straightforward one > > I'm building a full patchset for this code and will upload this soon. As > you might expect over the time its taken me to develop this, some bitrot > has set in, so I'm rebuilding it against the latest dev version now, and > will complete fixes for the two bugs mentioned above. > > I'm sure some will say "no words, show me the code"... I thought you all > would appreciate some advance warning of this, to plan time to > investigate and comment upon the coding. > > Best Regards, Simon Riggs, 2ndQuadrant > http://www.2ndquadrant.com > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, 2004-04-26 at 16:37, Simon Riggs wrote: > I've now completed the coding of Phase 1 of PITR. > > This allows a backup to be recovered and then rolled forward (all the > way) on transaction logs. This proves the code and the design works, but > also validates a lot of the earlier assumptions that were the subject of > much earlier debate. > > As noted in the previous designs, PostgreSQL talks to an external > archiver using the XLogArchive API. > I've now completed: > - changes to PostgreSQL > - written a simple archiving utility, pg_arch > This will be on HACKERS not PATCHES for a while... OVERVIEW : Various code changes. Not all included here...but I want to prove this is real, rather than have you waiting for my patch release skills to improve. PostgreSQL changes include: ============================ - guc.c New GUC called wal_archive to control archival logging/not. - xlog.h GUC added here - xlog.c The most critical parts of the code live here. The way things currently work can be thought of as a circular set of logs, with the current log position sweeping around the circle like a clock. In order to archive an xlog, you must start just AFTER the file has been closed and BEFORE the pointer sweeps round again. The code here tries to spot the right moment to notify the archive that its time to archive. That point is critical, too early and the archive may yet be incomplete, too late and a window of failure creeps into the system. Finding that point is more complicated than it seems because every backend has the same file open and decides to close it at different times - nearly the same time if you're running pgbench, but could vary considerably otherwise. That timing difference is the source of Bug#1. My solution is to use the piece of code that first updates pg_control, since there is a similar need to only-do-it-once. My understanding is that the other backends eventually discover they are supposed to be looking at a different file now and reset themselves - so that the xlog gets fsynced only once. It's taken me a week to consider the alternatives...this point is critical, so please suggest if you know/think differently. When the pointer sweeps round again, if we are still archiving, we simply increase the number of logs in the cycle to defer when we can recycle the xlog. The code doesn't yet handle a failure condition we discussed previously: running out of disk space and how we handle that (there was detailed debate, noted for future implementation). New utility aimed at being located in src/bin/pg_arch ======================================================= - pg_arch.c The idea of pg_arch is that it is a functioning archival tool and at the same time is the reference implementation of the XLogArchive API. The API is all wrapped up in the same file currently, to make it easier to implement, but I envisage separating these out into two parts after it passes initial inspection - shouldn't take too much work given that was its design goal. This will then allow the API to be used for wider applications that want to backup PostgreSQL. - src/bin/Makefile has been updated to include pg_arch, so that this then gets made as part of the full system rather than an add-on. I'm sure somebody has feelings on this...my thinking was that it ought to be available without too much effort. What's NOT included (YET!) ========================== -changes to initdb -changes to postgresql.conf -changes to wal_debug -related changes -user documentation - changes to initdb XLogArchive API implementation relies on the existence of $PGDATA/pg_rlog That would be relatively simple to add to initdb, but its also a no brainer to add without it, so I thought I'd leave it for discussion in case anybody has good reasons to put elsewhere/rename it etc. More importantly, this effects the security model used by XLogArchive. The way I had originally envisaged this, the directory permissions would be opened up for group level read/write thus: pg_xlog rwxr-x--- pg_rlog rwxrwx--- though this of course relies on $PGDATA being opened up also. That then would allow the archiving tool to be in its own account also, yet with a shared group. (Thinking that a standard Legato install (for instance) is unlikely to recommend sharing a UNIX userid with PostgreSQL). I was unaware that PostgreSQL checks the permissions of PGDATA before it starts and does not allow you to proceed if group permissions exist. We have two options:-related changes -user documentation i) alter all things that rely on security being userlevel-only - initdb - startup - most other security features? ii) encourage (i.e. force) people using XLogArchive API to run as the PostgreSQL owning-user (postgres). I've avoided this issue in the general implementation, thinking that there'll be some strong feelings either way, or an alternative that I haven't thought of yet (please...) -changes to postgresql.conf The parameter setting wal_archive=true needs to be added to make XLogArchive work or not. I've not added this to the install template (yet), in case we had some further suggestions for what this might be called. -related changes -user documentation -changes to wal_debug The XLOG_DEBUG flag is set as a value between 1 and 16, though the code only ever treats this as a boolean. For my development, I partially implemented an earlier suggestion of mine: set the flag to 1 in the config file, then set the more verbose portions of debug output to trigger when its set to 16. That effected a couple of places in xlog.c. That may not be needed, so thats not included either. -user documentation Not yet...but it will be. > Bugs > - two bugs currently occur during some tests: > 1. the notification mechanism as originally designed causes ALL backends > to report that a log file has closed. That works most of the time, > though does give rise to occasional timing errors - nothing too > serious, but this inexactness could lead to later errors. > 2. After restore, the notification system doesn't recover fully - this > is a straightforward one
Attachment
> I want to come hug you --- where do you live? !!! You're not the only one. But we don't want to smother the poor guy, at least not before he completes his work :-)
On Mon, 2004-04-26 at 18:08, Bruce Momjian wrote: > Simon Riggs wrote: > > > > Well, I guess I was fairly happy too :-) > > YES! > > > I'd be more comfortable if I'd found more bugs though, but I'm sure the > > kind folk on this list will see that wish of mine comes true! > > > > The code is in a "needs more polishing" state - which is just the right > > time for some last discussions before everything sets too solid. > > Once we see the patch, we will be able to eyeball all the code paths and > interface to existing code and will be able to spot a lot of stuff, I am > sure. > > It might take a few passes over it but you will get all the support and > ideas we have. Thanks very much. Code will be there in full tomorrow now (oh it is tomorrow...) Fixed the bugs that I spoke of earlier though. They all make sense when you try to tell someone else about them... Best Regards, Simon
Simon Riggs wrote: > New utility aimed at being located in src/bin/pg_arch Why isn't the archiver process integrated into the server?
Peter Eisentraut wrote: > Simon Riggs wrote: > > New utility aimed at being located in src/bin/pg_arch > > Why isn't the archiver process integrated into the server? I think it is because the archiver process has to be started/stopped independently of the server. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 2004-04-27 at 18:10, Peter Eisentraut wrote: > Simon Riggs wrote: > > New utility aimed at being located in src/bin/pg_arch > > Why isn't the archiver process integrated into the server? > Number of reasons.... Overall, I initially favoured the archiver as another special backend, like checkpoint. That is exactly the same architecture as Oracle uses, so is a good starting place for thought. We discussed the design in detail on the list and the suggestion was made to implement PITR using an API to send notification to an archiver. In Oracle7, it was considered OK to just dump the files in some directory and call them archived. Later, most DBMSs have gone to some trouble to integrate with generic or at least market leading backup and recovery (BAR) software products. Informix and DB2 provide open interfaces to BARs; Oracle does not, but then it figures it already (had) market share, so we'll just do it our way. The XLogArchive design allows ANY external archiver to work with PostgreSQL. The pg_arch program supplied is really to show how that might be implemented. This leaves the door open for any BAR product to interface through to PostgreSQL, whether this be your favourite open source BAR or the leading proprietary vendors. Wide adoption is an important design feature and the design presented offers this. The other reason is to do with how and when archival takes place. An asynchronous communication mechanism is required between PostgreSQL and the archiver, to allow for such situations as tape mounts or simple failure of the archiver. The method chosen for implementing this asynchronous comms mechanism lends itself to being an external API - there were other designs but these were limited to internal use only. You ask a reasonable question however. If pg_autovacuum exists, why should pg_autoarch not work also? My own thinking about external connectivity may have overshadowed my thinking there. It would not require too much additional work to add another GUC which gives the name of the external archiver to confirm execution of, or start/restart if it fails. At this point, such a feature is a nice to have in comparison with the goal of being able to recover to a PIT, so I will defer this issue to Phase 3.... Best regards, Simon Riggs
Am Tuesday 27 April 2004 22:21 schrieb Simon Riggs: > > Why isn't the archiver process integrated into the server? > You ask a reasonable question however. If pg_autovacuum exists, why > should pg_autoarch not work also? pg_autovacuum is going away to be integrated as a backend process.
Am Tuesday 27 April 2004 19:59 schrieb Bruce Momjian: > Peter Eisentraut wrote: > > Simon Riggs wrote: > > > New utility aimed at being located in src/bin/pg_arch > > > > Why isn't the archiver process integrated into the server? > > I think it is because the archiver process has to be started/stopped > independently of the server. When the server is not running there is nothing to archive, so I don't follow this argument.
Am Monday 26 April 2004 23:11 schrieb Simon Riggs: > ii) encourage (i.e. force) people using XLogArchive API to run as the > PostgreSQL owning-user (postgres). I think this is perfectly reasonable.
On Wed, 2004-04-28 at 16:14, Peter Eisentraut wrote: > Am Tuesday 27 April 2004 19:59 schrieb Bruce Momjian: > > Peter Eisentraut wrote: > > > Simon Riggs wrote: > > > > New utility aimed at being located in src/bin/pg_arch > > > > > > Why isn't the archiver process integrated into the server? > > > > I think it is because the archiver process has to be started/stopped > > independently of the server. > > When the server is not running there is nothing to archive, so I don't follow > this argument. The running server creates xlogs, which are still available for archive even when the server is not running... Overall, your point is taken, with many additional comments in my other posts in reply to you. I accept that this may be desirable in the future, for some simple implementations. The pg_autovacuum evolution path is a good model - if it works and the code is stable, bring it under the postmaster at a later time. Best Regards, Simon Riggs
Simon Riggs wrote: > > When the server is not running there is nothing to archive, so I don't follow > > this argument. > > The running server creates xlogs, which are still available for archive > even when the server is not running... > > Overall, your point is taken, with many additional comments in my other > posts in reply to you. > > I accept that this may be desirable in the future, for some simple > implementations. The pg_autovacuum evolution path is a good model - if > it works and the code is stable, bring it under the postmaster at a > later time. [ This email isn't focused because I haven't resolved all my ideas yet.] OK, I looked over the code. Basically it appears pg_arch is a client-side program that copies files from pg_xlog to a specified directory, and marks completion in a new pg_rlog directory. The driving part of the program seems to be: while ( (n = read( xlogfd, buf, BLCKSZ)) > 0) if ( write( archfd, buf, n) != n) return false; The program basically sleeps and when it awakes checks to see if new WAL files have been created. There is some additional GUC variable to prevent WAL from being recycled until it has been archived, but the posted patch only had pg_arch.c, its Makefile, and a patch to update bin/Makefile. Simon (the submitter) specified he was providing an API to archive, but it is really just a set of C routines to call that do copies. It is not a wire protocol or anything like that. The program has a mode where it archives all available wal files and exits, but by default it has to remain running to continue archiving. I am wondering if this is the way to approach the situation. I apologize for not considering this earlier. Archives of PITR postings of interest are at: http://momjian.postgresql.org/cgi-bin/pgtodo?pitr It seems the backend is the one who knows right away when a new WAL file has been created and needs to be archived. Also, are folks happy with archiving only full WAL files? This will not restore all transactions up to the point of failure, but might lose perhaps 2-5 minutes of transactions before the failure. Also, a client application is a separate process that must remain running. With Informix, there is a separate utility to do PITR logging. It is a pain to have to make sure a separate process is always running. Here is an idea. What if we add two GUC settings: pitr = true/false;pitr_path = 'filename or |program'; In this way, you would basically specify your path to dump all WAL logs into (just keep appending 16MB chunks) or call a program that you pipe all the WAL logs into. You can't change pitr_path while pitr is on. Each backend opens the filename in append mode before writing. One problem is that this slows down the backend because it has to do the write, and it might be slow. We also need the ability to write to a tape drive, and you can't open/close those like a file. Different backends will be doing the WAL file additions, there isn't a central process to keep a tape drive file descriptor open. Seems pg_arch should at least use libpq to connect to a database and do a LISTEN and have the backend NOTIFY when they create a new WAL file or something. Polling for new WAL files seems non-optimal, but maybe a database connection is overkill. Then, you start the backend, specify the path, turn on pitr, do the tar, and you are on your way. Also, pg_arch should only be run the the install user. No need to allow other users to run this. Another idea is to have a client program like pg_ctl that controls PITR logging (start, stop, location), but does its job and exits, rather than remains running. I apologies for not bringing up these issues earlier. I didn't realize the direction it was going. I wasn't focused on it. Sorry. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, Apr 29, 2004 at 12:18:38AM -0400, Bruce Momjian wrote: > OK, I looked over the code. Basically it appears pg_arch is a > client-side program that copies files from pg_xlog to a specified > directory, and marks completion in a new pg_rlog directory. > > The driving part of the program seems to be: > > while ( (n = read( xlogfd, buf, BLCKSZ)) > 0) > if ( write( archfd, buf, n) != n) > return false; > > The program basically sleeps and when it awakes checks to see if new WAL > files have been created. Is the API able to indicate a written but not-yet-filled WAL segment? So an archiver could copy the filled part, and refill it later. This may be needed because a segment could take a while to be filled. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Hoy es el primer día del resto de mi vida"
On Thu, Apr 29, 2004 at 10:07:01AM -0400, Bruce Momjian wrote: > Alvaro Herrera wrote: > > Is the API able to indicate a written but not-yet-filled WAL segment? > > So an archiver could copy the filled part, and refill it later. This > > may be needed because a segment could take a while to be filled. > > I couldn't figure that out, but I don't think it does. It would have to > lock the WAL writes so it could get a good copy, I think, and I didn't > see that. I'm not sure but I don't think so. You don't have to lock the WAL for writing, because it will always write later in the file than you are allowed to read. (If you read more than you were told to, it's your fault as an archiver.) -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Et put se mouve" (Galileo Galilei)
Alvaro Herrera wrote: > On Thu, Apr 29, 2004 at 12:18:38AM -0400, Bruce Momjian wrote: > > > OK, I looked over the code. Basically it appears pg_arch is a > > client-side program that copies files from pg_xlog to a specified > > directory, and marks completion in a new pg_rlog directory. > > > > The driving part of the program seems to be: > > > > while ( (n = read( xlogfd, buf, BLCKSZ)) > 0) > > if ( write( archfd, buf, n) != n) > > return false; > > > > The program basically sleeps and when it awakes checks to see if new WAL > > files have been created. > > Is the API able to indicate a written but not-yet-filled WAL segment? > So an archiver could copy the filled part, and refill it later. This > may be needed because a segment could take a while to be filled. I couldn't figure that out, but I don't think it does. It would have to lock the WAL writes so it could get a good copy, I think, and I didn't see that. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Alvaro Herrera wrote: > On Thu, Apr 29, 2004 at 10:07:01AM -0400, Bruce Momjian wrote: > > Alvaro Herrera wrote: > > > > Is the API able to indicate a written but not-yet-filled WAL segment? > > > So an archiver could copy the filled part, and refill it later. This > > > may be needed because a segment could take a while to be filled. > > > > I couldn't figure that out, but I don't think it does. It would have to > > lock the WAL writes so it could get a good copy, I think, and I didn't > > see that. > > I'm not sure but I don't think so. You don't have to lock the WAL for > writing, because it will always write later in the file than you are > allowed to read. (If you read more than you were told to, it's your > fault as an archiver.) My point was that without locking the WAL, we might get part of a WAL write in our file, but I now realize that during a crash the same thing might happen, so it would be OK to just copy it even if it is being written to. Simon posted the rest of his patch that shows changes to the backend, and a comment reads: + * The name of the notification file is the message that will be picked up + * by the archiver, e.g. we write RLogDir/00000001000000C6.full + * and the archiver then knows to archive XLOgDir/00000001000000C6, + * while it is doing so it will rename RLogDir/00000001000000C6.full + * to RLogDir/00000001000000C6.busy, then when complete, rename it again + * to RLogDir/00000001000000C6.done so it is only archiving full logs. Also, I think this archiver should be able to log to a local drive, network drive (trivial), tape drive, ftp, or use an external script to transfer the logs somewhere. (ftp would probably be an external script with 'expect'). -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, 2004-04-29 at 15:22, Bruce Momjian wrote: > Alvaro Herrera wrote: > > On Thu, Apr 29, 2004 at 10:07:01AM -0400, Bruce Momjian wrote: > > > Alvaro Herrera wrote: > > > > > > Is the API able to indicate a written but not-yet-filled WAL segment? > > > > So an archiver could copy the filled part, and refill it later. This > > > > may be needed because a segment could take a while to be filled. > > > > > > I couldn't figure that out, but I don't think it does. It would have to > > > lock the WAL writes so it could get a good copy, I think, and I didn't > > > see that. > > > > I'm not sure but I don't think so. You don't have to lock the WAL for > > writing, because it will always write later in the file than you are > > allowed to read. (If you read more than you were told to, it's your > > fault as an archiver.) > > My point was that without locking the WAL, we might get part of a WAL > write in our file, but I now realize that during a crash the same thing > might happen, so it would be OK to just copy it even if it is being > written to. > > Simon posted the rest of his patch that shows changes to the backend, > and a comment reads: > > + * The name of the notification file is the message that will be picked up > + * by the archiver, e.g. we write RLogDir/00000001000000C6.full > + * and the archiver then knows to archive XLOgDir/00000001000000C6, > + * while it is doing so it will rename RLogDir/00000001000000C6.full > + * to RLogDir/00000001000000C6.busy, then when complete, rename it again > + * to RLogDir/00000001000000C6.done > > so it is only archiving full logs. > > Also, I think this archiver should be able to log to a local drive, > network drive (trivial), tape drive, ftp, or use an external script to > transfer the logs somewhere. (ftp would probably be an external script > with 'expect'). Bruce is correct, the API waits for the archive to be full before archiving. I had thought about the case for partial archiving: basically, if you want to archive in smaller chunks, make your log files smaller...this is now a compile time option. Possibly there is an argument to make the xlog file size configurable, as a way of doing what you suggest. Taking multiple copies of the same file, yet trying to work out which one to apply sounds complex and error prone to me. It also increases the cost of the archival process and thus drains other resources. The archiver should be able to do a whole range of things. Basically, that point was discussed and the agreed approach was to provide an API that would allow anybody and everybody to write whatever they wanted. The design included pg_arch since it was clear that there would be a requirement in the basic product to have those facilities - and in any case any practically focused API has a reference port as a way of showing how to use it and exposing any bugs in the server side implementation. The point is...everybody is now empowered to write tape drive code, whatever you fancy.... go do. Best regards, Simon Riggs
Simon Riggs wrote: > > Also, I think this archiver should be able to log to a local drive, > > network drive (trivial), tape drive, ftp, or use an external script to > > transfer the logs somewhere. (ftp would probably be an external script > > with 'expect'). > > Bruce is correct, the API waits for the archive to be full before > archiving. > > I had thought about the case for partial archiving: basically, if you > want to archive in smaller chunks, make your log files smaller...this is > now a compile time option. Possibly there is an argument to make the > xlog file size configurable, as a way of doing what you suggest. > > Taking multiple copies of the same file, yet trying to work out which > one to apply sounds complex and error prone to me. It also increases the > cost of the archival process and thus drains other resources. > > The archiver should be able to do a whole range of things. Basically, > that point was discussed and the agreed approach was to provide an API > that would allow anybody and everybody to write whatever they wanted. > The design included pg_arch since it was clear that there would be a > requirement in the basic product to have those facilities - and in any > case any practically focused API has a reference port as a way of > showing how to use it and exposing any bugs in the server side > implementation. > > The point is...everybody is now empowered to write tape drive code, > whatever you fancy.... go do. Agreed we want to allow the superuser control over writing of the archive logs. The question is how do they get access to that. Is it by running a client program continuously or calling an interface script from the backend? My point was that having the backend call the program has improved reliablity and control over when to write, and easier administration. How are people going to run pg_arch? Via nohup? In virtual screens? If I am at the console and I want to start it, do I use "&"? If I want to stop it, do I do a 'ps' and issue a 'kill'? This doesn't seem like a good user interface to me. To me the problem isn't pg_arch itself but the idea that a client program is going to be independently finding(polling) and copying of the archive logs. I am thinking the client program is called with two arguments, the xlog file name, and the arch location defined in GUC. Then the client program does the write. The problem there though is who gets the write error since the backend will not wait around for completion? Another case is server start/stop. You want to start/stop the archive logger to match the database server, particularly if you reboot the server. I know Informix used a client program for logging, and it was a pain to administer. I would be happy with an exteral program if it was started/stoped by the postmaster (or via GUC change) and received a signal when a WAL file was written. But if we do that, it isn't really an external program anymore but another child process like our stats collector. I am willing to work on this if folks think this is a better approach. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, Apr 29, 2004 at 07:34:47PM +0100, Simon Riggs wrote: > Bruce is correct, the API waits for the archive to be full before > archiving. > > I had thought about the case for partial archiving: basically, if you > want to archive in smaller chunks, make your log files smaller...this is > now a compile time option. Possibly there is an argument to make the > xlog file size configurable, as a way of doing what you suggest. > > Taking multiple copies of the same file, yet trying to work out which > one to apply sounds complex and error prone to me. It also increases the > cost of the archival process and thus drains other resources. My idea was basically that the archiver could be told "I've finished writing XLog segment 1 until byte 9000", so the archiver would dd if=xlog-1 seek=0 skip=0 bs=1c count=9000c of=archive-1 And later, it would get a notification "segment 1 until byte 18000" he does dd if=xlog-1 seek=0 skip=0 bs=1c count=18000c of=archive-1 Or, if it's smart enough, dd if=xlog-1 seek=9000c skip=9000c bs=1c count=9000c of=archive-1 Basically it is updating the logs as soon as it receives the notifications. Writing 16 MB of xlogs could take some time. When a full xlog segment has been written, a different kind of notification can be issued. A dumb archiver could just ignore the incremental ones and copy the files only upon receiving this other kind. I think that if log files are too small, maybe it will be a waste of resources (which ones?). Anyway, it's just an idea. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
On Thu, 2004-04-29 at 20:24, Bruce Momjian wrote: > I am willing to work on this... There is much work still to be done to make PITR work..accepting all of the many comments made. If anybody wants this by 1 June, I think we'd better look sharp. My aim has been to knock one of the URGENT items on the TODO list into touch, however that was to be achieved. The following work remains...from all that has been said... - halt restore at particular condition (point in time, txnid etc) - archive policy to control whether to halt database should archiving fail and space run out (as Oracle, Db2 do), or not (as discussed) - cope with restoring a stream of logs larger than the disk space on the restoration target system - integrate restore with tablespace code, to allow tablespace backups - build XLogSpy mechanism to allow DBA to better know when to recover to - extend logging mechanism to allow recovery time prediction - publicise the API with BAR open source teams, to get feedback and to encourage them to use the API to allow PostgreSQL support for their BAR - use the API to build interfaces to the 100+ BAR products on the market - performance tuning of xlogs, to ensure minimum xlog volume written - performance tuning of recovery, to ensure wasted effort avoided - allow archiver utility to be managed by postmaster - write some good documentation - comprehensive crash testing - really comprehensive crash testing - very comprehensive crash testing It seems worth working on things in some kind of priority order. I claim these, by the way, but many others look important and interesting to me: - halt restore at particular condition (point in time, txnid etc) - cope with restoring a stream of logs larger than the disk space on the restoration target system - write some good documentation Best Regards, Simon Riggs
On Thu, 2004-04-29 at 20:24, Bruce Momjian wrote: > Simon Riggs wrote: > > > > The archiver should be able to do a whole range of things. Basically, > > that point was discussed and the agreed approach was to provide an API > > that would allow anybody and everybody to write whatever they wanted. > > The design included pg_arch since it was clear that there would be a > > requirement in the basic product to have those facilities - and in any > > case any practically focused API has a reference port as a way of > > showing how to use it and exposing any bugs in the server side > > implementation. > > > > The point is...everybody is now empowered to write tape drive code, > > whatever you fancy.... go do. > > Agreed we want to allow the superuser control over writing of the > archive logs. The question is how do they get access to that. Is it by > running a client program continuously or calling an interface script > from the backend? > > My point was that having the backend call the program has improved > reliablity and control over when to write, and easier administration. > Agreed. We've both suggested ways that can occur, though I suggest this is much less of a priority, for now. Not "no", just not "now". > How are people going to run pg_arch? Via nohup? In virtual screens? If > I am at the console and I want to start it, do I use "&"? If I want to > stop it, do I do a 'ps' and issue a 'kill'? This doesn't seem like a > good user interface to me. > > To me the problem isn't pg_arch itself but the idea that a client > program is going to be independently finding(polling) and copying of the > archive logs. > > I am thinking the client program is called with two arguments, the xlog > file name, and the arch location defined in GUC. Then the client > program does the write. The problem there though is who gets the write > error since the backend will not wait around for completion? > > Another case is server start/stop. You want to start/stop the archive > logger to match the database server, particularly if you reboot the > server. I know Informix used a client program for logging, and it was a > pain to administer. > pg_arch is just icing on top of the API. The API is the real deal here. I'm not bothered if pg_arch is not accepted, as long as we can adopt the API. As noted previously, my original mind was to split the API away from the pg_arch application to make it clearer what was what. Once that has been done, I encourage others to improve pg_arch - but also to use the API to interface with other BAR prodiucts. If you're using PostgreSQL for serious business then you will be using a serious BAR product as well. There are many FOSS alternatives... The API's purpose is to allow larger, pre-existing BAR products to know when and how to retrieve data from PostgreSQL. Those products don't and won't run underneath postmaster, so although I agree with Peter's original train of thought, I also agree with Tom's suggestion that we need an API more than we need an archiver process. I would be happy with an exteral program if it was started/stoped by the > postmaster (or via GUC change) and received a signal when a WAL file was > written. That is exactly what has been written. The PostgreSQL side of the API is written directly into the backend, in xlog.c and is therefore activated by postmaster controlled code. That then sends "a signal" to the process that will do the archiving - the Archiver side of the XLogArchive API has it as an in-process library. (The "signal" is, in fact, a zero-length file written to disk because there are many reasons why an external archiver may not be ready to archive or even up and running to receive a signal). The only difference is that there is some confusion as to the role and importance of pg_arch. Best Regards, Simon Riggs
Simon Riggs wrote: > > Agreed we want to allow the superuser control over writing of the > > archive logs. The question is how do they get access to that. Is it by > > running a client program continuously or calling an interface script > > from the backend? > > > > My point was that having the backend call the program has improved > > reliablity and control over when to write, and easier administration. > > > > Agreed. We've both suggested ways that can occur, though I suggest this > is much less of a priority, for now. Not "no", just not "now". > > > Another case is server start/stop. You want to start/stop the archive > > logger to match the database server, particularly if you reboot the > > server. I know Informix used a client program for logging, and it was a > > pain to administer. > > > > pg_arch is just icing on top of the API. The API is the real deal here. > I'm not bothered if pg_arch is not accepted, as long as we can adopt the > API. As noted previously, my original mind was to split the API away > from the pg_arch application to make it clearer what was what. Once that > has been done, I encourage others to improve pg_arch - but also to use > the API to interface with other BAR prodiucts. > > If you're using PostgreSQL for serious business then you will be using a > serious BAR product as well. There are many FOSS alternatives... > > The API's purpose is to allow larger, pre-existing BAR products to know > when and how to retrieve data from PostgreSQL. Those products don't and > won't run underneath postmaster, so although I agree with Peter's > original train of thought, I also agree with Tom's suggestion that we > need an API more than we need an archiver process. > > I would be happy with an exteral program if it was started/stoped by the > > postmaster (or via GUC change) and received a signal when a WAL file was > > written. > > That is exactly what has been written. > > The PostgreSQL side of the API is written directly into the backend, in > xlog.c and is therefore activated by postmaster controlled code. That > then sends "a signal" to the process that will do the archiving - the > Archiver side of the XLogArchive API has it as an in-process library. > (The "signal" is, in fact, a zero-length file written to disk because > there are many reasons why an external archiver may not be ready to > archive or even up and running to receive a signal). > > The only difference is that there is some confusion as to the role and > importance of pg_arch. OK, I have finalized my thinking on this. We both agree that a pg_arch client-side program certainly works for PITR logging. The big question in my mind is whether a client-side program is what we want to use long-term, and whether we want to release a 7.5 that uses it and then change it in 7.6 to something more integrated into the backend. Let me add this is a little different from pg_autovacuum. With that, you could put it in cron and be done with it. With pg_arch, there is a routine that has to be used to do PITR, and if we change the process in 7.6, I am afraid there will be confusion. Let me also add that I am not terribly worried about having the feature to restore to an arbitrary point in time for 7.5. I would much rather have a good PITR solution that works cleanly in 7.5 and add it to 7.6, than to have retore to an arbitrary point but have a strained implementation that we have to revisit for 7.6. Here are my ideas. (I talked to Tom about this and am including his ideas too.) Basically, the archiver that scans the xlog directory to identify files to be archived should be a subprocess of the postmaster. You already have that code and it can be moved into the backend. Here is my implementation idea. First, your pg_arch code runs in the backend and is started just like the statistics process. It has to be started whether PITR is being used or not, but will be inactive if PITR isn't enabled. This must be done because we can't have a backend start this process later in case they turn on PITR after server start. The process id of the archive process is stored in shared memory. When PITR is turned on, each backend that complete a WAL file sends a signal to the archiver process. The archiver wakes up on the signal and scans the directory, finds files that need archiving, and either does a 'cp' or runs a user-defined program (like scp) to transfer the file to the archive location. In GUC we add: pitr = true/falsepitr_location = 'directory, user@host:/dir, etc'pitr_transfer = 'cp, scp, etc' The archiver program updates its config values when someone changes these values via postgresql.conf (and uses pg_ctl reload). These can only be modified from postgresql.conf. Changing them via SET has to be disabled because they are cluster-level settings, not per session, like port number or checkpoint_segments. Basically, I think that we need to push user-level control of this process down beyond the directory scanning code (that is pretty standard), and allow them to call an arbitrary program to transfer the logs. My idea is that the pitr_transfer program will get $1=WAL file name and $2=pitr_location and the program can use those arguments to do the transfer. We can even put a pitr_transfer.sample program in share and document $1 and $2. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 2004-04-30 at 04:02, Bruce Momjian wrote: > Let me also add that I am not terribly worried about having the feature > to restore to an arbitrary point in time for 7.5. I would much rather > have a good PITR solution that works cleanly in 7.5 and add it to 7.6, > than to have retore to an arbitrary point but have a strained > implementation that we have to revisit for 7.6. > Interesting thought, I see now your priorities. Will read and digest over next few days. Thanks for your help and attention, Best regards, Simon Riggs
On Fri, 2004-04-30 at 04:02, Bruce Momjian wrote: > Simon Riggs wrote: > > > Agreed we want to allow the superuser control over writing of the > > > archive logs. The question is how do they get access to that. Is it by > > > running a client program continuously or calling an interface script > > > from the backend? > > > > > > My point was that having the backend call the program has improved > > > reliablity and control over when to write, and easier administration. > > > > > > > Agreed. We've both suggested ways that can occur, though I suggest this > > is much less of a priority, for now. Not "no", just not "now". > > > > > Another case is server start/stop. You want to start/stop the archive > > > logger to match the database server, particularly if you reboot the > > > server. I know Informix used a client program for logging, and it was a > > > pain to administer. > > > > > > > pg_arch is just icing on top of the API. The API is the real deal here. > > I'm not bothered if pg_arch is not accepted, as long as we can adopt the > > API. As noted previously, my original mind was to split the API away > > from the pg_arch application to make it clearer what was what. Once that > > has been done, I encourage others to improve pg_arch - but also to use > > the API to interface with other BAR prodiucts. > > > > If you're using PostgreSQL for serious business then you will be using a > > serious BAR product as well. There are many FOSS alternatives... > > > > The API's purpose is to allow larger, pre-existing BAR products to know > > when and how to retrieve data from PostgreSQL. Those products don't and > > won't run underneath postmaster, so although I agree with Peter's > > original train of thought, I also agree with Tom's suggestion that we > > need an API more than we need an archiver process. > > > > I would be happy with an exteral program if it was started/stoped by the > > > postmaster (or via GUC change) and received a signal when a WAL file was > > > written. > > > > That is exactly what has been written. > > > > The PostgreSQL side of the API is written directly into the backend, in > > xlog.c and is therefore activated by postmaster controlled code. That > > then sends "a signal" to the process that will do the archiving - the > > Archiver side of the XLogArchive API has it as an in-process library. > > (The "signal" is, in fact, a zero-length file written to disk because > > there are many reasons why an external archiver may not be ready to > > archive or even up and running to receive a signal). > > > > The only difference is that there is some confusion as to the role and > > importance of pg_arch. > > OK, I have finalized my thinking on this. > > We both agree that a pg_arch client-side program certainly works for > PITR logging. The big question in my mind is whether a client-side > program is what we want to use long-term, and whether we want to release > a 7.5 that uses it and then change it in 7.6 to something more > integrated into the backend. > > Let me add this is a little different from pg_autovacuum. With that, > you could put it in cron and be done with it. With pg_arch, there is a > routine that has to be used to do PITR, and if we change the process in > 7.6, I am afraid there will be confusion. > > Let me also add that I am not terribly worried about having the feature > to restore to an arbitrary point in time for 7.5. I would much rather > have a good PITR solution that works cleanly in 7.5 and add it to 7.6, > than to have retore to an arbitrary point but have a strained > implementation that we have to revisit for 7.6. > > Here are my ideas. (I talked to Tom about this and am including his > ideas too.) Basically, the archiver that scans the xlog directory to > identify files to be archived should be a subprocess of the postmaster. > You already have that code and it can be moved into the backend. > > Here is my implementation idea. First, your pg_arch code runs in the > backend and is started just like the statistics process. It has to be > started whether PITR is being used or not, but will be inactive if PITR > isn't enabled. This must be done because we can't have a backend start > this process later in case they turn on PITR after server start. > > The process id of the archive process is stored in shared memory. When > PITR is turned on, each backend that complete a WAL file sends a signal > to the archiver process. The archiver wakes up on the signal and scans > the directory, finds files that need archiving, and either does a 'cp' > or runs a user-defined program (like scp) to transfer the file to the > archive location. > > In GUC we add: > > pitr = true/false > pitr_location = 'directory, user@host:/dir, etc' > pitr_transfer = 'cp, scp, etc' > > The archiver program updates its config values when someone changes > these values via postgresql.conf (and uses pg_ctl reload). These can > only be modified from postgresql.conf. Changing them via SET has to be > disabled because they are cluster-level settings, not per session, like > port number or checkpoint_segments. > > Basically, I think that we need to push user-level control of this > process down beyond the directory scanning code (that is pretty > standard), and allow them to call an arbitrary program to transfer the > logs. My idea is that the pitr_transfer program will get $1=WAL file > name and $2=pitr_location and the program can use those arguments to do > the transfer. We can even put a pitr_transfer.sample program in share > and document $1 and $2. ...Bruce and I have just discussed this in some detail and reached a good understanding of the design proposals as a whole. It looks like all of this can happen in the next few weeks, with a worst case time estimate of mid-June. TGFT! I'll write this up and post this shortly, with a rough roadmap for further development of recovery-related features. Best Regards, Simon Riggs 2nd Quadrant