Thread: Using streaming replication as log archiving
As has been previously mentioned a couple of times, it should be perfectly possible to use streaming replication to get around the limitations of archive_command/archive_timeout to do log archiving for PITR (being that you either keep archive_timeout high and risk data loss or you set it very low and generate a huge log archive without need). I've put together a tool to do this. The basic idea is to just stream down replication and write it to regular WAL files, which can then be used for recovery. You'll still need to use archive_command together with it to ensure that the backups are complete. Streaming replication doesn't guarantee that - in fact, regular replication will fallback to using whatever archive_command created when wal_keep_segments isn't enough. I've put up an early version of the tool at http://github.com/mhagander/pg_streamrecv Comments and contributions are most welcome. And frankly, a good review is very much required before I'd trust it ;) Hopefully, I didn't overlook something critical :D -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Mon, Sep 27, 2010 at 9:07 PM, Magnus Hagander <magnus@hagander.net> wrote: > As has been previously mentioned a couple of times, it should be > perfectly possible to use streaming replication to get around the > limitations of archive_command/archive_timeout to do log archiving for > PITR (being that you either keep archive_timeout high and risk data > loss or you set it very low and generate a huge log archive without > need). > > I've put together a tool to do this. The basic idea is to just stream > down replication and write it to regular WAL files, which can then be > used for recovery. You'll still need to use archive_command together > with it to ensure that the backups are complete. Streaming replication > doesn't guarantee that - in fact, regular replication will fallback to > using whatever archive_command created when wal_keep_segments isn't > enough. > > I've put up an early version of the tool at > http://github.com/mhagander/pg_streamrecv Great! This also might be useful for users who want something like Oracle redo log mirroring. > Comments and contributions are most welcome. And frankly, a good > review is very much required before I'd trust it ;) Hopefully, I > didn't overlook something critical :D When I ran that, the size of the WAL file in inprogress directory became more than 16MB. Obviously something isn't right. When I requested immediate shutdown to the master, segmentation fault occurred in pg_streamrecv. I guess that the return value 0 of PQgetCopyData would not be handled correctly. After I repeated Ctrl+C and start of pg_streamrecv some times, I encountered the following error and pg_streamrecv was never up. Is this intentional? In progress directory contains more than one file! $ ls foo/inprogress/ 00000001000000000000000D 00000001000000000000000D.save When there is inprogress or archived WAL file, pg_streamrecv should not execute pg_current_xlog_location because that result is not used? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Sep 28, 2010 at 06:25, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Sep 27, 2010 at 9:07 PM, Magnus Hagander <magnus@hagander.net> wrote: >> As has been previously mentioned a couple of times, it should be >> perfectly possible to use streaming replication to get around the >> limitations of archive_command/archive_timeout to do log archiving for >> PITR (being that you either keep archive_timeout high and risk data >> loss or you set it very low and generate a huge log archive without >> need). >> >> I've put together a tool to do this. The basic idea is to just stream >> down replication and write it to regular WAL files, which can then be >> used for recovery. You'll still need to use archive_command together >> with it to ensure that the backups are complete. Streaming replication >> doesn't guarantee that - in fact, regular replication will fallback to >> using whatever archive_command created when wal_keep_segments isn't >> enough. >> >> I've put up an early version of the tool at >> http://github.com/mhagander/pg_streamrecv > > Great! This also might be useful for users who want something like > Oracle redo log mirroring. Thanks. >> Comments and contributions are most welcome. And frankly, a good >> review is very much required before I'd trust it ;) Hopefully, I >> didn't overlook something critical :D > > When I ran that, the size of the WAL file in inprogress directory > became more than 16MB. Obviously something isn't right. Wow, that's weird. I'm unable to reproduce that here - can you try to figure out why that happened? > When I requested immediate shutdown to the master, segmentation > fault occurred in pg_streamrecv. I guess that the return value 0 > of PQgetCopyData would not be handled correctly. Almost right - it actually returns -2 - which isn't handled. I've added a fix for that - and while att it, covering anything that's so small it doesn't contain the streaming replication header. > After I repeated Ctrl+C and start of pg_streamrecv some times, > I encountered the following error and pg_streamrecv was never up. > Is this intentional? > > In progress directory contains more than one file! > > $ ls foo/inprogress/ > 00000001000000000000000D 00000001000000000000000D.save Yes, that is actually intentional. When it finds the ..0D file there the first time, it gets renamed to ".save", and it retries the transmission from the beginning of that segment. as soon as the retransmission has passed the point that 0D was at, the .save file is removed. If you Ctrl-C the process again *before* it has reached that point, it will leave both files around - it's up to you to clean them up. This is to make sure we don't overwrite a file that contains more log data than is currently available on the master. > When there is inprogress or archived WAL file, pg_streamrecv should > not execute pg_current_xlog_location because that result is not used? Yeah, that's just a bit of lazy programming that I should fix :-) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Tue, Sep 28, 2010 at 5:23 PM, Magnus Hagander <magnus@hagander.net> wrote: >> When I ran that, the size of the WAL file in inprogress directory >> became more than 16MB. Obviously something isn't right. > > Wow, that's weird. I'm unable to reproduce that here - can you try to > figure out why that happened? Sorry, I overlooked the single-digit figure in the result of "ls -l". To be exact, the size of the WAL file in inprogress directory can be less than 16MB. Here is the result of "ls -l inprogress". $ ls -l inprogress/ total 1724 -rw-rw-r-- 1 postgres postgres 1757352 Sep 29 12:03 000000010000000000000003 This also would be problem since the WAL file smaller than 16MB cannot be used for recovery. I think that pg_streamrecv should create 16MB file with zero at first, and write the received WAL records in that, as walreceiver does. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Sep 29, 2010 at 05:40, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Sep 28, 2010 at 5:23 PM, Magnus Hagander <magnus@hagander.net> wrote: >>> When I ran that, the size of the WAL file in inprogress directory >>> became more than 16MB. Obviously something isn't right. >> >> Wow, that's weird. I'm unable to reproduce that here - can you try to >> figure out why that happened? > > Sorry, I overlooked the single-digit figure in the result of "ls -l". Aha, that explains it. > To be exact, the size of the WAL file in inprogress directory can be > less than 16MB. Here is the result of "ls -l inprogress". > > $ ls -l inprogress/ > total 1724 > -rw-rw-r-- 1 postgres postgres 1757352 Sep 29 12:03 000000010000000000000003 > > This also would be problem since the WAL file smaller than 16MB cannot > be used for recovery. I think that pg_streamrecv should create 16MB > file with zero at first, and write the received WAL records in that, as > walreceiver does. It's actually intentional. If we create a file at first, there is no way to figure out exactly how far through a partial segment we are without parsing the details of the log. This is useful both for the admin (who can look at the directory and watch the file grow) and the tool itself (to know when the .save file can be rotated away, when recovering from a partial segment receive). My idea was to just have the admin pad the file when it's time to do the restore. I could perhaps even add an option to the tool to do it - the idea being it's a manual step still. Do you have another suggestion for how to provide those two things? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> wrote: > Comments and contributions are most welcome. This is probably too esoteric to be worked on yet, but for this to be useful for us we would need to pass the resulting files through pg_clearxlogtail and gzip in an automated fashion. And we would need to do regular log file archiving in parallel with it. As background, our databases around the state archive to a directory which is then pushed via rsync to a "dumb" backup location in the same room as the database server (we're lucky to have rsync on the target of this copy; any other executable is out of the question), and the same directory is pulled via rsync to a central location. We would be interested in using streaming replication to a tool such as you describe for the copy to the central location, but since we would still be forcing a wal-file switch once per hour we would need the current capability to shrink an "empty" file from 16MB to 16kB using the above-mentioned tools. Also, a the ability to limit bandwidth would be a nice feature for us, preferably in a way which could be changed on the fly. If you could keep the development "friendly" to such features, I may get around to adding them to support our needs.... -Kevin
On Wed, Sep 29, 2010 at 23:45, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Magnus Hagander <magnus@hagander.net> wrote: > >> Comments and contributions are most welcome. > > This is probably too esoteric to be worked on yet, but for this to > be useful for us we would need to pass the resulting files through > pg_clearxlogtail and gzip in an automated fashion. And we would > need to do regular log file archiving in parallel with it. > > As background, our databases around the state archive to a directory > which is then pushed via rsync to a "dumb" backup location in the > same room as the database server (we're lucky to have rsync on the > target of this copy; any other executable is out of the question), > and the same directory is pulled via rsync to a central location. > We would be interested in using streaming replication to a tool such > as you describe for the copy to the central location, but since we > would still be forcing a wal-file switch once per hour we would need > the current capability to shrink an "empty" file from 16MB to 16kB > using the above-mentioned tools. You could just have one stream going local and one stream going to the other location in parallell, though? Or use the stream to the local directory and rsync that off? While I haven't tested it, rsyncing the partial WAL files *should* be fine, I think... > Also, a the ability to limit bandwidth would be a nice feature for > us, preferably in a way which could be changed on the fly. > > If you could keep the development "friendly" to such features, I may > get around to adding them to support our needs.... Would it be enough to have kind of an "archive_command" switch that says "whenever you've finished a complete wal segment, run this command on it"? Then that command could clear the tail, compress, and send off? And in that case, should it run inline or in the background with the streaming? I would assume just fork it off and leave it to it's own business would be best? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> wrote: >> If you could keep the development "friendly" to such features, I >> may get around to adding them to support our needs.... > > Would it be enough to have kind of an "archive_command" switch > that says "whenever you've finished a complete wal segment, run > this command on it"? That would allow some nice options. I've been thinking what would be the ideal use of this with our backup scheme, and the best I've thought up would be that each WAL segment file would be a single output stream, with the option of calling a executable (which could be a script) with the target file name and then piping the stream to it. At 16MB or a forced xlog switch, it would close the stream and call the executable again with a new file name. You could have a default executable for the default behavior, or just build in a default if no executable is specified. The reason I like this is that I could pipe the stream through pg_clearxlogtail and gzip pretty much "as is" to the locations on the database server currently used for rsync to the two targets, and the rsync commands would send the incremental changes once per minute to both targets. I haven't thought of another solution which provides incremental transmission of the WAL to the local backup location, which would be a nice thing to have, since this is most crucial when the WAN is down and not only is WAL data not coming back to our central location, but our application framework based replication stream isn't making back, either. -Kevin
On Thu, Sep 30, 2010 at 15:45, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Magnus Hagander <magnus@hagander.net> wrote: > >>> If you could keep the development "friendly" to such features, I >>> may get around to adding them to support our needs.... >> >> Would it be enough to have kind of an "archive_command" switch >> that says "whenever you've finished a complete wal segment, run >> this command on it"? > > That would allow some nice options. I've been thinking what would > be the ideal use of this with our backup scheme, and the best I've > thought up would be that each WAL segment file would be a single > output stream, with the option of calling a executable (which could > be a script) with the target file name and then piping the stream to > it. At 16MB or a forced xlog switch, it would close the stream and > call the executable again with a new file name. You could have a > default executable for the default behavior, or just build in a > default if no executable is specified. The problem with that one (which I'm sure is solvable somehow) is how to deal with restarts. Both restarts in the middle of a segment (happens all the time if you don't have an archive_timeout set), and really also restarts between segments. How would the tool know where to begin streaming again? Right now, it looks at the files - but doing it by your suggestion there are no files to look at. We'd need a second script/command to call to figure out where to restart from in that case, no? > The reason I like this is that I could pipe the stream through > pg_clearxlogtail and gzip pretty much "as is" to the locations on > the database server currently used for rsync to the two targets, and > the rsync commands would send the incremental changes once per > minute to both targets. I haven't thought of another solution which > provides incremental transmission of the WAL to the local backup > location, which would be a nice thing to have, since this is most > crucial when the WAN is down and not only is WAL data not coming > back to our central location, but our application framework based > replication stream isn't making back, either. It should be safe to just rsync the archive directory as it's being written by pg_streamrecv. Doesn't that give you the property you're looking for - local machine gets data streamed in live, remote machine gets it rsynced every minute? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Thu, Sep 30, 2010 at 10:24 AM, Magnus Hagander <magnus@hagander.net> wrote: >> That would allow some nice options. I've been thinking what would >> be the ideal use of this with our backup scheme, and the best I've >> thought up would be that each WAL segment file would be a single >> output stream, with the option of calling a executable (which could >> be a script) with the target file name and then piping the stream to >> it. At 16MB or a forced xlog switch, it would close the stream and >> call the executable again with a new file name. You could have a >> default executable for the default behavior, or just build in a >> default if no executable is specified. > > The problem with that one (which I'm sure is solvable somehow) is how > to deal with restarts. Both restarts in the middle of a segment > (happens all the time if you don't have an archive_timeout set), and > really also restarts between segments. How would the tool know where > to begin streaming again? Right now, it looks at the files - but doing > it by your suggestion there are no files to look at. We'd need a > second script/command to call to figure out where to restart from in > that case, no? And then think of the future, when sync rep is in... I'm hoping to be able to use something like this to do synchrous replication to my archive (instead of to a live server). > It should be safe to just rsync the archive directory as it's being > written by pg_streamrecv. Doesn't that give you the property you're > looking for - local machine gets data streamed in live, remote machine > gets it rsynced every minute? When the "being written to" segmnt copmletes moves to the final location, he'll get an extra whole "copy" of the file. But of the "move" can be an exec of his scritpt, the compressed/gzipped final result shouldn't be that bad. Certainly no worse then what he's currently getting with archive command ;-) And he's got the uncompressed incimental updates as they are happening. a.
On Thu, Sep 30, 2010 at 16:39, Aidan Van Dyk <aidan@highrise.ca> wrote: > On Thu, Sep 30, 2010 at 10:24 AM, Magnus Hagander <magnus@hagander.net> wrote: > >>> That would allow some nice options. I've been thinking what would >>> be the ideal use of this with our backup scheme, and the best I've >>> thought up would be that each WAL segment file would be a single >>> output stream, with the option of calling a executable (which could >>> be a script) with the target file name and then piping the stream to >>> it. At 16MB or a forced xlog switch, it would close the stream and >>> call the executable again with a new file name. You could have a >>> default executable for the default behavior, or just build in a >>> default if no executable is specified. >> >> The problem with that one (which I'm sure is solvable somehow) is how >> to deal with restarts. Both restarts in the middle of a segment >> (happens all the time if you don't have an archive_timeout set), and >> really also restarts between segments. How would the tool know where >> to begin streaming again? Right now, it looks at the files - but doing >> it by your suggestion there are no files to look at. We'd need a >> second script/command to call to figure out where to restart from in >> that case, no? > > And then think of the future, when sync rep is in... I'm hoping to be > able to use something like this to do synchrous replication to my > archive (instead of to a live server). Right, that could be a future enhancement. Doesn't mean we shouldn't still do our best with the async mode of course :P >> It should be safe to just rsync the archive directory as it's being >> written by pg_streamrecv. Doesn't that give you the property you're >> looking for - local machine gets data streamed in live, remote machine >> gets it rsynced every minute? > > When the "being written to" segmnt copmletes moves to the final > location, he'll get an extra whole "copy" of the file. But of the Ah, good point. > "move" can be an exec of his scritpt, the compressed/gzipped final > result shouldn't be that bad. Certainly no worse then what he's > currently getting with archive command ;-) And he's got the > uncompressed incimental updates as they are happening. Yeah, it would be trivial to replace the rename() call with a call to a script that gets to do whatever is suitable to the file. Actually, it'd probably be better to rename() it *and* call the script, so that we can continue properly if the script fails. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> wrote: > We'd need a second script/command to call to figure out where to > restart from in that case, no? I see your point; I guess we would need that. > It should be safe to just rsync the archive directory as it's > being written by pg_streamrecv. Doesn't that give you the property > you're looking for - local machine gets data streamed in live, > remote machine gets it rsynced every minute? Well the local target is a can't run pg_streamrecv -- it's a backup machine where we pretty much have rsync and nothing else. We could run pg_streamrecv on the database server itself and rsync to the local machine every minute. I just checked with the DBA who monitors space issues for such things, and it would be OK to rsync the uncompressed file to the local backup as it is written (we have enough space for it without compression) as long as we compress it before sending it to the central location. For that, your idea to fire a script on completion of the file would work -- we could maintain both raw and compressed files on the database server for rsync to the two locations. You can probably see the appeal of filtering it as it is written, though, if that is feasible. :-) -Kevin
Aidan Van Dyk <aidan@highrise.ca> wrote: > When the "being written to" segmnt copmletes moves to the final > location, he'll get an extra whole "copy" of the file. But of the > "move" can be an exec of his scritpt, the compressed/gzipped final > result shouldn't be that bad. Certainly no worse then what he's > currently getting with archive command ;-) And he's got the > uncompressed incimental updates as they are happening. Hmmm... As long as streaming replication doesn't send the "tail" of an incomplete WAL segment file, the only thing we'd be missing on the send to the central location is the compression. That's typically reducing the size of the transmission by 50% to 75% (e.g., the gzipped "full" files are usually in the range of 4MB to 8MB). At our WAN speeds, that is significant. I don't suppose that streaming replication uses (or offers as an option) a compressed stream? -Kevin
On Thu, Sep 30, 2010 at 17:25, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Aidan Van Dyk <aidan@highrise.ca> wrote: > >> When the "being written to" segmnt copmletes moves to the final >> location, he'll get an extra whole "copy" of the file. But of the >> "move" can be an exec of his scritpt, the compressed/gzipped final >> result shouldn't be that bad. Certainly no worse then what he's >> currently getting with archive command ;-) And he's got the >> uncompressed incimental updates as they are happening. > > Hmmm... As long as streaming replication doesn't send the "tail" of > an incomplete WAL segment file, the only thing we'd be missing on > the send to the central location is the compression. That's > typically reducing the size of the transmission by 50% to 75% (e.g., > the gzipped "full" files are usually in the range of 4MB to 8MB). > At our WAN speeds, that is significant. I don't suppose that > streaming replication uses (or offers as an option) a compressed > stream? No, it sends a regular COPY stream with the raw transaction log data. So the files generated will eventually exactly 16Mb, just like they are in pg_xlog. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Wed, Sep 29, 2010 at 5:47 PM, Magnus Hagander <magnus@hagander.net> wrote: > It's actually intentional. If we create a file at first, there is no > way to figure out exactly how far through a partial segment we are > without parsing the details of the log. This is useful both for the > admin (who can look at the directory and watch the file grow) and the > tool itself (to know when the .save file can be rotated away, when > recovering from a partial segment receive). > > My idea was to just have the admin pad the file when it's time to do > the restore. I could perhaps even add an option to the tool to do it - > the idea being it's a manual step still. > > Do you have another suggestion for how to provide those two things? My idea is to implement something like xlogdump in contrib and use it for those two things. Though it's harder to implement that than to do padding tool. BTW, implementing something like xlogdump is already in TODO list: --- Create dump tool for write-ahead logs for use in determining transaction id for point-in-time recovery. This is useful for checking PITR recovery. http://wiki.postgresql.org/wiki/TODO#Point-In-Time_Recovery_.28PITR.29 --- Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Oct 1, 2010 at 11:13, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Sep 29, 2010 at 5:47 PM, Magnus Hagander <magnus@hagander.net> wrote: >> It's actually intentional. If we create a file at first, there is no >> way to figure out exactly how far through a partial segment we are >> without parsing the details of the log. This is useful both for the >> admin (who can look at the directory and watch the file grow) and the >> tool itself (to know when the .save file can be rotated away, when >> recovering from a partial segment receive). >> >> My idea was to just have the admin pad the file when it's time to do >> the restore. I could perhaps even add an option to the tool to do it - >> the idea being it's a manual step still. >> >> Do you have another suggestion for how to provide those two things? > > My idea is to implement something like xlogdump in contrib and use it > for those two things. Though it's harder to implement that than to do > padding tool. Yes, much harder. I prefer keeping the tool simple when possible ;) Doesn't mean we couldn't do both, i guess. > BTW, implementing something like xlogdump is already in TODO list: Yes, obviously such a tool would be very useful independent of the streaming thing. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/