Thread: Improving compressibility of WAL files
The attached patch from Aidan Van Dyk zeros out the end of WAL files to improve their compressibility. (The patch was originally sent to 'general' which explains why it was lost until now.) Would someone please eyeball it?; it is useful for compressing PITR logs even if we find a better solution for replication streaming? As for why this patch is useful: > > The real reason not to put that functionality into core (or even > > contrib) is that it's a stopgap kluge. What the people who want this > > functionality *really* want is continuous (streaming) log-shipping, not > > WAL-segment-at-a-time shipping. Putting functionality like that into > > core is infinitely more interesting than putting band-aids on a > > segmented approach. > > Well, I realize we want streaming archive logs, but there are still > going to be people who are archiving for point-in-time recovery, and I > assume a good number of them are going to compress their WAL files to > save space, because they have to store a lot of them. Wouldn't zeroing > out the trailing byte of WAL still help those people? --------------------------------------------------------------------------- Aidan Van Dyk wrote: -- Start of PGP signed section. > * Aidan Van Dyk <aidan@highrise.ca> [081031 15:11]: > > How about something like the attached. It's been spun quickly, passed > > regression tests, and some simple hand tests on REL8_3_STABLE. It seem slike > > HEAD can't initdb on my machine (quad opteron with SW raid1), I tried a few > > revision in the last few days, and initdb dies on them all... > > OK, HEAD does work, I don't know what was going on previosly... Attached is my > patch against head. > > I'll try and pull out some machines on Monday to really thrash/crash this but > I'm running out of time today to set that up. > > But in running head, I've come accross this: > regression=# SELECT pg_stop_backup(); > WARNING: pg_stop_backup still waiting for archive to complete (60 seconds elapsed) > WARNING: pg_stop_backup still waiting for archive to complete (120 seconds elapsed) > WARNING: pg_stop_backup still waiting for archive to complete (240 seconds elapsed) > > My archive script is *not* running, it ran and exited: > mountie@pumpkin:~/projects/postgresql/PostgreSQL/src/test/regress$ ps -ewf | grep post > mountie 2904 1 0 16:31 pts/14 00:00:00 /home/mountie/projects/postgresql/PostgreSQL/src/test/regress/tmp_check/install/usr/local/pgsql > mountie 2906 2904 0 16:31 ? 00:00:01 postgres: writer process > mountie 2907 2904 0 16:31 ? 00:00:00 postgres: wal writer process > mountie 2908 2904 0 16:31 ? 00:00:00 postgres: archiver process last was 00000001000000000000001F > mountie 2909 2904 0 16:31 ? 00:00:01 postgres: stats collector process > mountie 2921 2904 1 16:31 ? 00:00:18 postgres: mountie regression 127.0.0.1(56455) idle > > Those all match up: > mountie@pumpkin:~/projects/postgresql/PostgreSQL/src/test/regress$ pstree -acp 2904 > postgres,2904 -D/home/mountie/projects/postgres > ??postgres,2906 > ??postgres,2907 > ??postgres,2908 > ??postgres,2909 > ??postgres,2921 > > strace on the "archiver process" postgres: > select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) > getppid() = 2904 > select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) > getppid() = 2904 > select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) > getppid() = 2904 > select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) > getppid() = 2904 > select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) > getppid() = 2904 > > It *does* finally finish, postmaster log looks like ("Archving ..." is what my > archive script prints, bytes is the gzip'ed size): > Archiving 000000010000000000000016 [16397 bytes] > Archiving 000000010000000000000017 [4405457 bytes] > Archiving 000000010000000000000018 [3349243 bytes] > Archiving 000000010000000000000019 [3349505 bytes] > LOG: ZEROING xlog file 0 segment 27 from 7954432 - 16777216 [8822784 bytes] > Archiving 00000001000000000000001A [3349590 bytes] > Archiving 00000001000000000000001B [1596676 bytes] > LOG: ZEROING xlog file 0 segment 28 from 8192 - 16777216 [16769024 bytes] > Archiving 00000001000000000000001C [16398 bytes] > LOG: ZEROING xlog file 0 segment 29 from 8192 - 16777216 [16769024 bytes] > Archiving 00000001000000000000001D [16397 bytes] > LOG: ZEROING xlog file 0 segment 30 from 8192 - 16777216 [16769024 bytes] > Archiving 00000001000000000000001E [16393 bytes] > Archiving 00000001000000000000001E.00000020.backup [146 bytes] > WARNING: pg_stop_backup still waiting for archive to complete (60 seconds elapsed) > WARNING: pg_stop_backup still waiting for archive to complete (120 seconds elapsed) > WARNING: pg_stop_backup still waiting for archive to complete (240 seconds elapsed) > LOG: ZEROING xlog file 0 segment 31 from 8192 - 16777216 [16769024 bytes] > Archiving 00000001000000000000001F [16395 bytes] > > > So what's this "pg_stop_backup still waiting for archive to complete" for 5 > minutes state? I've not seen that before (runing 8.2 and 8.3). > > a. > -- > Aidan Van Dyk Create like a god, > aidan@highrise.ca command like a king, > http://www.highrise.ca/ work like a slave. [ Attachment, skipping... ] -- End of PGP section, PGP failed! -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + commit fba38257e52564276bb106d55aef14d0de481169 Author: Aidan Van Dyk <aidan@highrise.ca> Date: Fri Oct 31 12:35:24 2008 -0400 WIP: Zero xlog tal on a forced switch If XLogWrite is called with xlog_switch, an XLog swithc has been force, either by a timeout based switch (archive_timeout), or an interactive force xlog switch (pg_switch_xlog/pg_stop_backup). In those cases, we assume we can afford a little extra IO bandwidth to make xlogs so much more compressable diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 003098f..c6f9c79 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -1600,6 +1600,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch) */ if (finishing_seg || (xlog_switch && last_iteration)) { + /* + * If we've had an xlog switch forced, then we want to zero + * out the rest of the segment. We zero it out here because at the + * force switch time, IO bandwidth isn't a problem. + * -- AIDAN + */ + if (xlog_switch) + { + char buf[1024]; + uint32 left = (XLogSegSize - openLogOff); + ereport(LOG, + (errmsg("ZEROING xlog file %u segment %u from %u - %u [%u bytes]", + openLogId, openLogSeg, + openLogOff, XLogSegSize, left) + )); + memset(buf, 0, sizeof(buf)); + while (left > 0) + { + size_t len = (left > sizeof(buf)) ? sizeof(buf) : left; + write(openLogFile, buf, len); + left -= len; + } + } + issue_xlog_fsync(); LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
Bruce Momjian <bruce@momjian.us> writes: > The attached patch from Aidan Van Dyk zeros out the end of WAL files to > improve their compressibility. (The patch was originally sent to > 'general' which explains why it was lost until now.) Isn't this redundant given the existence of pglesslog? regards, tom lane
* Bruce Momjian <bruce@momjian.us> [090108 16:43]: > > The attached patch from Aidan Van Dyk zeros out the end of WAL files to > improve their compressibility. (The patch was originally sent to > 'general' which explains why it was lost until now.) > > Would someone please eyeball it?; it is useful for compressing PITR > logs even if we find a better solution for replication streaming? The reason I didn't push it was that people claimed it would chew up to much WAL bandwidhh (causing a large commit latency) when the new 0's are all written/fsynced at once... I don't necessarily buy it, because the force_switch is usually either a 1) timeed occurance on an otherwise idle time 2) user-forced (i.e. forced checkpoint/pg_backup, so your IO is going to be hammered anyways... But that's why I didn't follow up on it... There's possible a few other ways to do it, such as zero the WAL on recycling (but not fsyncing it), and hopefully most of the zero's get trickled out by the OS before it comes down to a single 16MB fsync, but not many people seemed too enthused about the whole WAL compressablitly subject... But, the way I see things going on -hackers, I must admit, sync-rep (WAL streaming) looks like it's a long way off and possibly not even going to do what I want, so *I* would really like this wal zero'ing... If anybody has any specific things with the patch ehty think needs chaning, I'll try and accomidate, but I do note that I never submitted it for the Commitfest... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
>>> Aidan Van Dyk <aidan@highrise.ca> 01/08/09 5:02 PM >>> > *I* would really like this wal zero'ing... pg_clearxlogtail (in pgfoundry) does exactly the same zeroing of the tail as a filter. If you pipe through it on the way to gzip, there is no increase in disk I/O over a straight gzip, and often an I/O savings. Benchmarks of the final version showed no measurable performance cost, even with full WAL files. It's not as convenient to use as what your patch does, but it's not all that hard either. There is also pglesslog, although we had pg_clearxlogtail working before we found the other, so we've never checked it out. Perhaps it does even better. -Kevin
On Thu, 2009-01-08 at 18:02 -0500, Aidan Van Dyk wrote: > * Bruce Momjian <bruce@momjian.us> [090108 16:43]: > > > > The attached patch from Aidan Van Dyk zeros out the end of WAL files to > > improve their compressibility. (The patch was originally sent to > > 'general' which explains why it was lost until now.) > > > > Would someone please eyeball it?; it is useful for compressing PITR > > logs even if we find a better solution for replication streaming? > > The reason I didn't push it was that people claimed it would chew up to > much WAL bandwidhh (causing a large commit latency) when the new 0's are > all written/fsynced at once... > > I don't necessarily buy it, because the force_switch is usually either a > 1) timeed occurance on an otherwise idle time > 2) user-forced (i.e. forced checkpoint/pg_backup, so your IO is going to > be hammered anyways... > > But that's why I didn't follow up on it... > > There's possible a few other ways to do it, such as zero the WAL on > recycling (but not fsyncing it), and hopefully most of the zero's get > trickled out by the OS before it comes down to a single 16MB fsync, but > not many people seemed too enthused about the whole WAL compressablitly > subject... > > But, the way I see things going on -hackers, I must admit, sync-rep (WAL > streaming) looks like it's a long way off and possibly not even going to > do what I want, so *I* would really like this wal zero'ing... > > If anybody has any specific things with the patch ehty think needs > chaning, I'll try and accomidate, but I do note that I never > submitted it for the Commitfest... won't it still be easier/less intrusive on inline core functionality and more flexible to just record end-of-valid-wal somewhere and then let the compressor discard the invalid part when compressing and recreate it with zeros on decompression ? ------------------- Hannu
On Fri, 2009-01-09 at 01:29 +0200, Hannu Krosing wrote: > On Thu, 2009-01-08 at 18:02 -0500, Aidan Van Dyk wrote: ... > > There's possible a few other ways to do it, such as zero the WAL on > > recycling (but not fsyncing it), and hopefully most of the zero's get > > trickled out by the OS before it comes down to a single 16MB fsync, but > > not many people seemed too enthused about the whole WAL compressablitly > > subject... > > > > But, the way I see things going on -hackers, I must admit, sync-rep (WAL > > streaming) looks like it's a long way off and possibly not even going to > > do what I want, so *I* would really like this wal zero'ing... > > > > If anybody has any specific things with the patch ehty think needs > > chaning, I'll try and accomidate, but I do note that I never > > submitted it for the Commitfest... > > won't it still be easier/less intrusive on inline core functionality and > more flexible to just record end-of-valid-wal somewhere and then let the > compressor discard the invalid part when compressing and recreate it > with zeros on decompression ? And some of the functionality already exists for in-process WAL files in form of pg_current_xlog_location() and pg_current_xlog_insert_location(), recording end-of-data in wal file just extends this to completed log files. -- ------------------------------------------ Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On Fri, 9 Jan 2009, Hannu Krosing wrote: > won't it still be easier/less intrusive on inline core functionality and > more flexible to just record end-of-valid-wal somewhere and then let the > compressor discard the invalid part when compressing and recreate it > with zeros on decompression ? I thought at one point that the direction this was going toward was to provide the size of the WAL file as a parameter you can use in the archive_command: %p provides the path, %f the file name, and now %l the length. That makes an example archive command something like: head -c "%l" "%p" | gzip > /mnt/server/archivedir/"%f" Expanding it back to always be 16MB on the other side might require some trivial script, can't think of a standard UNIX tool suitable for that but it's easy enough to write. I'm assuming I just remembering someone else's suggestion here, maybe I just invented the above. You don't want to just modify pg_standby to accept small files, because then you've made it harder to make absolutely sure when the file is ready to be processed if a non-atomic copy is being done. And it may make sense to provide some simple C implementations of the clear/expand tools in contrib even with the %l addition, mainly to help out Windows users. To reiterate the choices I remember popping up in the multiple rounds this has come up, possible implementations that would work for this general requirement include: 1) Provide the length as part of the archive command 2) Add a more explicit end-of-WAL delimiter 3) Write zeros to the unused portion in the server 4) pglesslog 5) pg_clearxlogtail With "(6) use sync rep" being not quite a perfect answer; there are certainly WAN-based use cases where you don't want full sync rep but do want the WAL to compress as much as possible. I think (1) is a better solution than most of these in the context of an improvement to core, with (4) pglesslog being the main other contender because of how it provides additional full-page write improvements. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > The attached patch from Aidan Van Dyk zeros out the end of WAL files to > > improve their compressibility. (The patch was originally sent to > > 'general' which explains why it was lost until now.) > > Isn't this redundant given the existence of pglesslog? It does the same as pglesslog, but is simpler to use because it is automatic. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> Isn't this redundant given the existence of pglesslog? > It does the same as pglesslog, but is simpler to use because it is > automatic. Which also means that everyone pays the performance penalty whether they get any benefit or not. The point of the external solution is to do the work only in installations that get some benefit. We've been over this ground before... regards, tom lane
> You don't want to just > modify pg_standby to accept small files, because then you've made it > harder to make absolutely sure when the file is ready to be > processed if a non-atomic copy is being done. It is hard, but I think it is the right way forward. Anyway I think the size is not robust at all because some (most ? e.g. win32) non-atomic copy implementations will also show the final size right from the beginning. Could we use stricter file locking when opening WAL for recovery ? Or implement a wait loop when the crc check (+ a basic validity check) for the next record fails (and the next record is on a 512 byte boundary ?). I think standby and restore recovery can be treated differently to startup recovery because a copied wal file, even if the copy is not atomic, will not have trailing valid WAL records from a recycled WAL. (A solution that recycles WAL files for restore/standby would need to make sure it renames the files *after* restoring the content.) Btw how do we detect end of WAL when restoring a backup and WAL after PANIC ? > 1) Provide the length as part of the archive command +1 Andreas
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Tom Lane wrote: > >> Isn't this redundant given the existence of pglesslog? > > > It does the same as pglesslog, but is simpler to use because it is > > automatic. > > Which also means that everyone pays the performance penalty whether > they get any benefit or not. The point of the external solution > is to do the work only in installations that get some benefit. > We've been over this ground before... If there is a performance penalty, you are right, but if the zeroing is done as part of the archiving, it seems near zero cost enough to do it all the time, no? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
>>> Greg Smith <gsmith@gregsmith.com> wrote: > I thought at one point that the direction this was going toward was to > provide the size of the WAL file as a parameter you can use in the > archive_command: %p provides the path, %f the file name, and now %l the > length. That makes an example archive command something like: > > head -c "%l" "%p" | gzip > /mnt/server/archivedir/"%f" Hard to beat for performance. I thought there was some technical snag. > Expanding it back to always be 16MB on the other side might require some > trivial script, can't think of a standard UNIX tool suitable for that but > it's easy enough to write. Untested, but it seems like something close to this would work: cat $p $( dd if=/dev/null blocks=1 ibs=$(( (16 * 1024 * 1024) - $(stat -c%s $p) )) ) -Kevin
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> Which also means that everyone pays the performance penalty whether >> they get any benefit or not. The point of the external solution >> is to do the work only in installations that get some benefit. >> We've been over this ground before... > If there is a performance penalty, you are right, but if the zeroing is > done as part of the archiving, it seems near zero cost enough to do it > all the time, no? It's the same cost no matter which process does it. regards, tom lane
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Greg Smith <gsmith@gregsmith.com> wrote: >> I thought at one point that the direction this was going toward was to >> provide the size of the WAL file as a parameter you can use in the >> archive_command: > Hard to beat for performance. I thought there was some technical > snag. Yeah: the archiver process doesn't have that information available. regards, tom lane
All that is useless until we get a %l in archive_command... *I* didn't see an easy way to get at the "written" size later on in the chain (i.e. in the actual archiving), so I took the path of least resitance. The reason *I* shy way from pg_lesslog and pg_clearxlogtail, is that they seem to possibly be frail... I'm just scared of somethign changing in PG some time, and my pg_clearxlogtail not nowing, me forgetting to upgrade, and me not doing enough test of my actually restoring backups... Sure, it's all me being neglgent, but the simpler, the better... If I wrapped this zeroing in GUC, people could choose wether to pay the penalty or not, would that satisfy anyone? Again, *I* think that the force_switch case is going to happen when the admin's quite happy to pay that penalty... But obviously not everyone... a. * Kevin Grittner <Kevin.Grittner@wicourts.gov> [090109 11:01]: > >>> Greg Smith <gsmith@gregsmith.com> wrote: > > I thought at one point that the direction this was going toward was > to > > provide the size of the WAL file as a parameter you can use in the > > archive_command: %p provides the path, %f the file name, and now %l > the > > length. That makes an example archive command something like: > > > > head -c "%l" "%p" | gzip > /mnt/server/archivedir/"%f" > > Hard to beat for performance. I thought there was some technical > snag. > > > Expanding it back to always be 16MB on the other side might require > some > > trivial script, can't think of a standard UNIX tool suitable for that > but > > it's easy enough to write. > > Untested, but it seems like something close to this would work: > > cat $p $( dd if=/dev/null blocks=1 ibs=$(( (16 * 1024 * 1024) - $(stat > -c%s $p) )) ) > > -Kevin > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Fri, 2009-01-09 at 09:31 -0500, Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > Tom Lane wrote: > > >> Isn't this redundant given the existence of pglesslog? > > > > > It does the same as pglesslog, but is simpler to use because it is > > > automatic. > > > > Which also means that everyone pays the performance penalty whether > > they get any benefit or not. The point of the external solution > > is to do the work only in installations that get some benefit. > > We've been over this ground before... > > If there is a performance penalty, you are right, but if the zeroing is > done as part of the archiving, it seems near zero cost enough to do it > all the time, no? It can already be done as part of the archiving, using an external tool as Tom notes. Yes, we could make the archiver do this, but I see no big advantage over having it done externally. It's not faster, safer, easier. Not easier because we would want a parameter to turn it off when not wanted. The patch as stands is IMHO not acceptable because the work to zero the file is performed by the unlucky backend that hits EOF on the current WAL file, which is bad enough, but it is also performed while holding WALWriteLock. I like Greg Smith's analysis of this and his conclusion that we could provide a %l option, but even that would require work to have that info passed to the archiver. Perhaps inside the notification file which is already written and read by the write processes. But whether that can or should be done for this release is a different debate. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
* Simon Riggs <simon@2ndQuadrant.com> [090109 11:33]: > The patch as stands is IMHO not acceptable because the work to zero the > file is performed by the unlucky backend that hits EOF on the current > WAL file, which is bad enough, but it is also performed while holding > WALWriteLock. Agreed, but noting that that extra zero work is contitional on the "force_swich", meaning that commits backup behind that WALWriteLock only during forced xlog switches (like archive_timeout and pg_backup). I actually did look through verify that when I made the patch, although I claim that verification to be something anybody else should beleive ;-) But my given output when I showd the stats/lines/etc did demonstrate that. > I like Greg Smith's analysis of this and his conclusion that we could > provide a %l option, but even that would require work to have that info > passed to the archiver. Perhaps inside the notification file which is > already written and read by the write processes. But whether that can or > should be done for this release is a different debate. It's certainly not already in this commitfest, just like this patch. I thought the initial reaction after I posted it made it pretty clear it wasn't something people (other than a few of us) were willing to allow... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
>>> Aidan Van Dyk <aidan@highrise.ca> 01/09/09 10:22 AM >>> > The reason *I* shy way from pg_lesslog and pg_clearxlogtail, is that > they seem to possibly be frail... I'm just scared of somethign > changing in PG some time, and my pg_clearxlogtail not nowing, me > forgetting to upgrade, and me not doing enough test of my actually > restoring backups... A fair concern. I can't speak about pglesslog, but pg_clearxlogtail goes out of its way to minimize this risk. Changes to log records themselves can't break it; it is only dependent on the partitioning. It bails with a message to stderr and a non-zero return code if it finds anything obviously wrong. It also checks the WAL format for which it was compiled against the WAL format on which it was invoked, and issues a warning if they don't match. We ran into this once on a machine running multiple releases of PostgreSQL where the archive script invoked the wrong executable. It worked correctly in spite of the warning, but the warning was enough to alert us to the mismatch and change the path in the archive script. -Kevin
Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Greg Smith <gsmith@gregsmith.com> wrote: >>> I thought at one point that the direction this was going toward was to >>> provide the size of the WAL file as a parameter you can use in the >>> archive_command: > >> Hard to beat for performance. I thought there was some technical >> snag. > > Yeah: the archiver process doesn't have that information available. Am I being really dim here - why isn't the first record in the WAL file a fixed-length record containing e.g. txid_start, time_start, txid_end, time_end, length? Write it once when you start using the file and once when it's finished. -- Richard Huxton Archonet Ltd
* Richard Huxton <dev@archonet.com> [090109 12:22]: > > Yeah: the archiver process doesn't have that information available. > Am I being really dim here - why isn't the first record in the WAL file > a fixed-length record containing e.g. txid_start, time_start, txid_end, > time_end, length? Write it once when you start using the file and once > when it's finished. It would break the WAL "write-block/sync-block" forward only progress of the xlog, which avoids the whole torn-page problem that the heap has. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > Greg Smith <gsmith@gregsmith.com> wrote: > >> I thought at one point that the direction this was going toward was to > >> provide the size of the WAL file as a parameter you can use in the > >> archive_command: > > > Hard to beat for performance. I thought there was some technical > > snag. > > Yeah: the archiver process doesn't have that information available. OK, thanks, I understand now. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Aidan Van Dyk wrote: > * Richard Huxton <dev@archonet.com> [090109 12:22]: > >>> Yeah: the archiver process doesn't have that information available. > >> Am I being really dim here - why isn't the first record in the WAL file >> a fixed-length record containing e.g. txid_start, time_start, txid_end, >> time_end, length? Write it once when you start using the file and once >> when it's finished. > > It would break the WAL "write-block/sync-block" forward only progress of > the xlog, which avoids the whole torn-page problem that the heap has. I thought that only applied when the filesystem page-size was less than the data we were writing? -- Richard Huxton Archonet Ltd
Simon Riggs <simon@2ndQuadrant.com> writes: > Yes, we could make the archiver do this, but I see no big advantage over > having it done externally. It's not faster, safer, easier. Not easier > because we would want a parameter to turn it off when not wanted. And the other question to ask is how much effort and code should we be putting into the concept anyway. AFAICS, file-at-a-time WAL shipping is a stopgap implementation that will be dead as a doornail once the current efforts towards realtime replication are finished. There will still be some use for forced log switches in connection with backups, but that's not going to occur often enough to be important to optimize. regards, tom lane
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: > AFAICS, file-at-a-time WAL shipping > is a stopgap implementation that will be dead as a doornail once the > current efforts towards realtime replication are finished. As long as there is a way to rsync log data to multiple targets not running replicas, with compression because of low-speed WAN connections, I'm happy. Doesn't matter whether that is using existing techniques or the new realtime techniques. -Kevin
On Fri, 2009-01-09 at 13:22 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > Yes, we could make the archiver do this, but I see no big advantage over > > having it done externally. It's not faster, safer, easier. Not easier > > because we would want a parameter to turn it off when not wanted. > > And the other question to ask is how much effort and code should we be > putting into the concept anyway. AFAICS, file-at-a-time WAL shipping > is a stopgap implementation that will be dead as a doornail once the > current efforts towards realtime replication are finished. There will > still be some use for forced log switches in connection with backups, > but that's not going to occur often enough to be important to optimize. Agreed. Half-filled WAL files were necessary to honour archive_timeout. With continuous streaming all WAL files will be 100% full before we switch, for most purposes. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 9 Jan 2009, Simon Riggs wrote: > Half-filled WAL files were necessary to honour archive_timeout. With > continuous streaming all WAL files will be 100% full before we switch, > for most purposes. The main use case I'm concerned about losing support for is: 1) Two systems connected by a WAN with significant transmit latency 2) The secondary system runs a warm standby aimed at disaster recovery 3) Business requirements want the standby to never be more than (say) 5 minutes behind the primary, presuming the WAN is up 4) WAN traffic is "expensive" (money==bandwidth, one of the two is scarce) This seems a pretty common scenario in my experience. Right now, this case is served quite well like this: -archive_timeout='5 minutes' -[pglesslog|pg_clearxlogtail] | gzip | rsync The main concern I have with switching to a more synchronous scheme is that network efficiency drops as the payload breaks into smaller pieces. I haven't had enough time to keep up with all the sync rep advances recently to know for sure if there's a configuration there that's suitable for this case. If that can be configured to send only in relatively large chunks, while still never letting things lag too far behind, then I'd agree completely that the case for any of these WAL cleaner utilities is dead--presuming said support makes it into the next release. If that's not available, say because the only useful option sends in very small pieces, there may still be a need for some utility to fill in for this particular requirement. Luckily there are many to choose from if it comes to that. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>>> Greg Smith <gsmith@gregsmith.com> wrote: > The main use case I'm concerned about losing support for is: > > 1) Two systems connected by a WAN with significant transmit latency > 2) The secondary system runs a warm standby aimed at disaster recovery > 3) Business requirements want the standby to never be more than (say) 5 > minutes behind the primary, presuming the WAN is up > 4) WAN traffic is "expensive" (money==bandwidth, one of the two is scarce) > > This seems a pretty common scenario in my experience. Right now, this > case is served quite well like this: > > -archive_timeout='5 minutes' > -[pglesslog|pg_clearxlogtail] | gzip | rsync You've come pretty close to describing our environment, other than having 72 primaries each using rsync to push the WAL files to another server at the same site while a server at the central site uses rsync to pull them back. We don't run warm standby on the backup server at the site of origin, and don't want to have to do so. It is critically important that the flow of xlog data never hold up the primary databases, and that failure to copy xlog to either of the targets not interfere with copying to the other. (We have WAN failures surprising often, sometimes for days at a time, and the backup server on-site is in the same rack of the same cabinet as the database server.) Compression of xlog data is important not only for WAN transmission, but for storage space. We keep two weeks of WAL files to allow recovery from either of the last two weekly backups, and we archive the first weekly backup of each month, with the WAL files needed for recovery, for one year. So it appears we care about somewhat similar issues. -Kevin
On Fri, 9 Jan 2009, Aidan Van Dyk wrote: > *I* didn't see an easy way to get at the "written" size later on in the > chain (i.e. in the actual archiving), so I took the path of least > resitance. I was hoping it might fall out of the other work being done in that area, given how much that code is still being poked at right now. As Hannu pointed out, from a conceptual level you just need to carry along the same information that pg_current_xlog_location() returns to the archiver on all the paths where a segment might end early. > If I wrapped this zeroing in GUC, people could choose wether to pay the > penalty or not, would that satisfy anyone? Rather than creating a whole new GUC, it might it be possible to turn archive_mode into an enum setting: off, on, and cleaned as the modes perhaps. That would avoid making a new setting, with the downside that a bunch of critical code would look less clear than it does with a boolean. > Again, *I* think that the force_switch case is going to happen when the > admin's quite happy to pay that penalty... But obviously not > everyone... I understand the case you've made for why it doesn't matter, and for almost every case you're right. The situation it may be vulnerable to is where a burst of transactions come in just as the archive timeout expires after minimal WAL activity. There I think you can end up with a bunch of clients waiting behind an almost full zero fill operation, which pushes up the worst-case latency. I've been able to measure the impact of the similar case where zero-filling a brand new segment can impact things; this would be much less like to happen because the timing would have to line up just wrong, but I think it's still possible. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
* Greg Smith <gsmith@gregsmith.com> [090109 18:39]: > I was hoping it might fall out of the other work being done in that area, > given how much that code is still being poked at right now. As Hannu > pointed out, from a conceptual level you just need to carry along the > same information that pg_current_xlog_location() returns to the archiver > on all the paths where a segment might end early. I was(am) also hoping that somethig falls out of sync-rep that gives me better PITR backups (better than a small archive_timeout)... That hope is what made me abandon this patch after the initial feedback. > Rather than creating a whole new GUC, it might it be possible to turn > archive_mode into an enum setting: off, on, and cleaned as the modes > perhaps. That would avoid making a new setting, with the downside that a > bunch of critical code would look less clear than it does with a boolean. I'm content to wait and see what falls out of sync-rep stuff... ... for now ... > I understand the case you've made for why it doesn't matter, and for > almost every case you're right. The situation it may be vulnerable to is > where a burst of transactions come in just as the archive timeout expires > after minimal WAL activity. There I think you can end up with a bunch of > clients waiting behind an almost full zero fill operation, which pushes > up the worst-case latency. I've been able to measure the impact of the > similar case where zero-filling a brand new segment can impact things; > this would be much less like to happen because the timing would have to > line up just wrong, but I think it's still possible. Ya, and it's one of just many of the times PG hits these worst-latency spikes ;-) GEnerally, it's *very* good... and once in a while, when all the stars line up correctly, you get these spikes.... But even with these spikes, it's plenty fast enough for the stuff I've done... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.