Thread: Improving compressibility of WAL files

Improving compressibility of WAL files

From
Bruce Momjian
Date:
The attached patch from Aidan Van Dyk zeros out the end of WAL files to
improve their compressibility. (The patch was originally sent to
'general' which explains why it was lost until now.)

Would someone please eyeball it?;  it is useful for compressing PITR
logs even if we find a better solution for replication streaming?

As for why this patch is useful:

> > The real reason not to put that functionality into core (or even
> > contrib) is that it's a stopgap kluge.  What the people who want this
> > functionality *really* want is continuous (streaming) log-shipping, not
> > WAL-segment-at-a-time shipping.  Putting functionality like that into
> > core is infinitely more interesting than putting band-aids on a
> > segmented approach.
>
> Well, I realize we want streaming archive logs, but there are still
> going to be people who are archiving for point-in-time recovery, and I
> assume a good number of them are going to compress their WAL files to
> save space, because they have to store a lot of them.  Wouldn't zeroing
> out the trailing byte of WAL still help those people?

---------------------------------------------------------------------------

Aidan Van Dyk wrote:
-- Start of PGP signed section.
> * Aidan Van Dyk <aidan@highrise.ca> [081031 15:11]:
> > How about something like the attached.  It's been spun quickly, passed
> > regression tests, and some simple hand tests on REL8_3_STABLE.  It seem slike
> > HEAD can't  initdb on my machine (quad opteron with SW raid1), I tried a few
> > revision in the last few days, and initdb dies on them all...
>
> OK, HEAD does work, I don't know what was going on previosly... Attached is my
> patch against head.
>
> I'll try and pull out some machines on Monday to really thrash/crash this but
> I'm running out of time today to set that up.
>
> But in running head, I've come accross this:
>     regression=# SELECT pg_stop_backup();
>     WARNING:  pg_stop_backup still waiting for archive to complete (60 seconds elapsed)
>     WARNING:  pg_stop_backup still waiting for archive to complete (120 seconds elapsed)
>     WARNING:  pg_stop_backup still waiting for archive to complete (240 seconds elapsed)
>
> My archive script is *not* running, it ran and exited:
>     mountie@pumpkin:~/projects/postgresql/PostgreSQL/src/test/regress$ ps -ewf | grep post
>     mountie   2904     1  0 16:31 pts/14   00:00:00
/home/mountie/projects/postgresql/PostgreSQL/src/test/regress/tmp_check/install/usr/local/pgsql
>     mountie   2906  2904  0 16:31 ?        00:00:01 postgres: writer process
>     mountie   2907  2904  0 16:31 ?        00:00:00 postgres: wal writer process
>     mountie   2908  2904  0 16:31 ?        00:00:00 postgres: archiver process   last was 00000001000000000000001F
>     mountie   2909  2904  0 16:31 ?        00:00:01 postgres: stats collector process
>     mountie   2921  2904  1 16:31 ?        00:00:18 postgres: mountie regression 127.0.0.1(56455) idle
>
> Those all match up:
>     mountie@pumpkin:~/projects/postgresql/PostgreSQL/src/test/regress$ pstree -acp 2904
>     postgres,2904 -D/home/mountie/projects/postgres
>       ??postgres,2906
>       ??postgres,2907
>       ??postgres,2908
>       ??postgres,2909
>       ??postgres,2921
>
> strace on the "archiver process" postgres:
>     select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
>     getppid()                               = 2904
>     select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
>     getppid()                               = 2904
>     select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
>     getppid()                               = 2904
>     select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
>     getppid()                               = 2904
>     select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
>     getppid()                               = 2904
>
> It *does* finally finish, postmaster log looks like ("Archving ..." is what my
> archive script prints, bytes is the gzip'ed size):
>     Archiving 000000010000000000000016 [16397 bytes]
>     Archiving 000000010000000000000017 [4405457 bytes]
>     Archiving 000000010000000000000018 [3349243 bytes]
>     Archiving 000000010000000000000019 [3349505 bytes]
>     LOG:  ZEROING xlog file 0 segment 27 from 7954432 - 16777216 [8822784 bytes]
>     Archiving 00000001000000000000001A [3349590 bytes]
>     Archiving 00000001000000000000001B [1596676 bytes]
>     LOG:  ZEROING xlog file 0 segment 28 from 8192 - 16777216 [16769024 bytes]
>     Archiving 00000001000000000000001C [16398 bytes]
>     LOG:  ZEROING xlog file 0 segment 29 from 8192 - 16777216 [16769024 bytes]
>     Archiving 00000001000000000000001D [16397 bytes]
>     LOG:  ZEROING xlog file 0 segment 30 from 8192 - 16777216 [16769024 bytes]
>     Archiving 00000001000000000000001E [16393 bytes]
>     Archiving 00000001000000000000001E.00000020.backup [146 bytes]
>     WARNING:  pg_stop_backup still waiting for archive to complete (60 seconds elapsed)
>     WARNING:  pg_stop_backup still waiting for archive to complete (120 seconds elapsed)
>     WARNING:  pg_stop_backup still waiting for archive to complete (240 seconds elapsed)
>     LOG:  ZEROING xlog file 0 segment 31 from 8192 - 16777216 [16769024 bytes]
>     Archiving 00000001000000000000001F [16395 bytes]
>
>
> So what's this "pg_stop_backup still waiting for archive to complete" for 5
> minutes state?  I've not seen that before (runing 8.2 and 8.3).
>
> a.
> --
> Aidan Van Dyk                                             Create like a god,
> aidan@highrise.ca                                       command like a king,
> http://www.highrise.ca/                                   work like a slave.

[ Attachment, skipping... ]
-- End of PGP section, PGP failed!

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
commit fba38257e52564276bb106d55aef14d0de481169
Author: Aidan Van Dyk <aidan@highrise.ca>
Date:   Fri Oct 31 12:35:24 2008 -0400

    WIP: Zero xlog tal on a forced switch

    If XLogWrite is called with xlog_switch, an XLog swithc has been force, either
    by a timeout based switch (archive_timeout), or an interactive force xlog
    switch (pg_switch_xlog/pg_stop_backup).  In those cases, we assume we can
    afford a little extra IO bandwidth to make xlogs so much more compressable

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 003098f..c6f9c79 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1600,6 +1600,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
              */
             if (finishing_seg || (xlog_switch && last_iteration))
             {
+                /*
+                 * If we've had an xlog switch forced, then we want to zero
+                 * out the rest of the segment.  We zero it out here because at the
+                 * force switch time, IO bandwidth isn't a problem.
+                 *   -- AIDAN
+                 */
+                if (xlog_switch)
+                {
+                    char buf[1024];
+                    uint32 left = (XLogSegSize - openLogOff);
+                    ereport(LOG,
+                        (errmsg("ZEROING xlog file %u segment %u from %u - %u [%u bytes]",
+                                openLogId, openLogSeg,
+                                openLogOff, XLogSegSize, left)
+                         ));
+                    memset(buf, 0, sizeof(buf));
+                    while (left > 0)
+                    {
+                        size_t len = (left > sizeof(buf)) ? sizeof(buf) : left;
+                        write(openLogFile, buf, len);
+                        left -= len;
+                    }
+                }
+
                 issue_xlog_fsync();
                 LogwrtResult.Flush = LogwrtResult.Write;        /* end of page */


Re: Improving compressibility of WAL files

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> The attached patch from Aidan Van Dyk zeros out the end of WAL files to
> improve their compressibility. (The patch was originally sent to
> 'general' which explains why it was lost until now.)

Isn't this redundant given the existence of pglesslog?
        regards, tom lane


Re: Improving compressibility of WAL files

From
Aidan Van Dyk
Date:
* Bruce Momjian <bruce@momjian.us> [090108 16:43]:
> 
> The attached patch from Aidan Van Dyk zeros out the end of WAL files to
> improve their compressibility. (The patch was originally sent to
> 'general' which explains why it was lost until now.)
> 
> Would someone please eyeball it?;  it is useful for compressing PITR
> logs even if we find a better solution for replication streaming?

The reason I didn't push it was that people claimed it would chew up to
much WAL bandwidhh (causing a large commit latency) when the new 0's are
all written/fsynced at once...

I don't necessarily buy it, because the force_switch is usually either a
1) timeed occurance on an otherwise idle time
2) user-forced (i.e. forced checkpoint/pg_backup, so your IO is going to  be hammered anyways...

But that's why I didn't follow up on it...

There's possible a few other ways to do it, such as zero the WAL on
recycling (but not fsyncing it), and hopefully most of the zero's get
trickled out by the OS before it comes down to a single 16MB fsync, but
not many people seemed too enthused about the whole WAL compressablitly
subject...

But, the way I see things going on -hackers, I must admit, sync-rep (WAL
streaming) looks like it's a long way off and possibly not even going to
do what I want, so *I* would really like this wal zero'ing...

If anybody has any specific things with the patch ehty think needs
chaning, I'll try and accomidate, but I do note that I never
submitted it for the Commitfest...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Improving compressibility of WAL files

From
"Kevin Grittner"
Date:
>>> Aidan Van Dyk <aidan@highrise.ca> 01/08/09 5:02 PM >>> 
> *I* would really like this wal zero'ing...
pg_clearxlogtail (in pgfoundry) does exactly the same zeroing of the
tail as a filter.  If you pipe through it on the way to gzip, there
is no increase in disk I/O over a straight gzip, and often an I/O
savings.  Benchmarks of the final version showed no measurable
performance cost, even with full WAL files.
It's not as convenient to use as what your patch does, but it's not
all that hard either.  There is also pglesslog, although we had
pg_clearxlogtail working before we found the other, so we've never
checked it out.  Perhaps it does even better.
-Kevin


Re: Improving compressibility of WAL files

From
Hannu Krosing
Date:
On Thu, 2009-01-08 at 18:02 -0500, Aidan Van Dyk wrote:
> * Bruce Momjian <bruce@momjian.us> [090108 16:43]:
> > 
> > The attached patch from Aidan Van Dyk zeros out the end of WAL files to
> > improve their compressibility. (The patch was originally sent to
> > 'general' which explains why it was lost until now.)
> > 
> > Would someone please eyeball it?;  it is useful for compressing PITR
> > logs even if we find a better solution for replication streaming?
> 
> The reason I didn't push it was that people claimed it would chew up to
> much WAL bandwidhh (causing a large commit latency) when the new 0's are
> all written/fsynced at once...
> 
> I don't necessarily buy it, because the force_switch is usually either a
> 1) timeed occurance on an otherwise idle time
> 2) user-forced (i.e. forced checkpoint/pg_backup, so your IO is going to
>    be hammered anyways...
> 
> But that's why I didn't follow up on it...
> 
> There's possible a few other ways to do it, such as zero the WAL on
> recycling (but not fsyncing it), and hopefully most of the zero's get
> trickled out by the OS before it comes down to a single 16MB fsync, but
> not many people seemed too enthused about the whole WAL compressablitly
> subject...
> 
> But, the way I see things going on -hackers, I must admit, sync-rep (WAL
> streaming) looks like it's a long way off and possibly not even going to
> do what I want, so *I* would really like this wal zero'ing...
> 
> If anybody has any specific things with the patch ehty think needs
> chaning, I'll try and accomidate, but I do note that I never
> submitted it for the Commitfest...

won't it still be easier/less intrusive on inline core functionality and
more flexible to just record end-of-valid-wal somewhere and then let the
compressor discard the invalid part when compressing and recreate it
with zeros on decompression ?

-------------------
Hannu





Re: Improving compressibility of WAL files

From
Hannu Krosing
Date:
On Fri, 2009-01-09 at 01:29 +0200, Hannu Krosing wrote:
> On Thu, 2009-01-08 at 18:02 -0500, Aidan Van Dyk wrote:
...
> > There's possible a few other ways to do it, such as zero the WAL on
> > recycling (but not fsyncing it), and hopefully most of the zero's get
> > trickled out by the OS before it comes down to a single 16MB fsync, but
> > not many people seemed too enthused about the whole WAL compressablitly
> > subject...
> > 
> > But, the way I see things going on -hackers, I must admit, sync-rep (WAL
> > streaming) looks like it's a long way off and possibly not even going to
> > do what I want, so *I* would really like this wal zero'ing...
> > 
> > If anybody has any specific things with the patch ehty think needs
> > chaning, I'll try and accomidate, but I do note that I never
> > submitted it for the Commitfest...
> 
> won't it still be easier/less intrusive on inline core functionality and
> more flexible to just record end-of-valid-wal somewhere and then let the
> compressor discard the invalid part when compressing and recreate it
> with zeros on decompression ?

And some of the functionality already exists for in-process WAL files in
form of pg_current_xlog_location() and
pg_current_xlog_insert_location(), recording end-of-data in wal file
just extends this to completed log files.

-- 
------------------------------------------
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training



Re: Improving compressibility of WAL files

From
Greg Smith
Date:
On Fri, 9 Jan 2009, Hannu Krosing wrote:

> won't it still be easier/less intrusive on inline core functionality and
> more flexible to just record end-of-valid-wal somewhere and then let the
> compressor discard the invalid part when compressing and recreate it
> with zeros on decompression ?

I thought at one point that the direction this was going toward was to 
provide the size of the WAL file as a parameter you can use in the 
archive_command:  %p provides the path, %f the file name, and now %l the 
length.  That makes an example archive command something like:

head -c "%l" "%p" | gzip > /mnt/server/archivedir/"%f"

Expanding it back to always be 16MB on the other side might require some 
trivial script, can't think of a standard UNIX tool suitable for that but 
it's easy enough to write.  I'm assuming I just remembering someone else's 
suggestion here, maybe I just invented the above.  You don't want to just 
modify pg_standby to accept small files, because then you've made it 
harder to make absolutely sure when the file is ready to be processed if a 
non-atomic copy is being done.  And it may make sense to provide some 
simple C implementations of the clear/expand tools in contrib even with 
the %l addition, mainly to help out Windows users.

To reiterate the choices I remember popping up in the multiple rounds this 
has come up, possible implementations that would work for this general 
requirement include:

1) Provide the length as part of the archive command
2) Add a more explicit end-of-WAL delimiter
3) Write zeros to the unused portion in the server
4) pglesslog
5) pg_clearxlogtail

With "(6) use sync rep" being not quite a perfect answer; there are 
certainly WAN-based use cases where you don't want full sync rep but do 
want the WAL to compress as much as possible.

I think (1) is a better solution than most of these in the context of an 
improvement to core, with (4) pglesslog being the main other contender 
because of how it provides additional full-page write improvements.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Improving compressibility of WAL files

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > The attached patch from Aidan Van Dyk zeros out the end of WAL files to
> > improve their compressibility. (The patch was originally sent to
> > 'general' which explains why it was lost until now.)
> 
> Isn't this redundant given the existence of pglesslog?

It does the same as pglesslog, but is simpler to use because it is
automatic.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Improving compressibility of WAL files

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> Isn't this redundant given the existence of pglesslog?

> It does the same as pglesslog, but is simpler to use because it is
> automatic.

Which also means that everyone pays the performance penalty whether
they get any benefit or not.  The point of the external solution
is to do the work only in installations that get some benefit.
We've been over this ground before...
        regards, tom lane


Re: Improving compressibility of WAL files

From
Zeugswetter Andreas OSB sIT
Date:
> You don't want to just
> modify pg_standby to accept small files, because then you've made it
> harder to make absolutely sure when the file is ready to be
> processed if a non-atomic copy is being done.

It is hard, but I think it is the right way forward.
Anyway I think the size is not robust at all because some (most ? e.g. win32) non-atomic copy
implementations will also show the final size right from the beginning.

Could we use stricter file locking when opening WAL for recovery ?

Or implement a wait loop when the crc check (+ a basic validity check) for the next
record fails (and the next record is on a 512 byte boundary ?).
I think standby and restore recovery can be treated differently to startup recovery because
a copied wal file, even if the copy is not atomic, will not have trailing valid WAL records
from a recycled WAL. (A solution that recycles WAL files for restore/standby would need to make
sure it renames the files *after* restoring the content.)

Btw how do we detect end of WAL when restoring a backup and WAL after PANIC ?

> 1) Provide the length as part of the archive command

+1

Andreas

Re: Improving compressibility of WAL files

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> Isn't this redundant given the existence of pglesslog?
> 
> > It does the same as pglesslog, but is simpler to use because it is
> > automatic.
> 
> Which also means that everyone pays the performance penalty whether
> they get any benefit or not.  The point of the external solution
> is to do the work only in installations that get some benefit.
> We've been over this ground before...

If there is a performance penalty, you are right, but if the zeroing is
done as part of the archiving, it seems near zero cost enough to do it
all the time, no?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Improving compressibility of WAL files

From
"Kevin Grittner"
Date:
>>> Greg Smith <gsmith@gregsmith.com> wrote: 
> I thought at one point that the direction this was going toward was
to 
> provide the size of the WAL file as a parameter you can use in the 
> archive_command:  %p provides the path, %f the file name, and now %l
the 
> length.  That makes an example archive command something like:
> 
> head -c "%l" "%p" | gzip > /mnt/server/archivedir/"%f"
Hard to beat for performance.  I thought there was some technical
snag.
> Expanding it back to always be 16MB on the other side might require
some 
> trivial script, can't think of a standard UNIX tool suitable for that
but 
> it's easy enough to write.
Untested, but it seems like something close to this would work:
cat $p $( dd if=/dev/null blocks=1 ibs=$(( (16 * 1024 * 1024) - $(stat
-c%s $p) )) )
-Kevin


Re: Improving compressibility of WAL files

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> Which also means that everyone pays the performance penalty whether
>> they get any benefit or not.  The point of the external solution
>> is to do the work only in installations that get some benefit.
>> We've been over this ground before...

> If there is a performance penalty, you are right, but if the zeroing is
> done as part of the archiving, it seems near zero cost enough to do it
> all the time, no?

It's the same cost no matter which process does it.
        regards, tom lane


Re: Improving compressibility of WAL files

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Greg Smith <gsmith@gregsmith.com> wrote: 
>> I thought at one point that the direction this was going toward was to 
>> provide the size of the WAL file as a parameter you can use in the 
>> archive_command:
> Hard to beat for performance.  I thought there was some technical
> snag.
Yeah: the archiver process doesn't have that information available.
        regards, tom lane


Re: Improving compressibility of WAL files

From
Aidan Van Dyk
Date:
All that is useless until we get a %l in archive_command...

*I* didn't see an easy way to get at the "written" size later on in the
chain (i.e. in the actual archiving), so I took the path of least
resitance.

The reason *I* shy way from pg_lesslog and pg_clearxlogtail, is that
they seem to possibly be frail... I'm just scared of somethign changing
in PG some time, and my pg_clearxlogtail not nowing, me forgetting to
upgrade, and me not doing enough test of my actually restoring backups...

Sure, it's all me being neglgent, but the simpler, the better...

If I wrapped this zeroing in GUC, people could choose wether to pay the
penalty or not, would that satisfy anyone?

Again, *I* think that the force_switch case is going to happen when the
admin's quite happy to pay that penalty...  But obviously not
everyone...

a.

* Kevin Grittner <Kevin.Grittner@wicourts.gov> [090109 11:01]:
> >>> Greg Smith <gsmith@gregsmith.com> wrote: 
> > I thought at one point that the direction this was going toward was
> to 
> > provide the size of the WAL file as a parameter you can use in the 
> > archive_command:  %p provides the path, %f the file name, and now %l
> the 
> > length.  That makes an example archive command something like:
> > 
> > head -c "%l" "%p" | gzip > /mnt/server/archivedir/"%f"
>  
> Hard to beat for performance.  I thought there was some technical
> snag.
>  
> > Expanding it back to always be 16MB on the other side might require
> some 
> > trivial script, can't think of a standard UNIX tool suitable for that
> but 
> > it's easy enough to write.
>  
> Untested, but it seems like something close to this would work:
>  
> cat $p $( dd if=/dev/null blocks=1 ibs=$(( (16 * 1024 * 1024) - $(stat
> -c%s $p) )) )
>  
> -Kevin
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Improving compressibility of WAL files

From
Simon Riggs
Date:
On Fri, 2009-01-09 at 09:31 -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > Tom Lane wrote:
> > >> Isn't this redundant given the existence of pglesslog?
> > 
> > > It does the same as pglesslog, but is simpler to use because it is
> > > automatic.
> > 
> > Which also means that everyone pays the performance penalty whether
> > they get any benefit or not.  The point of the external solution
> > is to do the work only in installations that get some benefit.
> > We've been over this ground before...
> 
> If there is a performance penalty, you are right, but if the zeroing is
> done as part of the archiving, it seems near zero cost enough to do it
> all the time, no?

It can already be done as part of the archiving, using an external tool
as Tom notes.

Yes, we could make the archiver do this, but I see no big advantage over
having it done externally. It's not faster, safer, easier. Not easier
because we would want a parameter to turn it off when not wanted.

The patch as stands is IMHO not acceptable because the work to zero the
file is performed by the unlucky backend that hits EOF on the current
WAL file, which is bad enough, but it is also performed while holding
WALWriteLock. 

I like Greg Smith's analysis of this and his conclusion that we could
provide a %l option, but even that would require work to have that info
passed to the archiver. Perhaps inside the notification file which is
already written and read by the write processes. But whether that can or
should be done for this release is a different debate.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Improving compressibility of WAL files

From
Aidan Van Dyk
Date:
* Simon Riggs <simon@2ndQuadrant.com> [090109 11:33]:
> The patch as stands is IMHO not acceptable because the work to zero the
> file is performed by the unlucky backend that hits EOF on the current
> WAL file, which is bad enough, but it is also performed while holding
> WALWriteLock. 

Agreed, but noting that that extra zero work is contitional on the
"force_swich", meaning that commits backup behind that WALWriteLock only
during forced xlog switches (like archive_timeout and pg_backup).  I
actually did look through verify that when I made the patch, although I
claim that verification to be something anybody else should beleive ;-)
But my given output when I showd the stats/lines/etc did demonstrate
that.

> I like Greg Smith's analysis of this and his conclusion that we could
> provide a %l option, but even that would require work to have that info
> passed to the archiver. Perhaps inside the notification file which is
> already written and read by the write processes. But whether that can or
> should be done for this release is a different debate.

It's certainly not already in this commitfest, just like this patch.  I
thought the initial reaction after I posted it made it pretty clear it
wasn't something people (other than a few of us) were willing to
allow...

a.
-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Improving compressibility of WAL files

From
"Kevin Grittner"
Date:
>>> Aidan Van Dyk <aidan@highrise.ca> 01/09/09 10:22 AM >>> 
> The reason *I* shy way from pg_lesslog and pg_clearxlogtail, is that
> they seem to possibly be frail... I'm just scared of somethign
> changing in PG some time, and my pg_clearxlogtail not nowing, me
> forgetting to upgrade, and me not doing enough test of my actually
> restoring backups...
A fair concern.  I can't speak about pglesslog, but pg_clearxlogtail
goes out of its way to minimize this risk.  Changes to log records
themselves can't break it; it is only dependent on the partitioning. 
It bails with a message to stderr and a non-zero return code if it
finds anything obviously wrong.  It also checks the WAL format for
which it was compiled against the WAL format on which it was invoked,
and issues a warning if they don't match.  We ran into this once on a
machine running multiple releases of PostgreSQL where the archive
script invoked the wrong executable.  It worked correctly in spite of
the warning, but the warning was enough to alert us to the mismatch
and change the path in the archive script.
-Kevin


Re: Improving compressibility of WAL files

From
Richard Huxton
Date:
Tom Lane wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Greg Smith <gsmith@gregsmith.com> wrote: 
>>> I thought at one point that the direction this was going toward was to 
>>> provide the size of the WAL file as a parameter you can use in the 
>>> archive_command:
>  
>> Hard to beat for performance.  I thought there was some technical
>> snag.
>  
> Yeah: the archiver process doesn't have that information available.

Am I being really dim here - why isn't the first record in the WAL file
a fixed-length record containing e.g. txid_start, time_start, txid_end,
time_end, length? Write it once when you start using the file and once
when it's finished.

--  Richard Huxton Archonet Ltd


Re: Improving compressibility of WAL files

From
Aidan Van Dyk
Date:
* Richard Huxton <dev@archonet.com> [090109 12:22]:

> > Yeah: the archiver process doesn't have that information available.

> Am I being really dim here - why isn't the first record in the WAL file
> a fixed-length record containing e.g. txid_start, time_start, txid_end,
> time_end, length? Write it once when you start using the file and once
> when it's finished.

It would break the WAL "write-block/sync-block" forward only progress of
the xlog, which avoids the whole torn-page problem that the heap has.

a.
-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Improving compressibility of WAL files

From
Bruce Momjian
Date:
Tom Lane wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > Greg Smith <gsmith@gregsmith.com> wrote: 
> >> I thought at one point that the direction this was going toward was to 
> >> provide the size of the WAL file as a parameter you can use in the 
> >> archive_command:
>  
> > Hard to beat for performance.  I thought there was some technical
> > snag.
>  
> Yeah: the archiver process doesn't have that information available.

OK, thanks, I understand now.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Improving compressibility of WAL files

From
Richard Huxton
Date:
Aidan Van Dyk wrote:
> * Richard Huxton <dev@archonet.com> [090109 12:22]:
> 
>>> Yeah: the archiver process doesn't have that information available.
> 
>> Am I being really dim here - why isn't the first record in the WAL file
>> a fixed-length record containing e.g. txid_start, time_start, txid_end,
>> time_end, length? Write it once when you start using the file and once
>> when it's finished.
> 
> It would break the WAL "write-block/sync-block" forward only progress of
> the xlog, which avoids the whole torn-page problem that the heap has.

I thought that only applied when the filesystem page-size was less than
the data we were writing?

--  Richard Huxton Archonet Ltd


Re: Improving compressibility of WAL files

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> Yes, we could make the archiver do this, but I see no big advantage over
> having it done externally. It's not faster, safer, easier. Not easier
> because we would want a parameter to turn it off when not wanted.

And the other question to ask is how much effort and code should we be
putting into the concept anyway.  AFAICS, file-at-a-time WAL shipping
is a stopgap implementation that will be dead as a doornail once the
current efforts towards realtime replication are finished.  There will
still be some use for forced log switches in connection with backups,
but that's not going to occur often enough to be important to optimize.
        regards, tom lane


Re: Improving compressibility of WAL files

From
"Kevin Grittner"
Date:
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> AFAICS, file-at-a-time WAL shipping
> is a stopgap implementation that will be dead as a doornail once the
> current efforts towards realtime replication are finished.
As long as there is a way to rsync log data to multiple targets not
running replicas, with compression because of low-speed WAN
connections, I'm happy.  Doesn't matter whether that is using existing
techniques or the new realtime techniques.
-Kevin


Re: Improving compressibility of WAL files

From
Simon Riggs
Date:
On Fri, 2009-01-09 at 13:22 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > Yes, we could make the archiver do this, but I see no big advantage over
> > having it done externally. It's not faster, safer, easier. Not easier
> > because we would want a parameter to turn it off when not wanted.
> 
> And the other question to ask is how much effort and code should we be
> putting into the concept anyway.  AFAICS, file-at-a-time WAL shipping
> is a stopgap implementation that will be dead as a doornail once the
> current efforts towards realtime replication are finished.  There will
> still be some use for forced log switches in connection with backups,
> but that's not going to occur often enough to be important to optimize.

Agreed.

Half-filled WAL files were necessary to honour archive_timeout. With
continuous streaming all WAL files will be 100% full before we switch,
for most purposes.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Improving compressibility of WAL files

From
Greg Smith
Date:
On Fri, 9 Jan 2009, Simon Riggs wrote:

> Half-filled WAL files were necessary to honour archive_timeout. With
> continuous streaming all WAL files will be 100% full before we switch,
> for most purposes.

The main use case I'm concerned about losing support for is:

1) Two systems connected by a WAN with significant transmit latency
2) The secondary system runs a warm standby aimed at disaster recovery
3) Business requirements want the standby to never be more than (say) 5 
minutes behind the primary, presuming the WAN is up
4) WAN traffic is "expensive" (money==bandwidth, one of the two is scarce)

This seems a pretty common scenario in my experience.  Right now, this 
case is served quite well like this:

-archive_timeout='5 minutes'
-[pglesslog|pg_clearxlogtail] | gzip | rsync

The main concern I have with switching to a more synchronous scheme is 
that network efficiency drops as the payload breaks into smaller pieces. 
I haven't had enough time to keep up with all the sync rep advances 
recently to know for sure if there's a configuration there that's suitable 
for this case.  If that can be configured to send only in relatively large 
chunks, while still never letting things lag too far behind, then I'd 
agree completely that the case for any of these WAL cleaner utilities is 
dead--presuming said support makes it into the next release.

If that's not available, say because the only useful option sends in very 
small pieces, there may still be a need for some utility to fill in for 
this particular requirement.  Luckily there are many to choose from if it 
comes to that.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Improving compressibility of WAL files

From
"Kevin Grittner"
Date:
>>> Greg Smith <gsmith@gregsmith.com> wrote: 
> The main use case I'm concerned about losing support for is:
> 
> 1) Two systems connected by a WAN with significant transmit latency
> 2) The secondary system runs a warm standby aimed at disaster
recovery
> 3) Business requirements want the standby to never be more than (say)
5 
> minutes behind the primary, presuming the WAN is up
> 4) WAN traffic is "expensive" (money==bandwidth, one of the two is
scarce)
> 
> This seems a pretty common scenario in my experience.  Right now,
this 
> case is served quite well like this:
> 
> -archive_timeout='5 minutes'
> -[pglesslog|pg_clearxlogtail] | gzip | rsync
You've come pretty close to describing our environment, other than
having 72 primaries each using rsync to push the WAL files to another
server at the same site while a server at the central site uses rsync
to pull them back.  We don't run warm standby on the backup server at
the site of origin, and don't want to have to do so.
It is critically important that the flow of xlog data never hold up
the primary databases, and that failure to copy xlog to either of the
targets not interfere with copying to the other.  (We have WAN
failures surprising often, sometimes for days at a time, and the
backup server on-site is in the same rack of the same cabinet as the
database server.)
Compression of xlog data is important not only for WAN transmission,
but for storage space.  We keep two weeks of WAL files to allow
recovery from either of the last two weekly backups, and we archive
the first weekly backup of each month, with the WAL files needed for
recovery, for one year.
So it appears we care about somewhat similar issues.
-Kevin


Re: Improving compressibility of WAL files

From
Greg Smith
Date:
On Fri, 9 Jan 2009, Aidan Van Dyk wrote:

> *I* didn't see an easy way to get at the "written" size later on in the
> chain (i.e. in the actual archiving), so I took the path of least
> resitance.

I was hoping it might fall out of the other work being done in that area, 
given how much that code is still being poked at right now.  As Hannu 
pointed out, from a conceptual level you just need to carry along the same 
information that pg_current_xlog_location() returns to the archiver on all 
the paths where a segment might end early.

> If I wrapped this zeroing in GUC, people could choose wether to pay the
> penalty or not, would that satisfy anyone?

Rather than creating a whole new GUC, it might it be possible to turn 
archive_mode into an enum setting:  off, on, and cleaned as the modes 
perhaps.  That would avoid making a new setting, with the downside that a 
bunch of critical code would look less clear than it does with a boolean.

> Again, *I* think that the force_switch case is going to happen when the
> admin's quite happy to pay that penalty...  But obviously not
> everyone...

I understand the case you've made for why it doesn't matter, and for 
almost every case you're right.  The situation it may be vulnerable to is 
where a burst of transactions come in just as the archive timeout expires 
after minimal WAL activity.  There I think you can end up with a bunch of 
clients waiting behind an almost full zero fill operation, which pushes up 
the worst-case latency.  I've been able to measure the impact of the 
similar case where zero-filling a brand new segment can impact things; 
this would be much less like to happen because the timing would have to 
line up just wrong, but I think it's still possible.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Improving compressibility of WAL files

From
Aidan Van Dyk
Date:
* Greg Smith <gsmith@gregsmith.com> [090109 18:39]:

> I was hoping it might fall out of the other work being done in that area, 
> given how much that code is still being poked at right now.  As Hannu  
> pointed out, from a conceptual level you just need to carry along the 
> same information that pg_current_xlog_location() returns to the archiver 
> on all the paths where a segment might end early.

I was(am) also hoping that somethig falls out of sync-rep that gives me
better PITR backups (better than a small archive_timeout)... That hope
is what made me abandon this patch after the initial feedback.

> Rather than creating a whole new GUC, it might it be possible to turn  
> archive_mode into an enum setting:  off, on, and cleaned as the modes  
> perhaps.  That would avoid making a new setting, with the downside that a 
> bunch of critical code would look less clear than it does with a boolean.

I'm content to wait and see what falls out of sync-rep stuff...

... for now ... 

> I understand the case you've made for why it doesn't matter, and for  
> almost every case you're right.  The situation it may be vulnerable to is 
> where a burst of transactions come in just as the archive timeout expires 
> after minimal WAL activity.  There I think you can end up with a bunch of 
> clients waiting behind an almost full zero fill operation, which pushes 
> up the worst-case latency.  I've been able to measure the impact of the  
> similar case where zero-filling a brand new segment can impact things;  
> this would be much less like to happen because the timing would have to  
> line up just wrong, but I think it's still possible.

Ya, and it's one of just many of the times PG hits these worst-latency
spikes ;-)  GEnerally, it's *very* good... and once in a while, when all
the stars line up correctly, you get these spikes....

But even with these spikes, it's plenty fast enough for the stuff I've
done...

a.
-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.