Thread: Quick-and-dirty compression for WAL backup blocks

Quick-and-dirty compression for WAL backup blocks

From
Tom Lane
Date:
It seems we are more or less agreed that 32-bit CRC ought to be enough
for WAL; and we also need to make a change to ensure that backup blocks
are positively linked to their parent WAL record, as I noted earlier
today.  So as long as we have to mess with the WAL record format, I was
wondering what else we could get done in the same change.

The TODO item that comes to mind immediately is "Compress WAL entries".
The TODO.detail file for that has a whole lot of ideas of various
(mostly high) levels of complexity, but one thing we could do fairly
trivially is to try to compress the page images that are dumped into WAL
to protect against partial-write problems.  After reviewing the old
discussion I still like the proposal I made:

> ... make the WAL writing logic aware of the layout
> of buffer pages --- specifically, to know that our pages generally
> contain an uninteresting "hole" in the middle, and not write the hole.
> Optimistically this might reduce the WAL data volume by something
> approaching 50%; though pessimistically (if most pages are near full)
> it wouldn't help much.

A more concrete version of this is: examine the page to see if the
pd_lower field is between SizeOfPageHeaderData and BLCKSZ, and if so
whether there is a run of consecutive zero bytes beginning at the
pd_lower position.  Omit any such bytes from what is written to WAL.
(This definition ensures that nothing goes wrong if the page does not
follow the normal page layout conventions: the transformation is
lossless no matter what, since we can always reconstruct the exact page
contents.)  The overhead needed is only 2 bytes to show the number of
bytes removed.

The other alternatives that were suggested included running the page
contents through the same compressor used for TOAST, and implementing
a general-purpose run-length compressor that could get rid of runs of
zeroes anywhere on the page.  However, considering that the compression
work has to be done while holding WALInsertLock, it seems to me there
is a strong premium on speed.  I think that lets out the TOAST
compressor, which isn't amazingly speedy.  (Another objection to the
TOAST compressor is that it certainly won't win on already-compressed
toasted data.)  A run-length compressor would be reasonably quick but
I think that the omit-the-middle-hole approach gets most of the possible
win with even less work.  In particular, I think it can be proven that
omit-the-hole will actually require less CPU than now, since counting
zero bytes should be strictly faster than CRC'ing bytes, and we'll be
able to save the CRC work on whatever bytes we omit.

Any objections?
        regards, tom lane


Re: Quick-and-dirty compression for WAL backup blocks

From
Simon Riggs
Date:
On Tue, 2005-05-31 at 16:26 -0400, Tom Lane wrote:
> The TODO item that comes to mind immediately is "Compress WAL entries".
> A more concrete version of this is: examine the page to see if the
> pd_lower field is between SizeOfPageHeaderData and BLCKSZ, and if so
> whether there is a run of consecutive zero bytes beginning at the
> pd_lower position.  Omit any such bytes from what is written to WAL.
> (This definition ensures that nothing goes wrong if the page does not
> follow the normal page layout conventions: the transformation is
> lossless no matter what, since we can always reconstruct the exact page
> contents.)  The overhead needed is only 2 bytes to show the number of
> bytes removed.
> 
> The other alternatives that were suggested included running the page
> contents through the same compressor used for TOAST, and implementing
> a general-purpose run-length compressor that could get rid of runs of
> zeroes anywhere on the page.  However, considering that the compression
> work has to be done while holding WALInsertLock, it seems to me there
> is a strong premium on speed.  I think that lets out the TOAST
> compressor, which isn't amazingly speedy.  (Another objection to the
> TOAST compressor is that it certainly won't win on already-compressed
> toasted data.)  A run-length compressor would be reasonably quick but
> I think that the omit-the-middle-hole approach gets most of the possible
> win with even less work.  In particular, I think it can be proven that
> omit-the-hole will actually require less CPU than now, since counting
> zero bytes should be strictly faster than CRC'ing bytes, and we'll be
> able to save the CRC work on whatever bytes we omit.
> 
> Any objections?

None: completely agree with your analysis. Sounds great.

> It seems we are more or less agreed that 32-bit CRC ought to be enough
> for WAL; and we also need to make a change to ensure that backup blocks
> are positively linked to their parent WAL record, as I noted earlier
> today.  So as long as we have to mess with the WAL record format, I was
> wondering what else we could get done in the same change.

Is this a change that would be backpatched as you suggested previously?
It seems a rather large patch to change three things at once. Can the
backpatch wait until 8.1 has gone through beta to allow the changes to
be proven?

Best Regards, Simon Riggs




Re: Quick-and-dirty compression for WAL backup blocks

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Is this a change that would be backpatched as you suggested previously?

I don't think we can backpatch any of these items, since they involve
changes in the on-disk file format.  I was thinking of them as CVS HEAD
changes only.
        regards, tom lane


Re: Quick-and-dirty compression for WAL backup blocks

From
"Mark Cave-Ayland"
Date:
> Date: Tue, 31 May 2005 16:26:20 -0400
> From: Tom Lane <tgl@sss.pgh.pa.us>
> To: pgsql-hackers@postgreSQL.org
> Subject: Quick-and-dirty compression for WAL backup blocks
> Message-ID: <23210.1117571180@sss.pgh.pa.us>

(cut)

> ... make the WAL writing logic aware of the layout
> of buffer pages --- specifically, to know that our pages generally
> contain an uninteresting "hole" in the middle, and not write the hole.
> Optimistically this might reduce the WAL data volume by something
> approaching 50%; though pessimistically (if most pages are near full)
> it wouldn't help much.

> A more concrete version of this is: examine the page to see if the
> pd_lower field is between SizeOfPageHeaderData and BLCKSZ, and if so
> whether there is a run of consecutive zero bytes beginning at the
> pd_lower position.  Omit any such bytes from what is written to WAL.
> (This definition ensures that nothing goes wrong if the page does not
> follow the normal page layout conventions: the transformation is
> lossless no matter what, since we can always reconstruct the exact page
> contents.)  The overhead needed is only 2 bytes to show the number of
bytes removed.

I can't say I'm familiar enough with the exact format of the XLog records
(wasn't sure where to find this), but from what you're saying (and by
definition I suppose) I'm guessing it is just a complete copy of the new
page just before it gets written to the heap?

I also noticed your comment above that mentioned that compression would be
less effective as the pages became more full. Would changing the loading
factor of database pages have an effect here, as I would have thought that
the WAL would be fsync'd more aggressively than the heap?

> A run-length compressor would be reasonably quick but I think that the
> omit-the-middle-
> hole approach gets most of the possible win with even less work.

I can't think that a RLE scheme would be much more expensive than a 'count
the hole' approach with more benefit, so I wouldn't like to discount this
straight away...

If you do manage to go ahead with the code, I'd be very interested to see
some comparisons in bytes written to XLog for old and new approaches for
some inserts/updates. Perhaps we could ask Mark to run another TPC benchmark
at OSDL when this and the CRC changes have been completed.


Kind regards,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk




Re: Quick-and-dirty compression for WAL backup blocks

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> I also noticed your comment above that mentioned that compression would be
> less effective as the pages became more full. Would changing the loading
> factor of database pages have an effect here, as I would have thought that
> the WAL would be fsync'd more aggressively than the heap?

Yeah, it's predictable that leaving more free space per page would make
the optimization more effective.  I don't have anything quantitative to
say about it.

> If you do manage to go ahead with the code, I'd be very interested to see
> some comparisons in bytes written to XLog for old and new approaches for
> some inserts/updates. Perhaps we could ask Mark to run another TPC benchmark
> at OSDL when this and the CRC changes have been completed.

Good idea.  One point is that the larger the checkpoint interval, the
less this matters at all, because a backup block is only written for
a given page the first time it is modified after a checkpoint.  (This
is one of the big reasons frequent checkpoints are expensive.)  IIRC
Mark was using unrealistically large checkpoint intervals, which would
mean he'd underestimate the effects of optimizing backup blocks.
        regards, tom lane


Re: Quick-and-dirty compression for WAL backup blocks

From
Alvaro Herrera
Date:
On Wed, Jun 01, 2005 at 10:12:34AM -0400, Tom Lane wrote:
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > I also noticed your comment above that mentioned that compression would be
> > less effective as the pages became more full. Would changing the loading
> > factor of database pages have an effect here, as I would have thought that
> > the WAL would be fsync'd more aggressively than the heap?
> 
> Yeah, it's predictable that leaving more free space per page would make
> the optimization more effective.

But it would also increase the number of pages, so maybe there would
have to be block dumps more often, which may make the whole thing worse
on average.

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
"La realidad se compone de muchos sueños, todos ellos diferentes,
pero en cierto aspecto, parecidos..." (Yo, hablando de sueños eróticos)


Re: Quick-and-dirty compression for WAL backup blocks

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
>> A run-length compressor would be reasonably quick but I think that the
>> omit-the-middle-hole approach gets most of the possible win with even
>> less work.

> I can't think that a RLE scheme would be much more expensive than a 'count
> the hole' approach with more benefit, so I wouldn't like to discount this
> straight away...

RLE would require scanning the whole page with no certainty of win,
whereas count-the-hole is a certain win since you only examine bytes
that are potentially removable from the later CRC calculation.

> If you do manage to go ahead with the code, I'd be very interested to see
> some comparisons in bytes written to XLog for old and new approaches for
> some inserts/updates. Perhaps we could ask Mark to run another TPC benchmark
> at OSDL when this and the CRC changes have been completed.

I've completed a test run for this (it's essentially MySQL's sql-bench
done immediately after initdb).  What I get is:

CVS tip of 6/1: ending WAL offset = 0/A364A780 = 2741282688 bytes written

CVS tip of 6/2: ending WAL offset = 0/8BB091DC = 2343604700 bytes written

or about a 15% savings.  This is with a checkpoint_segments setting of 30.
One can presume that the savings would be larger at smaller checkpoint
intervals and smaller at larger intervals, but I didn't try more than
one set of test conditions.

I'd say that's an improvement worth having, especially considering that
it requires no net expenditure of CPU time.  But the table is certainly
still open to discuss more complicated approaches.
        regards, tom lane


Re: Quick-and-dirty compression for WAL backup blocks

From
Junji TERAMOTO
Date:
Hello all,

I am interested in how to "Compress WAL entries".
Then, I study the source now, and read this discussion.

There are some questions.

1.
In the XLogInsert(), it makes two kinds of logs, "whole buffer(page)
log" and "partial buffer log", isn't it?  Is it only "who buffer log"
to generate a log with "hole"?

2.
Tom Lane wrote:
> The overhead needed is only 2 bytes to show the number of
> bytes removed.

In "whole buffer log", there is a page header that includes offset of
"hole" (lower and upper). If we use that information, we don't need
any overhead, do we?

# Sorry for my bad english..

-- 
Junji Teramoto


Re: Quick-and-dirty compression for WAL backup blocks

From
"Mark Cave-Ayland"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 04 June 2005 16:46
> To: Mark Cave-Ayland (External)
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: Quick-and-dirty compression for WAL backup blocks

(cut)

> I've completed a test run for this (it's essentially MySQL's
> sql-bench done immediately after initdb).  What I get is:
>
> CVS tip of 6/1: ending WAL offset = 0/A364A780 = 2741282688
> bytes written
>
> CVS tip of 6/2: ending WAL offset = 0/8BB091DC = 2343604700
> bytes written
>
> or about a 15% savings.  This is with a checkpoint_segments
> setting of 30. One can presume that the savings would be
> larger at smaller checkpoint intervals and smaller at larger
> intervals, but I didn't try more than one set of test conditions.
>
> I'd say that's an improvement worth having, especially
> considering that it requires no net expenditure of CPU time.
> But the table is certainly still open to discuss more
> complicated approaches.


Hi Tom,

Thanks for the numbers. I've just been across to the OSDL STP website and it
appears that the automatic nightly PostgreSQL CVS builds set up by Mark have
been broken since the middle of April :( A brief look shows that compilation
of the PostgreSQL sources is failing with lots of errors and warnings. I
don't think I can justify the time at the moment to go and look at this
myself, however I thought I'd post to -hackers as a heads up in case anyone
else could pick this up.


Kind regards,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk




Re: Quick-and-dirty compression for WAL backup blocks

From
Tom Lane
Date:
Junji TERAMOTO <teramoto.junji@lab.ntt.co.jp> writes:
> In the XLogInsert(), it makes two kinds of logs, "whole buffer(page)
> log" and "partial buffer log", isn't it?  Is it only "who buffer log"
> to generate a log with "hole"?

Right.

> Tom Lane wrote:
>> The overhead needed is only 2 bytes to show the number of
>> bytes removed.

> In "whole buffer log", there is a page header that includes offset of
> "hole" (lower and upper). If we use that information, we don't need
> any overhead, do we?

No, because the WAL code cannot assume that all pages follow the
convention that pd_lower and pd_upper represent the boundaries of
free space.  (As a counterexample: index metapages don't always
do that.)  I think the transformation has to be guaranteed lossless,
which means that at a minimum you'd need to check whether the data
in between pd_lower and pd_upper really is zeroes.  So the irreducible
minimum overhead is 1 bit to tell whether you compressed or not.
Considering alignment requirements, you might as well expend a couple
of bytes and keep it simple.  (In the patch as committed, I ended up
using 4 bytes --- a uint16 hole start and a uint16 hole length ---
because it kept the code simple.  The alignment requirements mean the
extra 2 bytes are usually free anyway.)
        regards, tom lane


Re: Quick-and-dirty compression for WAL backup blocks

From
Heikki Linnakangas
Date:
On Mon, 6 Jun 2005, Tom Lane wrote:

> Junji TERAMOTO <teramoto.junji@lab.ntt.co.jp> writes:
>
>> In "whole buffer log", there is a page header that includes offset of
>> "hole" (lower and upper). If we use that information, we don't need
>> any overhead, do we?
>
> No, because the WAL code cannot assume that all pages follow the
> convention that pd_lower and pd_upper represent the boundaries of
> free space.  (As a counterexample: index metapages don't always
> do that.)  I think the transformation has to be guaranteed lossless,
> which means that at a minimum you'd need to check whether the data
> in between pd_lower and pd_upper really is zeroes.  So the irreducible
> minimum overhead is 1 bit to tell whether you compressed or not.

Vacuum doesn't zero out the free space between lower and upper, it's 
just marked as unused, so a lossless compression becomes less efficient 
on tables that have free space released by vacuum in them.

How about adding a flag to XLogRecData to indicate if the space between 
pd_lower and pd_upper is meaningful or not? The XLogInsert caller probably 
knows that. That way you could completely skip over the free space if 
it's not meaningful, saving even more cycles.

- Heikki


Re: Quick-and-dirty compression for WAL backup blocks

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Vacuum doesn't zero out the free space between lower and upper,

It does now ;-)

> How about adding a flag to XLogRecData to indicate if the space between 
> pd_lower and pd_upper is meaningful or not? The XLogInsert caller probably 
> knows that. That way you could completely skip over the free space if 
> it's not meaningful, saving even more cycles.

Hmm ... that might not be a bad idea.  As far as I can think offhand,
all the XLogInsert callers know very well what type of page they are
working with, so they would always be able to set such a flag correctly.

Would this be institutionalizing a particular approach to data
compression in the XLogInsert API, though?
        regards, tom lane


Re: Quick-and-dirty compression for WAL backup blocks

From
Heikki Linnakangas
Date:
On Mon, 6 Jun 2005, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> Vacuum doesn't zero out the free space between lower and upper,
>
> It does now ;-)

Oh :). Does it affect vacuum performance?

>> How about adding a flag to XLogRecData to indicate if the space between
>> pd_lower and pd_upper is meaningful or not? The XLogInsert caller probably
>> knows that. That way you could completely skip over the free space if
>> it's not meaningful, saving even more cycles.
>
> Hmm ... that might not be a bad idea.  As far as I can think offhand,
> all the XLogInsert callers know very well what type of page they are
> working with, so they would always be able to set such a flag correctly.
>
> Would this be institutionalizing a particular approach to data
> compression in the XLogInsert API, though?

The "skip the free space" optimization is still useful and worthwhile 
even if we have a more sophisticated compression method for the 
rest of the page.

- Heikki


Re: Quick-and-dirty compression for WAL backup blocks

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Mon, 6 Jun 2005, Tom Lane wrote:
>> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>>> Vacuum doesn't zero out the free space between lower and upper,
>> 
>> It does now ;-)

> Oh :). Does it affect vacuum performance?

I haven't tried to measure it ... but certainly it's not totally free.
I'd be happy to rip that change out again.

>> Would this be institutionalizing a particular approach to data
>> compression in the XLogInsert API, though?

> The "skip the free space" optimization is still useful and worthwhile 
> even if we have a more sophisticated compression method for the 
> rest of the page.

Good point.  OK, I'm hacking XLOG stuff now anyway so I'll see about
making that happen.
        regards, tom lane


Re: Quick-and-dirty compression for WAL backup blocks

From
Junji TERAMOTO
Date:
Tom Lane wrote:

>>>>In the XLogInsert(), it makes two kinds of logs, "whole buffer(page)
>>>>log" and "partial buffer log", isn't it?  Is it only "who buffer
>>>>log"
>>>>to generate a log with "hole"?
>
>>
>> Right.

I see.
I think, it is important to reduce the necessities to write whole pages
to WAL (as TODO list).
# It seems difficult to do so... Compressing WAL is easier way.

>> No, because the WAL code cannot assume that all pages follow the
>> convention that pd_lower and pd_upper represent the boundaries of
>> free space.  (As a counterexample: index metapages don't always
>> do that.)

Oh, I forget it.
And I think it is good idea to modify XLogInsert API as CVS, too.

-- 
Junji Teramoto


Re: Quick-and-dirty compression for WAL backup blocks

From
"Jim C. Nasby"
Date:
On Sat, Jun 04, 2005 at 11:46:07AM -0400, Tom Lane wrote:
> I've completed a test run for this (it's essentially MySQL's sql-bench
> done immediately after initdb).  What I get is:
> 
> CVS tip of 6/1: ending WAL offset = 0/A364A780 = 2741282688 bytes written
> 
> CVS tip of 6/2: ending WAL offset = 0/8BB091DC = 2343604700 bytes written
> 
> or about a 15% savings.  This is with a checkpoint_segments setting of 30.
> One can presume that the savings would be larger at smaller checkpoint
> intervals and smaller at larger intervals, but I didn't try more than
> one set of test conditions.
> 
> I'd say that's an improvement worth having, especially considering that
> it requires no net expenditure of CPU time.  But the table is certainly
> still open to discuss more complicated approaches.

If it's not hard to hack in as a test, it'd be interesting to see what
additional gains a more aggresive compression algorithm like LZW got.
CPU is more of a concern in that case, but for databases generating a
lot of WAL it might still be a win.

BTW, is this the thread you reffered to? I wasn't able to find it in the
TODO and had to go into the archives to find it...
http://archives.postgresql.org/pgsql-performance/2005-04/msg00264.php
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Quick-and-dirty compression for WAL backup blocks

From
Rod Taylor
Date:
> > I'd say that's an improvement worth having, especially considering that
> > it requires no net expenditure of CPU time.  But the table is certainly
> > still open to discuss more complicated approaches.
> 
> If it's not hard to hack in as a test, it'd be interesting to see what
> additional gains a more aggresive compression algorithm like LZW got.
> CPU is more of a concern in that case, but for databases generating a
> lot of WAL it might still be a win.

I've generate a fair amount of WAL segments (about 20GB per day), and
have a CPU issue.

I implemented a cronjob which compresses them using gzip on a different
machine.

Any chance we could have an official compression tool which is
independent of the server itself so we can distribute the load a little?

> BTW, is this the thread you reffered to? I wasn't able to find it in the
> TODO and had to go into the archives to find it...
> http://archives.postgresql.org/pgsql-performance/2005-04/msg00264.php
-- 



Re: Quick-and-dirty compression for WAL backup blocks

From
Junji TERAMOTO
Date:
Hi all,

I examined the effect of block-hole compression patch.

I compared the amounts of writing of the WAL data of CVS(7/11) and
8.0.3. (The amount of the WAL data writing was measured by the number of
executions of the write() function in XLogWrite().)
And, I measured the size of the hole.

Environment;
IBM x206 P4 3.0GHz Mem 4GB
CentOS 4.0 (Linux 2.6.9-5.0.3.ELsmp)

Parameters;
shared_buffers = 65535
checkpoint_segments = 30
default_with_oids = false (8.0.3)
default_with_oids = off   (CVS)

How to exam;
0) initdb --no-locale
1) pgbench -i -s 100 pgbench
2) pgbench -n -c 50 -t 5000 pgbench
3) vacuumdb -d pgbench
4) pgbench -n -c 50 -t 5000 pgbench

Results;    |    8.0.3    |                  CVS(7/11)
Exam |         |   |         |   |        block-hole (byte)    |  write  |C.P|  write  |C.P|   total   | min | max  |
avg
-----+---------+---+---------+---+-----------+-----+------+---------1)  |  187505 | 3 |  187373 | 4 |    194056 |  36 |
8124| 3881.122)  |  509725 | 6 |  513489 | 5 | 115564476 |  12 | 8096 |  347.693)  |  280456 | 2 |  172973 | 2 |
95923360| 248 | 8156 |  614.084)  |  533971 | 7 |  525135 | 6 | 171147256 |  12 | 8140 |  482.11
 

C.P = Checkpoint frequency

It has been understood that patchs seems to be effective at VACUUM as a
result of the measurement. But, in other cases, the effect was not so seen.



-- 
Junji Teramoto