Thread: Compression of full-page-writes

Compression of full-page-writes

From

Fujii Masao

Date:

30 August 2013, 02:56:00

Hi,

Attached patch adds new GUC parameter 'compress_backup_block'.
When this parameter is enabled, the server just compresses FPW
(full-page-writes) in WAL by using pglz_compress() before inserting it
to the WAL buffers. Then, the compressed FPW is decompressed
in recovery. This is very simple patch.

The purpose of this patch is the reduction of WAL size.
Under heavy write load, the server needs to write a large amount of
WAL and this is likely to be a bottleneck. What's the worse is,
in replication, a large amount of WAL would have harmful effect on
not only WAL writing in the master, but also WAL streaming and
WAL writing in the standby. Also we would need to spend more
money on the storage to store such a large data.
I'd like to alleviate such harmful situations by reducing WAL size.

My idea is very simple, just compress FPW because FPW is
a big part of WAL. I used pglz_compress() as a compression method,
but you might think that other method is better. We can add
something like FPW-compression-hook for that later. The patch
adds new GUC parameter, but I'm thinking to merge it to full_page_writes
parameter to avoid increasing the number of GUC. That is,
I'm thinking to change full_page_writes so that it can accept new value
'compress'.

I measured how much WAL this patch can reduce, by using pgbench.

* Server spec
  CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
  Mem: 16GB
  Disk: 500GB SSD Samsung 840

* Benchmark
  pgbench -c 32 -j 4 -T 900 -M prepared
  scaling factor: 100

  checkpoint_segments = 1024
  checkpoint_timeout = 5min
  (every checkpoint during benchmark were triggered by checkpoint_timeout)

* Result
  [tps]
  1386.8 (compress_backup_block = off)
  1627.7 (compress_backup_block = on)

  [the amount of WAL generated during running pgbench]
  4302 MB (compress_backup_block = off)
  1521 MB (compress_backup_block = on)

At least in my test, the patch could reduce the WAL size to one-third!

The patch is WIP yet. But I'd like to hear the opinions about this idea
before completing it, and then add the patch to next CF if okay.

Regards,

--
Fujii Masao

Attachment

compress_fpw_v1.patch

Re: Compression of full-page-writes

From

Satoshi Nagayasu

Date:

30 August 2013, 03:08:00


(2013/08/30 11:55), Fujii Masao wrote:
> Hi,
>
> Attached patch adds new GUC parameter 'compress_backup_block'.
> When this parameter is enabled, the server just compresses FPW
> (full-page-writes) in WAL by using pglz_compress() before inserting it
> to the WAL buffers. Then, the compressed FPW is decompressed
> in recovery. This is very simple patch.
>
> The purpose of this patch is the reduction of WAL size.
> Under heavy write load, the server needs to write a large amount of
> WAL and this is likely to be a bottleneck. What's the worse is,
> in replication, a large amount of WAL would have harmful effect on
> not only WAL writing in the master, but also WAL streaming and
> WAL writing in the standby. Also we would need to spend more
> money on the storage to store such a large data.
> I'd like to alleviate such harmful situations by reducing WAL size.
>
> My idea is very simple, just compress FPW because FPW is
> a big part of WAL. I used pglz_compress() as a compression method,
> but you might think that other method is better. We can add
> something like FPW-compression-hook for that later. The patch
> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
> parameter to avoid increasing the number of GUC. That is,
> I'm thinking to change full_page_writes so that it can accept new value
> 'compress'.
>
> I measured how much WAL this patch can reduce, by using pgbench.
>
> * Server spec
>    CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
>    Mem: 16GB
>    Disk: 500GB SSD Samsung 840
>
> * Benchmark
>    pgbench -c 32 -j 4 -T 900 -M prepared
>    scaling factor: 100
>
>    checkpoint_segments = 1024
>    checkpoint_timeout = 5min
>    (every checkpoint during benchmark were triggered by checkpoint_timeout)

I believe that the amount of backup blocks in WAL files is affected
by how often the checkpoints are occurring, particularly under such
update-intensive workload.

Under your configuration, checkpoint should occur so often.
So, you need to change checkpoint_timeout larger in order to
determine whether the patch is realistic.

Regards,

>
> * Result
>    [tps]
>    1386.8 (compress_backup_block = off)
>    1627.7 (compress_backup_block = on)
>
>    [the amount of WAL generated during running pgbench]
>    4302 MB (compress_backup_block = off)
>    1521 MB (compress_backup_block = on)
>
> At least in my test, the patch could reduce the WAL size to one-third!
>
> The patch is WIP yet. But I'd like to hear the opinions about this idea
> before completing it, and then add the patch to next CF if okay.
>
> Regards,
>
>
>
>

-- 
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

Re: Compression of full-page-writes

From

Satoshi Nagayasu

Date:

30 August 2013, 03:21:16


(2013/08/30 12:07), Satoshi Nagayasu wrote:
>
>
> (2013/08/30 11:55), Fujii Masao wrote:
>> Hi,
>>
>> Attached patch adds new GUC parameter 'compress_backup_block'.
>> When this parameter is enabled, the server just compresses FPW
>> (full-page-writes) in WAL by using pglz_compress() before inserting it
>> to the WAL buffers. Then, the compressed FPW is decompressed
>> in recovery. This is very simple patch.
>>
>> The purpose of this patch is the reduction of WAL size.
>> Under heavy write load, the server needs to write a large amount of
>> WAL and this is likely to be a bottleneck. What's the worse is,
>> in replication, a large amount of WAL would have harmful effect on
>> not only WAL writing in the master, but also WAL streaming and
>> WAL writing in the standby. Also we would need to spend more
>> money on the storage to store such a large data.
>> I'd like to alleviate such harmful situations by reducing WAL size.
>>
>> My idea is very simple, just compress FPW because FPW is
>> a big part of WAL. I used pglz_compress() as a compression method,
>> but you might think that other method is better. We can add
>> something like FPW-compression-hook for that later. The patch
>> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
>> parameter to avoid increasing the number of GUC. That is,
>> I'm thinking to change full_page_writes so that it can accept new value
>> 'compress'.
>>
>> I measured how much WAL this patch can reduce, by using pgbench.
>>
>> * Server spec
>>    CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
>>    Mem: 16GB
>>    Disk: 500GB SSD Samsung 840
>>
>> * Benchmark
>>    pgbench -c 32 -j 4 -T 900 -M prepared
>>    scaling factor: 100
>>
>>    checkpoint_segments = 1024
>>    checkpoint_timeout = 5min
>>    (every checkpoint during benchmark were triggered by
>> checkpoint_timeout)
>
> I believe that the amount of backup blocks in WAL files is affected
> by how often the checkpoints are occurring, particularly under such
> update-intensive workload.
>
> Under your configuration, checkpoint should occur so often.
> So, you need to change checkpoint_timeout larger in order to
> determine whether the patch is realistic.

In fact, the following chart shows that checkpoint_timeout=30min
also reduces WAL size to one-third, compared with 5min timeout,
in the pgbench experimentation.

https://www.oss.ecl.ntt.co.jp/ossc/oss/img/pglesslog_img02.jpg

Regards,

>
> Regards,
>
>>
>> * Result
>>    [tps]
>>    1386.8 (compress_backup_block = off)
>>    1627.7 (compress_backup_block = on)
>>
>>    [the amount of WAL generated during running pgbench]
>>    4302 MB (compress_backup_block = off)
>>    1521 MB (compress_backup_block = on)
>>
>> At least in my test, the patch could reduce the WAL size to one-third!
>>
>> The patch is WIP yet. But I'd like to hear the opinions about this idea
>> before completing it, and then add the patch to next CF if okay.
>>
>> Regards,
>>
>>
>>
>>
>

-- 
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

Re: Compression of full-page-writes

From

Peter Geoghegan

Date:

30 August 2013, 03:43:38

On Thu, Aug 29, 2013 at 7:55 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>   [the amount of WAL generated during running pgbench]
>   4302 MB (compress_backup_block = off)
>   1521 MB (compress_backup_block = on)

Interesting.

I wonder, what is the impact on recovery time under the same
conditions? I suppose that the cost of the random I/O involved would
probably dominate just as with compress_backup_block = off. That said,
you've used an SSD here, so perhaps not.

-- 
Peter Geoghegan

Re: Compression of full-page-writes

From

Amit Kapila

Date:

30 August 2013, 04:44:07

On Fri, Aug 30, 2013 at 8:25 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Hi,
>
> Attached patch adds new GUC parameter 'compress_backup_block'.
> When this parameter is enabled, the server just compresses FPW
> (full-page-writes) in WAL by using pglz_compress() before inserting it
> to the WAL buffers. Then, the compressed FPW is decompressed
> in recovery. This is very simple patch.
>
> The purpose of this patch is the reduction of WAL size.
> Under heavy write load, the server needs to write a large amount of
> WAL and this is likely to be a bottleneck. What's the worse is,
> in replication, a large amount of WAL would have harmful effect on
> not only WAL writing in the master, but also WAL streaming and
> WAL writing in the standby. Also we would need to spend more
> money on the storage to store such a large data.
> I'd like to alleviate such harmful situations by reducing WAL size.
>
> My idea is very simple, just compress FPW because FPW is
> a big part of WAL. I used pglz_compress() as a compression method,
> but you might think that other method is better. We can add
> something like FPW-compression-hook for that later. The patch
> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
> parameter to avoid increasing the number of GUC. That is,
> I'm thinking to change full_page_writes so that it can accept new value
> 'compress'.
>
> I measured how much WAL this patch can reduce, by using pgbench.
>
> * Server spec
>   CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
>   Mem: 16GB
>   Disk: 500GB SSD Samsung 840
>
> * Benchmark
>   pgbench -c 32 -j 4 -T 900 -M prepared
>   scaling factor: 100
>
>   checkpoint_segments = 1024
>   checkpoint_timeout = 5min
>   (every checkpoint during benchmark were triggered by checkpoint_timeout)
>
> * Result
>   [tps]
>   1386.8 (compress_backup_block = off)
>   1627.7 (compress_backup_block = on)
>
>   [the amount of WAL generated during running pgbench]
>   4302 MB (compress_backup_block = off)
>   1521 MB (compress_backup_block = on)

This is really nice data.

I think if you want, you can once try with one of the tests Heikki has
posted for one of my other patch which is here:
http://www.postgresql.org/message-id/51366323.8070606@vmware.com

Also if possible, for with lesser clients (1,2,4) and may be with more
frequency of checkpoint.

This is just to show benefits of this idea with other kind of workload.

I think we can do these tests later as well, I had mentioned because
sometime back (probably 6 months), one of my colleagues have tried
exactly the same idea of using compression method (LZ and few others)
for FPW, but it turned out that even though the WAL size is reduced
but performance went down which is not the case in the data you have
shown even though you have used SSD, might be he has done some mistake
as he was not as experienced, but I think still it's good to check on
various workloads.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compression of full-page-writes

From

KONDO Mitsumasa

Date:

30 August 2013, 05:28:21

(2013/08/30 11:55), Fujii Masao wrote:
> * Benchmark
>    pgbench -c 32 -j 4 -T 900 -M prepared
>    scaling factor: 100
>
>    checkpoint_segments = 1024
>    checkpoint_timeout = 5min
>    (every checkpoint during benchmark were triggered by checkpoint_timeout)
Did you execute munual checkpoint before starting benchmark?
We read only your message, it occuered three times checkpoint during benchmark.
But if you did not executed manual checkpoint, it would be different.

You had better clear this point for more transparent evaluation.

Regards,
-- 
Mitsumasa KONDO
NTT Open Software Center

Re: Compression of full-page-writes

From

Nikhil Sontakke

Date:

30 August 2013, 05:38:10

Hi Fujii-san,

I must be missing something really trivial, but why not try to compress all types of WAL blocks and not just FPW?

Regards,
Nikhils

On Fri, Aug 30, 2013 at 8:25 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Hi,

Attached patch adds new GUC parameter 'compress_backup_block'.
When this parameter is enabled, the server just compresses FPW
(full-page-writes) in WAL by using pglz_compress() before inserting it
to the WAL buffers. Then, the compressed FPW is decompressed
in recovery. This is very simple patch.

The purpose of this patch is the reduction of WAL size.
Under heavy write load, the server needs to write a large amount of
WAL and this is likely to be a bottleneck. What's the worse is,
in replication, a large amount of WAL would have harmful effect on
not only WAL writing in the master, but also WAL streaming and
WAL writing in the standby. Also we would need to spend more
money on the storage to store such a large data.
I'd like to alleviate such harmful situations by reducing WAL size.

My idea is very simple, just compress FPW because FPW is
a big part of WAL. I used pglz_compress() as a compression method,
but you might think that other method is better. We can add
something like FPW-compression-hook for that later. The patch
adds new GUC parameter, but I'm thinking to merge it to full_page_writes
parameter to avoid increasing the number of GUC. That is,
I'm thinking to change full_page_writes so that it can accept new value
'compress'.

I measured how much WAL this patch can reduce, by using pgbench.

* Server spec
CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Mem: 16GB
Disk: 500GB SSD Samsung 840

* Benchmark
pgbench -c 32 -j 4 -T 900 -M prepared
scaling factor: 100

checkpoint_segments = 1024
checkpoint_timeout = 5min
(every checkpoint during benchmark were triggered by checkpoint_timeout)

* Result
[tps]
1386.8 (compress_backup_block = off)
1627.7 (compress_backup_block = on)

[the amount of WAL generated during running pgbench]
4302 MB (compress_backup_block = off)
1521 MB (compress_backup_block = on)

At least in my test, the patch could reduce the WAL size to one-third!

The patch is WIP yet. But I'd like to hear the opinions about this idea
before completing it, and then add the patch to next CF if okay.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Compression of full-page-writes

From

Michael Paquier

Date:

30 August 2013, 05:55:28

On Fri, Aug 30, 2013 at 11:55 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> My idea is very simple, just compress FPW because FPW is
> a big part of WAL. I used pglz_compress() as a compression method,
> but you might think that other method is better. We can add
> something like FPW-compression-hook for that later. The patch
> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
> parameter to avoid increasing the number of GUC. That is,
> I'm thinking to change full_page_writes so that it can accept new value
> 'compress'.
Instead of a generic 'compress', what about using the name of the
compression method as parameter value? Just to keep the door open to
new types of compression methods.

> * Result
>   [tps]
>   1386.8 (compress_backup_block = off)
>   1627.7 (compress_backup_block = on)
>
>   [the amount of WAL generated during running pgbench]
>   4302 MB (compress_backup_block = off)
>   1521 MB (compress_backup_block = on)
>
> At least in my test, the patch could reduce the WAL size to one-third!
Nice numbers! Testing this patch with other benchmarks than pgbench
would be interesting as well.
-- 
Michael

Re: Compression of full-page-writes

From

Fujii Masao

Date:

30 August 2013, 05:55:35

On Fri, Aug 30, 2013 at 12:43 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Aug 29, 2013 at 7:55 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>   [the amount of WAL generated during running pgbench]
>>   4302 MB (compress_backup_block = off)
>>   1521 MB (compress_backup_block = on)
>
> Interesting.
>
> I wonder, what is the impact on recovery time under the same
> conditions?

Will test! I can imagine that the recovery time would be a bit
longer with compress_backup_block=on because compressed
FPW needs to be decompressed.

> I suppose that the cost of the random I/O involved would
> probably dominate just as with compress_backup_block = off. That said,
> you've used an SSD here, so perhaps not.

Oh, maybe my description was confusing. full_page_writes was enabled
while running the benchmark even if compress_backup_block = off.
I've not merged those two parameters yet. So even in
compress_backup_block = off, random I/O would not be increased in recovery.

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Fujii Masao

Date:

30 August 2013, 06:02:14

On Fri, Aug 30, 2013 at 1:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Aug 30, 2013 at 8:25 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Hi,
>>
>> Attached patch adds new GUC parameter 'compress_backup_block'.
>> When this parameter is enabled, the server just compresses FPW
>> (full-page-writes) in WAL by using pglz_compress() before inserting it
>> to the WAL buffers. Then, the compressed FPW is decompressed
>> in recovery. This is very simple patch.
>>
>> The purpose of this patch is the reduction of WAL size.
>> Under heavy write load, the server needs to write a large amount of
>> WAL and this is likely to be a bottleneck. What's the worse is,
>> in replication, a large amount of WAL would have harmful effect on
>> not only WAL writing in the master, but also WAL streaming and
>> WAL writing in the standby. Also we would need to spend more
>> money on the storage to store such a large data.
>> I'd like to alleviate such harmful situations by reducing WAL size.
>>
>> My idea is very simple, just compress FPW because FPW is
>> a big part of WAL. I used pglz_compress() as a compression method,
>> but you might think that other method is better. We can add
>> something like FPW-compression-hook for that later. The patch
>> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
>> parameter to avoid increasing the number of GUC. That is,
>> I'm thinking to change full_page_writes so that it can accept new value
>> 'compress'.
>>
>> I measured how much WAL this patch can reduce, by using pgbench.
>>
>> * Server spec
>>   CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
>>   Mem: 16GB
>>   Disk: 500GB SSD Samsung 840
>>
>> * Benchmark
>>   pgbench -c 32 -j 4 -T 900 -M prepared
>>   scaling factor: 100
>>
>>   checkpoint_segments = 1024
>>   checkpoint_timeout = 5min
>>   (every checkpoint during benchmark were triggered by checkpoint_timeout)
>>
>> * Result
>>   [tps]
>>   1386.8 (compress_backup_block = off)
>>   1627.7 (compress_backup_block = on)
>>
>>   [the amount of WAL generated during running pgbench]
>>   4302 MB (compress_backup_block = off)
>>   1521 MB (compress_backup_block = on)
>
> This is really nice data.
>
> I think if you want, you can once try with one of the tests Heikki has
> posted for one of my other patch which is here:
> http://www.postgresql.org/message-id/51366323.8070606@vmware.com
>
> Also if possible, for with lesser clients (1,2,4) and may be with more
> frequency of checkpoint.
>
> This is just to show benefits of this idea with other kind of workload.

Yep, I will do more tests.

> I think we can do these tests later as well, I had mentioned because
> sometime back (probably 6 months), one of my colleagues have tried
> exactly the same idea of using compression method (LZ and few others)
> for FPW, but it turned out that even though the WAL size is reduced
> but performance went down which is not the case in the data you have
> shown even though you have used SSD, might be he has done some mistake
> as he was not as experienced, but I think still it's good to check on
> various workloads.

I'd appreciate if you test the patch with HDD. Now I have no machine with HDD.

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Fujii Masao

Date:

30 August 2013, 06:03:50

On Fri, Aug 30, 2013 at 2:32 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
> (2013/08/30 11:55), Fujii Masao wrote:
>>
>> * Benchmark
>>    pgbench -c 32 -j 4 -T 900 -M prepared
>>    scaling factor: 100
>>
>>    checkpoint_segments = 1024
>>    checkpoint_timeout = 5min
>>    (every checkpoint during benchmark were triggered by
>> checkpoint_timeout)
>
> Did you execute munual checkpoint before starting benchmark?

Yes.

> We read only your message, it occuered three times checkpoint during
> benchmark.
> But if you did not executed manual checkpoint, it would be different.
>
> You had better clear this point for more transparent evaluation.

What I executed was:

-------------------------------------
CHECKPOINT
SELECT pg_current_xlog_location()
pgbench -c 32 -j 4 -T 900 -M prepared -r -P 10
SELECT pg_current_xlog_location()
SELECT pg_xlog_location_diff() -- calculate the diff of the above locations
-------------------------------------

I repeated this several times to eliminate the noise.

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Peter Geoghegan

Date:

30 August 2013, 06:05:49

On Thu, Aug 29, 2013 at 10:55 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I suppose that the cost of the random I/O involved would
>> probably dominate just as with compress_backup_block = off. That said,
>> you've used an SSD here, so perhaps not.
>
> Oh, maybe my description was confusing. full_page_writes was enabled
> while running the benchmark even if compress_backup_block = off.
> I've not merged those two parameters yet. So even in
> compress_backup_block = off, random I/O would not be increased in recovery.

I understood it that way. I just meant that it could be that the
random I/O was so expensive that the additional cost of decompressing
the FPIs looked insignificant in comparison. If that was the case, the
increase in recovery time would be modest.

-- 
Peter Geoghegan

Re: Compression of full-page-writes

From

Fujii Masao

Date:

30 August 2013, 06:15:21

On Fri, Aug 30, 2013 at 2:37 PM, Nikhil Sontakke <nikkhils@gmail.com> wrote:
> Hi Fujii-san,
>
> I must be missing something really trivial, but why not try to compress all
> types of WAL blocks and not just FPW?

The size of non-FPW WAL is small, compared to that of FPW.
I thought that compression of such a small WAL would not have
big effect on the reduction of WAL size. Rather, compression of
every WAL records might cause large performance overhead.

Also, focusing on FPW makes the patch very simple. We can
add the compression of other WAL later if we want.

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Heikki Linnakangas

Date:

30 August 2013, 06:57:20

On 30.08.2013 05:55, Fujii Masao wrote:
> * Result
>    [tps]
>    1386.8 (compress_backup_block = off)
>    1627.7 (compress_backup_block = on)

It would be good to check how much of this effect comes from reducing 
the amount of data that needs to be CRC'd, because there has been some 
talk of replacing the current CRC-32 algorithm with something faster. 
See 
http://www.postgresql.org/message-id/20130829223004.GD4283@awork2.anarazel.de. 
It might even be beneficial to use one routine for full-page-writes, 
which are generally much larger than other WAL records, and another 
routine for smaller records. As long as they both produce the same CRC, 
of course.

Speeding up the CRC calculation obviously won't help with the WAL volume 
per se, ie. you still generate the same amount of WAL that needs to be 
shipped in replication. But then again, if all you want to do is to 
reduce the volume, you could just compress the whole WAL stream.

- Heikki

Re: Compression of full-page-writes

From

Robert Haas

Date:

30 August 2013, 22:34:26

On Thu, Aug 29, 2013 at 10:55 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Attached patch adds new GUC parameter 'compress_backup_block'.

I think this is a great idea.

(This is not to disagree with any of the suggestions made on this
thread for further investigation, all of which I think I basically
agree with.  But I just wanted to voice general support for the
general idea, regardless of what specifically we end up with.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Compression of full-page-writes

From

Fujii Masao

Date:

11 September 2013, 10:39:22

On Fri, Aug 30, 2013 at 11:55 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Hi,
>
> Attached patch adds new GUC parameter 'compress_backup_block'.
> When this parameter is enabled, the server just compresses FPW
> (full-page-writes) in WAL by using pglz_compress() before inserting it
> to the WAL buffers. Then, the compressed FPW is decompressed
> in recovery. This is very simple patch.
>
> The purpose of this patch is the reduction of WAL size.
> Under heavy write load, the server needs to write a large amount of
> WAL and this is likely to be a bottleneck. What's the worse is,
> in replication, a large amount of WAL would have harmful effect on
> not only WAL writing in the master, but also WAL streaming and
> WAL writing in the standby. Also we would need to spend more
> money on the storage to store such a large data.
> I'd like to alleviate such harmful situations by reducing WAL size.
>
> My idea is very simple, just compress FPW because FPW is
> a big part of WAL. I used pglz_compress() as a compression method,
> but you might think that other method is better. We can add
> something like FPW-compression-hook for that later. The patch
> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
> parameter to avoid increasing the number of GUC. That is,
> I'm thinking to change full_page_writes so that it can accept new value
> 'compress'.

Done. Attached is the updated version of the patch.

In this patch, full_page_writes accepts three values: on, compress, and off.
When it's set to compress, the full page image is compressed before it's
inserted into the WAL buffers.

I measured how much this patch affects the performance and the WAL
volume again, and I also measured how much this patch affects the
recovery time.

* Server spec
  CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
  Mem: 16GB
  Disk: 500GB SSD Samsung 840

* Benchmark
  pgbench -c 32 -j 4 -T 900 -M prepared
  scaling factor: 100

  checkpoint_segments = 1024
  checkpoint_timeout = 5min
  (every checkpoint during benchmark were triggered by checkpoint_timeout)

* Result
  [tps]
  1344.2 (full_page_writes = on)
  1605.9 (compress)
  1810.1 (off)

  [the amount of WAL generated during running pgbench]
  4422 MB (on)
  1517 MB (compress)
    885 MB (off)

  [time required to replay WAL generated during running pgbench]
  61s (on)                 .... 1209911 transactions were replayed,
recovery speed: 19834.6 transactions/sec
  39s (compress)      .... 1445446 transactions were replayed,
recovery speed: 37062.7 transactions/sec
  37s (off)                 .... 1629235 transactions were replayed,
recovery speed: 44033.3 transactions/sec

When full_page_writes is disabled, the recovery speed is basically very low
because of random I/O. But, ISTM that, since I was using SSD in my box,
the recovery with full_page_writse=off was fastest.

Regards,

--
Fujii Masao

Attachment

compress_fpw_v2.patch

Re: Compression of full-page-writes

From

Andres Freund

Date:

11 September 2013, 10:43:27

On 2013-09-11 19:39:14 +0900, Fujii Masao wrote:
> * Benchmark
>   pgbench -c 32 -j 4 -T 900 -M prepared
>   scaling factor: 100
> 
>   checkpoint_segments = 1024
>   checkpoint_timeout = 5min
>   (every checkpoint during benchmark were triggered by checkpoint_timeout)
> 
> * Result
>   [tps]
>   1344.2 (full_page_writes = on)
>   1605.9 (compress)
>   1810.1 (off)
> 
>   [the amount of WAL generated during running pgbench]
>   4422 MB (on)
>   1517 MB (compress)
>     885 MB (off)
> 
>   [time required to replay WAL generated during running pgbench]
>   61s (on)                 .... 1209911 transactions were replayed,
> recovery speed: 19834.6 transactions/sec
>   39s (compress)      .... 1445446 transactions were replayed,
> recovery speed: 37062.7 transactions/sec
>   37s (off)                 .... 1629235 transactions were replayed,
> recovery speed: 44033.3 transactions/sec

ISTM for those benchmarks you should use an absolute number of
transactions, not one based on elapsed time. Otherwise the comparison
isn't really meaningful.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Compression of full-page-writes

From

Fujii Masao

Date:

30 September 2013, 03:49:50

On Wed, Sep 11, 2013 at 7:39 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Aug 30, 2013 at 11:55 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Hi,
>>
>> Attached patch adds new GUC parameter 'compress_backup_block'.
>> When this parameter is enabled, the server just compresses FPW
>> (full-page-writes) in WAL by using pglz_compress() before inserting it
>> to the WAL buffers. Then, the compressed FPW is decompressed
>> in recovery. This is very simple patch.
>>
>> The purpose of this patch is the reduction of WAL size.
>> Under heavy write load, the server needs to write a large amount of
>> WAL and this is likely to be a bottleneck. What's the worse is,
>> in replication, a large amount of WAL would have harmful effect on
>> not only WAL writing in the master, but also WAL streaming and
>> WAL writing in the standby. Also we would need to spend more
>> money on the storage to store such a large data.
>> I'd like to alleviate such harmful situations by reducing WAL size.
>>
>> My idea is very simple, just compress FPW because FPW is
>> a big part of WAL. I used pglz_compress() as a compression method,
>> but you might think that other method is better. We can add
>> something like FPW-compression-hook for that later. The patch
>> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
>> parameter to avoid increasing the number of GUC. That is,
>> I'm thinking to change full_page_writes so that it can accept new value
>> 'compress'.
>
> Done. Attached is the updated version of the patch.
>
> In this patch, full_page_writes accepts three values: on, compress, and off.
> When it's set to compress, the full page image is compressed before it's
> inserted into the WAL buffers.
>
> I measured how much this patch affects the performance and the WAL
> volume again, and I also measured how much this patch affects the
> recovery time.
>
> * Server spec
>   CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
>   Mem: 16GB
>   Disk: 500GB SSD Samsung 840
>
> * Benchmark
>   pgbench -c 32 -j 4 -T 900 -M prepared
>   scaling factor: 100
>
>   checkpoint_segments = 1024
>   checkpoint_timeout = 5min
>   (every checkpoint during benchmark were triggered by checkpoint_timeout)
>
> * Result
>   [tps]
>   1344.2 (full_page_writes = on)
>   1605.9 (compress)
>   1810.1 (off)
>
>   [the amount of WAL generated during running pgbench]
>   4422 MB (on)
>   1517 MB (compress)
>     885 MB (off)

On second thought, the patch could compress WAL very much because I
used pgbench.
Most of data in pgbench are pgbench_accounts table's "filler" columns, i.e.,
blank-padded empty strings. So, the compression ratio of WAL was very high.

I will do the same measurement by using another benchmark.

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

KONDO Mitsumasa

Date:

30 September 2013, 04:22:52

Hi Fujii-san,

(2013/09/30 12:49), Fujii Masao wrote:> On second thought, the patch could compress WAL very much because I used
pgbench.
> I will do the same measurement by using another benchmark.
If you hope, I can test this patch in DBT-2 benchmark in end of this week.
I will use under following test server.

* Test server  Server: HP Proliant DL360 G7  CPU:    Xeon E5640 2.66GHz (1P/4C)  Memory: 18GB(PC3-10600R-9)  Disk:
146GB(15k)*4RAID1+0  RAID controller: P410i/256MB
 

This is PG-REX test server as you know.

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center

Re: Compression of full-page-writes

From

Fujii Masao

Date:

30 September 2013, 04:34:20

On Mon, Sep 30, 2013 at 1:27 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
> Hi Fujii-san,
>
>
> (2013/09/30 12:49), Fujii Masao wrote:
>> On second thought, the patch could compress WAL very much because I used
>> pgbench.
>>
>> I will do the same measurement by using another benchmark.
>
> If you hope, I can test this patch in DBT-2 benchmark in end of this week.
> I will use under following test server.
>
> * Test server
>   Server: HP Proliant DL360 G7
>   CPU:    Xeon E5640 2.66GHz (1P/4C)
>   Memory: 18GB(PC3-10600R-9)
>   Disk:   146GB(15k)*4 RAID1+0
>   RAID controller: P410i/256MB

Yep, please! It's really helpful!

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Amit Kapila

Date:

30 September 2013, 04:55:53

On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Sep 30, 2013 at 1:27 PM, KONDO Mitsumasa
> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>> Hi Fujii-san,
>>
>>
>> (2013/09/30 12:49), Fujii Masao wrote:
>>> On second thought, the patch could compress WAL very much because I used
>>> pgbench.
>>>
>>> I will do the same measurement by using another benchmark.
>>
>> If you hope, I can test this patch in DBT-2 benchmark in end of this week.
>> I will use under following test server.
>>
>> * Test server
>>   Server: HP Proliant DL360 G7
>>   CPU:    Xeon E5640 2.66GHz (1P/4C)
>>   Memory: 18GB(PC3-10600R-9)
>>   Disk:   146GB(15k)*4 RAID1+0
>>   RAID controller: P410i/256MB
>
> Yep, please! It's really helpful!

I think it will be useful if you can get the data for 1 and 2 threads
(may be with pgbench itself) as well, because the WAL reduction is
almost sure, but the only thing is that it should not dip tps in some
of the scenarios.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compression of full-page-writes

From

KONDO Mitsumasa

Date:

30 September 2013, 10:06:43

(2013/09/30 13:55), Amit Kapila wrote:
> On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Yep, please! It's really helpful!
OK! I test with single instance and synchronous replication constitution.

By the way, you posted patch which is sync_file_range() WAL writing method in 3 
years ago. I think it is also good for performance. As the reason, I read 
sync_file_range() and fdatasync() in latest linux kernel code(3.9.11), 
fdatasync() writes in dirty buffers of the whole file, on the other hand, 
sync_file_range() writes in partial dirty buffers. In more detail, these 
functions use the same function in kernel source code, fdatasync() is 
vfs_fsync_range(file, 0, LLONG_MAX, 1), and sync_file_range() is 
vfs_fsync_range(file, offset, amount, 1).
It is obvious that which is more efficiently in WAL writing.

You had better confirm it in linux kernel's git. I think your conviction will be 
more deeply.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/sync.c?id=refs/tags/v3.11.2

> I think it will be useful if you can get the data for 1 and 2 threads
> (may be with pgbench itself) as well, because the WAL reduction is
> almost sure, but the only thing is that it should not dip tps in some
> of the scenarios.
That's right. I also want to know about this patch in MD environment, because
MD has strong point in sequential write which like WAL writing.

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center

Re: Compression of full-page-writes

From

Fujii Masao

Date:

04 October 2013, 05:19:33

On Mon, Sep 30, 2013 at 1:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Mon, Sep 30, 2013 at 1:27 PM, KONDO Mitsumasa
>> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>>> Hi Fujii-san,
>>>
>>>
>>> (2013/09/30 12:49), Fujii Masao wrote:
>>>> On second thought, the patch could compress WAL very much because I used
>>>> pgbench.
>>>>
>>>> I will do the same measurement by using another benchmark.
>>>
>>> If you hope, I can test this patch in DBT-2 benchmark in end of this week.
>>> I will use under following test server.
>>>
>>> * Test server
>>>   Server: HP Proliant DL360 G7
>>>   CPU:    Xeon E5640 2.66GHz (1P/4C)
>>>   Memory: 18GB(PC3-10600R-9)
>>>   Disk:   146GB(15k)*4 RAID1+0
>>>   RAID controller: P410i/256MB
>>
>> Yep, please! It's really helpful!
>
> I think it will be useful if you can get the data for 1 and 2 threads
> (may be with pgbench itself) as well, because the WAL reduction is
> almost sure, but the only thing is that it should not dip tps in some
> of the scenarios.

Here is the measurement result of pgbench with 1 thread.

scaling factor: 100
query mode: prepared
number of clients: 1
number of threads: 1
duration: 900 s

WAL Volume
- 1344 MB (full_page_writes = on)
-   349 MB (compress)
-     78 MB (off)

TPS
117.369221 (on)
143.908024 (compress)
163.722063 (off)

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Amit Kapila

Date:

05 October 2013, 11:42:19

On Fri, Oct 4, 2013 at 10:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Sep 30, 2013 at 1:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Mon, Sep 30, 2013 at 1:27 PM, KONDO Mitsumasa
>>> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>>>> Hi Fujii-san,
>>>>
>>>>
>>>> (2013/09/30 12:49), Fujii Masao wrote:
>>>>> On second thought, the patch could compress WAL very much because I used
>>>>> pgbench.
>>>>>
>>>>> I will do the same measurement by using another benchmark.
>>>>
>>>> If you hope, I can test this patch in DBT-2 benchmark in end of this week.
>>>> I will use under following test server.
>>>>
>>>> * Test server
>>>>   Server: HP Proliant DL360 G7
>>>>   CPU:    Xeon E5640 2.66GHz (1P/4C)
>>>>   Memory: 18GB(PC3-10600R-9)
>>>>   Disk:   146GB(15k)*4 RAID1+0
>>>>   RAID controller: P410i/256MB
>>>
>>> Yep, please! It's really helpful!
>>
>> I think it will be useful if you can get the data for 1 and 2 threads
>> (may be with pgbench itself) as well, because the WAL reduction is
>> almost sure, but the only thing is that it should not dip tps in some
>> of the scenarios.
>
> Here is the measurement result of pgbench with 1 thread.
>
> scaling factor: 100
> query mode: prepared
> number of clients: 1
> number of threads: 1
> duration: 900 s
>
> WAL Volume
> - 1344 MB (full_page_writes = on)
> -   349 MB (compress)
> -     78 MB (off)
>
> TPS
> 117.369221 (on)
> 143.908024 (compress)
> 163.722063 (off)

This data is good.
I will check if with the help of my old colleagues, I can get the
performance data on m/c where we have tried similar idea.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compression of full-page-writes

From

Haribabu kommi

Date:

08 October 2013, 08:34:19

On 05 October 2013 17:12 Amit Kapila wrote:
>On Fri, Oct 4, 2013 at 10:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Mon, Sep 30, 2013 at 1:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>> On Mon, Sep 30, 2013 at 1:27 PM, KONDO Mitsumasa
>>>> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>>>>> Hi Fujii-san,
>>>>>
>>>>>
>>>>> (2013/09/30 12:49), Fujii Masao wrote:
>>>>>> On second thought, the patch could compress WAL very much because
>>>>>> I used pgbench.
>>>>>>
>>>>>> I will do the same measurement by using another benchmark.
>>>>>
>>>>> If you hope, I can test this patch in DBT-2 benchmark in end of this week.
>>>>> I will use under following test server.
>>>>>
>>>>> * Test server
>>>>>   Server: HP Proliant DL360 G7
>>>>>   CPU:    Xeon E5640 2.66GHz (1P/4C)
>>>>>   Memory: 18GB(PC3-10600R-9)
>>>>>   Disk:   146GB(15k)*4 RAID1+0
>>>>>   RAID controller: P410i/256MB
>>>>
>>>> Yep, please! It's really helpful!
>>>
>>> I think it will be useful if you can get the data for 1 and 2 threads
>>> (may be with pgbench itself) as well, because the WAL reduction is
>>> almost sure, but the only thing is that it should not dip tps in some
>>> of the scenarios.
>>
>> Here is the measurement result of pgbench with 1 thread.
>>
>> scaling factor: 100
>> query mode: prepared
>> number of clients: 1
>> number of threads: 1
>> duration: 900 s
>>
>> WAL Volume
>> - 1344 MB (full_page_writes = on)
>> -   349 MB (compress)
>> -     78 MB (off)
>>
>> TPS
>> 117.369221 (on)
>> 143.908024 (compress)
>> 163.722063 (off)

>This data is good.
>I will check if with the help of my old colleagues, I can get the performance data on m/c where we have tried similar
idea.

                        Thread-1                    Threads-2
                Head code        FPW compress    Head code        FPW compress
Pgbench-org 5min        1011(0.96GB)    815(0.20GB)        2083(1.24GB)    1843(0.40GB)
Pgbench-1000 5min        958(1.16GB)        778(0.24GB)        1937(2.80GB)    1659(0.73GB)
Pgbench-org 15min        1065(1.43GB)    983(0.56GB)        2094(1.93GB)    2025(1.09GB)
Pgbench-1000 15min    1020(3.70GB)    898(1.05GB)        1383(5.31GB)    1908(2.49GB)

Pgbench-org - original pgbench
Pgbench-1000 - changed pgbench with a record size of 1000.
5 min - pgbench test carried out for 5 min.
15 min - pgbench test carried out for 15 min.

The checkpoint_timeout and checkpoint_segments are increased to make sure no checkpoint happens during the test run.

>From the above readings it is observed that,
1. There a performance dip in one or two threads test, the amount of dip reduces with the test run time.
2. For two threads pgbench-1000 record size test, the fpw compress performance is good in 15min run.
3. More than 50% WAL reduction in all scenarios.

All these readings are measured with pgbench query mode as simple.
Please find the attached sheet for more details regarding machine and test configuration.


Regards,
Hari babu.

Attachment

compress_fpw.htm

Re: Compression of full-page-writes

From

KONDO Mitsumasa

Date:

08 October 2013, 09:46:33

(2013/10/08 17:33), Haribabu kommi wrote:
> The checkpoint_timeout and checkpoint_segments are increased to make sure no checkpoint happens during the test run.
Your setting is easy occurred checkpoint in checkpoint_segments = 256. I don't 
know number of disks in your test server, in my test server which has 4 magnetic 
disk(1.5k rpm), postgres generates 50 - 100 WALs per minutes.

And I cannot understand your setting which is sync_commit = off. This setting 
tend to cause cpu bottle-neck and data-loss. It is not general in database usage.
Therefore, your test is not fair comparison for Fujii's patch.

Going back to my DBT-2 benchmark, I have not got good performance (almost same 
performance).  So I am checking hunk, my setting, or something wrong in Fujii's 
patch now. I am going to try to send test result tonight.

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center

Re: Compression of full-page-writes

From

Andres Freund

Date:

08 October 2013, 09:49:21

On 2013-09-11 12:43:21 +0200, Andres Freund wrote:
> On 2013-09-11 19:39:14 +0900, Fujii Masao wrote:
> > * Benchmark
> >   pgbench -c 32 -j 4 -T 900 -M prepared
> >   scaling factor: 100
> > 
> >   checkpoint_segments = 1024
> >   checkpoint_timeout = 5min
> >   (every checkpoint during benchmark were triggered by checkpoint_timeout)
> > 
> > * Result
> >   [tps]
> >   1344.2 (full_page_writes = on)
> >   1605.9 (compress)
> >   1810.1 (off)
> > 
> >   [the amount of WAL generated during running pgbench]
> >   4422 MB (on)
> >   1517 MB (compress)
> >     885 MB (off)
> > 
> >   [time required to replay WAL generated during running pgbench]
> >   61s (on)                 .... 1209911 transactions were replayed,
> > recovery speed: 19834.6 transactions/sec
> >   39s (compress)      .... 1445446 transactions were replayed,
> > recovery speed: 37062.7 transactions/sec
> >   37s (off)                 .... 1629235 transactions were replayed,
> > recovery speed: 44033.3 transactions/sec
> 
> ISTM for those benchmarks you should use an absolute number of
> transactions, not one based on elapsed time. Otherwise the comparison
> isn't really meaningful.

I really think we need to see recovery time benchmarks with a constant
amount of transactions to judge this properly.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Compression of full-page-writes

From

Haribabu kommi

Date:

08 October 2013, 11:13:27

On 08 October 2013 15:22 KONDO Mitsumasa wrote:
> (2013/10/08 17:33), Haribabu kommi wrote:
>> The checkpoint_timeout and checkpoint_segments are increased to make sure no checkpoint happens during the test
run.

>Your setting is easy occurred checkpoint in checkpoint_segments = 256. I don't know number of disks in your test
server,in my test server which has 4 magnetic disk(1.5k rpm), postgres generates 50 - 100 WALs per minutes.

A manual checkpoint is executed before starting of the test and verified as no checkpoint happened during the run by
increasingthe "checkpoint_warning".

>And I cannot understand your setting which is sync_commit = off. This setting tend to cause cpu bottle-neck and
data-loss.It is not general in database usage.

Therefore, your test is not fair comparison for Fujii's patch.

I chosen the sync_commit=off mode because it generates more tps, thus it increases the volume of WAL.
I will test with sync_commit=on mode and provide the test results.

Regards,
Hari babu.

Re: Compression of full-page-writes

From

KONDO Mitsumasa

Date:

08 October 2013, 13:02:49

Hi,

I tested dbt-2 benchmark in single instance and synchronous replication.
Unfortunately, my benchmark results were not seen many differences...

* Test server
    Server: HP Proliant DL360 G7
    CPU:    Xeon E5640 2.66GHz (1P/4C)
    Memory: 18GB(PC3-10600R-9)
    Disk:   146GB(15k)*4 RAID1+0
    RAID controller: P410i/256MB

* Result
** Single instance**
             | NOTPM     | 90%tile     | Average | S.Deviation
------------+-----------+-------------+---------+-------------
no-patched  | 3322.93   | 20.469071   | 5.882   | 10.478
patched     | 3315.42   | 19.086105   | 5.669   | 9.108


** Synchronous Replication **
             | NOTPM     | 90%tile     | Average | S.Deviation
------------+-----------+-------------+---------+-------------
no-patched  | 3275.55   | 21.332866   | 6.072   | 9.882
patched     | 3318.82   | 18.141807   | 5.757   | 9.829

** Detail of result
http://pgstatsinfo.projects.pgfoundry.org/DBT-2_Fujii_patch/


I set full_page_write = compress with Fujii's patch in DBT-2. But it does not
seems to effect for eleminating WAL files. I will try to DBT-2 benchmark more
once, and try to normal pgbench in my test server.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachment

Re: Compression of full-page-writes

From

KONDO Mitsumasa

Date:

08 October 2013, 13:06:51

(2013/10/08 20:13), Haribabu kommi wrote:
> I chosen the sync_commit=off mode because it generates more tps, thus it increases the volume of WAL.
I did not think to there. Sorry...

> I will test with sync_commit=on mode and provide the test results.
OK. Thanks!

-- 
Mitsumasa KONDO
NTT Open Source Software Center

Re: Compression of full-page-writes

From

Haribabu kommi

Date:

09 October 2013, 04:36:19

On 08 October 2013 18:42 KONDO Mitsumasa wrote:
>(2013/10/08 20:13), Haribabu kommi wrote:
>> I will test with sync_commit=on mode and provide the test results.
>OK. Thanks!

Pgbench test results with synchronous_commit mode as on.

                        Thread-1                    Threads-2
                Head code        FPW compress    Head code        FPW compress
Pgbench-org 5min        138(0.24GB)        131(0.04GB)        160(0.28GB)        163(0.05GB)
Pgbench-1000 5min        140(0.29GB)        128(0.03GB)        160(0.33GB)        162(0.02GB)
Pgbench-org 15min        141(0.59GB)        136(0.12GB)        160(0.65GB)        162(0.14GB)
Pgbench-1000 15min    138(0.81GB)        134(0.11GB)        159(0.92GB)        162(0.18GB)

Pgbench-org - original pgbench
Pgbench-1000 - changed pgbench with a record size of 1000.
5 min - pgbench test carried out for 5 min.
15 min - pgbench test carried out for 15 min.

From the above readings it is observed that,
1. There a performance dip in one thread test, the amount of dip reduces with the test run time.
2. More than 75% WAL reduction in all scenarios.

Please find the attached sheet for more details regarding machine and test configuration

Regards,
Hari babu.

Attachment

compress_fpw_sync_on.htm

Re: Compression of full-page-writes

From

Dimitri Fontaine

Date:

10 October 2013, 16:22:17

Hi,

I did a partial review of this patch, wherein I focused on the patch and
the code itself, as I saw other contributors already did some testing on
it, so that we know it applies cleanly and work to some good extend.

Fujii Masao <masao.fujii@gmail.com> writes:
> In this patch, full_page_writes accepts three values: on, compress, and off.
> When it's set to compress, the full page image is compressed before it's
> inserted into the WAL buffers.

Code review :

In full_page_writes_str() why are you returning "unrecognized" rather
than doing an ELOG(ERROR, …) for this unexpected situation?

The code switches to compression (or trying to) when the following
condition is met:

+         if (fpw <= FULL_PAGE_WRITES_COMPRESS)
+         {
+             rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length, &(rdt->len));

We have

+ typedef enum FullPageWritesLevel
+ {
+     FULL_PAGE_WRITES_OFF = 0,
+     FULL_PAGE_WRITES_COMPRESS,
+     FULL_PAGE_WRITES_ON
+ } FullPageWritesLevel;

+ #define FullPageWritesIsNeeded(fpw)    (fpw >= FULL_PAGE_WRITES_COMPRESS)

I don't much like using the <= test against and ENUM and I'm not sure I
understand the intention you have here. It somehow looks like a typo and
disagrees with the macro. What about using the FullPageWritesIsNeeded
macro, and maybe rewriting the macro as

#define FullPageWritesIsNeeded(fpw) \  (fpw == FULL_PAGE_WRITES_COMPRESS || fpw == FULL_PAGE_WRITES_ON)

Also, having "on" imply "compress" is a little funny to me. Maybe we
should just finish our testing and be happy to always compress the full
page writes. What would the downside be exactly (on buzy IO system
writing less data even if needing more CPU will be the right trade-off).

I like that you're checking the savings of the compressed data with
respect to the uncompressed data and cancel the compression if there's
no gain. I wonder if your test accounts for enough padding and headers
though given the results we saw in other tests made in this thread.

Why do we have both the static function full_page_writes_str() and the
macro FullPageWritesStr, with two different implementations issuing
either "true" and "false" or "on" and "off"?

!     unsigned    hole_offset:15,    /* number of bytes before "hole" */
!                 flags:2,        /* state of a backup block, see below */
!                 hole_length:15;    /* number of bytes in "hole" */

I don't understand that. I wanted to use that patch as a leverage to
smoothly discover the internals of our WAL system but won't have the
time to do that here. That said, I don't even know that C syntax.

+ #define BKPBLOCK_UNCOMPRESSED    0    /* uncompressed */
+ #define BKPBLOCK_COMPRESSED        1    /* comperssed */

There's a typo in the comment above.

>   [time required to replay WAL generated during running pgbench]
>   61s (on)                 .... 1209911 transactions were replayed,
> recovery speed: 19834.6 transactions/sec
>   39s (compress)      .... 1445446 transactions were replayed,
> recovery speed: 37062.7 transactions/sec
>   37s (off)                 .... 1629235 transactions were replayed,
> recovery speed: 44033.3 transactions/sec

How did you get those numbers ? pg_basebackup before the test and
archiving, then a PITR maybe? Is it possible to do the same test with
the same number of transactions to replay, I guess using the -t
parameter rather than the -T one for this testing.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Compression of full-page-writes

From

Fujii Masao

Date:

10 October 2013, 17:32:40

On Tue, Oct 8, 2013 at 10:07 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
> Hi,
>
> I tested dbt-2 benchmark in single instance and synchronous replication.

Thanks!

> Unfortunately, my benchmark results were not seen many differences...
>
>
> * Test server
>    Server: HP Proliant DL360 G7
>    CPU:    Xeon E5640 2.66GHz (1P/4C)
>    Memory: 18GB(PC3-10600R-9)
>    Disk:   146GB(15k)*4 RAID1+0
>    RAID controller: P410i/256MB
>
> * Result
> ** Single instance**
>             | NOTPM     | 90%tile     | Average | S.Deviation
> ------------+-----------+-------------+---------+-------------
> no-patched  | 3322.93   | 20.469071   | 5.882   | 10.478
> patched     | 3315.42   | 19.086105   | 5.669   | 9.108
>
>
> ** Synchronous Replication **
>             | NOTPM     | 90%tile     | Average | S.Deviation
> ------------+-----------+-------------+---------+-------------
> no-patched  | 3275.55   | 21.332866   | 6.072   | 9.882
> patched     | 3318.82   | 18.141807   | 5.757   | 9.829
>
> ** Detail of result
> http://pgstatsinfo.projects.pgfoundry.org/DBT-2_Fujii_patch/
>
>
> I set full_page_write = compress with Fujii's patch in DBT-2. But it does
> not
> seems to effect for eleminating WAL files.

Could you let me know how much WAL records were generated
during each benchmark?

I think that this benchmark result clearly means that the patch
has only limited effects in the reduction of WAL volume and
the performance improvement unless the database contains
highly-compressible data like pgbench_accounts.filler. But if
we can use other compression algorithm, maybe we can reduce
WAL volume very much. I'm not sure what algorithm is good
for WAL compression, though.

It might be better to introduce the hook for compression of FPW
so that users can freely use their compression module, rather
than just using pglz_compress(). Thought?

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Fujii Masao

Date:

10 October 2013, 17:36:15

On Wed, Oct 9, 2013 at 1:35 PM, Haribabu kommi
<haribabu.kommi@huawei.com> wrote:
> On 08 October 2013 18:42 KONDO Mitsumasa wrote:
>>(2013/10/08 20:13), Haribabu kommi wrote:
>>> I will test with sync_commit=on mode and provide the test results.
>>OK. Thanks!
>
> Pgbench test results with synchronous_commit mode as on.

Thanks!

>                                                 Thread-1                                        Threads-2
>                                 Head code               FPW compress    Head code               FPW compress
> Pgbench-org 5min                138(0.24GB)             131(0.04GB)             160(0.28GB)             163(0.05GB)
> Pgbench-1000 5min               140(0.29GB)             128(0.03GB)             160(0.33GB)             162(0.02GB)
> Pgbench-org 15min               141(0.59GB)             136(0.12GB)             160(0.65GB)             162(0.14GB)
> Pgbench-1000 15min      138(0.81GB)             134(0.11GB)             159(0.92GB)             162(0.18GB)
>
> Pgbench-org - original pgbench
> Pgbench-1000 - changed pgbench with a record size of 1000.

This means that you changed the data type of pgbench_accounts.filler
to char(1000)?

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Fujii Masao

Date:

10 October 2013, 18:44:07

On Fri, Oct 11, 2013 at 1:20 AM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:
> Hi,
>
> I did a partial review of this patch, wherein I focused on the patch and
> the code itself, as I saw other contributors already did some testing on
> it, so that we know it applies cleanly and work to some good extend.

Thanks a lot!

> In full_page_writes_str() why are you returning "unrecognized" rather
> than doing an ELOG(ERROR, …) for this unexpected situation?

It's because the similar functions 'wal_level_str' and 'dbState' also return
'unrecognized' in the unexpected situation. I just implemented
full_page_writes_str()
in the same manner.

If we do an elog(ERROR) in that case, pg_xlogdump would fail to dump
the 'broken' (i.e., unrecognized fpw is set) WAL file. I think that some
users want to use pg_xlogdump to investigate the broken WAL file, so
doing an elog(ERROR) seems not good to me.

> The code switches to compression (or trying to) when the following
> condition is met:
>
> +               if (fpw <= FULL_PAGE_WRITES_COMPRESS)
> +               {
> +                       rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length, &(rdt->len));
>
> We have
>
> + typedef enum FullPageWritesLevel
> + {
> +       FULL_PAGE_WRITES_OFF = 0,
> +       FULL_PAGE_WRITES_COMPRESS,
> +       FULL_PAGE_WRITES_ON
> + } FullPageWritesLevel;
>
> + #define FullPageWritesIsNeeded(fpw)   (fpw >= FULL_PAGE_WRITES_COMPRESS)
>
> I don't much like using the <= test against and ENUM and I'm not sure I
> understand the intention you have here. It somehow looks like a typo and
> disagrees with the macro.

I thought that FPW should be compressed only when full_page_writes is
set to 'compress' or 'off'. That is, 'off' implies a compression. When it's set
to 'off', FPW is basically not generated, so there is no need to call
CompressBackupBlock() in that case. But only during online base backup,
FPW is forcibly generated even when it's set to 'off'. So I used the check
"fpw <= FULL_PAGE_WRITES_COMPRESS" there.

> What about using the FullPageWritesIsNeeded
> macro, and maybe rewriting the macro as
>
> #define FullPageWritesIsNeeded(fpw) \
>    (fpw == FULL_PAGE_WRITES_COMPRESS || fpw == FULL_PAGE_WRITES_ON)

I'm OK to change the macro so that the <= test is not used.

> Also, having "on" imply "compress" is a little funny to me. Maybe we
> should just finish our testing and be happy to always compress the full
> page writes. What would the downside be exactly (on buzy IO system
> writing less data even if needing more CPU will be the right trade-off).

"on" doesn't imply "compress". When full_page_writes is set to "on",
FPW is not compressed at all.

> I like that you're checking the savings of the compressed data with
> respect to the uncompressed data and cancel the compression if there's
> no gain. I wonder if your test accounts for enough padding and headers
> though given the results we saw in other tests made in this thread.

I'm afraid that the patch has only limited effects in WAL reduction and
performance improvement unless the database contains highly-compressible
data like large blank characters column. It really depends on the contents
of the database. So, obviously FPW compression should not be the default.
Maybe we can treat it as just tuning knob.

> Why do we have both the static function full_page_writes_str() and the
> macro FullPageWritesStr, with two different implementations issuing
> either "true" and "false" or "on" and "off"?

First I was thinking to use "on" and "off" because they are often used
as the setting value of boolean GUC. But unfortunately the existing
pg_xlogdump uses "true" and "false" to show the value of full_page_writes
in WAL. To avoid breaking the backward compatibility, I implmented
the "true/false" version of function. I'm really not sure how many people
want such a compatibility of pg_xlogdump, though.

> !       unsigned        hole_offset:15, /* number of bytes before "hole" */
> !                               flags:2,                /* state of a backup block, see below */
> !                               hole_length:15; /* number of bytes in "hole" */
>
> I don't understand that. I wanted to use that patch as a leverage to
> smoothly discover the internals of our WAL system but won't have the
> time to do that here.

We need the flag indicating whether each FPW is compressed or not.
If no such a flag exists in WAL, the standby cannot determine whether
it should decompress each FPW or not, and then cannot replay
the WAL containing FPW properly. That is, I just used a 'space' in
the header of FPW to have such a flag.

> That said, I don't even know that C syntax.

The struct 'ItemIdData' uses the same C syntax.

> + #define BKPBLOCK_UNCOMPRESSED 0       /* uncompressed */
> + #define BKPBLOCK_COMPRESSED           1       /* comperssed */
>
> There's a typo in the comment above.

Yep.

>>   [time required to replay WAL generated during running pgbench]
>>   61s (on)                 .... 1209911 transactions were replayed,
>> recovery speed: 19834.6 transactions/sec
>>   39s (compress)      .... 1445446 transactions were replayed,
>> recovery speed: 37062.7 transactions/sec
>>   37s (off)                 .... 1629235 transactions were replayed,
>> recovery speed: 44033.3 transactions/sec
>
> How did you get those numbers ? pg_basebackup before the test and
> archiving, then a PITR maybe? Is it possible to do the same test with
> the same number of transactions to replay, I guess using the -t
> parameter rather than the -T one for this testing.

Sure. To be honest, when I received the same request from Andres,
I did that benchmark. But unfortunately because of machine trouble,
I could not report it, yet. Will do that again.

Regards,

--
Fujii Masao

Re: Compression of full-page-writes

From

Andres Freund

Date:

10 October 2013, 23:35:17

Hi,
On 2013-10-11 03:44:01 +0900, Fujii Masao wrote:
> I'm afraid that the patch has only limited effects in WAL reduction and
> performance improvement unless the database contains highly-compressible
> data like large blank characters column. It really depends on the contents
> of the database. So, obviously FPW compression should not be the default.
> Maybe we can treat it as just tuning knob.

Have you tried using lz4 (or snappy) instead of pglz? There's a patch
adding it to pg in
http://archives.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de 

If this really is only a benefit in scenarios with lots of such data, I
have to say I have my doubts about the benefits of the patch.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Compression of full-page-writes

From

Fujii Masao

Date:

11 October 2013, 03:30:49

On Fri, Oct 11, 2013 at 3:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Oct 11, 2013 at 1:20 AM, Dimitri Fontaine
> <dimitri@2ndquadrant.fr> wrote:
>> Hi,
>>
>> I did a partial review of this patch, wherein I focused on the patch and
>> the code itself, as I saw other contributors already did some testing on
>> it, so that we know it applies cleanly and work to some good extend.
>
> Thanks a lot!
>
>> In full_page_writes_str() why are you returning "unrecognized" rather
>> than doing an ELOG(ERROR, …) for this unexpected situation?
>
> It's because the similar functions 'wal_level_str' and 'dbState' also return
> 'unrecognized' in the unexpected situation. I just implemented
> full_page_writes_str()
> in the same manner.
>
> If we do an elog(ERROR) in that case, pg_xlogdump would fail to dump
> the 'broken' (i.e., unrecognized fpw is set) WAL file. I think that some
> users want to use pg_xlogdump to investigate the broken WAL file, so
> doing an elog(ERROR) seems not good to me.
>
>> The code switches to compression (or trying to) when the following
>> condition is met:
>>
>> +               if (fpw <= FULL_PAGE_WRITES_COMPRESS)
>> +               {
>> +                       rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length, &(rdt->len));
>>
>> We have
>>
>> + typedef enum FullPageWritesLevel
>> + {
>> +       FULL_PAGE_WRITES_OFF = 0,
>> +       FULL_PAGE_WRITES_COMPRESS,
>> +       FULL_PAGE_WRITES_ON
>> + } FullPageWritesLevel;
>>
>> + #define FullPageWritesIsNeeded(fpw)   (fpw >= FULL_PAGE_WRITES_COMPRESS)
>>
>> I don't much like using the <= test against and ENUM and I'm not sure I
>> understand the intention you have here. It somehow looks like a typo and
>> disagrees with the macro.
>
> I thought that FPW should be compressed only when full_page_writes is
> set to 'compress' or 'off'. That is, 'off' implies a compression. When it's set
> to 'off', FPW is basically not generated, so there is no need to call
> CompressBackupBlock() in that case. But only during online base backup,
> FPW is forcibly generated even when it's set to 'off'. So I used the check
> "fpw <= FULL_PAGE_WRITES_COMPRESS" there.
>
>> What about using the FullPageWritesIsNeeded
>> macro, and maybe rewriting the macro as
>>
>> #define FullPageWritesIsNeeded(fpw) \
>>    (fpw == FULL_PAGE_WRITES_COMPRESS || fpw == FULL_PAGE_WRITES_ON)
>
> I'm OK to change the macro so that the <= test is not used.
>
>> Also, having "on" imply "compress" is a little funny to me. Maybe we
>> should just finish our testing and be happy to always compress the full
>> page writes. What would the downside be exactly (on buzy IO system
>> writing less data even if needing more CPU will be the right trade-off).
>
> "on" doesn't imply "compress". When full_page_writes is set to "on",
> FPW is not compressed at all.
>
>> I like that you're checking the savings of the compressed data with
>> respect to the uncompressed data and cancel the compression if there's
>> no gain. I wonder if your test accounts for enough padding and headers
>> though given the results we saw in other tests made in this thread.
>
> I'm afraid that the patch has only limited effects in WAL reduction and
> performance improvement unless the database contains highly-compressible
> data like large blank characters column. It really depends on the contents
> of the database. So, obviously FPW compression should not be the default.
> Maybe we can treat it as just tuning knob.
>
>> Why do we have both the static function full_page_writes_str() and the
>> macro FullPageWritesStr, with two different implementations issuing
>> either "true" and "false" or "on" and "off"?
>
> First I was thinking to use "on" and "off" because they are often used
> as the setting value of boolean GUC. But unfortunately the existing
> pg_xlogdump uses "true" and "false" to show the value of full_page_writes
> in WAL. To avoid breaking the backward compatibility, I implmented
> the "true/false" version of function. I'm really not sure how many people
> want such a compatibility of pg_xlogdump, though.
>
>> !       unsigned        hole_offset:15, /* number of bytes before "hole" */
>> !                               flags:2,                /* state of a backup block, see below */
>> !                               hole_length:15; /* number of bytes in "hole" */
>>
>> I don't understand that. I wanted to use that patch as a leverage to
>> smoothly discover the internals of our WAL system but won't have the
>> time to do that here.
>
> We need the flag indicating whether each FPW is compressed or not.
> If no such a flag exists in WAL, the standby cannot determine whether
> it should decompress each FPW or not, and then cannot replay
> the WAL containing FPW properly. That is, I just used a 'space' in
> the header of FPW to have such a flag.
>
>> That said, I don't even know that C syntax.
>
> The struct 'ItemIdData' uses the same C syntax.
>
>> + #define BKPBLOCK_UNCOMPRESSED 0       /* uncompressed */
>> + #define BKPBLOCK_COMPRESSED           1       /* comperssed */
>>
>> There's a typo in the comment above.
>
> Yep.
>
>>>   [time required to replay WAL generated during running pgbench]
>>>   61s (on)                 .... 1209911 transactions were replayed,
>>> recovery speed: 19834.6 transactions/sec
>>>   39s (compress)      .... 1445446 transactions were replayed,
>>> recovery speed: 37062.7 transactions/sec
>>>   37s (off)                 .... 1629235 transactions were replayed,
>>> recovery speed: 44033.3 transactions/sec
>>
>> How did you get those numbers ? pg_basebackup before the test and
>> archiving, then a PITR maybe? Is it possible to do the same test with
>> the same number of transactions to replay, I guess using the -t
>> parameter rather than the -T one for this testing.
>
> Sure. To be honest, when I received the same request from Andres,
> I did that benchmark. But unfortunately because of machine trouble,
> I could not report it, yet. Will do that again.

Here is the benchmark result:

* Result
[tps]
1317.306391 (full_page_writes = on)
1628.407752 (compress)

[the amount of WAL generated during running pgbench]
1319 MB (on)326 MB (compress)

[time required to replay WAL generated during running pgbench]
19s (on)
2013-10-11 12:05:09 JST LOG:  redo starts at F/F1000028
2013-10-11 12:05:28 JST LOG:  redo done at 10/446B7BF0

12s (on)
2013-10-11 12:06:22 JST LOG:  redo starts at F/F1000028
2013-10-11 12:06:34 JST LOG:  redo done at 10/446B7BF0

12s (on)
2013-10-11 12:07:19 JST LOG:  redo starts at F/F1000028
2013-10-11 12:07:31 JST LOG:  redo done at 10/446B7BF0
8s (compress)
2013-10-11 12:17:36 JST LOG:  redo starts at 10/50000028
2013-10-11 12:17:44 JST LOG:  redo done at 10/655AE478
8s (compress)
2013-10-11 12:18:26 JST LOG:  redo starts at 10/50000028
2013-10-11 12:18:34 JST LOG:  redo done at 10/655AE478
8s (compress)
2013-10-11 12:19:07 JST LOG:  redo starts at 10/50000028
2013-10-11 12:19:15 JST LOG:  redo done at 10/655AE478

[benchmark]
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 32
number of threads: 4
number of transactions per client: 10000
number of transactions actually processed: 320000/320000

Regards,

--
Fujii Masao

Re: Compression of full-page-writes

From

Fujii Masao

Date:

11 October 2013, 03:45:52

On Fri, Oct 11, 2013 at 8:35 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
> On 2013-10-11 03:44:01 +0900, Fujii Masao wrote:
>> I'm afraid that the patch has only limited effects in WAL reduction and
>> performance improvement unless the database contains highly-compressible
>> data like large blank characters column. It really depends on the contents
>> of the database. So, obviously FPW compression should not be the default.
>> Maybe we can treat it as just tuning knob.
>
>
> Have you tried using lz4 (or snappy) instead of pglz? There's a patch
> adding it to pg in
> http://archives.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de

Yeah, it's worth checking them! Will do that.

> If this really is only a benefit in scenarios with lots of such data, I
> have to say I have my doubts about the benefits of the patch.

Yep, maybe the patch needs to be redesigned. Currently in the patch
compression is performed per FPW, i.e., the size of data to compress
is just 8KB. If we can increase the size of data to compress, we might
be able to improve the compression ratio. For example, by storing
all outstanding WAL data temporarily in local buffer, compressing them,
and then storing the compressed WAL data to WAL buffers.

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Amit Kapila

Date:

11 October 2013, 03:53:00

On Fri, Oct 11, 2013 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
> On 2013-10-11 03:44:01 +0900, Fujii Masao wrote:
>> I'm afraid that the patch has only limited effects in WAL reduction and
>> performance improvement unless the database contains highly-compressible
>> data like large blank characters column. It really depends on the contents
>> of the database. So, obviously FPW compression should not be the default.
>> Maybe we can treat it as just tuning knob.
>
>
> Have you tried using lz4 (or snappy) instead of pglz? There's a patch
> adding it to pg in
> http://archives.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de
>
> If this really is only a benefit in scenarios with lots of such data, I
> have to say I have my doubts about the benefits of the patch.

I think it will be difficult to prove by using any compression
algorithm, that it compresses in most of the scenario's.
In many cases it can so happen that the WAL will also not be reduced
and tps can also come down if the data is non-compressible, because
any compression algorithm will have to try to compress the data and it
will burn some cpu for that, which inturn will reduce tps.

As this patch is giving a knob to user to turn compression on/off, so
users can decide if they want such benefit.
Now some users can say that they have no idea, how or what kind of
data will be there in their databases, so such kind of users should
not use this option, but on the other side some users know that they
have similar pattern of data, so they can get benefit out of such
optimisations. For example in telecom industry, i have seen that they
have lot of data as CDR's (call data records) in their HLR databases
for which the data records will be different but of same pattern.

Being said above, I think both this patch and my patch "WAL reduction
for Update" (https://commitfest.postgresql.org/action/patch_view?id=1209)
are using same technique for WAL compression and can lead to similar
consequences in different ways.
So I suggest to have unified method to enable WAL Compression for both
the patches.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compression of full-page-writes

From

Haribabu kommi

Date:

11 October 2013, 07:10:26

On 10 October 2013 23:06 Fujii Masao wrote:
>On Wed, Oct 9, 2013 at 1:35 PM, Haribabu kommi <haribabu.kommi@huawei.com> wrote:
>>                                                 Thread-1                                        Threads-2
>>                                 Head code               FPW compress    Head code               FPW compress
>> Pgbench-org 5min                138(0.24GB)             131(0.04GB)             160(0.28GB)             163(0.05GB)
>> Pgbench-1000 5min               140(0.29GB)             128(0.03GB)             160(0.33GB)             162(0.02GB)
>> Pgbench-org 15min               141(0.59GB)             136(0.12GB)             160(0.65GB)             162(0.14GB)
>> Pgbench-1000 15min      138(0.81GB)             134(0.11GB)             159(0.92GB)             162(0.18GB)
>>
>> Pgbench-org - original pgbench
>> Pgbench-1000 - changed pgbench with a record size of 1000.

>This means that you changed the data type of pgbench_accounts.filler to char(1000)?

Yes, I changed the filler column as char(1000).

Regards,
Hari babu.

Re: Compression of full-page-writes

From

Andres Freund

Date:

11 October 2013, 17:07:05

On 2013-10-11 09:22:50 +0530, Amit Kapila wrote:
> I think it will be difficult to prove by using any compression
> algorithm, that it compresses in most of the scenario's.
> In many cases it can so happen that the WAL will also not be reduced
> and tps can also come down if the data is non-compressible, because
> any compression algorithm will have to try to compress the data and it
> will burn some cpu for that, which inturn will reduce tps.

Then those concepts maybe aren't such a good idea after all. Storing
lots of compressible data in an uncompressed fashion isn't an all that
common usecase. I most certainly don't want postgres to optimize for
blank padded data, especially if it can hurt other scenarios. Just not
enough benefit.
That said, I actually have relatively high hopes for compressing full
page writes. There often enough is lot of repetitiveness between rows on
the same page that it should be useful outside of such strange
scenarios. But maybe pglz is just not a good fit for this, it really
isn't a very good algorithm in this day and aage.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Compression of full-page-writes

From

Amit Kapila

Date:

12 October 2013, 15:14:32

On Fri, Oct 11, 2013 at 10:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-11 09:22:50 +0530, Amit Kapila wrote:
>> I think it will be difficult to prove by using any compression
>> algorithm, that it compresses in most of the scenario's.
>> In many cases it can so happen that the WAL will also not be reduced
>> and tps can also come down if the data is non-compressible, because
>> any compression algorithm will have to try to compress the data and it
>> will burn some cpu for that, which inturn will reduce tps.
>
> Then those concepts maybe aren't such a good idea after all. Storing
> lots of compressible data in an uncompressed fashion isn't an all that
> common usecase. I most certainly don't want postgres to optimize for
> blank padded data, especially if it can hurt other scenarios. Just not
> enough benefit.
> That said, I actually have relatively high hopes for compressing full
> page writes. There often enough is lot of repetitiveness between rows on
> the same page that it should be useful outside of such strange
> scenarios. But maybe pglz is just not a good fit for this, it really
> isn't a very good algorithm in this day and aage.

Do you think that if WAL reduction or performance with other
compression algorithm (for ex. snappy)  is better, then chances of
getting the new compression algorithm in postresql will be more?
Wouldn't it be okay, if we have GUC to enable it and have pluggable
api for calling compression method, with this we can even include
other compression algorithm's if they proved to be good and reduce the
dependency of this patch on inclusion of new compression methods in
postgresql?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compression of full-page-writes

From

Jesper Krogh

Date:

13 October 2013, 10:10:16

On 11/10/13 19:06, Andres Freund wrote:
> On 2013-10-11 09:22:50 +0530, Amit Kapila wrote:
>> I think it will be difficult to prove by using any compression
>> algorithm, that it compresses in most of the scenario's.
>> In many cases it can so happen that the WAL will also not be reduced
>> and tps can also come down if the data is non-compressible, because
>> any compression algorithm will have to try to compress the data and it
>> will burn some cpu for that, which inturn will reduce tps.
> Then those concepts maybe aren't such a good idea after all. Storing
> lots of compressible data in an uncompressed fashion isn't an all that
> common usecase. I most certainly don't want postgres to optimize for
> blank padded data, especially if it can hurt other scenarios. Just not
> enough benefit.
> That said, I actually have relatively high hopes for compressing full
> page writes. There often enough is lot of repetitiveness between rows on
> the same page that it should be useful outside of such strange
> scenarios. But maybe pglz is just not a good fit for this, it really
> isn't a very good algorithm in this day and aage.
>
Hm,. There is a clear benefit for compressible data and clearly
no benefit from incompressible data..

how about letting autovacuum "taste" the compressibillity of
pages on per relation/index basis and set a flag that triggers
this functionality where it provides a benefit?

not hugely more magical than figuring out wether the data ends up
in the heap or in a toast table as it is now.

-- 
Jesper

Re: Compression of full-page-writes

From

KONDO Mitsumasa

Date:

15 October 2013, 00:55:31

(2013/10/13 0:14), Amit Kapila wrote:
> On Fri, Oct 11, 2013 at 10:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> But maybe pglz is just not a good fit for this, it really
>> isn't a very good algorithm in this day and aage.
+1. This compression algorithm is needed more faster than pglz which is like 
general compression algorithm, to avoid the CPU bottle-neck. I think pglz doesn't 
have good performance, and it is like fossil compression algorithm. So we need to 
change latest compression algorithm for more better future.

> Do you think that if WAL reduction or performance with other
> compression algorithm (for ex. snappy)  is better, then chances of
> getting the new compression algorithm in postresql will be more?
Latest compression algorithms papers(also snappy) have indecated. I think it is 
enough to select algorithm. It may be also good work in postgres.

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center

Re: Compression of full-page-writes

From

Amit Kapila

Date:

15 October 2013, 04:33:31

On Tue, Oct 15, 2013 at 6:30 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
> (2013/10/13 0:14), Amit Kapila wrote:
>>
>> On Fri, Oct 11, 2013 at 10:36 PM, Andres Freund <andres@2ndquadrant.com>
>> wrote:
>>>
>>> But maybe pglz is just not a good fit for this, it really
>>> isn't a very good algorithm in this day and aage.
>
> +1. This compression algorithm is needed more faster than pglz which is like
> general compression algorithm, to avoid the CPU bottle-neck. I think pglz
> doesn't have good performance, and it is like fossil compression algorithm.
> So we need to change latest compression algorithm for more better future.
>
>
>> Do you think that if WAL reduction or performance with other
>> compression algorithm (for ex. snappy)  is better, then chances of
>> getting the new compression algorithm in postresql will be more?
>
> Latest compression algorithms papers(also snappy) have indecated. I think it
> is enough to select algorithm. It may be also good work in postgres.

Snappy is good mainly for un-compressible data, see the link below:
http://www.postgresql.org/message-id/CAAZKuFZCOCHsswQM60ioDO_hk12tA7OG3YcJA8v=4YebMOA-wA@mail.gmail.com

I think it is bit difficult to prove that any one algorithm is best
for all kind of loads.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compression of full-page-writes

From

KONDO Mitsumasa

Date:

15 October 2013, 06:06:27

(2013/10/15 13:33), Amit Kapila wrote:
> Snappy is good mainly for un-compressible data, see the link below:
> http://www.postgresql.org/message-id/CAAZKuFZCOCHsswQM60ioDO_hk12tA7OG3YcJA8v=4YebMOA-wA@mail.gmail.com
This result was gotten in ARM architecture, it is not general CPU.
Please see detail document.
http://www.reddit.com/r/programming/comments/1aim6s/lz4_extremely_fast_compression_algorithm/c8y0ew9

I found compression algorithm test in HBase. I don't read detail, but it 
indicates snnapy algorithm gets best performance.
http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of

In fact, most of modern NoSQL storages use snappy. Because it has good 
performance and good licence(BSD license).

> I think it is bit difficult to prove that any one algorithm is best
> for all kind of loads.
I think it is necessary to make best efforts in community than I do the best 
choice with strict test.

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center

Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

15 October 2013, 13:01:12

On Tue, Oct 15, 2013 at 03:11:22PM +0900, KONDO Mitsumasa wrote:
> (2013/10/15 13:33), Amit Kapila wrote:
> >Snappy is good mainly for un-compressible data, see the link below:
> >http://www.postgresql.org/message-id/CAAZKuFZCOCHsswQM60ioDO_hk12tA7OG3YcJA8v=4YebMOA-wA@mail.gmail.com
> This result was gotten in ARM architecture, it is not general CPU.
> Please see detail document.
> http://www.reddit.com/r/programming/comments/1aim6s/lz4_extremely_fast_compression_algorithm/c8y0ew9
> 
> I found compression algorithm test in HBase. I don't read detail,
> but it indicates snnapy algorithm gets best performance.
> http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of
> 
> In fact, most of modern NoSQL storages use snappy. Because it has
> good performance and good licence(BSD license).
> 
> >I think it is bit difficult to prove that any one algorithm is best
> >for all kind of loads.
> I think it is necessary to make best efforts in community than I do
> the best choice with strict test.
> 
> Regards,
> -- 
> Mitsumasa KONDO
> NTT Open Source Software Center
> 

Google's lz4 is also a very nice algorithm with 33% better compression
performance than snappy and 2X the decompression performance in some
benchmarks also with a bsd license:

https://code.google.com/p/lz4/

Regards,
Ken

Re: Compression of full-page-writes

From

KONDO Mitsumasa

Date:

16 October 2013, 04:37:41

(2013/10/15 22:01), ktm@rice.edu wrote:
> Google's lz4 is also a very nice algorithm with 33% better compression
> performance than snappy and 2X the decompression performance in some
> benchmarks also with a bsd license:
>
> https://code.google.com/p/lz4/
If we judge only performance, we will select lz4. However, we should think another important factor which is software
robustness,achievement, bug

fix history, and etc... If we see unknown bugs, can we fix it or improve
algorithm? It seems very difficult, because we only use it and don't
understand algorihtms. Therefore, I think that we had better to select
robust and having more user software.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software

Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

16 October 2013, 13:39:42

On Wed, Oct 16, 2013 at 01:42:34PM +0900, KONDO Mitsumasa wrote:
> (2013/10/15 22:01), ktm@rice.edu wrote:
> >Google's lz4 is also a very nice algorithm with 33% better compression
> >performance than snappy and 2X the decompression performance in some
> >benchmarks also with a bsd license:
> >
> >https://code.google.com/p/lz4/
> If we judge only performance, we will select lz4. However, we should think
>  another important factor which is software robustness, achievement, bug
> fix history, and etc... If we see unknown bugs, can we fix it or improve
> algorithm? It seems very difficult, because we only use it and don't
> understand algorihtms. Therefore, I think that we had better to select
> robust and having more user software.
> 
> Regards,
> --
> Mitsumasa KONDO
> NTT Open Source Software
> 
Hi,

Those are all very good points. lz4 however is being used by Hadoop. It
is implemented natively in the Linux 3.11 kernel and the BSD version of
the ZFS filesystem supports the lz4 algorithm for on-the-fly compression.
With more and more CPU cores available in modern system, using an
algorithm with very fast decompression speeds can make storing data, even
in memory, in a compressed form can reduce space requirements in exchange
for a higher CPU cycle cost. The ability to make those sorts of trade-offs
can really benefit from a plug-able compression algorithm interface.

Regards,
Ken

Re: Compression of full-page-writes

From

KONDO Mitsumasa

Date:

18 October 2013, 06:23:54

Hi,

Sorry for my reply late...

(2013/10/11 2:32), Fujii Masao wrote:
> Could you let me know how much WAL records were generated
> during each benchmark?
It was not seen difference hardly about WAL in DBT-2 benchmark. It was because
largest tuples are filled in random character which is difficult to compress, I
survey it.

So I test two pattern data. One is original data which is hard to compress data.
Second is little bit changing data which are easy to compress data. Specifically,
I substitute zero padding tuple for random character tuple.
Record size is same in original test data, I changed only character fo record.
Sample changed record is here.

* Original record (item table)
> 1       9830    W+ùMî/aGhÞVJ;t+Pöþm5v2î.        82.62   Tî%N#ROò|?ö;[_îë~!YäHPÜï[S!JV58Ü#;+$cPì=dãNò;=Þô5
> 2       1492    VIKëyC..UCçWSèQð2?&s÷Jf 95.78   >ýoCj'nîHR`i]cøuDH&-wì4èè}{39ámLß2mC712Tao÷
> 3       4485    oJ)kLvP^_:91BOïé        32.00   ð<èüJ÷RÝ_Jze+?é4Ü7ä-r=DÝK\\$;Fsà8ál5

* Changed sample record (item table)
> 1       9830    000000000000000000000000        95.77   00000000000000000000000000000000000000000
> 2       764     00000000000000  47.92   00000000000000000000000000000000000000000000000000
> 3       4893    000000000000000000000   15.90   00000000000000000000000000000000000



* DBT-2 Result

@Werehouse = 340
                         | NOTPM     | 90%tile     | Average | S.Deviation
------------------------+-----------+-------------+---------+-------------
no-patched              | 3319.02   | 13.606648   | 7.589   | 8.428
patched                 | 3341.25   | 20.132364   | 7.471   | 10.458
patched-testdata_changed| 3738.07   | 20.493533   | 3.795   | 10.003

Compression patch gets higher performance than no-patch in easy to compress test
data. It is because compression patch make archive WAL more small size, in
result, waste file cache is less than no-patch. Therefore, it was inflected
file-cache more effectively.

However, test in hard to compress test data have little bit lessor performance
than no-patch. I think it is compression overhead in pglz.


> I think that this benchmark result clearly means that the patch
> has only limited effects in the reduction of WAL volume and
> the performance improvement unless the database contains
> highly-compressible data like pgbench_accounts.
Your expectation is right. I think that low CPU cost and high compression
algorithm make your patch more better and better performance, too.

> filler. But if
> we can use other compression algorithm, maybe we can reduce
> WAL volume very much.
Yes, Please!

> I'm not sure what algorithm is good  for WAL compression, though.
Community member think Snappy or lz4 is better. You'd better to select one,
or test two algorithms.

> It might be better to introduce the hook for compression of FPW
> so that users can freely use their compression module, rather
> than just using pglz_compress(). Thought?
In my memory, Andres Freund developed like this patch. Did it commit or
developing now? I have thought this idea is very good.

Regards,
--
Mitsumasa KONDO
NTT Open Source  Software Center

>Thanks for extending and revising the FPW-compress patch! Could you add
>your patch into next CF?

Sure. I will make improvements and add it to next CF.

>Isn't it worth measuring the recovery performance for each compression
>algorithm?

Yes I will post this soon.

On Wed, May 28, 2014 at 8:04 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, May 27, 2014 at 12:57 PM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
> Hello All,
>
> 0001-CompressBackupBlock_snappy_lz4_pglz extends patch on compression of
> full page writes to include LZ4 and Snappy . Changes include making
> "compress_backup_block" GUC from boolean to enum. Value of the GUC can be
> OFF, pglz, snappy or lz4 which can be used to turn off compression or set
> the desired compression algorithm.
>
> 0002-Support_snappy_lz4 adds support for LZ4 and Snappy in PostgreSQL. It
> uses Andres’s patch for getting Makefiles working and has a few wrappers to
> make the function calls to LZ4 and Snappy compression functions and handle
> varlena datatypes.
> Patch Courtesy: Pavan Deolasee

Thanks for extending and revising the FPW-compress patch! Could you add
your patch into next CF?

> Also, compress_backup_block GUC needs to be merged with full_page_writes.

Basically I agree with you because I don't want to add new GUC very similar to
the existing one.

But could you imagine the case where full_page_writes = off. Even in this case,
FPW is forcibly written only during base backup. Such FPW also should be
compressed? Which compression algorithm should be used? If we want to
choose the algorithm for such FPW, we would not be able to merge those two
GUCs. IMO it's OK to always use the best compression algorithm for such FPW
and merge them, though.

> Tests use JDBC runner TPC-C benchmark to measure the amount of WAL
> compression ,tps and response time in each of the scenarios viz .
> Compression = OFF , pglz, LZ4 , snappy ,FPW=off

Isn't it worth measuring the recovery performance for each compression
algorithm?

Regards,

--
Fujii Masao

Re: Compression of full-page-writes

From

Bruce Momjian

Date:

29 May 2014, 17:21:41

On Thu, May 29, 2014 at 11:21:44AM +0100, Simon Riggs wrote:
> > Uh, how would that work if you want to compress the background_FPWs?
> > Use compressed_background_FPWs?
> 
> We've currently got 1 technique for torn page protection, soon to have
> 2 and with a 3rd on the horizon and likely to receive effort in next
> release.
> 
> It seems sensible to have just one parameter to describe the various
> techniques, as suggested. I'm suggesting that we plan for how things
> will look when we have the 3rd one as well.
> 
> Alternate suggestions welcome.

I was just pointing out that we might need compression to be a separate
boolean variable from the type of page tear protection.  I know I am
usually anti-adding-variables, but in this case it seems trying to have
one variable control several things will lead to confusion.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Compression of full-page-writes

From

Fujii Masao

Date:

02 June 2014, 12:44:33

On Thu, May 29, 2014 at 7:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 29 May 2014 01:07, Bruce Momjian <bruce@momjian.us> wrote:
>> On Wed, May 28, 2014 at 04:04:13PM +0100, Simon Riggs wrote:
>>> On 28 May 2014 15:34, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>
>>> >> Also, compress_backup_block GUC needs to be merged with full_page_writes.
>>> >
>>> > Basically I agree with you because I don't want to add new GUC very similar to
>>> > the existing one.
>>> >
>>> > But could you imagine the case where full_page_writes = off. Even in this case,
>>> > FPW is forcibly written only during base backup. Such FPW also should be
>>> > compressed? Which compression algorithm should be used? If we want to
>>> > choose the algorithm for such FPW, we would not be able to merge those two
>>> > GUCs. IMO it's OK to always use the best compression algorithm for such FPW
>>> > and merge them, though.
>>>
>>> I'd prefer a new name altogether
>>>
>>> torn_page_protection = 'full_page_writes'
>>> torn_page_protection = 'compressed_full_page_writes'
>>> torn_page_protection = 'none'
>>>
>>> this allows us to add new techniques later like
>>>
>>> torn_page_protection = 'background_FPWs'
>>>
>>> or
>>>
>>> torn_page_protection = 'double_buffering'
>>>
>>> when/if we add those new techniques
>>
>> Uh, how would that work if you want to compress the background_FPWs?
>> Use compressed_background_FPWs?
>
> We've currently got 1 technique for torn page protection, soon to have
> 2 and with a 3rd on the horizon and likely to receive effort in next
> release.
>
> It seems sensible to have just one parameter to describe the various
> techniques, as suggested. I'm suggesting that we plan for how things
> will look when we have the 3rd one as well.
>
> Alternate suggestions welcome.

Is even compression of double buffer worthwhile? If yes, what about separating
the GUC parameter into torn_page_protection and something like
full_page_compression? ISTM that any combination of settings of those parameters
can work.

torn_page_protection = 'FPW', 'background FPW', 'none', 'double buffer'
full_page_compression = 'no', 'pglz', 'lz4', 'snappy'

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Rahila Syed

Date:

10 June 2014, 14:49:52

<div dir="ltr"><p class="MsoNormal">Hello ,<p class="MsoNormal"><br /><p class="MsoNormal">In order to facilitate
changingof compression algorithms  and to be able to recover using WAL records compressed with different compression
algorithms,information about compression algorithm can be stored in WAL record.<p class="MsoNormal">XLOG record header
has2 to 4 padding bytes in order to align the WAL record. This space can be used for  a new flag in order to store
informationabout the compression algorithm used. Like the xl_info field of XlogRecord struct,  8 bits flag  can be
constructedwith the lower 4 bits of the flag used to indicate which backup block is compressed out of 0,1,2,3. Higher
fourbits can be used to indicate state of compression i.e off,lz4,snappy,pglz.<p class="MsoNormal">The flag can be
extendedto incorporate more compression algorithms added in future if any.<p class="MsoNormal">What is your opinion on
this?<pclass="MsoNormal"><br /><p class="MsoNormal">Thank you,<p class="MsoNormal">Rahila Syed</div><div
class="gmail_extra"><br/><br /><div class="gmail_quote">On Tue, May 27, 2014 at 9:27 AM, Rahila Syed <span
dir="ltr"><<ahref="mailto:rahilasyed.90@gmail.com" target="_blank">rahilasyed.90@gmail.com</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello All,<br
/><br/> 0001-CompressBackupBlock_snappy_lz4_pglz extends patch on compression of<br /> full page writes to include LZ4
andSnappy . Changes include making<br /> "compress_backup_block" GUC from boolean to enum. Value of the GUC can be<br
/>OFF, pglz, snappy or lz4 which can be used to turn off compression or set<br /> the desired compression algorithm.<br
/><br/> 0002-Support_snappy_lz4 adds support for LZ4 and Snappy in PostgreSQL. It<br /> uses Andres’s patch for getting
Makefilesworking and has a few wrappers to<br /> make the function calls to LZ4 and Snappy compression functions and
handle<br/> varlena datatypes.<br /> Patch Courtesy: Pavan Deolasee<br /><br /> These patches serve as a way to test
variouscompression algorithms. These<br /> are WIP yet. They don’t support changing compression algorithms on standby
.<br/> Also, compress_backup_block GUC needs to be merged with full_page_writes.<br /> The patch uses LZ4 high
compression(HC)variant.<br /> I have conducted initial tests which I would like to share and solicit<br /> feedback<br
/><br/> Tests use JDBC runner TPC-C benchmark to measure the amount of WAL<br /> compression ,tps and response time in
eachof the scenarios viz .<br /> Compression = OFF , pglz, LZ4 , snappy ,FPW=off<br /><br /> Server specifications:<br
/>Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos<br /> RAM: 32GB<br /> Disk : HDD      450GB
10KHot Plug 2.5-inch SAS HDD * 8 nos<br /> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm<br /><br /><br />
Benchmark:<br/> Scale : 100<br /> Command  :java JR  /home/postgres/jdbcrunner-1.2/scripts/tpcc.js  -sleepTime<br />
600,350,300,250,250<br/> Warmup time          : 1 sec<br /> Measurement time     : 900 sec<br /> Number of tx types   :
5<br/> Number of agents     : 16<br /> Connection pool size : 16<br /> Statement cache size : 40<br /> Auto commit    
    : false<br /> Sleep time           : 600,350,300,250,250 msec<br /><br /> Checkpoint segments:1024<br /> Checkpoint
timeout:5mins<br /><br /><br /> Scenario           WAL generated(bytes)                   Compression<br /> (bytes)    
 TPS (tx1,tx2,tx3,tx4,tx5)<br /> No_compress      2220787088 (~2221MB)                 NULL<br /> 13.3,13.3,1.3,1.3,1.3
tps<br/> Pglz                  1796213760 (~1796MB)                 424573328<br /> (19.11%)     13.1,13.1,1.3,1.3,1.3
tps<br/> Snappy             1724171112 (~1724MB)                 496615976( 22.36%)<br /> 13.2,13.2,1.3,1.3,1.3 tps<br
/>LZ4(HC)            1658941328 (~1659MB)                 561845760(25.29%)<br /> 13.2,13.2,1.3,1.3,1.3 tps<br />
FPW(off)          139384320(~139 MB)                    NULL<br /> 13.3,13.3,1.3,1.3,1.3 tps<br /><br /> As per
measurementresults, WAL reduction using LZ4 is close to 25% which<br /> shows 6 percent increase in WAL reduction when
comparedto pglz . WAL<br /> reduction in snappy is close to 22 % .<br /> The numbers for compression using LZ4 and
Snappydoesn’t seem to be very<br /> high as compared to pglz for given workload. This can be due to<br />
in-compressiblenature of the TPC-C data which contains random strings<br /><br /> Compression does not have bad impact
onthe response time. In fact, response<br /> times for Snappy, LZ4 are much better than no compression with almost ½
to<br/> 1/3 of the response times of no-compression(FPW=on) and FPW = off.<br /> The response time order for each  type
ofcompression is<br /> Pglz>Snappy>LZ4<br /><br /> Scenario              Response time (tx1,tx2,tx3,tx4,tx5)<br
/>no_compress        5555,1848,4221,6791,5747 msec<br /> pglz                    4275,2659,1828,4025,3326 msec<br />
Snappy              3790,2828,2186,1284,1120 msec<br /> LZ4(hC)              2519,2449,1158,2066,2065 msec<br />
FPW(off)            6234,2430,3017,5417,5885 msec<br /><br /> LZ4 and Snappy are almost at par with each other in terms
ofresponse time<br /> as average response times of five types of transactions remains almost same<br /> for both.<br />
0001-CompressBackupBlock_snappy_lz4_pglz.patch<br/> <<a
href="http://postgresql.1045698.n5.nabble.com/file/n5805044/0001-CompressBackupBlock_snappy_lz4_pglz.patch"
target="_blank">http://postgresql.1045698.n5.nabble.com/file/n5805044/0001-CompressBackupBlock_snappy_lz4_pglz.patch</a>><br
/>0002-Support_snappy_lz4.patch<br /> <<a
href="http://postgresql.1045698.n5.nabble.com/file/n5805044/0002-Support_snappy_lz4.patch"
target="_blank">http://postgresql.1045698.n5.nabble.com/file/n5805044/0002-Support_snappy_lz4.patch</a>><br/><br
/><br/><br /><br /> --<br /> View this message in context: <a
href="http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5805044.html"
target="_blank">http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5805044.html</a><br/>
Sentfrom the PostgreSQL - hackers mailing list archive at Nabble.com.<br /><span class="HOEnZb"><font
color="#888888"><br/><br /> --<br /> Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your
subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></font></span></blockquote></div><br /></div>

Re: Compression of full-page-writes

From

Michael Paquier

Date:

11 June 2014, 01:05:24

On Tue, Jun 10, 2014 at 11:49 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello ,
>
>
> In order to facilitate changing of compression algorithms  and to be able to
> recover using WAL records compressed with different compression algorithms,
> information about compression algorithm can be stored in WAL record.
>
> XLOG record header has 2 to 4 padding bytes in order to align the WAL
> record. This space can be used for  a new flag in order to store information
> about the compression algorithm used. Like the xl_info field of XlogRecord
> struct,  8 bits flag  can be constructed with the lower 4 bits of the flag
> used to indicate which backup block is compressed out of 0,1,2,3. Higher
> four bits can be used to indicate state of compression i.e
> off,lz4,snappy,pglz.
>
> The flag can be extended to incorporate more compression algorithms added in
> future if any.
>
> What is your opinion on this?
-1 for any additional bytes in WAL record to control such things,
having one single compression that we know performs well and relying
on it makes the life of user and developer easier.
-- 
Michael

Re: Compression of full-page-writes

From

Fujii Masao

Date:

11 June 2014, 10:49:17

On Wed, Jun 11, 2014 at 10:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Jun 10, 2014 at 11:49 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>> Hello ,
>>
>>
>> In order to facilitate changing of compression algorithms  and to be able to
>> recover using WAL records compressed with different compression algorithms,
>> information about compression algorithm can be stored in WAL record.
>>
>> XLOG record header has 2 to 4 padding bytes in order to align the WAL
>> record. This space can be used for  a new flag in order to store information
>> about the compression algorithm used. Like the xl_info field of XlogRecord
>> struct,  8 bits flag  can be constructed with the lower 4 bits of the flag
>> used to indicate which backup block is compressed out of 0,1,2,3. Higher
>> four bits can be used to indicate state of compression i.e
>> off,lz4,snappy,pglz.
>>
>> The flag can be extended to incorporate more compression algorithms added in
>> future if any.
>>
>> What is your opinion on this?
> -1 for any additional bytes in WAL record to control such things,
> having one single compression that we know performs well and relying
> on it makes the life of user and developer easier.

IIUC even when we adopt only one algorithm, additional at least one bit is
necessary to see whether this backup block is compressed or not.

This flag is necessary only for backup block, so there is no need to use
the header of each WAL record. What about just using the backup block
header?

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Pavan Deolasee

Date:

11 June 2014, 11:05:27

On Wed, Jun 11, 2014 at 4:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

IIUC even when we adopt only one algorithm, additional at least one bit is
necessary to see whether this backup block is compressed or not.

This flag is necessary only for backup block, so there is no need to use
the header of each WAL record. What about just using the backup block
header?

+1. We can also steal a few bits from ForkNumber field in the backup block header if required.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Compression of full-page-writes

From

Rahila Syed

Date:

13 June 2014, 14:37:38

Hello,

The attached patch named CompressBackupBlock_snappy_lz4_pglz accomplishes compression of FPW in WAL using pglz ,LZ4 and Snappy. This serves as a means to test performance of various compression algorithms for FPW compression

Minor correction in check for compression/decompression is made since the last time it was posted.

Patch named Support-for-lz4-and-snappy adds support for LZ4 and Snappy in PostgreSQL.

Below are the performance numbers taken for various values of compress_backup_block GUC parameter.

Scenario Amount of WAL(bytes) Compression (bytes) WALRecovery time(secs) TPS

FPW(on)Compression(Off) 1393681216 (~1394MB) NA 17 s 15.8,15.8,1.6,1.6,1.6 tps

Pglz 1192524560 (~1193 MB) 14% 17 s 15.6,15.6,1.6,1.6,1.6 tps

LZ4 1124745880 (~1125MB) 19.2% 16 s 15.7,15.7,1.6,1.6,1.6 tps

Snappy 1123117704 (~1123MB) 19.4% 17 s 15.6,15.6,1.6,1.6,1.6 tps

FPW (off) 171287384 ( ~171MB) NA 12 s 16.0,16.0,1.6,1.6,1.6 tps

Compression ratios of LZ4 and Snappy are almost at par for given workload. The nature of TPC-C type of data used is highly incompressible which explains the low compression ratios.

Turning compression on reduces tps overall. TPS numbers for LZ4 is slightly better than pglz and snappy.

Recovery(decompression) speed of LZ4 is slightly faster than Snappy.

Overall LZ4 scores over Snappy and pglz in terms of recovery (decompression) speed ,TPS and response times. Also, compression of LZ4 is at par with Snappy.

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Benchmark:
Scale : 16
Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js -sleepTime 550,250,250,200,200
Warmup time : 1 sec
Measurement time : 900 sec
Number of tx types : 5
Number of agents : 16
Connection pool size : 16
Statement cache size : 40
Auto commit : false

Checkpoint segments:1024
Checkpoint timeout:5 mins

Limitations of the current patch:

1. The patch currently compresses entire backup block inclusive of ‘hole’ unlike normal code which backs up the part before and after the hole separately. There can be performance issues when ‘hole’ is not filled with zeros. Hence separately compressing parts of block before and after hole can be considered.

2. Patch currently relies on ‘compress_backup_block’ GUC parameter to check if FPW is compressed or not. Information about whether FPW is compressed and which compression algorithm is used can be included in WAL record header . This will enable switching compression off and changing compression algorithm whenever desired.

3. Extending decompression logic to pg_xlogdump.

On Tue, May 27, 2014 at 9:27 AM, Rahila Syed <rahilasyed.90@gmail.com> wrote:

Hello All,

0001-CompressBackupBlock_snappy_lz4_pglz extends patch on compression of
full page writes to include LZ4 and Snappy . Changes include making
"compress_backup_block" GUC from boolean to enum. Value of the GUC can be
OFF, pglz, snappy or lz4 which can be used to turn off compression or set
the desired compression algorithm.

0002-Support_snappy_lz4 adds support for LZ4 and Snappy in PostgreSQL. It
uses Andres’s patch for getting Makefiles working and has a few wrappers to
make the function calls to LZ4 and Snappy compression functions and handle
varlena datatypes.
Patch Courtesy: Pavan Deolasee

These patches serve as a way to test various compression algorithms. These
are WIP yet. They don’t support changing compression algorithms on standby .
Also, compress_backup_block GUC needs to be merged with full_page_writes.
The patch uses LZ4 high compression(HC) variant.
I have conducted initial tests which I would like to share and solicit
feedback

Tests use JDBC runner TPC-C benchmark to measure the amount of WAL
compression ,tps and response time in each of the scenarios viz .
Compression = OFF , pglz, LZ4 , snappy ,FPW=off

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Benchmark:
Scale : 100
Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js -sleepTime
600,350,300,250,250
Warmup time : 1 sec
Measurement time : 900 sec
Number of tx types : 5
Number of agents : 16
Connection pool size : 16
Statement cache size : 40
Auto commit : false
Sleep time : 600,350,300,250,250 msec

Checkpoint segments:1024
Checkpoint timeout:5 mins

Scenario WAL generated(bytes) Compression
(bytes) TPS (tx1,tx2,tx3,tx4,tx5)
No_compress 2220787088 (~2221MB) NULL
13.3,13.3,1.3,1.3,1.3 tps
Pglz 1796213760 (~1796MB) 424573328
(19.11%) 13.1,13.1,1.3,1.3,1.3 tps
Snappy 1724171112 (~1724MB) 496615976( 22.36%)
13.2,13.2,1.3,1.3,1.3 tps
LZ4(HC) 1658941328 (~1659MB) 561845760(25.29%)
13.2,13.2,1.3,1.3,1.3 tps
FPW(off) 139384320(~139 MB) NULL
13.3,13.3,1.3,1.3,1.3 tps

As per measurement results, WAL reduction using LZ4 is close to 25% which
shows 6 percent increase in WAL reduction when compared to pglz . WAL
reduction in snappy is close to 22 % .
The numbers for compression using LZ4 and Snappy doesn’t seem to be very
high as compared to pglz for given workload. This can be due to
in-compressible nature of the TPC-C data which contains random strings

Compression does not have bad impact on the response time. In fact, response
times for Snappy, LZ4 are much better than no compression with almost ½ to
1/3 of the response times of no-compression(FPW=on) and FPW = off.
The response time order for each type of compression is
Pglz>Snappy>LZ4

Scenario Response time (tx1,tx2,tx3,tx4,tx5)
no_compress 5555,1848,4221,6791,5747 msec
pglz 4275,2659,1828,4025,3326 msec
Snappy 3790,2828,2186,1284,1120 msec
LZ4(hC) 2519,2449,1158,2066,2065 msec
FPW(off) 6234,2430,3017,5417,5885 msec

LZ4 and Snappy are almost at par with each other in terms of response time
as average response times of five types of transactions remains almost same
for both.
0001-CompressBackupBlock_snappy_lz4_pglz.patch
<http://postgresql.1045698.n5.nabble.com/file/n5805044/0001-CompressBackupBlock_snappy_lz4_pglz.patch>
0002-Support_snappy_lz4.patch
<http://postgresql.1045698.n5.nabble.com/file/n5805044/0002-Support_snappy_lz4.patch>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5805044.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hello ,

>I have a few preliminary comments about your patch

Thank you for review comments.

>the patch creates src/common/lz4/.travis.yml, which it shouldn't.

Agree. I will remove it.

>Shouldn't this use palloc?

palloc() is disallowed in critical sections and we are already in CS while executing this code. So we use malloc(). It's OK since the memory is allocated just once per session and it stays till the end.

>At the very minimum, I would move the "if (!compressed_pages_allocated)"

>block outside the "for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)" loop,.

Yes , the code for allocating memory is being executed just once for each run of the program so it can be taken out of the for loop. But as the condition

if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF &&

!compressed_pages_allocated) evaluates to be true for just the first loop , I am not sure if the change will be a significant improvement from performance point of view except it will save few condition checks.

>and
>add some comments. I think we could live with that

I will add comments.

>If we were going to keep multiple compression algorithms around, I'd be
>inclined to create a "pg_compress(…, compression_algorithm)" function to
>hide these return-value differences from the callers. and a "pg_decompress()" function that does error checking

+1 for abstracting out the differences in the return values and arguments and provide a common interface for all compression algorithms.

> if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY)
> {
if (pg_snappy_compress(page, BLCKSZ, buf) == EIO)
return NULL;
> }
> else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_LZ4)
> {
> if (pg_LZ4_compress(page, BLCKSZ, buf) == 0)
> return NULL;
> }
> else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_PGLZ)
> {
> if (pglz_compress(page, BLCKSZ, (PGLZ_Header *) buf,
> PGLZ_strategy_default) != 0)
> return NULL;
> }
> else
> elog(ERROR, "Wrong value for compress_backup_block GUC");

> /*
> * …comment about insisting on saving at least two bytes…
> */

> if (VARSIZE(buf) >= orig_len - 2)
> return NULL;

> *len = VARHDRSIZE + VARSIZE(buf);

> return buf;

>I guess it doesn't matter *too* much if the intention is to have all
>these compression algorithms only during development/testing and pick
>just one in the end. But the above is considerably easier to read in
>the meanwhile.

The above version is better as it avoids goto statement.

>I don't mind the suggestion elsewhere in this thread to use

>"full_page_compression = y" (as a setting alongside
>"torn_page_protection = x").

This change of GUC is in the ToDo for this patch.

Thank you,

Rahila

On Tue, Jun 17, 2014 at 5:17 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:

At 2014-06-13 20:07:29 +0530, rahilasyed90@gmail.com wrote:
>
> Patch named Support-for-lz4-and-snappy adds support for LZ4 and Snappy
> in PostgreSQL.

I haven't looked at this in any detail yet, but I note that the patch
creates src/common/lz4/.travis.yml, which it shouldn't.

I have a few preliminary comments about your patch.

> @@ -84,6 +87,7 @@ bool XLogArchiveMode = false;
> char *XLogArchiveCommand = NULL;
> bool EnableHotStandby = false;
> bool fullPageWrites = true;
> +int compress_backup_block = false;

I think compress_backup_block should be initialised to
BACKUP_BLOCK_COMPRESSION_OFF. (But see below.)

> + for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> + compressed_pages[j] = (char *) malloc(buffer_size);

Shouldn't this use palloc?

> + * Create a compressed version of a backup block
> + *
> + * If successful, return a compressed result and set 'len' to its length.
> + * Otherwise (ie, compressed result is actually bigger than original),
> + * return NULL.
> + */
> +static char *
> +CompressBackupBlock(char *page, uint32 orig_len, char *dest, uint32 *len)
> +{

First, the calling convention is a bit strange. I understand that you're
pre-allocating compressed_pages[] so as to avoid repeated allocations;
and that you're doing it outside CompressBackupBlock so as to avoid
passing in the index i. But the result is a little weird.

At the very minimum, I would move the "if (!compressed_pages_allocated)"
block outside the "for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)" loop, and
add some comments. I think we could live with that.

But I'm not at all fond of the code in this function either. I'd write
it like this:

struct varlena *buf = (struct varlena *) dest;

if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY)
{
if (pg_snappy_compress(page, BLCKSZ, buf) == EIO)
return NULL;
}
else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_LZ4)
{
if (pg_LZ4_compress(page, BLCKSZ, buf) == 0)
return NULL;
}
else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_PGLZ)
{
if (pglz_compress(page, BLCKSZ, (PGLZ_Header *) buf,
PGLZ_strategy_default) != 0)
return NULL;
}
else
elog(ERROR, "Wrong value for compress_backup_block GUC");

/*
* …comment about insisting on saving at least two bytes…
*/

if (VARSIZE(buf) >= orig_len - 2)
return NULL;

*len = VARHDRSIZE + VARSIZE(buf);

return buf;

I guess it doesn't matter *too* much if the intention is to have all
these compression algorithms only during development/testing and pick
just one in the end. But the above is considerably easier to read in
the meanwhile.

If we were going to keep multiple compression algorithms around, I'd be
inclined to create a "pg_compress(…, compression_algorithm)" function to
hide these return-value differences from the callers.

> + else if (VARATT_IS_COMPRESSED((struct varlena *) blk) && compress_backup_block!=BACKUP_BLOCK_COMPRESSION_OFF)
> + {
> + if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> + {
> + int ret;
> + size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> + char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> + size_t s_uncompressed_length;
> +
> + ret = snappy_uncompressed_length(compressed_data,
> + compressed_length,
> + &s_uncompressed_length);
> + if (!ret)
> + elog(ERROR, "snappy: failed to determine compression length");
> + if (BLCKSZ != s_uncompressed_length)
> + elog(ERROR, "snappy: compression size mismatch %d != %zu",
> + BLCKSZ, s_uncompressed_length);
> +
> + ret = snappy_uncompress(compressed_data,
> + compressed_length,
> + page);
> + if (ret != 0)
> + elog(ERROR, "snappy: decompression failed: %d", ret);
> + }

…and a "pg_decompress()" function that does error checking.

> +static const struct config_enum_entry backup_block_compression_options[] = {
> + {"off", BACKUP_BLOCK_COMPRESSION_OFF, false},
> + {"false", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"no", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"0", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"pglz", BACKUP_BLOCK_COMPRESSION_PGLZ, true},
> + {"snappy", BACKUP_BLOCK_COMPRESSION_SNAPPY, true},
> + {"lz4", BACKUP_BLOCK_COMPRESSION_LZ4, true},
> + {NULL, 0, false}
> +};

Finally, I don't like the name "compress_backup_block".

1. It should have been plural (compress_backup_blockS).

2. Looking at the enum values, "backup_block_compression = x" would be a
better name anyway…

3. But we don't use the term "backup block" anywhere in the
documentation, and it's very likely to confuse people.

I don't mind the suggestion elsewhere in this thread to use
"full_page_compression = y" (as a setting alongside
"torn_page_protection = x").

I haven't tried the patch (other than applying and building it) yet. I
will do so after I hear what you and others think of the above points.

-- Abhijit

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

18 June 2014, 12:43:13

On 2014-06-18 18:10:34 +0530, Rahila Syed wrote:
> Hello ,
> 
> >I have a few preliminary comments about your patch
> Thank you for review comments.
> 
> >the patch creates src/common/lz4/.travis.yml, which it shouldn't.
> Agree. I will remove it.
> 
> >Shouldn't this use palloc?
> palloc() is disallowed in critical sections and we are already in CS while
> executing this code. So we use malloc(). It's OK since the memory is
> allocated just once per session and it stays till the end.

malloc() isn't allowed either. You'll need to make sure all memory is
allocated beforehand

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Abhijit Menon-Sen

Date:

18 June 2014, 12:55:42

At 2014-06-18 18:10:34 +0530, rahilasyed90@gmail.com wrote:
>
> palloc() is disallowed in critical sections and we are already in CS
> while executing this code. So we use malloc().

Are these allocations actually inside a critical section? It seems to me
that the critical section starts further down, but perhaps I am missing
something.

Second, as Andres says, you shouldn't malloc() inside a critical section
either; and anyway, certainly not without checking the return value.

> I am not sure if the change will be a significant improvement from
> performance point of view except it will save few condition checks.

Moving that allocation out of the outer for loop it's currently in is
*nothing* to do with performance, but about making the code easier to
read.

-- Abhijit

Re: [REVIEW] Re: Compression of full-page-writes

From

Abhijit Menon-Sen

Date:

18 June 2014, 12:58:22

At 2014-06-18 18:25:34 +0530, ams@2ndQuadrant.com wrote:
>
> Are these allocations actually inside a critical section? It seems to me
> that the critical section starts further down, but perhaps I am missing
> something.

OK, I was missing that XLogInsert() itself can be called from inside a
critical section. So the allocation has to be moved somewhere else
altogether.

-- Abhijit

Re: [REVIEW] Re: Compression of full-page-writes

From

Pavan Deolasee

Date:

18 June 2014, 13:06:51

On Wed, Jun 18, 2014 at 6:25 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:

At 2014-06-18 18:10:34 +0530, rahilasyed90@gmail.com wrote:
>
> palloc() is disallowed in critical sections and we are already in CS
> while executing this code. So we use malloc().

Are these allocations actually inside a critical section? It seems to me
that the critical section starts further down, but perhaps I am missing
something.

ISTM XLogInsert() itself is called from other critical sections. See heapam.c for example.

Second, as Andres says, you shouldn't malloc() inside a critical section
either; and anyway, certainly not without checking the return value.

I was actually surprised to see Andreas comment. But he is right. OOM inside CS will result in a PANIC. I wonder if we can or if we really do enforce that though. The code within #ifdef WAL_DEBUG in the same function is surely doing a palloc(). That will be caught since there is an assert inside palloc(). May be nobody tried building with WAL_DEBUG since that assert was added.

May be Rahila can move that code to InitXLogAccess or even better check for malloc() return value and proceed without compression. There is code in snappy.c which will need similar handling, if we decide to finally add that to core.

> I am not sure if the change will be a significant improvement from
> performance point of view except it will save few condition checks.

Moving that allocation out of the outer for loop it's currently in is
*nothing* to do with performance, but about making the code easier to
read.

+1.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

03 July 2014, 19:58:33

Hello,

Updated version of patches are attached.

Changes are as follows

1. Improved readability of the code as per the review comments.

2. Addition of block_compression field in BkpBlock structure to store information about compression of block. This provides for switching compression on/off and changing compression algorithm as required.

3.Handling of OOM in critical section by checking for return value of malloc and proceeding without compression of FPW if return value is NULL.

Thank you,

Rahila Syed

On Tue, Jun 17, 2014 at 5:17 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:

At 2014-06-13 20:07:29 +0530, rahilasyed90@gmail.com wrote:
>
> Patch named Support-for-lz4-and-snappy adds support for LZ4 and Snappy
> in PostgreSQL.

I haven't looked at this in any detail yet, but I note that the patch
creates src/common/lz4/.travis.yml, which it shouldn't.

I have a few preliminary comments about your patch.

> @@ -84,6 +87,7 @@ bool XLogArchiveMode = false;
> char *XLogArchiveCommand = NULL;
> bool EnableHotStandby = false;
> bool fullPageWrites = true;
> +int compress_backup_block = false;

I think compress_backup_block should be initialised to
BACKUP_BLOCK_COMPRESSION_OFF. (But see below.)

> + for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> + compressed_pages[j] = (char *) malloc(buffer_size);

Shouldn't this use palloc?

> + * Create a compressed version of a backup block
> + *
> + * If successful, return a compressed result and set 'len' to its length.
> + * Otherwise (ie, compressed result is actually bigger than original),
> + * return NULL.
> + */
> +static char *
> +CompressBackupBlock(char *page, uint32 orig_len, char *dest, uint32 *len)
> +{

First, the calling convention is a bit strange. I understand that you're
pre-allocating compressed_pages[] so as to avoid repeated allocations;
and that you're doing it outside CompressBackupBlock so as to avoid
passing in the index i. But the result is a little weird.

At the very minimum, I would move the "if (!compressed_pages_allocated)"
block outside the "for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)" loop, and
add some comments. I think we could live with that.

But I'm not at all fond of the code in this function either. I'd write
it like this:

struct varlena *buf = (struct varlena *) dest;

if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY)
{
if (pg_snappy_compress(page, BLCKSZ, buf) == EIO)
return NULL;
}
else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_LZ4)
{
if (pg_LZ4_compress(page, BLCKSZ, buf) == 0)
return NULL;
}
else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_PGLZ)
{
if (pglz_compress(page, BLCKSZ, (PGLZ_Header *) buf,
PGLZ_strategy_default) != 0)
return NULL;
}
else
elog(ERROR, "Wrong value for compress_backup_block GUC");

/*
* …comment about insisting on saving at least two bytes…
*/

if (VARSIZE(buf) >= orig_len - 2)
return NULL;

*len = VARHDRSIZE + VARSIZE(buf);

return buf;

I guess it doesn't matter *too* much if the intention is to have all
these compression algorithms only during development/testing and pick
just one in the end. But the above is considerably easier to read in
the meanwhile.

If we were going to keep multiple compression algorithms around, I'd be
inclined to create a "pg_compress(…, compression_algorithm)" function to
hide these return-value differences from the callers.

> + else if (VARATT_IS_COMPRESSED((struct varlena *) blk) && compress_backup_block!=BACKUP_BLOCK_COMPRESSION_OFF)
> + {
> + if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> + {
> + int ret;
> + size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> + char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> + size_t s_uncompressed_length;
> +
> + ret = snappy_uncompressed_length(compressed_data,
> + compressed_length,
> + &s_uncompressed_length);
> + if (!ret)
> + elog(ERROR, "snappy: failed to determine compression length");
> + if (BLCKSZ != s_uncompressed_length)
> + elog(ERROR, "snappy: compression size mismatch %d != %zu",
> + BLCKSZ, s_uncompressed_length);
> +
> + ret = snappy_uncompress(compressed_data,
> + compressed_length,
> + page);
> + if (ret != 0)
> + elog(ERROR, "snappy: decompression failed: %d", ret);
> + }

…and a "pg_decompress()" function that does error checking.

> +static const struct config_enum_entry backup_block_compression_options[] = {
> + {"off", BACKUP_BLOCK_COMPRESSION_OFF, false},
> + {"false", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"no", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"0", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"pglz", BACKUP_BLOCK_COMPRESSION_PGLZ, true},
> + {"snappy", BACKUP_BLOCK_COMPRESSION_SNAPPY, true},
> + {"lz4", BACKUP_BLOCK_COMPRESSION_LZ4, true},
> + {NULL, 0, false}
> +};

Finally, I don't like the name "compress_backup_block".

1. It should have been plural (compress_backup_blockS).

2. Looking at the enum values, "backup_block_compression = x" would be a
better name anyway…

3. But we don't use the term "backup block" anywhere in the
documentation, and it's very likely to confuse people.

I don't mind the suggestion elsewhere in this thread to use
"full_page_compression = y" (as a setting alongside
"torn_page_protection = x").

I haven't tried the patch (other than applying and building it) yet. I
will do so after I hear what you and others think of the above points.

-- Abhijit

>
> But 0002-CompressBackupBlock_snappy_lz4_pglz-2.patch doesn't seem to
> be able to apply to HEAD cleanly.

>Yes, and it needs quite some reformatting beyond fixing whitespace
>damage too (long lines, comment formatting, consistent spacing etc.).

Please find attached patches with no whitespace error and improved formatting.

Thank you,

On Fri, Jul 4, 2014 at 11:39 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:

At 2014-07-04 14:38:27 +0900, masao.fujii@gmail.com wrote:
>
> But 0002-CompressBackupBlock_snappy_lz4_pglz-2.patch doesn't seem to
> be able to apply to HEAD cleanly.

Yes, and it needs quite some reformatting beyond fixing whitespace
damage too (long lines, comment formatting, consistent spacing etc.).

-- Abhijit

Thank you for review comments.

>There are still numerous formatting changes required, e.g. spaces around
>"=" and correct formatting of comments. And "git diff --check" still has
>a few whitespace problems. I won't point these out one by one, but maybe
>you should run pgindent

I will do this.

>Could you look into his suggestions of other places to do the

>allocation, please?

I will get back to you on this

>Wouldn't it be better to set

> bkpb->block_compression = compress_backup_block;

>once earlier instead of setting it that way once and setting it to
>BACKUP_BLOCK_COMPRESSION_OFF in two other places

Yes.

If you're using VARATT_IS_COMPRESSED() to detect compression, don't you

need SET_VARSIZE_COMPRESSED() in CompressBackupBlock? pglz_compress()
does it for you, but the other two algorithms don't.

Yes we need SET_VARSIZE_COMPRESSED. It is present in wrappers around snappy and LZ4 namely pg_snappy_compress and pg_LZ4_compress.

>But now that you've added bkpb.block_compression, you should be able to
>avoid VARATT_IS_COMPRESSED() altogether, unless I'm missing something.
>What do you think?

You are right. It can be removed.

Thank you,

On Fri, Jul 4, 2014 at 9:35 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:

At 2014-07-04 21:02:33 +0530, ams@2ndQuadrant.com wrote:
>
> > +/*
> > + */
> > +static const struct config_enum_entry backup_block_compression_options[] = {

Oh, I forgot to mention that the configuration setting changes are also
pending. I think we had a working consensus to use full_page_compression
as the name of the GUC. As I understand it, that'll accept an algorithm
name as an argument while we're still experimenting, but eventually once
we select an algorithm, it'll become just a boolean (and then we don't
need to put algorithm information into BkpBlock any more either).

-- Abhijit

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

09 July 2014, 18:56:36

>But though the code looks better locally than before, the larger problem
>is that this is still unsafe. As Pavan pointed out, XLogInsert is called
>from inside critical sections, so we can't allocate memory here.

>Could you look into his suggestions of other places to do the

>allocation, please?

If I understand correctly , the reason memory allocation is not allowed in critical section is because OOM error in critical section can lead to PANIC.

This patch does not report an OOM error on memory allocation failure instead it proceeds without compression of FPW if sufficient memory is not available for compression. Also, the memory is allocated just once very early in the session. So , the probability of OOM seems to be low and even if it occurs it is handled as mentioned above.

Though Andres said we cannot use malloc in critical section, the memory allocation done in the patch does not involve reporting OOM error in case of failure. IIUC, this eliminates the probability of PANIC in critical section. So, I think keeping this allocation in critical section should be fine. Am I missing something?

Thank you,

Rahila Syed

On Mon, Jul 7, 2014 at 4:43 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:

Thank you for review comments.

>There are still numerous formatting changes required, e.g. spaces around
>"=" and correct formatting of comments. And "git diff --check" still has
>a few whitespace problems. I won't point these out one by one, but maybe
>you should run pgindent

I will do this.

>Could you look into his suggestions of other places to do the
>allocation, please?

I will get back to you on this

>Wouldn't it be better to set

> bkpb->block_compression = compress_backup_block;

>once earlier instead of setting it that way once and setting it to
>BACKUP_BLOCK_COMPRESSION_OFF in two other places
Yes.

If you're using VARATT_IS_COMPRESSED() to detect compression, don't you
need SET_VARSIZE_COMPRESSED() in CompressBackupBlock? pglz_compress()
does it for you, but the other two algorithms don't.
Yes we need SET_VARSIZE_COMPRESSED. It is present in wrappers around snappy and LZ4 namely pg_snappy_compress and pg_LZ4_compress.

>But now that you've added bkpb.block_compression, you should be able to
>avoid VARATT_IS_COMPRESSED() altogether, unless I'm missing something.
>What do you think?
You are right. It can be removed.

Thank you,

On Fri, Jul 4, 2014 at 9:35 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
At 2014-07-04 21:02:33 +0530, ams@2ndQuadrant.com wrote:
>
> > +/*
> > + */
> > +static const struct config_enum_entry backup_block_compression_options[] = {

Oh, I forgot to mention that the configuration setting changes are also
pending. I think we had a working consensus to use full_page_compression
as the name of the GUC. As I understand it, that'll accept an algorithm
name as an argument while we're still experimenting, but eventually once
we select an algorithm, it'll become just a boolean (and then we don't
need to put algorithm information into BkpBlock any more either).

-- Abhijit

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

11 July 2014, 06:30:57

On 2014-07-04 19:27:10 +0530, Rahila Syed wrote:
> +    /* Allocates memory for compressed backup blocks according to the compression
> +     * algorithm used.Once per session at the time of insertion of first XLOG
> +     * record.
> +     * This memory stays till the end of session. OOM is handled by making the
> +     * code proceed without FPW compression*/
> +    static char *compressed_pages[XLR_MAX_BKP_BLOCKS];
> +    static bool compressed_pages_allocated = false;
> +    if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF &&
> +        compressed_pages_allocated!= true)
> +    {
> +        size_t buffer_size = VARHDRSZ;
> +        int j;
> +        if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> +            buffer_size += snappy_max_compressed_length(BLCKSZ);
> +        else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_LZ4)
> +            buffer_size += LZ4_compressBound(BLCKSZ);
> +        else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_PGLZ)
> +            buffer_size += PGLZ_MAX_OUTPUT(BLCKSZ);
> +        for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> +        {    compressed_pages[j] = (char *) malloc(buffer_size);
> +            if(compressed_pages[j] == NULL)
> +            {
> +                compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF;
> +                break;
> +            }
> +        }
> +        compressed_pages_allocated = true;
> +    }

Why not do this in InitXLOGAccess() or similar?

>      /*
>       * Make additional rdata chain entries for the backup blocks, so that we
>       * don't need to special-case them in the write loop.  This modifies the
> @@ -1015,11 +1048,32 @@ begin:;
>          rdt->next = &(dtbuf_rdt2[i]);
>          rdt = rdt->next;
>  
> +        if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF)
> +        {
> +        /* Compress the backup block before including it in rdata chain */
> +            rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length,
> +                                            compressed_pages[i], &(rdt->len));
> +            if (rdt->data != NULL)
> +            {
> +                /*
> +                 * write_len is the length of compressed block and its varlena
> +                 * header
> +                 */
> +                write_len += rdt->len;
> +                bkpb->hole_length = BLCKSZ - rdt->len;
> +                /*Adding information about compression in the backup block header*/
> +                bkpb->block_compression=compress_backup_block;
> +                rdt->next = NULL;
> +                continue;
> +            }
> +        }
> +

So, you're compressing backup blocks one by one. I wonder if that's the
right idea and if we shouldn't instead compress all of them in one run to
increase the compression ratio.


> +/*
>   * Get a pointer to the right location in the WAL buffer containing the
>   * given XLogRecPtr.
>   *
> @@ -4061,6 +4174,50 @@ RestoreBackupBlockContents(XLogRecPtr lsn, BkpBlock bkpb, char *blk,
>      {
>          memcpy((char *) page, blk, BLCKSZ);
>      }
> +    /* Decompress if backup block is compressed*/
> +    else if (VARATT_IS_COMPRESSED((struct varlena *) blk)
> +                && bkpb.block_compression!=BACKUP_BLOCK_COMPRESSION_OFF)
> +    {
> +        if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> +        {
> +            int ret;
> +            size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> +            char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> +            size_t s_uncompressed_length;
> +
> +            ret = snappy_uncompressed_length(compressed_data,
> +                    compressed_length,
> +                    &s_uncompressed_length);
> +            if (!ret)
> +                elog(ERROR, "snappy: failed to determine compression length");
> +            if (BLCKSZ != s_uncompressed_length)
> +                elog(ERROR, "snappy: compression size mismatch %d != %zu",
> +                        BLCKSZ, s_uncompressed_length);
> +
> +            ret = snappy_uncompress(compressed_data,
> +                    compressed_length,
> +                    page);
> +            if (ret != 0)
> +                elog(ERROR, "snappy: decompression failed: %d", ret);
> +        }
> +        else if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_LZ4)
> +        {
> +            int ret;
> +            size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> +            char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> +            ret = LZ4_decompress_fast(compressed_data, page,
> +                    BLCKSZ);
> +            if (ret != compressed_length)
> +                elog(ERROR, "lz4: decompression size mismatch: %d vs %zu", ret,
> +                        compressed_length);
> +        }
> +        else if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_PGLZ)
> +        {
> +            pglz_decompress((PGLZ_Header *) blk, (char *) page);
> +        }
> +        else
> +            elog(ERROR, "Wrong value for compress_backup_block GUC");
> +    }
>      else
>      {
>          memcpy((char *) page, blk, bkpb.hole_offset);

So why aren't we compressing the hole here instead of compressing the
parts that the current logic deems to be filled with important information?

>  /*
>   * Options for enum values stored in other modules
>   */
> @@ -3498,6 +3512,16 @@ static struct config_enum ConfigureNamesEnum[] =
>          NULL, NULL, NULL
>      },
>  
> +    {
> +        {"compress_backup_block", PGC_SIGHUP, WAL_SETTINGS,
> +            gettext_noop("Compress backup block in WAL using specified compression algorithm."),
> +            NULL
> +        },
> +        &compress_backup_block,
> +        BACKUP_BLOCK_COMPRESSION_OFF, backup_block_compression_options,
> +        NULL, NULL, NULL
> +    },
> +

This should be named 'compress_full_page_writes' or so, even if a
temporary guc. There's the 'full_page_writes' guc and I see little
reaason to deviate from its name.

Greetings,

Andres Freund

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

11 July 2014, 13:13:29

Thank you for review.

>So, you're compressing backup blocks one by one. I wonder if that's the
>right idea and if we shouldn't instead compress all of them in one run to
>increase the compression ratio.

The idea behind compressing blocks one by one was to keep the code as much similar to the original as possible.

For instance the easiest change I could think of is , if we compress all backup blocks of a WAL record together the below format of WAL record might change

Fixed-size header (XLogRecord struct)

rmgr-specific data

BkpBlock

backup block data

BkpBlock

backup block data

....

Fixed-size header (XLogRecord struct)

rmgr-specific data

BkpBlock

backup blocks data

...

But at the same time, it can be worth giving a try to see if there is significant improvement in compression .

>So why aren't we compressing the hole here instead of compressing the
>parts that the current logic deems to be filled with important information?

Entire full page image in the WAL record is compressed. The unimportant part of the full page image which is hole is not WAL logged in original code. This patch compresses entire full page image inclusive of hole. This can be optimized by omitting hole in the compressed FPW(incase hole is filled with non-zeros) like the original uncompressed FPW . But this can lead to change in BkpBlock structure.

>This should be named 'compress_full_page_writes' or so, even if a
>temporary guc. There's the 'full_page_writes' guc and I see little
>reaason to deviate from its name.

Yes. This will be renamed to full_page_compression according to suggestions earlier in the discussion.

Thank you,

Rahila Syed

On Fri, Jul 11, 2014 at 12:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-07-04 19:27:10 +0530, Rahila Syed wrote:
> + /* Allocates memory for compressed backup blocks according to the compression
> + * algorithm used.Once per session at the time of insertion of first XLOG
> + * record.
> + * This memory stays till the end of session. OOM is handled by making the
> + * code proceed without FPW compression*/
> + static char *compressed_pages[XLR_MAX_BKP_BLOCKS];
> + static bool compressed_pages_allocated = false;
> + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF &&
> + compressed_pages_allocated!= true)
> + {
> + size_t buffer_size = VARHDRSZ;
> + int j;
> + if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> + buffer_size += snappy_max_compressed_length(BLCKSZ);
> + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_LZ4)
> + buffer_size += LZ4_compressBound(BLCKSZ);
> + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_PGLZ)
> + buffer_size += PGLZ_MAX_OUTPUT(BLCKSZ);
> + for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> + { compressed_pages[j] = (char *) malloc(buffer_size);
> + if(compressed_pages[j] == NULL)
> + {
> + compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF;
> + break;
> + }
> + }
> + compressed_pages_allocated = true;
> + }

Why not do this in InitXLOGAccess() or similar?

> /*
> * Make additional rdata chain entries for the backup blocks, so that we
> * don't need to special-case them in the write loop. This modifies the
> @@ -1015,11 +1048,32 @@ begin:;
> rdt->next = &(dtbuf_rdt2[i]);
> rdt = rdt->next;
>
> + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF)
> + {
> + /* Compress the backup block before including it in rdata chain */
> + rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length,
> + compressed_pages[i], &(rdt->len));
> + if (rdt->data != NULL)
> + {
> + /*
> + * write_len is the length of compressed block and its varlena
> + * header
> + */
> + write_len += rdt->len;
> + bkpb->hole_length = BLCKSZ - rdt->len;
> + /*Adding information about compression in the backup block header*/
> + bkpb->block_compression=compress_backup_block;
> + rdt->next = NULL;
> + continue;
> + }
> + }
> +

So, you're compressing backup blocks one by one. I wonder if that's the
right idea and if we shouldn't instead compress all of them in one run to
increase the compression ratio.

> +/*
> * Get a pointer to the right location in the WAL buffer containing the
> * given XLogRecPtr.
> *
> @@ -4061,6 +4174,50 @@ RestoreBackupBlockContents(XLogRecPtr lsn, BkpBlock bkpb, char *blk,
> {
> memcpy((char *) page, blk, BLCKSZ);
> }
> + /* Decompress if backup block is compressed*/
> + else if (VARATT_IS_COMPRESSED((struct varlena *) blk)
> + && bkpb.block_compression!=BACKUP_BLOCK_COMPRESSION_OFF)
> + {
> + if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> + {
> + int ret;
> + size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> + char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> + size_t s_uncompressed_length;
> +
> + ret = snappy_uncompressed_length(compressed_data,
> + compressed_length,
> + &s_uncompressed_length);
> + if (!ret)
> + elog(ERROR, "snappy: failed to determine compression length");
> + if (BLCKSZ != s_uncompressed_length)
> + elog(ERROR, "snappy: compression size mismatch %d != %zu",
> + BLCKSZ, s_uncompressed_length);
> +
> + ret = snappy_uncompress(compressed_data,
> + compressed_length,
> + page);
> + if (ret != 0)
> + elog(ERROR, "snappy: decompression failed: %d", ret);
> + }
> + else if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_LZ4)
> + {
> + int ret;
> + size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> + char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> + ret = LZ4_decompress_fast(compressed_data, page,
> + BLCKSZ);
> + if (ret != compressed_length)
> + elog(ERROR, "lz4: decompression size mismatch: %d vs %zu", ret,
> + compressed_length);
> + }
> + else if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_PGLZ)
> + {
> + pglz_decompress((PGLZ_Header *) blk, (char *) page);
> + }
> + else
> + elog(ERROR, "Wrong value for compress_backup_block GUC");
> + }
> else
> {
> memcpy((char *) page, blk, bkpb.hole_offset);

So why aren't we compressing the hole here instead of compressing the
parts that the current logic deems to be filled with important information?

> /*
> * Options for enum values stored in other modules
> */
> @@ -3498,6 +3512,16 @@ static struct config_enum ConfigureNamesEnum[] =
> NULL, NULL, NULL
> },
>
> + {
> + {"compress_backup_block", PGC_SIGHUP, WAL_SETTINGS,
> + gettext_noop("Compress backup block in WAL using specified compression algorithm."),
> + NULL
> + },
> + &compress_backup_block,
> + BACKUP_BLOCK_COMPRESSION_OFF, backup_block_compression_options,
> + NULL, NULL, NULL
> + },
> +

This should be named 'compress_full_page_writes' or so, even if a
temporary guc. There's the 'full_page_writes' guc and I see little
reaason to deviate from its name.

Greetings,

Andres Freund

Re: [REVIEW] Re: Compression of full-page-writes

From

Pavan Deolasee

Date:

23 July 2014, 08:21:34

I'm trying to understand what would it take to have this patch in an acceptable form before the next commitfest. Both Abhijit and Andres has done some extensive review of the patch and have given many useful suggestions to Rahila. While she has incorporated most of them, I feel we are still some distance away from having something which can be committed. Here are my observations based on the discussion on this thread so far.

1. Need for compressing full page backups:

There are good number of benchmarks done by various people on this list which clearly shows the need of the feature. Many people have already voiced their agreement on having this in core, even as a configurable parameter. There had been some requests to have more benchmarks such as response times immediately after a checkpoint or CPU consumption which I'm not entirely sure if already done.

2. Need for different compression algorithms:

There were requests for comparing different compression algorithms such as LZ4 and snappy. Based on the numbers that Rahila has posted, I can see LZ4 has the best compression ratio, at least for TPC-C benchmarks she tried. Having said that, I was hoping to see more numbers in terms of CPU resource utilization which will demonstrate the trade-off, if any. Anyways, there were also apprehensions expressed about whether to have pluggable algorithm in the final patch that gets committed. If we do decide to support more compression algorithms, I like what Andres had done before i.e. store the compression algorithm information in the varlena header. So basically, we should have a abstract API which can take a buffer and the desired algorithm and returns compressed data, along with varlena header with encoded information. ISTM that the patch Andres had posted earlier was focused primarily on toast data, but I think we can make it more generic so that both toast and FPW can use it.

Having said that, IMHO we should go one step at a time. We are using pglz for compressing toast data for long, so we can continue to use the same for compressing full page images. We can simultaneously work on adding more algorithms to core and choose the right candidate for different scenarios such as toast or FPW based on test evidences. But that work can happen independent of this patch.

3. Compressing one block vs all blocks:

Andres suggested that compressing all backup blocks in one go may give us better compression ratio. This is worth trying. I'm wondering what would the best way to do so without minimal changes to the xlog insertion code. Today, we add more rdata items for backup block header(s) and backup blocks themselves (if there is a "hole" then 2 per backup block) beyond what the caller has supplied. If we have to compress all the backup blocks together, then one approach is to copy the backup block headers and the blocks to a temp buffer, compress that and replace the rdata entries added previously with a single rdata. Is there a better way to handle multiple blocks in one go?

We still need a way to tell the restore path that the wal data is compressed. One way is to always add a varlena header irrespective of whether the blocks are compressed or not. This looks overkill. Another way to add a new field to XLogRecord to record this information. Looks like we can do this without increasing the size of the header since there are 2 bytes padding after the xl_rmid field.

4. Handling holes in backup blocks:

I think we address (3) then this can be easily done. Alternatively, we can also memzero the "hole" and then compress the entire page. The compression algorithm should handle that well.

Thoughts/comments?

Thanks,

Pavan

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

05 August 2014, 12:55:34

On Wed, Jul 23, 2014 at 5:21 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> 1. Need for compressing full page backups:
> There are good number of benchmarks done by various people on this list
> which clearly shows the need of the feature. Many people have already voiced
> their agreement on having this in core, even as a configurable parameter.

Yes!

> Having said that, IMHO we should go one step at a time. We are using pglz
> for compressing toast data for long, so we can continue to use the same for
> compressing full page images. We can simultaneously work on adding more
> algorithms to core and choose the right candidate for different scenarios
> such as toast or FPW based on test evidences. But that work can happen
> independent of this patch.

This gradual approach looks good to me. And, if the additional compression
algorithm like lz4 is always better than pglz for every scenarios, we can just
change the code so that the additional algorithm is always used. Which would
make the code simpler.

> 3. Compressing one block vs all blocks:
> Andres suggested that compressing all backup blocks in one go may give us
> better compression ratio. This is worth trying. I'm wondering what would the
> best way to do so without minimal changes to the xlog insertion code. Today,
> we add more rdata items for backup block header(s) and backup blocks
> themselves (if there is a "hole" then 2 per backup block) beyond what the
> caller has supplied. If we have to compress all the backup blocks together,
> then one approach is to copy the backup block headers and the blocks to a
> temp buffer, compress that and replace the rdata entries added previously
> with a single rdata.

Basically sounds reasonable. But, how does this logic work if there are
multiple rdata and only some of them are backup blocks?

If a "hole" is not copied to that temp buffer, ISTM that we should
change backup block header  so that it contains the info for a
"hole", e.g., location that a "hole" starts. No?

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Pavan Deolasee

Date:

06 August 2014, 08:03:43

On Tue, Aug 5, 2014 at 6:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

This gradual approach looks good to me. And, if the additional compression
algorithm like lz4 is always better than pglz for every scenarios, we can just
change the code so that the additional algorithm is always used. Which would
make the code simpler.

Right.

> 3. Compressing one block vs all blocks:
> Andres suggested that compressing all backup blocks in one go may give us
> better compression ratio. This is worth trying. I'm wondering what would the
> best way to do so without minimal changes to the xlog insertion code. Today,
> we add more rdata items for backup block header(s) and backup blocks
> themselves (if there is a "hole" then 2 per backup block) beyond what the
> caller has supplied. If we have to compress all the backup blocks together,
> then one approach is to copy the backup block headers and the blocks to a
> temp buffer, compress that and replace the rdata entries added previously
> with a single rdata.

Basically sounds reasonable. But, how does this logic work if there are
multiple rdata and only some of them are backup blocks?

My idea is to just make a pass over the rdata entries past the rdt_lastnormal element after processing the backup blocks and making additional entries in the chain. These additional rdata entries correspond to the backup blocks and their headers. So we can copy the rdata->data of these elements in a temp buffer and compress the entire thing in one go. We can then replace the rdata chain past the rdt_lastnormal with a single rdata with data pointing to the compressed data. Recovery code just needs to decompress this data the record header indicates that the backup data is compressed. Sure the exact mechanism to indicate if the data is compressed (and by which algorithm) can be worked out.

If a "hole" is not copied to that temp buffer, ISTM that we should
change backup block header so that it contains the info for a
"hole", e.g., location that a "hole" starts. No?

AFAICS its not required if we compress the stream of BkpBlock and the block data. The current mechanism of constructing the additional rdata chain items takes care of hole anyways.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

16 August 2014, 09:51:32

>So, you're compressing backup blocks one by one. I wonder if that's the
>right idea and if we shouldn't instead compress all of them in one run to
>increase the compression ratio

Please find attached patch for compression of all blocks of a record together .

Following are the measurement results:

Benchmark:

Scale : 16
Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js -sleepTime 550,250,250,200,200

Warmup time : 1 sec
Measurement time : 900 sec
Number of tx types : 5
Number of agents : 16
Connection pool size : 16
Statement cache size : 40
Auto commit : false

Checkpoint segments:1024
Checkpoint timeout:5 mins

Compression Multiple Blocks in one run Single Block in one run

Bytes saved 0 0

OFF WAL generated 1265150984(~1265MB) 1264771760(~1265MB)

% Compression NA NA

Bytes saved 215215079 (~215MB) 285675622 (~286MB)

LZ4 WAL generated 125118783(~1251MB) 1329031918(~1329MB)

% Compression 17.2 % 21.49 %

Bytes saved 203705959 (~204MB) 271009408 (~271MB)

Snappy WAL generated 1254505415(~1254MB) 1329628352(~1330MB)

% Compression 16.23 % 20.38%

Bytes saved 155910177(~156MB) 182804997(~182MB)

pglz WAL generated 1259773129(~1260MB) 1286670317(~1287MB)

% Compression 12.37% 14.21%

As per measurement results of this benchmark, compression of multiple blocks didn't improve compression ratio over compression of single block.

LZ4 outperforms Snappy and pglz in terms of compression ratio.

Thank you,

On Fri, Jul 11, 2014 at 12:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-07-04 19:27:10 +0530, Rahila Syed wrote:
> + /* Allocates memory for compressed backup blocks according to the compression
> + * algorithm used.Once per session at the time of insertion of first XLOG
> + * record.
> + * This memory stays till the end of session. OOM is handled by making the
> + * code proceed without FPW compression*/
> + static char *compressed_pages[XLR_MAX_BKP_BLOCKS];
> + static bool compressed_pages_allocated = false;
> + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF &&
> + compressed_pages_allocated!= true)
> + {
> + size_t buffer_size = VARHDRSZ;
> + int j;
> + if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> + buffer_size += snappy_max_compressed_length(BLCKSZ);
> + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_LZ4)
> + buffer_size += LZ4_compressBound(BLCKSZ);
> + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_PGLZ)
> + buffer_size += PGLZ_MAX_OUTPUT(BLCKSZ);
> + for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> + { compressed_pages[j] = (char *) malloc(buffer_size);
> + if(compressed_pages[j] == NULL)
> + {
> + compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF;
> + break;
> + }
> + }
> + compressed_pages_allocated = true;
> + }

Why not do this in InitXLOGAccess() or similar?

> /*
> * Make additional rdata chain entries for the backup blocks, so that we
> * don't need to special-case them in the write loop. This modifies the
> @@ -1015,11 +1048,32 @@ begin:;
> rdt->next = &(dtbuf_rdt2[i]);
> rdt = rdt->next;
>
> + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF)
> + {
> + /* Compress the backup block before including it in rdata chain */
> + rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length,
> + compressed_pages[i], &(rdt->len));
> + if (rdt->data != NULL)
> + {
> + /*
> + * write_len is the length of compressed block and its varlena
> + * header
> + */
> + write_len += rdt->len;
> + bkpb->hole_length = BLCKSZ - rdt->len;
> + /*Adding information about compression in the backup block header*/
> + bkpb->block_compression=compress_backup_block;
> + rdt->next = NULL;
> + continue;
> + }
> + }
> +

So, you're compressing backup blocks one by one. I wonder if that's the
right idea and if we shouldn't instead compress all of them in one run to
increase the compression ratio.

> +/*
> * Get a pointer to the right location in the WAL buffer containing the
> * given XLogRecPtr.
> *
> @@ -4061,6 +4174,50 @@ RestoreBackupBlockContents(XLogRecPtr lsn, BkpBlock bkpb, char *blk,
> {
> memcpy((char *) page, blk, BLCKSZ);
> }
> + /* Decompress if backup block is compressed*/
> + else if (VARATT_IS_COMPRESSED((struct varlena *) blk)
> + && bkpb.block_compression!=BACKUP_BLOCK_COMPRESSION_OFF)
> + {
> + if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> + {
> + int ret;
> + size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> + char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> + size_t s_uncompressed_length;
> +
> + ret = snappy_uncompressed_length(compressed_data,
> + compressed_length,
> + &s_uncompressed_length);
> + if (!ret)
> + elog(ERROR, "snappy: failed to determine compression length");
> + if (BLCKSZ != s_uncompressed_length)
> + elog(ERROR, "snappy: compression size mismatch %d != %zu",
> + BLCKSZ, s_uncompressed_length);
> +
> + ret = snappy_uncompress(compressed_data,
> + compressed_length,
> + page);
> + if (ret != 0)
> + elog(ERROR, "snappy: decompression failed: %d", ret);
> + }
> + else if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_LZ4)
> + {
> + int ret;
> + size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> + char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> + ret = LZ4_decompress_fast(compressed_data, page,
> + BLCKSZ);
> + if (ret != compressed_length)
> + elog(ERROR, "lz4: decompression size mismatch: %d vs %zu", ret,
> + compressed_length);
> + }
> + else if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_PGLZ)
> + {
> + pglz_decompress((PGLZ_Header *) blk, (char *) page);
> + }
> + else
> + elog(ERROR, "Wrong value for compress_backup_block GUC");
> + }
> else
> {
> memcpy((char *) page, blk, bkpb.hole_offset);

So why aren't we compressing the hole here instead of compressing the
parts that the current logic deems to be filled with important information?

> /*
> * Options for enum values stored in other modules
> */
> @@ -3498,6 +3512,16 @@ static struct config_enum ConfigureNamesEnum[] =
> NULL, NULL, NULL
> },
>
> + {
> + {"compress_backup_block", PGC_SIGHUP, WAL_SETTINGS,
> + gettext_noop("Compress backup block in WAL using specified compression algorithm."),
> + NULL
> + },
> + &compress_backup_block,
> + BACKUP_BLOCK_COMPRESSION_OFF, backup_block_compression_options,
> + NULL, NULL, NULL
> + },
> +

This should be named 'compress_full_page_writes' or so, even if a
temporary guc. There's the 'full_page_writes' guc and I see little
reaason to deviate from its name.

Greetings,

Andres Freund

Attachment

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

18 August 2014, 03:24:53

On Sat, Aug 16, 2014 at 6:51 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>>So, you're compressing backup blocks one by one. I wonder if that's the
>>right idea and if we shouldn't instead compress all of them in one run to
>>increase the compression ratio
>
> Please find attached patch for compression of all blocks of a record
> together .
>
> Following are the measurement results:
>
>
> Benchmark:
>
> Scale : 16
> Command  :java JR  /home/postgres/jdbcrunner-1.2/scripts/tpcc.js  -sleepTime
> 550,250,250,200,200
>
> Warmup time          : 1 sec
> Measurement time     : 900 sec
> Number of tx types   : 5
> Number of agents     : 16
> Connection pool size : 16
> Statement cache size : 40
> Auto commit          : false
>
>
> Checkpoint segments:1024
> Checkpoint timeout:5 mins
>
>
>
>
> Compression                                                       Multiple
> Blocks in one run                    Single Block in one run
>
>                                           Bytes saved
> 0                                                           0
>
>
>
> OFF                                   WAL generated
> 1265150984(~1265MB)                   1264771760(~1265MB)
>
>
>
>                                          % Compression
> NA                                                          NA
>
>
>
>
>                                           Bytes saved
> 215215079 (~215MB)                       285675622 (~286MB)
>
>
>
> LZ4                                   WAL generated
> 125118783(~1251MB)                    1329031918(~1329MB)
>
>
>
>                                          % Compression                  17.2
> %                                                 21.49 %
>
>
>
>
>                                          Bytes saved
> 203705959 (~204MB)                      271009408 (~271MB)
>
>
>
> Snappy                           WAL generated
> 1254505415(~1254MB)                  1329628352(~1330MB)
>
>
>
>                                          % Compression                16.23
> %                                               20.38%
>
>
>
>
>                                          Bytes saved
> 155910177(~156MB)                       182804997(~182MB)
>
>
>
> pglz                                 WAL generated
> 1259773129(~1260MB)                 1286670317(~1287MB)
>
>
>
>                                          % Compression                12.37%
> 14.21%
>
>
>
>
>
> As per measurement results of this benchmark, compression of multiple blocks
> didn't improve compression ratio over compression  of single block.

According to the measurement result, the amount of WAL generated in
"Multiple Blocks in one run" than that in "Single Block in one run".
So ISTM that compression of multiple blocks at one run can improve
the compression ratio. Am I missing something?

Regards,

-- 
Fujii Masao

Fwd: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

18 August 2014, 11:19:54

>According to the measurement result, the amount of WAL generated in

>"Multiple Blocks in one run" than that in "Single Block in one run".

>So ISTM that compression of multiple blocks at one run can improve

>the compression ratio. Am I missing something?

Sorry for using unclear terminology. WAL generated here means WAL that gets generated in each run without compression.

So, the value WAL generated in the above measurement is uncompressed WAL generated to be specific.

uncompressed WAL = compressed WAL + Bytes saved.

Here, the measurements are done for a constant amount of time rather than fixed number of transactions. Hence amount of WAL generated does not correspond to compression ratios of each algo. Hence have calculated bytes saved in order to get accurate idea of the amount of compression in each scenario and for various algorithms.

Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above scenarios are as follows:

Compression algo Multiple Blocks in one run Single Block in one run

LZ4 1.21 1.27

Snappy 1.19 1.25

pglz 1.14 1.16

This shows compression ratios of both the scenarios Multiple blocks and single block are nearly same for this benchmark.

Thank you,

Rahila Syed

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

18 August 2014, 17:06:20

On Mon, Aug 18, 2014 at 7:19 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>>According to the measurement result, the amount of WAL generated in
>>"Multiple Blocks in one run" than that in "Single Block in one run".
>>So ISTM that compression of multiple blocks at one run can improve
>>the compression ratio. Am I missing something?
>
> Sorry for using unclear terminology. WAL generated here means WAL that gets
> generated in each run without compression.
> So, the value WAL generated in the  above measurement is uncompressed WAL
> generated to be specific.
> uncompressed WAL = compressed WAL  + Bytes saved.
>
> Here, the measurements are done for a constant amount of time rather than
> fixed number of transactions. Hence amount of WAL generated does not
> correspond to compression ratios of each algo. Hence have calculated bytes
> saved in order to get accurate idea of the amount of compression in each
> scenario and for various algorithms.
>
> Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above
> scenarios are as follows:
>
> Compression algo       Multiple Blocks in one run    Single Block in one run
>
> LZ4                              1.21                                   1.27
>
> Snappy                        1.19                                   1.25
>
> pglz                             1.14                                   1.16
>
> This shows compression ratios of both the scenarios Multiple blocks and
> single block  are nearly same for this benchmark.

I don't agree with that conclusion.  The difference between 1.21 and
1.27, or between 1.19 and 1.25, is quite significant.  Even the
difference beyond 1.14 and 1.16 is not trivial.  We should try to get
the larger benefit, if it is possible to do so without an unreasonable
effort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

18 August 2014, 17:09:05

On 2014-08-18 13:06:15 -0400, Robert Haas wrote:
> On Mon, Aug 18, 2014 at 7:19 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> >>According to the measurement result, the amount of WAL generated in
> >>"Multiple Blocks in one run" than that in "Single Block in one run".
> >>So ISTM that compression of multiple blocks at one run can improve
> >>the compression ratio. Am I missing something?
> >
> > Sorry for using unclear terminology. WAL generated here means WAL that gets
> > generated in each run without compression.
> > So, the value WAL generated in the  above measurement is uncompressed WAL
> > generated to be specific.
> > uncompressed WAL = compressed WAL  + Bytes saved.
> >
> > Here, the measurements are done for a constant amount of time rather than
> > fixed number of transactions. Hence amount of WAL generated does not
> > correspond to compression ratios of each algo. Hence have calculated bytes
> > saved in order to get accurate idea of the amount of compression in each
> > scenario and for various algorithms.
> >
> > Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above
> > scenarios are as follows:
> >
> > Compression algo       Multiple Blocks in one run    Single Block in one run
> >
> > LZ4                              1.21                                   1.27
> >
> > Snappy                        1.19                                   1.25
> >
> > pglz                             1.14                                   1.16
> >
> > This shows compression ratios of both the scenarios Multiple blocks and
> > single block  are nearly same for this benchmark.
> 
> I don't agree with that conclusion.  The difference between 1.21 and
> 1.27, or between 1.19 and 1.25, is quite significant.  Even the
> difference beyond 1.14 and 1.16 is not trivial.  We should try to get
> the larger benefit, if it is possible to do so without an unreasonable
> effort.

Agreed.

One more question: Do I see it right that multiple blocks compressed
together compress *worse* than compressing individual blocks? If so, I
have a rather hard time believing that the patch is sane.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

18 August 2014, 17:10:57

On Thu, Jul 3, 2014 at 3:58 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Updated version of patches are attached.
> Changes are as follows
> 1. Improved readability of the code as per the review comments.
> 2. Addition of block_compression field in BkpBlock structure to store
> information about compression of block. This provides for switching
> compression on/off and changing compression algorithm as required.
> 3.Handling of OOM in critical section by checking for return value of malloc
> and proceeding without compression of FPW if return value is NULL.

So, it seems like you're basically using malloc to work around the
fact that a palloc failure is an error, and we can't throw an error in
a critical section.  I don't think that's good; we want all of our
allocations, as far as possible, to be tracked via palloc.  It might
be a good idea to add a new variant of palloc or MemoryContextAlloc
that returns NULL on failure instead of throwing an error; I've wanted
that once or twice.  But in this particular case, I'm not quite seeing
why it should be necessary - the number of backup blocks per record is
limited to some pretty small number, so it ought to be possible to
preallocate enough memory to compress them all, perhaps just by
declaring a global variable like char wal_compression_space[8192]; or
whatever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

19 August 2014, 07:48:06

On Tue, Aug 19, 2014 at 2:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-08-18 13:06:15 -0400, Robert Haas wrote:
>> On Mon, Aug 18, 2014 at 7:19 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>> >>According to the measurement result, the amount of WAL generated in
>> >>"Multiple Blocks in one run" than that in "Single Block in one run".
>> >>So ISTM that compression of multiple blocks at one run can improve
>> >>the compression ratio. Am I missing something?
>> >
>> > Sorry for using unclear terminology. WAL generated here means WAL that gets
>> > generated in each run without compression.
>> > So, the value WAL generated in the  above measurement is uncompressed WAL
>> > generated to be specific.
>> > uncompressed WAL = compressed WAL  + Bytes saved.
>> >
>> > Here, the measurements are done for a constant amount of time rather than
>> > fixed number of transactions. Hence amount of WAL generated does not
>> > correspond to compression ratios of each algo. Hence have calculated bytes
>> > saved in order to get accurate idea of the amount of compression in each
>> > scenario and for various algorithms.
>> >
>> > Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above
>> > scenarios are as follows:
>> >
>> > Compression algo       Multiple Blocks in one run    Single Block in one run
>> >
>> > LZ4                              1.21                                   1.27
>> >
>> > Snappy                        1.19                                   1.25
>> >
>> > pglz                             1.14                                   1.16
>> >
>> > This shows compression ratios of both the scenarios Multiple blocks and
>> > single block  are nearly same for this benchmark.
>>
>> I don't agree with that conclusion.  The difference between 1.21 and
>> 1.27, or between 1.19 and 1.25, is quite significant.  Even the
>> difference beyond 1.14 and 1.16 is not trivial.  We should try to get
>> the larger benefit, if it is possible to do so without an unreasonable
>> effort.
>
> Agreed.
>
> One more question: Do I see it right that multiple blocks compressed
> together compress *worse* than compressing individual blocks? If so, I
> have a rather hard time believing that the patch is sane.

Or the way of benchmark might have some problems.

Rahila,
I'd like to measure the compression ratio in both multiple blocks and
single block cases.
Could you tell me where the patch for "single block in one run" is?

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

19 August 2014, 09:37:10

Hello,

Thank you for comments.

>Could you tell me where the patch for "single block in one run" is?

Please find attached patch for single block compression in one run.

Thank you,

On Tue, Aug 19, 2014 at 1:17 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Aug 19, 2014 at 2:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-08-18 13:06:15 -0400, Robert Haas wrote:
>> On Mon, Aug 18, 2014 at 7:19 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>> >>According to the measurement result, the amount of WAL generated in
>> >>"Multiple Blocks in one run" than that in "Single Block in one run".
>> >>So ISTM that compression of multiple blocks at one run can improve
>> >>the compression ratio. Am I missing something?
>> >
>> > Sorry for using unclear terminology. WAL generated here means WAL that gets
>> > generated in each run without compression.
>> > So, the value WAL generated in the above measurement is uncompressed WAL
>> > generated to be specific.
>> > uncompressed WAL = compressed WAL + Bytes saved.
>> >
>> > Here, the measurements are done for a constant amount of time rather than
>> > fixed number of transactions. Hence amount of WAL generated does not
>> > correspond to compression ratios of each algo. Hence have calculated bytes
>> > saved in order to get accurate idea of the amount of compression in each
>> > scenario and for various algorithms.
>> >
>> > Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above
>> > scenarios are as follows:
>> >
>> > Compression algo Multiple Blocks in one run Single Block in one run
>> >
>> > LZ4 1.21 1.27
>> >
>> > Snappy 1.19 1.25
>> >
>> > pglz 1.14 1.16
>> >
>> > This shows compression ratios of both the scenarios Multiple blocks and
>> > single block are nearly same for this benchmark.
>>
>> I don't agree with that conclusion. The difference between 1.21 and
>> 1.27, or between 1.19 and 1.25, is quite significant. Even the
>> difference beyond 1.14 and 1.16 is not trivial. We should try to get
>> the larger benefit, if it is possible to do so without an unreasonable
>> effort.
>
> Agreed.
>
> One more question: Do I see it right that multiple blocks compressed
> together compress *worse* than compressing individual blocks? If so, I
> have a rather hard time believing that the patch is sane.

Or the way of benchmark might have some problems.

Rahila,
I'd like to measure the compression ratio in both multiple blocks and
single block cases.
Could you tell me where the patch for "single block in one run" is?

Regards,

--
Fujii Masao

Attachment

CompressSingleBlock.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

19 August 2014, 18:36:53

>So, it seems like you're basically using malloc to work around the
>fact that a palloc failure is an error, and we can't throw an error in
>a critical section. I don't think that's good; we want all of our
>allocations, as far as possible, to be tracked via palloc. It might
>be a good idea to add a new variant of palloc or MemoryContextAlloc
>that returns NULL on failure instead of throwing an error; I've wanted
>that once or twice. But in this particular case, I'm not quite seeing
>why it should be necessary

I am using malloc to return NULL in case of failure and proceed without compression of FPW ,if it returns NULL.

Proceeding without compression seems to be more accurate than throwing an error and exiting because of failure to allocate memory for compression.

>the number of backup blocks per record is

>limited to some pretty small number, so it ought to be possible to
>preallocate enough memory to compress them all, perhaps just by
>declaring a global variable like char wal_compression_space[8192]; or
>whatever.

In the updated patch a static global variable is added to which memory is allocated from heap using malloc outside critical section. The size of the memory block is 4 * BkpBlock header + 4 * BLCKSZ.

Thank you,

On Mon, Aug 18, 2014 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jul 3, 2014 at 3:58 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Updated version of patches are attached.
> Changes are as follows
> 1. Improved readability of the code as per the review comments.
> 2. Addition of block_compression field in BkpBlock structure to store
> information about compression of block. This provides for switching
> compression on/off and changing compression algorithm as required.
> 3.Handling of OOM in critical section by checking for return value of malloc
> and proceeding without compression of FPW if return value is NULL.

So, it seems like you're basically using malloc to work around the
fact that a palloc failure is an error, and we can't throw an error in
a critical section. I don't think that's good; we want all of our
allocations, as far as possible, to be tracked via palloc. It might
be a good idea to add a new variant of palloc or MemoryContextAlloc
that returns NULL on failure instead of throwing an error; I've wanted
that once or twice. But in this particular case, I'm not quite seeing
why it should be necessary - the number of backup blocks per record is
limited to some pretty small number, so it ought to be possible to
preallocate enough memory to compress them all, perhaps just by
declaring a global variable like char wal_compression_space[8192]; or
whatever.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

26 August 2014, 12:14:56

On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
> Thank you for comments.
>
>>Could you tell me where the patch for "single block in one run" is?
> Please find attached patch for single block compression in one run.

Thanks! I ran the benchmark using pgbench and compared the results.
I'd like to share the results.

[RESULT]
Amount of WAL generated during the benchmark. Unit is MB.
               Multiple                Single   off            202.0                201.5   on            6051.0
       6053.0   pglz            3543.0                3567.0   lz4            3344.0                3485.0   snappy
      3354.0                3449.5

Latency average during the benchmark. Unit is ms.
               Multiple                Single   off            19.1                19.0   on            55.3
   57.3   pglz            45.0                45.9   lz4            44.2                44.7   snappy            43.4
            43.3

These results show that FPW compression is really helpful for decreasing
the WAL volume and improving the performance.

The compression ratio by lz4 or snappy is better than that by pglz. But
it's difficult to conclude which lz4 or snappy is best, according to these
results.

ISTM that compression-of-multiple-pages-at-a-time approach can compress
WAL more than compression-of-single-... does.

[HOW TO BENCHMARK]
Create pgbench database with scall factor 1000.

Change the data type of the column "filler" on each pgbench table
from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
gen_random_uuid() in order to avoid empty column, e.g.,
alter table pgbench_accounts alter column filler type text using
gen_random_uuid()::text

After creating the test database, run the pgbench as follows. The
number of transactions executed during benchmark is almost same
between each benchmark because -R option is used.
 pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared

checkpoint_timeout is 5min, so it's expected that checkpoint was
executed at least two times during the benchmark.

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

27 August 2014, 14:52:34

On Tue, Aug 26, 2014 at 8:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>> Hello,
>> Thank you for comments.
>>
>>>Could you tell me where the patch for "single block in one run" is?
>> Please find attached patch for single block compression in one run.
>
> Thanks! I ran the benchmark using pgbench and compared the results.
> I'd like to share the results.
>
> [RESULT]
> Amount of WAL generated during the benchmark. Unit is MB.
>
>                 Multiple                Single
>     off            202.0                201.5
>     on            6051.0                6053.0
>     pglz            3543.0                3567.0
>     lz4            3344.0                3485.0
>     snappy            3354.0                3449.5
>
> Latency average during the benchmark. Unit is ms.
>
>                 Multiple                Single
>     off            19.1                19.0
>     on            55.3                57.3
>     pglz            45.0                45.9
>     lz4            44.2                44.7
>     snappy            43.4                43.3
>
> These results show that FPW compression is really helpful for decreasing
> the WAL volume and improving the performance.

Yeah, those look like good numbers.  What happens if you run it at
full speed, without -R?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Arthur Silva

Date:

27 August 2014, 15:46:29

<p dir="ltr"><br /> Em 26/08/2014 09:16, "Fujii Masao" <<a
href="mailto:masao.fujii@gmail.com">masao.fujii@gmail.com</a>>escreveu:<br /> ><br /> > On Tue, Aug 19, 2014
at6:37 PM, Rahila Syed <<a href="mailto:rahilasyed90@gmail.com">rahilasyed90@gmail.com</a>> wrote:<br /> >
>Hello,<br /> > > Thank you for comments.<br /> > ><br /> > >>Could you tell me where the patch
for"single block in one run" is?<br /> > > Please find attached patch for single block compression in one run.<br
/>><br /> > Thanks! I ran the benchmark using pgbench and compared the results.<br /> > I'd like to share the
results.<br/> ><br /> > [RESULT]<br /> > Amount of WAL generated during the benchmark. Unit is MB.<br />
><br/> >                 Multiple                Single<br /> >     off            202.0               
201.5<br/> >     on            6051.0                6053.0<br /> >     pglz            3543.0               
3567.0<br/> >     lz4            3344.0                3485.0<br /> >     snappy            3354.0               
3449.5<br/> ><br /> > Latency average during the benchmark. Unit is ms.<br /> ><br /> >                
Multiple               Single<br /> >     off            19.1                19.0<br /> >     on            55.3 
             57.3<br /> >     pglz            45.0                45.9<br /> >     lz4            44.2           
   44.7<br /> >     snappy            43.4                43.3<br /> ><br /> > These results show that FPW
compressionis really helpful for decreasing<br /> > the WAL volume and improving the performance.<br /> ><br />
>The compression ratio by lz4 or snappy is better than that by pglz. But<br /> > it's difficult to conclude which
lz4or snappy is best, according to these<br /> > results.<br /> ><br /> > ISTM that
compression-of-multiple-pages-at-a-timeapproach can compress<br /> > WAL more than compression-of-single-...
does.<br/> ><br /> > [HOW TO BENCHMARK]<br /> > Create pgbench database with scall factor 1000.<br /> ><br
/>> Change the data type of the column "filler" on each pgbench table<br /> > from CHAR(n) to TEXT, and fill the
datawith the result of pgcrypto's<br /> > gen_random_uuid() in order to avoid empty column, e.g.,<br /> ><br />
> alter table pgbench_accounts alter column filler type text using<br /> > gen_random_uuid()::text<br /> ><br
/>> After creating the test database, run the pgbench as follows. The<br /> > number of transactions executed
duringbenchmark is almost same<br /> > between each benchmark because -R option is used.<br /> ><br /> >  
pgbench-c 64 -j 64 -r -R 400 -T 900 -M prepared<br /> ><br /> > checkpoint_timeout is 5min, so it's expected that
checkpointwas<br /> > executed at least two times during the benchmark.<br /> ><br /> > Regards,<br /> ><br
/>> --<br /> > Fujii Masao<br /> ><br /> ><br /> > --<br /> > Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> > To make changes to your
subscription:<br/> > <a
href="http://www.postgresql.org/mailpref/pgsql-hackers">http://www.postgresql.org/mailpref/pgsql-hackers</a><p
dir="ltr">It'dbe interesting to check avg cpu usage as well.

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

02 September 2014, 06:52:19

On Wed, Aug 27, 2014 at 11:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 26, 2014 at 8:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>>> Hello,
>>> Thank you for comments.
>>>
>>>>Could you tell me where the patch for "single block in one run" is?
>>> Please find attached patch for single block compression in one run.
>>
>> Thanks! I ran the benchmark using pgbench and compared the results.
>> I'd like to share the results.
>>
>> [RESULT]
>> Amount of WAL generated during the benchmark. Unit is MB.
>>
>>                 Multiple                Single
>>     off            202.0                201.5
>>     on            6051.0                6053.0
>>     pglz            3543.0                3567.0
>>     lz4            3344.0                3485.0
>>     snappy            3354.0                3449.5
>>
>> Latency average during the benchmark. Unit is ms.
>>
>>                 Multiple                Single
>>     off            19.1                19.0
>>     on            55.3                57.3
>>     pglz            45.0                45.9
>>     lz4            44.2                44.7
>>     snappy            43.4                43.3
>>
>> These results show that FPW compression is really helpful for decreasing
>> the WAL volume and improving the performance.
>
> Yeah, those look like good numbers.  What happens if you run it at
> full speed, without -R?

OK, I ran the same benchmark except -R option. Here are the results:

[RESULT]
Throughput in the benchmark.
                           Multiple                    Single       off                    2162.6
2164.5      on                    891.8                    895.6       pglz                    1037.2
1042.3       lz4                    1084.7                    1091.8       snappy                    1058.4
      1073.3
 

Latency average during the benchmark. Unit is ms.
                           Multiple                    Single       off                    29.6                    29.6
     on                    71.7                    71.5       pglz                    61.7                    61.4
lz4                    59.0                    58.6       snappy                    60.5                    59.6
 

Amount of WAL generated during the benchmark. Unit is MB.
                           Multiple                    Single       off                    948.0
948.0      on                    7675.5                    7702.0       pglz                    5492.0
 5528.5       lz4                    5494.5                    5596.0       snappy                    5667.0
       5804.0
 

pglz vs. lz4 vs. snappy       In this benchmark, lz4 seems to have been the best compression
algorithm.       It caused best performance and highest WAL compression ratio.

Multiple vs. Single       WAL volume with "Multiple" was smaller than that with "Single". But       the throughput was
betterin "Single". So the "Multiple" is more useful       for WAL compression, but it may cause higher performance
overhead      at least in current implementation.
 

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

02 September 2014, 06:53:11

On Thu, Aug 28, 2014 at 12:46 AM, Arthur Silva <arthurprs@gmail.com> wrote:
>
> Em 26/08/2014 09:16, "Fujii Masao" <masao.fujii@gmail.com> escreveu:
>
>
>>
>> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90@gmail.com>
>> wrote:
>> > Hello,
>> > Thank you for comments.
>> >
>> >>Could you tell me where the patch for "single block in one run" is?
>> > Please find attached patch for single block compression in one run.
>>
>> Thanks! I ran the benchmark using pgbench and compared the results.
>> I'd like to share the results.
>>
>> [RESULT]
>> Amount of WAL generated during the benchmark. Unit is MB.
>>
>>                 Multiple                Single
>>     off            202.0                201.5
>>     on            6051.0                6053.0
>>     pglz            3543.0                3567.0
>>     lz4            3344.0                3485.0
>>     snappy            3354.0                3449.5
>>
>> Latency average during the benchmark. Unit is ms.
>>
>>                 Multiple                Single
>>     off            19.1                19.0
>>     on            55.3                57.3
>>     pglz            45.0                45.9
>>     lz4            44.2                44.7
>>     snappy            43.4                43.3
>>
>> These results show that FPW compression is really helpful for decreasing
>> the WAL volume and improving the performance.
>>
>> The compression ratio by lz4 or snappy is better than that by pglz. But
>> it's difficult to conclude which lz4 or snappy is best, according to these
>> results.
>>
>> ISTM that compression-of-multiple-pages-at-a-time approach can compress
>> WAL more than compression-of-single-... does.
>>
>> [HOW TO BENCHMARK]
>> Create pgbench database with scall factor 1000.
>>
>> Change the data type of the column "filler" on each pgbench table
>> from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
>> gen_random_uuid() in order to avoid empty column, e.g.,
>>
>>  alter table pgbench_accounts alter column filler type text using
>> gen_random_uuid()::text
>>
>> After creating the test database, run the pgbench as follows. The
>> number of transactions executed during benchmark is almost same
>> between each benchmark because -R option is used.
>>
>>   pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
>>
>> checkpoint_timeout is 5min, so it's expected that checkpoint was
>> executed at least two times during the benchmark.
>>
>> Regards,
>>
>> --
>> Fujii Masao
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>
> It'd be interesting to check avg cpu usage as well.

Yep, but I forgot to collect those info...

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

02 September 2014, 12:11:46

Hello,

>It'd be interesting to check avg cpu usage as well

I have collected average CPU utilization numbers by collecting sar output at interval of 10 seconds for following benchmark:

Benchmark:

Scale : 16
Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js -sleepTime 550,250,250,200,200

Warmup time : 1 sec
Measurement time : 900 sec
Number of tx types : 5
Number of agents : 16
Connection pool size : 16
Statement cache size : 40
Auto commit : false

Checkpoint segments:1024
Checkpoint timeout:5 mins

Average % of CPU utilization at user level for multiple blocks compression:

Compression Off = 3.34133

Snappy = 3.41044

LZ4 = 3.59556

Pglz = 3.66422

The numbers show the average CPU utilization is in the following order pglz > LZ4 > Snappy > No compression

Attached is the graph which gives plot of % CPU utilization versus time elapsed for each of the compression algorithms.

Also, the overall CPU utilization during tests is very low i.e below 10% . CPU remained idle for large(~90) percentage of time. I will repeat the above tests with high load on CPU and using the benchmark given by Fujii-san and post the results.

Thank you,

On Wed, Aug 27, 2014 at 9:16 PM, Arthur Silva <arthurprs@gmail.com> wrote:

Em 26/08/2014 09:16, "Fujii Masao" <masao.fujii@gmail.com> escreveu:

>
> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> > Hello,
> > Thank you for comments.
> >
> >>Could you tell me where the patch for "single block in one run" is?
> > Please find attached patch for single block compression in one run.
>
> Thanks! I ran the benchmark using pgbench and compared the results.
> I'd like to share the results.
>
> [RESULT]
> Amount of WAL generated during the benchmark. Unit is MB.
>
> Multiple Single
> off 202.0 201.5
> on 6051.0 6053.0
> pglz 3543.0 3567.0
> lz4 3344.0 3485.0
> snappy 3354.0 3449.5
>
> Latency average during the benchmark. Unit is ms.
>
> Multiple Single
> off 19.1 19.0
> on 55.3 57.3
> pglz 45.0 45.9
> lz4 44.2 44.7
> snappy 43.4 43.3
>
> These results show that FPW compression is really helpful for decreasing
> the WAL volume and improving the performance.
>
> The compression ratio by lz4 or snappy is better than that by pglz. But
> it's difficult to conclude which lz4 or snappy is best, according to these
> results.
>
> ISTM that compression-of-multiple-pages-at-a-time approach can compress
> WAL more than compression-of-single-... does.
>
> [HOW TO BENCHMARK]
> Create pgbench database with scall factor 1000.
>
> Change the data type of the column "filler" on each pgbench table
> from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
> gen_random_uuid() in order to avoid empty column, e.g.,
>
> alter table pgbench_accounts alter column filler type text using
> gen_random_uuid()::text
>
> After creating the test database, run the pgbench as follows. The
> number of transactions executed during benchmark is almost same
> between each benchmark because -R option is used.
>
> pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
>
> checkpoint_timeout is 5min, so it's expected that checkpoint was
> executed at least two times during the benchmark.
>
> Regards,
>
> --
> Fujii Masao
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
It'd be interesting to check avg cpu usage as well.

Attachment

CPU_utilization_user.png

Re: [REVIEW] Re: Compression of full-page-writes

From

Arthur Silva

Date:

02 September 2014, 13:30:58

On Tue, Sep 2, 2014 at 9:11 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:

Hello,

>It'd be interesting to check avg cpu usage as well

I have collected average CPU utilization numbers by collecting sar output at interval of 10 seconds for following benchmark:

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Benchmark:
Scale : 16
Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js -sleepTime 550,250,250,200,200
Warmup time : 1 sec
Measurement time : 900 sec
Number of tx types : 5
Number of agents : 16
Connection pool size : 16
Statement cache size : 40
Auto commit : false

Checkpoint segments:1024
Checkpoint timeout:5 mins

Average % of CPU utilization at user level for multiple blocks compression:

Compression Off = 3.34133
Snappy = 3.41044
LZ4 = 3.59556
Pglz = 3.66422

The numbers show the average CPU utilization is in the following order pglz > LZ4 > Snappy > No compression
Attached is the graph which gives plot of % CPU utilization versus time elapsed for each of the compression algorithms.
Also, the overall CPU utilization during tests is very low i.e below 10% . CPU remained idle for large(~90) percentage of time. I will repeat the above tests with high load on CPU and using the benchmark given by Fujii-san and post the results.

Thank you,

On Wed, Aug 27, 2014 at 9:16 PM, Arthur Silva <arthurprs@gmail.com> wrote:

Em 26/08/2014 09:16, "Fujii Masao" <masao.fujii@gmail.com> escreveu:

>
> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> > Hello,
> > Thank you for comments.
> >
> >>Could you tell me where the patch for "single block in one run" is?
> > Please find attached patch for single block compression in one run.
>
> Thanks! I ran the benchmark using pgbench and compared the results.
> I'd like to share the results.
>
> [RESULT]
> Amount of WAL generated during the benchmark. Unit is MB.
>
> Multiple Single
> off 202.0 201.5
> on 6051.0 6053.0
> pglz 3543.0 3567.0
> lz4 3344.0 3485.0
> snappy 3354.0 3449.5
>
> Latency average during the benchmark. Unit is ms.
>
> Multiple Single
> off 19.1 19.0
> on 55.3 57.3
> pglz 45.0 45.9
> lz4 44.2 44.7
> snappy 43.4 43.3
>
> These results show that FPW compression is really helpful for decreasing
> the WAL volume and improving the performance.
>
> The compression ratio by lz4 or snappy is better than that by pglz. But
> it's difficult to conclude which lz4 or snappy is best, according to these
> results.
>
> ISTM that compression-of-multiple-pages-at-a-time approach can compress
> WAL more than compression-of-single-... does.
>
> [HOW TO BENCHMARK]
> Create pgbench database with scall factor 1000.
>
> Change the data type of the column "filler" on each pgbench table
> from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
> gen_random_uuid() in order to avoid empty column, e.g.,
>
> alter table pgbench_accounts alter column filler type text using
> gen_random_uuid()::text
>
> After creating the test database, run the pgbench as follows. The
> number of transactions executed during benchmark is almost same
> between each benchmark because -R option is used.
>
> pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
>
> checkpoint_timeout is 5min, so it's expected that checkpoint was
> executed at least two times during the benchmark.
>
> Regards,
>
> --
> Fujii Masao
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
It'd be interesting to check avg cpu usage as well.

Is there any reason to default to LZ4-HC? Shouldn't we try the default as well? LZ4-default is known for its near realtime speeds in exchange for a few % of compression, which sounds optimal for this use case.

Also, we might want to compile these libraries with -O3 instead of the default -O2. They're finely tuned to work with all possible compiler optimizations w/ hints and other tricks, this is specially true for LZ4, not sure for snappy.

In my virtual machine LZ4 w/ -O3 compression runs at twice the speed (950MB/s) of -O2 (450MB/s) @ (61.79%), LZ4-HC seems unaffected though (58MB/s) @ (60.27%).

Yes, that's right, almost 1GB/s! And the compression ratio is only 1,5% short compared to LZ4-HC.

Re: [REVIEW] Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

02 September 2014, 13:38:01

On Tue, Sep 02, 2014 at 10:30:11AM -0300, Arthur Silva wrote:
> On Tue, Sep 2, 2014 at 9:11 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> 
> > Hello,
> >
> > >It'd be interesting to check avg cpu usage as well
> >
> > I have collected average CPU utilization numbers by collecting sar output
> > at interval of 10 seconds  for following benchmark:
> >
> > Server specifications:
> > Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> > RAM: 32GB
> > Disk : HDD      450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> > 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
> >
> > Benchmark:
> >
> > Scale : 16
> > Command  :java JR  /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
> >  -sleepTime 550,250,250,200,200
> >
> > Warmup time          : 1 sec
> > Measurement time     : 900 sec
> > Number of tx types   : 5
> > Number of agents     : 16
> > Connection pool size : 16
> > Statement cache size : 40
> > Auto commit          : false
> >
> >
> > Checkpoint segments:1024
> > Checkpoint timeout:5 mins
> >
> >
> > Average % of CPU utilization at user level for multiple blocks compression:
> >
> > Compression Off  =  3.34133
> >
> >  Snappy = 3.41044
> >
> > LZ4  = 3.59556
> >
> >  Pglz = 3.66422
> >
> >
> > The numbers show the average CPU utilization is in the following order
> > pglz > LZ4 > Snappy > No compression
> > Attached is the graph which gives plot of % CPU utilization versus time
> > elapsed for each of the compression algorithms.
> > Also, the overall CPU utilization during tests is very low i.e below 10% .
> > CPU remained idle for large(~90) percentage of time. I will repeat the
> > above tests with high load on CPU and using the benchmark given by
> > Fujii-san and post the results.
> >
> >
> > Thank you,
> >
> >
> >
> > On Wed, Aug 27, 2014 at 9:16 PM, Arthur Silva <arthurprs@gmail.com> wrote:
> >
> >>
> >> Em 26/08/2014 09:16, "Fujii Masao" <masao.fujii@gmail.com> escreveu:
> >>
> >> >
> >> > On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90@gmail.com>
> >> wrote:
> >> > > Hello,
> >> > > Thank you for comments.
> >> > >
> >> > >>Could you tell me where the patch for "single block in one run" is?
> >> > > Please find attached patch for single block compression in one run.
> >> >
> >> > Thanks! I ran the benchmark using pgbench and compared the results.
> >> > I'd like to share the results.
> >> >
> >> > [RESULT]
> >> > Amount of WAL generated during the benchmark. Unit is MB.
> >> >
> >> >                 Multiple                Single
> >> >     off            202.0                201.5
> >> >     on            6051.0                6053.0
> >> >     pglz            3543.0                3567.0
> >> >     lz4            3344.0                3485.0
> >> >     snappy            3354.0                3449.5
> >> >
> >> > Latency average during the benchmark. Unit is ms.
> >> >
> >> >                 Multiple                Single
> >> >     off            19.1                19.0
> >> >     on            55.3                57.3
> >> >     pglz            45.0                45.9
> >> >     lz4            44.2                44.7
> >> >     snappy            43.4                43.3
> >> >
> >> > These results show that FPW compression is really helpful for decreasing
> >> > the WAL volume and improving the performance.
> >> >
> >> > The compression ratio by lz4 or snappy is better than that by pglz. But
> >> > it's difficult to conclude which lz4 or snappy is best, according to
> >> these
> >> > results.
> >> >
> >> > ISTM that compression-of-multiple-pages-at-a-time approach can compress
> >> > WAL more than compression-of-single-... does.
> >> >
> >> > [HOW TO BENCHMARK]
> >> > Create pgbench database with scall factor 1000.
> >> >
> >> > Change the data type of the column "filler" on each pgbench table
> >> > from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
> >> > gen_random_uuid() in order to avoid empty column, e.g.,
> >> >
> >> >  alter table pgbench_accounts alter column filler type text using
> >> > gen_random_uuid()::text
> >> >
> >> > After creating the test database, run the pgbench as follows. The
> >> > number of transactions executed during benchmark is almost same
> >> > between each benchmark because -R option is used.
> >> >
> >> >   pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
> >> >
> >> > checkpoint_timeout is 5min, so it's expected that checkpoint was
> >> > executed at least two times during the benchmark.
> >> >
> >> > Regards,
> >> >
> >> > --
> >> > Fujii Masao
> >> >
> >> >
> >> > --
> >> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> >> > To make changes to your subscription:
> >> > http://www.postgresql.org/mailpref/pgsql-hackers
> >>
> >> It'd be interesting to check avg cpu usage as well.
> >>
> >
> >
> Is there any reason to default to LZ4-HC? Shouldn't we try the default as
> well? LZ4-default is known for its near realtime speeds in exchange for a
> few % of compression, which sounds optimal for this use case.
> 
> Also, we might want to compile these libraries with -O3 instead of the
> default -O2. They're finely tuned to work with all possible compiler
> optimizations w/ hints and other tricks, this is specially true for LZ4,
> not sure for snappy.
> 
> In my virtual machine LZ4 w/ -O3 compression runs at twice the speed
> (950MB/s) of -O2 (450MB/s) @ (61.79%), LZ4-HC seems unaffected though
> (58MB/s) @ (60.27%).
> 
> Yes, that's right, almost 1GB/s! And the compression ratio is only 1,5%
> short compared to LZ4-HC.

Hi,

I agree completely. For day-to-day use we should use LZ4-default. For read-only
tables, it might be nice to "archive" them with LZ4-HC for the higher compression
would increase read speed and reduce storage space needs. I believe that LZ4-HC
is only slower to compress and the decompression is unaffected.

Regards,
Ken

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

02 September 2014, 13:39:41

On 2014-09-02 08:37:42 -0500, ktm@rice.edu wrote:
> I agree completely. For day-to-day use we should use LZ4-default. For read-only
> tables, it might be nice to "archive" them with LZ4-HC for the higher compression
> would increase read speed and reduce storage space needs. I believe that LZ4-HC
> is only slower to compress and the decompression is unaffected.

This is about the write ahead log, not relations

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

11 September 2014, 05:46:30

>I will repeat the above tests with high load on CPU and using the benchmark
given by Fujii-san and post the results.

Average % of CPU usage at user level for each of the compression algorithm
are as follows.

Compression Multiple Single

Off 81.1338 81.1267
LZ4 81.0998 81.1695
Snappy: 80.9741 80.9703
Pglz : 81.2353 81.2753
<http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png>
<http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png>
The numbers show CPU utilization of Snappy is the least. The CPU utilization
in increasing order is
pglz > No compression > LZ4 > Snappy

The variance of average CPU utilization numbers is very low. However ,
snappy seems to be best when it comes to lesser utilization of CPU.

As per the measurement results posted till date

LZ4 outperforms snappy and pglz in terms of compression ratio and
performance. However , CPU utilization numbers show snappy utilizes least
amount of CPU . Difference is not much though.

As there has been no consensus yet about which compression algorithm to
adopt, is it better to make this decision independent of the FPW compression
patch as suggested earlier in this thread?. FPW compression can be done
using built in compression pglz as it shows considerable performance over
uncompressed WAL and good compression ratio
Also, the patch to compress multiple blocks at once gives better compression
as compared to single block. ISTM that performance overhead introduced by
multiple blocks compression is slightly higher than single block compression
which can be tested again after modifying the patch to use pglz . Hence,
this patch can be built using multiple blocks compression.

Thoughts?

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [REVIEW] Re: Compression of full-page-writes

From

Arthur Silva

Date:

11 September 2014, 12:37:54

I agree that there's no reason to fix an algorithm to it, unless maybe it's pglz. There's some initial talk about implementing pluggable compression algorithms for TOAST and I guess the same must be taken into consideration for the WAL.

--
Arthur Silva

On Thu, Sep 11, 2014 at 2:46 AM, Rahila Syed <rahilasyed.90@gmail.com> wrote:

>I will repeat the above tests with high load on CPU and using the benchmark
given by Fujii-san and post the results.

Average % of CPU usage at user level for each of the compression algorithm
are as follows.

Compression Multiple Single

Off 81.1338 81.1267
LZ4 81.0998 81.1695
Snappy: 80.9741 80.9703
Pglz : 81.2353 81.2753

<http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png>
<http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png>

The numbers show CPU utilization of Snappy is the least. The CPU utilization
in increasing order is
pglz > No compression > LZ4 > Snappy

The variance of average CPU utilization numbers is very low. However ,
snappy seems to be best when it comes to lesser utilization of CPU.

As per the measurement results posted till date

LZ4 outperforms snappy and pglz in terms of compression ratio and
performance. However , CPU utilization numbers show snappy utilizes least
amount of CPU . Difference is not much though.

As there has been no consensus yet about which compression algorithm to
adopt, is it better to make this decision independent of the FPW compression
patch as suggested earlier in this thread?. FPW compression can be done
using built in compression pglz as it shows considerable performance over
uncompressed WAL and good compression ratio
Also, the patch to compress multiple blocks at once gives better compression
as compared to single block. ISTM that performance overhead introduced by
multiple blocks compression is slightly higher than single block compression
which can be tested again after modifying the patch to use pglz . Hence,
this patch can be built using multiple blocks compression.

Thoughts?

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [REVIEW] Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

11 September 2014, 13:01:17

On Thu, Sep 11, 2014 at 09:37:07AM -0300, Arthur Silva wrote:
> I agree that there's no reason to fix an algorithm to it, unless maybe it's
> pglz. There's some initial talk about implementing pluggable compression
> algorithms for TOAST and I guess the same must be taken into consideration
> for the WAL.
> 
> --
> Arthur Silva
> 
> 
> On Thu, Sep 11, 2014 at 2:46 AM, Rahila Syed <rahilasyed.90@gmail.com>
> wrote:
> 
> > >I will repeat the above tests with high load on CPU and using the
> > benchmark
> > given by Fujii-san and post the results.
> >
> > Average % of CPU usage at user level for each of the compression algorithm
> > are as follows.
> >
> > Compression        Multiple            Single
> >
> > Off                        81.1338            81.1267
> > LZ4                      81.0998            81.1695
> > Snappy:                80.9741             80.9703
> > Pglz :                    81.2353            81.2753
> >
> > <
> > http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png
> > >
> > <
> > http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png
> > >
> >
> > The numbers show CPU utilization of Snappy is the least. The CPU
> > utilization
> > in increasing order is
> > pglz > No compression > LZ4 > Snappy
> >
> > The variance of average CPU utilization numbers is very low. However ,
> > snappy seems to be best when it comes to lesser utilization of CPU.
> >
> > As per the measurement results posted till date
> >
> > LZ4 outperforms snappy and pglz in terms of compression ratio and
> > performance. However , CPU utilization numbers show snappy utilizes least
> > amount of CPU . Difference is not much though.
> >
> > As there has been no consensus yet about which compression algorithm to
> > adopt, is it better to make this decision independent of the FPW
> > compression
> > patch as suggested earlier in this thread?. FPW compression can be done
> > using built in compression pglz as it shows considerable performance over
> > uncompressed WAL and good compression ratio
> > Also, the patch to compress multiple blocks at once gives better
> > compression
> > as compared to single block. ISTM that performance overhead introduced by
> > multiple blocks compression is slightly higher than single block
> > compression
> > which can be tested again after modifying the patch to use pglz .  Hence,
> > this patch can be built using multiple blocks compression.
> >
> > Thoughts?
> >

Hi,

The big (huge) win for lz4 (not the HC variant) is the enormous compression
and decompression speed. It compresses quite a bit faster (33%) than snappy
and decompresses twice as fast as snappy.

Regards,
Ken

Re: [REVIEW] Re: Compression of full-page-writes

From

Mitsumasa KONDO

Date:

11 September 2014, 13:33:49

2014-09-11 22:01 GMT+09:00 ktm@rice.edu <ktm@rice.edu>:

On Thu, Sep 11, 2014 at 09:37:07AM -0300, Arthur Silva wrote:
> I agree that there's no reason to fix an algorithm to it, unless maybe it's
> pglz.

Yes, it seems difficult to judge only the algorithm performance.

We have to start to consider source code maintenance, quality and the other factors..

The big (huge) win for lz4 (not the HC variant) is the enormous compression
and decompression speed. It compresses quite a bit faster (33%) than snappy
and decompresses twice as fast as snappy.

Show us the evidence. Postgres members showed the test result and them consideration.

It's very objective comparing.

Best Regards,

Mitsumasa KONDO

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

11 September 2014, 16:55:28

On Thu, Sep 11, 2014 at 1:46 AM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
>>I will repeat the above tests with high load on CPU and using the benchmark
> given by Fujii-san and post the results.
>
> Average % of CPU usage at user level for each of the compression algorithm
> are as follows.
>
> Compression        Multiple            Single
>
> Off                        81.1338            81.1267
> LZ4                      81.0998            81.1695
> Snappy:                80.9741             80.9703
> Pglz :                    81.2353            81.2753
>
> <http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png>
> <http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png>
>
> The numbers show CPU utilization of Snappy is the least. The CPU utilization
> in increasing order is
> pglz > No compression > LZ4 > Snappy
>
> The variance of average CPU utilization numbers is very low. However ,
> snappy seems to be best when it comes to lesser utilization of CPU.
>
> As per the measurement results posted till date
>
> LZ4 outperforms snappy and pglz in terms of compression ratio and
> performance. However , CPU utilization numbers show snappy utilizes least
> amount of CPU . Difference is not much though.
>
> As there has been no consensus yet about which compression algorithm to
> adopt, is it better to make this decision independent of the FPW compression
> patch as suggested earlier in this thread?. FPW compression can be done
> using built in compression pglz as it shows considerable performance over
> uncompressed WAL and good compression ratio
> Also, the patch to compress multiple blocks at once gives better compression
> as compared to single block. ISTM that performance overhead introduced by
> multiple blocks compression is slightly higher than single block compression
> which can be tested again after modifying the patch to use pglz .  Hence,
> this patch can be built using multiple blocks compression.

I advise supporting pglz only for the initial patch, and adding
support for the others later if it seems worthwhile.  The approach
seems to work well enough with pglz that it's worth doing even if we
never add the other algorithms.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

11 September 2014, 16:58:12

On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
> I advise supporting pglz only for the initial patch, and adding
> support for the others later if it seems worthwhile.  The approach
> seems to work well enough with pglz that it's worth doing even if we
> never add the other algorithms.

That approach is fine with me. Note though that I am pretty strongly
against adding support for more than one algorithm at the same time. So,
if we gain lz4 support - which I think is definitely where we should go
- we should drop pglz support for the WAL.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Bruce Momjian

Date:

11 September 2014, 16:59:05

On Thu, Sep 11, 2014 at 12:55:21PM -0400, Robert Haas wrote:
> I advise supporting pglz only for the initial patch, and adding
> support for the others later if it seems worthwhile.  The approach
> seems to work well enough with pglz that it's worth doing even if we
> never add the other algorithms.

+1

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

11 September 2014, 17:04:49

On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
>> I advise supporting pglz only for the initial patch, and adding
>> support for the others later if it seems worthwhile.  The approach
>> seems to work well enough with pglz that it's worth doing even if we
>> never add the other algorithms.
>
> That approach is fine with me. Note though that I am pretty strongly
> against adding support for more than one algorithm at the same time.

What if one algorithm compresses better and the other algorithm uses
less CPU time?

I don't see a compelling need for an option if we get a new algorithm
that strictly dominates what we've already got in all parameters, and
it may well be that, as respects pglz, that's achievable.  But ISTM
that it need not be true in general.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

11 September 2014, 17:17:49

On 2014-09-11 13:04:43 -0400, Robert Haas wrote:
> On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
> >> I advise supporting pglz only for the initial patch, and adding
> >> support for the others later if it seems worthwhile.  The approach
> >> seems to work well enough with pglz that it's worth doing even if we
> >> never add the other algorithms.
> >
> > That approach is fine with me. Note though that I am pretty strongly
> > against adding support for more than one algorithm at the same time.
> 
> What if one algorithm compresses better and the other algorithm uses
> less CPU time?

Then we make a choice for our users. A configuration option about an
aspect of postgres that darned view people will understand with for the
marginal differences between snappy and lz4 doesn't make sense.

> I don't see a compelling need for an option if we get a new algorithm
> that strictly dominates what we've already got in all parameters, and
> it may well be that, as respects pglz, that's achievable.  But ISTM
> that it need not be true in general.

If you look at the results lz4 is pretty much there. Sure, there's
algorithms which have a much better compression - but the time overhead
is so large it just doesn't make sense for full page compression.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

11 September 2014, 18:14:19

On Thu, Sep 11, 2014 at 1:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-09-11 13:04:43 -0400, Robert Haas wrote:
>> On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
>> >> I advise supporting pglz only for the initial patch, and adding
>> >> support for the others later if it seems worthwhile.  The approach
>> >> seems to work well enough with pglz that it's worth doing even if we
>> >> never add the other algorithms.
>> >
>> > That approach is fine with me. Note though that I am pretty strongly
>> > against adding support for more than one algorithm at the same time.
>>
>> What if one algorithm compresses better and the other algorithm uses
>> less CPU time?
>
> Then we make a choice for our users. A configuration option about an
> aspect of postgres that darned view people will understand with for the
> marginal differences between snappy and lz4 doesn't make sense.

Maybe.  Let's get the basic patch done first; then we can argue about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

11 September 2014, 18:30:28

On Thu, Sep 11, 2014 at 06:58:06PM +0200, Andres Freund wrote:
> On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
> > I advise supporting pglz only for the initial patch, and adding
> > support for the others later if it seems worthwhile.  The approach
> > seems to work well enough with pglz that it's worth doing even if we
> > never add the other algorithms.
> 
> That approach is fine with me. Note though that I am pretty strongly
> against adding support for more than one algorithm at the same time. So,
> if we gain lz4 support - which I think is definitely where we should go
> - we should drop pglz support for the WAL.
> 
> Greetings,
> 
> Andres Freund
> 

+1

Regards,
Ken

Re: [REVIEW] Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

11 September 2014, 18:35:20

On Thu, Sep 11, 2014 at 07:17:42PM +0200, Andres Freund wrote:
> On 2014-09-11 13:04:43 -0400, Robert Haas wrote:
> > On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
> > >> I advise supporting pglz only for the initial patch, and adding
> > >> support for the others later if it seems worthwhile.  The approach
> > >> seems to work well enough with pglz that it's worth doing even if we
> > >> never add the other algorithms.
> > >
> > > That approach is fine with me. Note though that I am pretty strongly
> > > against adding support for more than one algorithm at the same time.
> > 
> > What if one algorithm compresses better and the other algorithm uses
> > less CPU time?
> 
> Then we make a choice for our users. A configuration option about an
> aspect of postgres that darned view people will understand with for the
> marginal differences between snappy and lz4 doesn't make sense.
> 
> > I don't see a compelling need for an option if we get a new algorithm
> > that strictly dominates what we've already got in all parameters, and
> > it may well be that, as respects pglz, that's achievable.  But ISTM
> > that it need not be true in general.
> 
> If you look at the results lz4 is pretty much there. Sure, there's
> algorithms which have a much better compression - but the time overhead
> is so large it just doesn't make sense for full page compression.
> 
> Greetings,
> 
> Andres Freund
> 

In addition, you can leverage the the presence of a higher-compression
version of lz4 (lz4hc) that can utilize the same decompression engine
that could possibly be applied to static tables as a REINDEX option
or even slowly growing tables that would benefit from the better
compression as well as the increased decompression speed available.

Regards,
Ken

Re: [REVIEW] Re: Compression of full-page-writes

From

Heikki Linnakangas

Date:

12 September 2014, 19:38:25

On 09/02/2014 09:52 AM, Fujii Masao wrote:
> [RESULT]
> Throughput in the benchmark.
>
>                              Multiple                    Single
>          off                    2162.6                    2164.5
>          on                    891.8                    895.6
>          pglz                    1037.2                    1042.3
>          lz4                    1084.7                    1091.8
>          snappy                    1058.4                    1073.3

Most of the CPU overhead of writing full pages is because of CRC 
calculation. Compression helps because then you have less data to CRC.

It's worth noting that there are faster CRC implementations out there 
than what we use. The Slicing-by-4 algorithm was discussed years ago, 
but was not deemed worth it back then IIRC because we typically 
calculate CRC over very small chunks of data, and the benefit of 
Slicing-by-4 and many other algorithms only show up when you work on 
larger chunks. But a full-page image is probably large enough to benefit.

What I'm trying to say is that this should be compared with the idea of 
just switching the CRC implementation. That would make the 'on' case 
faster, and and the benefit of compression smaller. I wouldn't be 
surprised if it made the 'on' case faster than compressed cases.

I don't mean that we should abandon this patch - compression makes the 
WAL smaller which has all kinds of other benefits, even if it makes the 
raw TPS throughput of the system worse. But I'm just saying that these 
TPS comparisons should be taken with a grain of salt. We probably should 
consider switching to a faster CRC algorithm again, regardless of what 
we do with compression.

- Heikki

Re: [REVIEW] Re: Compression of full-page-writes

From

Abhijit Menon-Sen

Date:

12 September 2014, 19:54:12

At 2014-09-12 22:38:01 +0300, hlinnakangas@vmware.com wrote:
>
> We probably should consider switching to a faster CRC algorithm again,
> regardless of what we do with compression.

As it happens, I'm already working on resurrecting a patch that Andres
posted in 2010 to switch to zlib's faster CRC implementation.

-- Abhijit

CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)

From

Heikki Linnakangas

Date:

12 September 2014, 20:03:19

On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
> At 2014-09-12 22:38:01 +0300, hlinnakangas@vmware.com wrote:
>>
>> We probably should consider switching to a faster CRC algorithm again,
>> regardless of what we do with compression.
>
> As it happens, I'm already working on resurrecting a patch that Andres
> posted in 2010 to switch to zlib's faster CRC implementation.

As it happens, I also wrote an implementation of Slice-by-4 the other
day :-). Haven't gotten around to post it, but here it is.

What algorithm does zlib use for CRC calculation?

- Heikki

Attachment

slice-by-4.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Ants Aasma

Date:

12 September 2014, 20:17:23

On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I don't mean that we should abandon this patch - compression makes the WAL
> smaller which has all kinds of other benefits, even if it makes the raw TPS
> throughput of the system worse. But I'm just saying that these TPS
> comparisons should be taken with a grain of salt. We probably should
> consider switching to a faster CRC algorithm again, regardless of what we do
> with compression.

CRC is a pretty awfully slow algorithm for checksums. We should
consider switching it out for something more modern. CityHash,
MurmurHash3 and xxhash look like pretty good candidates, being around
an order of magnitude faster than CRC. I'm hoping to investigate
substituting the WAL checksum algorithm 9.5.

Given the room for improvement in this area I think it would make
sense to just short-circuit the CRC calculations for testing this
patch to see if the performance improvement is due to less data being
checksummed.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)

From

Andres Freund

Date:

12 September 2014, 20:22:30

On 2014-09-12 23:03:00 +0300, Heikki Linnakangas wrote:
> On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
> >At 2014-09-12 22:38:01 +0300, hlinnakangas@vmware.com wrote:
> >>
> >>We probably should consider switching to a faster CRC algorithm again,
> >>regardless of what we do with compression.
> >
> >As it happens, I'm already working on resurrecting a patch that Andres
> >posted in 2010 to switch to zlib's faster CRC implementation.
> 
> As it happens, I also wrote an implementation of Slice-by-4 the other day
> :-). Haven't gotten around to post it, but here it is.
> 
> What algorithm does zlib use for CRC calculation?

Also slice-by-4, with a manually unrolled loop doing 32bytes at once, using
individual slice-by-4's. IIRC I tried and removing that slowed things
down overall. What it also did was move crc to a function. I'm not sure
why I did it that way, but it really might be beneficial - if you look
at profiles today there's sometimes icache/decoding stalls...

Hm. Let me look:
http://archives.postgresql.org/message-id/201005202227.49990.andres%40anarazel.de

Ick, there's quite some debugging leftovers ;)

I think it might be a good idea to also switch the polynom at the same
time. I really really think we should, when the hardware supports, use
the polynom that's available in SSE4.2. It has similar properties, can
implemented in software just the same...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

12 September 2014, 20:28:02

On 2014-09-12 23:17:12 +0300, Ants Aasma wrote:
> On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > I don't mean that we should abandon this patch - compression makes the WAL
> > smaller which has all kinds of other benefits, even if it makes the raw TPS
> > throughput of the system worse. But I'm just saying that these TPS
> > comparisons should be taken with a grain of salt. We probably should
> > consider switching to a faster CRC algorithm again, regardless of what we do
> > with compression.
> 
> CRC is a pretty awfully slow algorithm for checksums. We should
> consider switching it out for something more modern. CityHash,
> MurmurHash3 and xxhash look like pretty good candidates, being around
> an order of magnitude faster than CRC. I'm hoping to investigate
> substituting the WAL checksum algorithm 9.5.

I think that might not be a bad plan. But it'll involve *far* more
effort and arguing to change to fundamentally different algorithms. So
personally I'd just go with slice-by-4. that's relatively
uncontroversial I think. Then maybe switch the polynom so we can use the
CRC32 instruction.

> Given the room for improvement in this area I think it would make
> sense to just short-circuit the CRC calculations for testing this
> patch to see if the performance improvement is due to less data being
> checksummed.

FWIW, I don't think it's 'bad' that less data provides speedups. I don't
really see a need to see that get that out of the benchmarks.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

12 September 2014, 20:31:30

On 2014-09-12 22:38:01 +0300, Heikki Linnakangas wrote:
> It's worth noting that there are faster CRC implementations out there than
> what we use. The Slicing-by-4 algorithm was discussed years ago, but was not
> deemed worth it back then IIRC because we typically calculate CRC over very
> small chunks of data, and the benefit of Slicing-by-4 and many other
> algorithms only show up when you work on larger chunks. But a full-page
> image is probably large enough to benefit.

I've recently pondered moving things around so the CRC sum can be
computed over the whole data instead of the individual chain elements. I
think, regardless of the checksum algorithm and implementation we end up
with, that might end up as a noticeable benefit.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

12 September 2014, 20:39:44

On Fri, Sep 12, 2014 at 11:17:12PM +0300, Ants Aasma wrote:
> On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > I don't mean that we should abandon this patch - compression makes the WAL
> > smaller which has all kinds of other benefits, even if it makes the raw TPS
> > throughput of the system worse. But I'm just saying that these TPS
> > comparisons should be taken with a grain of salt. We probably should
> > consider switching to a faster CRC algorithm again, regardless of what we do
> > with compression.
> 
> CRC is a pretty awfully slow algorithm for checksums. We should
> consider switching it out for something more modern. CityHash,
> MurmurHash3 and xxhash look like pretty good candidates, being around
> an order of magnitude faster than CRC. I'm hoping to investigate
> substituting the WAL checksum algorithm 9.5.
> 
> Given the room for improvement in this area I think it would make
> sense to just short-circuit the CRC calculations for testing this
> patch to see if the performance improvement is due to less data being
> checksummed.
> 
> Regards,
> Ants Aasma

+1 for xxhash -

version    speed on 64-bits      speed on 32-bits
-------    ----------------      ----------------
XXH64      13.8 GB/s             1.9 GB/s
XXH32       6.8 GB/s             6.0 GB/s

Here is a blog about its performance as a hash function:

http://fastcompression.blogspot.com/2014/07/xxhash-wider-64-bits.html

Regards,
Ken

Re: [REVIEW] Re: Compression of full-page-writes

From

Arthur Silva

Date:

13 September 2014, 03:59:53

<p dir="ltr">That's not entirely true. CRC-32C beats pretty much everything with the same length quality-wise and has
bothhardware implementations and highly optimized software versions.<div class="gmail_quote">Em 12/09/2014 17:18, "Ants
Aasma"<<a href="mailto:ants@cybertec.at">ants@cybertec.at</a>> escreveu:<br type="attribution" /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Fri, Sep 12, 2014 at 10:38
PM,Heikki Linnakangas<br /> <<a href="mailto:hlinnakangas@vmware.com">hlinnakangas@vmware.com</a>> wrote:<br />
>I don't mean that we should abandon this patch - compression makes the WAL<br /> > smaller which has all kinds
ofother benefits, even if it makes the raw TPS<br /> > throughput of the system worse. But I'm just saying that
theseTPS<br /> > comparisons should be taken with a grain of salt. We probably should<br /> > consider switching
toa faster CRC algorithm again, regardless of what we do<br /> > with compression.<br /><br /> CRC is a pretty
awfullyslow algorithm for checksums. We should<br /> consider switching it out for something more modern. CityHash,<br
/>MurmurHash3 and xxhash look like pretty good candidates, being around<br /> an order of magnitude faster than CRC.
I'mhoping to investigate<br /> substituting the WAL checksum algorithm 9.5.<br /><br /> Given the room for improvement
inthis area I think it would make<br /> sense to just short-circuit the CRC calculations for testing this<br /> patch
tosee if the performance improvement is due to less data being<br /> checksummed.<br /><br /> Regards,<br /> Ants
Aasma<br/> --<br /> Cybertec Schönig & Schönig GmbH<br /> Gröhrmühlgasse 26<br /> A-2700 Wiener Neustadt<br /> Web:
<ahref="http://www.postgresql-support.de" target="_blank">http://www.postgresql-support.de</a><br /><br /><br /> --<br
/>Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your
subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></blockquote></div>

Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)

From

Arthur Silva

Date:

13 September 2014, 04:20:41

<p dir="ltr"><br /> Em 12/09/2014 17:23, "Andres Freund" <<a
href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>>escreveu:<br /> ><br /> > On 2014-09-12
23:03:00+0300, Heikki Linnakangas wrote:<br /> > > On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:<br /> >
>>At 2014-09-12 22:38:01 +0300, <a href="mailto:hlinnakangas@vmware.com">hlinnakangas@vmware.com</a> wrote:<br />
>> >><br /> > > >>We probably should consider switching to a faster CRC algorithm again,<br />
>> >>regardless of what we do with compression.<br /> > > ><br /> > > >As it happens, I'm
alreadyworking on resurrecting a patch that Andres<br /> > > >posted in 2010 to switch to zlib's faster CRC
implementation.<br/> > ><br /> > > As it happens, I also wrote an implementation of Slice-by-4 the other
day<br/> > > :-). Haven't gotten around to post it, but here it is.<br /> > ><br /> > > What
algorithmdoes zlib use for CRC calculation?<br /> ><br /> > Also slice-by-4, with a manually unrolled loop doing
32bytesat once, using<br /> > individual slice-by-4's. IIRC I tried and removing that slowed things<br /> > down
overall.What it also did was move crc to a function. I'm not sure<br /> > why I did it that way, but it really might
bebeneficial - if you look<br /> > at profiles today there's sometimes icache/decoding stalls...<br /> ><br />
>Hm. Let me look:<br /> > <a
href="http://archives.postgresql.org/message-id/201005202227.49990.andres%40anarazel.de">http://archives.postgresql.org/message-id/201005202227.49990.andres%40anarazel.de</a><br
/>><br /> > Ick, there's quite some debugging leftovers ;)<br /> ><br /> > I think it might be a good idea
toalso switch the polynom at the same<br /> > time. I really really think we should, when the hardware supports,
use<br/> > the polynom that's available in SSE4.2. It has similar properties, can<br /> > implemented in software
justthe same...<br /> ><br /> > Greetings,<br /> ><br /> > Andres Freund<br /> ><br /> > --<br />
> Andres Freund                     <a href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a><br /> >
 PostgreSQLDevelopment, 24x7 Support, Training & Services<br /> ><br /> ><br /> > --<br /> > Sent via
pgsql-hackersmailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br /> >
Tomake changes to your subscription:<br /> > <a
href="http://www.postgresql.org/mailpref/pgsql-hackers">http://www.postgresql.org/mailpref/pgsql-hackers</a><p
dir="ltr">ThisGoogle library is worth a look <a
href="https://code.google.com/p/crcutil/">https://code.google.com/p/crcutil/</a>as it has some extremely optimized
versions.

Re: [REVIEW] Re: Compression of full-page-writes

From

Ants Aasma

Date:

13 September 2014, 05:52:47

On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs@gmail.com> wrote:
> That's not entirely true. CRC-32C beats pretty much everything with the same
> length quality-wise and has both hardware implementations and highly
> optimized software versions.

For better or for worse CRC is biased by detecting all single bit
errors, the detection capability of larger errors is slightly
diminished. The quality of the other algorithms I mentioned is also
very good, while producing uniformly varying output. CRC has exactly
one hardware implementation in general purpose CPU's and Intel has a
patent on the techniques they used to implement it. The fact that AMD
hasn't yet implemented this instruction shows that this patent is
non-trivial to work around. The hardware CRC is about as fast as
xxhash. The highly optimized software CRCs are an order of magnitude
slower and require large cache trashing lookup tables.

If we choose to stay with CRC we must accept that we can only solve
the performance issues for Intel CPUs and provide slight alleviation
for others.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

13 September 2014, 16:44:51

On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs@gmail.com> wrote:
> > That's not entirely true. CRC-32C beats pretty much everything with the same
> > length quality-wise and has both hardware implementations and highly
> > optimized software versions.
> 
> For better or for worse CRC is biased by detecting all single bit
> errors, the detection capability of larger errors is slightly
> diminished. The quality of the other algorithms I mentioned is also
> very good, while producing uniformly varying output.

There's also much more literature about the various CRCs in comparison
to some of these hash allgorithms. Pretty much everything tests how well
they're suited for hashtables, but that's not really what we need
(although it might not hurt *at all* to have something faster there...).

I do think we need to think about the types of errors we really have to
detect. It's not all that clear that either the typical guarantees/tests
for CRCs nor for checksums (smhasher, whatever) are very
representative...

> CRC has exactly
> one hardware implementation in general purpose CPU's and Intel has a
> patent on the techniques they used to implement it. The fact that AMD
> hasn't yet implemented this instruction shows that this patent is
> non-trivial to work around.

I think AMD has implemeded SSE4.2 with bulldozer. It's still only recent
x86 though. So I think there's good reasons for moving away from it.

How one could get patents on exposing hardware CRC implementations -
hard to find a computing device without one - as a instruction is beyond
me...

I think it's pretty clear by now that we should move to lz4 for a couple
things - which bundles xxhash with it. So that has one argument for it.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Tom Lane

Date:

13 September 2014, 16:55:45

Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
>> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs@gmail.com> wrote:
>>> That's not entirely true. CRC-32C beats pretty much everything with the same
>>> length quality-wise and has both hardware implementations and highly
>>> optimized software versions.

>> For better or for worse CRC is biased by detecting all single bit
>> errors, the detection capability of larger errors is slightly
>> diminished. The quality of the other algorithms I mentioned is also
>> very good, while producing uniformly varying output.

> There's also much more literature about the various CRCs in comparison
> to some of these hash allgorithms.

Indeed.  CRCs have well-understood properties for error detection.
Have any of these new algorithms been analyzed even a hundredth as
thoroughly?  No.  I'm unimpressed by evidence-free claims that
something else is "also very good".

Now, CRCs are designed for detecting the sorts of short burst errors
that are (or were, back in the day) common on phone lines.  You could
certainly make an argument that that's not the type of threat we face
for PG data.  However, I've not seen anyone actually make such an
argument, let alone demonstrate that some other algorithm would be better.
To start with, you'd need to explain precisely what other error pattern
is more important to defend against, and why.
        regards, tom lane

Re: [REVIEW] Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

13 September 2014, 19:32:25

On Sat, Sep 13, 2014 at 12:55:33PM -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> >> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs@gmail.com> wrote:
> >>> That's not entirely true. CRC-32C beats pretty much everything with the same
> >>> length quality-wise and has both hardware implementations and highly
> >>> optimized software versions.
> 
> >> For better or for worse CRC is biased by detecting all single bit
> >> errors, the detection capability of larger errors is slightly
> >> diminished. The quality of the other algorithms I mentioned is also
> >> very good, while producing uniformly varying output.
> 
> > There's also much more literature about the various CRCs in comparison
> > to some of these hash allgorithms.
> 
> Indeed.  CRCs have well-understood properties for error detection.
> Have any of these new algorithms been analyzed even a hundredth as
> thoroughly?  No.  I'm unimpressed by evidence-free claims that
> something else is "also very good".
> 
> Now, CRCs are designed for detecting the sorts of short burst errors
> that are (or were, back in the day) common on phone lines.  You could
> certainly make an argument that that's not the type of threat we face
> for PG data.  However, I've not seen anyone actually make such an
> argument, let alone demonstrate that some other algorithm would be better.
> To start with, you'd need to explain precisely what other error pattern
> is more important to defend against, and why.
> 
>             regards, tom lane
> 

Here is a blog on the development of xxhash:

http://fastcompression.blogspot.com/2012/04/selecting-checksum-algorithm.html

Regards,
Ken

Re: [REVIEW] Re: Compression of full-page-writes

From

Arthur Silva

Date:

14 September 2014, 00:51:44

On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
>> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs@gmail.com> wrote:
>>> That's not entirely true. CRC-32C beats pretty much everything with the same
>>> length quality-wise and has both hardware implementations and highly
>>> optimized software versions.

>> For better or for worse CRC is biased by detecting all single bit
>> errors, the detection capability of larger errors is slightly
>> diminished. The quality of the other algorithms I mentioned is also
>> very good, while producing uniformly varying output.

> There's also much more literature about the various CRCs in comparison
> to some of these hash allgorithms.

Indeed. CRCs have well-understood properties for error detection.
Have any of these new algorithms been analyzed even a hundredth as
thoroughly? No. I'm unimpressed by evidence-free claims that
something else is "also very good".

Now, CRCs are designed for detecting the sorts of short burst errors
that are (or were, back in the day) common on phone lines. You could
certainly make an argument that that's not the type of threat we face
for PG data. However, I've not seen anyone actually make such an
argument, let alone demonstrate that some other algorithm would be better.
To start with, you'd need to explain precisely what other error pattern
is more important to defend against, and why.

regards, tom lane

Mysql went this way as well, changing the CRC polynomial in 5.6.

What we are looking for here is uniqueness thus better error detection. Not avalanche effect, nor cryptographically secure, nor bit distribution.

As far as I'm aware CRC32C is unbeaten collision wise and time proven.

I couldn't find tests with xxhash and crc32 on the same hardware so I spent some time putting together a benchmark (see attachment, to run it just start run.sh)

I included a crc32 implementation using ssr4.2 instructions (which works on pretty much any Intel processor built after 2008 and AMD built after 2012),

a portable Slice-By-8 software implementation and xxhash since it's the fastest software 32bit hash I know of.

Here're the results running the test program on my i5-4200M

crc sb8: 90444623
elapsed: 0.513688s
speed: 1.485220 GB/s

crc hw: 90444623
elapsed: 0.048327s
speed: 15.786877 GB/s

xxhash: 7f4a8d5
elapsed: 0.182100s
speed: 4.189663 GB/s

The hardware version is insanely and works on the majority of Postgres setups and the fallback software implementations is 2.8x slower than the fastest 32bit hash around.

Hopefully it'll be useful in the discussion.

Attachment

bench.zip

Re: [REVIEW] Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

14 September 2014, 01:28:10

On Sat, Sep 13, 2014 at 09:50:55PM -0300, Arthur Silva wrote:
> On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> > >> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs@gmail.com>
> > wrote:
> > >>> That's not entirely true. CRC-32C beats pretty much everything with
> > the same
> > >>> length quality-wise and has both hardware implementations and highly
> > >>> optimized software versions.
> >
> > >> For better or for worse CRC is biased by detecting all single bit
> > >> errors, the detection capability of larger errors is slightly
> > >> diminished. The quality of the other algorithms I mentioned is also
> > >> very good, while producing uniformly varying output.
> >
> > > There's also much more literature about the various CRCs in comparison
> > > to some of these hash allgorithms.
> >
> > Indeed.  CRCs have well-understood properties for error detection.
> > Have any of these new algorithms been analyzed even a hundredth as
> > thoroughly?  No.  I'm unimpressed by evidence-free claims that
> > something else is "also very good".
> >
> > Now, CRCs are designed for detecting the sorts of short burst errors
> > that are (or were, back in the day) common on phone lines.  You could
> > certainly make an argument that that's not the type of threat we face
> > for PG data.  However, I've not seen anyone actually make such an
> > argument, let alone demonstrate that some other algorithm would be better.
> > To start with, you'd need to explain precisely what other error pattern
> > is more important to defend against, and why.
> >
> >                         regards, tom lane
> >
> 
> Mysql went this way as well, changing the CRC polynomial in 5.6.
> 
> What we are looking for here is uniqueness thus better error detection. Not
> avalanche effect, nor cryptographically secure, nor bit distribution.
> As far as I'm aware CRC32C is unbeaten collision wise and time proven.
> 
> I couldn't find tests with xxhash and crc32 on the same hardware so I spent
> some time putting together a benchmark (see attachment, to run it just
> start run.sh)
> 
> I included a crc32 implementation using ssr4.2 instructions (which works on
> pretty much any Intel processor built after 2008 and AMD built after 2012),
> a portable Slice-By-8 software implementation and xxhash since it's the
> fastest software 32bit hash I know of.
> 
> Here're the results running the test program on my i5-4200M
> 
> crc sb8: 90444623
> elapsed: 0.513688s
> speed: 1.485220 GB/s
> 
> crc hw: 90444623
> elapsed: 0.048327s
> speed: 15.786877 GB/s
> 
> xxhash: 7f4a8d5
> elapsed: 0.182100s
> speed: 4.189663 GB/s
> 
> The hardware version is insanely and works on the majority of Postgres
> setups and the fallback software implementations is 2.8x slower than the
> fastest 32bit hash around.
> 
> Hopefully it'll be useful in the discussion.

Thank you for running this sample benchmark. It definitely shows that the
hardware version of the CRC is very fast, unfortunately it is really only
available on x64 Intel/AMD processors which leaves all the rest lacking.
For current 64-bit hardware, it might be instructive to also try using
the XXH64 version and just take one half of the hash. It should come in
at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
CRC. Also, while I understand that CRC has a very venerable history and
is well studied for transmission type errors, I have been unable to find
any research on its applicability to validating file/block writes to a
disk drive. While it is to quote you "unbeaten collision wise", xxhash,
both the 32-bit and 64-bit version are its equal. Since there seems to
be a lack of research on disk based error detection versus CRC polynomials,
it seems likely that any of the proposed hash functions are on an equal
footing in this regard. As Andres commented up-thread, xxhash comes along
for "free" with lz4.

Regards,
Ken

Re: [REVIEW] Re: Compression of full-page-writes

From

Arthur Silva

Date:

14 September 2014, 01:43:29

On Sat, Sep 13, 2014 at 10:27 PM, ktm@rice.edu <ktm@rice.edu> wrote:

On Sat, Sep 13, 2014 at 09:50:55PM -0300, Arthur Silva wrote:
> On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> > >> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs@gmail.com>
> > wrote:
> > >>> That's not entirely true. CRC-32C beats pretty much everything with
> > the same
> > >>> length quality-wise and has both hardware implementations and highly
> > >>> optimized software versions.
> >
> > >> For better or for worse CRC is biased by detecting all single bit
> > >> errors, the detection capability of larger errors is slightly
> > >> diminished. The quality of the other algorithms I mentioned is also
> > >> very good, while producing uniformly varying output.
> >
> > > There's also much more literature about the various CRCs in comparison
> > > to some of these hash allgorithms.
> >
> > Indeed. CRCs have well-understood properties for error detection.
> > Have any of these new algorithms been analyzed even a hundredth as
> > thoroughly? No. I'm unimpressed by evidence-free claims that
> > something else is "also very good".
> >
> > Now, CRCs are designed for detecting the sorts of short burst errors
> > that are (or were, back in the day) common on phone lines. You could
> > certainly make an argument that that's not the type of threat we face
> > for PG data. However, I've not seen anyone actually make such an
> > argument, let alone demonstrate that some other algorithm would be better.
> > To start with, you'd need to explain precisely what other error pattern
> > is more important to defend against, and why.
> >
> > regards, tom lane
> >
>
> Mysql went this way as well, changing the CRC polynomial in 5.6.
>
> What we are looking for here is uniqueness thus better error detection. Not
> avalanche effect, nor cryptographically secure, nor bit distribution.
> As far as I'm aware CRC32C is unbeaten collision wise and time proven.
>
> I couldn't find tests with xxhash and crc32 on the same hardware so I spent
> some time putting together a benchmark (see attachment, to run it just
> start run.sh)
>
> I included a crc32 implementation using ssr4.2 instructions (which works on
> pretty much any Intel processor built after 2008 and AMD built after 2012),
> a portable Slice-By-8 software implementation and xxhash since it's the
> fastest software 32bit hash I know of.
>
> Here're the results running the test program on my i5-4200M
>
> crc sb8: 90444623
> elapsed: 0.513688s
> speed: 1.485220 GB/s
>
> crc hw: 90444623
> elapsed: 0.048327s
> speed: 15.786877 GB/s
>
> xxhash: 7f4a8d5
> elapsed: 0.182100s
> speed: 4.189663 GB/s
>
> The hardware version is insanely and works on the majority of Postgres
> setups and the fallback software implementations is 2.8x slower than the
> fastest 32bit hash around.
>
> Hopefully it'll be useful in the discussion.

Thank you for running this sample benchmark. It definitely shows that the
hardware version of the CRC is very fast, unfortunately it is really only
available on x64 Intel/AMD processors which leaves all the rest lacking.
For current 64-bit hardware, it might be instructive to also try using
the XXH64 version and just take one half of the hash. It should come in
at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
CRC. Also, while I understand that CRC has a very venerable history and
is well studied for transmission type errors, I have been unable to find
any research on its applicability to validating file/block writes to a
disk drive. While it is to quote you "unbeaten collision wise", xxhash,
both the 32-bit and 64-bit version are its equal. Since there seems to
be a lack of research on disk based error detection versus CRC polynomials,
it seems likely that any of the proposed hash functions are on an equal
footing in this regard. As Andres commented up-thread, xxhash comes along
for "free" with lz4.

Regards,
Ken

For the sake of completeness the results for xxhash64 in my machine

xxhash64
speed: 7.365398 GB/s

Which is indeed very fast.

Re: [REVIEW] Re: Compression of full-page-writes

From

Claudio Freire

Date:

14 September 2014, 02:19:54

On Sat, Sep 13, 2014 at 10:27 PM, ktm@rice.edu <ktm@rice.edu> wrote:
>> Here're the results running the test program on my i5-4200M
>>
>> crc sb8: 90444623
>> elapsed: 0.513688s
>> speed: 1.485220 GB/s
>>
>> crc hw: 90444623
>> elapsed: 0.048327s
>> speed: 15.786877 GB/s
>>
>> xxhash: 7f4a8d5
>> elapsed: 0.182100s
>> speed: 4.189663 GB/s
>>
>> The hardware version is insanely and works on the majority of Postgres
>> setups and the fallback software implementations is 2.8x slower than the
>> fastest 32bit hash around.
>>
>> Hopefully it'll be useful in the discussion.
>
> Thank you for running this sample benchmark. It definitely shows that the
> hardware version of the CRC is very fast, unfortunately it is really only
> available on x64 Intel/AMD processors which leaves all the rest lacking.
> For current 64-bit hardware, it might be instructive to also try using
> the XXH64 version and just take one half of the hash. It should come in
> at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
> CRC. Also, while I understand that CRC has a very venerable history and
> is well studied for transmission type errors, I have been unable to find
> any research on its applicability to validating file/block writes to a
> disk drive. While it is to quote you "unbeaten collision wise", xxhash,
> both the 32-bit and 64-bit version are its equal. Since there seems to
> be a lack of research on disk based error detection versus CRC polynomials,
> it seems likely that any of the proposed hash functions are on an equal
> footing in this regard. As Andres commented up-thread, xxhash comes along
> for "free" with lz4.


Bear in mind that

a) taking half of the CRC will invalidate all error detection
capability research, and it may also invalidate its properties,
depending on the CRC itself.

b) bit corruption as is the target kind of error for CRC are resurging
in SSDs, as can be seen in table 4 of a link that I think appeared on
this same list:
https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf

I would totally forget of taking half of whatever CRC. That's looking
for pain, in that it will invalidate all existing and future research
on that hash/CRC type.

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

14 September 2014, 15:21:22

On 2014-09-13 20:27:51 -0500, ktm@rice.edu wrote:
> > 
> > What we are looking for here is uniqueness thus better error detection. Not
> > avalanche effect, nor cryptographically secure, nor bit distribution.
> > As far as I'm aware CRC32C is unbeaten collision wise and time proven.
> > 
> > I couldn't find tests with xxhash and crc32 on the same hardware so I spent
> > some time putting together a benchmark (see attachment, to run it just
> > start run.sh)
> > 
> > I included a crc32 implementation using ssr4.2 instructions (which works on
> > pretty much any Intel processor built after 2008 and AMD built after 2012),
> > a portable Slice-By-8 software implementation and xxhash since it's the
> > fastest software 32bit hash I know of.
> > 
> > Here're the results running the test program on my i5-4200M
> > 
> > crc sb8: 90444623
> > elapsed: 0.513688s
> > speed: 1.485220 GB/s
> > 
> > crc hw: 90444623
> > elapsed: 0.048327s
> > speed: 15.786877 GB/s
> > 
> > xxhash: 7f4a8d5
> > elapsed: 0.182100s
> > speed: 4.189663 GB/s
> > 
> > The hardware version is insanely and works on the majority of Postgres
> > setups and the fallback software implementations is 2.8x slower than the
> > fastest 32bit hash around.
> > 
> > Hopefully it'll be useful in the discussion.

Note that all these numbers aren't fully relevant to the use case
here. For the WAL - which is what we're talking about and the only place
where CRC32 is used with high throughput - the individual parts of a
record are pretty darn small on average. So performance of checksumming
small amounts of data is more relevant. Mind, that's not likely to go
for CRC32, especially not slice-by-8. The cache fooprint of the large
tables is likely going to be noticeable in non micro benchmarks.

> Also, while I understand that CRC has a very venerable history and
> is well studied for transmission type errors, I have been unable to find
> any research on its applicability to validating file/block writes to a
> disk drive.

Which incidentally doesn't really match what the CRC is used for
here. It's used for individual WAL records. Usually these are pretty
small, far smaller than disk/postgres' blocks on average. There's a
couple scenarios where they can get large, true, but most of them are
small.
The primary reason they're important is to correctly detect the end of
the WAL. To ensure we're interpreting half written records, or records
from before the WAL file was overwritten.

> While it is to quote you "unbeaten collision wise", xxhash,
> both the 32-bit and 64-bit version are its equal.

Aha? You take that from the smhasher results?

> Since there seems to be a lack of research on disk based error
> detection versus CRC polynomials, it seems likely that any of the
> proposed hash functions are on an equal footing in this regard. As
> Andres commented up-thread, xxhash comes along for "free" with lz4.

This is pure handwaving.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

14 September 2014, 17:23:46

On Sun, Sep 14, 2014 at 05:21:10PM +0200, Andres Freund wrote:
> On 2014-09-13 20:27:51 -0500, ktm@rice.edu wrote:
> 
> > Also, while I understand that CRC has a very venerable history and
> > is well studied for transmission type errors, I have been unable to find
> > any research on its applicability to validating file/block writes to a
> > disk drive.
> 
> Which incidentally doesn't really match what the CRC is used for
> here. It's used for individual WAL records. Usually these are pretty
> small, far smaller than disk/postgres' blocks on average. There's a
> couple scenarios where they can get large, true, but most of them are
> small.
> The primary reason they're important is to correctly detect the end of
> the WAL. To ensure we're interpreting half written records, or records
> from before the WAL file was overwritten.
> 
> 
> > While it is to quote you "unbeaten collision wise", xxhash,
> > both the 32-bit and 64-bit version are its equal.
> 
> Aha? You take that from the smhasher results?

Yes.

> 
> > Since there seems to be a lack of research on disk based error
> > detection versus CRC polynomials, it seems likely that any of the
> > proposed hash functions are on an equal footing in this regard. As
> > Andres commented up-thread, xxhash comes along for "free" with lz4.
> 
> This is pure handwaving.

Yes. But without research to support the use of CRC32 in this same
environment, it is handwaving in the other direction. :)

Regards,
Ken

Re: [REVIEW] Re: Compression of full-page-writes

From

Arthur Silva

Date:

14 September 2014, 23:43:24

<div dir="ltr"><p dir="ltr"><br /> Em 14/09/2014 12:21, "Andres Freund" <<a href="mailto:andres@2ndquadrant.com"
target="_blank">andres@2ndquadrant.com</a>>escreveu:<br /> ><br /> > On 2014-09-13 20:27:51 -0500, <a
href="mailto:ktm@rice.edu"target="_blank">ktm@rice.edu</a> wrote:<br /> > > ><br /> > > > What we are
lookingfor here is uniqueness thus better error detection. Not<br /> > > > avalanche effect, nor
cryptographicallysecure, nor bit distribution.<br /> > > > As far as I'm aware CRC32C is unbeaten collision
wiseand time proven.<br /> > > ><br /> > > > I couldn't find tests with xxhash and crc32 on the same
hardwareso I spent<br /> > > > some time putting together a benchmark (see attachment, to run it just<br />
>> > start run.sh)<br /> > > ><br /> > > > I included a crc32 implementation using ssr4.2
instructions(which works on<br /> > > > pretty much any Intel processor built after 2008 and AMD built after
2012),<br/> > > > a portable Slice-By-8 software implementation and xxhash since it's the<br /> > > >
fastestsoftware 32bit hash I know of.<br /> > > ><br /> > > > Here're the results running the test
programon my i5-4200M<br /> > > ><br /> > > > crc sb8: 90444623<br /> > > > elapsed:
0.513688s<br/> > > > speed: 1.485220 GB/s<br /> > > ><br /> > > > crc hw: 90444623<br />
>> > elapsed: 0.048327s<br /> > > > speed: 15.786877 GB/s<br /> > > ><br /> > > >
xxhash:7f4a8d5<br /> > > > elapsed: 0.182100s<br /> > > > speed: 4.189663 GB/s<br /> > >
><br/> > > > The hardware version is insanely and works on the majority of Postgres<br /> > > >
setupsand the fallback software implementations is 2.8x slower than the<br /> > > > fastest 32bit hash
around.<br/> > > ><br /> > > > Hopefully it'll be useful in the discussion.<br /> ><br /> >
Notethat all these numbers aren't fully relevant to the use case<br /> > here. For the WAL - which is what we're
talkingabout and the only place<br /> > where CRC32 is used with high throughput - the individual parts of a<br />
>record are pretty darn small on average. So performance of checksumming<br /> > small amounts of data is more
relevant.Mind, that's not likely to go<br /> > for CRC32, especially not slice-by-8. The cache fooprint of the
large<br/> > tables is likely going to be noticeable in non micro benchmarks.<br /> ><p dir="ltr">Indeed, the
smallinput sizes is something I was missing. Something more cache friendly would be better, it's just a matter of
findinga better candidate.<p dir="ltr"> Although I find it highly unlikely that the 4kb extra table of sb8 brings its
performancedown to sb4 level, even considering the small inputs and cache misses.<p>For what's worth mysql, cassandra,
kafka,ext4, xfx all use crc32c checksums in their WAL/Journals.<br /><p dir="ltr">> > Also, while I understand
thatCRC has a very venerable history and<br /> > > is well studied for transmission type errors, I have been
unableto find<br /> > > any research on its applicability to validating file/block writes to a<br /> > >
diskdrive.<br /> ><br /> > Which incidentally doesn't really match what the CRC is used for<br /> > here. It's
usedfor individual WAL records. Usually these are pretty<br /> > small, far smaller than disk/postgres' blocks on
average.There's a<br /> > couple scenarios where they can get large, true, but most of them are<br /> > small.<br
/>> The primary reason they're important is to correctly detect the end of<br /> > the WAL. To ensure we're
interpretinghalf written records, or records<br /> > from before the WAL file was overwritten.<br /> ><br />
><br/> > > While it is to quote you "unbeaten collision wise", xxhash,<br /> > > both the 32-bit and
64-bitversion are its equal.<br /> ><br /> > Aha? You take that from the smhasher results?<br /> ><br /> >
>Since there seems to be a lack of research on disk based error<br /> > > detection versus CRC polynomials, it
seemslikely that any of the<br /> > > proposed hash functions are on an equal footing in this regard. As<br />
>> Andres commented up-thread, xxhash comes along for "free" with lz4.<br /> ><br /> > This is pure
handwaving.<br/> ><br /> > Greetings,<br /> ><br /> > Andres Freund<br /> ><br /> > --<br /> >
 AndresFreund                     <a href="http://www.2ndQuadrant.com/"
target="_blank">http://www.2ndQuadrant.com/</a><br/> >  PostgreSQL Development, 24x7 Support, Training &
Services<br/></div>

Re: [REVIEW] Re: Compression of full-page-writes

From

Craig Ringer

Date:

15 September 2014, 04:27:24

On 09/14/2014 09:27 AM, ktm@rice.edu wrote:
> Thank you for running this sample benchmark. It definitely shows that the
> hardware version of the CRC is very fast, unfortunately it is really only
> available on x64 Intel/AMD processors which leaves all the rest lacking.

We're talking about something that'd land in 9.5 at best, and going by
the adoption rates I see, get picked up slowly over the next couple of
years by users.

Given that hardware support is already widespread now, I'm not at all
convinced that this is a problem. In mid-2015 we'd be talking about 4+
year old AMD CPUs and Intel CPUs that're 6+ years old.

In a quick search around I did find one class of machine I have access
to that doesn't have SSE 4.2 support. Well, two if you count the POWER7
boxes. It is a type of pre-OpenStack slated-for-retirement RackSpace
server with an Opteron 2374.

People on older, slower hardware won't get a big performance boost when
adopting a new PostgreSQL major release on their old gear. This doesn't
greatly upset me.

It'd be another thing if we were talking about something where people
without the required support would be unable to run the Pg release or
take a massive performance hit, but that doesn't appear to be the case here.

So I'm all for taking advantage of the hardware support.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Heikki Linnakangas

Date:

15 September 2014, 10:58:26

On 09/15/2014 02:42 AM, Arthur Silva wrote:
> Em 14/09/2014 12:21, "Andres Freund" <andres@2ndquadrant.com> escreveu:
>>
>> On 2014-09-13 20:27:51 -0500, ktm@rice.edu wrote:
>>>>
>>>> What we are looking for here is uniqueness thus better error
> detection. Not
>>>> avalanche effect, nor cryptographically secure, nor bit distribution.
>>>> As far as I'm aware CRC32C is unbeaten collision wise and time proven.
>>>>
>>>> I couldn't find tests with xxhash and crc32 on the same hardware so I
> spent
>>>> some time putting together a benchmark (see attachment, to run it just
>>>> start run.sh)
>>>>
>>>> I included a crc32 implementation using ssr4.2 instructions (which
> works on
>>>> pretty much any Intel processor built after 2008 and AMD built after
> 2012),
>>>> a portable Slice-By-8 software implementation and xxhash since it's
> the
>>>> fastest software 32bit hash I know of.
>>>>
>>>> Here're the results running the test program on my i5-4200M
>>>>
>>>> crc sb8: 90444623
>>>> elapsed: 0.513688s
>>>> speed: 1.485220 GB/s
>>>>
>>>> crc hw: 90444623
>>>> elapsed: 0.048327s
>>>> speed: 15.786877 GB/s
>>>>
>>>> xxhash: 7f4a8d5
>>>> elapsed: 0.182100s
>>>> speed: 4.189663 GB/s
>>>>
>>>> The hardware version is insanely and works on the majority of Postgres
>>>> setups and the fallback software implementations is 2.8x slower than
> the
>>>> fastest 32bit hash around.
>>>>
>>>> Hopefully it'll be useful in the discussion.
>>
>> Note that all these numbers aren't fully relevant to the use case
>> here. For the WAL - which is what we're talking about and the only place
>> where CRC32 is used with high throughput - the individual parts of a
>> record are pretty darn small on average. So performance of checksumming
>> small amounts of data is more relevant. Mind, that's not likely to go
>> for CRC32, especially not slice-by-8. The cache fooprint of the large
>> tables is likely going to be noticeable in non micro benchmarks.
>
> Indeed, the small input sizes is something I was missing. Something more
> cache friendly would be better, it's just a matter of finding a better
> candidate.

It's worth noting that the extra tables that slicing-by-4 requires are 
and *in addition to* the lookup table we already have. And slicing-by-8 
builds on the slicing-by-4 lookup tables. Our current algorithm uses a 
1kB lookup table, slicing-by-4 a 4kB, and slicing-by-8 8kB. But the 
first 1kB of the slicing-by-4 lookup table is identical to the current 
1kB lookup table, and the first 4kB of the slicing-by-8 are identical to 
the slicing-by-4 tables.

It would be pretty straightforward to use the current algorithm when the 
WAL record is very small, and slicing-by-4 or slicing-by-8 for larger 
records (like FPWs), where the larger table is more likely to pay off. I 
have no idea where the break-even point is with the current algorithm 
vs. slicing-by-4 and a cold cache, but maybe we can get a handle on that 
with some micro-benchmarking.

Although this is complicated by the fact that slicing-by-4 or -8 might 
well be a win even with very small records, if you generate a lot of them.

- Heikki

Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)

From

Amit Kapila

Date:

16 September 2014, 10:13:15

On Sat, Sep 13, 2014 at 1:33 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
>> At 2014-09-12 22:38:01 +0300, hlinnakangas@vmware.com wrote:
>>> We probably should consider switching to a faster CRC algorithm again,
>>> regardless of what we do with compression.
>>
>> As it happens, I'm already working on resurrecting a patch that Andres
>> posted in 2010 to switch to zlib's faster CRC implementation.
>
> As it happens, I also wrote an implementation of Slice-by-4 the other day :-).

> Haven't gotten around to post it, but here it is.

Incase we are using the implementation for everything that uses

COMP_CRC32() macro, won't it give problem for older version

databases. I have created a database with Head code and then

tried to start server after applying this patch it gives below error:

FATAL: incorrect checksum in control file

In general, the idea sounds quite promising. To see how it performs

on small to medium size data, I have used attached test which is

written be you (with some additional tests) during performance test

of WAL reduction patch in 9.4.

Performance Data

------------------------------

Non-default settings

autovacuum = off

checkpoint_segments = 256

checkpoint_timeout = 20 min

HEAD -

testname | wal_generated | duration

-----------------------------------------+---------------+------------------

two short fields, no change | 583802008 | 11.6727559566498

two short fields, no change | 580888024 | 11.8558299541473

two short fields, no change | 580889680 | 11.5449349880219

two short fields, one changed | 620646400 | 11.6657111644745

two short fields, one changed | 620667904 | 11.6010649204254

two short fields, one changed | 622079320 | 11.6774570941925

two short fields, both changed | 620649656 | 12.0892491340637

two short fields, both changed | 620648360 | 12.1650269031525

two short fields, both changed | 620653952 | 12.2125108242035

one short and one long field, no change | 329018192 | 4.74178600311279

one short and one long field, no change | 329021664 | 4.71507883071899

one short and one long field, no change | 330326496 | 4.84932994842529

ten tiny fields, all changed | 701358488 | 14.236780166626

ten tiny fields, all changed | 701355328 | 14.0777900218964

ten tiny fields, all changed | 701358272 | 14.1000919342041

hundred tiny fields, all changed | 315656568 | 6.99316620826721

hundred tiny fields, all changed | 314875488 | 6.85715913772583

hundred tiny fields, all changed | 315263768 | 6.94613790512085

hundred tiny fields, half changed | 314878360 | 6.89090895652771

hundred tiny fields, half changed | 314877216 | 7.05924606323242

hundred tiny fields, half changed | 314881816 | 6.93445992469788

hundred tiny fields, half nulled | 236244136 | 6.43347096443176

hundred tiny fields, half nulled | 236248104 | 6.30539107322693

hundred tiny fields, half nulled | 236501040 | 6.33403086662292

9 short and 1 long, short changed | 262373616 | 4.24646091461182

9 short and 1 long, short changed | 262375136 | 4.49821400642395

9 short and 1 long, short changed | 262379840 | 4.38264393806458

(27 rows)

Patched -

testname | wal_generated | duration

-----------------------------------------+---------------+------------------

two short fields, no change | 580897400 | 10.6518769264221

two short fields, no change | 581779816 | 10.7118690013885

two short fields, no change | 581013224 | 10.8294110298157

two short fields, one changed | 620646264 | 10.8309078216553

two short fields, one changed | 620652872 | 10.8480410575867

two short fields, one changed | 620812376 | 10.9162290096283

two short fields, both changed | 620651792 | 10.9025599956512

two short fields, both changed | 620652304 | 10.7771129608154

two short fields, both changed | 620649960 | 11.0185468196869

one short and one long field, no change | 329022000 | 3.88278198242188

one short and one long field, no change | 329023656 | 4.01899003982544

one short and one long field, no change | 329022992 | 3.91587209701538

ten tiny fields, all changed | 701353296 | 12.7748699188232

ten tiny fields, all changed | 701354848 | 12.761589050293

ten tiny fields, all changed | 701356520 | 12.6703131198883

hundred tiny fields, all changed | 314879424 | 6.25606894493103

hundred tiny fields, all changed | 314878416 | 6.32905578613281

hundred tiny fields, all changed | 314878464 | 6.28877377510071

hundred tiny fields, half changed | 314874808 | 6.25019288063049

hundred tiny fields, half changed | 314881296 | 6.41510701179504

hundred tiny fields, half changed | 314881320 | 6.42809700965881

hundred tiny fields, half nulled | 236248928 | 5.9281849861145

hundred tiny fields, half nulled | 236251768 | 5.91391110420227

hundred tiny fields, half nulled | 236247288 | 5.94086098670959

9 short and 1 long, short changed | 262374536 | 3.77700018882751

9 short and 1 long, short changed | 262377504 | 3.81636500358582

9 short and 1 long, short changed | 262378880 | 3.84033012390137

(27 rows)

The patched version gives better results in all cases

(in range of 10~15%), though this is not the perfect test, however

it gives fair idea that the patch is quite promising. I think to test

the benefit from crc calculation for full page, we can have some

checkpoint during each test (may be after insert). Let me know

what other kind of tests do you think are required to see the

gain/loss from this patch.

I think the main difference in this patch and what Andres has

developed sometime back was code for manually unrolled loop

doing 32bytes at once, so once Andres or Abhijit will post an

updated version, we can do some performance tests to see

if there is any additional gain.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

wal-update-testsuite.sh

Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)

From

Andres Freund

Date:

16 September 2014, 10:28:16

On 2014-09-16 15:43:06 +0530, Amit Kapila wrote:
> On Sat, Sep 13, 2014 at 1:33 AM, Heikki Linnakangas <hlinnakangas@vmware.com>
> wrote:
> > On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
> >> At 2014-09-12 22:38:01 +0300, hlinnakangas@vmware.com wrote:
> >>> We probably should consider switching to a faster CRC algorithm again,
> >>> regardless of what we do with compression.
> >>
> >> As it happens, I'm already working on resurrecting a patch that Andres
> >> posted in 2010 to switch to zlib's faster CRC implementation.
> >
> > As it happens, I also wrote an implementation of Slice-by-4 the other day
> :-).
> > Haven't gotten around to post it, but here it is.
> 
> Incase we are using the implementation for everything that uses
> COMP_CRC32() macro, won't it give problem for older version
> databases.  I have created a database with Head code and then
> tried to start server after applying this patch it gives below error:
> FATAL:  incorrect checksum in control file

That's indicative of a bug. This really shouldn't cause such problems -
at least my version was compatible with the current definition, and IIRC
Heikki's should be the same in theory. If I read it right.

> In general, the idea sounds quite promising.  To see how it performs
> on small to medium size data, I have used attached test which is
> written be you (with some additional tests) during performance test
> of WAL reduction patch in 9.4.

Yes, we should really do this.

> The patched version gives better results in all cases
> (in range of 10~15%), though this is not the perfect test, however
> it gives fair idea that the patch is quite promising.  I think to test
> the benefit from crc calculation for full page, we can have some
> checkpoint during each test (may be after insert).  Let me know
> what other kind of tests do you think are required to see the
> gain/loss from this patch.

I actually think we don't really need this. It's pretty evident that
slice-by-4 is a clear improvement.

> I think the main difference in this patch and what Andres has
> developed sometime back was code for manually unrolled loop
> doing 32bytes at once, so once Andres or Abhijit will post an
> updated version, we can do some performance tests to see
> if there is any additional gain.

If Heikki's version works I see little need to use my/Abhijit's
patch. That version has part of it under the zlib license. If Heikki's
version is a 'clean room', then I'd say we go with it. It looks really
quite similar though... We can make minor changes like additional
unrolling without problems lateron.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)

From

Heikki Linnakangas

Date:

16 September 2014, 10:49:45

On 09/16/2014 01:28 PM, Andres Freund wrote:
> On 2014-09-16 15:43:06 +0530, Amit Kapila wrote:
>> On Sat, Sep 13, 2014 at 1:33 AM, Heikki Linnakangas <hlinnakangas@vmware.com>
>> wrote:
>>> On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
>>>> At 2014-09-12 22:38:01 +0300, hlinnakangas@vmware.com wrote:
>>>>> We probably should consider switching to a faster CRC algorithm again,
>>>>> regardless of what we do with compression.
>>>>
>>>> As it happens, I'm already working on resurrecting a patch that Andres
>>>> posted in 2010 to switch to zlib's faster CRC implementation.
>>>
>>> As it happens, I also wrote an implementation of Slice-by-4 the other day
>> :-).
>>> Haven't gotten around to post it, but here it is.
>>
>> Incase we are using the implementation for everything that uses
>> COMP_CRC32() macro, won't it give problem for older version
>> databases.  I have created a database with Head code and then
>> tried to start server after applying this patch it gives below error:
>> FATAL:  incorrect checksum in control file
>
> That's indicative of a bug. This really shouldn't cause such problems -
> at least my version was compatible with the current definition, and IIRC
> Heikki's should be the same in theory. If I read it right.
>
>> In general, the idea sounds quite promising.  To see how it performs
>> on small to medium size data, I have used attached test which is
>> written be you (with some additional tests) during performance test
>> of WAL reduction patch in 9.4.
>
> Yes, we should really do this.
>
>> The patched version gives better results in all cases
>> (in range of 10~15%), though this is not the perfect test, however
>> it gives fair idea that the patch is quite promising.  I think to test
>> the benefit from crc calculation for full page, we can have some
>> checkpoint during each test (may be after insert).  Let me know
>> what other kind of tests do you think are required to see the
>> gain/loss from this patch.
>
> I actually think we don't really need this. It's pretty evident that
> slice-by-4 is a clear improvement.
>
>> I think the main difference in this patch and what Andres has
>> developed sometime back was code for manually unrolled loop
>> doing 32bytes at once, so once Andres or Abhijit will post an
>> updated version, we can do some performance tests to see
>> if there is any additional gain.
>
> If Heikki's version works I see little need to use my/Abhijit's
> patch. That version has part of it under the zlib license. If Heikki's
> version is a 'clean room', then I'd say we go with it. It looks really
> quite similar though... We can make minor changes like additional
> unrolling without problems lateron.

I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as 
reference - you can probably see the similarity. Any implementation is 
going to look more or less the same, though; there aren't that many ways 
to write the implementation.

- Heikki

Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)

From

Andres Freund

Date:

16 September 2014, 10:57:30

On 2014-09-16 13:49:20 +0300, Heikki Linnakangas wrote:
> I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as
> reference - you can probably see the similarity. Any implementation is going
> to look more or less the same, though; there aren't that many ways to write
> the implementation.

True.

I think I see what's the problem causing Amit's test to fail. Amit, did
you use the powerpc machine?

Heikki, you swap bytes unconditionally - afaics that's wrong on big
endian systems. My patch had:

+ static inline uint32 swab32(const uint32 x);
+ static inline uint32 swab32(const uint32 x){
+     return ((x & (uint32)0x000000ffUL) << 24) |
+         ((x & (uint32)0x0000ff00UL) <<  8) |
+         ((x & (uint32)0x00ff0000UL) >>  8) |
+         ((x & (uint32)0xff000000UL) >> 24);
+ }
+ 
+ #if defined __BIG_ENDIAN__
+ #define cpu_to_be32(x)
+ #else
+ #define cpu_to_be32(x) swab32(x)
+ #endif


I guess yours needs something similar. I personally like the cpu_to_be*
naming - it imo makes it pretty clear what happens.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)

From

Amit Kapila

Date:

16 September 2014, 12:20:47

On Tue, Sep 16, 2014 at 4:27 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-09-16 13:49:20 +0300, Heikki Linnakangas wrote:
> > I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as
> > reference - you can probably see the similarity. Any implementation is going
> > to look more or less the same, though; there aren't that many ways to write
> > the implementation.
>
> True.
>
> I think I see what's the problem causing Amit's test to fail. Amit, did
> you use the powerpc machine?

Yes.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

19 September 2014, 11:38:28

Hello,

>Maybe. Let's get the basic patch done first; then we can argue about that

Please find attached patch to compress FPW using pglz compression.
All backup blocks in WAL record are compressed at once before inserting it
into WAL buffers . Full_page_writes GUC has been modified to accept three
values
1. On
2. Compress
3. Off
FPW are compressed when full_page_writes is set to compress. FPW generated
forcibly during online backup even when full_page_writes is off are also
compressed. When full_page_writes is set on FPW are not compressed.
Benckmark:
Server Specification:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
Checkpoint segments: 1024
Checkpoint timeout: 5 mins
pgbench -c 64 -j 64 -r -T 900 -M prepared
Scale factor: 1000
WAL generated (MB) Throughput
(tps) Latency(ms)
On 9235.43
979.03 65.36
Compress(pglz) 6518.68
1072.34 59.66
Off 501.04 1135.17
56.34

The results show around 30 percent decrease in WAL volume due to
compression of FPW.
compress_fpw_v1.patch
<http://postgresql.1045698.n5.nabble.com/file/n5819645/compress_fpw_v1.patch>

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5819645.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

19 September 2014, 14:05:46

>Please find attached patch to compress FPW using pglz compression.
Please refer the updated patch attached. The earlier patch added few
duplicate lines of code in guc.c file.

compress_fpw_v1.patch
<http://postgresql.1045698.n5.nabble.com/file/n5819659/compress_fpw_v1.patch>  


Thank you,
Rahila Syed



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5819659.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [REVIEW] Re: Compression of full-page-writes

From

Tom Lane

Date:

19 September 2014, 14:42:28

Rahila Syed <rahilasyed.90@gmail.com> writes:
> Please find attached patch to compress FPW using pglz compression.

Patch not actually attached AFAICS (no, a link is not good enough).
        regards, tom lane

Re: [REVIEW] Re: Compression of full-page-writes

From

Sawada Masahiko

Date:

19 September 2014, 15:55:56

On Fri, Sep 19, 2014 at 11:05 PM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
>
>>Please find attached patch to compress FPW using pglz compression.
> Please refer the updated patch attached. The earlier patch added few
> duplicate lines of code in guc.c file.
>
> compress_fpw_v1.patch
> <http://postgresql.1045698.n5.nabble.com/file/n5819659/compress_fpw_v1.patch>
>
>

I got patching failed to HEAD.
Detail is following.

Hunk #3 FAILED at 142.
1 out of 3 hunks FAILED -- saving rejects to file
src/backend/access/rmgrdesc/xlogdesc.c.rej

Regards,

-------
Sawada Masahiko

Re: [REVIEW] Re: Compression of full-page-writes

From

Alvaro Herrera

Date:

21 September 2014, 22:23:31

Tom Lane wrote:
> Rahila Syed <rahilasyed.90@gmail.com> writes:
> > Please find attached patch to compress FPW using pglz compression.
> 
> Patch not actually attached AFAICS (no, a link is not good enough).

Well, from Rahila's point of view the patch is actually attached, but
she's posting from the Nabble interface, which mangles it and turns into
a link instead.  Not her fault, really -- but the end result is the
same: to properly submit a patch, you need to send an email to the
pgsql-hackers@postgresql.org mailing list, not join a group/forum from
some intermediary newsgroup site that mirrors the list.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

22 September 2014, 09:45:53

<p dir="ltr"><br /> Hello All,<p dir="ltr"> <br /> >Well, from Rahila's point of view the patch is actually
attached,but <br /> >she's posting from the Nabble interface, which mangles it and turns into <br /> >a link
instead.<pdir="ltr">Yes.<p dir="ltr"> <p dir="ltr">>but the end result is the <br /> >same: to properly submit a
patch,you need to send an email to the <br /> > mailing list, not join a group/forum from <br /> >some
intermediarynewsgroup site that mirrors the list.<br />  <p dir="ltr">Thank you. I will take care of it henceforth.  <p
dir="ltr">Pleasefind attached the patch to compress FPW.  Patch submitted by Fujii-san earlier in the thread is used to
mergecompression GUC with full_page_writes.<p dir="ltr"> <p dir="ltr">I am reposting the measurement numbers.<p
dir="ltr">ServerSpecification: <br /> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos <br />
RAM:32GB <br /> Disk : HDWD     450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos <br /> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s,
10,000rpm <p dir="ltr">Checkpoint segments: 1024 <br /> Checkpoint timeout: 5 mins <p dir="ltr">pgbench -c 64 -j 64 -r
-T900 -M prepared <br /> Scale factor: 1000 <p dir="ltr">                                WAL generated (MB)
          Throughput(tps)              Latency(ms) <br /> On                             9235.43                      
       979.03                                           65.36 <br /> Compress(pglz)       6518.68                      
        1072.34                                         59.66 <br /> Off                             501.04            
                    1135.17                                        56.34 <p dir="ltr">The results show  around 30
percentdecrease in WAL volume due to compression of FPW.<p dir="ltr">Thank you ,<p dir="ltr">Rahila Syed<p
dir="ltr">TomLane wrote:<br /> > Rahila Syed <<a href="mailto:rahilasyed.90@gmail.com">rahilasyed</a><a
href="mailto:rahilasyed.90@gmail.com">.90@</a><ahref="mailto:rahilasyed.90@gmail.com">gmail.com</a>> writes:<br />
>> Please find attached patch to compress FPW using pglz compression.<br /> ><br /> > Patch not actually
attachedAFAICS (no, a link is not good enough).<p dir="ltr">Well, from Rahila's point of view the patch is actually
attached,but<br /> she's posting from the Nabble interface, which mangles it and turns into<br /> a link instead.  Not
herfault, really -- but the end result is the<br /> same: to properly submit a patch, you need to send an email to
the<br/><a href="mailto:pgsql-hackers@postgresql.org">pgsql</a><a href="mailto:pgsql-hackers@postgresql.org">-</a><a
href="mailto:pgsql-hackers@postgresql.org">hackers</a><ahref="mailto:pgsql-hackers@postgresql.org">@</a><a
href="mailto:pgsql-hackers@postgresql.org">postgresql.org</a>mailinglist, not join a group/forum from<br /> some
intermediarynewsgroup site that mirrors the list.<p dir="ltr">--<br /> Álvaro Herrera               <a
href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a><br/> PostgreSQL Development, 24x7 Support, Training
&Services<br /><br />

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

22 September 2014, 10:39:53

Hello,

>Please find attached the patch to compress FPW.

Sorry I had forgotten to attach. Please find the patch attached.

Thank you,

Rahila Syed

From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Rahila Syed
Sent: Monday, September 22, 2014 3:16 PM
To: Alvaro Herrera
Cc: Rahila Syed; PostgreSQL-development; Tom Lane
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

Hello All,

>Well, from Rahila's point of view the patch is actually attached, but
>she's posting from the Nabble interface, which mangles it and turns into
>a link instead.

Yes.

>but the end result is the
>same: to properly submit a patch, you need to send an email to the
> mailing list, not join a group/forum from
>some intermediary newsgroup site that mirrors the list.

Thank you. I will take care of it henceforth.

Please find attached the patch to compress FPW. Patch submitted by Fujii-san earlier in the thread is used to merge compression GUC with full_page_writes.

I am reposting the measurement numbers.

Server Specification:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDWD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Checkpoint segments: 1024
Checkpoint timeout: 5 mins

pgbench -c 64 -j 64 -r -T 900 -M prepared
Scale factor: 1000

WAL generated (MB) Throughput (tps) Latency(ms)
On 9235.43 979.03 65.36
Compress(pglz) 6518.68 1072.34 59.66
Off 501.04 1135.17 56.34

The results show around 30 percent decrease in WAL volume due to compression of FPW.

Thank you ,

Rahila Syed

Tom Lane wrote:
> Rahila Syed <rahilasyed .90@gmail.com> writes:
> > Please find attached patch to compress FPW using pglz compression.
>
> Patch not actually attached AFAICS (no, a link is not good enough).

Well, from Rahila's point of view the patch is actually attached, but
she's posting from the Nabble interface, which mangles it and turns into
a link instead. Not her fault, really -- but the end result is the
same: to properly submit a patch, you need to send an email to the
pgsql -hackers @postgresql.orgmailing list, not join a group/forum from
some intermediary newsgroup site that mirrors the list.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

______________________________________________________________________
Disclaimer:This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding

Attachment

compress_fpw_v1.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Florian Weimer

Date:

23 September 2014, 17:15:34

* Ants Aasma:

> CRC has exactly one hardware implementation in general purpose CPU's

I'm pretty sure that's not true.  Many general purpose CPUs have CRC
circuity, and there must be some which also expose them as
instructions.

> and Intel has a patent on the techniques they used to implement
> it. The fact that AMD hasn't yet implemented this instruction shows
> that this patent is non-trivial to work around.

I think you're jumping to conclusions.  Intel and AMD have various
cross-licensing deals.  AMD faces other constraints which can make
implementing the instruction difficult.

Re: [REVIEW] Re: Compression of full-page-writes

From

Ants Aasma

Date:

24 September 2014, 14:57:05

On Tue, Sep 23, 2014 at 8:15 PM, Florian Weimer <fw@deneb.enyo.de> wrote:
> * Ants Aasma:
>
>> CRC has exactly one hardware implementation in general purpose CPU's
>
> I'm pretty sure that's not true.  Many general purpose CPUs have CRC
> circuity, and there must be some which also expose them as
> instructions.

I must eat my words here, indeed AMD processors starting from
Bulldozer do implement the CRC32 instruction. However, according to
Agner Fog, AMD's implementation has a 6 cycle latency and more
importantly a throughput of 1/6 per cycle. While Intel's
implementation on all CPUs except the new Atom has 3 cycle latency and
1 instruction/cycle throughput. This means that there still is a
significant handicap for AMD platforms, not to mention Power or Sparc
with no hardware support. Some ARM's implement CRC32, but I haven't
researched what their performance is.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: [REVIEW] Re: Compression of full-page-writes

From

andres@anarazel.de (Andres Freund)

Date:

29 September 2014, 12:36:33

Hi,

On 2014-09-22 10:39:32 +0000, Syed, Rahila wrote:
> >Please find attached the patch to compress FPW.

I've given this a quick look and noticed some things:
1) I don't think it's a good idea to put the full page write compression  into struct XLogRecord.

2) You've essentially removed a lot of checks about the validity of bkp  blocks in xlogreader. I don't think that's
acceptable.

3) You have both FullPageWritesStr() and full_page_writes_str().

4) I don't like FullPageWritesIsNeeded(). For one it, at least to me,  sounds grammatically wrong. More importantly
whenreading it I'm  thinking of it being about the LSN check. How about instead directly  checking whatever !=
FULL_PAGE_WRITES_OFF?

5) CompressBackupBlockPagesAlloc is declared static but not defined as  such.

6) You call CompressBackupBlockPagesAlloc() from two places. Neither is  IIRC within a critical section. So you imo
shouldremove the outOfMem  handling and revert to palloc() instead of using malloc directly. One  thing worthy of note
isthat I don't think you currently can  "legally" check fullPageWrites == FULL_PAGE_WRITES_ON when calling it  only
duringstartup as fullPageWrites can be changed at runtime.
 

7) Unless I miss something CompressBackupBlock should be plural, right?  ATM it compresses all the blocks?

8) I don't tests like  "if (fpw <= FULL_PAGE_WRITES_COMPRESS)". That  relies on the, less than intuitive, ordering of
FULL_PAGE_WRITES_COMPRESS(=1) before FULL_PAGE_WRITES_ON (=2).
 

9) I think you've broken the case where we first think 1 block needs to  be backed up, and another doesn't. If we then
detect,after the  START_CRIT_SECTION(), that we need to "goto begin;" orig_len will  still have it's old content.
 


I think that's it for now. Imo it'd be ok to mark this patch as returned
with feedback and deal with it during the next fest.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

29 September 2014, 15:02:55

On Mon, Sep 29, 2014 at 8:36 AM, Andres Freund <andres@anarazel.de> wrote:
> 1) I don't think it's a good idea to put the full page write compression
>    into struct XLogRecord.

Why not, and where should that be put?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

29 September 2014, 15:21:01

On 2014-09-29 11:02:49 -0400, Robert Haas wrote:
> On Mon, Sep 29, 2014 at 8:36 AM, Andres Freund <andres@anarazel.de> wrote:
> > 1) I don't think it's a good idea to put the full page write compression
> >    into struct XLogRecord.
> 
> Why not, and where should that be put?

Hah. I knew that somebody would pick that comment up ;)

I think it shouldn't be there because it looks trivial to avoid putting
it there. There's no runtime and nearly no code complexity reduction
gained by adding a field to struct XLogRecord. The best way to do that
depends a bit on how my complaint about the removed error checking
during reading the backup block data is resolved.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Heikki Linnakangas

Date:

29 September 2014, 15:27:27

On 09/29/2014 06:02 PM, Robert Haas wrote:
> On Mon, Sep 29, 2014 at 8:36 AM, Andres Freund <andres@anarazel.de> wrote:
>> 1) I don't think it's a good idea to put the full page write compression
>>     into struct XLogRecord.
>
> Why not, and where should that be put?

It should be a flag in BkpBlock.

- Heikki

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

29 September 2014, 15:29:23

On 2014-09-29 18:27:01 +0300, Heikki Linnakangas wrote:
> On 09/29/2014 06:02 PM, Robert Haas wrote:
> >On Mon, Sep 29, 2014 at 8:36 AM, Andres Freund <andres@anarazel.de> wrote:
> >>1) I don't think it's a good idea to put the full page write compression
> >>    into struct XLogRecord.
> >
> >Why not, and where should that be put?
> 
> It should be a flag in BkpBlock.

Doesn't work with the current approach (which I don't really like
much). The backup blocks are all compressed together. *Including* all
the struct BkpBlocks. Then the field in struct XLogRecord is used to
decide whether to decompress the whole thing or to take it verbatim.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)

From

Robert Haas

Date:

06 October 2014, 15:04:24

On Tue, Sep 16, 2014 at 6:49 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>>>> As it happens, I also wrote an implementation of Slice-by-4 the other
>>>> day
>>>
>> If Heikki's version works I see little need to use my/Abhijit's
>> patch. That version has part of it under the zlib license. If Heikki's
>> version is a 'clean room', then I'd say we go with it. It looks really
>> quite similar though... We can make minor changes like additional
>> unrolling without problems lateron.
>
>
> I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as
> reference - you can probably see the similarity. Any implementation is going
> to look more or less the same, though; there aren't that many ways to write
> the implementation.

So, it seems like the status of this patch is:

1. It probably has a bug, since Amit's testing seemed to show that it
wasn't returning the same results as unpatched master.

2. The performance tests showed a significant win on an important workload.

3. It's not in any CommitFest anywhere.

Given point #2, it's seems like we ought to find a way to keep this
from sliding into oblivion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

09 October 2014, 15:40:21

Hello,

Thank you for review.

>1) I don't think it's a good idea to put the full page write compression   into struct XLogRecord.

Full page write compression information can be stored in varlena struct of
compressed blocks as done for toast data in pluggable compression support
patch. If I understand correctly, it can be done similar to the manner in
which compressed Datum is modified to contain information about compression
algorithm in pluggable compression support patch.    

>2) You've essentially removed a lot of checks about the validity of bkp   blocks in xlogreader. I don't think that's
acceptable

To ensure this, the raw size stored in first four byte of compressed datum
can be used to perform error checking for backup blocks
Currently, the error checking for size of backup blocks happens individually
for each block.
If backup blocks are compressed together , it can happen once for the entire
set of backup blocks in a WAL record. The total raw size of compressed
blocks can be checked against the total size stored in WAL record header. 

>3) You have both FullPageWritesStr() and full_page_writes_str().

full_page_writes_str() is true/false version of FullPageWritesStr macro. It
is implemented for backward compatibility with pg_xlogdump


>4)I don't like FullPageWritesIsNeeded(). For one it, at least to me,   sounds grammatically wrong. More importantly
whenreading it I'm   thinking of it being about the LSN check. How about instead directly   checking whatever !=
FULL_PAGE_WRITES_OFF?
 

I will modify this.

>5) CompressBackupBlockPagesAlloc is declared static but not defined as   such. 
>7) Unless I miss something CompressBackupBlock should be plural, right?   ATM it compresses all the blocks? 
I will correct these.

>6)You call CompressBackupBlockPagesAlloc() from two places. Neither is > IIRC within a critical section. So you imo
shouldremove the outOfMem >  handling and revert to palloc() instead of using malloc directly. 
 

Yes neither is in critical section. outOfMem handling is done in order to
proceed without compression of FPW in case sufficient memory is not
available for compression.


Thank you,
Rahila Syed



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5822391.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

17 October 2014, 04:52:56

Hello,

Please find the updated patch attached.

>1) I don't think it's a good idea to put the full page write compression
into struct XLogRecord.

1. The compressed blocks is of varlena type. Hence, VARATT_IS_COMPRESSED can be used to detect if the datum is compressed. But, it can give false positive when blocks are not compressed because uncompressed blocks in WAL record are not of type varlena. If I understand correctly, VARATT_IS_COMPRESSED looks for particular bit pattern in the datum which when found it returns true irrespective of type of datum.

2. BkpBlock header of the first block in a WAL record can be copied as it is followed by compressed data including block corresponding to first header and remaining headers and blocks. This header can then be used to store flag indicating if the blocks are compressed or not. This seems to be a feasible option but will increase few bytes equivalent to sizeof(BkpBlock) in record when compared to the method of compressing all blocks and headers.

Also , the full page write compression currently stored in WAL record occupies 1 byte of padding hence does not increase the overall size. But at the same timecompression attribute is related to backup up blocks hence it makes more sense to have it in BkpBlock header. Although, the attached patch does not include this yet as it will be better to get consensus first.

Thoughts?

>2) You've essentially removed a lot of checks about the validity of bkp
blocks in xlogreader. I don't think that's acceptable

Check to see if size of compressed blocks agrees with the total size stored on WAL record header is added in the patch attached. This serves as a check to validate length of record.

>3) You have both FullPageWritesStr() and full_page_writes_str().

This has not changed for now reason being full_page_writes_str() is true/false version of FullPageWritesStr macro. It
is implemented for backward compatibility with pg_xlogdump.

>4)I don't like FullPageWritesIsNeeded(). For one it, at least to me,

>7) Unless I miss something CompressBackupBlock should be plural, right?
ATM it compresses all the blocks?
>8) I don't tests like "if (fpw <= FULL_PAGE_WRITES_COMPRESS)". That
relies on the, less than intuitive, ordering of
FULL_PAGE_WRITES_COMPRESS (=1) before FULL_PAGE_WRITES_ON (=2).
>9) I think you've broken the case where we first think 1 block needs to
be backed up, and another doesn't. If we then detect, after the
START_CRIT_SECTION(), that we need to "goto begin;" orig_len will
still have it's old content.

I have corrected these in the patch attached.

>5) CompressBackupBlockPagesAlloc is declared static but not defined as
such.

Have made it global now in order to be able to access it from PostgresMain.

>6) You call CompressBackupBlockPagesAlloc() from two places. Neither is
IIRC within a critical section. So you imo should remove the outOfMem
handling and revert to palloc() instead of using malloc directly.

This has not been changed in the current patch reason being outOfMem handling is done in order to

proceed without compression of FPW in case sufficient memory is not
available for compression.

>One
> thing worthy of note is that I don't think you currently can
> "legally" check fullPageWrites == FULL_PAGE_WRITES_ON when calling it
> only during startup as fullPageWrites can be changed at runtime

In the attached patch, this check is also added in PostgresMain on SIGHUP after processing postgresql.conf file.

Thank you,

Rahila Syed

On Mon, Sep 29, 2014 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2014-09-22 10:39:32 +0000, Syed, Rahila wrote:
> >Please find attached the patch to compress FPW.

I've given this a quick look and noticed some things:
1) I don't think it's a good idea to put the full page write compression
into struct XLogRecord.

2) You've essentially removed a lot of checks about the validity of bkp
blocks in xlogreader. I don't think that's acceptable.

3) You have both FullPageWritesStr() and full_page_writes_str().

4) I don't like FullPageWritesIsNeeded(). For one it, at least to me,
sounds grammatically wrong. More importantly when reading it I'm
thinking of it being about the LSN check. How about instead directly
checking whatever != FULL_PAGE_WRITES_OFF?

5) CompressBackupBlockPagesAlloc is declared static but not defined as
such.

6) You call CompressBackupBlockPagesAlloc() from two places. Neither is
IIRC within a critical section. So you imo should remove the outOfMem
handling and revert to palloc() instead of using malloc directly. One
thing worthy of note is that I don't think you currently can
"legally" check fullPageWrites == FULL_PAGE_WRITES_ON when calling it
only during startup as fullPageWrites can be changed at runtime.

7) Unless I miss something CompressBackupBlock should be plural, right?
ATM it compresses all the blocks?

8) I don't tests like "if (fpw <= FULL_PAGE_WRITES_COMPRESS)". That
relies on the, less than intuitive, ordering of
FULL_PAGE_WRITES_COMPRESS (=1) before FULL_PAGE_WRITES_ON (=2).

9) I think you've broken the case where we first think 1 block needs to
be backed up, and another doesn't. If we then detect, after the
START_CRIT_SECTION(), that we need to "goto begin;" orig_len will
still have it's old content.

I think that's it for now. Imo it'd be ok to mark this patch as returned
with feedback and deal with it during the next fest.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

compress_fpw_v2.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

27 October 2014, 13:20:10

On Fri, Oct 17, 2014 at 1:52 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
>
> Please find the updated patch attached.

Thanks for updating the patch! Here are the comments.

The patch isn't applied to the master cleanly.

I got the following compiler warnings.

xlog.c:930: warning: ISO C90 forbids mixed declarations and code
xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code

The compilation of the document failed with the following error message.

openjade:config.sgml:2188:12:E: end tag for element "TERM" which is not open
make[3]: *** [HTML.index] Error 1

Only backend calls CompressBackupBlocksPagesAlloc when SIGHUP is sent.
Why does only backend need to do that? What about other processes which
can write FPW, e.g., autovacuum?

Do we release the buffers for compressed data when fpw is changed from
"compress" to "on"?

+    if (uncompressedPages == NULL)
+    {
+        uncompressedPages = (char *)malloc(XLR_TOTAL_BLCKSZ);
+        if (uncompressedPages == NULL)
+            outOfMem = 1;
+    }

The memory is always (i.e., even when fpw=on) allocated to uncompressedPages,
but not to compressedPages. Why? I guess that the test of fpw needs to be there.

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

28 October 2014, 07:56:46

Hello Fujii-san,

Thank you for your comments.

>The patch isn't applied to the master cleanly.
>The compilation of the document failed with the following error message.
>openjade:config.sgml:2188:12:E: end tag for element "TERM" which is not open
>make[3]: *** [HTML.index] Error 1
>xlog.c:930: warning: ISO C90 forbids mixed declarations and code
>xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
>xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code

Please find attached patch with these rectified.

>Only backend calls CompressBackupBlocksPagesAlloc when SIGHUP is sent.
>Why does only backend need to do that? What about other processes which can write FPW, e.g., autovacuum?
I had overlooked this. I will correct it.

>Do we release the buffers for compressed data when fpw is changed from "compress" to "on"?
The current code does not do this.

>The memory is always (i.e., even when fpw=on) allocated to uncompressedPages, but not to compressedPages. Why? I guess
thatthe test of fpw needs to be there 
uncompressedPages is also used to store the decompression output at the time of recovery. Hence, memory for
uncompressedPagesneeds to be allocated even if fpw=on which is not the case for compressedPages. 

Thank you,
Rahila Syed

-----Original Message-----
From: Fujii Masao [mailto:masao.fujii@gmail.com]
Sent: Monday, October 27, 2014 6:50 PM
To: Rahila Syed
Cc: Andres Freund; Syed, Rahila; Alvaro Herrera; Rahila Syed; PostgreSQL-development; Tom Lane
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Fri, Oct 17, 2014 at 1:52 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
>
> Please find the updated patch attached.

Thanks for updating the patch! Here are the comments.

The patch isn't applied to the master cleanly.

I got the following compiler warnings.

xlog.c:930: warning: ISO C90 forbids mixed declarations and code
xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code

The compilation of the document failed with the following error message.

openjade:config.sgml:2188:12:E: end tag for element "TERM" which is not open
make[3]: *** [HTML.index] Error 1

Only backend calls CompressBackupBlocksPagesAlloc when SIGHUP is sent.
Why does only backend need to do that? What about other processes which can write FPW, e.g., autovacuum?

Do we release the buffers for compressed data when fpw is changed from "compress" to "on"?

+    if (uncompressedPages == NULL)
+    {
+        uncompressedPages = (char *)malloc(XLR_TOTAL_BLCKSZ);
+        if (uncompressedPages == NULL)
+            outOfMem = 1;
+    }

The memory is always (i.e., even when fpw=on) allocated to uncompressedPages, but not to compressedPages. Why? I guess
thatthe test of fpw needs to be there. 

Regards,

--
Fujii Masao

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment

compress_fpw_v2.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

28 October 2014, 11:00:31

On Tue, Oct 28, 2014 at 4:54 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
> Hello Fujii-san,
>
> Thank you for your comments.
>
>>The patch isn't applied to the master cleanly.
>>The compilation of the document failed with the following error message.
>>openjade:config.sgml:2188:12:E: end tag for element "TERM" which is not open
>>make[3]: *** [HTML.index] Error 1
>>xlog.c:930: warning: ISO C90 forbids mixed declarations and code
>>xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
>>xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
>
> Please find attached patch with these rectified.
>
>>Only backend calls CompressBackupBlocksPagesAlloc when SIGHUP is sent.
>>Why does only backend need to do that? What about other processes which can write FPW, e.g., autovacuum?
> I had overlooked this. I will correct it.
>
>>Do we release the buffers for compressed data when fpw is changed from "compress" to "on"?
> The current code does not do this.

Don't we need to do that?

>>The memory is always (i.e., even when fpw=on) allocated to uncompressedPages, but not to compressedPages. Why? I
guessthat the test of fpw needs to be there
 
> uncompressedPages is also used to store the decompression output at the time of recovery. Hence, memory for
uncompressedPagesneeds to be allocated even if fpw=on which is not the case for compressedPages.
 

You don't need to make the processes except the startup process allocate
the memory for uncompressedPages when fpw=on. Only the startup process
uses it for the WAL decompression.

BTW, what happens if the memory allocation for uncompressedPages for
the recovery fails? Which would prevent the recovery at all, so PANIC should
happen in that case?

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

28 October 2014, 15:21:47

>>>Do we release the buffers for compressed data when fpw is changed from
"compress" to "on"? 
>> The current code does not do this.
>Don't we need to do that? 
Yes this needs to be done in order to avoid memory leak when compression is
turned off at runtime while the backend session is running.

>You don't need to make the processes except the startup process allocate 
>the memory for uncompressedPages when fpw=on. Only the startup process 
>uses it for the WAL decompression
I see. fpw != on check can be put at the time of memory allocation of
uncompressedPages in the backend code . And at the time of recovery
uncompressedPages can be allocated separately if not already allocated.

>BTW, what happens if the memory allocation for uncompressedPages for 
>the recovery fails? 
The current code does not handle this. This will be rectified.

>Which would prevent the recovery at all, so PANIC should 
>happen in that case? 
IIUC, instead of reporting  PANIC , palloc can be used to allocate memory
for uncompressedPages at the time of recovery which will throw ERROR and
abort startup process in case of failure.


Thank you,
Rahila Syed



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5824613.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

04 November 2014, 05:04:05

Hello ,

Please find updated patch with the review comments given above implemented.

The compressed data now includes all backup blocks and their headers except the header of first backup block in WAL record. The first backup block header in WAL record is used to store the compression information. This is done in order to avoid adding compression information in WAL record header.

Memory allocation on SIGHUP in autovacuum is remaining. Working on it.

Thank you,

Rahila Syed

On Tue, Oct 28, 2014 at 8:51 PM, Rahila Syed <rahilasyed.90@gmail.com> wrote:

>>>Do we release the buffers for compressed data when fpw is changed from
"compress" to "on"?
>> The current code does not do this.
>Don't we need to do that?
Yes this needs to be done in order to avoid memory leak when compression is
turned off at runtime while the backend session is running.

>You don't need to make the processes except the startup process allocate
>the memory for uncompressedPages when fpw=on. Only the startup process
>uses it for the WAL decompression
I see. fpw != on check can be put at the time of memory allocation of
uncompressedPages in the backend code . And at the time of recovery
uncompressedPages can be allocated separately if not already allocated.

>BTW, what happens if the memory allocation for uncompressedPages for
>the recovery fails?
The current code does not handle this. This will be rectified.

>Which would prevent the recovery at all, so PANIC should
>happen in that case?
IIUC, instead of reporting PANIC , palloc can be used to allocate memory
for uncompressedPages at the time of recovery which will throw ERROR and
abort startup process in case of failure.

Thank you,
Rahila Syed

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5824613.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

compress_fpw_v3.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

06 November 2014, 09:18:31

On Tue, Nov 4, 2014 at 2:03 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello ,
>
> Please find updated patch with the review comments given above implemented

Hunk #3 FAILED at 692.
1 out of 3 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlogreader.c.rej

The patch was not applied to the master cleanly. Could you update the patch?

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

08 November 2014, 21:41:10

Hello,

>The patch was not applied to the master cleanly. Could you update the patch?

Please find attached updated and rebased patch to compress FPW. Review comments given above have been implemented.

Thank you,

Rahila Syed

On Thu, Nov 6, 2014 at 2:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Nov 4, 2014 at 2:03 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello ,
>
> Please find updated patch with the review comments given above implemented

Hunk #3 FAILED at 692.
1 out of 3 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlogreader.c.rej

The patch was not applied to the master cleanly. Could you update the patch?

Regards,

--
Fujii Masao

Attachment

compress_fpw_v4.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

09 November 2014, 13:32:29

On Sun, Nov 9, 2014 at 6:41 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
>
>>The patch was not applied to the master cleanly. Could you update the
>> patch?
> Please find attached updated and rebased patch to compress FPW. Review
> comments given above have been implemented.

Thanks for updating the patch! Will review it.

BTW, I got the following compiler warnings.

xlogreader.c:755: warning: assignment from incompatible pointer type
autovacuum.c:1412: warning: implicit declaration of function
'CompressBackupBlocksPagesAlloc'
xlogreader.c:755: warning: assignment from incompatible pointer type

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

10 November 2014, 08:27:03

On Sun, Nov 9, 2014 at 10:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sun, Nov 9, 2014 at 6:41 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>> Hello,
>>
>>>The patch was not applied to the master cleanly. Could you update the
>>> patch?
>> Please find attached updated and rebased patch to compress FPW. Review
>> comments given above have been implemented.
>
> Thanks for updating the patch! Will review it.
>
> BTW, I got the following compiler warnings.
>
> xlogreader.c:755: warning: assignment from incompatible pointer type
> autovacuum.c:1412: warning: implicit declaration of function
> 'CompressBackupBlocksPagesAlloc'
> xlogreader.c:755: warning: assignment from incompatible pointer type
I have been looking at this patch, here are some comments:
1) This documentation change is incorrect:
- <term><varname>full_page_writes</varname> (<type>boolean</type>)
+ <term><varname>full_page_writes</varname> (<type>enum</type>)</term>
<indexterm>
<primary><varname>full_page_writes</> configuration parameter</primary>
</indexterm>
- </term>
The termination of block term was correctly places before.
2) This patch defines FullPageWritesStr and full_page_writes_str, but both do more or less the same thing.
3) This patch is touching worker_spi.c and calling CompressBackupBlocksPagesAlloc directly. Why is that necessary? Doesn't a bgworker call InitXLOGAccess once it connects to a database?
4) Be careful as well of whitespaces (code lines should have as well a maximum of 80 characters):
+ * If compression is set on replace the rdata nodes of backup blocks added in the loop
+ * above by single rdata node that contains compressed backup blocks and their headers
+ * except the header of first block which is used to store the information about compression.
+ */
5) GetFullPageWriteGUC or something similar is necessary, but I think that for consistency with doPageWrites its value should be fetched in XLogInsert and then passed as an extra argument in XLogRecordAssemble. Thinking more about this, I think that it would be cleaner to simply have a bool flag tracking if compression is active or not, something like doPageCompression, that could be fetched using GetFullPageWriteInfo. Thinking more about it, we could directly track forcePageWrites and fullPageWrites, but that would make back-patching more difficult with not that much gain.
6) Not really a complaint, but note that this patch is using two bits that were unused up to now to store the compression status of a backup block. This is actually safe as long as the maximum page is not higher than 32k, which is the limit authorized by --with-blocksize btw. I think that this deserves a comment at the top of the declaration of BkpBlock.
!       unsigned        hole_offset:15, /* number of bytes before "hole" */
!                               flags:2,                /* state of a backup block, see below */
!                               hole_length:15; /* number of bytes in "hole" */

7) Some code in RestoreBackupBlock:

+       char *uncompressedPages;
+
+       uncompressedPages = (char *)palloc(XLR_TOTAL_BLCKSZ);
[...]
+       /* Check if blocks in WAL record are compressed */
+       if (bkpb.flag_compress == BKPBLOCKS_COMPRESSED)
+       {
+               /* Checks to see if decompression is successful is made inside the function */
+               pglz_decompress((PGLZ_Header *) blk, uncompressedPages);
+               blk = uncompressedPages;
+       }
uncompressedPages is pallocd'd all the time but you actually just need to do that when the block is compressed.

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

25 November 2014, 06:34:09

On Tue, Nov 25, 2014 at 3:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> For now here are the patches either way, so feel free to comment.
And of course the patches are incorrect...
--
Michael

So, Here are reworked patches for the whole set, with the following changes:
- Found why replay was failing, xlogreader.c took into account BLCKSZ
- hole while it should have taken into account the compressed data
length when fetching a compressed block image.
- Reworked pglz portion to have it return status errors instead of
simple booleans. pglz stuff is as well moved to src/common as Alvaro
suggested.

I am planning to run some tests to check how much compression can
reduce WAL size with this new set of patches. I have been however able
to check that those patches pass installcheck-world with a standby
replaying the changes behind. Feel free to play with those patches...
Regards,
--
Michael

>        if (!fullPageWrites)
>        {
>               WALInsertLockAcquireExclusive();
>                Insert->fullPageWrites = fullPageWrites;
>                WALInsertLockRelease();
>        }
>

As fullPageWrites is not a boolean isnt it better to change the if condition as fullPageWrites == FULL_PAGE_WRITES_OFF? As it is done in the if condition above, this seems to be a miss.

>doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);

IIUC, doPageWrites is true when fullPageWrites is either 'on' or 'compress'
Considering Insert -> fullPageWrites is an int now, I think its better to explicitly write the above as ,

doPageWrites = (Insert -> fullPageWrites != FULL_PAGE_WRITES_OFF || Insert->forcePageWrites)

The patch attached has the above changes. Also, it initializes doPageCompression in InitXLOGAccess as per earlier discussion.

I have attached the changes separately as changes.patch.

Thank you,

Rahila Syed

Attachment

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

28 November 2014, 06:48:33

So, I have been doing some more tests with this patch. I think the
compression numbers are in line with the previous tests.

Configuration
==========

3 sets are tested:
- HEAD (a5eb85e) + fpw = on
- patch + fpw = on
- patch + fpw = compress
With the following configuration:
shared_buffers=512MB
checkpoint_segments=1024
checkpoint_timeout = 5min
fsync=off

WAL quantity
===========
pgbench -s 30 -i (455MB of data)
pgbench -c 32 -j 32 -t 45000 -M prepared (roughly 11 min of run on
laptop, two checkpoints kick in)

1) patch + fdw = compress
tps = 2086.893948 (including connections establishing)
tps = 2087.031543 (excluding connections establishing)
start LSN: 0/19000090
stop LSN: 0/49F73D78
difference: 783MB

2) patch + fdw = on
start LSN: 0/1B000090
stop LSN: 0/8F4E1BD0
difference: 1861 MB
tps = 2106.812454 (including connections establishing)
tps = 2106.953329 (excluding connections establishing)

3) HEAD + fdw = on
start LSN: 0/1B0000C8
stop LSN:
difference:

WAL replay performance
===================
Then tested replay time of a standby after replaying WAL files
generated by previous pgbench runs and by tracking "redo start" and
"redo stop". Goal here is to check for the same amount of activity how
much block decompression plays on replay. The replay includes the
pgbench initialization phase.

1) patch + fdw = compress
1-1) Try 1.
2014-11-28 14:09:27.287 JST: LOG:  redo starts at 0/3000380
2014-11-28 14:10:19.836 JST: LOG:  redo done at 0/49F73E18
Result: 52.549
1-2) Try 2.
2014-11-28 14:15:04.196 JST: LOG:  redo starts at 0/3000380
2014-11-28 14:15:56.238 JST: LOG:  redo done at 0/49F73E18
Result: 52.042
1-3) Try 3
2014-11-28 14:20:27.186 JST: LOG:  redo starts at 0/3000380
2014-11-28 14:21:19.350 JST: LOG:  redo done at 0/49F73E18
Result: 52.164
2) patch + fdw = on
2-1) Try 1
2014-11-28 14:42:54.670 JST: LOG:  redo starts at 0/3000750
2014-11-28 14:43:56.221 JST: LOG:  redo done at 0/8F4E1BD0
Result: 61.5s
2-2) Try 2
2014-11-28 14:46:03.198 JST: LOG:  redo starts at 0/3000750
2014-11-28 14:47:03.545 JST: LOG:  redo done at 0/8F4E1BD0
Result: 60.3s
2-3) Try 3
2014-11-28 14:50:26.896 JST: LOG:  redo starts at 0/3000750
2014-11-28 14:51:30.950 JST: LOG:  redo done at 0/8F4E1BD0
Result: 64.0s
3) HEAD + fdw = on
3-1) Try 1
2014-11-28 15:21:48.153 JST: LOG:  redo starts at 0/3000750
2014-11-28 15:22:53.864 JST: LOG:  redo done at 0/8FFFFFA8
Result: 65.7s
3-2) Try 2
2014-11-28 15:27:16.271 JST: LOG:  redo starts at 0/3000750
2014-11-28 15:28:20.677 JST: LOG:  redo done at 0/8FFFFFA8
Result: 64.4s
3-3) Try 3
2014-11-28 15:36:30.434 JST: LOG:  redo starts at 0/3000750
2014-11-28 15:37:33.208 JST: LOG:  redo done at 0/8FFFFFA8
Result: 62.7s

So we are getting an equivalent amount of WAL when compression is not
enabled with both HEAD and the patch, aka a reduction of 55% at
constant number of transactions with pgbench. The difference seems to
be some noise. Note that basically as the patch adds a uint16 in
XLogRecordBlockImageHeader to store the length of the block compressed
and achieve a double level of compression (1st level being the removal
of the page hole), the records are 2 bytes longer per block image, it
does not seem to be much a problem in those tests. Regarding the WAL
replay, compressed blocks need extra CPU for decompression in exchange
of having less WAL to replay in quantity, this is actually reducing by
~15% the replay time, so the replay plays in favor of putting the load
on the CPU. Also, I haven't seen any difference with or without the
patch when compression is disabled.

Updated patches attached, I found a couple of issues with the code
this morning (issues more or less pointed out as well by Rahila
earlier) before running those tests.
Regards,

Regards,
--
Michael

On Fri, Dec 5, 2014 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 4, 2014 at 5:36 AM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
> The only scenario in which a user would not want to compress forcibly
> written pages is when CPU utilization is high.

Or if they think the code to compress full pages is buggy.

Yeah, especially if in the future we begin to add support for other compression algorithm.

> But according to measurements
> done earlier the CPU utilization of compress=’on’ and ‘off’ are not
> significantly different.

If that's really true, we could consider having no configuration any
time, and just compressing always. But I'm skeptical that it's
actually true.

So am I. Data is the thing that matters for us.

Speaking of which, I have been working more on the set of patches to add support for this feature and attached are updated patches, with the following changes:

- Addition of a new GUC parameter wal_compression, being a complete switch to control compression of WAL. Default is off. We could extend this parameter later if we decide to add support for new algorithms or new modes, let's say a record-level compression. Parameter is PGC_POSTMASTER. We could make it PGC_SIGHUP but that would be better as a future optimization, and would need a new WAL record type similar to full_page_writes. (Actually, I see no urgency in making it SIGHUP..)

- full_page_writes is moved back to its original state

- Correction of a couple of typos and comments.

Regards,

Michael

Attachment

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

05 December 2014, 14:10:22

I attempted quick review and could not come up with much except this

+   /*
+    * Calculate the amount of FPI data in the record. Each backup block
+    * takes up BLCKSZ bytes, minus the "hole" length.
+    *
+    * XXX: We peek into xlogreader's private decoded backup blocks for the
+    * hole_length. It doesn't seem worth it to add an accessor macro for
+    * this.
+    */
+   fpi_len = 0;
+   for (block_id = 0; block_id <= record->max_block_id; block_id++)
+   {
+       if (XLogRecHasCompressedBlockImage(record, block_id))
+           fpi_len += BLCKSZ - record->blocks[block_id].compress_len;


IIUC, fpi_len in case of compressed block image should be

fpi_len = record->blocks[block_id].compress_len;


Thank you,
Rahila Syed 




--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5829403.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

05 December 2014, 15:06:28

On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed <rahilasyed.90@gmail.com> wrote:

I attempted quick review and could not come up with much except this

+ /*
+ * Calculate the amount of FPI data in the record. Each backup block
+ * takes up BLCKSZ bytes, minus the "hole" length.
+ *
+ * XXX: We peek into xlogreader's private decoded backup blocks for the
+ * hole_length. It doesn't seem worth it to add an accessor macro for
+ * this.
+ */
+ fpi_len = 0;
+ for (block_id = 0; block_id <= record->max_block_id; block_id++)
+ {
+ if (XLogRecHasCompressedBlockImage(record, block_id))
+ fpi_len += BLCKSZ - record->blocks[block_id].compress_len;

IIUC, fpi_len in case of compressed block image should be

fpi_len = record->blocks[block_id].compress_len;

Yep, true. Patches need a rebase btw as Heikki fixed a commit related to the stats of pg_xlogdump.

Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

05 December 2014, 15:10:23

On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
I attempted quick review and could not come up with much except this

+ /*
+ * Calculate the amount of FPI data in the record. Each backup block
+ * takes up BLCKSZ bytes, minus the "hole" length.
+ *
+ * XXX: We peek into xlogreader's private decoded backup blocks for the
+ * hole_length. It doesn't seem worth it to add an accessor macro for
+ * this.
+ */
+ fpi_len = 0;
+ for (block_id = 0; block_id <= record->max_block_id; block_id++)
+ {
+ if (XLogRecHasCompressedBlockImage(record, block_id))
+ fpi_len += BLCKSZ - record->blocks[block_id].compress_len;

IIUC, fpi_len in case of compressed block image should be

fpi_len = record->blocks[block_id].compress_len;
Yep, true. Patches need a rebase btw as Heikki fixed a commit related to the stats of pg_xlogdump.

In any case, any opinions to switch this patch as "Ready for committer"?
--

Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

05 December 2014, 15:18:08

On 2014-12-06 00:10:11 +0900, Michael Paquier wrote:
> On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
> 
> >
> >
> >
> > On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed <rahilasyed.90@gmail.com>
> > wrote:
> >
> >> I attempted quick review and could not come up with much except this
> >>
> >> +   /*
> >> +    * Calculate the amount of FPI data in the record. Each backup block
> >> +    * takes up BLCKSZ bytes, minus the "hole" length.
> >> +    *
> >> +    * XXX: We peek into xlogreader's private decoded backup blocks for
> >> the
> >> +    * hole_length. It doesn't seem worth it to add an accessor macro for
> >> +    * this.
> >> +    */
> >> +   fpi_len = 0;
> >> +   for (block_id = 0; block_id <= record->max_block_id; block_id++)
> >> +   {
> >> +       if (XLogRecHasCompressedBlockImage(record, block_id))
> >> +           fpi_len += BLCKSZ - record->blocks[block_id].compress_len;
> >>
> >>
> >> IIUC, fpi_len in case of compressed block image should be
> >>
> >> fpi_len = record->blocks[block_id].compress_len;
> >>
> > Yep, true. Patches need a rebase btw as Heikki fixed a commit related to
> > the stats of pg_xlogdump.
> >
> 
> In any case, any opinions to switch this patch as "Ready for committer"?

Needing a rebase is a obvious conflict to that... But I guess some wider
looks afterwards won't hurt.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

05 December 2014, 17:08:37

On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
>>If that's really true, we could consider having no configuration any
>>time, and just compressing always.  But I'm skeptical that it's
>>actually true.
>
> I was referring to this for CPU utilization:
> http://www.postgresql.org/message-id/1410414381339-5818552.post@n5.nabble.com
> <http://>
>
> The above tests were performed on machine with configuration as follows
> Server specifications:
> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> RAM: 32GB
> Disk : HDD      450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

I think that measurement methodology is not very good for assessing
the CPU overhead, because you are only measuring the percentage CPU
utilization, not the absolute amount of CPU utilization.  It's not
clear whether the duration of the tests was the same for all the
configurations you tried - in which case the number of transactions
might have been different - or whether the number of operations was
exactly the same - in which case the runtime might have been
different.  Either way, it could obscure an actual difference in
absolute CPU usage per transaction.  It's unlikely that both the
runtime and the number of transactions were identical for all of your
tests, because that would imply that the patch makes no difference to
performance; if that were true, you wouldn't have bothered writing
it....

What I would suggest is instrument the backend with getrusage() at
startup and shutdown and have it print the difference in user time and
system time.  Then, run tests for a fixed number of transactions and
see how the total CPU usage for the run differs.

Last cycle, Amit Kapila did a bunch of work trying to compress the WAL
footprint for updates, and we found that compression was pretty darn
expensive there in terms of CPU time.  So I am suspicious of the
finding that it is free here.  It's not impossible that there's some
effect which causes us to recoup more CPU time than we spend
compressing in this case that did not apply in that case, but the
projects are awfully similar, so I tend to doubt it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

06 December 2014, 14:07:48

On Sat, Dec 6, 2014 at 12:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-12-06 00:10:11 +0900, Michael Paquier wrote:
> On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>
> >
> >
> >
> > On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed <rahilasyed.90@gmail.com>
> > wrote:
> >
> >> I attempted quick review and could not come up with much except this
> >>
> >> + /*
> >> + * Calculate the amount of FPI data in the record. Each backup block
> >> + * takes up BLCKSZ bytes, minus the "hole" length.
> >> + *
> >> + * XXX: We peek into xlogreader's private decoded backup blocks for
> >> the
> >> + * hole_length. It doesn't seem worth it to add an accessor macro for
> >> + * this.
> >> + */
> >> + fpi_len = 0;
> >> + for (block_id = 0; block_id <= record->max_block_id; block_id++)
> >> + {
> >> + if (XLogRecHasCompressedBlockImage(record, block_id))
> >> + fpi_len += BLCKSZ - record->blocks[block_id].compress_len;
> >>
> >>
> >> IIUC, fpi_len in case of compressed block image should be
> >>
> >> fpi_len = record->blocks[block_id].compress_len;
> >>
> > Yep, true. Patches need a rebase btw as Heikki fixed a commit related to
> > the stats of pg_xlogdump.
> >
>
> In any case, any opinions to switch this patch as "Ready for committer"?

Needing a rebase is a obvious conflict to that... But I guess some wider
looks afterwards won't hurt.

Here are rebased versions, which are patches 1 and 2. And I am switching as well the patch to "Ready for Committer". The important point to consider for this patch is the use of the additional 2-bytes as uint16 in the block information structure to save the length of a compressed block, which may be compressed without its hole to achieve a double level of compression (image compressed without its hole). We may use a simple flag on one or two bits using for example a bit from hole_length, but in this case we would need to always compress images with their hole included, something more expensive as the compression would take more time.

Robert wrote:
> What I would suggest is instrument the backend with getrusage() at
> startup and shutdown and have it print the difference in user time and
> system time. Then, run tests for a fixed number of transactions and
> see how the total CPU usage for the run differs.

That's a nice idea, which is done with patch 3 as a simple hack calling twice getrusage at the beginning of PostgresMain and before proc_exit, calculating the difference time and logging it for each process (used as well log_line_prefix with %p).

Then I just did a small test with a load of a pgbench-scale-100 database on fresh instances:
1) Compression = on:
Stop LSN: 0/487E49B8
getrusage: proc 11163: LOG: user diff: 63.071127, system diff: 10.898386
pg_xlogdump: FPI size: 122296653 [90.52%]

2) Compression = off

Stop LSN: 0/4E54EB88
Result: proc 11648: LOG: user diff: 43.855212, system diff: 7.857965
pg_xlogdump: FPI size: 204359192 [94.10%]

And the CPU consumption is showing quite some difference... I'd expect as well pglz_compress to show up high in a perf profile for this case (don't have the time to do that now, but a perf record -a -g would be fine I guess).

Regards,

Michael

Attachment

Re: Compression of full-page-writes

From

Simon Riggs

Date:

08 December 2014, 02:30:32

> On Thu, Dec 4, 2014 at 8:37 PM, Michael Paquier wrote
> I pondered something that Andres mentioned upthread: we may not do the
>compression in WAL record only for blocks, but also at record level. Hence
>joining the two ideas together I think that we should definitely have
>a different
>GUC to control the feature, consistently for all the images. Let's call it
>wal_compression, with the following possible values:
>- on, meaning that a maximum of compression is done, for this feature
>basically full_page_writes = on.
>- full_page_writes, meaning that full page writes are compressed
>- off, default value, to disable completely the feature.
>This would let room for another mode: 'record', to completely compress
>a record. For now though, I think that a simple on/off switch would be
>fine for this patch. Let's keep things simple.

+1 for a separate parameter for compression

Some changed thoughts to the above

* parameter should be SUSET - it doesn't *need* to be set only at
server start since all records are independent of each other

* ideally we'd like to be able to differentiate the types of usage.
which then allows the user to control the level of compression
depending upon the type of action. My first cut at what those settings
should be are ALL > LOGICAL > PHYSICAL > VACUUM.

VACUUM - only compress while running vacuum commands
PHYSICAL - only compress while running physical DDL commands (ALTER
TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't
typically be used for logical decoding
LOGICAL - compress FPIs for record types that change tables
ALL - all user commands
(each level includes all prior levels)

* name should not be wal_compression - we're not compressing all wal
records, just fpis. There is no evidence that we even want to compress
other record types, nor that our compression mechanism is effective at
doing so. Simple => keep name as compress_full_page_writes
Though perhaps we should have it called wal_compression_level

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Compression of full-page-writes

From

Michael Paquier

Date:

08 December 2014, 02:46:23

On Mon, Dec 8, 2014 at 11:30 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> * parameter should be SUSET - it doesn't *need* to be set only at
> server start since all records are independent of each other
Check.

> * ideally we'd like to be able to differentiate the types of usage.
> which then allows the user to control the level of compression
> depending upon the type of action. My first cut at what those settings
> should be are ALL > LOGICAL > PHYSICAL > VACUUM.
> VACUUM - only compress while running vacuum commands
> PHYSICAL - only compress while running physical DDL commands (ALTER
> TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't
> typically be used for logical decoding
> LOGICAL - compress FPIs for record types that change tables
> ALL - all user commands
> (each level includes all prior levels)

Well, that's clearly an optimization so I don't think this should be
done for a first shot but those are interesting fresh ideas.
Technically speaking, note that we would need to support such things
with a new API to switch a new context flag in registered_buffers of
xloginsert.c for each block, and decide if the block is compressed
based on this context flag, and the compression level wanted.

> * name should not be wal_compression - we're not compressing all wal
> records, just fpis. There is no evidence that we even want to compress
> other record types, nor that our compression mechanism is effective at
> doing so. Simple => keep name as compress_full_page_writes
> Though perhaps we should have it called wal_compression_level

I don't really like those new names, but I'd prefer
wal_compression_level if we go down that road with 'none' as default
value. We may still decide in the future to support compression at the
record level instead of context level, particularly if we have an API
able to do palloc_return_null_at_oom, so the idea of WAL compression
is not related only to FPIs IMHO.
Regards,
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

08 December 2014, 06:42:36

>The important point to consider for this patch is the use of the additional 2-bytes as uint16 in the block information structure to save the length of a compressed

>block, which may be compressed without its hole to achieve a double level of compression (image compressed without its hole). We may use a simple flag on

>one or two bits using for example a bit from hole_length, but in this case we would need to always compress images with their hole included, something more

>expensive as the compression would take more time.

As you have mentioned here the idea to use bits from existing fields rather than adding additional 2 bytes in header,

FWIW elaborating slightly on the way it was done in the initial patches,

We can use the following struct

unsigned hole_offset:15,
compress_flag:2,
hole_length:15;

Here compress_flag can be 0 or 1 depending on status of compression. We can reduce the compress_flag to just 1 bit flag.

IIUC, the purpose of adding compress_len field in the latest patch is to store length of compressed blocks which is used at the time of decoding the blocks.

With this approach, length of compressed block can be stored in hole_length as,

hole_length = BLCKSZ - compress_len.

Thus, hole_length can serve the purpose of storing length of a compressed block without the need of additional 2-bytes. In DecodeXLogRecord, hole_length can be used for tracking the length of data received in cases of both compressed as well as uncompressed blocks.

As you already mentioned, this will need compressing images with hole but we can MemSet hole to 0 in order to make compression of hole less expensive and effective.

Thank you,

Rahila Syed

On Sat, Dec 6, 2014 at 7:37 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Sat, Dec 6, 2014 at 12:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-12-06 00:10:11 +0900, Michael Paquier wrote:
> On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>
> >
> >
> >
> > On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed <rahilasyed.90@gmail.com>
> > wrote:
> >
> >> I attempted quick review and could not come up with much except this
> >>
> >> + /*
> >> + * Calculate the amount of FPI data in the record. Each backup block
> >> + * takes up BLCKSZ bytes, minus the "hole" length.
> >> + *
> >> + * XXX: We peek into xlogreader's private decoded backup blocks for
> >> the
> >> + * hole_length. It doesn't seem worth it to add an accessor macro for
> >> + * this.
> >> + */
> >> + fpi_len = 0;
> >> + for (block_id = 0; block_id <= record->max_block_id; block_id++)
> >> + {
> >> + if (XLogRecHasCompressedBlockImage(record, block_id))
> >> + fpi_len += BLCKSZ - record->blocks[block_id].compress_len;
> >>
> >>
> >> IIUC, fpi_len in case of compressed block image should be
> >>
> >> fpi_len = record->blocks[block_id].compress_len;
> >>
> > Yep, true. Patches need a rebase btw as Heikki fixed a commit related to
> > the stats of pg_xlogdump.
> >
>
> In any case, any opinions to switch this patch as "Ready for committer"?

Needing a rebase is a obvious conflict to that... But I guess some wider
looks afterwards won't hurt.

Here are rebased versions, which are patches 1 and 2. And I am switching as well the patch to "Ready for Committer". The important point to consider for this patch is the use of the additional 2-bytes as uint16 in the block information structure to save the length of a compressed block, which may be compressed without its hole to achieve a double level of compression (image compressed without its hole). We may use a simple flag on one or two bits using for example a bit from hole_length, but in this case we would need to always compress images with their hole included, something more expensive as the compression would take more time.

Robert wrote:
> What I would suggest is instrument the backend with getrusage() at
> startup and shutdown and have it print the difference in user time and
> system time. Then, run tests for a fixed number of transactions and
> see how the total CPU usage for the run differs.
That's a nice idea, which is done with patch 3 as a simple hack calling twice getrusage at the beginning of PostgresMain and before proc_exit, calculating the difference time and logging it for each process (used as well log_line_prefix with %p).

Then I just did a small test with a load of a pgbench-scale-100 database on fresh instances:
1) Compression = on:
Stop LSN: 0/487E49B8
getrusage: proc 11163: LOG: user diff: 63.071127, system diff: 10.898386
pg_xlogdump: FPI size: 122296653 [90.52%]
2) Compression = off
Stop LSN: 0/4E54EB88
Result: proc 11648: LOG: user diff: 43.855212, system diff: 7.857965
pg_xlogdump: FPI size: 204359192 [94.10%]
And the CPU consumption is showing quite some difference... I'd expect as well pglz_compress to show up high in a perf profile for this case (don't have the time to do that now, but a perf record -a -g would be fine I guess).
Regards,
--
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

08 December 2014, 06:49:01

On Mon, Dec 8, 2014 at 3:42 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>
>>The important point to consider for this patch is the use of the additional
>> 2-bytes as uint16 in the block information structure to save the length of a
>> compressed
>>block, which may be compressed without its hole to achieve a double level
>> of compression (image compressed without its hole). We may use a simple flag
>> on
>>one or two bits using for example a bit from hole_length, but in this case
>> we would need to always compress images with their hole included, something
>> more
>  >expensive as the compression would take more time.
> As you have mentioned here the idea to use bits from existing fields rather
> than adding additional 2 bytes in header,
> FWIW elaborating slightly on the way it was done in the initial patches,
> We can use the following struct
>
>      unsigned    hole_offset:15,
>                  compress_flag:2,
>                 hole_length:15;
>
> Here  compress_flag can be 0 or 1 depending on status of compression. We can
> reduce the compress_flag to just 1 bit flag.
Just adding that this is fine as the largest page size that can be set is 32k.

> IIUC, the purpose of adding compress_len field in the latest patch is to
> store length of compressed blocks which is used at the time of decoding the
> blocks.
>
> With this approach, length of compressed block can be stored in hole_length
> as,
>
>  hole_length = BLCKSZ - compress_len.
>
> Thus, hole_length can serve the purpose of storing length of a compressed
> block without the need of additional 2-bytes.  In DecodeXLogRecord,
> hole_length can be used for tracking the length of data received in cases of
> both compressed as well as uncompressed blocks.
>
> As you already mentioned, this will need compressing images with hole but
> we can MemSet hole to 0 in order to make compression of hole less expensive
> and effective.

Thanks for coming back to this point in more details, this is very
important. The additional 2 bytes used make compression less expensive
by ignoring the hole, for a bit more data in each record. Using uint16
is as well a cleaner code style, more in-line wit hte other fields,
but that's a personal opinion ;)

Doing a switch from one approach to the other is easy enough though,
so let's see what others think.
Regards,
-- 
Michael

Re: Compression of full-page-writes

From

Simon Riggs

Date:

08 December 2014, 09:47:25

On 8 December 2014 at 11:46, Michael Paquier <michael.paquier@gmail.com> wrote:

>> * ideally we'd like to be able to differentiate the types of usage.
>> which then allows the user to control the level of compression
>> depending upon the type of action. My first cut at what those settings
>> should be are ALL > LOGICAL > PHYSICAL > VACUUM.
>> VACUUM - only compress while running vacuum commands
>> PHYSICAL - only compress while running physical DDL commands (ALTER
>> TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't
>> typically be used for logical decoding
>> LOGICAL - compress FPIs for record types that change tables
>> ALL - all user commands
>> (each level includes all prior levels)
>
> Well, that's clearly an optimization so I don't think this should be
> done for a first shot but those are interesting fresh ideas.

It is important that we offer an option that retains user performance.
I don't see that as an optimisation, but as an essential item.

The current feature will reduce WAL volume, at the expense of
foreground user performance. Worse, that will all happen around time
of new checkpoint,  so I expect this will have a large impact.
Presumably testing has been done to show the impact on user response
times? If not, we need that.

The most important distinction is between foreground and background tasks.

If you think the above is too complex, then we should make the
parameter into a USET, but set it to on in VACUUM, CLUSTER and
autovacuum.

> Technically speaking, note that we would need to support such things
> with a new API to switch a new context flag in registered_buffers of
> xloginsert.c for each block, and decide if the block is compressed
> based on this context flag, and the compression level wanted.
>
>> * name should not be wal_compression - we're not compressing all wal
>> records, just fpis. There is no evidence that we even want to compress
>> other record types, nor that our compression mechanism is effective at
>> doing so. Simple => keep name as compress_full_page_writes
>> Though perhaps we should have it called wal_compression_level
>
> I don't really like those new names, but I'd prefer
> wal_compression_level if we go down that road with 'none' as default
> value. We may still decide in the future to support compression at the
> record level instead of context level, particularly if we have an API
> able to do palloc_return_null_at_oom, so the idea of WAL compression
> is not related only to FPIs IMHO.

We may yet decide, but the pglz implementation is not effective on
smaller record lengths. Nor has any testing been done to show that is
even desirable.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Compression of full-page-writes

From

Robert Haas

Date:

08 December 2014, 19:09:24

On Sun, Dec 7, 2014 at 9:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> * parameter should be SUSET - it doesn't *need* to be set only at
> server start since all records are independent of each other

Why not USERSET?  There's no point in trying to prohibit users from
doing things that will cause bad performance because they can do that
anyway.

> * ideally we'd like to be able to differentiate the types of usage.
> which then allows the user to control the level of compression
> depending upon the type of action. My first cut at what those settings
> should be are ALL > LOGICAL > PHYSICAL > VACUUM.
>
> VACUUM - only compress while running vacuum commands
> PHYSICAL - only compress while running physical DDL commands (ALTER
> TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't
> typically be used for logical decoding
> LOGICAL - compress FPIs for record types that change tables
> ALL - all user commands
> (each level includes all prior levels)

Interesting idea, but what evidence do we have that a simple on/off
switch isn't good enough?

> * name should not be wal_compression - we're not compressing all wal
> records, just fpis. There is no evidence that we even want to compress
> other record types, nor that our compression mechanism is effective at
> doing so. Simple => keep name as compress_full_page_writes

Quite right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Compression of full-page-writes

From

Andres Freund

Date:

08 December 2014, 19:22:01

On 2014-12-08 14:09:19 -0500, Robert Haas wrote:
> > records, just fpis. There is no evidence that we even want to compress
> > other record types, nor that our compression mechanism is effective at
> > doing so. Simple => keep name as compress_full_page_writes
> 
> Quite right.

I don't really agree with this. There's lots of records which can be
quite big where compression could help a fair bit. Most prominently
HEAP2_MULTI_INSERT + INIT_PAGE. During initial COPY that's the biggest
chunk of WAL. And these are big and repetitive enough that compression
is very likely to be beneficial.

I still think that just compressing the whole record if it's above a
certain size is going to be better than compressing individual
parts. Michael argued thta that'd be complicated because of the varying
size of the required 'scratch space'. I don't buy that argument
though. It's easy enough to simply compress all the data in some fixed
chunk size. I.e. always compress 64kb in one go. If there's more
compress that independently.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Compression of full-page-writes

From

Robert Haas

Date:

08 December 2014, 19:37:52

On Mon, Dec 8, 2014 at 2:21 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-12-08 14:09:19 -0500, Robert Haas wrote:
>> > records, just fpis. There is no evidence that we even want to compress
>> > other record types, nor that our compression mechanism is effective at
>> > doing so. Simple => keep name as compress_full_page_writes
>>
>> Quite right.
>
> I don't really agree with this. There's lots of records which can be
> quite big where compression could help a fair bit. Most prominently
> HEAP2_MULTI_INSERT + INIT_PAGE. During initial COPY that's the biggest
> chunk of WAL. And these are big and repetitive enough that compression
> is very likely to be beneficial.
>
> I still think that just compressing the whole record if it's above a
> certain size is going to be better than compressing individual
> parts. Michael argued thta that'd be complicated because of the varying
> size of the required 'scratch space'. I don't buy that argument
> though. It's easy enough to simply compress all the data in some fixed
> chunk size. I.e. always compress 64kb in one go. If there's more
> compress that independently.

I agree that idea is worth considering.  But I think we should decide
which way is better and then do just one or the other.  I can't see
the point in adding wal_compress=full_pages now and then offering an
alternative wal_compress=big_records in 9.5.

I think it's also quite likely that there may be cases where
context-aware compression strategies can be employed.  For example,
the prefix/suffix compression of updates that Amit did last cycle
exploit the likely commonality between the old and new tuple.  We
might have cases like that where there are meaningful trade-offs to be
made between CPU and I/O, or other reasons to have user-exposed knobs.
I think we'll be much happier if those are completely separate GUCs,
so we can say things like compress_gin_wal=true and
compress_brin_effort=3.14 rather than trying to have a single
wal_compress GUC and assuming that we can shoehorn all future needs
into it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Compression of full-page-writes

From

Heikki Linnakangas

Date:

08 December 2014, 20:33:57

On 12/08/2014 09:21 PM, Andres Freund wrote:
> I still think that just compressing the whole record if it's above a
> certain size is going to be better than compressing individual
> parts. Michael argued thta that'd be complicated because of the varying
> size of the required 'scratch space'. I don't buy that argument
> though. It's easy enough to simply compress all the data in some fixed
> chunk size. I.e. always compress 64kb in one go. If there's more
> compress that independently.

Doing it in fixed-size chunks doesn't help - you have to hold onto the 
compressed data until it's written to the WAL buffers.

But you could just allocate a "large enough" scratch buffer, and give up 
if it doesn't fit. If the compressed data doesn't fit in e.g. 3 * 8kb, 
it didn't compress very well, so there's probably no point in 
compressing it anyway. Now, an exception to that might be a record that 
contains something else than page data, like a commit record with 
millions of subxids, but I think we could live with not compressing 
those, even though it would be beneficial to do so.

- Heikki

Re: Compression of full-page-writes

From

Michael Paquier

Date:

08 December 2014, 22:02:50

On Tue, Dec 9, 2014 at 5:33 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 12/08/2014 09:21 PM, Andres Freund wrote:
>>
>> I still think that just compressing the whole record if it's above a
>> certain size is going to be better than compressing individual
>> parts. Michael argued thta that'd be complicated because of the varying
>> size of the required 'scratch space'. I don't buy that argument
>> though. It's easy enough to simply compress all the data in some fixed
>> chunk size. I.e. always compress 64kb in one go. If there's more
>> compress that independently.
>
>
> Doing it in fixed-size chunks doesn't help - you have to hold onto the
> compressed data until it's written to the WAL buffers.
>
> But you could just allocate a "large enough" scratch buffer, and give up if
> it doesn't fit. If the compressed data doesn't fit in e.g. 3 * 8kb, it
> didn't compress very well, so there's probably no point in compressing it
> anyway. Now, an exception to that might be a record that contains something
> else than page data, like a commit record with millions of subxids, but I
> think we could live with not compressing those, even though it would be
> beneficial to do so.
Another thing to consider is the possibility to control at GUC level
what is the maximum size of a record we allow to compress.
-- 
Michael

Re: Compression of full-page-writes

From

Simon Riggs

Date:

08 December 2014, 22:19:10

On 9 December 2014 at 04:09, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Dec 7, 2014 at 9:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> * parameter should be SUSET - it doesn't *need* to be set only at
>> server start since all records are independent of each other
>
> Why not USERSET?  There's no point in trying to prohibit users from
> doing things that will cause bad performance because they can do that
> anyway.

Yes, I think USERSET would work fine for this.

>> * ideally we'd like to be able to differentiate the types of usage.
>> which then allows the user to control the level of compression
>> depending upon the type of action. My first cut at what those settings
>> should be are ALL > LOGICAL > PHYSICAL > VACUUM.
>>
>> VACUUM - only compress while running vacuum commands
>> PHYSICAL - only compress while running physical DDL commands (ALTER
>> TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't
>> typically be used for logical decoding
>> LOGICAL - compress FPIs for record types that change tables
>> ALL - all user commands
>> (each level includes all prior levels)
>
> Interesting idea, but what evidence do we have that a simple on/off
> switch isn't good enough?

Yes, I think that was overcooked. What I'm thinking is that in the
long run we might have groups of parameters attached to different
types of action, so we wouldn't need, for example, two parameters for
work_mem and maintenance_work_mem. We'd just have work_mem and then a
scheme that has different values of work_mem for different action
types.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Compression of full-page-writes

From

Simon Riggs

Date:

08 December 2014, 22:28:06

On 9 December 2014 at 04:21, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-12-08 14:09:19 -0500, Robert Haas wrote:
>> > records, just fpis. There is no evidence that we even want to compress
>> > other record types, nor that our compression mechanism is effective at
>> > doing so. Simple => keep name as compress_full_page_writes
>>
>> Quite right.
>
> I don't really agree with this. There's lots of records which can be
> quite big where compression could help a fair bit. Most prominently
> HEAP2_MULTI_INSERT + INIT_PAGE. During initial COPY that's the biggest
> chunk of WAL. And these are big and repetitive enough that compression
> is very likely to be beneficial.

Yes, you're right there. I was forgetting those aren't FPIs. However
they are close enough that it wouldn't necessarily effect the naming
of a parameter that controls such compression.

> I still think that just compressing the whole record if it's above a
> certain size is going to be better than compressing individual
> parts.

I think its OK to think it, but we should measure it.

For now then, I remove my objection to a commit of this patch based
upon parameter naming/rethinking. We have a fine tradition of changing
the names after the release is mostly wrapped, so lets pick a name in
a few months time when the dust has settled on what's in.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Compression of full-page-writes

From

Amit Kapila

Date:

09 December 2014, 05:16:01

On Mon, Dec 8, 2014 at 3:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 8 December 2014 at 11:46, Michael Paquier <michael.paquier@gmail.com> wrote:
> > I don't really like those new names, but I'd prefer
> > wal_compression_level if we go down that road with 'none' as default
> > value. We may still decide in the future to support compression at the
> > record level instead of context level, particularly if we have an API
> > able to do palloc_return_null_at_oom, so the idea of WAL compression
> > is not related only to FPIs IMHO.
>
> We may yet decide, but the pglz implementation is not effective on
> smaller record lengths. Nor has any testing been done to show that is
> even desirable.
>

It's even much worse for non-compressible (or less-compressible)

WAL data. I am not clear here that how a simple on/off switch

could address such cases because the data could be sometimes

dependent on which table user is doing operations (means schema or

data in some tables are more prone for compression in which case

it can give us benefits). I think may be we should think something on

lines what Robert has touched in one of his e-mails (context-aware

compression strategy).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compression of full-page-writes

From

Michael Paquier

Date:

10 December 2014, 09:19:44

On Tue, Dec 9, 2014 at 2:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 8, 2014 at 3:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 8 December 2014 at 11:46, Michael Paquier <michael.paquier@gmail.com> wrote:
> > I don't really like those new names, but I'd prefer
> > wal_compression_level if we go down that road with 'none' as default
> > value. We may still decide in the future to support compression at the
> > record level instead of context level, particularly if we have an API
> > able to do palloc_return_null_at_oom, so the idea of WAL compression
> > is not related only to FPIs IMHO.
>
> We may yet decide, but the pglz implementation is not effective on
> smaller record lengths. Nor has any testing been done to show that is
> even desirable.
>

It's even much worse for non-compressible (or less-compressible)
WAL data. I am not clear here that how a simple on/off switch
could address such cases because the data could be sometimes
dependent on which table user is doing operations (means schema or
data in some tables are more prone for compression in which case
it can give us benefits). I think may be we should think something on
lines what Robert has touched in one of his e-mails (context-aware
compression strategy).

So, I have been doing some measurements using the patch compressing FPWs and had a look at the transaction latency using pgbench -P 1 with those parameters on my laptop:
shared_buffers=512MB
checkpoint_segments=1024
checkpoint_timeout = 5min
fsync=off

A checkpoint was executed just before a 20-min run, so 3 checkpoints at least kicked in during each measurement, roughly that:
pgbench -i -s 100
psql -c 'checkpoint;'
date > ~/report.txt
pgbench -P 1 -c 16 -j 16 -T 1200 2>> ~/report.txt &

1) Compression of FPW:
latency average: 9.007 ms
latency stddev: 25.527 ms
tps = 1775.614812 (including connections establishing)

Here is the latency when a checkpoint that wrote 28% of the buffers begun (570s):
progress: 568.0 s, 2000.9 tps, lat 8.098 ms stddev 23.799
progress: 569.0 s, 1873.9 tps, lat 8.442 ms stddev 22.837
progress: 570.2 s, 1622.4 tps, lat 9.533 ms stddev 24.027
progress: 571.0 s, 1633.4 tps, lat 10.302 ms stddev 27.331
progress: 572.1 s, 1588.4 tps, lat 9.908 ms stddev 25.728
progress: 573.1 s, 1579.3 tps, lat 10.186 ms stddev 25.782

All the other checkpoints have the same profile, giving that the transaction latency increases by roughly 1.5~2ms to 10.5~11ms.

2) No compression of FPW:
latency average: 8.507 ms
latency stddev: 25.052 ms
tps = 1870.368880 (including connections establishing)

Here is the latency for a checkpoint that wrote 28% of buffers:
progress: 297.1 s, 1997.9 tps, lat 8.112 ms stddev 24.288
progress: 298.1 s, 1990.4 tps, lat 7.806 ms stddev 21.849
progress: 299.0 s, 1986.9 tps, lat 8.366 ms stddev 22.896
progress: 300.0 s, 1648.1 tps, lat 9.728 ms stddev 25.811
progress: 301.0 s, 1806.5 tps, lat 8.646 ms stddev 24.187
progress: 302.1 s, 1810.9 tps, lat 8.960 ms stddev 24.201
progress: 303.0 s, 1831.9 tps, lat 8.623 ms stddev 23.199
progress: 304.0 s, 1951.2 tps, lat 8.149 ms stddev 22.871

Here is another one that began around 600s (20% of buffers):
progress: 594.0 s, 1738.8 tps, lat 9.135 ms stddev 25.140
progress: 595.0 s, 893.2 tps, lat 18.153 ms stddev 67.186
progress: 596.1 s, 1671.0 tps, lat 9.470 ms stddev 25.691
progress: 597.1 s, 1580.3 tps, lat 10.189 ms stddev 26.430
progress: 598.0 s, 1570.9 tps, lat 10.089 ms stddev 23.684
progress: 599.2 s, 1657.0 tps, lat 9.385 ms stddev 23.794
progress: 600.0 s, 1665.5 tps, lat 10.280 ms stddev 25.857
progress: 601.1 s, 1571.7 tps, lat 9.851 ms stddev 25.341
progress: 602.1 s, 1577.7 tps, lat 10.056 ms stddev 25.331
progress: 603.0 s, 1600.1 tps, lat 10.329 ms stddev 25.429
progress: 604.0 s, 1593.8 tps, lat 10.004 ms stddev 26.816
Not sure what happened here, the burst has been a bit higher.

However roughly the latency was never higher than 10.5ms for the non-compression case. With those measurements I am getting more or less 1ms of latency difference between the compression and non-compression cases when checkpoint show up. Note that fsync is disabled.

Also, I am still planning to hack a patch able to compress directly records with a scratch buffer up 32k and see the difference with what I got here. For now, the results are attached.

Comments welcome.

Michael

Attachment

fpw_results.tar.gz

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

10 December 2014, 14:10:59

>What I would suggest is instrument the backend with getrusage() at
>startup and shutdown and have it print the difference in user time and
>system time. Then, run tests for a fixed number of transactions and
>see how the total CPU usage for the run differs.

Folllowing are the numbers obtained on tests with absolute CPU usage, fixed number of transactions and longer duration with latest fpw compression patch

pgbench command : pgbench -r -t 250000 -M prepared

To ensure that data is not highly compressible, empty filler columns were altered using

alter table pgbench_accounts alter column filler type text using

gen_random_uuid()::text

checkpoint_segments = 1024

checkpoint_timeout = 5min

fsync = on

The tests ran for around 30 mins.Manual checkpoint was run before each test.

Compression WAL generated %compression Latency-avg CPU usage (seconds) TPS Latency stddev

on 1531.4 MB ~35 % 7.351 ms user diff: 562.67s system diff: 41.40s 135.96 13.759 ms

off 2373.1 MB 6.781 ms user diff: 354.20s system diff: 39.67s 147.40 14.152 ms

The compression obtained is quite high close to 35 %.

CPU usage at user level when compression is on is quite noticeably high as compared to that when compression is off. But gain in terms of reduction of WAL is also high.

Server specifications:

Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Thank you,

Rahila Syed

On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
>>If that's really true, we could consider having no configuration any
>>time, and just compressing always. But I'm skeptical that it's
>>actually true.
>
> I was referring to this for CPU utilization:
> http://www.postgresql.org/message-id/1410414381339-5818552.post@n5.nabble.com
> <http://>
>
> The above tests were performed on machine with configuration as follows
> Server specifications:
> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> RAM: 32GB
> Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

I think that measurement methodology is not very good for assessing
the CPU overhead, because you are only measuring the percentage CPU
utilization, not the absolute amount of CPU utilization. It's not
clear whether the duration of the tests was the same for all the
configurations you tried - in which case the number of transactions
might have been different - or whether the number of operations was
exactly the same - in which case the runtime might have been
different. Either way, it could obscure an actual difference in
absolute CPU usage per transaction. It's unlikely that both the
runtime and the number of transactions were identical for all of your
tests, because that would imply that the patch makes no difference to
performance; if that were true, you wouldn't have bothered writing
it....

What I would suggest is instrument the backend with getrusage() at
startup and shutdown and have it print the difference in user time and
system time. Then, run tests for a fixed number of transactions and
see how the total CPU usage for the run differs.

Last cycle, Amit Kapila did a bunch of work trying to compress the WAL
footprint for updates, and we found that compression was pretty darn
expensive there in terms of CPU time. So I am suspicious of the
finding that it is free here. It's not impossible that there's some
effect which causes us to recoup more CPU time than we spend
compressing in this case that did not apply in that case, but the
projects are awfully similar, so I tend to doubt it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Bruce Momjian

Date:

10 December 2014, 14:25:16

On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
> The tests ran for around 30 mins.Manual checkpoint was run before each test.
> 
> Compression   WAL generated    %compression    Latency-avg   CPU usage
> (seconds)                                          TPS              Latency
> stddev               
> 
> 
> on                  1531.4 MB          ~35 %                  7.351 ms     
>   user diff: 562.67s     system diff: 41.40s              135.96            
>   13.759 ms
> 
> 
> off                  2373.1 MB                                     6.781 ms    
>       user diff: 354.20s      system diff: 39.67s            147.40            
>   14.152 ms
> 
> The compression obtained is quite high close to 35 %.
> CPU usage at user level when compression is on is quite noticeably high as
> compared to that when compression is off. But gain in terms of reduction of WAL
> is also high.

I am sorry but I can't understand the above results due to wrapping. 
Are you saying compression was twice as slow?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: [REVIEW] Re: Compression of full-page-writes

From

Arthur Silva

Date:

10 December 2014, 14:40:27

On Wed, Dec 10, 2014 at 12:10 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:

>What I would suggest is instrument the backend with getrusage() at
>startup and shutdown and have it print the difference in user time and
>system time. Then, run tests for a fixed number of transactions and
>see how the total CPU usage for the run differs.

Folllowing are the numbers obtained on tests with absolute CPU usage, fixed number of transactions and longer duration with latest fpw compression patch

pgbench command : pgbench -r -t 250000 -M prepared

To ensure that data is not highly compressible, empty filler columns were altered using

alter table pgbench_accounts alter column filler type text using
gen_random_uuid()::text

checkpoint_segments = 1024
checkpoint_timeout = 5min
fsync = on

The tests ran for around 30 mins.Manual checkpoint was run before each test.

Compression WAL generated %compression Latency-avg CPU usage (seconds) TPS Latency stddev

on 1531.4 MB ~35 % 7.351 ms   user diff: 562.67s system diff: 41.40s   135.96   13.759 ms

off 2373.1 MB   6.781 ms   user diff: 354.20s system diff: 39.67s   147.40   14.152 ms

The compression obtained is quite high close to 35 %.
CPU usage at user level when compression is on is quite noticeably high as compared to that when compression is off. But gain in terms of reduction of WAL is also high.

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Thank you,

Rahila Syed

On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
>>If that's really true, we could consider having no configuration any
>>time, and just compressing always. But I'm skeptical that it's
>>actually true.
>
> I was referring to this for CPU utilization:
> http://www.postgresql.org/message-id/1410414381339-5818552.post@n5.nabble.com
> <http://>
>
> The above tests were performed on machine with configuration as follows
> Server specifications:
> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> RAM: 32GB
> Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

I think that measurement methodology is not very good for assessing
the CPU overhead, because you are only measuring the percentage CPU
utilization, not the absolute amount of CPU utilization. It's not
clear whether the duration of the tests was the same for all the
configurations you tried - in which case the number of transactions
might have been different - or whether the number of operations was
exactly the same - in which case the runtime might have been
different. Either way, it could obscure an actual difference in
absolute CPU usage per transaction. It's unlikely that both the
runtime and the number of transactions were identical for all of your
tests, because that would imply that the patch makes no difference to
performance; if that were true, you wouldn't have bothered writing
it....

What I would suggest is instrument the backend with getrusage() at
startup and shutdown and have it print the difference in user time and
system time. Then, run tests for a fixed number of transactions and
see how the total CPU usage for the run differs.

Last cycle, Amit Kapila did a bunch of work trying to compress the WAL
footprint for updates, and we found that compression was pretty darn
expensive there in terms of CPU time. So I am suspicious of the
finding that it is free here. It's not impossible that there's some
effect which causes us to recoup more CPU time than we spend
compressing in this case that did not apply in that case, but the
projects are awfully similar, so I tend to doubt it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

This can be improved in the future by using other algorithms.

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

11 December 2014, 07:56:48

>I am sorry but I can't understand the above results due to wrapping.
>Are you saying compression was twice as slow?

CPU usage at user level (in seconds) for compression set 'on' is 562 secs

while that for compression set 'off' is 354 secs. As per the readings, it takes little less than double CPU time to compress.

However , the total time taken to run 250000 transactions for each of the scenario is as follows,

compression = 'on' : 1838 secs

= 'off' : 1701 secs

Different is around 140 secs.

Thank you,

Rahila Syed

On Wed, Dec 10, 2014 at 7:55 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
> The tests ran for around 30 mins.Manual checkpoint was run before each test.
>
> Compression WAL generated %compression Latency-avg CPU usage
> (seconds) TPS Latency
> stddev
>
>
> on 1531.4 MB ~35 % 7.351 ms
>   user diff: 562.67s system diff: 41.40s   135.96
>   13.759 ms
>
>
> off 2373.1 MB   6.781 ms
>   user diff: 354.20s system diff: 39.67s   147.40
>   14.152 ms
>
> The compression obtained is quite high close to 35 %.
> CPU usage at user level when compression is on is quite noticeably high as
> compared to that when compression is off. But gain in terms of reduction of WAL
> is also high.

I am sorry but I can't understand the above results due to wrapping.
Are you saying compression was twice as slow?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

Re: [REVIEW] Re: Compression of full-page-writes

From

Bruce Momjian

Date:

11 December 2014, 16:35:05

On Thu, Dec 11, 2014 at 01:26:38PM +0530, Rahila Syed wrote:
> >I am sorry but I can't understand the above results due to wrapping.
> >Are you saying compression was twice as slow?
> 
> CPU usage at user level (in seconds)  for compression set 'on' is 562 secs
> while that for compression  set 'off' is 354 secs. As per the readings, it
> takes little less than double CPU time to compress.
> However , the total time  taken to run 250000 transactions for each of the
> scenario is as follows,
> 
> compression = 'on'  : 1838 secs
>             = 'off' : 1701 secs
> 
> 
> Different is around 140 secs.

OK, so the compression took 2x the cpu and was 8% slower.  The only
benefit is WAL files are 35% smaller?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

12 December 2014, 00:37:17

On Fri, Dec 12, 2014 at 1:34 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Dec 11, 2014 at 01:26:38PM +0530, Rahila Syed wrote:
> >I am sorry but I can't understand the above results due to wrapping.
> >Are you saying compression was twice as slow?
>
> CPU usage at user level (in seconds) for compression set 'on' is 562 secs
> while that for compression set 'off' is 354 secs. As per the readings, it
> takes little less than double CPU time to compress.
> However , the total time taken to run 250000 transactions for each of the
> scenario is as follows,
>
> compression = 'on' : 1838 secs
> = 'off' : 1701 secs
>
>
> Different is around 140 secs.

OK, so the compression took 2x the cpu and was 8% slower. The only
benefit is WAL files are 35% smaller?

That depends as well on the compression algorithm used. I am far from being a specialist in this area, but I guess that there are things consuming less CPU for a lower rate of compression and that there are no magic solutions. A correct answer would be to either change the compression algorithm present in core to something that is more compliant to the FPW compression, or to add hooks to allow people to plug in the compression algorithm they want for the compression and decompression calls. In any case and for any type of compression (be it different algo, record-level compression or FPW compression), what we have here is a tradeoff, and a switch for people who care more about I/O than CPU usage. And we would still face in any case CPU bursts at checkpoints because I can't imagine FPWs not being compressed even if we do something at record level (thinking so what we have here is the light-compression version).

Regards,

Michael

Re: Compression of full-page-writes

From

Michael Paquier

Date:

12 December 2014, 03:33:31

On Tue, Dec 9, 2014 at 4:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Dec 7, 2014 at 9:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > * parameter should be SUSET - it doesn't *need* to be set only at
> > server start since all records are independent of each other
>
> Why not USERSET?  There's no point in trying to prohibit users from
> doing things that will cause bad performance because they can do that
> anyway.

Using SUSET or USERSET has a small memory cost: we should
unconditionally palloc the buffers containing the compressed data
until WAL is written out. We could always call an equivalent of
InitXLogInsert when this parameter is updated but that would be
bug-prone IMO and it does not plead in favor of code simplicity.
Regards,
-- 
Michael

Re: Compression of full-page-writes

From

Robert Haas

Date:

12 December 2014, 13:23:46

On Thu, Dec 11, 2014 at 10:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Dec 9, 2014 at 4:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Dec 7, 2014 at 9:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > * parameter should be SUSET - it doesn't *need* to be set only at
>> > server start since all records are independent of each other
>>
>> Why not USERSET?  There's no point in trying to prohibit users from
>> doing things that will cause bad performance because they can do that
>> anyway.
>
> Using SUSET or USERSET has a small memory cost: we should
> unconditionally palloc the buffers containing the compressed data
> until WAL is written out. We could always call an equivalent of
> InitXLogInsert when this parameter is updated but that would be
> bug-prone IMO and it does not plead in favor of code simplicity.

I don't understand what you're saying here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

12 December 2014, 13:28:09

On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> compression = 'on'  : 1838 secs
>>             = 'off' : 1701 secs
>>
>> Different is around 140 secs.
>
> OK, so the compression took 2x the cpu and was 8% slower.  The only
> benefit is WAL files are 35% smaller?

Compression didn't take 2x the CPU.  It increased user CPU from 354.20
s to 562.67 s over the course of the run, so it took about 60% more
CPU.

But I wouldn't be too discouraged by that.  At least AIUI, there are
quite a number of users for whom WAL volume is a serious challenge,
and they might be willing to pay that price to have less of it.  Also,
we have talked a number of times before about incorporating Snappy or
LZ4, which I'm guessing would save a fair amount of CPU -- but the
decision was made to leave that out of the first version, and just use
pg_lz, to keep the initial patch simple.  I think that was a good
decision.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

12 December 2014, 14:06:02

On 2014-12-12 08:27:59 -0500, Robert Haas wrote:
> On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
> >> compression = 'on'  : 1838 secs
> >>             = 'off' : 1701 secs
> >>
> >> Different is around 140 secs.
> >
> > OK, so the compression took 2x the cpu and was 8% slower.  The only
> > benefit is WAL files are 35% smaller?
> 
> Compression didn't take 2x the CPU.  It increased user CPU from 354.20
> s to 562.67 s over the course of the run, so it took about 60% more
> CPU.
> 
> But I wouldn't be too discouraged by that.  At least AIUI, there are
> quite a number of users for whom WAL volume is a serious challenge,
> and they might be willing to pay that price to have less of it.

And it might actually result in *higher* performance in a good number of
cases if the the WAL flushes are a significant part of the cost.

IIRC he test used a single process - that's probably not too
representative...

> Also,
> we have talked a number of times before about incorporating Snappy or
> LZ4, which I'm guessing would save a fair amount of CPU -- but the
> decision was made to leave that out of the first version, and just use
> pg_lz, to keep the initial patch simple.  I think that was a good
> decision.

Agreed.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Compression of full-page-writes

From

Michael Paquier

Date:

12 December 2014, 14:15:35

On Fri, Dec 12, 2014 at 10:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 11, 2014 at 10:33 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Tue, Dec 9, 2014 at 4:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Sun, Dec 7, 2014 at 9:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> > * parameter should be SUSET - it doesn't *need* to be set only at
>>> > server start since all records are independent of each other
>>>
>>> Why not USERSET?  There's no point in trying to prohibit users from
>>> doing things that will cause bad performance because they can do that
>>> anyway.
>>
>> Using SUSET or USERSET has a small memory cost: we should
>> unconditionally palloc the buffers containing the compressed data
>> until WAL is written out. We could always call an equivalent of
>> InitXLogInsert when this parameter is updated but that would be
>> bug-prone IMO and it does not plead in favor of code simplicity.
>
> I don't understand what you're saying here.
I just meant that the scratch buffers used to store temporarily the
compressed and uncompressed data should be palloc'd all the time, even
if the switch is off.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Bruce Momjian

Date:

12 December 2014, 14:18:24

On Fri, Dec 12, 2014 at 08:27:59AM -0500, Robert Haas wrote:
> On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
> >> compression = 'on'  : 1838 secs
> >>             = 'off' : 1701 secs
> >>
> >> Different is around 140 secs.
> >
> > OK, so the compression took 2x the cpu and was 8% slower.  The only
> > benefit is WAL files are 35% smaller?
> 
> Compression didn't take 2x the CPU.  It increased user CPU from 354.20
> s to 562.67 s over the course of the run, so it took about 60% more
> CPU.
> 
> But I wouldn't be too discouraged by that.  At least AIUI, there are
> quite a number of users for whom WAL volume is a serious challenge,
> and they might be willing to pay that price to have less of it.  Also,
> we have talked a number of times before about incorporating Snappy or
> LZ4, which I'm guessing would save a fair amount of CPU -- but the
> decision was made to leave that out of the first version, and just use
> pg_lz, to keep the initial patch simple.  I think that was a good
> decision.

Well, the larger question is why wouldn't we just have the user compress
the entire WAL file before archiving --- why have each backend do it? 
Is it the write volume we are saving?  I though this WAL compression
gave better performance in some cases.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

12 December 2014, 14:22:39

On 2014-12-12 09:18:01 -0500, Bruce Momjian wrote:
> On Fri, Dec 12, 2014 at 08:27:59AM -0500, Robert Haas wrote:
> > On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > >> compression = 'on'  : 1838 secs
> > >>             = 'off' : 1701 secs
> > >>
> > >> Different is around 140 secs.
> > >
> > > OK, so the compression took 2x the cpu and was 8% slower.  The only
> > > benefit is WAL files are 35% smaller?
> > 
> > Compression didn't take 2x the CPU.  It increased user CPU from 354.20
> > s to 562.67 s over the course of the run, so it took about 60% more
> > CPU.
> > 
> > But I wouldn't be too discouraged by that.  At least AIUI, there are
> > quite a number of users for whom WAL volume is a serious challenge,
> > and they might be willing to pay that price to have less of it.  Also,
> > we have talked a number of times before about incorporating Snappy or
> > LZ4, which I'm guessing would save a fair amount of CPU -- but the
> > decision was made to leave that out of the first version, and just use
> > pg_lz, to keep the initial patch simple.  I think that was a good
> > decision.
> 
> Well, the larger question is why wouldn't we just have the user compress
> the entire WAL file before archiving --- why have each backend do it? 
> Is it the write volume we are saving?  I though this WAL compression
> gave better performance in some cases.

Err. Streaming?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Bruce Momjian

Date:

12 December 2014, 14:24:40

On Fri, Dec 12, 2014 at 03:22:24PM +0100, Andres Freund wrote:
> On 2014-12-12 09:18:01 -0500, Bruce Momjian wrote:
> > On Fri, Dec 12, 2014 at 08:27:59AM -0500, Robert Haas wrote:
> > > On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > > >> compression = 'on'  : 1838 secs
> > > >>             = 'off' : 1701 secs
> > > >>
> > > >> Different is around 140 secs.
> > > >
> > > > OK, so the compression took 2x the cpu and was 8% slower.  The only
> > > > benefit is WAL files are 35% smaller?
> > > 
> > > Compression didn't take 2x the CPU.  It increased user CPU from 354.20
> > > s to 562.67 s over the course of the run, so it took about 60% more
> > > CPU.
> > > 
> > > But I wouldn't be too discouraged by that.  At least AIUI, there are
> > > quite a number of users for whom WAL volume is a serious challenge,
> > > and they might be willing to pay that price to have less of it.  Also,
> > > we have talked a number of times before about incorporating Snappy or
> > > LZ4, which I'm guessing would save a fair amount of CPU -- but the
> > > decision was made to leave that out of the first version, and just use
> > > pg_lz, to keep the initial patch simple.  I think that was a good
> > > decision.
> > 
> > Well, the larger question is why wouldn't we just have the user compress
> > the entire WAL file before archiving --- why have each backend do it? 
> > Is it the write volume we are saving?  I though this WAL compression
> > gave better performance in some cases.
> 
> Err. Streaming?

Well, you can already set up SSL for compression while streaming.  In
fact, I assume many are already using SSL for streaming as the majority
of SSL overhead is from connection start.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

12 December 2014, 14:30:34

On 2014-12-12 09:24:27 -0500, Bruce Momjian wrote:
> On Fri, Dec 12, 2014 at 03:22:24PM +0100, Andres Freund wrote:
> > > Well, the larger question is why wouldn't we just have the user compress
> > > the entire WAL file before archiving --- why have each backend do it? 
> > > Is it the write volume we are saving?  I though this WAL compression
> > > gave better performance in some cases.
> > 
> > Err. Streaming?
> 
> Well, you can already set up SSL for compression while streaming.  In
> fact, I assume many are already using SSL for streaming as the majority
> of SSL overhead is from connection start.

That's not really true. The overhead of SSL during streaming is
*significant*. Both the kind of compression it does (which is far more
expensive than pglz or lz4) and the encyrption itself. In many cases
it's prohibitively expensive - there's even a fair number on-list
reports about this.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Compression of full-page-writes

From

Michael Paquier

Date:

12 December 2014, 14:35:25

On Fri, Dec 12, 2014 at 11:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 12, 2014 at 9:15 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> I just meant that the scratch buffers used to store temporarily the
>> compressed and uncompressed data should be palloc'd all the time, even
>> if the switch is off.
>
> If they're fixed size, you can just put them on the heap as static globals.
> static char space_for_stuff[65536];
Well sure :)

> Or whatever you need.
> I don't think that's a cost worth caring about.
OK, I thought it was.
-- 
Michael

Re: Compression of full-page-writes

From

Robert Haas

Date:

12 December 2014, 14:37:37

On Fri, Dec 12, 2014 at 9:15 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> I just meant that the scratch buffers used to store temporarily the
> compressed and uncompressed data should be palloc'd all the time, even
> if the switch is off.

If they're fixed size, you can just put them on the heap as static globals.

static char space_for_stuff[65536];

Or whatever you need.

I don't think that's a cost worth caring about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Compression of full-page-writes

From

Robert Haas

Date:

12 December 2014, 14:40:49

On Fri, Dec 12, 2014 at 9:34 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> I don't think that's a cost worth caring about.
> OK, I thought it was.

Space on the heap that never gets used is basically free.  The OS
won't actually allocate physical memory unless the pages are actually
accessed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

12 December 2014, 14:42:34

Hello,

>Well, the larger question is why wouldn't we just have the user compress
>the entire WAL file before archiving --- why have each backend do it?
>Is it the write volume we are saving?

IIUC, the idea here is to not only save the on disk size of WAL but to reduce the overhead of flushing WAL records to disk in servers with heavy write operations. So yes improving the performance by saving write volume is a part of the requirement.

Thank you,

Rahila Syed

On Fri, Dec 12, 2014 at 7:48 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Dec 12, 2014 at 08:27:59AM -0500, Robert Haas wrote:
> On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
> >> compression = 'on' : 1838 secs
> >> = 'off' : 1701 secs
> >>
> >> Different is around 140 secs.
> >
> > OK, so the compression took 2x the cpu and was 8% slower. The only
> > benefit is WAL files are 35% smaller?
>
> Compression didn't take 2x the CPU. It increased user CPU from 354.20
> s to 562.67 s over the course of the run, so it took about 60% more
> CPU.
>
> But I wouldn't be too discouraged by that. At least AIUI, there are
> quite a number of users for whom WAL volume is a serious challenge,
> and they might be willing to pay that price to have less of it. Also,
> we have talked a number of times before about incorporating Snappy or
> LZ4, which I'm guessing would save a fair amount of CPU -- but the
> decision was made to leave that out of the first version, and just use
> pg_lz, to keep the initial patch simple. I think that was a good
> decision.

Well, the larger question is why wouldn't we just have the user compress
the entire WAL file before archiving --- why have each backend do it?
Is it the write volume we are saving? I though this WAL compression
gave better performance in some cases.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

Re: [REVIEW] Re: Compression of full-page-writes

From

Bruce Momjian

Date:

12 December 2014, 14:46:22

On Fri, Dec 12, 2014 at 03:27:33PM +0100, Andres Freund wrote:
> On 2014-12-12 09:24:27 -0500, Bruce Momjian wrote:
> > On Fri, Dec 12, 2014 at 03:22:24PM +0100, Andres Freund wrote:
> > > > Well, the larger question is why wouldn't we just have the user compress
> > > > the entire WAL file before archiving --- why have each backend do it? 
> > > > Is it the write volume we are saving?  I though this WAL compression
> > > > gave better performance in some cases.
> > > 
> > > Err. Streaming?
> > 
> > Well, you can already set up SSL for compression while streaming.  In
> > fact, I assume many are already using SSL for streaming as the majority
> > of SSL overhead is from connection start.
> 
> That's not really true. The overhead of SSL during streaming is
> *significant*. Both the kind of compression it does (which is far more
> expensive than pglz or lz4) and the encyrption itself. In many cases
> it's prohibitively expensive - there's even a fair number on-list
> reports about this.

Well, I am just trying to understand when someone would benefit from WAL
compression.  Are we saying it is only useful for non-SSL streaming?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

12 December 2014, 14:50:53

On Wed, Dec 10, 2014 at 11:25 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
> The tests ran for around 30 mins.Manual checkpoint was run before each test.
>
> Compression WAL generated %compression Latency-avg CPU usage
> (seconds) TPS Latency
> stddev
>
>
> on 1531.4 MB ~35 % 7.351 ms
>   user diff: 562.67s system diff: 41.40s   135.96
>   13.759 ms
>
>
> off 2373.1 MB   6.781 ms
>   user diff: 354.20s system diff: 39.67s   147.40
>   14.152 ms
>
> The compression obtained is quite high close to 35 %.
> CPU usage at user level when compression is on is quite noticeably high as
> compared to that when compression is off. But gain in terms of reduction of WAL
> is also high.

I am sorry but I can't understand the above results due to wrapping.
Are you saying compression was twice as slow?

I got curious to see how the compression of an entire record would perform and how it compares for small WAL records, and here are some numbers based on the patch attached, this patch compresses the whole record including the block headers, letting only XLogRecord out of it with a flag indicating that the record is compressed (note that this patch contains a portion for replay untested, still this patch gives an idea on how much compression of the whole record affects user CPU in this test case). It uses a buffer of 4 * BLCKSZ, if the record is longer than that compression is simply given up. Those tests are using the hack upthread calculating user and system CPU using getrusage() when a backend.

Here is the simple test case I used with 512MB of shared_buffers and small records, filling up a bunch of buffers, dirtying them and them compressing FPWs with a checkpoint.

#!/bin/bash
psql <<EOF
SELECT pg_backend_pid();
CREATE TABLE aa (a int);
CREATE TABLE results (phase text, position pg_lsn);
CREATE EXTENSION IF NOT EXISTS pg_prewarm;
ALTER TABLE aa SET (FILLFACTOR = 50);
INSERT INTO results VALUES ('pre-insert', pg_current_xlog_location());
INSERT INTO aa VALUES (generate_series(1,7000000)); -- 484MB
SELECT pg_size_pretty(pg_relation_size('aa'::regclass));
SELECT pg_prewarm('aa'::regclass);
CHECKPOINT;
INSERT INTO results VALUES ('pre-update', pg_current_xlog_location());
UPDATE aa SET a = 7000000 + a;
CHECKPOINT;
INSERT INTO results VALUES ('post-update', pg_current_xlog_location());
SELECT * FROM results;
EOF

Note that autovacuum and fsync are off.

=# select phase, user_diff, system_diff,
pg_size_pretty(pre_update - pre_insert),
pg_size_pretty(post_update - pre_update) from results;
       phase        | user_diff | system_diff | pg_size_pretty | pg_size_pretty
--------------------+-----------+-------------+----------------+----------------
Compression FPW    | 42.990799 |    0.868179 | 429 MB         | 567 MB
No compression     | 25.688731 |    1.236551 | 429 MB         | 727 MB
Compression record | 56.376750 |    0.769603 | 429 MB         | 566 MB
(3 rows)

If we do record-level compression, we'll need to be very careful in defining a lower-bound to not eat unnecessary CPU resources, perhaps something that should be controlled with a GUC. I presume that this stands true as well for the upper bound.

Regards,

Michael

Attachment

20141212_record_level_compression.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

12 December 2014, 14:53:39

On 2014-12-12 09:46:13 -0500, Bruce Momjian wrote:
> On Fri, Dec 12, 2014 at 03:27:33PM +0100, Andres Freund wrote:
> > On 2014-12-12 09:24:27 -0500, Bruce Momjian wrote:
> > > On Fri, Dec 12, 2014 at 03:22:24PM +0100, Andres Freund wrote:
> > > > > Well, the larger question is why wouldn't we just have the user compress
> > > > > the entire WAL file before archiving --- why have each backend do it? 
> > > > > Is it the write volume we are saving?  I though this WAL compression
> > > > > gave better performance in some cases.
> > > > 
> > > > Err. Streaming?
> > > 
> > > Well, you can already set up SSL for compression while streaming.  In
> > > fact, I assume many are already using SSL for streaming as the majority
> > > of SSL overhead is from connection start.
> > 
> > That's not really true. The overhead of SSL during streaming is
> > *significant*. Both the kind of compression it does (which is far more
> > expensive than pglz or lz4) and the encyrption itself. In many cases
> > it's prohibitively expensive - there's even a fair number on-list
> > reports about this.
> 
> Well, I am just trying to understand when someone would benefit from WAL
> compression.  Are we saying it is only useful for non-SSL streaming?

No, not at all. It's useful in a lot more situations:

* The amount of WAL in pg_xlog can make up a significant portion of a database's size. Especially in large OLTP
databases.Compressing archives doesn't help with that.
 
* The original WAL volume itself can be quite problematic because at some point its exhausting the underlying IO
subsystem.Both due to the pure write rate and to the fsync()s regularly required.
 
* ssl compression can often not be used for WAL streaming because it's too slow as it's uses a much more expensive
algorithm.Which is why we even have a GUC to disable it.
 

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

12 December 2014, 15:04:43

On 2014-12-12 23:50:43 +0900, Michael Paquier wrote:
> I got curious to see how the compression of an entire record would perform
> and how it compares for small WAL records, and here are some numbers based
> on the patch attached, this patch compresses the whole record including the
> block headers, letting only XLogRecord out of it with a flag indicating
> that the record is compressed (note that this patch contains a portion for
> replay untested, still this patch gives an idea on how much compression of
> the whole record affects user CPU in this test case). It uses a buffer of 4
> * BLCKSZ, if the record is longer than that compression is simply given up.
> Those tests are using the hack upthread calculating user and system CPU
> using getrusage() when a backend.
> 
> Here is the simple test case I used with 512MB of shared_buffers and small
> records, filling up a bunch of buffers, dirtying them and them compressing
> FPWs with a checkpoint.
> #!/bin/bash
> psql <<EOF
> SELECT pg_backend_pid();
> CREATE TABLE aa (a int);
> CREATE TABLE results (phase text, position pg_lsn);
> CREATE EXTENSION IF NOT EXISTS pg_prewarm;
> ALTER TABLE aa SET (FILLFACTOR = 50);
> INSERT INTO results VALUES ('pre-insert', pg_current_xlog_location());
> INSERT INTO aa VALUES (generate_series(1,7000000)); -- 484MB
> SELECT pg_size_pretty(pg_relation_size('aa'::regclass));
> SELECT pg_prewarm('aa'::regclass);
> CHECKPOINT;
> INSERT INTO results VALUES ('pre-update', pg_current_xlog_location());
> UPDATE aa SET a = 7000000 + a;
> CHECKPOINT;
> INSERT INTO results VALUES ('post-update', pg_current_xlog_location());
> SELECT * FROM results;
> EOF
> 
> Note that autovacuum and fsync are off.
> =# select phase, user_diff, system_diff,
> pg_size_pretty(pre_update - pre_insert),
> pg_size_pretty(post_update - pre_update) from results;
>        phase        | user_diff | system_diff | pg_size_pretty |
> pg_size_pretty
> --------------------+-----------+-------------+----------------+----------------
>  Compression FPW    | 42.990799 |    0.868179 | 429 MB         | 567 MB
>  No compression     | 25.688731 |    1.236551 | 429 MB         | 727 MB
>  Compression record | 56.376750 |    0.769603 | 429 MB         | 566 MB
> (3 rows)
> If we do record-level compression, we'll need to be very careful in
> defining a lower-bound to not eat unnecessary CPU resources, perhaps
> something that should be controlled with a GUC. I presume that this stands
> true as well for the upper bound.

Record level compression pretty obviously would need a lower boundary
for when to use compression. It won't be useful for small heapam/btree
records, but it'll be rather useful for large multi_insert, clean or
similar records...

Greetings,

Andres Freund

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

12 December 2014, 16:09:04

On Fri, Dec 12, 2014 at 10:04 AM, Andres Freund <andres@anarazel.de> wrote:
>> Note that autovacuum and fsync are off.
>> =# select phase, user_diff, system_diff,
>> pg_size_pretty(pre_update - pre_insert),
>> pg_size_pretty(post_update - pre_update) from results;
>>        phase        | user_diff | system_diff | pg_size_pretty |
>> pg_size_pretty
>> --------------------+-----------+-------------+----------------+----------------
>>  Compression FPW    | 42.990799 |    0.868179 | 429 MB         | 567 MB
>>  No compression     | 25.688731 |    1.236551 | 429 MB         | 727 MB
>>  Compression record | 56.376750 |    0.769603 | 429 MB         | 566 MB
>> (3 rows)
>> If we do record-level compression, we'll need to be very careful in
>> defining a lower-bound to not eat unnecessary CPU resources, perhaps
>> something that should be controlled with a GUC. I presume that this stands
>> true as well for the upper bound.
>
> Record level compression pretty obviously would need a lower boundary
> for when to use compression. It won't be useful for small heapam/btree
> records, but it'll be rather useful for large multi_insert, clean or
> similar records...

Unless I'm missing something, this test is showing that FPW
compression saves 298MB of WAL for 17.3 seconds of CPU time, as
against master.  And compressing the whole record saves a further 1MB
of WAL for a further 13.39 seconds of CPU time.  That makes
compressing the whole record sound like a pretty terrible idea - even
if you get more benefit by reducing the lower boundary, you're still
burning a ton of extra CPU time for almost no gain on the larger
records.  Ouch!

(Of course, I'm assuming that Michael's patch is reasonably efficient,
which might not be true.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

12 December 2014, 16:12:15

On 2014-12-12 11:08:52 -0500, Robert Haas wrote:
> Unless I'm missing something, this test is showing that FPW
> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
> against master.  And compressing the whole record saves a further 1MB
> of WAL for a further 13.39 seconds of CPU time.  That makes
> compressing the whole record sound like a pretty terrible idea - even
> if you get more benefit by reducing the lower boundary, you're still
> burning a ton of extra CPU time for almost no gain on the larger
> records.  Ouch!

Well, that test pretty much doesn't have any large records besides FPWs
afaics. So it's unsurprising that it's not beneficial.

Greetings,

Andres Freund

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

12 December 2014, 16:15:52

On Fri, Dec 12, 2014 at 11:12 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2014-12-12 11:08:52 -0500, Robert Haas wrote:
>> Unless I'm missing something, this test is showing that FPW
>> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
>> against master.  And compressing the whole record saves a further 1MB
>> of WAL for a further 13.39 seconds of CPU time.  That makes
>> compressing the whole record sound like a pretty terrible idea - even
>> if you get more benefit by reducing the lower boundary, you're still
>> burning a ton of extra CPU time for almost no gain on the larger
>> records.  Ouch!
>
> Well, that test pretty much doesn't have any large records besides FPWs
> afaics. So it's unsurprising that it's not beneficial.

"Not beneficial" is rather an understatement.  It's actively harmful,
and not by a small margin.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

12 December 2014, 16:19:47

On 2014-12-12 11:15:46 -0500, Robert Haas wrote:
> On Fri, Dec 12, 2014 at 11:12 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2014-12-12 11:08:52 -0500, Robert Haas wrote:
> >> Unless I'm missing something, this test is showing that FPW
> >> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
> >> against master.  And compressing the whole record saves a further 1MB
> >> of WAL for a further 13.39 seconds of CPU time.  That makes
> >> compressing the whole record sound like a pretty terrible idea - even
> >> if you get more benefit by reducing the lower boundary, you're still
> >> burning a ton of extra CPU time for almost no gain on the larger
> >> records.  Ouch!
> >
> > Well, that test pretty much doesn't have any large records besides FPWs
> > afaics. So it's unsurprising that it's not beneficial.
> 
> "Not beneficial" is rather an understatement.  It's actively harmful,
> and not by a small margin.

Sure, but that's just because it's too simplistic. I don't think it
makes sense to make any inference about the worthyness of the general
approach from the, nearly obvious, fact that compressing every tiny
record is a bad idea.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Bruce Momjian

Date:

12 December 2014, 18:04:50

On Fri, Dec 12, 2014 at 05:19:42PM +0100, Andres Freund wrote:
> On 2014-12-12 11:15:46 -0500, Robert Haas wrote:
> > On Fri, Dec 12, 2014 at 11:12 AM, Andres Freund <andres@anarazel.de> wrote:
> > > On 2014-12-12 11:08:52 -0500, Robert Haas wrote:
> > >> Unless I'm missing something, this test is showing that FPW
> > >> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
> > >> against master.  And compressing the whole record saves a further 1MB
> > >> of WAL for a further 13.39 seconds of CPU time.  That makes
> > >> compressing the whole record sound like a pretty terrible idea - even
> > >> if you get more benefit by reducing the lower boundary, you're still
> > >> burning a ton of extra CPU time for almost no gain on the larger
> > >> records.  Ouch!
> > >
> > > Well, that test pretty much doesn't have any large records besides FPWs
> > > afaics. So it's unsurprising that it's not beneficial.
> > 
> > "Not beneficial" is rather an understatement.  It's actively harmful,
> > and not by a small margin.
> 
> Sure, but that's just because it's too simplistic. I don't think it
> makes sense to make any inference about the worthyness of the general
> approach from the, nearly obvious, fact that compressing every tiny
> record is a bad idea.

Well, it seems we need to see some actual cases where compression does
help before moving forward.  I thought Amit had some amazing numbers for
WAL compression --- has that changed?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: [REVIEW] Re: Compression of full-page-writes

From

Simon Riggs

Date:

12 December 2014, 18:51:22

On 12 December 2014 at 18:04, Bruce Momjian <bruce@momjian.us> wrote:

> Well, it seems we need to see some actual cases where compression does
> help before moving forward.  I thought Amit had some amazing numbers for
> WAL compression --- has that changed?

For background processes, like VACUUM, then WAL compression will be
helpful. The numbers show that only applies to FPWs.

I remain concerned about the cost in foreground processes, especially
since the cost will be paid immediately after checkpoint, making our
spikes worse.

What I don't understand is why we aren't working on double buffering,
since that cost would be paid in a background process and would be
evenly spread out across a checkpoint. Plus we'd be able to remove
FPWs altogether, which is like 100% compression.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Robert Haas

Date:

12 December 2014, 21:40:09

On Fri, Dec 12, 2014 at 1:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> What I don't understand is why we aren't working on double buffering,
> since that cost would be paid in a background process and would be
> evenly spread out across a checkpoint. Plus we'd be able to remove
> FPWs altogether, which is like 100% compression.

The previous patch to implement that - by somebody at vmware - was an
epic fail.  I'm not opposed to seeing somebody try again, but it's a
tricky problem.  When the double buffer fills up, then you've got to
finish flushing the pages whose images are stored in the buffer to
disk before you can overwrite it, which acts like a kind of
mini-checkpoint.  That problem might be solvable, but let's use this
thread to discuss this patch, not some other patch that someone might
have chosen to write but didn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

12 December 2014, 22:25:47

On Sat, Dec 13, 2014 at 1:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 12, 2014 at 10:04 AM, Andres Freund <andres@anarazel.de> wrote:
>>> Note that autovacuum and fsync are off.
>>> =# select phase, user_diff, system_diff,
>>> pg_size_pretty(pre_update - pre_insert),
>>> pg_size_pretty(post_update - pre_update) from results;
>>>        phase        | user_diff | system_diff | pg_size_pretty |
>>> pg_size_pretty
>>> --------------------+-----------+-------------+----------------+----------------
>>>  Compression FPW    | 42.990799 |    0.868179 | 429 MB         | 567 MB
>>>  No compression     | 25.688731 |    1.236551 | 429 MB         | 727 MB
>>>  Compression record | 56.376750 |    0.769603 | 429 MB         | 566 MB
>>> (3 rows)
>>> If we do record-level compression, we'll need to be very careful in
>>> defining a lower-bound to not eat unnecessary CPU resources, perhaps
>>> something that should be controlled with a GUC. I presume that this stands
>>> true as well for the upper bound.
>>
>> Record level compression pretty obviously would need a lower boundary
>> for when to use compression. It won't be useful for small heapam/btree
>> records, but it'll be rather useful for large multi_insert, clean or
>> similar records...
>
> Unless I'm missing something, this test is showing that FPW
> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
> against master.  And compressing the whole record saves a further 1MB
> of WAL for a further 13.39 seconds of CPU time.  That makes
> compressing the whole record sound like a pretty terrible idea - even
> if you get more benefit by reducing the lower boundary, you're still
> burning a ton of extra CPU time for almost no gain on the larger
> records.  Ouch!
>
> (Of course, I'm assuming that Michael's patch is reasonably efficient,
> which might not be true.)
Note that I was curious about the worst-case ever, aka how much CPU
pg_lzcompress would use if everything is compressed, even the smallest
records. So we'll surely need a lower-bound. I think that doing some
tests with a lower bound set as a multiple of SizeOfXLogRecord would
be fine, but in this case what we'll see is a result similar to what
FPW compression does.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Claudio Freire

Date:

12 December 2014, 22:31:10

On Fri, Dec 12, 2014 at 7:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Dec 13, 2014 at 1:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Dec 12, 2014 at 10:04 AM, Andres Freund <andres@anarazel.de> wrote:
>>>> Note that autovacuum and fsync are off.
>>>> =# select phase, user_diff, system_diff,
>>>> pg_size_pretty(pre_update - pre_insert),
>>>> pg_size_pretty(post_update - pre_update) from results;
>>>>        phase        | user_diff | system_diff | pg_size_pretty |
>>>> pg_size_pretty
>>>> --------------------+-----------+-------------+----------------+----------------
>>>>  Compression FPW    | 42.990799 |    0.868179 | 429 MB         | 567 MB
>>>>  No compression     | 25.688731 |    1.236551 | 429 MB         | 727 MB
>>>>  Compression record | 56.376750 |    0.769603 | 429 MB         | 566 MB
>>>> (3 rows)
>>>> If we do record-level compression, we'll need to be very careful in
>>>> defining a lower-bound to not eat unnecessary CPU resources, perhaps
>>>> something that should be controlled with a GUC. I presume that this stands
>>>> true as well for the upper bound.
>>>
>>> Record level compression pretty obviously would need a lower boundary
>>> for when to use compression. It won't be useful for small heapam/btree
>>> records, but it'll be rather useful for large multi_insert, clean or
>>> similar records...
>>
>> Unless I'm missing something, this test is showing that FPW
>> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
>> against master.  And compressing the whole record saves a further 1MB
>> of WAL for a further 13.39 seconds of CPU time.  That makes
>> compressing the whole record sound like a pretty terrible idea - even
>> if you get more benefit by reducing the lower boundary, you're still
>> burning a ton of extra CPU time for almost no gain on the larger
>> records.  Ouch!
>>
>> (Of course, I'm assuming that Michael's patch is reasonably efficient,
>> which might not be true.)
> Note that I was curious about the worst-case ever, aka how much CPU
> pg_lzcompress would use if everything is compressed, even the smallest
> records. So we'll surely need a lower-bound. I think that doing some
> tests with a lower bound set as a multiple of SizeOfXLogRecord would
> be fine, but in this case what we'll see is a result similar to what
> FPW compression does.


In general, lz4 (and pg_lz is similar to lz4) compresses very poorly
anything below about 128b in length. Of course there are outliers,
with some very compressible stuff, but with regular text or JSON data,
it's quite unlikely to compress at all with smaller input. Compression
is modest up to about 1k when it starts to really pay off.

That's at least my experience with lots JSON-ish, text-ish and CSV
data sets, compressible but not so much in small bits.

Re: Compression of full-page-writes

From

Amit Kapila

Date:

13 December 2014, 10:51:07

On Tue, Dec 9, 2014 at 10:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Dec 8, 2014 at 3:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> > On 8 December 2014 at 11:46, Michael Paquier <michael.paquier@gmail.com> wrote:
> > > I don't really like those new names, but I'd prefer
> > > wal_compression_level if we go down that road with 'none' as default
> > > value. We may still decide in the future to support compression at the
> > > record level instead of context level, particularly if we have an API
> > > able to do palloc_return_null_at_oom, so the idea of WAL compression
> > > is not related only to FPIs IMHO.
> >
> > We may yet decide, but the pglz implementation is not effective on
> > smaller record lengths. Nor has any testing been done to show that is
> > even desirable.
> >
>
> It's even much worse for non-compressible (or less-compressible)
> WAL data.

To check the actual effect, I have ran few tests with the patch

(0001-Move-pg_lzcompress.c-to-src-common,

0002-Support-compression-for-full-page-writes-in-WAL) and the

data shows that for worst case (9 short and 1 long, short changed)

there is dip of ~56% in runtime where the compression is less (~20%)

and a ~35% of dip in runtime for the small record size

(two short fields, no change) where compression is ~28%. For best

case (one short and one long field, no change), the compression is

more than 2 times and there is an improvement in runtime of ~4%.

Note that in worst case, I am using random string due to which the

compression is less and it seems to me that worst is not by far the

worst because we see some compression in that case as well. I

think this might not be the best test to measure the effect of this

patch, but still it has data for various compression ratio's which

could indicate the value of this patch. Test case used to take

below data is attached with this mail.

Seeing this data, one way to mitigate the cases where it can cause

performance impact is to have a table level compression flag which

we have discussed last year during development of WAL compression

for Update operation as well.

Performance Data

-----------------------------

m/c configuration -

IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

Non-default parameters -

checkpoint_segments - 256

checkpoint_timeout - 15 min

wal_compression=off

testname | wal_generated | duration

-----------------------------------------+---------------+------------------

two short fields, no change | 540055720 | 12.1288201808929

two short fields, no change | 542911816 | 11.8804960250854

two short fields, no change | 540063400 | 11.7856659889221

two short fields, one changed | 540055792 | 11.9835240840912

two short fields, one changed | 540056624 | 11.9008920192719

two short fields, one changed | 540059560 | 12.064150094986

two short fields, both changed | 581813832 | 10.2909409999847

two short fields, both changed | 579823384 | 12.4431331157684

two short fields, both changed | 579896448 | 12.5214929580688

one short and one long field, no change | 320058048 | 5.04950094223022

one short and one long field, no change | 321150040 | 5.24907302856445

one short and one long field, no change | 320055072 | 5.07368278503418

ten tiny fields, all changed | 620765680 | 14.2868521213531

ten tiny fields, all changed | 620681176 | 14.2786719799042

ten tiny fields, all changed | 620684600 | 14.216343164444

hundred tiny fields, all changed | 306317512 | 6.98173499107361

hundred tiny fields, all changed | 308039000 | 7.03955984115601

hundred tiny fields, all changed | 307117016 | 7.11708188056946

hundred tiny fields, half changed | 306483392 | 7.06978106498718

hundred tiny fields, half changed | 309336056 | 7.07678198814392

hundred tiny fields, half changed | 306317432 | 7.02817606925964

hundred tiny fields, half nulled | 219931376 | 6.29952597618103

hundred tiny fields, half nulled | 221001240 | 6.34559392929077

hundred tiny fields, half nulled | 219933072 | 6.36759996414185

9 short and 1 long, short changed | 253761248 | 4.37235498428345

9 short and 1 long, short changed | 253763040 | 4.34973502159119

9 short and 1 long, short changed | 253760280 | 4.34902000427246

(27 rows)

wal_compression = on

testname | wal_generated | duration

-----------------------------------------+---------------+------------------

two short fields, no change | 420569264 | 18.1419389247894

two short fields, no change | 423401960 | 16.0569458007812

two short fields, no change | 420568240 | 15.9060699939728

two short fields, one changed | 420769880 | 15.4179458618164

two short fields, one changed | 420769768 | 15.8254570960999

two short fields, one changed | 420771760 | 15.7606999874115

two short fields, both changed | 464684816 | 15.6395478248596

two short fields, both changed | 460885392 | 16.4674611091614

two short fields, both changed | 460908256 | 16.5107719898224

one short and one long field, no change | 86536912 | 4.87007188796997

one short and one long field, no change | 85008896 | 4.87805414199829

one short and one long field, no change | 85016024 | 4.91748881340027

ten tiny fields, all changed | 461562256 | 16.7471029758453

ten tiny fields, all changed | 461924064 | 19.1157128810883

ten tiny fields, all changed | 461526872 | 18.746591091156

hundred tiny fields, all changed | 188909640 | 8.3099319934845

hundred tiny fields, all changed | 191173832 | 8.34689402580261

hundred tiny fields, all changed | 190272920 | 8.3138701915741

hundred tiny fields, half changed | 189411656 | 8.24592804908752

hundred tiny fields, half changed | 188907888 | 8.23570513725281

hundred tiny fields, half changed | 191874520 | 8.23411083221436

hundred tiny fields, half nulled | 106529504 | 7.44415497779846

hundred tiny fields, half nulled | 103855064 | 7.48734498023987

hundred tiny fields, half nulled | 103858984 | 7.45094799995422

9 short and 1 long, short changed | 210281512 | 6.79501819610596

9 short and 1 long, short changed | 210285808 | 6.79907608032227

9 short and 1 long, short changed | 211485728 | 6.79275107383728

(27 rows)

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

wal-update-testsuite.sh

Re: [REVIEW] Re: Compression of full-page-writes

From

Simon Riggs

Date:

13 December 2014, 11:04:42

On 12 December 2014 at 21:40, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 12, 2014 at 1:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> What I don't understand is why we aren't working on double buffering,
>> since that cost would be paid in a background process and would be
>> evenly spread out across a checkpoint. Plus we'd be able to remove
>> FPWs altogether, which is like 100% compression.
>
> The previous patch to implement that - by somebody at vmware - was an
> epic fail.  I'm not opposed to seeing somebody try again, but it's a
> tricky problem.  When the double buffer fills up, then you've got to
> finish flushing the pages whose images are stored in the buffer to
> disk before you can overwrite it, which acts like a kind of
> mini-checkpoint.  That problem might be solvable, but let's use this
> thread to discuss this patch, not some other patch that someone might
> have chosen to write but didn't.

No, I think its relevant.

WAL compression looks to me like a short term tweak, not the end game.

On that basis, we should go for simple and effective, user-settable
compression of FPWs and not spend too much Valuable Committer Time on
it.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

13 December 2014, 14:36:48

On Fri, Dec 12, 2014 at 11:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
> On Wed, Dec 10, 2014 at 11:25 PM, Bruce Momjian <bruce@momjian.us> wrote:
>>
>> On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
>> > The tests ran for around 30 mins.Manual checkpoint was run before each
>> > test.
>> >
>> > Compression   WAL generated    %compression    Latency-avg   CPU usage
>> > (seconds)                                          TPS
>> > Latency
>> > stddev
>> >
>> >
>> > on                  1531.4 MB          ~35 %                  7.351 ms
>> >   user diff: 562.67s     system diff: 41.40s              135.96
>> >   13.759 ms
>> >
>> >
>> > off                  2373.1 MB                                     6.781
>> > ms
>> >       user diff: 354.20s      system diff: 39.67s            147.40
>> >   14.152 ms
>> >
>> > The compression obtained is quite high close to 35 %.
>> > CPU usage at user level when compression is on is quite noticeably high
>> > as
>> > compared to that when compression is off. But gain in terms of reduction
>> > of WAL
>> > is also high.
>>
>> I am sorry but I can't understand the above results due to wrapping.
>> Are you saying compression was twice as slow?
>
>
> I got curious to see how the compression of an entire record would perform
> and how it compares for small WAL records, and here are some numbers based
> on the patch attached, this patch compresses the whole record including the
> block headers, letting only XLogRecord out of it with a flag indicating that
> the record is compressed (note that this patch contains a portion for replay
> untested, still this patch gives an idea on how much compression of the
> whole record affects user CPU in this test case). It uses a buffer of 4 *
> BLCKSZ, if the record is longer than that compression is simply given up.
> Those tests are using the hack upthread calculating user and system CPU
> using getrusage() when a backend.
>
> Here is the simple test case I used with 512MB of shared_buffers and small
> records, filling up a bunch of buffers, dirtying them and them compressing
> FPWs with a checkpoint.
> #!/bin/bash
> psql <<EOF
> SELECT pg_backend_pid();
> CREATE TABLE aa (a int);
> CREATE TABLE results (phase text, position pg_lsn);
> CREATE EXTENSION IF NOT EXISTS pg_prewarm;
> ALTER TABLE aa SET (FILLFACTOR = 50);
> INSERT INTO results VALUES ('pre-insert', pg_current_xlog_location());
> INSERT INTO aa VALUES (generate_series(1,7000000)); -- 484MB
> SELECT pg_size_pretty(pg_relation_size('aa'::regclass));
> SELECT pg_prewarm('aa'::regclass);
> CHECKPOINT;
> INSERT INTO results VALUES ('pre-update', pg_current_xlog_location());
> UPDATE aa SET a = 7000000 + a;
> CHECKPOINT;
> INSERT INTO results VALUES ('post-update', pg_current_xlog_location());
> SELECT * FROM results;
> EOF
Re-using this test case, I have produced more results by changing the
fillfactor of the table:
=# select test || ', ffactor ' || ffactor, pg_size_pretty(post_update
- pre_update), user_diff, system_diff from results;
           ?column?            | pg_size_pretty | user_diff | system_diff
-------------------------------+----------------+-----------+-------------
 FPW on + 2 bytes, ffactor 50  | 582 MB         | 42.391894 |    0.807444
 FPW on + 2 bytes, ffactor 20  | 229 MB         | 14.330304 |    0.729626
 FPW on + 2 bytes, ffactor 10  | 117 MB         |  7.335442 |    0.570996
 FPW off + 2 bytes, ffactor 50 | 746 MB         | 25.330391 |    1.248503
 FPW off + 2 bytes, ffactor 20 | 293 MB         | 10.537475 |    0.755448
 FPW off + 2 bytes, ffactor 10 | 148 MB         |  5.762775 |    0.763761
 HEAD, ffactor 50              | 746 MB         | 25.181729 |    1.133433
 HEAD, ffactor 20              | 293 MB         |  9.962242 |    0.765970
 HEAD, ffactor 10              | 148 MB         |  5.693426 |    0.775371
 Record, ffactor 50            | 582 MB         | 54.904374 |    0.678204
 Record, ffactor 20            | 229 MB         | 19.798268 |    0.807220
 Record, ffactor 10            | 116 MB         |  9.401877 |    0.668454
(12 rows)

The following tests are run:
- "Record" means the record-level compression
- "HEAD" is postgres at 1c5c70df
- "FPW off" is HEAD + patch with switch set to off
- "FPW on" is HEAD + patch with switch set to on
The gain in compression has a linear profile with the length of page
hole. There was visibly some noise in the tests: you can see that the
CPU of "FPW off" is a bit higher than HEAD.

Something to be aware of btw is that this patch introduces an
additional 8 bytes per block image in WAL as it contains additional
information to control the compression. In this case this is the
uint16 compress_len present in XLogRecordBlockImageHeader. In the case
of the measurements done, knowing that 63638 FPWs have been written,
there is a difference of a bit less than 500k in WAL between HEAD and
"FPW off" in favor of HEAD. The gain with compression is welcome,
still for the default there is a small price to track down if a block
is compressed or not. This patch still takes advantage of it by not
compressing the hole present in page and reducing CPU work a bit.

Attached are as well updated patches, switching wal_compression to
USERSET and cleaning up things related to this switch from
PGC_POSTMASTER. I am attaching as well the results I got, feel free to
have a look.
Regards,
--
Michael

On Tue, Dec 16, 2014 at 8:35 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Tue, Dec 16, 2014 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Dec 13, 2014 at 9:36 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Something to be aware of btw is that this patch introduces an
>>> additional 8 bytes per block image in WAL as it contains additional
>>> information to control the compression. In this case this is the
>>> uint16 compress_len present in XLogRecordBlockImageHeader. In the case
>>> of the measurements done, knowing that 63638 FPWs have been written,
>>> there is a difference of a bit less than 500k in WAL between HEAD and
>>> "FPW off" in favor of HEAD. The gain with compression is welcome,
>>> still for the default there is a small price to track down if a block
>>> is compressed or not. This patch still takes advantage of it by not
>>> compressing the hole present in page and reducing CPU work a bit.
>>
>> That sounds like a pretty serious problem to me.
> OK. If that's so much a problem, I'll switch back to the version using
> 1 bit in the block header to identify if a block is compressed or not.
> This way, when switch will be off the record length will be the same
> as HEAD.
And here are attached fresh patches reducing the WAL record size to what it is in head when the compression switch is off. Looking at the logic in xlogrecord.h, the block header stores the hole length and hole offset. I changed that a bit to store the length of raw block, with hole or compressed as the 1st uint16. The second uint16 is used to store the hole offset, same as HEAD when compression switch is off. When compression is on, a special value 0xFFFF is saved (actually only filling 1 in the 16th bit is fine...). Note that this forces to fill in the hole with zeros and to compress always BLCKSZ worth of data.
Those patches pass make check-world, even WAL replay on standbys.

I have done as well measurements using this patch set, with the following things that can be noticed:
- When compression switch is off, the same quantity of WAL as HEAD is produced
- pglz is very bad at compressing page hole. I mean, really bad. Have a look at the user CPU particularly when pages are empty and you'll understand... Other compression algorithms would be better here. Tests are done with various values of fillfactor, 10 means that after the update 80% of the page is empty, at 50% the page is more or less completely full.

Here are the results, with 5 test cases:
- FPW on + 2 bytes, compression switch is on, using 2 additional bytes in block header, resulting in WAL records longer as 8 more bytes are used per block with lower CPU usage as page holes are not compressed by pglz.
- FPW off + 2 bytes, same as previous, with compression switch to on.
- FPW on + 0 bytes, compression switch to on, the same block header size as HEAD is used, at the cost of compressing page holes filled with zeros
- FPW on + 0 bytes, compression switch to off, same as previous
- HEAD, unpatched master (except with hack to calculate user and system CPU)
- Record, the record-level compression, with compression lower-bound set at 0.

=# select test || ', ffactor ' || ffactor, pg_size_pretty(post_update - pre_update), user_diff, system_diff from results;
?column? | pg_size_pretty | user_diff | system_diff
-------------------------------+----------------+-----------+-------------
FPW on + 2 bytes, ffactor 50 | 582 MB | 42.391894 | 0.807444
FPW on + 2 bytes, ffactor 20 | 229 MB | 14.330304 | 0.729626
FPW on + 2 bytes, ffactor 10 | 117 MB | 7.335442 | 0.570996
FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 | 1.248503
FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 | 0.755448
FPW off + 2 bytes, ffactor 10 | 148 MB | 5.762775 | 0.763761
FPW on + 0 bytes, ffactor 50 | 585 MB | 54.115496 | 0.924891
FPW on + 0 bytes, ffactor 20 | 234 MB | 26.270404 | 0.755862
FPW on + 0 bytes, ffactor 10 | 122 MB | 19.540131 | 0.800981
FPW off + 0 bytes, ffactor 50 | 746 MB | 25.102241 | 1.110677
FPW off + 0 bytes, ffactor 20 | 293 MB | 9.889374 | 0.749884
FPW off + 0 bytes, ffactor 10 | 148 MB | 5.286767 | 0.682746
HEAD, ffactor 50 | 746 MB | 25.181729 | 1.133433
HEAD, ffactor 20 | 293 MB | 9.962242 | 0.765970
HEAD, ffactor 10 | 148 MB | 5.693426 | 0.775371
Record, ffactor 50 | 582 MB | 54.904374 | 0.678204
Record, ffactor 20 | 229 MB | 19.798268 | 0.807220
Record, ffactor 10 | 116 MB | 9.401877 | 0.668454
(18 rows)

Attached are as well the results of the measurements, and the test case used.
Regards,
--
Michael

Attachment

Re: [REVIEW] Re: Compression of full-page-writes

From

Alvaro Herrera

Date:

16 December 2014, 14:24:47

Michael Paquier wrote:

> And here are attached fresh patches reducing the WAL record size to what it
> is in head when the compression switch is off. Looking at the logic in
> xlogrecord.h, the block header stores the hole length and hole offset. I
> changed that a bit to store the length of raw block, with hole or
> compressed as the 1st uint16. The second uint16 is used to store the hole
> offset, same as HEAD when compression switch is off. When compression is
> on, a special value 0xFFFF is saved (actually only filling 1 in the 16th
> bit is fine...). Note that this forces to fill in the hole with zeros and
> to compress always BLCKSZ worth of data.

Why do we compress the hole?  This seems pointless, considering that we
know it's all zeroes.  Is it possible to compress the head and tail of
page separately?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

16 December 2014, 14:30:45

On Tue, Dec 16, 2014 at 11:24 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Michael Paquier wrote:

> And here are attached fresh patches reducing the WAL record size to what it
> is in head when the compression switch is off. Looking at the logic in
> xlogrecord.h, the block header stores the hole length and hole offset. I
> changed that a bit to store the length of raw block, with hole or
> compressed as the 1st uint16. The second uint16 is used to store the hole
> offset, same as HEAD when compression switch is off. When compression is
> on, a special value 0xFFFF is saved (actually only filling 1 in the 16th
> bit is fine...). Note that this forces to fill in the hole with zeros and
> to compress always BLCKSZ worth of data.

Why do we compress the hole? This seems pointless, considering that we
know it's all zeroes. Is it possible to compress the head and tail of
page separately?

This would take 2 additional bytes at minimum in the block header, resulting in 8 additional bytes in record each time a FPW shows up. IMO it is important to check the length of things obtained when replaying WAL, that's something the current code of HEAD does quite well.

Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

16 December 2014, 15:00:32

On Tue, Dec 16, 2014 at 11:30 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Tue, Dec 16, 2014 at 11:24 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Michael Paquier wrote:

> And here are attached fresh patches reducing the WAL record size to what it
> is in head when the compression switch is off. Looking at the logic in
> xlogrecord.h, the block header stores the hole length and hole offset. I
> changed that a bit to store the length of raw block, with hole or
> compressed as the 1st uint16. The second uint16 is used to store the hole
> offset, same as HEAD when compression switch is off. When compression is
> on, a special value 0xFFFF is saved (actually only filling 1 in the 16th
> bit is fine...). Note that this forces to fill in the hole with zeros and
> to compress always BLCKSZ worth of data.

Why do we compress the hole? This seems pointless, considering that we
know it's all zeroes. Is it possible to compress the head and tail of
page separately?
This would take 2 additional bytes at minimum in the block header, resulting in 8 additional bytes in record each time a FPW shows up. IMO it is important to check the length of things obtained when replaying WAL, that's something the current code of HEAD does quite well.

Actually, the original length of the compressed block in saved in PGLZ_Header, so if we are fine to not check the size of the block decompressed when decoding WAL we can do without the hole filled with zeros, and use only 1 bit to see if the block is compressed or not.
--

Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Merlin Moncure

Date:

16 December 2014, 15:12:57

On Mon, Dec 15, 2014 at 5:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Dec 16, 2014 at 5:14 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> OTOH, Our built in compressor as we all know is a complete dog in
>> terms of cpu when stacked up against some more modern implementations.
>> All that said, as long as there is a clean path to migrating to
>> another compression alg should one materialize, that problem can be
>> nicely decoupled from this patch as Robert pointed out.
> I am curious to see some numbers about that. Has anyone done such
> comparison measurements?

I don't, but I can make some.  There are some numbers on the web but
it's better to make some new ones because IIRC some light optimization
had gone into plgz of late.

Compressing *one* file with lz4 and a quick/n/dirty plgz i hacked out
of the source (borrowing heavily from
https://github.com/maropu/pglz_bench/blob/master/pglz_bench.cpp),  I
tested the results:

lz4 real time:  0m0.032s
pglz real time: 0m0.281s

mmoncure@mernix2 ~/src/lz4/lz4-r125 $ ls -lh test.*
-rw-r--r-- 1 mmoncure mmoncure 2.7M Dec 16 09:04 test.lz4
-rw-r--r-- 1 mmoncure mmoncure 2.5M Dec 16 09:01 test.pglz

A better test would examine all manner of different xlog files in a
fashion closer to how your patch would need to compress them but the
numbers here tell a fairly compelling story: similar compression
results for around 9x the cpu usage.  Be advised that compression alg
selection is one of those types of discussions that tends to spin off
into outer space; that's not something you have to solve today.  Just
try and make things so that they can be switched out if things
change....

merlin

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

16 December 2014, 16:34:47

On Wed, Dec 17, 2014 at 12:00 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

Actually, the original length of the compressed block in saved in PGLZ_Header, so if we are fine to not check the size of the block decompressed when decoding WAL we can do without the hole filled with zeros, and use only 1 bit to see if the block is compressed or not.

And.. After some more hacking, I have been able to come up with a patch that is able to compress blocks without the page hole, and that keeps the WAL record length the same as HEAD when compression switch is off. The numbers are pretty good, CPU is saved in the same proportions as previous patches when compression is enabled, and there is zero delta with HEAD when compression switch is off.

Here are the actual numbers:
           test_name           | pg_size_pretty | user_diff | system_diff
-------------------------------+----------------+-----------+-------------
FPW on + 2 bytes, ffactor 50 | 582 MB         | 42.391894 |    0.807444
FPW on + 2 bytes, ffactor 20 | 229 MB         | 14.330304 |    0.729626
FPW on + 2 bytes, ffactor 10 | 117 MB         | 7.335442 |    0.570996
FPW off + 2 bytes, ffactor 50 | 746 MB         | 25.330391 |    1.248503
FPW off + 2 bytes, ffactor 20 | 293 MB         | 10.537475 |    0.755448
FPW off + 2 bytes, ffactor 10 | 148 MB         | 5.762775 |    0.763761
FPW on + 0 bytes, ffactor 50 | 582 MB         | 42.174297 |    0.790596
FPW on + 0 bytes, ffactor 20 | 229 MB         | 14.424233 |    0.770459
FPW on + 0 bytes, ffactor 10 | 117 MB         | 7.057195 |    0.584806
FPW off + 0 bytes, ffactor 50 | 746 MB         | 25.261998 |    1.054516
FPW off + 0 bytes, ffactor 20 | 293 MB         | 10.589888 |    0.860207
FPW off + 0 bytes, ffactor 10 | 148 MB         | 5.827191 |    0.874285
HEAD, ffactor 50              | 746 MB         | 25.181729 |    1.133433
HEAD, ffactor 20              | 293 MB         | 9.962242 |    0.765970
HEAD, ffactor 10              | 148 MB         | 5.693426 |    0.775371
Record, ffactor 50            | 582 MB         | 54.904374 |    0.678204
Record, ffactor 20            | 229 MB         | 19.798268 |    0.807220
Record, ffactor 10            | 116 MB         | 9.401877 |    0.668454
(18 rows)

The new tests of this patch are "FPW off + 0 bytes". Patches as well as results are attached.

Regards,

Michael

Attachment

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

17 December 2014, 00:34:31

On Wed, Dec 17, 2014 at 12:12 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

Compressing *one* file with lz4 and a quick/n/dirty plgz i hacked out
of the source (borrowing heavily from
https://github.com/maropu/pglz_bench/blob/master/pglz_bench.cpp), I
tested the results:

lz4 real time: 0m0.032s
pglz real time: 0m0.281s

mmoncure@mernix2 ~/src/lz4/lz4-r125 $ ls -lh test.*
-rw-r--r-- 1 mmoncure mmoncure 2.7M Dec 16 09:04 test.lz4
-rw-r--r-- 1 mmoncure mmoncure 2.5M Dec 16 09:01 test.pglz

A better test would examine all manner of different xlog files in a
fashion closer to how your patch would need to compress them but the
numbers here tell a fairly compelling story: similar compression
results for around 9x the cpu usage.

Yes that could be better... Thanks for some real numbers that's really informative.

Be advised that compression alg
selection is one of those types of discussions that tends to spin off
into outer space; that's not something you have to solve today. Just
try and make things so that they can be switched out if things
change....

One way to get around that would be a set of hooks to allow people to set up the compression algorithm they want:

- One for buffer compression

- One for buffer decompression

- Perhaps one to set up the size of the buffer used for compression/decompression scratch buffer. In the case of pglz, this needs for example to be PGLZ_MAX_OUTPUT(buffer_size). The part actually tricky is that as xlogreader.c is used by pg_xlogdump, we cannot directly use a hook in it, but we may as well be able to live with a fixed maximum size like BLCKSZ * 2 by default, this would let largely enough room for all kinds of compression algos IMO...

Regards,

Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

17 December 2014, 14:33:25

Hello,

>Patches as well as results are attached.

I had a look at code. I have few minor points,

+ bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;

+ if (is_compressed)

{

- rdt_datas_last->data = page;

- rdt_datas_last->len = BLCKSZ;

+ /* compressed block information */

+ bimg.length = compress_len;

+ bimg.extra_data = hole_offset;

+ bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;

For consistency with the existing code , how about renaming the macro XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of BKPBLOCK_HAS_IMAGE.

+ blk->hole_offset = extra_data & ~XLR_BLCK_COMPRESSED_MASK;

Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be more indicative of the fact that lower 15 bits of extra_data field comprises of hole_offset value. This suggestion is also just to achieve consistency with the existing BKPBLOCK_FORK_MASK for fork_flags field.

And comment typo

+ * First try to compress block, filling in the page hole with zeros

+ * to improve the compression of the whole. If the block is considered

+ * as incompressible, complete the block header information as if

+ * nothing happened.

As hole is no longer being compressed, this needs to be changed.

Thank you,

Rahila Syed

On Tue, Dec 16, 2014 at 10:04 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Wed, Dec 17, 2014 at 12:00 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
Actually, the original length of the compressed block in saved in PGLZ_Header, so if we are fine to not check the size of the block decompressed when decoding WAL we can do without the hole filled with zeros, and use only 1 bit to see if the block is compressed or not.
And.. After some more hacking, I have been able to come up with a patch that is able to compress blocks without the page hole, and that keeps the WAL record length the same as HEAD when compression switch is off. The numbers are pretty good, CPU is saved in the same proportions as previous patches when compression is enabled, and there is zero delta with HEAD when compression switch is off.

Here are the actual numbers:
           test_name           | pg_size_pretty | user_diff | system_diff
-------------------------------+----------------+-----------+-------------
FPW on + 2 bytes, ffactor 50 | 582 MB         | 42.391894 |    0.807444
FPW on + 2 bytes, ffactor 20 | 229 MB         | 14.330304 |    0.729626
FPW on + 2 bytes, ffactor 10 | 117 MB         | 7.335442 |    0.570996
FPW off + 2 bytes, ffactor 50 | 746 MB         | 25.330391 |    1.248503
FPW off + 2 bytes, ffactor 20 | 293 MB         | 10.537475 |    0.755448
FPW off + 2 bytes, ffactor 10 | 148 MB         | 5.762775 |    0.763761
FPW on + 0 bytes, ffactor 50 | 582 MB         | 42.174297 |    0.790596
FPW on + 0 bytes, ffactor 20 | 229 MB         | 14.424233 |    0.770459
FPW on + 0 bytes, ffactor 10 | 117 MB         | 7.057195 |    0.584806
FPW off + 0 bytes, ffactor 50 | 746 MB         | 25.261998 |    1.054516
FPW off + 0 bytes, ffactor 20 | 293 MB         | 10.589888 |    0.860207
FPW off + 0 bytes, ffactor 10 | 148 MB         | 5.827191 |    0.874285
HEAD, ffactor 50              | 746 MB         | 25.181729 |    1.133433
HEAD, ffactor 20              | 293 MB         | 9.962242 |    0.765970
HEAD, ffactor 10              | 148 MB         | 5.693426 |    0.775371
Record, ffactor 50            | 582 MB         | 54.904374 |    0.678204
Record, ffactor 20            | 229 MB         | 19.798268 |    0.807220
Record, ffactor 10            | 116 MB         | 9.401877 |    0.668454
(18 rows)

The new tests of this patch are "FPW off + 0 bytes". Patches as well as results are attached.
Regards,
--
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

18 December 2014, 04:05:49

On Wed, Dec 17, 2014 at 1:34 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
> On Wed, Dec 17, 2014 at 12:00 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>
>> Actually, the original length of the compressed block in saved in
>> PGLZ_Header, so if we are fine to not check the size of the block
>> decompressed when decoding WAL we can do without the hole filled with zeros,
>> and use only 1 bit to see if the block is compressed or not.
>
> And.. After some more hacking, I have been able to come up with a patch that
> is able to compress blocks without the page hole, and that keeps the WAL
> record length the same as HEAD when compression switch is off. The numbers
> are pretty good, CPU is saved in the same proportions as previous patches
> when compression is enabled, and there is zero delta with HEAD when
> compression switch is off.
>
> Here are the actual numbers:
>            test_name           | pg_size_pretty | user_diff | system_diff
> -------------------------------+----------------+-----------+-------------
>  FPW on + 2 bytes, ffactor 50  | 582 MB         | 42.391894 |    0.807444
>  FPW on + 2 bytes, ffactor 20  | 229 MB         | 14.330304 |    0.729626
>  FPW on + 2 bytes, ffactor 10  | 117 MB         |  7.335442 |    0.570996
>  FPW off + 2 bytes, ffactor 50 | 746 MB         | 25.330391 |    1.248503
>  FPW off + 2 bytes, ffactor 20 | 293 MB         | 10.537475 |    0.755448
>  FPW off + 2 bytes, ffactor 10 | 148 MB         |  5.762775 |    0.763761
>  FPW on + 0 bytes, ffactor 50  | 582 MB         | 42.174297 |    0.790596
>  FPW on + 0 bytes, ffactor 20  | 229 MB         | 14.424233 |    0.770459
>  FPW on + 0 bytes, ffactor 10  | 117 MB         |  7.057195 |    0.584806
>  FPW off + 0 bytes, ffactor 50 | 746 MB         | 25.261998 |    1.054516
>  FPW off + 0 bytes, ffactor 20 | 293 MB         | 10.589888 |    0.860207
>  FPW off + 0 bytes, ffactor 10 | 148 MB         |  5.827191 |    0.874285
>  HEAD, ffactor 50              | 746 MB         | 25.181729 |    1.133433
>  HEAD, ffactor 20              | 293 MB         |  9.962242 |    0.765970
>  HEAD, ffactor 10              | 148 MB         |  5.693426 |    0.775371
>  Record, ffactor 50            | 582 MB         | 54.904374 |    0.678204
>  Record, ffactor 20            | 229 MB         | 19.798268 |    0.807220
>  Record, ffactor 10            | 116 MB         |  9.401877 |    0.668454
> (18 rows)
>
> The new tests of this patch are "FPW off + 0 bytes". Patches as well as
> results are attached.

I think that neither pg_control nor xl_parameter_change need to have the info
about WAL compression because each backup block has that entry.

Will review the remaining part later.

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

18 December 2014, 04:12:11

On Thu, Dec 18, 2014 at 1:05 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Dec 17, 2014 at 1:34 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> I think that neither pg_control nor xl_parameter_change need to have the info
> about WAL compression because each backup block has that entry.
>
> Will review the remaining part later.
I got into wondering the utility of this part earlier this morning as
that's some remnant of when wal_compression was set as PGC_POSTMASTER.
Will remove.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

18 December 2014, 05:21:19

On Wed, Dec 17, 2014 at 11:33 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:

I had a look at code. I have few minor points,

Thanks!

+ bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
+
+ if (is_compressed)
{
- rdt_datas_last->data = page;
- rdt_datas_last->len = BLCKSZ;
+ /* compressed block information */
+ bimg.length = compress_len;
+ bimg.extra_data = hole_offset;
+ bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;

For consistency with the existing code , how about renaming the macro XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of BKPBLOCK_HAS_IMAGE.

OK, why not...

+ blk->hole_offset = extra_data & ~XLR_BLCK_COMPRESSED_MASK;
Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be more indicative of the fact that lower 15 bits of extra_data field comprises of hole_offset value. This suggestion is also just to achieve consistency with the existing BKPBLOCK_FORK_MASK for fork_flags field.

Yeah that seems clearer, let's define it as ~XLR_BLCK_COMPRESSED_MASK though.

And comment typo
+ * First try to compress block, filling in the page hole with zeros
+ * to improve the compression of the whole. If the block is considered
+ * as incompressible, complete the block header information as if
+ * nothing happened.

As hole is no longer being compressed, this needs to be changed.

Fixed. As well as an additional comment block down.

A couple of things noticed on the fly:

- Fixed pg_xlogdump being not completely correct to report the FPW information

- A couple of typos and malformed sentences fixed

- Added an assertion to check that the hole offset value does not the bit used for compression status

- Reworked docs, mentioning as well that wal_compression is off by default.

- Removed stuff in pg_controldata and XLOG_PARAMETER_CHANGE (mentioned by Fujii-san)

Regards,

Michael

Attachment

20141218_fpw_compression_v8.tar.gz

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

18 December 2014, 08:27:09

On Thu, Dec 18, 2014 at 2:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
> On Wed, Dec 17, 2014 at 11:33 PM, Rahila Syed <rahilasyed90@gmail.com>
> wrote:
>>
>> I had a look at code. I have few minor points,
>
> Thanks!
>
>> +           bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
>> +
>> +           if (is_compressed)
>>             {
>> -               rdt_datas_last->data = page;
>> -               rdt_datas_last->len = BLCKSZ;
>> +               /* compressed block information */
>> +               bimg.length = compress_len;
>> +               bimg.extra_data = hole_offset;
>> +               bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;
>>
>> For consistency with the existing code , how about renaming the macro
>> XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of
>> BKPBLOCK_HAS_IMAGE.
>
> OK, why not...
>
>>
>> +               blk->hole_offset = extra_data & ~XLR_BLCK_COMPRESSED_MASK;
>> Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be
>> more indicative of the fact that lower 15 bits of extra_data field comprises
>> of hole_offset value. This suggestion is also just to achieve consistency
>> with the existing BKPBLOCK_FORK_MASK for fork_flags field.
>
> Yeah that seems clearer, let's define it as ~XLR_BLCK_COMPRESSED_MASK
> though.
>
>> And comment typo
>> +            * First try to compress block, filling in the page hole with
>> zeros
>> +            * to improve the compression of the whole. If the block is
>> considered
>> +            * as incompressible, complete the block header information as
>> if
>> +            * nothing happened.
>>
>> As hole is no longer being compressed, this needs to be changed.
>
> Fixed. As well as an additional comment block down.
>
> A couple of things noticed on the fly:
> - Fixed pg_xlogdump being not completely correct to report the FPW
> information
> - A couple of typos and malformed sentences fixed
> - Added an assertion to check that the hole offset value does not the bit
> used for compression status
> - Reworked docs, mentioning as well that wal_compression is off by default.
> - Removed stuff in pg_controldata and XLOG_PARAMETER_CHANGE (mentioned by
> Fujii-san)

Thanks!

+                else
+                    memcpy(compression_scratch, page, page_len);

I don't think the block image needs to be copied to scratch buffer here.
We can try to compress the "page" directly.

+#include "utils/pg_lzcompress.h"#include "utils/memutils.h"

pg_lzcompress.h should be after meutils.h.

+/* Scratch buffer used to store block image to-be-compressed */
+static char compression_scratch[PGLZ_MAX_BLCKSZ];

Isn't it better to allocate the memory for compression_scratch in
InitXLogInsert()
like hdr_scratch?

+        uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));

Why don't we allocate the buffer for uncompressed page only once and
keep reusing it like XLogReaderState->readBuf? The size of uncompressed
page is at most BLCKSZ, so we can allocate the memory for it even before
knowing the real size of each block image.

-                printf(" (FPW); hole: offset: %u, length: %u\n",
-                       record->blocks[block_id].hole_offset,
-                       record->blocks[block_id].hole_length);
+                if (record->blocks[block_id].is_compressed)
+                    printf(" (FPW); hole offset: %u, compressed length %u\n",
+                           record->blocks[block_id].hole_offset,
+                           record->blocks[block_id].bkp_len);
+                else
+                    printf(" (FPW); hole offset: %u, length: %u\n",
+                           record->blocks[block_id].hole_offset,
+                           record->blocks[block_id].bkp_len);

We need to consider what info about FPW we want pg_xlogdump to report.
I'd like to calculate how much bytes FPW was compressed, from the report
of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
and that of compressed one in the report.

In pg_config.h, the comment of BLCKSZ needs to be updated? Because
the maximum size of BLCKSZ can be affected by not only itemid but also
XLogRecordBlockImageHeader.
    bool        has_image;
+    bool        is_compressed;

Doesn't ResetDecoder need to reset is_compressed?

+#wal_compression = off            # enable compression of full-page writes

Currently wal_compression compresses only FPW, so isn't it better to place
it after full_page_writes in postgresql.conf.sample?

+    uint16        extra_data;    /* used to store offset of bytes in
"hole", with
+                             * last free bit used to check if block is
+                             * compressed */

At least to me, defining something like the following seems more easy to
read.
   uint16    hole_offset:15,                   is_compressed:1

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

18 December 2014, 10:32:05

>Isn't it better to allocate the memory for compression_scratch in
>InitXLogInsert()
>like hdr_scratch?

I think making compression_scratch a statically allocated global variable is the result of following discussion earlier,

http://www.postgresql.org/message-id/CA+TgmoazNBuwnLS4bpwyqgqteEznOAvy7KWdBm0A2-tBARn_aQ@mail.gmail.com

Thank you,

Rahila Syed

On Thu, Dec 18, 2014 at 1:57 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Dec 18, 2014 at 2:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
> On Wed, Dec 17, 2014 at 11:33 PM, Rahila Syed <rahilasyed90@gmail.com>
> wrote:
>>
>> I had a look at code. I have few minor points,
>
> Thanks!
>
>> + bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
>> +
>> + if (is_compressed)
>> {
>> - rdt_datas_last->data = page;
>> - rdt_datas_last->len = BLCKSZ;
>> + /* compressed block information */
>> + bimg.length = compress_len;
>> + bimg.extra_data = hole_offset;
>> + bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;
>>
>> For consistency with the existing code , how about renaming the macro
>> XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of
>> BKPBLOCK_HAS_IMAGE.
>
> OK, why not...
>
>>
>> + blk->hole_offset = extra_data & ~XLR_BLCK_COMPRESSED_MASK;
>> Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be
>> more indicative of the fact that lower 15 bits of extra_data field comprises
>> of hole_offset value. This suggestion is also just to achieve consistency
>> with the existing BKPBLOCK_FORK_MASK for fork_flags field.
>
> Yeah that seems clearer, let's define it as ~XLR_BLCK_COMPRESSED_MASK
> though.
>
>> And comment typo
>> + * First try to compress block, filling in the page hole with
>> zeros
>> + * to improve the compression of the whole. If the block is
>> considered
>> + * as incompressible, complete the block header information as
>> if
>> + * nothing happened.
>>
>> As hole is no longer being compressed, this needs to be changed.
>
> Fixed. As well as an additional comment block down.
>
> A couple of things noticed on the fly:
> - Fixed pg_xlogdump being not completely correct to report the FPW
> information
> - A couple of typos and malformed sentences fixed
> - Added an assertion to check that the hole offset value does not the bit
> used for compression status
> - Reworked docs, mentioning as well that wal_compression is off by default.
> - Removed stuff in pg_controldata and XLOG_PARAMETER_CHANGE (mentioned by
> Fujii-san)

Thanks!

+ else
+ memcpy(compression_scratch, page, page_len);

I don't think the block image needs to be copied to scratch buffer here.
We can try to compress the "page" directly.

+#include "utils/pg_lzcompress.h"
#include "utils/memutils.h"

pg_lzcompress.h should be after meutils.h.

+/* Scratch buffer used to store block image to-be-compressed */
+static char compression_scratch[PGLZ_MAX_BLCKSZ];

Isn't it better to allocate the memory for compression_scratch in
InitXLogInsert()
like hdr_scratch?

+ uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));

Why don't we allocate the buffer for uncompressed page only once and
keep reusing it like XLogReaderState->readBuf? The size of uncompressed
page is at most BLCKSZ, so we can allocate the memory for it even before
knowing the real size of each block image.

- printf(" (FPW); hole: offset: %u, length: %u\n",
- record->blocks[block_id].hole_offset,
- record->blocks[block_id].hole_length);
+ if (record->blocks[block_id].is_compressed)
+ printf(" (FPW); hole offset: %u, compressed length %u\n",
+ record->blocks[block_id].hole_offset,
+ record->blocks[block_id].bkp_len);
+ else
+ printf(" (FPW); hole offset: %u, length: %u\n",
+ record->blocks[block_id].hole_offset,
+ record->blocks[block_id].bkp_len);

We need to consider what info about FPW we want pg_xlogdump to report.
I'd like to calculate how much bytes FPW was compressed, from the report
of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
and that of compressed one in the report.

In pg_config.h, the comment of BLCKSZ needs to be updated? Because
the maximum size of BLCKSZ can be affected by not only itemid but also
XLogRecordBlockImageHeader.

bool has_image;
+ bool is_compressed;

Doesn't ResetDecoder need to reset is_compressed?

+#wal_compression = off # enable compression of full-page writes

Currently wal_compression compresses only FPW, so isn't it better to place
it after full_page_writes in postgresql.conf.sample?

+ uint16 extra_data; /* used to store offset of bytes in
"hole", with
+ * last free bit used to check if block is
+ * compressed */

At least to me, defining something like the following seems more easy to
read.

uint16 hole_offset:15,
is_compressed:1

Regards,

--
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

18 December 2014, 10:39:23

On Thu, Dec 18, 2014 at 7:31 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>>Isn't it better to allocate the memory for compression_scratch in
>>InitXLogInsert()
>>like hdr_scratch?
>
> I think making compression_scratch a statically allocated global variable
> is the result of  following discussion earlier,
>
> http://www.postgresql.org/message-id/CA+TgmoazNBuwnLS4bpwyqgqteEznOAvy7KWdBm0A2-tBARn_aQ@mail.gmail.com
   /*    * Permanently allocate readBuf.  We do it this way, rather than just    * making a static array, for two
reasons:(1) no need to waste the    * storage in most instantiations of the backend; (2) a static char array    * isn't
guaranteedto have any particular alignment, whereas palloc()    * will provide MAXALIGN'd storage.    */
 

The above source code comment in XLogReaderAllocate() makes me think that
it's better to avoid using a static array. The point (1) seems less important in
this case because most processes need the buffer for WAL compression,
though.

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

18 December 2014, 10:40:48

On Thu, Dec 18, 2014 at 7:31 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>>Isn't it better to allocate the memory for compression_scratch in
>>InitXLogInsert()
>>like hdr_scratch?
>
> I think making compression_scratch a statically allocated global variable
> is the result of  following discussion earlier,
> http://www.postgresql.org/message-id/CA+TgmoazNBuwnLS4bpwyqgqteEznOAvy7KWdBm0A2-tBARn_aQ@mail.gmail.com
Yep, in this case the OS does not request this memory as long as it is
not touched, like when wal_compression is off all the time in the
backend. Robert mentioned that upthread.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

18 December 2014, 15:19:38

On Thu, Dec 18, 2014 at 5:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Thanks!
Thanks for your input.

> +                else
> +                    memcpy(compression_scratch, page, page_len);
>
> I don't think the block image needs to be copied to scratch buffer here.
> We can try to compress the "page" directly.
Check.

> +#include "utils/pg_lzcompress.h"
>  #include "utils/memutils.h"
>
> pg_lzcompress.h should be after meutils.h.
Oops.

> +/* Scratch buffer used to store block image to-be-compressed */
> +static char compression_scratch[PGLZ_MAX_BLCKSZ];
>
> Isn't it better to allocate the memory for compression_scratch in
> InitXLogInsert()
> like hdr_scratch?
Because the OS would not touch it if wal_compression is never used,
but now that you mention it, it may be better to get that in the
context of xlog_insert..

> +        uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));
>
> Why don't we allocate the buffer for uncompressed page only once and
> keep reusing it like XLogReaderState->readBuf? The size of uncompressed
> page is at most BLCKSZ, so we can allocate the memory for it even before
> knowing the real size of each block image.
OK, this would save some cycles. I was trying to make process allocate
a minimum of memory only when necessary.

> -                printf(" (FPW); hole: offset: %u, length: %u\n",
> -                       record->blocks[block_id].hole_offset,
> -                       record->blocks[block_id].hole_length);
> +                if (record->blocks[block_id].is_compressed)
> +                    printf(" (FPW); hole offset: %u, compressed length %u\n",
> +                           record->blocks[block_id].hole_offset,
> +                           record->blocks[block_id].bkp_len);
> +                else
> +                    printf(" (FPW); hole offset: %u, length: %u\n",
> +                           record->blocks[block_id].hole_offset,
> +                           record->blocks[block_id].bkp_len);
>
> We need to consider what info about FPW we want pg_xlogdump to report.
> I'd like to calculate how much bytes FPW was compressed, from the report
> of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
> and that of compressed one in the report.
OK, so let's add a parameter in the decoder for the uncompressed
length. Sounds fine?

> In pg_config.h, the comment of BLCKSZ needs to be updated? Because
> the maximum size of BLCKSZ can be affected by not only itemid but also
> XLogRecordBlockImageHeader.
Check.

>      bool        has_image;
> +    bool        is_compressed;
>
> Doesn't ResetDecoder need to reset is_compressed?
Check.

> +#wal_compression = off            # enable compression of full-page writes
> Currently wal_compression compresses only FPW, so isn't it better to place
> it after full_page_writes in postgresql.conf.sample?
Check.

> +    uint16        extra_data;    /* used to store offset of bytes in
> "hole", with
> +                             * last free bit used to check if block is
> +                             * compressed */
> At least to me, defining something like the following seems more easy to
> read.
>     uint16    hole_offset:15,
>                     is_compressed:1
Check++.

Updated patches addressing all those things are attached.
Regards,
--
Michael

Attachment

20141219_fpw_compression_v9.tar.gz

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

24 December 2014, 04:28:55

Hello,

>Updated patches addressing all those things are attached.

Below are some performance numbers using latest patch 20141219_fpw_compression_v9 . Compression looks promising with reduced impact on CPU usage, tps and runtime.

pgbench command : pgbench -c 16 -j 16 -r -t 250000 -M prepared

To ensure that data is not highly compressible, empty filler columns were altered using

alter table pgbench_accounts alter column filler type text using

gen_random_uuid()::text

checkpoint_segments = 1024

checkpoint_timeout = 5min

fsync = on

Compression On Off

WAL generated 24558983188(~24.56 GB) 35931217248 (~ 35.93GB)

Runtime 5987.0 s 5825.0 s

TPS tps = 668.05 tps = 686.69

Latency average 23.935 ms 23.211 ms

Latency stddev 80.619 ms 80.141 ms

CPU usage(user) 916.64s 614.76s

CPU usage(system) 54.96s 64.14s

IO(average writes to disk) 10.43 MB . 12.5 MB

IO(total writes to disk) 64268.94 MB 72920 MB

Reduction in WAL is around 32 %. Reduction in total IO writes to disk is around 12%. Impact on runtime , tps and latency is very less.

CPU usage when compression is on is increased by 49% which is lesser as compared to earlier measurements.

Server specifications:

Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Thank you,

Rahila Syed

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

24 December 2014, 11:44:27

On Fri, Dec 19, 2014 at 12:19 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Dec 18, 2014 at 5:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Thanks!
> Thanks for your input.
>
>> +                else
>> +                    memcpy(compression_scratch, page, page_len);
>>
>> I don't think the block image needs to be copied to scratch buffer here.
>> We can try to compress the "page" directly.
> Check.
>
>> +#include "utils/pg_lzcompress.h"
>>  #include "utils/memutils.h"
>>
>> pg_lzcompress.h should be after meutils.h.
> Oops.
>
>> +/* Scratch buffer used to store block image to-be-compressed */
>> +static char compression_scratch[PGLZ_MAX_BLCKSZ];
>>
>> Isn't it better to allocate the memory for compression_scratch in
>> InitXLogInsert()
>> like hdr_scratch?
> Because the OS would not touch it if wal_compression is never used,
> but now that you mention it, it may be better to get that in the
> context of xlog_insert..
>
>> +        uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));
>>
>> Why don't we allocate the buffer for uncompressed page only once and
>> keep reusing it like XLogReaderState->readBuf? The size of uncompressed
>> page is at most BLCKSZ, so we can allocate the memory for it even before
>> knowing the real size of each block image.
> OK, this would save some cycles. I was trying to make process allocate
> a minimum of memory only when necessary.
>
>> -                printf(" (FPW); hole: offset: %u, length: %u\n",
>> -                       record->blocks[block_id].hole_offset,
>> -                       record->blocks[block_id].hole_length);
>> +                if (record->blocks[block_id].is_compressed)
>> +                    printf(" (FPW); hole offset: %u, compressed length %u\n",
>> +                           record->blocks[block_id].hole_offset,
>> +                           record->blocks[block_id].bkp_len);
>> +                else
>> +                    printf(" (FPW); hole offset: %u, length: %u\n",
>> +                           record->blocks[block_id].hole_offset,
>> +                           record->blocks[block_id].bkp_len);
>>
>> We need to consider what info about FPW we want pg_xlogdump to report.
>> I'd like to calculate how much bytes FPW was compressed, from the report
>> of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
>> and that of compressed one in the report.
> OK, so let's add a parameter in the decoder for the uncompressed
> length. Sounds fine?
>
>> In pg_config.h, the comment of BLCKSZ needs to be updated? Because
>> the maximum size of BLCKSZ can be affected by not only itemid but also
>> XLogRecordBlockImageHeader.
> Check.
>
>>      bool        has_image;
>> +    bool        is_compressed;
>>
>> Doesn't ResetDecoder need to reset is_compressed?
> Check.
>
>> +#wal_compression = off            # enable compression of full-page writes
>> Currently wal_compression compresses only FPW, so isn't it better to place
>> it after full_page_writes in postgresql.conf.sample?
> Check.
>
>> +    uint16        extra_data;    /* used to store offset of bytes in
>> "hole", with
>> +                             * last free bit used to check if block is
>> +                             * compressed */
>> At least to me, defining something like the following seems more easy to
>> read.
>>     uint16    hole_offset:15,
>>                     is_compressed:1
> Check++.
>
> Updated patches addressing all those things are attached.

Thanks for updating the patch!

Firstly I'm thinking to commit the
0001-Move-pg_lzcompress.c-to-src-common.patch.

pg_lzcompress.h still exists in include/utils, but it should be moved to
include/common?

Do we really need PGLZ_Status? I'm not sure whether your categorization of
the result status of compress/decompress functions is right or not. For example,
pglz_decompress() can return PGLZ_INCOMPRESSIBLE status, but which seems
invalid logically... Maybe this needs to be revisited when we introduce other
compression algorithms and create the wrapper function for those compression
and decompression functions. Anyway making pg_lzdecompress return
the boolean value seems enough.

I updated 0001-Move-pg_lzcompress.c-to-src-common.patch accordingly.
Barring objections, I will push the attached patch firstly.

Regards,

--
Fujii Masao

Attachment

0001-Move-pg_lzcompress.c-to-src-common.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

24 December 2014, 12:03:34

On Wed, Dec 24, 2014 at 8:44 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Dec 19, 2014 at 12:19 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> Firstly I'm thinking to commit the
> 0001-Move-pg_lzcompress.c-to-src-common.patch.
>
> pg_lzcompress.h still exists in include/utils, but it should be moved to
> include/common?
You are right. This is a remnant of first version of this patch where
pglz was added in port/ and not common/.

> Do we really need PGLZ_Status? I'm not sure whether your categorization of
> the result status of compress/decompress functions is right or not. For example,
> pglz_decompress() can return PGLZ_INCOMPRESSIBLE status, but which seems
> invalid logically... Maybe this needs to be revisited when we introduce other
> compression algorithms and create the wrapper function for those compression
> and decompression functions. Anyway making pg_lzdecompress return
> the boolean value seems enough.
Returning only a boolean is fine for me (that's what my first patch
did), especially if we add at some point hooks for compression and
decompression calls.
Regards,
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

25 December 2014, 13:10:19

On Wed, Dec 24, 2014 at 9:03 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Returning only a boolean is fine for me (that's what my first patch
> did), especially if we add at some point hooks for compression and
> decompression calls.
Here is a patch rebased on current HEAD (60838df) for the core feature
with the APIs of pglz using booleans as return values.
--
Michael

Attachment

20141225_fpw_compression_v10.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

26 December 2014, 03:31:35

On Thu, Dec 25, 2014 at 10:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Dec 24, 2014 at 9:03 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Returning only a boolean is fine for me (that's what my first patch
>> did), especially if we add at some point hooks for compression and
>> decompression calls.
> Here is a patch rebased on current HEAD (60838df) for the core feature
> with the APIs of pglz using booleans as return values.
After the revert of 1st patch moving pglz to src/common, I have
reworked both patches, resulting in the attached.

For pglz, the dependency to varlena has been removed to make the code
able to run independently on both frontend and backend sides. In order
to do that the APIs of pglz_compress and pglz_decompress have been
changed a bit:
- pglz_compress returns the number of bytes compressed.
- pglz_decompress takes as additional argument the compressed length
of the buffer, and returns the number of bytes decompressed instead of
a simple boolean for consistency with the compression API.
PGLZ_Header is not modified to keep the on-disk format intact.

The WAL compression patch is realigned based on those changes.
Regards,
--
Michael

Attachment

20141226_fpw_compression_v11.tar.gz

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

26 December 2014, 06:24:53

On Fri, Dec 26, 2014 at 12:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Dec 25, 2014 at 10:10 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Dec 24, 2014 at 9:03 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Returning only a boolean is fine for me (that's what my first patch
>>> did), especially if we add at some point hooks for compression and
>>> decompression calls.
>> Here is a patch rebased on current HEAD (60838df) for the core feature
>> with the APIs of pglz using booleans as return values.
> After the revert of 1st patch moving pglz to src/common, I have
> reworked both patches, resulting in the attached.
>
> For pglz, the dependency to varlena has been removed to make the code
> able to run independently on both frontend and backend sides. In order
> to do that the APIs of pglz_compress and pglz_decompress have been
> changed a bit:
> - pglz_compress returns the number of bytes compressed.
> - pglz_decompress takes as additional argument the compressed length
> of the buffer, and returns the number of bytes decompressed instead of
> a simple boolean for consistency with the compression API.
> PGLZ_Header is not modified to keep the on-disk format intact.

pglz_compress() and pglz_decompress() still use PGLZ_Header, so the frontend
which uses those functions needs to handle PGLZ_Header. But it basically should
be handled via the varlena macros. That is, the frontend still seems to need to
understand the varlena datatype. I think we should avoid that. Thought?

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

26 December 2014, 07:16:29

On Fri, Dec 26, 2014 at 3:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> pglz_compress() and pglz_decompress() still use PGLZ_Header, so the frontend
> which uses those functions needs to handle PGLZ_Header. But it basically should
> be handled via the varlena macros. That is, the frontend still seems to need to
> understand the varlena datatype. I think we should avoid that. Thought?
Hm, yes it may be wiser to remove it and make the data passed to pglz
for varlena 8 bytes shorter..
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

28 December 2014, 13:57:26

On Fri, Dec 26, 2014 at 4:16 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Fri, Dec 26, 2014 at 3:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> pglz_compress() and pglz_decompress() still use PGLZ_Header, so the frontend
>> which uses those functions needs to handle PGLZ_Header. But it basically should
>> be handled via the varlena macros. That is, the frontend still seems to need to
>> understand the varlena datatype. I think we should avoid that. Thought?
> Hm, yes it may be wiser to remove it and make the data passed to pglz
> for varlena 8 bytes shorter..

OK, here is the result of this work, made of 3 patches.

The first two patches move pglz stuff to src/common and make it a frontend utility entirely independent on varlena and its related metadata.
- Patch 1 is a simple move of pglz to src/common, with PGLZ_Header still present. There is nothing amazing here, and that's the broken version that has been reverted in 966115c.
- The real stuff comes with patch 2, that implements the removal of PGLZ_Header, changing the APIs of compression and decompression to pglz to not have anymore toast metadata, this metadata being now localized in tuptoaster.c. Note that this patch protects the on-disk format (tested with pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of compression and decompression look like with this patch, simply performing operations from a source to a destination:
extern int32 pglz_compress(const char *source, int32 slen, char *dest,
const PGLZ_Strategy *strategy);
extern int32 pglz_decompress(const char *source, char *dest,
int32 compressed_size, int32 raw_size);
The return value of those functions is the number of bytes written in the destination buffer, and 0 if operation failed. This is aimed to make backend as well more pluggable. The reason why patch 2 exists (it could be merged with patch 1), is to facilitate the review and the changes made to pglz to make it an entirely independent facility.

Patch 3 is the FPW compression, changed to fit with those changes. Note that as PGLZ_Header contains the raw size of the compressed data, and that it does not exist, it is necessary to store the raw length of the block image directly in the block image header with 2 additional bytes. Those 2 bytes are used only if wal_compression is set to true thanks to a boolean flag, so if wal_compression is disabled, the WAL record length is exactly the same as HEAD, and there is no penalty in the default case. Similarly to previous patches, the block image is compressed without its hole.

To finish, here are some results using the same test as here with the hack on getrusage to get the system and user CPU diff on a single backend execution:
http://www.postgresql.org/message-id/CAB7nPqSc97o-UE5paxfMUKWcxE_JioyxO1M4A0pMnmYqAnec2g@mail.gmail.com

From

Amit Kapila

Date:

01 January 2015, 05:11:00

On Thu, Jan 1, 2015 at 2:39 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Dec 30, 2014 at 01:27:44PM +0100, Andres Freund wrote:
> > On 2014-12-30 21:23:38 +0900, Michael Paquier wrote:
> > > On Tue, Dec 30, 2014 at 6:21 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > > > On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote:
> > > >> Speeding up the CRC calculation obviously won't help with the WAL volume
> > > >> per se, ie. you still generate the same amount of WAL that needs to be
> > > >> shipped in replication. But then again, if all you want to do is to
> > > >> reduce the volume, you could just compress the whole WAL stream.
> > > >
> > > > Was this point addressed?
> > > Compressing the whole record is interesting for multi-insert records,
> > > but as we need to keep the compressed data in a pre-allocated buffer
> > > until WAL is written, we can only compress things within a given size
> > > range. The point is, even if we define a lower bound, compression is
> > > going to perform badly with an application that generates for example
> > > many small records that are just higher than the lower bound...
> > > Unsurprisingly for small records this was bad:
> >
> > So why are you bringing it up? That's not an argument for anything,
> > except not doing it in such a simplistic way.
>
> I still don't understand the value of adding WAL compression, given the
> high CPU usage and minimal performance improvement. The only big
> advantage is WAL storage, but again, why not just compress the WAL file
> when archiving.
>
> I thought we used to see huge performance benefits from WAL compression,
> but not any more?

I think there can be performance benefit for the cases when the data

is compressible, but it would be loss otherwise. The main thing is

that the current compression algorithm (pg_lz) used is not so

favorable for non-compresible data.

>Has the UPDATE WAL compression removed that benefit?

Good question, I think there might be some impact due to that, but in

general for page level compression still there will be much more to

compress.

In general, I think this idea has merit with respect to compressible data,

and to save for the cases where it will not perform well, there is a on/off

switch for this feature and in future if PostgreSQL has some better

compression method, we can consider the same as well. One thing

that we need to think is whether user's can decide with ease when to

enable this global switch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compression of full-page-writes

From

Michael Paquier

Date:

01 January 2015, 06:33:46

On Thu, Jan 1, 2015 at 2:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 1, 2015 at 2:39 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> > So why are you bringing it up? That's not an argument for anything,
>> > except not doing it in such a simplistic way.
>>
>> I still don't understand the value of adding WAL compression, given the
>> high CPU usage and minimal performance improvement.  The only big
>> advantage is WAL storage, but again, why not just compress the WAL file
>> when archiving.
When doing some tests with pgbench for a fixed number of transactions,
I also noticed a reduction in replay time as well, see here for
example some results here:
http://www.postgresql.org/message-id/CAB7nPqRv6RaSx7hTnp=g3dYqOu++FeL0UioYqPLLBdbhAyB_jQ@mail.gmail.com

>> I thought we used to see huge performance benefits from WAL compression,
>> but not any more?
>
> I think there can be performance benefit for the cases when the data
> is compressible, but it would be loss otherwise.  The main thing is
> that the current compression algorithm (pg_lz) used is not so
> favorable for non-compresible data.
Yes definitely. Switching to a different algorithm would be the next
step forward. We have been discussing mainly about lz4 that has a
friendly license, I think that it would be worth studying other things
as well once we have all the infrastructure in place.

>>Has the UPDATE WAL compression removed that benefit?
>
> Good question,  I think there might be some impact due to that, but in
> general for page level compression still there will be much more to
> compress.
That may be a good thing to put a number on. We could try to patch a
build with a revert of a3115f0d and measure a bit that the difference
in WAL size that it creates. Thoughts?

> In general, I think this idea has merit with respect to compressible data,
> and to save for the cases where it will not perform well, there is a on/off
> switch for this feature and in future if PostgreSQL has some better
> compression method, we can consider the same as well.  One thing
> that we need to think is whether user's can decide with ease when to
> enable this global switch.
The opposite is true as well, we shouldn't force the user to have data
compressed even if the switch is disabled.
-- 
Michael

Re: Compression of full-page-writes

From

Bruce Momjian

Date:

01 January 2015, 20:29:57

On Thu, Jan  1, 2015 at 10:40:53AM +0530, Amit Kapila wrote:
> Good question,  I think there might be some impact due to that, but in
> general for page level compression still there will be much more to
> compress. 
> 
> In general, I think this idea has merit with respect to compressible data,
> and to save for the cases where it will not perform well, there is a on/off
> switch for this feature and in future if PostgreSQL has some better
> compression method, we can consider the same as well.  One thing
> that we need to think is whether user's can decide with ease when to
> enable this global switch.

Yes, that is the crux of my concern.  I am worried about someone who
assumes compressions == good, and then enables it.  If we can't clearly
know when it is good, it is even harder for users to know.  If we think
it isn't generally useful until a new compression algorithm is used,
perhaps we need to wait until the we implement this.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Compression of full-page-writes

From

Amit Kapila

Date:

02 January 2015, 03:10:03

On Thu, Jan 1, 2015 at 12:03 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Thu, Jan 1, 2015 at 2:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Jan 1, 2015 at 2:39 AM, Bruce Momjian <bruce@momjian.us> wrote:
> >> > So why are you bringing it up? That's not an argument for anything,
> >> > except not doing it in such a simplistic way.
> >>
> >> I still don't understand the value of adding WAL compression, given the
> >> high CPU usage and minimal performance improvement. The only big
> >> advantage is WAL storage, but again, why not just compress the WAL file
> >> when archiving.
> When doing some tests with pgbench for a fixed number of transactions,
> I also noticed a reduction in replay time as well, see here for
> example some results here:
> http://www.postgresql.org/message-id/CAB7nPqRv6RaSx7hTnp=g3dYqOu++FeL0UioYqPLLBdbhAyB_jQ@mail.gmail.com
>
> >> I thought we used to see huge performance benefits from WAL compression,
> >> but not any more?
> >
> > I think there can be performance benefit for the cases when the data
> > is compressible, but it would be loss otherwise. The main thing is
> > that the current compression algorithm (pg_lz) used is not so
> > favorable for non-compresible data.
> Yes definitely. Switching to a different algorithm would be the next
> step forward. We have been discussing mainly about lz4 that has a
> friendly license, I think that it would be worth studying other things
> as well once we have all the infrastructure in place.
>
> >>Has the UPDATE WAL compression removed that benefit?
> >
> > Good question, I think there might be some impact due to that, but in
> > general for page level compression still there will be much more to
> > compress.
> That may be a good thing to put a number on. We could try to patch a
> build with a revert of a3115f0d and measure a bit that the difference
> in WAL size that it creates. Thoughts?
>

You can do that, but what inference you want to deduce from it?

I think there can be some improvement in performance as well as

compression depending on the tests (if your tests involves lot of

Updates, then you might see some better results), however the

results will be more or less on similar lines.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compression of full-page-writes

From

Amit Kapila

Date:

02 January 2015, 03:19:49

On Fri, Jan 2, 2015 at 1:59 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Thu, Jan 1, 2015 at 10:40:53AM +0530, Amit Kapila wrote:
> > Good question, I think there might be some impact due to that, but in
> > general for page level compression still there will be much more to
> > compress.
> >
> > In general, I think this idea has merit with respect to compressible data,
> > and to save for the cases where it will not perform well, there is a on/off
> > switch for this feature and in future if PostgreSQL has some better
> > compression method, we can consider the same as well. One thing
> > that we need to think is whether user's can decide with ease when to
> > enable this global switch.
>
> Yes, that is the crux of my concern. I am worried about someone who
> assumes compressions == good, and then enables it. If we can't clearly
> know when it is good, it is even harder for users to know.

I think it might have been better if this switch is a relation level switch

as whether the data is compressible or not can be based on schema

and data in individual tables, but I think your concern is genuine.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compression of full-page-writes

From

Andres Freund

Date:

02 January 2015, 12:01:15

On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
> I still don't understand the value of adding WAL compression, given the
> high CPU usage and minimal performance improvement.  The only big
> advantage is WAL storage, but again, why not just compress the WAL file
> when archiving.

before: pg_xlog is 800GB
after: pg_xlog is 600GB.

I'm damned sure that many people would be happy with that, even if the
*per backend* overhead is a bit higher. And no, compression of archives
when archiving helps *zap* with that (streaming, wal_keep_segments,
checkpoint_timeout). As discussed before.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Compression of full-page-writes

From

"ktm@rice.edu"

Date:

02 January 2015, 16:16:17

On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote:
> On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
> > I still don't understand the value of adding WAL compression, given the
> > high CPU usage and minimal performance improvement.  The only big
> > advantage is WAL storage, but again, why not just compress the WAL file
> > when archiving.
> 
> before: pg_xlog is 800GB
> after: pg_xlog is 600GB.
> 
> I'm damned sure that many people would be happy with that, even if the
> *per backend* overhead is a bit higher. And no, compression of archives
> when archiving helps *zap* with that (streaming, wal_keep_segments,
> checkpoint_timeout). As discussed before.
> 
> Greetings,
> 
> Andres Freund
> 

+1

On an I/O constrained system assuming 50:50 table:WAL I/O, in the case
above you can process 100GB of transaction data at the cost of a bit
more CPU.

Regards,
Ken

Re: Compression of full-page-writes

From

Bruce Momjian

Date:

02 January 2015, 16:53:04

On Fri, Jan  2, 2015 at 10:15:57AM -0600, ktm@rice.edu wrote:
> On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote:
> > On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
> > > I still don't understand the value of adding WAL compression, given the
> > > high CPU usage and minimal performance improvement.  The only big
> > > advantage is WAL storage, but again, why not just compress the WAL file
> > > when archiving.
> > 
> > before: pg_xlog is 800GB
> > after: pg_xlog is 600GB.
> > 
> > I'm damned sure that many people would be happy with that, even if the
> > *per backend* overhead is a bit higher. And no, compression of archives
> > when archiving helps *zap* with that (streaming, wal_keep_segments,
> > checkpoint_timeout). As discussed before.
> > 
> > Greetings,
> > 
> > Andres Freund
> > 
> 
> +1
> 
> On an I/O constrained system assuming 50:50 table:WAL I/O, in the case
> above you can process 100GB of transaction data at the cost of a bit
> more CPU.

OK, so given your stats, the feature give a 12.5% reduction in I/O.  If
that is significant, shouldn't we see a performance improvement?  If we
don't see a performance improvement, is I/O reduction worthwhile?  Is it
valuable in that it gives non-database applications more I/O to use?  Is
that all?

I suggest we at least document that this feature as mostly useful for
I/O reduction, and maybe say CPU usage and performance might be
negatively impacted.

OK, here is the email I remember from Fujii Masao this same thread that
showed a performance improvement for WAL compression:
http://www.postgresql.org/message-id/CAHGQGwGqG8e9YN0fNCUZqTTT=hNr7Ly516kfT5ffqf4pp1qnHg@mail.gmail.com

Why are we not seeing the 33% compression and 15% performance
improvement he saw?  What am I missing here?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Compression of full-page-writes

From

Andres Freund

Date:

02 January 2015, 16:56:16

On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote:
> Why are we not seeing the 33% compression and 15% performance
> improvement he saw?  What am I missing here?

To see performance improvements something needs to be the bottleneck. If
WAL writes/flushes aren't that in the tested scenario, you won't see a
performance benefit. Amdahl's law and all that.

I don't understand your negativity about the topic.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Compression of full-page-writes

From

Bruce Momjian

Date:

02 January 2015, 17:06:41

On Fri, Jan  2, 2015 at 05:55:52PM +0100, Andres Freund wrote:
> On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote:
> > Why are we not seeing the 33% compression and 15% performance
> > improvement he saw?  What am I missing here?
> 
> To see performance improvements something needs to be the bottleneck. If
> WAL writes/flushes aren't that in the tested scenario, you won't see a
> performance benefit. Amdahl's law and all that.
> 
> I don't understand your negativity about the topic.

I remember the initial post from Masao in August 2013 showing a
performance boost, so I assumed, while we had the concurrent WAL insert
performance improvement in 9.4, this was going to be our 9.5 WAL
improvement.   While the WAL insert performance improvement required no
tuning and was never a negative, I now see the compression patch as
something that has negatives, so has to be set by the user, and only
wins in certain cases.  I am disappointed, and am trying to figure out
how this became such a marginal win for 9.5.  :-(

My negativity is not that I don't want it, but I want to understand why
it isn't better than I remembered.  You are basically telling me it was
always a marginal win.  :-(  Boohoo!

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Compression of full-page-writes

From

Andres Freund

Date:

02 January 2015, 17:11:41

On 2015-01-02 12:06:33 -0500, Bruce Momjian wrote:
> On Fri, Jan  2, 2015 at 05:55:52PM +0100, Andres Freund wrote:
> > On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote:
> > > Why are we not seeing the 33% compression and 15% performance
> > > improvement he saw?  What am I missing here?
> > 
> > To see performance improvements something needs to be the bottleneck. If
> > WAL writes/flushes aren't that in the tested scenario, you won't see a
> > performance benefit. Amdahl's law and all that.
> > 
> > I don't understand your negativity about the topic.
> 
> I remember the initial post from Masao in August 2013 showing a
> performance boost, so I assumed, while we had the concurrent WAL insert
> performance improvement in 9.4, this was going to be our 9.5 WAL
> improvement.

I don't think it makes sense to compare features/improvements that way.

> While the WAL insert performance improvement required no tuning and
> was never a negative

It's actually a negative in some cases.

> , I now see the compression patch as something that has negatives, so
> has to be set by the user, and only wins in certain cases.  I am
> disappointed, and am trying to figure out how this became such a
> marginal win for 9.5.  :-(

I find the notion that a multi digit space reduction is a "marginal win"
pretty ridiculous and way too narrow focused. Our WAL volume is a
*significant* problem in the field. And it mostly consists out of FPWs
spacewise.

> My negativity is not that I don't want it, but I want to understand why
> it isn't better than I remembered.  You are basically telling me it was
> always a marginal win.  :-(  Boohoo!

No, I didn't. I told you that *IN ONE BENCHMARK* wal writes apparently
are not the bottleneck.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Compression of full-page-writes

From

Bruce Momjian

Date:

02 January 2015, 17:16:05

On Fri, Jan  2, 2015 at 06:11:29PM +0100, Andres Freund wrote:
> > My negativity is not that I don't want it, but I want to understand why
> > it isn't better than I remembered.  You are basically telling me it was
> > always a marginal win.  :-(  Boohoo!
> 
> No, I didn't. I told you that *IN ONE BENCHMARK* wal writes apparently
> are not the bottleneck.

What I have not seen is any recent benchmarks that show it as a win,
while the original email did, so I was confused.  I tried to explain
exactly how I viewed things  --- you can not like it, but that is how I
look for upcoming features, and where we should focus our time.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Compression of full-page-writes

From

Claudio Freire

Date:

02 January 2015, 17:18:20

On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> , I now see the compression patch as something that has negatives, so
>> has to be set by the user, and only wins in certain cases.  I am
>> disappointed, and am trying to figure out how this became such a
>> marginal win for 9.5.  :-(
>
> I find the notion that a multi digit space reduction is a "marginal win"
> pretty ridiculous and way too narrow focused. Our WAL volume is a
> *significant* problem in the field. And it mostly consists out of FPWs
> spacewise.

One thing I'd like to point out, is that in cases where WAL I/O is an
issue (ie: WAL archiving), usually people already compress the
segments during archiving. I know I do, and I know it's recommended on
the web, and by some consultants.

So, I wouldn't want this FPW compression, which is desirable in
replication scenarios if you can spare the CPU cycles (because of
streaming), adversely affecting WAL compression during archiving.

Has anyone tested the compressability of WAL segments with FPW compression on?

AFAIK, both pglz and lz4 output should still be compressible with
deflate, but I've never tried.

Re: Compression of full-page-writes

From

Bruce Momjian

Date:

02 January 2015, 17:25:08

On Fri, Jan  2, 2015 at 02:18:12PM -0300, Claudio Freire wrote:
> On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> , I now see the compression patch as something that has negatives, so
> >> has to be set by the user, and only wins in certain cases.  I am
> >> disappointed, and am trying to figure out how this became such a
> >> marginal win for 9.5.  :-(
> >
> > I find the notion that a multi digit space reduction is a "marginal win"
> > pretty ridiculous and way too narrow focused. Our WAL volume is a
> > *significant* problem in the field. And it mostly consists out of FPWs
> > spacewise.
> 
> One thing I'd like to point out, is that in cases where WAL I/O is an
> issue (ie: WAL archiving), usually people already compress the
> segments during archiving. I know I do, and I know it's recommended on
> the web, and by some consultants.
> 
> So, I wouldn't want this FPW compression, which is desirable in
> replication scenarios if you can spare the CPU cycles (because of
> streaming), adversely affecting WAL compression during archiving.

To be specific, desirable in streaming replication scenarios that don't
use SSL compression.  (What percentage is that?)  It is something we
should mention in the docs for this feature?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Compression of full-page-writes

From

Stephen Frost

Date:

02 January 2015, 18:05:55

* Bruce Momjian (bruce@momjian.us) wrote:
> To be specific, desirable in streaming replication scenarios that don't
> use SSL compression.  (What percentage is that?)  It is something we
> should mention in the docs for this feature?

Considering how painful the SSL rengeotiation problems were and the CPU
overhead, I'd be surprised if many high-write-volume replication
environments use SSL at all.

There's a lot of win to be had from compression of FPWs, but it's like
most compression in that there are trade-offs to be had and environments
where it won't be a win, but I believe those cases to be the minority.
Thanks,
    Stephen

Re: Compression of full-page-writes

From

Michael Paquier

Date:

03 January 2015, 11:55:23

On Sat, Jan 3, 2015 at 1:52 AM, Bruce Momjian <bruce@momjian.us> wrote:
> I suggest we at least document that this feature as mostly useful for
> I/O reduction, and maybe say CPU usage and performance might be
> negatively impacted.
FWIW, that's mentioned in the documentation included in the patch..
-- 
Michael

Re: Compression of full-page-writes

From

Fujii Masao

Date:

05 January 2015, 09:07:29

On Sat, Jan 3, 2015 at 1:52 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Jan  2, 2015 at 10:15:57AM -0600, ktm@rice.edu wrote:
>> On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote:
>> > On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
>> > > I still don't understand the value of adding WAL compression, given the
>> > > high CPU usage and minimal performance improvement.  The only big
>> > > advantage is WAL storage, but again, why not just compress the WAL file
>> > > when archiving.
>> >
>> > before: pg_xlog is 800GB
>> > after: pg_xlog is 600GB.
>> >
>> > I'm damned sure that many people would be happy with that, even if the
>> > *per backend* overhead is a bit higher. And no, compression of archives
>> > when archiving helps *zap* with that (streaming, wal_keep_segments,
>> > checkpoint_timeout). As discussed before.
>> >
>> > Greetings,
>> >
>> > Andres Freund
>> >
>>
>> +1
>>
>> On an I/O constrained system assuming 50:50 table:WAL I/O, in the case
>> above you can process 100GB of transaction data at the cost of a bit
>> more CPU.
>
> OK, so given your stats, the feature give a 12.5% reduction in I/O.  If
> that is significant, shouldn't we see a performance improvement?  If we
> don't see a performance improvement, is I/O reduction worthwhile?  Is it
> valuable in that it gives non-database applications more I/O to use?  Is
> that all?
>
> I suggest we at least document that this feature as mostly useful for
> I/O reduction, and maybe say CPU usage and performance might be
> negatively impacted.
>
> OK, here is the email I remember from Fujii Masao this same thread that
> showed a performance improvement for WAL compression:
>
>         http://www.postgresql.org/message-id/CAHGQGwGqG8e9YN0fNCUZqTTT=hNr7Ly516kfT5ffqf4pp1qnHg@mail.gmail.com
>
> Why are we not seeing the 33% compression and 15% performance
> improvement he saw?

Because the benchmarks I and Michael used are very difffernet.
I just used pgbench, but he used his simple test SQLs (see
http://www.postgresql.org/message-id/CAB7nPqSc97o-UE5paxfMUKWcxE_JioyxO1M4A0pMnmYqAnec2g@mail.gmail.com).

Furthermore, the data type of pgbench_accounts.filler column is character(84)
and its content is empty, so pgbench_accounts is very compressible. This is
one of the reasons I could see good performance improvement and high compression
ratio.

Regards,

-- 
Fujii Masao

Re: Compression of full-page-writes

From

Fujii Masao

Date:

05 January 2015, 09:12:54

On Sat, Jan 3, 2015 at 2:24 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Jan  2, 2015 at 02:18:12PM -0300, Claudio Freire wrote:
>> On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> >> , I now see the compression patch as something that has negatives, so
>> >> has to be set by the user, and only wins in certain cases.  I am
>> >> disappointed, and am trying to figure out how this became such a
>> >> marginal win for 9.5.  :-(
>> >
>> > I find the notion that a multi digit space reduction is a "marginal win"
>> > pretty ridiculous and way too narrow focused. Our WAL volume is a
>> > *significant* problem in the field. And it mostly consists out of FPWs
>> > spacewise.
>>
>> One thing I'd like to point out, is that in cases where WAL I/O is an
>> issue (ie: WAL archiving), usually people already compress the
>> segments during archiving. I know I do, and I know it's recommended on
>> the web, and by some consultants.
>>
>> So, I wouldn't want this FPW compression, which is desirable in
>> replication scenarios if you can spare the CPU cycles (because of
>> streaming), adversely affecting WAL compression during archiving.
>
> To be specific, desirable in streaming replication scenarios that don't
> use SSL compression.  (What percentage is that?)  It is something we
> should mention in the docs for this feature?

Even if SSL is used in replication, FPW compression is useful. It can reduce
the amount of I/O in the standby side. Sometimes I've seen walreceiver's I/O had
become a performance bottleneck especially in synchronous replication cases.
FPW compression can be useful for those cases, for example.

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

05 January 2015, 13:30:04

On Sun, Dec 28, 2014 at 10:57 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
> On Fri, Dec 26, 2014 at 4:16 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> On Fri, Dec 26, 2014 at 3:24 PM, Fujii Masao <masao.fujii@gmail.com>
>> wrote:
>>> pglz_compress() and pglz_decompress() still use PGLZ_Header, so the
>>> frontend
>>> which uses those functions needs to handle PGLZ_Header. But it basically
>>> should
>>> be handled via the varlena macros. That is, the frontend still seems to
>>> need to
>>> understand the varlena datatype. I think we should avoid that. Thought?
>> Hm, yes it may be wiser to remove it and make the data passed to pglz
>> for varlena 8 bytes shorter..
>
> OK, here is the result of this work, made of 3 patches.

Thanks for updating the patches!

> The first two patches move pglz stuff to src/common and make it a frontend
> utility entirely independent on varlena and its related metadata.
> - Patch 1 is a simple move of pglz to src/common, with PGLZ_Header still
> present. There is nothing amazing here, and that's the broken version that
> has been reverted in 966115c.

The patch 1 cannot be applied to the master successfully because of
recent change.

> - The real stuff comes with patch 2, that implements the removal of
> PGLZ_Header, changing the APIs of compression and decompression to pglz to
> not have anymore toast metadata, this metadata being now localized in
> tuptoaster.c. Note that this patch protects the on-disk format (tested with
> pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of
> compression and decompression look like with this patch, simply performing
> operations from a source to a destination:
> extern int32 pglz_compress(const char *source, int32 slen, char *dest,
>                           const PGLZ_Strategy *strategy);
> extern int32 pglz_decompress(const char *source, char *dest,
>                           int32 compressed_size, int32 raw_size);
> The return value of those functions is the number of bytes written in the
> destination buffer, and 0 if operation failed.

So it's guaranteed that 0 is never returned in success case? I'm not sure
if that case can really happen, though.

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

06 January 2015, 02:09:33

On Mon, Jan 5, 2015 at 10:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sun, Dec 28, 2014 at 10:57 PM, Michael Paquier wrote:
> The patch 1 cannot be applied to the master successfully because of
> recent change.
Yes, that's caused by ccb161b. Attached are rebased versions.

>> - The real stuff comes with patch 2, that implements the removal of
>> PGLZ_Header, changing the APIs of compression and decompression to pglz to
>> not have anymore toast metadata, this metadata being now localized in
>> tuptoaster.c. Note that this patch protects the on-disk format (tested with
>> pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of
>> compression and decompression look like with this patch, simply performing
>> operations from a source to a destination:
>> extern int32 pglz_compress(const char *source, int32 slen, char *dest,
>>                           const PGLZ_Strategy *strategy);
>> extern int32 pglz_decompress(const char *source, char *dest,
>>                           int32 compressed_size, int32 raw_size);
>> The return value of those functions is the number of bytes written in the
>> destination buffer, and 0 if operation failed.
>
> So it's guaranteed that 0 is never returned in success case? I'm not sure
> if that case can really happen, though.
This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
compression algorithm to return a size of 0 bytes as compressed or
decompressed length btw? We could as well make it return a negative
value when a failure occurs if you feel more comfortable with it.
--
Michael

Attachment

20150105_fpw_compression_v13.tar.gz

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

06 January 2015, 15:51:17

Hello,

>Yes, that's caused by ccb161b. Attached are rebased versions. 

Following are some comments, 

>uint16  hole_offset:15, /* number of bytes in "hole" */
Typo in description of hole_offset
           
>        for (block_id = 0; block_id <= record->max_block_id; block_id++)
>-       {
>-               if (XLogRecHasBlockImage(record, block_id))
>-                       fpi_len += BLCKSZ -
record->blocks[block_id].hole_length;
>-       }
>+               fpi_len += record->blocks[block_id].bkp_len;

IIUC, if condition, /if(XLogRecHasBlockImage(record, block_id))/ is
incorrectly removed from the above for loop.

>typedef struct XLogRecordCompressedBlockImageHeader
I am trying to understand the purpose behind declaration of the above
struct. IIUC, it is defined in order to introduce new field uint16 
raw_length and it has been declared as a separate struct from
XLogRecordBlockImageHeader to not affect the size of WAL record when
compression is off.
I wonder if it is ok to simply memcpy the uint16 raw_length in the
hdr_scratch when compression is on
and not have a separate header struct for it neither declare it in existing
header. raw_length can be a locally defined variable is XLogRecordAssemble
or it can be a field in registered_buffer struct like compressed_page.
I think this can simplify the code. 
Am I missing something obvious?   

> /*
>  * Fill in the remaining fields in the XLogRecordBlockImageHeader
>  * struct and add new entries in the record chain.
> */  
>   bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;

This code line seems to be misplaced with respect to the above comment. 
Comment indicates filling of XLogRecordBlockImageHeader fields while
fork_flags is a field of XLogRecordBlockHeader.
Is it better to place the code close to following condition? if (needs_backup) {

>+  *the original length of the
>+ * block without its page hole being deducible from the compressed data
>+ * itself.
IIUC, this comment before XLogRecordBlockImageHeader seems to be no longer
valid as original length is not deducible from compressed data and rather
stored in header.


Thank you,
Rahila Syed



--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5833025.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

07 January 2015, 04:02:36

On Wed, Jan 7, 2015 at 12:51 AM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
> Following are some comments,
Thanks for the feedback.

>>uint16  hole_offset:15, /* number of bytes in "hole" */
> Typo in description of hole_offset
Fixed. That's "before hole".

>>        for (block_id = 0; block_id <= record->max_block_id; block_id++)
>>-       {
>>-               if (XLogRecHasBlockImage(record, block_id))
>>-                       fpi_len += BLCKSZ -
> record->blocks[block_id].hole_length;
>>-       }
>>+               fpi_len += record->blocks[block_id].bkp_len;
>
> IIUC, if condition, /if(XLogRecHasBlockImage(record, block_id))/ is
> incorrectly removed from the above for loop.
Fixed.

>>typedef struct XLogRecordCompressedBlockImageHeader
> I am trying to understand the purpose behind declaration of the above
> struct. IIUC, it is defined in order to introduce new field uint16
> raw_length and it has been declared as a separate struct from
> XLogRecordBlockImageHeader to not affect the size of WAL record when
> compression is off.
> I wonder if it is ok to simply memcpy the uint16 raw_length in the
> hdr_scratch when compression is on
> and not have a separate header struct for it neither declare it in existing
> header. raw_length can be a locally defined variable is XLogRecordAssemble
> or it can be a field in registered_buffer struct like compressed_page.
> I think this can simplify the code.
> Am I missing something obvious?
You are missing nothing. I just introduced this structure for a matter
of readability to show the two-byte difference between non-compressed
and compressed header information. It is true that doing it my way
makes the structures duplicated, so let's simply add the
compression-related information as an extra structure added after
XLogRecordBlockImageHeader if the block is compressed. I hope this
addresses your concerns.

>> /*
>>  * Fill in the remaining fields in the XLogRecordBlockImageHeader
>>  * struct and add new entries in the record chain.
>> */
>
>>   bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
>
> This code line seems to be misplaced with respect to the above comment.
> Comment indicates filling of XLogRecordBlockImageHeader fields while
> fork_flags is a field of XLogRecordBlockHeader.
> Is it better to place the code close to following condition?
>   if (needs_backup)
>   {
Yes, this comment should not be here. I replaced it with the comment in HEAD.


>>+  *the original length of the
>>+ * block without its page hole being deducible from the compressed data
>>+ * itself.
> IIUC, this comment before XLogRecordBlockImageHeader seems to be no longer
> valid as original length is not deducible from compressed data and rather
> stored in header.
Aah, true. This was originally present in the header of PGLZ that has
been removed to make it available for frontends.

Updated patches are attached.
Regards,
--
Michael

Attachment

20150107_fpw_compression_v14.tar.gz

Re: Compression of full-page-writes

From

Rahila Syed

Date:

08 January 2015, 14:59:35

Hello,

Below are performance numbers in case of synchronous replication with and
without fpw compression using latest version of patch(version 14). The patch
helps improve performance considerably.

Both master and standby are on the same machine in order to get numbers
independent of network overhead.
The compression patch helps to increase tps by 10% . It also helps reduce
I/O to disk , latency and total runtime for a fixed number of transactions
as shown below.
The compression of WAL is quite high around 40%.

pgbench scale :1000
pgbench command : pgbench  -c 16 -j 16 -r -t 250000 -M prepared

To ensure that data is not highly compressible, empty filler columns were
altered using
alter table pgbench_accounts alter column filler type text using
gen_random_uuid()::text

checkpoint_segments = 1024
checkpoint_timeout =  5min
fsync = on

Compression                    on
off

WAL generated             23037180520(~23.04MB)
38196743704(~38.20MB)

TPS                             264.18
239.34

Latency average            60.541  ms                               66.822
ms

Latency stddev              126.567 ms                              130.434
ms

Total writes to disk         145045.310 MB                        192357.250
MB

Runtime                       15141.0 s
16712.0 s


Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD      450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Thank you,
Rahila Syed



--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5833315.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Compression of full-page-writes

From

Michael Paquier

Date:

09 January 2015, 07:48:59

On Thu, Jan 8, 2015 at 11:59 PM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
> Below are performance numbers in case of synchronous replication with and
> without fpw compression using latest version of patch(version 14). The patch
> helps improve performance considerably.
> Both master and standby are on the same machine in order to get numbers
> independent of network overhead.
So this test can be used to evaluate how shorter records influence
performance since the master waits for flush confirmation from the
standby, right?

> The compression patch helps to increase tps by 10% . It also helps reduce
> I/O to disk , latency and total runtime for a fixed number of transactions
> as shown below.
> The compression of WAL is quite high around 40%.
>
> Compression                    on
> off
>
> WAL generated             23037180520(~23.04MB)
> 38196743704(~38.20MB)
Isn't that GB and not MB?

> TPS                             264.18                    239.34
>
> Latency average            60.541  ms           66.822
> ms
>
> Latency stddev              126.567 ms           130.434
> ms
>
> Total writes to disk         145045.310 MB     192357.250MB
> Runtime                       15141.0 s  16712.0 s
How many FPWs have been generated and how many dirty buffers have been
flushed for the 3 checkpoints of each test?

Any data about the CPU activity?
-- 
Michael

Re: Compression of full-page-writes

From

Rahila Syed

Date:

09 January 2015, 12:49:55

>So this test can be used to evaluate how shorter records influence 
>performance since the master waits for flush confirmation from the 
>standby, right? 

Yes. This test can help measure performance improvement due to reduced I/O
on standby as master waits for WAL records flush on standby.

>Isn't that GB and not MB? 
Yes. That is a typo. It should be GB.

>How many FPWs have been generated and how many dirty buffers have been 
>flushed for the 3 checkpoints of each test? 

>Any data about the CPU activity? 
Above data is not available for this run . I will rerun the tests to gather
above data.

Thank you,
Rahila Syed





--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5833389.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Compression of full-page-writes

From

Michael Paquier

Date:

10 January 2015, 08:10:30

On Fri, Jan 9, 2015 at 9:49 PM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
>>So this test can be used to evaluate how shorter records influence
>>performance since the master waits for flush confirmation from the
>>standby, right?
>
> Yes. This test can help measure performance improvement due to reduced I/O
> on standby as master waits for WAL records flush on standby.
It may be interesting to run such tests with more concurrent
connections at the same time, like 32 or 64.
-- 
Michael

Re: Compression of full-page-writes

From

Robert Haas

Date:

14 January 2015, 16:47:26

On Fri, Jan 2, 2015 at 11:52 AM, Bruce Momjian <bruce@momjian.us> wrote:
> OK, so given your stats, the feature give a 12.5% reduction in I/O.  If
> that is significant, shouldn't we see a performance improvement?  If we
> don't see a performance improvement, is I/O reduction worthwhile?  Is it
> valuable in that it gives non-database applications more I/O to use?  Is
> that all?
>
> I suggest we at least document that this feature as mostly useful for
> I/O reduction, and maybe say CPU usage and performance might be
> negatively impacted.
>
> OK, here is the email I remember from Fujii Masao this same thread that
> showed a performance improvement for WAL compression:
>
>         http://www.postgresql.org/message-id/CAHGQGwGqG8e9YN0fNCUZqTTT=hNr7Ly516kfT5ffqf4pp1qnHg@mail.gmail.com
>
> Why are we not seeing the 33% compression and 15% performance
> improvement he saw?  What am I missing here?

Bruce, some database workloads are I/O bound and others are CPU bound.
Any patch that reduces I/O by using CPU is going to be a win when the
system is I/O bound and a loss when it is CPU bound.  I'm not really
sure what else to say about that; it seems pretty obvious.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Compression of full-page-writes

From

Michael Paquier

Date:

15 January 2015, 07:21:25

Marking this patch as returned with feedback for this CF, moving it to
the next one. I doubt that there will be much progress here for the
next couple of days, so let's try at least to get something for this
release cycle.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

05 February 2015, 08:50:15

On Tue, Jan 6, 2015 at 11:09 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Jan 5, 2015 at 10:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Sun, Dec 28, 2014 at 10:57 PM, Michael Paquier wrote:
>> The patch 1 cannot be applied to the master successfully because of
>> recent change.
> Yes, that's caused by ccb161b. Attached are rebased versions.
>
>>> - The real stuff comes with patch 2, that implements the removal of
>>> PGLZ_Header, changing the APIs of compression and decompression to pglz to
>>> not have anymore toast metadata, this metadata being now localized in
>>> tuptoaster.c. Note that this patch protects the on-disk format (tested with
>>> pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of
>>> compression and decompression look like with this patch, simply performing
>>> operations from a source to a destination:
>>> extern int32 pglz_compress(const char *source, int32 slen, char *dest,
>>>                           const PGLZ_Strategy *strategy);
>>> extern int32 pglz_decompress(const char *source, char *dest,
>>>                           int32 compressed_size, int32 raw_size);
>>> The return value of those functions is the number of bytes written in the
>>> destination buffer, and 0 if operation failed.
>>
>> So it's guaranteed that 0 is never returned in success case? I'm not sure
>> if that case can really happen, though.
> This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
> compression algorithm to return a size of 0 bytes as compressed or
> decompressed length btw? We could as well make it return a negative
> value when a failure occurs if you feel more comfortable with it.

I feel that's better. Attached is the updated version of the patch.
I changed the pg_lzcompress and pg_lzdecompress so that they return -1
when failure happens. Also I applied some cosmetic changes to the patch
(e.g., shorten the long name of the newly-added macros).
Barring any objection, I will commit this.

Regards,

--
Fujii Masao

Attachment

move_pg_lzcompress_to_common_v2.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

05 February 2015, 14:07:30

Hello,

>/*
>+    * We recheck the actual size even if pglz_compress() report success,
>+    * because it might be satisfied with having saved as little as one byte
>+    * in the compressed data.
>+    */
>+   *len = (uint16) compressed_len;
>+   if (*len >= orig_len - 1)
>+       return false;
>+   return true;
>+}

As per latest code ,when compression is 'on' we introduce additional 2 bytes in the header of each block image for
storingraw_length of the compressed block.  
In order to achieve compression while accounting for these two additional bytes, we must ensure that compressed length
isless than original length - 2. 
So , IIUC the above condition should rather be

If (*len >= orig_len -2 )
                return false;
return true;

The attached patch contains this. It also has a cosmetic change-  renaming compressBuf to uncompressBuf as it is used
tostore uncompressed page. 

Thank you,
Rahila Syed

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
Sent: Wednesday, January 07, 2015 9:32 AM
To: Rahila Syed
Cc: PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Wed, Jan 7, 2015 at 12:51 AM, Rahila Syed <rahilasyed.90@gmail.com> wrote:
> Following are some comments,
Thanks for the feedback.

>>uint16  hole_offset:15, /* number of bytes in "hole" */
> Typo in description of hole_offset
Fixed. That's "before hole".

>>        for (block_id = 0; block_id <= record->max_block_id; block_id++)
>>-       {
>>-               if (XLogRecHasBlockImage(record, block_id))
>>-                       fpi_len += BLCKSZ -
> record->blocks[block_id].hole_length;
>>-       }
>>+               fpi_len += record->blocks[block_id].bkp_len;
>
> IIUC, if condition, /if(XLogRecHasBlockImage(record, block_id))/ is
> incorrectly removed from the above for loop.
Fixed.

>>typedef struct XLogRecordCompressedBlockImageHeader
> I am trying to understand the purpose behind declaration of the above
> struct. IIUC, it is defined in order to introduce new field uint16
> raw_length and it has been declared as a separate struct from
> XLogRecordBlockImageHeader to not affect the size of WAL record when
> compression is off.
> I wonder if it is ok to simply memcpy the uint16 raw_length in the
> hdr_scratch when compression is on and not have a separate header
> struct for it neither declare it in existing header. raw_length can be
> a locally defined variable is XLogRecordAssemble or it can be a field
> in registered_buffer struct like compressed_page.
> I think this can simplify the code.
> Am I missing something obvious?
You are missing nothing. I just introduced this structure for a matter of readability to show the two-byte difference
betweennon-compressed and compressed header information. It is true that doing it my way makes the structures
duplicated,so let's simply add the compression-related information as an extra structure added after
XLogRecordBlockImageHeaderif the block is compressed. I hope this addresses your concerns. 

>> /*
>>  * Fill in the remaining fields in the XLogRecordBlockImageHeader
>>  * struct and add new entries in the record chain.
>> */
>
>>   bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
>
> This code line seems to be misplaced with respect to the above comment.
> Comment indicates filling of XLogRecordBlockImageHeader fields while
> fork_flags is a field of XLogRecordBlockHeader.
> Is it better to place the code close to following condition?
>   if (needs_backup)
>   {
Yes, this comment should not be here. I replaced it with the comment in HEAD.


>>+  *the original length of the
>>+ * block without its page hole being deducible from the compressed
>>+ data
>>+ * itself.
> IIUC, this comment before XLogRecordBlockImageHeader seems to be no
> longer valid as original length is not deducible from compressed data
> and rather stored in header.
Aah, true. This was originally present in the header of PGLZ that has been removed to make it available for frontends.

Updated patches are attached.
Regards,
--
Michael

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment

Support-compression-for-full-page-writes-in-WAL_v15.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

05 February 2015, 18:42:50

Fujii Masao wrote:
> I wrote
>> This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
>> compression algorithm to return a size of 0 bytes as compressed or
>> decompressed length btw? We could as well make it return a negative
>> value when a failure occurs if you feel more comfortable with it.
>
> I feel that's better. Attached is the updated version of the patch.
> I changed the pg_lzcompress and pg_lzdecompress so that they return -1
> when failure happens. Also I applied some cosmetic changes to the patch
> (e.g., shorten the long name of the newly-added macros).
> Barring any objection, I will commit this.

I just had a look at your updated version, ran some sanity tests, and
things look good from me. The new names of the macros at the top of
tuptoaster.c are clearer as well.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

05 February 2015, 19:15:47

On Thu, Feb 5, 2015 at 11:06 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
>>/*
>>+    * We recheck the actual size even if pglz_compress() report success,
>>+    * because it might be satisfied with having saved as little as one byte
>>+    * in the compressed data.
>>+    */
>>+   *len = (uint16) compressed_len;
>>+   if (*len >= orig_len - 1)
>>+       return false;
>>+   return true;
>>+}
>
> As per latest code ,when compression is 'on' we introduce additional 2 bytes in the header of each block image for
storingraw_length of the compressed block. 
> In order to achieve compression while accounting for these two additional bytes, we must ensure that compressed
lengthis less than original length - 2. 
> So , IIUC the above condition should rather be
>
> If (*len >= orig_len -2 )
>                 return false;
> return true;
> The attached patch contains this. It also has a cosmetic change-  renaming compressBuf to uncompressBuf as it is used
tostore uncompressed page. 

Agreed on both things.

Just looking at your latest patch after some time to let it cool down,
I noticed a couple of things.

 #define MaxSizeOfXLogRecordBlockHeader \
     (SizeOfXLogRecordBlockHeader + \
-     SizeOfXLogRecordBlockImageHeader + \
+     SizeOfXLogRecordBlockImageHeader, \
+     SizeOfXLogRecordBlockImageCompressionInfo + \
There is a comma here instead of a sum sign. We should really sum up
all those sizes to evaluate the maximum size of a block header.

+     * Permanently allocate readBuf uncompressBuf.  We do it this way,
+     * rather than just making a static array, for two reasons:
This comment is just but weird, "readBuf AND uncompressBuf" is more appropriate.

+     * We recheck the actual size even if pglz_compress() report success,
+     * because it might be satisfied with having saved as little as one byte
+     * in the compressed data. We add two bytes to store raw_length  with the
+     * compressed image. So for compression to be effective
compressed_len should
+     * be atleast < orig_len - 2
This comment block should be reworked, and misses a dot at its end. I
rewrote it like that, hopefully that's clearer:
+       /*
+        * We recheck the actual size even if pglz_compress() reports
success and see
+        * if at least 2 bytes of length have been saved, as this
corresponds to the
+        * additional amount of data stored in WAL record for a compressed block
+        * via raw_length.
+        */

In any case, those things have been introduced by what I did in
previous versions... And attached is a new patch.
--
Michael

Attachment

Support-compression-for-full-page-writes-in-WAL_v16.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

06 February 2015, 11:03:33

On Fri, Feb 6, 2015 at 4:15 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 5, 2015 at 11:06 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
>>>/*
>>>+    * We recheck the actual size even if pglz_compress() report success,
>>>+    * because it might be satisfied with having saved as little as one byte
>>>+    * in the compressed data.
>>>+    */
>>>+   *len = (uint16) compressed_len;
>>>+   if (*len >= orig_len - 1)
>>>+       return false;
>>>+   return true;
>>>+}
>>
>> As per latest code ,when compression is 'on' we introduce additional 2 bytes in the header of each block image for
storingraw_length of the compressed block.

>> In order to achieve compression while accounting for these two additional bytes, we must ensure that compressed
lengthis less than original length - 2.

>> So , IIUC the above condition should rather be
>>
>> If (*len >= orig_len -2 )
>>                 return false;

"2" should be replaced with the macro variable indicating the size of
extra header for compressed backup block.

Do we always need extra two bytes for compressed backup block?
ISTM that extra bytes are not necessary when the hole length is zero.
In this case the length of the original backup block (i.e., uncompressed)
must be BLCKSZ, so we don't need to save the original size in
the extra bytes.

Furthermore, when fpw compression is disabled and the hole length
is zero, we seem to be able to save one byte from the header of
backup block. Currently we use 4 bytes for the header, 2 bytes for
the length of backup block, 15 bits for the hole offset and 1 bit for
the flag indicating whether block is compressed or not. But in that case,
the length of backup block doesn't need to be stored because it must
be BLCKSZ. Shouldn't we optimize the header in this way? Thought?

+                int page_len = BLCKSZ - hole_length;
+                char *scratch_buf;
+                if (hole_length != 0)
+                {
+                    scratch_buf = compression_scratch;
+                    memcpy(scratch_buf, page, hole_offset);
+                    memcpy(scratch_buf + hole_offset,
+                           page + (hole_offset + hole_length),
+                           BLCKSZ - (hole_length + hole_offset));
+                }
+                else
+                    scratch_buf = page;
+
+                /* Perform compression of block */
+                if (XLogCompressBackupBlock(scratch_buf,
+                                            page_len,
+                                            regbuf->compressed_page,
+                                            &compress_len))
+                {
+                    /* compression is done, add record */
+                    is_compressed = true;
+                }

You can refactor XLogCompressBackupBlock() and move all the
above code to it for more simplicity.

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

06 February 2015, 12:30:29

On Fri, Feb 6, 2015 at 3:03 PM, Fujii Masao wrote:
> Do we always need extra two bytes for compressed backup block?
> ISTM that extra bytes are not necessary when the hole length is zero.
> In this case the length of the original backup block (i.e., uncompressed)
> must be BLCKSZ, so we don't need to save the original size in
> the extra bytes.

Yes, we would need a additional bit to identify that. We could steal
it from length in XLogRecordBlockImageHeader.

> Furthermore, when fpw compression is disabled and the hole length
> is zero, we seem to be able to save one byte from the header of
> backup block. Currently we use 4 bytes for the header, 2 bytes for
> the length of backup block, 15 bits for the hole offset and 1 bit for
> the flag indicating whether block is compressed or not. But in that case,
> the length of backup block doesn't need to be stored because it must
> be BLCKSZ. Shouldn't we optimize the header in this way? Thought?

If we do it, that's something to tackle even before this patch on
HEAD, because you could use the 16th bit of the first 2 bytes of
XLogRecordBlockImageHeader to do necessary sanity checks, to actually
not reduce record by 1 byte, but 2 bytes as hole-related data is not
necessary. I imagine that a patch optimizing that wouldn't be that
hard to write as well.

> +                int page_len = BLCKSZ - hole_length;
> +                char *scratch_buf;
> +                if (hole_length != 0)
> +                {
> +                    scratch_buf = compression_scratch;
> +                    memcpy(scratch_buf, page, hole_offset);
> +                    memcpy(scratch_buf + hole_offset,
> +                           page + (hole_offset + hole_length),
> +                           BLCKSZ - (hole_length + hole_offset));
> +                }
> +                else
> +                    scratch_buf = page;
> +
> +                /* Perform compression of block */
> +                if (XLogCompressBackupBlock(scratch_buf,
> +                                            page_len,
> +                                            regbuf->compressed_page,
> +                                            &compress_len))
> +                {
> +                    /* compression is done, add record */
> +                    is_compressed = true;
> +                }
>
> You can refactor XLogCompressBackupBlock() and move all the
> above code to it for more simplicity.

Sure.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

06 February 2015, 12:48:49

On Fri, Feb 6, 2015 at 4:30 PM, Michael Paquier wrote:
> On Fri, Feb 6, 2015 at 3:03 PM, Fujii Masao wrote:
>> Do we always need extra two bytes for compressed backup block?
>> ISTM that extra bytes are not necessary when the hole length is zero.
>> In this case the length of the original backup block (i.e., uncompressed)
>> must be BLCKSZ, so we don't need to save the original size in
>> the extra bytes.
>
> Yes, we would need a additional bit to identify that. We could steal
> it from length in XLogRecordBlockImageHeader.
>
>> Furthermore, when fpw compression is disabled and the hole length
>> is zero, we seem to be able to save one byte from the header of
>> backup block. Currently we use 4 bytes for the header, 2 bytes for
>> the length of backup block, 15 bits for the hole offset and 1 bit for
>> the flag indicating whether block is compressed or not. But in that case,
>> the length of backup block doesn't need to be stored because it must
>> be BLCKSZ. Shouldn't we optimize the header in this way? Thought?
>
> If we do it, that's something to tackle even before this patch on
> HEAD, because you could use the 16th bit of the first 2 bytes of
> XLogRecordBlockImageHeader to do necessary sanity checks, to actually
> not reduce record by 1 byte, but 2 bytes as hole-related data is not
> necessary. I imagine that a patch optimizing that wouldn't be that
> hard to write as well.

Actually, as Heikki pointed me out... A block image is 8k and pages
without holes are rare, so it may be not worth sacrificing code
simplicity for record reduction at the order of 0.1% or smth like
that, and the current patch is light because it keeps things simple.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

06 February 2015, 14:35:27

>In any case, those things have been introduced by what I did in previous versions... And attached is a new patch.
Thank you for feedback.

>    /* allocate scratch buffer used for compression of block images */
>+    if (compression_scratch == NULL)
>+        compression_scratch = MemoryContextAllocZero(xloginsert_cxt,
>+                                                     BLCKSZ);
 >}
The compression patch can  use the latest interface MemoryContextAllocExtended to proceed without compression when
sufficientmemory is not available for  
scratch buffer.
The attached patch introduces OutOfMem flag which is set on when MemoryContextAllocExtended returns NULL .

Thank you,
Rahila Syed

-----Original Message-----
From: Michael Paquier [mailto:michael.paquier@gmail.com]
Sent: Friday, February 06, 2015 12:46 AM
To: Syed, Rahila
Cc: PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Thu, Feb 5, 2015 at 11:06 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
>>/*
>>+    * We recheck the actual size even if pglz_compress() report success,
>>+    * because it might be satisfied with having saved as little as one byte
>>+    * in the compressed data.
>>+    */
>>+   *len = (uint16) compressed_len;
>>+   if (*len >= orig_len - 1)
>>+       return false;
>>+   return true;
>>+}
>
> As per latest code ,when compression is 'on' we introduce additional 2 bytes in the header of each block image for
storingraw_length of the compressed block. 
> In order to achieve compression while accounting for these two additional bytes, we must ensure that compressed
lengthis less than original length - 2. 
> So , IIUC the above condition should rather be
>
> If (*len >= orig_len -2 )
>                 return false;
> return true;
> The attached patch contains this. It also has a cosmetic change-  renaming compressBuf to uncompressBuf as it is used
tostore uncompressed page. 

Agreed on both things.

Just looking at your latest patch after some time to let it cool down, I noticed a couple of things.

 #define MaxSizeOfXLogRecordBlockHeader \
     (SizeOfXLogRecordBlockHeader + \
-     SizeOfXLogRecordBlockImageHeader + \
+     SizeOfXLogRecordBlockImageHeader, \
+     SizeOfXLogRecordBlockImageCompressionInfo + \
There is a comma here instead of a sum sign. We should really sum up all those sizes to evaluate the maximum size of a
blockheader. 

+     * Permanently allocate readBuf uncompressBuf.  We do it this way,
+     * rather than just making a static array, for two reasons:
This comment is just but weird, "readBuf AND uncompressBuf" is more appropriate.

+     * We recheck the actual size even if pglz_compress() report success,
+     * because it might be satisfied with having saved as little as one byte
+     * in the compressed data. We add two bytes to store raw_length  with the
+     * compressed image. So for compression to be effective
compressed_len should
+     * be atleast < orig_len - 2
This comment block should be reworked, and misses a dot at its end. I rewrote it like that, hopefully that's clearer:
+       /*
+        * We recheck the actual size even if pglz_compress() reports
success and see
+        * if at least 2 bytes of length have been saved, as this
corresponds to the
+        * additional amount of data stored in WAL record for a compressed block
+        * via raw_length.
+        */

In any case, those things have been introduced by what I did in previous versions... And attached is a new patch.
--
Michael

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment

Support-compression-for-full-page-writes-in-WAL_v17.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

06 February 2015, 15:21:30

On Fri, Feb 6, 2015 at 6:35 PM, Syed, Rahila wrote:
> The compression patch can  use the latest interface MemoryContextAllocExtended to proceed without compression when
sufficientmemory is not available for
 
> scratch buffer.
> The attached patch introduces OutOfMem flag which is set on when MemoryContextAllocExtended returns NULL .

TBH, I don't think that brings much as this allocation is done once
and process would surely fail before reaching the first code path
doing a WAL record insertion. In any case, OutOfMem is useless, you
could simply check if compression_scratch is NULL when assembling a
record.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

09 February 2015, 06:18:15

On Fri, Feb 6, 2015 at 3:42 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Fujii Masao wrote:
>> I wrote
>>> This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
>>> compression algorithm to return a size of 0 bytes as compressed or
>>> decompressed length btw? We could as well make it return a negative
>>> value when a failure occurs if you feel more comfortable with it.
>>
>> I feel that's better. Attached is the updated version of the patch.
>> I changed the pg_lzcompress and pg_lzdecompress so that they return -1
>> when failure happens. Also I applied some cosmetic changes to the patch
>> (e.g., shorten the long name of the newly-added macros).
>> Barring any objection, I will commit this.
>
> I just had a look at your updated version, ran some sanity tests, and
> things look good from me. The new names of the macros at the top of
> tuptoaster.c are clearer as well.

Thanks for the review! Pushed!

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

09 February 2015, 13:28:17

Hello,

>> Do we always need extra two bytes for compressed backup block?
>> ISTM that extra bytes are not necessary when the hole length is zero.
>> In this case the length of the original backup block (i.e.,
>> uncompressed) must be BLCKSZ, so we don't need to save the original
>> size in the extra bytes.

>Yes, we would need a additional bit to identify that. We could steal it from length in XLogRecordBlockImageHeader.

This is implemented in the attached patch by dividing length field as follows,
    uint16    length:15,
        with_hole:1;

>"2" should be replaced with the macro variable indicating the size of
>extra header for compressed backup block.
Macro SizeOfXLogRecordBlockImageCompressionInfo is used instead of 2

>You can refactor XLogCompressBackupBlock() and move all the
>above code to it for more simplicity
This is also implemented in the patch attached.

Thank you,
Rahila Syed


-----Original Message-----
From: Michael Paquier [mailto:michael.paquier@gmail.com]
Sent: Friday, February 06, 2015 6:00 PM
To: Fujii Masao
Cc: Syed, Rahila; PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Fri, Feb 6, 2015 at 3:03 PM, Fujii Masao wrote:
> Do we always need extra two bytes for compressed backup block?
> ISTM that extra bytes are not necessary when the hole length is zero.
> In this case the length of the original backup block (i.e.,
> uncompressed) must be BLCKSZ, so we don't need to save the original
> size in the extra bytes.

Yes, we would need a additional bit to identify that. We could steal it from length in XLogRecordBlockImageHeader.

> Furthermore, when fpw compression is disabled and the hole length is
> zero, we seem to be able to save one byte from the header of backup
> block. Currently we use 4 bytes for the header, 2 bytes for the length
> of backup block, 15 bits for the hole offset and 1 bit for the flag
> indicating whether block is compressed or not. But in that case, the
> length of backup block doesn't need to be stored because it must be
> BLCKSZ. Shouldn't we optimize the header in this way? Thought?

If we do it, that's something to tackle even before this patch on HEAD, because you could use the 16th bit of the first
2bytes of XLogRecordBlockImageHeader to do necessary sanity checks, to actually not reduce record by 1 byte, but 2
bytesas hole-related data is not necessary. I imagine that a patch optimizing that wouldn't be that hard to write as
well.

> +                int page_len = BLCKSZ - hole_length;
> +                char *scratch_buf;
> +                if (hole_length != 0)
> +                {
> +                    scratch_buf = compression_scratch;
> +                    memcpy(scratch_buf, page, hole_offset);
> +                    memcpy(scratch_buf + hole_offset,
> +                           page + (hole_offset + hole_length),
> +                           BLCKSZ - (hole_length + hole_offset));
> +                }
> +                else
> +                    scratch_buf = page;
> +
> +                /* Perform compression of block */
> +                if (XLogCompressBackupBlock(scratch_buf,
> +                                            page_len,
> +                                            regbuf->compressed_page,
> +                                            &compress_len))
> +                {
> +                    /* compression is done, add record */
> +                    is_compressed = true;
> +                }
>
> You can refactor XLogCompressBackupBlock() and move all the above code
> to it for more simplicity.

Sure.
--
Michael

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment

Support-compression-for-full-page-writes-in-WAL_v17.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

10 February 2015, 00:58:03

On Mon, Feb 9, 2015 at 10:27 PM, Syed, Rahila wrote:
> (snip)

Thanks for showing up here! I have not tested the test the patch,
those comments are based on what I read from v17.

>>> Do we always need extra two bytes for compressed backup block?
>>> ISTM that extra bytes are not necessary when the hole length is zero.
>>> In this case the length of the original backup block (i.e.,
>>> uncompressed) must be BLCKSZ, so we don't need to save the original
>>> size in the extra bytes.
>
>>Yes, we would need a additional bit to identify that. We could steal it from length in XLogRecordBlockImageHeader.
>
> This is implemented in the attached patch by dividing length field as follows,
>         uint16  length:15,
>                 with_hole:1;

IMO, we should add details about how this new field is used in the
comments on top of XLogRecordBlockImageHeader, meaning that when a
page hole is present we use the compression info structure and when
there is no hole, we are sure that the FPW raw length is BLCKSZ
meaning that the two bytes of the CompressionInfo stuff is
unnecessary.

>
>>"2" should be replaced with the macro variable indicating the size of
>>extra header for compressed backup block.
> Macro SizeOfXLogRecordBlockImageCompressionInfo is used instead of 2
>
>>You can refactor XLogCompressBackupBlock() and move all the
>>above code to it for more simplicity
> This is also implemented in the patch attached.

This portion looks correct to me.

A couple of other comments:
1) Nitpicky but, code format is sometimes strange.
For example here you should not have a space between the function
definition and the variable declarations:
+{
+
+    int orig_len = BLCKSZ - hole_length;
This is as well incorrect in two places:
if(hole_length != 0)
There should be a space between the if and its condition in parenthesis.
2) For correctness with_hole should be set even for uncompressed
pages. I think that we should as well use it for sanity checks in
xlogreader.c when decoding records.

Regards,
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

10 February 2015, 07:46:13

Hello,

A bug had been introduced in the latest versions of the patch. The order of parameters passed to pglz_decompress was
wrong.
Please find attached patch with following correction,

Original code,
+        if (pglz_decompress(block_image, record->uncompressBuf,
+                            bkpb->bkp_len, bkpb->bkp_uncompress_len) == 0)
Correction
+        if (pglz_decompress(block_image, bkpb->bkp_len,
+                            record->uncompressBuf, bkpb->bkp_uncompress_len) == 0)


>For example here you should not have a space between the function definition and the variable declarations:
>+{
>+
>+    int orig_len = BLCKSZ - hole_length;
>This is as well incorrect in two places:
>if(hole_length != 0)
>There should be a space between the if and its condition in parenthesis.

Also corrected above code format mistakes.

Thank you,
Rahila Syed

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Syed, Rahila
Sent: Monday, February 09, 2015 6:58 PM
To: Michael Paquier; Fujii Masao
Cc: PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

Hello,

>> Do we always need extra two bytes for compressed backup block?
>> ISTM that extra bytes are not necessary when the hole length is zero.
>> In this case the length of the original backup block (i.e.,
>> uncompressed) must be BLCKSZ, so we don't need to save the original
>> size in the extra bytes.

>Yes, we would need a additional bit to identify that. We could steal it from length in XLogRecordBlockImageHeader.

This is implemented in the attached patch by dividing length field as follows,
    uint16    length:15,
        with_hole:1;

>"2" should be replaced with the macro variable indicating the size of
>extra header for compressed backup block.
Macro SizeOfXLogRecordBlockImageCompressionInfo is used instead of 2

>You can refactor XLogCompressBackupBlock() and move all the above code
>to it for more simplicity
This is also implemented in the patch attached.

Thank you,
Rahila Syed


-----Original Message-----
From: Michael Paquier [mailto:michael.paquier@gmail.com]
Sent: Friday, February 06, 2015 6:00 PM
To: Fujii Masao
Cc: Syed, Rahila; PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Fri, Feb 6, 2015 at 3:03 PM, Fujii Masao wrote:
> Do we always need extra two bytes for compressed backup block?
> ISTM that extra bytes are not necessary when the hole length is zero.
> In this case the length of the original backup block (i.e.,
> uncompressed) must be BLCKSZ, so we don't need to save the original
> size in the extra bytes.

Yes, we would need a additional bit to identify that. We could steal it from length in XLogRecordBlockImageHeader.

> Furthermore, when fpw compression is disabled and the hole length is
> zero, we seem to be able to save one byte from the header of backup
> block. Currently we use 4 bytes for the header, 2 bytes for the length
> of backup block, 15 bits for the hole offset and 1 bit for the flag
> indicating whether block is compressed or not. But in that case, the
> length of backup block doesn't need to be stored because it must be
> BLCKSZ. Shouldn't we optimize the header in this way? Thought?

If we do it, that's something to tackle even before this patch on HEAD, because you could use the 16th bit of the first
2bytes of XLogRecordBlockImageHeader to do necessary sanity checks, to actually not reduce record by 1 byte, but 2
bytesas hole-related data is not necessary. I imagine that a patch optimizing that wouldn't be that hard to write as
well.

> +                int page_len = BLCKSZ - hole_length;
> +                char *scratch_buf;
> +                if (hole_length != 0)
> +                {
> +                    scratch_buf = compression_scratch;
> +                    memcpy(scratch_buf, page, hole_offset);
> +                    memcpy(scratch_buf + hole_offset,
> +                           page + (hole_offset + hole_length),
> +                           BLCKSZ - (hole_length + hole_offset));
> +                }
> +                else
> +                    scratch_buf = page;
> +
> +                /* Perform compression of block */
> +                if (XLogCompressBackupBlock(scratch_buf,
> +                                            page_len,
> +                                            regbuf->compressed_page,
> +                                            &compress_len))
> +                {
> +                    /* compression is done, add record */
> +                    is_compressed = true;
> +                }
>
> You can refactor XLogCompressBackupBlock() and move all the above code
> to it for more simplicity.

Sure.
--
Michael

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may
containlegally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the
senderby replying promptly to this email and then delete and destroy this email and any attachments without any further
use,copying or forwarding. 

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment

Support-compression-for-full-page-writes-in-WAL_v17.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

11 February 2015, 14:03:33

>IMO, we should add details about how this new field is used in the comments on top of XLogRecordBlockImageHeader,
meaningthat when a page hole is present we use the compression info structure and when there is no hole, we are sure
thatthe FPW raw length is BLCKSZ meaning that the two bytes of the CompressionInfo stuff is unnecessary. 
This comment is included in the patch attached.

> For correctness with_hole should be set even for uncompressed pages. I think that we should as well use it for sanity
checksin xlogreader.c when decoding records. 
This change is made in the attached patch. Following sanity checks have been added in xlogreader.c

if (!(blk->with_hole) && blk->hole_offset != 0 || blk->with_hole && blk->hole_offset <= 0))

if (blk->with_hole && blk->bkp_len >= BLCKSZ)

if (!(blk->with_hole) && blk->bkp_len != BLCKSZ)

Thank you,
Rahila Syed



______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment

Support-compression-for-full-page-writes-in-WAL_v18.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

12 February 2015, 07:26:54

On Wed, Feb 11, 2015 at 11:03 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:

>IMO, we should add details about how this new field is used in the comments on top of XLogRecordBlockImageHeader, meaning that when a page hole is present we use the compression info structure and when there is no hole, we are sure that the FPW raw length is BLCKSZ meaning that the two bytes of the CompressionInfo stuff is unnecessary.
This comment is included in the patch attached.

> For correctness with_hole should be set even for uncompressed pages. I think that we should as well use it for sanity checks in xlogreader.c when decoding records.
This change is made in the attached patch. Following sanity checks have been added in xlogreader.c

if (!(blk->with_hole) && blk->hole_offset != 0 || blk->with_hole && blk->hole_offset <= 0))

if (blk->with_hole && blk->bkp_len >= BLCKSZ)

if (!(blk->with_hole) && blk->bkp_len != BLCKSZ)

Cool, thanks!

This patch fails to compile:
xlogreader.c:1049:46: error: extraneous ')' after condition, expected a statement
blk->with_hole && blk->hole_offset <= 0))

Note as well that at least clang does not like much how the sanity check with with_hole are done. You should place parentheses around the '&&' expressions. Also, I would rather define with_hole == 0 or with_hole == 1 explicitly int those checks.

There is a typo:

s/true,see/true, see/

[nitpicky]Be as well aware of the 80-character limit per line that is usually normally by comment blocks.[/]

+ * "with_hole" is used to identify the presence of hole in a block.
+ * As mentioned above, length of block cannnot be more than 15-bit long.
+ * So, the free bit in the length field is used by "with_hole" to identify presence of
+ * XLogRecordBlockImageCompressionInfo. If hole is not present ,the raw size of
+ * a compressed block is equal to BLCKSZ therefore XLogRecordBlockImageCompressionInfo
+ * for the corresponding compressed block need not be stored in header.
+ * If hole is present raw size is stored.

I would rewrite this paragraph as follows, fixing the multiple typos:

"with_hole" is used to identify the presence of a hole in a block image. As the length of a block cannot be more than 15-bit long, the extra bit in the length field is used for this identification purpose. If the block image has no hole, it is ensured that the raw size of a compressed block image is equal to BLCKSZ, hence the contents of XLogRecordBlockImageCompressionInfo are not necessary.

+ /* Followed by the data related to compression if block is compressed */

This comment needs to be updated to "if block image is compressed and has a hole".

+   lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h) and
+   XLogRecordBlockImageHeader where page hole offset and length is limited to 15-bit
+   length (see src/include/access/xlogrecord.h).

80-character limit...

Regards

Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

12 February 2015, 11:08:39

Thank you for comments. Please find attached the updated patch.

>This patch fails to compile:
>xlogreader.c:1049:46: error: extraneous ')' after condition, expected a statement
> blk->with_hole && blk->hole_offset <= 0))

This has been rectified.

>Note as well that at least clang does not like much how the sanity check with with_hole are done. You should place parentheses around the '&&' expressions. Also, I would rather define with_hole == 0 or with_hole == 1 explicitly int those checks

The expressions are modified accordingly.

>There is a typo:

>s/true,see/true, see/

>[nitpicky]Be as well aware of the 80-character limit per line that is usually normally by comment blocks.[/]

Have corrected the typos and changed the comments as mentioned. Also , realigned certain lines to meet the 80-char limit.

Thank you,

Rahila Syed

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment

Support-compression-for-full-page-writes-in-WAL_v18.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

13 February 2015, 05:47:17

On Thu, Feb 12, 2015 at 8:08 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
>
>
>
> Thank you for comments. Please find attached the updated patch.
>
>
>
> >This patch fails to compile:
> >xlogreader.c:1049:46: error: extraneous ')' after condition, expected a statement
> > blk->with_hole && blk->hole_offset <= 0))
>
> This has been rectified.
>
>
>
> >Note as well that at least clang does not like much how the sanity check with with_hole are done. You should place parentheses around the '&&' expressions. Also, I would rather define with_hole == 0 or with_hole == 1 explicitly int those checks
>
> The expressions are modified accordingly.
>
>
>
> >There is a typo:
>
> >s/true,see/true, see/
>
> >[nitpicky]Be as well aware of the 80-character limit per line that is usually normally by comment blocks.[/]
>
>
>
> Have corrected the typos and changed the comments as mentioned. Also , realigned certain lines to meet the 80-char limit.

Thanks for the updated patch.

+ /* leave if data cannot be compressed */
+ if (compressed_len == 0)
+ return false;
This should be < 0, pglz_compress returns -1 when compression fails.

+ if (pglz_decompress(block_image, bkpb->bkp_len, record->uncompressBuf,
+ bkpb->bkp_uncompress_len) == 0)
Similarly, this should be < 0.

Regarding the sanity checks that have been added recently. I think that they are useful but I am suspecting as well that only a check on the record CRC is done because that's reliable enough and not doing those checks accelerates a bit replay. So I am thinking that we should simply replace them by assertions.

I have as well re-run my small test case, with the following results (scripts and results attached)
=# select test, user_diff,system_diff, pg_size_pretty(pre_update - pre_insert),
pg_size_pretty(post_update - pre_update) from results;
test | user_diff | system_diff | pg_size_pretty | pg_size_pretty
---------+-----------+-------------+----------------+----------------
FPW on | 46.134564 | 0.823306 | 429 MB | 566 MB
FPW on | 16.307575 | 0.798591 | 171 MB | 229 MB
FPW on | 8.325136 | 0.848390 | 86 MB | 116 MB
FPW off | 29.992383 | 1.100458 | 440 MB | 746 MB
FPW off | 12.237578 | 1.027076 | 171 MB | 293 MB
FPW off | 6.814926 | 0.931624 | 86 MB | 148 MB
HEAD | 26.590816 | 1.159255 | 440 MB | 746 MB
HEAD | 11.620359 | 0.990851 | 171 MB | 293 MB
HEAD | 6.300401 | 0.904311 | 86 MB | 148 MB
(9 rows)
The level of compression reached is the same as previous mark, 566MB for the case of fillfactor=50 (CAB7nPqSc97o-UE5paxfMUKWcxE_JioyxO1M4A0pMnmYqAnec2g@mail.gmail.com) with a similar CPU usage.

Once we get those small issues fixes, I think that it is with having a committer look at this patch, presumably Fujii-san.
Regards,
--
Michael

Attachment

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

16 February 2015, 11:30:38

Hello,

Thank you for reviewing and testing the patch.

>+ /* leave if data cannot be compressed */
>+ if (compressed_len == 0)
>+ return false;
>This should be < 0, pglz_compress returns -1 when compression fails.
>
>+ if (pglz_decompress(block_image, bkpb->bkp_len, record->uncompressBuf,
>+ bkpb->bkp_uncompress_len) == 0)
>Similarly, this should be < 0.

These have been corrected in the attached.

>Regarding the sanity checks that have been added recently. I think that they are useful but I am suspecting as well that only a check on the record CRC is done because that's reliable enough and not doing those checks accelerates a bit replay. So I am thinking that we should simply replace >them by assertions.

Removing the checks makes sense as CRC ensures correctness . Moreover ,as error message for invalid length of record is present in the code , messages for invalid block length can be redundant.

Checks have been replaced by assertions in the attached patch.

Following if condition in XLogCompressBackupBlock has been modified as follows

+ * We recheck the actual size even if pglz_compress() reports success and

+ * see if at least 2 bytes of length have been saved, as this corresponds

+ * to the additional amount of data stored in WAL record for a compressed

+ * block via raw_length when block contains hole..

+ */

+ *len = (uint16) compressed_len;

+ if (*len >= orig_len - SizeOfXLogRecordBlockImageCompressionInfo)

+ return false;

+ return true;

Current

if ((hole_length != 0) &&

+ (*len >= orig_len - SizeOfXLogRecordBlockImageCompressionInfo))

+ return false;

+return true

This is because the extra information raw_length is included only if compressed block has hole in it.

>Once we get those small issues fixes, I think that it is with having a committer look at this patch, presumably Fujii-san

Agree. I will mark this patch as ready for committer

Thank you,

Rahila Syed

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment

Support-compression-for-full-page-writes-in-WAL_v19.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

16 February 2015, 11:55:28

On Mon, Feb 16, 2015 at 8:30 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:

Regarding the sanity checks that have been added recently. I think that they are useful but I am suspecting as well that only a check on the record CRC is done because that's reliable enough and not doing those checks accelerates a bit replay. So I am thinking that we should simply replace >them by assertions.
Removing the checks makes sense as CRC ensures correctness . Moreover ,as error message for invalid length of record is present in the code , messages for invalid block length can be redundant.
Checks have been replaced by assertions in the attached patch.

After more thinking, we may as well simply remove them, an error with CRC having high chances to complain before reaching this point...

Current
if ((hole_length != 0) &&
+ (*len >= orig_len - SizeOfXLogRecordBlockImageCompressionInfo))
+ return false;
+return true

This makes sense.

Nitpicking 1:
+ Assert(!(blk->with_hole == 1 && blk->hole_offset <= 0));

Double-space here.

Nitpicking 2:
char * page

This should be rewritten as char *page, the "*" being assigned with the variable name.
--

Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

16 February 2015, 11:55:49

On 2015-02-16 11:30:20 +0000, Syed, Rahila wrote:
> - * As a trivial form of data compression, the XLOG code is aware that
> - * PG data pages usually contain an unused "hole" in the middle, which
> - * contains only zero bytes.  If hole_length > 0 then we have removed
> - * such a "hole" from the stored data (and it's not counted in the
> - * XLOG record's CRC, either).  Hence, the amount of block data actually
> - * present is BLCKSZ - hole_length bytes.
> + * Block images are able to do several types of compression:
> + * - When wal_compression is off, as a trivial form of compression, the
> + * XLOG code is aware that PG data pages usually contain an unused "hole"
> + * in the middle, which contains only zero bytes.  If length < BLCKSZ
> + * then we have removed such a "hole" from the stored data (and it is
> + * not counted in the XLOG record's CRC, either).  Hence, the amount
> + * of block data actually present is "length" bytes.  The hole "offset"
> + * on page is defined using "hole_offset".
> + * - When wal_compression is on, block images are compressed using a
> + * compression algorithm without their hole to improve compression
> + * process of the page. "length" corresponds in this case to the length
> + * of the compressed block. "hole_offset" is the hole offset of the page,
> + * and the length of the uncompressed block is defined by "raw_length",
> + * whose data is included in the record only when compression is enabled
> + * and "with_hole" is set to true, see below.
> + *
> + * "is_compressed" is used to identify if a given block image is compressed
> + * or not. Maximum page size allowed on the system being 32k, the hole
> + * offset cannot be more than 15-bit long so the last free bit is used to
> + * store the compression state of block image. If the maximum page size
> + * allowed is increased to a value higher than that, we should consider
> + * increasing this structure size as well, but this would increase the
> + * length of block header in WAL records with alignment.
> + *
> + * "with_hole" is used to identify the presence of a hole in a block image.
> + * As the length of a block cannot be more than 15-bit long, the extra bit in
> + * the length field is used for this identification purpose. If the block image
> + * has no hole, it is ensured that the raw size of a compressed block image is
> + * equal to BLCKSZ, hence the contents of XLogRecordBlockImageCompressionInfo
> + * are not necessary.
>   */
>  typedef struct XLogRecordBlockImageHeader
>  {
> -    uint16        hole_offset;    /* number of bytes before "hole" */
> -    uint16        hole_length;    /* number of bytes in "hole" */
> +    uint16    length:15,            /* length of block data in record */
> +            with_hole:1;        /* status of hole in the block */
> +
> +    uint16    hole_offset:15,    /* number of bytes before "hole" */
> +        is_compressed:1;    /* compression status of image */
> +
> +    /* Followed by the data related to compression if block is compressed */
>  } XLogRecordBlockImageHeader;

Yikes, this is ugly.

I think we should change the xlog format so that the block_id (which
currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't
the block id but something like XLR_CHUNK_ID. Which is used as is for
XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to
XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the
block id following the chunk id.

Yes, that'll increase the amount of data for a backup block by 1 byte,
but I think that's worth it. I'm pretty sure we will be happy about the
added extensibility pretty soon.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

16 February 2015, 12:08:20

On 2015-02-16 20:55:20 +0900, Michael Paquier wrote:
> On Mon, Feb 16, 2015 at 8:30 PM, Syed, Rahila <Rahila.Syed@nttdata.com>
> wrote:
> 
> >
> > Regarding the sanity checks that have been added recently. I think that
> > they are useful but I am suspecting as well that only a check on the record
> > CRC is done because that's reliable enough and not doing those checks
> > accelerates a bit replay. So I am thinking that we should simply replace
> > >them by assertions.
> >
> > Removing the checks makes sense as CRC ensures correctness . Moreover ,as
> > error message for invalid length of record is present in the code ,
> > messages for invalid block length can be redundant.
> >
> > Checks have been replaced by assertions in the attached patch.
> >
> 
> After more thinking, we may as well simply remove them, an error with CRC
> having high chances to complain before reaching this point...

Surely not. The existing code explicitly does it like   if (blk->has_data && blk->data_len == 0)
report_invalid_record(state,                            "BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
                   (uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
 
these cross checks are important. And I see no reason to deviate from
that. The CRC sum isn't foolproof - we intentionally do checks at
several layers. And, as you can see from some other locations, we
actually try to *not* fatally error out when hitting them at times - so
an Assert also is wrong.

Heikki:       /* cross-check that the HAS_DATA flag is set iff data_length > 0 */       if (blk->has_data &&
blk->data_len== 0)               report_invalid_record(state,                         "BKPBLOCK_HAS_DATA set, but no
dataincluded at %X/%X",                                                         (uint32) (state->ReadRecPtr >> 32),
(uint32)state->ReadRecPtr);       if (!blk->has_data && blk->data_len != 0)               report_invalid_record(state,
             "BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
         (unsigned int) blk->data_len,                                  (uint32) (state->ReadRecPtr >> 32), (uint32)
state->ReadRecPtr);
those look like they're missing a goto err; to me.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

18 February 2015, 14:26:53

Hello,

>I think we should change the xlog format so that the block_id (which currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a
actualblock id) isn't the block id but something like XLR_CHUNK_ID. Which is used as is for
XLR_CHUNK_ID_DATA_SHORT/LONG,but for backup blocks can be set to to >XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
XLR_CHUNK_BKP_REFERENCE...The BKP blocks will then follow, storing the block id following the chunk id. 

>Yes, that'll increase the amount of data for a backup block by 1 byte, but I think that's worth it. I'm pretty sure we
willbe happy about the added extensibility pretty soon. 

To clarify my understanding of the above change,

Instead of a block id to reference different fragments of an xlog record , a single byte field "chunk_id"  should be
used. chunk_id  will be same as XLR_BLOCK_ID_DATA_SHORT/LONG for main data fragments.  
But for block references, it will take store following values in order to store information about the backup blocks.
#define XLR_CHUNK_BKP_COMPRESSED  0x01
#define XLR_CHUNK_BKP_WITH_HOLE     0x02
...

The new xlog format should look like follows,

Fixed-size header (XLogRecord struct)
Chunk_id(add a field before id field in XLogRecordBlockHeader struct)
XLogRecordBlockHeader
Chunk_idXLogRecordBlockHeader         ...                               ...
Chunk_id ( rename id field of the XLogRecordDataHeader struct)
XLogRecordDataHeader[Short|Long] block datablock data...main data

I will post a patch based on this.

Thank you,
Rahila Syed

-----Original Message-----
From: Andres Freund [mailto:andres@2ndquadrant.com]
Sent: Monday, February 16, 2015 5:26 PM
To: Syed, Rahila
Cc: Michael Paquier; Fujii Masao; PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On 2015-02-16 11:30:20 +0000, Syed, Rahila wrote:
> - * As a trivial form of data compression, the XLOG code is aware that
> - * PG data pages usually contain an unused "hole" in the middle,
> which
> - * contains only zero bytes.  If hole_length > 0 then we have removed
> - * such a "hole" from the stored data (and it's not counted in the
> - * XLOG record's CRC, either).  Hence, the amount of block data
> actually
> - * present is BLCKSZ - hole_length bytes.
> + * Block images are able to do several types of compression:
> + * - When wal_compression is off, as a trivial form of compression,
> + the
> + * XLOG code is aware that PG data pages usually contain an unused "hole"
> + * in the middle, which contains only zero bytes.  If length < BLCKSZ
> + * then we have removed such a "hole" from the stored data (and it is
> + * not counted in the XLOG record's CRC, either).  Hence, the amount
> + * of block data actually present is "length" bytes.  The hole "offset"
> + * on page is defined using "hole_offset".
> + * - When wal_compression is on, block images are compressed using a
> + * compression algorithm without their hole to improve compression
> + * process of the page. "length" corresponds in this case to the
> + length
> + * of the compressed block. "hole_offset" is the hole offset of the
> + page,
> + * and the length of the uncompressed block is defined by
> + "raw_length",
> + * whose data is included in the record only when compression is
> + enabled
> + * and "with_hole" is set to true, see below.
> + *
> + * "is_compressed" is used to identify if a given block image is
> + compressed
> + * or not. Maximum page size allowed on the system being 32k, the
> + hole
> + * offset cannot be more than 15-bit long so the last free bit is
> + used to
> + * store the compression state of block image. If the maximum page
> + size
> + * allowed is increased to a value higher than that, we should
> + consider
> + * increasing this structure size as well, but this would increase
> + the
> + * length of block header in WAL records with alignment.
> + *
> + * "with_hole" is used to identify the presence of a hole in a block image.
> + * As the length of a block cannot be more than 15-bit long, the
> + extra bit in
> + * the length field is used for this identification purpose. If the
> + block image
> + * has no hole, it is ensured that the raw size of a compressed block
> + image is
> + * equal to BLCKSZ, hence the contents of
> + XLogRecordBlockImageCompressionInfo
> + * are not necessary.
>   */
>  typedef struct XLogRecordBlockImageHeader  {
> -    uint16        hole_offset;    /* number of bytes before "hole" */
> -    uint16        hole_length;    /* number of bytes in "hole" */
> +    uint16    length:15,            /* length of block data in record */
> +            with_hole:1;        /* status of hole in the block */
> +
> +    uint16    hole_offset:15,    /* number of bytes before "hole" */
> +        is_compressed:1;    /* compression status of image */
> +
> +    /* Followed by the data related to compression if block is
> +compressed */
>  } XLogRecordBlockImageHeader;

Yikes, this is ugly.

I think we should change the xlog format so that the block_id (which currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a
actualblock id) isn't the block id but something like XLR_CHUNK_ID. Which is used as is for
XLR_CHUNK_ID_DATA_SHORT/LONG,but for backup blocks can be set to to XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
XLR_CHUNK_BKP_REFERENCE...The BKP blocks will then follow, storing the block id following the chunk id. 

Yes, that'll increase the amount of data for a backup block by 1 byte, but I think that's worth it. I'm pretty sure we
willbe happy about the added extensibility pretty soon. 

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

19 February 2015, 06:40:41

On Mon, Feb 16, 2015 at 8:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-02-16 11:30:20 +0000, Syed, Rahila wrote:
>> - * As a trivial form of data compression, the XLOG code is aware that
>> - * PG data pages usually contain an unused "hole" in the middle, which
>> - * contains only zero bytes.  If hole_length > 0 then we have removed
>> - * such a "hole" from the stored data (and it's not counted in the
>> - * XLOG record's CRC, either).  Hence, the amount of block data actually
>> - * present is BLCKSZ - hole_length bytes.
>> + * Block images are able to do several types of compression:
>> + * - When wal_compression is off, as a trivial form of compression, the
>> + * XLOG code is aware that PG data pages usually contain an unused "hole"
>> + * in the middle, which contains only zero bytes.  If length < BLCKSZ
>> + * then we have removed such a "hole" from the stored data (and it is
>> + * not counted in the XLOG record's CRC, either).  Hence, the amount
>> + * of block data actually present is "length" bytes.  The hole "offset"
>> + * on page is defined using "hole_offset".
>> + * - When wal_compression is on, block images are compressed using a
>> + * compression algorithm without their hole to improve compression
>> + * process of the page. "length" corresponds in this case to the length
>> + * of the compressed block. "hole_offset" is the hole offset of the page,
>> + * and the length of the uncompressed block is defined by "raw_length",
>> + * whose data is included in the record only when compression is enabled
>> + * and "with_hole" is set to true, see below.
>> + *
>> + * "is_compressed" is used to identify if a given block image is compressed
>> + * or not. Maximum page size allowed on the system being 32k, the hole
>> + * offset cannot be more than 15-bit long so the last free bit is used to
>> + * store the compression state of block image. If the maximum page size
>> + * allowed is increased to a value higher than that, we should consider
>> + * increasing this structure size as well, but this would increase the
>> + * length of block header in WAL records with alignment.
>> + *
>> + * "with_hole" is used to identify the presence of a hole in a block image.
>> + * As the length of a block cannot be more than 15-bit long, the extra bit in
>> + * the length field is used for this identification purpose. If the block image
>> + * has no hole, it is ensured that the raw size of a compressed block image is
>> + * equal to BLCKSZ, hence the contents of XLogRecordBlockImageCompressionInfo
>> + * are not necessary.
>>   */
>>  typedef struct XLogRecordBlockImageHeader
>>  {
>> -     uint16          hole_offset;    /* number of bytes before "hole" */
>> -     uint16          hole_length;    /* number of bytes in "hole" */
>> +     uint16  length:15,                      /* length of block data in record */
>> +                     with_hole:1;            /* status of hole in the block */
>> +
>> +     uint16  hole_offset:15, /* number of bytes before "hole" */
>> +             is_compressed:1;        /* compression status of image */
>> +
>> +     /* Followed by the data related to compression if block is compressed */
>>  } XLogRecordBlockImageHeader;
>
> Yikes, this is ugly.
>
> I think we should change the xlog format so that the block_id (which
> currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't
> the block id but something like XLR_CHUNK_ID. Which is used as is for
> XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to
> XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
> XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the
> block id following the chunk id.
> Yes, that'll increase the amount of data for a backup block by 1 byte,
> but I think that's worth it. I'm pretty sure we will be happy about the
> added extensibility pretty soon.

Yeah, that would help for readability and does not cost much compared
to BLCKSZ. Still could you explain what kind of extensibility you have
in mind except code readability? It is hard to make a nice picture
with only the paper and the pencils, and the current patch approach
has been taken to minimize the record length, particularly for users
who do not care about WAL compression.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

23 February 2015, 08:28:14

Hello,

Attached is a patch which has following changes,

As suggested above block ID in xlog structs has been replaced by chunk ID.

Chunk ID is used to distinguish between different types of xlog record fragments.

Like,
XLR_CHUNK_ID_DATA_SHORT
XLR_CHUNK_ID_DATA_LONG
XLR_CHUNK_BKP_COMPRESSED
XLR_CHUNK_BKP_WITH_HOLE

In block references, block ID follows the chunk ID. Here block ID retains its functionality.

This approach increases data by 1 byte for each block reference in an xlog record. This approach separates ID referring different fragments of xlog record from the actual block ID which is used to refer block references in xlog record.

Following are WAL numbers for each scenario,

WAL

FPW compression on 121.652 MB

FPW compression off 148.998 MB

HEAD 148.764 MB

Compression remains nearly same as before. There is some difference in WAL between HEAD and HEAD+patch+compression OFF. This difference corresponds to 1 byte increase with each block reference of xlog record.

Thank you,

Rahila Syed

On Wed, Feb 18, 2015 at 7:53 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:

Hello,

>I think we should change the xlog format so that the block_id (which currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the block id but something like XLR_CHUNK_ID. Which is used as is for XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to >XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED, XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the block id following the chunk id.

>Yes, that'll increase the amount of data for a backup block by 1 byte, but I think that's worth it. I'm pretty sure we will be happy about the added extensibility pretty soon.

To clarify my understanding of the above change,

Instead of a block id to reference different fragments of an xlog record , a single byte field "chunk_id" should be used. chunk_id will be same as XLR_BLOCK_ID_DATA_SHORT/LONG for main data fragments.
But for block references, it will take store following values in order to store information about the backup blocks.
#define XLR_CHUNK_BKP_COMPRESSED 0x01
#define XLR_CHUNK_BKP_WITH_HOLE 0x02
...

The new xlog format should look like follows,

Fixed-size header (XLogRecord struct)
Chunk_id(add a field before id field in XLogRecordBlockHeader struct)
XLogRecordBlockHeader
Chunk_id
XLogRecordBlockHeader
...
...
Chunk_id ( rename id field of the XLogRecordDataHeader struct)
XLogRecordDataHeader[Short|Long]
block data
block data
...
main data

I will post a patch based on this.

Thank you,
Rahila Syed

-----Original Message-----
From: Andres Freund [mailto:andres@2ndquadrant.com]
Sent: Monday, February 16, 2015 5:26 PM
To: Syed, Rahila
Cc: Michael Paquier; Fujii Masao; PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On 2015-02-16 11:30:20 +0000, Syed, Rahila wrote:
> - * As a trivial form of data compression, the XLOG code is aware that
> - * PG data pages usually contain an unused "hole" in the middle,
> which
> - * contains only zero bytes. If hole_length > 0 then we have removed
> - * such a "hole" from the stored data (and it's not counted in the
> - * XLOG record's CRC, either). Hence, the amount of block data
> actually
> - * present is BLCKSZ - hole_length bytes.
> + * Block images are able to do several types of compression:
> + * - When wal_compression is off, as a trivial form of compression,
> + the
> + * XLOG code is aware that PG data pages usually contain an unused "hole"
> + * in the middle, which contains only zero bytes. If length < BLCKSZ
> + * then we have removed such a "hole" from the stored data (and it is
> + * not counted in the XLOG record's CRC, either). Hence, the amount
> + * of block data actually present is "length" bytes. The hole "offset"
> + * on page is defined using "hole_offset".
> + * - When wal_compression is on, block images are compressed using a
> + * compression algorithm without their hole to improve compression
> + * process of the page. "length" corresponds in this case to the
> + length
> + * of the compressed block. "hole_offset" is the hole offset of the
> + page,
> + * and the length of the uncompressed block is defined by
> + "raw_length",
> + * whose data is included in the record only when compression is
> + enabled
> + * and "with_hole" is set to true, see below.
> + *
> + * "is_compressed" is used to identify if a given block image is
> + compressed
> + * or not. Maximum page size allowed on the system being 32k, the
> + hole
> + * offset cannot be more than 15-bit long so the last free bit is
> + used to
> + * store the compression state of block image. If the maximum page
> + size
> + * allowed is increased to a value higher than that, we should
> + consider
> + * increasing this structure size as well, but this would increase
> + the
> + * length of block header in WAL records with alignment.
> + *
> + * "with_hole" is used to identify the presence of a hole in a block image.
> + * As the length of a block cannot be more than 15-bit long, the
> + extra bit in
> + * the length field is used for this identification purpose. If the
> + block image
> + * has no hole, it is ensured that the raw size of a compressed block
> + image is
> + * equal to BLCKSZ, hence the contents of
> + XLogRecordBlockImageCompressionInfo
> + * are not necessary.
> */
> typedef struct XLogRecordBlockImageHeader {
> - uint16 hole_offset; /* number of bytes before "hole" */
> - uint16 hole_length; /* number of bytes in "hole" */
> + uint16 length:15, /* length of block data in record */
> + with_hole:1; /* status of hole in the block */
> +
> + uint16 hole_offset:15, /* number of bytes before "hole" */
> + is_compressed:1; /* compression status of image */
> +
> + /* Followed by the data related to compression if block is
> +compressed */
> } XLogRecordBlockImageHeader;

Yikes, this is ugly.

I think we should change the xlog format so that the block_id (which currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the block id but something like XLR_CHUNK_ID. Which is used as is for XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED, XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the block id following the chunk id.

Yes, that'll increase the amount of data for a backup block by 1 byte, but I think that's worth it. I'm pretty sure we will be happy about the added extensibility pretty soon.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Support-compression-for-full-page-writes-in-WAL_v20.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

23 February 2015, 12:22:15

On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
>
> Attached is a patch which has following changes,
>
> As suggested above block ID in xlog structs has been replaced by chunk ID.
> Chunk ID is used to distinguish between different types of xlog record
> fragments.
> Like,
> XLR_CHUNK_ID_DATA_SHORT
> XLR_CHUNK_ID_DATA_LONG
> XLR_CHUNK_BKP_COMPRESSED
> XLR_CHUNK_BKP_WITH_HOLE
>
> In block references, block ID follows the chunk ID. Here block ID retains
> its functionality.
> This approach increases data by 1 byte for each block reference in an xlog
> record. This approach separates ID referring different fragments of xlog
> record from the actual block ID which is used to refer  block references in
> xlog record.

I've not read this logic yet, but ISTM there is a bug in that new WAL format
because I got the following error and the startup process could not replay
any WAL records when I set up replication and enabled wal_compression.

LOG:  record with invalid length at 0/30000B0
LOG:  record with invalid length at 0/3000518
LOG:  Invalid block length in record 0/30005A0
LOG:  Invalid block length in record 0/3000D60
...

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

24 February 2015, 07:03:54

On Mon, Feb 23, 2015 at 9:22 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
>
> Attached is a patch which has following changes,
>
> As suggested above block ID in xlog structs has been replaced by chunk ID.
> Chunk ID is used to distinguish between different types of xlog record
> fragments.
> Like,
> XLR_CHUNK_ID_DATA_SHORT
> XLR_CHUNK_ID_DATA_LONG
> XLR_CHUNK_BKP_COMPRESSED
> XLR_CHUNK_BKP_WITH_HOLE
>
> In block references, block ID follows the chunk ID. Here block ID retains
> its functionality.
> This approach increases data by 1 byte for each block reference in an xlog
> record. This approach separates ID referring different fragments of xlog
> record from the actual block ID which is used to refer block references in
> xlog record.

I've not read this logic yet, but ISTM there is a bug in that new WAL format
because I got the following error and the startup process could not replay
any WAL records when I set up replication and enabled wal_compression.

LOG: record with invalid length at 0/30000B0
LOG: record with invalid length at 0/3000518
LOG: Invalid block length in record 0/30005A0
LOG: Invalid block length in record 0/3000D60

Looking at this code, I think that it is really confusing to move the data related to the status of the backup block out of XLogRecordBlockImageHeader to the chunk ID itself that may *not* include a backup block at all as it is conditioned by the presence of BKPBLOCK_HAS_IMAGE. I would still prefer the idea of having the backup block data in its dedicated header with bits stolen from the existing fields, perhaps by rewriting it to something like that:

typedef struct XLogRecordBlockImageHeader {
uint32 length:15,

hole_length:15,

is_compressed:1,

is_hole:1;

} XLogRecordBlockImageHeader;

Now perhaps I am missing something and this is really "ugly" ;)

+#define XLR_CHUNK_ID_DATA_SHORT                255
+#define XLR_CHUNK_ID_DATA_LONG         254
+#define XLR_CHUNK_BKP_COMPRESSED       0x01
+#define XLR_CHUNK_BKP_WITH_HOLE                0x02

Wouldn't we need a XLR_CHUNK_ID_BKP_HEADER or equivalent? The idea between this chunk_id stuff it to be able to make the difference between a short header, a long header and a backup block header by looking at the first byte.

The comments on top of XLogRecordBlockImageHeader are still mentioning the old parameters like with_hole or is_compressed that you have removed.

It seems as well that there is some noise:
- lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h).
+ lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h)

Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

24 February 2015, 09:47:02

Hello ,

>I've not read this logic yet, but ISTM there is a bug in that new WAL format because I got the following error and the
startupprocess could not replay any WAL records when I set up replication and enabled wal_compression. 

>LOG:  record with invalid length at 0/30000B0
>LOG:  record with invalid length at 0/3000518
>LOG:  Invalid block length in record 0/30005A0
>LOG:  Invalid block length in record 0/3000D60 ...

Please fine attached patch which replays WAL records.

Thank you,
Rahila Syed

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Fujii Masao
Sent: Monday, February 23, 2015 5:52 PM
To: Rahila Syed
Cc: PostgreSQL-development; Andres Freund; Michael Paquier
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
>
> Attached is a patch which has following changes,
>
> As suggested above block ID in xlog structs has been replaced by chunk ID.
> Chunk ID is used to distinguish between different types of xlog record
> fragments.
> Like,
> XLR_CHUNK_ID_DATA_SHORT
> XLR_CHUNK_ID_DATA_LONG
> XLR_CHUNK_BKP_COMPRESSED
> XLR_CHUNK_BKP_WITH_HOLE
>
> In block references, block ID follows the chunk ID. Here block ID
> retains its functionality.
> This approach increases data by 1 byte for each block reference in an
> xlog record. This approach separates ID referring different fragments
> of xlog record from the actual block ID which is used to refer  block
> references in xlog record.

I've not read this logic yet, but ISTM there is a bug in that new WAL format because I got the following error and the
startupprocess could not replay any WAL records when I set up replication and enabled wal_compression. 

LOG:  record with invalid length at 0/30000B0
LOG:  record with invalid length at 0/3000518
LOG:  Invalid block length in record 0/30005A0
LOG:  Invalid block length in record 0/3000D60 ...

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment

Support-compression-for-full-page-writes-in-WAL_v20.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

24 February 2015, 14:37:07

On 2015-02-24 16:03:41 +0900, Michael Paquier wrote:
> Looking at this code, I think that it is really confusing to move the data
> related to the status of the backup block out of XLogRecordBlockImageHeader
> to the chunk ID itself that may *not* include a backup block at all as it
> is conditioned by the presence of BKPBLOCK_HAS_IMAGE.

What's the problem here? We could actually now easily remove
BKPBLOCK_HAS_IMAGE and replace it by a chunk id.

> the idea of having the backup block data in its dedicated header with bits
> stolen from the existing fields, perhaps by rewriting it to something like
> that:
> typedef struct XLogRecordBlockImageHeader {
> uint32 length:15,
>      hole_length:15,
>      is_compressed:1,
>      is_hole:1;
> } XLogRecordBlockImageHeader;
> Now perhaps I am missing something and this is really "ugly" ;)

I think it's fantastically ugly. We'll also likely want different
compression formats and stuff in the not too far away future. This will
just end up being a pain.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

26 February 2015, 08:43:24

On Tue, Feb 24, 2015 at 6:46 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
> Hello ,
>
>>I've not read this logic yet, but ISTM there is a bug in that new WAL format because I got the following error and
thestartup process could not replay any WAL records when I set up replication and enabled wal_compression.

>
>>LOG:  record with invalid length at 0/30000B0
>>LOG:  record with invalid length at 0/3000518
>>LOG:  Invalid block length in record 0/30005A0
>>LOG:  Invalid block length in record 0/3000D60 ...
>
> Please fine attached patch which replays WAL records.

Even this patch doesn't work fine. The standby emit the following
error messages.

LOG:  invalid block_id 255 at 0/30000B0
LOG:  record with invalid length at 0/30017F0
LOG:  invalid block_id 255 at 0/3001878
LOG:  record with invalid length at 0/30027D0
LOG:  record with invalid length at 0/3002E58
...

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

26 February 2015, 21:54:11

Hello,

>Even this patch doesn't work fine. The standby emit the following
>error messages.

Yes this bug remains unsolved. I am still working on resolving this.

Following chunk IDs have been added in the attached patch as suggested upthread.
+#define XLR_CHUNK_BLOCK_REFERENCE 0x10
+#define XLR_CHUNK_BLOCK_HAS_IMAGE 0x04
+#define XLR_CHUNK_BLOCK_HAS_DATA 0x08

XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.

XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE

and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.

Thank you,

Rahila Syed

Attachment

Support-compression-of-full-page-writes-in-WAL_v21.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

26 February 2015, 23:01:07

On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
>
>>Even this patch doesn't work fine. The standby emit the following
>>error messages.
>
> Yes this bug remains unsolved. I am still working on resolving this.
>
> Following chunk IDs have been added in the attached patch as suggested
> upthread.
> +#define XLR_CHUNK_BLOCK_REFERENCE  0x10
> +#define XLR_CHUNK_BLOCK_HAS_IMAGE  0x04
> +#define XLR_CHUNK_BLOCK_HAS_DATA   0x08
>
> XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
> XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
> and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.

Before sending a new version, be sure that this get fixed by for
example building up a master with a standby replaying WAL, and running
make installcheck-world or similar. If the standby does not complain
at all, you have good chances to not have bugs. You could also build
with WAL_DEBUG to check record consistency.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

27 February 2015, 03:44:36

On Fri, Feb 27, 2015 at 8:01 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>>>Even this patch doesn't work fine. The standby emit the following
>>>error messages.
>>
>> Yes this bug remains unsolved. I am still working on resolving this.
>>
>> Following chunk IDs have been added in the attached patch as suggested
>> upthread.
>> +#define XLR_CHUNK_BLOCK_REFERENCE  0x10
>> +#define XLR_CHUNK_BLOCK_HAS_IMAGE  0x04
>> +#define XLR_CHUNK_BLOCK_HAS_DATA   0x08
>>
>> XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
>> XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
>> and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.
>
> Before sending a new version, be sure that this get fixed by for
> example building up a master with a standby replaying WAL, and running
> make installcheck-world or similar. If the standby does not complain
> at all, you have good chances to not have bugs. You could also build
> with WAL_DEBUG to check record consistency.

It would be good to get those problems fixed first. Could you send an
updated patch? I'll look into it in more details. For the time being I
am switching this patch to "Waiting on Author".
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

02 March 2015, 10:13:26

On Fri, Feb 27, 2015 at 12:44 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Feb 27, 2015 at 8:01 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>>>>Even this patch doesn't work fine. The standby emit the following
>>>>error messages.
>>>
>>> Yes this bug remains unsolved. I am still working on resolving this.
>>>
>>> Following chunk IDs have been added in the attached patch as suggested
>>> upthread.
>>> +#define XLR_CHUNK_BLOCK_REFERENCE  0x10
>>> +#define XLR_CHUNK_BLOCK_HAS_IMAGE  0x04
>>> +#define XLR_CHUNK_BLOCK_HAS_DATA   0x08
>>>
>>> XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
>>> XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
>>> and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.
>>
>> Before sending a new version, be sure that this get fixed by for
>> example building up a master with a standby replaying WAL, and running
>> make installcheck-world or similar. If the standby does not complain
>> at all, you have good chances to not have bugs. You could also build
>> with WAL_DEBUG to check record consistency.

+1

When I test the WAL or replication related features, I usually run
"make installcheck" and pgbench against the master at the same time
after setting up the replication environment.
typedef struct XLogRecordBlockHeader{
+    uint8        chunk_id;        /* xlog fragment id */    uint8        id;                /* block reference ID */

Seems this increases the header size of WAL record even if no backup block
image is included. Right? Isn't it better to add the flag info about backup
block image into XLogRecordBlockImageHeader rather than XLogRecordBlockHeader?
Originally we borrowed one or two bits from its existing fields to minimize
the header size, but we can just add new flag field if we prefer
the extensibility and readability of the code.

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

02 March 2015, 20:17:56

Hello,

>When I test the WAL or replication related features, I usually run
>"make installcheck" and pgbench against the master at the same time
>after setting up the replication environment.

I will conduct these tests before sending updated version.

>Seems this increases the header size of WAL record even if no backup block image is included. Right?
Yes, this increases the header size of WAL record by 1 byte for every block reference even if it has no backup block image.

>Isn't it better to add the flag info about backup block image into XLogRecordBlockImageHeader rather than XLogRecordBlockHeader
Yes , this will make the code extensible,readable and will save couple of bytes per record.

But the current approach is to provide a chunk ID identifying different xlog record fragments like main data , block references etc.

Currently , block ID is used to identify record fragments which can be either XLR_BLOCK_ID_DATA_SHORT , XLR_BLOCK_ID_DATA_LONG or actual block ID.

This can be replaced by chunk ID to separate it from block ID. Block ID can be used to number the block fragments whereas chunk ID can be used to distinguish between main data fragments and block references. Chunk ID of block references can contain information about presence of data, image , hole and compression.

Chunk ID for main data fragments remains as it is . This approach provides for readability and extensibility.

Thank you,

Rahila Syed

On Mon, Mar 2, 2015 at 3:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Feb 27, 2015 at 12:44 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Feb 27, 2015 at 8:01 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>>>>Even this patch doesn't work fine. The standby emit the following
>>>>error messages.
>>>
>>> Yes this bug remains unsolved. I am still working on resolving this.
>>>
>>> Following chunk IDs have been added in the attached patch as suggested
>>> upthread.
>>> +#define XLR_CHUNK_BLOCK_REFERENCE 0x10
>>> +#define XLR_CHUNK_BLOCK_HAS_IMAGE 0x04
>>> +#define XLR_CHUNK_BLOCK_HAS_DATA 0x08
>>>
>>> XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
>>> XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
>>> and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.
>>
>> Before sending a new version, be sure that this get fixed by for
>> example building up a master with a standby replaying WAL, and running
>> make installcheck-world or similar. If the standby does not complain
>> at all, you have good chances to not have bugs. You could also build
>> with WAL_DEBUG to check record consistency.

+1

When I test the WAL or replication related features, I usually run
"make installcheck" and pgbench against the master at the same time
after setting up the replication environment.

typedef struct XLogRecordBlockHeader
{
+ uint8 chunk_id; /* xlog fragment id */
uint8 id; /* block reference ID */

Seems this increases the header size of WAL record even if no backup block
image is included. Right? Isn't it better to add the flag info about backup
block image into XLogRecordBlockImageHeader rather than XLogRecordBlockHeader?
Originally we borrowed one or two bits from its existing fields to minimize
the header size, but we can just add new flag field if we prefer
the extensibility and readability of the code.

Regards,

--
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

02 March 2015, 23:59:34

On Tue, Mar 3, 2015 at 5:17 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
>
>>When I test the WAL or replication related features, I usually run
>>"make installcheck" and pgbench against the master at the same time
>>after setting up the replication environment.
> I will conduct these tests before sending updated version.
>
>>Seems this increases the header size of WAL record even if no backup block
>> image is included. Right?
> Yes, this increases the header size of WAL record by 1 byte for every block
> reference even if it has no backup block image.
>
>>Isn't it better to add the flag info about backup block image into
>> XLogRecordBlockImageHeader rather than XLogRecordBlockHeader
> Yes , this will make the code extensible,readable and  will save couple of
> bytes per record.
>  But the current approach is to provide a chunk ID identifying different
> xlog record fragments like main data , block references etc.
> Currently , block ID is used to identify record fragments which can be
> either XLR_BLOCK_ID_DATA_SHORT , XLR_BLOCK_ID_DATA_LONG or actual block ID.
> This can be replaced by chunk ID to separate it from block ID. Block ID can
> be used to number the block fragments whereas chunk ID can be used to
> distinguish between main data fragments and block references. Chunk ID of
> block references can contain information about presence of data, image ,
> hole and compression.
> Chunk ID for main data fragments remains as it is . This approach provides
> for readability and extensibility.

Already mentioned upthread, but I agree with Fujii-san here: adding
information related to the state of a block image in
XLogRecordBlockHeader makes little sense because we are not sure to
have a block image, perhaps there is only data associated to it, and
that we should control that exclusively in XLogRecordBlockImageHeader
and let the block ID alone for now. Hence we'd better have 1 extra
int8 in XLogRecordBlockImageHeader with now 2 flags:
- Is block compressed or not?
- Does block have a hole?
Perhaps this will not be considered as ugly, and this leaves plenty of
room for storing a version number for compression.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Andres Freund

Date:

03 March 2015, 00:24:18

On 2015-03-03 08:59:30 +0900, Michael Paquier wrote:
> Already mentioned upthread, but I agree with Fujii-san here: adding
> information related to the state of a block image in
> XLogRecordBlockHeader makes little sense because we are not sure to
> have a block image, perhaps there is only data associated to it, and
> that we should control that exclusively in XLogRecordBlockImageHeader
> and let the block ID alone for now.

This argument doesn't make much sense to me. The flag byte could very
well indicate 'block reference without image following' vs 'block
reference with data + hole following' vs 'block reference with
compressed data following'.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

03 March 2015, 00:35:02

On Tue, Mar 3, 2015 at 9:24 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-03-03 08:59:30 +0900, Michael Paquier wrote:
>> Already mentioned upthread, but I agree with Fujii-san here: adding
>> information related to the state of a block image in
>> XLogRecordBlockHeader makes little sense because we are not sure to
>> have a block image, perhaps there is only data associated to it, and
>> that we should control that exclusively in XLogRecordBlockImageHeader
>> and let the block ID alone for now.
>
> This argument doesn't make much sense to me. The flag byte could very
> well indicate 'block reference without image following' vs 'block
> reference with data + hole following' vs 'block reference with
> compressed data following'.

Information about the state of a block is decoupled with its
existence, aka in the block header, we should control if:
- record has data
- record has a block
And in the block image header, we control if the block is:
- compressed or not
- has a hole or not.
Are you willing to sacrifice bytes in the block header to control if a
block is compressed or has a hole even if the block has only data but
no image?
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

03 March 2015, 15:42:09

Hello,

>It would be good to get those problems fixed first. Could you send an updated patch?

Please find attached updated patch with WAL replay error fixed. The patch follows chunk ID approach of xlog format.

Following are brief measurement numbers.
                                                      WAL
FPW compression on           122.032 MB

FPW compression off           155.239 MB

HEAD                                          155.236 MB


Thank you,
Rahila Syed


______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment

Support-compression-of-full-page-writes_v22.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

04 March 2015, 03:18:01

On Tue, Mar 3, 2015 at 9:34 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Mar 3, 2015 at 9:24 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2015-03-03 08:59:30 +0900, Michael Paquier wrote:
>>> Already mentioned upthread, but I agree with Fujii-san here: adding
>>> information related to the state of a block image in
>>> XLogRecordBlockHeader makes little sense because we are not sure to
>>> have a block image, perhaps there is only data associated to it, and
>>> that we should control that exclusively in XLogRecordBlockImageHeader
>>> and let the block ID alone for now.
>>
>> This argument doesn't make much sense to me. The flag byte could very
>> well indicate 'block reference without image following' vs 'block
>> reference with data + hole following' vs 'block reference with
>> compressed data following'.
>
> Information about the state of a block is decoupled with its
> existence, aka in the block header, we should control if:
> - record has data
> - record has a block
> And in the block image header, we control if the block is:
> - compressed or not
> - has a hole or not.

Are there any other flag bits that we should or are planning to add into
WAL header newly, except the above two? If yes and they are required by even
a block which doesn't have an image, I will change my mind and agree to
add something like chunk ID to a block header. But I guess the answer of the
question is No. Since the flag bits now we are thinking to add are required
only by a block having an image, adding them into a block header (instead of
block image header) seems a waste of bytes in WAL. So I concur with Michael.

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

04 March 2015, 06:32:09

On Wed, Mar 4, 2015 at 12:41 AM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
> Please find attached updated patch with WAL replay error fixed. The patch follows chunk ID approach of xlog format.

(Review done independently of the chunk_id stuff being good or not,
already gave my opinion on the matter).
 * readRecordBufSize is set to the new buffer size.
- *
+
The patch has some noise diffs.

You may want to change the values of BKPBLOCK_WILL_INIT and
BKPBLOCK_SAME_REL to respectively 0x01 and 0x02.

+               uint8   chunk_id = 0;
+               chunk_id |= XLR_CHUNK_BLOCK_REFERENCE;

Why not simply that:
chunk_id = XLR_CHUNK_BLOCK_REFERENCE;

+#define XLR_CHUNK_ID_DATA_SHORT                255
+#define XLR_CHUNK_ID_DATA_LONG         254
Why aren't those just using one bit as well? This seems inconsistent
with the rest.

+     if ((blk->with_hole == 0 && blk->hole_offset != 0) ||
+         (blk->with_hole == 1 && blk->hole_offset <= 0))
In xlogreader.c blk->with_hole is defined as a boolean but compared
with an integer, could you remove the ==0 and ==1 portions for
clarity?

-                       goto err;
+                               goto err;               }       }
-       if (remaining != datatotal)
This gathers incorrect code alignment and unnecessary diffs.
typedef struct XLogRecordBlockHeader{
+       /* Chunk ID precedes */
+       uint8           id;
What prevents the declaration of chunk_id as an int8 here instead of
this comment? This is confusing.

> Following are brief measurement numbers.
>                                                       WAL
> FPW compression on           122.032 MB
> FPW compression off           155.239 MB
> HEAD                                   155.236 MB

What is the test run in this case? How many block images have been
generated in WAL for each case? You could gather some of those numbers
with pg_xlogdump --stat for example.
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

04 March 2015, 11:03:16

Hello,

>Are there any other flag bits that we should or are planning to add into WAL header newly, except the above two? If
yesand they are required by even a block which doesn't have an image, I will change my mind and agree to add something
likechunk ID to a block header.  
>But I guess the answer of the question is No. Since the flag bits now we are thinking to add are required only by a
blockhaving an image, adding them into a block header (instead of block image header) seems a waste of bytes in WAL. So
Iconcur with Michael. 
I agree.
As per my understanding, this change of xlog format was to provide for future enhancement which would need flags
relevantto entire block. 
But as mentioned, currently the flags being added are related to block image only. Hence for this patch it makes sense
toadd a field to XLogRecordImageHeader rather than block header.  
This will also save bytes per WAL record.

Thank you,
Rahila Syed



______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Re: [REVIEW] Re: Compression of full-page-writes

From

"Syed, Rahila"

Date:

05 March 2015, 12:15:41

Hello,

Please find attached  a patch. As discussed, flag to denote compression and presence of hole in block image has been
addedin XLogRecordImageHeader rather than block header.   

Following are WAL numbers based on attached  test script  posted by Michael earlier in the thread.

                                                      WAL generated
FPW compression on           122.032 MB

FPW compression off           155.223 MB

HEAD                                          155.236 MB

Compression : 21 %
Number of block images generated in WAL :   63637


Thank you,
Rahila Syed


______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
>> Please find attached  a patch. As discussed, flag to denote compression and presence of hole in block image has been
addedin XLogRecordImageHeader rather than block header. 

Thanks for updating the patch! Attached is the refactored version of the patch.

Regards,

--
Fujii Masao

Attachment

Support-compression-of-full-page-writes-in-WAL_v24.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

09 March 2015, 12:08:57

On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
>>> Please find attached  a patch. As discussed, flag to denote compression and presence of hole in block image has
beenadded in XLogRecordImageHeader rather than block header.
 
>
> Thanks for updating the patch! Attached is the refactored version of the patch.

Cool. Thanks!

I have some minor comments:

+        The default value is <literal>off</>
Dot at the end of this sentence.

+        Turning this parameter on can reduce the WAL volume without
"Turning <value>on</> this parameter

+        but at the cost of some extra CPU time by the compression during
+        WAL logging and the decompression during WAL replay."
Isn't a verb missing here, for something like that:
"but at the cost of some extra CPU spent on the compression during WAL
logging and on the decompression during WAL replay."

+ * This can reduce the WAL volume, but at some extra cost of CPU time
+ * by the compression during WAL logging.
Er, similarly "some extra cost of CPU spent on the compression...".

+       if (blk->bimg_info & BKPIMAGE_HAS_HOLE &&
+                       (blk->hole_offset == 0 ||
+                        blk->hole_length == 0 ||
I think that extra parenthesis should be used for the first expression
with BKPIMAGE_HAS_HOLE.

+                               if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED &&
+                                       blk->bimg_len == BLCKSZ)
+                               {
Same here.

+                               /*
+                                * cross-check that hole_offset == 0
and hole_length == 0
+                                * if the HAS_HOLE flag is set.
+                                */
I think that you mean here that this happens when the flag is *not* set.

+       /*
+        * If BKPIMAGE_HAS_HOLE and BKPIMAGE_IS_COMPRESSED,
+        * an XLogRecordBlockCompressHeader follows
+        */
Maybe a "struct" should be added for "an XLogRecordBlockCompressHeader
struct". And a dot at the end of the sentence should be added?

Regards,
-- 
Michael

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

10 March 2015, 12:55:19

On Mon, Mar 9, 2015 at 9:08 PM, Michael Paquier wrote:
> On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao wrote:
>> Thanks for updating the patch! Attached is the refactored version of the patch.

Fujii-san and I had a short chat about tuning a bit the PGLZ strategy
which is now PGLZ_strategy_default in the patch (at least 25% of
compression, etc.). In particular min_input_size which is not set at
32B is too low, and knowing that the minimum fillfactor of a relation
page is 10% this looks really too low.

For example, using the extension attached to this email able to
compress and decompress bytea strings that I have developed after pglz
has been moved to libpqcommon (contains as well a function able to get
a relation page without its hole, feel free to use it), I am seeing
that we can gain quite a lot of space even with some incompressible
data like UUID or some random float data (pages are compressed without
their hole):
1) Float table:
=# create table float_tab (id float);
CREATE TABLE
=# insert into float_tab select random() from generate_series(1, 20);
INSERT 0 20
=# SELECT bytea_size(compress_data(page)) AS compress_size,
bytea_size(page) AS raw_size_no_hole FROM
get_raw_page('float_tab'::regclass, 0, false);
-[ RECORD 1 ]----+----
compress_size    | 329
raw_size_no_hole | 744
=# SELECT bytea_size(compress_data(page)) AS compress_size,
bytea_size(page) AS raw_size_no_hole FROM
get_raw_page('float_tab'::regclass, 0, false);
-[ RECORD 1 ]----+-----
compress_size    | 1753
raw_size_no_hole | 4344
So that's more or less 60% saved...
2) UUID table
=# SELECT bytea_size(compress_data(page)) AS compress_size,
bytea_size(page) AS raw_size_no_hole FROM
get_raw_page('uuid_tab'::regclass, 0, false);
-[ RECORD 1 ]----+----
compress_size    | 590
raw_size_no_hole | 904
=# insert into uuid_tab select gen_random_uuid() from generate_series(1, 100);
INSERT 0 100
=# SELECT bytea_size(compress_data(page)) AS compress_size,
bytea_size(page) AS raw_size_no_hole FROM
get_raw_page('uuid_tab'::regclass, 0, false);
-[ RECORD 1 ]----+-----
compress_size    | 3338
raw_size_no_hole | 5304
And in this case we are close to 40% saved...

At least, knowing that with the header there are at least 24B used on
a page, what about increasing min_input_size to something like 128B or
256B? I don't think that this is a blocker for this patch as most of
the relation pages are going to have far more data than that so they
will be unconditionally compressed, but there is definitely something
we could do in this area later on, perhaps even we could do
improvement with the other parameters like the compression rate. So
that's something to keep in mind...
--
Michael

Attachment

compress_test.tar.gz

Re: [REVIEW] Re: Compression of full-page-writes

From

Rahila Syed

Date:

10 March 2015, 22:08:59

Hello,

>I have some minor comments

The comments have been implemented in the attached patch.

>I think that extra parenthesis should be used for the first expression
>with BKPIMAGE_HAS_HOLE.

Parenthesis have been added to improve code readability.

Thank you,

Rahila Syed

On Mon, Mar 9, 2015 at 5:38 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
>>> Please find attached a patch. As discussed, flag to denote compression and presence of hole in block image has been added in XLogRecordImageHeader rather than block header.
>
> Thanks for updating the patch! Attached is the refactored version of the patch.

Cool. Thanks!

I have some minor comments:

+ The default value is <literal>off</>
Dot at the end of this sentence.

+ Turning this parameter on can reduce the WAL volume without
"Turning <value>on</> this parameter

+ but at the cost of some extra CPU time by the compression during
+ WAL logging and the decompression during WAL replay."
Isn't a verb missing here, for something like that:
"but at the cost of some extra CPU spent on the compression during WAL
logging and on the decompression during WAL replay."

+ * This can reduce the WAL volume, but at some extra cost of CPU time
+ * by the compression during WAL logging.
Er, similarly "some extra cost of CPU spent on the compression...".

+ if (blk->bimg_info & BKPIMAGE_HAS_HOLE &&
+ (blk->hole_offset == 0 ||
+ blk->hole_length == 0 ||
I think that extra parenthesis should be used for the first expression
with BKPIMAGE_HAS_HOLE.

+ if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED &&
+ blk->bimg_len == BLCKSZ)
+ {
Same here.

+ /*
+ * cross-check that hole_offset == 0
and hole_length == 0
+ * if the HAS_HOLE flag is set.
+ */
I think that you mean here that this happens when the flag is *not* set.

+ /*
+ * If BKPIMAGE_HAS_HOLE and BKPIMAGE_IS_COMPRESSED,
+ * an XLogRecordBlockCompressHeader follows
+ */
Maybe a "struct" should be added for "an XLogRecordBlockCompressHeader
struct". And a dot at the end of the sentence should be added?

Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Support-compression-full-page-writes-in-WAL_v25.patch

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

11 March 2015, 03:19:43

On Mon, Mar 9, 2015 at 9:08 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila <Rahila.Syed@nttdata.com> wrote:
>>>> Please find attached  a patch. As discussed, flag to denote compression and presence of hole in block image has
beenadded in XLogRecordImageHeader rather than block header.
 
>>
>> Thanks for updating the patch! Attached is the refactored version of the patch.
>
> Cool. Thanks!
>
> I have some minor comments:

Thanks for the comments!

> +        Turning this parameter on can reduce the WAL volume without
> "Turning <value>on</> this parameter

That tag is not used in other place in config.sgml, so I'm not sure if
that's really necessary.

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Fujii Masao

Date:

11 March 2015, 06:57:31

On Wed, Mar 11, 2015 at 7:08 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Hello,
>
>>I have some minor comments
>
> The comments have been implemented in the attached patch.

Thanks for updating the patch! I just changed a bit and finally pushed it.
Thanks everyone involved in this patch!

Regards,

-- 
Fujii Masao

Re: [REVIEW] Re: Compression of full-page-writes

From

Michael Paquier

Date:

11 March 2015, 07:01:09

On Wed, Mar 11, 2015 at 3:57 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Mar 11, 2015 at 7:08 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
>> Hello,
>>
>>>I have some minor comments
>>
>> The comments have been implemented in the attached patch.
>
> Thanks for updating the patch! I just changed a bit and finally pushed it.
> Thanks everyone involved in this patch!

Woohoo! Thanks!
-- 
Michael