Thread: pglz performance

pglz performance

From
Andrey Borodin
Date:
Hi hackers!

I was reviewing Paul Ramsey's TOAST patch[0] and noticed that there is a big room for improvement in performance of
pglzcompression and decompression. 

With Vladimir we started to investigate ways to boost byte copying and eventually created test suit[1] to investigate
performanceof compression and decompression. 
This is and extension with single function test_pglz() which performs tests for different:
1. Data payloads
2. Compression implementations
3. Decompression implementations

Currently we test mostly decompression improvements against two WALs and one data file taken from pgbench-generated
database.Any suggestion on more relevant data payloads are very welcome. 
My laptop tests show that our decompression implementation [2] can be from 15% to 50% faster.
Also I've noted that compression is extremely slow, ~30 times slower than decompression. I believe we can do something
aboutit. 

We focus only on boosting existing codec without any considerations of other compression algorithms.

Any comments are much appreciated.

Most important questions are:
1. What are relevant data sets?
2. What are relevant CPUs? I have only XEON-based servers and few laptops\desktops with intel CPUs
3. If compression is 30 times slower, should we better focus on compression instead of decompression?

Best regards, Andrey Borodin.


[0] https://www.postgresql.org/message-id/flat/CANP8%2BjKcGj-JYzEawS%2BCUZnfeGKq4T5LswcswMP4GUHeZEP1ag%40mail.gmail.com
[1] https://github.com/x4m/test_pglz
[2] https://www.postgresql.org/message-id/C2D8E5D5-3E83-469B-8751-1C7877C2A5F2%40yandex-team.ru


Re: pglz performance

From
Michael Paquier
Date:
On Mon, May 13, 2019 at 07:45:59AM +0500, Andrey Borodin wrote:
> I was reviewing Paul Ramsey's TOAST patch[0] and noticed that there
> is a big room for improvement in performance of pglz compression and
> decompression.

Yes, I believe so too.  pglz is a huge CPU-consumer when it comes to
compilation compared to more modern algos like lz4.

> With Vladimir we started to investigate ways to boost byte copying
> and eventually created test suit[1] to investigate performance of
> compression and decompression.  This is and extension with single
> function test_pglz() which performs tests for different:
> 1. Data payloads
> 2. Compression implementations
> 3. Decompression implementations

Cool.  I got something rather similar in my wallet of plugins:
https://github.com/michaelpq/pg_plugins/tree/master/compress_test
This is something I worked on mainly for FPW compression in WAL.

> Currently we test mostly decompression improvements against two WALs
> and one data file taken from pgbench-generated database. Any
> suggestion on more relevant data payloads are very welcome.

Text strings made of random data and variable length?  For any test of
this kind I think that it is good to focus on the performance of the
low-level calls, even going as far as a simple C wrapper on top of the
pglz APIs to test only the performance and not have extra PG-related
overhead like palloc() which can be a barrier.  Focusing on strings of
lengths of 1kB up to 16kB may be an idea of size, and it is important
to keep the same uncompressed strings for performance comparison.

> My laptop tests show that our decompression implementation [2] can
> be from 15% to 50% faster.  Also I've noted that compression is
> extremely slow, ~30 times slower than decompression. I believe we
> can do something about it.

That's nice.

> We focus only on boosting existing codec without any considerations
> of other compression algorithms.

There is this as well.  A couple of algorithms have a license
compatible with Postgres, but it may be more simple to just improve
pglz.  A 10%~20% improvement is something worth doing.

> Most important questions are:
> 1. What are relevant data sets?
> 2. What are relevant CPUs? I have only XEON-based servers and few
> laptops\desktops with intel CPUs
> 3. If compression is 30 times slower, should we better focus on
> compression instead of decompression?

Decompression can matter a lot for mostly-read workloads and
compression can become a bottleneck for heavy-insert loads, so
improving compression or decompression should be two separate
problems, not two problems linked.  Any improvement in one or the
other, or even both, is nice to have.
--
Michael

Attachment

Re: pglz performance

From
Andrey Borodin
Date:

> 13 мая 2019 г., в 12:14, Michael Paquier <michael@paquier.xyz> написал(а):
>
>> Currently we test mostly decompression improvements against two WALs
>> and one data file taken from pgbench-generated database. Any
>> suggestion on more relevant data payloads are very welcome.
>
> Text strings made of random data and variable length?
Like text corpus?

>  For any test of
> this kind I think that it is good to focus on the performance of the
> low-level calls, even going as far as a simple C wrapper on top of the
> pglz APIs to test only the performance and not have extra PG-related
> overhead like palloc() which can be a barrier.
Our test_pglz extension is measuring only time of real compression, doing warmup run, all allocations are done before
measurement.

>  Focusing on strings of
> lengths of 1kB up to 16kB may be an idea of size, and it is important
> to keep the same uncompressed strings for performance comparison.
We intentionally avoid using generated data, thus keep test files committed into git repo.
Also we check that decompressed data matches source of compression. All tests are done 5 times.

We use PG extension only for simplicity of deployment of benchmarks to our PG clusters.


Here are some test results.

Currently we test on 4 payloads:
1. WAL from cluster initialization
2. 2 WALs from pgbench pgbench -i -s 10
3. data file taken from pgbench -i -s 10

We use these decompressors:
1. pglz_decompress_vanilla - taken from PG source code
2. pglz_decompress_hacked - use sliced memcpy to imitate byte-by-byte pglz decompression
3. pglz_decompress_hacked4, pglz_decompress_hacked8, pglz_decompress_hackedX - use memcpy if match is no less than X
bytes.We need to determine best X, if this approach is used. 

I used three platforms:
1. Server XEONE5-2660 SM/SYS1027RN3RF/10S2.5/1U/2P (2*INTEL XEON E5-2660/16*DDR3ECCREG/10*SAS-2.5) Under Ubuntu 14, PG
9.6.
2. Desktop Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Ubuntu 18, PG 12devel
3. Laptop MB Pro 15 2015 2.2 GHz Core i7 (I7-4770HQ) MacOS, PG 12devel
Owners of AMD and ARM devices are welcome.

Server results (less is better):
NOTICE:  00000: Time to decompress one byte in ns:
NOTICE:  00000: Payload 000000010000000000000001
NOTICE:  00000: Decompressor pglz_decompress_hacked result 0.647235
NOTICE:  00000: Decompressor pglz_decompress_hacked4 result 0.671029
NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 0.699949
NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 0.739586
NOTICE:  00000: Decompressor pglz_decompress_hacked32 result 0.787926
NOTICE:  00000: Decompressor pglz_decompress_vanilla result 1.147282
NOTICE:  00000: Payload 000000010000000000000006
NOTICE:  00000: Decompressor pglz_decompress_hacked result 0.201774
NOTICE:  00000: Decompressor pglz_decompress_hacked4 result 0.211859
NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 0.212610
NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 0.214601
NOTICE:  00000: Decompressor pglz_decompress_hacked32 result 0.221813
NOTICE:  00000: Decompressor pglz_decompress_vanilla result 0.706005
NOTICE:  00000: Payload 000000010000000000000008
NOTICE:  00000: Decompressor pglz_decompress_hacked result 1.370132
NOTICE:  00000: Decompressor pglz_decompress_hacked4 result 1.388991
NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 1.388502
NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 1.529455
NOTICE:  00000: Decompressor pglz_decompress_hacked32 result 1.520813
NOTICE:  00000: Decompressor pglz_decompress_vanilla result 1.433527
NOTICE:  00000: Payload 16398
NOTICE:  00000: Decompressor pglz_decompress_hacked result 0.606943
NOTICE:  00000: Decompressor pglz_decompress_hacked4 result 0.623044
NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 0.624118
NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 0.620987
NOTICE:  00000: Decompressor pglz_decompress_hacked32 result 0.621183
NOTICE:  00000: Decompressor pglz_decompress_vanilla result 1.365318

Comment: pglz_decompress_hacked is unconditionally optimal. On most of cases it is 2x better than current
implementation.
On 000000010000000000000008 it is only marginally better. pglz_decompress_hacked8 is few percents worse than
pglz_decompress_hacked.

Desktop results:
NOTICE:  Time to decompress one byte in ns:
NOTICE:  Payload 000000010000000000000001
NOTICE:  Decompressor pglz_decompress_hacked result 0.396454
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.429249
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.436413
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.478077
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.491488
NOTICE:  Decompressor pglz_decompress_vanilla result 0.695527
NOTICE:  Payload 000000010000000000000006
NOTICE:  Decompressor pglz_decompress_hacked result 0.110710
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.115669
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.127637
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.120544
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.117981
NOTICE:  Decompressor pglz_decompress_vanilla result 0.399446
NOTICE:  Payload 000000010000000000000008
NOTICE:  Decompressor pglz_decompress_hacked result 0.647402
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.691891
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.693834
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.776815
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.777960
NOTICE:  Decompressor pglz_decompress_vanilla result 0.721192
NOTICE:  Payload 16398
NOTICE:  Decompressor pglz_decompress_hacked result 0.337654
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.355452
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.351224
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.362548
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.356456
NOTICE:  Decompressor pglz_decompress_vanilla result 0.837042

Comment: identical to Server results.

Laptop results:
NOTICE:  Time to decompress one byte in ns:
NOTICE:  Payload 000000010000000000000001
NOTICE:  Decompressor pglz_decompress_hacked result 0.661469
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.638366
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.664377
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.696135
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.634825
NOTICE:  Decompressor pglz_decompress_vanilla result 0.676560
NOTICE:  Payload 000000010000000000000006
NOTICE:  Decompressor pglz_decompress_hacked result 0.213921
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.224864
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.229394
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.218141
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.220954
NOTICE:  Decompressor pglz_decompress_vanilla result 0.242412
NOTICE:  Payload 000000010000000000000008
NOTICE:  Decompressor pglz_decompress_hacked result 1.053417
NOTICE:  Decompressor pglz_decompress_hacked4 result 1.063704
NOTICE:  Decompressor pglz_decompress_hacked8 result 1.007211
NOTICE:  Decompressor pglz_decompress_hacked16 result 1.145089
NOTICE:  Decompressor pglz_decompress_hacked32 result 1.079702
NOTICE:  Decompressor pglz_decompress_vanilla result 1.051557
NOTICE:  Payload 16398
NOTICE:  Decompressor pglz_decompress_hacked result 0.251690
NOTICE:  Decompressor pglz_decompress_hacked4 result 0.268125
NOTICE:  Decompressor pglz_decompress_hacked8 result 0.269248
NOTICE:  Decompressor pglz_decompress_hacked16 result 0.277880
NOTICE:  Decompressor pglz_decompress_hacked32 result 0.270290
NOTICE:  Decompressor pglz_decompress_vanilla result 0.705652

Comment: decompress time on WAL segments is statistically indistinguishable between hacked and original versions.
Hackeddecompression of data file is 2x faster. 

We are going to try these tests on cascade lake processors too.

Best regards, Andrey Borodin.


Re: pglz performance

From
Andrey Borodin
Date:

> 15 мая 2019 г., в 15:06, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> Owners of AMD and ARM devices are welcome.

Yandex hardware RND guys gave me ARM server and Power9 server. They are looking for AMD and some new Intel boxes.

Meanwhile I made some enhancements to test suit:
1. I've added Shakespeare payload: concatenation of works of this prominent poet.
2. For each payload compute "sliced time" - time to decompress payload if it was sliced by 2Kb pieces or 8Kb pieces.
3. For each decompressor we compute "score": (sum of time to decompress each payload, each payload sliced by 2Kb and
8Kb)* 5 times 

I've attached full test logs, meanwhile here's results for different platforms.

Intel Server
NOTICE:  00000: Decompressor pglz_decompress_hacked result 10.346763
NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 11.192078
NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 11.957727
NOTICE:  00000: Decompressor pglz_decompress_vanilla result 14.262256

ARM Server
NOTICE:  Decompressor pglz_decompress_hacked result 12.966668
NOTICE:  Decompressor pglz_decompress_hacked8 result 13.004935
NOTICE:  Decompressor pglz_decompress_hacked16 result 13.043015
NOTICE:  Decompressor pglz_decompress_vanilla result 18.239242

Power9 Server
NOTICE:  Decompressor pglz_decompress_hacked result 10.992974
NOTICE:  Decompressor pglz_decompress_hacked8 result 11.747443
NOTICE:  Decompressor pglz_decompress_hacked16 result 11.026342
NOTICE:  Decompressor pglz_decompress_vanilla result 16.375315

Intel laptop
NOTICE:  Decompressor pglz_decompress_hacked result 9.445808
NOTICE:  Decompressor pglz_decompress_hacked8 result 9.105360
NOTICE:  Decompressor pglz_decompress_hacked16 result 9.621833
NOTICE:  Decompressor pglz_decompress_vanilla result 10.661968

From these results pglz_decompress_hacked looks best.

Best regards, Andrey Borodin.


Attachment

Re: pglz performance

From
Michael Paquier
Date:
On Thu, May 16, 2019 at 10:13:22PM +0500, Andrey Borodin wrote:
> Meanwhile I made some enhancements to test suit:
> Intel Server
> NOTICE:  00000: Decompressor pglz_decompress_hacked result 10.346763
> NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 11.192078
> NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 11.957727
> NOTICE:  00000: Decompressor pglz_decompress_vanilla result 14.262256
>
> ARM Server
> NOTICE:  Decompressor pglz_decompress_hacked result 12.966668
> NOTICE:  Decompressor pglz_decompress_hacked8 result 13.004935
> NOTICE:  Decompressor pglz_decompress_hacked16 result 13.043015
> NOTICE:  Decompressor pglz_decompress_vanilla result 18.239242
>
> Power9 Server
> NOTICE:  Decompressor pglz_decompress_hacked result 10.992974
> NOTICE:  Decompressor pglz_decompress_hacked8 result 11.747443
> NOTICE:  Decompressor pglz_decompress_hacked16 result 11.026342
> NOTICE:  Decompressor pglz_decompress_vanilla result 16.375315
>
> Intel laptop
> NOTICE:  Decompressor pglz_decompress_hacked result 9.445808
> NOTICE:  Decompressor pglz_decompress_hacked8 result 9.105360
> NOTICE:  Decompressor pglz_decompress_hacked16 result 9.621833
> NOTICE:  Decompressor pglz_decompress_vanilla result 10.661968
>
> From these results pglz_decompress_hacked looks best.

That's nice.

From the numbers you are presenting here, all of them are much better
than the original, and there is not much difference between any of the
patched versions.  Having a 20%~30% improvement with a patch is very
nice.

After that comes the simplicity and the future maintainability of what
is proposed.  I am not much into accepting a patch which has a 1%~2%
impact for some hardwares and makes pglz much more complex and harder
to understand.  But I am really eager to see a patch with at least a
10% improvement which remains simple, even more if it simplifies the
logic used in pglz.
--
Michael

Attachment

Re: pglz performance

From
Andrey Borodin
Date:

> 17 мая 2019 г., в 6:44, Michael Paquier <michael@paquier.xyz> написал(а):
>
> That's nice.
>
> From the numbers you are presenting here, all of them are much better
> than the original, and there is not much difference between any of the
> patched versions.  Having a 20%~30% improvement with a patch is very
> nice.
>
> After that comes the simplicity and the future maintainability of what
> is proposed.  I am not much into accepting a patch which has a 1%~2%
> impact for some hardwares and makes pglz much more complex and harder
> to understand.  But I am really eager to see a patch with at least a
> 10% improvement which remains simple, even more if it simplifies the
> logic used in pglz.

Here are patches for both winning versions. I'll place them on CF.
My gut feeling is pglz_decompress_hacked8 should be better, but on most architectures benchmarks show opposite.


Best regards, Andrey Borodin.



Attachment

Re: pglz performance

From
Gasper Zejn
Date:
On 16. 05. 19 19:13, Andrey Borodin wrote:
>
>> 15 мая 2019 г., в 15:06, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>>
>> Owners of AMD and ARM devices are welcome.

I've tested according to instructions at the test repo
https://github.com/x4m/test_pglz

Test_pglz is at a97f63b and postgres at 6ba500.

Hardware is desktop AMD Ryzen 5 2600, 32GB RAM

Decompressor score (summ of all times):

NOTICE:  Decompressor pglz_decompress_hacked result 6.988909
NOTICE:  Decompressor pglz_decompress_hacked8 result 7.562619
NOTICE:  Decompressor pglz_decompress_hacked16 result 8.316957
NOTICE:  Decompressor pglz_decompress_vanilla result 10.725826


Attached is the full test run, if needed.

Kind regards,

Gasper

> Yandex hardware RND guys gave me ARM server and Power9 server. They are looking for AMD and some new Intel boxes.
>
> Meanwhile I made some enhancements to test suit:
> 1. I've added Shakespeare payload: concatenation of works of this prominent poet.
> 2. For each payload compute "sliced time" - time to decompress payload if it was sliced by 2Kb pieces or 8Kb pieces.
> 3. For each decompressor we compute "score": (sum of time to decompress each payload, each payload sliced by 2Kb and
8Kb)* 5 times
 
>
> I've attached full test logs, meanwhile here's results for different platforms.
>
> Intel Server
> NOTICE:  00000: Decompressor pglz_decompress_hacked result 10.346763
> NOTICE:  00000: Decompressor pglz_decompress_hacked8 result 11.192078
> NOTICE:  00000: Decompressor pglz_decompress_hacked16 result 11.957727
> NOTICE:  00000: Decompressor pglz_decompress_vanilla result 14.262256
>
> ARM Server
> NOTICE:  Decompressor pglz_decompress_hacked result 12.966668
> NOTICE:  Decompressor pglz_decompress_hacked8 result 13.004935
> NOTICE:  Decompressor pglz_decompress_hacked16 result 13.043015
> NOTICE:  Decompressor pglz_decompress_vanilla result 18.239242
>
> Power9 Server
> NOTICE:  Decompressor pglz_decompress_hacked result 10.992974
> NOTICE:  Decompressor pglz_decompress_hacked8 result 11.747443
> NOTICE:  Decompressor pglz_decompress_hacked16 result 11.026342
> NOTICE:  Decompressor pglz_decompress_vanilla result 16.375315
>
> Intel laptop
> NOTICE:  Decompressor pglz_decompress_hacked result 9.445808
> NOTICE:  Decompressor pglz_decompress_hacked8 result 9.105360
> NOTICE:  Decompressor pglz_decompress_hacked16 result 9.621833
> NOTICE:  Decompressor pglz_decompress_vanilla result 10.661968
>
> From these results pglz_decompress_hacked looks best.
>
> Best regards, Andrey Borodin.
>

Attachment

Re: pglz performance

From
Andrey Borodin
Date:

> 17 мая 2019 г., в 18:40, Gasper Zejn <zejn@owca.info> написал(а):
>
> I've tested according to instructions at the test repo
> https://github.com/x4m/test_pglz
>
> Test_pglz is at a97f63b and postgres at 6ba500.
>
> Hardware is desktop AMD Ryzen 5 2600, 32GB RAM
>
> Decompressor score (summ of all times):
>
> NOTICE:  Decompressor pglz_decompress_hacked result 6.988909
> NOTICE:  Decompressor pglz_decompress_hacked8 result 7.562619
> NOTICE:  Decompressor pglz_decompress_hacked16 result 8.316957
> NOTICE:  Decompressor pglz_decompress_vanilla result 10.725826

Thanks, Gasper! Basically we observe same 0.65 time reduction here.

That's very good that we have independent scores.

I'm still somewhat not sure that score is fair, on payload 000000010000000000000008 we have vanilla decompression
sometimesslower than hacked by few percents. And this is especially visible on AMD. Degradation for
000000010000000000000008sliced by 8Kb reaches 10% 

I think this is because 000000010000000000000008 have highest entropy.It is almost random and matches are very short,
butpresent. 
000000010000000000000008
Entropy = 4.360546 bits per byte.
000000010000000000000006
Entropy = 1.450059 bits per byte.
000000010000000000000001
Entropy = 2.944235 bits per byte.
shakespeare.txt
Entropy = 3.603659 bits per byte
16398
Entropy = 1.897640 bits per byte.

Best regards, Andrey Borodin.


Re: pglz performance

From
Binguo Bao
Date:
Hi hackers!
I am a student participating in GSoC 2019. I am looking forward to working with you all and learning from you.
My project would aim to provide the ability to de-TOAST a fully TOAST'd and compressed field using an iterator.
For more details, please take a look at my proposal[0]. Any suggestions or comments about my immature ideas would be much appreciated:)

I've implemented the first step of the project, the segment pglz compression provides the ability to get the subset of the raw data without decompressing the entire field.
And I've done some test[1] for the compressor. The test result is as follows:
NOTICE:  Test summary:
NOTICE:  Payload 000000010000000000000001
NOTICE:       Decompressor name      |    Compression time (ns/bit)  | Decompression time (ns/bit) | ratio  
NOTICE:   pglz_decompress_hacked     |            23.747444           |          0.578344         | 0.159809
NOTICE:   pglz_decompress_hacked8    |            23.764193           |          0.677800         | 0.159809
NOTICE:   pglz_decompress_hacked16   |            23.740351           |          0.704730         | 0.159809
NOTICE:   pglz_decompress_vanilla    |            23.797917           |          1.227868         | 0.159809
NOTICE:   pglz_decompress_hacked_seg |            12.261808           |          0.625634         | 0.184952

Comment: Compression speed increased by nearly 100% with compression rate dropped by 15%

NOTICE:  Payload 000000010000000000000001 sliced by 2Kb
NOTICE:   pglz_decompress_hacked     |            12.616956           |          0.621223         | 0.156953
NOTICE:   pglz_decompress_hacked8    |            12.583685           |          0.756741         | 0.156953
NOTICE:   pglz_decompress_hacked16   |            12.512636           |          0.774980         | 0.156953
NOTICE:   pglz_decompress_vanilla    |            12.493062           |          1.262820         | 0.156953
NOTICE:   pglz_decompress_hacked_seg |            11.986554           |          0.622654         | 0.159590
NOTICE:  Payload 000000010000000000000001 sliced by 4Kb
NOTICE:   pglz_decompress_hacked     |            15.514469           |          0.565565         | 0.154213
NOTICE:   pglz_decompress_hacked8    |            15.529144           |          0.699675         | 0.154213
NOTICE:   pglz_decompress_hacked16   |            15.514040           |          0.721145         | 0.154213
NOTICE:   pglz_decompress_vanilla    |            15.558958           |          1.237237         | 0.154213
NOTICE:   pglz_decompress_hacked_seg |            14.650309           |          0.563228         | 0.153652
NOTICE:  Payload 000000010000000000000006
NOTICE:       Decompressor name      |  Compression time (ns/bit)  | Decompression time (ns/bit) | ratio  
NOTICE:   pglz_decompress_hacked     |            8.610177           |          0.153577         | 0.052294
NOTICE:   pglz_decompress_hacked8    |            8.566785           |          0.168002         | 0.052294
NOTICE:   pglz_decompress_hacked16   |            8.643126           |          0.167537         | 0.052294
NOTICE:   pglz_decompress_vanilla    |            8.574498           |          0.930738         | 0.052294
NOTICE:   pglz_decompress_hacked_seg |            7.394731           |          0.171924         | 0.056081
NOTICE:  Payload 000000010000000000000006 sliced by 2Kb
NOTICE:   pglz_decompress_hacked     |            6.724060           |          0.295043         | 0.065541
NOTICE:   pglz_decompress_hacked8    |            6.623018           |          0.318527         | 0.065541
NOTICE:   pglz_decompress_hacked16   |            6.898034           |          0.318360         | 0.065541
NOTICE:   pglz_decompress_vanilla    |            6.712711           |          1.045430         | 0.065541
NOTICE:   pglz_decompress_hacked_seg |            6.630743           |          0.302589         | 0.068471
NOTICE:  Payload 000000010000000000000006 sliced by 4Kb
NOTICE:   pglz_decompress_hacked     |            6.624067           |          0.220942         | 0.058865
NOTICE:   pglz_decompress_hacked8    |            6.659424           |          0.240183         | 0.058865
NOTICE:   pglz_decompress_hacked16   |            6.763864           |          0.240564         | 0.058865
NOTICE:   pglz_decompress_vanilla    |            6.743574           |          0.985348         | 0.058865
NOTICE:   pglz_decompress_hacked_seg |            6.613123           |          0.227582         | 0.060330
NOTICE:  Payload 000000010000000000000008
NOTICE:       Decompressor name      |  Compression time (ns/bit)  | Decompression time (ns/bit) | ratio  
NOTICE:   pglz_decompress_hacked     |            52.425957           |          1.050544         | 0.498941
NOTICE:   pglz_decompress_hacked8    |            52.204561           |          1.261592         | 0.498941
NOTICE:   pglz_decompress_hacked16   |            52.328491           |          1.466751         | 0.498941
NOTICE:   pglz_decompress_vanilla    |            52.465308           |          1.341271         | 0.498941
NOTICE:   pglz_decompress_hacked_seg |            31.896341           |          1.113260         | 0.600998
NOTICE:  Payload 000000010000000000000008 sliced by 2Kb
NOTICE:   pglz_decompress_hacked     |            30.620611           |          0.768542         | 0.351941
NOTICE:   pglz_decompress_hacked8    |            30.557334           |          0.907421         | 0.351941
NOTICE:   pglz_decompress_hacked16   |            32.064903           |          1.208913         | 0.351941
NOTICE:   pglz_decompress_vanilla    |            30.489886           |          1.014197         | 0.351941
NOTICE:   pglz_decompress_hacked_seg |            27.145243           |          0.774193         | 0.352868
NOTICE:  Payload 000000010000000000000008 sliced by 4Kb
NOTICE:   pglz_decompress_hacked     |            36.567903           |          1.054633         | 0.514047
NOTICE:   pglz_decompress_hacked8    |            36.459124           |          1.267731         | 0.514047
NOTICE:   pglz_decompress_hacked16   |            36.791718           |          1.479650         | 0.514047
NOTICE:   pglz_decompress_vanilla    |            36.241913           |          1.303136         | 0.514047
NOTICE:   pglz_decompress_hacked_seg |            31.526327           |          1.059926         | 0.526875
NOTICE:  Payload 16398
NOTICE:       Decompressor name      |  Compression time (ns/bit)  | Decompression time (ns/bit) | ratio  
NOTICE:   pglz_decompress_hacked     |            9.508625           |          0.435190         | 0.071816
NOTICE:   pglz_decompress_hacked8    |            9.546987           |          0.473871         | 0.071816
NOTICE:   pglz_decompress_hacked16   |            9.534496           |          0.471662         | 0.071816
NOTICE:   pglz_decompress_vanilla    |            9.559053           |          1.352561         | 0.071816
NOTICE:   pglz_decompress_hacked_seg |            8.479486           |          0.441536         | 0.073232
NOTICE:  Payload 16398 sliced by 2Kb
NOTICE:   pglz_decompress_hacked     |            6.808167           |          0.326570         | 0.082775
NOTICE:   pglz_decompress_hacked8    |            6.790743           |          0.361720         | 0.082775
NOTICE:   pglz_decompress_hacked16   |            6.886097           |          0.364549         | 0.082775
NOTICE:   pglz_decompress_vanilla    |            6.918429           |          1.191265         | 0.082775
NOTICE:   pglz_decompress_hacked_seg |            6.752811           |          0.340805         | 0.085705
NOTICE:  Payload 16398 sliced by 4Kb
NOTICE:   pglz_decompress_hacked     |            7.244472           |          0.261872         | 0.076860
NOTICE:   pglz_decompress_hacked8    |            7.290275           |          0.295988         | 0.076860
NOTICE:   pglz_decompress_hacked16   |            7.340706           |          0.294683         | 0.076860
NOTICE:   pglz_decompress_vanilla    |            7.429289           |          1.151645         | 0.076860
NOTICE:   pglz_decompress_hacked_seg |            7.054166           |          0.267896         | 0.078325
NOTICE:  Payload shakespeare.txt
NOTICE:       Decompressor name      |  Compression time (ns/bit)  | Decompression time (ns/bit) | ratio  
NOTICE:   pglz_decompress_hacked     |            25.998753           |          1.345542         | 0.281363
NOTICE:   pglz_decompress_hacked8    |            26.121630           |          1.917667         | 0.281363
NOTICE:   pglz_decompress_hacked16   |            26.139312           |          2.101329         | 0.281363
NOTICE:   pglz_decompress_vanilla    |            26.155571           |          2.082123         | 0.281363
NOTICE:   pglz_decompress_hacked_seg |            16.792089           |          1.951269         | 0.436558

Comment: In this case, the compression rate has dropped dramatically.
 
NOTICE:  Payload shakespeare.txt sliced by 2Kb
NOTICE:   pglz_decompress_hacked     |            14.992793           |          1.923663         | 0.436270
NOTICE:   pglz_decompress_hacked8    |            14.982428           |          2.695319         | 0.436270
NOTICE:   pglz_decompress_hacked16   |            15.211803           |          2.846615         | 0.436270
NOTICE:   pglz_decompress_vanilla    |            15.113214           |          2.580098         | 0.436270
NOTICE:   pglz_decompress_hacked_seg |            15.120852           |          1.922596         | 0.439199
NOTICE:  Payload shakespeare.txt sliced by 4Kb
NOTICE:   pglz_decompress_hacked     |            18.083400           |          1.687598         | 0.366936
NOTICE:   pglz_decompress_hacked8    |            18.185038           |          2.395928         | 0.366936
NOTICE:   pglz_decompress_hacked16   |            18.096120           |          2.554812         | 0.366936
NOTICE:   pglz_decompress_vanilla    |            18.435380           |          2.329129         | 0.366936
NOTICE:   pglz_decompress_hacked_seg |            18.103267           |          1.705517         | 0.368400
NOTICE:  

Decompressor score (summ of all times):
NOTICE:  Decompressor pglz_decompress_hacked     result 11.288848
NOTICE:  Decompressor pglz_decompress_hacked8    result 14.438165
NOTICE:  Decompressor pglz_decompress_hacked16   result 15.716280
NOTICE:  Decompressor pglz_decompress_vanilla    result 21.034867
NOTICE:  Decompressor pglz_decompress_hacked_seg result 12.090609
NOTICE:  

compressor score (summ of all times):
NOTICE:  compressor pglz_compress_vanilla result 276.776671
NOTICE:  compressor pglz_compress_hacked_seg result 222.407850

There are some questions now:
1. The compression algorithm is not compatible with the original compression algorithm now.
2. If the idea works, we need to test more data, what kind of data is more appropriate?
Any comments are much appreciated.

Re: pglz performance

From
Andrey Borodin
Date:

> 18 мая 2019 г., в 11:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
Hi!
Here's rebased version of patches.

Best regards, Andrey Borodin.

Attachment

Re: pglz performance

From
Andrey Borodin
Date:

> 13 мая 2019 г., в 12:14, Michael Paquier <michael@paquier.xyz> написал(а):
>
> Decompression can matter a lot for mostly-read workloads and
> compression can become a bottleneck for heavy-insert loads, so
> improving compression or decompression should be two separate
> problems, not two problems linked.  Any improvement in one or the
> other, or even both, is nice to have.

Here's patch hacked by Vladimir for compression.

Key differences (as far as I see, maybe Vladimir will post more complete list of optimizations):
1. Use functions instead of macro-functions: not surprisingly it's easier to optimize them and provide less constraints
forcompiler to optimize. 
2. More compact hash table: use indexes instead of pointers.
3. More robust segment comparison: like memcmp, but return index of first different byte

In weighted mix of different data (same as for compression), overall speedup is x1.43 on my machine.

Current implementation is integrated into test_pglz suit for benchmarking purposes[0].

Best regards, Andrey Borodin.

[0] https://github.com/x4m/test_pglz

Attachment

Re: pglz performance

From
Konstantin Knizhnik
Date:

On 27.06.2019 21:33, Andrey Borodin wrote:
>
>> 13 мая 2019 г., в 12:14, Michael Paquier <michael@paquier.xyz> написал(а):
>>
>> Decompression can matter a lot for mostly-read workloads and
>> compression can become a bottleneck for heavy-insert loads, so
>> improving compression or decompression should be two separate
>> problems, not two problems linked.  Any improvement in one or the
>> other, or even both, is nice to have.
> Here's patch hacked by Vladimir for compression.
>
> Key differences (as far as I see, maybe Vladimir will post more complete list of optimizations):
> 1. Use functions instead of macro-functions: not surprisingly it's easier to optimize them and provide less
constraintsfor compiler to optimize.
 
> 2. More compact hash table: use indexes instead of pointers.
> 3. More robust segment comparison: like memcmp, but return index of first different byte
>
> In weighted mix of different data (same as for compression), overall speedup is x1.43 on my machine.
>
> Current implementation is integrated into test_pglz suit for benchmarking purposes[0].
>
> Best regards, Andrey Borodin.
>
> [0] https://github.com/x4m/test_pglz

It takes me some time to understand that your memcpy optimization is 
correct;)
I have tested different ways of optimizing this fragment of code, but 
failed tooutperform your implementation!
Results at my computer is simlar with yours:

Decompressor score (summ of all times):
NOTICE:  Decompressor pglz_decompress_hacked result 6.627355
NOTICE:  Decompressor pglz_decompress_hacked_unrolled result 7.497114
NOTICE:  Decompressor pglz_decompress_hacked8 result 7.412944
NOTICE:  Decompressor pglz_decompress_hacked16 result 7.792978
NOTICE:  Decompressor pglz_decompress_vanilla result 10.652603

Compressor score (summ of all times):
NOTICE:  Compressor pglz_compress_vanilla result 116.970005
NOTICE:  Compressor pglz_compress_hacked result 89.706105


But ...  below are results for lz4:

Decompressor score (summ of all times):
NOTICE:  Decompressor lz4_decompress result 3.660066
Compressor score (summ of all times):
NOTICE:  Compressor lz4_compress result 10.288594

There is 2 times advantage in decompress speed and 10 times advantage in 
compress speed.
So may be instead of "hacking" pglz algorithm we should better switch to 
lz4?


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: pglz performance

From
Tomas Vondra
Date:
On Fri, Aug 02, 2019 at 04:45:43PM +0300, Konstantin Knizhnik wrote:
>
>
>On 27.06.2019 21:33, Andrey Borodin wrote:
>>
>>>13 мая 2019 г., в 12:14, Michael Paquier <michael@paquier.xyz> написал(а):
>>>
>>>Decompression can matter a lot for mostly-read workloads and
>>>compression can become a bottleneck for heavy-insert loads, so
>>>improving compression or decompression should be two separate
>>>problems, not two problems linked.  Any improvement in one or the
>>>other, or even both, is nice to have.
>>Here's patch hacked by Vladimir for compression.
>>
>>Key differences (as far as I see, maybe Vladimir will post more complete list of optimizations):
>>1. Use functions instead of macro-functions: not surprisingly it's easier to optimize them and provide less
constraintsfor compiler to optimize.
 
>>2. More compact hash table: use indexes instead of pointers.
>>3. More robust segment comparison: like memcmp, but return index of first different byte
>>
>>In weighted mix of different data (same as for compression), overall speedup is x1.43 on my machine.
>>
>>Current implementation is integrated into test_pglz suit for benchmarking purposes[0].
>>
>>Best regards, Andrey Borodin.
>>
>>[0] https://github.com/x4m/test_pglz
>
>It takes me some time to understand that your memcpy optimization is 
>correct;)
>I have tested different ways of optimizing this fragment of code, but 
>failed tooutperform your implementation!
>Results at my computer is simlar with yours:
>
>Decompressor score (summ of all times):
>NOTICE:  Decompressor pglz_decompress_hacked result 6.627355
>NOTICE:  Decompressor pglz_decompress_hacked_unrolled result 7.497114
>NOTICE:  Decompressor pglz_decompress_hacked8 result 7.412944
>NOTICE:  Decompressor pglz_decompress_hacked16 result 7.792978
>NOTICE:  Decompressor pglz_decompress_vanilla result 10.652603
>
>Compressor score (summ of all times):
>NOTICE:  Compressor pglz_compress_vanilla result 116.970005
>NOTICE:  Compressor pglz_compress_hacked result 89.706105
>
>
>But ...  below are results for lz4:
>
>Decompressor score (summ of all times):
>NOTICE:  Decompressor lz4_decompress result 3.660066
>Compressor score (summ of all times):
>NOTICE:  Compressor lz4_compress result 10.288594
>
>There is 2 times advantage in decompress speed and 10 times advantage 
>in compress speed.
>So may be instead of "hacking" pglz algorithm we should better switch 
>to lz4?
>

I think we should just bite the bullet and add initdb option to pick
compression algorithm. That's been discussed repeatedly, but we never
ended up actually doing that. See for example [1].

If there's anyone willing to put some effort into getting this feature
over the line, I'm willing to do reviews & commit. It's a seemingly
small change with rather insane potential impact.

But even if we end up doing that, it still makes sense to optimize the
hell out of pglz, because existing systems will still use that
(pg_upgrade can't switch from one compression algorithm to another).

regards

[1] https://www.postgresql.org/message-id/flat/55341569.1090107%402ndquadrant.com

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pglz performance

From
Andrey Borodin
Date:
Thanks for looking into this!

> 2 авг. 2019 г., в 19:43, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>
> On Fri, Aug 02, 2019 at 04:45:43PM +0300, Konstantin Knizhnik wrote:
>>
>> It takes me some time to understand that your memcpy optimization is correct;)
Seems that comments are not explanatory enough... will try to fix.

>> I have tested different ways of optimizing this fragment of code, but failed tooutperform your implementation!
JFYI we tried optimizations with memcpy with const size (optimized into assembly instead of call), unrolling literal
loopand some others. All these did not work better. 

>> But ...  below are results for lz4:
>>
>> Decompressor score (summ of all times):
>> NOTICE:  Decompressor lz4_decompress result 3.660066
>> Compressor score (summ of all times):
>> NOTICE:  Compressor lz4_compress result 10.288594
>>
>> There is 2 times advantage in decompress speed and 10 times advantage in compress speed.
>> So may be instead of "hacking" pglz algorithm we should better switch to lz4?
>>
>
> I think we should just bite the bullet and add initdb option to pick
> compression algorithm. That's been discussed repeatedly, but we never
> ended up actually doing that. See for example [1].
>
> If there's anyone willing to put some effort into getting this feature
> over the line, I'm willing to do reviews & commit. It's a seemingly
> small change with rather insane potential impact.
>
> But even if we end up doing that, it still makes sense to optimize the
> hell out of pglz, because existing systems will still use that
> (pg_upgrade can't switch from one compression algorithm to another).

We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF.

Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to
anydata with common substrings: this will enhance compression ratio. 
It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate
functionwhich produces some "adapted common prefix" from users's data. 
Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which
commonprefix to use. 
But also system command can say "invoke decompression from extension".

Thus, user will be able to train database compression on his data and substitute pglz compression with custom
compressionmethod seamlessly. 

This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4, zstd,
brotli,lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever gun user
wantsfor his foot. 

Best regards, Andrey Borodin.


Re: pglz performance

From
Andres Freund
Date:
Hi,

On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
> We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF.

I don't understand why it's a good idea to improve the compression side
of pglz. There's plenty other people that spent a lot of time developing
better compression algorithms.


> Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to
anydata with common substrings: this will enhance compression ratio.
 
> It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate
functionwhich produces some "adapted common prefix" from users's data.
 
> Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which
commonprefix to use.
 
> But also system command can say "invoke decompression from extension".
> 
> Thus, user will be able to train database compression on his data and substitute pglz compression with custom
compressionmethod seamlessly.
 
> 
> This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4,
zstd,brotli, lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever
gunuser wants for his foot.
 

I think this is way too complicated, and will provide not particularly
much benefit for the majority users.

In fact, I'll argue that we should flat out reject any such patch until
we have at least one decent default compression algorithm in
core. You're trying to work around a poor compression algorithm with
complicated dictionary improvement, that require user interaction, and
only will work in a relatively small subset of the cases, and will very
often increase compression times.

Greetings,

Andres Freund



Re: pglz performance

From
Tomas Vondra
Date:
On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
>> We have some kind of "roadmap" of "extensible pglz". We plan to
>> provide implementation on Novembers CF.
>
>I don't understand why it's a good idea to improve the compression side
>of pglz. There's plenty other people that spent a lot of time
>developing better compression algorithms.
>

Isn't it beneficial for existing systems, that will be stuck with pglz
even if we end up adding other algorithms?

>
>> Currently, pglz starts with empty cache map: there is no prior 4k
>> bytes before start. We can add imaginary prefix to any data with
>> common substrings: this will enhance compression ratio.  It is hard
>> to decide on training data set for this "common prefix". So we want
>> to produce extension with aggregate function which produces some
>> "adapted common prefix" from users's data.  Then we can "reserve" few
>> negative bytes for "decompression commands". This command can
>> instruct database on which common prefix to use.  But also system
>> command can say "invoke decompression from extension".
>>
>> Thus, user will be able to train database compression on his data and
>> substitute pglz compression with custom compression method
>> seamlessly.
>>
>> This will make hard-choosen compression unneeded, but seems overly
>> hacky. But there will be no need to have lz4, zstd, brotli, lzma and
>> others in core. Why not provide e.g. "time series compression"? Or
>> "DNA compression"? Whatever gun user wants for his foot.
>
>I think this is way too complicated, and will provide not particularly
>much benefit for the majority users.
>

I agree with this. I do see value in the feature, but probably not as a
drop-in replacement for the default compression algorithm. I'd compare
it to the "custom compression methods" patch that was submitted some
time ago.

>In fact, I'll argue that we should flat out reject any such patch until
>we have at least one decent default compression algorithm in core.
>You're trying to work around a poor compression algorithm with
>complicated dictionary improvement, that require user interaction, and
>only will work in a relatively small subset of the cases, and will very
>often increase compression times.
>

I wouldn't be so strict I guess. But I do agree an algorithm that 
requires additional steps (training, ...) is unlikely to be a good
candidate for default instance compression alorithm.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pglz performance

From
Andres Freund
Date:
Hi,

On 2019-08-02 19:00:39 +0200, Tomas Vondra wrote:
> On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
> > > We have some kind of "roadmap" of "extensible pglz". We plan to
> > > provide implementation on Novembers CF.
> > 
> > I don't understand why it's a good idea to improve the compression side
> > of pglz. There's plenty other people that spent a lot of time
> > developing better compression algorithms.
> > 
> 
> Isn't it beneficial for existing systems, that will be stuck with pglz
> even if we end up adding other algorithms?

Why would they be stuck continuing to *compress* with pglz? As we
fully retoast on write anyway we can just gradually switch over to the
better algorithm. Decompression speed is another story, of course.


Here's what I had a few years back:

https://www.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de
see also
https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de

I think we should refresh something like that patch, and:
- make the compression algorithm GUC an enum, rename
- add --with-system-lz4
- obviously refresh the copy of lz4
- drop snappy

Greetings,

Andres Freund



Re: pglz performance

From
Tomas Vondra
Date:
On Fri, Aug 02, 2019 at 10:12:58AM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-08-02 19:00:39 +0200, Tomas Vondra wrote:
>> On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote:
>> > Hi,
>> >
>> > On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
>> > > We have some kind of "roadmap" of "extensible pglz". We plan to
>> > > provide implementation on Novembers CF.
>> >
>> > I don't understand why it's a good idea to improve the compression side
>> > of pglz. There's plenty other people that spent a lot of time
>> > developing better compression algorithms.
>> >
>>
>> Isn't it beneficial for existing systems, that will be stuck with pglz
>> even if we end up adding other algorithms?
>
>Why would they be stuck continuing to *compress* with pglz? As we
>fully retoast on write anyway we can just gradually switch over to the
>better algorithm. Decompression speed is another story, of course.
>

Hmmm, I don't remember the details of those patches so I didn't realize
it allows incremental recompression. If that's possible, that would mean 
existing systems can start using it. Which is good.

Another question is whether we'd actually want to include the code in
core directly, or use system libraries (and if some packagers might
decide to disable that, for whatever reason).

But yeah, I agree you may have a point about optimizing pglz compression.

>
>Here's what I had a few years back:
>
>https://www.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de
>see also
>https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de
>
>I think we should refresh something like that patch, and:
>- make the compression algorithm GUC an enum, rename
>- add --with-system-lz4
>- obviously refresh the copy of lz4
>- drop snappy
>

That's a reasonable plan, I guess.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pglz performance

From
Andres Freund
Date:
Hi,

On 2019-08-02 19:52:39 +0200, Tomas Vondra wrote:
> Hmmm, I don't remember the details of those patches so I didn't realize
> it allows incremental recompression. If that's possible, that would mean
> existing systems can start using it. Which is good.

That depends on what do you mean by "incremental"? A single toasted
datum can only have one compression type, because we only update them
all in one anyway. But different datums can be compressed differently.


> Another question is whether we'd actually want to include the code in
> core directly, or use system libraries (and if some packagers might
> decide to disable that, for whatever reason).

I'd personally say we should have an included version, and a
--with-system-... flag that uses the system one.

Greetings,

Andres Freund



Re: pglz performance

From
Tomas Vondra
Date:
On Fri, Aug 02, 2019 at 11:20:03AM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-08-02 19:52:39 +0200, Tomas Vondra wrote:
>> Hmmm, I don't remember the details of those patches so I didn't realize
>> it allows incremental recompression. If that's possible, that would mean
>> existing systems can start using it. Which is good.
>
>That depends on what do you mean by "incremental"? A single toasted
>datum can only have one compression type, because we only update them
>all in one anyway. But different datums can be compressed differently.
>

I meant different toast values using different compression algorithm,
sorry for the confusion.

>
>> Another question is whether we'd actually want to include the code in
>> core directly, or use system libraries (and if some packagers might
>> decide to disable that, for whatever reason).
>
>I'd personally say we should have an included version, and a
>--with-system-... flag that uses the system one.
>

OK. I'd say to require a system library, but that's a minor detail.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pglz performance

From
Konstantin Knizhnik
Date:

On 02.08.2019 21:20, Andres Freund wrote:
> Another question is whether we'd actually want to include the code in
>> core directly, or use system libraries (and if some packagers might
>> decide to disable that, for whatever reason).
> I'd personally say we should have an included version, and a
> --with-system-... flag that uses the system one.
+1




Re: pglz performance

From
Petr Jelinek
Date:
Hi,

On 02/08/2019 21:48, Tomas Vondra wrote:
> On Fri, Aug 02, 2019 at 11:20:03AM -0700, Andres Freund wrote:
> 
>>
>>> Another question is whether we'd actually want to include the code in
>>> core directly, or use system libraries (and if some packagers might
>>> decide to disable that, for whatever reason).
>>
>> I'd personally say we should have an included version, and a
>> --with-system-... flag that uses the system one.
>>
> 
> OK. I'd say to require a system library, but that's a minor detail.
> 

Same here.

Just so that we don't idly talk, what do you think about the attached?
It:
- adds new GUC compression_algorithm with possible values of pglz 
(default) and lz4 (if lz4 is compiled in), requires SIGHUP
- adds --with-lz4 configure option (default yes, so the configure option 
is actually --without-lz4) that enables the lz4, it's using system library
- uses the compression_algorithm for both TOAST and WAL compression (if on)
- supports slicing for lz4 as well (pglz was already supported)
- supports reading old TOAST values
- adds 1 byte header to the compressed data where we currently store the 
algorithm kind, that leaves us with 254 more to add :) (that's an extra 
overhead compared to the current state)
- changes the rawsize in TOAST header to 31 bits via bit packing
- uses the extra bit to differentiate between old and new format
- supports reading from table which has different rows stored with 
different algorithm (so that the GUC itself can be freely changed)

Simple docs and a TAP test included.

I did some basic performance testing (it's not really my thing though, 
so I would appreciate if somebody did more).
I get about 7x perf improvement on data load with lz4 compared to pglz 
on my dataset but strangely only tiny decompression improvement. Perhaps 
more importantly I also did before patch and after patch tests with pglz 
and the performance difference with my data set was <1%.

Note that this will just link against lz4, it does not add lz4 into 
PostgreSQL code-base.

The issues I know of:
- the pg_decompress function really ought to throw error in the default 
branch but that file is also used in front-end so not sure how to do that
- the TAP test probably does not work with all possible configurations 
(but that's why it needs to be set in PG_TEST_EXTRA like for example ssl)
- we don't really have any automated test for reading old TOAST format, 
no idea how to do that
- I expect my changes to configure.in are not the greatest as I don't 
have pretty much zero experience with autoconf

-- 
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/

Attachment

Re: pglz performance

From
Andrey Borodin
Date:

> 2 авг. 2019 г., в 21:39, Andres Freund <andres@anarazel.de> написал(а):
>
> On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
>> We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF.
>
> I don't understand why it's a good idea to improve the compression side
> of pglz. There's plenty other people that spent a lot of time developing
> better compression algorithms.
Improving compression side of pglz has two different projects:
1. Faster compression with less code and same compression ratio (patch in this thread).
2. Better compression ratio with at least same compression speed of uncompressed values.
Why I want to do patch for 2? Because it's interesting.
Will 1 or 2 be reviewed or committed? I have no idea.
Will many users benefit from 1 or 2? Yes, clearly. Unless we force everyone to stop compressing with pglz.

>> Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to
anydata with common substrings: this will enhance compression ratio. 
>> It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate
functionwhich produces some "adapted common prefix" from users's data. 
>> Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which
commonprefix to use. 
>> But also system command can say "invoke decompression from extension".
>>
>> Thus, user will be able to train database compression on his data and substitute pglz compression with custom
compressionmethod seamlessly. 
>>
>> This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4,
zstd,brotli, lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever
gunuser wants for his foot. 
>
> I think this is way too complicated, and will provide not particularly
> much benefit for the majority users.
>
> In fact, I'll argue that we should flat out reject any such patch until
> we have at least one decent default compression algorithm in
> core. You're trying to work around a poor compression algorithm with
> complicated dictionary improvement
OK. The idea of something plugged into pglz seemed odd even to me.
But looks like it restarted lz4 discussion :)

> , that require user interaction, and
> only will work in a relatively small subset of the cases, and will very
> often increase compression times.
No, surely, if implementation of "common prefix" will increase compression times I will not even post a patch.
BTW, lz4 also supports "common prefix", let's do that too?
Here's link on Zstd dictionary builder, but it is compatible with lz4
https://github.com/facebook/zstd#the-case-for-small-data-compression
We actually have small datums.

> 4 авг. 2019 г., в 5:41, Petr Jelinek <petr@2ndquadrant.com> написал(а):
>
> Just so that we don't idly talk, what do you think about the attached?
> It:
> - adds new GUC compression_algorithm with possible values of pglz (default) and lz4 (if lz4 is compiled in), requires
SIGHUP
> - adds --with-lz4 configure option (default yes, so the configure option is actually --without-lz4) that enables the
lz4,it's using system library 
> - uses the compression_algorithm for both TOAST and WAL compression (if on)
> - supports slicing for lz4 as well (pglz was already supported)
> - supports reading old TOAST values
> - adds 1 byte header to the compressed data where we currently store the algorithm kind, that leaves us with 254 more
toadd :) (that's an extra overhead compared to the current state) 
> - changes the rawsize in TOAST header to 31 bits via bit packing
> - uses the extra bit to differentiate between old and new format
> - supports reading from table which has different rows stored with different algorithm (so that the GUC itself can be
freelychanged) 
That's cool. I suggest defaulting to lz4 if it is available. You cannot start cluster on non-lz4 binaries which used
lz4once. 
Do we plan the possibility of compression algorithm as extension? Or will all algorithms be packed into that byte in
core?
What about lz4 "common prefix"? System or user-defined. If lz4 is compiled in we can even offer in-system training,
justmake sure that trained prefixes will make their way to standbys. 

Best regards, Andrey Borodin.


Re: pglz performance

From
Tomas Vondra
Date:
On Sun, Aug 04, 2019 at 02:41:24AM +0200, Petr Jelinek wrote:
>Hi,
>
>On 02/08/2019 21:48, Tomas Vondra wrote:
>>On Fri, Aug 02, 2019 at 11:20:03AM -0700, Andres Freund wrote:
>>
>>>
>>>>Another question is whether we'd actually want to include the code in
>>>>core directly, or use system libraries (and if some packagers might
>>>>decide to disable that, for whatever reason).
>>>
>>>I'd personally say we should have an included version, and a
>>>--with-system-... flag that uses the system one.
>>>
>>
>>OK. I'd say to require a system library, but that's a minor detail.
>>
>
>Same here.
>
>Just so that we don't idly talk, what do you think about the attached?
>It:
>- adds new GUC compression_algorithm with possible values of pglz 
>(default) and lz4 (if lz4 is compiled in), requires SIGHUP
>- adds --with-lz4 configure option (default yes, so the configure 
>option is actually --without-lz4) that enables the lz4, it's using 
>system library
>- uses the compression_algorithm for both TOAST and WAL compression (if on)
>- supports slicing for lz4 as well (pglz was already supported)
>- supports reading old TOAST values
>- adds 1 byte header to the compressed data where we currently store 
>the algorithm kind, that leaves us with 254 more to add :) (that's an 
>extra overhead compared to the current state)
>- changes the rawsize in TOAST header to 31 bits via bit packing
>- uses the extra bit to differentiate between old and new format
>- supports reading from table which has different rows stored with 
>different algorithm (so that the GUC itself can be freely changed)
>

Cool.

>Simple docs and a TAP test included.
>
>I did some basic performance testing (it's not really my thing though, 
>so I would appreciate if somebody did more).
>I get about 7x perf improvement on data load with lz4 compared to pglz 
>on my dataset but strangely only tiny decompression improvement. 
>Perhaps more importantly I also did before patch and after patch tests 
>with pglz and the performance difference with my data set was <1%.
>
>Note that this will just link against lz4, it does not add lz4 into 
>PostgreSQL code-base.
>

WFM, although I think Andres wanted to do both (link against system and
add lz4 code as a fallback). I think the question is what happens when
you run with lz4 for a while, and then switch to binaries without lz4
support. Or when you try to replicate from lz4-enabled instance to an
instance without it. Especially for physical replication, but I suppose
it may affect logical replication with binary protocol?


A very brief review:


1) I wonder what "runstatedir" is about.


2) This seems rather suspicious, and obviously the comment is now
entirely bogus:

 /* Check that off_t can represent 2**63 - 1 correctly.
    We can't simply define LARGE_OFF_T to be 9223372036854775807,
    since some C++ compilers masquerading as C compilers
    incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))


3) I can't really build without lz4:

config.status: linking src/makefiles/Makefile.linux to src/Makefile.port
pg_lzcompress.c: In function ‘pg_compress_bound’:
pg_lzcompress.c:892:22: error: ‘SIZEOF_PG_COMPRESS_HEADER’ undeclared (first use in this function)
    return slen + 4 + SIZEOF_PG_COMPRESS_HEADER;
                      ^~~~~~~~~~~~~~~~~~~~~~~~~
pg_lzcompress.c:892:22: note: each undeclared identifier is reported only once for each function it appears in


4) I did a simple test with physical replication, with lz4 enabled on
both sides (well, can't build without lz4 anyway, per previous point).
It immediately failed like this:

 FATAL:  failed to restore block image
 CONTEXT:  WAL redo at 0/5000A40 for Btree/INSERT_LEAF: off 138
 LOG:  startup process (PID 15937) exited with exit code 1

This is a simple UPDATE on a trivial table:

 create table t (a int primary key);
 insert into t select i from generate_series(1,1000) s(i);
 update t set a = a - 100000 where random () < 0.1;

with some checkpoints to force FPW (and wal_compression=on, of course).

I haven't tried `make check-world` but I suppose some of the TAP tests
should fail because of this. And if not, we need to improve coverage.


5) I wonder why compression_algorithm is defined as PGC_SIGHUP. Why not
to allow users to set it per session? I suppose we might have a separate
option for WAL compression_algorithm.


6) It seems a bit strange that pg_compress/pg_decompress are now defined
in pglz_compress.{c,h}. Maybe we should invent src/common/compress.c?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pglz performance

From
Petr Jelinek
Date:
Hi,

On 04/08/2019 11:57, Andrey Borodin wrote:
> 
> 
>> 2 авг. 2019 г., в 21:39, Andres Freund <andres@anarazel.de> написал(а):
>>
>> On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
>>> We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF.
>>
>> I don't understand why it's a good idea to improve the compression side
>> of pglz. There's plenty other people that spent a lot of time developing
>> better compression algorithms.
> Improving compression side of pglz has two different projects:
> 1. Faster compression with less code and same compression ratio (patch in this thread).
> 2. Better compression ratio with at least same compression speed of uncompressed values.
> Why I want to do patch for 2? Because it's interesting.
> Will 1 or 2 be reviewed or committed? I have no idea.
> Will many users benefit from 1 or 2? Yes, clearly. Unless we force everyone to stop compressing with pglz.
> 

FWIW I agree.

>> Just so that we don't idly talk, what do you think about the attached?
>> It:
>> - adds new GUC compression_algorithm with possible values of pglz (default) and lz4 (if lz4 is compiled in),
requiresSIGHUP
 
>> - adds --with-lz4 configure option (default yes, so the configure option is actually --without-lz4) that enables the
lz4,it's using system library
 
>> - uses the compression_algorithm for both TOAST and WAL compression (if on)
>> - supports slicing for lz4 as well (pglz was already supported)
>> - supports reading old TOAST values
>> - adds 1 byte header to the compressed data where we currently store the algorithm kind, that leaves us with 254
moreto add :) (that's an extra overhead compared to the current state)
 
>> - changes the rawsize in TOAST header to 31 bits via bit packing
>> - uses the extra bit to differentiate between old and new format
>> - supports reading from table which has different rows stored with different algorithm (so that the GUC itself can
befreely changed)
 
> That's cool. I suggest defaulting to lz4 if it is available. You cannot start cluster on non-lz4 binaries which used
lz4once.
 
> Do we plan the possibility of compression algorithm as extension? Or will all algorithms be packed into that byte in
core?

What I wrote does not expect extensions providing new compression. We'd 
have to somehow reserve compression ids for specific extensions and that 
seems like a lot of extra complexity for little benefit. I don't see 
much benefit in having more than say 3 generic compressors (I could 
imagine adding zstd). If you are thinking about data type specific 
compression then I think this is wrong layer.

> What about lz4 "common prefix"? System or user-defined. If lz4 is compiled in we can even offer in-system training,
justmake sure that trained prefixes will make their way to standbys.
 
> 

I definitely don't plan to work on common prefix. But don't see why that 
could not be added later.

-- 
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: pglz performance

From
Petr Jelinek
Date:
Hi,

On 04/08/2019 13:52, Tomas Vondra wrote:
> On Sun, Aug 04, 2019 at 02:41:24AM +0200, Petr Jelinek wrote:
>> Hi,
>>
>> On 02/08/2019 21:48, Tomas Vondra wrote:
>>> On Fri, Aug 02, 2019 at 11:20:03AM -0700, Andres Freund wrote:
>>>
>>>>
>>>>> Another question is whether we'd actually want to include the code in
>>>>> core directly, or use system libraries (and if some packagers might
>>>>> decide to disable that, for whatever reason).
>>>>
>>>> I'd personally say we should have an included version, and a
>>>> --with-system-... flag that uses the system one.
>>>>
>>>
>>> OK. I'd say to require a system library, but that's a minor detail.
>>>
>>
>> Same here.
>>
>> Just so that we don't idly talk, what do you think about the attached?
>> It:
>> - adds new GUC compression_algorithm with possible values of pglz 
>> (default) and lz4 (if lz4 is compiled in), requires SIGHUP
>> - adds --with-lz4 configure option (default yes, so the configure 
>> option is actually --without-lz4) that enables the lz4, it's using 
>> system library
>> - uses the compression_algorithm for both TOAST and WAL compression 
>> (if on)
>> - supports slicing for lz4 as well (pglz was already supported)
>> - supports reading old TOAST values
>> - adds 1 byte header to the compressed data where we currently store 
>> the algorithm kind, that leaves us with 254 more to add :) (that's an 
>> extra overhead compared to the current state)
>> - changes the rawsize in TOAST header to 31 bits via bit packing
>> - uses the extra bit to differentiate between old and new format
>> - supports reading from table which has different rows stored with 
>> different algorithm (so that the GUC itself can be freely changed)
>>
> 
> Cool.
> 
>> Simple docs and a TAP test included.
>>
>> I did some basic performance testing (it's not really my thing though, 
>> so I would appreciate if somebody did more).
>> I get about 7x perf improvement on data load with lz4 compared to pglz 
>> on my dataset but strangely only tiny decompression improvement. 
>> Perhaps more importantly I also did before patch and after patch tests 
>> with pglz and the performance difference with my data set was <1%.
>>
>> Note that this will just link against lz4, it does not add lz4 into 
>> PostgreSQL code-base.
>>
> 
> WFM, although I think Andres wanted to do both (link against system and
> add lz4 code as a fallback). I think the question is what happens when
> you run with lz4 for a while, and then switch to binaries without lz4
> support. Or when you try to replicate from lz4-enabled instance to an
> instance without it. Especially for physical replication, but I suppose
> it may affect logical replication with binary protocol?
> 

I generally prefer having system library, we don't include for example 
ICU either.

> 
> A very brief review:
> 
> 
> 1) I wonder what "runstatedir" is about.
> 

No idea, that stuff is generated by autoconf from configure.in.

> 
> 2) This seems rather suspicious, and obviously the comment is now
> entirely bogus:
> 
> /* Check that off_t can represent 2**63 - 1 correctly.
>     We can't simply define LARGE_OFF_T to be 9223372036854775807,
>     since some C++ compilers masquerading as C compilers
>     incorrectly reject 9223372036854775807.  */
> -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
> +#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) 
> << 31))
> 

Same as above. TBH I am not sure why we even include configure in git 
repo given that different autoconf versions can build different outputs.

> 
> 3) I can't really build without lz4:
> 
> config.status: linking src/makefiles/Makefile.linux to src/Makefile.port
> pg_lzcompress.c: In function ‘pg_compress_bound’:
> pg_lzcompress.c:892:22: error: ‘SIZEOF_PG_COMPRESS_HEADER’ undeclared 
> (first use in this function)
>     return slen + 4 + SIZEOF_PG_COMPRESS_HEADER;
>                       ^~~~~~~~~~~~~~~~~~~~~~~~~
> pg_lzcompress.c:892:22: note: each undeclared identifier is reported 
> only once for each function it appears in
> 

Okay, that's just problem of SIZEOF_PG_COMPRESS_HEADER being defined 
inside the HAVE_LZ4 ifdef while it should be defined above ifdef.

> 
> 4) I did a simple test with physical replication, with lz4 enabled on
> both sides (well, can't build without lz4 anyway, per previous point).
> It immediately failed like this:
> 
> FATAL:  failed to restore block image
> CONTEXT:  WAL redo at 0/5000A40 for Btree/INSERT_LEAF: off 138
> LOG:  startup process (PID 15937) exited with exit code 1
> 
> This is a simple UPDATE on a trivial table:
> 
> create table t (a int primary key);
> insert into t select i from generate_series(1,1000) s(i);
> update t set a = a - 100000 where random () < 0.1;
> 
> with some checkpoints to force FPW (and wal_compression=on, of course).
> 
> I haven't tried `make check-world` but I suppose some of the TAP tests
> should fail because of this. And if not, we need to improve coverage.
> 

FWIW I did run check-world without problems, will have to look into this.

> 
> 5) I wonder why compression_algorithm is defined as PGC_SIGHUP. Why not
> to allow users to set it per session? I suppose we might have a separate
> option for WAL compression_algorithm.
>

Yeah I was thinking we might want to change wal_compression to enum as 
well. Although that complicates the code quite a bit (the caller has to 
decide algorithm instead compression system doing it).

> 
> 6) It seems a bit strange that pg_compress/pg_decompress are now defined
> in pglz_compress.{c,h}. Maybe we should invent src/common/compress.c?
> 

It does not seem worth it to invent new module for like 20 lines of 
wrapper code.

-- 
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: pglz performance

From
Tomas Vondra
Date:
On Sun, Aug 04, 2019 at 05:53:26PM +0200, Petr Jelinek wrote:
>
> ...
>
>>
>>4) I did a simple test with physical replication, with lz4 enabled on
>>both sides (well, can't build without lz4 anyway, per previous point).
>>It immediately failed like this:
>>
>>FATAL:  failed to restore block image
>>CONTEXT:  WAL redo at 0/5000A40 for Btree/INSERT_LEAF: off 138
>>LOG:  startup process (PID 15937) exited with exit code 1
>>
>>This is a simple UPDATE on a trivial table:
>>
>>create table t (a int primary key);
>>insert into t select i from generate_series(1,1000) s(i);
>>update t set a = a - 100000 where random () < 0.1;
>>
>>with some checkpoints to force FPW (and wal_compression=on, of course).
>>
>>I haven't tried `make check-world` but I suppose some of the TAP tests
>>should fail because of this. And if not, we need to improve coverage.
>>
>
>FWIW I did run check-world without problems, will have to look into this.
>

Not sure if we bother to set wal_compression=on for check-world (I don't
think we do, but I may be missing something), so maybe check-world does
not really test wal compression.

IMO the issue is that RestoreBlockImage() still calls pglz_decompress
directly, instead of going through pg_decompress().


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pglz performance

From
Andres Freund
Date:
Hi,

On 2019-08-04 02:41:24 +0200, Petr Jelinek wrote:
> Same here.
> 
> Just so that we don't idly talk, what do you think about the attached?

Cool!

> It:
> - adds new GUC compression_algorithm with possible values of pglz (default)
> and lz4 (if lz4 is compiled in), requires SIGHUP

As Tomas remarked, I think it shouldn't be SIGHUP but USERSET. And I
think lz4 should be preferred, if available.  I could see us using a
list style guc, so we could set it to lz4, pglz, and the first available
one would be used.

> - adds 1 byte header to the compressed data where we currently store the
> algorithm kind, that leaves us with 254 more to add :) (that's an extra
> overhead compared to the current state)

Hm. Why do we need an additional byte?  IIRC my patch added that only
for the case we would run out of space for compression formats without
extending any sizes?


> - changes the rawsize in TOAST header to 31 bits via bit packing
> - uses the extra bit to differentiate between old and new format

Hm. Wouldn't it be easier to just use a different vartag for this?


> - I expect my changes to configure.in are not the greatest as I don't have
> pretty much zero experience with autoconf

FWIW the configure output changes are likely because you used a modified
version of autoconf. Unfortunately debian/ubuntu ship one with vendor
patches.

Greetings,

Andres Freund



Re: pglz performance

From
Petr Jelinek
Date:
Hi,

On 04/08/2019 21:20, Andres Freund wrote:
> On 2019-08-04 02:41:24 +0200, Petr Jelinek wrote:
>> Same here.
>>
>> Just so that we don't idly talk, what do you think about the attached?
> 
> Cool!
> 
>> It:
>> - adds new GUC compression_algorithm with possible values of pglz (default)
>> and lz4 (if lz4 is compiled in), requires SIGHUP
> 
> As Tomas remarked, I think it shouldn't be SIGHUP but USERSET. And I
> think lz4 should be preferred, if available.  I could see us using a
> list style guc, so we could set it to lz4, pglz, and the first available
> one would be used.
> 

Sounds reasonable.

>> - adds 1 byte header to the compressed data where we currently store the
>> algorithm kind, that leaves us with 254 more to add :) (that's an extra
>> overhead compared to the current state)
> 
> Hm. Why do we need an additional byte?  IIRC my patch added that only
> for the case we would run out of space for compression formats without
> extending any sizes?
> 

Yeah your patch worked differently (I didn't actually use any code from 
it). The main reason why I add the byte is that I am storing the 
algorithm in the compressed value itself, not in varlena header. I was 
mainly trying to not have every caller care about storing and loading 
the compression algorithm. I also can't say I particularly like that 
hack in your patch.

However if we'd want to have separate GUCs for TOAST and WAL then we'll 
have to do that anyway so maybe it does not matter anymore (we can't use 
similar hack there AFAICS though).

> 
>> - changes the rawsize in TOAST header to 31 bits via bit packing
>> - uses the extra bit to differentiate between old and new format
> 
> Hm. Wouldn't it be easier to just use a different vartag for this?
> 

That would only work for external TOAST pointers right? The compressed 
varlena can also be stored inline and potentially in index tuple.

> 
>> - I expect my changes to configure.in are not the greatest as I don't have
>> pretty much zero experience with autoconf
> 
> FWIW the configure output changes are likely because you used a modified
> version of autoconf. Unfortunately debian/ubuntu ship one with vendor
> patches.
> 

Yeah, Ubuntu here, that explains.

-- 
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: pglz performance

From
Andres Freund
Date:
Hi,

On 2019-08-04 17:53:26 +0200, Petr Jelinek wrote:
> > 5) I wonder why compression_algorithm is defined as PGC_SIGHUP. Why not
> > to allow users to set it per session? I suppose we might have a separate
> > option for WAL compression_algorithm.
> > 
> 
> Yeah I was thinking we might want to change wal_compression to enum as well.
> Although that complicates the code quite a bit (the caller has to decide
> algorithm instead compression system doing it).

Isn't that basically required anyway? The WAL record will need to carry
information about the type of compression used, independent of
PGC_SIGHUP/PGC_USERSET, unless you want to make it an initdb option or
something super heavyweight like that.

We could just have the wal_compression assign hook set a the compression
callback and a compression type integer or such if we want to avoid
doing that kind of thing at runtime.

Greetings,

Andres Freund



Re: pglz performance

From
Petr Jelinek
Date:
Hi,

On 05/08/2019 00:15, Andres Freund wrote:
> Hi,
> 
> On 2019-08-04 17:53:26 +0200, Petr Jelinek wrote:
>>> 5) I wonder why compression_algorithm is defined as PGC_SIGHUP. Why not
>>> to allow users to set it per session? I suppose we might have a separate
>>> option for WAL compression_algorithm.
>>>
>>
>> Yeah I was thinking we might want to change wal_compression to enum as well.
>> Although that complicates the code quite a bit (the caller has to decide
>> algorithm instead compression system doing it).
> 
> Isn't that basically required anyway? The WAL record will need to carry
> information about the type of compression used, independent of
> PGC_SIGHUP/PGC_USERSET, unless you want to make it an initdb option or
> something super heavyweight like that.
> 

It carries that information inside the compressed value, like I said in 
the other reply, that's why the extra byte.

-- 
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: pglz performance

From
Michael Paquier
Date:
On Fri, Aug 02, 2019 at 07:52:39PM +0200, Tomas Vondra wrote:
> On Fri, Aug 02, 2019 at 10:12:58AM -0700, Andres Freund wrote:
>> Why would they be stuck continuing to *compress* with pglz? As we
>> fully retoast on write anyway we can just gradually switch over to the
>> better algorithm. Decompression speed is another story, of course.
>
> Hmmm, I don't remember the details of those patches so I didn't realize
> it allows incremental recompression. If that's possible, that would mean
> existing systems can start using it. Which is good.

It may become a problem on some platforms though (Windows?), so
patches to improve either the compression or decompression of pglz are
not that much crazy as they are still likely going to be used, and for
read-mostly switching to a new algo may not be worth the extra cost so
it is not like we are going to drop it completely either.  My take,
still the same as upthread, is that it mostly depends on the amount of
complexity each patch introduces compared to the performance gain.

> Another question is whether we'd actually want to include the code in
> core directly, or use system libraries (and if some packagers might
> decide to disable that, for whatever reason).

Linking to system libraries would make our maintenance much easier,
and when it comes to have a copy of something else in the tree we
would be stuck with more maintenance around it.  These tend to rot
easily.  After that comes the case where the compression algo is not
in the binary across one server to another, in which case we have an
automatic ERROR in case of a mismatching algo, or FATAL for
deompression of FPWs at recovery when wal_compression is used.
--
Michael

Attachment

Re: pglz performance

From
Andres Freund
Date:
Hi,

On 2019-08-05 00:19:28 +0200, Petr Jelinek wrote:
> It carries that information inside the compressed value, like I said in the
> other reply, that's why the extra byte.

I'm not convinced that that is a good plan - imo the reference to the
compressed data should carry that information.

I.e. in the case of toast, at least toast pointers should hold enough
information to determine the compression algorithm. And in the case of
WAL, the WAL record should contain that.

Consider e.g. adding support for slice fetching of datums compressed
with some algorithm - we should be able to determine whether that's
possible without fetching the datum (so we can take a non-exceptional
path for datums compressed otherwise). Similarly, for WAL, we should be
able to detect whether an incompatible compression format is used,
without having to invoke a generic compression routine that then fails
in some way. Or adding compression reporting for WAL to xlogdump.

I also don't particularly like baking in the assumption that we don't
support tuples larger than 1GB in further places. To me it seems likely
that we're going to have to fix that, and it's hard enough already... I
know that my patch did that too...

For external datums I suggest encoding the compression method as a
distinct VARTAG_ONDISK_COMPRESSED, and then have that include the
compression method as a field.

For in-line compressed values (so VARATT_IS_4B_C), doing something
roughly like you did, indicating the type of metadata following using
the high bit sounds reasonable. But I think I'd make it so that if the
highbit is set, the struct is instead entirely different, keeping a full
4 byte byte length, and including the compression type header inside the
struct. Perhaps I'd combine the compression type with the high-bit-set
part?  So when the high bit is set, it'd be something like

{
    int32        vl_len_;        /* varlena header (do not touch directly!) */

    /*
     * Actually only 7 bit, the high bit determines whether this
     * is the old compression header (if unset), or this type of header
     * (if set).
     */
    uint8 type;

    /*
     * Stored as uint8[4], to avoid unnecessary alignment padding.
     */
    uint8[4] length;

    char va_data[FLEXIBLE_ARRAY_MEMBER];
}

I think it's worth spending some effort trying to get this right - we'll
be bound by design choices for a while.


It's kinda annoying that toast datums aren't better designed. Creating
them from scratch, I'd make it something like:

1) variable-width integer describing the "physical length", so that
   tuple deforming can quickly determine length - all the ifs necessary
   to determine lengths are a bottleneck. I'd probably just use a 127bit
   encoding, if the high bit is set, there's a following length byte.

2) type of toasted datum, probably also in a variable width encoding,
   starting at 1byte. Not that I think it's likely we'd overrun 256
   types - but it's cheap enough to just declare the high bit as an
   length extension bit.

These are always stored unaligned. So there's no need to deal with
padding bytes having to be zero to determine whether we're dealing with
a 1byte datum etc.

Then, type dependant:

For In-line uncompressed datums
3a) alignment padding, amount determined by 2) above, i.e. we'd just
    have different types for different amounts of alignment. Probably
    using some heuristic to use unaligned when either dealing with data
    that doesn't need alignment, or when the datum is fairly small, so
    copying to get the data as unaligned won't be a significant penalty.
4a) data

For in-line compressed datums
3b) compression metadata {varint rawsize, varint compression algorithm}
4b) unaligned compressed data - there's no benefit in keeping it aligned

For External toast for uncompressed data:
3d) {toastrelid, valueid, varint rawsize}

For External toast for compressed data:
3e) {valueid, toastrelid, varint compression_algorithm, varint rawsize, varint extsize}


That'd make it a lot more extensible, easier to understand, faster to
decode in a lot of cases, remove a lot of arbitrary limits.  Yes, it'd
increase the header size for small datums to two bytes, but I think
that'd be more than bought back by the other improvements.


Greetings,

Andres Freund



Re: pglz performance

From
Andres Freund
Date:
Hi,

On 2019-08-05 16:04:46 +0900, Michael Paquier wrote:
> On Fri, Aug 02, 2019 at 07:52:39PM +0200, Tomas Vondra wrote:
> > On Fri, Aug 02, 2019 at 10:12:58AM -0700, Andres Freund wrote:
> >> Why would they be stuck continuing to *compress* with pglz? As we
> >> fully retoast on write anyway we can just gradually switch over to the
> >> better algorithm. Decompression speed is another story, of course.
> > 
> > Hmmm, I don't remember the details of those patches so I didn't realize
> > it allows incremental recompression. If that's possible, that would mean
> > existing systems can start using it. Which is good.
> 
> It may become a problem on some platforms though (Windows?), so
> patches to improve either the compression or decompression of pglz are
> not that much crazy as they are still likely going to be used, and for
> read-mostly switching to a new algo may not be worth the extra cost so
> it is not like we are going to drop it completely either.

What's the platform dependency that you're thinking of? And how's
compression speed relevant to "read mostly"?  Switching would just
happen whenever tuple fields are changed. And it'll not have an
additional cost, because all it does is reduce the cost of a toast write
that'd otherwise happened with pglz.


> Linking to system libraries would make our maintenance much easier,
> and when it comes to have a copy of something else in the tree we
> would be stuck with more maintenance around it.  These tend to rot
> easily.

I don't think it's really our experience that they "rot easily".


> After that comes the case where the compression algo is not
> in the binary across one server to another, in which case we have an
> automatic ERROR in case of a mismatching algo, or FATAL for
> deompression of FPWs at recovery when wal_compression is used.

Huh? That's a failure case that only exists if you don't include it in
the tree (with the option to use an out-of-tree lib)?

Greetings,

Andres Freund



Re: pglz performance

From
Peter Eisentraut
Date:
On 2019-06-24 10:44, Andrey Borodin wrote:
>> 18 мая 2019 г., в 11:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>>
> Hi! 
> Here's rebased version of patches.
> 
> Best regards, Andrey Borodin.

I think this is the most recent patch for the CF entry
<https://commitfest.postgresql.org/24/2119/>.

What about the two patches?  Which one is better?

Have you also considered using memmove() to deal with the overlap issue?

Benchmarks have been posted in this thread.  Where is the benchmarking
tool?  Should we include that in the source somehow?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Andrey Borodin
Date:
Hi, Peter! Thanks for looking into this.

> 4 сент. 2019 г., в 14:09, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):
>
> On 2019-06-24 10:44, Andrey Borodin wrote:
>>> 18 мая 2019 г., в 11:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>>>
>> Hi!
>> Here's rebased version of patches.
>>
>> Best regards, Andrey Borodin.
>
> I think this is the most recent patch for the CF entry
> <https://commitfest.postgresql.org/24/2119/>.
>
> What about the two patches?  Which one is better?
On our observations pglz_decompress_hacked.patch is best for most of tested platforms.
Difference is that pglz_decompress_hacked8.patch will not appply optimization if decompressed match is not greater than
8bytes. This optimization was suggested by Tom, that's why we benchmarked it specifically. 

> Have you also considered using memmove() to deal with the overlap issue?
Yes, memmove() resolves ambiguity of copying overlapping regions in a way that is not compatible with pglz. In proposed
patchwe never copy overlapping regions. 

> Benchmarks have been posted in this thread.  Where is the benchmarking
> tool?  Should we include that in the source somehow?

Benchmarking tool is here [0]. Well, code of the benchmarking tool do not adhere to our standards in some places, we
didnot consider its inclusion in core. 
However, most questionable part of benchmarking is choice of test data. It's about 100Mb of useless WALs, datafile and
valuableShakespeare writings. 

Best regards, Andrey Borodin.


[0] https://github.com/x4m/test_pglz




Re: pglz performance

From
Peter Eisentraut
Date:
On 2019-09-04 11:22, Andrey Borodin wrote:
>> What about the two patches?  Which one is better?
> On our observations pglz_decompress_hacked.patch is best for most of tested platforms.
> Difference is that pglz_decompress_hacked8.patch will not appply optimization if decompressed match is not greater
than8 bytes. This optimization was suggested by Tom, that's why we benchmarked it specifically.
 

The patches attached to the message I was replying to are named

0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch
0001-Use-memcpy-in-pglz-decompression.patch

Are those the same ones?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Andrey Borodin
Date:

> 4 сент. 2019 г., в 17:40, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):
>
> On 2019-09-04 11:22, Andrey Borodin wrote:
>>> What about the two patches?  Which one is better?
>> On our observations pglz_decompress_hacked.patch is best for most of tested platforms.
>> Difference is that pglz_decompress_hacked8.patch will not appply optimization if decompressed match is not greater
than8 bytes. This optimization was suggested by Tom, that's why we benchmarked it specifically. 
>
> The patches attached to the message I was replying to are named
>
> 0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch
> 0001-Use-memcpy-in-pglz-decompression.patch
>
> Are those the same ones?

Yes. Sorry for this confusion.

The only difference of 0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch is that it fallbacks to byte-loop
iflen is <= 8. 

Best regards, Andrey Borodin.


Re: pglz performance

From
Alvaro Herrera
Date:
I just noticed we had two CF items pointing to this thread,

https://commitfest.postgresql.org/24/2119/
https://commitfest.postgresql.org/24/2180/

so I marked the newer one as withdrawn.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Oleg Bartunov
Date:
On Wed, Sep 4, 2019 at 12:22 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi, Peter! Thanks for looking into this.
>
> > 4 сент. 2019 г., в 14:09, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):
> >
> > On 2019-06-24 10:44, Andrey Borodin wrote:
> >>> 18 мая 2019 г., в 11:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
> >>>
> >> Hi!
> >> Here's rebased version of patches.
> >>
> >> Best regards, Andrey Borodin.
> >
> > I think this is the most recent patch for the CF entry
> > <https://commitfest.postgresql.org/24/2119/>.
> >
> > What about the two patches?  Which one is better?
> On our observations pglz_decompress_hacked.patch is best for most of tested platforms.
> Difference is that pglz_decompress_hacked8.patch will not appply optimization if decompressed match is not greater
than8 bytes. This optimization was suggested by Tom, that's why we benchmarked it specifically. 
>
> > Have you also considered using memmove() to deal with the overlap issue?
> Yes, memmove() resolves ambiguity of copying overlapping regions in a way that is not compatible with pglz. In
proposedpatch we never copy overlapping regions. 
>
> > Benchmarks have been posted in this thread.  Where is the benchmarking
> > tool?  Should we include that in the source somehow?
>
> Benchmarking tool is here [0]. Well, code of the benchmarking tool do not adhere to our standards in some places, we
didnot consider its inclusion in core. 
> However, most questionable part of benchmarking is choice of test data. It's about 100Mb of useless WALs, datafile
andvaluable Shakespeare writings. 

Why not use 'Silesia compression corpus'
(http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia), which used by
lzbench (https://github.com/inikep/lzbench) ?  I and Teodor remember
that testing on non-english texts could be very important.


>
> Best regards, Andrey Borodin.
>
>
> [0] https://github.com/x4m/test_pglz
>
>
>


--
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: pglz performance

From
Petr Jelinek
Date:
Hi,

On 05/08/2019 09:26, Andres Freund wrote:
> Hi,
> 
> On 2019-08-05 00:19:28 +0200, Petr Jelinek wrote:
>> It carries that information inside the compressed value, like I said in the
>> other reply, that's why the extra byte.
> 
> I'm not convinced that that is a good plan - imo the reference to the
> compressed data should carry that information.
> 
> I.e. in the case of toast, at least toast pointers should hold enough
> information to determine the compression algorithm. And in the case of
> WAL, the WAL record should contain that.
> 
Point taken.

> 
> For external datums I suggest encoding the compression method as a
> distinct VARTAG_ONDISK_COMPRESSED, and then have that include the
> compression method as a field.

So the new reads/writes will use this and reads of old format won't 
change? Sounds fine.

> 
> For in-line compressed values (so VARATT_IS_4B_C), doing something
> roughly like you did, indicating the type of metadata following using
> the high bit sounds reasonable. But I think I'd make it so that if the
> highbit is set, the struct is instead entirely different, keeping a full
> 4 byte byte length, and including the compression type header inside the
> struct. Perhaps I'd combine the compression type with the high-bit-set
> part?  So when the high bit is set, it'd be something like
> 
> {
>      int32        vl_len_;        /* varlena header (do not touch directly!) */
> 
>      /*
>       * Actually only 7 bit, the high bit determines whether this
>       * is the old compression header (if unset), or this type of header
>       * (if set).
>       */
>      uint8 type;
> 
>      /*
>       * Stored as uint8[4], to avoid unnecessary alignment padding.
>       */
>      uint8[4] length;
> 
>      char va_data[FLEXIBLE_ARRAY_MEMBER];
> }
> 

Won't this break BW compatibility on big-endian (if I understand 
corretly what you are trying to achieve here)?

> I think it's worth spending some effort trying to get this right - we'll
> be bound by design choices for a while.
> 

Sure, however I am not in business of redesigning TOAST from scratch 
right now, even if I do agree that the current header is far from ideal.

-- 
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: pglz performance

From
Peter Eisentraut
Date:
On 2019-09-04 14:45, Andrey Borodin wrote:
>> On 2019-09-04 11:22, Andrey Borodin wrote:
>>>> What about the two patches?  Which one is better?
>>> On our observations pglz_decompress_hacked.patch is best for most of tested platforms.
>>> Difference is that pglz_decompress_hacked8.patch will not appply optimization if decompressed match is not greater
than8 bytes. This optimization was suggested by Tom, that's why we benchmarked it specifically.
 
>>
>> The patches attached to the message I was replying to are named
>>
>> 0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch
>> 0001-Use-memcpy-in-pglz-decompression.patch
>>
>> Are those the same ones?
> 
> Yes. Sorry for this confusion.
> 
> The only difference of 0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch is that it fallbacks to byte-loop
iflen is <= 8.
 

After reviewing this thread and more testing, I think
0001-Use-memcpy-in-pglz-decompression.patch appears to be a solid win
and we should move ahead with it.

I don't, however, fully understand the code changes, and I think this
could use more and better comments.  In particular, I wonder about

off *= 2;

This is new logic that isn't explained anywhere.

This whole function is commented a bit strangely.  It begins with
"Otherwise", but there is nothing before it.  And what does "from OUTPUT
to OUTPUT" mean?  There is no "output" variable.  We should make this
match the code better.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Andrey Borodin
Date:
Oleg, Peter, thanks for looking into this!

I hope to benchmark decompression on Silesian corpus soon.

PFA v2 with better comments.

> 27 сент. 2019 г., в 14:41, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):
>
> After reviewing this thread and more testing, I think
> 0001-Use-memcpy-in-pglz-decompression.patch appears to be a solid win
> and we should move ahead with it.
>
> I don't, however, fully understand the code changes, and I think this
> could use more and better comments.  In particular, I wonder about
>
> off *= 2;
I've changed this to
off += off;

>
> This is new logic that isn't explained anywhere.
>
> This whole function is commented a bit strangely.  It begins with
> "Otherwise", but there is nothing before it.  And what does "from OUTPUT
> to OUTPUT" mean?  There is no "output" variable.  We should make this
> match the code better.


I've added small example to illustrate what is going on.

Thanks!

--
Andrey Borodin
Open source RDBMS development team leader
Yandex.Cloud

Attachment

Re: pglz performance

From
Andrey Borodin
Date:

> 28 сент. 2019 г., в 10:29, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> I hope to benchmark decompression on Silesian corpus soon.

I've done it. And results are quite controversial.
Dataset adds 12 payloads to our 5. Payloads have relatively high entropy. In many cases pglz cannot compress them at
all,so decompression is nop, data is stored as is. 

Decompressor pglz_decompress_hacked result 48.281747
Decompressor pglz_decompress_hacked8 result 33.868779
Decompressor pglz_decompress_vanilla result 42.510165

Tested on Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz

With Silesian corpus pglz_decompress_hacked is actually decreasing performance on high-entropy data.
Meanwhile pglz_decompress_hacked8 is still faster than usual pglz_decompress.
In spite of this benchmarks, I think that pglz_decompress_hacked8 is safer option.

I've updated test suite [0] and anyone interested can verify benchmarks.

--
Andrey Borodin
Open source RDBMS development team leader
Yandex.Cloud

[0] https://github.com/x4m/test_pglz


Re: pglz performance

From
Andrey Borodin
Date:

> 21 окт. 2019 г., в 14:09, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> With Silesian corpus pglz_decompress_hacked is actually decreasing performance on high-entropy data.
> Meanwhile pglz_decompress_hacked8 is still faster than usual pglz_decompress.
> In spite of this benchmarks, I think that pglz_decompress_hacked8 is safer option.

Here's v3 which takes into account recent benchmarks with Silesian Corpus and have better comments.

Thanks!


--
Andrey Borodin
Open source RDBMS development team leader
Yandex.Cloud

Attachment

Re: pglz performance

From
Peter Eisentraut
Date:
On 2019-10-25 07:05, Andrey Borodin wrote:
>> 21 окт. 2019 г., в 14:09, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>>
>> With Silesian corpus pglz_decompress_hacked is actually decreasing performance on high-entropy data.
>> Meanwhile pglz_decompress_hacked8 is still faster than usual pglz_decompress.
>> In spite of this benchmarks, I think that pglz_decompress_hacked8 is safer option.
> 
> Here's v3 which takes into account recent benchmarks with Silesian Corpus and have better comments.

Your message from 21 October appears to say that this change makes the 
performance worse.  So I don't know how to proceed with this.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Alvaro Herrera
Date:
On 2019-Nov-01, Peter Eisentraut wrote:

> On 2019-10-25 07:05, Andrey Borodin wrote:
> > > 21 окт. 2019 г., в 14:09, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
> > > 
> > > With Silesian corpus pglz_decompress_hacked is actually decreasing performance on high-entropy data.
> > > Meanwhile pglz_decompress_hacked8 is still faster than usual pglz_decompress.
> > > In spite of this benchmarks, I think that pglz_decompress_hacked8 is safer option.
> > 
> > Here's v3 which takes into account recent benchmarks with Silesian Corpus and have better comments.
> 
> Your message from 21 October appears to say that this change makes the
> performance worse.  So I don't know how to proceed with this.

As I understand that report, in these results "less is better", so the
hacked8 variant shows better performance (33.8) than current (42.5).
The "hacked" variant shows worse performance (48.2) that the current
code.  The "in spite" phrase seems to have been a mistake.

I am surprised that there is so much variability in the performance
numbers, though, based on such small tweaks of the code.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Tomas Vondra
Date:
On Fri, Nov 01, 2019 at 12:48:28PM -0300, Alvaro Herrera wrote:
>On 2019-Nov-01, Peter Eisentraut wrote:
>
>> On 2019-10-25 07:05, Andrey Borodin wrote:
>> > > 21 окт. 2019 г., в 14:09, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>> > >
>> > > With Silesian corpus pglz_decompress_hacked is actually decreasing performance on high-entropy data.
>> > > Meanwhile pglz_decompress_hacked8 is still faster than usual pglz_decompress.
>> > > In spite of this benchmarks, I think that pglz_decompress_hacked8 is safer option.
>> >
>> > Here's v3 which takes into account recent benchmarks with Silesian Corpus and have better comments.
>>
>> Your message from 21 October appears to say that this change makes the
>> performance worse.  So I don't know how to proceed with this.
>
>As I understand that report, in these results "less is better", so the
>hacked8 variant shows better performance (33.8) than current (42.5).
>The "hacked" variant shows worse performance (48.2) that the current
>code.  The "in spite" phrase seems to have been a mistake.
>
>I am surprised that there is so much variability in the performance
>numbers, though, based on such small tweaks of the code.
>

I'd try running the benchmarks to verify the numbers, and maybe do some
additional tests, but it's not clear to me which patches should I use.

I think the last patches with 'hacked' and 'hacked8' in the name are a
couple of months old, and the recent posts attach just a single patch.
Andrey, can you post current versions of both patches?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pglz performance

From
Andrey Borodin
Date:

> 1 нояб. 2019 г., в 18:48, Alvaro Herrera <alvherre@2ndquadrant.com> написал(а):
>
> On 2019-Nov-01, Peter Eisentraut wrote:
>
>> On 2019-10-25 07:05, Andrey Borodin wrote:
>>>> 21 окт. 2019 г., в 14:09, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>>>>
>>>> With Silesian corpus pglz_decompress_hacked is actually decreasing performance on high-entropy data.
>>>> Meanwhile pglz_decompress_hacked8 is still faster than usual pglz_decompress.
>>>> In spite of this benchmarks, I think that pglz_decompress_hacked8 is safer option.
>>>
>>> Here's v3 which takes into account recent benchmarks with Silesian Corpus and have better comments.
>>
>> Your message from 21 October appears to say that this change makes the
>> performance worse.  So I don't know how to proceed with this.
>
> As I understand that report, in these results "less is better", so the
> hacked8 variant shows better performance (33.8) than current (42.5).
> The "hacked" variant shows worse performance (48.2) that the current
> code.
This is correct. Thanks, Álvaro.

> The "in spite" phrase seems to have been a mistake.
Yes. Sorry, I actually thought that "in spite" is a contradiction of "despite" and means "In view of".

> I am surprised that there is so much variability in the performance
> numbers, though, based on such small tweaks of the code.
Silesian Corpus is very different from WALs and PG data files. Data files are rich in long sequences of same byte. This
sequencesare long, thus unrolled very effectively by memcpy method. 
But Silesian corpus is rich in short matches of few bytes.

> 1 нояб. 2019 г., в 19:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
> I'd try running the benchmarks to verify the numbers, and maybe do some
> additional tests, but it's not clear to me which patches should I use.
Cool, thanks!

> I think the last patches with 'hacked' and 'hacked8' in the name are a
> couple of months old, and the recent posts attach just a single patch.
> Andrey, can you post current versions of both patches?
PFA two patches:
v4-0001-Use-memcpy-in-pglz-decompression.patch (known as 'hacked' in test_pglz extension)
v4-0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch (known as 'hacked8')

Best regards, Andrey Borodin.

Attachment

Re: pglz performance

From
Tels
Date:
Hello Andrey,

On 2019-11-02 12:30, Andrey Borodin wrote:
>> 1 нояб. 2019 г., в 18:48, Alvaro Herrera <alvherre@2ndquadrant.com> 
>> написал(а):
> PFA two patches:
> v4-0001-Use-memcpy-in-pglz-decompression.patch (known as 'hacked' in
> test_pglz extension)
> v4-0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch (known
> as 'hacked8')

Looking at the patches, it seems only the case of a match is changed. 
But when we observe a literal byte, this is copied byte-by-byte with:

  else
   {
   * An unset control bit means LITERAL BYTE. So we just
   * copy one from INPUT to OUTPUT.
   */
   *dp++ = *sp++;
   }

Maybe we can optimize this, too. For instance, you could just increase a 
counter:

  else
   {
   /*
   * An unset control bit means LITERAL BYTE. We count
   * these and copy them later.
   */
   literal_bytes ++;
   }

and in the case of:

   if (ctrl & 1)
     {
     /* First copy all the literal bytes */
     if (literal_bytes > 0)
       {
       memcpy( sp, dp, literal_bytes);
       sp += literal_bytes;
       dp += literal_bytes;
       literal_bytes = 0;
       }

(Code untested!)

The same would need to be done at the very end, if the input ends 
without any new CTRL-byte.

Wether that gains us anything depends on how common literal bytes are. 
It might be that highly compressible input has almost none, while input 
that is a mix of incompressible strings and compressible ones might have 
longer stretches. One example would be something like an SHA-256, that 
is repeated twice. The first instance would be incompressible, the 
second one would be just a copy. This might not happens that often in 
practical inputs, though.

I wonder if you agree and what would happen if you try this variant on 
your corpus tests.

Best regards,

Tels



Re: pglz performance

From
Andrey Borodin
Date:
Hi Tels!
Thanks for your interest in fast decompression.

> 3 нояб. 2019 г., в 12:24, Tels <nospam-pg-abuse@bloodgate.com> написал(а):
>
> I wonder if you agree and what would happen if you try this variant on your corpus tests.

I've tried some different optimization for literals. For example loop unrolling[0] and literals bulk-copying.
This approaches were brining some performance improvement. But with noise. Statistically they were somewhere better,
somewhereworse, net win, but that "net win" depends on what we consider important data and important platform. 

Proposed patch makes clearly decompression faster on any dataset, and platform.
I believe improving pglz further is viable, but optimizations like common data prefix seems more promising to me.
Also, I think we actually need real codecs like lz4, zstd and brotli instead of our own invented wheel.

If you have some spare time - Pull Requests to test_pglz are welcome, lets benchmark more micro optimizations, it
bringsa lot of fun :) 


--
Andrey Borodin
Open source RDBMS development team leader
Yandex.Cloud

[0] https://github.com/x4m/test_pglz/blob/master/pg_lzcompress_hacked.c#L166


Re: pglz performance

From
Peter Eisentraut
Date:
On 2019-11-01 16:48, Alvaro Herrera wrote:
> As I understand that report, in these results "less is better", so the
> hacked8 variant shows better performance (33.8) than current (42.5).
> The "hacked" variant shows worse performance (48.2) that the current
> code.

Which appears to be the opposite of, or at least inconsistent with, 
results earlier in the thread.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Peter Eisentraut
Date:
On 2019-11-01 17:59, Tomas Vondra wrote:
> I'd try running the benchmarks to verify the numbers, and maybe do some
> additional tests, but it's not clear to me which patches should I use.
> 
> I think the last patches with 'hacked' and 'hacked8' in the name are a
> couple of months old, and the recent posts attach just a single patch.
> Andrey, can you post current versions of both patches?

OK, waiting on some independent verification of benchmark numbers.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Michael Paquier
Date:
On Wed, Nov 06, 2019 at 09:04:25AM +0100, Peter Eisentraut wrote:
> OK, waiting on some independent verification of benchmark numbers.

Still waiting for these after 19 days, so the patch has been marked as
returned with feedback.
--
Michael

Attachment

Re: pglz performance

From
Andrey Borodin
Date:

> 25 нояб. 2019 г., в 13:03, Michael Paquier <michael@paquier.xyz> написал(а):
>
> On Wed, Nov 06, 2019 at 09:04:25AM +0100, Peter Eisentraut wrote:
>> OK, waiting on some independent verification of benchmark numbers.
>
> Still waiting for these after 19 days, so the patch has been marked as
> returned with feedback.

I think status Needs Review describes what is going on better. It's not like something is awaited from my side.

Thanks.

Best regards, Andrey Borodin.


Re: pglz performance

From
Michael Paquier
Date:
On Mon, Nov 25, 2019 at 01:21:27PM +0500, Andrey Borodin wrote:
> I think status Needs Review describes what is going on better. It's
> not like something is awaited from my side.

Indeed.  You are right so I have moved the patch instead, with "Needs
review".  The patch status was actually incorrect in the CF app, as it
was marked as waiting on author.

@Tomas: updated versions of the patches have been sent by Andrey.
--
Michael

Attachment

Re: pglz performance

From
Tomas Vondra
Date:
On Mon, Nov 25, 2019 at 05:29:40PM +0900, Michael Paquier wrote:
>On Mon, Nov 25, 2019 at 01:21:27PM +0500, Andrey Borodin wrote:
>> I think status Needs Review describes what is going on better. It's
>> not like something is awaited from my side.
>
>Indeed.  You are right so I have moved the patch instead, with "Needs
>review".  The patch status was actually incorrect in the CF app, as it
>was marked as waiting on author.
>
>@Tomas: updated versions of the patches have been sent by Andrey.

I've done benchmarks on the two last patches, using the data sets from
test_pglz repository [1], but using three simple queries:

1) prefix - first 100 bytes of the value

   SELECT length(substr(value, 0, 100)) FROM t

2) infix - 100 bytes from the middle

   SELECT length(substr(value, test_length/2, 100)) FROM t

3) suffix - last 100 bytes

   SELECT length(substr(value, test_length - 100, 100)) FROM t

See the two attached scripts, implementing this benchmark. The test
itself did a 60-second pgbench runs (single client) measuring tps on two
different machines.

patch 1: v4-0001-Use-memcpy-in-pglz-decompression.patch
patch 2: v4-0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch

The results (compared to master) from the first machine (i5-2500k CPU)
look like this:

                                   patch 1        |         patch 2
dataset                   prefix   infix  suffix | prefix   infix  suffix
-------------------------------------------------------------------------
000000010000000000000001     99%    134%    161% |   100%    126%    152%
000000010000000000000006     99%    260%    287% |   100%    257%    279%
000000010000000000000008    100%    100%    100% |   100%     95%     91%
16398                       100%    168%    221% |   100%    159%    215%
shakespeare.txt             100%    138%    141% |   100%    116%    117%
mr                           99%    120%    128% |   100%    107%    108%
dickens                     100%    129%    132% |   100%    100%    100%
mozilla                     100%    119%    120% |   100%    102%    104%
nci                         100%    149%    141% |   100%    143%    135%
ooffice                      99%    121%    123% |   100%     97%     98%
osdb                        100%     99%     99% |   100%    100%     99%
reymont                      99%    130%    132% |   100%    106%    107%
samba                       100%    126%    132% |   100%    105%    111%
sao                         100%    100%     99% |   100%    100%    100%
webster                     100%    127%    127% |   100%    106%    106%
x-ray                        99%     99%     99% |   100%    100%    100%
xml                         100%    144%    144% |   100%    130%    128%

and on the other one (xeon e5-2620v4) looks like this:

                                  patch 1        |          patch 2
dataset                   prefix  infix  suffix | prefix  infix   suffix
------------------------------------------------------------------------
000000010000000000000001     98%   147%    170% |    98%   132%     159%
000000010000000000000006    100%   340%    314% |    98%   334%     355%
000000010000000000000008     99%   100%    105% |    99%    99%     101%
16398                       101%   153%    205% |    99%   148%     201%
shakespeare.txt             100%   147%    149% |    99%   117%     118%
mr                          100%   131%    139% |    99%   112%     108%
dickens                     100%   143%    143% |    99%   103%     102%
mozilla                     100%   122%    122% |    99%   105%     106%
nci                         100%   151%    135% |   100%   135%     125%
ooffice                      99%   127%    129% |    98%   101%     102%
osdb                        102%   100%    101% |   102%   100%      99%
reymont                     101%   142%    143% |   100%   108%     108%
samba                       100%   132%    136% |    99%   109%     112%
sao                          99%   101%    100% |    99%   100%     100%
webster                     100%   132%    129% |   100%   106%     106%
x-ray                        99%   101%    100% |    90%   101%     101%
xml                         100%   147%    148% |   100%   127%     125%

In general, I think the results for both patches seem clearly a win, but
maybe patch 1 is  bit better, especially on the newer (xeon) CPU. So I'd
probably go with that one.

[1] https://github.com/x4m/test_pglz

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: pglz performance

From
Peter Eisentraut
Date:
On 2019-11-26 10:43, Tomas Vondra wrote:
> In general, I think the results for both patches seem clearly a win, but
> maybe patch 1 is  bit better, especially on the newer (xeon) CPU. So I'd
> probably go with that one.

Patch 1 is also the simpler patch, so it seems clearly preferable.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Tomas Vondra
Date:
On Tue, Nov 26, 2019 at 08:17:13PM +0100, Peter Eisentraut wrote:
>On 2019-11-26 10:43, Tomas Vondra wrote:
>>In general, I think the results for both patches seem clearly a win, but
>>maybe patch 1 is  bit better, especially on the newer (xeon) CPU. So I'd
>>probably go with that one.
>
>Patch 1 is also the simpler patch, so it seems clearly preferable.
>

Yeah, although the difference is minimal. We could probably construct a
benchmark where #2 wins, but I think these queries are fairly realistic.
So I'd just go with #1.

Code-wise I think the patches are mostly fine, although the comments
might need some proof-reading.

1) I wasn't really sure what a "nibble" is, but maybe it's just me and
it's a well-known term.

2) First byte use lower -> First byte uses lower

3) nibble contain upper -> nibble contains upper

4) to preven possible uncertanity -> to prevent possible uncertainty

5) I think we should briefly explain why memmove would be incompatible
with pglz, it's not quite clear to me.

6) I'm pretty sure the comment in the 'while (off < len)' branch will be
badly mangled by pgindent.

7) The last change moving "copy" to the next line seems unnecessary.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pglz performance

From
Michael Paquier
Date:
On Tue, Nov 26, 2019 at 09:05:59PM +0100, Tomas Vondra wrote:
> Yeah, although the difference is minimal. We could probably construct a
> benchmark where #2 wins, but I think these queries are fairly realistic.
> So I'd just go with #1.

Nice results.  Using your benchmarks it indeed looks like patch 1 is a
winner here.

> Code-wise I think the patches are mostly fine, although the comments
> might need some proof-reading.
>
> 1) I wasn't really sure what a "nibble" is, but maybe it's just me and
> it's a well-known term.
>
> 2) First byte use lower -> First byte uses lower
>
> 3) nibble contain upper -> nibble contains upper
>
> 4) to preven possible uncertanity -> to prevent possible uncertainty
>
> 5) I think we should briefly explain why memmove would be incompatible
> with pglz, it's not quite clear to me.
>
> 6) I'm pretty sure the comment in the 'while (off < len)' branch will be
> badly mangled by pgindent.
>
> 7) The last change moving "copy" to the next line seems unnecessary.

Patch 1 has a typo as well here:
+   * When offset is smaller than lengh - source and
s/lengh/length/

Okay, if we are reaching a conclusion here, Tomas or Peter, are you
planning to finish brushing the patch and potentially commit it?
--
Michael

Attachment

Re: pglz performance

From
Tomas Vondra
Date:
On Wed, Nov 27, 2019 at 05:01:47PM +0900, Michael Paquier wrote:
>On Tue, Nov 26, 2019 at 09:05:59PM +0100, Tomas Vondra wrote:
>> Yeah, although the difference is minimal. We could probably construct a
>> benchmark where #2 wins, but I think these queries are fairly realistic.
>> So I'd just go with #1.
>
>Nice results.  Using your benchmarks it indeed looks like patch 1 is a
>winner here.
>
>> Code-wise I think the patches are mostly fine, although the comments
>> might need some proof-reading.
>>
>> 1) I wasn't really sure what a "nibble" is, but maybe it's just me and
>> it's a well-known term.
>>
>> 2) First byte use lower -> First byte uses lower
>>
>> 3) nibble contain upper -> nibble contains upper
>>
>> 4) to preven possible uncertanity -> to prevent possible uncertainty
>>
>> 5) I think we should briefly explain why memmove would be incompatible
>> with pglz, it's not quite clear to me.
>>
>> 6) I'm pretty sure the comment in the 'while (off < len)' branch will be
>> badly mangled by pgindent.
>>
>> 7) The last change moving "copy" to the next line seems unnecessary.
>
>Patch 1 has a typo as well here:
>+   * When offset is smaller than lengh - source and
>s/lengh/length/
>
>Okay, if we are reaching a conclusion here, Tomas or Peter, are you
>planning to finish brushing the patch and potentially commit it?

I'd like some feedback from Andrey regarding the results and memmove
comment, ant then I'll polish and push.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pglz performance

From
Andrey Borodin
Date:
Hi Tomas!

Thanks for benchmarking this!

> 26 нояб. 2019 г., в 14:43, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>
> On Mon, Nov 25, 2019 at 05:29:40PM +0900, Michael Paquier wrote:
>> On Mon, Nov 25, 2019 at 01:21:27PM +0500, Andrey Borodin wrote:
>>> I think status Needs Review describes what is going on better. It's
>>> not like something is awaited from my side.
>>
>> Indeed.  You are right so I have moved the patch instead, with "Needs
>> review".  The patch status was actually incorrect in the CF app, as it
>> was marked as waiting on author.
>>
>> @Tomas: updated versions of the patches have been sent by Andrey.
>
> I've done benchmarks on the two last patches, using the data sets from
> test_pglz repository [1], but using three simple queries:
>
> 1) prefix - first 100 bytes of the value
>
>  SELECT length(substr(value, 0, 100)) FROM t
>
> 2) infix - 100 bytes from the middle
>
>  SELECT length(substr(value, test_length/2, 100)) FROM t
>
> 3) suffix - last 100 bytes
>
>  SELECT length(substr(value, test_length - 100, 100)) FROM t
>
> See the two attached scripts, implementing this benchmark. The test
> itself did a 60-second pgbench runs (single client) measuring tps on two
> different machines.
>
> patch 1: v4-0001-Use-memcpy-in-pglz-decompression.patch
> patch 2: v4-0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch
>
> The results (compared to master) from the first machine (i5-2500k CPU)
> look like this:
>
>                                  patch 1        |         patch 2
> dataset                   prefix   infix  suffix | prefix   infix  suffix
> -------------------------------------------------------------------------
> 000000010000000000000001     99%    134%    161% |   100%    126%    152%
> 000000010000000000000006     99%    260%    287% |   100%    257%    279%
> 000000010000000000000008    100%    100%    100% |   100%     95%     91%
> 16398                       100%    168%    221% |   100%    159%    215%
> shakespeare.txt             100%    138%    141% |   100%    116%    117%
> mr                           99%    120%    128% |   100%    107%    108%
> dickens                     100%    129%    132% |   100%    100%    100%
> mozilla                     100%    119%    120% |   100%    102%    104%
> nci                         100%    149%    141% |   100%    143%    135%
> ooffice                      99%    121%    123% |   100%     97%     98%
> osdb                        100%     99%     99% |   100%    100%     99%
> reymont                      99%    130%    132% |   100%    106%    107%
> samba                       100%    126%    132% |   100%    105%    111%
> sao                         100%    100%     99% |   100%    100%    100%
> webster                     100%    127%    127% |   100%    106%    106%
> x-ray                        99%     99%     99% |   100%    100%    100%
> xml                         100%    144%    144% |   100%    130%    128%
>
> and on the other one (xeon e5-2620v4) looks like this:
>
>                                 patch 1        |          patch 2
> dataset                   prefix  infix  suffix | prefix  infix   suffix
> ------------------------------------------------------------------------
> 000000010000000000000001     98%   147%    170% |    98%   132%     159%
> 000000010000000000000006    100%   340%    314% |    98%   334%     355%
> 000000010000000000000008     99%   100%    105% |    99%    99%     101%
> 16398                       101%   153%    205% |    99%   148%     201%
> shakespeare.txt             100%   147%    149% |    99%   117%     118%
> mr                          100%   131%    139% |    99%   112%     108%
> dickens                     100%   143%    143% |    99%   103%     102%
> mozilla                     100%   122%    122% |    99%   105%     106%
> nci                         100%   151%    135% |   100%   135%     125%
> ooffice                      99%   127%    129% |    98%   101%     102%
> osdb                        102%   100%    101% |   102%   100%      99%
> reymont                     101%   142%    143% |   100%   108%     108%
> samba                       100%   132%    136% |    99%   109%     112%
> sao                          99%   101%    100% |    99%   100%     100%
> webster                     100%   132%    129% |   100%   106%     106%
> x-ray                        99%   101%    100% |    90%   101%     101%
> xml                         100%   147%    148% |   100%   127%     125%
>
> In general, I think the results for both patches seem clearly a win, but
> maybe patch 1 is  bit better, especially on the newer (xeon) CPU. So I'd
> probably go with that one.



From my POV there are two interesting new points in your benchmarks:
1. They are more or lesss end-to-end benchmarks with whole system involved.
2. They provide per-payload breakdown

Prefix experiment is mostly related to reading from page cache and not directly connected with decompression. It's a
bitstrange that we observe 1% degradation in certain experiments, but I believe it's a noise. 
Infix and Suffix results are correlated. We observe no impact of the patch on compressed data.

test_pglz also includes slicing by 2Kb and 8Kb. This was done to imitate toasting. But as far as I understand, in your
testdata payload will be inserted into toast table too, won't it? If so, I agree that patch 1 looks like a better
option.

> 27 нояб. 2019 г., в 1:05, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>
> Code-wise I think the patches are mostly fine, although the comments
> might need some proof-reading.
>
> 1) I wasn't really sure what a "nibble" is, but maybe it's just me and
> it's a well-known term.
I've took the word from pg_lzcompress.c comments
 *          The offset is in the upper nibble of T1 and in T2.
 *          The length is in the lower nibble of T1.
>
> 2) First byte use lower -> First byte uses lower
>
> 3) nibble contain upper -> nibble contains upper
>
> 4) to preven possible uncertanity -> to prevent possible uncertainty
>
> 5) I think we should briefly explain why memmove would be incompatible
> with pglz, it's not quite clear to me.
Here's the example
+                     * Consider input: 112341234123412341234
+                     * At byte 5       here ^ we have match with length 16 and
+                     * offset 4.       11234M(len=16, off=4)
If we simply memmove() this 16 bytes we will produce 112341234XXXXXXXXXXXX, where series of X is 12 undefined bytes,
thatwere at bytes [6:18]. 

>
> 6) I'm pretty sure the comment in the 'while (off < len)' branch will be
> badly mangled by pgindent.
I think I can just write it without line limit and then run pgindent. Will try to do it this evening. Also, I will try
towrite more about memmove. 
>
> 7) The last change moving "copy" to the next line seems unnecessary.


Oh, looks like I had been rewording this comment, and eventually came to the same text..Yes, this change is absolutely
unnecessary.

Thanks!

Best regards, Andrey Borodin.


Re: pglz performance

From
Tomas Vondra
Date:
On Wed, Nov 27, 2019 at 05:47:25PM +0500, Andrey Borodin wrote:
>Hi Tomas!
>
>Thanks for benchmarking this!
>
>> 26 нояб. 2019 г., в 14:43, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>>
>> On Mon, Nov 25, 2019 at 05:29:40PM +0900, Michael Paquier wrote:
>>> On Mon, Nov 25, 2019 at 01:21:27PM +0500, Andrey Borodin wrote:
>>>> I think status Needs Review describes what is going on better. It's
>>>> not like something is awaited from my side.
>>>
>>> Indeed.  You are right so I have moved the patch instead, with "Needs
>>> review".  The patch status was actually incorrect in the CF app, as it
>>> was marked as waiting on author.
>>>
>>> @Tomas: updated versions of the patches have been sent by Andrey.
>>
>> I've done benchmarks on the two last patches, using the data sets from
>> test_pglz repository [1], but using three simple queries:
>>
>> 1) prefix - first 100 bytes of the value
>>
>>  SELECT length(substr(value, 0, 100)) FROM t
>>
>> 2) infix - 100 bytes from the middle
>>
>>  SELECT length(substr(value, test_length/2, 100)) FROM t
>>
>> 3) suffix - last 100 bytes
>>
>>  SELECT length(substr(value, test_length - 100, 100)) FROM t
>>
>> See the two attached scripts, implementing this benchmark. The test
>> itself did a 60-second pgbench runs (single client) measuring tps on two
>> different machines.
>>
>> patch 1: v4-0001-Use-memcpy-in-pglz-decompression.patch
>> patch 2: v4-0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch
>>
>> The results (compared to master) from the first machine (i5-2500k CPU)
>> look like this:
>>
>>                                  patch 1        |         patch 2
>> dataset                   prefix   infix  suffix | prefix   infix  suffix
>> -------------------------------------------------------------------------
>> 000000010000000000000001     99%    134%    161% |   100%    126%    152%
>> 000000010000000000000006     99%    260%    287% |   100%    257%    279%
>> 000000010000000000000008    100%    100%    100% |   100%     95%     91%
>> 16398                       100%    168%    221% |   100%    159%    215%
>> shakespeare.txt             100%    138%    141% |   100%    116%    117%
>> mr                           99%    120%    128% |   100%    107%    108%
>> dickens                     100%    129%    132% |   100%    100%    100%
>> mozilla                     100%    119%    120% |   100%    102%    104%
>> nci                         100%    149%    141% |   100%    143%    135%
>> ooffice                      99%    121%    123% |   100%     97%     98%
>> osdb                        100%     99%     99% |   100%    100%     99%
>> reymont                      99%    130%    132% |   100%    106%    107%
>> samba                       100%    126%    132% |   100%    105%    111%
>> sao                         100%    100%     99% |   100%    100%    100%
>> webster                     100%    127%    127% |   100%    106%    106%
>> x-ray                        99%     99%     99% |   100%    100%    100%
>> xml                         100%    144%    144% |   100%    130%    128%
>>
>> and on the other one (xeon e5-2620v4) looks like this:
>>
>>                                 patch 1        |          patch 2
>> dataset                   prefix  infix  suffix | prefix  infix   suffix
>> ------------------------------------------------------------------------
>> 000000010000000000000001     98%   147%    170% |    98%   132%     159%
>> 000000010000000000000006    100%   340%    314% |    98%   334%     355%
>> 000000010000000000000008     99%   100%    105% |    99%    99%     101%
>> 16398                       101%   153%    205% |    99%   148%     201%
>> shakespeare.txt             100%   147%    149% |    99%   117%     118%
>> mr                          100%   131%    139% |    99%   112%     108%
>> dickens                     100%   143%    143% |    99%   103%     102%
>> mozilla                     100%   122%    122% |    99%   105%     106%
>> nci                         100%   151%    135% |   100%   135%     125%
>> ooffice                      99%   127%    129% |    98%   101%     102%
>> osdb                        102%   100%    101% |   102%   100%      99%
>> reymont                     101%   142%    143% |   100%   108%     108%
>> samba                       100%   132%    136% |    99%   109%     112%
>> sao                          99%   101%    100% |    99%   100%     100%
>> webster                     100%   132%    129% |   100%   106%     106%
>> x-ray                        99%   101%    100% |    90%   101%     101%
>> xml                         100%   147%    148% |   100%   127%     125%
>>
>> In general, I think the results for both patches seem clearly a win, but
>> maybe patch 1 is  bit better, especially on the newer (xeon) CPU. So I'd
>> probably go with that one.
>
>
>
>From my POV there are two interesting new points in your benchmarks:
>1. They are more or lesss end-to-end benchmarks with whole system involved.
>2. They provide per-payload breakdown
>

Yes. I was considering using the test_pglz extension first, but in the
end I decided an end-to-end test is easier to do and more relevant.

>Prefix experiment is mostly related to reading from page cache and not
>directly connected with decompression. It's a bit strange that we
>observe 1% degradation in certain experiments, but I believe it's a
>noise.
>

Yes, I agree it's probably noise - it's not always a degradation, there
are cases where it actually improves by ~1%. Perhaps more runs would
even this out, or maybe it's due to different bin layout or something.

I should have some results from a test with longer (10-minute) run soon,
but I don't think this is a massive issue.

>Infix and Suffix results are correlated. We observe no impact of the
>patch on compressed data.
>

TBH I have not looked at which data sets are compressible etc. so I
can't really comment on this.

FWIW the reason why I did the prefix/infix/suffix is primarily that I
was involved in some recent patches tweaking TOAST slicing, so I wanted
to se if this happens to negatively affect it somehow. And it does not.

>test_pglz also includes slicing by 2Kb and 8Kb. This was done to
>imitate toasting. But as far as I understand, in your test data payload
>will be inserted into toast table too, won't it? If so, I agree that
>patch 1 looks like a better option.
>

Yes, the tests simply do whatever PostgreSQL would do when loading and
storing this data, including TOASTing.

>> 27 нояб. 2019 г., в 1:05, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>>
>> Code-wise I think the patches are mostly fine, although the comments
>> might need some proof-reading.
>>
>> 1) I wasn't really sure what a "nibble" is, but maybe it's just me and
>> it's a well-known term.
>I've took the word from pg_lzcompress.c comments
> *          The offset is in the upper nibble of T1 and in T2.
> *          The length is in the lower nibble of T1.

Aha, good. I haven't noticed that word before, so I assumed it's
introduced by those patches.  And the first thing I thought of was
"nibbles" video game [1]. Which obviously left me a bit puzzled ;-)

But it seems to be a well-known term, I just never heard it before.

[1] https://en.wikipedia.org/wiki/Nibbles_(video_game)

>>
>> 2) First byte use lower -> First byte uses lower
>>
>> 3) nibble contain upper -> nibble contains upper
>>
>> 4) to preven possible uncertanity -> to prevent possible uncertainty
>>
>> 5) I think we should briefly explain why memmove would be incompatible
>> with pglz, it's not quite clear to me.
>Here's the example
>+ * Consider input: 112341234123412341234
>+ * At byte 5       here ^ we have match with length 16 and
>+ * offset 4.       11234M(len=16, off=4)
>
>If we simply memmove() this 16 bytes we will produce
>112341234XXXXXXXXXXXX, where series of X is 12 undefined bytes, that
>were at bytes [6:18].
>

OK, thanks.

>>
>> 6) I'm pretty sure the comment in the 'while (off < len)' branch will be
>> badly mangled by pgindent.
>
>I think I can just write it without line limit and then run pgindent.
>Will try to do it this evening. Also, I will try to write more about
>memmove.
>
>>
>> 7) The last change moving "copy" to the next line seems unnecessary.
>
>Oh, looks like I had been rewording this comment, and eventually came
>to the same text..Yes, this change is absolutely unnecessary.
>
>Thanks!
>

Good. I'll wait for an updated version of the patch and then try to get
it pushed by the end of the CF.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Andrey Borodin
Date:

> 27 нояб. 2019 г., в 20:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>
>>>
>>> 6) I'm pretty sure the comment in the 'while (off < len)' branch will be
>>> badly mangled by pgindent.
>>
>> I think I can just write it without line limit and then run pgindent.
>> Will try to do it this evening. Also, I will try to write more about
>> memmove.
Well, yes, I could not make pgindent format some parts of that comment, gave up and left only simple text.
>>
>>>
>>> 7) The last change moving "copy" to the next line seems unnecessary.
>>
>> Oh, looks like I had been rewording this comment, and eventually came
>> to the same text..Yes, this change is absolutely unnecessary.
>>
>> Thanks!
>>
>
> Good. I'll wait for an updated version of the patch and then try to get
> it pushed by the end of the CF.

PFA v5.
Thanks!

Best regards, Andrey Borodin.


Attachment

Re: pglz performance

From
Alvaro Herrera
Date:
On 2019-Nov-27, Andrey Borodin wrote:

> 
> 
> > 27 нояб. 2019 г., в 20:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
> > 
> >>> 
> >>> 6) I'm pretty sure the comment in the 'while (off < len)' branch will be
> >>> badly mangled by pgindent.
> >> 
> >> I think I can just write it without line limit and then run pgindent.
> >> Will try to do it this evening. Also, I will try to write more about
> >> memmove.
> Well, yes, I could not make pgindent format some parts of that
> comment, gave up and left only simple text.

Please don't.  The way to avoid pgindent from messing with the comment
is to surround it with dashes, /*--------- and end it with *------- */
Just make sure that you use well-aligned and lines shorter than 80, for
cleanliness; but whatever you do, if you use the dashes then pgindent
won't touch it.

(I think the closing dash line is not necessary, but it looks better for
things to be symmetrical.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pglz performance

From
Michael Paquier
Date:
On Wed, Nov 27, 2019 at 04:28:18PM +0100, Tomas Vondra wrote:
> Yes. I was considering using the test_pglz extension first, but in the
> end I decided an end-to-end test is easier to do and more relevant.

I actually got something in this area in one of my trees:
https://github.com/michaelpq/pg_plugins/tree/master/compress_test

> Good. I'll wait for an updated version of the patch and then try to get
> it pushed by the end of the CF.

Sounds like a plan.  Thanks.
--
Michael

Attachment

Re: pglz performance

From
Tomas Vondra
Date:
On Wed, Nov 27, 2019 at 11:27:49PM +0500, Andrey Borodin wrote:
>
>
>> 27 нояб. 2019 г., в 20:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>>
>>>>
>>>> 6) I'm pretty sure the comment in the 'while (off < len)' branch will be
>>>> badly mangled by pgindent.
>>>
>>> I think I can just write it without line limit and then run pgindent.
>>> Will try to do it this evening. Also, I will try to write more about
>>> memmove.
>Well, yes, I could not make pgindent format some parts of that comment, gave up and left only simple text.
>>>
>>>>
>>>> 7) The last change moving "copy" to the next line seems unnecessary.
>>>
>>> Oh, looks like I had been rewording this comment, and eventually came
>>> to the same text..Yes, this change is absolutely unnecessary.
>>>
>>> Thanks!
>>>
>>
>> Good. I'll wait for an updated version of the patch and then try to get
>> it pushed by the end of the CF.
>

OK, pushed, with some minor cosmetic tweaks on the comments (essentially
using the formatting trick pointed out by Alvaro), and removing one
unnecessary change in pglz_maximum_compressed_size.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pglz performance

From
Andrey Borodin
Date:

> 29 нояб. 2019 г., в 3:43, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>
> OK, pushed, with some minor cosmetic tweaks on the comments (essentially
> using the formatting trick pointed out by Alvaro), and removing one
> unnecessary change in pglz_maximum_compressed_size.

Cool, thanks!

> pglz_maximum_compressed_size
It was an artifact of pgindent.

Best regards, Andrey Borodin.


Re: pglz performance

From
Michael Paquier
Date:
On Fri, Nov 29, 2019 at 10:18:18AM +0500, Andrey Borodin wrote:
>> 29 нояб. 2019 г., в 3:43, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>>
>> OK, pushed, with some minor cosmetic tweaks on the comments (essentially
>> using the formatting trick pointed out by Alvaro), and removing one
>> unnecessary change in pglz_maximum_compressed_size.
>
> Cool, thanks!

Yippee.  Thanks.
--
Michael

Attachment

Re: pglz performance

From
"Andrey M. Borodin"
Date:
Hi Petr!

> 4 авг. 2019 г., в 05:41, Petr Jelinek <petr@2ndquadrant.com> написал(а):
>
> Just so that we don't idly talk, what do you think about the attached?
> It:
> - adds new GUC compression_algorithm with possible values of pglz (default) and lz4 (if lz4 is compiled in), requires
SIGHUP
> - adds --with-lz4 configure option (default yes, so the configure option is actually --without-lz4) that enables the
lz4,it's using system library 

Do you plan to work on lz4 aiming at 13 or 14?
Maybe let's register it on CF 2020-07?

Best regards, Andrey Borodin.