Re: pglz performance - Mailing list pgsql-hackers

From Andrey Borodin
Subject Re: pglz performance
Date
Msg-id DBB2A9E5-29FD-40BF-AC60-BD990FBF142F@yandex-team.ru
Whole thread Raw
In response to Re: pglz performance  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: pglz performance
List pgsql-hackers
Thanks for looking into this!

> 2 авг. 2019 г., в 19:43, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>
> On Fri, Aug 02, 2019 at 04:45:43PM +0300, Konstantin Knizhnik wrote:
>>
>> It takes me some time to understand that your memcpy optimization is correct;)
Seems that comments are not explanatory enough... will try to fix.

>> I have tested different ways of optimizing this fragment of code, but failed tooutperform your implementation!
JFYI we tried optimizations with memcpy with const size (optimized into assembly instead of call), unrolling literal
loopand some others. All these did not work better. 

>> But ...  below are results for lz4:
>>
>> Decompressor score (summ of all times):
>> NOTICE:  Decompressor lz4_decompress result 3.660066
>> Compressor score (summ of all times):
>> NOTICE:  Compressor lz4_compress result 10.288594
>>
>> There is 2 times advantage in decompress speed and 10 times advantage in compress speed.
>> So may be instead of "hacking" pglz algorithm we should better switch to lz4?
>>
>
> I think we should just bite the bullet and add initdb option to pick
> compression algorithm. That's been discussed repeatedly, but we never
> ended up actually doing that. See for example [1].
>
> If there's anyone willing to put some effort into getting this feature
> over the line, I'm willing to do reviews & commit. It's a seemingly
> small change with rather insane potential impact.
>
> But even if we end up doing that, it still makes sense to optimize the
> hell out of pglz, because existing systems will still use that
> (pg_upgrade can't switch from one compression algorithm to another).

We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF.

Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to
anydata with common substrings: this will enhance compression ratio. 
It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate
functionwhich produces some "adapted common prefix" from users's data. 
Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which
commonprefix to use. 
But also system command can say "invoke decompression from extension".

Thus, user will be able to train database compression on his data and substitute pglz compression with custom
compressionmethod seamlessly. 

This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4, zstd,
brotli,lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever gun user
wantsfor his foot. 

Best regards, Andrey Borodin.


pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: [PATCH] Improve performance of NOTIFY over many databases (v2)
Next
From: Jesper Pedersen
Date:
Subject: Re: Index Skip Scan