Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id 8b623131-aa85-3c0a-0f01-f5188c038ff9@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (konstantin knizhnik <k.knizhnik@postgrespro.ru>)
List pgsql-hackers
On 12/02/2017 09:24 PM, konstantin knizhnik wrote:
> 
> On Dec 2, 2017, at 6:04 PM, Tomas Vondra wrote:
> 
>> On 12/01/2017 10:52 PM, Andres Freund wrote:
>> ...
>>
>> Other algorithms (e.g. zstd) got significantly better compression (25%)
>> compared to pglz, but in exchange for longer compression. I'm sure we
>> could lower compression level to make it faster, but that will of course
>> hurt the compression ratio.
>>
>> I don't think switching to a different compression algorithm is a way
>> forward - it was proposed and explored repeatedly in the past, and every
>> time it failed for a number of reasons, most of which are still valid.
>>
>>
>> Firstly, it's going to be quite hard (or perhaps impossible) to
>> find an algorithm that is "universally better" than pglz. Some
>> algorithms do work better for text documents, some for binary
>> blobs, etc. I don't think there's a win-win option.
>>
>> Sure, there are workloads where pglz performs poorly (I've seen
>> such cases too), but IMHO that's more an argument for the custom
>> compression method approach. pglz gives you good default
>> compression in most cases, and you can change it for columns where
>> it matters, and where a different space/time trade-off makes
>> sense.
>>
>>
>> Secondly, all the previous attempts ran into some legal issues, i.e.
>> licensing and/or patents. Maybe the situation changed since then (no
>> idea, haven't looked into that), but in the past the "pluggable"
>> approach was proposed as a way to address this.
>>
>>
> 
> May be it will be interesting for you to see the following results
> of applying page-level compression (CFS in PgPro-EE) to pgbench
> data:
> 

I don't follow. If I understand what CFS does correctly (and I'm mostly
guessing here, because I haven't seen the code published anywhere, and I
assume it's proprietary), it essentially compresses whole 8kB blocks.

I don't know it reorganizes the data into columnar format first, in some
way (to make it more "columnar" which is more compressible), which would
make somewhat similar to page-level compression in Oracle.

But it's clearly a very different approach from what the patch aims to
improve (compressing individual varlena values).

> 
> All algorithms (except zlib) were used with best-speed option: using 
> better compression level usually has not so large impact on
> compression ratio (<30%), but can significantly increase time
> (several times). Certainly pgbench isnot the best candidate for
> testing compression algorithms: it generates a lot of artificial and
> redundant data. But we measured it also on real customers data and
> still zstd seems to be the best compression methods: provides good
> compression with smallest CPU overhead.
> 

I think this really depends on the dataset, and drawing conclusions
based on a single test is somewhat crazy. Especially when it's synthetic
pgbench data with lots of inherent redundancy - sequential IDs, ...

My takeaway from the results is rather that page-level compression may
be very beneficial in some cases, although I wonder how much of that can
be gained by simply using compressed filesystem (thus making it
transparent to PostgreSQL).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Partition pruning for Star Schema
Next
From: legrand legrand
Date:
Subject: Re: Partition pruning for Star Schema