Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Oleg Bartunov
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id CAF4Au4z-7_Ya4qO=PBK37dyY-Up+CqT0v_ucqXUV8Zanu4eNLw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Sat, Dec 2, 2017 at 6:04 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 12/01/2017 10:52 PM, Andres Freund wrote:
>> On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
>>> Honestly, if we can give everybody a 4% space reduction by
>>> switching to lz4, I think that's totally worth doing -- but let's
>>> not make people choose it, let's make it the default going forward,
>>> and keep pglz support around so we don't break pg_upgrade
>>> compatibility (and so people can continue to choose it if for some
>>> reason it works better in their use case). That kind of improvement
>>> is nothing special in a specific workload, but TOAST is a pretty
>>> general-purpose mechanism. I have become, through a few bitter
>>> experiences, a strong believer in the value of trying to reduce our
>>> on-disk footprint, and knocking 4% off the size of every TOAST
>>> table in the world does not sound worthless to me -- even though
>>> context-aware compression can doubtless do a lot better.
>>
>> +1. It's also a lot faster, and I've seen way way to many workloads
>> with 50%+ time spent in pglz.
>>
>
> TBH the 4% figure is something I mostly made up (I'm fake news!). On the
> mailing list archive (which I believe is pretty compressible) I observed
> something like 2.5% size reduction with lz4 compared to pglz, at least
> with the compression levels I've used ...

Nikita Glukhove tested compression on real  json data:

Delicious bookmarks (js):

json 1322MB
jsonb 1369MB
jsonbc 931MB 1.5x
jsonb+lz4d 404MB 3.4x

Citus customer reviews (jr):

json 1391MB
jsonb 1574MB
jsonbc 622MB 2.5x
jsonb+lz4d 601MB 2.5x

I also attached a plot with wired tiger size (zstd compression) in Mongodb.
Nikita has more numbers about compression.


>
> Other algorithms (e.g. zstd) got significantly better compression (25%)
> compared to pglz, but in exchange for longer compression. I'm sure we
> could lower compression level to make it faster, but that will of course
> hurt the compression ratio.
>
> I don't think switching to a different compression algorithm is a way
> forward - it was proposed and explored repeatedly in the past, and every
> time it failed for a number of reasons, most of which are still valid.
>
>
> Firstly, it's going to be quite hard (or perhaps impossible) to find an
> algorithm that is "universally better" than pglz. Some algorithms do
> work better for text documents, some for binary blobs, etc. I don't
> think there's a win-win option.
>
> Sure, there are workloads where pglz performs poorly (I've seen such
> cases too), but IMHO that's more an argument for the custom compression
> method approach. pglz gives you good default compression in most cases,
> and you can change it for columns where it matters, and where a
> different space/time trade-off makes sense.
>
>
> Secondly, all the previous attempts ran into some legal issues, i.e.
> licensing and/or patents. Maybe the situation changed since then (no
> idea, haven't looked into that), but in the past the "pluggable"
> approach was proposed as a way to address this.

I don't think so. Pluggable means that now we have more data types,
which don't fit to
the old compression scheme of TOAST and we need better flexibility. I
see in future we
could avoid decompression of the whole toast just to get on key from
document, so we
first slice data and compress each slice separately.

>
>
> regards
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] Custom compression methods
Next
From: chenhj
Date:
Subject: Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WALfiles