Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

From Aleksander Alekseev
Subject Re: [PATCH] Compression dictionaries for JSONB
Date
Msg-id CAJ7c6TN0b+iBBO5yZm+Tqj-RBzuKAOppdcfvmqz0s2NVztY19Q@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Compression dictionaries for JSONB  (Andres Freund <andres@anarazel.de>)
Responses Re: [PATCH] Compression dictionaries for JSONB
List pgsql-hackers
Hi,

> > So to clarify, are we talking about tuple-level compression? Or
> > perhaps page-level compression?
>
> Tuple level.
>
> What I think we should do is basically this:
>
> When we compress datums, we know the table being targeted. If there's a
> pg_attribute parameter indicating we should, we can pass a prebuilt
> dictionary to the LZ4/zstd [de]compression functions.
>
> It's possible we'd need to use a somewhat extended header for such
> compressed datums, to reference the dictionary "id" to be used when
> decompressing, if the compression algorithms don't already have that in
> one of their headers, but that's entirely doable.
>
> A quick demo of the effect size:
> [...]
> Here's the results:
>
>                lz4       zstd   uncompressed
> no dict    1328794     982497        3898498
> dict        375070     267194
>
> I'd say the effect of the dictionary is pretty impressive. And remember,
> this is with the dictionary having been trained on a subset of the data.

I see your point regarding the fact that creating dictionaries on a
training set is too beneficial to neglect it. Can't argue with this.

What puzzles me though is: what prevents us from doing this on a page
level as suggested previously?

More similar data you compress the more space and disk I/O you save.
Additionally you don't have to compress/decompress the data every time
you access it. Everything that's in shared buffers is uncompressed.
Not to mention the fact that you don't care what's in pg_attribute,
the fact that schema may change, etc. There is a table and a
dictionary for this table that you refresh from time to time. Very
simple.

Of course the disadvantage here is that we are not saving the memory,
unlike the case of tuple-level compression. But we are saving a lot of
CPU cycles and doing less disk IOs. I would argue that saving CPU
cycles is generally more preferable. CPUs are still often a bottleneck
while the memory becomes more and more available, e.g there are
relatively affordable (for a company, not an individual) 1 TB RAM
instances, etc.

So it seems to me that doing page-level compression would be simpler
and more beneficial in the long run (10+ years). Don't you agree?

-- 
Best regards,
Aleksander Alekseev



pgsql-hackers by date:

Previous
From: Nikita Malakhov
Date:
Subject: Re: Pluggable toaster
Next
From: Pavel Stehule
Date:
Subject: Re: proposal: psql: psql variable BACKEND_PID