Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [PATCH] Compression dictionaries for JSONB
Date
Msg-id 20230205190602.m7kgendxow6j2a3y@alap3.anarazel.de
Whole thread Raw
In response to Re: [PATCH] Compression dictionaries for JSONB  (Aleksander Alekseev <aleksander@timescale.com>)
Responses Re: [PATCH] Compression dictionaries for JSONB
Re: [PATCH] Compression dictionaries for JSONB
List pgsql-hackers
Hi,

On 2023-02-05 20:05:51 +0300, Aleksander Alekseev wrote:
> > I don't think we'd want much of the infrastructure introduced in the
> > patch for type agnostic cross-row compression. A dedicated "dictionary"
> > type as a wrapper around other types IMO is the wrong direction. This
> > should be a relation-level optimization option, possibly automatic, not
> > something visible to every user of the table.
>
> So to clarify, are we talking about tuple-level compression? Or
> perhaps page-level compression?

Tuple level.

What I think we should do is basically this:

When we compress datums, we know the table being targeted. If there's a
pg_attribute parameter indicating we should, we can pass a prebuilt
dictionary to the LZ4/zstd [de]compression functions.

It's possible we'd need to use a somewhat extended header for such
compressed datums, to reference the dictionary "id" to be used when
decompressing, if the compression algorithms don't already have that in
one of their headers, but that's entirely doable.


A quick demo of the effect size:

# create data to train dictionary with, use subset to increase realism
mkdir /tmp/pg_proc_as_json/;
CREATE EXTENSION adminpack;
SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, true)
FROM pg_proc pp
LIMIT 2000;


# build dictionary
zstd --train -o /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw

# create more data
SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, true) FROM pg_proc pp;


# compress without dictionary
lz4 -k -m /tmp/pg_proc_as_json/*.raw
zstd -k /tmp/pg_proc_as_json/*.raw

# measure size
cat /tmp/pg_proc_as_json/*.raw|wc -c; cat /tmp/pg_proc_as_json/*.lz4|wc -c; cat /tmp/pg_proc_as_json/*.zst|wc -c


# compress with dictionary
rm -f /tmp/pg_proc_as_json/*.{lz4,zst};
lz4 -k -D /tmp/pg_proc_as_json.dict -m /tmp/pg_proc_as_json/*.raw
zstd -k -D /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw

did the same with zstd.

Here's the results:

               lz4       zstd    uncompressed
no dict    1328794     982497        3898498
dict        375070     267194

I'd say the effect of the dictionary is pretty impressive. And remember,
this is with the dictionary having been trained on a subset of the data.


As a comparison, here's all the data compressed compressed at once:

               lz4       zstd
no dict     180231     104913
dict        179814     106444

Unsurprisingly the dictionary doesn't help much, because the compression
algorithm can "natively" see the duplication.


- Andres



pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: pglz compression performance, take two
Next
From: Dmitry Dolgov
Date:
Subject: Re: pg_stat_statements and "IN" conditions