Home > mailing lists

Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: [PATCH] Compression dictionaries for JSONB
Date	February 5, 2023 19:06:02
Msg-id	20230205190602.m7kgendxow6j2a3y@alap3.anarazel.de Whole thread Raw
In response to	Re: [PATCH] Compression dictionaries for JSONB (Aleksander Alekseev <aleksander@timescale.com>)
Responses	Re: [PATCH] Compression dictionaries for JSONB Re: [PATCH] Compression dictionaries for JSONB
List	pgsql-hackers

Tree view

Hi,

On 2023-02-05 20:05:51 +0300, Aleksander Alekseev wrote:
> > I don't think we'd want much of the infrastructure introduced in the
> > patch for type agnostic cross-row compression. A dedicated "dictionary"
> > type as a wrapper around other types IMO is the wrong direction. This
> > should be a relation-level optimization option, possibly automatic, not
> > something visible to every user of the table.
>
> So to clarify, are we talking about tuple-level compression? Or
> perhaps page-level compression?

Tuple level.

What I think we should do is basically this:

When we compress datums, we know the table being targeted. If there's a
pg_attribute parameter indicating we should, we can pass a prebuilt
dictionary to the LZ4/zstd [de]compression functions.

It's possible we'd need to use a somewhat extended header for such
compressed datums, to reference the dictionary "id" to be used when
decompressing, if the compression algorithms don't already have that in
one of their headers, but that's entirely doable.

A quick demo of the effect size:

# create data to train dictionary with, use subset to increase realism
mkdir /tmp/pg_proc_as_json/;
CREATE EXTENSION adminpack;
SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, true)
FROM pg_proc pp
LIMIT 2000;

# build dictionary
zstd --train -o /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw

# create more data
SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, true) FROM pg_proc pp;

# compress without dictionary
lz4 -k -m /tmp/pg_proc_as_json/*.raw
zstd -k /tmp/pg_proc_as_json/*.raw

# measure size
cat /tmp/pg_proc_as_json/*.raw|wc -c; cat /tmp/pg_proc_as_json/*.lz4|wc -c; cat /tmp/pg_proc_as_json/*.zst|wc -c

# compress with dictionary
rm -f /tmp/pg_proc_as_json/*.{lz4,zst};
lz4 -k -D /tmp/pg_proc_as_json.dict -m /tmp/pg_proc_as_json/*.raw
zstd -k -D /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw

did the same with zstd.

Here's the results:

               lz4       zstd    uncompressed
no dict    1328794     982497        3898498
dict        375070     267194

I'd say the effect of the dictionary is pretty impressive. And remember,
this is with the dictionary having been trained on a subset of the data.

As a comparison, here's all the data compressed compressed at once:

               lz4       zstd
no dict     180231     104913
dict        179814     106444

Unsurprisingly the dictionary doesn't help much, because the compression
algorithm can "natively" see the duplication.

- Andres

pgsql-hackers by date:

From: Andrey Borodin
Date: 05 February 2023, 18:36:39
Subject: Re: pglz compression performance, take two

From: Dmitry Dolgov
Date: 05 February 2023, 19:56:00
Subject: Re: pg_stat_statements and "IN" conditions

Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

Previous

Next