Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: [PATCH] Compression dictionaries for JSONB |
Date | |
Msg-id | 20230205190602.m7kgendxow6j2a3y@alap3.anarazel.de Whole thread Raw |
In response to | Re: [PATCH] Compression dictionaries for JSONB (Aleksander Alekseev <aleksander@timescale.com>) |
Responses |
Re: [PATCH] Compression dictionaries for JSONB
Re: [PATCH] Compression dictionaries for JSONB |
List | pgsql-hackers |
Hi, On 2023-02-05 20:05:51 +0300, Aleksander Alekseev wrote: > > I don't think we'd want much of the infrastructure introduced in the > > patch for type agnostic cross-row compression. A dedicated "dictionary" > > type as a wrapper around other types IMO is the wrong direction. This > > should be a relation-level optimization option, possibly automatic, not > > something visible to every user of the table. > > So to clarify, are we talking about tuple-level compression? Or > perhaps page-level compression? Tuple level. What I think we should do is basically this: When we compress datums, we know the table being targeted. If there's a pg_attribute parameter indicating we should, we can pass a prebuilt dictionary to the LZ4/zstd [de]compression functions. It's possible we'd need to use a somewhat extended header for such compressed datums, to reference the dictionary "id" to be used when decompressing, if the compression algorithms don't already have that in one of their headers, but that's entirely doable. A quick demo of the effect size: # create data to train dictionary with, use subset to increase realism mkdir /tmp/pg_proc_as_json/; CREATE EXTENSION adminpack; SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, true) FROM pg_proc pp LIMIT 2000; # build dictionary zstd --train -o /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw # create more data SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, true) FROM pg_proc pp; # compress without dictionary lz4 -k -m /tmp/pg_proc_as_json/*.raw zstd -k /tmp/pg_proc_as_json/*.raw # measure size cat /tmp/pg_proc_as_json/*.raw|wc -c; cat /tmp/pg_proc_as_json/*.lz4|wc -c; cat /tmp/pg_proc_as_json/*.zst|wc -c # compress with dictionary rm -f /tmp/pg_proc_as_json/*.{lz4,zst}; lz4 -k -D /tmp/pg_proc_as_json.dict -m /tmp/pg_proc_as_json/*.raw zstd -k -D /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw did the same with zstd. Here's the results: lz4 zstd uncompressed no dict 1328794 982497 3898498 dict 375070 267194 I'd say the effect of the dictionary is pretty impressive. And remember, this is with the dictionary having been trained on a subset of the data. As a comparison, here's all the data compressed compressed at once: lz4 zstd no dict 180231 104913 dict 179814 106444 Unsurprisingly the dictionary doesn't help much, because the compression algorithm can "natively" see the duplication. - Andres
pgsql-hackers by date: