Home > mailing lists

Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

From	Nikita Malakhov
Subject	Re: [PATCH] Compression dictionaries for JSONB
Date	April 18, 2023 16:12:08
Msg-id	CAN-LCVP2KodiE7bTBBmiLszn1Vg8jCOKUcz_2pb=J9mTqG8o2w@mail.gmail.com Whole thread
In response to	Re: [PATCH] Compression dictionaries for JSONB (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses	Re: [PATCH] Compression dictionaries for JSONB Re: [PATCH] Compression dictionaries for JSONB
List	pgsql-hackers

Tree view

Hi,

I don't think it's a good idea to interfere with the storage strategies. Dictionary

should be a kind of storage option, like a compression, but not the strategy

declining all others.

>> While thinking about how a user interface could look like it occured
>> to me that what we are discussing could be merely a new STORAGE
>> strategy. Currently there are PLAIN, MAIN, EXTERNAL and EXTENDED.
>> Let's call a new strategy DICTIONARY, with typstorage = d.

>I think "AT_AC SET COMPRESSION lz4 {[WITH | WITHOUT] DICTIONARY}",
>"AT_AC SET COMPRESSION lz4-dictionary", or "AT_AC SET
>compression_dictionary = on" would be better from a design
>perspective.

Agree with Matthias on above.

About the TOAST pointer:

>The toast pointer must store enough info about the compression used to
>decompress the datum, which implies it needs to store the compression
>algorithm used, and a reference to the compression dictionary (if
>any). I think the idea about introducing a new toast pointer type (in
>the custom toast patch) wasn't bad per se, and that change would allow
>us to carry more or different info in the header.

The External TOAST pointer is very limited to the amount of service data

it could keep, that's why we introduced the Custom TOAST pointers in the

Pluggable TOAST. But keep in mind that changing the TOAST pointer

structure requires a lot of quite heavy modifications in the core - along with

some obvious places like insert/update/delete datum there is very serious

issue with logical replication.

The Pluggable TOAST was rejected, but we have a lot of improvements

based on changing the TOAST pointer structure.

On Tue, Apr 18, 2023 at 6:40 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Tue, 18 Apr 2023 at 17:28, Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi Andres,
>
> > As I said, I don't think we should extend dictionaries. For this to work we'll
> > likely need a new / extended compressed toast datum header of some form, with
> > a reference to the dictionary. That'd likely be needed even with updatable
> > dictionaries, as we IIRC don't know which column a toasted datum is for, but
> > we need to know, to identify the dictionary. As we need that field anyway, we
> > can easily have multiple dictionaries.
>
> So I summarized the requirements we agreed on so far and ended up with
> the following list:
>
> * This is going to be a PostgreSQL feature, not an extension, not a
> bunch of hooks, etc;
> * We are not going to support lazy/partial decompression since this is
> too complicated in a general case and Postgres is not a specialized
> document-oriented DBMS (there is a normalization after all);
> * This should be a relation-level optimization option, not something
> visible to every user of the table (not a custom type, etc);
> * This is going to be an attribute-level compression;
> * The dictionaries should be created automatically (maybe not in a PoC
> but in the final implementation) since people are not good at it;
> * We are going to be using the existing compression algorithms like
> LZ4/ZSTD, not to invent new ones;
> * When created, a dictionary version is immutable, i.e. no new entries
> can be added. New version of a dictionary can be created when the data
> evolves. The compressed data stores the dictionary version used for
> compression. A dictionary version can't be deleted while data exists
> that uses this version of a dictionary;
> * Dictionaries are created automatically from sampled data during
> ANALIZE. We compare the efficiency of a new dictionary vs the
> efficiency of the old one (or the lack of such) on sampled data and
> depending on the results decide whether it's worth creating a new
> version of a dictionary;
> * This is going to work for multiple types: TEXT, JSON, JSONB, XML,
> BYTEA etc. Ideally for user-defined types too;

Any type with typlen < 0 should work, right?

> Hopefully I didn't miss anything.
>
> While thinking about how a user interface could look like it occured
> to me that what we are discussing could be merely a new STORAGE
> strategy. Currently there are PLAIN, MAIN, EXTERNAL and EXTENDED.
> Let's call a new strategy DICTIONARY, with typstorage = d.

The use of dictionaries should be dependent on only the use of a
compression method that supports pre-computed compression
dictionaries. I think storage=MAIN + compression dictionaries should
be supported, to make sure there is no expensive TOAST lookup for the
attributes of the tuple; but that doesn't seem to be an option with
that design.

> When user wants a given attribute to be compressed, he/she says:
>
> ALTER TABLE foo ALTER COLUMN bar SET STORAGE DICTIONARY;
>
> And the compression algorithms is chosen as usual:
>
> ALTER TABLE foo ALTER COLUMN bar SET COMPRESSION lz4;
>
> When there are no dictionaries yet, DICTIONARY works the same as
> EXTENDED. When a dictionary is trained the data is compressed using
> the latest version of this dictionary. For visibility we are going to
> need some sort of pg_stat_dictionaries view that shows the existing
> dictionaries, how much space they consume, etc.

I think "AT_AC SET COMPRESSION lz4 {[WITH | WITHOUT] DICTIONARY}",
"AT_AC SET COMPRESSION lz4-dictionary", or "AT_AC SET
compression_dictionary = on" would be better from a design
perspective.

> If we choose this approach there are a couple of questions/notes that
> come to mind:
>
> * The default compression algorithm is PGLZ and unlike LZ4 it doesn't
> support training dictionaries yet. This should be straightforward to
> implement though, or alternatively shared dictionaries could work only
> for LZ4;

Didn't we get zstd support recently as well?

> * Currently users have control of toast_tuple_target but not
> TOAST_TUPLE_THRESHOLD. Which means for tuples smaller than 1/4 of the
> page size shared dictionaries are not going to be triggered. Which is
> not necessarily a bad thing. Alternatively we could give the users
> toast_tuple_threshold setting. This shouldn't necessarily be part of
> this CF entry discussion however, we can always discuss it separately;

That makes a lot of sense, but as you said handling that separately
would probably be better and easier to review.

> * Should we allow setting DICTIONARY storage strategy for a given
> type, i.e. CREATE TYPE baz STORAGE = DICTIONARY? I suggest we forbid
> it in the first implementation, just for the sake of simplicity.

Can we specify a default compression method for each postgresql type,
just like how we specify the default storage? If not, then the setting
could realistically be in conflict with a default_toast_compression
setting, assuming that dictionary support is not a requirement for
column compression methods.

> * It looks like we won't be able to share a dictionary between
> multiple columns. Which again is not necessarily a bad thing: data in
> these columns can be completely different (e.g. BYTEA and XML),
> columns can be dropped independently, etc.

Yes

> If a user is interested in
> sharing a dictionary between several columns he/she can join these
> columns in a single JSONB column.

It is unreasonable to expect this to be possible, due to e.g.
partitioning resulting in columns that share compressable patters to
be on different physical tables.

> * TOAST currently doesn't support ZSTD. IMO this is not a big deal and
> adding the corresponding support can be discussed separately.
> * If memory serves, there were not so many free bits left in TOAST
> pointers. The pointers don't store a storage strategy though so
> hopefully this will not be a problem. We'll see.

The toast pointer must store enough info about the compression used to
decompress the datum, which implies it needs to store the compression
algorithm used, and a reference to the compression dictionary (if
any). I think the idea about introducing a new toast pointer type (in
the custom toast patch) wasn't bad per se, and that change would allow
us to carry more or different info in the header.

Kind regards,

Matthias van de Meent

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

pgsql-hackers by date:

From: Tom Lane
Date: 18 April 2023, 16:06:09
Subject: Re: constants for tar header offsets

From: Robert Haas
Date: 18 April 2023, 16:23:59
Subject: Re: Request for comment on setting binary format output per session

Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

Previous

Next