Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers
From | Nikita Malakhov |
---|---|
Subject | Re: [PATCH] Compression dictionaries for JSONB |
Date | |
Msg-id | CAN-LCVOn68NnZ-CUc56XfmS+HHK_PoOS3y1RsLbnMhASU3HMbg@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Compression dictionaries for JSONB (Aleksander Alekseev <aleksander@timescale.com>) |
Responses |
Re: [PATCH] Compression dictionaries for JSONB
|
List | pgsql-hackers |
Hi hackers!
I've got a partly question, partly proposal for the future development of this
feature:
What if we use pg_dict table not to store dictionaries but to store dictionaries'
meta, and actual dictionaries to be stored in separate tables like it is done with
TOAST tables (i.e. pg_dict.<dictionary 1 entry> --> pg_dict_16385 table)?
Thus we can kill several birds with one stone - we deal with concurrent
dictionaries' updates - which looks like very serious issue for now, they do not
affect each other and overall DB performance while using, we get around SQL
statement size restriction, could effectively deal with versions in dictionaries
and even dictionaries' versions, as well as dictionary size restriction, we can
use it for duplicated JSON parts, and even we can provide an API to work
with dictionaries and dictionary tables which later could be usable even for
working with JSON schemas as well (maybe, with some extension)?
Overall structure could look like this:
pg_dict
|
|---- dictionary 1 meta
| |--name
| |--size
| |--etc
| |--dictionary table name (i.e. pg_dict_16385)
| |
| |----> pg_dict_16385
|
|---- dictionary 2 meta
| |--name
| |--size
| |--etc
| |--dictionary table name (i.e. pg_dict_16386)
| |
| |----> pg_dict_16386
...
where dictionary table could look like
pg_dict_16385
|
|---- key 1
| |-value
|
|---- key 2
| |-value
...
And with a special DICT API we would have means to access, cache, store our
dictionaries in a uniform way from different levels. In this implementation it also
looks as a very valuable addition for our JSONb Toaster.
JSON schema processing is a very promising feature and we have to keep up
with major competitors like Oracle which are already working on it.
On Mon, Aug 1, 2022 at 2:25 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi hackers,
> So far we seem to have a consensus to:
>
> 1. Use bytea instead of NameData to store dictionary entries;
>
> 2. Assign monotonically ascending IDs to the entries instead of using
> Oids, as it is done with pg_class.relnatts. In order to do this we
> should either add a corresponding column to pg_type, or add a new
> catalog table, e.g. pg_dict_meta. Personally I don't have a strong
> opinion on what is better. Thoughts?
>
> Both changes should be straightforward to implement and also are a
> good exercise to newcomers.
>
> I invite anyone interested to join this effort as a co-author! (since,
> honestly, rewriting the same feature over and over again alone is
> quite boring :D).
cfbot complained that v5 doesn't apply anymore. Here is the rebased
version of the patch.
> Good point. This was not a problem for ZSON since the dictionary size
> was limited to 2**16 entries, the dictionary was immutable, and the
> dictionaries had versions. For compression dictionaries we removed the
> 2**16 entries limit and also decided to get rid of versions. The idea
> was that you can simply continue adding new entries, but no one
> thought about the fact that this will consume the memory required to
> decompress the document indefinitely.
>
> Maybe we should return to the idea of limited dictionary size and
> versions. Objections?
> [ ...]
> You are right. Another reason to return to the idea of dictionary versions.
Since no one objected so far and/or proposed a better idea I assume
this can be added to the list of TODOs as well.
--
Best regards,
Aleksander Alekseev
pgsql-hackers by date: