Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

From Nikita Malakhov
Subject Re: [PATCH] Compression dictionaries for JSONB
Date
Msg-id CAN-LCVOn68NnZ-CUc56XfmS+HHK_PoOS3y1RsLbnMhASU3HMbg@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Compression dictionaries for JSONB  (Aleksander Alekseev <aleksander@timescale.com>)
Responses Re: [PATCH] Compression dictionaries for JSONB
List pgsql-hackers
Hi hackers!

I've got a partly question, partly proposal for the future development of this
feature:
What if we use pg_dict table not to store dictionaries but to store dictionaries'
meta, and actual dictionaries to be stored in separate tables like it is done with
TOAST tables (i.e. pg_dict.<dictionary 1 entry> --> pg_dict_16385 table)?
Thus we can kill several birds with one stone - we deal with concurrent 
dictionaries' updates - which looks like very serious issue for now, they do not 
affect each other and overall DB performance while using, we get around SQL
statement size restriction, could effectively deal with versions in dictionaries 
and even dictionaries' versions, as well as dictionary size restriction, we can
use it for duplicated JSON parts, and even we can provide an API to work 
with dictionaries and dictionary tables which later could be usable even for 
working with JSON schemas as well (maybe, with some extension)?

Overall structure could look like this:
pg_dict
   |
   |---- dictionary 1 meta
   |           |--name
   |           |--size
   |           |--etc
   |           |--dictionary table name (i.e. pg_dict_16385)
   |                  |
   |                  |----> pg_dict_16385
   |
   |---- dictionary 2 meta
   |           |--name
   |           |--size
   |           |--etc
   |           |--dictionary table name (i.e. pg_dict_16386)
   |                  |
   |                  |----> pg_dict_16386
  ...

where dictionary table could look like
pg_dict_16385
   |
   |---- key 1
   |        |-value 
   |
   |---- key 2
   |        |-value 
  ...

And with a special DICT API we would have means to access, cache, store our
dictionaries in a uniform way from different levels. In this implementation it also
looks as a very valuable addition for our JSONb Toaster.

JSON schema processing is a very promising feature and we have to keep up
with major competitors like Oracle which are already working on it.

On Mon, Aug 1, 2022 at 2:25 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi hackers,

> So far we seem to have a consensus to:
>
> 1. Use bytea instead of NameData to store dictionary entries;
>
> 2. Assign monotonically ascending IDs to the entries instead of using
> Oids, as it is done with pg_class.relnatts. In order to do this we
> should either add a corresponding column to pg_type, or add a new
> catalog table, e.g. pg_dict_meta. Personally I don't have a strong
> opinion on what is better. Thoughts?
>
> Both changes should be straightforward to implement and also are a
> good exercise to newcomers.
>
> I invite anyone interested to join this effort as a co-author! (since,
> honestly, rewriting the same feature over and over again alone is
> quite boring :D).

cfbot complained that v5 doesn't apply anymore. Here is the rebased
version of the patch.

> Good point. This was not a problem for ZSON since the dictionary size
> was limited to 2**16 entries, the dictionary was immutable, and the
> dictionaries had versions. For compression dictionaries we removed the
> 2**16 entries limit and also decided to get rid of versions. The idea
> was that you can simply continue adding new entries, but no one
> thought about the fact that this will consume the memory required to
> decompress the document indefinitely.
>
> Maybe we should return to the idea of limited dictionary size and
> versions. Objections?
> [ ...]
> You are right. Another reason to return to the idea of dictionary versions.

Since no one objected so far and/or proposed a better idea I assume
this can be added to the list of TODOs as well.

--
Best regards,
Aleksander Alekseev


--
Regards,
Nikita Malakhov

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: mahendrakar s
Date:
Subject: Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory