Home > mailing lists

Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

From	Nikita Malakhov
Subject	Re: [PATCH] Compression dictionaries for JSONB
Date	August 19, 2022 10:57:42
Msg-id	CAN-LCVOn68NnZ-CUc56XfmS+HHK_PoOS3y1RsLbnMhASU3HMbg@mail.gmail.com Whole thread Raw
In response to	Re: [PATCH] Compression dictionaries for JSONB (Aleksander Alekseev <aleksander@timescale.com>)
Responses	Re: [PATCH] Compression dictionaries for JSONB
List	pgsql-hackers

Tree view

Hi hackers!

I've got a partly question, partly proposal for the future development of this

feature:

What if we use pg_dict table not to store dictionaries but to store dictionaries'

meta, and actual dictionaries to be stored in separate tables like it is done with

TOAST tables (i.e. pg_dict.<dictionary 1 entry> --> pg_dict_16385 table)?

Thus we can kill several birds with one stone - we deal with concurrent

dictionaries' updates - which looks like very serious issue for now, they do not

affect each other and overall DB performance while using, we get around SQL

statement size restriction, could effectively deal with versions in dictionaries

and even dictionaries' versions, as well as dictionary size restriction, we can

use it for duplicated JSON parts, and even we can provide an API to work

with dictionaries and dictionary tables which later could be usable even for

working with JSON schemas as well (maybe, with some extension)?

Overall structure could look like this:

pg_dict

|---- dictionary 1 meta

| |--name

| |--size

| |--etc

| |--dictionary table name (i.e. pg_dict_16385)

| |

| |----> pg_dict_16385

|---- dictionary 2 meta

| |--name

| |--size

| |--etc

| |--dictionary table name (i.e. pg_dict_16386)

| |

| |----> pg_dict_16386

...

where dictionary table could look like

pg_dict_16385

|---- key 1

| |-value

|---- key 2

| |-value

...

And with a special DICT API we would have means to access, cache, store our

dictionaries in a uniform way from different levels. In this implementation it also

looks as a very valuable addition for our JSONb Toaster.

JSON schema processing is a very promising feature and we have to keep up

with major competitors like Oracle which are already working on it.

On Mon, Aug 1, 2022 at 2:25 PM Aleksander Alekseev <aleksander@timescale.com> wrote:

Hi hackers,

> So far we seem to have a consensus to:
>
> 1. Use bytea instead of NameData to store dictionary entries;
>
> 2. Assign monotonically ascending IDs to the entries instead of using
> Oids, as it is done with pg_class.relnatts. In order to do this we
> should either add a corresponding column to pg_type, or add a new
> catalog table, e.g. pg_dict_meta. Personally I don't have a strong
> opinion on what is better. Thoughts?
>
> Both changes should be straightforward to implement and also are a
> good exercise to newcomers.
>
> I invite anyone interested to join this effort as a co-author! (since,
> honestly, rewriting the same feature over and over again alone is
> quite boring :D).

cfbot complained that v5 doesn't apply anymore. Here is the rebased
version of the patch.

> Good point. This was not a problem for ZSON since the dictionary size
> was limited to 2**16 entries, the dictionary was immutable, and the
> dictionaries had versions. For compression dictionaries we removed the
> 2**16 entries limit and also decided to get rid of versions. The idea
> was that you can simply continue adding new entries, but no one
> thought about the fact that this will consume the memory required to
> decompress the document indefinitely.
>
> Maybe we should return to the idea of limited dictionary size and
> versions. Objections?
> [ ...]
> You are right. Another reason to return to the idea of dictionary versions.

Since no one objected so far and/or proposed a better idea I assume
this can be added to the list of TODOs as well.

--
Best regards,
Aleksander Alekseev

Regards,

Nikita Malakhov

https://postgrespro.ru/

pgsql-hackers by date:

From: Peter Smith
Date: 19 August 2022, 10:46:12
Subject: Re: Perform streaming logical transactions by background workers and parallel apply

From: mahendrakar s
Date: 19 August 2022, 11:06:54
Subject: Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory

Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

Previous

Next