Thread: [PATCH] Compression dictionaries for JSONB

[PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

22 April 2022, 11:30:01

Hi hackers,

This is a follow-up thread to `RFC: compression dictionaries for JSONB` [1]. I would like to share my current progress in order to get early feedback. The patch is currently in a draft state but implements the basic functionality. I did my best to account for all the great feedback I previously got from Alvaro and Matthias.

Usage example:

```
CREATE TYPE mydict AS DICTIONARY OF jsonb ('aaa', 'bbb');

SELECT '{"aaa":"bbb"}' :: mydict;
mydict
----------------
{"aaa": "bbb"}

SELECT ('{"aaa":"bbb"}' :: mydict) -> 'aaa';
?column?
----------
"bbb"
```

Here `mydict` works as a transparent replacement for `jsonb`. However, its internal representation differs. The provided dictionary entries ('aaa', 'bbb') are stored in the new catalog table:

```
SELECT * FROM pg_dict;
oid | dicttypid | dictentry
-------+-----------+-----------
39476 | 39475 | aaa
39477 | 39475 | bbb

(2 rows)

```

When `mydict` sees 'aaa' in the document, it replaces it with the corresponding code, in this case - 39476. For more details regarding the compression algorithm and choosen compromises please see the comments in the patch.

In pg_type `mydict` has typtype = TYPTYPE_DICT. It works the same way as TYPTYPE_BASE with only difference: corresponding `<type>_in` (pg_type.typinput) and `<another-type>_<type>` (pg_cast.castfunc) procedures receive the dictionary Oid as a `typmod` argument. This way the procedures can distinguish `mydict1` from `mydict2` and use the proper compression dictionary.

The approach with alternative `typmod` role is arguably a bit hacky, but it was the less invasive way to implement the feature I've found. I'm open to alternative suggestions.

Current limitations (todo):
- ALTER TYPE is not implemented
- Tests and documentation are missing
- Autocomplete is missing

Future work (out of scope of this patch):
- Support types other than JSONB: TEXT, XML, etc
- Automatically updated dictionaries, e.g. during VACUUM
- Alternative compression algorithms. Note that this will not require any further changes in the catalog, only the values we write to pg_type and pg_cast will differ.

Open questions:
- Dictionary entries are currently stored as NameData, the same type that is used for enums. Are we OK with the accompanying limitations? Any alternative suggestions?

- All in all, am I moving the right direction?

Your feedback is very much welcomed!

[1]: https://postgr.es/m/CAJ7c6TPx7N-bVw0dZ1ASCDQKZJHhBYkT6w4HV1LzfS%2BUUTUfmA%40mail.gmail.com

--
Best regards,
Aleksander Alekseev

Attachment

v1-0001-CREATE-TYPE-foo-AS-DICTIONARY-OF-JSONB.patch

Re: [PATCH] Compression dictionaries for JSONB

From

Zhihong Yu

Date:

22 April 2022, 18:21:03

On Fri, Apr 22, 2022 at 1:30 AM Aleksander Alekseev <aleksander@timescale.com> wrote:

Hi hackers,

This is a follow-up thread to `RFC: compression dictionaries for JSONB` [1]. I would like to share my current progress in order to get early feedback. The patch is currently in a draft state but implements the basic functionality. I did my best to account for all the great feedback I previously got from Alvaro and Matthias.

Usage example:

```
CREATE TYPE mydict AS DICTIONARY OF jsonb ('aaa', 'bbb');

SELECT '{"aaa":"bbb"}' :: mydict;
mydict
----------------
{"aaa": "bbb"}

SELECT ('{"aaa":"bbb"}' :: mydict) -> 'aaa';
?column?
----------
"bbb"
```

Here `mydict` works as a transparent replacement for `jsonb`. However, its internal representation differs. The provided dictionary entries ('aaa', 'bbb') are stored in the new catalog table:

```
SELECT * FROM pg_dict;
oid | dicttypid | dictentry
-------+-----------+-----------
39476 | 39475 | aaa
39477 | 39475 | bbb
(2 rows)
```

When `mydict` sees 'aaa' in the document, it replaces it with the corresponding code, in this case - 39476. For more details regarding the compression algorithm and choosen compromises please see the comments in the patch.

In pg_type `mydict` has typtype = TYPTYPE_DICT. It works the same way as TYPTYPE_BASE with only difference: corresponding `<type>_in` (pg_type.typinput) and `<another-type>_<type>` (pg_cast.castfunc) procedures receive the dictionary Oid as a `typmod` argument. This way the procedures can distinguish `mydict1` from `mydict2` and use the proper compression dictionary.

The approach with alternative `typmod` role is arguably a bit hacky, but it was the less invasive way to implement the feature I've found. I'm open to alternative suggestions.

Current limitations (todo):
- ALTER TYPE is not implemented
- Tests and documentation are missing
- Autocomplete is missing

Future work (out of scope of this patch):
- Support types other than JSONB: TEXT, XML, etc
- Automatically updated dictionaries, e.g. during VACUUM
- Alternative compression algorithms. Note that this will not require any further changes in the catalog, only the values we write to pg_type and pg_cast will differ.

Open questions:
- Dictionary entries are currently stored as NameData, the same type that is used for enums. Are we OK with the accompanying limitations? Any alternative suggestions?
- All in all, am I moving the right direction?

Your feedback is very much welcomed!

[1]: https://postgr.es/m/CAJ7c6TPx7N-bVw0dZ1ASCDQKZJHhBYkT6w4HV1LzfS%2BUUTUfmA%40mail.gmail.com

--
Best regards,
Aleksander Alekseev

Hi,

For src/backend/catalog/pg_dict.c, please add license header.

+ elog(ERROR, "skipbytes > decoded_size - outoffset");

Include the values for skipbytes, decoded_size and outoffset.

Cheers

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

25 April 2022, 16:15:55

Hi Zhihong,

Many thanks for your feedback!

> For src/backend/catalog/pg_dict.c, please add license header.

Fixed.

> +           elog(ERROR, "skipbytes > decoded_size - outoffset");
>
> Include the values for skipbytes, decoded_size and outoffset.

In fact, this code should never be executed, and if somehow it will
be, this information will not help us much to debug the issue. I made
corresponding changes to the error message and added the comments.

Here it the 2nd version of the patch:

- Includes changes named above
- Fixes a warning reported by cfbot
- Fixes some FIXME's
- The path includes some simple tests now
- A proper commit message was added

Please note that this is still a draft. Feedback is welcome.

-- 
Best regards,
Aleksander Alekseev

Attachment

v2-0001-Compression-dictionaries-for-JSONB.patch

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

13 May 2022, 11:08:53

Hi hackers,

> Here it the 2nd version of the patch:
>
> - Includes changes named above
> - Fixes a warning reported by cfbot
> - Fixes some FIXME's
> - The path includes some simple tests now
> - A proper commit message was added

Here is the rebased version of the patch. Changes compared to v2 are minimal.

> Open questions:
> - Dictionary entries are currently stored as NameData, the same type that is
>   used for enums. Are we OK with the accompanying limitations? Any alternative
>   suggestions?
> - All in all, am I moving the right direction?

I would like to receive a little bit more feedback before investing more time
into this effort. This will allow me, if necessary, to alter the overall design
more easily.

-- 
Best regards,
Aleksander Alekseev

Attachment

v3-0001-Compression-dictionaries-for-JSONB.patch

Re: [PATCH] Compression dictionaries for JSONB

From

Jacob Champion

Date:

02 June 2022, 00:29:06

On Wed, Jun 1, 2022 at 1:44 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> This is a follow-up thread to `RFC: compression dictionaries for JSONB` [1]. I would like to share my current
progressin order to get early feedback. The patch is currently in a draft state but implements the basic functionality.
Idid my best to account for all the great feedback I previously got from Alvaro and Matthias. 

I'm coming up to speed with this set of threads -- the following is
not a complete review by any means, and please let me know if I've
missed some of the history.

> SELECT * FROM pg_dict;
>   oid  | dicttypid | dictentry
> -------+-----------+-----------
>  39476 |     39475 | aaa
>  39477 |     39475 | bbb
> (2 rows)

I saw there was some previous discussion about dictionary size. It
looks like this approach would put all dictionaries into a shared OID
pool. Since I don't know what a "standard" use case is, is there any
risk of OID exhaustion for larger deployments with many dictionaries?
Or is 2**32 so comparatively large that it's not really a serious
concern?

> When `mydict` sees 'aaa' in the document, it replaces it with the corresponding code, in this case - 39476. For more
detailsregarding the compression algorithm and choosen compromises please see the comments in the patch. 

I see the algorithm description, but I'm curious to know whether it's
based on some other existing compression scheme, for the sake of
comparison. It seems like it shares similarities with the Snappy
scheme?

Could you talk more about what the expected ratios and runtime
characteristics are? Best I can see is that compression runtime is
something like O(n * e * log d) where n is the length of the input, e
is the maximum length of a dictionary entry, and d is the number of
entries in the dictionary. Since e and d are constant for a given
static dictionary, how well the dictionary is constructed is
presumably important.

> In pg_type `mydict` has typtype = TYPTYPE_DICT. It works the same way as TYPTYPE_BASE with only difference:
corresponding`<type>_in` (pg_type.typinput) and `<another-type>_<type>` (pg_cast.castfunc) procedures receive the
dictionaryOid as a `typmod` argument. This way the procedures can distinguish `mydict1` from `mydict2` and use the
propercompression dictionary. 
>
> The approach with alternative `typmod` role is arguably a bit hacky, but it was the less invasive way to implement
thefeature I've found. I'm open to alternative suggestions. 

Haven't looked at this closely enough to develop an opinion yet.

> Current limitations (todo):
> - ALTER TYPE is not implemented

That reminds me. How do people expect to generate a "good" dictionary
in practice? Would they somehow get the JSONB representations out of
Postgres and run a training program over the blobs? I see some
reference to training functions in the prior threads but don't see any
breadcrumbs in the code.

> - Alternative compression algorithms. Note that this will not require any further changes in the catalog, only the
valueswe write to pg_type and pg_cast will differ. 

Could you expand on this? I.e. why would alternative algorithms not
need catalog changes? It seems like the only schemes that could be
used with pg_catalog.pg_dict are those that expect to map a byte
string to a number. Is that general enough to cover other standard
compression algorithms?

> Open questions:
> - Dictionary entries are currently stored as NameData, the same type that is used for enums. Are we OK with the
accompanyinglimitations? Any alternative suggestions? 

It does feel a little weird to have a hard limit on the entry length,
since that also limits the compression ratio. But it also limits the
compression runtime, so maybe it's a worthwhile tradeoff.

It also seems strange to use a dictionary of C strings to compress
binary data; wouldn't we want to be able to compress zero bytes too?

Hope this helps,
--Jacob

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

02 June 2022, 16:30:20

Hi Jacob,

Many thanks for your feedback!

> I saw there was some previous discussion about dictionary size. It
> looks like this approach would put all dictionaries into a shared OID
> pool. Since I don't know what a "standard" use case is, is there any
> risk of OID exhaustion for larger deployments with many dictionaries?
> Or is 2**32 so comparatively large that it's not really a serious
> concern?

I agree, this is a drawback of the current implementation. To be honest,
I simply followed the example of how ENUMs are implemented. I'm not 100% sure
if we should be worried here (apparently, freed OIDs are reused). I'm OK with
using a separate sequence if someone could second this. This is the first time
I'm altering the catalog so I'm not certain what the best practices are.

> I see the algorithm description, but I'm curious to know whether it's
> based on some other existing compression scheme, for the sake of
> comparison. It seems like it shares similarities with the Snappy
> scheme?
>
> Could you talk more about what the expected ratios and runtime
> characteristics are? Best I can see is that compression runtime is
> something like O(n * e * log d) where n is the length of the input, e
> is the maximum length of a dictionary entry, and d is the number of
> entries in the dictionary. Since e and d are constant for a given
> static dictionary, how well the dictionary is constructed is
> presumably important.

The algorithm is almost identical to the one I used in ZSON extension [1]
except the fact that ZSON uses 16-bit codes. In docs/benchmark.md you will find
approximate ratios to expect, etc. The reasons why this particular algorithm
was chosen are:

1. It was extensively tested in the past and seem to work OK for existing
   ZSON users.
2. It doesn't use any knowledge regarding the data structure and thus can be
   reused for TEXT/XML/etc as-is.
3. Previously we agreed that at some point users will be able to change the
   algorithm (the same way as they can do it for TOAST now) so which algorithm
   will be used in the first implementation is not that important. I simply
   choose the already existing one.

> > Current limitations (todo):
> > - ALTER TYPE is not implemented
>
> That reminds me. How do people expect to generate a "good" dictionary
> in practice? Would they somehow get the JSONB representations out of
> Postgres and run a training program over the blobs? I see some
> reference to training functions in the prior threads but don't see any
> breadcrumbs in the code.

So far we agreed that in the first implementation it will be done manually.
In the future it will be possible to update the dictionaries automatically
during VACUUM. The idea of something similar to zson_learn() procedure, as
I recall, didn't get much support, so we probably will not have it, or at least
it is not a priority.

> > - Alternative compression algorithms. Note that this will not require any
> > further changes in the catalog, only the values we write to pg_type and
> > pg_cast will differ.
>
> Could you expand on this? I.e. why would alternative algorithms not
> need catalog changes? It seems like the only schemes that could be
> used with pg_catalog.pg_dict are those that expect to map a byte
> string to a number. Is that general enough to cover other standard
> compression algorithms?

Sure. When creating a new dictionary pg_type and pg_cast are modified like this:

 =# CREATE TYPE mydict AS DICTIONARY OF JSONB ('abcdef', 'ghijkl');
CREATE TYPE
 =# SELECT * FROM pg_type WHERE typname = 'mydict';
-[ RECORD 1 ]--+---------------
oid            | 16397
typname        | mydict
typnamespace   | 2200
...
typarray       | 16396
typinput       | dictionary_in
typoutput      | dictionary_out
...

=# SELECT c.*, p.proname FROM pg_cast AS c
     LEFT JOIN pg_proc AS p
     ON p.oid = c.castfunc
      WHERE c.castsource = 16397 or c.casttarget = 16397;
-[ RECORD 1 ]-----------------
oid         | 16400
castsource  | 3802
casttarget  | 16397
castfunc    | 9866
castcontext | a
castmethod  | f
proname     | jsonb_dictionary
-[ RECORD 2 ]-----------------
oid         | 16401
castsource  | 16397
casttarget  | 3802
castfunc    | 9867
castcontext | i
castmethod  | f
proname     | dictionary_jsonb
-[ RECORD 3 ]-----------------
oid         | 16402
castsource  | 16397
casttarget  | 17
castfunc    | 9868
castcontext | e
castmethod  | f
proname     | dictionary_bytea

In order to add a new algorithm you simply need to provide alternatives
to dictionary_in / dictionary_out / jsonb_dictionary / dictionary_jsonb and
specify them in the catalog instead. The catalog schema will remain the same.

> It also seems strange to use a dictionary of C strings to compress
> binary data; wouldn't we want to be able to compress zero bytes too?

That's a good point. Again, here I simply followed the example of the ENUMs
implementation. Since compression dictionaries are intended to be used with
text-like types such as JSONB, (and also JSON, TEXT and XML in the future),
choosing Name type seemed to be a reasonable compromise. Dictionary entries are
most likely going to store JSON keys, common words used in the TEXT, etc.
However, I'm fine with any alternative scheme if somebody experienced with the
PostgreSQL catalog could second this.

[1]: https://github.com/afiskon/zson

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Jacob Champion

Date:

03 June 2022, 01:21:26

On Thu, Jun 2, 2022 at 6:30 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> > I saw there was some previous discussion about dictionary size. It
> > looks like this approach would put all dictionaries into a shared OID
> > pool. Since I don't know what a "standard" use case is, is there any
> > risk of OID exhaustion for larger deployments with many dictionaries?
> > Or is 2**32 so comparatively large that it's not really a serious
> > concern?
>
> I agree, this is a drawback of the current implementation. To be honest,
> I simply followed the example of how ENUMs are implemented. I'm not 100% sure
> if we should be worried here (apparently, freed OIDs are reused). I'm OK with
> using a separate sequence if someone could second this. This is the first time
> I'm altering the catalog so I'm not certain what the best practices are.

I think reuse should be fine (if a bit slower, but offhand that
doesn't seem like an important bottleneck). Users may be unamused to
find that one large dictionary has prevented the creation of any new
entries in other dictionaries, though. But again, I have no intuition
for the size of a production-grade compression dictionary, and maybe
it's silly to assume that normal use would ever reach the OID limit.

> > I see the algorithm description, but I'm curious to know whether it's
> > based on some other existing compression scheme, for the sake of
> > comparison. It seems like it shares similarities with the Snappy
> > scheme?
> >
> > Could you talk more about what the expected ratios and runtime
> > characteristics are? Best I can see is that compression runtime is
> > something like O(n * e * log d) where n is the length of the input, e
> > is the maximum length of a dictionary entry, and d is the number of
> > entries in the dictionary. Since e and d are constant for a given
> > static dictionary, how well the dictionary is constructed is
> > presumably important.
>
> The algorithm is almost identical to the one I used in ZSON extension [1]
> except the fact that ZSON uses 16-bit codes. In docs/benchmark.md you will find
> approximate ratios to expect, etc.

That's assuming a machine-trained dictionary, though, which isn't part
of the proposal now. Is there a performance/ratio sample for a "best
practice" hand-written dictionary?

> > That reminds me. How do people expect to generate a "good" dictionary
> > in practice? Would they somehow get the JSONB representations out of
> > Postgres and run a training program over the blobs? I see some
> > reference to training functions in the prior threads but don't see any
> > breadcrumbs in the code.
>
> So far we agreed that in the first implementation it will be done manually.
> In the future it will be possible to update the dictionaries automatically
> during VACUUM. The idea of something similar to zson_learn() procedure, as
> I recall, didn't get much support, so we probably will not have it, or at least
> it is not a priority.

Hm... I'm skeptical that a manually-constructed set of compression
dictionaries would be maintainable over time or at scale. But I'm not
the target audience so I will let others weigh in here instead.

> > > - Alternative compression algorithms. Note that this will not require any
> > > further changes in the catalog, only the values we write to pg_type and
> > > pg_cast will differ.
> >
> > Could you expand on this? I.e. why would alternative algorithms not
> > need catalog changes? It seems like the only schemes that could be
> > used with pg_catalog.pg_dict are those that expect to map a byte
> > string to a number. Is that general enough to cover other standard
> > compression algorithms?
>
> Sure. When creating a new dictionary pg_type and pg_cast are modified like this:
>
>  =# CREATE TYPE mydict AS DICTIONARY OF JSONB ('abcdef', 'ghijkl');
> CREATE TYPE
>  =# SELECT * FROM pg_type WHERE typname = 'mydict';
> -[ RECORD 1 ]--+---------------
> oid            | 16397
> typname        | mydict
> typnamespace   | 2200
> ...
> typarray       | 16396
> typinput       | dictionary_in
> typoutput      | dictionary_out
> ...
>
> =# SELECT c.*, p.proname FROM pg_cast AS c
>      LEFT JOIN pg_proc AS p
>      ON p.oid = c.castfunc
>       WHERE c.castsource = 16397 or c.casttarget = 16397;
> -[ RECORD 1 ]-----------------
> oid         | 16400
> castsource  | 3802
> casttarget  | 16397
> castfunc    | 9866
> castcontext | a
> castmethod  | f
> proname     | jsonb_dictionary
> -[ RECORD 2 ]-----------------
> oid         | 16401
> castsource  | 16397
> casttarget  | 3802
> castfunc    | 9867
> castcontext | i
> castmethod  | f
> proname     | dictionary_jsonb
> -[ RECORD 3 ]-----------------
> oid         | 16402
> castsource  | 16397
> casttarget  | 17
> castfunc    | 9868
> castcontext | e
> castmethod  | f
> proname     | dictionary_bytea
>
> In order to add a new algorithm you simply need to provide alternatives
> to dictionary_in / dictionary_out / jsonb_dictionary / dictionary_jsonb and
> specify them in the catalog instead. The catalog schema will remain the same.

The catalog schemas for pg_type and pg_cast would. But would the
current pg_dict schema be generally applicable to other cross-table
compression schemes? It seems narrowly tailored -- which is not a
problem for a proof of concept patch; I'm just not seeing how other
standard compression schemes might make use of an OID-to-NameData map.
My naive understanding is that they have their own dictionary
structures.

(You could of course hack in any general structure you needed by
treating pg_dict like a list of chunks, but that seems wasteful and
slow, especially given the 63-byte chunk limit, and even more likely
to exhaust the shared OID pool. I think LZMA dictionaries can be huge,
as one example.)

> > It also seems strange to use a dictionary of C strings to compress
> > binary data; wouldn't we want to be able to compress zero bytes too?
>
> That's a good point. Again, here I simply followed the example of the ENUMs
> implementation. Since compression dictionaries are intended to be used with
> text-like types such as JSONB, (and also JSON, TEXT and XML in the future),
> choosing Name type seemed to be a reasonable compromise. Dictionary entries are
> most likely going to store JSON keys, common words used in the TEXT, etc.
> However, I'm fine with any alternative scheme if somebody experienced with the
> PostgreSQL catalog could second this.

I think Matthias back in the first thread was hoping for the ability
to compress duplicated JSON objects as well; it seems like that
wouldn't be possible with the current scheme. (Again I have no
intuition for which use cases are must-haves.) I'm wondering if
pg_largeobject would be an alternative catalog to draw inspiration
from... specifically the use of bytea as the stored value, and of a
two-column primary key.

But take all my suggestions with a dash of salt :D I'm new to this space.

Thanks!
--Jacob

Re: [PATCH] Compression dictionaries for JSONB

From

Matthias van de Meent

Date:

05 June 2022, 22:51:47

On Fri, 13 May 2022 at 10:09, Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi hackers,
>
> > Here it the 2nd version of the patch:
> >
> > - Includes changes named above
> > - Fixes a warning reported by cfbot
> > - Fixes some FIXME's
> > - The path includes some simple tests now
> > - A proper commit message was added
>
> Here is the rebased version of the patch. Changes compared to v2 are minimal.
>
> > Open questions:
> > - Dictionary entries are currently stored as NameData, the same type that is
> >   used for enums. Are we OK with the accompanying limitations? Any alternative
> >   suggestions?
> > - All in all, am I moving the right direction?
>
> I would like to receive a little bit more feedback before investing more time
> into this effort. This will allow me, if necessary, to alter the overall design
> more easily.

Sorry for the delayed reply. After the last thread, I've put some time
in looking into the "pluggable toaster" patches, which appears to want
to provide related things: Compressing typed data using an extensible
API. I think that that API is a better approach to increase the
compression ratio for JSONB.

That does not mean that I think that the basis of this patch is
incorrect, just that the current API (through new entries in the
pg_type and pg_casts catalogs) is not the right direction if/when
we're going to have a pluggable toaster API. The bulk of the patch
should still be usable, but I think that the way it interfaces with
the CREATE TABLE (column ...) APIs would need reworking to build on
top of the api's of the "pluggable toaster" patches (so, creating
toasters instead of types). I think that would allow for an overall
better user experience and better performance due to decreased need
for fully decompressed type casting.

Kind regards,

Matthias van de Meent.

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

15 June 2022, 15:38:23

Hi Matthias,

> The bulk of the patch
> should still be usable, but I think that the way it interfaces with
> the CREATE TABLE (column ...) APIs would need reworking to build on
> top of the api's of the "pluggable toaster" patches (so, creating
> toasters instead of types). I think that would allow for an overall
> better user experience and better performance due to decreased need
> for fully decompressed type casting.

Many thanks for the feedback.

The "pluggable TOASTer" patch looks very interesting indeed. I'm
currently trying to make heads and tails of it and trying to figure
out if it can be used as a base for compression dictionaries,
especially for implementing the partial decompression. Hopefully I
will be able to contribute to it and to the dependent patch [1] in the
upcoming CF, at least as a tester/reviewer. Focusing our efforts on
[1] for now seems to be a good strategy.

My current impression of your idea is somewhat mixed at this point though.

Teodor's goal is to allow creating _extensions_ that implement
alternative TOAST strategies, which use alternative compression
algorithms and/or use the knowledge of the binary representation of
the particular type. For sure, this would be a nice thing to have.
However, during the discussion of the "compression dictionaries" RFC
the consensus was reached that the community wants to see it as a
_built_in_ functionality rather than an extension. Otherwise we could
simply add ZSON to /contrib/ as it was originally proposed.

So if we are going to keep "compression dictionaries" a built-in
functionality, putting artificial constraints on its particular
implementation, or adding artificial dependencies of two rather
complicated patches, is arguably a controversial idea. Especially
considering the fact that it was shown that the feature can be
implemented without these dependencies, in a very non-invasive way.

These are just my initial thoughts I would like to share though. I may
change my mind after diving deeper into a "pluggable TOASTer" patch.

I cc:'ed Teodor in case he would like to share his insights on the topic.

[1]: https://commitfest.postgresql.org/38/3479/

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

17 June 2022, 18:04:11

Hi Matthias,

> These are just my initial thoughts I would like to share though. I may
> change my mind after diving deeper into a "pluggable TOASTer" patch.

I familiarized myself with the "pluggable TOASTer" thread and joined
the discussion [1].

I'm afraid so far I failed to understand your suggestion to base
"compression dictionaries" patch on "pluggable TOASTer", considering
the fair amount of push-back it got from the community, not to mention
a somewhat raw state of the patchset. It's true that Teodor and I are
trying to address similar problems. This however doesn't mean that
there should be a dependency between these patches.

Also, I completely agree with Tomas [2]:

> My main point is that we should not be making too many radical
> changes at once - it makes it much harder to actually get anything done.

IMO the patches don't depend on each other but rather complement each
other. The user can switch between different TOAST methods, and the
compression dictionaries can work on top of different TOAST methods.
Although there is also a high-level idea (according to the
presentations) to share common data between different TOASTed values,
similarly to what compression dictionaries do, by looking at the
current feedback and considering the overall complexity and the amount
of open questions (e.g. interaction with different TableAMs, etc), I
seriously doubt that this particular part of "pluggable TOASTer" will
end-up in the core.

[1]: https://postgr.es/m/CAJ7c6TOMPiRs-CZ%3DA9hyzxOyqHhKXxLD8qCF5%2BGJuLjQBzOX4A%40mail.gmail.com
[2]: https://postgr.es/m/9ef14537-b33b-c63a-9938-e2b413db0a4c%40enterprisedb.com

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Simon Riggs

Date:

23 June 2022, 18:48:48

On Thu, 2 Jun 2022 at 14:30, Aleksander Alekseev
<aleksander@timescale.com> wrote:

> > I saw there was some previous discussion about dictionary size. It
> > looks like this approach would put all dictionaries into a shared OID
> > pool. Since I don't know what a "standard" use case is, is there any
> > risk of OID exhaustion for larger deployments with many dictionaries?
> > Or is 2**32 so comparatively large that it's not really a serious
> > concern?
>
> I agree, this is a drawback of the current implementation. To be honest,
> I simply followed the example of how ENUMs are implemented. I'm not 100% sure
> if we should be worried here (apparently, freed OIDs are reused). I'm OK with
> using a separate sequence if someone could second this. This is the first time
> I'm altering the catalog so I'm not certain what the best practices are.

The goal of this patch is great, thank you for working on this (and ZSON).

The approach chosen has a few downsides that I'm not happy with yet.

* Assigning OIDs for each dictionary entry is not a great idea. I
don't see why you would need to do that; just assign monotonically
ascending keys for each dictionary, as we do for AttrNums.

* There is a limit on SQL statement size, which will effectively limit
the size of dictionaries, but the examples are unrealistically small,
so this isn't clear as a limitation, but it would be in practice. It
would be better to specify a filename, which can be read in when the
DDL executes. This can be put into pg_dump output in a similar way to
the COPY data for a table is, so once read in it stays static.

* The dictionaries are only allowed for certain datatypes. This should
not be specifically limited by this patch, i.e. user defined types
should not be rejected.

* Dictionaries have no versioning. Any list of data items changes over
time, so how do we express that? Enums were also invented as static
lists originally, then had to be modified later to accomodate
additions and revisions, so let's think about that now, even if we
don't add all of the commands in one go. Currently we would have to
create a whole new dictionary if even one word changes. Ideally, we
want the dictionary to have a top-level name and then have multiple
versions over time. Let's agree how we are going do these things, so
we can make sure the design and code allows for those future
enhancements.
i.e. how will we do ALTER TABLE ... UPGRADE DICTIONARY without causing
a table rewrite?

* Does the order of entries in the dictionary allow us to express a
priority? i.e. to allow Huffman coding.

Thanks for your efforts - this is a very important patch.

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

28 June 2022, 15:37:14

Hi Simon,

Many thanks for your feedback!

I'm going to submit an updated version of the patch in a bit. I just
wanted to reply to some of your questions / comments.

> Dictionaries have no versioning. [...]

> Does the order of entries in the dictionary allow us to express a priority? i.e. to allow Huffman coding. [...]

This is something we discussed in the RFC thread. I got an impression
that the consensus was reached:

1. To simply use 32-bit codes in the compressed documents, instead of
16-bit ones as it was done in ZSON;
2. Not to use any sort of variable-length coding;
3. Not to use dictionary versions. New codes can be added to the
existing dictionaries by executing ALTER TYPE mydict ADD ENTRY. (This
also may answer your comment regarding a limit on SQL statement size.)
4. The compression scheme can be altered in the future if needed.
Every compressed document stores algorithm_version (1 byte).

Does this plan of action sound OK to you? At this point it is not too
difficult to make design changes.

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Matthias van de Meent

Date:

04 July 2022, 15:45:22

Hi Alexander,

On Fri, 17 Jun 2022 at 17:04, Aleksander Alekseev
<aleksander@timescale.com> wrote:
>> These are just my initial thoughts I would like to share though. I may
>> change my mind after diving deeper into a "pluggable TOASTer" patch.
>
> I familiarized myself with the "pluggable TOASTer" thread and joined
> the discussion [1].
>
> I'm afraid so far I failed to understand your suggestion to base
> "compression dictionaries" patch on "pluggable TOASTer", considering
> the fair amount of push-back it got from the community, not to mention
> a somewhat raw state of the patchset. It's true that Teodor and I are
> trying to address similar problems. This however doesn't mean that
> there should be a dependency between these patches.

The reason I think this is better implemented as a pluggable toaster
is because casts are necessarily opaque and require O(sizeofdata)
copies or processing. The toaster infrastructure that is proposed in
[0] seems to improve on the O(sizeofdata) requirement for toast, but
that will not work with casts.

> Also, I completely agree with Tomas [2]:
>
>> My main point is that we should not be making too many radical
>> changes at once - it makes it much harder to actually get anything done.
>
> IMO the patches don't depend on each other but rather complement each
> other. The user can switch between different TOAST methods, and the
> compression dictionaries can work on top of different TOAST methods.

I don't think that is possible (or at least, not as performant). To
treat type X' as type X and use it as a stored medium instead, you
must have either the whole binary representation of X, or have access
to the internals of type X. I find it difficult to believe that casts
can be done without a full detoast (or otherwise without deep
knowledge about internal structure of the data type such as 'type A is
binary compatible with type X'), and as such I think this feature
'compression dictionaries' is competing with the 'pluggable toaster'
feature, if the one is used on top of the other. That is, the
dictionary is still created like in the proposed patches (though
preferably without the 64-byte NAMELEN limit), but the usage will be
through "TOASTER my_dict_enabled_toaster".

Additionally, I don't think we've ever accepted two different
implementations of the same concept, at least not without first having
good arguments why both competing implementations have obvious
benefits over the other, and both implementations being incompatible.

> Although there is also a high-level idea (according to the
> presentations) to share common data between different TOASTed values,
> similarly to what compression dictionaries do, by looking at the
> current feedback and considering the overall complexity and the amount
> of open questions (e.g. interaction with different TableAMs, etc), I
> seriously doubt that this particular part of "pluggable TOASTer" will
> end-up in the core.

Yes, and that's why I think that this where this dictionary
infrastructure could provide value, as an alternative or extension to
the proposed jsonb toaster in the 'pluggable toaster' thread.

Kind regards,

Matthias van de Meent

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

04 July 2022, 17:00:19

Hi Matthias,

> > Although there is also a high-level idea (according to the
> > presentations) to share common data between different TOASTed values,
> > similarly to what compression dictionaries do, by looking at the
> > current feedback and considering the overall complexity and the amount
> > of open questions (e.g. interaction with different TableAMs, etc), I
> > seriously doubt that this particular part of "pluggable TOASTer" will
> > end-up in the core.
>
> Yes, and that's why I think that this where this dictionary
> infrastructure could provide value, as an alternative or extension to
> the proposed jsonb toaster in the 'pluggable toaster' thread.

OK, I see your point now. And I think this is a very good point.
Basing "Compression dictionaries" on the API provided by "pluggable
TOASTer" can also be less hacky than what I'm currently doing with
`typmod` argument. I'm going to switch the implementation at some
point, unless anyone will object to the idea.

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

11 July 2022, 17:44:42

Hi hackers,

> OK, I see your point now. And I think this is a very good point.
> Basing "Compression dictionaries" on the API provided by "pluggable
> TOASTer" can also be less hacky than what I'm currently doing with
> `typmod` argument. I'm going to switch the implementation at some
> point, unless anyone will object to the idea.

Here is the rebased patch. I reworked the memory management a bit but
other than that there are no new changes.

So far we seem to have a consensus to:

1. Use bytea instead of NameData to store dictionary entries;

2. Assign monotonically ascending IDs to the entries instead of using
Oids, as it is done with pg_class.relnatts. In order to do this we
should either add a corresponding column to pg_type, or add a new
catalog table, e.g. pg_dict_meta. Personally I don't have a strong
opinion on what is better. Thoughts?

Both changes should be straightforward to implement and also are a
good exercise to newcomers.

I invite anyone interested to join this effort as a co-author! (since,
honestly, rewriting the same feature over and over again alone is
quite boring :D).

-- 
Best regards,
Aleksander Alekseev

Attachment

v4-0001-Compression-dictionaries-for-JSONB.patch

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

11 July 2022, 18:41:14

Hi hackers,

> I invite anyone interested to join this effort as a co-author!

Here is v5. Same as v4 but with a fixed compiler warning (thanks,
cfbot). Sorry for the noise.

-- 
Best regards,
Aleksander Alekseev

Attachment

v5-0001-Compression-dictionaries-for-JSONB.patch

Re: [PATCH] Compression dictionaries for JSONB

From

Nikita Malakhov

Date:

12 July 2022, 13:25:43

Hi hackers!

Aleksander, please point me in the right direction if it was mentioned before, I have a few questions:

1) It is not clear for me, how do you see the life cycle of such a dictionary? If it is meant to keep growing without

cleaning up/rebuilding it could affect performance in an undesirable way, along with keeping unused data without

any means to get rid of them.

Also, I agree with Simon Riggs, using OIDs from the general pool for dictionary entries is a bad idea.

2) From (1) follows another question - I haven't seen any means for getting rid of unused keys (or any other means

for dictionary cleanup). How could it be done?

3) Is the possible scenario legal - by some means a dictionary does not contain some keys for entries? What happens then?

4) If one dictionary is used by several tables - I see future issues in concurrent dictionary updates. This will for sure

affect performance and can cause unpredictable behavior for queries.

If you have any questions on Pluggable TOAST don't hesitate to ask me and on JSONB Toaster you can ask Nikita Glukhov.

Thank you!

Regards,

Nikita Malakhov

Postgres Professional

https://postgrespro.ru/

On Mon, Jul 11, 2022 at 6:41 PM Aleksander Alekseev <aleksander@timescale.com> wrote:

Hi hackers,

> I invite anyone interested to join this effort as a co-author!

Here is v5. Same as v4 but with a fixed compiler warning (thanks,
cfbot). Sorry for the noise.

--
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

12 July 2022, 15:15:17

Hi Nikita,

> Aleksander, please point me in the right direction if it was mentioned before, I have a few questions:

Thanks for your feedback. These are good questions indeed.

> 1) It is not clear for me, how do you see the life cycle of such a dictionary? If it is meant to keep growing
without
> cleaning up/rebuilding it could affect performance in an undesirable way, along with keeping unused data without
> any means to get rid of them.
> 2) From (1) follows another question - I haven't seen any means for getting rid of unused keys (or any other means
> for dictionary cleanup). How could it be done?

Good point. This was not a problem for ZSON since the dictionary size
was limited to 2**16 entries, the dictionary was immutable, and the
dictionaries had versions. For compression dictionaries we removed the
2**16 entries limit and also decided to get rid of versions. The idea
was that you can simply continue adding new entries, but no one
thought about the fact that this will consume the memory required to
decompress the document indefinitely.

Maybe we should return to the idea of limited dictionary size and
versions. Objections?

> 4) If one dictionary is used by several tables - I see future issues in concurrent dictionary updates. This will for
sure
> affect performance and can cause unpredictable behavior for queries.

You are right. Another reason to return to the idea of dictionary versions.

> Also, I agree with Simon Riggs, using OIDs from the general pool for dictionary entries is a bad idea.

Yep, we agreed to stop using OIDs for this, however this was not
changed in the patch at this point. Please don't hesitate joining the
effort if you want to. I wouldn't mind taking a short break from this
patch.

> 3) Is the possible scenario legal - by some means a dictionary does not contain some keys for entries? What happens
then?

No, we should either forbid removing dictionary entries or check that
all the existing documents are not using the entries being removed.

> If you have any questions on Pluggable TOAST don't hesitate to ask me and on JSONB Toaster you can ask Nikita
Glukhov.

Will do! Thanks for working on this and I'm looking forward to the
next version of the patch for the next round of review.

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Nikita Malakhov

Date:

17 July 2022, 21:15:03

Hi hackers!

Aleksander, I've carefully gone over discussion and still have some questions to ask -

1) Is there any means of measuring overhead of dictionaries over vanilla implementation? IMO it is a must because

JSON is a widely used functionality. Also, as it was mentioned before, to check the dictionary value must be detoasted;

2) Storing dictionaries in one table. As I wrote before, this will surely lead to locks and waits while inserting and updating

dictionaries, and could cause serious performance issues. And vacuuming this table will lead to locks for all tables using

dictionaries until vacuum is complete;

3) JSON documents in production environments could be very complex and use thousands of keys, so creating dictionary

directly in SQL statement is not very good approach, so it's another reason to have means for creating dictionaries as a

separate tables and/or passing them as files or so;

4) Suggested mechanics, if put on top of the TOAST, could not benefit from knowledge if internal JSON structure, which

is seen as important drawback in spite of extensive research work done on working with JSON schema (storing, validating,

etc.), and also it cannot recognize and help to compress duplicated parts of JSON document;

5) A small test issue - if dictionaried' JSON has a key which is equal to OID used in a dictionary for some other key?

In Pluggable TOAST we suggest that as an improvement compression should be put inside the Toaster as an option,

thus the Toaster could have maximum benefits from knowledge of data internal structure (and in future use JSON Schema).

For using in special Toaster for JSON datatype compression dictionaries seem to be very valuable addition, but now I

have to agree that this feature in current state is competing with Pluggable TOAST.

Thank you!

Regards,

Nikita Malakhov

Postgres Professional

https://postgrespro.ru/

On Tue, Jul 12, 2022 at 3:15 PM Aleksander Alekseev <aleksander@timescale.com> wrote:

Hi Nikita,

> Aleksander, please point me in the right direction if it was mentioned before, I have a few questions:

Thanks for your feedback. These are good questions indeed.

> 1) It is not clear for me, how do you see the life cycle of such a dictionary? If it is meant to keep growing without
> cleaning up/rebuilding it could affect performance in an undesirable way, along with keeping unused data without
> any means to get rid of them.
> 2) From (1) follows another question - I haven't seen any means for getting rid of unused keys (or any other means
> for dictionary cleanup). How could it be done?

Good point. This was not a problem for ZSON since the dictionary size
was limited to 2**16 entries, the dictionary was immutable, and the
dictionaries had versions. For compression dictionaries we removed the
2**16 entries limit and also decided to get rid of versions. The idea
was that you can simply continue adding new entries, but no one
thought about the fact that this will consume the memory required to
decompress the document indefinitely.

Maybe we should return to the idea of limited dictionary size and
versions. Objections?

> 4) If one dictionary is used by several tables - I see future issues in concurrent dictionary updates. This will for sure
> affect performance and can cause unpredictable behavior for queries.

You are right. Another reason to return to the idea of dictionary versions.

> Also, I agree with Simon Riggs, using OIDs from the general pool for dictionary entries is a bad idea.

Yep, we agreed to stop using OIDs for this, however this was not
changed in the patch at this point. Please don't hesitate joining the
effort if you want to. I wouldn't mind taking a short break from this
patch.

> 3) Is the possible scenario legal - by some means a dictionary does not contain some keys for entries? What happens then?

No, we should either forbid removing dictionary entries or check that
all the existing documents are not using the entries being removed.

> If you have any questions on Pluggable TOAST don't hesitate to ask me and on JSONB Toaster you can ask Nikita Glukhov.

Will do! Thanks for working on this and I'm looking forward to the
next version of the patch for the next round of review.

--
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

18 July 2022, 15:26:47

Hi Nikita,

Thanks for your feedback!

> Aleksander, I've carefully gone over discussion and still have some questions to ask -
>
> 1) Is there any means of measuring overhead of dictionaries over vanilla implementation? IMO it is a must because
> JSON is a widely used functionality. Also, as it was mentioned before, to check the dictionary value must be
detoasted;

Not sure what overhead you have in mind. The patch doesn't affect the
vanilla JSONB implementation.

> 2) Storing dictionaries in one table. As I wrote before, this will surely lead to locks and waits while inserting and
updating
> dictionaries, and could cause serious performance issues. And vacuuming this table will lead to locks for all tables
using
> dictionaries until vacuum is complete;

I believe this is true to some degree. But doesn't the same generally
apply to the rest of catalog tables?

I'm not that concerned about inserting/updating since this is a rare
operation. Vacuuming shouldn't be such a problem unless the user
creates/deletes dictionaries all the time.

Am I missing something?

> 3) JSON documents in production environments could be very complex and use thousands of keys, so creating dictionary
> directly in SQL statement is not very good approach, so it's another reason to have means for creating dictionaries
asa
 
> separate tables and/or passing them as files or so;

Yes, it was proposed to update dictionaries automatically e.g. during
the VACUUM of the table that contains compressed documents. This is
simply out of scope of this particular patch. It was argued that the
manual update should be supported too, which is implemented in this
patch.

> 4) Suggested mechanics, if put on top of the TOAST, could not benefit from knowledge if internal JSON structure,
which
> is seen as important drawback in spite of extensive research work done on working with JSON schema (storing,
validating,
> etc.), and also it cannot recognize and help to compress duplicated parts of JSON document;

Could you please elaborate on this a bit and/or maybe give an example? ...

> In Pluggable TOAST we suggest that as an improvement compression should be put inside the Toaster as an option,
> thus the Toaster could have maximum benefits from knowledge of data internal structure (and in future use JSON
Schema).

... Current implementation doesn't use the knowledge of JSONB format,
that's true. This is because previously we agreed there is no "one
size fits all" compression method, thus several are going to be
supported eventually. The current algorithm was chosen merely as the
one that is going to work good enough for any data type, not just
JSONB. Nothing prevents an alternative compression method from using
the knowledge of JSONB structure.

As, I believe, Matthias pointed out above, only partial decompression
would be a challenge. This is indeed something that would be better to
implement somewhere closer to the TOAST level. Other than that I'm not
sure what you mean.

> 5) A small test issue - if dictionaried' JSON has a key which is equal to OID used in a dictionary for some other
key?

Again, I'm having difficulties understanding the case you are
describing. Could you give a specific example?

> For using in special Toaster for JSON datatype compression dictionaries seem to be very valuable addition, but now I
> have to agree that this feature in current state is competing with Pluggable TOAST.

I disagree with the word "competing" here. Again, Matthias had a very
good point about this above.

In short, pluggable TOAST is a low-level internal mechanism, but it
doesn't provide a good interface for the end user and has several open
issues. The most important one IMO is how it is supposed to work with
pluggable AMs in the general case. "Compression dictionaries" have a
good user interface, and the implementation is not that important. The
current implementation uses casts, as the only option available at the
moment. But nothing prevents it from using Pluggable TOAST if this
will produce a cleaner code (I believe it will) and will allow
delivering partial decompression (this is yet to be figured out).

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Simon Riggs

Date:

27 July 2022, 10:36:13

On Sun, 17 Jul 2022 at 19:15, Nikita Malakhov <hukutoc@gmail.com> wrote:

> we suggest that as an improvement compression should be put inside the Toaster as an option,
> thus the Toaster could have maximum benefits from knowledge of data internal structure (and in future use JSON
Schema).

Very much agreed.

> For using in special Toaster for JSON datatype compression dictionaries seem to be very valuable addition, but now I
> have to agree that this feature in current state is competing with Pluggable TOAST.

But I don't understand this.

Why does storing a compression dictionary in the catalog prevent that
dictionary from being used within the toaster?

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Re: [PATCH] Compression dictionaries for JSONB

From

Matthias van de Meent

Date:

27 July 2022, 11:30:18

On Wed, 27 Jul 2022 at 09:36, Simon Riggs <simon.riggs@enterprisedb.com> wrote:
>
> On Sun, 17 Jul 2022 at 19:15, Nikita Malakhov <hukutoc@gmail.com> wrote:
>
> > For using in special Toaster for JSON datatype compression dictionaries seem to be very valuable addition, but now
I
> > have to agree that this feature in current state is competing with Pluggable TOAST.
>
> But I don't understand this.
>
> Why does storing a compression dictionary in the catalog prevent that
> dictionary from being used within the toaster?

The point is not that compression dictionaries in the catalog are bad
- I think it makes a lot of sense - but that the typecast -based usage
of those dictionaries in user tables (like the UI provided by zson)
effectively competes with the toaster: It tries to store the data in a
more compressed manner than the toaster currently can because it has
additional knowledge about the values being toasted.

The main difference between casting and toasting however is that
casting is fairly because it has a significantly higher memory
overhead: both the fully decompressed and the compressed values are
stored in memory at the same time at some point when you cast a value,
while only the decompressed value is stored in full in memory when
(de)toasting.

And, considering that there is an open proposal for extending the
toaster mechanism, I think that it is not specifically efficient to
work with the relatively expensive typecast -based infrastructure if
this dictionary compression can instead be added using the proposed
extensible toasting mechanism at relatively low overhead.

Kind regards,

Matthias van de Meent

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

01 August 2022, 14:25:32

Hi hackers,

> So far we seem to have a consensus to:
>
> 1. Use bytea instead of NameData to store dictionary entries;
>
> 2. Assign monotonically ascending IDs to the entries instead of using
> Oids, as it is done with pg_class.relnatts. In order to do this we
> should either add a corresponding column to pg_type, or add a new
> catalog table, e.g. pg_dict_meta. Personally I don't have a strong
> opinion on what is better. Thoughts?
>
> Both changes should be straightforward to implement and also are a
> good exercise to newcomers.
>
> I invite anyone interested to join this effort as a co-author! (since,
> honestly, rewriting the same feature over and over again alone is
> quite boring :D).

cfbot complained that v5 doesn't apply anymore. Here is the rebased
version of the patch.

> Good point. This was not a problem for ZSON since the dictionary size
> was limited to 2**16 entries, the dictionary was immutable, and the
> dictionaries had versions. For compression dictionaries we removed the
> 2**16 entries limit and also decided to get rid of versions. The idea
> was that you can simply continue adding new entries, but no one
> thought about the fact that this will consume the memory required to
> decompress the document indefinitely.
>
> Maybe we should return to the idea of limited dictionary size and
> versions. Objections?
> [ ...]
> You are right. Another reason to return to the idea of dictionary versions.

Since no one objected so far and/or proposed a better idea I assume
this can be added to the list of TODOs as well.

-- 
Best regards,
Aleksander Alekseev

Attachment

v6-0001-Compression-dictionaries-for-JSONB.patch

Re: [PATCH] Compression dictionaries for JSONB

From

Nikita Malakhov

Date:

19 August 2022, 10:57:42

Hi hackers!

I've got a partly question, partly proposal for the future development of this

feature:

What if we use pg_dict table not to store dictionaries but to store dictionaries'

meta, and actual dictionaries to be stored in separate tables like it is done with

TOAST tables (i.e. pg_dict.<dictionary 1 entry> --> pg_dict_16385 table)?

Thus we can kill several birds with one stone - we deal with concurrent

dictionaries' updates - which looks like very serious issue for now, they do not

affect each other and overall DB performance while using, we get around SQL

statement size restriction, could effectively deal with versions in dictionaries

and even dictionaries' versions, as well as dictionary size restriction, we can

use it for duplicated JSON parts, and even we can provide an API to work

with dictionaries and dictionary tables which later could be usable even for

working with JSON schemas as well (maybe, with some extension)?

Overall structure could look like this:

pg_dict

|---- dictionary 1 meta

| |--name

| |--size

| |--etc

| |--dictionary table name (i.e. pg_dict_16385)

| |

| |----> pg_dict_16385

|---- dictionary 2 meta

| |--name

| |--size

| |--etc

| |--dictionary table name (i.e. pg_dict_16386)

| |

| |----> pg_dict_16386

...

where dictionary table could look like

pg_dict_16385

|---- key 1

| |-value

|---- key 2

| |-value

...

And with a special DICT API we would have means to access, cache, store our

dictionaries in a uniform way from different levels. In this implementation it also

looks as a very valuable addition for our JSONb Toaster.

JSON schema processing is a very promising feature and we have to keep up

with major competitors like Oracle which are already working on it.

On Mon, Aug 1, 2022 at 2:25 PM Aleksander Alekseev <aleksander@timescale.com> wrote:

Hi hackers,

> So far we seem to have a consensus to:
>
> 1. Use bytea instead of NameData to store dictionary entries;
>
> 2. Assign monotonically ascending IDs to the entries instead of using
> Oids, as it is done with pg_class.relnatts. In order to do this we
> should either add a corresponding column to pg_type, or add a new
> catalog table, e.g. pg_dict_meta. Personally I don't have a strong
> opinion on what is better. Thoughts?
>
> Both changes should be straightforward to implement and also are a
> good exercise to newcomers.
>
> I invite anyone interested to join this effort as a co-author! (since,
> honestly, rewriting the same feature over and over again alone is
> quite boring :D).

cfbot complained that v5 doesn't apply anymore. Here is the rebased
version of the patch.

> Good point. This was not a problem for ZSON since the dictionary size
> was limited to 2**16 entries, the dictionary was immutable, and the
> dictionaries had versions. For compression dictionaries we removed the
> 2**16 entries limit and also decided to get rid of versions. The idea
> was that you can simply continue adding new entries, but no one
> thought about the fact that this will consume the memory required to
> decompress the document indefinitely.
>
> Maybe we should return to the idea of limited dictionary size and
> versions. Objections?
> [ ...]
> You are right. Another reason to return to the idea of dictionary versions.

Since no one objected so far and/or proposed a better idea I assume
this can be added to the list of TODOs as well.

--
Best regards,
Aleksander Alekseev

Regards,

Nikita Malakhov

https://postgrespro.ru/

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

02 September 2022, 13:50:12

Hi hackers,

Here is the rebased version of the patch.

> I invite anyone interested to join this effort as a co-author! (since,
> honestly, rewriting the same feature over and over again alone is
> quite boring :D).

> Overall structure could look like this:
> pg_dict
>    |
>    |---- dictionary 1 meta
>    |           |--name
>    |           |--size
>    |           |--etc
>    |           |--dictionary table name (i.e. pg_dict_16385)
>    |                  |
>    |                  |----> pg_dict_16385
>    |
>    |---- dictionary 2 meta
>    |           |--name
>    |           |--size
>    |           |--etc
>    |           |--dictionary table name (i.e. pg_dict_16386)
>    |                  |
>    |                  |----> pg_dict_16386

For the record, Nikita and I agreed offlist that Nikita will join this
effort as a co-author in order to implement the suggested improvements
(and perhaps some improvements that were not suggested yet). Meanwhile
I'm going to keep the current version of the patch up to date with the
`master` branch.

-- 
Best regards,
Aleksander Alekseev

Attachment

v7-0001-Compression-dictionaries-for-JSONB.patch

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

06 October 2022, 13:29:44

Hi hackers,

> For the record, Nikita and I agreed offlist that Nikita will join this
> effort as a co-author in order to implement the suggested improvements
> (and perhaps some improvements that were not suggested yet). Meanwhile
> I'm going to keep the current version of the patch up to date with the
> `master` branch.

Here is an updated patch with added Meson support.

-- 
Best regards,
Aleksander Alekseev

Attachment

v8-0001-Compression-dictionaries-for-JSONB.patch

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

04 November 2022, 11:37:02

Hi hackers,

> For the record, Nikita and I agreed offlist that Nikita will join this
> effort as a co-author in order to implement the suggested improvements
> (and perhaps some improvements that were not suggested yet). Meanwhile
> I'm going to keep the current version of the patch up to date with the
> `master` branch.

8272749e added a few more arguments to CastCreate(). Here is the rebased patch.

-- 
Best regards,
Aleksander Alekseev

Attachment

v9-0001-Compression-dictionaries-for-JSONB.patch

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

17 November 2022, 13:36:36

Hi hackers,

> 8272749e added a few more arguments to CastCreate(). Here is the rebased patch.

After merging afbfc029 [1] the patch needed a rebase. PFA v10.

The patch is still in a PoC state and this is exactly why comments and
suggestions from the community are most welcome! Particularly I would
like to know:

1. Would you call it a wanted feature considering the existence of
Pluggable TOASTer patchset which (besides other things) tries to
introduce type-aware TOASTers for EXTERNAL attributes? I know what
Simon's [2] and Nikita's latest answers were, and I know my personal
opinion on this [3][4], but I would like to hear from the rest of the
community.

2. How should we make sure a dictionary will not consume all the
available memory? Limiting the amount of dictionary entries to pow(2,
16) and having dictionary versions seems to work OK for ZSON. However
it was pointed out that this may be an unwanted limitation for the
in-core implementation.

[1]:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c727f511;hp=afbfc02983f86c4d71825efa6befd547fe81a926
[2]: https://www.postgresql.org/message-id/CANbhV-HpCF852WcZuU0wyh1jMU4p6XLbV6rCRkZpnpeKQ9OenQ%40mail.gmail.com
[3]: https://www.postgresql.org/message-id/CAJ7c6TN-N3%3DPSykmOjmW1EAf9YyyHFDHEznX-5VORsWUvVN-5w%40mail.gmail.com
[4]: https://www.postgresql.org/message-id/CAJ7c6TO2XTTk3cu5w6ePHfhYQkoNpw7u1jeqHf%3DGwn%2BoWci8eA%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev

Attachment

v10-0001-Compression-dictionaries-for-JSONB.patch

Re: [PATCH] Compression dictionaries for JSONB

From

Alvaro Herrera

Date:

03 February 2023, 12:55:40

This patch came up at the developer meeting in Brussels yesterday.
https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2023_Developer_Meeting#v16_Patch_Triage

First, as far as I can tell, there is a large overlap between this patch
and "Pluggable toaster" patch.  The approaches are completely different,
but they seem to be trying to fix the same problem: the fact that the
default TOAST stuff isn't good enough for JSONB.  I think before asking
developers of both patches to rebase over and over, we should take a
step back and decide which one we dislike the less, and how to fix that
one into a shape that we no longer dislike.

(Don't get me wrong.  I'm all for having better JSONB compression.
However, for one thing, both patches require action from the user to set
up a compression mechanism by hand.  Perhaps it would be even better if
the system determines that a JSONB column uses a different compression
implementation, without the user doing anything explicitly; or maybe we
want to give the user *some* agency for specific columns if they want,
but not force them into it for every single jsonb column.)

Now, I don't think either of these patches can get to a committable
shape in time for v16 -- even assuming we had an agreed design, which
AFAICS we don't.  But I encourage people to continue discussion and try
to find consensus.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Doing what he did amounts to sticking his fingers under the hood of the
implementation; if he gets his fingers burnt, it's his problem."  (Tom Lane)

Re: [PATCH] Compression dictionaries for JSONB

From

Pavel Borisov

Date:

03 February 2023, 13:39:31

On Fri, 3 Feb 2023 at 14:04, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> This patch came up at the developer meeting in Brussels yesterday.
> https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2023_Developer_Meeting#v16_Patch_Triage
>
> First, as far as I can tell, there is a large overlap between this patch
> and "Pluggable toaster" patch.  The approaches are completely different,
> but they seem to be trying to fix the same problem: the fact that the
> default TOAST stuff isn't good enough for JSONB.  I think before asking
> developers of both patches to rebase over and over, we should take a
> step back and decide which one we dislike the less, and how to fix that
> one into a shape that we no longer dislike.
>
> (Don't get me wrong.  I'm all for having better JSONB compression.
> However, for one thing, both patches require action from the user to set
> up a compression mechanism by hand.  Perhaps it would be even better if
> the system determines that a JSONB column uses a different compression
> implementation, without the user doing anything explicitly; or maybe we
> want to give the user *some* agency for specific columns if they want,
> but not force them into it for every single jsonb column.)
>
> Now, I don't think either of these patches can get to a committable
> shape in time for v16 -- even assuming we had an agreed design, which
> AFAICS we don't.  But I encourage people to continue discussion and try
> to find consensus.
>
Hi, Alvaro!

I'd like to give my +1 in favor of implementing a pluggable toaster
interface first. Then we can work on custom toast engines for
different scenarios, not limited to JSON(b).

For example, I find it useful to decrease WAL overhead on the
replication of TOAST updates. It is quite a pain now that we need to
rewrite all toast chunks at any TOAST update. Also, it could be good
for implementing undo access methods etc., etc. Now, these kinds of
activities in extensions face the fact that core has only one TOAST
which is quite inefficient in many scenarios.

So overall I value the extensibility part of this activity as the most
important one and will be happy to see it completed first.

Kind regards,
Pavel Borisov,
Supabase.

Re: [PATCH] Compression dictionaries for JSONB

From

Andres Freund

Date:

04 February 2023, 16:31:23

Hi,

On 2023-02-03 14:39:31 +0400, Pavel Borisov wrote:
> On Fri, 3 Feb 2023 at 14:04, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > This patch came up at the developer meeting in Brussels yesterday.
> > https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2023_Developer_Meeting#v16_Patch_Triage
> >
> > First, as far as I can tell, there is a large overlap between this patch
> > and "Pluggable toaster" patch.  The approaches are completely different,
> > but they seem to be trying to fix the same problem: the fact that the
> > default TOAST stuff isn't good enough for JSONB.  I think before asking
> > developers of both patches to rebase over and over, we should take a
> > step back and decide which one we dislike the less, and how to fix that
> > one into a shape that we no longer dislike.
> >
> > (Don't get me wrong.  I'm all for having better JSONB compression.
> > However, for one thing, both patches require action from the user to set
> > up a compression mechanism by hand.  Perhaps it would be even better if
> > the system determines that a JSONB column uses a different compression
> > implementation, without the user doing anything explicitly; or maybe we
> > want to give the user *some* agency for specific columns if they want,
> > but not force them into it for every single jsonb column.)
> >
> > Now, I don't think either of these patches can get to a committable
> > shape in time for v16 -- even assuming we had an agreed design, which
> > AFAICS we don't.  But I encourage people to continue discussion and try
> > to find consensus.
> >
> Hi, Alvaro!
>
> I'd like to give my +1 in favor of implementing a pluggable toaster
> interface first. Then we can work on custom toast engines for
> different scenarios, not limited to JSON(b).

I don't think the approaches in either of these threads is
promising. They add a lot of complexity, require implementation effort
for each type, manual work by the administrator for column, etc.


One of the major justifications for work in this area is the cross-row
redundancy for types like jsonb. I think there's ways to improve that
across types, instead of requiring per-type work. We could e.g. use
compression dictionaries to achieve much higher compression
rates. Training of the dictionairy could even happen automatically by
analyze, if we wanted to.  It's unlikely to get you everything a very
sophisticated per-type compression is going to give you, but it's going
to be a lot better than today, and it's going to work across types.


> For example, I find it useful to decrease WAL overhead on the
> replication of TOAST updates. It is quite a pain now that we need to
> rewrite all toast chunks at any TOAST update. Also, it could be good
> for implementing undo access methods etc., etc. Now, these kinds of
> activities in extensions face the fact that core has only one TOAST
> which is quite inefficient in many scenarios.
>
> So overall I value the extensibility part of this activity as the most
> important one and will be happy to see it completed first.

I think the complexity will just make improving toast in-core harder,
without much benefit.


Regards,

Andres

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

05 February 2023, 13:41:17

Hi,

> I don't think the approaches in either of these threads is
> promising. They add a lot of complexity, require implementation effort
> for each type, manual work by the administrator for column, etc.

I would like to point out that compression dictionaries don't require
per-type work.

Current implementation is artificially limited to JSONB because it's a
PoC. I was hoping to get more feedback from the community before
proceeding further. Internally it uses type-agnostic compression and
doesn't care whether it compresses JSON(B), XML, TEXT, BYTEA or
arrays. This choice was explicitly done in order to support types
other than JSONB.

> One of the major justifications for work in this area is the cross-row
> redundancy for types like jsonb. I think there's ways to improve that
> across types, instead of requiring per-type work.

To be fair, there are advantages in using type-aware compression. The
compression algorithm can be more efficient than a general one and in
theory one can implement lazy decompression, e.g. the one that
decompresses only the accessed fields of a JSONB document.

I agree though that particularly for PostgreSQL this is not
necessarily the right path, especially considering the accompanying
complexity.

If the user cares about the disk space consumption why storing JSONB
in a relational DBMS in the first place? We already have a great
solution for compacting the data, it was invented in the 70s and is
called normalization.

Since PostgreSQL is not a specified document-oriented DBMS I think we
better focus our (far from being infinite) resources on something more
people would benefit from: AIO/DIO [1] or perhaps getting rid of
freezing [2], to name a few examples.

> [...]
> step back and decide which one we dislike the less, and how to fix that
> one into a shape that we no longer dislike.

For the sake of completeness, doing neither type-aware TOASTing nor
compression dictionaries and leaving this area to the extension
authors (e.g. ZSON) is also a possible choice, for the same reasons
named above. However having a built-in type-agnostic dictionary
compression IMO is a too attractive idea to completely ignore it.
Especially considering the fact that the implementation was proven to
be fairly simple and there was even no need to rebase the patch since
November :)

I know that there were concerns [3] regarding the typmod hack. I don't
like it either and 100% open to suggestions here. This is merely a
current implementation detail used in a PoC, not a fundamental design
decision.

[1]: https://postgr.es/m/20210223100344.llw5an2aklengrmn%40alap3.anarazel.de
[2]: https://postgr.es/m/CAJ7c6TOk1mx4KfF0AHkvXi%2BpkdjFqwTwvRE-JmdczZMAYnRQ0w%40mail.gmail.com
[3]: https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2023_Developer_Meeting#v16_Patch_Triage

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Andres Freund

Date:

05 February 2023, 17:50:50

Hi,

On 2023-02-05 13:41:17 +0300, Aleksander Alekseev wrote:
> > I don't think the approaches in either of these threads is
> > promising. They add a lot of complexity, require implementation effort
> > for each type, manual work by the administrator for column, etc.
> 
> I would like to point out that compression dictionaries don't require
> per-type work.
> 
> Current implementation is artificially limited to JSONB because it's a
> PoC. I was hoping to get more feedback from the community before
> proceeding further. Internally it uses type-agnostic compression and
> doesn't care whether it compresses JSON(B), XML, TEXT, BYTEA or
> arrays. This choice was explicitly done in order to support types
> other than JSONB.

I don't think we'd want much of the infrastructure introduced in the
patch for type agnostic cross-row compression. A dedicated "dictionary"
type as a wrapper around other types IMO is the wrong direction. This
should be a relation-level optimization option, possibly automatic, not
something visible to every user of the table.

I assume that manually specifying dictionary entries is a consequence of
the prototype state?  I don't think this is something humans are very
good at, just analyzing the data to see what's useful to dictionarize
seems more promising.

I also suspect that we'd have to spend a lot of effort to make
compression/decompression fast if we want to handle dictionaries
ourselves, rather than using the dictionary support in libraries like
lz4/zstd.

> > One of the major justifications for work in this area is the cross-row
> > redundancy for types like jsonb. I think there's ways to improve that
> > across types, instead of requiring per-type work.
> 
> To be fair, there are advantages in using type-aware compression. The
> compression algorithm can be more efficient than a general one and in
> theory one can implement lazy decompression, e.g. the one that
> decompresses only the accessed fields of a JSONB document.

> I agree though that particularly for PostgreSQL this is not
> necessarily the right path, especially considering the accompanying
> complexity.

I agree with both those paragraphs.

> above. However having a built-in type-agnostic dictionary compression
> IMO is a too attractive idea to completely ignore it.  Especially
> considering the fact that the implementation was proven to be fairly
> simple and there was even no need to rebase the patch since November
> :)

I don't think a prototype-y patch not needing a rebase two months is a
good measure of complexity :)

Greetings,

Andres Freund

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

05 February 2023, 20:05:51

Hi,

> I assume that manually specifying dictionary entries is a consequence of
> the prototype state?  I don't think this is something humans are very
> good at, just analyzing the data to see what's useful to dictionarize
> seems more promising.

No, humans are not good at it. The idea was to automate the process
and build the dictionaries automatically e.g. during the VACUUM.

> I don't think we'd want much of the infrastructure introduced in the
> patch for type agnostic cross-row compression. A dedicated "dictionary"
> type as a wrapper around other types IMO is the wrong direction. This
> should be a relation-level optimization option, possibly automatic, not
> something visible to every user of the table.

So to clarify, are we talking about tuple-level compression? Or
perhaps page-level compression?

Implementing page-level compression should be *relatively*
straightforward. As an example this was previously done for InnoDB.
Basically InnoDB compresses the entire page, then rounds the result to
1K, 2K, 4K, 8K, etc and stores the result in a corresponding fork
("fork" in PG terminology), similarly to how a SLAB allocator works.
Additionally a page_id -> fork_id map should be maintained, probably
in yet another fork, similarly to visibility map. A compressed page
can change the fork after being modified since this may change the
size of a compressed page. The buffer manager is unaffected and deals
only with uncompressed pages. (I'm not an expert in InnoDB and this is
my very rough understanding of how its compression works.)

I believe this can be implemented as a TAM. Whether this would be a
"dictionary" compression is debatable but it gives the users similar
benefits, give or take. The advantage is that users shouldn't define
any dictionaries manually, nor should DBMS during VACUUM or somehow
else.

> I also suspect that we'd have to spend a lot of effort to make
> compression/decompression fast if we want to handle dictionaries
> ourselves, rather than using the dictionary support in libraries like
> lz4/zstd.

That's a reasonable concern, can't argue with that.

> I don't think a prototype-y patch not needing a rebase two months is a
> good measure of complexity :)

It's worth noting that I also invested quite some time into reviewing
type-aware TOASTers :) I just choose to keep my personal opinion about
the complexity of that patch to myself this time since obviously I'm a
bit biased. However if you are curious it's all in the corresponding
thread.

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Andres Freund

Date:

05 February 2023, 22:06:02

Hi,

On 2023-02-05 20:05:51 +0300, Aleksander Alekseev wrote:
> > I don't think we'd want much of the infrastructure introduced in the
> > patch for type agnostic cross-row compression. A dedicated "dictionary"
> > type as a wrapper around other types IMO is the wrong direction. This
> > should be a relation-level optimization option, possibly automatic, not
> > something visible to every user of the table.
>
> So to clarify, are we talking about tuple-level compression? Or
> perhaps page-level compression?

Tuple level.

What I think we should do is basically this:

When we compress datums, we know the table being targeted. If there's a
pg_attribute parameter indicating we should, we can pass a prebuilt
dictionary to the LZ4/zstd [de]compression functions.

It's possible we'd need to use a somewhat extended header for such
compressed datums, to reference the dictionary "id" to be used when
decompressing, if the compression algorithms don't already have that in
one of their headers, but that's entirely doable.

A quick demo of the effect size:

# create data to train dictionary with, use subset to increase realism
mkdir /tmp/pg_proc_as_json/;
CREATE EXTENSION adminpack;
SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, true)
FROM pg_proc pp
LIMIT 2000;

# build dictionary
zstd --train -o /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw

# create more data
SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, true) FROM pg_proc pp;

# compress without dictionary
lz4 -k -m /tmp/pg_proc_as_json/*.raw
zstd -k /tmp/pg_proc_as_json/*.raw

# measure size
cat /tmp/pg_proc_as_json/*.raw|wc -c; cat /tmp/pg_proc_as_json/*.lz4|wc -c; cat /tmp/pg_proc_as_json/*.zst|wc -c

# compress with dictionary
rm -f /tmp/pg_proc_as_json/*.{lz4,zst};
lz4 -k -D /tmp/pg_proc_as_json.dict -m /tmp/pg_proc_as_json/*.raw
zstd -k -D /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw

did the same with zstd.

Here's the results:

               lz4       zstd    uncompressed
no dict    1328794     982497        3898498
dict        375070     267194

I'd say the effect of the dictionary is pretty impressive. And remember,
this is with the dictionary having been trained on a subset of the data.

As a comparison, here's all the data compressed compressed at once:

               lz4       zstd
no dict     180231     104913
dict        179814     106444

Unsurprisingly the dictionary doesn't help much, because the compression
algorithm can "natively" see the duplication.

- Andres

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

06 February 2023, 17:03:07

Hi,

> > So to clarify, are we talking about tuple-level compression? Or
> > perhaps page-level compression?
>
> Tuple level.
>
> What I think we should do is basically this:
>
> When we compress datums, we know the table being targeted. If there's a
> pg_attribute parameter indicating we should, we can pass a prebuilt
> dictionary to the LZ4/zstd [de]compression functions.
>
> It's possible we'd need to use a somewhat extended header for such
> compressed datums, to reference the dictionary "id" to be used when
> decompressing, if the compression algorithms don't already have that in
> one of their headers, but that's entirely doable.
>
> A quick demo of the effect size:
> [...]
> Here's the results:
>
>                lz4       zstd   uncompressed
> no dict    1328794     982497        3898498
> dict        375070     267194
>
> I'd say the effect of the dictionary is pretty impressive. And remember,
> this is with the dictionary having been trained on a subset of the data.

I see your point regarding the fact that creating dictionaries on a
training set is too beneficial to neglect it. Can't argue with this.

What puzzles me though is: what prevents us from doing this on a page
level as suggested previously?

More similar data you compress the more space and disk I/O you save.
Additionally you don't have to compress/decompress the data every time
you access it. Everything that's in shared buffers is uncompressed.
Not to mention the fact that you don't care what's in pg_attribute,
the fact that schema may change, etc. There is a table and a
dictionary for this table that you refresh from time to time. Very
simple.

Of course the disadvantage here is that we are not saving the memory,
unlike the case of tuple-level compression. But we are saving a lot of
CPU cycles and doing less disk IOs. I would argue that saving CPU
cycles is generally more preferable. CPUs are still often a bottleneck
while the memory becomes more and more available, e.g there are
relatively affordable (for a company, not an individual) 1 TB RAM
instances, etc.

So it seems to me that doing page-level compression would be simpler
and more beneficial in the long run (10+ years). Don't you agree?

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Matthias van de Meent

Date:

06 February 2023, 18:16:41

On Mon, 6 Feb 2023 at 15:03, Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi,
>
> I see your point regarding the fact that creating dictionaries on a
> training set is too beneficial to neglect it. Can't argue with this.
>
> What puzzles me though is: what prevents us from doing this on a page
> level as suggested previously?

The complexity of page-level compression is significant, as pages are
currently a base primitive of our persistency and consistency scheme.
TOAST builds on top of these low level primitives and has access to
catalogs, but smgr doesn't do that and can't have that, respectively,
because it need to be accessible and usable without access to the
catalogs during replay in recovery.

I would like to know how you envision we would provide consistency
when page-level compression would be implemented - wouldn't it
increase WAL overhead (and WAL synchronization overhead) when writing
out updated pages to a new location due to it changing compressed
size?

> More similar data you compress the more space and disk I/O you save.
> Additionally you don't have to compress/decompress the data every time
> you access it. Everything that's in shared buffers is uncompressed.
> Not to mention the fact that you don't care what's in pg_attribute,
> the fact that schema may change, etc. There is a table and a
> dictionary for this table that you refresh from time to time. Very
> simple.

You cannot "just" refresh a dictionary used once to compress an
object, because you need it to decompress the object too.

Additionally, I don't think block-level compression is related to this
thread in a meaningful way: TOAST and datatype -level compression
reduce the on-page size of attributes, and would benefit from improved
compression regardless of the size of pages when stored on disk, but a
page will always use 8kB when read into memory. A tuple that uses less
space on pages will thus always be the better option when you're
optimizing for memory usage, while also reducing storage size.

> Of course the disadvantage here is that we are not saving the memory,
> unlike the case of tuple-level compression. But we are saving a lot of
> CPU cycles

Do you have any indication for how attribute-level compares against
page-level compression in cpu cycles?

> and doing less disk IOs.

Less IO bandwidth, but I doubt it uses less operations, as each page
would still need to be read; which currently happens on a page-by-page
IO operation. 10 page read operations use 10 syscalls to read data
from disk - 10 IO ops.

> I would argue that saving CPU
> cycles is generally more preferable. CPUs are still often a bottleneck
> while the memory becomes more and more available, e.g there are
> relatively affordable (for a company, not an individual) 1 TB RAM
> instances, etc.

But not all systems have that 1TB RAM, and we cannot expect all users
to increase their RAM.

> So it seems to me that doing page-level compression would be simpler
> and more beneficial in the long run (10+ years). Don't you agree?

Page-level compression can not compress patterns that have a length of
more than 1 page. TOAST is often used to store values larger than 8kB,
which we'd prefer to compress to the greatest extent possible. So, a
value-level compression method specialized to the type of the value
does make a lot of sense, too.

I'm not trying to say that compressing pages doesn't make sense or is
useless, I just don't think that we should ignore attribute-level
compression just because page-level compression could at some point be
implemented too.

Kind regards,

Matthias van de Meent

Re: [PATCH] Compression dictionaries for JSONB

From

Andres Freund

Date:

06 February 2023, 22:33:28

Hi,

On 2023-02-06 16:16:41 +0100, Matthias van de Meent wrote:
> On Mon, 6 Feb 2023 at 15:03, Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> >
> > Hi,
> >
> > I see your point regarding the fact that creating dictionaries on a
> > training set is too beneficial to neglect it. Can't argue with this.
> >
> > What puzzles me though is: what prevents us from doing this on a page
> > level as suggested previously?
> 
> The complexity of page-level compression is significant, as pages are
> currently a base primitive of our persistency and consistency scheme.

+many

It's also not all a panacea performance-wise, datum-level decompression can
often be deferred much longer than page level decompression. For things like
json[b], you'd hopefully normally have some "pre-filtering" based on proper
columns, before you need to dig into the json datum.

It's also not necessarily that good, compression ratio wise. Particularly for
wider datums you're not going to be able to remove much duplication, because
there's only a handful of tuples. Consider the case of json keys - the
dictionary will often do better than page level compression, because it'll
have the common keys in the dictionary, which means the "full" keys never will
have to appear on a page, whereas page-level compression will have the keys on
it, at least once.

Of course you can use a dictionary for page-level compression too, but the
gains when it works well will often be limited, because in most OLTP usable
page-compression schemes I'm aware of, you can't compress a page all that far
down, because you need a small number of possible "compressed page sizes".

> > More similar data you compress the more space and disk I/O you save.
> > Additionally you don't have to compress/decompress the data every time
> > you access it. Everything that's in shared buffers is uncompressed.
> > Not to mention the fact that you don't care what's in pg_attribute,
> > the fact that schema may change, etc. There is a table and a
> > dictionary for this table that you refresh from time to time. Very
> > simple.
> 
> You cannot "just" refresh a dictionary used once to compress an
> object, because you need it to decompress the object too.

Right. That's what I was trying to refer to when mentioning that we might need
to add a bit of additional information to the varlena header for datums
compressed with a dictionary.

Greetings,

Andres Freund

Re: [PATCH] Compression dictionaries for JSONB

From

Nikita Malakhov

Date:

07 February 2023, 09:11:52

Hi,

On updating dictionary -

>You cannot "just" refresh a dictionary used once to compress an
>object, because you need it to decompress the object too.

and when you have many - updating an existing dictionary requires

going through all objects compressed with it in the whole database.

It's a very tricky question how to implement this feature correctly.

Also, there are some thoughts on using JSON schema to optimize

storage for JSON objects.

(That's applied to the TOAST too, so at first glance we've decided

to forbid dropping or changing TOAST implementations already

registered in a particular database.)

In my experience, in modern world, even with fast SSD storage

arrays, with large database (about 40-50 Tb) we had disk access

as a bottleneck more often than CPU, except for the cases with

a lot of parallel execution threads for a single query (Oracle).

On Mon, Feb 6, 2023 at 10:33 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-02-06 16:16:41 +0100, Matthias van de Meent wrote:
> On Mon, 6 Feb 2023 at 15:03, Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> >
> > Hi,
> >
> > I see your point regarding the fact that creating dictionaries on a
> > training set is too beneficial to neglect it. Can't argue with this.
> >
> > What puzzles me though is: what prevents us from doing this on a page
> > level as suggested previously?
>
> The complexity of page-level compression is significant, as pages are
> currently a base primitive of our persistency and consistency scheme.

+many

It's also not all a panacea performance-wise, datum-level decompression can
often be deferred much longer than page level decompression. For things like
json[b], you'd hopefully normally have some "pre-filtering" based on proper
columns, before you need to dig into the json datum.

It's also not necessarily that good, compression ratio wise. Particularly for
wider datums you're not going to be able to remove much duplication, because
there's only a handful of tuples. Consider the case of json keys - the
dictionary will often do better than page level compression, because it'll
have the common keys in the dictionary, which means the "full" keys never will
have to appear on a page, whereas page-level compression will have the keys on
it, at least once.

Of course you can use a dictionary for page-level compression too, but the
gains when it works well will often be limited, because in most OLTP usable
page-compression schemes I'm aware of, you can't compress a page all that far
down, because you need a small number of possible "compressed page sizes".

> > More similar data you compress the more space and disk I/O you save.
> > Additionally you don't have to compress/decompress the data every time
> > you access it. Everything that's in shared buffers is uncompressed.
> > Not to mention the fact that you don't care what's in pg_attribute,
> > the fact that schema may change, etc. There is a table and a
> > dictionary for this table that you refresh from time to time. Very
> > simple.
>
> You cannot "just" refresh a dictionary used once to compress an
> object, because you need it to decompress the object too.

Right. That's what I was trying to refer to when mentioning that we might need
to add a bit of additional information to the varlena header for datums
compressed with a dictionary.

Greetings,

Andres Freund

Regards,

Nikita Malakhov

Postgres Professional

https://postgrespro.ru/

Re: [PATCH] Compression dictionaries for JSONB

From

Alvaro Herrera

Date:

07 February 2023, 13:55:57

On 2023-Feb-05, Aleksander Alekseev wrote:

> Since PostgreSQL is not a specified document-oriented DBMS I think we
> better focus our (far from being infinite) resources on something more
> people would benefit from: AIO/DIO [1] or perhaps getting rid of
> freezing [2], to name a few examples.

For what it's worth -- one of the reasons Postgres is successful, at
least in my opinion, is that each developer does more or less what they
see fit (or what their employer sees fit), without following any sort of
grand plan or roadmap.  This has allowed us to expand in many directions
simultaneously.  There's a group working on AIO; others are interested
in improving partitioning, or logical replication, adding new SQL
features, and so on.  I don't think we should stop thinking about TOAST
(or more precisely JSON compression) just because we want to have all
these other things.  Not being a document database didn't stop us from
adding JSON many years back and JSONB stuff later.  When we did, it was
an enormous enabler of new use cases.

Everyone, from customers of large Postgres support companies, to those
of small or one-man Postgres support shops, to individual users doing
stuff on their free time, benefits from everything that happens in the
Postgres development group.  Let's keep improving Postgres for everyone.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La espina, desde que nace, ya pincha" (Proverbio africano)

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

07 February 2023, 16:39:45

Hi,

> > The complexity of page-level compression is significant, as pages are
> > currently a base primitive of our persistency and consistency scheme.
>
> +many
>
> It's also not all a panacea performance-wise, datum-level decompression can
> often be deferred much longer than page level decompression. For things like
> json[b], you'd hopefully normally have some "pre-filtering" based on proper
> columns, before you need to dig into the json datum.

This is actually a good point.

> It's also not necessarily that good, compression ratio wise. Particularly for
> wider datums you're not going to be able to remove much duplication, because
> there's only a handful of tuples. Consider the case of json keys - the
> dictionary will often do better than page level compression, because it'll
> have the common keys in the dictionary, which means the "full" keys never will
> have to appear on a page, whereas page-level compression will have the keys on
> it, at least once.

To clarify, what I meant was applying an idea of compression with
shared dictionaries to the pages instead of tuples. Just to make sure
we are on the same page.

> Page-level compression can not compress patterns that have a length of
> more than 1 page. TOAST is often used to store values larger than 8kB,
> which we'd prefer to compress to the greatest extent possible. So, a
> value-level compression method specialized to the type of the value
> does make a lot of sense, too.

Let's not forget that TOAST table is a table too. Page-level
compression applies to it as well as to a regular one.

> Of course you can use a dictionary for page-level compression too, but the
> gains when it works well will often be limited, because in most OLTP usable
> page-compression schemes I'm aware of, you can't compress a page all that far
> down, because you need a small number of possible "compressed page sizes".

That's true. However compressing an 8 KB page to, let's say, 1 KB, is
not a bad result as well.

In any case, there seems to be advantages and disadvantages of either
approach. Personally I don't care that much which one to choose. In
fact, although my own patch proposed attribute-level compression, not
tuple-level one, it is arguably closer to tuple-level approach than
page-level one. So to a certain extent I would be contradicting myself
by trying to prove that page-level compression is the way to go. Also
Matthias has a reasonable concern that page-level compression may have
implications for the WAL size. (Maybe it will not but I'm not ready to
prove it right now, nor am I convinced this is necessarily true.)

So, let's focus on tuple-level compression then.

> > > More similar data you compress the more space and disk I/O you save.
> > > Additionally you don't have to compress/decompress the data every time
> > > you access it. Everything that's in shared buffers is uncompressed.
> > > Not to mention the fact that you don't care what's in pg_attribute,
> > > the fact that schema may change, etc. There is a table and a
> > > dictionary for this table that you refresh from time to time. Very
> > > simple.
> >
> > You cannot "just" refresh a dictionary used once to compress an
> > object, because you need it to decompress the object too.
>
> Right. That's what I was trying to refer to when mentioning that we might need
> to add a bit of additional information to the varlena header for datums
> compressed with a dictionary.

> [...]
> and when you have many - updating an existing dictionary requires
> going through all objects compressed with it in the whole database.
> It's a very tricky question how to implement this feature correctly.

Yep, that's one of the challenges.

One approach would be to extend the existing dictionary. Not sure if
ZSTD / LZ4 support this, they probably don't. In any case this is a
sub-optimal approach because the dictionary will grow indefinitely.

We could create a dictionary once per table and forbid modifying it.
Users will have to re-create and refill a table manually if he/she
wants to update the dictionary by using `INSERT INTO .. SELECT ..`.
Although this is a possible solution I don't think this is what Andres
meant above by being invisible to the user. Also it would mean that
the new dictionary should be learned on the old table before creating
the new one with a new dictionary which is awkward.

This is why we need something like dictionary versions. A dictionary
can't be erased as long as there is data that uses this version of a
dictionary. The old data should be decompressed and compressed again
with the most recent dictionary, e.g. during VACUUM or perhaps VACUUM
FULL. This is an idea I ended up using in ZSON.

There may be alternative solutions, but I don't think I'm aware of
such. (There are JSON Schema, Protobuf etc, but they don't work for
general-purpose compression algorithms and/or arbitrary data types.)

> Let's keep improving Postgres for everyone.

Amen.

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

09 February 2023, 13:50:57

Hi Andres,

> > So to clarify, are we talking about tuple-level compression? Or
> > perhaps page-level compression?
>
> Tuple level.

> although my own patch proposed attribute-level compression, not
> tuple-level one, it is arguably closer to tuple-level approach than
> page-level one

Just wanted to make sure that by tuple-level we mean the same thing.

When saying tuple-level do you mean that the entire tuple should be
compressed as one large binary (i.e. similarly to page-level
compression but more granularly), or every single attribute should be
compressed separately (similarly to how TOAST does this)?

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Andres Freund

Date:

09 February 2023, 14:01:12

Hi,

On February 9, 2023 2:50:57 AM PST, Aleksander Alekseev <aleksander@timescale.com> wrote:
>Hi Andres,
>
>> > So to clarify, are we talking about tuple-level compression? Or
>> > perhaps page-level compression?
>>
>> Tuple level.
>
>> although my own patch proposed attribute-level compression, not
>> tuple-level one, it is arguably closer to tuple-level approach than
>> page-level one
>
>Just wanted to make sure that by tuple-level we mean the same thing.
>
>When saying tuple-level do you mean that the entire tuple should be
>compressed as one large binary (i.e. similarly to page-level
>compression but more granularly), or every single attribute should be
>compressed separately (similarly to how TOAST does this)?

Good point - should have been clearer. I meant attribute wise compression. Like we do today, except that we would use a
dictionaryto increase compression rates. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] Compression dictionaries for JSONB

From

Nikita Malakhov

Date:

10 February 2023, 21:22:14

Hi!

If I understand Andres' message correctly - the proposition is to

make use of compression dictionaries automatic, possibly just setting

a parameter when the table is created, something like

CREATE TABLE t ( ..., t JSONB USE DICTIONARY);

The question is then how to create such dictionaries automatically

and extend them while data is being added to the table. Because

it is not something unusual when after a time circumstances change

and a rather small table is started to be loaded with huge amounts

of data.

I prefer extending a dictionary over re-creating it because while

dictionary is recreated we leave users two choices - to wait until

dictionary creation is over or to use the old version (say, kept as

as a snapshot while a new one is created). Keeping many versions

simultaneously does not make sense and would extend DB size.

Also, compressing small data with a large dictionary (the case for

one-for-many tables dictionary), I think, would add some considerable

overhead to the INSERT/UPDATE commands, so the most reasonable

choice is a per-table dictionary.

Am I right?

Any ideas on how to create and extend such dictionaries automatically?

On Thu, Feb 9, 2023 at 2:01 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On February 9, 2023 2:50:57 AM PST, Aleksander Alekseev <aleksander@timescale.com> wrote:
>Hi Andres,
>
>> > So to clarify, are we talking about tuple-level compression? Or
>> > perhaps page-level compression?
>>
>> Tuple level.
>
>> although my own patch proposed attribute-level compression, not
>> tuple-level one, it is arguably closer to tuple-level approach than
>> page-level one
>
>Just wanted to make sure that by tuple-level we mean the same thing.
>
>When saying tuple-level do you mean that the entire tuple should be
>compressed as one large binary (i.e. similarly to page-level
>compression but more granularly), or every single attribute should be
>compressed separately (similarly to how TOAST does this)?

Good point - should have been clearer. I meant attribute wise compression. Like we do today, except that we would use a dictionary to increase compression rates.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Regards,

Nikita Malakhov

Postgres Professional

https://postgrespro.ru/

Re: [PATCH] Compression dictionaries for JSONB

From

Andres Freund

Date:

10 February 2023, 23:01:45

Hi,

On 2023-02-10 21:22:14 +0300, Nikita Malakhov wrote:
> If I understand Andres' message correctly - the proposition is to
> make use of compression dictionaries automatic, possibly just setting
> a parameter when the table is created, something like
> CREATE TABLE t ( ..., t JSONB USE DICTIONARY);

I definitely wouldn't make it automatic initially, and then later see how well
that works.

Whether automatic or not, it probably makes sense to integrate building
dictionaries with analyze. We can build a dictionaries from sampled datums, we
can check how efficient a dictionary is when compressing sampled datums, we
can compare the efficiency of a new dictionary with the existing dictionary to
see whether it's worth a new one.

> The question is then how to create such dictionaries automatically
> and extend them while data is being added to the table. Because
> it is not something unusual when after a time circumstances change
> and a rather small table is started to be loaded with huge amounts
> of data.

It doesn't really make sense to create the dictionaries with small tables,
anyway. For them to be efficient, you need a reasonable amount of data to
build a dictionary from.

> I prefer extending a dictionary over re-creating it because while
> dictionary is recreated we leave users two choices - to wait until
> dictionary creation is over or to use the old version (say, kept as
> as a snapshot while a new one is created). Keeping many versions
> simultaneously does not make sense and would extend DB size.

I don't think you really can extend dictionaries. The references into the
dictionary are as small as possible, based on the contents of the
dictonary. And you normally want to keep the dictionary size bounded for that
reason alone, but also for [de]compression speed.

So you'd need build a new dictionary, and use that going forward. And yes,
you'd not be able to delete the old dictionary, because there will still be
references to the old one.

We could add a command to scan the data to see if an old dictionary is still
used, or even remove references to it. But I don't think it's a particularly
important: It only makes sense to create the initial directory once some data
has accumulated, and creating further dictionaries only makes sense once the
table is a good bit larger. At that point the size of another dictionary to be
stored isn't relevant in relation (otherwise you'd just give up after building
a new dictionary and evaluating its effectiveness).

> Also, compressing small data with a large dictionary (the case for
> one-for-many tables dictionary), I think, would add some considerable
> overhead to the INSERT/UPDATE commands, so the most reasonable
> choice is a per-table dictionary.

Likely even per-column, but I can see some advantages in either appraoch.

> Any ideas on how to create and extend such dictionaries automatically?

As I said, I don't think we should extend dictionaries. For this to work we'll
likely need a new / extended compressed toast datum header of some form, with
a reference to the dictionary. That'd likely be needed even with updatable
dictionaries, as we IIRC don't know which column a toasted datum is for, but
we need to know, to identify the dictionary. As we need that field anyway, we
can easily have multiple dictionaries.

Greetings,

Andres Freund

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

18 April 2023, 17:27:48

Hi Andres,

> As I said, I don't think we should extend dictionaries. For this to work we'll
> likely need a new / extended compressed toast datum header of some form, with
> a reference to the dictionary. That'd likely be needed even with updatable
> dictionaries, as we IIRC don't know which column a toasted datum is for, but
> we need to know, to identify the dictionary. As we need that field anyway, we
> can easily have multiple dictionaries.

So I summarized the requirements we agreed on so far and ended up with
the following list:

* This is going to be a PostgreSQL feature, not an extension, not a
bunch of hooks, etc;
* We are not going to support lazy/partial decompression since this is
too complicated in a general case and Postgres is not a specialized
document-oriented DBMS (there is a normalization after all);
* This should be a relation-level optimization option, not something
visible to every user of the table (not a custom type, etc);
* This is going to be an attribute-level compression;
* The dictionaries should be created automatically (maybe not in a PoC
but in the final implementation) since people are not good at it;
* We are going to be using the existing compression algorithms like
LZ4/ZSTD, not to invent new ones;
* When created, a dictionary version is immutable, i.e. no new entries
can be added. New version of a dictionary can be created when the data
evolves. The compressed data stores the dictionary version used for
compression. A dictionary version can't be deleted while data exists
that uses this version of a dictionary;
* Dictionaries are created automatically from sampled data during
ANALIZE. We compare the efficiency of a new dictionary vs the
efficiency of the old one (or the lack of such) on sampled data and
depending on the results decide whether it's worth creating a new
version of a dictionary;
* This is going to work for multiple types: TEXT, JSON, JSONB, XML,
BYTEA etc. Ideally for user-defined types too;

Hopefully I didn't miss anything.

While thinking about how a user interface could look like it occured
to me that what we are discussing could be merely a new STORAGE
strategy. Currently there are PLAIN, MAIN, EXTERNAL and EXTENDED.
Let's call a new strategy DICTIONARY, with typstorage = d.

When user wants a given attribute to be compressed, he/she says:

ALTER TABLE foo ALTER COLUMN bar SET STORAGE DICTIONARY;

And the compression algorithms is chosen as usual:

ALTER TABLE foo ALTER COLUMN bar SET COMPRESSION lz4;

When there are no dictionaries yet, DICTIONARY works the same as
EXTENDED. When a dictionary is trained the data is compressed using
the latest version of this dictionary. For visibility we are going to
need some sort of pg_stat_dictionaries view that shows the existing
dictionaries, how much space they consume, etc.

If we choose this approach there are a couple of questions/notes that
come to mind:

* The default compression algorithm is PGLZ and unlike LZ4 it doesn't
support training dictionaries yet. This should be straightforward to
implement though, or alternatively shared dictionaries could work only
for LZ4;
* Currently users have control of toast_tuple_target but not
TOAST_TUPLE_THRESHOLD. Which means for tuples smaller than 1/4 of the
page size shared dictionaries are not going to be triggered. Which is
not necessarily a bad thing. Alternatively we could give the users
toast_tuple_threshold setting. This shouldn't necessarily be part of
this CF entry discussion however, we can always discuss it separately;
* Should we allow setting DICTIONARY storage strategy for a given
type, i.e. CREATE TYPE baz STORAGE = DICTIONARY? I suggest we forbid
it in the first implementation, just for the sake of simplicity.
* It looks like we won't be able to share a dictionary between
multiple columns. Which again is not necessarily a bad thing: data in
these columns can be completely different (e.g. BYTEA and XML),
columns can be dropped independently, etc. If a user is interested in
sharing a dictionary between several columns he/she can join these
columns in a single JSONB column.
* TOAST currently doesn't support ZSTD. IMO this is not a big deal and
adding the corresponding support can be discussed separately.
* If memory serves, there were not so many free bits left in TOAST
pointers. The pointers don't store a storage strategy though so
hopefully this will not be a problem. We'll see.

Please let me know what you think about all this. I'm going to prepare
an updated patch for the next CF so I could use early feedback.

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Matthias van de Meent

Date:

18 April 2023, 18:40:10

On Tue, 18 Apr 2023 at 17:28, Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi Andres,
>
> > As I said, I don't think we should extend dictionaries. For this to work we'll
> > likely need a new / extended compressed toast datum header of some form, with
> > a reference to the dictionary. That'd likely be needed even with updatable
> > dictionaries, as we IIRC don't know which column a toasted datum is for, but
> > we need to know, to identify the dictionary. As we need that field anyway, we
> > can easily have multiple dictionaries.
>
> So I summarized the requirements we agreed on so far and ended up with
> the following list:
>
> * This is going to be a PostgreSQL feature, not an extension, not a
> bunch of hooks, etc;
> * We are not going to support lazy/partial decompression since this is
> too complicated in a general case and Postgres is not a specialized
> document-oriented DBMS (there is a normalization after all);
> * This should be a relation-level optimization option, not something
> visible to every user of the table (not a custom type, etc);
> * This is going to be an attribute-level compression;
> * The dictionaries should be created automatically (maybe not in a PoC
> but in the final implementation) since people are not good at it;
> * We are going to be using the existing compression algorithms like
> LZ4/ZSTD, not to invent new ones;
> * When created, a dictionary version is immutable, i.e. no new entries
> can be added. New version of a dictionary can be created when the data
> evolves. The compressed data stores the dictionary version used for
> compression. A dictionary version can't be deleted while data exists
> that uses this version of a dictionary;
> * Dictionaries are created automatically from sampled data during
> ANALIZE. We compare the efficiency of a new dictionary vs the
> efficiency of the old one (or the lack of such) on sampled data and
> depending on the results decide whether it's worth creating a new
> version of a dictionary;
> * This is going to work for multiple types: TEXT, JSON, JSONB, XML,
> BYTEA etc. Ideally for user-defined types too;

Any type with typlen < 0 should work, right?

> Hopefully I didn't miss anything.
>
> While thinking about how a user interface could look like it occured
> to me that what we are discussing could be merely a new STORAGE
> strategy. Currently there are PLAIN, MAIN, EXTERNAL and EXTENDED.
> Let's call a new strategy DICTIONARY, with typstorage = d.

The use of dictionaries should be dependent on only the use of a
compression method that supports pre-computed compression
dictionaries. I think storage=MAIN + compression dictionaries should
be supported, to make sure there is no expensive TOAST lookup for the
attributes of the tuple; but that doesn't seem to be an option with
that design.

> When user wants a given attribute to be compressed, he/she says:
>
> ALTER TABLE foo ALTER COLUMN bar SET STORAGE DICTIONARY;
>
> And the compression algorithms is chosen as usual:
>
> ALTER TABLE foo ALTER COLUMN bar SET COMPRESSION lz4;
>
> When there are no dictionaries yet, DICTIONARY works the same as
> EXTENDED. When a dictionary is trained the data is compressed using
> the latest version of this dictionary. For visibility we are going to
> need some sort of pg_stat_dictionaries view that shows the existing
> dictionaries, how much space they consume, etc.

I think "AT_AC SET COMPRESSION lz4 {[WITH | WITHOUT] DICTIONARY}",
"AT_AC SET COMPRESSION lz4-dictionary", or "AT_AC SET
compression_dictionary = on" would be better from a design
perspective.

> If we choose this approach there are a couple of questions/notes that
> come to mind:
>
> * The default compression algorithm is PGLZ and unlike LZ4 it doesn't
> support training dictionaries yet. This should be straightforward to
> implement though, or alternatively shared dictionaries could work only
> for LZ4;

Didn't we get zstd support recently as well?

> * Currently users have control of toast_tuple_target but not
> TOAST_TUPLE_THRESHOLD. Which means for tuples smaller than 1/4 of the
> page size shared dictionaries are not going to be triggered. Which is
> not necessarily a bad thing. Alternatively we could give the users
> toast_tuple_threshold setting. This shouldn't necessarily be part of
> this CF entry discussion however, we can always discuss it separately;

That makes a lot of sense, but as you said handling that separately
would probably be better and easier to review.

> * Should we allow setting DICTIONARY storage strategy for a given
> type, i.e. CREATE TYPE baz STORAGE = DICTIONARY? I suggest we forbid
> it in the first implementation, just for the sake of simplicity.

Can we specify a default compression method for each postgresql type,
just like how we specify the default storage? If not, then the setting
could realistically be in conflict with a default_toast_compression
setting, assuming that dictionary support is not a requirement for
column compression methods.

> * It looks like we won't be able to share a dictionary between
> multiple columns. Which again is not necessarily a bad thing: data in
> these columns can be completely different (e.g. BYTEA and XML),
> columns can be dropped independently, etc.

Yes

> If a user is interested in
> sharing a dictionary between several columns he/she can join these
> columns in a single JSONB column.

It is unreasonable to expect this to be possible, due to e.g.
partitioning resulting in columns that share compressable patters to
be on different physical tables.

> * TOAST currently doesn't support ZSTD. IMO this is not a big deal and
> adding the corresponding support can be discussed separately.
> * If memory serves, there were not so many free bits left in TOAST
> pointers. The pointers don't store a storage strategy though so
> hopefully this will not be a problem. We'll see.

The toast pointer must store enough info about the compression used to
decompress the datum, which implies it needs to store the compression
algorithm used, and a reference to the compression dictionary (if
any). I think the idea about introducing a new toast pointer type (in
the custom toast patch) wasn't bad per se, and that change would allow
us to carry more or different info in the header.

Kind regards,

Matthias van de Meent

Re: [PATCH] Compression dictionaries for JSONB

From

Nikita Malakhov

Date:

18 April 2023, 19:12:08

Hi,

I don't think it's a good idea to interfere with the storage strategies. Dictionary

should be a kind of storage option, like a compression, but not the strategy

declining all others.

>> While thinking about how a user interface could look like it occured
>> to me that what we are discussing could be merely a new STORAGE
>> strategy. Currently there are PLAIN, MAIN, EXTERNAL and EXTENDED.
>> Let's call a new strategy DICTIONARY, with typstorage = d.

>I think "AT_AC SET COMPRESSION lz4 {[WITH | WITHOUT] DICTIONARY}",
>"AT_AC SET COMPRESSION lz4-dictionary", or "AT_AC SET
>compression_dictionary = on" would be better from a design
>perspective.

Agree with Matthias on above.

About the TOAST pointer:

>The toast pointer must store enough info about the compression used to
>decompress the datum, which implies it needs to store the compression
>algorithm used, and a reference to the compression dictionary (if
>any). I think the idea about introducing a new toast pointer type (in
>the custom toast patch) wasn't bad per se, and that change would allow
>us to carry more or different info in the header.

The External TOAST pointer is very limited to the amount of service data

it could keep, that's why we introduced the Custom TOAST pointers in the

Pluggable TOAST. But keep in mind that changing the TOAST pointer

structure requires a lot of quite heavy modifications in the core - along with

some obvious places like insert/update/delete datum there is very serious

issue with logical replication.

The Pluggable TOAST was rejected, but we have a lot of improvements

based on changing the TOAST pointer structure.

On Tue, Apr 18, 2023 at 6:40 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Tue, 18 Apr 2023 at 17:28, Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi Andres,
>
> > As I said, I don't think we should extend dictionaries. For this to work we'll
> > likely need a new / extended compressed toast datum header of some form, with
> > a reference to the dictionary. That'd likely be needed even with updatable
> > dictionaries, as we IIRC don't know which column a toasted datum is for, but
> > we need to know, to identify the dictionary. As we need that field anyway, we
> > can easily have multiple dictionaries.
>
> So I summarized the requirements we agreed on so far and ended up with
> the following list:
>
> * This is going to be a PostgreSQL feature, not an extension, not a
> bunch of hooks, etc;
> * We are not going to support lazy/partial decompression since this is
> too complicated in a general case and Postgres is not a specialized
> document-oriented DBMS (there is a normalization after all);
> * This should be a relation-level optimization option, not something
> visible to every user of the table (not a custom type, etc);
> * This is going to be an attribute-level compression;
> * The dictionaries should be created automatically (maybe not in a PoC
> but in the final implementation) since people are not good at it;
> * We are going to be using the existing compression algorithms like
> LZ4/ZSTD, not to invent new ones;
> * When created, a dictionary version is immutable, i.e. no new entries
> can be added. New version of a dictionary can be created when the data
> evolves. The compressed data stores the dictionary version used for
> compression. A dictionary version can't be deleted while data exists
> that uses this version of a dictionary;
> * Dictionaries are created automatically from sampled data during
> ANALIZE. We compare the efficiency of a new dictionary vs the
> efficiency of the old one (or the lack of such) on sampled data and
> depending on the results decide whether it's worth creating a new
> version of a dictionary;
> * This is going to work for multiple types: TEXT, JSON, JSONB, XML,
> BYTEA etc. Ideally for user-defined types too;

Any type with typlen < 0 should work, right?

> Hopefully I didn't miss anything.
>
> While thinking about how a user interface could look like it occured
> to me that what we are discussing could be merely a new STORAGE
> strategy. Currently there are PLAIN, MAIN, EXTERNAL and EXTENDED.
> Let's call a new strategy DICTIONARY, with typstorage = d.

The use of dictionaries should be dependent on only the use of a
compression method that supports pre-computed compression
dictionaries. I think storage=MAIN + compression dictionaries should
be supported, to make sure there is no expensive TOAST lookup for the
attributes of the tuple; but that doesn't seem to be an option with
that design.

> When user wants a given attribute to be compressed, he/she says:
>
> ALTER TABLE foo ALTER COLUMN bar SET STORAGE DICTIONARY;
>
> And the compression algorithms is chosen as usual:
>
> ALTER TABLE foo ALTER COLUMN bar SET COMPRESSION lz4;
>
> When there are no dictionaries yet, DICTIONARY works the same as
> EXTENDED. When a dictionary is trained the data is compressed using
> the latest version of this dictionary. For visibility we are going to
> need some sort of pg_stat_dictionaries view that shows the existing
> dictionaries, how much space they consume, etc.

I think "AT_AC SET COMPRESSION lz4 {[WITH | WITHOUT] DICTIONARY}",
"AT_AC SET COMPRESSION lz4-dictionary", or "AT_AC SET
compression_dictionary = on" would be better from a design
perspective.

> If we choose this approach there are a couple of questions/notes that
> come to mind:
>
> * The default compression algorithm is PGLZ and unlike LZ4 it doesn't
> support training dictionaries yet. This should be straightforward to
> implement though, or alternatively shared dictionaries could work only
> for LZ4;

Didn't we get zstd support recently as well?

> * Currently users have control of toast_tuple_target but not
> TOAST_TUPLE_THRESHOLD. Which means for tuples smaller than 1/4 of the
> page size shared dictionaries are not going to be triggered. Which is
> not necessarily a bad thing. Alternatively we could give the users
> toast_tuple_threshold setting. This shouldn't necessarily be part of
> this CF entry discussion however, we can always discuss it separately;

That makes a lot of sense, but as you said handling that separately
would probably be better and easier to review.

> * Should we allow setting DICTIONARY storage strategy for a given
> type, i.e. CREATE TYPE baz STORAGE = DICTIONARY? I suggest we forbid
> it in the first implementation, just for the sake of simplicity.

Can we specify a default compression method for each postgresql type,
just like how we specify the default storage? If not, then the setting
could realistically be in conflict with a default_toast_compression
setting, assuming that dictionary support is not a requirement for
column compression methods.

> * It looks like we won't be able to share a dictionary between
> multiple columns. Which again is not necessarily a bad thing: data in
> these columns can be completely different (e.g. BYTEA and XML),
> columns can be dropped independently, etc.

Yes

> If a user is interested in
> sharing a dictionary between several columns he/she can join these
> columns in a single JSONB column.

It is unreasonable to expect this to be possible, due to e.g.
partitioning resulting in columns that share compressable patters to
be on different physical tables.

> * TOAST currently doesn't support ZSTD. IMO this is not a big deal and
> adding the corresponding support can be discussed separately.
> * If memory serves, there were not so many free bits left in TOAST
> pointers. The pointers don't store a storage strategy though so
> hopefully this will not be a problem. We'll see.

The toast pointer must store enough info about the compression used to
decompress the datum, which implies it needs to store the compression
algorithm used, and a reference to the compression dictionary (if
any). I think the idea about introducing a new toast pointer type (in
the custom toast patch) wasn't bad per se, and that change would allow
us to carry more or different info in the header.

Kind regards,

Matthias van de Meent

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

18 April 2023, 20:21:14

Matthias, Nikita,

Many thanks for the feedback!

> Any type with typlen < 0 should work, right?

Right.

> The use of dictionaries should be dependent on only the use of a
> compression method that supports pre-computed compression
> dictionaries. I think storage=MAIN + compression dictionaries should
> be supported, to make sure there is no expensive TOAST lookup for the
> attributes of the tuple; but that doesn't seem to be an option with
> that design.

> I don't think it's a good idea to interfere with the storage strategies. Dictionary
> should be a kind of storage option, like a compression, but not the strategy
> declining all others.

My reasoning behind this proposal was as follows.

Let's not forget that MAIN attributes *can* be stored in a TOAST table
as a final resort, and also that EXTENDED attributes are compressed
in-place first, and are stored in a TOAST table *only* if this is
needed to fit a tuple in toast_tuple_target bytes (which additionally
user can change). So whether in practice it's going to be advantageous
to distinguish MAIN+dict.compressed and EXTENDED+dict.compressed
attributes seems to be debatable.

Basically the only difference between MAIN and EXTENDED is the
priority the four-stage TOASTing algorithm gives to the corresponding
attributes. I would assume if the user wants dictionary compression,
the attribute should be highly compressible and thus always EXTENDED.
(We seem to use MAIN for types that are not that well compressible.)

This being said, if the majority believes we should introduce a new
entity and keep storage strategies as is, I'm fine with that. This
perhaps is not going to be the most convenient interface for the user.
On the flip side it's going to be flexible. It's all about compromise.

> I think "AT_AC SET COMPRESSION lz4 {[WITH | WITHOUT] DICTIONARY}",
> "AT_AC SET COMPRESSION lz4-dictionary", or "AT_AC SET
> compression_dictionary = on" would be better from a design
> perspective.

> Agree with Matthias on above.

OK, unless someone will object, we have a consensus here.

> Didn't we get zstd support recently as well?

Unfortunately, it is not used for TOAST. In fact I vaguely recall that
ZSTD support for TOAST may have been explicitly rejected. Don't quote
me on that however...

I think it's going to be awkward to support PGLZ/LZ4 for COMPRESSION
and LZ4/ZSTD for dictionary compression. As a user personally I would
prefer having one set of compression algorithms that can be used with
TOAST.

Perhaps for PoC we could focus on LZ4, and maybe PGLZ, if we choose to
use PGLZ for compression dictionaries too. We can always discuss ZSTD
separately.

> Can we specify a default compression method for each postgresql type,
> just like how we specify the default storage? If not, then the setting
> could realistically be in conflict with a default_toast_compression
> setting, assuming that dictionary support is not a requirement for
> column compression methods.

No, only STORAGE can be specified [1].

> The toast pointer must store enough info about the compression used to
> decompress the datum, which implies it needs to store the compression
> algorithm used, and a reference to the compression dictionary (if
> any). I think the idea about introducing a new toast pointer type (in
> the custom toast patch) wasn't bad per se, and that change would allow
> us to carry more or different info in the header.

> The Pluggable TOAST was rejected, but we have a lot of improvements
> based on changing the TOAST pointer structure.

Interestingly it looks like we ended up working on TOAST improvement
after all. I'm almost certain that we will have to modify TOAST
pointers to a certain degree in order to make it work. Hopefully it's
not going to be too invasive.

[1]: https://www.postgresql.org/docs/current/sql-createtype.html
-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

26 April 2023, 16:00:08

Hi Nikita,

> The External TOAST pointer is very limited to the amount of service data
> it could keep, that's why we introduced the Custom TOAST pointers in the
> Pluggable TOAST. But keep in mind that changing the TOAST pointer
> structure requires a lot of quite heavy modifications in the core - along with
> some obvious places like insert/update/delete datum there is very serious
> issue with logical replication.
> The Pluggable TOAST was rejected, but we have a lot of improvements
> based on changing the TOAST pointer structure.

Now I see what you meant [1]. I agree that we should focus on
refactoring TOAST pointers first. So I suggest we continue discussing
this in a corresponding thread and return to this one later.

[1]: https://www.postgresql.org/message-id/CAJ7c6TPSvR2rKpoVX5TSXo_kMxXF%2B-SxLtrpPaMf907tX%3DnVCw%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Nikita Malakhov

Date:

27 April 2023, 13:43:59

Hi,

I think I should open a new thread related to TOAST pointer refactoring

based on Pluggable TOAST, COPY and looping in retrieving new TOAST

value OID issues.

On Wed, Apr 26, 2023 at 4:00 PM Aleksander Alekseev <aleksander@timescale.com> wrote:

Hi Nikita,

> The External TOAST pointer is very limited to the amount of service data
> it could keep, that's why we introduced the Custom TOAST pointers in the
> Pluggable TOAST. But keep in mind that changing the TOAST pointer
> structure requires a lot of quite heavy modifications in the core - along with
> some obvious places like insert/update/delete datum there is very serious
> issue with logical replication.
> The Pluggable TOAST was rejected, but we have a lot of improvements
> based on changing the TOAST pointer structure.

Now I see what you meant [1]. I agree that we should focus on
refactoring TOAST pointers first. So I suggest we continue discussing
this in a corresponding thread and return to this one later.

[1]: https://www.postgresql.org/message-id/CAJ7c6TPSvR2rKpoVX5TSXo_kMxXF%2B-SxLtrpPaMf907tX%3DnVCw%40mail.gmail.com

--
Best regards,
Aleksander Alekseev

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: [PATCH] Compression dictionaries for JSONB

From

Nikita Malakhov

Date:

17 May 2023, 14:32:13

Hi hackers!

As discussed above, I've created a new thread on the Extension of the TOAST pointer subject -

https://www.postgresql.org/message-id/flat/CAN-LCVMq2X%3Dfhx7KLxfeDyb3P%2BBXuCkHC0g%3D9GF%2BJD4izfVa0Q%40mail.gmail.com

Please check and comment.

On Thu, Apr 27, 2023 at 1:43 PM Nikita Malakhov <hukutoc@gmail.com> wrote:

Hi,

I think I should open a new thread related to TOAST pointer refactoring
based on Pluggable TOAST, COPY and looping in retrieving new TOAST
value OID issues.

On Wed, Apr 26, 2023 at 4:00 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi Nikita,

> The External TOAST pointer is very limited to the amount of service data
> it could keep, that's why we introduced the Custom TOAST pointers in the
> Pluggable TOAST. But keep in mind that changing the TOAST pointer
> structure requires a lot of quite heavy modifications in the core - along with
> some obvious places like insert/update/delete datum there is very serious
> issue with logical replication.
> The Pluggable TOAST was rejected, but we have a lot of improvements
> based on changing the TOAST pointer structure.

Now I see what you meant [1]. I agree that we should focus on
refactoring TOAST pointers first. So I suggest we continue discussing
this in a corresponding thread and return to this one later.

[1]: https://www.postgresql.org/message-id/CAJ7c6TPSvR2rKpoVX5TSXo_kMxXF%2B-SxLtrpPaMf907tX%3DnVCw%40mail.gmail.com

--
Best regards,
Aleksander Alekseev

--
Regards,

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

12 October 2023, 13:28:48

Hi hackers,

I would like to continue discussing compression dictionaries.

> So I summarized the requirements we agreed on so far and ended up with
> the following list: [...]

Again, here is the summary of our current agreements, at least how I
understand them. Please feel free to correct me where I'm wrong.

We are going to focus on supporting the:

````
SET COMPRESSION lz4 [WITH|WITHOUT] DICTIONARY
```

... syntax for now. From the UI perspective the rest of the agreements
didn't change compared to the previous summary.

In the [1] discussion (cc: Robert) we agreed to use va_tag != 18 for
the on-disk TOAST pointer representation to make TOAST pointers
extendable. If va_tag has a different value (currently it's always
18), the TOAST pointer is followed by an utf8-like varint bitmask.
This bitmask determines the rest of the content of the TOAST pointer
and its overall size. This will allow to extend TOAST pointers to
include dictionary_id and also to extend them in the future, e.g. to
support ZSTD and other compression algorithms, use 64-bit TOAST
pointers, etc.

Several things occured to me:

- Does anyone believe that va_tag should be part of the utf8-like
bitmask in order to save a byte or two?

- The described approach means that compression dictionaries are not
going to be used when data is compressed in-place (i.e. within a
tuple), since no TOAST pointer is involved in this case. Also we will
be unable to add additional compression algorithms here. Does anyone
have problems with this? Should we use the reserved compression
algorithm id instead as a marker of an extended TOAST?

- It would be nice to decompose the feature in several independent
patches, e.g. modify TOAST first, then add compression dictionaries
without automatic update of the dictionaries, then add the automatic
update. I find it difficult to imagine however how to modify TOAST
pointers and test the code properly without a dependency on a larger
feature. Could anyone think of a trivial test case for extendable
TOAST? Maybe something we could add to src/test/modules similarly to
how we test SLRU, background workers, etc.

[1]:
https://www.postgresql.org/message-id/flat/CAN-LCVMq2X%3Dfhx7KLxfeDyb3P%2BBXuCkHC0g%3D9GF%2BJD4izfVa0Q%40mail.gmail.com

--
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Shubham Khanna

Date:

17 January, 13:50:22

On Wed, Jan 17, 2024 at 4:16 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi hackers,
>
> > 8272749e added a few more arguments to CastCreate(). Here is the rebased patch.
>
> After merging afbfc029 [1] the patch needed a rebase. PFA v10.
>
> The patch is still in a PoC state and this is exactly why comments and
> suggestions from the community are most welcome! Particularly I would
> like to know:
>
> 1. Would you call it a wanted feature considering the existence of
> Pluggable TOASTer patchset which (besides other things) tries to
> introduce type-aware TOASTers for EXTERNAL attributes? I know what
> Simon's [2] and Nikita's latest answers were, and I know my personal
> opinion on this [3][4], but I would like to hear from the rest of the
> community.
>
> 2. How should we make sure a dictionary will not consume all the
> available memory? Limiting the amount of dictionary entries to pow(2,
> 16) and having dictionary versions seems to work OK for ZSON. However
> it was pointed out that this may be an unwanted limitation for the
> in-core implementation.
>
> [1]:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c727f511;hp=afbfc02983f86c4d71825efa6befd547fe81a926
> [2]: https://www.postgresql.org/message-id/CANbhV-HpCF852WcZuU0wyh1jMU4p6XLbV6rCRkZpnpeKQ9OenQ%40mail.gmail.com
> [3]: https://www.postgresql.org/message-id/CAJ7c6TN-N3%3DPSykmOjmW1EAf9YyyHFDHEznX-5VORsWUvVN-5w%40mail.gmail.com
> [4]: https://www.postgresql.org/message-id/CAJ7c6TO2XTTk3cu5w6ePHfhYQkoNpw7u1jeqHf%3DGwn%2BoWci8eA%40mail.gmail.com

I tried to apply the patch but it is failing at the Head. It is giving
the following error:
patching file src/test/regress/expected/dict.out
patching file src/test/regress/expected/oidjoins.out
patching file src/test/regress/expected/opr_sanity.out
patching file src/test/regress/parallel_schedule
Hunk #1 FAILED at 111.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/parallel_schedule.rej
patching file src/test/regress/sql/dict.sql
Please send the Re-base version of the patch.

Thanks and Regards,
Shubham Khanna.

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

17 January, 17:21:54

Hi Shubham,

> > > 8272749e added a few more arguments to CastCreate(). Here is the rebased patch.
> >
> > After merging afbfc029 [1] the patch needed a rebase. PFA v10.
> >
> > The patch is still in a PoC state and this is exactly why comments and
> > suggestions from the community are most welcome! Particularly I would
> > like to know:
> >
> > 1. Would you call it a wanted feature considering the existence of
> > Pluggable TOASTer patchset which (besides other things) tries to
> > introduce type-aware TOASTers for EXTERNAL attributes? I know what
> > Simon's [2] and Nikita's latest answers were, and I know my personal
> > opinion on this [3][4], but I would like to hear from the rest of the
> > community.
> >
> > 2. How should we make sure a dictionary will not consume all the
> > available memory? Limiting the amount of dictionary entries to pow(2,
> > 16) and having dictionary versions seems to work OK for ZSON. However
> > it was pointed out that this may be an unwanted limitation for the
> > in-core implementation.
> >
> > [1]:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c727f511;hp=afbfc02983f86c4d71825efa6befd547fe81a926
> > [2]: https://www.postgresql.org/message-id/CANbhV-HpCF852WcZuU0wyh1jMU4p6XLbV6rCRkZpnpeKQ9OenQ%40mail.gmail.com
> > [3]: https://www.postgresql.org/message-id/CAJ7c6TN-N3%3DPSykmOjmW1EAf9YyyHFDHEznX-5VORsWUvVN-5w%40mail.gmail.com
> > [4]:
https://www.postgresql.org/message-id/CAJ7c6TO2XTTk3cu5w6ePHfhYQkoNpw7u1jeqHf%3DGwn%2BoWci8eA%40mail.gmail.com
>
> I tried to apply the patch but it is failing at the Head. It is giving
> the following error:

Yes it does for a while now. Until we reach any agreement regarding
questions (1) and (2) personally I don't see the point in submitting
rebased patches. We can continue the discussion but mark the CF entry
as RwF for now it will be helpful.

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Aleksander Alekseev

Date:

17 January, 17:27:17

Hi again,

> Yes it does for a while now. Until we reach any agreement regarding
> questions (1) and (2) personally I don't see the point in submitting
> rebased patches. We can continue the discussion but mark the CF entry
> as RwF for now it will be helpful.

Sorry, what I actually meant were the following questions:

"""
Several things occured to me:

- Does anyone believe that va_tag should be part of the utf8-like
bitmask in order to save a byte or two?

- The described approach means that compression dictionaries are not
going to be used when data is compressed in-place (i.e. within a
tuple), since no TOAST pointer is involved in this case. Also we will
be unable to add additional compression algorithms here. Does anyone
have problems with this? Should we use the reserved compression
algorithm id instead as a marker of an extended TOAST?

- It would be nice to decompose the feature in several independent
patches, e.g. modify TOAST first, then add compression dictionaries
without automatic update of the dictionaries, then add the automatic
update. I find it difficult to imagine however how to modify TOAST
pointers and test the code properly without a dependency on a larger
feature. Could anyone think of a trivial test case for extendable
TOAST? Maybe something we could add to src/test/modules similarly to
how we test SLRU, background workers, etc.
"""

Since there was not much activity since then (for 3 months) I don't
really see how to process further.

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

From

Nikita Malakhov

Date:

17 January, 17:43:30

Hi,

Aleksander, there was a quite straightforward answer regarding Pluggable TOAST

in other thread - the Pluggable TOAST feature is not desired by the community,

and advanced TOAST mechanics would be accepted as parts of problematic

datatypes extended functionality, on a par with in and out functions, so what I am

actually doing now - re-writing JSONb TOAST improvements to be called as separate

functions without Pluggable TOAST API. This takes some time because there is a large

and complex code base left by Nikita Glukhov who has lost interest in this work due

to some reasons.

I, personally, think that these two features could benefit from each other, but they could

be adapted to each other after I would introduce JSONb Toaster in v17 master.

If you don't mind please check the thread on extending the TOAST pointer - it is important

for improving TOAST mechanics.

On Wed, Jan 17, 2024 at 5:27 PM Aleksander Alekseev <aleksander@timescale.com> wrote:

Hi again,

> Yes it does for a while now. Until we reach any agreement regarding
> questions (1) and (2) personally I don't see the point in submitting
> rebased patches. We can continue the discussion but mark the CF entry
> as RwF for now it will be helpful.

Sorry, what I actually meant were the following questions:

"""
Several things occured to me:

- Does anyone believe that va_tag should be part of the utf8-like
bitmask in order to save a byte or two?

- The described approach means that compression dictionaries are not
going to be used when data is compressed in-place (i.e. within a
tuple), since no TOAST pointer is involved in this case. Also we will
be unable to add additional compression algorithms here. Does anyone
have problems with this? Should we use the reserved compression
algorithm id instead as a marker of an extended TOAST?

- It would be nice to decompose the feature in several independent
patches, e.g. modify TOAST first, then add compression dictionaries
without automatic update of the dictionaries, then add the automatic
update. I find it difficult to imagine however how to modify TOAST
pointers and test the code properly without a dependency on a larger
feature. Could anyone think of a trivial test case for extendable
TOAST? Maybe something we could add to src/test/modules similarly to
how we test SLRU, background workers, etc.
"""

Since there was not much activity since then (for 3 months) I don't
really see how to process further.

--
Best regards,
Aleksander Alekseev

Regards,

Nikita Malakhov

Postgres Professional

The Russian Postgres Company

https://postgrespro.ru/

Re: [PATCH] Compression dictionaries for JSONB

From

vignesh C

Date:

26 January, 16:23:43

On Wed, 17 Jan 2024 at 19:52, Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi Shubham,
>
> > > > 8272749e added a few more arguments to CastCreate(). Here is the rebased patch.
> > >
> > > After merging afbfc029 [1] the patch needed a rebase. PFA v10.
> > >
> > > The patch is still in a PoC state and this is exactly why comments and
> > > suggestions from the community are most welcome! Particularly I would
> > > like to know:
> > >
> > > 1. Would you call it a wanted feature considering the existence of
> > > Pluggable TOASTer patchset which (besides other things) tries to
> > > introduce type-aware TOASTers for EXTERNAL attributes? I know what
> > > Simon's [2] and Nikita's latest answers were, and I know my personal
> > > opinion on this [3][4], but I would like to hear from the rest of the
> > > community.
> > >
> > > 2. How should we make sure a dictionary will not consume all the
> > > available memory? Limiting the amount of dictionary entries to pow(2,
> > > 16) and having dictionary versions seems to work OK for ZSON. However
> > > it was pointed out that this may be an unwanted limitation for the
> > > in-core implementation.
> > >
> > > [1]:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c727f511;hp=afbfc02983f86c4d71825efa6befd547fe81a926
> > > [2]: https://www.postgresql.org/message-id/CANbhV-HpCF852WcZuU0wyh1jMU4p6XLbV6rCRkZpnpeKQ9OenQ%40mail.gmail.com
> > > [3]:
https://www.postgresql.org/message-id/CAJ7c6TN-N3%3DPSykmOjmW1EAf9YyyHFDHEznX-5VORsWUvVN-5w%40mail.gmail.com
> > > [4]:
https://www.postgresql.org/message-id/CAJ7c6TO2XTTk3cu5w6ePHfhYQkoNpw7u1jeqHf%3DGwn%2BoWci8eA%40mail.gmail.com
> >
> > I tried to apply the patch but it is failing at the Head. It is giving
> > the following error:
>
> Yes it does for a while now. Until we reach any agreement regarding
> questions (1) and (2) personally I don't see the point in submitting
> rebased patches. We can continue the discussion but mark the CF entry
> as RwF for now it will be helpful.

Thanks. I have updated the status to "Returned with feedback". Feel
free to create a new entry after we agree on the approach to take it
forward.

Regards,
Vignesh