Thread: RFC: compression dictionaries for JSONB

RFC: compression dictionaries for JSONB

From
Aleksander Alekseev
Date:
Hi hackers,

== Background ==

This is a follow-up thread to `Add ZSON extension to /contrib/` [1].
The ZSON extension introduces a new type called ZSON, which is 100%
compatible with JSONB but uses a shared dictionary of strings most
frequently used by given JSONB documents for compression. See the
thread for more details.

According to the feedback I got, the community generally liked the
idea of adding an across-rows and across-tables compression capability
to JSONB. What the community didn't like was:

1. Introducing a new data type in order to archive this;
2. Updating compression dictionaries manually;
3. Some implementation details of ZSON, such as limited dictionary
size (2 ** 16 entries) and an extensive usage of gettimeofday() system
call;

There was also a request for proof of the usefulness of this feature
in practice.

To be honest with you I don't have solid proof that many users require
this feature, and how many users that would be exactly. ZSON was
originally developed because a customer of Postgres Professional
requested it back in 2016. People approach me with questions from time
to time. E.g. one user asked me recently how the extension can be
compiled on Windows [2].

Andrew Dunstan reported that 2nd Quadrant (now EDB) has a fork of ZSON
with some enhancements [3]. Konstantin Knizhnik reported that he
worked on a similar extension [4]. Unfortunately, Andrew and
Konstantin didn't give any more details (hopefully they will). But all
in all, this indicates some demand.

Speaking of performance, some time ago I benchmarked ZSON on data
similar to the one that the customer had [5]. The benchmark showed
~10% performance improvements in terms of TPS and ~50% of saved disk
space. The extension saved the memory as well, which was known from
the implementation. The exact amount of saved memory was not measured.
This benchmark shouldn't be considered as proof that all users will
necessarily benefit from such a feature. But it indicates that some
users could.

== Proposal ==

The proposal is to add the support of compression dictionaries to JSONB.

In order to do this, the SQL syntax should be modified. The proposed
syntax is based on Matthias van de Meent's idea [6]:

```
CREATE TYPE <type-name> AS JSONB_DICTIONARY (
  learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... },
  autoupdate = false, -- true by default
  -- optional: other arguments, min/max string lengths, etc
);
```

Basically, this is an equivalent of zson_learn [7]. It will create an
id -> string dictionary in the PostgreSQL catalog. When the user
chooses `autoupdate = true`, the dictionary will be updated
automatically by PostgreSQL (e.g. during the VACUUM). This is the
default value. The dictionary can also be updated manually:

```
SELECT jsonb_update_dictionary("type-name");
```

Other than that, the type works like a regular one. All the usual
ALTER TYPE / DROP TYPE semantics are applicable. All the operators
available to JSONB are also available to <type-name>.

Internally <type-name> is represented similar to JSONB. However, the
strings from the dictionary are replaced with varints. This idea was
borrowed from Tomas Vondra [8]. The dictionary size is limited to
2**28 entries. The limit can be easily extended in the future if
necessary. Also <type-name> stores the version of the dictionary used
to compress the data. All in all, this is similar to how ZSON works.

The first implementation always decompresses <type-name> entirely.
Partial compression and decompression can always be added
transparently to the user.

== Looking for a feedback ===

I would appreciate your feedback on this RFC.

Is anything missing in the description of the feature? Do you think
users need it? Can you think of a better user interface? Are there any
corner cases worth considering? Any other comments and questions are
welcome too!

I would like to implement this when the consensus will be reached on
how the feature should look like (and whether we need it). Any help
(from co-authors, REVIEWERS!!!, technical writers, ...) would be much
appreciated.

== Links ==

[1]: https://www.postgresql.org/message-id/flat/CAJ7c6TP3fCC9TNKJBQAcEf4c%3DL7XQZ7QvuUayLgjhNQMD_5M_A%40mail.gmail.com
[2]: https://github.com/postgrespro/zson/issues?q=is%3Aissue+is%3Aclosed
[3]: https://www.postgresql.org/message-id/6f3944ad-6924-5fed-580c-e72477733f04%40dunslane.net
[4]: https://github.com/postgrespro/jsonb_schema
[5]: https://github.com/postgrespro/zson/blob/master/docs/benchmark.md
[6]: https://www.postgresql.org/message-id/CAEze2WheMusc73UZ5TpfiAGQ%3DrRwSSgr0y3j9DEVAQgQFwneRA%40mail.gmail.com
[7]: https://github.com/postgrespro/zson#usage
[8]: https://www.postgresql.org/message-id/77356556-0634-5cde-f55e-cce739dc09b9%40enterprisedb.com

-- 
Best regards,
Aleksander Alekseev
Open-Source PostgreSQL Contributor at Timescale



Re: RFC: compression dictionaries for JSONB

From
Matthias van de Meent
Date:
On Fri, 8 Oct 2021 at 11:47, Aleksander Alekseev
<aleksander@timescale.com> wrote:
> This is a follow-up thread to `Add ZSON extension to /contrib/` [1].
> The ZSON extension introduces a new type called ZSON, which is 100%
> compatible with JSONB but uses a shared dictionary of strings most
> frequently used by given JSONB documents for compression. See the
> thread for more details.

Great to see that you're still working on this! It would be great if
we could get this into postgres. As such, I hope you can provide some
clarifications on my questions and comments.

> According to the feedback I got, the community generally liked the
> idea of adding an across-rows and across-tables compression capability
> to JSONB. What the community didn't like was:
>
> 1. Introducing a new data type in order to archive this;
> 2. Updating compression dictionaries manually;

Well, I for one would like access to manually add entries to the
dictionary. What I'm not interested in is being required to manually
update the dictionary; but the ability to manually insert into the
dictionary however is much appreciated.

> 3. Some implementation details of ZSON, such as limited dictionary
> size (2 ** 16 entries) and an extensive usage of gettimeofday() system
> call;
>
> There was also a request for proof of the usefulness of this feature
> in practice.

More compact JSONB is never a bad idea: one reason to stick with JSON
over JSONB is that JSON can use significantly less space than JSONB,
if stored properly. So, improving the disk usage of JSONB is not
really a bad idea.

>
> == Proposal ==
>
> The proposal is to add the support of compression dictionaries to JSONB.
>
> In order to do this, the SQL syntax should be modified. The proposed
> syntax is based on Matthias van de Meent's idea [6]:

Seems fine

> ```
> CREATE TYPE <type-name> AS JSONB_DICTIONARY (
>   learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... },

I'm having trouble understanding how this learn_by field would be used:

If stored as strings, they would go out of date when tables or columns
are renamed or dropped.
Similarly, you'd want to update the dictionary with common values in
columns of that type; generally not columns of arbitrary other types.
You can't in advance know the names of tables and columns, so that
would add a burden of maintenance to the user when they add / change /
remove a column of the dictionary type. Instead of storing 'use update
data from table X column Y' in the type, I think that adding it as a
column option would be the better choice.

I agree with an option for auto-update, though I don't think we have
enough information to determine the default value (I'd err to the side
of caution, going with 'off').

>   autoupdate = false, -- true by default
>   -- optional: other arguments, min/max string lengths, etc
> );
> ```

For dump/restore I think it would be very useful to allow export &
import of these dictionaries, so that restored databases don't have
the problem of starting cold.

As such, `ALTER TYPE  jsondict ADD ENTRY entry_value` would probably
be useful, and maybe even `CREATE TYPE dict AS JSONB_DICTIONARY
('"entry_1"', '"entry_two"', '"entry_three"') WITH (option =
optional)`

> Basically, this is an equivalent of zson_learn [7]. It will create an
> id -> string dictionary in the PostgreSQL catalog. When the user
> chooses `autoupdate = true`, the dictionary will be updated
> automatically by PostgreSQL (e.g. during the VACUUM). This is the
> default value. The dictionary can also be updated manually:
>
> ```
> SELECT jsonb_update_dictionary("type-name");
> ```

I'm a bit on the fence about this. We do use this for sequences, but
alternatively we might want to use ALTER TYPE jsondict;

> Other than that, the type works like a regular one. All the usual
> ALTER TYPE / DROP TYPE semantics are applicable. All the operators
> available to JSONB are also available to <type-name>.
>
> Internally <type-name> is represented similar to JSONB. However, the
> strings from the dictionary are replaced with varints.

How do you propose to differentiate actual integers with these keyed
strings, and / or actual strings with varints? Replacing _all_ strings
doesn't seem like such a great idea.

Related comment below.

> This idea was
> borrowed from Tomas Vondra [8]. The dictionary size is limited to
> 2**28 entries. The limit can be easily extended in the future if
> necessary. Also <type-name> stores the version of the dictionary used
> to compress the data. All in all, this is similar to how ZSON works.

I appreciate this idea, but using that varint implementation is not a
choice I'd make. In the jsonb storage format, we already encode the
length of each value, so varint shouldn't be necessary here. Next, as
everything in jsonb storage is 4-byte aligned, a uint32 should
suffice, or if we're being adventurous, we might even fit a uint29
identifier in the length field instead (at the cost of full backwards
incompatibility).

Lastly, we don't have a good format for varint now (numeric is close,
but has significant overhead), so I'd say we should go with a
fixed-size integer and accept that limitation.

My own suggestion would be updating JSONB on-disk format with the following:

```
 /* values stored in the type bits */
 #define JENTRY_ISSTRING 0x00000000
 #define JENTRY_ISNUMERIC 0x10000000
 #define JENTRY_ISBOOL_FALSE 0x20000000
 #define JENTRY_ISBOOL_TRUE 0x30000000
 #define JENTRY_ISNULL 0x40000000
 #define JENTRY_ISCONTAINER 0x50000000 /* array or object */
+#define JENTRY_ISSYMBOL 0x60000000 /* Lookup in dictionary */
```

And then store the symbol in the JEntry (either in the
JENTRY_OFFLENMASK or in the actual referred content), whilst maybe
using some bits in this for e.g. type hints (whether the item in the
dictionary is an array, object, string or numeric).

I really would like this to support non-string types, because jsonb
structures can grow quite large, even with only small strings: e.g.
`{..., "tags": {"customer": "blabla"}}` could be dictionaried to
`{..., "tags": {'1: '2}`, but potentially also to `{... "tags": '1}`.
Of these, the second would be more efficient overall for storage and
retrieval..

> The first implementation always decompresses <type-name> entirely.
> Partial compression and decompression can always be added
> transparently to the user.

Are you talking about the TOAST compression and decompression, or are
you talking about a different compression scheme? If a different
scheme, is it only replacing the strings in the jsonb-tree with their
directory identifiers, and replacing the symbols in the jsonb-tree
with text (all through the JSONB internals), or are you proposing an
actual compression scheme over the stored jsonb bytes (effectively
wrapping the jsonb IO functions)?

Overall, I'm glad to see this take off, but I do want some
clarifications regarding the direction that this is going towards.


Kind regards,

Matthias



Re: RFC: compression dictionaries for JSONB

From
Alvaro Herrera
Date:
On 2021-Oct-08, Matthias van de Meent wrote:

> On Fri, 8 Oct 2021 at 11:47, Aleksander Alekseev
> <aleksander@timescale.com> wrote:

> > In order to do this, the SQL syntax should be modified. The proposed
> > syntax is based on Matthias van de Meent's idea [6]:
> 
> Seems fine
> 
> > ```
> > CREATE TYPE <type-name> AS JSONB_DICTIONARY (
> >   learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... },

Actually, why is it a JSONB_DICTIONARY and not like

CREATE TYPE name AS DICTIONARY (
  base_type = JSONB, ...
);

so that it is possible to use the infrastructure for other things?  For
example, perhaps PostGIS geometries could benefit from it -- or even
text or xml columns.

The pg_type entry would have to provide some support procedure that
makes use of the dictionary in some way.  This seems better than tying
the SQL object to a specific type.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
#error "Operator lives in the wrong universe"
  ("Use of cookies in real-time system development", M. Gleixner, M. Mc Guire)



Re: RFC: compression dictionaries for JSONB

From
Matthias van de Meent
Date:
On Fri, 8 Oct 2021 at 17:19, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Oct-08, Matthias van de Meent wrote:
>
> > On Fri, 8 Oct 2021 at 11:47, Aleksander Alekseev
> > <aleksander@timescale.com> wrote:
>
> > > In order to do this, the SQL syntax should be modified. The proposed
> > > syntax is based on Matthias van de Meent's idea [6]:
> >
> > Seems fine
> >
> > > ```
> > > CREATE TYPE <type-name> AS JSONB_DICTIONARY (
> > >   learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... },
>
> Actually, why is it a JSONB_DICTIONARY and not like
>
> CREATE TYPE name AS DICTIONARY (
>   base_type = JSONB, ...
> );

That's a good point, but if we're extending this syntax to allow the
ability of including other types, then I'd instead extend the syntax
that of below, so that the type of the dictionary entries is required
in the syntax:

CREATE TYPE name AS DICTIONARY OF jsonb [ ( ...entries ) ] [ WITH (
...options ) ];

> so that it is possible to use the infrastructure for other things?  For
> example, perhaps PostGIS geometries could benefit from it -- or even
> text or xml columns.
>
> The pg_type entry would have to provide some support procedure that
> makes use of the dictionary in some way.  This seems better than tying
> the SQL object to a specific type.

Agreed, but this might mean that much more effort would be required to
get such a useful quality-of-life feature committed.

Kind regards,

Matthias



Re: RFC: compression dictionaries for JSONB

From
Alvaro Herrera
Date:
On 2021-Oct-08, Matthias van de Meent wrote:

> That's a good point, but if we're extending this syntax to allow the
> ability of including other types, then I'd instead extend the syntax
> that of below, so that the type of the dictionary entries is required
> in the syntax:
> 
> CREATE TYPE name AS DICTIONARY OF jsonb [ ( ...entries ) ] [ WITH (
> ...options ) ];

I don't think this gives you any guarantees of the sort you seem to
expect.  See CREATE AGGREGATE as a precedent where there are some
options in the parenthesized options list you cannot omit.

> > The pg_type entry would have to provide some support procedure that
> > makes use of the dictionary in some way.  This seems better than tying
> > the SQL object to a specific type.
> 
> Agreed, but this might mean that much more effort would be required to
> get such a useful quality-of-life feature committed.

I don't understand what you mean by that.  I'm not saying that the patch
has to provide support for any additional datatypes.  Its only
obligation would be to provide a new column in pg_type which is zero for
all rows except jsonb, and in that row it is the OID of a
jsonb_dictionary() function that's called from all the right places and
receives all the right arguments.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"No tengo por qué estar de acuerdo con lo que pienso"
                             (Carlos Caszeli)



Re: RFC: compression dictionaries for JSONB

From
Matthias van de Meent
Date:
On Fri, 8 Oct 2021 at 21:21, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Oct-08, Matthias van de Meent wrote:
>
> > That's a good point, but if we're extending this syntax to allow the
> > ability of including other types, then I'd instead extend the syntax
> > that of below, so that the type of the dictionary entries is required
> > in the syntax:
> >
> > CREATE TYPE name AS DICTIONARY OF jsonb [ ( ...entries ) ] [ WITH (
> > ...options ) ];
>
> I don't think this gives you any guarantees of the sort you seem to
> expect.  See CREATE AGGREGATE as a precedent where there are some
> options in the parenthesized options list you cannot omit.

Bikeshedding on syntax:
I guess? I don't really like 'required options' patterns. If you're
required to use/specify an option, then it's not optional, and should
thus not be included in the group of 'options'.

> > > The pg_type entry would have to provide some support procedure that
> > > makes use of the dictionary in some way.  This seems better than tying
> > > the SQL object to a specific type.
> >
> > Agreed, but this might mean that much more effort would be required to
> > get such a useful quality-of-life feature committed.
>
> I don't understand what you mean by that.  I'm not saying that the patch
> has to provide support for any additional datatypes.  Its only
> obligation would be to provide a new column in pg_type which is zero for
> all rows except jsonb, and in that row it is the OID of a
> jsonb_dictionary() function that's called from all the right places and
> receives all the right arguments.

This seems feasable to do, but I still have limited knowledge on the
intricacies of the type system, and as such I don't see how this part
would function:

I was expecting more something in the line of how array types seem to
work: Type _A is an array type, containing elements of Type A. It's
containing type is defined in pg_type.typbasetype. No special
functions are defined on base types to allow their respective array
types, that part is handled by the array infrastructure. Same for
Domain types.

Now that I think about it, we should still provide the information on
_how_ to find the type functions for the dictionaried type: Arrays and
domains are generic, but dictionaries will require deep understanding
of the underlying type.

So, yes, you are correct, there should be one more function, which
would supply the necessary pg_type functions that CREATE TYPE
DICTIONARY can then register in the pg_type entry of the dictionary
type. The alternative would initially be hardcoding this for the base
types that have dictionary support, which definitely would be possible
for a first iteration, but wouldn't be great.


Kind regards,

Matthias



Re: RFC: compression dictionaries for JSONB

From
Aleksander Alekseev
Date:
Matthias, Alvaro,

Many thanks for your comments and suggestions!

> Well, I for one would like access to manually add entries to the
> dictionary. What I'm not interested in is being required to manually
> update the dictionary; but the ability to manually insert into the
> dictionary however is much appreciated.

Sure, I see no reason why we can't do it. This would also simplify
splitting the task into smaller ones. We could introduce only manually
updated dictionaries first, and then automate filling them for the
users who need this.

> If stored as strings, they would go out of date when tables or columns
> are renamed or dropped.
> Similarly, you'd want to update the dictionary with common values in
> columns of that type; generally not columns of arbitrary other types.
> You can't in advance know the names of tables and columns, so that
> would add a burden of maintenance to the user when they add / change /
> remove a column of the dictionary type. Instead of storing 'use update
> data from table X column Y' in the type, I think that adding it as a
> column option would be the better choice.

Agree, add / change / remove of a column should be handled
automatically. Just to clarify, by column option do you mean syntax
like ALTER TABLE ... ALTER COLUMN ... etc, right? I didn't think of
extending this part of the syntax. That would be a better choice
indeed.

> I'm a bit on the fence about this. We do use this for sequences, but
> alternatively we might want to use ALTER TYPE jsondict;

Agree, ALTER TYPE seems to be a better choice than SELECT function().
This would make the interface more consistent.

> Overall, I'm glad to see this take off, but I do want some
> clarifications regarding the direction that this is going towards.
> [...]
> Actually, why is it a JSONB_DICTIONARY and not like:
> CREATE TYPE name AS DICTIONARY (
>  base_type = JSONB, ...
> );
> so that it is possible to use the infrastructure for other things?  For
> example, perhaps PostGIS geometries could benefit from it -- or even
> text or xml columns.

So the question is if we want to extend the capabilities of a single
type, i.e. JSONB, or to add a functionality that would work for the
various types. I see the following pros and cons of both approaches.

Modifying JSONB may at some point allow to partially decompress only
the parts of the document that need to be decompressed for a given
query. However, the compression schema will be probably less
efficient. There could also be difficulties in respect of backward
compatibility, and this is going to work only with JSONB.

An alternative approach, CREATE TYPE ... AS DICTIONARY OF <type> or
something like this would work not only for JSONB, but also for TEXT,
XML, arrays, and PostGIS. By the way, this was suggested before [1].
Another advantage here is that all things being equal the compression
schema could be more efficient. The feature will not affect existing
types. The main disadvantage is that implementing a partial
decompression would be very difficult and/or impractical.

Personally, I would say that the 2nd option, CREATE TYPE ... AS
DICTIONARY OF <type>, seems to be more useful. To my knowledge, not
many users would care much about partial decompression, and this is
the only real advantage of the 1st option I see. Also, this is how
ZSON is implemented. It doesn't care about the underlying type and
treats it as a BLOB. Thus the proofs of usefulness I provided above
are not quite valid for the 1st option. Probably unrelated, but 2nd
option would be even easier for me to implement since I already solved
a similar task.

All in all, I suggest focusing on the 2nd option with universal
compression dictionaries. Naturally, the focus will be on JSONB first.
But we will be able to extend this functionality for other types as
well.

Thoughts?

[1]: https://www.postgresql.org/message-id/CANP8%2BjLT8r03LJsw%3DdUSFxBh5pRB%2BUCKvS3BUT-dd4JPRDb3tg%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev



Re: RFC: compression dictionaries for JSONB

From
Matthias van de Meent
Date:
On Mon, 11 Oct 2021 at 15:25, Aleksander Alekseev
<aleksander@timescale.com> wrote:
> Agree, add / change / remove of a column should be handled
> automatically. Just to clarify, by column option do you mean syntax
> like ALTER TABLE ... ALTER COLUMN ... etc, right?

Correct, either SET (option) or maybe using typmod (which is hack-ish,
but could save on some bytes of storage)

> I didn't think of
> extending this part of the syntax. That would be a better choice
> indeed.

> > Overall, I'm glad to see this take off, but I do want some
> > clarifications regarding the direction that this is going towards.
> > [...]
> > Actually, why is it a JSONB_DICTIONARY and not like:
> > CREATE TYPE name AS DICTIONARY (
> >  base_type = JSONB, ...
> > );
> > so that it is possible to use the infrastructure for other things?  For
> > example, perhaps PostGIS geometries could benefit from it -- or even
> > text or xml columns.
>
> So the question is if we want to extend the capabilities of a single
> type, i.e. JSONB, or to add a functionality that would work for the
> various types. I see the following pros and cons of both approaches.
>
> Modifying JSONB may at some point allow to partially decompress only
> the parts of the document that need to be decompressed for a given
> query. However, the compression schema will be probably less
> efficient. There could also be difficulties in respect of backward
> compatibility, and this is going to work only with JSONB.

Assuming this above is option 1. If I understand correctly, this
option was 'adapt the data type so that it understands how to handle a
shared dictionary, decreasing storage requirements'.

> An alternative approach, CREATE TYPE ... AS DICTIONARY OF <type> or
> something like this would work not only for JSONB, but also for TEXT,
> XML, arrays, and PostGIS. By the way, this was suggested before [1].
> Another advantage here is that all things being equal the compression
> schema could be more efficient. The feature will not affect existing
> types. The main disadvantage is that implementing a partial
> decompression would be very difficult and/or impractical.

Assuming this was the 2nd option. If I understand correctly, this
option is effectively 'adapt or wrap TOAST to understand and handle
dictionaries for dictionary encoding common values'.

> Personally, I would say that the 2nd option, CREATE TYPE ... AS
> DICTIONARY OF <type>, seems to be more useful. To my knowledge, not
> many users would care much about partial decompression, and this is
> the only real advantage of the 1st option I see. Also, this is how
> ZSON is implemented. It doesn't care about the underlying type and
> treats it as a BLOB. Thus the proofs of usefulness I provided above
> are not quite valid for the 1st option. Probably unrelated, but 2nd
> option would be even easier for me to implement since I already solved
> a similar task.
>
> All in all, I suggest focusing on the 2nd option with universal
> compression dictionaries. Naturally, the focus will be on JSONB first.
> But we will be able to extend this functionality for other types as
> well.
>
> Thoughts?

I think that an 'universal dictionary encoder' would be useful, but
that a data type might also have good reason to implement their
replacement methods by themselves for better overall performance (such
as maintaining partial detoast support in dictionaried items, or
overall lower memory footprint, or ...). As such, I'd really
appreciate it if Option 1 is not ruled out by any implementation of
Option 2.

Kind regards,

Matthias van de Meent



Re: RFC: compression dictionaries for JSONB

From
Aleksander Alekseev
Date:
Hi Matthias,

> Assuming this above is option 1. If I understand correctly, this
> option was 'adapt the data type so that it understands how to handle a
> shared dictionary, decreasing storage requirements'.
> [...]
> Assuming this was the 2nd option. If I understand correctly, this
> option is effectively 'adapt or wrap TOAST to understand and handle
> dictionaries for dictionary encoding common values'.

Yes, exactly.

> I think that an 'universal dictionary encoder' would be useful, but
> that a data type might also have good reason to implement their
> replacement methods by themselves for better overall performance (such
> as maintaining partial detoast support in dictionaried items, or
> overall lower memory footprint, or ...). As such, I'd really
> appreciate it if Option 1 is not ruled out by any implementation of
> Option 2.

I agree, having the benefits of two approaches in one feature would be
great. However, I'm having some difficulties imagining how the
implementation would look like in light of the pros and cons stated
above. I could use some help here.

One approach I can think of is introducing a new entity, let's call it
"dictionary compression method". The idea is similar to access methods
and tableam's. There is a set of callbacks the dictionary compression
method should implement, some are mandatory, some can be set to NULL.
Users can specify the compression method for the dictionary:

```
CREATE TYPE name AS DICTIONARY OF JSONB (
  compression_method = 'jsonb_best_compression'
  -- compression_methods = 'jsonb_fastest_partial_decompression'
  -- if not specified, some default compression method is used
);
```

JSONB is maybe not the best example of the type for which people may
need multiple compression methods in practice. But I can imagine how
overwriting a compression method for, let's say, arrays in an
extension could be beneficial depending on the application.

This approach will make an API well-defined and, more importantly,
extendable. In the future, we could add additional (optional) methods
for particular scenarios, like partial decompression.

Does it sound like a reasonable approach?

-- 
Best regards,
Aleksander Alekseev



Re: RFC: compression dictionaries for JSONB

From
Matthias van de Meent
Date:
On Wed, 13 Oct 2021 at 11:48, Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi Matthias,
>
> > Assuming this above is option 1. If I understand correctly, this
> > option was 'adapt the data type so that it understands how to handle a
> > shared dictionary, decreasing storage requirements'.
> > [...]
> > Assuming this was the 2nd option. If I understand correctly, this
> > option is effectively 'adapt or wrap TOAST to understand and handle
> > dictionaries for dictionary encoding common values'.
>
> Yes, exactly.
>
> > I think that an 'universal dictionary encoder' would be useful, but
> > that a data type might also have good reason to implement their
> > replacement methods by themselves for better overall performance (such
> > as maintaining partial detoast support in dictionaried items, or
> > overall lower memory footprint, or ...). As such, I'd really
> > appreciate it if Option 1 is not ruled out by any implementation of
> > Option 2.
>
> I agree, having the benefits of two approaches in one feature would be
> great. However, I'm having some difficulties imagining how the
> implementation would look like in light of the pros and cons stated
> above. I could use some help here.
>
> One approach I can think of is introducing a new entity, let's call it
> "dictionary compression method". The idea is similar to access methods
> and tableam's. There is a set of callbacks the dictionary compression
> method should implement, some are mandatory, some can be set to NULL.

You might also want to look into the  'pluggable compression support'
[0] and  'Custom compression methods' [1] threads for inspiration, as
that seems very similar to what was originally proposed there. (†)

One important difference from those discussed at [0][1] is that the
compression proposed here is at the type level, while the compression
proposed in both 'Pluggable compression support' and 'Custom
compression methods' is at the column / table / server level.

> Users can specify the compression method for the dictionary:
>
> ```
> CREATE TYPE name AS DICTIONARY OF JSONB (
>   compression_method = 'jsonb_best_compression'
>   -- compression_methods = 'jsonb_fastest_partial_decompression'
>   -- if not specified, some default compression method is used
> );
> ```
>
> JSONB is maybe not the best example of the type for which people may
> need multiple compression methods in practice. But I can imagine how
> overwriting a compression method for, let's say, arrays in an
> extension could be beneficial depending on the application.
>
> This approach will make an API well-defined and, more importantly,
> extendable. In the future, we could add additional (optional) methods
> for particular scenarios, like partial decompression.
>
> Does it sound like a reasonable approach?

Yes, I think that's doable.


Kind regards,

Matthias

(†): 'Custom compression methods' eventually got committed in an
entirely different state by the way of commit bbe0a81db, where LZ4 is
now a toast compression option that can be configured at the column /
system level. This is a hard-coded compression method, so no
infrastructure (or at least, API) is available for custom compression
methods in that code.

[0] https://www.postgresql.org/message-id/flat/20130614230142.GC19641%40awork2.anarazel.de
[1] https://www.postgresql.org/message-id/flat/20170907194236.4cefce96@wp.localdomain



Re: RFC: compression dictionaries for JSONB

From
Aleksander Alekseev
Date:
Hi hackers,

Many thanks for all your great feedback!

Please see the follow-up thread '[PATCH] Compression dictionaries for JSONB':

https://postgr.es/m/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev