Thread: RFC: compression dictionaries for JSONB
Hi hackers, == Background == This is a follow-up thread to `Add ZSON extension to /contrib/` [1]. The ZSON extension introduces a new type called ZSON, which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used by given JSONB documents for compression. See the thread for more details. According to the feedback I got, the community generally liked the idea of adding an across-rows and across-tables compression capability to JSONB. What the community didn't like was: 1. Introducing a new data type in order to archive this; 2. Updating compression dictionaries manually; 3. Some implementation details of ZSON, such as limited dictionary size (2 ** 16 entries) and an extensive usage of gettimeofday() system call; There was also a request for proof of the usefulness of this feature in practice. To be honest with you I don't have solid proof that many users require this feature, and how many users that would be exactly. ZSON was originally developed because a customer of Postgres Professional requested it back in 2016. People approach me with questions from time to time. E.g. one user asked me recently how the extension can be compiled on Windows [2]. Andrew Dunstan reported that 2nd Quadrant (now EDB) has a fork of ZSON with some enhancements [3]. Konstantin Knizhnik reported that he worked on a similar extension [4]. Unfortunately, Andrew and Konstantin didn't give any more details (hopefully they will). But all in all, this indicates some demand. Speaking of performance, some time ago I benchmarked ZSON on data similar to the one that the customer had [5]. The benchmark showed ~10% performance improvements in terms of TPS and ~50% of saved disk space. The extension saved the memory as well, which was known from the implementation. The exact amount of saved memory was not measured. This benchmark shouldn't be considered as proof that all users will necessarily benefit from such a feature. But it indicates that some users could. == Proposal == The proposal is to add the support of compression dictionaries to JSONB. In order to do this, the SQL syntax should be modified. The proposed syntax is based on Matthias van de Meent's idea [6]: ``` CREATE TYPE <type-name> AS JSONB_DICTIONARY ( learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... }, autoupdate = false, -- true by default -- optional: other arguments, min/max string lengths, etc ); ``` Basically, this is an equivalent of zson_learn [7]. It will create an id -> string dictionary in the PostgreSQL catalog. When the user chooses `autoupdate = true`, the dictionary will be updated automatically by PostgreSQL (e.g. during the VACUUM). This is the default value. The dictionary can also be updated manually: ``` SELECT jsonb_update_dictionary("type-name"); ``` Other than that, the type works like a regular one. All the usual ALTER TYPE / DROP TYPE semantics are applicable. All the operators available to JSONB are also available to <type-name>. Internally <type-name> is represented similar to JSONB. However, the strings from the dictionary are replaced with varints. This idea was borrowed from Tomas Vondra [8]. The dictionary size is limited to 2**28 entries. The limit can be easily extended in the future if necessary. Also <type-name> stores the version of the dictionary used to compress the data. All in all, this is similar to how ZSON works. The first implementation always decompresses <type-name> entirely. Partial compression and decompression can always be added transparently to the user. == Looking for a feedback === I would appreciate your feedback on this RFC. Is anything missing in the description of the feature? Do you think users need it? Can you think of a better user interface? Are there any corner cases worth considering? Any other comments and questions are welcome too! I would like to implement this when the consensus will be reached on how the feature should look like (and whether we need it). Any help (from co-authors, REVIEWERS!!!, technical writers, ...) would be much appreciated. == Links == [1]: https://www.postgresql.org/message-id/flat/CAJ7c6TP3fCC9TNKJBQAcEf4c%3DL7XQZ7QvuUayLgjhNQMD_5M_A%40mail.gmail.com [2]: https://github.com/postgrespro/zson/issues?q=is%3Aissue+is%3Aclosed [3]: https://www.postgresql.org/message-id/6f3944ad-6924-5fed-580c-e72477733f04%40dunslane.net [4]: https://github.com/postgrespro/jsonb_schema [5]: https://github.com/postgrespro/zson/blob/master/docs/benchmark.md [6]: https://www.postgresql.org/message-id/CAEze2WheMusc73UZ5TpfiAGQ%3DrRwSSgr0y3j9DEVAQgQFwneRA%40mail.gmail.com [7]: https://github.com/postgrespro/zson#usage [8]: https://www.postgresql.org/message-id/77356556-0634-5cde-f55e-cce739dc09b9%40enterprisedb.com -- Best regards, Aleksander Alekseev Open-Source PostgreSQL Contributor at Timescale
On Fri, 8 Oct 2021 at 11:47, Aleksander Alekseev <aleksander@timescale.com> wrote: > This is a follow-up thread to `Add ZSON extension to /contrib/` [1]. > The ZSON extension introduces a new type called ZSON, which is 100% > compatible with JSONB but uses a shared dictionary of strings most > frequently used by given JSONB documents for compression. See the > thread for more details. Great to see that you're still working on this! It would be great if we could get this into postgres. As such, I hope you can provide some clarifications on my questions and comments. > According to the feedback I got, the community generally liked the > idea of adding an across-rows and across-tables compression capability > to JSONB. What the community didn't like was: > > 1. Introducing a new data type in order to archive this; > 2. Updating compression dictionaries manually; Well, I for one would like access to manually add entries to the dictionary. What I'm not interested in is being required to manually update the dictionary; but the ability to manually insert into the dictionary however is much appreciated. > 3. Some implementation details of ZSON, such as limited dictionary > size (2 ** 16 entries) and an extensive usage of gettimeofday() system > call; > > There was also a request for proof of the usefulness of this feature > in practice. More compact JSONB is never a bad idea: one reason to stick with JSON over JSONB is that JSON can use significantly less space than JSONB, if stored properly. So, improving the disk usage of JSONB is not really a bad idea. > > == Proposal == > > The proposal is to add the support of compression dictionaries to JSONB. > > In order to do this, the SQL syntax should be modified. The proposed > syntax is based on Matthias van de Meent's idea [6]: Seems fine > ``` > CREATE TYPE <type-name> AS JSONB_DICTIONARY ( > learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... }, I'm having trouble understanding how this learn_by field would be used: If stored as strings, they would go out of date when tables or columns are renamed or dropped. Similarly, you'd want to update the dictionary with common values in columns of that type; generally not columns of arbitrary other types. You can't in advance know the names of tables and columns, so that would add a burden of maintenance to the user when they add / change / remove a column of the dictionary type. Instead of storing 'use update data from table X column Y' in the type, I think that adding it as a column option would be the better choice. I agree with an option for auto-update, though I don't think we have enough information to determine the default value (I'd err to the side of caution, going with 'off'). > autoupdate = false, -- true by default > -- optional: other arguments, min/max string lengths, etc > ); > ``` For dump/restore I think it would be very useful to allow export & import of these dictionaries, so that restored databases don't have the problem of starting cold. As such, `ALTER TYPE jsondict ADD ENTRY entry_value` would probably be useful, and maybe even `CREATE TYPE dict AS JSONB_DICTIONARY ('"entry_1"', '"entry_two"', '"entry_three"') WITH (option = optional)` > Basically, this is an equivalent of zson_learn [7]. It will create an > id -> string dictionary in the PostgreSQL catalog. When the user > chooses `autoupdate = true`, the dictionary will be updated > automatically by PostgreSQL (e.g. during the VACUUM). This is the > default value. The dictionary can also be updated manually: > > ``` > SELECT jsonb_update_dictionary("type-name"); > ``` I'm a bit on the fence about this. We do use this for sequences, but alternatively we might want to use ALTER TYPE jsondict; > Other than that, the type works like a regular one. All the usual > ALTER TYPE / DROP TYPE semantics are applicable. All the operators > available to JSONB are also available to <type-name>. > > Internally <type-name> is represented similar to JSONB. However, the > strings from the dictionary are replaced with varints. How do you propose to differentiate actual integers with these keyed strings, and / or actual strings with varints? Replacing _all_ strings doesn't seem like such a great idea. Related comment below. > This idea was > borrowed from Tomas Vondra [8]. The dictionary size is limited to > 2**28 entries. The limit can be easily extended in the future if > necessary. Also <type-name> stores the version of the dictionary used > to compress the data. All in all, this is similar to how ZSON works. I appreciate this idea, but using that varint implementation is not a choice I'd make. In the jsonb storage format, we already encode the length of each value, so varint shouldn't be necessary here. Next, as everything in jsonb storage is 4-byte aligned, a uint32 should suffice, or if we're being adventurous, we might even fit a uint29 identifier in the length field instead (at the cost of full backwards incompatibility). Lastly, we don't have a good format for varint now (numeric is close, but has significant overhead), so I'd say we should go with a fixed-size integer and accept that limitation. My own suggestion would be updating JSONB on-disk format with the following: ``` /* values stored in the type bits */ #define JENTRY_ISSTRING 0x00000000 #define JENTRY_ISNUMERIC 0x10000000 #define JENTRY_ISBOOL_FALSE 0x20000000 #define JENTRY_ISBOOL_TRUE 0x30000000 #define JENTRY_ISNULL 0x40000000 #define JENTRY_ISCONTAINER 0x50000000 /* array or object */ +#define JENTRY_ISSYMBOL 0x60000000 /* Lookup in dictionary */ ``` And then store the symbol in the JEntry (either in the JENTRY_OFFLENMASK or in the actual referred content), whilst maybe using some bits in this for e.g. type hints (whether the item in the dictionary is an array, object, string or numeric). I really would like this to support non-string types, because jsonb structures can grow quite large, even with only small strings: e.g. `{..., "tags": {"customer": "blabla"}}` could be dictionaried to `{..., "tags": {'1: '2}`, but potentially also to `{... "tags": '1}`. Of these, the second would be more efficient overall for storage and retrieval.. > The first implementation always decompresses <type-name> entirely. > Partial compression and decompression can always be added > transparently to the user. Are you talking about the TOAST compression and decompression, or are you talking about a different compression scheme? If a different scheme, is it only replacing the strings in the jsonb-tree with their directory identifiers, and replacing the symbols in the jsonb-tree with text (all through the JSONB internals), or are you proposing an actual compression scheme over the stored jsonb bytes (effectively wrapping the jsonb IO functions)? Overall, I'm glad to see this take off, but I do want some clarifications regarding the direction that this is going towards. Kind regards, Matthias
On 2021-Oct-08, Matthias van de Meent wrote: > On Fri, 8 Oct 2021 at 11:47, Aleksander Alekseev > <aleksander@timescale.com> wrote: > > In order to do this, the SQL syntax should be modified. The proposed > > syntax is based on Matthias van de Meent's idea [6]: > > Seems fine > > > ``` > > CREATE TYPE <type-name> AS JSONB_DICTIONARY ( > > learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... }, Actually, why is it a JSONB_DICTIONARY and not like CREATE TYPE name AS DICTIONARY ( base_type = JSONB, ... ); so that it is possible to use the infrastructure for other things? For example, perhaps PostGIS geometries could benefit from it -- or even text or xml columns. The pg_type entry would have to provide some support procedure that makes use of the dictionary in some way. This seems better than tying the SQL object to a specific type. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ #error "Operator lives in the wrong universe" ("Use of cookies in real-time system development", M. Gleixner, M. Mc Guire)
On Fri, 8 Oct 2021 at 17:19, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Oct-08, Matthias van de Meent wrote: > > > On Fri, 8 Oct 2021 at 11:47, Aleksander Alekseev > > <aleksander@timescale.com> wrote: > > > > In order to do this, the SQL syntax should be modified. The proposed > > > syntax is based on Matthias van de Meent's idea [6]: > > > > Seems fine > > > > > ``` > > > CREATE TYPE <type-name> AS JSONB_DICTIONARY ( > > > learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... }, > > Actually, why is it a JSONB_DICTIONARY and not like > > CREATE TYPE name AS DICTIONARY ( > base_type = JSONB, ... > ); That's a good point, but if we're extending this syntax to allow the ability of including other types, then I'd instead extend the syntax that of below, so that the type of the dictionary entries is required in the syntax: CREATE TYPE name AS DICTIONARY OF jsonb [ ( ...entries ) ] [ WITH ( ...options ) ]; > so that it is possible to use the infrastructure for other things? For > example, perhaps PostGIS geometries could benefit from it -- or even > text or xml columns. > > The pg_type entry would have to provide some support procedure that > makes use of the dictionary in some way. This seems better than tying > the SQL object to a specific type. Agreed, but this might mean that much more effort would be required to get such a useful quality-of-life feature committed. Kind regards, Matthias
On 2021-Oct-08, Matthias van de Meent wrote: > That's a good point, but if we're extending this syntax to allow the > ability of including other types, then I'd instead extend the syntax > that of below, so that the type of the dictionary entries is required > in the syntax: > > CREATE TYPE name AS DICTIONARY OF jsonb [ ( ...entries ) ] [ WITH ( > ...options ) ]; I don't think this gives you any guarantees of the sort you seem to expect. See CREATE AGGREGATE as a precedent where there are some options in the parenthesized options list you cannot omit. > > The pg_type entry would have to provide some support procedure that > > makes use of the dictionary in some way. This seems better than tying > > the SQL object to a specific type. > > Agreed, but this might mean that much more effort would be required to > get such a useful quality-of-life feature committed. I don't understand what you mean by that. I'm not saying that the patch has to provide support for any additional datatypes. Its only obligation would be to provide a new column in pg_type which is zero for all rows except jsonb, and in that row it is the OID of a jsonb_dictionary() function that's called from all the right places and receives all the right arguments. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "No tengo por qué estar de acuerdo con lo que pienso" (Carlos Caszeli)
On Fri, 8 Oct 2021 at 21:21, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Oct-08, Matthias van de Meent wrote: > > > That's a good point, but if we're extending this syntax to allow the > > ability of including other types, then I'd instead extend the syntax > > that of below, so that the type of the dictionary entries is required > > in the syntax: > > > > CREATE TYPE name AS DICTIONARY OF jsonb [ ( ...entries ) ] [ WITH ( > > ...options ) ]; > > I don't think this gives you any guarantees of the sort you seem to > expect. See CREATE AGGREGATE as a precedent where there are some > options in the parenthesized options list you cannot omit. Bikeshedding on syntax: I guess? I don't really like 'required options' patterns. If you're required to use/specify an option, then it's not optional, and should thus not be included in the group of 'options'. > > > The pg_type entry would have to provide some support procedure that > > > makes use of the dictionary in some way. This seems better than tying > > > the SQL object to a specific type. > > > > Agreed, but this might mean that much more effort would be required to > > get such a useful quality-of-life feature committed. > > I don't understand what you mean by that. I'm not saying that the patch > has to provide support for any additional datatypes. Its only > obligation would be to provide a new column in pg_type which is zero for > all rows except jsonb, and in that row it is the OID of a > jsonb_dictionary() function that's called from all the right places and > receives all the right arguments. This seems feasable to do, but I still have limited knowledge on the intricacies of the type system, and as such I don't see how this part would function: I was expecting more something in the line of how array types seem to work: Type _A is an array type, containing elements of Type A. It's containing type is defined in pg_type.typbasetype. No special functions are defined on base types to allow their respective array types, that part is handled by the array infrastructure. Same for Domain types. Now that I think about it, we should still provide the information on _how_ to find the type functions for the dictionaried type: Arrays and domains are generic, but dictionaries will require deep understanding of the underlying type. So, yes, you are correct, there should be one more function, which would supply the necessary pg_type functions that CREATE TYPE DICTIONARY can then register in the pg_type entry of the dictionary type. The alternative would initially be hardcoding this for the base types that have dictionary support, which definitely would be possible for a first iteration, but wouldn't be great. Kind regards, Matthias
Matthias, Alvaro, Many thanks for your comments and suggestions! > Well, I for one would like access to manually add entries to the > dictionary. What I'm not interested in is being required to manually > update the dictionary; but the ability to manually insert into the > dictionary however is much appreciated. Sure, I see no reason why we can't do it. This would also simplify splitting the task into smaller ones. We could introduce only manually updated dictionaries first, and then automate filling them for the users who need this. > If stored as strings, they would go out of date when tables or columns > are renamed or dropped. > Similarly, you'd want to update the dictionary with common values in > columns of that type; generally not columns of arbitrary other types. > You can't in advance know the names of tables and columns, so that > would add a burden of maintenance to the user when they add / change / > remove a column of the dictionary type. Instead of storing 'use update > data from table X column Y' in the type, I think that adding it as a > column option would be the better choice. Agree, add / change / remove of a column should be handled automatically. Just to clarify, by column option do you mean syntax like ALTER TABLE ... ALTER COLUMN ... etc, right? I didn't think of extending this part of the syntax. That would be a better choice indeed. > I'm a bit on the fence about this. We do use this for sequences, but > alternatively we might want to use ALTER TYPE jsondict; Agree, ALTER TYPE seems to be a better choice than SELECT function(). This would make the interface more consistent. > Overall, I'm glad to see this take off, but I do want some > clarifications regarding the direction that this is going towards. > [...] > Actually, why is it a JSONB_DICTIONARY and not like: > CREATE TYPE name AS DICTIONARY ( > base_type = JSONB, ... > ); > so that it is possible to use the infrastructure for other things? For > example, perhaps PostGIS geometries could benefit from it -- or even > text or xml columns. So the question is if we want to extend the capabilities of a single type, i.e. JSONB, or to add a functionality that would work for the various types. I see the following pros and cons of both approaches. Modifying JSONB may at some point allow to partially decompress only the parts of the document that need to be decompressed for a given query. However, the compression schema will be probably less efficient. There could also be difficulties in respect of backward compatibility, and this is going to work only with JSONB. An alternative approach, CREATE TYPE ... AS DICTIONARY OF <type> or something like this would work not only for JSONB, but also for TEXT, XML, arrays, and PostGIS. By the way, this was suggested before [1]. Another advantage here is that all things being equal the compression schema could be more efficient. The feature will not affect existing types. The main disadvantage is that implementing a partial decompression would be very difficult and/or impractical. Personally, I would say that the 2nd option, CREATE TYPE ... AS DICTIONARY OF <type>, seems to be more useful. To my knowledge, not many users would care much about partial decompression, and this is the only real advantage of the 1st option I see. Also, this is how ZSON is implemented. It doesn't care about the underlying type and treats it as a BLOB. Thus the proofs of usefulness I provided above are not quite valid for the 1st option. Probably unrelated, but 2nd option would be even easier for me to implement since I already solved a similar task. All in all, I suggest focusing on the 2nd option with universal compression dictionaries. Naturally, the focus will be on JSONB first. But we will be able to extend this functionality for other types as well. Thoughts? [1]: https://www.postgresql.org/message-id/CANP8%2BjLT8r03LJsw%3DdUSFxBh5pRB%2BUCKvS3BUT-dd4JPRDb3tg%40mail.gmail.com -- Best regards, Aleksander Alekseev
On Mon, 11 Oct 2021 at 15:25, Aleksander Alekseev <aleksander@timescale.com> wrote: > Agree, add / change / remove of a column should be handled > automatically. Just to clarify, by column option do you mean syntax > like ALTER TABLE ... ALTER COLUMN ... etc, right? Correct, either SET (option) or maybe using typmod (which is hack-ish, but could save on some bytes of storage) > I didn't think of > extending this part of the syntax. That would be a better choice > indeed. > > Overall, I'm glad to see this take off, but I do want some > > clarifications regarding the direction that this is going towards. > > [...] > > Actually, why is it a JSONB_DICTIONARY and not like: > > CREATE TYPE name AS DICTIONARY ( > > base_type = JSONB, ... > > ); > > so that it is possible to use the infrastructure for other things? For > > example, perhaps PostGIS geometries could benefit from it -- or even > > text or xml columns. > > So the question is if we want to extend the capabilities of a single > type, i.e. JSONB, or to add a functionality that would work for the > various types. I see the following pros and cons of both approaches. > > Modifying JSONB may at some point allow to partially decompress only > the parts of the document that need to be decompressed for a given > query. However, the compression schema will be probably less > efficient. There could also be difficulties in respect of backward > compatibility, and this is going to work only with JSONB. Assuming this above is option 1. If I understand correctly, this option was 'adapt the data type so that it understands how to handle a shared dictionary, decreasing storage requirements'. > An alternative approach, CREATE TYPE ... AS DICTIONARY OF <type> or > something like this would work not only for JSONB, but also for TEXT, > XML, arrays, and PostGIS. By the way, this was suggested before [1]. > Another advantage here is that all things being equal the compression > schema could be more efficient. The feature will not affect existing > types. The main disadvantage is that implementing a partial > decompression would be very difficult and/or impractical. Assuming this was the 2nd option. If I understand correctly, this option is effectively 'adapt or wrap TOAST to understand and handle dictionaries for dictionary encoding common values'. > Personally, I would say that the 2nd option, CREATE TYPE ... AS > DICTIONARY OF <type>, seems to be more useful. To my knowledge, not > many users would care much about partial decompression, and this is > the only real advantage of the 1st option I see. Also, this is how > ZSON is implemented. It doesn't care about the underlying type and > treats it as a BLOB. Thus the proofs of usefulness I provided above > are not quite valid for the 1st option. Probably unrelated, but 2nd > option would be even easier for me to implement since I already solved > a similar task. > > All in all, I suggest focusing on the 2nd option with universal > compression dictionaries. Naturally, the focus will be on JSONB first. > But we will be able to extend this functionality for other types as > well. > > Thoughts? I think that an 'universal dictionary encoder' would be useful, but that a data type might also have good reason to implement their replacement methods by themselves for better overall performance (such as maintaining partial detoast support in dictionaried items, or overall lower memory footprint, or ...). As such, I'd really appreciate it if Option 1 is not ruled out by any implementation of Option 2. Kind regards, Matthias van de Meent
Hi Matthias, > Assuming this above is option 1. If I understand correctly, this > option was 'adapt the data type so that it understands how to handle a > shared dictionary, decreasing storage requirements'. > [...] > Assuming this was the 2nd option. If I understand correctly, this > option is effectively 'adapt or wrap TOAST to understand and handle > dictionaries for dictionary encoding common values'. Yes, exactly. > I think that an 'universal dictionary encoder' would be useful, but > that a data type might also have good reason to implement their > replacement methods by themselves for better overall performance (such > as maintaining partial detoast support in dictionaried items, or > overall lower memory footprint, or ...). As such, I'd really > appreciate it if Option 1 is not ruled out by any implementation of > Option 2. I agree, having the benefits of two approaches in one feature would be great. However, I'm having some difficulties imagining how the implementation would look like in light of the pros and cons stated above. I could use some help here. One approach I can think of is introducing a new entity, let's call it "dictionary compression method". The idea is similar to access methods and tableam's. There is a set of callbacks the dictionary compression method should implement, some are mandatory, some can be set to NULL. Users can specify the compression method for the dictionary: ``` CREATE TYPE name AS DICTIONARY OF JSONB ( compression_method = 'jsonb_best_compression' -- compression_methods = 'jsonb_fastest_partial_decompression' -- if not specified, some default compression method is used ); ``` JSONB is maybe not the best example of the type for which people may need multiple compression methods in practice. But I can imagine how overwriting a compression method for, let's say, arrays in an extension could be beneficial depending on the application. This approach will make an API well-defined and, more importantly, extendable. In the future, we could add additional (optional) methods for particular scenarios, like partial decompression. Does it sound like a reasonable approach? -- Best regards, Aleksander Alekseev
On Wed, 13 Oct 2021 at 11:48, Aleksander Alekseev <aleksander@timescale.com> wrote: > > Hi Matthias, > > > Assuming this above is option 1. If I understand correctly, this > > option was 'adapt the data type so that it understands how to handle a > > shared dictionary, decreasing storage requirements'. > > [...] > > Assuming this was the 2nd option. If I understand correctly, this > > option is effectively 'adapt or wrap TOAST to understand and handle > > dictionaries for dictionary encoding common values'. > > Yes, exactly. > > > I think that an 'universal dictionary encoder' would be useful, but > > that a data type might also have good reason to implement their > > replacement methods by themselves for better overall performance (such > > as maintaining partial detoast support in dictionaried items, or > > overall lower memory footprint, or ...). As such, I'd really > > appreciate it if Option 1 is not ruled out by any implementation of > > Option 2. > > I agree, having the benefits of two approaches in one feature would be > great. However, I'm having some difficulties imagining how the > implementation would look like in light of the pros and cons stated > above. I could use some help here. > > One approach I can think of is introducing a new entity, let's call it > "dictionary compression method". The idea is similar to access methods > and tableam's. There is a set of callbacks the dictionary compression > method should implement, some are mandatory, some can be set to NULL. You might also want to look into the 'pluggable compression support' [0] and 'Custom compression methods' [1] threads for inspiration, as that seems very similar to what was originally proposed there. (†) One important difference from those discussed at [0][1] is that the compression proposed here is at the type level, while the compression proposed in both 'Pluggable compression support' and 'Custom compression methods' is at the column / table / server level. > Users can specify the compression method for the dictionary: > > ``` > CREATE TYPE name AS DICTIONARY OF JSONB ( > compression_method = 'jsonb_best_compression' > -- compression_methods = 'jsonb_fastest_partial_decompression' > -- if not specified, some default compression method is used > ); > ``` > > JSONB is maybe not the best example of the type for which people may > need multiple compression methods in practice. But I can imagine how > overwriting a compression method for, let's say, arrays in an > extension could be beneficial depending on the application. > > This approach will make an API well-defined and, more importantly, > extendable. In the future, we could add additional (optional) methods > for particular scenarios, like partial decompression. > > Does it sound like a reasonable approach? Yes, I think that's doable. Kind regards, Matthias (†): 'Custom compression methods' eventually got committed in an entirely different state by the way of commit bbe0a81db, where LZ4 is now a toast compression option that can be configured at the column / system level. This is a hard-coded compression method, so no infrastructure (or at least, API) is available for custom compression methods in that code. [0] https://www.postgresql.org/message-id/flat/20130614230142.GC19641%40awork2.anarazel.de [1] https://www.postgresql.org/message-id/flat/20170907194236.4cefce96@wp.localdomain
Hi hackers, Many thanks for all your great feedback! Please see the follow-up thread '[PATCH] Compression dictionaries for JSONB': https://postgr.es/m/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com -- Best regards, Aleksander Alekseev