Re: Add ZSON extension to /contrib/ - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: Add ZSON extension to /contrib/ |
Date | |
Msg-id | CAEze2WheMusc73UZ5TpfiAGQ=rRwSSgr0y3j9DEVAQgQFwneRA@mail.gmail.com Whole thread Raw |
In response to | Re: Add ZSON extension to /contrib/ (Magnus Hagander <magnus@hagander.net>) |
Responses |
Re: Add ZSON extension to /contrib/
|
List | pgsql-hackers |
On Tue, 25 May 2021 at 13:32, Magnus Hagander <magnus@hagander.net> wrote: > > On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev > <aleksander@timescale.com> wrote: > > > > Hi hackers, > > > > Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type,which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documentsfor compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the documentis large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk space,memory and increases the overall performance. More details can be found in README on GitHub. > > > > The extension was accepted warmly and instantaneously I got several requests to submit it to /contrib/ so people usingAmazon RDS and similar services could enjoy it too. Do note that e.g. postgis is not in contrib, but is available in e.g. RDS. > > Back then I was not sure if the extension is mature enough and if it lacks any additional features required to solvethe real-world problems of the users. Time showed, however, that people are happy with the extension as it is. Therewere several minor issues discovered, but they were fixed back in 2017. The extension never experienced any compatibilityproblems with the next major release of PostgreSQL. > > > > So my question is if the community may consider adding ZSON to /contrib/. If this is the case I will add this threadto the nearest CF and submit a corresponding patch. I like the idea of the ZSON type, but I'm somewhat disappointed by its current limitations: - There is only one active shared dictionary (as a user I would want distinct dictionaries for each use case, similar to ENUM: each ENUM type has their own limit of 2**31 (?) values) - There is no provided method to manually specify the dictionary (only "zson_learn", which constructs a new dictionary) - You cannot add to the dictionary (equiv. to ALTER TYPE enum_type ADD VALUE), you must create a new one. Apart from that, I noticed the following more technical points, for if you submit it as-is as a patch: - Each dictionary uses a lot of memory, regardless of the number of actual stored keys. For 32-bit systems the base usage of a dictionary without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be almost 1MB, and for 64-bit it would be 1.7MB. That is significantly more than I'd want to install. - You call gettimeofday() in both dict_get and in get_current_dict_id. These functions can be called in short and tight loops (for small GSON fields), in which case it would add significant overhead through the implied syscalls. - The compression method you've chosen seems to extract most common strings from the JSONB table, and then use that as a pre-built dictionary for doing some dictionary encoding on the on-disk format of the jsonb structure. Although I fully understand that this makes the system quite easy to reason about, it does mean that you're deTOASTing the full GSON field, and that the stored bytestring will not be structured / doesn't work well with current debuggers. > If the extension is mature enough, why make it an extension in > contrib, and not instead either enhance the existing jsonb type with > it or make it a built-in type? I don't think that this datatype (that supplies a basic but effective compression algorithm over JSONB) is fit for core as-is. I have also thought about building a similar type, but one that would be more like ENUM: An extension on the JSONB datatype, which has some list of common 'well-known' values that will be substituted, and to which later more substitutable values can be added (e.g. CREATE TYPE ... AS JSONB_DICTIONARY ('"commonly_used_key"', '"very_long_string_that_appears_often"', '[{"structure": "that", "appears": "often"}]') or something similar). That would leave JSONB just as a JSONB_DICTIONARY type without any substitutable values. These specialized JSONB types could then be used as a specification for table columns, custom types, et cetera. Some of the reasons I've not yet built such type is me not being familiar with the jsonb- and enum-code (which I suspect to be critical for an efficient implementation of such type), although whilst researching I've noticed that it is possible to use most of the JSONB infrastructure / read older jsonb values, as there are still some JEntry type masks available which could flag such substitutions. With regards, Matthias van de Meent
pgsql-hackers by date: