Re: Add ZSON extension to /contrib/ - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Add ZSON extension to /contrib/
Date
Msg-id CAEze2WheMusc73UZ5TpfiAGQ=rRwSSgr0y3j9DEVAQgQFwneRA@mail.gmail.com
Whole thread Raw
In response to Re: Add ZSON extension to /contrib/  (Magnus Hagander <magnus@hagander.net>)
Responses Re: Add ZSON extension to /contrib/
List pgsql-hackers
On Tue, 25 May 2021 at 13:32, Magnus Hagander <magnus@hagander.net> wrote:
>
> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> >
> > Hi hackers,
> >
> > Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON
type,which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB
documentsfor compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the
documentis large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk
space,memory and increases the overall performance. More details can be found in README on GitHub. 
> >
> > The extension was accepted warmly and instantaneously I got several requests to submit it to /contrib/ so people
usingAmazon RDS and similar services could enjoy it too. 

Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.

> > Back then I was not sure if the extension is mature enough and if it lacks any additional features required to
solvethe real-world problems of the users. Time showed, however, that people are happy with the extension as it is.
Therewere several minor issues discovered, but they were fixed back in 2017. The extension never experienced any
compatibilityproblems with the next major release of PostgreSQL. 
> >
> > So my question is if the community may consider adding ZSON to /contrib/. If this is the case I will add this
threadto the nearest CF and submit a corresponding patch. 

I like the idea of the ZSON type, but I'm somewhat disappointed by its
current limitations:

- There is only one active shared dictionary (as a user I would want
distinct dictionaries for each use case, similar to ENUM: each ENUM
type has their own limit of 2**31 (?) values)
- There is no provided method to manually specify the dictionary (only
"zson_learn", which constructs a new dictionary)
- You cannot add to the dictionary (equiv. to ALTER TYPE enum_type ADD
VALUE), you must create a new one.

Apart from that, I noticed the following more technical points, for if
you submit it as-is as a patch:

- Each dictionary uses a lot of memory, regardless of the number of
actual stored keys. For 32-bit systems the base usage of a dictionary
without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
more than I'd want to install.
- You call gettimeofday() in both dict_get and in get_current_dict_id.
These functions can be called in short and tight loops (for small GSON
fields), in which case it would add significant overhead through the
implied syscalls.
- The compression method you've chosen seems to extract most common
strings from the JSONB table, and then use that as a pre-built
dictionary for doing some dictionary encoding on the on-disk format of
the jsonb structure. Although I fully understand that this makes the
system quite easy to reason about, it does mean that you're deTOASTing
the full GSON field, and that the stored bytestring will not be
structured / doesn't work well with current debuggers.

> If the extension is mature enough, why make it an extension in
> contrib, and not instead either enhance the existing jsonb type with
> it or make it a built-in type?

I don't think that this datatype (that supplies a basic but effective
compression algorithm over JSONB) is fit for core as-is.

I have also thought about building a similar type, but one that would
be more like ENUM: An extension on the JSONB datatype, which has some
list of common 'well-known' values that will be substituted, and to
which later more substitutable values can be added (e.g. CREATE TYPE
... AS JSONB_DICTIONARY  ('"commonly_used_key"',
'"very_long_string_that_appears_often"', '[{"structure": "that",
"appears": "often"}]') or something similar). That would leave JSONB
just as a JSONB_DICTIONARY type without any substitutable values.

These specialized JSONB types could then be used as a specification
for table columns, custom types, et cetera. Some of the reasons I've
not yet built such type is me not being familiar with the jsonb- and
enum-code (which I suspect to be critical for an efficient
implementation of such type), although whilst researching I've noticed
that it is possible to use most of the JSONB infrastructure / read
older jsonb values, as there are still some JEntry type masks
available which could flag such substitutions.

With regards,

Matthias van de Meent



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Add ZSON extension to /contrib/
Next
From: Tom Lane
Date:
Subject: Re: CALL versus procedures with output-only arguments