Thread: Add ZSON extension to /contrib/

Add ZSON extension to /contrib/

From
Aleksander Alekseev
Date:
Hi hackers,

Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type, which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documents for compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the document is large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk space, memory and increases the overall performance. More details can be found in README on GitHub.

The extension was accepted warmly and instantaneously I got several requests to submit it to /contrib/ so people using Amazon RDS and similar services could enjoy it too. Back then I was not sure if the extension is mature enough and if it lacks any additional features required to solve the real-world problems of the users. Time showed, however, that people are happy with the extension as it is. There were several minor issues discovered, but they were fixed back in 2017. The extension never experienced any compatibility problems with the next major release of PostgreSQL.

So my question is if the community may consider adding ZSON to /contrib/. If this is the case I will add this thread to the nearest CF and submit a corresponding patch.

[1]: https://github.com/postgrespro/zson

--
Best regards,
Aleksander Alekseev
Open-Source PostgreSQL Contributor at Timescale

Re: Add ZSON extension to /contrib/

From
Magnus Hagander
Date:
On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi hackers,
>
> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON
type,which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB
documentsfor compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the
documentis large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk
space,memory and increases the overall performance. More details can be found in README on GitHub. 
>
> The extension was accepted warmly and instantaneously I got several requests to submit it to /contrib/ so people
usingAmazon RDS and similar services could enjoy it too. Back then I was not sure if the extension is mature enough and
ifit lacks any additional features required to solve the real-world problems of the users. Time showed, however, that
peopleare happy with the extension as it is. There were several minor issues discovered, but they were fixed back in
2017.The extension never experienced any compatibility problems with the next major release of PostgreSQL. 
>
> So my question is if the community may consider adding ZSON to /contrib/. If this is the case I will add this thread
tothe nearest CF and submit a corresponding patch. 

If the extension is mature enough, why make it an extension in
contrib, and not instead either enhance the existing jsonb type with
it or make it a built-in type?

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Add ZSON extension to /contrib/

From
Andrew Dunstan
Date:
On 5/25/21 6:55 AM, Aleksander Alekseev wrote:
> Hi hackers,
>
> Back in 2016 while being at PostgresPro I developed the ZSON extension
> [1]. The extension introduces the new ZSON type, which is 100%
> compatible with JSONB but uses a shared dictionary of strings most
> frequently used in given JSONB documents for compression. These
> strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4)
> applies if the document is large enough by common PostgreSQL logic.
> Under certain conditions (many large documents), this saves disk
> space, memory and increases the overall performance. More details can
> be found in README on GitHub.
>
> The extension was accepted warmly and instantaneously I got several
> requests to submit it to /contrib/ so people using Amazon RDS and
> similar services could enjoy it too. Back then I was not sure if the
> extension is mature enough and if it lacks any additional features
> required to solve the real-world problems of the users. Time showed,
> however, that people are happy with the extension as it is. There were
> several minor issues discovered, but they were fixed back in 2017. The
> extension never experienced any compatibility problems with the next
> major release of PostgreSQL.
>
> So my question is if the community may consider adding ZSON to
> /contrib/. If this is the case I will add this thread to the nearest
> CF and submit a corresponding patch.
>
> [1]: https://github.com/postgrespro/zson
> <https://github.com/postgrespro/zson>
>
We (2ndQuadrant, now part of EDB) made some enhancements to Zson a few years ago, and I have permission to contribute
thoseif this proposal is adopted. From the readme:
 

1. There is an option to make zson_learn only process object keys,
rather than field values.

```
select zson_learn('{{table1,col1}}',true);
```

2. Strings with an octet-length less than 3 are not processed.
Since strings are encoded as 2 bytes and then there needs to be
another byte with the length of the following skipped bytes, encoding
values less than 3 bytes is going to be a net loss.

3. There is a new function to create a dictionary directly from an
array of text, rather than using the learning code:

```
select zson_create_dictionary(array['word1','word2']::text[]);
```

4. There is a function to augment the current dictionary from an array of text:

```
select zson_extend_dictionary(array['value1','value2','value3']::text[]);
```

This is particularly useful for adding common field prefixes or values. A good
example of field prefixes is URL values where the first part of the URL is
fairly constrained but the last part is not.


cheers

andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Add ZSON extension to /contrib/

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
> <aleksander@timescale.com> wrote:
>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON
type,which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB
documentsfor compression. 

> If the extension is mature enough, why make it an extension in
> contrib, and not instead either enhance the existing jsonb type with
> it or make it a built-in type?

IMO we have too d*mn many JSON types already.  If we can find a way
to shoehorn this optimization into JSONB, that'd be great.  Otherwise
I do not think it's worth the added user confusion.

Also, even if ZSON was "100% compatible with JSONB" back in 2016,
a whole lot of features have been added since then.  Having to
duplicate all that code again for a different data type is not
something I want to see us doing.  So that's an independent reason
for wanting to hide this under the existing type not make a new one.

            regards, tom lane



Re: Add ZSON extension to /contrib/

From
Matthias van de Meent
Date:
On Tue, 25 May 2021 at 13:32, Magnus Hagander <magnus@hagander.net> wrote:
>
> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> >
> > Hi hackers,
> >
> > Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON
type,which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB
documentsfor compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the
documentis large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk
space,memory and increases the overall performance. More details can be found in README on GitHub. 
> >
> > The extension was accepted warmly and instantaneously I got several requests to submit it to /contrib/ so people
usingAmazon RDS and similar services could enjoy it too. 

Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.

> > Back then I was not sure if the extension is mature enough and if it lacks any additional features required to
solvethe real-world problems of the users. Time showed, however, that people are happy with the extension as it is.
Therewere several minor issues discovered, but they were fixed back in 2017. The extension never experienced any
compatibilityproblems with the next major release of PostgreSQL. 
> >
> > So my question is if the community may consider adding ZSON to /contrib/. If this is the case I will add this
threadto the nearest CF and submit a corresponding patch. 

I like the idea of the ZSON type, but I'm somewhat disappointed by its
current limitations:

- There is only one active shared dictionary (as a user I would want
distinct dictionaries for each use case, similar to ENUM: each ENUM
type has their own limit of 2**31 (?) values)
- There is no provided method to manually specify the dictionary (only
"zson_learn", which constructs a new dictionary)
- You cannot add to the dictionary (equiv. to ALTER TYPE enum_type ADD
VALUE), you must create a new one.

Apart from that, I noticed the following more technical points, for if
you submit it as-is as a patch:

- Each dictionary uses a lot of memory, regardless of the number of
actual stored keys. For 32-bit systems the base usage of a dictionary
without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
more than I'd want to install.
- You call gettimeofday() in both dict_get and in get_current_dict_id.
These functions can be called in short and tight loops (for small GSON
fields), in which case it would add significant overhead through the
implied syscalls.
- The compression method you've chosen seems to extract most common
strings from the JSONB table, and then use that as a pre-built
dictionary for doing some dictionary encoding on the on-disk format of
the jsonb structure. Although I fully understand that this makes the
system quite easy to reason about, it does mean that you're deTOASTing
the full GSON field, and that the stored bytestring will not be
structured / doesn't work well with current debuggers.

> If the extension is mature enough, why make it an extension in
> contrib, and not instead either enhance the existing jsonb type with
> it or make it a built-in type?

I don't think that this datatype (that supplies a basic but effective
compression algorithm over JSONB) is fit for core as-is.

I have also thought about building a similar type, but one that would
be more like ENUM: An extension on the JSONB datatype, which has some
list of common 'well-known' values that will be substituted, and to
which later more substitutable values can be added (e.g. CREATE TYPE
... AS JSONB_DICTIONARY  ('"commonly_used_key"',
'"very_long_string_that_appears_often"', '[{"structure": "that",
"appears": "often"}]') or something similar). That would leave JSONB
just as a JSONB_DICTIONARY type without any substitutable values.

These specialized JSONB types could then be used as a specification
for table columns, custom types, et cetera. Some of the reasons I've
not yet built such type is me not being familiar with the jsonb- and
enum-code (which I suspect to be critical for an efficient
implementation of such type), although whilst researching I've noticed
that it is possible to use most of the JSONB infrastructure / read
older jsonb values, as there are still some JEntry type masks
available which could flag such substitutions.

With regards,

Matthias van de Meent



Re: Add ZSON extension to /contrib/

From
Andrew Dunstan
Date:
On 5/25/21 4:10 PM, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
>> <aleksander@timescale.com> wrote:
>>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON
type,which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB
documentsfor compression.
 
>> If the extension is mature enough, why make it an extension in
>> contrib, and not instead either enhance the existing jsonb type with
>> it or make it a built-in type?
> IMO we have too d*mn many JSON types already.  If we can find a way
> to shoehorn this optimization into JSONB, that'd be great.  Otherwise
> I do not think it's worth the added user confusion.
>
> Also, even if ZSON was "100% compatible with JSONB" back in 2016,
> a whole lot of features have been added since then.  Having to
> duplicate all that code again for a different data type is not
> something I want to see us doing.  So that's an independent reason
> for wanting to hide this under the existing type not make a new one.



I take your point. However, there isn't really any duplication. It's
handled by this:


    CREATE FUNCTION jsonb_to_zson(jsonb)
        RETURNS zson
        AS 'MODULE_PATHNAME'
        LANGUAGE C STRICT IMMUTABLE;

    CREATE FUNCTION zson_to_jsonb(zson)
        RETURNS jsonb
        AS 'MODULE_PATHNAME'
        LANGUAGE C STRICT IMMUTABLE;

    CREATE CAST (jsonb AS zson) WITH FUNCTION jsonb_to_zson(jsonb) AS
    ASSIGNMENT;
    CREATE CAST (zson AS jsonb) WITH FUNCTION zson_to_jsonb(zson) AS
    IMPLICIT;


cheers


andrew


-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Add ZSON extension to /contrib/

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 5/25/21 4:10 PM, Tom Lane wrote:
>> Also, even if ZSON was "100% compatible with JSONB" back in 2016,
>> a whole lot of features have been added since then.  Having to
>> duplicate all that code again for a different data type is not
>> something I want to see us doing.  So that's an independent reason
>> for wanting to hide this under the existing type not make a new one.

> I take your point. However, there isn't really any duplication. It's
> handled by [ creating a pair of casts ]

If that were an adequate solution then nobody would be unhappy about
json vs jsonb.  I don't think it really is satisfactory:

* does nothing for user confusion (except maybe make it worse)

* not terribly efficient

* doesn't cover all cases, notably indexes.

            regards, tom lane



Re: Add ZSON extension to /contrib/

From
Tom Lane
Date:
Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> I like the idea of the ZSON type, but I'm somewhat disappointed by its
> current limitations:

I've not read the code, so maybe this thought is completely off-point,
but I wonder if anything could be learned from PostGIS.  AIUI they
have developed the infrastructure needed to have auxiliary info
(particularly, spatial reference data) attached to a geometry column,
without duplicating it in every value of the column.  Seems like that
is a close analog of what's needed here.

            regards, tom lane



Re: Add ZSON extension to /contrib/

From
Andrew Dunstan
Date:
On 5/25/21 4:31 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 5/25/21 4:10 PM, Tom Lane wrote:
>>> Also, even if ZSON was "100% compatible with JSONB" back in 2016,
>>> a whole lot of features have been added since then.  Having to
>>> duplicate all that code again for a different data type is not
>>> something I want to see us doing.  So that's an independent reason
>>> for wanting to hide this under the existing type not make a new one.
>> I take your point. However, there isn't really any duplication. It's
>> handled by [ creating a pair of casts ]
> If that were an adequate solution then nobody would be unhappy about
> json vs jsonb.  I don't think it really is satisfactory:
>
> * does nothing for user confusion (except maybe make it worse)
>
> * not terribly efficient
>
> * doesn't cover all cases, notably indexes.
>
>             


Quite so. To some extent it's a toy. But at least one of our customers
has found it useful, and judging by Aleksander's email they aren't
alone. Your ideas downthread are probably a useful pointer of how we
might fruitfully proceed.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Add ZSON extension to /contrib/

From
Stephen Frost
Date:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> > I like the idea of the ZSON type, but I'm somewhat disappointed by its
> > current limitations:
>
> I've not read the code, so maybe this thought is completely off-point,
> but I wonder if anything could be learned from PostGIS.  AIUI they
> have developed the infrastructure needed to have auxiliary info
> (particularly, spatial reference data) attached to a geometry column,
> without duplicating it in every value of the column.  Seems like that
> is a close analog of what's needed here.

Err, not exactly the same- there aren't *that* many SRIDs and therefore
they can be stuffed into the typemod (my, probably wrong, recollection
was that I actually pushed Paul in that direction due to being
frustrated with CHECK constraints they had been using previously..).

Not something you could do with a dictionary as what's contempalted
here.  I do agree that each jsonb/zson/whatever column should really be
able to have its own dictionary though and maybe you could shove *which*
of those dictionaries a given column uses into the typemod for that
column...  In an ideal world, however, we wouldn't make a user have to
actually do that though and instead we'd just build our own magically
for them when they use jsonb.

Thanks,

Stephen

Attachment

Re: Add ZSON extension to /contrib/

From
Aleksander Alekseev
Date:
Hi hackers,

Many thanks for your feedback, I very much appreciate it!

> If the extension is mature enough, why make it an extension in
> contrib, and not instead either enhance the existing jsonb type with
> it or make it a built-in type?

> IMO we have too d*mn many JSON types already.  If we can find a way
> to shoehorn this optimization into JSONB, that'd be great.  Otherwise
> I do not think it's worth the added user confusion.

Magnus, Tom,

My reasoning is that if the problem can be solved with an extension
there is little reason to modify the core. This seems to be in the
spirit of PostgreSQL. If the community reaches the consensus to modify
the core to introduce a similar feature, we could discuss this as
well. It sounds like a lot of unnecessary work to me though (see
below).

> * doesn't cover all cases, notably indexes.

Tom,

Not sure if I follow. What cases do you have in mind?

> Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.

Matthias,

Good point. I suspect that PostGIS is an exception though...

> I like the idea of the ZSON type, but I'm somewhat disappointed by its
> current limitations

Several people suggested various enhancements right after learning
about ZSON. Time showed, however, that none of the real-world users
really need e.g. more than one common dictionary per database. I
suspect this is because no one has more than 2**16 repeatable unique
strings (one dictionary limitation) in their documents. Thus there is
no benefit in having separate dictionaries and corresponding extra
complexity.

> - Each dictionary uses a lot of memory, regardless of the number of
> actual stored keys. For 32-bit systems the base usage of a dictionary
> without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
> almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
> more than I'd want to install.

You are probably right on this one, this part could be optimized. I
will address this if we agree on submitting the patch.

> - You call gettimeofday() in both dict_get and in get_current_dict_id.
> These functions can be called in short and tight loops (for small GSON
> fields), in which case it would add significant overhead through the
> implied syscalls.

I must admit, I'm not an expert in this area. My understanding is that
gettimeofday() is implemented as single virtual memory access on
modern operating systems, e.g. VDSO on Linux, thus it's very cheap.
I'm not that sure about other supported platforms though. Probably
worth investigating.

> It does mean that you're deTOASTing
> the full GSON field, and that the stored bytestring will not be
> structured / doesn't work well with current debuggers.

Unfortunately, I'm not very well aware of debugging tools in this
context. Could you please name the debuggers I should take into
account?

> We (2ndQuadrant, now part of EDB) made some enhancements to Zson a few years ago, and I have permission to contribute
thoseif this proposal is adopted.
 

Andrew,

That's great, and personally I very much like the enhancements you've
made. Purely out of curiosity, did they ended up as a part of
2ndQiadrant / EDB products? I will be happy to accept a pull request
with these enhancements regardless of how the story with this proposal
ends up.

> Quite so. To some extent it's a toy. But at least one of our customers
> has found it useful, and judging by Aleksander's email they aren't
> alone.

Indeed, this is an extremely simple extension, ~500 effective lines of
code in C. It addresses a somewhat specific scenario, which, to my
regret, doesn't seem to be uncommon. A pain-killer of a sort. In an
ideal world, people suppose simply to normalize their data.

-- 
Best regards,
Aleksander Alekseev



Re: Add ZSON extension to /contrib/

From
Konstantin Knizhnik
Date:

On 25.05.2021 13:55, Aleksander Alekseev wrote:
> Hi hackers,
>
> Back in 2016 while being at PostgresPro I developed the ZSON extension 
> [1]. The extension introduces the new ZSON type, which is 100% 
> compatible with JSONB but uses a shared dictionary of strings most 
> frequently used in given JSONB documents for compression. These 
> strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) 
> applies if the document is large enough by common PostgreSQL logic. 
> Under certain conditions (many large documents), this saves disk 
> space, memory and increases the overall performance. More details can 
> be found in README on GitHub.
>
> The extension was accepted warmly and instantaneously I got several 
> requests to submit it to /contrib/ so people using Amazon RDS and 
> similar services could enjoy it too. Back then I was not sure if the 
> extension is mature enough and if it lacks any additional features 
> required to solve the real-world problems of the users. Time showed, 
> however, that people are happy with the extension as it is. There were 
> several minor issues discovered, but they were fixed back in 2017. The 
> extension never experienced any compatibility problems with the next 
> major release of PostgreSQL.
>
> So my question is if the community may consider adding ZSON to 
> /contrib/. If this is the case I will add this thread to the nearest 
> CF and submit a corresponding patch.
>
> [1]: https://github.com/postgrespro/zson
>
> -- 
> Best regards,
> Aleksander Alekseev
> Open-Source PostgreSQL Contributor at Timescale


Yet another approach to the same problem:

https://github.com/postgrespro/jsonb_schema

Instead of compression JSONs we can try to automatically detect JSON 
schema (names and types of JSON fields) and store it separately from values.
This approach is more similar with one used in schema-less databases. It 
is most efficient if there are many JSON records with the same schema 
and sizes of  keys are comparable with size of values. At IMDB data set 
it cause reducing of database size about 1.7 times.





Re: Add ZSON extension to /contrib/

From
Matthias van de Meent
Date:
On Wed, 26 May 2021 at 12:49, Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi hackers,
>
> Many thanks for your feedback, I very much appreciate it!
>
> > If the extension is mature enough, why make it an extension in
> > contrib, and not instead either enhance the existing jsonb type with
> > it or make it a built-in type?
>
> > IMO we have too d*mn many JSON types already.  If we can find a way
> > to shoehorn this optimization into JSONB, that'd be great.  Otherwise
> > I do not think it's worth the added user confusion.
>
> Magnus, Tom,
>
> My reasoning is that if the problem can be solved with an extension
> there is little reason to modify the core. This seems to be in the
> spirit of PostgreSQL. If the community reaches the consensus to modify
> the core to introduce a similar feature, we could discuss this as
> well. It sounds like a lot of unnecessary work to me though (see
> below).
>
> > * doesn't cover all cases, notably indexes.
>
> Tom,
>
> Not sure if I follow. What cases do you have in mind?
>
> > Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.
>
> Matthias,
>
> Good point. I suspect that PostGIS is an exception though...

Quite a few other non-/common/ extensions are available in RDS[0],
some of which are HLL (from citusdata), pglogical (from 2ndQuadrant)
and orafce (from Pavel Stehule, orafce et al.).

> > I like the idea of the ZSON type, but I'm somewhat disappointed by its
> > current limitations
>
> Several people suggested various enhancements right after learning
> about ZSON. Time showed, however, that none of the real-world users
> really need e.g. more than one common dictionary per database. I
> suspect this is because no one has more than 2**16 repeatable unique
> strings (one dictionary limitation) in their documents. Thus there is
> no benefit in having separate dictionaries and corresponding extra
> complexity.

IMO the main benefit of having different dictionaries is that you
could have a small dictionary for small and very structured JSONB
fields (e.g. some time-series data), and a large one for large /
unstructured JSONB fields, without having the significant performance
impact of having that large and varied dictionary on the
small&structured field. Although a binary search is log(n) and thus
still quite cheap even for large dictionaries, the extra size is
certainly not free, and you'll be touching more memory in the process.

> > - Each dictionary uses a lot of memory, regardless of the number of
> > actual stored keys. For 32-bit systems the base usage of a dictionary
> > without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
> > almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
> > more than I'd want to install.
>
> You are probably right on this one, this part could be optimized. I
> will address this if we agree on submitting the patch.
>
> > - You call gettimeofday() in both dict_get and in get_current_dict_id.
> > These functions can be called in short and tight loops (for small GSON
> > fields), in which case it would add significant overhead through the
> > implied syscalls.
>
> I must admit, I'm not an expert in this area. My understanding is that
> gettimeofday() is implemented as single virtual memory access on
> modern operating systems, e.g. VDSO on Linux, thus it's very cheap.
> I'm not that sure about other supported platforms though. Probably
> worth investigating.

Yes, but vDSO does not necessarily work on all systems: e.g. in 2017,
a lot on EC2 [1] was run using Xen with vDSO not working for
gettimeofday. I'm uncertain if this issue persists for their new
KVM/Nitro hypervisor.

> > It does mean that you're deTOASTing
> > the full GSON field, and that the stored bytestring will not be
> > structured / doesn't work well with current debuggers.
>
> Unfortunately, I'm not very well aware of debugging tools in this
> context. Could you please name the debuggers I should take into
> account?

Hmm, I was mistaken in that regard. I was under the impression that at
least one of pageinspect, pg_filedump and pg_hexedit did support
column value introspection, which they apparently do not. pg_filedump
(and thus pg_hexedit) have some introspection, but none specialized
for jsonb (yet).

The point I tried to make was that introspection of GSON would be even
more difficult due to it adding a non-standard compression method
which makes introspection effectively impossible (the algorithm can
replace things other than the strings it should replace, so it will be
difficult to retrieve structure from the encoded string).

With regards,

Matthias van de Meent

[0]
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.FeatureSupport.Extensions.13x
[1] https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/



Re: Add ZSON extension to /contrib/

From
Bruce Momjian
Date:
On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote:
> Hi hackers,
> 
> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The
> extension introduces the new ZSON type, which is 100% compatible with JSONB but
> uses a shared dictionary of strings most frequently used in given JSONB
> documents for compression. These strings are replaced with integer IDs.
> Afterward, PGLZ (and now LZ4) applies if the document is large enough by common
> PostgreSQL logic. Under certain conditions (many large documents), this saves
> disk space, memory and increases the overall performance. More details can be
> found in README on GitHub.

I think this is interesting because it is one of the few cases that
allow compression outside of a single column.  Here is a list of
compression options:

    https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020
    
    1. single field
    2. across rows in a single page
    3. across rows in a single column
    4. across all columns and rows in a table
    5. across tables in a database
    6. across databases

While standard Postgres does #1, ZSON allows 2-5, assuming the data is
in the ZSON data type.  I think this cross-field compression has great
potential for cases where the data is not relational, or hasn't had time
to be structured relationally.  It also opens questions of how to do
this cleanly in a relational system.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Add ZSON extension to /contrib/

From
Andrew Dunstan
Date:
On 5/26/21 5:29 PM, Bruce Momjian wrote:
> On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote:
>> Hi hackers,
>>
>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The
>> extension introduces the new ZSON type, which is 100% compatible with JSONB but
>> uses a shared dictionary of strings most frequently used in given JSONB
>> documents for compression. These strings are replaced with integer IDs.
>> Afterward, PGLZ (and now LZ4) applies if the document is large enough by common
>> PostgreSQL logic. Under certain conditions (many large documents), this saves
>> disk space, memory and increases the overall performance. More details can be
>> found in README on GitHub.
> I think this is interesting because it is one of the few cases that
> allow compression outside of a single column.  Here is a list of
> compression options:
>
>     https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020
>     
>     1. single field
>     2. across rows in a single page
>     3. across rows in a single column
>     4. across all columns and rows in a table
>     5. across tables in a database
>     6. across databases
>
> While standard Postgres does #1, ZSON allows 2-5, assuming the data is
> in the ZSON data type.  I think this cross-field compression has great
> potential for cases where the data is not relational, or hasn't had time
> to be structured relationally.  It also opens questions of how to do
> this cleanly in a relational system.
>

I think we're going to get the best bang for the buck on doing 2, 3, and
4. If it's confined to a single table then we can put a dictionary in
something like a fork. Maybe given partitioning we want to be able to do
multi-table dictionaries, but that's less certain.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Add ZSON extension to /contrib/

From
Tomas Vondra
Date:

On 5/26/21 6:43 PM, Matthias van de Meent wrote:
> On Wed, 26 May 2021 at 12:49, Aleksander Alekseev
> <aleksander@timescale.com> wrote:
>>
>> Hi hackers,
>>
>> Many thanks for your feedback, I very much appreciate it!
>>
>>> If the extension is mature enough, why make it an extension in
>>> contrib, and not instead either enhance the existing jsonb type with
>>> it or make it a built-in type?
>>
>>> IMO we have too d*mn many JSON types already.  If we can find a way
>>> to shoehorn this optimization into JSONB, that'd be great.  Otherwise
>>> I do not think it's worth the added user confusion.
>>
>> Magnus, Tom,
>>
>> My reasoning is that if the problem can be solved with an extension
>> there is little reason to modify the core. This seems to be in the
>> spirit of PostgreSQL. If the community reaches the consensus to modify
>> the core to introduce a similar feature, we could discuss this as
>> well. It sounds like a lot of unnecessary work to me though (see
>> below).
>>
>>> * doesn't cover all cases, notably indexes.
>>
>> Tom,
>>
>> Not sure if I follow. What cases do you have in mind?
>>
>>> Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.
>>
>> Matthias,
>>
>> Good point. I suspect that PostGIS is an exception though...
> 
> Quite a few other non-/common/ extensions are available in RDS[0],
> some of which are HLL (from citusdata), pglogical (from 2ndQuadrant)
> and orafce (from Pavel Stehule, orafce et al.).
> 
>>> I like the idea of the ZSON type, but I'm somewhat disappointed by its
>>> current limitations
>>
>> Several people suggested various enhancements right after learning
>> about ZSON. Time showed, however, that none of the real-world users
>> really need e.g. more than one common dictionary per database. I
>> suspect this is because no one has more than 2**16 repeatable unique
>> strings (one dictionary limitation) in their documents. Thus there is
>> no benefit in having separate dictionaries and corresponding extra
>> complexity.
> 
> IMO the main benefit of having different dictionaries is that you
> could have a small dictionary for small and very structured JSONB
> fields (e.g. some time-series data), and a large one for large /
> unstructured JSONB fields, without having the significant performance
> impact of having that large and varied dictionary on the
> small&structured field. Although a binary search is log(n) and thus
> still quite cheap even for large dictionaries, the extra size is
> certainly not free, and you'll be touching more memory in the process.
> 

I'm sure we can think of various other arguments for allowing separate
dictionaries. For example, what if you drop a column? With one huge
dictionary you're bound to keep the data forever. With per-column dicts
you can just drop the dict and free disk space / memory.

I also find it hard to believe that no one needs 2**16 strings. I mean,
65k is not that much, really. To give an example, I've been toying with
storing bitcoin blockchain in a database - one way to do that is storing
each block as a single JSONB document. But each "item" (eg. transaction)
is identified by a unique hash, so that means (tens of) thousands of
unique strings *per document*.

Yes, it's a bit silly and extreme, and maybe the compression would not
help much in this case. But it shows that 2**16 is damn easy to hit.

In other words, this seems like a nice example of survivor bias, where
we only look at cases for which the existing limitations are acceptable,
ignoring the (many) remaining cases eliminated by those limitations.

>>> - Each dictionary uses a lot of memory, regardless of the number of
>>> actual stored keys. For 32-bit systems the base usage of a dictionary
>>> without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
>>> almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
>>> more than I'd want to install.
>>
>> You are probably right on this one, this part could be optimized. I
>> will address this if we agree on submitting the patch.
>>

I'm sure it can be optimized, but I also think it's focusing on the base
memory usage too much.

What I care about is the end result, i.e. how much disk space / memory I
save at the end. I don't care if it's 1MB or 1.7MB if using the
compression saves me e.g. 50% of disk space. And it's completely
irrelevant if I can't use the feature because of limitations stemming
from the "single dictionary" limitations (in which case I'll save the
0.7MB, but I also lose the 50% disk space savings - not a great trade
off, if you ask me).

>>> - You call gettimeofday() in both dict_get and in get_current_dict_id.
>>> These functions can be called in short and tight loops (for small GSON
>>> fields), in which case it would add significant overhead through the
>>> implied syscalls.
>>
>> I must admit, I'm not an expert in this area. My understanding is that
>> gettimeofday() is implemented as single virtual memory access on
>> modern operating systems, e.g. VDSO on Linux, thus it's very cheap.
>> I'm not that sure about other supported platforms though. Probably
>> worth investigating.
> 
> Yes, but vDSO does not necessarily work on all systems: e.g. in 2017,
> a lot on EC2 [1] was run using Xen with vDSO not working for
> gettimeofday. I'm uncertain if this issue persists for their new
> KVM/Nitro hypervisor.
> 

Yeah. Better not to call gettimeofday is very often. I have no idea why
the code even does that, though - it seems to be deciding whether it's
OK to use cached dictionary based on the timestamp, but that seems
rather dubious. But it's hard to say, because there are about no useful
comments *anywhere* in the code.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add ZSON extension to /contrib/

From
Tomas Vondra
Date:

On 5/26/21 2:49 AM, Stephen Frost wrote:
> Greetings,
> 
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
>>> I like the idea of the ZSON type, but I'm somewhat disappointed by its
>>> current limitations:
>>
>> I've not read the code, so maybe this thought is completely off-point,
>> but I wonder if anything could be learned from PostGIS.  AIUI they
>> have developed the infrastructure needed to have auxiliary info
>> (particularly, spatial reference data) attached to a geometry column,
>> without duplicating it in every value of the column.  Seems like that
>> is a close analog of what's needed here.
> 
> Err, not exactly the same- there aren't *that* many SRIDs and therefore
> they can be stuffed into the typemod (my, probably wrong, recollection
> was that I actually pushed Paul in that direction due to being
> frustrated with CHECK constraints they had been using previously..).
> 
> Not something you could do with a dictionary as what's contempalted
> here.  I do agree that each jsonb/zson/whatever column should really be
> able to have its own dictionary though and maybe you could shove *which*
> of those dictionaries a given column uses into the typemod for that
> column...  In an ideal world, however, we wouldn't make a user have to
> actually do that though and instead we'd just build our own magically
> for them when they use jsonb.
> 

I think doing this properly will require inventing new infrastructure to
associate some custom parameters with a column (and/or data type). In
principle it seems quite similar to 911e702077, which introduced opclass
parameters, which allowed implementing the new BRIN opclasses in PG14.

Even if we eventually decide to not add zson into contrib (or core), it
seems like this infrastructure would make zson more usable in practice
with this capability.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add ZSON extension to /contrib/

From
Tomas Vondra
Date:

On 5/27/21 4:15 AM, Andrew Dunstan wrote:
> 
> On 5/26/21 5:29 PM, Bruce Momjian wrote:
>> On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote:
>>> Hi hackers,
>>>
>>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The
>>> extension introduces the new ZSON type, which is 100% compatible with JSONB but
>>> uses a shared dictionary of strings most frequently used in given JSONB
>>> documents for compression. These strings are replaced with integer IDs.
>>> Afterward, PGLZ (and now LZ4) applies if the document is large enough by common
>>> PostgreSQL logic. Under certain conditions (many large documents), this saves
>>> disk space, memory and increases the overall performance. More details can be
>>> found in README on GitHub.
>> I think this is interesting because it is one of the few cases that
>> allow compression outside of a single column.  Here is a list of
>> compression options:
>>
>>     https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020
>>     
>>     1. single field
>>     2. across rows in a single page
>>     3. across rows in a single column
>>     4. across all columns and rows in a table
>>     5. across tables in a database
>>     6. across databases
>>
>> While standard Postgres does #1, ZSON allows 2-5, assuming the data is
>> in the ZSON data type.  I think this cross-field compression has great
>> potential for cases where the data is not relational, or hasn't had time
>> to be structured relationally.  It also opens questions of how to do
>> this cleanly in a relational system.
>>
> 
> I think we're going to get the best bang for the buck on doing 2, 3, and
> 4. If it's confined to a single table then we can put a dictionary in
> something like a fork.

Agreed.

> Maybe given partitioning we want to be able to do multi-table
> dictionaries, but that's less certain.
> 

Yeah. I think it'll have many of the same issues/complexity as global
indexes, and the gains are likely limited. At least assuming the
partitions are sufficiently large, but tiny partitions are inefficient
in general, I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add ZSON extension to /contrib/

From
Andrew Dunstan
Date:
On 5/28/21 6:35 AM, Tomas Vondra wrote:
>
>>
>> IMO the main benefit of having different dictionaries is that you
>> could have a small dictionary for small and very structured JSONB
>> fields (e.g. some time-series data), and a large one for large /
>> unstructured JSONB fields, without having the significant performance
>> impact of having that large and varied dictionary on the
>> small&structured field. Although a binary search is log(n) and thus
>> still quite cheap even for large dictionaries, the extra size is
>> certainly not free, and you'll be touching more memory in the process.
>>
> I'm sure we can think of various other arguments for allowing separate
> dictionaries. For example, what if you drop a column? With one huge
> dictionary you're bound to keep the data forever. With per-column dicts
> you can just drop the dict and free disk space / memory.
>
> I also find it hard to believe that no one needs 2**16 strings. I mean,
> 65k is not that much, really. To give an example, I've been toying with
> storing bitcoin blockchain in a database - one way to do that is storing
> each block as a single JSONB document. But each "item" (eg. transaction)
> is identified by a unique hash, so that means (tens of) thousands of
> unique strings *per document*.
>
> Yes, it's a bit silly and extreme, and maybe the compression would not
> help much in this case. But it shows that 2**16 is damn easy to hit.
>
> In other words, this seems like a nice example of survivor bias, where
> we only look at cases for which the existing limitations are acceptable,
> ignoring the (many) remaining cases eliminated by those limitations.
>
>

I don't think we should lightly discard the use of 2 byte keys though.
Maybe we could use a scheme similar to what we use for text lengths,
where the first bit indicates whether we have a 1 byte or 4 byte length
indicator. Many dictionaries will have less that 2^15-1 entries, so they
would use exclusively the smaller keys.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Add ZSON extension to /contrib/

From
Tomas Vondra
Date:

On 5/28/21 4:22 PM, Andrew Dunstan wrote:
> 
> On 5/28/21 6:35 AM, Tomas Vondra wrote:
>>
>>>
>>> IMO the main benefit of having different dictionaries is that you
>>> could have a small dictionary for small and very structured JSONB
>>> fields (e.g. some time-series data), and a large one for large /
>>> unstructured JSONB fields, without having the significant performance
>>> impact of having that large and varied dictionary on the
>>> small&structured field. Although a binary search is log(n) and thus
>>> still quite cheap even for large dictionaries, the extra size is
>>> certainly not free, and you'll be touching more memory in the process.
>>>
>> I'm sure we can think of various other arguments for allowing separate
>> dictionaries. For example, what if you drop a column? With one huge
>> dictionary you're bound to keep the data forever. With per-column dicts
>> you can just drop the dict and free disk space / memory.
>>
>> I also find it hard to believe that no one needs 2**16 strings. I mean,
>> 65k is not that much, really. To give an example, I've been toying with
>> storing bitcoin blockchain in a database - one way to do that is storing
>> each block as a single JSONB document. But each "item" (eg. transaction)
>> is identified by a unique hash, so that means (tens of) thousands of
>> unique strings *per document*.
>>
>> Yes, it's a bit silly and extreme, and maybe the compression would not
>> help much in this case. But it shows that 2**16 is damn easy to hit.
>>
>> In other words, this seems like a nice example of survivor bias, where
>> we only look at cases for which the existing limitations are acceptable,
>> ignoring the (many) remaining cases eliminated by those limitations.
>>
>>
> 
> I don't think we should lightly discard the use of 2 byte keys though.
> Maybe we could use a scheme similar to what we use for text lengths,
> where the first bit indicates whether we have a 1 byte or 4 byte length
> indicator. Many dictionaries will have less that 2^15-1 entries, so they
> would use exclusively the smaller keys.
> 

I didn't mean to discard that, of course. I'm sure a lot of data sets
may be perfectly fine with 64k keys, of course, and it may be worth
optimizing that as a special case. All I'm saying is that if we start
from the position that this limit is perfectly fine and no one is going
to hit it in practice, it may be due to people not even trying it on
documents with more keys.

That being said, I still don't think the 1MB vs. 1.7MB figure is
particularly meaningful, because it's for "empty" dictionary, which is
something you'll not have in practice. And once you start adding keys,
the difference will get less and less significant.

However, if we care about efficiency for "small" JSON documents, it's
probably worth using something like varint [1], which is 1-4B depending
on the value.

[1] https://learnmeabitcoin.com/technical/varint


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add ZSON extension to /contrib/

From
"Joel Jacobson"
Date:
On Tue, May 25, 2021, at 22:10, Tom Lane wrote:
Magnus Hagander <magnus@hagander.net> writes:
> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev
>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type, which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documents for compression.

> If the extension is mature enough, why make it an extension in
> contrib, and not instead either enhance the existing jsonb type with
> it or make it a built-in type?

IMO we have too d*mn many JSON types already.  If we can find a way
to shoehorn this optimization into JSONB, that'd be great.  Otherwise
I do not think it's worth the added user confusion.

I think the json situation is unfortunate.

If carefully designing the json type from scratch,
with all the accumulated experiences over the years from working with json and jsonb,
I think the end result would probably be quite different.

For instance, I remember Marko Tiikkaja implemented his own json type many years ago when we worked together at the same company, needing json before PostgreSQL had support for it, I remember I thought some ideas in his interface felt  more natural than the built-in json type we later got.

While zson improves on efficiency, there are probably lots of other improvements in the interface that could be made as well.

Instead of trying to fix the existing built-in json type, I think it would be better to package the built-in functionality as a "json" extension, that would come pre-installed, similar to how "plpgsql" comes pre-installed.

Users who feel they are unhappy with the entire json/jsonb types could then install "zson" or some other competing json type instead. This would allow the life-cycles of legacy/deprecated versions to overlap with future versions.

Uninstallable Pre-Installed Extensions as a concept in general could perhaps be a feasible alternative to shoehorning in funtionality/optimizations in general, and also a way to avoid GUCs.

The biggest downside I see is the risk for confusion among users, since there can then be multiple competing implementations providing the same functionality. It's nice to have built-ins when all users love the built-ins.

/Joel

Re: Add ZSON extension to /contrib/

From
Aleksander Alekseev
Date:
Hi hackers,

Many thanks for the feedback and all the great suggestions!

I decided to add the patch to the nearest commitfest. You will find it in the attachment.

Differences from the GitHub version:

- Code formatting changed;
- More comments added to the code;
- SGML documentation added;
- Plus several minor changes;

I very much like the ideas:

- To use varint, as Tomas suggested
- Make dictionaries variable in size
- Somehow avoid calling gettimeofday()
- Improvements by 2ndQuadrant that Andrew named

However, I would like to decompose the task into 1) deciding if the extension is worth adding to /contrib/ and 2) improving it. Since there are people who already use ZSON, the extension should be backward-compatible with the current ZSON format anyway. Also, every improvement deserves its own discussion, testing, and benchmarking. Thus I believe the suggested approach will simplify the job for reviewers, and also save us time if the patch will be declined. If the patch will be accepted, I will be delighted to submit follow-up patches!

If you have any other ideas on how the extension can be improved in the future, please don't hesitate to name them in this thread. Also, I would appreciate some code review.

--
Best regards,
Aleksander Alekseev
Attachment

Re: Add ZSON extension to /contrib/

From
Peter Eisentraut
Date:
On 04.06.21 17:09, Aleksander Alekseev wrote:
> I decided to add the patch to the nearest commitfest.

With respect to the commit fest submission, I don't think there is 
consensus right now to add this.  I think people would prefer that this 
dictionary facility be somehow made available in the existing JSON 
types.  Also, I sense that there is still some volatility about some of 
the details of how this extension should work and its scope.  I think 
this is served best as an external extension for now.



Re: Add ZSON extension to /contrib/

From
Tomas Vondra
Date:
On 7/3/21 12:34 PM, Peter Eisentraut wrote:
> On 04.06.21 17:09, Aleksander Alekseev wrote:
>> I decided to add the patch to the nearest commitfest.
> 
> With respect to the commit fest submission, I don't think there is 
> consensus right now to add this.  I think people would prefer that this 
> dictionary facility be somehow made available in the existing JSON 
> types.  Also, I sense that there is still some volatility about some of 
> the details of how this extension should work and its scope.  I think 
> this is served best as an external extension for now.

I agree there's a lot of open questions to figure out, but I think this 
"column-level compression" capability has a lot of potential. Not just 
for structured documents like JSON, but maybe even for scalar types.

I don't think the question whether this should be built into jsonb, a 
separate built-in type, contrib type or something external is the one we 
need to answer first.

The first thing I'd like to see is some "proof" that it's actually 
useful in practice - there were some claims about people/customers using 
it and being happy with the benefits, but there were no actual examples 
of data sets that are expected to benefit, compression ratios etc. And 
considering that [1] went unnoticed for 5 years, I have my doubts about 
it being used very widely. (I may be wrong and maybe people are just not 
casting jsonb to zson.)

I've tried to use this on the one large non-synthetic JSONB dataset I 
had at hand at the moment, which is the bitcoin blockchain. That's ~1TB 
with JSONB, and when I tried using ZSON instead there was no measurable 
benefit, in fact the database was a bit larger. But I admit btc data is 
rather strange, because it contains a lot of randomness (all the tx and 
block IDs are random-looking hashes, etc.), and there's a lot of them in 
each document. So maybe that's simply a data set that can't benefit from 
zson on principle.

I also suspect the zson_extract_strings() is pretty inefficient and I 
ran into various issues with the btc blocks which have very many keys, 
often far more than the 10k limit.

In any case, I think having a clear example(s) of practical data sets 
that benefit from using zson would be very useful, both to guide the 
development and to show what the potential gains are.

The other thing is that a lot of the stuff seems to be manual (e.g. the 
learning), and not really well integrated with the core. IMO improving 
this by implementing the necessary infrastructure would help all the 
possible cases (built-in type, contrib, external extension).


regards

[1] 
https://github.com/postgrespro/zson/commit/02db084ea3b94d9e68fd912dea97094634fcdea5

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add ZSON extension to /contrib/

From
Aleksander Alekseev
Date:
Hi hackers,

Many thanks for all the great feedback!

Please see the follow-up thread `RFC: compression dictionaries for JSONB`:

https://www.postgresql.org/message-id/CAJ7c6TPx7N-bVw0dZ1ASCDQKZJHhBYkT6w4HV1LzfS%2BUUTUfmA%40mail.gmail.com

--
Best regards,
Aleksander Alekseev
Open-Source PostgreSQL Contributor at Timescale