Re: Add ZSON extension to /contrib/ - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Add ZSON extension to /contrib/
Date
Msg-id d524cc65-6d45-2ffd-fc9b-eb60b964d02e@enterprisedb.com
Whole thread Raw
In response to Re: Add ZSON extension to /contrib/  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
Responses Re: Add ZSON extension to /contrib/  (Aleksander Alekseev <aleksander@timescale.com>)
List pgsql-hackers
On 7/3/21 12:34 PM, Peter Eisentraut wrote:
> On 04.06.21 17:09, Aleksander Alekseev wrote:
>> I decided to add the patch to the nearest commitfest.
> 
> With respect to the commit fest submission, I don't think there is 
> consensus right now to add this.  I think people would prefer that this 
> dictionary facility be somehow made available in the existing JSON 
> types.  Also, I sense that there is still some volatility about some of 
> the details of how this extension should work and its scope.  I think 
> this is served best as an external extension for now.

I agree there's a lot of open questions to figure out, but I think this 
"column-level compression" capability has a lot of potential. Not just 
for structured documents like JSON, but maybe even for scalar types.

I don't think the question whether this should be built into jsonb, a 
separate built-in type, contrib type or something external is the one we 
need to answer first.

The first thing I'd like to see is some "proof" that it's actually 
useful in practice - there were some claims about people/customers using 
it and being happy with the benefits, but there were no actual examples 
of data sets that are expected to benefit, compression ratios etc. And 
considering that [1] went unnoticed for 5 years, I have my doubts about 
it being used very widely. (I may be wrong and maybe people are just not 
casting jsonb to zson.)

I've tried to use this on the one large non-synthetic JSONB dataset I 
had at hand at the moment, which is the bitcoin blockchain. That's ~1TB 
with JSONB, and when I tried using ZSON instead there was no measurable 
benefit, in fact the database was a bit larger. But I admit btc data is 
rather strange, because it contains a lot of randomness (all the tx and 
block IDs are random-looking hashes, etc.), and there's a lot of them in 
each document. So maybe that's simply a data set that can't benefit from 
zson on principle.

I also suspect the zson_extract_strings() is pretty inefficient and I 
ran into various issues with the btc blocks which have very many keys, 
often far more than the 10k limit.

In any case, I think having a clear example(s) of practical data sets 
that benefit from using zson would be very useful, both to guide the 
development and to show what the potential gains are.

The other thing is that a lot of the stuff seems to be manual (e.g. the 
learning), and not really well integrated with the core. IMO improving 
this by implementing the necessary infrastructure would help all the 
possible cases (built-in type, contrib, external extension).


regards

[1] 
https://github.com/postgrespro/zson/commit/02db084ea3b94d9e68fd912dea97094634fcdea5

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: pgsql: Fix numeric_mul() overflow due to too many digits after decimal
Next
From: Magnus Hagander
Date:
Subject: Re: Support kerberos authentication for postgres_fdw