Re: Add ZSON extension to /contrib/ - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Add ZSON extension to /contrib/
Date
Msg-id 77356556-0634-5cde-f55e-cce739dc09b9@enterprisedb.com
Whole thread Raw
In response to Re: Add ZSON extension to /contrib/  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-hackers

On 5/28/21 4:22 PM, Andrew Dunstan wrote:
> 
> On 5/28/21 6:35 AM, Tomas Vondra wrote:
>>
>>>
>>> IMO the main benefit of having different dictionaries is that you
>>> could have a small dictionary for small and very structured JSONB
>>> fields (e.g. some time-series data), and a large one for large /
>>> unstructured JSONB fields, without having the significant performance
>>> impact of having that large and varied dictionary on the
>>> small&structured field. Although a binary search is log(n) and thus
>>> still quite cheap even for large dictionaries, the extra size is
>>> certainly not free, and you'll be touching more memory in the process.
>>>
>> I'm sure we can think of various other arguments for allowing separate
>> dictionaries. For example, what if you drop a column? With one huge
>> dictionary you're bound to keep the data forever. With per-column dicts
>> you can just drop the dict and free disk space / memory.
>>
>> I also find it hard to believe that no one needs 2**16 strings. I mean,
>> 65k is not that much, really. To give an example, I've been toying with
>> storing bitcoin blockchain in a database - one way to do that is storing
>> each block as a single JSONB document. But each "item" (eg. transaction)
>> is identified by a unique hash, so that means (tens of) thousands of
>> unique strings *per document*.
>>
>> Yes, it's a bit silly and extreme, and maybe the compression would not
>> help much in this case. But it shows that 2**16 is damn easy to hit.
>>
>> In other words, this seems like a nice example of survivor bias, where
>> we only look at cases for which the existing limitations are acceptable,
>> ignoring the (many) remaining cases eliminated by those limitations.
>>
>>
> 
> I don't think we should lightly discard the use of 2 byte keys though.
> Maybe we could use a scheme similar to what we use for text lengths,
> where the first bit indicates whether we have a 1 byte or 4 byte length
> indicator. Many dictionaries will have less that 2^15-1 entries, so they
> would use exclusively the smaller keys.
> 

I didn't mean to discard that, of course. I'm sure a lot of data sets
may be perfectly fine with 64k keys, of course, and it may be worth
optimizing that as a special case. All I'm saying is that if we start
from the position that this limit is perfectly fine and no one is going
to hit it in practice, it may be due to people not even trying it on
documents with more keys.

That being said, I still don't think the 1MB vs. 1.7MB figure is
particularly meaningful, because it's for "empty" dictionary, which is
something you'll not have in practice. And once you start adding keys,
the difference will get less and less significant.

However, if we care about efficiency for "small" JSON documents, it's
probably worth using something like varint [1], which is 1-4B depending
on the value.

[1] https://learnmeabitcoin.com/technical/varint


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Degression (PG10 > 11, 12 or 13)
Next
From: Tom Lane
Date:
Subject: Re: Degression (PG10 > 11, 12 or 13)