Re: Add ZSON extension to /contrib/ - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: Add ZSON extension to /contrib/
Date
Msg-id 09ec9e70-d901-40a6-cbca-b0e375eebb73@dunslane.net
Whole thread Raw
In response to Re: Add ZSON extension to /contrib/  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Add ZSON extension to /contrib/
List pgsql-hackers
On 5/28/21 6:35 AM, Tomas Vondra wrote:
>
>>
>> IMO the main benefit of having different dictionaries is that you
>> could have a small dictionary for small and very structured JSONB
>> fields (e.g. some time-series data), and a large one for large /
>> unstructured JSONB fields, without having the significant performance
>> impact of having that large and varied dictionary on the
>> small&structured field. Although a binary search is log(n) and thus
>> still quite cheap even for large dictionaries, the extra size is
>> certainly not free, and you'll be touching more memory in the process.
>>
> I'm sure we can think of various other arguments for allowing separate
> dictionaries. For example, what if you drop a column? With one huge
> dictionary you're bound to keep the data forever. With per-column dicts
> you can just drop the dict and free disk space / memory.
>
> I also find it hard to believe that no one needs 2**16 strings. I mean,
> 65k is not that much, really. To give an example, I've been toying with
> storing bitcoin blockchain in a database - one way to do that is storing
> each block as a single JSONB document. But each "item" (eg. transaction)
> is identified by a unique hash, so that means (tens of) thousands of
> unique strings *per document*.
>
> Yes, it's a bit silly and extreme, and maybe the compression would not
> help much in this case. But it shows that 2**16 is damn easy to hit.
>
> In other words, this seems like a nice example of survivor bias, where
> we only look at cases for which the existing limitations are acceptable,
> ignoring the (many) remaining cases eliminated by those limitations.
>
>

I don't think we should lightly discard the use of 2 byte keys though.
Maybe we could use a scheme similar to what we use for text lengths,
where the first bit indicates whether we have a 1 byte or 4 byte length
indicator. Many dictionaries will have less that 2^15-1 entries, so they
would use exclusively the smaller keys.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




pgsql-hackers by date:

Previous
From: Johannes Graën
Date:
Subject: Degression (PG10 > 11, 12 or 13)
Next
From: Mark Dilger
Date:
Subject: Re: Command statistics system (cmdstats)