On 5/28/21 6:35 AM, Tomas Vondra wrote:
>
>>
>> IMO the main benefit of having different dictionaries is that you
>> could have a small dictionary for small and very structured JSONB
>> fields (e.g. some time-series data), and a large one for large /
>> unstructured JSONB fields, without having the significant performance
>> impact of having that large and varied dictionary on the
>> small&structured field. Although a binary search is log(n) and thus
>> still quite cheap even for large dictionaries, the extra size is
>> certainly not free, and you'll be touching more memory in the process.
>>
> I'm sure we can think of various other arguments for allowing separate
> dictionaries. For example, what if you drop a column? With one huge
> dictionary you're bound to keep the data forever. With per-column dicts
> you can just drop the dict and free disk space / memory.
>
> I also find it hard to believe that no one needs 2**16 strings. I mean,
> 65k is not that much, really. To give an example, I've been toying with
> storing bitcoin blockchain in a database - one way to do that is storing
> each block as a single JSONB document. But each "item" (eg. transaction)
> is identified by a unique hash, so that means (tens of) thousands of
> unique strings *per document*.
>
> Yes, it's a bit silly and extreme, and maybe the compression would not
> help much in this case. But it shows that 2**16 is damn easy to hit.
>
> In other words, this seems like a nice example of survivor bias, where
> we only look at cases for which the existing limitations are acceptable,
> ignoring the (many) remaining cases eliminated by those limitations.
>
>
I don't think we should lightly discard the use of 2 byte keys though.
Maybe we could use a scheme similar to what we use for text lengths,
where the first bit indicates whether we have a 1 byte or 4 byte length
indicator. Many dictionaries will have less that 2^15-1 entries, so they
would use exclusively the smaller keys.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com