Re: Add ZSON extension to /contrib/ - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Add ZSON extension to /contrib/ |
Date | |
Msg-id | 77356556-0634-5cde-f55e-cce739dc09b9@enterprisedb.com Whole thread Raw |
In response to | Re: Add ZSON extension to /contrib/ (Andrew Dunstan <andrew@dunslane.net>) |
List | pgsql-hackers |
On 5/28/21 4:22 PM, Andrew Dunstan wrote: > > On 5/28/21 6:35 AM, Tomas Vondra wrote: >> >>> >>> IMO the main benefit of having different dictionaries is that you >>> could have a small dictionary for small and very structured JSONB >>> fields (e.g. some time-series data), and a large one for large / >>> unstructured JSONB fields, without having the significant performance >>> impact of having that large and varied dictionary on the >>> small&structured field. Although a binary search is log(n) and thus >>> still quite cheap even for large dictionaries, the extra size is >>> certainly not free, and you'll be touching more memory in the process. >>> >> I'm sure we can think of various other arguments for allowing separate >> dictionaries. For example, what if you drop a column? With one huge >> dictionary you're bound to keep the data forever. With per-column dicts >> you can just drop the dict and free disk space / memory. >> >> I also find it hard to believe that no one needs 2**16 strings. I mean, >> 65k is not that much, really. To give an example, I've been toying with >> storing bitcoin blockchain in a database - one way to do that is storing >> each block as a single JSONB document. But each "item" (eg. transaction) >> is identified by a unique hash, so that means (tens of) thousands of >> unique strings *per document*. >> >> Yes, it's a bit silly and extreme, and maybe the compression would not >> help much in this case. But it shows that 2**16 is damn easy to hit. >> >> In other words, this seems like a nice example of survivor bias, where >> we only look at cases for which the existing limitations are acceptable, >> ignoring the (many) remaining cases eliminated by those limitations. >> >> > > I don't think we should lightly discard the use of 2 byte keys though. > Maybe we could use a scheme similar to what we use for text lengths, > where the first bit indicates whether we have a 1 byte or 4 byte length > indicator. Many dictionaries will have less that 2^15-1 entries, so they > would use exclusively the smaller keys. > I didn't mean to discard that, of course. I'm sure a lot of data sets may be perfectly fine with 64k keys, of course, and it may be worth optimizing that as a special case. All I'm saying is that if we start from the position that this limit is perfectly fine and no one is going to hit it in practice, it may be due to people not even trying it on documents with more keys. That being said, I still don't think the 1MB vs. 1.7MB figure is particularly meaningful, because it's for "empty" dictionary, which is something you'll not have in practice. And once you start adding keys, the difference will get less and less significant. However, if we care about efficiency for "small" JSON documents, it's probably worth using something like varint [1], which is 1-4B depending on the value. [1] https://learnmeabitcoin.com/technical/varint regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: