Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers
From | Jacob Champion |
---|---|
Subject | Re: [PATCH] Compression dictionaries for JSONB |
Date | |
Msg-id | CAAWbhmgtHaVdh4cDHcDWOd7m4KL6RfBcdtYgy5WDSKN=Sucb5w@mail.gmail.com Whole thread Raw |
In response to | [PATCH] Compression dictionaries for JSONB (Aleksander Alekseev <aleksander@timescale.com>) |
Responses |
Re: [PATCH] Compression dictionaries for JSONB
|
List | pgsql-hackers |
On Wed, Jun 1, 2022 at 1:44 PM Aleksander Alekseev <aleksander@timescale.com> wrote: > This is a follow-up thread to `RFC: compression dictionaries for JSONB` [1]. I would like to share my current progressin order to get early feedback. The patch is currently in a draft state but implements the basic functionality. Idid my best to account for all the great feedback I previously got from Alvaro and Matthias. I'm coming up to speed with this set of threads -- the following is not a complete review by any means, and please let me know if I've missed some of the history. > SELECT * FROM pg_dict; > oid | dicttypid | dictentry > -------+-----------+----------- > 39476 | 39475 | aaa > 39477 | 39475 | bbb > (2 rows) I saw there was some previous discussion about dictionary size. It looks like this approach would put all dictionaries into a shared OID pool. Since I don't know what a "standard" use case is, is there any risk of OID exhaustion for larger deployments with many dictionaries? Or is 2**32 so comparatively large that it's not really a serious concern? > When `mydict` sees 'aaa' in the document, it replaces it with the corresponding code, in this case - 39476. For more detailsregarding the compression algorithm and choosen compromises please see the comments in the patch. I see the algorithm description, but I'm curious to know whether it's based on some other existing compression scheme, for the sake of comparison. It seems like it shares similarities with the Snappy scheme? Could you talk more about what the expected ratios and runtime characteristics are? Best I can see is that compression runtime is something like O(n * e * log d) where n is the length of the input, e is the maximum length of a dictionary entry, and d is the number of entries in the dictionary. Since e and d are constant for a given static dictionary, how well the dictionary is constructed is presumably important. > In pg_type `mydict` has typtype = TYPTYPE_DICT. It works the same way as TYPTYPE_BASE with only difference: corresponding`<type>_in` (pg_type.typinput) and `<another-type>_<type>` (pg_cast.castfunc) procedures receive the dictionaryOid as a `typmod` argument. This way the procedures can distinguish `mydict1` from `mydict2` and use the propercompression dictionary. > > The approach with alternative `typmod` role is arguably a bit hacky, but it was the less invasive way to implement thefeature I've found. I'm open to alternative suggestions. Haven't looked at this closely enough to develop an opinion yet. > Current limitations (todo): > - ALTER TYPE is not implemented That reminds me. How do people expect to generate a "good" dictionary in practice? Would they somehow get the JSONB representations out of Postgres and run a training program over the blobs? I see some reference to training functions in the prior threads but don't see any breadcrumbs in the code. > - Alternative compression algorithms. Note that this will not require any further changes in the catalog, only the valueswe write to pg_type and pg_cast will differ. Could you expand on this? I.e. why would alternative algorithms not need catalog changes? It seems like the only schemes that could be used with pg_catalog.pg_dict are those that expect to map a byte string to a number. Is that general enough to cover other standard compression algorithms? > Open questions: > - Dictionary entries are currently stored as NameData, the same type that is used for enums. Are we OK with the accompanyinglimitations? Any alternative suggestions? It does feel a little weird to have a hard limit on the entry length, since that also limits the compression ratio. But it also limits the compression runtime, so maybe it's a worthwhile tradeoff. It also seems strange to use a dictionary of C strings to compress binary data; wouldn't we want to be able to compress zero bytes too? Hope this helps, --Jacob
pgsql-hackers by date: