Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers
From | Aleksander Alekseev |
---|---|
Subject | Re: [PATCH] Compression dictionaries for JSONB |
Date | |
Msg-id | CAJ7c6TMThnxjn6sM7s8y5Yib4+BDhvGGFhW=qK8R+YMiKTwvBQ@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Compression dictionaries for JSONB (Jacob Champion <jchampion@timescale.com>) |
Responses |
Re: [PATCH] Compression dictionaries for JSONB
Re: [PATCH] Compression dictionaries for JSONB |
List | pgsql-hackers |
Hi Jacob, Many thanks for your feedback! > I saw there was some previous discussion about dictionary size. It > looks like this approach would put all dictionaries into a shared OID > pool. Since I don't know what a "standard" use case is, is there any > risk of OID exhaustion for larger deployments with many dictionaries? > Or is 2**32 so comparatively large that it's not really a serious > concern? I agree, this is a drawback of the current implementation. To be honest, I simply followed the example of how ENUMs are implemented. I'm not 100% sure if we should be worried here (apparently, freed OIDs are reused). I'm OK with using a separate sequence if someone could second this. This is the first time I'm altering the catalog so I'm not certain what the best practices are. > I see the algorithm description, but I'm curious to know whether it's > based on some other existing compression scheme, for the sake of > comparison. It seems like it shares similarities with the Snappy > scheme? > > Could you talk more about what the expected ratios and runtime > characteristics are? Best I can see is that compression runtime is > something like O(n * e * log d) where n is the length of the input, e > is the maximum length of a dictionary entry, and d is the number of > entries in the dictionary. Since e and d are constant for a given > static dictionary, how well the dictionary is constructed is > presumably important. The algorithm is almost identical to the one I used in ZSON extension [1] except the fact that ZSON uses 16-bit codes. In docs/benchmark.md you will find approximate ratios to expect, etc. The reasons why this particular algorithm was chosen are: 1. It was extensively tested in the past and seem to work OK for existing ZSON users. 2. It doesn't use any knowledge regarding the data structure and thus can be reused for TEXT/XML/etc as-is. 3. Previously we agreed that at some point users will be able to change the algorithm (the same way as they can do it for TOAST now) so which algorithm will be used in the first implementation is not that important. I simply choose the already existing one. > > Current limitations (todo): > > - ALTER TYPE is not implemented > > That reminds me. How do people expect to generate a "good" dictionary > in practice? Would they somehow get the JSONB representations out of > Postgres and run a training program over the blobs? I see some > reference to training functions in the prior threads but don't see any > breadcrumbs in the code. So far we agreed that in the first implementation it will be done manually. In the future it will be possible to update the dictionaries automatically during VACUUM. The idea of something similar to zson_learn() procedure, as I recall, didn't get much support, so we probably will not have it, or at least it is not a priority. > > - Alternative compression algorithms. Note that this will not require any > > further changes in the catalog, only the values we write to pg_type and > > pg_cast will differ. > > Could you expand on this? I.e. why would alternative algorithms not > need catalog changes? It seems like the only schemes that could be > used with pg_catalog.pg_dict are those that expect to map a byte > string to a number. Is that general enough to cover other standard > compression algorithms? Sure. When creating a new dictionary pg_type and pg_cast are modified like this: =# CREATE TYPE mydict AS DICTIONARY OF JSONB ('abcdef', 'ghijkl'); CREATE TYPE =# SELECT * FROM pg_type WHERE typname = 'mydict'; -[ RECORD 1 ]--+--------------- oid | 16397 typname | mydict typnamespace | 2200 ... typarray | 16396 typinput | dictionary_in typoutput | dictionary_out ... =# SELECT c.*, p.proname FROM pg_cast AS c LEFT JOIN pg_proc AS p ON p.oid = c.castfunc WHERE c.castsource = 16397 or c.casttarget = 16397; -[ RECORD 1 ]----------------- oid | 16400 castsource | 3802 casttarget | 16397 castfunc | 9866 castcontext | a castmethod | f proname | jsonb_dictionary -[ RECORD 2 ]----------------- oid | 16401 castsource | 16397 casttarget | 3802 castfunc | 9867 castcontext | i castmethod | f proname | dictionary_jsonb -[ RECORD 3 ]----------------- oid | 16402 castsource | 16397 casttarget | 17 castfunc | 9868 castcontext | e castmethod | f proname | dictionary_bytea In order to add a new algorithm you simply need to provide alternatives to dictionary_in / dictionary_out / jsonb_dictionary / dictionary_jsonb and specify them in the catalog instead. The catalog schema will remain the same. > It also seems strange to use a dictionary of C strings to compress > binary data; wouldn't we want to be able to compress zero bytes too? That's a good point. Again, here I simply followed the example of the ENUMs implementation. Since compression dictionaries are intended to be used with text-like types such as JSONB, (and also JSON, TEXT and XML in the future), choosing Name type seemed to be a reasonable compromise. Dictionary entries are most likely going to store JSON keys, common words used in the TEXT, etc. However, I'm fine with any alternative scheme if somebody experienced with the PostgreSQL catalog could second this. [1]: https://github.com/afiskon/zson -- Best regards, Aleksander Alekseev
pgsql-hackers by date: