Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

From Aleksander Alekseev
Subject Re: [PATCH] Compression dictionaries for JSONB
Date
Msg-id CAJ7c6TMThnxjn6sM7s8y5Yib4+BDhvGGFhW=qK8R+YMiKTwvBQ@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Compression dictionaries for JSONB  (Jacob Champion <jchampion@timescale.com>)
Responses Re: [PATCH] Compression dictionaries for JSONB
Re: [PATCH] Compression dictionaries for JSONB
List pgsql-hackers
Hi Jacob,

Many thanks for your feedback!

> I saw there was some previous discussion about dictionary size. It
> looks like this approach would put all dictionaries into a shared OID
> pool. Since I don't know what a "standard" use case is, is there any
> risk of OID exhaustion for larger deployments with many dictionaries?
> Or is 2**32 so comparatively large that it's not really a serious
> concern?

I agree, this is a drawback of the current implementation. To be honest,
I simply followed the example of how ENUMs are implemented. I'm not 100% sure
if we should be worried here (apparently, freed OIDs are reused). I'm OK with
using a separate sequence if someone could second this. This is the first time
I'm altering the catalog so I'm not certain what the best practices are.

> I see the algorithm description, but I'm curious to know whether it's
> based on some other existing compression scheme, for the sake of
> comparison. It seems like it shares similarities with the Snappy
> scheme?
>
> Could you talk more about what the expected ratios and runtime
> characteristics are? Best I can see is that compression runtime is
> something like O(n * e * log d) where n is the length of the input, e
> is the maximum length of a dictionary entry, and d is the number of
> entries in the dictionary. Since e and d are constant for a given
> static dictionary, how well the dictionary is constructed is
> presumably important.

The algorithm is almost identical to the one I used in ZSON extension [1]
except the fact that ZSON uses 16-bit codes. In docs/benchmark.md you will find
approximate ratios to expect, etc. The reasons why this particular algorithm
was chosen are:

1. It was extensively tested in the past and seem to work OK for existing
   ZSON users.
2. It doesn't use any knowledge regarding the data structure and thus can be
   reused for TEXT/XML/etc as-is.
3. Previously we agreed that at some point users will be able to change the
   algorithm (the same way as they can do it for TOAST now) so which algorithm
   will be used in the first implementation is not that important. I simply
   choose the already existing one.

> > Current limitations (todo):
> > - ALTER TYPE is not implemented
>
> That reminds me. How do people expect to generate a "good" dictionary
> in practice? Would they somehow get the JSONB representations out of
> Postgres and run a training program over the blobs? I see some
> reference to training functions in the prior threads but don't see any
> breadcrumbs in the code.

So far we agreed that in the first implementation it will be done manually.
In the future it will be possible to update the dictionaries automatically
during VACUUM. The idea of something similar to zson_learn() procedure, as
I recall, didn't get much support, so we probably will not have it, or at least
it is not a priority.

> > - Alternative compression algorithms. Note that this will not require any
> > further changes in the catalog, only the values we write to pg_type and
> > pg_cast will differ.
>
> Could you expand on this? I.e. why would alternative algorithms not
> need catalog changes? It seems like the only schemes that could be
> used with pg_catalog.pg_dict are those that expect to map a byte
> string to a number. Is that general enough to cover other standard
> compression algorithms?

Sure. When creating a new dictionary pg_type and pg_cast are modified like this:

 =# CREATE TYPE mydict AS DICTIONARY OF JSONB ('abcdef', 'ghijkl');
CREATE TYPE
 =# SELECT * FROM pg_type WHERE typname = 'mydict';
-[ RECORD 1 ]--+---------------
oid            | 16397
typname        | mydict
typnamespace   | 2200
...
typarray       | 16396
typinput       | dictionary_in
typoutput      | dictionary_out
...

=# SELECT c.*, p.proname FROM pg_cast AS c
     LEFT JOIN pg_proc AS p
     ON p.oid = c.castfunc
      WHERE c.castsource = 16397 or c.casttarget = 16397;
-[ RECORD 1 ]-----------------
oid         | 16400
castsource  | 3802
casttarget  | 16397
castfunc    | 9866
castcontext | a
castmethod  | f
proname     | jsonb_dictionary
-[ RECORD 2 ]-----------------
oid         | 16401
castsource  | 16397
casttarget  | 3802
castfunc    | 9867
castcontext | i
castmethod  | f
proname     | dictionary_jsonb
-[ RECORD 3 ]-----------------
oid         | 16402
castsource  | 16397
casttarget  | 17
castfunc    | 9868
castcontext | e
castmethod  | f
proname     | dictionary_bytea

In order to add a new algorithm you simply need to provide alternatives
to dictionary_in / dictionary_out / jsonb_dictionary / dictionary_jsonb and
specify them in the catalog instead. The catalog schema will remain the same.

> It also seems strange to use a dictionary of C strings to compress
> binary data; wouldn't we want to be able to compress zero bytes too?

That's a good point. Again, here I simply followed the example of the ENUMs
implementation. Since compression dictionaries are intended to be used with
text-like types such as JSONB, (and also JSON, TEXT and XML in the future),
choosing Name type seemed to be a reasonable compromise. Dictionary entries are
most likely going to store JSON keys, common words used in the TEXT, etc.
However, I'm fine with any alternative scheme if somebody experienced with the
PostgreSQL catalog could second this.

[1]: https://github.com/afiskon/zson

-- 
Best regards,
Aleksander Alekseev



pgsql-hackers by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: [RFC] building postgres with meson
Next
From: Daniel Gustafsson
Date:
Subject: Re: compiler warnings with gcc 4.8 and -Og