Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

From Jacob Champion
Subject Re: [PATCH] Compression dictionaries for JSONB
Date
Msg-id CAAWbhmgtHaVdh4cDHcDWOd7m4KL6RfBcdtYgy5WDSKN=Sucb5w@mail.gmail.com
Whole thread Raw
In response to [PATCH] Compression dictionaries for JSONB  (Aleksander Alekseev <aleksander@timescale.com>)
Responses Re: [PATCH] Compression dictionaries for JSONB  (Aleksander Alekseev <aleksander@timescale.com>)
List pgsql-hackers
On Wed, Jun 1, 2022 at 1:44 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
> This is a follow-up thread to `RFC: compression dictionaries for JSONB` [1]. I would like to share my current
progressin order to get early feedback. The patch is currently in a draft state but implements the basic functionality.
Idid my best to account for all the great feedback I previously got from Alvaro and Matthias. 

I'm coming up to speed with this set of threads -- the following is
not a complete review by any means, and please let me know if I've
missed some of the history.

> SELECT * FROM pg_dict;
>   oid  | dicttypid | dictentry
> -------+-----------+-----------
>  39476 |     39475 | aaa
>  39477 |     39475 | bbb
> (2 rows)

I saw there was some previous discussion about dictionary size. It
looks like this approach would put all dictionaries into a shared OID
pool. Since I don't know what a "standard" use case is, is there any
risk of OID exhaustion for larger deployments with many dictionaries?
Or is 2**32 so comparatively large that it's not really a serious
concern?

> When `mydict` sees 'aaa' in the document, it replaces it with the corresponding code, in this case - 39476. For more
detailsregarding the compression algorithm and choosen compromises please see the comments in the patch. 

I see the algorithm description, but I'm curious to know whether it's
based on some other existing compression scheme, for the sake of
comparison. It seems like it shares similarities with the Snappy
scheme?

Could you talk more about what the expected ratios and runtime
characteristics are? Best I can see is that compression runtime is
something like O(n * e * log d) where n is the length of the input, e
is the maximum length of a dictionary entry, and d is the number of
entries in the dictionary. Since e and d are constant for a given
static dictionary, how well the dictionary is constructed is
presumably important.

> In pg_type `mydict` has typtype = TYPTYPE_DICT. It works the same way as TYPTYPE_BASE with only difference:
corresponding`<type>_in` (pg_type.typinput) and `<another-type>_<type>` (pg_cast.castfunc) procedures receive the
dictionaryOid as a `typmod` argument. This way the procedures can distinguish `mydict1` from `mydict2` and use the
propercompression dictionary. 
>
> The approach with alternative `typmod` role is arguably a bit hacky, but it was the less invasive way to implement
thefeature I've found. I'm open to alternative suggestions. 

Haven't looked at this closely enough to develop an opinion yet.

> Current limitations (todo):
> - ALTER TYPE is not implemented

That reminds me. How do people expect to generate a "good" dictionary
in practice? Would they somehow get the JSONB representations out of
Postgres and run a training program over the blobs? I see some
reference to training functions in the prior threads but don't see any
breadcrumbs in the code.

> - Alternative compression algorithms. Note that this will not require any further changes in the catalog, only the
valueswe write to pg_type and pg_cast will differ. 

Could you expand on this? I.e. why would alternative algorithms not
need catalog changes? It seems like the only schemes that could be
used with pg_catalog.pg_dict are those that expect to map a byte
string to a number. Is that general enough to cover other standard
compression algorithms?

> Open questions:
> - Dictionary entries are currently stored as NameData, the same type that is used for enums. Are we OK with the
accompanyinglimitations? Any alternative suggestions? 

It does feel a little weird to have a hard limit on the entry length,
since that also limits the compression ratio. But it also limits the
compression runtime, so maybe it's a worthwhile tradeoff.

It also seems strange to use a dictionary of C strings to compress
binary data; wouldn't we want to be able to compress zero bytes too?

Hope this helps,
--Jacob



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: showing effective max_connections
Next
From: Nathan Bossart
Date:
Subject: Re: silence compiler warning in brin.c