Re: RFC: compression dictionaries for JSONB - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: RFC: compression dictionaries for JSONB
Date
Msg-id CAEze2Wiev85+Oj7a0jXNSkBpTeocMx3bzoO7HMF6fUkG6FOAwA@mail.gmail.com
Whole thread Raw
In response to Re: RFC: compression dictionaries for JSONB  (Aleksander Alekseev <aleksander@timescale.com>)
Responses Re: RFC: compression dictionaries for JSONB  (Aleksander Alekseev <aleksander@timescale.com>)
List pgsql-hackers
On Mon, 11 Oct 2021 at 15:25, Aleksander Alekseev
<aleksander@timescale.com> wrote:
> Agree, add / change / remove of a column should be handled
> automatically. Just to clarify, by column option do you mean syntax
> like ALTER TABLE ... ALTER COLUMN ... etc, right?

Correct, either SET (option) or maybe using typmod (which is hack-ish,
but could save on some bytes of storage)

> I didn't think of
> extending this part of the syntax. That would be a better choice
> indeed.

> > Overall, I'm glad to see this take off, but I do want some
> > clarifications regarding the direction that this is going towards.
> > [...]
> > Actually, why is it a JSONB_DICTIONARY and not like:
> > CREATE TYPE name AS DICTIONARY (
> >  base_type = JSONB, ...
> > );
> > so that it is possible to use the infrastructure for other things?  For
> > example, perhaps PostGIS geometries could benefit from it -- or even
> > text or xml columns.
>
> So the question is if we want to extend the capabilities of a single
> type, i.e. JSONB, or to add a functionality that would work for the
> various types. I see the following pros and cons of both approaches.
>
> Modifying JSONB may at some point allow to partially decompress only
> the parts of the document that need to be decompressed for a given
> query. However, the compression schema will be probably less
> efficient. There could also be difficulties in respect of backward
> compatibility, and this is going to work only with JSONB.

Assuming this above is option 1. If I understand correctly, this
option was 'adapt the data type so that it understands how to handle a
shared dictionary, decreasing storage requirements'.

> An alternative approach, CREATE TYPE ... AS DICTIONARY OF <type> or
> something like this would work not only for JSONB, but also for TEXT,
> XML, arrays, and PostGIS. By the way, this was suggested before [1].
> Another advantage here is that all things being equal the compression
> schema could be more efficient. The feature will not affect existing
> types. The main disadvantage is that implementing a partial
> decompression would be very difficult and/or impractical.

Assuming this was the 2nd option. If I understand correctly, this
option is effectively 'adapt or wrap TOAST to understand and handle
dictionaries for dictionary encoding common values'.

> Personally, I would say that the 2nd option, CREATE TYPE ... AS
> DICTIONARY OF <type>, seems to be more useful. To my knowledge, not
> many users would care much about partial decompression, and this is
> the only real advantage of the 1st option I see. Also, this is how
> ZSON is implemented. It doesn't care about the underlying type and
> treats it as a BLOB. Thus the proofs of usefulness I provided above
> are not quite valid for the 1st option. Probably unrelated, but 2nd
> option would be even easier for me to implement since I already solved
> a similar task.
>
> All in all, I suggest focusing on the 2nd option with universal
> compression dictionaries. Naturally, the focus will be on JSONB first.
> But we will be able to extend this functionality for other types as
> well.
>
> Thoughts?

I think that an 'universal dictionary encoder' would be useful, but
that a data type might also have good reason to implement their
replacement methods by themselves for better overall performance (such
as maintaining partial detoast support in dictionaried items, or
overall lower memory footprint, or ...). As such, I'd really
appreciate it if Option 1 is not ruled out by any implementation of
Option 2.

Kind regards,

Matthias van de Meent



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: storing an explicit nonce
Next
From: Mark Dilger
Date:
Subject: Re: BUG #17212: pg_amcheck fails on checking temporary relations