Re: RFC: compression dictionaries for JSONB - Mailing list pgsql-hackers

From Aleksander Alekseev
Subject Re: RFC: compression dictionaries for JSONB
Date
Msg-id CAJ7c6TOB9KUv1PpUFPN1c0vQ7WLHK5-t=p8jXWcTHoffT40jiQ@mail.gmail.com
Whole thread Raw
In response to Re: RFC: compression dictionaries for JSONB  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: RFC: compression dictionaries for JSONB  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
List pgsql-hackers
Matthias, Alvaro,

Many thanks for your comments and suggestions!

> Well, I for one would like access to manually add entries to the
> dictionary. What I'm not interested in is being required to manually
> update the dictionary; but the ability to manually insert into the
> dictionary however is much appreciated.

Sure, I see no reason why we can't do it. This would also simplify
splitting the task into smaller ones. We could introduce only manually
updated dictionaries first, and then automate filling them for the
users who need this.

> If stored as strings, they would go out of date when tables or columns
> are renamed or dropped.
> Similarly, you'd want to update the dictionary with common values in
> columns of that type; generally not columns of arbitrary other types.
> You can't in advance know the names of tables and columns, so that
> would add a burden of maintenance to the user when they add / change /
> remove a column of the dictionary type. Instead of storing 'use update
> data from table X column Y' in the type, I think that adding it as a
> column option would be the better choice.

Agree, add / change / remove of a column should be handled
automatically. Just to clarify, by column option do you mean syntax
like ALTER TABLE ... ALTER COLUMN ... etc, right? I didn't think of
extending this part of the syntax. That would be a better choice
indeed.

> I'm a bit on the fence about this. We do use this for sequences, but
> alternatively we might want to use ALTER TYPE jsondict;

Agree, ALTER TYPE seems to be a better choice than SELECT function().
This would make the interface more consistent.

> Overall, I'm glad to see this take off, but I do want some
> clarifications regarding the direction that this is going towards.
> [...]
> Actually, why is it a JSONB_DICTIONARY and not like:
> CREATE TYPE name AS DICTIONARY (
>  base_type = JSONB, ...
> );
> so that it is possible to use the infrastructure for other things?  For
> example, perhaps PostGIS geometries could benefit from it -- or even
> text or xml columns.

So the question is if we want to extend the capabilities of a single
type, i.e. JSONB, or to add a functionality that would work for the
various types. I see the following pros and cons of both approaches.

Modifying JSONB may at some point allow to partially decompress only
the parts of the document that need to be decompressed for a given
query. However, the compression schema will be probably less
efficient. There could also be difficulties in respect of backward
compatibility, and this is going to work only with JSONB.

An alternative approach, CREATE TYPE ... AS DICTIONARY OF <type> or
something like this would work not only for JSONB, but also for TEXT,
XML, arrays, and PostGIS. By the way, this was suggested before [1].
Another advantage here is that all things being equal the compression
schema could be more efficient. The feature will not affect existing
types. The main disadvantage is that implementing a partial
decompression would be very difficult and/or impractical.

Personally, I would say that the 2nd option, CREATE TYPE ... AS
DICTIONARY OF <type>, seems to be more useful. To my knowledge, not
many users would care much about partial decompression, and this is
the only real advantage of the 1st option I see. Also, this is how
ZSON is implemented. It doesn't care about the underlying type and
treats it as a BLOB. Thus the proofs of usefulness I provided above
are not quite valid for the 1st option. Probably unrelated, but 2nd
option would be even easier for me to implement since I already solved
a similar task.

All in all, I suggest focusing on the 2nd option with universal
compression dictionaries. Naturally, the focus will be on JSONB first.
But we will be able to extend this functionality for other types as
well.

Thoughts?

[1]: https://www.postgresql.org/message-id/CANP8%2BjLT8r03LJsw%3DdUSFxBh5pRB%2BUCKvS3BUT-dd4JPRDb3tg%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev



pgsql-hackers by date:

Previous
From: Greg Nancarrow
Date:
Subject: Re: Drop replslot after pgstat_shutdown cause assert coredump
Next
From: Yura Sokolov
Date:
Subject: Re: Double partition lock in bufmgr