Home > mailing lists

Re: RFC: compression dictionaries for JSONB - Mailing list pgsql-hackers

From	Aleksander Alekseev
Subject	Re: RFC: compression dictionaries for JSONB
Date	October 11, 2021 13:25:07
Msg-id	CAJ7c6TOB9KUv1PpUFPN1c0vQ7WLHK5-t=p8jXWcTHoffT40jiQ@mail.gmail.com Whole thread Raw
In response to	Re: RFC: compression dictionaries for JSONB (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses	Re: RFC: compression dictionaries for JSONB
List	pgsql-hackers

Tree view

Matthias, Alvaro,

Many thanks for your comments and suggestions!

> Well, I for one would like access to manually add entries to the
> dictionary. What I'm not interested in is being required to manually
> update the dictionary; but the ability to manually insert into the
> dictionary however is much appreciated.

Sure, I see no reason why we can't do it. This would also simplify
splitting the task into smaller ones. We could introduce only manually
updated dictionaries first, and then automate filling them for the
users who need this.

> If stored as strings, they would go out of date when tables or columns
> are renamed or dropped.
> Similarly, you'd want to update the dictionary with common values in
> columns of that type; generally not columns of arbitrary other types.
> You can't in advance know the names of tables and columns, so that
> would add a burden of maintenance to the user when they add / change /
> remove a column of the dictionary type. Instead of storing 'use update
> data from table X column Y' in the type, I think that adding it as a
> column option would be the better choice.

Agree, add / change / remove of a column should be handled
automatically. Just to clarify, by column option do you mean syntax
like ALTER TABLE ... ALTER COLUMN ... etc, right? I didn't think of
extending this part of the syntax. That would be a better choice
indeed.

> I'm a bit on the fence about this. We do use this for sequences, but
> alternatively we might want to use ALTER TYPE jsondict;

Agree, ALTER TYPE seems to be a better choice than SELECT function().
This would make the interface more consistent.

> Overall, I'm glad to see this take off, but I do want some
> clarifications regarding the direction that this is going towards.
> [...]
> Actually, why is it a JSONB_DICTIONARY and not like:
> CREATE TYPE name AS DICTIONARY (
>  base_type = JSONB, ...
> );
> so that it is possible to use the infrastructure for other things?  For
> example, perhaps PostGIS geometries could benefit from it -- or even
> text or xml columns.

So the question is if we want to extend the capabilities of a single
type, i.e. JSONB, or to add a functionality that would work for the
various types. I see the following pros and cons of both approaches.

Modifying JSONB may at some point allow to partially decompress only
the parts of the document that need to be decompressed for a given
query. However, the compression schema will be probably less
efficient. There could also be difficulties in respect of backward
compatibility, and this is going to work only with JSONB.

An alternative approach, CREATE TYPE ... AS DICTIONARY OF <type> or
something like this would work not only for JSONB, but also for TEXT,
XML, arrays, and PostGIS. By the way, this was suggested before [1].
Another advantage here is that all things being equal the compression
schema could be more efficient. The feature will not affect existing
types. The main disadvantage is that implementing a partial
decompression would be very difficult and/or impractical.

Personally, I would say that the 2nd option, CREATE TYPE ... AS
DICTIONARY OF <type>, seems to be more useful. To my knowledge, not
many users would care much about partial decompression, and this is
the only real advantage of the 1st option I see. Also, this is how
ZSON is implemented. It doesn't care about the underlying type and
treats it as a BLOB. Thus the proofs of usefulness I provided above
are not quite valid for the 1st option. Probably unrelated, but 2nd
option would be even easier for me to implement since I already solved
a similar task.

All in all, I suggest focusing on the 2nd option with universal
compression dictionaries. Naturally, the focus will be on JSONB first.
But we will be able to extend this functionality for other types as
well.

Thoughts?

[1]: https://www.postgresql.org/message-id/CANP8%2BjLT8r03LJsw%3DdUSFxBh5pRB%2BUCKvS3BUT-dd4JPRDb3tg%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev

pgsql-hackers by date:

From: Greg Nancarrow
Date: 11 October 2021, 13:15:40
Subject: Re: Drop replslot after pgstat_shutdown cause assert coredump

From: Yura Sokolov
Date: 11 October 2021, 13:31:17
Subject: Re: Double partition lock in bufmgr

Re: RFC: compression dictionaries for JSONB - Mailing list pgsql-hackers

Previous

Next