Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

From Aleksander Alekseev
Subject Re: [PATCH] Compression dictionaries for JSONB
Date
Msg-id CAJ7c6TOx878pUmmj2ibRiF-JaOYLvYHAA27ebdANgTDOs58PvA@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Compression dictionaries for JSONB  (Nikita Malakhov <hukutoc@gmail.com>)
List pgsql-hackers
Hi Nikita,

Thanks for your feedback!

> Aleksander, I've carefully gone over discussion and still have some questions to ask -
>
> 1) Is there any means of measuring overhead of dictionaries over vanilla implementation? IMO it is a must because
> JSON is a widely used functionality. Also, as it was mentioned before, to check the dictionary value must be
detoasted;

Not sure what overhead you have in mind. The patch doesn't affect the
vanilla JSONB implementation.

> 2) Storing dictionaries in one table. As I wrote before, this will surely lead to locks and waits while inserting and
updating
> dictionaries, and could cause serious performance issues. And vacuuming this table will lead to locks for all tables
using
> dictionaries until vacuum is complete;

I believe this is true to some degree. But doesn't the same generally
apply to the rest of catalog tables?

I'm not that concerned about inserting/updating since this is a rare
operation. Vacuuming shouldn't be such a problem unless the user
creates/deletes dictionaries all the time.

Am I missing something?

> 3) JSON documents in production environments could be very complex and use thousands of keys, so creating dictionary
> directly in SQL statement is not very good approach, so it's another reason to have means for creating dictionaries
asa
 
> separate tables and/or passing them as files or so;

Yes, it was proposed to update dictionaries automatically e.g. during
the VACUUM of the table that contains compressed documents. This is
simply out of scope of this particular patch. It was argued that the
manual update should be supported too, which is implemented in this
patch.

> 4) Suggested mechanics, if put on top of the TOAST, could not benefit from knowledge if internal JSON structure,
which
> is seen as important drawback in spite of extensive research work done on working with JSON schema (storing,
validating,
> etc.), and also it cannot recognize and help to compress duplicated parts of JSON document;

Could you please elaborate on this a bit and/or maybe give an example? ...

> In Pluggable TOAST we suggest that as an improvement compression should be put inside the Toaster as an option,
> thus the Toaster could have maximum benefits from knowledge of data internal structure (and in future use JSON
Schema).

... Current implementation doesn't use the knowledge of JSONB format,
that's true. This is because previously we agreed there is no "one
size fits all" compression method, thus several are going to be
supported eventually. The current algorithm was chosen merely as the
one that is going to work good enough for any data type, not just
JSONB. Nothing prevents an alternative compression method from using
the knowledge of JSONB structure.

As, I believe, Matthias pointed out above, only partial decompression
would be a challenge. This is indeed something that would be better to
implement somewhere closer to the TOAST level. Other than that I'm not
sure what you mean.

> 5) A small test issue - if dictionaried' JSON has a key which is equal to OID used in a dictionary for some other
key?

Again, I'm having difficulties understanding the case you are
describing. Could you give a specific example?

> For using in special Toaster for JSON datatype compression dictionaries seem to be very valuable addition, but now I
> have to agree that this feature in current state is competing with Pluggable TOAST.

I disagree with the word "competing" here. Again, Matthias had a very
good point about this above.

In short, pluggable TOAST is a low-level internal mechanism, but it
doesn't provide a good interface for the end user and has several open
issues. The most important one IMO is how it is supposed to work with
pluggable AMs in the general case. "Compression dictionaries" have a
good user interface, and the implementation is not that important. The
current implementation uses casts, as the only option available at the
moment. But nothing prevents it from using Pluggable TOAST if this
will produce a cleaner code (I believe it will) and will allow
delivering partial decompression (this is yet to be figured out).

-- 
Best regards,
Aleksander Alekseev



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Next
From: Antonin Houska
Date:
Subject: Re: Problem about postponing gathering partial paths for topmost scan/join rel