Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers

From Aleksander Alekseev
Subject Re: [PATCH] Compression dictionaries for JSONB
Date
Msg-id CAJ7c6TNgq3O9SVXcpUXs0gVuBzfD_22SGZmCKUC4dj84nc8j7w@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Compression dictionaries for JSONB  (Andres Freund <andres@anarazel.de>)
Responses Re: [PATCH] Compression dictionaries for JSONB
List pgsql-hackers
Hi,

> I assume that manually specifying dictionary entries is a consequence of
> the prototype state?  I don't think this is something humans are very
> good at, just analyzing the data to see what's useful to dictionarize
> seems more promising.

No, humans are not good at it. The idea was to automate the process
and build the dictionaries automatically e.g. during the VACUUM.

> I don't think we'd want much of the infrastructure introduced in the
> patch for type agnostic cross-row compression. A dedicated "dictionary"
> type as a wrapper around other types IMO is the wrong direction. This
> should be a relation-level optimization option, possibly automatic, not
> something visible to every user of the table.

So to clarify, are we talking about tuple-level compression? Or
perhaps page-level compression?

Implementing page-level compression should be *relatively*
straightforward. As an example this was previously done for InnoDB.
Basically InnoDB compresses the entire page, then rounds the result to
1K, 2K, 4K, 8K, etc and stores the result in a corresponding fork
("fork" in PG terminology), similarly to how a SLAB allocator works.
Additionally a page_id -> fork_id map should be maintained, probably
in yet another fork, similarly to visibility map. A compressed page
can change the fork after being modified since this may change the
size of a compressed page. The buffer manager is unaffected and deals
only with uncompressed pages. (I'm not an expert in InnoDB and this is
my very rough understanding of how its compression works.)

I believe this can be implemented as a TAM. Whether this would be a
"dictionary" compression is debatable but it gives the users similar
benefits, give or take. The advantage is that users shouldn't define
any dictionaries manually, nor should DBMS during VACUUM or somehow
else.

> I also suspect that we'd have to spend a lot of effort to make
> compression/decompression fast if we want to handle dictionaries
> ourselves, rather than using the dictionary support in libraries like
> lz4/zstd.

That's a reasonable concern, can't argue with that.

> I don't think a prototype-y patch not needing a rebase two months is a
> good measure of complexity :)

It's worth noting that I also invested quite some time into reviewing
type-aware TOASTers :) I just choose to keep my personal opinion about
the complexity of that patch to myself this time since obviously I'm a
bit biased. However if you are curious it's all in the corresponding
thread.

-- 
Best regards,
Aleksander Alekseev



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: File descriptors in exec'd subprocesses
Next
From: Laurenz Albe
Date:
Subject: Re: Make EXPLAIN generate a generic plan for a parameterized query