RFC: compression dictionaries for JSONB - Mailing list pgsql-hackers

From Aleksander Alekseev
Subject RFC: compression dictionaries for JSONB
Date
Msg-id CAJ7c6TPx7N-bVw0dZ1ASCDQKZJHhBYkT6w4HV1LzfS+UUTUfmA@mail.gmail.com
Whole thread Raw
Responses Re: RFC: compression dictionaries for JSONB
List pgsql-hackers
Hi hackers,

== Background ==

This is a follow-up thread to `Add ZSON extension to /contrib/` [1].
The ZSON extension introduces a new type called ZSON, which is 100%
compatible with JSONB but uses a shared dictionary of strings most
frequently used by given JSONB documents for compression. See the
thread for more details.

According to the feedback I got, the community generally liked the
idea of adding an across-rows and across-tables compression capability
to JSONB. What the community didn't like was:

1. Introducing a new data type in order to archive this;
2. Updating compression dictionaries manually;
3. Some implementation details of ZSON, such as limited dictionary
size (2 ** 16 entries) and an extensive usage of gettimeofday() system
call;

There was also a request for proof of the usefulness of this feature
in practice.

To be honest with you I don't have solid proof that many users require
this feature, and how many users that would be exactly. ZSON was
originally developed because a customer of Postgres Professional
requested it back in 2016. People approach me with questions from time
to time. E.g. one user asked me recently how the extension can be
compiled on Windows [2].

Andrew Dunstan reported that 2nd Quadrant (now EDB) has a fork of ZSON
with some enhancements [3]. Konstantin Knizhnik reported that he
worked on a similar extension [4]. Unfortunately, Andrew and
Konstantin didn't give any more details (hopefully they will). But all
in all, this indicates some demand.

Speaking of performance, some time ago I benchmarked ZSON on data
similar to the one that the customer had [5]. The benchmark showed
~10% performance improvements in terms of TPS and ~50% of saved disk
space. The extension saved the memory as well, which was known from
the implementation. The exact amount of saved memory was not measured.
This benchmark shouldn't be considered as proof that all users will
necessarily benefit from such a feature. But it indicates that some
users could.

== Proposal ==

The proposal is to add the support of compression dictionaries to JSONB.

In order to do this, the SQL syntax should be modified. The proposed
syntax is based on Matthias van de Meent's idea [6]:

```
CREATE TYPE <type-name> AS JSONB_DICTIONARY (
  learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... },
  autoupdate = false, -- true by default
  -- optional: other arguments, min/max string lengths, etc
);
```

Basically, this is an equivalent of zson_learn [7]. It will create an
id -> string dictionary in the PostgreSQL catalog. When the user
chooses `autoupdate = true`, the dictionary will be updated
automatically by PostgreSQL (e.g. during the VACUUM). This is the
default value. The dictionary can also be updated manually:

```
SELECT jsonb_update_dictionary("type-name");
```

Other than that, the type works like a regular one. All the usual
ALTER TYPE / DROP TYPE semantics are applicable. All the operators
available to JSONB are also available to <type-name>.

Internally <type-name> is represented similar to JSONB. However, the
strings from the dictionary are replaced with varints. This idea was
borrowed from Tomas Vondra [8]. The dictionary size is limited to
2**28 entries. The limit can be easily extended in the future if
necessary. Also <type-name> stores the version of the dictionary used
to compress the data. All in all, this is similar to how ZSON works.

The first implementation always decompresses <type-name> entirely.
Partial compression and decompression can always be added
transparently to the user.

== Looking for a feedback ===

I would appreciate your feedback on this RFC.

Is anything missing in the description of the feature? Do you think
users need it? Can you think of a better user interface? Are there any
corner cases worth considering? Any other comments and questions are
welcome too!

I would like to implement this when the consensus will be reached on
how the feature should look like (and whether we need it). Any help
(from co-authors, REVIEWERS!!!, technical writers, ...) would be much
appreciated.

== Links ==

[1]: https://www.postgresql.org/message-id/flat/CAJ7c6TP3fCC9TNKJBQAcEf4c%3DL7XQZ7QvuUayLgjhNQMD_5M_A%40mail.gmail.com
[2]: https://github.com/postgrespro/zson/issues?q=is%3Aissue+is%3Aclosed
[3]: https://www.postgresql.org/message-id/6f3944ad-6924-5fed-580c-e72477733f04%40dunslane.net
[4]: https://github.com/postgrespro/jsonb_schema
[5]: https://github.com/postgrespro/zson/blob/master/docs/benchmark.md
[6]: https://www.postgresql.org/message-id/CAEze2WheMusc73UZ5TpfiAGQ%3DrRwSSgr0y3j9DEVAQgQFwneRA%40mail.gmail.com
[7]: https://github.com/postgrespro/zson#usage
[8]: https://www.postgresql.org/message-id/77356556-0634-5cde-f55e-cce739dc09b9%40enterprisedb.com

-- 
Best regards,
Aleksander Alekseev
Open-Source PostgreSQL Contributor at Timescale



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Improve the HINT message of the ALTER command for postgres_fdw
Next
From: Aleksander Alekseev
Date:
Subject: Re: Add ZSON extension to /contrib/