RFC: compression dictionaries for JSONB - Mailing list pgsql-hackers
From | Aleksander Alekseev |
---|---|
Subject | RFC: compression dictionaries for JSONB |
Date | |
Msg-id | CAJ7c6TPx7N-bVw0dZ1ASCDQKZJHhBYkT6w4HV1LzfS+UUTUfmA@mail.gmail.com Whole thread Raw |
Responses |
Re: RFC: compression dictionaries for JSONB
|
List | pgsql-hackers |
Hi hackers, == Background == This is a follow-up thread to `Add ZSON extension to /contrib/` [1]. The ZSON extension introduces a new type called ZSON, which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used by given JSONB documents for compression. See the thread for more details. According to the feedback I got, the community generally liked the idea of adding an across-rows and across-tables compression capability to JSONB. What the community didn't like was: 1. Introducing a new data type in order to archive this; 2. Updating compression dictionaries manually; 3. Some implementation details of ZSON, such as limited dictionary size (2 ** 16 entries) and an extensive usage of gettimeofday() system call; There was also a request for proof of the usefulness of this feature in practice. To be honest with you I don't have solid proof that many users require this feature, and how many users that would be exactly. ZSON was originally developed because a customer of Postgres Professional requested it back in 2016. People approach me with questions from time to time. E.g. one user asked me recently how the extension can be compiled on Windows [2]. Andrew Dunstan reported that 2nd Quadrant (now EDB) has a fork of ZSON with some enhancements [3]. Konstantin Knizhnik reported that he worked on a similar extension [4]. Unfortunately, Andrew and Konstantin didn't give any more details (hopefully they will). But all in all, this indicates some demand. Speaking of performance, some time ago I benchmarked ZSON on data similar to the one that the customer had [5]. The benchmark showed ~10% performance improvements in terms of TPS and ~50% of saved disk space. The extension saved the memory as well, which was known from the implementation. The exact amount of saved memory was not measured. This benchmark shouldn't be considered as proof that all users will necessarily benefit from such a feature. But it indicates that some users could. == Proposal == The proposal is to add the support of compression dictionaries to JSONB. In order to do this, the SQL syntax should be modified. The proposed syntax is based on Matthias van de Meent's idea [6]: ``` CREATE TYPE <type-name> AS JSONB_DICTIONARY ( learn_by = { {"table_a", "column_a"}, {"table_b", "column_b"}, ... }, autoupdate = false, -- true by default -- optional: other arguments, min/max string lengths, etc ); ``` Basically, this is an equivalent of zson_learn [7]. It will create an id -> string dictionary in the PostgreSQL catalog. When the user chooses `autoupdate = true`, the dictionary will be updated automatically by PostgreSQL (e.g. during the VACUUM). This is the default value. The dictionary can also be updated manually: ``` SELECT jsonb_update_dictionary("type-name"); ``` Other than that, the type works like a regular one. All the usual ALTER TYPE / DROP TYPE semantics are applicable. All the operators available to JSONB are also available to <type-name>. Internally <type-name> is represented similar to JSONB. However, the strings from the dictionary are replaced with varints. This idea was borrowed from Tomas Vondra [8]. The dictionary size is limited to 2**28 entries. The limit can be easily extended in the future if necessary. Also <type-name> stores the version of the dictionary used to compress the data. All in all, this is similar to how ZSON works. The first implementation always decompresses <type-name> entirely. Partial compression and decompression can always be added transparently to the user. == Looking for a feedback === I would appreciate your feedback on this RFC. Is anything missing in the description of the feature? Do you think users need it? Can you think of a better user interface? Are there any corner cases worth considering? Any other comments and questions are welcome too! I would like to implement this when the consensus will be reached on how the feature should look like (and whether we need it). Any help (from co-authors, REVIEWERS!!!, technical writers, ...) would be much appreciated. == Links == [1]: https://www.postgresql.org/message-id/flat/CAJ7c6TP3fCC9TNKJBQAcEf4c%3DL7XQZ7QvuUayLgjhNQMD_5M_A%40mail.gmail.com [2]: https://github.com/postgrespro/zson/issues?q=is%3Aissue+is%3Aclosed [3]: https://www.postgresql.org/message-id/6f3944ad-6924-5fed-580c-e72477733f04%40dunslane.net [4]: https://github.com/postgrespro/jsonb_schema [5]: https://github.com/postgrespro/zson/blob/master/docs/benchmark.md [6]: https://www.postgresql.org/message-id/CAEze2WheMusc73UZ5TpfiAGQ%3DrRwSSgr0y3j9DEVAQgQFwneRA%40mail.gmail.com [7]: https://github.com/postgrespro/zson#usage [8]: https://www.postgresql.org/message-id/77356556-0634-5cde-f55e-cce739dc09b9%40enterprisedb.com -- Best regards, Aleksander Alekseev Open-Source PostgreSQL Contributor at Timescale
pgsql-hackers by date: