Thread: ZSON, PostgreSQL extension for compressing JSONB
Hello. I've just uploaded ZSON extension on GitHub: https://github.com/afiskon/zson ZSON learns on your common JSONB documents and creates a dictionary with strings that are frequently used in all documents. After that you can use ZSON type to compress documents using this dictionary. When documents schema evolve and compression becomes inefficient you can re-learn on new documents. New documents will be compressed with a new dictionary, old documents will be decompressed using old dictionary. In some cases ZSON can save half of your disk space and give you about 10% more TPS. Everything depends on your data and workload though. Memory is saved as well. For more details see README.md. Please don't hesitate to ask any questions. Any feedback and pull requests are welcome too! -- Best regards, Aleksander Alekseev
Attachment
I like this, seeing that the keys of JSON docs are replicated in every record.
I makes my old-school DBA-Sense start to itch.
On Fri, Sep 30, 2016 at 8:58 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
Hello.
I've just uploaded ZSON extension on GitHub:
https://github.com/afiskon/zson
ZSON learns on your common JSONB documents and creates a dictionary
with strings that are frequently used in all documents. After that you
can use ZSON type to compress documents using this dictionary. When
documents schema evolve and compression becomes inefficient you can
re-learn on new documents. New documents will be compressed with a new
dictionary, old documents will be decompressed using old dictionary.
In some cases ZSON can save half of your disk space and give you about
10% more TPS. Everything depends on your data and workload though.
Memory is saved as well. For more details see README.md.
Please don't hesitate to ask any questions. Any feedback and pull
requests are welcome too!
--
Best regards,
Aleksander Alekseev
On 30 September 2016 at 16:58, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > I've just uploaded ZSON extension on GitHub: > > https://github.com/afiskon/zson > > ZSON learns on your common JSONB documents and creates a dictionary > with strings that are frequently used in all documents. After that you > can use ZSON type to compress documents using this dictionary. When > documents schema evolve and compression becomes inefficient you can > re-learn on new documents. New documents will be compressed with a new > dictionary, old documents will be decompressed using old dictionary. > > In some cases ZSON can save half of your disk space and give you about > 10% more TPS. Everything depends on your data and workload though. > Memory is saved as well. For more details see README.md. > > Please don't hesitate to ask any questions. Any feedback and pull > requests are welcome too! Very good. Oleg had mentioned that dictionary compression was being considered. It would be useful to be able to define compression dictionaries for many use cases. Will you be submitting this to core? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Oct 4, 2016 at 4:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 30 September 2016 at 16:58, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
> I've just uploaded ZSON extension on GitHub:
>
> https://github.com/afiskon/zson
>
> ZSON learns on your common JSONB documents and creates a dictionary
> with strings that are frequently used in all documents. After that you
> can use ZSON type to compress documents using this dictionary. When
> documents schema evolve and compression becomes inefficient you can
> re-learn on new documents. New documents will be compressed with a new
> dictionary, old documents will be decompressed using old dictionary.
>
> In some cases ZSON can save half of your disk space and give you about
> 10% more TPS. Everything depends on your data and workload though.
> Memory is saved as well. For more details see README.md.
>
> Please don't hesitate to ask any questions. Any feedback and pull
> requests are welcome too!
Very good. Oleg had mentioned that dictionary compression was being considered.
It would be useful to be able to define compression dictionaries for
many use cases.
Dictionary compression is a different project, we'll publish it after testing.
Will you be submitting this to core?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hello, Simon. Thanks for you interest to this project! > Will you be submitting this to core? I could align ZSON to PostgreSQL code style. I only need to run pgindent and write a few comments. Do you think community would be interested in adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific problem for this? -- Best regards, Aleksander Alekseev
Attachment
@Aleksander
~everyone wants lower data storage and wants some kind of compression. Can this be made to automatically retrain when analyzing (makes sense?)? And create a new dictionary only if it changes compared to the last one.
On Tue, Oct 4, 2016 at 5:34 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
Hello, Simon.
Thanks for you interest to this project!
> Will you be submitting this to core?
I could align ZSON to PostgreSQL code style. I only need to run pgindent
and write a few comments. Do you think community would be interested in
adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific
problem for this?
--
Best regards,
Aleksander Alekseev
On Wed, Oct 5, 2016 at 12:34 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > I could align ZSON to PostgreSQL code style. I only need to run pgindent > and write a few comments. Do you think community would be interested in > adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific > problem for this? I find the references to pglz quite troubling, particularly by reading that this data type visibly uses its own compression logic. -- Michael
On 10/4/16, Dorian Hoxha <dorian.hoxha@gmail.com> wrote: > On Tue, Oct 4, 2016 at 5:34 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: >> Hello, Simon. >> >> Thanks for you interest to this project! >> >> > Will you be submitting this to core? >> >> I could align ZSON to PostgreSQL code style. I only need to run pgindent >> and write a few comments. Do you think community would be interested in >> adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific >> problem for this? > > ~everyone wants lower data storage and wants some kind of compression. > Can this be made to automatically retrain when analyzing (makes sense?)? > And create a new dictionary only if it changes compared to the last one. ANALYZE does not walk through all the table and does not change its records. Creating a new dictionary for compressing purposes supposes updating rows of the original table to replace entries to references to a dictionary. -- Best regards, Vitaly Burovoy
> ~everyone wants lower data storage and wants some kind of compression. > Can this be made to automatically retrain when analyzing (makes sense?)? > And create a new dictionary only if it changes compared to the last one. It's an interesting idea. However I doubt it could be automated in one-size-fits-all manner. One users would like to do re-learning during analyzing, others during vacuum, once a month or say using triggers and some sort of heuristics. Despite that I see no reason not to accept pull requests with implementations of different re-learning automation strategies. It's just not a priority for me personally. -- Best regards, Aleksander Alekseev
Attachment
> > I could align ZSON to PostgreSQL code style. I only need to run pgindent > > and write a few comments. Do you think community would be interested in > > adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific > > problem for this? > > I find the references to pglz quite troubling, particularly by reading > that this data type visibly uses its own compression logic. ZSON just replaces frequently used strings to 16-bit codes. It also adds PGLZ_HINT_SIZE (=32 by default, could be also 0) zero bytes in the beginning to make it more likely that data will be compressed using PGLZ. After all, who will use ZSON for small documents? Thats all. Hope it explains references to PGLZ. -- Best regards, Aleksander Alekseev
Attachment
On 4 October 2016 at 16:34, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > Hello, Simon. > > Thanks for you interest to this project! > >> Will you be submitting this to core? > > I could align ZSON to PostgreSQL code style. I only need to run pgindent > and write a few comments. Do you think community would be interested in > adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific > problem for this? CREATE COMPRESSION DICTIONARY public.simple_dict ( WORDS = my_json_schema ); Then use the dictionary by referencing the DictionaryId within the datatype modifier, e.g. ZSON(357) That way we can use this for TEXT, XML, JSON etc.. as ZTEXT, ZXML, ZJSON So it seems like a generally useful thing to me. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> > I could align ZSON to PostgreSQL code style. I only need to run pgindent > > and write a few comments. Do you think community would be interested in > > adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific > > problem for this? > > CREATE COMPRESSION DICTIONARY public.simple_dict ( > WORDS = my_json_schema > ); > > Then use the dictionary by referencing the DictionaryId within the > datatype modifier, e.g. ZSON(357) > > That way we can use this for TEXT, XML, JSON etc.. as ZTEXT, ZXML, ZJSON > > So it seems like a generally useful thing to me. Good idea! What about evolving schema of JSON/JSONB/XML? For instance, adding/removing keys in new versions of the application. UPDATE COMPRESSION DICTIONARY? -- Best regards, Aleksander Alekseev
Attachment
On 5 October 2016 at 16:58, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > What about evolving schema of JSON/JSONB/XML? For instance, > adding/removing keys in new versions of the application. UPDATE > COMPRESSION DICTIONARY? You can add to a dictionary, but not remove things. I'm not sure that's a big issue. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello, Eduardo. > Why do you use a dictionary compression and not zlib/lz4/bzip/anyother? Internally PostgreSQL already has LZ77 family algorithm - PGLZ. I didn't try to replace it, only to supplement. PGLZ compresses every piece of data (JSONB documents in this case) independently. What I did is removed redundant data that exists between documents and that PGLZ can't compress since every single document usually uses every key and similar strings (some sort of string tags in arrays, etc) only once. > Compress/Decompress speed? By my observations PGLZ has characteristics similar to GZIP. I didn't benchmark ZSON encoding/decoding separately from DBMS because end user is interested only in TPS which depends on IO, amount of documents that we could fit into memory and other factors. > As I understand, postgresql must decompress before use. Only if you try to read document fields. For deleting a tuple, doing vacuum, etc there is no need to decompress a data. > Some compressing algs (dictionary transforms where a token is word) > allow search for tokens/words directly on compressed data transforming > the token/word to search in dictionary entry and searching it in > compressed data. From it, replace, substring, etc... string > manipulations algs at word level can be implemented. Unfortunately I doubt that current ZSON implementation can use these ideas. However I must agree that it's a very interesting field of research. I don't think anyone tried to do something like this in PostgreSQL yet. > My passion is compression, do you care if I try other algorithms? For > that, some dict id numbers (>1024 or >1<<16 or <128 for example) say > which compression algorithm is used or must change zson_header to store > that information. Doing that, each document could be compressed with > the best compressor (size or decompression speed) at idle times or at > request. By all means! Naturally if you'll find a better encoding I would be happy to merge corresponding code in ZSON's repository. > Thanks for sharing and time. Thanks for feedback and sharing your thoughts! -- Best regards, Aleksander Alekseev