Thread: Add ZSON extension to /contrib/
Hi hackers,
Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type, which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documents for compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the document is large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk space, memory and increases the overall performance. More details can be found in README on GitHub.
[1]: https://github.com/postgrespro/zson
--
Best regards,
Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type, which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documents for compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the document is large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk space, memory and increases the overall performance. More details can be found in README on GitHub.
The extension was accepted warmly and instantaneously I got several requests to submit it to /contrib/ so people using Amazon RDS and similar services could enjoy it too. Back then I was not sure if the extension is mature enough and if it lacks any additional features required to solve the real-world problems of the users. Time showed, however, that people are happy with the extension as it is. There were several minor issues discovered, but they were fixed back in 2017. The extension never experienced any compatibility problems with the next major release of PostgreSQL.
So my question is if the community may consider adding ZSON to /contrib/. If this is the case I will add this thread to the nearest CF and submit a corresponding patch.
[1]: https://github.com/postgrespro/zson
--
Best regards,
Aleksander Alekseev
Open-Source PostgreSQL Contributor at Timescale
On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev <aleksander@timescale.com> wrote: > > Hi hackers, > > Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type,which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documentsfor compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the documentis large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk space,memory and increases the overall performance. More details can be found in README on GitHub. > > The extension was accepted warmly and instantaneously I got several requests to submit it to /contrib/ so people usingAmazon RDS and similar services could enjoy it too. Back then I was not sure if the extension is mature enough and ifit lacks any additional features required to solve the real-world problems of the users. Time showed, however, that peopleare happy with the extension as it is. There were several minor issues discovered, but they were fixed back in 2017.The extension never experienced any compatibility problems with the next major release of PostgreSQL. > > So my question is if the community may consider adding ZSON to /contrib/. If this is the case I will add this thread tothe nearest CF and submit a corresponding patch. If the extension is mature enough, why make it an extension in contrib, and not instead either enhance the existing jsonb type with it or make it a built-in type? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On 5/25/21 6:55 AM, Aleksander Alekseev wrote: > Hi hackers, > > Back in 2016 while being at PostgresPro I developed the ZSON extension > [1]. The extension introduces the new ZSON type, which is 100% > compatible with JSONB but uses a shared dictionary of strings most > frequently used in given JSONB documents for compression. These > strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) > applies if the document is large enough by common PostgreSQL logic. > Under certain conditions (many large documents), this saves disk > space, memory and increases the overall performance. More details can > be found in README on GitHub. > > The extension was accepted warmly and instantaneously I got several > requests to submit it to /contrib/ so people using Amazon RDS and > similar services could enjoy it too. Back then I was not sure if the > extension is mature enough and if it lacks any additional features > required to solve the real-world problems of the users. Time showed, > however, that people are happy with the extension as it is. There were > several minor issues discovered, but they were fixed back in 2017. The > extension never experienced any compatibility problems with the next > major release of PostgreSQL. > > So my question is if the community may consider adding ZSON to > /contrib/. If this is the case I will add this thread to the nearest > CF and submit a corresponding patch. > > [1]: https://github.com/postgrespro/zson > <https://github.com/postgrespro/zson> > We (2ndQuadrant, now part of EDB) made some enhancements to Zson a few years ago, and I have permission to contribute thoseif this proposal is adopted. From the readme: 1. There is an option to make zson_learn only process object keys, rather than field values. ``` select zson_learn('{{table1,col1}}',true); ``` 2. Strings with an octet-length less than 3 are not processed. Since strings are encoded as 2 bytes and then there needs to be another byte with the length of the following skipped bytes, encoding values less than 3 bytes is going to be a net loss. 3. There is a new function to create a dictionary directly from an array of text, rather than using the learning code: ``` select zson_create_dictionary(array['word1','word2']::text[]); ``` 4. There is a function to augment the current dictionary from an array of text: ``` select zson_extend_dictionary(array['value1','value2','value3']::text[]); ``` This is particularly useful for adding common field prefixes or values. A good example of field prefixes is URL values where the first part of the URL is fairly constrained but the last part is not. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Magnus Hagander <magnus@hagander.net> writes: > On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev > <aleksander@timescale.com> wrote: >> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type,which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documentsfor compression. > If the extension is mature enough, why make it an extension in > contrib, and not instead either enhance the existing jsonb type with > it or make it a built-in type? IMO we have too d*mn many JSON types already. If we can find a way to shoehorn this optimization into JSONB, that'd be great. Otherwise I do not think it's worth the added user confusion. Also, even if ZSON was "100% compatible with JSONB" back in 2016, a whole lot of features have been added since then. Having to duplicate all that code again for a different data type is not something I want to see us doing. So that's an independent reason for wanting to hide this under the existing type not make a new one. regards, tom lane
On Tue, 25 May 2021 at 13:32, Magnus Hagander <magnus@hagander.net> wrote: > > On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev > <aleksander@timescale.com> wrote: > > > > Hi hackers, > > > > Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type,which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documentsfor compression. These strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) applies if the documentis large enough by common PostgreSQL logic. Under certain conditions (many large documents), this saves disk space,memory and increases the overall performance. More details can be found in README on GitHub. > > > > The extension was accepted warmly and instantaneously I got several requests to submit it to /contrib/ so people usingAmazon RDS and similar services could enjoy it too. Do note that e.g. postgis is not in contrib, but is available in e.g. RDS. > > Back then I was not sure if the extension is mature enough and if it lacks any additional features required to solvethe real-world problems of the users. Time showed, however, that people are happy with the extension as it is. Therewere several minor issues discovered, but they were fixed back in 2017. The extension never experienced any compatibilityproblems with the next major release of PostgreSQL. > > > > So my question is if the community may consider adding ZSON to /contrib/. If this is the case I will add this threadto the nearest CF and submit a corresponding patch. I like the idea of the ZSON type, but I'm somewhat disappointed by its current limitations: - There is only one active shared dictionary (as a user I would want distinct dictionaries for each use case, similar to ENUM: each ENUM type has their own limit of 2**31 (?) values) - There is no provided method to manually specify the dictionary (only "zson_learn", which constructs a new dictionary) - You cannot add to the dictionary (equiv. to ALTER TYPE enum_type ADD VALUE), you must create a new one. Apart from that, I noticed the following more technical points, for if you submit it as-is as a patch: - Each dictionary uses a lot of memory, regardless of the number of actual stored keys. For 32-bit systems the base usage of a dictionary without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be almost 1MB, and for 64-bit it would be 1.7MB. That is significantly more than I'd want to install. - You call gettimeofday() in both dict_get and in get_current_dict_id. These functions can be called in short and tight loops (for small GSON fields), in which case it would add significant overhead through the implied syscalls. - The compression method you've chosen seems to extract most common strings from the JSONB table, and then use that as a pre-built dictionary for doing some dictionary encoding on the on-disk format of the jsonb structure. Although I fully understand that this makes the system quite easy to reason about, it does mean that you're deTOASTing the full GSON field, and that the stored bytestring will not be structured / doesn't work well with current debuggers. > If the extension is mature enough, why make it an extension in > contrib, and not instead either enhance the existing jsonb type with > it or make it a built-in type? I don't think that this datatype (that supplies a basic but effective compression algorithm over JSONB) is fit for core as-is. I have also thought about building a similar type, but one that would be more like ENUM: An extension on the JSONB datatype, which has some list of common 'well-known' values that will be substituted, and to which later more substitutable values can be added (e.g. CREATE TYPE ... AS JSONB_DICTIONARY ('"commonly_used_key"', '"very_long_string_that_appears_often"', '[{"structure": "that", "appears": "often"}]') or something similar). That would leave JSONB just as a JSONB_DICTIONARY type without any substitutable values. These specialized JSONB types could then be used as a specification for table columns, custom types, et cetera. Some of the reasons I've not yet built such type is me not being familiar with the jsonb- and enum-code (which I suspect to be critical for an efficient implementation of such type), although whilst researching I've noticed that it is possible to use most of the JSONB infrastructure / read older jsonb values, as there are still some JEntry type masks available which could flag such substitutions. With regards, Matthias van de Meent
On 5/25/21 4:10 PM, Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev >> <aleksander@timescale.com> wrote: >>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type,which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documentsfor compression. >> If the extension is mature enough, why make it an extension in >> contrib, and not instead either enhance the existing jsonb type with >> it or make it a built-in type? > IMO we have too d*mn many JSON types already. If we can find a way > to shoehorn this optimization into JSONB, that'd be great. Otherwise > I do not think it's worth the added user confusion. > > Also, even if ZSON was "100% compatible with JSONB" back in 2016, > a whole lot of features have been added since then. Having to > duplicate all that code again for a different data type is not > something I want to see us doing. So that's an independent reason > for wanting to hide this under the existing type not make a new one. I take your point. However, there isn't really any duplication. It's handled by this: CREATE FUNCTION jsonb_to_zson(jsonb) RETURNS zson AS 'MODULE_PATHNAME' LANGUAGE C STRICT IMMUTABLE; CREATE FUNCTION zson_to_jsonb(zson) RETURNS jsonb AS 'MODULE_PATHNAME' LANGUAGE C STRICT IMMUTABLE; CREATE CAST (jsonb AS zson) WITH FUNCTION jsonb_to_zson(jsonb) AS ASSIGNMENT; CREATE CAST (zson AS jsonb) WITH FUNCTION zson_to_jsonb(zson) AS IMPLICIT; cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 5/25/21 4:10 PM, Tom Lane wrote: >> Also, even if ZSON was "100% compatible with JSONB" back in 2016, >> a whole lot of features have been added since then. Having to >> duplicate all that code again for a different data type is not >> something I want to see us doing. So that's an independent reason >> for wanting to hide this under the existing type not make a new one. > I take your point. However, there isn't really any duplication. It's > handled by [ creating a pair of casts ] If that were an adequate solution then nobody would be unhappy about json vs jsonb. I don't think it really is satisfactory: * does nothing for user confusion (except maybe make it worse) * not terribly efficient * doesn't cover all cases, notably indexes. regards, tom lane
Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > I like the idea of the ZSON type, but I'm somewhat disappointed by its > current limitations: I've not read the code, so maybe this thought is completely off-point, but I wonder if anything could be learned from PostGIS. AIUI they have developed the infrastructure needed to have auxiliary info (particularly, spatial reference data) attached to a geometry column, without duplicating it in every value of the column. Seems like that is a close analog of what's needed here. regards, tom lane
On 5/25/21 4:31 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 5/25/21 4:10 PM, Tom Lane wrote: >>> Also, even if ZSON was "100% compatible with JSONB" back in 2016, >>> a whole lot of features have been added since then. Having to >>> duplicate all that code again for a different data type is not >>> something I want to see us doing. So that's an independent reason >>> for wanting to hide this under the existing type not make a new one. >> I take your point. However, there isn't really any duplication. It's >> handled by [ creating a pair of casts ] > If that were an adequate solution then nobody would be unhappy about > json vs jsonb. I don't think it really is satisfactory: > > * does nothing for user confusion (except maybe make it worse) > > * not terribly efficient > > * doesn't cover all cases, notably indexes. > > Quite so. To some extent it's a toy. But at least one of our customers has found it useful, and judging by Aleksander's email they aren't alone. Your ideas downthread are probably a useful pointer of how we might fruitfully proceed. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > > I like the idea of the ZSON type, but I'm somewhat disappointed by its > > current limitations: > > I've not read the code, so maybe this thought is completely off-point, > but I wonder if anything could be learned from PostGIS. AIUI they > have developed the infrastructure needed to have auxiliary info > (particularly, spatial reference data) attached to a geometry column, > without duplicating it in every value of the column. Seems like that > is a close analog of what's needed here. Err, not exactly the same- there aren't *that* many SRIDs and therefore they can be stuffed into the typemod (my, probably wrong, recollection was that I actually pushed Paul in that direction due to being frustrated with CHECK constraints they had been using previously..). Not something you could do with a dictionary as what's contempalted here. I do agree that each jsonb/zson/whatever column should really be able to have its own dictionary though and maybe you could shove *which* of those dictionaries a given column uses into the typemod for that column... In an ideal world, however, we wouldn't make a user have to actually do that though and instead we'd just build our own magically for them when they use jsonb. Thanks, Stephen
Attachment
Hi hackers, Many thanks for your feedback, I very much appreciate it! > If the extension is mature enough, why make it an extension in > contrib, and not instead either enhance the existing jsonb type with > it or make it a built-in type? > IMO we have too d*mn many JSON types already. If we can find a way > to shoehorn this optimization into JSONB, that'd be great. Otherwise > I do not think it's worth the added user confusion. Magnus, Tom, My reasoning is that if the problem can be solved with an extension there is little reason to modify the core. This seems to be in the spirit of PostgreSQL. If the community reaches the consensus to modify the core to introduce a similar feature, we could discuss this as well. It sounds like a lot of unnecessary work to me though (see below). > * doesn't cover all cases, notably indexes. Tom, Not sure if I follow. What cases do you have in mind? > Do note that e.g. postgis is not in contrib, but is available in e.g. RDS. Matthias, Good point. I suspect that PostGIS is an exception though... > I like the idea of the ZSON type, but I'm somewhat disappointed by its > current limitations Several people suggested various enhancements right after learning about ZSON. Time showed, however, that none of the real-world users really need e.g. more than one common dictionary per database. I suspect this is because no one has more than 2**16 repeatable unique strings (one dictionary limitation) in their documents. Thus there is no benefit in having separate dictionaries and corresponding extra complexity. > - Each dictionary uses a lot of memory, regardless of the number of > actual stored keys. For 32-bit systems the base usage of a dictionary > without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be > almost 1MB, and for 64-bit it would be 1.7MB. That is significantly > more than I'd want to install. You are probably right on this one, this part could be optimized. I will address this if we agree on submitting the patch. > - You call gettimeofday() in both dict_get and in get_current_dict_id. > These functions can be called in short and tight loops (for small GSON > fields), in which case it would add significant overhead through the > implied syscalls. I must admit, I'm not an expert in this area. My understanding is that gettimeofday() is implemented as single virtual memory access on modern operating systems, e.g. VDSO on Linux, thus it's very cheap. I'm not that sure about other supported platforms though. Probably worth investigating. > It does mean that you're deTOASTing > the full GSON field, and that the stored bytestring will not be > structured / doesn't work well with current debuggers. Unfortunately, I'm not very well aware of debugging tools in this context. Could you please name the debuggers I should take into account? > We (2ndQuadrant, now part of EDB) made some enhancements to Zson a few years ago, and I have permission to contribute thoseif this proposal is adopted. Andrew, That's great, and personally I very much like the enhancements you've made. Purely out of curiosity, did they ended up as a part of 2ndQiadrant / EDB products? I will be happy to accept a pull request with these enhancements regardless of how the story with this proposal ends up. > Quite so. To some extent it's a toy. But at least one of our customers > has found it useful, and judging by Aleksander's email they aren't > alone. Indeed, this is an extremely simple extension, ~500 effective lines of code in C. It addresses a somewhat specific scenario, which, to my regret, doesn't seem to be uncommon. A pain-killer of a sort. In an ideal world, people suppose simply to normalize their data. -- Best regards, Aleksander Alekseev
On 25.05.2021 13:55, Aleksander Alekseev wrote: > Hi hackers, > > Back in 2016 while being at PostgresPro I developed the ZSON extension > [1]. The extension introduces the new ZSON type, which is 100% > compatible with JSONB but uses a shared dictionary of strings most > frequently used in given JSONB documents for compression. These > strings are replaced with integer IDs. Afterward, PGLZ (and now LZ4) > applies if the document is large enough by common PostgreSQL logic. > Under certain conditions (many large documents), this saves disk > space, memory and increases the overall performance. More details can > be found in README on GitHub. > > The extension was accepted warmly and instantaneously I got several > requests to submit it to /contrib/ so people using Amazon RDS and > similar services could enjoy it too. Back then I was not sure if the > extension is mature enough and if it lacks any additional features > required to solve the real-world problems of the users. Time showed, > however, that people are happy with the extension as it is. There were > several minor issues discovered, but they were fixed back in 2017. The > extension never experienced any compatibility problems with the next > major release of PostgreSQL. > > So my question is if the community may consider adding ZSON to > /contrib/. If this is the case I will add this thread to the nearest > CF and submit a corresponding patch. > > [1]: https://github.com/postgrespro/zson > > -- > Best regards, > Aleksander Alekseev > Open-Source PostgreSQL Contributor at Timescale Yet another approach to the same problem: https://github.com/postgrespro/jsonb_schema Instead of compression JSONs we can try to automatically detect JSON schema (names and types of JSON fields) and store it separately from values. This approach is more similar with one used in schema-less databases. It is most efficient if there are many JSON records with the same schema and sizes of keys are comparable with size of values. At IMDB data set it cause reducing of database size about 1.7 times.
On Wed, 26 May 2021 at 12:49, Aleksander Alekseev <aleksander@timescale.com> wrote: > > Hi hackers, > > Many thanks for your feedback, I very much appreciate it! > > > If the extension is mature enough, why make it an extension in > > contrib, and not instead either enhance the existing jsonb type with > > it or make it a built-in type? > > > IMO we have too d*mn many JSON types already. If we can find a way > > to shoehorn this optimization into JSONB, that'd be great. Otherwise > > I do not think it's worth the added user confusion. > > Magnus, Tom, > > My reasoning is that if the problem can be solved with an extension > there is little reason to modify the core. This seems to be in the > spirit of PostgreSQL. If the community reaches the consensus to modify > the core to introduce a similar feature, we could discuss this as > well. It sounds like a lot of unnecessary work to me though (see > below). > > > * doesn't cover all cases, notably indexes. > > Tom, > > Not sure if I follow. What cases do you have in mind? > > > Do note that e.g. postgis is not in contrib, but is available in e.g. RDS. > > Matthias, > > Good point. I suspect that PostGIS is an exception though... Quite a few other non-/common/ extensions are available in RDS[0], some of which are HLL (from citusdata), pglogical (from 2ndQuadrant) and orafce (from Pavel Stehule, orafce et al.). > > I like the idea of the ZSON type, but I'm somewhat disappointed by its > > current limitations > > Several people suggested various enhancements right after learning > about ZSON. Time showed, however, that none of the real-world users > really need e.g. more than one common dictionary per database. I > suspect this is because no one has more than 2**16 repeatable unique > strings (one dictionary limitation) in their documents. Thus there is > no benefit in having separate dictionaries and corresponding extra > complexity. IMO the main benefit of having different dictionaries is that you could have a small dictionary for small and very structured JSONB fields (e.g. some time-series data), and a large one for large / unstructured JSONB fields, without having the significant performance impact of having that large and varied dictionary on the small&structured field. Although a binary search is log(n) and thus still quite cheap even for large dictionaries, the extra size is certainly not free, and you'll be touching more memory in the process. > > - Each dictionary uses a lot of memory, regardless of the number of > > actual stored keys. For 32-bit systems the base usage of a dictionary > > without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be > > almost 1MB, and for 64-bit it would be 1.7MB. That is significantly > > more than I'd want to install. > > You are probably right on this one, this part could be optimized. I > will address this if we agree on submitting the patch. > > > - You call gettimeofday() in both dict_get and in get_current_dict_id. > > These functions can be called in short and tight loops (for small GSON > > fields), in which case it would add significant overhead through the > > implied syscalls. > > I must admit, I'm not an expert in this area. My understanding is that > gettimeofday() is implemented as single virtual memory access on > modern operating systems, e.g. VDSO on Linux, thus it's very cheap. > I'm not that sure about other supported platforms though. Probably > worth investigating. Yes, but vDSO does not necessarily work on all systems: e.g. in 2017, a lot on EC2 [1] was run using Xen with vDSO not working for gettimeofday. I'm uncertain if this issue persists for their new KVM/Nitro hypervisor. > > It does mean that you're deTOASTing > > the full GSON field, and that the stored bytestring will not be > > structured / doesn't work well with current debuggers. > > Unfortunately, I'm not very well aware of debugging tools in this > context. Could you please name the debuggers I should take into > account? Hmm, I was mistaken in that regard. I was under the impression that at least one of pageinspect, pg_filedump and pg_hexedit did support column value introspection, which they apparently do not. pg_filedump (and thus pg_hexedit) have some introspection, but none specialized for jsonb (yet). The point I tried to make was that introspection of GSON would be even more difficult due to it adding a non-standard compression method which makes introspection effectively impossible (the algorithm can replace things other than the strings it should replace, so it will be difficult to retrieve structure from the encoded string). With regards, Matthias van de Meent [0] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.FeatureSupport.Extensions.13x [1] https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/
On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote: > Hi hackers, > > Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The > extension introduces the new ZSON type, which is 100% compatible with JSONB but > uses a shared dictionary of strings most frequently used in given JSONB > documents for compression. These strings are replaced with integer IDs. > Afterward, PGLZ (and now LZ4) applies if the document is large enough by common > PostgreSQL logic. Under certain conditions (many large documents), this saves > disk space, memory and increases the overall performance. More details can be > found in README on GitHub. I think this is interesting because it is one of the few cases that allow compression outside of a single column. Here is a list of compression options: https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020 1. single field 2. across rows in a single page 3. across rows in a single column 4. across all columns and rows in a table 5. across tables in a database 6. across databases While standard Postgres does #1, ZSON allows 2-5, assuming the data is in the ZSON data type. I think this cross-field compression has great potential for cases where the data is not relational, or hasn't had time to be structured relationally. It also opens questions of how to do this cleanly in a relational system. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 5/26/21 5:29 PM, Bruce Momjian wrote: > On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote: >> Hi hackers, >> >> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The >> extension introduces the new ZSON type, which is 100% compatible with JSONB but >> uses a shared dictionary of strings most frequently used in given JSONB >> documents for compression. These strings are replaced with integer IDs. >> Afterward, PGLZ (and now LZ4) applies if the document is large enough by common >> PostgreSQL logic. Under certain conditions (many large documents), this saves >> disk space, memory and increases the overall performance. More details can be >> found in README on GitHub. > I think this is interesting because it is one of the few cases that > allow compression outside of a single column. Here is a list of > compression options: > > https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020 > > 1. single field > 2. across rows in a single page > 3. across rows in a single column > 4. across all columns and rows in a table > 5. across tables in a database > 6. across databases > > While standard Postgres does #1, ZSON allows 2-5, assuming the data is > in the ZSON data type. I think this cross-field compression has great > potential for cases where the data is not relational, or hasn't had time > to be structured relationally. It also opens questions of how to do > this cleanly in a relational system. > I think we're going to get the best bang for the buck on doing 2, 3, and 4. If it's confined to a single table then we can put a dictionary in something like a fork. Maybe given partitioning we want to be able to do multi-table dictionaries, but that's less certain. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 5/26/21 6:43 PM, Matthias van de Meent wrote: > On Wed, 26 May 2021 at 12:49, Aleksander Alekseev > <aleksander@timescale.com> wrote: >> >> Hi hackers, >> >> Many thanks for your feedback, I very much appreciate it! >> >>> If the extension is mature enough, why make it an extension in >>> contrib, and not instead either enhance the existing jsonb type with >>> it or make it a built-in type? >> >>> IMO we have too d*mn many JSON types already. If we can find a way >>> to shoehorn this optimization into JSONB, that'd be great. Otherwise >>> I do not think it's worth the added user confusion. >> >> Magnus, Tom, >> >> My reasoning is that if the problem can be solved with an extension >> there is little reason to modify the core. This seems to be in the >> spirit of PostgreSQL. If the community reaches the consensus to modify >> the core to introduce a similar feature, we could discuss this as >> well. It sounds like a lot of unnecessary work to me though (see >> below). >> >>> * doesn't cover all cases, notably indexes. >> >> Tom, >> >> Not sure if I follow. What cases do you have in mind? >> >>> Do note that e.g. postgis is not in contrib, but is available in e.g. RDS. >> >> Matthias, >> >> Good point. I suspect that PostGIS is an exception though... > > Quite a few other non-/common/ extensions are available in RDS[0], > some of which are HLL (from citusdata), pglogical (from 2ndQuadrant) > and orafce (from Pavel Stehule, orafce et al.). > >>> I like the idea of the ZSON type, but I'm somewhat disappointed by its >>> current limitations >> >> Several people suggested various enhancements right after learning >> about ZSON. Time showed, however, that none of the real-world users >> really need e.g. more than one common dictionary per database. I >> suspect this is because no one has more than 2**16 repeatable unique >> strings (one dictionary limitation) in their documents. Thus there is >> no benefit in having separate dictionaries and corresponding extra >> complexity. > > IMO the main benefit of having different dictionaries is that you > could have a small dictionary for small and very structured JSONB > fields (e.g. some time-series data), and a large one for large / > unstructured JSONB fields, without having the significant performance > impact of having that large and varied dictionary on the > small&structured field. Although a binary search is log(n) and thus > still quite cheap even for large dictionaries, the extra size is > certainly not free, and you'll be touching more memory in the process. > I'm sure we can think of various other arguments for allowing separate dictionaries. For example, what if you drop a column? With one huge dictionary you're bound to keep the data forever. With per-column dicts you can just drop the dict and free disk space / memory. I also find it hard to believe that no one needs 2**16 strings. I mean, 65k is not that much, really. To give an example, I've been toying with storing bitcoin blockchain in a database - one way to do that is storing each block as a single JSONB document. But each "item" (eg. transaction) is identified by a unique hash, so that means (tens of) thousands of unique strings *per document*. Yes, it's a bit silly and extreme, and maybe the compression would not help much in this case. But it shows that 2**16 is damn easy to hit. In other words, this seems like a nice example of survivor bias, where we only look at cases for which the existing limitations are acceptable, ignoring the (many) remaining cases eliminated by those limitations. >>> - Each dictionary uses a lot of memory, regardless of the number of >>> actual stored keys. For 32-bit systems the base usage of a dictionary >>> without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be >>> almost 1MB, and for 64-bit it would be 1.7MB. That is significantly >>> more than I'd want to install. >> >> You are probably right on this one, this part could be optimized. I >> will address this if we agree on submitting the patch. >> I'm sure it can be optimized, but I also think it's focusing on the base memory usage too much. What I care about is the end result, i.e. how much disk space / memory I save at the end. I don't care if it's 1MB or 1.7MB if using the compression saves me e.g. 50% of disk space. And it's completely irrelevant if I can't use the feature because of limitations stemming from the "single dictionary" limitations (in which case I'll save the 0.7MB, but I also lose the 50% disk space savings - not a great trade off, if you ask me). >>> - You call gettimeofday() in both dict_get and in get_current_dict_id. >>> These functions can be called in short and tight loops (for small GSON >>> fields), in which case it would add significant overhead through the >>> implied syscalls. >> >> I must admit, I'm not an expert in this area. My understanding is that >> gettimeofday() is implemented as single virtual memory access on >> modern operating systems, e.g. VDSO on Linux, thus it's very cheap. >> I'm not that sure about other supported platforms though. Probably >> worth investigating. > > Yes, but vDSO does not necessarily work on all systems: e.g. in 2017, > a lot on EC2 [1] was run using Xen with vDSO not working for > gettimeofday. I'm uncertain if this issue persists for their new > KVM/Nitro hypervisor. > Yeah. Better not to call gettimeofday is very often. I have no idea why the code even does that, though - it seems to be deciding whether it's OK to use cached dictionary based on the timestamp, but that seems rather dubious. But it's hard to say, because there are about no useful comments *anywhere* in the code. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5/26/21 2:49 AM, Stephen Frost wrote: > Greetings, > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Matthias van de Meent <boekewurm+postgres@gmail.com> writes: >>> I like the idea of the ZSON type, but I'm somewhat disappointed by its >>> current limitations: >> >> I've not read the code, so maybe this thought is completely off-point, >> but I wonder if anything could be learned from PostGIS. AIUI they >> have developed the infrastructure needed to have auxiliary info >> (particularly, spatial reference data) attached to a geometry column, >> without duplicating it in every value of the column. Seems like that >> is a close analog of what's needed here. > > Err, not exactly the same- there aren't *that* many SRIDs and therefore > they can be stuffed into the typemod (my, probably wrong, recollection > was that I actually pushed Paul in that direction due to being > frustrated with CHECK constraints they had been using previously..). > > Not something you could do with a dictionary as what's contempalted > here. I do agree that each jsonb/zson/whatever column should really be > able to have its own dictionary though and maybe you could shove *which* > of those dictionaries a given column uses into the typemod for that > column... In an ideal world, however, we wouldn't make a user have to > actually do that though and instead we'd just build our own magically > for them when they use jsonb. > I think doing this properly will require inventing new infrastructure to associate some custom parameters with a column (and/or data type). In principle it seems quite similar to 911e702077, which introduced opclass parameters, which allowed implementing the new BRIN opclasses in PG14. Even if we eventually decide to not add zson into contrib (or core), it seems like this infrastructure would make zson more usable in practice with this capability. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5/27/21 4:15 AM, Andrew Dunstan wrote: > > On 5/26/21 5:29 PM, Bruce Momjian wrote: >> On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote: >>> Hi hackers, >>> >>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The >>> extension introduces the new ZSON type, which is 100% compatible with JSONB but >>> uses a shared dictionary of strings most frequently used in given JSONB >>> documents for compression. These strings are replaced with integer IDs. >>> Afterward, PGLZ (and now LZ4) applies if the document is large enough by common >>> PostgreSQL logic. Under certain conditions (many large documents), this saves >>> disk space, memory and increases the overall performance. More details can be >>> found in README on GitHub. >> I think this is interesting because it is one of the few cases that >> allow compression outside of a single column. Here is a list of >> compression options: >> >> https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020 >> >> 1. single field >> 2. across rows in a single page >> 3. across rows in a single column >> 4. across all columns and rows in a table >> 5. across tables in a database >> 6. across databases >> >> While standard Postgres does #1, ZSON allows 2-5, assuming the data is >> in the ZSON data type. I think this cross-field compression has great >> potential for cases where the data is not relational, or hasn't had time >> to be structured relationally. It also opens questions of how to do >> this cleanly in a relational system. >> > > I think we're going to get the best bang for the buck on doing 2, 3, and > 4. If it's confined to a single table then we can put a dictionary in > something like a fork. Agreed. > Maybe given partitioning we want to be able to do multi-table > dictionaries, but that's less certain. > Yeah. I think it'll have many of the same issues/complexity as global indexes, and the gains are likely limited. At least assuming the partitions are sufficiently large, but tiny partitions are inefficient in general, I think. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5/28/21 6:35 AM, Tomas Vondra wrote: > >> >> IMO the main benefit of having different dictionaries is that you >> could have a small dictionary for small and very structured JSONB >> fields (e.g. some time-series data), and a large one for large / >> unstructured JSONB fields, without having the significant performance >> impact of having that large and varied dictionary on the >> small&structured field. Although a binary search is log(n) and thus >> still quite cheap even for large dictionaries, the extra size is >> certainly not free, and you'll be touching more memory in the process. >> > I'm sure we can think of various other arguments for allowing separate > dictionaries. For example, what if you drop a column? With one huge > dictionary you're bound to keep the data forever. With per-column dicts > you can just drop the dict and free disk space / memory. > > I also find it hard to believe that no one needs 2**16 strings. I mean, > 65k is not that much, really. To give an example, I've been toying with > storing bitcoin blockchain in a database - one way to do that is storing > each block as a single JSONB document. But each "item" (eg. transaction) > is identified by a unique hash, so that means (tens of) thousands of > unique strings *per document*. > > Yes, it's a bit silly and extreme, and maybe the compression would not > help much in this case. But it shows that 2**16 is damn easy to hit. > > In other words, this seems like a nice example of survivor bias, where > we only look at cases for which the existing limitations are acceptable, > ignoring the (many) remaining cases eliminated by those limitations. > > I don't think we should lightly discard the use of 2 byte keys though. Maybe we could use a scheme similar to what we use for text lengths, where the first bit indicates whether we have a 1 byte or 4 byte length indicator. Many dictionaries will have less that 2^15-1 entries, so they would use exclusively the smaller keys. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 5/28/21 4:22 PM, Andrew Dunstan wrote: > > On 5/28/21 6:35 AM, Tomas Vondra wrote: >> >>> >>> IMO the main benefit of having different dictionaries is that you >>> could have a small dictionary for small and very structured JSONB >>> fields (e.g. some time-series data), and a large one for large / >>> unstructured JSONB fields, without having the significant performance >>> impact of having that large and varied dictionary on the >>> small&structured field. Although a binary search is log(n) and thus >>> still quite cheap even for large dictionaries, the extra size is >>> certainly not free, and you'll be touching more memory in the process. >>> >> I'm sure we can think of various other arguments for allowing separate >> dictionaries. For example, what if you drop a column? With one huge >> dictionary you're bound to keep the data forever. With per-column dicts >> you can just drop the dict and free disk space / memory. >> >> I also find it hard to believe that no one needs 2**16 strings. I mean, >> 65k is not that much, really. To give an example, I've been toying with >> storing bitcoin blockchain in a database - one way to do that is storing >> each block as a single JSONB document. But each "item" (eg. transaction) >> is identified by a unique hash, so that means (tens of) thousands of >> unique strings *per document*. >> >> Yes, it's a bit silly and extreme, and maybe the compression would not >> help much in this case. But it shows that 2**16 is damn easy to hit. >> >> In other words, this seems like a nice example of survivor bias, where >> we only look at cases for which the existing limitations are acceptable, >> ignoring the (many) remaining cases eliminated by those limitations. >> >> > > I don't think we should lightly discard the use of 2 byte keys though. > Maybe we could use a scheme similar to what we use for text lengths, > where the first bit indicates whether we have a 1 byte or 4 byte length > indicator. Many dictionaries will have less that 2^15-1 entries, so they > would use exclusively the smaller keys. > I didn't mean to discard that, of course. I'm sure a lot of data sets may be perfectly fine with 64k keys, of course, and it may be worth optimizing that as a special case. All I'm saying is that if we start from the position that this limit is perfectly fine and no one is going to hit it in practice, it may be due to people not even trying it on documents with more keys. That being said, I still don't think the 1MB vs. 1.7MB figure is particularly meaningful, because it's for "empty" dictionary, which is something you'll not have in practice. And once you start adding keys, the difference will get less and less significant. However, if we care about efficiency for "small" JSON documents, it's probably worth using something like varint [1], which is 1-4B depending on the value. [1] https://learnmeabitcoin.com/technical/varint regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 25, 2021, at 22:10, Tom Lane wrote:
Magnus Hagander <magnus@hagander.net> writes:> On Tue, May 25, 2021 at 12:55 PM Aleksander Alekseev> <aleksander@timescale.com> wrote:>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. The extension introduces the new ZSON type, which is 100% compatible with JSONB but uses a shared dictionary of strings most frequently used in given JSONB documents for compression.> If the extension is mature enough, why make it an extension in> contrib, and not instead either enhance the existing jsonb type with> it or make it a built-in type?IMO we have too d*mn many JSON types already. If we can find a wayto shoehorn this optimization into JSONB, that'd be great. OtherwiseI do not think it's worth the added user confusion.
I think the json situation is unfortunate.
If carefully designing the json type from scratch,
with all the accumulated experiences over the years from working with json and jsonb,
I think the end result would probably be quite different.
For instance, I remember Marko Tiikkaja implemented his own json type many years ago when we worked together at the same company, needing json before PostgreSQL had support for it, I remember I thought some ideas in his interface felt more natural than the built-in json type we later got.
While zson improves on efficiency, there are probably lots of other improvements in the interface that could be made as well.
Instead of trying to fix the existing built-in json type, I think it would be better to package the built-in functionality as a "json" extension, that would come pre-installed, similar to how "plpgsql" comes pre-installed.
Users who feel they are unhappy with the entire json/jsonb types could then install "zson" or some other competing json type instead. This would allow the life-cycles of legacy/deprecated versions to overlap with future versions.
Uninstallable Pre-Installed Extensions as a concept in general could perhaps be a feasible alternative to shoehorning in funtionality/optimizations in general, and also a way to avoid GUCs.
The biggest downside I see is the risk for confusion among users, since there can then be multiple competing implementations providing the same functionality. It's nice to have built-ins when all users love the built-ins.
/Joel
Hi hackers,
Many thanks for the feedback and all the great suggestions!
I decided to add the patch to the nearest commitfest. You will find it in the attachment.
Differences from the GitHub version:
- Code formatting changed;
- More comments added to the code;
- SGML documentation added;
- Plus several minor changes;
I very much like the ideas:
- To use varint, as Tomas suggested
- Make dictionaries variable in size
- Somehow avoid calling gettimeofday()
- Improvements by 2ndQuadrant that Andrew named
However, I would like to decompose the task into 1) deciding if the extension is worth adding to /contrib/ and 2) improving it. Since there are people who already use ZSON, the extension should be backward-compatible with the current ZSON format anyway. Also, every improvement deserves its own discussion, testing, and benchmarking. Thus I believe the suggested approach will simplify the job for reviewers, and also save us time if the patch will be declined. If the patch will be accepted, I will be delighted to submit follow-up patches!
If you have any other ideas on how the extension can be improved in the future, please don't hesitate to name them in this thread. Also, I would appreciate some code review.
--
Best regards,
Aleksander Alekseev
Many thanks for the feedback and all the great suggestions!
I decided to add the patch to the nearest commitfest. You will find it in the attachment.
Differences from the GitHub version:
- Code formatting changed;
- More comments added to the code;
- SGML documentation added;
- Plus several minor changes;
I very much like the ideas:
- To use varint, as Tomas suggested
- Make dictionaries variable in size
- Somehow avoid calling gettimeofday()
- Improvements by 2ndQuadrant that Andrew named
However, I would like to decompose the task into 1) deciding if the extension is worth adding to /contrib/ and 2) improving it. Since there are people who already use ZSON, the extension should be backward-compatible with the current ZSON format anyway. Also, every improvement deserves its own discussion, testing, and benchmarking. Thus I believe the suggested approach will simplify the job for reviewers, and also save us time if the patch will be declined. If the patch will be accepted, I will be delighted to submit follow-up patches!
If you have any other ideas on how the extension can be improved in the future, please don't hesitate to name them in this thread. Also, I would appreciate some code review.
--
Best regards,
Aleksander Alekseev
Attachment
On 04.06.21 17:09, Aleksander Alekseev wrote: > I decided to add the patch to the nearest commitfest. With respect to the commit fest submission, I don't think there is consensus right now to add this. I think people would prefer that this dictionary facility be somehow made available in the existing JSON types. Also, I sense that there is still some volatility about some of the details of how this extension should work and its scope. I think this is served best as an external extension for now.
On 7/3/21 12:34 PM, Peter Eisentraut wrote: > On 04.06.21 17:09, Aleksander Alekseev wrote: >> I decided to add the patch to the nearest commitfest. > > With respect to the commit fest submission, I don't think there is > consensus right now to add this. I think people would prefer that this > dictionary facility be somehow made available in the existing JSON > types. Also, I sense that there is still some volatility about some of > the details of how this extension should work and its scope. I think > this is served best as an external extension for now. I agree there's a lot of open questions to figure out, but I think this "column-level compression" capability has a lot of potential. Not just for structured documents like JSON, but maybe even for scalar types. I don't think the question whether this should be built into jsonb, a separate built-in type, contrib type or something external is the one we need to answer first. The first thing I'd like to see is some "proof" that it's actually useful in practice - there were some claims about people/customers using it and being happy with the benefits, but there were no actual examples of data sets that are expected to benefit, compression ratios etc. And considering that [1] went unnoticed for 5 years, I have my doubts about it being used very widely. (I may be wrong and maybe people are just not casting jsonb to zson.) I've tried to use this on the one large non-synthetic JSONB dataset I had at hand at the moment, which is the bitcoin blockchain. That's ~1TB with JSONB, and when I tried using ZSON instead there was no measurable benefit, in fact the database was a bit larger. But I admit btc data is rather strange, because it contains a lot of randomness (all the tx and block IDs are random-looking hashes, etc.), and there's a lot of them in each document. So maybe that's simply a data set that can't benefit from zson on principle. I also suspect the zson_extract_strings() is pretty inefficient and I ran into various issues with the btc blocks which have very many keys, often far more than the 10k limit. In any case, I think having a clear example(s) of practical data sets that benefit from using zson would be very useful, both to guide the development and to show what the potential gains are. The other thing is that a lot of the stuff seems to be manual (e.g. the learning), and not really well integrated with the core. IMO improving this by implementing the necessary infrastructure would help all the possible cases (built-in type, contrib, external extension). regards [1] https://github.com/postgrespro/zson/commit/02db084ea3b94d9e68fd912dea97094634fcdea5 -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi hackers,
Many thanks for all the great feedback!
Please see the follow-up thread `RFC: compression dictionaries for JSONB`:
https://www.postgresql.org/message-id/CAJ7c6TPx7N-bVw0dZ1ASCDQKZJHhBYkT6w4HV1LzfS%2BUUTUfmA%40mail.gmail.com
--
Best regards,
Aleksander Alekseev
Open-Source PostgreSQL Contributor at Timescale
Many thanks for all the great feedback!
Please see the follow-up thread `RFC: compression dictionaries for JSONB`:
https://www.postgresql.org/message-id/CAJ7c6TPx7N-bVw0dZ1ASCDQKZJHhBYkT6w4HV1LzfS%2BUUTUfmA%40mail.gmail.com
--
Best regards,
Aleksander Alekseev
Open-Source PostgreSQL Contributor at Timescale