Thread: jsonb format is pessimal for toast compression
I looked into the issue reported in bug #11109. The problem appears to be that jsonb's on-disk format is designed in such a way that the leading portion of any JSON array or object will be fairly incompressible, because it consists mostly of a strictly-increasing series of integer offsets. This interacts poorly with the code in pglz_compress() that gives up if it's found nothing compressible in the first first_success_by bytes of a value-to-be-compressed. (first_success_by is 1024 in the default set of compression parameters.) As an example, here's gdb's report of the bitwise representation of the example JSON value in the bug thread: 0x2ab85ac: 0x20000005 0x00000004 0x50003098 0x0000309f 0x2ab85bc: 0x000030ae 0x000030b8 0x000030cf 0x000030da 0x2ab85cc: 0x000030df 0x000030ee 0x00003105 0x6b6e756a 0x2ab85dc: 0x400000de 0x00000034 0x00000068 0x0000009c 0x2ab85ec: 0x000000d0 0x00000104 0x00000138 0x0000016c 0x2ab85fc: 0x000001a0 0x000001d4 0x00000208 0x0000023c 0x2ab860c: 0x00000270 0x000002a4 0x000002d8 0x0000030c 0x2ab861c: 0x00000340 0x00000374 0x000003a8 0x000003dc 0x2ab862c: 0x00000410 0x00000444 0x00000478 0x000004ac 0x2ab863c: 0x000004e0 0x00000514 0x00000548 0x0000057c 0x2ab864c: 0x000005b0 0x000005e4 0x00000618 0x0000064c 0x2ab865c: 0x00000680 0x000006b4 0x000006e8 0x0000071c 0x2ab866c: 0x00000750 0x00000784 0x000007b8 0x000007ec 0x2ab867c: 0x00000820 0x00000854 0x00000888 0x000008bc 0x2ab868c: 0x000008f0 0x00000924 0x00000958 0x0000098c 0x2ab869c: 0x000009c0 0x000009f4 0x00000a28 0x00000a5c 0x2ab86ac: 0x00000a90 0x00000ac4 0x00000af8 0x00000b2c 0x2ab86bc: 0x00000b60 0x00000b94 0x00000bc8 0x00000bfc 0x2ab86cc: 0x00000c30 0x00000c64 0x00000c98 0x00000ccc 0x2ab86dc: 0x00000d00 0x00000d34 0x00000d68 0x00000d9c 0x2ab86ec: 0x00000dd0 0x00000e04 0x00000e38 0x00000e6c 0x2ab86fc: 0x00000ea0 0x00000ed4 0x00000f08 0x00000f3c 0x2ab870c: 0x00000f70 0x00000fa4 0x00000fd8 0x0000100c 0x2ab871c: 0x00001040 0x00001074 0x000010a8 0x000010dc 0x2ab872c: 0x00001110 0x00001144 0x00001178 0x000011ac 0x2ab873c: 0x000011e0 0x00001214 0x00001248 0x0000127c 0x2ab874c: 0x000012b0 0x000012e4 0x00001318 0x0000134c 0x2ab875c: 0x00001380 0x000013b4 0x000013e8 0x0000141c 0x2ab876c: 0x00001450 0x00001484 0x000014b8 0x000014ec 0x2ab877c: 0x00001520 0x00001554 0x00001588 0x000015bc 0x2ab878c: 0x000015f0 0x00001624 0x00001658 0x0000168c 0x2ab879c: 0x000016c0 0x000016f4 0x00001728 0x0000175c 0x2ab87ac: 0x00001790 0x000017c4 0x000017f8 0x0000182c 0x2ab87bc: 0x00001860 0x00001894 0x000018c8 0x000018fc 0x2ab87cc: 0x00001930 0x00001964 0x00001998 0x000019cc 0x2ab87dc: 0x00001a00 0x00001a34 0x00001a68 0x00001a9c 0x2ab87ec: 0x00001ad0 0x00001b04 0x00001b38 0x00001b6c 0x2ab87fc: 0x00001ba0 0x00001bd4 0x00001c08 0x00001c3c 0x2ab880c: 0x00001c70 0x00001ca4 0x00001cd8 0x00001d0c 0x2ab881c: 0x00001d40 0x00001d74 0x00001da8 0x00001ddc 0x2ab882c: 0x00001e10 0x00001e44 0x00001e78 0x00001eac 0x2ab883c: 0x00001ee0 0x00001f14 0x00001f48 0x00001f7c 0x2ab884c: 0x00001fb0 0x00001fe4 0x00002018 0x0000204c 0x2ab885c: 0x00002080 0x000020b4 0x000020e8 0x0000211c 0x2ab886c: 0x00002150 0x00002184 0x000021b8 0x000021ec 0x2ab887c: 0x00002220 0x00002254 0x00002288 0x000022bc 0x2ab888c: 0x000022f0 0x00002324 0x00002358 0x0000238c 0x2ab889c: 0x000023c0 0x000023f4 0x00002428 0x0000245c 0x2ab88ac: 0x00002490 0x000024c4 0x000024f8 0x0000252c 0x2ab88bc: 0x00002560 0x00002594 0x000025c8 0x000025fc 0x2ab88cc: 0x00002630 0x00002664 0x00002698 0x000026cc 0x2ab88dc: 0x00002700 0x00002734 0x00002768 0x0000279c 0x2ab88ec: 0x000027d0 0x00002804 0x00002838 0x0000286c 0x2ab88fc: 0x000028a0 0x000028d4 0x00002908 0x0000293c 0x2ab890c: 0x00002970 0x000029a4 0x000029d8 0x00002a0c 0x2ab891c: 0x00002a40 0x00002a74 0x00002aa8 0x00002adc 0x2ab892c: 0x00002b10 0x00002b44 0x00002b78 0x00002bac 0x2ab893c: 0x00002be0 0x00002c14 0x00002c48 0x00002c7c 0x2ab894c: 0x00002cb0 0x00002ce4 0x00002d18 0x32343231 0x2ab895c: 0x74653534 0x74656577 0x33746577 0x77673534 0x2ab896c: 0x74657274 0x33347477 0x72777120 0x20717771 0x2ab897c: 0x65727771 0x20777120 0x66647372 0x73616b6c 0x2ab898c: 0x33353471 0x71772035 0x72777172 0x71727771 0x2ab899c: 0x77203277 0x72777172 0x71727771 0x33323233 0x2ab89ac: 0x6b207732 0x20657773 0x73616673 0x73207372 0x2ab89bc: 0x64736664 0x32343231 0x74653534 0x74656577 0x2ab89cc: 0x33746577 0x77673534 0x74657274 0x33347477 0x2ab89dc: 0x72777120 0x20717771 0x65727771 0x20777120 0x2ab89ec: 0x66647372 0x73616b6c 0x33353471 0x71772035 0x2ab89fc: 0x72777172 0x71727771 0x77203277 0x72777172 0x2ab8a0c: 0x71727771 0x33323233 0x6b207732 0x20657773 0x2ab8a1c: 0x73616673 0x73207372 0x64736664 0x32343231 0x2ab8a2c: 0x74653534 0x74656577 0x33746577 0x77673534 0x2ab8a3c: 0x74657274 0x33347477 0x72777120 0x20717771 0x2ab8a4c: 0x65727771 0x20777120 0x66647372 0x73616b6c 0x2ab8a5c: 0x33353471 0x71772035 0x72777172 0x71727771 0x2ab8a6c: 0x77203277 0x72777172 0x71727771 0x33323233 0x2ab8a7c: 0x6b207732 0x20657773 0x73616673 0x73207372 0x2ab8a8c: 0x64736664 0x32343231 0x74653534 0x74656577 0x2ab8a9c: 0x33746577 0x77673534 0x74657274 0x33347477 0x2ab8aac: 0x72777120 0x20717771 0x65727771 0x20777120 0x2ab8abc: 0x66647372 0x73616b6c 0x33353471 0x71772035 0x2ab8acc: 0x72777172 0x71727771 0x77203277 0x72777172 0x2ab8adc: 0x71727771 0x33323233 0x6b207732 0x20657773 0x2ab8aec: 0x73616673 0x73207372 0x64736664 0x32343231 0x2ab8afc: 0x74653534 0x74656577 0x33746577 0x77673534 ... 0x2abb61c: 0x74657274 0x33347477 0x72777120 0x20717771 0x2abb62c: 0x65727771 0x20777120 0x66647372 0x73616b6c 0x2abb63c: 0x33353471 0x71772035 0x72777172 0x71727771 0x2abb64c: 0x77203277 0x72777172 0x71727771 0x33323233 0x2abb65c: 0x6b207732 0x20657773 0x73616673 0x73207372 0x2abb66c: 0x64736664 0x537a6962 0x41706574 0x73756220 0x2abb67c: 0x73656e69 0x74732073 0x45617065 0x746e6576 0x2abb68c: 0x656d6954 0x34313032 0x2d38302d 0x32203730 0x2abb69c: 0x33323a31 0x2e33333a 0x62393434 0x6f4c7a69 0x2abb6ac: 0x69746163 0x61506e6f 0x74736972 0x736e6172 0x2abb6bc: 0x69746361 0x61446e6f 0x30326574 0x302d3431 0x2abb6cc: 0x37302d38 0x3a313220 0x333a3332 0x34342e33 There is plenty of compressible data once we get into the repetitive strings in the payload part --- but that starts at offset 944, and up to that point there is nothing that pg_lzcompress can get a handle on. There are, by definition, no sequences of 4 or more repeated bytes in that area. I think in principle pg_lzcompress could decide to compress the 3-byte sequences consisting of the high-order 24 bits of each offset; but it doesn't choose to do so, probably because of the way its lookup hash table works: * pglz_hist_idx -** Computes the history table slot for the lookup by the next 4* characters in the input.**NB: because we use the next 4 characters, we are not guaranteed to* find 3-character matches; they very possiblywill be in the wrong* hash list. This seems an acceptable tradeoff for spreading out the* hash keys more. For jsonb header data, the "next 4 characters" are *always* different, so only a chance hash collision can result in a match. There is therefore a pretty good chance that no compression will occur before it gives up because of first_success_by. I'm not sure if there is any easy fix for this. We could possibly change the default first_success_by value, but I think that'd just be postponing the problem to larger jsonb objects/arrays, and it would hurt performance for genuinely incompressible data. A somewhat painful, but not yet out-of-the-question, alternative is to change the jsonb on-disk representation. Perhaps the JEntry array could be defined as containing element lengths instead of element ending offsets. Not sure though if that would break binary searching for JSON object keys. regards, tom lane
<div dir="ltr">Apologies if this is a ridiculous suggestion, but I believe that swapping out the compression algorithm (forSnappy, for example) has been discussed in the past. I wonder if that algorithm is sufficiently different that it wouldproduce a better result, and if that might not be preferable to some of the other options. </div><div class="gmail_extra"><br/><br /><div class="gmail_quote">On Thu, Aug 7, 2014 at 11:17 PM, Tom Lane <span dir="ltr"><<ahref="mailto:tgl@sss.pgh.pa.us" target="_blank">tgl@sss.pgh.pa.us</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> I looked into theissue reported in bug #11109. The problem appears to be<br /> that jsonb's on-disk format is designed in such a way thatthe leading<br /> portion of any JSON array or object will be fairly incompressible, because<br /> it consists mostlyof a strictly-increasing series of integer offsets.<br /> This interacts poorly with the code in pglz_compress() thatgives up if<br /> it's found nothing compressible in the first first_success_by bytes of a<br /> value-to-be-compressed. (first_success_by is 1024 in the default set of<br /> compression parameters.)<br /><br /> As anexample, here's gdb's report of the bitwise representation of the<br /> example JSON value in the bug thread:<br /><br/> 0x2ab85ac: 0x20000005 0x00000004 0x50003098 0x0000309f<br /> 0x2ab85bc: 0x000030ae 0x000030b8 0x000030cf 0x000030da<br /> 0x2ab85cc: 0x000030df 0x000030ee 0x00003105 0x6b6e756a<br/> 0x2ab85dc: 0x400000de 0x00000034 0x00000068 0x0000009c<br /> 0x2ab85ec: 0x000000d0 0x00000104 0x00000138 0x0000016c<br /> 0x2ab85fc: 0x000001a0 0x000001d4 0x00000208 0x0000023c<br /> 0x2ab860c: 0x00000270 0x000002a4 0x000002d8 0x0000030c<br /> 0x2ab861c: 0x00000340 0x00000374 0x000003a8 0x000003dc<br /> 0x2ab862c: 0x00000410 0x00000444 0x00000478 0x000004ac<br /> 0x2ab863c: 0x000004e0 0x00000514 0x00000548 0x0000057c<br/> 0x2ab864c: 0x000005b0 0x000005e4 0x00000618 0x0000064c<br /> 0x2ab865c: 0x00000680 0x000006b4 0x000006e8 0x0000071c<br /> 0x2ab866c: 0x00000750 0x00000784 0x000007b8 0x000007ec<br /> 0x2ab867c: 0x00000820 0x00000854 0x00000888 0x000008bc<br /> 0x2ab868c: 0x000008f0 0x00000924 0x00000958 0x0000098c<br /> 0x2ab869c: 0x000009c0 0x000009f4 0x00000a28 0x00000a5c<br /> 0x2ab86ac: 0x00000a90 0x00000ac4 0x00000af8 0x00000b2c<br/> 0x2ab86bc: 0x00000b60 0x00000b94 0x00000bc8 0x00000bfc<br /> 0x2ab86cc: 0x00000c30 0x00000c64 0x00000c98 0x00000ccc<br /> 0x2ab86dc: 0x00000d00 0x00000d34 0x00000d68 0x00000d9c<br /> 0x2ab86ec: 0x00000dd0 0x00000e04 0x00000e38 0x00000e6c<br /> 0x2ab86fc: 0x00000ea0 0x00000ed4 0x00000f08 0x00000f3c<br /> 0x2ab870c: 0x00000f70 0x00000fa4 0x00000fd8 0x0000100c<br /> 0x2ab871c: 0x00001040 0x00001074 0x000010a8 0x000010dc<br/> 0x2ab872c: 0x00001110 0x00001144 0x00001178 0x000011ac<br /> 0x2ab873c: 0x000011e0 0x00001214 0x00001248 0x0000127c<br /> 0x2ab874c: 0x000012b0 0x000012e4 0x00001318 0x0000134c<br /> 0x2ab875c: 0x00001380 0x000013b4 0x000013e8 0x0000141c<br /> 0x2ab876c: 0x00001450 0x00001484 0x000014b8 0x000014ec<br /> 0x2ab877c: 0x00001520 0x00001554 0x00001588 0x000015bc<br /> 0x2ab878c: 0x000015f0 0x00001624 0x00001658 0x0000168c<br/> 0x2ab879c: 0x000016c0 0x000016f4 0x00001728 0x0000175c<br /> 0x2ab87ac: 0x00001790 0x000017c4 0x000017f8 0x0000182c<br /> 0x2ab87bc: 0x00001860 0x00001894 0x000018c8 0x000018fc<br /> 0x2ab87cc: 0x00001930 0x00001964 0x00001998 0x000019cc<br /> 0x2ab87dc: 0x00001a00 0x00001a34 0x00001a68 0x00001a9c<br /> 0x2ab87ec: 0x00001ad0 0x00001b04 0x00001b38 0x00001b6c<br /> 0x2ab87fc: 0x00001ba0 0x00001bd4 0x00001c08 0x00001c3c<br/> 0x2ab880c: 0x00001c70 0x00001ca4 0x00001cd8 0x00001d0c<br /> 0x2ab881c: 0x00001d40 0x00001d74 0x00001da8 0x00001ddc<br /> 0x2ab882c: 0x00001e10 0x00001e44 0x00001e78 0x00001eac<br /> 0x2ab883c: 0x00001ee0 0x00001f14 0x00001f48 0x00001f7c<br /> 0x2ab884c: 0x00001fb0 0x00001fe4 0x00002018 0x0000204c<br /> 0x2ab885c: 0x00002080 0x000020b4 0x000020e8 0x0000211c<br /> 0x2ab886c: 0x00002150 0x00002184 0x000021b8 0x000021ec<br/> 0x2ab887c: 0x00002220 0x00002254 0x00002288 0x000022bc<br /> 0x2ab888c: 0x000022f0 0x00002324 0x00002358 0x0000238c<br /> 0x2ab889c: 0x000023c0 0x000023f4 0x00002428 0x0000245c<br /> 0x2ab88ac: 0x00002490 0x000024c4 0x000024f8 0x0000252c<br /> 0x2ab88bc: 0x00002560 0x00002594 0x000025c8 0x000025fc<br /> 0x2ab88cc: 0x00002630 0x00002664 0x00002698 0x000026cc<br /> 0x2ab88dc: 0x00002700 0x00002734 0x00002768 0x0000279c<br/> 0x2ab88ec: 0x000027d0 0x00002804 0x00002838 0x0000286c<br /> 0x2ab88fc: 0x000028a0 0x000028d4 0x00002908 0x0000293c<br /> 0x2ab890c: 0x00002970 0x000029a4 0x000029d8 0x00002a0c<br /> 0x2ab891c: 0x00002a40 0x00002a74 0x00002aa8 0x00002adc<br /> 0x2ab892c: 0x00002b10 0x00002b44 0x00002b78 0x00002bac<br /> 0x2ab893c: 0x00002be0 0x00002c14 0x00002c48 0x00002c7c<br /> 0x2ab894c: 0x00002cb0 0x00002ce4 0x00002d18 0x32343231<br/> 0x2ab895c: 0x74653534 0x74656577 0x33746577 0x77673534<br /> 0x2ab896c: 0x74657274 0x33347477 0x72777120 0x20717771<br /> 0x2ab897c: 0x65727771 0x20777120 0x66647372 0x73616b6c<br /> 0x2ab898c: 0x33353471 0x71772035 0x72777172 0x71727771<br /> 0x2ab899c: 0x77203277 0x72777172 0x71727771 0x33323233<br /> 0x2ab89ac: 0x6b207732 0x20657773 0x73616673 0x73207372<br /> 0x2ab89bc: 0x64736664 0x32343231 0x74653534 0x74656577<br/> 0x2ab89cc: 0x33746577 0x77673534 0x74657274 0x33347477<br /> 0x2ab89dc: 0x72777120 0x20717771 0x65727771 0x20777120<br /> 0x2ab89ec: 0x66647372 0x73616b6c 0x33353471 0x71772035<br /> 0x2ab89fc: 0x72777172 0x71727771 0x77203277 0x72777172<br /> 0x2ab8a0c: 0x71727771 0x33323233 0x6b207732 0x20657773<br /> 0x2ab8a1c: 0x73616673 0x73207372 0x64736664 0x32343231<br /> 0x2ab8a2c: 0x74653534 0x74656577 0x33746577 0x77673534<br/> 0x2ab8a3c: 0x74657274 0x33347477 0x72777120 0x20717771<br /> 0x2ab8a4c: 0x65727771 0x20777120 0x66647372 0x73616b6c<br /> 0x2ab8a5c: 0x33353471 0x71772035 0x72777172 0x71727771<br /> 0x2ab8a6c: 0x77203277 0x72777172 0x71727771 0x33323233<br /> 0x2ab8a7c: 0x6b207732 0x20657773 0x73616673 0x73207372<br /> 0x2ab8a8c: 0x64736664 0x32343231 0x74653534 0x74656577<br /> 0x2ab8a9c: 0x33746577 0x77673534 0x74657274 0x33347477<br/> 0x2ab8aac: 0x72777120 0x20717771 0x65727771 0x20777120<br /> 0x2ab8abc: 0x66647372 0x73616b6c 0x33353471 0x71772035<br /> 0x2ab8acc: 0x72777172 0x71727771 0x77203277 0x72777172<br /> 0x2ab8adc: 0x71727771 0x33323233 0x6b207732 0x20657773<br /> 0x2ab8aec: 0x73616673 0x73207372 0x64736664 0x32343231<br /> 0x2ab8afc: 0x74653534 0x74656577 0x33746577 0x77673534<br /> ...<br /> 0x2abb61c: 0x74657274 0x33347477 0x72777120 0x20717771<br /> 0x2abb62c: 0x65727771 0x20777120 0x66647372 0x73616b6c<br /> 0x2abb63c: 0x33353471 0x71772035 0x72777172 0x71727771<br /> 0x2abb64c: 0x77203277 0x72777172 0x71727771 0x33323233<br /> 0x2abb65c: 0x6b207732 0x20657773 0x73616673 0x73207372<br /> 0x2abb66c: 0x64736664 0x537a6962 0x41706574 0x73756220<br /> 0x2abb67c: 0x73656e69 0x74732073 0x45617065 0x746e6576<br /> 0x2abb68c: 0x656d6954 0x34313032 0x2d38302d 0x32203730<br/> 0x2abb69c: 0x33323a31 0x2e33333a 0x62393434 0x6f4c7a69<br /> 0x2abb6ac: 0x69746163 0x61506e6f 0x74736972 0x736e6172<br /> 0x2abb6bc: 0x69746361 0x61446e6f 0x30326574 0x302d3431<br /> 0x2abb6cc: 0x37302d38 0x3a313220 0x333a3332 0x34342e33<br /><br /> Thereis plenty of compressible data once we get into the repetitive<br /> strings in the payload part --- but that startsat offset 944, and up to<br /> that point there is nothing that pg_lzcompress can get a handle on. There<br /> are,by definition, no sequences of 4 or more repeated bytes in that area.<br /> I think in principle pg_lzcompress coulddecide to compress the 3-byte<br /> sequences consisting of the high-order 24 bits of each offset; but it<br /> doesn'tchoose to do so, probably because of the way its lookup hash table<br /> works:<br /><br /> * pglz_hist_idx -<br/> *<br /> * Computes the history table slot for the lookup by the next 4<br /> * charactersin the input.<br /> *<br /> * NB: because we use the next 4 characters, we are not guaranteed to<br /> * find3-character matches; they very possibly will be in the wrong<br /> * hash list. This seems an acceptable tradeoff forspreading out the<br /> * hash keys more.<br /><br /> For jsonb header data, the "next 4 characters" are *always* different,so<br /> only a chance hash collision can result in a match. There is therefore a<br /> pretty good chance thatno compression will occur before it gives up<br /> because of first_success_by.<br /><br /> I'm not sure if there isany easy fix for this. We could possibly change<br /> the default first_success_by value, but I think that'd just be postponing<br/> the problem to larger jsonb objects/arrays, and it would hurt performance<br /> for genuinely incompressibledata. A somewhat painful, but not yet<br /> out-of-the-question, alternative is to change the jsonb on-disk<br/> representation. Perhaps the JEntry array could be defined as containing<br /> element lengths instead of elementending offsets. Not sure though if<br /> that would break binary searching for JSON object keys.<br /><br /> regards, tom lane<br /></blockquote></div><br /></div>
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > I looked into the issue reported in bug #11109. The problem appears to be > that jsonb's on-disk format is designed in such a way that the leading > portion of any JSON array or object will be fairly incompressible, because > it consists mostly of a strictly-increasing series of integer offsets. > This interacts poorly with the code in pglz_compress() that gives up if > it's found nothing compressible in the first first_success_by bytes of a > value-to-be-compressed. (first_success_by is 1024 in the default set of > compression parameters.) I haven't looked at this in any detail, so take this with a grain of salt, but what about teaching pglz_compress about using an offset farther into the data, if the incoming data is quite a bit larger than 1k? This is just a test to see if it's worthwhile to keep going, no? I wonder if this might even be able to be provided as a type-specific option, to avoid changing the behavior for types other than jsonb in this regard. (I'm imaginging a boolean saying "pick a random sample", or perhaps a function which can be called that'll return "here's where you wanna test if this thing is gonna compress at all") I'm rather disinclined to change the on-disk format because of this specific test, that feels a bit like the tail wagging the dog to me, especially as I do hope that some day we'll figure out a way to use a better compression algorithm than pglz. Thanks, Stephen
On Fri, Aug 8, 2014 at 10:48 AM, Stephen Frost <sfrost@snowman.net> wrote:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:I haven't looked at this in any detail, so take this with a grain of
> I looked into the issue reported in bug #11109. The problem appears to be
> that jsonb's on-disk format is designed in such a way that the leading
> portion of any JSON array or object will be fairly incompressible, because
> it consists mostly of a strictly-increasing series of integer offsets.
> This interacts poorly with the code in pglz_compress() that gives up if
> it's found nothing compressible in the first first_success_by bytes of a
> value-to-be-compressed. (first_success_by is 1024 in the default set of
> compression parameters.)
salt, but what about teaching pglz_compress about using an offset
farther into the data, if the incoming data is quite a bit larger than
1k? This is just a test to see if it's worthwhile to keep going, no? I
wonder if this might even be able to be provided as a type-specific
option, to avoid changing the behavior for types other than jsonb in
this regard.
+1 for offset. Or sample the data in the beginning, middle and end. Obviously one could always come up with worst case, but.
(I'm imaginging a boolean saying "pick a random sample", or perhaps a
function which can be called that'll return "here's where you wanna test
if this thing is gonna compress at all")
I'm rather disinclined to change the on-disk format because of this
specific test, that feels a bit like the tail wagging the dog to me,
especially as I do hope that some day we'll figure out a way to use a
better compression algorithm than pglz.
Thanks,
Stephen
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 08/07/2014 11:17 PM, Tom Lane wrote: > I looked into the issue reported in bug #11109. The problem appears to be > that jsonb's on-disk format is designed in such a way that the leading > portion of any JSON array or object will be fairly incompressible, because > it consists mostly of a strictly-increasing series of integer offsets. > This interacts poorly with the code in pglz_compress() that gives up if > it's found nothing compressible in the first first_success_by bytes of a > value-to-be-compressed. (first_success_by is 1024 in the default set of > compression parameters.) [snip] > There is plenty of compressible data once we get into the repetitive > strings in the payload part --- but that starts at offset 944, and up to > that point there is nothing that pg_lzcompress can get a handle on. There > are, by definition, no sequences of 4 or more repeated bytes in that area. > I think in principle pg_lzcompress could decide to compress the 3-byte > sequences consisting of the high-order 24 bits of each offset; but it > doesn't choose to do so, probably because of the way its lookup hash table > works: > > * pglz_hist_idx - > * > * Computes the history table slot for the lookup by the next 4 > * characters in the input. > * > * NB: because we use the next 4 characters, we are not guaranteed to > * find 3-character matches; they very possibly will be in the wrong > * hash list. This seems an acceptable tradeoff for spreading out the > * hash keys more. > > For jsonb header data, the "next 4 characters" are *always* different, so > only a chance hash collision can result in a match. There is therefore a > pretty good chance that no compression will occur before it gives up > because of first_success_by. > > I'm not sure if there is any easy fix for this. We could possibly change > the default first_success_by value, but I think that'd just be postponing > the problem to larger jsonb objects/arrays, and it would hurt performance > for genuinely incompressible data. A somewhat painful, but not yet > out-of-the-question, alternative is to change the jsonb on-disk > representation. Perhaps the JEntry array could be defined as containing > element lengths instead of element ending offsets. Not sure though if > that would break binary searching for JSON object keys. > > Ouch. Back when this structure was first presented at pgCon 2013, I wondered if we shouldn't extract the strings into a dictionary, because of key repetition, and convinced myself that this shouldn't be necessary because in significant cases TOAST would take care of it. Maybe we should have pglz_compress() look at the *last* 1024 bytes if it can't find anything worth compressing in the first, for values larger than a certain size. It's worth noting that this is a fairly pathological case. AIUI the example you constructed has an array with 100k string elements. I don't think that's typical. So I suspect that unless I've misunderstood the statement of the problem we're going to find that almost all the jsonb we will be storing is still compressible. cheers andrew
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> I looked into the issue reported in bug #11109. The problem appears to be >> that jsonb's on-disk format is designed in such a way that the leading >> portion of any JSON array or object will be fairly incompressible, because >> it consists mostly of a strictly-increasing series of integer offsets. >> This interacts poorly with the code in pglz_compress() that gives up if >> it's found nothing compressible in the first first_success_by bytes of a >> value-to-be-compressed. (first_success_by is 1024 in the default set of >> compression parameters.) > I haven't looked at this in any detail, so take this with a grain of > salt, but what about teaching pglz_compress about using an offset > farther into the data, if the incoming data is quite a bit larger than > 1k? This is just a test to see if it's worthwhile to keep going, no? Well, the point of the existing approach is that it's a *nearly free* test to see if it's worthwhile to keep going; there's just one if-test added in the outer loop of the compression code. (cf commit ad434473ebd2, which added that along with some other changes.) AFAICS, what we'd have to do to do it as you suggest would to execute compression on some subset of the data and then throw away that work entirely. I do not find that attractive, especially when for most datatypes there's no particular reason to look at one subset instead of another. > I'm rather disinclined to change the on-disk format because of this > specific test, that feels a bit like the tail wagging the dog to me, > especially as I do hope that some day we'll figure out a way to use a > better compression algorithm than pglz. I'm unimpressed by that argument too, for a number of reasons: 1. The real problem here is that jsonb is emitting quite a bit of fundamentally-nonrepetitive data, even when the user-visible input is very repetitive. That's a compression-unfriendly transformation by anyone's measure. Assuming that some future replacement for pg_lzcompress() will nonetheless be able to compress the data strikes me as mostly wishful thinking. Besides, we'd more than likely have a similar early-exit rule in any substitute implementation, so that we'd still be at risk even if it usually worked. 2. Are we going to ship 9.4 without fixing this? I definitely don't see replacing pg_lzcompress as being on the agenda for 9.4, whereas changing jsonb is still within the bounds of reason. Considering all the hype that's built up around jsonb, shipping a design with a fundamental performance handicap doesn't seem like a good plan to me. We could perhaps band-aid around it by using different compression parameters for jsonb, although that would require some painful API changes since toast_compress_datum() doesn't know what datatype it's operating on. regards, tom lane
Andrew Dunstan <andrew@dunslane.net> writes: > On 08/07/2014 11:17 PM, Tom Lane wrote: >> I looked into the issue reported in bug #11109. The problem appears to be >> that jsonb's on-disk format is designed in such a way that the leading >> portion of any JSON array or object will be fairly incompressible, because >> it consists mostly of a strictly-increasing series of integer offsets. > Ouch. > Back when this structure was first presented at pgCon 2013, I wondered > if we shouldn't extract the strings into a dictionary, because of key > repetition, and convinced myself that this shouldn't be necessary > because in significant cases TOAST would take care of it. That's not really the issue here, I think. The problem is that a relatively minor aspect of the representation, namely the choice to store a series of offsets rather than a series of lengths, produces nonrepetitive data even when the original input is repetitive. > Maybe we should have pglz_compress() look at the *last* 1024 bytes if it > can't find anything worth compressing in the first, for values larger > than a certain size. Not possible with anything like the current implementation, since it's just an on-the-fly status check not a trial compression. > It's worth noting that this is a fairly pathological case. AIUI the > example you constructed has an array with 100k string elements. I don't > think that's typical. So I suspect that unless I've misunderstood the > statement of the problem we're going to find that almost all the jsonb > we will be storing is still compressible. Actually, the 100K-string example I constructed *did* compress. Larry's example that's not compressing is only about 12kB. AFAICS, the threshold for trouble is in the vicinity of 256 array or object entries (resulting in a 1kB JEntry array). That doesn't seem especially high. There is a probabilistic component as to whether the early-exit case will actually fire, since any chance hash collision will probably result in some 3-byte offset prefix getting compressed. But the fact that a beta tester tripped over this doesn't leave me with a warm feeling about the odds that it won't happen much in the field. regards, tom lane
On 08/08/2014 11:18 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 08/07/2014 11:17 PM, Tom Lane wrote: >>> I looked into the issue reported in bug #11109. The problem appears to be >>> that jsonb's on-disk format is designed in such a way that the leading >>> portion of any JSON array or object will be fairly incompressible, because >>> it consists mostly of a strictly-increasing series of integer offsets. > >> Back when this structure was first presented at pgCon 2013, I wondered >> if we shouldn't extract the strings into a dictionary, because of key >> repetition, and convinced myself that this shouldn't be necessary >> because in significant cases TOAST would take care of it. > That's not really the issue here, I think. The problem is that a > relatively minor aspect of the representation, namely the choice to store > a series of offsets rather than a series of lengths, produces > nonrepetitive data even when the original input is repetitive. It would certainly be worth validating that changing this would fix the problem. I don't know how invasive that would be - I suspect (without looking very closely) not terribly much. > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > jsonb is still within the bounds of reason. > > Considering all the hype that's built up around jsonb, shipping a design > with a fundamental performance handicap doesn't seem like a good plan > to me. We could perhaps band-aid around it by using different compression > parameters for jsonb, although that would require some painful API changes > since toast_compress_datum() doesn't know what datatype it's operating on. > > Yeah, it would be a bit painful, but after all finding out this sort of thing is why we have betas. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > On 08/08/2014 11:18 AM, Tom Lane wrote: >> That's not really the issue here, I think. The problem is that a >> relatively minor aspect of the representation, namely the choice to store >> a series of offsets rather than a series of lengths, produces >> nonrepetitive data even when the original input is repetitive. > It would certainly be worth validating that changing this would fix the > problem. > I don't know how invasive that would be - I suspect (without looking > very closely) not terribly much. I took a quick look and saw that this wouldn't be that easy to get around. As I'd suspected upthread, there are places that do random access into a JEntry array, such as the binary search in findJsonbValueFromContainer(). If we have to add up all the preceding lengths to locate the corresponding value part, we lose the performance advantages of binary search. AFAICS that's applied directly to the on-disk representation. I'd thought perhaps there was always a transformation step to build a pointer list, but nope. regards, tom lane
On Fri, Aug 8, 2014 at 8:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm unimpressed by that argument too, for a number of reasons:
> I'm rather disinclined to change the on-disk format because of this
> specific test, that feels a bit like the tail wagging the dog to me,
> especially as I do hope that some day we'll figure out a way to use a
> better compression algorithm than pglz.
1. The real problem here is that jsonb is emitting quite a bit of
fundamentally-nonrepetitive data, even when the user-visible input is very
repetitive. That's a compression-unfriendly transformation by anyone's
measure. Assuming that some future replacement for pg_lzcompress() will
nonetheless be able to compress the data strikes me as mostly wishful
thinking. Besides, we'd more than likely have a similar early-exit rule
in any substitute implementation, so that we'd still be at risk even if
it usually worked.
Would an answer be to switch the location of the jsonb "header" data to the end of the field as opposed to the beginning of the field? That would allow pglz to see what it wants to see early on and go to work when possible?
Add an offset at the top of the field to show where to look - but then it would be the same in terms of functionality outside of that? Or pretty close?
John
John W Higgins <wishdev@gmail.com> writes: > Would an answer be to switch the location of the jsonb "header" data to the > end of the field as opposed to the beginning of the field? That would allow > pglz to see what it wants to see early on and go to work when possible? Hm, might work. Seems a bit odd, but it would make pglz_compress happier. OTOH, the big-picture issue here is that jsonb is generating noncompressible data in the first place. Putting it somewhere else in the Datum doesn't change the fact that we're going to have bloated storage, even if we dodge the early-exit problem. (I suspect the compression disadvantage vs text/plain-json that I showed yesterday is coming largely from that offset array.) But I don't currently see how to avoid that and still preserve the fast binary-search key lookup property, which is surely a nice thing to have. regards, tom lane
On 08/08/2014 12:04 PM, John W Higgins wrote: > > Would an answer be to switch the location of the jsonb "header" data > to the end of the field as opposed to the beginning of the field? That > would allow pglz to see what it wants to see early on and go to work > when possible? > > Add an offset at the top of the field to show where to look - but then > it would be the same in terms of functionality outside of that? Or > pretty close? > That might make building up jsonb structures piece by piece as we do difficult. cheers andrew
On 08/08/2014 11:54 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 08/08/2014 11:18 AM, Tom Lane wrote: >>> That's not really the issue here, I think. The problem is that a >>> relatively minor aspect of the representation, namely the choice to store >>> a series of offsets rather than a series of lengths, produces >>> nonrepetitive data even when the original input is repetitive. >> It would certainly be worth validating that changing this would fix the >> problem. >> I don't know how invasive that would be - I suspect (without looking >> very closely) not terribly much. > I took a quick look and saw that this wouldn't be that easy to get around. > As I'd suspected upthread, there are places that do random access into a > JEntry array, such as the binary search in findJsonbValueFromContainer(). > If we have to add up all the preceding lengths to locate the corresponding > value part, we lose the performance advantages of binary search. AFAICS > that's applied directly to the on-disk representation. I'd thought > perhaps there was always a transformation step to build a pointer list, > but nope. > > It would be interesting to know what the performance hit would be if we calculated the offsets/pointers on the fly, especially if we could cache it somehow. The main benefit of binary search is in saving on comparisons, especially of strings, ISTM, and that could still be available - this would just be a bit of extra arithmetic. cheers andrew
> value-to-be-compressed. (first_success_by is 1024 in the default set of > compression parameters.) Curious idea: we could swap JEntry array and values: values in the begining of type will be catched by pg_lzcompress. But we will need to know offset of JEntry array, so header will grow up till 8 bytes (actually, it will be a varlena header!) -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Curious idea: we could swap JEntry array and values: values in the > begining of type will be catched by pg_lzcompress. But we will need to > know offset of JEntry array, so header will grow up till 8 bytes > (actually, it will be a varlena header!) May be I wasn't clear:jsonb type will start from string collection instead of JEntry array, JEntry array will be placed at the end of object/array. so, pg_lzcompress will find repeatable 4-byte pieces in first 1024 bytes of jsonb. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Fri, Aug 8, 2014 at 11:02:26AM -0400, Tom Lane wrote: > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > jsonb is still within the bounds of reason. FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4 betas and report the problem JSONB columns. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 08/08/2014 08:02 AM, Tom Lane wrote: > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > jsonb is still within the bounds of reason. > > Considering all the hype that's built up around jsonb, shipping a design > with a fundamental performance handicap doesn't seem like a good plan > to me. We could perhaps band-aid around it by using different compression > parameters for jsonb, although that would require some painful API changes > since toast_compress_datum() doesn't know what datatype it's operating on. I would rather ship late than ship a noncompressable JSONB. One we ship 9.4, many users are going to load 100's of GB into JSONB fields. Even if we fix the compressability issue in 9.5, those users won't be able to fix the compression without rewriting all their data, which could be prohibitive. And we'll be in a position where we have to support the 9.4 JSONB format/compression technique for years so that users aren't blocked from upgrading. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Aug 8, 2014 at 9:14 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
May be I wasn't clear:jsonb type will start from string collection instead of JEntry array, JEntry array will be placed at the end of object/array. so, pg_lzcompress will find repeatable 4-byte pieces in first 1024 bytes of jsonb.Curious idea: we could swap JEntry array and values: values in the
begining of type will be catched by pg_lzcompress. But we will need to
know offset of JEntry array, so header will grow up till 8 bytes
(actually, it will be a varlena header!)
Another idea I have is that store offset in each JEntry is not necessary to have benefit of binary search. Namely what if we store offsets in each 8th JEntry and length in others? The speed of binary search will be about the same: overhead is only calculation offsets in the 8-entries chunk. But lengths will probably repeat.
------
With best regards,
Alexander Korotkov.
With best regards,
Alexander Korotkov.
On Fri, Aug 8, 2014 at 7:35 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >> I took a quick look and saw that this wouldn't be that easy to get around. >> As I'd suspected upthread, there are places that do random access into a >> JEntry array, such as the binary search in findJsonbValueFromContainer(). >> If we have to add up all the preceding lengths to locate the corresponding >> value part, we lose the performance advantages of binary search. AFAICS >> that's applied directly to the on-disk representation. I'd thought >> perhaps there was always a transformation step to build a pointer list, >> but nope. > > It would be interesting to know what the performance hit would be if we > calculated the offsets/pointers on the fly, especially if we could cache it > somehow. The main benefit of binary search is in saving on comparisons, > especially of strings, ISTM, and that could still be available - this would > just be a bit of extra arithmetic. I don't think binary search is the main problem here. Objects are usually reasonably sized, while arrays are more likely to be huge. To make matters worse, jsonb -> int goes from O(1) to O(n). Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 08/08/2014 06:17 AM, Tom Lane wrote: > I looked into the issue reported in bug #11109. The problem appears to be > that jsonb's on-disk format is designed in such a way that the leading > portion of any JSON array or object will be fairly incompressible, because > it consists mostly of a strictly-increasing series of integer offsets. How hard and how expensive would it be to teach pg_lzcompress to apply a delta filter on suitable data ? So that instead of integers their deltas will be fed to the "real" compressor -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On Fri, Aug 8, 2014 at 12:41 PM, Ants Aasma <ants@cybertec.at> wrote: > I don't think binary search is the main problem here. Objects are > usually reasonably sized, while arrays are more likely to be huge. To > make matters worse, jsonb -> int goes from O(1) to O(n). I don't think it's true that arrays are more likely to be huge. That regression would be bad, but jsonb -> int is not the most compelling operator by far. The indexable operators (in particular, @>) don't support subscripting arrays like that, and with good reason. -- Peter Geoghegan
On Fri, Aug 8, 2014 at 12:06 PM, Josh Berkus <josh@agliodbs.com> wrote: > One we ship 9.4, many users are going to load 100's of GB into JSONB > fields. Even if we fix the compressability issue in 9.5, those users > won't be able to fix the compression without rewriting all their data, > which could be prohibitive. And we'll be in a position where we have > to support the 9.4 JSONB format/compression technique for years so that > users aren't blocked from upgrading. FWIW, if we take the delicious JSON data as representative, a table storing that data as jsonb is 1374 MB in size. Whereas an equivalent table with the data typed using the original json datatype (but with white space differences more or less ignored, because it was created using a jsonb -> json cast), the same data is 1352 MB. Larry's complaint is valid; this is a real problem, and I'd like to fix it before 9.4 is out. However, let us not lose sight of the fact that JSON data is usually a poor target for TOAST compression. With idiomatic usage, redundancy is very much more likely to appear across rows, and not within individual Datums. Frankly, we aren't doing a very good job there, and doing better requires an alternative strategy. -- Peter Geoghegan
I was not complaining; I think JSONB is awesome.
But I am one of those people who would like to put 100's of GB (or more) JSON files into Postgres and I am concerned about file size and possible future changes to the format.
On Fri, Aug 8, 2014 at 7:10 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Fri, Aug 8, 2014 at 12:06 PM, Josh Berkus <josh@agliodbs.com> wrote:FWIW, if we take the delicious JSON data as representative, a table
> One we ship 9.4, many users are going to load 100's of GB into JSONB
> fields. Even if we fix the compressability issue in 9.5, those users
> won't be able to fix the compression without rewriting all their data,
> which could be prohibitive. And we'll be in a position where we have
> to support the 9.4 JSONB format/compression technique for years so that
> users aren't blocked from upgrading.
storing that data as jsonb is 1374 MB in size. Whereas an equivalent
table with the data typed using the original json datatype (but with
white space differences more or less ignored, because it was created
using a jsonb -> json cast), the same data is 1352 MB.
Larry's complaint is valid; this is a real problem, and I'd like to
fix it before 9.4 is out. However, let us not lose sight of the fact
that JSON data is usually a poor target for TOAST compression. With
idiomatic usage, redundancy is very much more likely to appear across
rows, and not within individual Datums. Frankly, we aren't doing a
very good job there, and doing better requires an alternative
strategy.
--
Peter Geoghegan
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> I looked into the issue reported in bug #11109. The problem appears to be > >> that jsonb's on-disk format is designed in such a way that the leading > >> portion of any JSON array or object will be fairly incompressible, because > >> it consists mostly of a strictly-increasing series of integer offsets. > >> This interacts poorly with the code in pglz_compress() that gives up if > >> it's found nothing compressible in the first first_success_by bytes of a > >> value-to-be-compressed. (first_success_by is 1024 in the default set of > >> compression parameters.) > > > I haven't looked at this in any detail, so take this with a grain of > > salt, but what about teaching pglz_compress about using an offset > > farther into the data, if the incoming data is quite a bit larger than > > 1k? This is just a test to see if it's worthwhile to keep going, no? > > Well, the point of the existing approach is that it's a *nearly free* > test to see if it's worthwhile to keep going; there's just one if-test > added in the outer loop of the compression code. (cf commit ad434473ebd2, > which added that along with some other changes.) AFAICS, what we'd have > to do to do it as you suggest would to execute compression on some subset > of the data and then throw away that work entirely. I do not find that > attractive, especially when for most datatypes there's no particular > reason to look at one subset instead of another. Ah, I see- we were using the first block as it means we can reuse the work done on it if we decide to continue with the compression. Makes sense. We could possibly arrange to have the amount attempted depend on the data type, but you point out that we can't do that without teaching lower components about types, which is less than ideal. What about considering how large the object is when we are analyzing if it compresses well overall? That is- for a larger object, make a larger effort to compress it. There's clearly a pessimistic case which could arise from that, but it may be better than the current situation. There's a clear risk that such an algorithm may well be very type specific, meaning that we make things worse for some types (eg: bytea's which end up never compressing well we'd likely spend more CPU time trying than we do today). > 1. The real problem here is that jsonb is emitting quite a bit of > fundamentally-nonrepetitive data, even when the user-visible input is very > repetitive. That's a compression-unfriendly transformation by anyone's > measure. Assuming that some future replacement for pg_lzcompress() will > nonetheless be able to compress the data strikes me as mostly wishful > thinking. Besides, we'd more than likely have a similar early-exit rule > in any substitute implementation, so that we'd still be at risk even if > it usually worked. I agree that jsonb ends up being nonrepetitive in part, which is why I've been trying to push the discussion in the direction of making it more likely for the highly-compressible data to be considered rather than the start of the jsonb object. I don't care for our compression algorithm having to be catered to in this regard in general though as the exact same problem could, and likely does, exist in some real life bytea-using PG implementations. I disagree that another algorithm wouldn't be able to manage better on this data than pglz. pglz, from my experience, is notoriously bad a certain data sets which other algorithms are not as poorly impacted by. > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > jsonb is still within the bounds of reason. I'd really hate to ship 9.4 without a fix for this, but I have a similar hard time with shipping 9.4 without the binary search component.. > Considering all the hype that's built up around jsonb, shipping a design > with a fundamental performance handicap doesn't seem like a good plan > to me. We could perhaps band-aid around it by using different compression > parameters for jsonb, although that would require some painful API changes > since toast_compress_datum() doesn't know what datatype it's operating on. I don't like the idea of shipping with this handicap either. Perhaps another options would be a new storage type which basically says "just compress it, no matter what"? We'd be able to make that the default for jsonb columns too, no? Again- I'll admit this is shooting from the hip, but I wanted to get these thoughts out and I won't have much more time tonight. Thanks! Stephen
* Bruce Momjian (bruce@momjian.us) wrote: > On Fri, Aug 8, 2014 at 11:02:26AM -0400, Tom Lane wrote: > > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > > jsonb is still within the bounds of reason. > > FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4 > betas and report the problem JSONB columns. That is *not* a good solution.. Thanks, Stephen
* Josh Berkus (josh@agliodbs.com) wrote: > On 08/08/2014 08:02 AM, Tom Lane wrote: > > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > > jsonb is still within the bounds of reason. > > > > Considering all the hype that's built up around jsonb, shipping a design > > with a fundamental performance handicap doesn't seem like a good plan > > to me. We could perhaps band-aid around it by using different compression > > parameters for jsonb, although that would require some painful API changes > > since toast_compress_datum() doesn't know what datatype it's operating on. > > I would rather ship late than ship a noncompressable JSONB. > > One we ship 9.4, many users are going to load 100's of GB into JSONB > fields. Even if we fix the compressability issue in 9.5, those users > won't be able to fix the compression without rewriting all their data, > which could be prohibitive. And we'll be in a position where we have > to support the 9.4 JSONB format/compression technique for years so that > users aren't blocked from upgrading. Would you accept removing the binary-search capability from jsonb just to make it compressable? I certainly wouldn't. I'd hate to ship late also, but I'd be willing to support that if we can find a good solution to keep both compressability and binary-search (and provided it doesn't delay us many months..). Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > What about considering how large the object is when we are analyzing if > it compresses well overall? Hmm, yeah, that's a possibility: we could redefine the limit at which we bail out in terms of a fraction of the object size instead of a fixed limit. However, that risks expending a large amount of work before we bail, if we have a very large incompressible object --- which is not exactly an unlikely case. Consider for example JPEG images stored as bytea, which I believe I've heard of people doing. Another issue is that it's not real clear that that fixes the problem for any fractional size we'd want to use. In Larry's example of a jsonb value that fails to compress, the header size is 940 bytes out of about 12K, so we'd be needing to trial-compress about 10% of the object before we reach compressible data --- and I doubt his example is worst-case. >> 1. The real problem here is that jsonb is emitting quite a bit of >> fundamentally-nonrepetitive data, even when the user-visible input is very >> repetitive. That's a compression-unfriendly transformation by anyone's >> measure. > I disagree that another algorithm wouldn't be able to manage better on > this data than pglz. pglz, from my experience, is notoriously bad a > certain data sets which other algorithms are not as poorly impacted by. Well, I used to be considered a compression expert, and I'm going to disagree with you here. It's surely possible that other algorithms would be able to get some traction where pglz fails to get any, but that doesn't mean that presenting them with hard-to-compress data in the first place is a good design decision. There is no scenario in which data like this is going to be friendly to a general-purpose compression algorithm. It'd be necessary to have explicit knowledge that the data consists of an increasing series of four-byte integers to be able to do much with it. And then such an assumption would break down once you got past the header ... > Perhaps another options would be a new storage type which basically says > "just compress it, no matter what"? We'd be able to make that the > default for jsonb columns too, no? Meh. We could do that, but it would still require adding arguments to toast_compress_datum() that aren't there now. In any case, this is a band-aid solution; and as Josh notes, once we ship 9.4 we are going to be stuck with jsonb's on-disk representation pretty much forever. regards, tom lane
On 08/08/2014 08:45 PM, Tom Lane wrote: >> Perhaps another options would be a new storage type which basically says >> "just compress it, no matter what"? We'd be able to make that the >> default for jsonb columns too, no? > Meh. We could do that, but it would still require adding arguments to > toast_compress_datum() that aren't there now. In any case, this is a > band-aid solution; and as Josh notes, once we ship 9.4 we are going to > be stuck with jsonb's on-disk representation pretty much forever. > Yeah, and almost any other solution is likely to mean non-jsonb users potentially paying a penalty for fixing this for jsonb. So if we can adjust the jsonb layout to fix this problem I think we should do so. cheers andrew
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > What about considering how large the object is when we are analyzing if > > it compresses well overall? > > Hmm, yeah, that's a possibility: we could redefine the limit at which > we bail out in terms of a fraction of the object size instead of a fixed > limit. However, that risks expending a large amount of work before we > bail, if we have a very large incompressible object --- which is not > exactly an unlikely case. Consider for example JPEG images stored as > bytea, which I believe I've heard of people doing. Another issue is > that it's not real clear that that fixes the problem for any fractional > size we'd want to use. In Larry's example of a jsonb value that fails > to compress, the header size is 940 bytes out of about 12K, so we'd be > needing to trial-compress about 10% of the object before we reach > compressible data --- and I doubt his example is worst-case. Agreed- I tried to allude to that in my prior mail, there's clearly a concern that we'd make things worse in certain situations. Then again, at least for that case, we could recommend changing the storage type to EXTERNAL. > >> 1. The real problem here is that jsonb is emitting quite a bit of > >> fundamentally-nonrepetitive data, even when the user-visible input is very > >> repetitive. That's a compression-unfriendly transformation by anyone's > >> measure. > > > I disagree that another algorithm wouldn't be able to manage better on > > this data than pglz. pglz, from my experience, is notoriously bad a > > certain data sets which other algorithms are not as poorly impacted by. > > Well, I used to be considered a compression expert, and I'm going to > disagree with you here. It's surely possible that other algorithms would > be able to get some traction where pglz fails to get any, but that doesn't > mean that presenting them with hard-to-compress data in the first place is > a good design decision. There is no scenario in which data like this is > going to be friendly to a general-purpose compression algorithm. It'd > be necessary to have explicit knowledge that the data consists of an > increasing series of four-byte integers to be able to do much with it. > And then such an assumption would break down once you got past the > header ... I've wondered previously as to if we, perhaps, missed the boat pretty badly by not providing an explicitly versioned per-type compression capability, such that we wouldn't be stuck with one compression algorith for all types, and would be able to version compression types in a way that would allow us to change them over time, provided the newer code always understands how to decode X-4 (or whatever) versions back. I do agree that it'd be great to represent every type in a highly compressable way for the sake of the compression algorithm, but I've not seen any good suggestions for how to make that happen and I've got a hard time seeing how we could completely change the jsonb storage format, retain the capabilities it has today, make it highly compressible, and get 9.4 out this calendar year. I expect we could trivially add padding into the jsonb header to make it compress better, for the sake of this particular check, but then we're going to always be compression jsonb, even when the user data isn't actually terribly good for compression, spending a good bit of CPU time while we're at it. > > Perhaps another options would be a new storage type which basically says > > "just compress it, no matter what"? We'd be able to make that the > > default for jsonb columns too, no? > > Meh. We could do that, but it would still require adding arguments to > toast_compress_datum() that aren't there now. In any case, this is a > band-aid solution; and as Josh notes, once we ship 9.4 we are going to > be stuck with jsonb's on-disk representation pretty much forever. I agree that we need to avoid changing jsonb's on-disk representation. Have I missed where a good suggestion has been made about how to do that which preserves the binary-search capabilities and doesn't make the code much more difficult? Trying to move the header to the end just for the sake of this doesn't strike me as a good solution as it'll make things quite a bit more complicated. Is there a way we could interleave the likely-compressible user data in with the header instead? I've not looked, but it seems like that's the only reasonable approach to address this issue in this manner. If that's simply done, then great, but it strikes me as unlikely to be.. I'll just throw out a bit of a counter-point to all this also though- we don't try to focus on making our on-disk representation of data, generally, very compressible even though there are filesystems, such as ZFS, which might benefit from certain rearrangements of our on-disk formats (no, I don't have any specific recommendations in this vein, but I certainly don't see anyone else asking after it or asking for us to be concerned about it). Compression is great and I'd hate to see us have a format that will just work with it even though it might be beneficial in many cases, but I feel the fault here is with the compression algorithm and the decisions made as part of that operation and not really with this particular data structure. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > I agree that we need to avoid changing jsonb's on-disk representation. ... post-release, I assume you mean. > Have I missed where a good suggestion has been made about how to do that > which preserves the binary-search capabilities and doesn't make the code > much more difficult? We don't have one yet, but we've only been thinking about this for a few hours. > Trying to move the header to the end just for the > sake of this doesn't strike me as a good solution as it'll make things > quite a bit more complicated. Is there a way we could interleave the > likely-compressible user data in with the header instead? Yeah, I was wondering about that too, but I don't immediately see how to do it without some sort of preprocessing step when we read the object (which'd be morally equivalent to converting a series of lengths into a pointer array). Binary search isn't going to work if the items it's searching in aren't all the same size. Having said that, I am not sure that a preprocessing step is a deal-breaker. It'd be O(N), but with a pretty darn small constant factor, and for plausible sizes of objects I think the binary search might still dominate. Worth investigation perhaps. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > I agree that we need to avoid changing jsonb's on-disk representation. > > ... post-release, I assume you mean. Yes. > > Have I missed where a good suggestion has been made about how to do that > > which preserves the binary-search capabilities and doesn't make the code > > much more difficult? > > We don't have one yet, but we've only been thinking about this for a few > hours. Fair enough. > > Trying to move the header to the end just for the > > sake of this doesn't strike me as a good solution as it'll make things > > quite a bit more complicated. Is there a way we could interleave the > > likely-compressible user data in with the header instead? > > Yeah, I was wondering about that too, but I don't immediately see how to > do it without some sort of preprocessing step when we read the object > (which'd be morally equivalent to converting a series of lengths into a > pointer array). Binary search isn't going to work if the items it's > searching in aren't all the same size. > > Having said that, I am not sure that a preprocessing step is a > deal-breaker. It'd be O(N), but with a pretty darn small constant factor, > and for plausible sizes of objects I think the binary search might still > dominate. Worth investigation perhaps. For my part, I'm less concerned about a preprocessing step which happens when we store the data and more concerned about ensuring that we're able to extract data quickly. Perhaps that's simply because I'm used to writes being more expensive than reads, but I'm not alone in that regard either. I doubt I'll have time in the next couple of weeks to look into this and if we're going to want this change for 9.4, we really need someone working on it sooner than later. (to the crowd)- do we have any takers for this investigation? Thanks, Stephen
On Sat, Aug 9, 2014 at 6:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Stephen Frost <sfrost@snowman.net> writes:
> > What about considering how large the object is when we are analyzing if
> > it compresses well overall?
>
> Hmm, yeah, that's a possibility: we could redefine the limit at which
> we bail out in terms of a fraction of the object size instead of a fixed
> limit. However, that risks expending a large amount of work before we
> bail, if we have a very large incompressible object --- which is not
> exactly an unlikely case. Consider for example JPEG images stored as
> bytea, which I believe I've heard of people doing. Another issue is
> that it's not real clear that that fixes the problem for any fractional
> size we'd want to use. In Larry's example of a jsonb value that fails
> to compress, the header size is 940 bytes out of about 12K, so we'd be
> needing to trial-compress about 10% of the object before we reach
> compressible data --- and I doubt his example is worst-case.
>
> >> 1. The real problem here is that jsonb is emitting quite a bit of
> >> fundamentally-nonrepetitive data, even when the user-visible input is very
> >> repetitive. That's a compression-unfriendly transformation by anyone's
> >> measure.
>
> > I disagree that another algorithm wouldn't be able to manage better on
> > this data than pglz. pglz, from my experience, is notoriously bad a
> > certain data sets which other algorithms are not as poorly impacted by.
>
> Well, I used to be considered a compression expert, and I'm going to
> disagree with you here. It's surely possible that other algorithms would
> be able to get some traction where pglz fails to get any,
bytes and keep on doing the same until we find first match in which
>
> Stephen Frost <sfrost@snowman.net> writes:
> > What about considering how large the object is when we are analyzing if
> > it compresses well overall?
>
> Hmm, yeah, that's a possibility: we could redefine the limit at which
> we bail out in terms of a fraction of the object size instead of a fixed
> limit. However, that risks expending a large amount of work before we
> bail, if we have a very large incompressible object --- which is not
> exactly an unlikely case. Consider for example JPEG images stored as
> bytea, which I believe I've heard of people doing. Another issue is
> that it's not real clear that that fixes the problem for any fractional
> size we'd want to use. In Larry's example of a jsonb value that fails
> to compress, the header size is 940 bytes out of about 12K, so we'd be
> needing to trial-compress about 10% of the object before we reach
> compressible data --- and I doubt his example is worst-case.
>
> >> 1. The real problem here is that jsonb is emitting quite a bit of
> >> fundamentally-nonrepetitive data, even when the user-visible input is very
> >> repetitive. That's a compression-unfriendly transformation by anyone's
> >> measure.
>
> > I disagree that another algorithm wouldn't be able to manage better on
> > this data than pglz. pglz, from my experience, is notoriously bad a
> > certain data sets which other algorithms are not as poorly impacted by.
>
> Well, I used to be considered a compression expert, and I'm going to
> disagree with you here. It's surely possible that other algorithms would
> be able to get some traction where pglz fails to get any,
During my previous work in this area, I had seen that some algorithms
use skipping logic which can be useful for incompressible data followed
by compressible data or in general as well. One of the technique could
be If we don't find any match for first 4 bytes, then skip 4 bytes
and if we don't find match again for next 8 bytes, then skip 8bytes and keep on doing the same until we find first match in which
case it would go back to beginning of data. Now here we could follow
this logic until we actually compare total of first_success_by bytes.
There can be caveats in this particular scheme of skipping but I
just wanted to mention in general about the skipping idea to reduce
the number of situations where we will bail out even though there is
lot of compressible data.
On Fri, Aug 8, 2014 at 08:25:04PM -0400, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > On Fri, Aug 8, 2014 at 11:02:26AM -0400, Tom Lane wrote: > > > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > > > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > > > jsonb is still within the bounds of reason. > > > > FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4 > > betas and report the problem JSONB columns. > > That is *not* a good solution.. If you change the JSONB binary format, and we can't read the old format, that is the only option. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
akapila wrote > On Sat, Aug 9, 2014 at 6:15 AM, Tom Lane < > tgl@.pa > > wrote: >> >> Stephen Frost < > sfrost@ > > writes: >> > What about considering how large the object is when we are analyzing if >> > it compresses well overall? >> >> Hmm, yeah, that's a possibility: we could redefine the limit at which >> we bail out in terms of a fraction of the object size instead of a fixed >> limit. However, that risks expending a large amount of work before we >> bail, if we have a very large incompressible object --- which is not >> exactly an unlikely case. Consider for example JPEG images stored as >> bytea, which I believe I've heard of people doing. Another issue is >> that it's not real clear that that fixes the problem for any fractional >> size we'd want to use. In Larry's example of a jsonb value that fails >> to compress, the header size is 940 bytes out of about 12K, so we'd be >> needing to trial-compress about 10% of the object before we reach >> compressible data --- and I doubt his example is worst-case. >> >> >> 1. The real problem here is that jsonb is emitting quite a bit of >> >> fundamentally-nonrepetitive data, even when the user-visible input is > very >> >> repetitive. That's a compression-unfriendly transformation by >> anyone's >> >> measure. >> >> > I disagree that another algorithm wouldn't be able to manage better on >> > this data than pglz. pglz, from my experience, is notoriously bad a >> > certain data sets which other algorithms are not as poorly impacted by. >> >> Well, I used to be considered a compression expert, and I'm going to >> disagree with you here. It's surely possible that other algorithms would >> be able to get some traction where pglz fails to get any, > > During my previous work in this area, I had seen that some algorithms > use skipping logic which can be useful for incompressible data followed > by compressible data or in general as well. Random thought from the sideline... This particular data type has the novel (within PostgreSQL) design of both a (feature oriented - and sizeable) header and a payload. Is there some way to add that model into the storage system so that, at a higher level, separate attempts are made to compress each section and then the compressed (or not) results and written out adjacently and with a small header indicating the length of the stored header and other meta data like whether each part is compressed and even the type that data represents? For reading back into memory the header-payload generic type is populated and then the header and payload decompressed - as needed - then the two parts are fed into the appropriate type constructor that understands and accepts the two pieces. Just hoping to spark an idea here - I don't know enough about the internals to even guess how close I am to something feasible. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/jsonb-format-is-pessimal-for-toast-compression-tp5814162p5814299.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Bruce, * Bruce Momjian (bruce@momjian.us) wrote: > On Fri, Aug 8, 2014 at 08:25:04PM -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4 > > > betas and report the problem JSONB columns. > > > > That is *not* a good solution.. > > If you change the JSONB binary format, and we can't read the old format, > that is the only option. Apologies- I had thought you were suggesting this for a 9.4 -> 9.5 conversion, not for just 9.4beta to 9.4. Adding that to pg_upgrade to address folks upgrading from betas would certainly be fine. Thanks, Stephen
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Stephen Frost <sfrost@snowman.net> writes: >> Trying to move the header to the end just for the sake of this >> doesn't strike me as a good solution as it'll make things quite >> a bit more complicated. Why is that? How much harder would it be to add a single offset field to the front to point to the part we're shifting to the end? It is not all that unusual to put a directory at the end, like in the .zip file format. >> Is there a way we could interleave the likely-compressible user >> data in with the header instead? > > Yeah, I was wondering about that too, but I don't immediately see > how to do it without some sort of preprocessing step when we read > the object (which'd be morally equivalent to converting a series > of lengths into a pointer array). That sounds far more complex and fragile than just moving the indexes to the end. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> writes: >> Stephen Frost <sfrost@snowman.net> writes: >>> Trying to move the header to the end just for the sake of this >>> doesn't strike me as a good solution as it'll make things quite >>> a bit more complicated. > Why is that?� How much harder would it be to add a single offset > field to the front to point to the part we're shifting to the end? > It is not all that unusual to put a directory at the end, like in > the .zip file format. Yeah, I was wondering that too. Arguably, directory-at-the-end would be easier to work with for on-the-fly creation, not that we do any such thing at the moment. I think the main thing that's bugging Stephen is that doing that just to make pglz_compress happy seems like a kluge (and I have to agree). Here's a possibly more concrete thing to think about: we may very well someday want to support JSONB object field or array element extraction without reading all blocks of a large toasted JSONB value, if the value is stored external without compression. We already went to the trouble of creating analogous logic for substring extraction from a long uncompressed text or bytea value, so I think this is a plausible future desire. With the current format you could imagine grabbing the first TOAST chunk, and then if you see the header is longer than that you can grab the remainder of the header without any wasted I/O, and for the array-subscripting case you'd now have enough info to fetch the element value from the body of the JSONB without any wasted I/O. With directory-at-the-end you'd have to read the first chunk just to get the directory pointer, and this would most likely not give you any of the directory proper; but at least you'd know exactly how big the directory is before you go to read it in. The former case is probably slightly better. However, if you're doing an object key lookup not an array element fetch, neither of these formats are really friendly at all, because each binary-search probe probably requires bringing in one or two toast chunks from the body of the JSONB value so you can look at the key text. I'm not sure if there's a way to redesign the format to make that less painful/expensive --- but certainly, having the key texts scattered through the JSONB value doesn't seem like a great thing from this standpoint. regards, tom lane
On Sat, Aug 9, 2014 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Kevin Grittner <kgrittn@ymail.com> writes: >>> Stephen Frost <sfrost@snowman.net> writes: >>>> Trying to move the header to the end just for the sake of this >>>> doesn't strike me as a good solution as it'll make things quite >>>> a bit more complicated. > >> Why is that? How much harder would it be to add a single offset >> field to the front to point to the part we're shifting to the end? >> It is not all that unusual to put a directory at the end, like in >> the .zip file format. > > Yeah, I was wondering that too. Arguably, directory-at-the-end would > be easier to work with for on-the-fly creation, not that we do any > such thing at the moment. I think the main thing that's bugging Stephen > is that doing that just to make pglz_compress happy seems like a kluge > (and I have to agree). > > Here's a possibly more concrete thing to think about: we may very well > someday want to support JSONB object field or array element extraction > without reading all blocks of a large toasted JSONB value, if the value is > stored external without compression. We already went to the trouble of > creating analogous logic for substring extraction from a long uncompressed > text or bytea value, so I think this is a plausible future desire. With > the current format you could imagine grabbing the first TOAST chunk, and > then if you see the header is longer than that you can grab the remainder > of the header without any wasted I/O, and for the array-subscripting case > you'd now have enough info to fetch the element value from the body of > the JSONB without any wasted I/O. With directory-at-the-end you'd > have to read the first chunk just to get the directory pointer, and this > would most likely not give you any of the directory proper; but at least > you'd know exactly how big the directory is before you go to read it in. > The former case is probably slightly better. However, if you're doing an > object key lookup not an array element fetch, neither of these formats are > really friendly at all, because each binary-search probe probably requires > bringing in one or two toast chunks from the body of the JSONB value so > you can look at the key text. I'm not sure if there's a way to redesign > the format to make that less painful/expensive --- but certainly, having > the key texts scattered through the JSONB value doesn't seem like a great > thing from this standpoint. I think that's a good point. On the general topic, I don't think it's reasonable to imagine that we're going to come up with a single heuristic that works well for every kind of input data. What pglz is doing - assuming that if the beginning of the data is incompressible then the rest probably is too - is fundamentally reasonable, nonwithstanding the fact that it doesn't happen to work out well for JSONB. We might be able to tinker with that general strategy in some way that seems to fix this case and doesn't appear to break others, but there's some risk in that, and there's no obvious reason in my mind why PGLZ should be require to fly blind. So I think it would be a better idea to arrange some method by which JSONB (and perhaps other data types) can provide compression hints to pglz. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > ... I think it would be a better idea to arrange some method by > which JSONB (and perhaps other data types) can provide compression > hints to pglz. I agree with that as a long-term goal, but not sure if it's sane to push into 9.4. What we could conceivably do now is (a) add a datatype OID argument to toast_compress_datum, and (b) hard-wire the selection of a different compression-parameters struct if it's JSONBOID. The actual fix would then be to increase the first_success_by field of this alternate struct. I had been worrying about API-instability risks associated with (a), but on reflection it seems unlikely that any third-party code calls toast_compress_datum directly, and anyway it's not something we'd be back-patching to before 9.4. The main objection to (b) is that it wouldn't help for domains over jsonb (and no, I don't want to insert a getBaseType call there to fix that). A longer-term solution would be to make this some sort of type property that domains could inherit, like typstorage is already. (Somebody suggested dealing with this by adding more typstorage values, but I don't find that attractive; typstorage is known in too many places.) We'd need some thought about exactly what we want to expose, since the specific knobs that pglz_compress has today aren't necessarily good for the long term. This is all kinda ugly really, but since I'm not hearing brilliant ideas for redesigning jsonb's storage format, maybe this is the best we can do for now. regards, tom lane
On Mon, Aug 11, 2014 at 12:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think that's a good point. I think that there may be something to be said for the current layout. Having adjacent keys and values could take better advantage of CPU cache characteristics. I've heard of approaches to improving B-Tree locality that forced keys and values to be adjacent on individual B-Tree pages [1], for example. I've heard of this more than once. And FWIW, I believe based on earlier research of user requirements in this area that very large jsonb datums are not considered all that compelling. Document database systems have considerable limitations here. > On the general topic, I don't think it's reasonable to imagine that > we're going to come up with a single heuristic that works well for > every kind of input data. What pglz is doing - assuming that if the > beginning of the data is incompressible then the rest probably is too > - is fundamentally reasonable, nonwithstanding the fact that it > doesn't happen to work out well for JSONB. We might be able to tinker > with that general strategy in some way that seems to fix this case and > doesn't appear to break others, but there's some risk in that, and > there's no obvious reason in my mind why PGLZ should be require to fly > blind. So I think it would be a better idea to arrange some method by > which JSONB (and perhaps other data types) can provide compression > hints to pglz. If there is to be any effort to make jsonb a more effective target for compression, I imagine that that would have to target redundancy between JSON documents. With idiomatic usage, we can expect plenty of it. [1] http://www.vldb.org/conf/1999/P7.pdf , "We also forced each key and child pointer to be adjacent to each other physically" -- Peter Geoghegan
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > ... I think it would be a better idea to arrange some method by > > which JSONB (and perhaps other data types) can provide compression > > hints to pglz. > > I agree with that as a long-term goal, but not sure if it's sane to > push into 9.4. Agreed. > What we could conceivably do now is (a) add a datatype OID argument to > toast_compress_datum, and (b) hard-wire the selection of a different > compression-parameters struct if it's JSONBOID. The actual fix would > then be to increase the first_success_by field of this alternate struct. Isn't the offset-to-compressable-data variable though, depending on the number of keys, etc? Would we be increasing first_success_by based off of some function which inspects the object? > I had been worrying about API-instability risks associated with (a), > but on reflection it seems unlikely that any third-party code calls > toast_compress_datum directly, and anyway it's not something we'd > be back-patching to before 9.4. Agreed. > The main objection to (b) is that it wouldn't help for domains over jsonb > (and no, I don't want to insert a getBaseType call there to fix that). While not ideal, that seems like an acceptable compromise for 9.4 to me. > A longer-term solution would be to make this some sort of type property > that domains could inherit, like typstorage is already. (Somebody > suggested dealing with this by adding more typstorage values, but > I don't find that attractive; typstorage is known in too many places.) Think that was me and having it be something which domains can inherit makes sense. Would be able to use this approach to introduce type (and domains inheirited from that type) specific compression algorithms, perhaps? Or even get to a point where we could have a chunk-based compression scheme for certain types of objects (such as JSONB) where we keep track of which keys exist at which points in the compressed object, allowing us to skip to the specific chunk which contains the requested key, similar to what we do with uncompressed data? > We'd need some thought about exactly what we want to expose, since > the specific knobs that pglz_compress has today aren't necessarily > good for the long term. Agreed. > This is all kinda ugly really, but since I'm not hearing brilliant > ideas for redesigning jsonb's storage format, maybe this is the > best we can do for now. This would certainly be an improvement over what's going on now, and I love the idea of possibly being able to expand this in the future to do more. What I'd hate to see is having all of this and only ever using it to say "skip ahead another 1k for JSONB". Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> What we could conceivably do now is (a) add a datatype OID argument to >> toast_compress_datum, and (b) hard-wire the selection of a different >> compression-parameters struct if it's JSONBOID. The actual fix would >> then be to increase the first_success_by field of this alternate struct. > Isn't the offset-to-compressable-data variable though, depending on the > number of keys, etc? Would we be increasing first_success_by based off > of some function which inspects the object? Given that this is a short-term hack, I'd be satisfied with setting it to INT_MAX. If we got more ambitious, we could consider improving the cutoff logic so that it gives up at "x% of the object or n bytes, whichever comes first"; but I'd want to see some hard evidence that that was useful before adding any more cycles to pglz_compress. regards, tom lane
* Peter Geoghegan (pg@heroku.com) wrote: > If there is to be any effort to make jsonb a more effective target for > compression, I imagine that that would have to target redundancy > between JSON documents. With idiomatic usage, we can expect plenty of > it. While I certainly agree, that's a rather different animal to address and doesn't hold a lot of relevance to the current problem. Or, to put it another way, I don't think anyone is going to be surprised that two rows containing the same data (even if they're inserted in the same transaction and have the same visibility information) are compressed together in some fashion. We've got a clear example of someone, quite reasonably, expecting their JSONB object to be compressed using the normal TOAST mechanism, and we're failing to do that in cases where it's actually a win to do so. That's the focus of this discussion and what needs to be addressed before 9.4 goes out. Thanks, Stephen
On Mon, Aug 11, 2014 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote: > We've got a clear example of someone, quite reasonably, expecting their > JSONB object to be compressed using the normal TOAST mechanism, and > we're failing to do that in cases where it's actually a win to do so. > That's the focus of this discussion and what needs to be addressed > before 9.4 goes out. Sure. I'm not trying to minimize that. We should fix it, certainly. However, it does bear considering that JSON data, with each document stored in a row is not an effective target for TOAST compression in general, even as text. -- Peter Geoghegan
On Fri, Aug 8, 2014 at 10:50 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > How hard and how expensive would it be to teach pg_lzcompress to > apply a delta filter on suitable data ? > > So that instead of integers their deltas will be fed to the "real" > compressor Has anyone given this more thought? I know this might not be 9.4 material, but to me it sounds like the most promising approach, if it's workable. This isn't a made up thing, the 7z and LZMA formats also have an optional delta filter. Of course with JSONB the problem is figuring out which parts to apply the delta filter to, and which parts not. This would also help with integer arrays, containing for example foreign key values to a serial column. There's bound to be some redundancy, as nearby serial values are likely to end up close together. In one of my past projects we used to store large arrays of integer fkeys, deliberately sorted for duplicate elimination. For an ideal case comparison, intar2 could be as large as intar1 when compressed with a 4-byte wide delta filter: create table intar1 as select array(select 1::int from generate_series(1,1000000)) a; create table intar2 as select array(select generate_series(1,1000000)::int) a; In PostgreSQL 9.3 the sizes are: select pg_column_size(a) from intar1; 45810 select pg_column_size(a) from intar2; 4000020 So a factor of 87 difference. Regards, Marti
On Mon, Aug 11, 2014 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> ... I think it would be a better idea to arrange some method by >> which JSONB (and perhaps other data types) can provide compression >> hints to pglz. > > I agree with that as a long-term goal, but not sure if it's sane to > push into 9.4. > > What we could conceivably do now is (a) add a datatype OID argument to > toast_compress_datum, and (b) hard-wire the selection of a different > compression-parameters struct if it's JSONBOID. The actual fix would > then be to increase the first_success_by field of this alternate struct. I think it would be perfectly sane to do that for 9.4. It may not be perfect, but neither is what we have now. > A longer-term solution would be to make this some sort of type property > that domains could inherit, like typstorage is already. (Somebody > suggested dealing with this by adding more typstorage values, but > I don't find that attractive; typstorage is known in too many places.) > We'd need some thought about exactly what we want to expose, since > the specific knobs that pglz_compress has today aren't necessarily > good for the long term. What would really be ideal here is if the JSON code could inform the toast compression code "this many initial bytes are likely incompressible, just pass them through without trying, and then start compressing at byte N", where N is the byte following the TOC. But I don't know that there's a reasonable way to implement that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 11, 2014 at 01:44:05PM -0700, Peter Geoghegan wrote: > On Mon, Aug 11, 2014 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote: > > We've got a clear example of someone, quite reasonably, expecting their > > JSONB object to be compressed using the normal TOAST mechanism, and > > we're failing to do that in cases where it's actually a win to do so. > > That's the focus of this discussion and what needs to be addressed > > before 9.4 goes out. > > Sure. I'm not trying to minimize that. We should fix it, certainly. > However, it does bear considering that JSON data, with each document > stored in a row is not an effective target for TOAST compression in > general, even as text. Seems we have two issues: 1) the header makes testing for compression likely to fail 2) use of pointers rather than offsets reduces compression potential I understand we are focusing on #1, but how much does compression reduce the storage size with and without #2? Seems we need to know that answer before deciding if it is worth reducing the ability to do fast lookups with #2. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Tue, Aug 12, 2014 at 8:00 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Mon, Aug 11, 2014 at 01:44:05PM -0700, Peter Geoghegan wrote: >> On Mon, Aug 11, 2014 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote: >> > We've got a clear example of someone, quite reasonably, expecting their >> > JSONB object to be compressed using the normal TOAST mechanism, and >> > we're failing to do that in cases where it's actually a win to do so. >> > That's the focus of this discussion and what needs to be addressed >> > before 9.4 goes out. >> >> Sure. I'm not trying to minimize that. We should fix it, certainly. >> However, it does bear considering that JSON data, with each document >> stored in a row is not an effective target for TOAST compression in >> general, even as text. > > Seems we have two issues: > > 1) the header makes testing for compression likely to fail > 2) use of pointers rather than offsets reduces compression potential I do think the best solution for 2 is what's been proposed already, to do delta-coding of the pointers in chunks (ie, 1 pointer, 15 deltas, repeat). But it does make binary search quite more complex. Alternatively, it could be somewhat compressed as follows: Segment = 1 pointer head, 15 deltas Pointer head = pointers[0] delta[i] = pointers[i] - pointers[0] for i in 1..15 (delta to segment head, not previous value) Now, you can have 4 types of segments. 8, 16, 32, 64 bits, which is the size of the deltas. You achieve between 8x and 1x compression, and even when 1x (no compression), you make it easier for pglz to find something compressible. Accessing it is also simple, if you have a segment index (tough part here). Replace the 15 for something that makes such segment index very compact ;)
Bruce Momjian <bruce@momjian.us> writes: > Seems we have two issues: > 1) the header makes testing for compression likely to fail > 2) use of pointers rather than offsets reduces compression potential > I understand we are focusing on #1, but how much does compression reduce > the storage size with and without #2? Seems we need to know that answer > before deciding if it is worth reducing the ability to do fast lookups > with #2. That's a fair question. I did a very very simple hack to replace the item offsets with item lengths -- turns out that that mostly requires removing some code that changes lengths to offsets ;-). I then loaded up Larry's example of a noncompressible JSON value, and compared pg_column_size() which is just about the right thing here since it reports datum size after compression. Remembering that the textual representation is 12353 bytes: json: 382 bytes jsonb, using offsets: 12593 bytes jsonb, using lengths: 406 bytes So this confirms my suspicion that the choice of offsets not lengths is what's killing compressibility. If it used lengths, jsonb would be very nearly as compressible as the original text. Hack attached in case anyone wants to collect more thorough statistics. We'd not actually want to do it like this because of the large expense of recomputing the offsets on-demand all the time. (It does pass the regression tests, for what that's worth.) regards, tom lane diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 04f35bf..2297504 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *************** convertJsonbArray(StringInfo buffer, JEn *** 1378,1385 **** errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); - if (i > 0) - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } --- 1378,1383 ---- *************** convertJsonbObject(StringInfo buffer, JE *** 1430,1437 **** errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); - if (i > 0) - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); --- 1428,1433 ---- *************** convertJsonbObject(StringInfo buffer, JE *** 1445,1451 **** errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } --- 1441,1446 ---- *************** uniqueifyJsonbObject(JsonbValue *object) *** 1592,1594 **** --- 1587,1600 ---- object->val.object.nPairs = res + 1 - object->val.object.pairs; } } + + uint32 + jsonb_get_offset(const JEntry *ja, int index) + { + uint32 off = 0; + int i; + + for (i = 0; i < index; i++) + off += JBE_LEN(ja, i); + return off; + } diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h index 5f2594b..c9b18e1 100644 *** a/src/include/utils/jsonb.h --- b/src/include/utils/jsonb.h *************** typedef uint32 JEntry; *** 153,162 **** * Macros for getting the offset and length of an element. Note multiple * evaluations and access to prior array element. */ ! #define JBE_ENDPOS(je_) ((je_) & JENTRY_POSMASK) ! #define JBE_OFF(ja, i) ((i) == 0 ? 0 : JBE_ENDPOS((ja)[i - 1])) ! #define JBE_LEN(ja, i) ((i) == 0 ? JBE_ENDPOS((ja)[i]) \ ! : JBE_ENDPOS((ja)[i]) - JBE_ENDPOS((ja)[i - 1])) /* * A jsonb array or object node, within a Jsonb Datum. --- 153,163 ---- * Macros for getting the offset and length of an element. Note multiple * evaluations and access to prior array element. */ ! #define JBE_LENFLD(je_) ((je_) & JENTRY_POSMASK) ! #define JBE_OFF(ja, i) jsonb_get_offset(ja, i) ! #define JBE_LEN(ja, i) JBE_LENFLD((ja)[i]) ! ! extern uint32 jsonb_get_offset(const JEntry *ja, int index); /* * A jsonb array or object node, within a Jsonb Datum.
I wrote: > That's a fair question. I did a very very simple hack to replace the item > offsets with item lengths -- turns out that that mostly requires removing > some code that changes lengths to offsets ;-). I then loaded up Larry's > example of a noncompressible JSON value, and compared pg_column_size() > which is just about the right thing here since it reports datum size after > compression. Remembering that the textual representation is 12353 bytes: > json: 382 bytes > jsonb, using offsets: 12593 bytes > jsonb, using lengths: 406 bytes Oh, one more result: if I leave the representation alone, but change the compression parameters to set first_success_by to INT_MAX, this value takes up 1397 bytes. So that's better, but still more than a 3X penalty compared to using lengths. (Admittedly, this test value probably is an outlier compared to normal practice, since it's a hundred or so repetitions of the same two strings.) regards, tom lane
On 08/13/2014 09:01 PM, Tom Lane wrote: > I wrote: >> That's a fair question. I did a very very simple hack to replace the item >> offsets with item lengths -- turns out that that mostly requires removing >> some code that changes lengths to offsets ;-). I then loaded up Larry's >> example of a noncompressible JSON value, and compared pg_column_size() >> which is just about the right thing here since it reports datum size after >> compression. Remembering that the textual representation is 12353 bytes: >> json: 382 bytes >> jsonb, using offsets: 12593 bytes >> jsonb, using lengths: 406 bytes > Oh, one more result: if I leave the representation alone, but change > the compression parameters to set first_success_by to INT_MAX, this > value takes up 1397 bytes. So that's better, but still more than a > 3X penalty compared to using lengths. (Admittedly, this test value > probably is an outlier compared to normal practice, since it's a hundred > or so repetitions of the same two strings.) > > What does changing to lengths do to the speed of other operations? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > On 08/13/2014 09:01 PM, Tom Lane wrote: >>> That's a fair question. I did a very very simple hack to replace the item >>> offsets with item lengths -- turns out that that mostly requires removing >>> some code that changes lengths to offsets ;-). > What does changing to lengths do to the speed of other operations? This was explicitly *not* an attempt to measure the speed issue. To do a fair trial of that, you'd have to work a good bit harder, methinks. Examining each of N items would involve O(N^2) work with the patch as posted, but presumably you could get it down to less in most or all cases --- in particular, sequential traversal could be done with little added cost. But it'd take a lot more hacking. regards, tom lane
On 08/14/2014 04:01 AM, Tom Lane wrote: > I wrote: >> That's a fair question. I did a very very simple hack to replace the item >> offsets with item lengths -- turns out that that mostly requires removing >> some code that changes lengths to offsets ;-). I then loaded up Larry's >> example of a noncompressible JSON value, and compared pg_column_size() >> which is just about the right thing here since it reports datum size after >> compression. Remembering that the textual representation is 12353 bytes: > >> json: 382 bytes >> jsonb, using offsets: 12593 bytes >> jsonb, using lengths: 406 bytes > > Oh, one more result: if I leave the representation alone, but change > the compression parameters to set first_success_by to INT_MAX, this > value takes up 1397 bytes. So that's better, but still more than a > 3X penalty compared to using lengths. (Admittedly, this test value > probably is an outlier compared to normal practice, since it's a hundred > or so repetitions of the same two strings.) For comparison, here's a patch that implements the scheme that Alexander Korotkov suggested, where we store an offset every 8th element, and a length in the others. It compresses Larry's example to 525 bytes. Increasing the "stride" from 8 to 16 entries, it compresses to 461 bytes. A nice thing about this patch is that it's on-disk compatible with the current format, hence initdb is not required. (The current comments claim that the first element in an array always has the JENTRY_ISFIRST flags set; that is wrong, there is no such flag. I removed the flag in commit d9daff0e, but apparently failed to update the comment and the accompanying JBE_ISFIRST macro. Sorry about that, will fix. This patch uses the bit that used to be JENTRY_ISFIRST to mark entries that store a length instead of an end offset.). - Heikki
Attachment
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > For comparison, here's a patch that implements the scheme that Alexander > Korotkov suggested, where we store an offset every 8th element, and a > length in the others. It compresses Larry's example to 525 bytes. > Increasing the "stride" from 8 to 16 entries, it compresses to 461 bytes. > A nice thing about this patch is that it's on-disk compatible with the > current format, hence initdb is not required. TBH, I think that's about the only nice thing about it :-(. It's conceptually a mess. And while I agree that this way avoids creating a big-O performance issue for large arrays/objects, I think the micro performance is probably going to be not so good. The existing code is based on the assumption that JBE_OFF() and JBE_LEN() are negligibly cheap; but with a solution like this, it's guaranteed that one or the other is going to be not-so-cheap. I think if we're going to do anything to the representation at all, we need to refactor the calling code; at least fixing the JsonbIterator logic so that it tracks the current data offset rather than expecting to able to compute it at no cost. The difficulty in arguing about this is that unless we have an agreed-on performance benchmark test, it's going to be a matter of unsupported opinions whether one solution is faster than another. Have we got anything that stresses key lookup and/or array indexing? regards, tom lane
On Wed, Aug 13, 2014 at 09:01:43PM -0400, Tom Lane wrote: > I wrote: > > That's a fair question. I did a very very simple hack to replace the item > > offsets with item lengths -- turns out that that mostly requires removing > > some code that changes lengths to offsets ;-). I then loaded up Larry's > > example of a noncompressible JSON value, and compared pg_column_size() > > which is just about the right thing here since it reports datum size after > > compression. Remembering that the textual representation is 12353 bytes: > > > json: 382 bytes > > jsonb, using offsets: 12593 bytes > > jsonb, using lengths: 406 bytes > > Oh, one more result: if I leave the representation alone, but change > the compression parameters to set first_success_by to INT_MAX, this > value takes up 1397 bytes. So that's better, but still more than a > 3X penalty compared to using lengths. (Admittedly, this test value > probably is an outlier compared to normal practice, since it's a hundred > or so repetitions of the same two strings.) Uh, can we get compression for actual documents, rather than duplicate strings? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian <bruce@momjian.us> writes: > Uh, can we get compression for actual documents, rather than duplicate > strings? [ shrug... ] What's your proposed set of "actual documents"? I don't think we have any corpus of JSON docs that are all large enough to need compression. This gets back to the problem of what test case are we going to consider while debating what solution to adopt. regards, tom lane
On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Uh, can we get compression for actual documents, rather than duplicate > > strings? > > [ shrug... ] What's your proposed set of "actual documents"? > I don't think we have any corpus of JSON docs that are all large > enough to need compression. > > This gets back to the problem of what test case are we going to consider > while debating what solution to adopt. Uh, we just one need one 12k JSON document from somewhere. Clearly this is something we can easily get. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Aug 14, 2014 at 11:52 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote: >> Bruce Momjian <bruce@momjian.us> writes: >> > Uh, can we get compression for actual documents, rather than duplicate >> > strings? >> >> [ shrug... ] What's your proposed set of "actual documents"? >> I don't think we have any corpus of JSON docs that are all large >> enough to need compression. >> >> This gets back to the problem of what test case are we going to consider >> while debating what solution to adopt. > > Uh, we just one need one 12k JSON document from somewhere. Clearly this > is something we can easily get. it's trivial to make a large json[b] document: select length(to_json(array(select row(a.*) from pg_attribute a))::TEXT); select
Bruce Momjian <bruce@momjian.us> writes: > On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote: >> This gets back to the problem of what test case are we going to consider >> while debating what solution to adopt. > Uh, we just one need one 12k JSON document from somewhere. Clearly this > is something we can easily get. I would put little faith in a single document as being representative. To try to get some statistics about a real-world case, I looked at the delicio.us dataset that someone posted awhile back (1252973 JSON docs). These have a minimum length (in text representation) of 604 bytes and a maximum length of 5949 bytes, which means that they aren't going to tell us all that much about large JSON docs, but this is better than no data at all. Since documents of only a couple hundred bytes aren't going to be subject to compression, I made a table of four columns each containing the same JSON data, so that each row would be long enough to force the toast logic to try to do something. (Note that none of these documents are anywhere near big enough to hit the refuses-to-compress problem.) Given that, I get the following statistics for pg_column_size(): min max avg JSON (text) representation 382 1155 526.5 HEAD's JSONB representation 493 1485 695.1 all-lengths representation 440 1257 615.3 So IOW, on this dataset the existing JSONB representation creates about 32% bloat compared to just storing the (compressed) user-visible text, and switching to all-lengths would about halve that penalty. Maybe this is telling us it's not worth changing the representation, and we should just go do something about the first_success_by threshold and be done. I'm hesitant to draw such conclusions on the basis of a single use-case though, especially one that doesn't really have that much use for compression in the first place. Do we have other JSON corpuses to look at? regards, tom lane
On Thu, Aug 14, 2014 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maybe this is telling us it's not worth changing the representation, > and we should just go do something about the first_success_by threshold > and be done. I'm hesitant to draw such conclusions on the basis of a > single use-case though, especially one that doesn't really have that > much use for compression in the first place. Do we have other JSON > corpuses to look at? Yes. Pavel posted some representative JSON data a while back: http://pgsql.cz/data/data.dump.gz (it's a plain dump) -- Peter Geoghegan
On Thu, Aug 14, 2014 at 01:57:14PM -0400, Tom Lane wrote: > Maybe this is telling us it's not worth changing the representation, > and we should just go do something about the first_success_by threshold > and be done. I'm hesitant to draw such conclusions on the basis of a > single use-case though, especially one that doesn't really have that > much use for compression in the first place. Do we have other JSON > corpuses to look at? Yes, that is what I was expecting --- once the whitespace and syntax sugar is gone in JSONB, I was unclear how much compression would help. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 08/14/2014 11:13 AM, Bruce Momjian wrote: > On Thu, Aug 14, 2014 at 01:57:14PM -0400, Tom Lane wrote: >> Maybe this is telling us it's not worth changing the representation, >> and we should just go do something about the first_success_by threshold >> and be done. I'm hesitant to draw such conclusions on the basis of a >> single use-case though, especially one that doesn't really have that >> much use for compression in the first place. Do we have other JSON >> corpuses to look at? > > Yes, that is what I was expecting --- once the whitespace and syntax > sugar is gone in JSONB, I was unclear how much compression would help. I thought the destruction case was when we have enough top-level keys that the offsets are more than 1K total, though, yes? So we need to test that set ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
I did quick test on the same bookmarks to test performance of 9.4beta2 and 9.4beta2+patch
The query was the same we used in pgcon presentation:SELECT count(*) FROM jb WHERE jb @> '{"tags":[{"term":"NYC"}]}'::jsonb;
table size | time (ms)
Yes, performance degrades, but not much. There is also small win in table size, but bookmarks are not big, so it's difficult to say about compression.
Oleg
On Thu, Aug 14, 2014 at 9:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
> On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote:>> This gets back to the problem of what test case are we going to considerI would put little faith in a single document as being representative.
>> while debating what solution to adopt.
> Uh, we just one need one 12k JSON document from somewhere. Clearly this
> is something we can easily get.
To try to get some statistics about a real-world case, I looked at the
delicio.us dataset that someone posted awhile back (1252973 JSON docs).
These have a minimum length (in text representation) of 604 bytes and
a maximum length of 5949 bytes, which means that they aren't going to
tell us all that much about large JSON docs, but this is better than
no data at all.
Since documents of only a couple hundred bytes aren't going to be subject
to compression, I made a table of four columns each containing the same
JSON data, so that each row would be long enough to force the toast logic
to try to do something. (Note that none of these documents are anywhere
near big enough to hit the refuses-to-compress problem.) Given that,
I get the following statistics for pg_column_size():
min max avg
JSON (text) representation 382 1155 526.5
HEAD's JSONB representation 493 1485 695.1
all-lengths representation 440 1257 615.3
So IOW, on this dataset the existing JSONB representation creates about
32% bloat compared to just storing the (compressed) user-visible text,
and switching to all-lengths would about halve that penalty.
Maybe this is telling us it's not worth changing the representation,
and we should just go do something about the first_success_by threshold
and be done. I'm hesitant to draw such conclusions on the basis of a
single use-case though, especially one that doesn't really have that
much use for compression in the first place. Do we have other JSON
corpuses to look at?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I attached a json file of approximately 513K. It contains two repetitions of a single json structure. The values are quasi-random. It might make a decent test case of meaningfully sized data.
best
On Thu, Aug 14, 2014 at 2:25 PM, Oleg Bartunov <obartunov@gmail.com> wrote:
9.4beta2+patch: 1373 Mb | 12139.4beta2: 1374 Mb | 1160I did quick test on the same bookmarks to test performance of 9.4beta2 and 9.4beta2+patchThe query was the same we used in pgcon presentation:
SELECT count(*) FROM jb WHERE jb @> '{"tags":[{"term":"NYC"}]}'::jsonb;
table size | time (ms)Yes, performance degrades, but not much. There is also small win in table size, but bookmarks are not big, so it's difficult to say about compression.OlegOn Thu, Aug 14, 2014 at 9:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:Bruce Momjian <bruce@momjian.us> writes:
> On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote:>> This gets back to the problem of what test case are we going to considerI would put little faith in a single document as being representative.
>> while debating what solution to adopt.
> Uh, we just one need one 12k JSON document from somewhere. Clearly this
> is something we can easily get.
To try to get some statistics about a real-world case, I looked at the
delicio.us dataset that someone posted awhile back (1252973 JSON docs).
These have a minimum length (in text representation) of 604 bytes and
a maximum length of 5949 bytes, which means that they aren't going to
tell us all that much about large JSON docs, but this is better than
no data at all.
Since documents of only a couple hundred bytes aren't going to be subject
to compression, I made a table of four columns each containing the same
JSON data, so that each row would be long enough to force the toast logic
to try to do something. (Note that none of these documents are anywhere
near big enough to hit the refuses-to-compress problem.) Given that,
I get the following statistics for pg_column_size():
min max avg
JSON (text) representation 382 1155 526.5
HEAD's JSONB representation 493 1485 695.1
all-lengths representation 440 1257 615.3
So IOW, on this dataset the existing JSONB representation creates about
32% bloat compared to just storing the (compressed) user-visible text,
and switching to all-lengths would about halve that penalty.
Maybe this is telling us it's not worth changing the representation,
and we should just go do something about the first_success_by threshold
and be done. I'm hesitant to draw such conclusions on the basis of a
single use-case though, especially one that doesn't really have that
much use for compression in the first place. Do we have other JSON
corpuses to look at?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Aug 14, 2014 at 3:49 PM, Larry White <ljw1001@gmail.com> wrote: > I attached a json file of approximately 513K. It contains two repetitions of > a single json structure. The values are quasi-random. It might make a decent > test case of meaningfully sized data. I have a 59M in plain SQL (10M compressed, 51M on-disk table size) collection of real-world JSON data. This data is mostly counters and anciliary info stored in json for the flexibility, more than anything else, since it's otherwise quite structured: most values share a lot between each other (in key names) but there's not much redundancy within single rows. Value length stats (in text format): min: 14 avg: 427 max: 23239 If anyone's interested, contact me personally (I gotta anonimize the info a bit first, since it's production info, and it's too big to attach on the ML).
On Thu, Aug 14, 2014 at 4:24 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Aug 14, 2014 at 3:49 PM, Larry White <ljw1001@gmail.com> wrote: >> I attached a json file of approximately 513K. It contains two repetitions of >> a single json structure. The values are quasi-random. It might make a decent >> test case of meaningfully sized data. > > > I have a 59M in plain SQL (10M compressed, 51M on-disk table size) > collection of real-world JSON data. > > This data is mostly counters and anciliary info stored in json for the > flexibility, more than anything else, since it's otherwise quite > structured: most values share a lot between each other (in key names) > but there's not much redundancy within single rows. > > Value length stats (in text format): > > min: 14 > avg: 427 > max: 23239 > > If anyone's interested, contact me personally (I gotta anonimize the > info a bit first, since it's production info, and it's too big to > attach on the ML). Oh, that one has a 13k toast, not very interesting. But I've got another (very similar), 47M table, 40M toast, length distribution: min: 19 avg: 474 max: 20370 Not sure why it's got a bigger toast having a similar distribution. Tells just how meaningless min/avg/max stats are :(
Peter Geoghegan <pg@heroku.com> writes: > On Thu, Aug 14, 2014 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Maybe this is telling us it's not worth changing the representation, >> and we should just go do something about the first_success_by threshold >> and be done. I'm hesitant to draw such conclusions on the basis of a >> single use-case though, especially one that doesn't really have that >> much use for compression in the first place. Do we have other JSON >> corpuses to look at? > Yes. Pavel posted some representative JSON data a while back: > http://pgsql.cz/data/data.dump.gz (it's a plain dump) I did some quick stats on that. 206560 rows min max avg external text representation 220 172685 880.3 JSON representation (compressed text) 224 78565 541.3 pg_column_size, JSONB HEAD repr. 225 82540 639.0 pg_column_size, all-lengths repr. 225 66794 531.1 So in this data, there definitely is some scope for compression: just compressing the text gets about 38% savings. The all-lengths hack is able to beat that slightly, but the all-offsets format is well behind at 27%. Not sure what to conclude. It looks from both these examples like we're talking about a 10 to 20 percent size penalty for JSON objects that are big enough to need compression. Is that beyond our threshold of pain? I'm not sure, but there is definitely room to argue that the extra I/O costs will swamp any savings we get from faster access to individual fields or array elements. regards, tom lane
On 15/08/14 09:47, Tom Lane wrote: > Peter Geoghegan <pg@heroku.com> writes: >> On Thu, Aug 14, 2014 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Maybe this is telling us it's not worth changing the representation, >>> and we should just go do something about the first_success_by threshold >>> and be done. I'm hesitant to draw such conclusions on the basis of a >>> single use-case though, especially one that doesn't really have that >>> much use for compression in the first place. Do we have other JSON >>> corpuses to look at? >> Yes. Pavel posted some representative JSON data a while back: >> http://pgsql.cz/data/data.dump.gz (it's a plain dump) > I did some quick stats on that. 206560 rows > > min max avg > > external text representation 220 172685 880.3 > > JSON representation (compressed text) 224 78565 541.3 > > pg_column_size, JSONB HEAD repr. 225 82540 639.0 > > pg_column_size, all-lengths repr. 225 66794 531.1 > > So in this data, there definitely is some scope for compression: > just compressing the text gets about 38% savings. The all-lengths > hack is able to beat that slightly, but the all-offsets format is > well behind at 27%. > > Not sure what to conclude. It looks from both these examples like > we're talking about a 10 to 20 percent size penalty for JSON objects > that are big enough to need compression. Is that beyond our threshold > of pain? I'm not sure, but there is definitely room to argue that the > extra I/O costs will swamp any savings we get from faster access to > individual fields or array elements. > > regards, tom lane > > Curious, would adding the standard deviation help in characterising the distribution of data values? Also you might like to consider additionally using the median value, and possibly the 25% & 75% (or some such) values. I assume the 'avg' in your table, refers to the arithmetic mean. Sometimes the median is a better meaure of 'normal' than the arithmetic mean, and it can be useful to note the difference between the two! Graphing the values may also be useful. You might have 2, or more, distinct populations which might show up as several distinct peaks - in which case, this might suggest changes to the algorithm. Many moons ago, I did a 400 level statistics course at University, of which I've forgotten most. However, I'm aware of other potentially useful measure, but I suspect that they would be too esoteric for the current problem! Cheers, Gavin
So, here's a destruction test case: 200,000 JSON values (plus 2 key columns) Average width 4K (+/- 1K) 183 keys per JSON valuekeys 10 to 30 characters105 float values70 integer values8 text and date valuesno nesting The "jsonic" table is JSON The "jsonbish" table is JSONB (I can't share this data set, but it makes a good test case) And, we see the effect: postgres=# select pg_size_pretty(pg_total_relation_size('jsonic'));pg_size_pretty ----------------394 MB (1 row) postgres=# select pg_size_pretty(pg_total_relation_size('jsonbish'));pg_size_pretty ----------------1147 MB (1 row) So, pretty bad; JSONB is 200% larger than JSON. I don't think having 183 top-level keys is all that unreasonable of a use case. Some folks will be migrating from Mongo, Redis or Couch to PostgreSQL, and might have a whole denormalized schema in JSON. BTW, I find this peculiar: postgres=# select pg_size_pretty(pg_relation_size('jsonic')); pg_size_pretty ----------------383 MB (1 row) postgres=# select pg_size_pretty(pg_relation_size('jsonbish')); pg_size_pretty ----------------11 MB (1 row) Next up: Tom's patch and indexing! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > So, here's a destruction test case: > 200,000 JSON values (plus 2 key columns) > Average width 4K (+/- 1K) > 183 keys per JSON value Is that 183 keys exactly each time, or is 183 the average? If so, what's the min/max number of keys? I ask because 183 would be below the threshold where I'd expect the no-compression behavior to kick in. > And, we see the effect: > postgres=# select pg_size_pretty(pg_total_relation_size('jsonic')); > pg_size_pretty > ---------------- > 394 MB > (1 row) > postgres=# select pg_size_pretty(pg_total_relation_size('jsonbish')); > pg_size_pretty > ---------------- > 1147 MB > (1 row) > So, pretty bad; JSONB is 200% larger than JSON. Ouch. But it's not clear how much of this is from the first_success_by threshold and how much is from having poor compression even though we escaped that trap. > BTW, I find this peculiar: > postgres=# select pg_size_pretty(pg_relation_size('jsonic')); > pg_size_pretty > ---------------- > 383 MB > (1 row) > postgres=# select pg_size_pretty(pg_relation_size('jsonbish')); > pg_size_pretty > ---------------- > 11 MB > (1 row) pg_relation_size is just the main data fork; it excludes TOAST. So what we can conclude is that most of the data got toasted out-of-line in jsonb, while very little did in json. That probably just comes from the average datum size being close to the push-out-of-line threshold, so that worse compression puts it over the edge. It would be useful to see min/max/avg of pg_column_size() in both these cases. regards, tom lane
On 08/14/2014 04:02 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> So, here's a destruction test case: >> 200,000 JSON values (plus 2 key columns) >> Average width 4K (+/- 1K) >> 183 keys per JSON value > > Is that 183 keys exactly each time, or is 183 the average? Each time exactly. > It would be useful to see min/max/avg of pg_column_size() in both > these cases. Well, this is 9.4, so I can do better than that. How about quartiles? thetype | colsize_distribution ---------+----------------------------json | {1777,1803,1890,1940,4424}jsonb | {5902,5926,5978,6002,6208} -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 08/14/2014 04:02 PM, Tom Lane wrote: >> It would be useful to see min/max/avg of pg_column_size() in both >> these cases. > Well, this is 9.4, so I can do better than that. How about quartiles? > thetype | colsize_distribution > ---------+---------------------------- > json | {1777,1803,1890,1940,4424} > jsonb | {5902,5926,5978,6002,6208} OK. That matches with the observation about being mostly toasted or not --- the threshold for pushing out-of-line would be something a little under 2KB depending on the other columns you had in the table. What's more, it looks like the jsonb data is pretty much never getting compressed --- the min is too high for that. So I'm guessing that this example is mostly about the first_success_by threshold preventing any compression from happening. Please, before looking at my other patch, try this: in src/backend/utils/adt/pg_lzcompress.c, change line 221 thusly: - 1024, /* Give up if no compression in the first 1KB */ + INT_MAX, /* Give up if no compression in the first 1KB */ then reload the jsonb data and give us the same stats on that. regards, tom lane
> What's more, it looks like the jsonb data is pretty much never getting > compressed --- the min is too high for that. So I'm guessing that this > example is mostly about the first_success_by threshold preventing any > compression from happening. Please, before looking at my other patch, > try this: in src/backend/utils/adt/pg_lzcompress.c, change line 221 > thusly: > > - 1024, /* Give up if no compression in the first 1KB */ > + INT_MAX, /* Give up if no compression in the first 1KB */ > > then reload the jsonb data and give us the same stats on that. That helped things, but not as much as you'd think: postgres=# select pg_size_pretty(pg_total_relation_size('jsonic')); pg_size_pretty ----------------394 MB (1 row) postgres=# select pg_size_pretty(pg_total_relation_size('jsonbish'));pg_size_pretty ----------------801 MB (1 row) What I find really strange is that the column size distribution is exactly the same: thetype | colsize_distribution ---------+----------------------------json | {1777,1803,1890,1940,4424}jsonb | {5902,5926,5978,6002,6208} Shouldn't the lower end stuff be smaller? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 08/14/2014 04:47 PM, Josh Berkus wrote: > thetype | colsize_distribution > ---------+---------------------------- > json | {1777,1803,1890,1940,4424} > jsonb | {5902,5926,5978,6002,6208} Just realized my query was counting the whole row size instead of just the column size. Here's just the JSON column: Before changing to to INT_MAX: thetype | colsize_distribution ---------+----------------------------json | {1741,1767,1854,1904,2292}jsonb | {3551,5866,5910,5958,6168} After: thetype | colsize_distribution ---------+----------------------------json | {1741,1767,1854,1904,2292}jsonb | {3515,3543,3636,3690,4038} So that did improve things, just not as much as we'd like. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
> Before changing to to INT_MAX: > > thetype | colsize_distribution > ---------+---------------------------- > json | {1741,1767,1854,1904,2292} > jsonb | {3551,5866,5910,5958,6168} > > After: > > thetype | colsize_distribution > ---------+---------------------------- > json | {1741,1767,1854,1904,2292} > jsonb | {3515,3543,3636,3690,4038} > > So that did improve things, just not as much as we'd like. And with Tom's test patch: postgres=# select pg_size_pretty(pg_total_relation_size('jsonic')); pg_size_pretty ----------------394 MB (1 row) postgres=# select pg_size_pretty(pg_total_relation_size('jsonbish'));pg_size_pretty ----------------541 MB (1 row) thetype | colsize_distribution ---------+----------------------------json | {1741,1767,1854,1904,2292}jsonb | {2037,2114,2288,2348,2746} Since that improved things a *lot*, just +40% instead of +200%, I thought I'd test some select queries. I decided to test a GIN lookup and value extraction, since indexed lookup is really what I care about. 9.4b2 no patches: postgres=# explain analyze select row_to_json -> 'kt1_total_sum' from jsonbish where row_to_json @> '{ "rpt_per_dt" : "2003-06-30" }'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on jsonbish (cost=29.55..582.92 rows=200 width=18) (actual time=20.814..2845.454 rows=100423 loops=1) Recheck Cond: (row_to_json @> '{"rpt_per_dt": "2003-06-30"}'::jsonb) Heap Blocks: exact=1471 -> Bitmap Index Scan on jsonbish_row_to_json_idx (cost=0.00..29.50 rows=200 width=0) (actual time=20.551..20.551 rows=100423 loops=1) Index Cond: (row_to_json @> '{"rpt_per_dt": "2003-06-30"}'::jsonb)Planningtime: 0.102 msExecution time: 2856.179 ms 9.4b2 TL patch: postgres=# explain analyze select row_to_json -> 'kt1_total_sum' from jsonbish where row_to_json @> '{ "rpt_per_dt" : "2003-06-30" }'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on jsonbish (cost=29.55..582.92 rows=200 width=18) (actual time=24.071..5201.687 rows=100423 loops=1) Recheck Cond: (row_to_json @> '{"rpt_per_dt": "2003-06-30"}'::jsonb) Heap Blocks: exact=1471 -> Bitmap Index Scan on jsonbish_row_to_json_idx (cost=0.00..29.50 rows=200 width=0) (actual time=23.779..23.779 rows=100423 loops=1) Index Cond: (row_to_json @> '{"rpt_per_dt": "2003-06-30"}'::jsonb)Planningtime: 0.098 msExecution time: 5214.212 ms ... so, an 80% increase in lookup and extraction time for swapping offsets for lengths. That's actually all extraction time; I tried removing the extraction from the query, and without it both queries are close enough to be statstically insignificant. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > And with Tom's test patch: > ... > Since that improved things a *lot*, just +40% instead of +200%, I > thought I'd test some select queries. That test patch is not meant to be fast, its ambition was only to see what the effects on storage size would be. So I find this unsurprising: > ... so, an 80% increase in lookup and extraction time for swapping > offsets for lengths. We can certainly reduce that. The question was whether it would be worth the effort to try. At this point, with three different test data sets having shown clear space savings, I think it is worth the effort. I'll poke into it tomorrow or over the weekend, unless somebody beats me to it. regards, tom lane
On 08/14/2014 07:24 PM, Tom Lane wrote: > We can certainly reduce that. The question was whether it would be > worth the effort to try. At this point, with three different test > data sets having shown clear space savings, I think it is worth > the effort. I'll poke into it tomorrow or over the weekend, unless > somebody beats me to it. Note that I specifically created that data set to be a worst case: many top-level keys, no nesting, and small values. However, I don't think it's an unrealistic worst case. Interestingly, even on the unpatched, 1GB table case, the *index* on the JSONB is only 60MB. Which shows just how terrific the improvement in GIN index size/performance is. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 08/14/2014 07:24 PM, Tom Lane wrote: >> We can certainly reduce that. The question was whether it would be >> worth the effort to try. At this point, with three different test >> data sets having shown clear space savings, I think it is worth >> the effort. I'll poke into it tomorrow or over the weekend, unless >> somebody beats me to it. > Note that I specifically created that data set to be a worst case: many > top-level keys, no nesting, and small values. However, I don't think > it's an unrealistic worst case. > Interestingly, even on the unpatched, 1GB table case, the *index* on the > JSONB is only 60MB. Which shows just how terrific the improvement in > GIN index size/performance is. I've been poking at this, and I think the main explanation for your result is that with more JSONB documents being subject to compression, we're spending more time in pglz_decompress. There's no free lunch in that department: if you want compressed storage it's gonna cost ya to decompress. The only way I can get decompression and TOAST access to not dominate the profile on cases of this size is to ALTER COLUMN SET STORAGE PLAIN. However, when I do that, I do see my test patch running about 25% slower overall than HEAD on an "explain analyze select jfield -> 'key' from table" type of query with 200-key documents with narrow fields (see attached perl script that generates the test data). It seems difficult to improve much on that for this test case. I put some logic into findJsonbValueFromContainer to calculate the offset sums just once not once per binary-search iteration, but that only improved matters 5% at best. I still think it'd be worth modifying the JsonbIterator code to avoid repetitive offset calculations, but that's not too relevant to this test case. Having said all that, I think this test is something of a contrived worst case. More realistic cases are likely to have many fewer keys (so that speed of the binary search loop is less of an issue) or else to have total document sizes large enough that inline PLAIN storage isn't an option, meaning that detoast+decompression costs will dominate. regards, tom lane #! /usr/bin/perl for (my $i = 0; $i < 100000; $i++) { print "{"; for (my $k = 1; $k <= 200; $k++) { print ", " if $k > 1; printf "\"k%d\": %d", $k, int(rand(10)); } print "}\n"; }
<p dir="ltr">I'm still getting up to speed on postgres development but I'd like to leave an opinion. <p dir="ltr">We shouldadd some sort of versionning to the jsonb format. This can be explored in the future in many ways.<p dir="ltr">As forthe current problem, we should explore the directory at the end option. It should improve compression and keep good accessperformance. <p dir="ltr">A 4 byte header is sufficient to store the directory offset and some versionning bits.<br/><div class="gmail_quote">Em 15/08/2014 17:39, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>escreveu:<br type="attribution" /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Josh Berkus <<a href="mailto:josh@agliodbs.com">josh@agliodbs.com</a>>writes:<br /> > On 08/14/2014 07:24 PM, Tom Lane wrote:<br />>> We can certainly reduce that. The question was whether it would be<br /> >> worth the effort to try. Atthis point, with three different test<br /> >> data sets having shown clear space savings, I think it is worth<br/> >> the effort. I'll poke into it tomorrow or over the weekend, unless<br /> >> somebody beats meto it.<br /><br /> > Note that I specifically created that data set to be a worst case: many<br /> > top-level keys,no nesting, and small values. However, I don't think<br /> > it's an unrealistic worst case.<br /><br /> > Interestingly,even on the unpatched, 1GB table case, the *index* on the<br /> > JSONB is only 60MB. Which shows justhow terrific the improvement in<br /> > GIN index size/performance is.<br /><br /> I've been poking at this, and Ithink the main explanation for your result<br /> is that with more JSONB documents being subject to compression, we're<br/> spending more time in pglz_decompress. There's no free lunch in that<br /> department: if you want compressedstorage it's gonna cost ya to<br /> decompress. The only way I can get decompression and TOAST access to not<br/> dominate the profile on cases of this size is to ALTER COLUMN SET STORAGE<br /> PLAIN. However, when I do that,I do see my test patch running about 25%<br /> slower overall than HEAD on an "explain analyze select jfield -> 'key'<br/> from table" type of query with 200-key documents with narrow fields (see<br /> attached perl script that generatesthe test data).<br /><br /> It seems difficult to improve much on that for this test case. I put some<br /> logicinto findJsonbValueFromContainer to calculate the offset sums just<br /> once not once per binary-search iteration,but that only improved matters<br /> 5% at best. I still think it'd be worth modifying the JsonbIterator code<br/> to avoid repetitive offset calculations, but that's not too relevant to<br /> this test case.<br /><br /> Havingsaid all that, I think this test is something of a contrived worst<br /> case. More realistic cases are likely tohave many fewer keys (so that<br /> speed of the binary search loop is less of an issue) or else to have total<br /> documentsizes large enough that inline PLAIN storage isn't an option,<br /> meaning that detoast+decompression costs willdominate.<br /><br /> regards, tom lane<br /><br /><br /><br /> --<br /> Sent via pgsql-hackersmailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br /> To makechanges to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div>
On 08/15/2014 01:38 PM, Tom Lane wrote: > I've been poking at this, and I think the main explanation for your result > is that with more JSONB documents being subject to compression, we're > spending more time in pglz_decompress. There's no free lunch in that > department: if you want compressed storage it's gonna cost ya to > decompress. The only way I can get decompression and TOAST access to not > dominate the profile on cases of this size is to ALTER COLUMN SET STORAGE > PLAIN. However, when I do that, I do see my test patch running about 25% > slower overall than HEAD on an "explain analyze select jfield -> 'key' > from table" type of query with 200-key documents with narrow fields (see > attached perl script that generates the test data). Ok, that probably falls under the heading of "acceptable tradeoffs" then. > Having said all that, I think this test is something of a contrived worst > case. More realistic cases are likely to have many fewer keys (so that > speed of the binary search loop is less of an issue) or else to have total > document sizes large enough that inline PLAIN storage isn't an option, > meaning that detoast+decompression costs will dominate. This was intended to be a worst case. However, I don't think that it's the last time we'll see the case of having 100 to 200 keys each with short values. That case was actually from some XML data which I'd already converted into a regular table (hence every row having 183 keys), but if JSONB had been available when I started the project, I might have chosen to store it as JSONB instead. It occurs to me that the matching data from a personals website would very much fit the pattern of having between 50 and 200 keys, each of which has a short value. So we don't need to *optimize* for that case, but it also shouldn't be disastrously slow or 300% of the size of comparable TEXT. Mind you, I don't find +80% to be disastrously slow (especially not with a space savings of 60%), so maybe that's good enough. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Arthur Silva <arthurprs@gmail.com> writes: > We should add some sort of versionning to the jsonb format. This can be > explored in the future in many ways. If we end up making an incompatible change to the jsonb format, I would support taking the opportunity to stick a version ID in there. But I don't want to force a dump/reload cycle *only* to do that. > As for the current problem, we should explore the directory at the end > option. It should improve compression and keep good access performance. Meh. Pushing the directory to the end is just a band-aid, and since it would still force a dump/reload, it's not a very enticing band-aid. The only thing it'd really fix is the first_success_by issue, which we could fix *without* a dump/reload by using different compression parameters for jsonb. Moving the directory to the end, by itself, does nothing to fix the problem that the directory contents aren't compressible --- and we now have pretty clear evidence that that is a significant issue. (See for instance Josh's results that increasing first_success_by did very little for the size of his dataset.) I think the realistic alternatives at this point are either to switch to all-lengths as in my test patch, or to use the hybrid approach of Heikki's test patch. IMO the major attraction of Heikki's patch is that it'd be upward compatible with existing beta installations, ie no initdb required (but thus, no opportunity to squeeze in a version identifier either). It's not showing up terribly well in the performance tests I've been doing --- it's about halfway between HEAD and my patch on that extract-a-key-from-a-PLAIN-stored-column test. But, just as with my patch, there are things that could be done to micro-optimize it by touching a bit more code. I did some quick stats comparing compressed sizes for the delicio.us data, printing quartiles as per Josh's lead: all-lengths {440,569,609,655,1257} Heikki's patch {456,582,624,671,1274} HEAD {493,636,684,744,1485} (As before, this is pg_column_size of the jsonb within a table whose rows are wide enough to force tuptoaster.c to try to compress the jsonb; otherwise many of these values wouldn't get compressed.) These documents don't have enough keys to trigger the first_success_by issue, so that HEAD doesn't look too awful, but still there's about an 11% gain from switching from offsets to lengths. Heikki's method captures much of that but not all. Personally I'd prefer to go to the all-lengths approach, but a large part of that comes from a subjective assessment that the hybrid approach is too messy. Others might well disagree. In case anyone else wants to do measurements on some more data sets, attached is a copy of Heikki's patch updated to apply against git tip. regards, tom lane diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 04f35bf..47b2998 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *************** convertJsonbArray(StringInfo buffer, JEn *** 1378,1385 **** errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); ! if (i > 0) meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } --- 1378,1387 ---- errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); ! if (i % JBE_STORE_LEN_STRIDE == 0) meta = (meta & ~JENTRY_POSMASK) | totallen; + else + meta |= JENTRY_HAS_LEN; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } *************** convertJsonbObject(StringInfo buffer, JE *** 1430,1440 **** errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); ! if (i > 0) meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); convertJsonbValue(buffer, &meta, &pair->value, level); len = meta & JENTRY_POSMASK; totallen += len; --- 1432,1445 ---- errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); ! if (i % JBE_STORE_LEN_STRIDE == 0) meta = (meta & ~JENTRY_POSMASK) | totallen; + else + meta |= JENTRY_HAS_LEN; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); + /* put value */ convertJsonbValue(buffer, &meta, &pair->value, level); len = meta & JENTRY_POSMASK; totallen += len; *************** convertJsonbObject(StringInfo buffer, JE *** 1445,1451 **** errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); ! meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } --- 1450,1456 ---- errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); ! meta |= JENTRY_HAS_LEN; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } *************** uniqueifyJsonbObject(JsonbValue *object) *** 1592,1594 **** --- 1597,1635 ---- object->val.object.nPairs = res + 1 - object->val.object.pairs; } } + + uint32 + jsonb_get_offset(const JEntry *ja, int index) + { + uint32 off = 0; + int i; + + /* + * Each absolute entry contains the *end* offset. Start offset of this + * entry is equal to the end offset of the previous entry. + */ + for (i = index - 1; i >= 0; i--) + { + off += JBE_POSFLD(ja[i]); + if (!JBE_HAS_LEN(ja[i])) + break; + } + return off; + } + + uint32 + jsonb_get_length(const JEntry *ja, int index) + { + uint32 off; + uint32 len; + + if (JBE_HAS_LEN(ja[index])) + len = JBE_POSFLD(ja[index]); + else + { + off = jsonb_get_offset(ja, index); + len = JBE_POSFLD(ja[index]) - off; + } + + return len; + } diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h index 91e3e14..10a07bb 100644 *** a/src/include/utils/jsonb.h --- b/src/include/utils/jsonb.h *************** typedef struct JsonbValue JsonbValue; *** 102,112 **** * to JB_FSCALAR | JB_FARRAY. * * To encode the length and offset of the variable-length portion of each ! * node in a compact way, the JEntry stores only the end offset within the ! * variable-length portion of the container node. For the first JEntry in the ! * container's JEntry array, that equals to the length of the node data. The ! * begin offset and length of the rest of the entries can be calculated using ! * the end offset of the previous JEntry in the array. * * Overall, the Jsonb struct requires 4-bytes alignment. Within the struct, * the variable-length portion of some node types is aligned to a 4-byte --- 102,113 ---- * to JB_FSCALAR | JB_FARRAY. * * To encode the length and offset of the variable-length portion of each ! * node in a compact way, the JEntry stores either the length of the element, ! * or its end offset within the variable-length portion of the container node. ! * Entries that store a length are marked with the JENTRY_HAS_LEN flag, other ! * entries store an end offset. The begin offset and length of each entry ! * can be calculated by scanning backwards to the previous entry storing an ! * end offset, and adding up the lengths of the elements in between. * * Overall, the Jsonb struct requires 4-bytes alignment. Within the struct, * the variable-length portion of some node types is aligned to a 4-byte *************** typedef struct JsonbValue JsonbValue; *** 120,134 **** /* * Jentry format. * ! * The least significant 28 bits store the end offset of the entry (see ! * JBE_ENDPOS, JBE_OFF, JBE_LEN macros below). The next three bits ! * are used to store the type of the entry. The most significant bit ! * is unused, and should be set to zero. */ typedef uint32 JEntry; #define JENTRY_POSMASK 0x0FFFFFFF #define JENTRY_TYPEMASK 0x70000000 /* values stored in the type bits */ #define JENTRY_ISSTRING 0x00000000 --- 121,136 ---- /* * Jentry format. * ! * The least significant 28 bits store the end offset or the length of the ! * entry, depending on whether the JENTRY_HAS_LEN flag is set (see ! * JBE_ENDPOS, JBE_OFF, JBE_LEN macros below). The other three bits ! * are used to store the type of the entry. */ typedef uint32 JEntry; #define JENTRY_POSMASK 0x0FFFFFFF #define JENTRY_TYPEMASK 0x70000000 + #define JENTRY_HAS_LEN 0x80000000 /* values stored in the type bits */ #define JENTRY_ISSTRING 0x00000000 *************** typedef uint32 JEntry; *** 146,160 **** #define JBE_ISBOOL_TRUE(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISBOOL_TRUE) #define JBE_ISBOOL_FALSE(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISBOOL_FALSE) #define JBE_ISBOOL(je_) (JBE_ISBOOL_TRUE(je_) || JBE_ISBOOL_FALSE(je_)) /* ! * Macros for getting the offset and length of an element. Note multiple ! * evaluations and access to prior array element. */ ! #define JBE_ENDPOS(je_) ((je_) & JENTRY_POSMASK) ! #define JBE_OFF(ja, i) ((i) == 0 ? 0 : JBE_ENDPOS((ja)[i - 1])) ! #define JBE_LEN(ja, i) ((i) == 0 ? JBE_ENDPOS((ja)[i]) \ ! : JBE_ENDPOS((ja)[i]) - JBE_ENDPOS((ja)[i - 1])) /* * A jsonb array or object node, within a Jsonb Datum. --- 148,170 ---- #define JBE_ISBOOL_TRUE(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISBOOL_TRUE) #define JBE_ISBOOL_FALSE(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISBOOL_FALSE) #define JBE_ISBOOL(je_) (JBE_ISBOOL_TRUE(je_) || JBE_ISBOOL_FALSE(je_)) + #define JBE_HAS_LEN(je_) (((je_) & JENTRY_HAS_LEN) != 0) /* ! * Macros for getting the offset and length of an element. */ ! #define JBE_POSFLD(je_) ((je_) & JENTRY_POSMASK) ! #define JBE_OFF(ja, i) jsonb_get_offset(ja, i) ! #define JBE_LEN(ja, i) jsonb_get_length(ja, i) ! ! /* ! * Store an absolute end offset every JBE_STORE_LEN_STRIDE elements (for an ! * array) or key/value pairs (for an object). Others are stored as lengths. ! */ ! #define JBE_STORE_LEN_STRIDE 8 ! ! extern uint32 jsonb_get_offset(const JEntry *ja, int index); ! extern uint32 jsonb_get_length(const JEntry *ja, int index); /* * A jsonb array or object node, within a Jsonb Datum.
On Fri, Aug 15, 2014 at 8:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Arthur Silva <arthurprs@gmail.com> writes:If we end up making an incompatible change to the jsonb format, I would
> We should add some sort of versionning to the jsonb format. This can be
> explored in the future in many ways.
support taking the opportunity to stick a version ID in there. But
I don't want to force a dump/reload cycle *only* to do that.Meh. Pushing the directory to the end is just a band-aid, and since it
> As for the current problem, we should explore the directory at the end
> option. It should improve compression and keep good access performance.
would still force a dump/reload, it's not a very enticing band-aid.
The only thing it'd really fix is the first_success_by issue, which
we could fix *without* a dump/reload by using different compression
parameters for jsonb. Moving the directory to the end, by itself,
does nothing to fix the problem that the directory contents aren't
compressible --- and we now have pretty clear evidence that that is a
significant issue. (See for instance Josh's results that increasing
first_success_by did very little for the size of his dataset.)
I think the realistic alternatives at this point are either to
switch to all-lengths as in my test patch, or to use the hybrid approach
of Heikki's test patch. IMO the major attraction of Heikki's patch
is that it'd be upward compatible with existing beta installations,
ie no initdb required (but thus, no opportunity to squeeze in a version
identifier either). It's not showing up terribly well in the performance
tests I've been doing --- it's about halfway between HEAD and my patch on
that extract-a-key-from-a-PLAIN-stored-column test. But, just as with my
patch, there are things that could be done to micro-optimize it by
touching a bit more code.
I did some quick stats comparing compressed sizes for the delicio.us
data, printing quartiles as per Josh's lead:
all-lengths {440,569,609,655,1257}
Heikki's patch {456,582,624,671,1274}
HEAD {493,636,684,744,1485}
(As before, this is pg_column_size of the jsonb within a table whose rows
are wide enough to force tuptoaster.c to try to compress the jsonb;
otherwise many of these values wouldn't get compressed.) These documents
don't have enough keys to trigger the first_success_by issue, so that
HEAD doesn't look too awful, but still there's about an 11% gain from
switching from offsets to lengths. Heikki's method captures much of
that but not all.
Personally I'd prefer to go to the all-lengths approach, but a large
part of that comes from a subjective assessment that the hybrid approach
is too messy. Others might well disagree.
In case anyone else wants to do measurements on some more data sets,
attached is a copy of Heikki's patch updated to apply against git tip.
regards, tom lane
I agree that versioning might sound silly at this point, but lets keep it in mind.
Row level compression is very slow itself, so it sounds odd to me paying 25% performance penalty everywhere for the sake of having better compression ratio in the dictionary area.
Consider, for example, an optimization that stuffs integers (up to 28 bits) inside the JEntry itself. That alone would save 8 bytes for each integer.On 08/15/2014 04:19 PM, Tom Lane wrote: > Personally I'd prefer to go to the all-lengths approach, but a large > part of that comes from a subjective assessment that the hybrid approach > is too messy. Others might well disagree. > > In case anyone else wants to do measurements on some more data sets, > attached is a copy of Heikki's patch updated to apply against git tip. Note that this is not 100% comparable because I'm running it against git clone, and the earlier tests were against beta2. However, the Heikki patch looks like a bust on this dataset -- see below. postgres=# select pg_size_pretty(pg_total_relation_size('jsonic'));pg_size_pretty ----------------394 MB (1 row) postgres=# select pg_size_pretty(pg_total_relation_size('jsonbish')); pg_size_pretty ----------------542 MB Extraction Test: postgres=# explain analyze select row_to_json -> 'kt1_total_sum' from jsonbish where row_to_json @> '{ "rpt_per_dt" : "2003-06-30" }'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on jsonbish (cost=29.55..582.92 rows=200 width=18) (actual time=22.742..5281.823 rows=100423 loops=1) Recheck Cond: (row_to_json @> '{"rpt_per_dt": "2003-06-30"}'::jsonb) Heap Blocks: exact=1471 -> Bitmap Index Scan on jsonbish_row_to_json_idx (cost=0.00..29.50 rows=200 width=0) (actual time=22.445..22.445 rows=100423 loops=1) Index Cond: (row_to_json @> '{"rpt_per_dt": "2003-06-30"}'::jsonb)Planningtime: 0.095 msExecution time: 5292.047 ms (7 rows) So, that extraction test is about 1% *slower* than the basic Tom Lane lengths-only patch, and still 80% slower than original JSONB. And it's the same size as the lengths-only version. Huh? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 08/15/2014 04:19 PM, Tom Lane wrote: >> Personally I'd prefer to go to the all-lengths approach, but a large >> part of that comes from a subjective assessment that the hybrid approach >> is too messy. Others might well disagree. > ... So, that extraction test is about 1% *slower* than the basic Tom Lane > lengths-only patch, and still 80% slower than original JSONB. And it's > the same size as the lengths-only version. Since it's looking like this might be the direction we want to go, I took the time to flesh out my proof-of-concept patch. The attached version takes care of cosmetic issues (like fixing the comments), and includes code to avoid O(N^2) penalties in findJsonbValueFromContainer and JsonbIteratorNext. I'm not sure whether those changes will help noticeably on Josh's test case; for me, they seemed worth making, but they do not bring the code back to full speed parity with the all-offsets version. But as we've been discussing, it seems likely that those costs would be swamped by compression and I/O considerations in most scenarios with large documents; and of course for small documents it hardly matters. Even if we don't go this way, there are parts of this patch that would need to get committed. I found for instance that convertJsonbArray and convertJsonbObject have insufficient defenses against overflowing the overall length field for the array or object. For my own part, I'm satisfied with the patch as attached (modulo the need to teach pg_upgrade about the incompatibility). There remains the question of whether to take this opportunity to add a version ID to the binary format. I'm not as excited about that idea as I originally was; having now studied the code more carefully, I think that any expansion would likely happen by adding more type codes and/or commandeering the currently-unused high-order bit of JEntrys. We don't need a version ID in the header for that. Moreover, if we did have such an ID, it would be notationally painful to get it to most of the places that might need it. regards, tom lane diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 2fd87fc..456011a 100644 *** a/src/backend/utils/adt/jsonb.c --- b/src/backend/utils/adt/jsonb.c *************** jsonb_from_cstring(char *json, int len) *** 196,207 **** static size_t checkStringLen(size_t len) { ! if (len > JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("string too long to represent as jsonb string"), errdetail("Due to an implementation restriction, jsonb strings cannot exceed %d bytes.", ! JENTRY_POSMASK))); return len; } --- 196,207 ---- static size_t checkStringLen(size_t len) { ! if (len > JENTRY_LENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("string too long to represent as jsonb string"), errdetail("Due to an implementation restriction, jsonb strings cannot exceed %d bytes.", ! JENTRY_LENMASK))); return len; } diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 04f35bf..e47eaea 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *************** *** 26,40 **** * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (the total size of an array's elements is also limited by JENTRY_POSMASK, * but we're not concerned about that here) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) ! static void fillJsonbValue(JEntry *array, int index, char *base_addr, JsonbValue *result); ! static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); static int compareJsonbScalarValue(JsonbValue *a, JsonbValue *b); static Jsonb *convertToJsonb(JsonbValue *val); static void convertJsonbValue(StringInfo buffer, JEntry *header, JsonbValue *val, int level); --- 26,41 ---- * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (the total size of an array's elements is also limited by JENTRY_LENMASK, * but we're not concerned about that here) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) ! static void fillJsonbValue(JEntry *children, int index, ! char *base_addr, uint32 offset, JsonbValue *result); ! static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); static int compareJsonbScalarValue(JsonbValue *a, JsonbValue *b); static Jsonb *convertToJsonb(JsonbValue *val); static void convertJsonbValue(StringInfo buffer, JEntry *header, JsonbValue *val, int level); *************** JsonbValueToJsonb(JsonbValue *val) *** 108,113 **** --- 109,135 ---- } /* + * Get the offset of the variable-length portion of a Jsonb node within + * the variable-length-data part of its container. The node is identified + * by index within the container's JEntry array. + * + * We do this by adding up the lengths of all the previous nodes' + * variable-length portions. It's best to avoid using this function when + * iterating through all the nodes in a container, since that would result + * in O(N^2) work. + */ + uint32 + getJsonbOffset(const JEntry *ja, int index) + { + uint32 off = 0; + int i; + + for (i = 0; i < index; i++) + off += JBE_LEN(ja, i); + return off; + } + + /* * BT comparator worker function. Returns an integer less than, equal to, or * greater than zero, indicating whether a is less than, equal to, or greater * than b. Consistent with the requirements for a B-Tree operator class *************** findJsonbValueFromContainer(JsonbContain *** 279,324 **** if (flags & JB_FARRAY & container->header) { char *base_addr = (char *) (children + count); int i; for (i = 0; i < count; i++) { ! fillJsonbValue(children, i, base_addr, result); if (key->type == result->type) { if (equalsJsonbScalarValue(key, result)) return result; } } } else if (flags & JB_FOBJECT & container->header) { /* Since this is an object, account for *Pairs* of Jentrys */ char *base_addr = (char *) (children + count * 2); uint32 stopLow = 0, ! stopMiddle; ! /* Object key past by caller must be a string */ Assert(key->type == jbvString); /* Binary search on object/pair keys *only* */ ! while (stopLow < count) { int index; int difference; JsonbValue candidate; /* ! * Note how we compensate for the fact that we're iterating ! * through pairs (not entries) throughout. */ - stopMiddle = stopLow + (count - stopLow) / 2; - index = stopMiddle * 2; candidate.type = jbvString; ! candidate.val.string.val = base_addr + JBE_OFF(children, index); candidate.val.string.len = JBE_LEN(children, index); difference = lengthCompareJsonbStringValue(&candidate, key); --- 301,370 ---- if (flags & JB_FARRAY & container->header) { char *base_addr = (char *) (children + count); + uint32 offset = 0; int i; for (i = 0; i < count; i++) { ! fillJsonbValue(children, i, base_addr, offset, result); if (key->type == result->type) { if (equalsJsonbScalarValue(key, result)) return result; } + + offset += JBE_LEN(children, i); } } else if (flags & JB_FOBJECT & container->header) { /* Since this is an object, account for *Pairs* of Jentrys */ char *base_addr = (char *) (children + count * 2); + uint32 *offsets; + uint32 lastoff; + int lastoffpos; uint32 stopLow = 0, ! stopHigh = count; ! /* Object key passed by caller must be a string */ Assert(key->type == jbvString); + /* + * We use a cache to avoid redundant getJsonbOffset() computations + * inside the search loop. Note that count may well be zero at this + * point; to avoid an ugly special case for initializing lastoff and + * lastoffpos, we allocate one extra array element. + */ + offsets = (uint32 *) palloc((count * 2 + 1) * sizeof(uint32)); + offsets[0] = lastoff = 0; + lastoffpos = 0; + /* Binary search on object/pair keys *only* */ ! while (stopLow < stopHigh) { + uint32 stopMiddle; int index; int difference; JsonbValue candidate; + stopMiddle = stopLow + (stopHigh - stopLow) / 2; + /* ! * Compensate for the fact that we're searching through pairs (not ! * entries). */ index = stopMiddle * 2; + /* Update the offsets cache through at least index+1 */ + while (lastoffpos <= index) + { + lastoff += JBE_LEN(children, lastoffpos); + offsets[++lastoffpos] = lastoff; + } + candidate.type = jbvString; ! candidate.val.string.val = base_addr + offsets[index]; candidate.val.string.len = JBE_LEN(children, index); difference = lengthCompareJsonbStringValue(&candidate, key); *************** findJsonbValueFromContainer(JsonbContain *** 326,333 **** if (difference == 0) { /* Found our key, return value */ ! fillJsonbValue(children, index + 1, base_addr, result); return result; } else --- 372,382 ---- if (difference == 0) { /* Found our key, return value */ ! fillJsonbValue(children, index + 1, ! base_addr, offsets[index + 1], ! result); + pfree(offsets); return result; } else *************** findJsonbValueFromContainer(JsonbContain *** 335,343 **** if (difference < 0) stopLow = stopMiddle + 1; else ! count = stopMiddle; } } } /* Not found */ --- 384,394 ---- if (difference < 0) stopLow = stopMiddle + 1; else ! stopHigh = stopMiddle; } } + + pfree(offsets); } /* Not found */ *************** getIthJsonbValueFromContainer(JsonbConta *** 368,374 **** result = palloc(sizeof(JsonbValue)); ! fillJsonbValue(container->children, i, base_addr, result); return result; } --- 419,427 ---- result = palloc(sizeof(JsonbValue)); ! fillJsonbValue(container->children, i, base_addr, ! getJsonbOffset(container->children, i), ! result); return result; } *************** getIthJsonbValueFromContainer(JsonbConta *** 377,387 **** * A helper function to fill in a JsonbValue to represent an element of an * array, or a key or value of an object. * * A nested array or object will be returned as jbvBinary, ie. it won't be * expanded. */ static void ! fillJsonbValue(JEntry *children, int index, char *base_addr, JsonbValue *result) { JEntry entry = children[index]; --- 430,446 ---- * A helper function to fill in a JsonbValue to represent an element of an * array, or a key or value of an object. * + * The node's JEntry is at children[index], and its variable-length data + * is at base_addr + offset. We make the caller determine the offset since + * in many cases the caller can amortize the work across multiple children. + * * A nested array or object will be returned as jbvBinary, ie. it won't be * expanded. */ static void ! fillJsonbValue(JEntry *children, int index, ! char *base_addr, uint32 offset, ! JsonbValue *result) { JEntry entry = children[index]; *************** fillJsonbValue(JEntry *children, int ind *** 392,405 **** else if (JBE_ISSTRING(entry)) { result->type = jbvString; ! result->val.string.val = base_addr + JBE_OFF(children, index); ! result->val.string.len = JBE_LEN(children, index); Assert(result->val.string.len >= 0); } else if (JBE_ISNUMERIC(entry)) { result->type = jbvNumeric; ! result->val.numeric = (Numeric) (base_addr + INTALIGN(JBE_OFF(children, index))); } else if (JBE_ISBOOL_TRUE(entry)) { --- 451,464 ---- else if (JBE_ISSTRING(entry)) { result->type = jbvString; ! result->val.string.val = base_addr + offset; ! result->val.string.len = JBE_LENFLD(entry); Assert(result->val.string.len >= 0); } else if (JBE_ISNUMERIC(entry)) { result->type = jbvNumeric; ! result->val.numeric = (Numeric) (base_addr + INTALIGN(offset)); } else if (JBE_ISBOOL_TRUE(entry)) { *************** fillJsonbValue(JEntry *children, int ind *** 415,422 **** { Assert(JBE_ISCONTAINER(entry)); result->type = jbvBinary; ! result->val.binary.data = (JsonbContainer *) (base_addr + INTALIGN(JBE_OFF(children, index))); ! result->val.binary.len = JBE_LEN(children, index) - (INTALIGN(JBE_OFF(children, index)) - JBE_OFF(children, index)); } } --- 474,482 ---- { Assert(JBE_ISCONTAINER(entry)); result->type = jbvBinary; ! /* Remove alignment padding from data pointer and len */ ! result->val.binary.data = (JsonbContainer *) (base_addr + INTALIGN(offset)); ! result->val.binary.len = JBE_LENFLD(entry) - (INTALIGN(offset) - offset); } } *************** recurse: *** 668,680 **** * a full conversion */ val->val.array.rawScalar = (*it)->isScalar; ! (*it)->i = 0; /* Set state for next call */ (*it)->state = JBI_ARRAY_ELEM; return WJB_BEGIN_ARRAY; case JBI_ARRAY_ELEM: ! if ((*it)->i >= (*it)->nElems) { /* * All elements within array already processed. Report this --- 728,741 ---- * a full conversion */ val->val.array.rawScalar = (*it)->isScalar; ! (*it)->curIndex = 0; ! (*it)->curDataOffset = 0; /* Set state for next call */ (*it)->state = JBI_ARRAY_ELEM; return WJB_BEGIN_ARRAY; case JBI_ARRAY_ELEM: ! if ((*it)->curIndex >= (*it)->nElems) { /* * All elements within array already processed. Report this *************** recurse: *** 686,692 **** return WJB_END_ARRAY; } ! fillJsonbValue((*it)->children, (*it)->i++, (*it)->dataProper, val); if (!IsAJsonbScalar(val) && !skipNested) { --- 747,758 ---- return WJB_END_ARRAY; } ! fillJsonbValue((*it)->children, (*it)->curIndex, ! (*it)->dataProper, (*it)->curDataOffset, ! val); ! ! (*it)->curDataOffset += JBE_LEN((*it)->children, (*it)->curIndex); ! (*it)->curIndex++; if (!IsAJsonbScalar(val) && !skipNested) { *************** recurse: *** 712,724 **** * v->val.object.pairs is not actually set, because we aren't * doing a full conversion */ ! (*it)->i = 0; /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; return WJB_BEGIN_OBJECT; case JBI_OBJECT_KEY: ! if ((*it)->i >= (*it)->nElems) { /* * All pairs within object already processed. Report this to --- 778,791 ---- * v->val.object.pairs is not actually set, because we aren't * doing a full conversion */ ! (*it)->curIndex = 0; ! (*it)->curDataOffset = 0; /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; return WJB_BEGIN_OBJECT; case JBI_OBJECT_KEY: ! if ((*it)->curIndex >= (*it)->nElems) { /* * All pairs within object already processed. Report this to *************** recurse: *** 732,738 **** else { /* Return key of a key/value pair. */ ! fillJsonbValue((*it)->children, (*it)->i * 2, (*it)->dataProper, val); if (val->type != jbvString) elog(ERROR, "unexpected jsonb type as object key"); --- 799,807 ---- else { /* Return key of a key/value pair. */ ! fillJsonbValue((*it)->children, (*it)->curIndex * 2, ! (*it)->dataProper, (*it)->curDataOffset, ! val); if (val->type != jbvString) elog(ERROR, "unexpected jsonb type as object key"); *************** recurse: *** 745,752 **** /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; ! fillJsonbValue((*it)->children, ((*it)->i++) * 2 + 1, ! (*it)->dataProper, val); /* * Value may be a container, in which case we recurse with new, --- 814,829 ---- /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; ! (*it)->curDataOffset += JBE_LEN((*it)->children, ! (*it)->curIndex * 2); ! ! fillJsonbValue((*it)->children, (*it)->curIndex * 2 + 1, ! (*it)->dataProper, (*it)->curDataOffset, ! val); ! ! (*it)->curDataOffset += JBE_LEN((*it)->children, ! (*it)->curIndex * 2 + 1); ! (*it)->curIndex++; /* * Value may be a container, in which case we recurse with new, *************** convertJsonbArray(StringInfo buffer, JEn *** 1340,1353 **** int totallen; uint32 header; ! /* Initialize pointer into conversion buffer at this level */ offset = buffer->len; padBufferToInt(buffer); /* ! * Construct the header Jentry, stored in the beginning of the variable- ! * length payload. */ header = val->val.array.nElems | JB_FARRAY; if (val->val.array.rawScalar) --- 1417,1431 ---- int totallen; uint32 header; ! /* Remember where variable-length data starts for this array */ offset = buffer->len; + /* Align to 4-byte boundary (any padding counts as part of my data) */ padBufferToInt(buffer); /* ! * Construct the header Jentry and store it in the beginning of the ! * variable-length payload. */ header = val->val.array.nElems | JB_FARRAY; if (val->val.array.rawScalar) *************** convertJsonbArray(StringInfo buffer, JEn *** 1358,1364 **** } appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! /* reserve space for the JEntries of the elements. */ metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.array.nElems); totallen = 0; --- 1436,1443 ---- } appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! ! /* Reserve space for the JEntries of the elements. */ metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.array.nElems); totallen = 0; *************** convertJsonbArray(StringInfo buffer, JEn *** 1368,1391 **** int len; JEntry meta; convertJsonbValue(buffer, &meta, elem, level + 1); - len = meta & JENTRY_POSMASK; - totallen += len; ! if (totallen > JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_POSMASK))); - if (i > 0) - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } totallen = buffer->len - offset; /* Initialize the header of this node, in the container's JEntry array */ *pheader = JENTRY_ISCONTAINER | totallen; } --- 1447,1485 ---- int len; JEntry meta; + /* + * Convert element, producing a JEntry and appending its + * variable-length data to buffer + */ convertJsonbValue(buffer, &meta, elem, level + 1); ! /* ! * Bail out if total variable-length data exceeds what will fit in a ! * JEntry length field. We check this in each iteration, not just ! * once at the end, to forestall possible integer overflow. ! */ ! len = JBE_LENFLD(meta); ! totallen += len; ! if (totallen > JENTRY_LENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_LENMASK))); copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } + /* Total data size is everything we've appended to buffer */ totallen = buffer->len - offset; + /* Check length again, since we didn't include the metadata above */ + if (totallen > JENTRY_LENMASK) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", + JENTRY_LENMASK))); + /* Initialize the header of this node, in the container's JEntry array */ *pheader = JENTRY_ISCONTAINER | totallen; } *************** convertJsonbArray(StringInfo buffer, JEn *** 1393,1457 **** static void convertJsonbObject(StringInfo buffer, JEntry *pheader, JsonbValue *val, int level) { - uint32 header; int offset; int metaoffset; int i; int totallen; ! /* Initialize pointer into conversion buffer at this level */ offset = buffer->len; padBufferToInt(buffer); ! /* Initialize header */ header = val->val.object.nPairs | JB_FOBJECT; appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! /* reserve space for the JEntries of the keys and values */ metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.object.nPairs * 2); totallen = 0; for (i = 0; i < val->val.object.nPairs; i++) { ! JsonbPair *pair = &val->val.object.pairs[i]; ! int len; ! JEntry meta; ! /* put key */ convertJsonbScalar(buffer, &meta, &pair->key); ! len = meta & JENTRY_POSMASK; totallen += len; - if (totallen > JENTRY_POSMASK) - ereport(ERROR, - (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), - errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", - JENTRY_POSMASK))); - - if (i > 0) - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); ! convertJsonbValue(buffer, &meta, &pair->value, level); ! len = meta & JENTRY_POSMASK; ! totallen += len; ! if (totallen > JENTRY_POSMASK) ! ereport(ERROR, ! (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_POSMASK))); - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } totallen = buffer->len - offset; *pheader = JENTRY_ISCONTAINER | totallen; } --- 1487,1570 ---- static void convertJsonbObject(StringInfo buffer, JEntry *pheader, JsonbValue *val, int level) { int offset; int metaoffset; int i; int totallen; + uint32 header; ! /* Remember where variable-length data starts for this object */ offset = buffer->len; + /* Align to 4-byte boundary (any padding counts as part of my data) */ padBufferToInt(buffer); ! /* ! * Construct the header Jentry and store it in the beginning of the ! * variable-length payload. ! */ header = val->val.object.nPairs | JB_FOBJECT; appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! /* Reserve space for the JEntries of the keys and values. */ metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.object.nPairs * 2); totallen = 0; for (i = 0; i < val->val.object.nPairs; i++) { ! JsonbPair *pair = &val->val.object.pairs[i]; ! int len; ! JEntry meta; ! /* ! * Convert key, producing a JEntry and appending its variable-length ! * data to buffer ! */ convertJsonbScalar(buffer, &meta, &pair->key); ! len = JBE_LENFLD(meta); totallen += len; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); ! /* ! * Convert value, producing a JEntry and appending its variable-length ! * data to buffer ! */ ! convertJsonbValue(buffer, &meta, &pair->value, level + 1); ! len = JBE_LENFLD(meta); ! totallen += len; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); + + /* + * Bail out if total variable-length data exceeds what will fit in a + * JEntry length field. We check this in each iteration, not just + * once at the end, to forestall possible integer overflow. But it + * should be sufficient to check once per iteration, since + * JENTRY_LENMASK is several bits narrower than int. + */ + if (totallen > JENTRY_LENMASK) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("total size of jsonb object elements exceeds the maximum of %u bytes", + JENTRY_LENMASK))); } + /* Total data size is everything we've appended to buffer */ totallen = buffer->len - offset; + /* Check length again, since we didn't include the metadata above */ + if (totallen > JENTRY_LENMASK) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("total size of jsonb object elements exceeds the maximum of %u bytes", + JENTRY_LENMASK))); + + /* Initialize the header of this node, in the container's JEntry array */ *pheader = JENTRY_ISCONTAINER | totallen; } diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h index 91e3e14..b9a4314 100644 *** a/src/include/utils/jsonb.h --- b/src/include/utils/jsonb.h *************** typedef struct JsonbValue JsonbValue; *** 83,91 **** * buffer is accessed, but they can also be deep copied and passed around. * * Jsonb is a tree structure. Each node in the tree consists of a JEntry ! * header, and a variable-length content. The JEntry header indicates what ! * kind of a node it is, e.g. a string or an array, and the offset and length ! * of its variable-length portion within the container. * * The JEntry and the content of a node are not stored physically together. * Instead, the container array or object has an array that holds the JEntrys --- 83,91 ---- * buffer is accessed, but they can also be deep copied and passed around. * * Jsonb is a tree structure. Each node in the tree consists of a JEntry ! * header and a variable-length content (possibly of zero size). The JEntry ! * header indicates what kind of a node it is, e.g. a string or an array, ! * and includes the length of its variable-length portion. * * The JEntry and the content of a node are not stored physically together. * Instead, the container array or object has an array that holds the JEntrys *************** typedef struct JsonbValue JsonbValue; *** 95,133 **** * hold its JEntry. Hence, no JEntry header is stored for the root node. It * is implicitly known that the root node must be an array or an object, * so we can get away without the type indicator as long as we can distinguish ! * the two. For that purpose, both an array and an object begins with a uint32 * header field, which contains an JB_FOBJECT or JB_FARRAY flag. When a naked * scalar value needs to be stored as a Jsonb value, what we actually store is * an array with one element, with the flags in the array's header field set * to JB_FSCALAR | JB_FARRAY. * - * To encode the length and offset of the variable-length portion of each - * node in a compact way, the JEntry stores only the end offset within the - * variable-length portion of the container node. For the first JEntry in the - * container's JEntry array, that equals to the length of the node data. The - * begin offset and length of the rest of the entries can be calculated using - * the end offset of the previous JEntry in the array. - * * Overall, the Jsonb struct requires 4-bytes alignment. Within the struct, * the variable-length portion of some node types is aligned to a 4-byte * boundary, while others are not. When alignment is needed, the padding is * in the beginning of the node that requires it. For example, if a numeric * node is stored after a string node, so that the numeric node begins at * offset 3, the variable-length portion of the numeric node will begin with ! * one padding byte. */ /* * Jentry format. * ! * The least significant 28 bits store the end offset of the entry (see ! * JBE_ENDPOS, JBE_OFF, JBE_LEN macros below). The next three bits ! * are used to store the type of the entry. The most significant bit ! * is unused, and should be set to zero. */ typedef uint32 JEntry; ! #define JENTRY_POSMASK 0x0FFFFFFF #define JENTRY_TYPEMASK 0x70000000 /* values stored in the type bits */ --- 95,126 ---- * hold its JEntry. Hence, no JEntry header is stored for the root node. It * is implicitly known that the root node must be an array or an object, * so we can get away without the type indicator as long as we can distinguish ! * the two. For that purpose, both an array and an object begin with a uint32 * header field, which contains an JB_FOBJECT or JB_FARRAY flag. When a naked * scalar value needs to be stored as a Jsonb value, what we actually store is * an array with one element, with the flags in the array's header field set * to JB_FSCALAR | JB_FARRAY. * * Overall, the Jsonb struct requires 4-bytes alignment. Within the struct, * the variable-length portion of some node types is aligned to a 4-byte * boundary, while others are not. When alignment is needed, the padding is * in the beginning of the node that requires it. For example, if a numeric * node is stored after a string node, so that the numeric node begins at * offset 3, the variable-length portion of the numeric node will begin with ! * one padding byte so that the actual numeric data is 4-byte aligned. */ /* * Jentry format. * ! * The least significant 28 bits store the data length of the entry (see ! * JBE_LENFLD and JBE_LEN macros below). The next three bits store the type ! * of the entry. The most significant bit is reserved for future use, and ! * should be set to zero. */ typedef uint32 JEntry; ! #define JENTRY_LENMASK 0x0FFFFFFF #define JENTRY_TYPEMASK 0x70000000 /* values stored in the type bits */ *************** typedef uint32 JEntry; *** 148,160 **** #define JBE_ISBOOL(je_) (JBE_ISBOOL_TRUE(je_) || JBE_ISBOOL_FALSE(je_)) /* ! * Macros for getting the offset and length of an element. Note multiple ! * evaluations and access to prior array element. */ ! #define JBE_ENDPOS(je_) ((je_) & JENTRY_POSMASK) ! #define JBE_OFF(ja, i) ((i) == 0 ? 0 : JBE_ENDPOS((ja)[i - 1])) ! #define JBE_LEN(ja, i) ((i) == 0 ? JBE_ENDPOS((ja)[i]) \ ! : JBE_ENDPOS((ja)[i]) - JBE_ENDPOS((ja)[i - 1])) /* * A jsonb array or object node, within a Jsonb Datum. --- 141,150 ---- #define JBE_ISBOOL(je_) (JBE_ISBOOL_TRUE(je_) || JBE_ISBOOL_FALSE(je_)) /* ! * Macros for getting the data length of a JEntry. */ ! #define JBE_LENFLD(je_) ((je_) & JENTRY_LENMASK) ! #define JBE_LEN(ja, i) JBE_LENFLD((ja)[i]) /* * A jsonb array or object node, within a Jsonb Datum. *************** typedef struct JsonbIterator *** 287,306 **** { /* Container being iterated */ JsonbContainer *container; ! uint32 nElems; /* Number of elements in children array (will be ! * nPairs for objects) */ bool isScalar; /* Pseudo-array scalar value? */ ! JEntry *children; /* Current item in buffer (up to nElems, but must * 2 for objects) */ ! int i; ! /* ! * Data proper. This points just past end of children array. ! * We use the JBE_OFF() macro on the Jentrys to find offsets of each ! * child in this area. ! */ ! char *dataProper; /* Private state */ JsonbIterState state; --- 277,294 ---- { /* Container being iterated */ JsonbContainer *container; ! uint32 nElems; /* Number of elements in children array (will ! * be nPairs for objects) */ bool isScalar; /* Pseudo-array scalar value? */ ! JEntry *children; /* JEntrys for child nodes */ ! /* Data proper. This points just past end of children array */ ! char *dataProper; /* Current item in buffer (up to nElems, but must * 2 for objects) */ ! int curIndex; ! /* Data offset corresponding to current item */ ! uint32 curDataOffset; /* Private state */ JsonbIterState state; *************** extern Datum gin_consistent_jsonb_path(P *** 344,349 **** --- 332,338 ---- extern Datum gin_triconsistent_jsonb_path(PG_FUNCTION_ARGS); /* Support functions */ + extern uint32 getJsonbOffset(const JEntry *ja, int index); extern int compareJsonbContainers(JsonbContainer *a, JsonbContainer *b); extern JsonbValue *findJsonbValueFromContainer(JsonbContainer *sheader, uint32 flags,
On 08/20/2014 08:29 AM, Tom Lane wrote: > Since it's looking like this might be the direction we want to go, I took > the time to flesh out my proof-of-concept patch. The attached version > takes care of cosmetic issues (like fixing the comments), and includes > code to avoid O(N^2) penalties in findJsonbValueFromContainer and > JsonbIteratorNext OK, will test. This means we need a beta3, no? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > This means we need a beta3, no? If we change the on-disk format, I'd say so. So we don't want to wait around too long before deciding. regards, tom lane
On 08/20/2014 08:29 AM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> On 08/15/2014 04:19 PM, Tom Lane wrote: >>> Personally I'd prefer to go to the all-lengths approach, but a large >>> part of that comes from a subjective assessment that the hybrid approach >>> is too messy. Others might well disagree. > >> ... So, that extraction test is about 1% *slower* than the basic Tom Lane >> lengths-only patch, and still 80% slower than original JSONB. And it's >> the same size as the lengths-only version. > > Since it's looking like this might be the direction we want to go, I took > the time to flesh out my proof-of-concept patch. The attached version > takes care of cosmetic issues (like fixing the comments), and includes > code to avoid O(N^2) penalties in findJsonbValueFromContainer and > JsonbIteratorNext. I'm not sure whether those changes will help > noticeably on Josh's test case; for me, they seemed worth making, but > they do not bring the code back to full speed parity with the all-offsets > version. But as we've been discussing, it seems likely that those costs > would be swamped by compression and I/O considerations in most scenarios > with large documents; and of course for small documents it hardly matters. Table sizes and extraction times are unchanged from the prior patch based on my workload. We should be comparing all-lengths vs length-and-offset maybe using another workload as well ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
What data are you using right now Josh?
There's the github archive http://www.githubarchive.org/
Here's some sample data https://gist.github.com/igrigorik/2017462
There's the github archive http://www.githubarchive.org/
Here's some sample data https://gist.github.com/igrigorik/2017462
--
Arthur Silva
Arthur Silva
On Wed, Aug 20, 2014 at 6:09 PM, Josh Berkus <josh@agliodbs.com> wrote:
On 08/20/2014 08:29 AM, Tom Lane wrote:> Josh Berkus <josh@agliodbs.com> writes:Table sizes and extraction times are unchanged from the prior patch
>> On 08/15/2014 04:19 PM, Tom Lane wrote:
>>> Personally I'd prefer to go to the all-lengths approach, but a large
>>> part of that comes from a subjective assessment that the hybrid approach
>>> is too messy. Others might well disagree.
>
>> ... So, that extraction test is about 1% *slower* than the basic Tom Lane
>> lengths-only patch, and still 80% slower than original JSONB. And it's
>> the same size as the lengths-only version.
>
> Since it's looking like this might be the direction we want to go, I took
> the time to flesh out my proof-of-concept patch. The attached version
> takes care of cosmetic issues (like fixing the comments), and includes
> code to avoid O(N^2) penalties in findJsonbValueFromContainer and
> JsonbIteratorNext. I'm not sure whether those changes will help
> noticeably on Josh's test case; for me, they seemed worth making, but
> they do not bring the code back to full speed parity with the all-offsets
> version. But as we've been discussing, it seems likely that those costs
> would be swamped by compression and I/O considerations in most scenarios
> with large documents; and of course for small documents it hardly matters.
based on my workload.
We should be comparing all-lengths vs length-and-offset maybe using
another workload as well ...
On 08/20/2014 03:42 PM, Arthur Silva wrote: > What data are you using right now Josh? The same data as upthread. Can you test the three patches (9.4 head, 9.4 with Tom's cleanup of Heikki's patch, and 9.4 with Tom's latest lengths-only) on your workload? I'm concerned that my workload is unusual and don't want us to make this decision based entirely on it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Aug 21, 2014 at 6:20 PM, Josh Berkus <josh@agliodbs.com> wrote:
On 08/20/2014 03:42 PM, Arthur Silva wrote:The same data as upthread.
> What data are you using right now Josh?
Can you test the three patches (9.4 head, 9.4 with Tom's cleanup of
Heikki's patch, and 9.4 with Tom's latest lengths-only) on your workload?
I'm concerned that my workload is unusual and don't want us to make this
decision based entirely on it.
Here's my test results so far with the github archive data.
It's important to keep in mind that the PushEvent event objects that I use in the queries only contains a small number of keys (8 to be precise), so these tests don't really stress the changed code.
Anyway, in this dataset (with the small objects) using the all-lengths patch provide small compression savings but the overhead is minimal.
----------------
Test data: 610MB of Json -- 341969 items
Index size (jsonb_ops): 331MB
Test query 1: SELECT data->'url', data->'actor' FROM t_json WHERE data @> '{"type": "PushEvent"}'
Test query 1 items: 169732
Test query 2: SELECT data FROM t_json WHERE data @> '{"type": "PushEvent"}'
Test query 2 items:
----------------
HEAD (aka, all offsets) EXTENDED
Size: 374MB
Toast Size: 145MB
Test query 1 runtime: 680ms
Test query 2 runtime: 405ms
----------------
HEAD (aka, all offsets) EXTERNAL
Size: 366MB
Toast Size: 333MB
Test query 1 runtime: 505ms
Test query 2 runtime: 350ms
----------------
All Lengths (Tom Lane patch) EXTENDED
Size: 379MB
Toast Size: 108MB
Test query 1 runtime: 720ms
Test query 2 runtime: 420ms
----------------
All Lengths (Tom Lane patch) EXTERNAL
Size: 366MB
Toast Size: 333MB
Test query 1 runtime: 525ms
Test query 2 runtime: 355ms
--
Arthur Silva
Arthur Silva
On 08/16/2014 02:19 AM, Tom Lane wrote: > I think the realistic alternatives at this point are either to > switch to all-lengths as in my test patch, or to use the hybrid approach > of Heikki's test patch. IMO the major attraction of Heikki's patch > is that it'd be upward compatible with existing beta installations, > ie no initdb required (but thus, no opportunity to squeeze in a version > identifier either). It's not showing up terribly well in the performance > tests I've been doing --- it's about halfway between HEAD and my patch on > that extract-a-key-from-a-PLAIN-stored-column test. But, just as with my > patch, there are things that could be done to micro-optimize it by > touching a bit more code. > > I did some quick stats comparing compressed sizes for the delicio.us > data, printing quartiles as per Josh's lead: > > all-lengths {440,569,609,655,1257} > Heikki's patch {456,582,624,671,1274} > HEAD {493,636,684,744,1485} > > (As before, this is pg_column_size of the jsonb within a table whose rows > are wide enough to force tuptoaster.c to try to compress the jsonb; > otherwise many of these values wouldn't get compressed.) These documents > don't have enough keys to trigger the first_success_by issue, so that > HEAD doesn't look too awful, but still there's about an 11% gain from > switching from offsets to lengths. Heikki's method captures much of > that but not all. > > Personally I'd prefer to go to the all-lengths approach, but a large > part of that comes from a subjective assessment that the hybrid approach > is too messy. Others might well disagree. It's not too pretty, no. But it would be nice to not have to make a tradeoff between lookup speed and compressibility. Yet another idea is to store all lengths, but add an additional array of offsets to JsonbContainer. The array would contain the offset of, say, every 16th element. It would be very small compared to the lengths array, but would greatly speed up random access on a large array/object. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 08/16/2014 02:19 AM, Tom Lane wrote: >> I think the realistic alternatives at this point are either to >> switch to all-lengths as in my test patch, or to use the hybrid approach >> of Heikki's test patch. ... >> Personally I'd prefer to go to the all-lengths approach, but a large >> part of that comes from a subjective assessment that the hybrid approach >> is too messy. Others might well disagree. > It's not too pretty, no. But it would be nice to not have to make a > tradeoff between lookup speed and compressibility. > Yet another idea is to store all lengths, but add an additional array of > offsets to JsonbContainer. The array would contain the offset of, say, > every 16th element. It would be very small compared to the lengths > array, but would greatly speed up random access on a large array/object. That does nothing to address my basic concern about the patch, which is that it's too complicated and therefore bug-prone. Moreover, it'd lose on-disk compatibility which is really the sole saving grace of the proposal. My feeling about it at this point is that the apparent speed gain from using offsets is illusory: in practically all real-world cases where there are enough keys or array elements for it to matter, costs associated with compression (or rather failure to compress) will dominate any savings we get from offset-assisted lookups. I agree that the evidence for this opinion is pretty thin ... but the evidence against it is nonexistent. regards, tom lane
On 08/26/2014 07:51 AM, Tom Lane wrote: > My feeling about it at this point is that the apparent speed gain from > using offsets is illusory: in practically all real-world cases where there > are enough keys or array elements for it to matter, costs associated with > compression (or rather failure to compress) will dominate any savings we > get from offset-assisted lookups. I agree that the evidence for this > opinion is pretty thin ... but the evidence against it is nonexistent. Well, I have shown one test case which shows where lengths is a net penalty. However, for that to be the case, you have to have the following conditions *all* be true: * lots of top-level keys * short values * rows which are on the borderline for TOAST * table which fits in RAM ... so that's a "special case" and if it's sub-optimal, no bigee. Also, it's not like it's an order-of-magnitude slower. Anyway, I called for feedback on by blog, and have gotten some: http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > Anyway, I called for feedback on by blog, and have gotten some: > http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( regards, tom lane
On 08/26/2014 11:40 AM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> Anyway, I called for feedback on by blog, and have gotten some: >> http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html > > I was hoping you'd get some useful data from that, but so far it seems > like a rehash of points made in the on-list thread :-( > > regards, tom lane yah, me too. :-( Unfortunately even the outside commentors don't seem to understand that storage size *is* related to speed, it's exchanging I/O speed for CPU speed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 08/26/2014 11:40 AM, Tom Lane wrote: >> I was hoping you'd get some useful data from that, but so far it seems >> like a rehash of points made in the on-list thread :-( > Unfortunately even the outside commentors don't seem to understand that > storage size *is* related to speed, it's exchanging I/O speed for CPU speed. Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. I wish it were cache-friendly too, per the upthread tangent about having to fetch keys from all over the place within a large JSON object. ... and while I was typing that sentence, lightning struck. The existing arrangement of object subfields with keys and values interleaved is just plain dumb. We should rearrange that as all the keys in order, then all the values in the same order. Then the keys are naturally adjacent in memory and object-key searches become much more cache-friendly: you probably touch most of the key portion of the object, but none of the values portion, until you know exactly what part of the latter to fetch. This approach might complicate the lookup logic marginally but I bet not very much; and it will be a huge help if we ever want to do smart access to EXTERNAL (non-compressed) JSON values. I will go prototype that just to see how much code rearrangement is required. regards, tom lane
On 2014-08-26 15:01:27 -0400, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > On 08/26/2014 11:40 AM, Tom Lane wrote: > >> I was hoping you'd get some useful data from that, but so far it seems > >> like a rehash of points made in the on-list thread :-( > > > Unfortunately even the outside commentors don't seem to understand that > > storage size *is* related to speed, it's exchanging I/O speed for CPU speed. > > Yeah, exactly. Given current hardware trends, data compression is > becoming more of a win not less as time goes on: CPU cycles are cheap > even compared to main memory access, let alone mass storage. So I'm > thinking we want to adopt a compression-friendly data format even if > it measures out as a small loss currently. On the other hand the majority of databases these day fit into main memory due to its increasing sizes and postgres is more often CPU than IO bound. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 26 August 2014 11:34, Josh Berkus <josh@agliodbs.com> wrote:
encoded=# select count(properties->>'submitted_by') from compressed;
count
--------
431948
(1 row)
Time: 250.512 ms
encoded=# select count(properties->>'submitted_by') from uncompressed;
count
--------
431948
(1 row)
Time: 218.552 ms
On 08/26/2014 07:51 AM, Tom Lane wrote:Well, I have shown one test case which shows where lengths is a net
> My feeling about it at this point is that the apparent speed gain from
> using offsets is illusory: in practically all real-world cases where there
> are enough keys or array elements for it to matter, costs associated with
> compression (or rather failure to compress) will dominate any savings we
> get from offset-assisted lookups. I agree that the evidence for this
> opinion is pretty thin ... but the evidence against it is nonexistent.
penalty. However, for that to be the case, you have to have the
following conditions *all* be true:
* lots of top-level keys
* short values
* rows which are on the borderline for TOAST
* table which fits in RAM
... so that's a "special case" and if it's sub-optimal, no bigee. Also,
it's not like it's an order-of-magnitude slower.
Anyway, I called for feedback on by blog, and have gotten some:
http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html
It would be really interesting to see your results with column STORAGE EXTERNAL for that benchmark. I think it is important to separate out the slowdown due to decompression now being needed vs that inherent in the new format, we can always switch off compression on a per-column basis using STORAGE EXTERNAL.
My JSON data has smallish objects with a small number of keys, it barely compresses at all with the patch and shows similar results to Arthur's data. Across ~500K rows I get:
count
--------
431948
(1 row)
Time: 250.512 ms
encoded=# select count(properties->>'submitted_by') from uncompressed;
count
--------
431948
(1 row)
Time: 218.552 ms
Laurence
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-08-26 15:01:27 -0400, Tom Lane wrote: >> Yeah, exactly. Given current hardware trends, data compression is >> becoming more of a win not less as time goes on: CPU cycles are cheap >> even compared to main memory access, let alone mass storage. So I'm >> thinking we want to adopt a compression-friendly data format even if >> it measures out as a small loss currently. > On the other hand the majority of databases these day fit into main > memory due to its increasing sizes and postgres is more often CPU than > IO bound. Well, better data compression helps make that true ;-). And don't forget cache effects; actual main memory is considered slow these days. regards, tom lane
On Tue, Aug 26, 2014 at 4:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> On 08/26/2014 11:40 AM, Tom Lane wrote: >>> I was hoping you'd get some useful data from that, but so far it seems >>> like a rehash of points made in the on-list thread :-( > >> Unfortunately even the outside commentors don't seem to understand that >> storage size *is* related to speed, it's exchanging I/O speed for CPU speed. > > Yeah, exactly. Given current hardware trends, data compression is > becoming more of a win not less as time goes on: CPU cycles are cheap > even compared to main memory access, let alone mass storage. So I'm > thinking we want to adopt a compression-friendly data format even if > it measures out as a small loss currently. > > I wish it were cache-friendly too, per the upthread tangent about having > to fetch keys from all over the place within a large JSON object. What about my earlier proposal? An in-memory compressed representation would greatly help cache locality, more so if you pack keys as you mentioned.
On 2014-08-26 15:17:13 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-08-26 15:01:27 -0400, Tom Lane wrote: > >> Yeah, exactly. Given current hardware trends, data compression is > >> becoming more of a win not less as time goes on: CPU cycles are cheap > >> even compared to main memory access, let alone mass storage. So I'm > >> thinking we want to adopt a compression-friendly data format even if > >> it measures out as a small loss currently. > > > On the other hand the majority of databases these day fit into main > > memory due to its increasing sizes and postgres is more often CPU than > > IO bound. > > Well, better data compression helps make that true ;-). People disable toast compression though because it results in better performance :(. Part of that could be fixed by a faster compression method, part of it by decompressing less often. But still. > And don't forget cache effects; actual main memory is considered slow > these days. Right. But that plays the other way round too. Compressed datums need to be copied to be accessed uncompressed. Whereas at least in comparison to inline compressed datums that's not necessary. Anyway, that's just to say that I don't really agree that CPU overhead is a worthy price to pay for storage efficiency if the gains are small. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Aug 26, 2014 at 12:27 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Anyway, that's just to say that I don't really agree that CPU overhead > is a worthy price to pay for storage efficiency if the gains are small. +1 -- Peter Geoghegan
On 08/26/2014 12:27 PM, Andres Freund wrote: > Anyway, that's just to say that I don't really agree that CPU overhead > is a worthy price to pay for storage efficiency if the gains are small. But in this case the gains aren't small; we're talking up to 60% smaller storage. Testing STORAGE EXTENDED soon. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
I wrote: > I wish it were cache-friendly too, per the upthread tangent about having > to fetch keys from all over the place within a large JSON object. > ... and while I was typing that sentence, lightning struck. The existing > arrangement of object subfields with keys and values interleaved is just > plain dumb. We should rearrange that as all the keys in order, then all > the values in the same order. Then the keys are naturally adjacent in > memory and object-key searches become much more cache-friendly: you > probably touch most of the key portion of the object, but none of the > values portion, until you know exactly what part of the latter to fetch. > This approach might complicate the lookup logic marginally but I bet not > very much; and it will be a huge help if we ever want to do smart access > to EXTERNAL (non-compressed) JSON values. > I will go prototype that just to see how much code rearrangement is > required. This looks pretty good from a coding point of view. I have not had time yet to see if it affects the speed of the benchmark cases we've been trying. I suspect that it won't make much difference in them. I think if we do decide to make an on-disk format change, we should seriously consider including this change. The same concept could be applied to offset-based storage of course, although I rather doubt that we'd make that combination of choices since it would be giving up on-disk compatibility for benefits that are mostly in the future. Attached are two patches: one is a "delta" against the last jsonb-lengths patch I posted, and the other is a "merged" patch showing the total change from HEAD, for ease of application. regards, tom lane diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index e47eaea..4e7fe67 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *************** *** 26,33 **** * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (the total size of an array's elements is also limited by JENTRY_LENMASK, ! * but we're not concerned about that here) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) --- 26,33 ---- * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (The total size of an array's or object's elements is also limited by ! * JENTRY_LENMASK, but we're not concerned about that here.) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) *************** findJsonbValueFromContainer(JsonbContain *** 294,303 **** { JEntry *children = container->children; int count = (container->header & JB_CMASK); ! JsonbValue *result = palloc(sizeof(JsonbValue)); Assert((flags & ~(JB_FARRAY | JB_FOBJECT)) == 0); if (flags & JB_FARRAY & container->header) { char *base_addr = (char *) (children + count); --- 294,309 ---- { JEntry *children = container->children; int count = (container->header & JB_CMASK); ! JsonbValue *result; Assert((flags & ~(JB_FARRAY | JB_FOBJECT)) == 0); + /* Quick out without a palloc cycle if object/array is empty */ + if (count <= 0) + return NULL; + + result = palloc(sizeof(JsonbValue)); + if (flags & JB_FARRAY & container->header) { char *base_addr = (char *) (children + count); *************** findJsonbValueFromContainer(JsonbContain *** 323,329 **** char *base_addr = (char *) (children + count * 2); uint32 *offsets; uint32 lastoff; ! int lastoffpos; uint32 stopLow = 0, stopHigh = count; --- 329,335 ---- char *base_addr = (char *) (children + count * 2); uint32 *offsets; uint32 lastoff; ! int i; uint32 stopLow = 0, stopHigh = count; *************** findJsonbValueFromContainer(JsonbContain *** 332,379 **** /* * We use a cache to avoid redundant getJsonbOffset() computations ! * inside the search loop. Note that count may well be zero at this ! * point; to avoid an ugly special case for initializing lastoff and ! * lastoffpos, we allocate one extra array element. */ ! offsets = (uint32 *) palloc((count * 2 + 1) * sizeof(uint32)); ! offsets[0] = lastoff = 0; ! lastoffpos = 0; /* Binary search on object/pair keys *only* */ while (stopLow < stopHigh) { uint32 stopMiddle; - int index; int difference; JsonbValue candidate; stopMiddle = stopLow + (stopHigh - stopLow) / 2; - /* - * Compensate for the fact that we're searching through pairs (not - * entries). - */ - index = stopMiddle * 2; - - /* Update the offsets cache through at least index+1 */ - while (lastoffpos <= index) - { - lastoff += JBE_LEN(children, lastoffpos); - offsets[++lastoffpos] = lastoff; - } - candidate.type = jbvString; ! candidate.val.string.val = base_addr + offsets[index]; ! candidate.val.string.len = JBE_LEN(children, index); difference = lengthCompareJsonbStringValue(&candidate, key); if (difference == 0) { ! /* Found our key, return value */ ! fillJsonbValue(children, index + 1, ! base_addr, offsets[index + 1], result); pfree(offsets); --- 338,383 ---- /* * We use a cache to avoid redundant getJsonbOffset() computations ! * inside the search loop. The entire cache can be filled immediately ! * since we expect to need the last offset for value access. (This ! * choice could lose if the key is not present, but avoiding extra ! * logic inside the search loop probably makes up for that.) */ ! offsets = (uint32 *) palloc(count * sizeof(uint32)); ! lastoff = 0; ! for (i = 0; i < count; i++) ! { ! offsets[i] = lastoff; ! lastoff += JBE_LEN(children, i); ! } ! /* lastoff now has the offset of the first value item */ /* Binary search on object/pair keys *only* */ while (stopLow < stopHigh) { uint32 stopMiddle; int difference; JsonbValue candidate; stopMiddle = stopLow + (stopHigh - stopLow) / 2; candidate.type = jbvString; ! candidate.val.string.val = base_addr + offsets[stopMiddle]; ! candidate.val.string.len = JBE_LEN(children, stopMiddle); difference = lengthCompareJsonbStringValue(&candidate, key); if (difference == 0) { ! /* Found our key, return corresponding value */ ! int index = stopMiddle + count; ! ! /* navigate to appropriate offset */ ! for (i = count; i < index; i++) ! lastoff += JBE_LEN(children, i); ! ! fillJsonbValue(children, index, ! base_addr, lastoff, result); pfree(offsets); *************** recurse: *** 730,735 **** --- 734,740 ---- val->val.array.rawScalar = (*it)->isScalar; (*it)->curIndex = 0; (*it)->curDataOffset = 0; + (*it)->curValueOffset = 0; /* not actually used */ /* Set state for next call */ (*it)->state = JBI_ARRAY_ELEM; return WJB_BEGIN_ARRAY; *************** recurse: *** 780,785 **** --- 785,792 ---- */ (*it)->curIndex = 0; (*it)->curDataOffset = 0; + (*it)->curValueOffset = getJsonbOffset((*it)->children, + (*it)->nElems); /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; return WJB_BEGIN_OBJECT; *************** recurse: *** 799,805 **** else { /* Return key of a key/value pair. */ ! fillJsonbValue((*it)->children, (*it)->curIndex * 2, (*it)->dataProper, (*it)->curDataOffset, val); if (val->type != jbvString) --- 806,812 ---- else { /* Return key of a key/value pair. */ ! fillJsonbValue((*it)->children, (*it)->curIndex, (*it)->dataProper, (*it)->curDataOffset, val); if (val->type != jbvString) *************** recurse: *** 814,828 **** /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; ! (*it)->curDataOffset += JBE_LEN((*it)->children, ! (*it)->curIndex * 2); ! ! fillJsonbValue((*it)->children, (*it)->curIndex * 2 + 1, ! (*it)->dataProper, (*it)->curDataOffset, val); (*it)->curDataOffset += JBE_LEN((*it)->children, ! (*it)->curIndex * 2 + 1); (*it)->curIndex++; /* --- 821,834 ---- /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; ! fillJsonbValue((*it)->children, (*it)->curIndex + (*it)->nElems, ! (*it)->dataProper, (*it)->curValueOffset, val); (*it)->curDataOffset += JBE_LEN((*it)->children, ! (*it)->curIndex); ! (*it)->curValueOffset += JBE_LEN((*it)->children, ! (*it)->curIndex + (*it)->nElems); (*it)->curIndex++; /* *************** convertJsonbObject(StringInfo buffer, JE *** 1509,1514 **** --- 1515,1524 ---- /* Reserve space for the JEntries of the keys and values. */ metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.object.nPairs * 2); + /* + * Iterate over the keys, then over the values, since that is the ordering + * we want in the on-disk representation. + */ totallen = 0; for (i = 0; i < val->val.object.nPairs; i++) { *************** convertJsonbObject(StringInfo buffer, JE *** 1529,1534 **** --- 1539,1561 ---- metaoffset += sizeof(JEntry); /* + * Bail out if total variable-length data exceeds what will fit in a + * JEntry length field. We check this in each iteration, not just + * once at the end, to forestall possible integer overflow. + */ + if (totallen > JENTRY_LENMASK) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("total size of jsonb object elements exceeds the maximum of %u bytes", + JENTRY_LENMASK))); + } + for (i = 0; i < val->val.object.nPairs; i++) + { + JsonbPair *pair = &val->val.object.pairs[i]; + int len; + JEntry meta; + + /* * Convert value, producing a JEntry and appending its variable-length * data to buffer */ *************** convertJsonbObject(StringInfo buffer, JE *** 1543,1551 **** /* * Bail out if total variable-length data exceeds what will fit in a * JEntry length field. We check this in each iteration, not just ! * once at the end, to forestall possible integer overflow. But it ! * should be sufficient to check once per iteration, since ! * JENTRY_LENMASK is several bits narrower than int. */ if (totallen > JENTRY_LENMASK) ereport(ERROR, --- 1570,1576 ---- /* * Bail out if total variable-length data exceeds what will fit in a * JEntry length field. We check this in each iteration, not just ! * once at the end, to forestall possible integer overflow. */ if (totallen > JENTRY_LENMASK) ereport(ERROR, diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h index b9a4314..f9472af 100644 *** a/src/include/utils/jsonb.h --- b/src/include/utils/jsonb.h *************** typedef uint32 JEntry; *** 149,156 **** /* * A jsonb array or object node, within a Jsonb Datum. * ! * An array has one child for each element. An object has two children for ! * each key/value pair. */ typedef struct JsonbContainer { --- 149,160 ---- /* * A jsonb array or object node, within a Jsonb Datum. * ! * An array has one child for each element, stored in array order. ! * ! * An object has two children for each key/value pair. The keys all appear ! * first, in key sort order; then the values appear, in an order matching the ! * key order. This arrangement keeps the keys compact in memory, making a ! * search for a particular key more cache-friendly. */ typedef struct JsonbContainer { *************** typedef struct JsonbContainer *** 162,169 **** } JsonbContainer; /* flags for the header-field in JsonbContainer */ ! #define JB_CMASK 0x0FFFFFFF ! #define JB_FSCALAR 0x10000000 #define JB_FOBJECT 0x20000000 #define JB_FARRAY 0x40000000 --- 166,173 ---- } JsonbContainer; /* flags for the header-field in JsonbContainer */ ! #define JB_CMASK 0x0FFFFFFF /* mask for count field */ ! #define JB_FSCALAR 0x10000000 /* flag bits */ #define JB_FOBJECT 0x20000000 #define JB_FARRAY 0x40000000 *************** struct JsonbValue *** 238,255 **** (jsonbval)->type <= jbvBool) /* ! * Pair within an Object. * ! * Pairs with duplicate keys are de-duplicated. We store the order for the ! * benefit of doing so in a well-defined way with respect to the original ! * observed order (which is "last observed wins"). This is only used briefly ! * when originally constructing a Jsonb. */ struct JsonbPair { JsonbValue key; /* Must be a jbvString */ JsonbValue value; /* May be of any type */ ! uint32 order; /* preserves order of pairs with equal keys */ }; /* Conversion state used when parsing Jsonb from text, or for type coercion */ --- 242,261 ---- (jsonbval)->type <= jbvBool) /* ! * Key/value pair within an Object. * ! * This struct type is only used briefly while constructing a Jsonb; it is ! * *not* the on-disk representation. ! * ! * Pairs with duplicate keys are de-duplicated. We store the originally ! * observed pair ordering for the purpose of removing duplicates in a ! * well-defined way (which is "last observed wins"). */ struct JsonbPair { JsonbValue key; /* Must be a jbvString */ JsonbValue value; /* May be of any type */ ! uint32 order; /* Pair's index in original sequence */ }; /* Conversion state used when parsing Jsonb from text, or for type coercion */ *************** typedef struct JsonbIterator *** 284,295 **** /* Data proper. This points just past end of children array */ char *dataProper; ! /* Current item in buffer (up to nElems, but must * 2 for objects) */ int curIndex; /* Data offset corresponding to current item */ uint32 curDataOffset; /* Private state */ JsonbIterState state; --- 290,308 ---- /* Data proper. This points just past end of children array */ char *dataProper; ! /* Current item in buffer (up to nElems) */ int curIndex; /* Data offset corresponding to current item */ uint32 curDataOffset; + /* + * If the container is an object, we want to return keys and values + * alternately; so curDataOffset points to the current key, and + * curValueOffset points to the current value. + */ + uint32 curValueOffset; + /* Private state */ JsonbIterState state; diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 2fd87fc..456011a 100644 *** a/src/backend/utils/adt/jsonb.c --- b/src/backend/utils/adt/jsonb.c *************** jsonb_from_cstring(char *json, int len) *** 196,207 **** static size_t checkStringLen(size_t len) { ! if (len > JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("string too long to represent as jsonb string"), errdetail("Due to an implementation restriction, jsonb strings cannot exceed %d bytes.", ! JENTRY_POSMASK))); return len; } --- 196,207 ---- static size_t checkStringLen(size_t len) { ! if (len > JENTRY_LENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("string too long to represent as jsonb string"), errdetail("Due to an implementation restriction, jsonb strings cannot exceed %d bytes.", ! JENTRY_LENMASK))); return len; } diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 04f35bf..4e7fe67 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *************** *** 26,40 **** * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (the total size of an array's elements is also limited by JENTRY_POSMASK, ! * but we're not concerned about that here) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) ! static void fillJsonbValue(JEntry *array, int index, char *base_addr, JsonbValue *result); ! static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); static int compareJsonbScalarValue(JsonbValue *a, JsonbValue *b); static Jsonb *convertToJsonb(JsonbValue *val); static void convertJsonbValue(StringInfo buffer, JEntry *header, JsonbValue *val, int level); --- 26,41 ---- * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (The total size of an array's or object's elements is also limited by ! * JENTRY_LENMASK, but we're not concerned about that here.) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) ! static void fillJsonbValue(JEntry *children, int index, ! char *base_addr, uint32 offset, JsonbValue *result); ! static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); static int compareJsonbScalarValue(JsonbValue *a, JsonbValue *b); static Jsonb *convertToJsonb(JsonbValue *val); static void convertJsonbValue(StringInfo buffer, JEntry *header, JsonbValue *val, int level); *************** static void convertJsonbArray(StringInfo *** 42,48 **** static void convertJsonbObject(StringInfo buffer, JEntry *header, JsonbValue *val, int level); static void convertJsonbScalar(StringInfo buffer, JEntry *header, JsonbValue *scalarVal); ! static int reserveFromBuffer(StringInfo buffer, int len); static void appendToBuffer(StringInfo buffer, const char *data, int len); static void copyToBuffer(StringInfo buffer, int offset, const char *data, int len); static short padBufferToInt(StringInfo buffer); --- 43,49 ---- static void convertJsonbObject(StringInfo buffer, JEntry *header, JsonbValue *val, int level); static void convertJsonbScalar(StringInfo buffer, JEntry *header, JsonbValue *scalarVal); ! static int reserveFromBuffer(StringInfo buffer, int len); static void appendToBuffer(StringInfo buffer, const char *data, int len); static void copyToBuffer(StringInfo buffer, int offset, const char *data, int len); static short padBufferToInt(StringInfo buffer); *************** JsonbValueToJsonb(JsonbValue *val) *** 108,113 **** --- 109,135 ---- } /* + * Get the offset of the variable-length portion of a Jsonb node within + * the variable-length-data part of its container. The node is identified + * by index within the container's JEntry array. + * + * We do this by adding up the lengths of all the previous nodes' + * variable-length portions. It's best to avoid using this function when + * iterating through all the nodes in a container, since that would result + * in O(N^2) work. + */ + uint32 + getJsonbOffset(const JEntry *ja, int index) + { + uint32 off = 0; + int i; + + for (i = 0; i < index; i++) + off += JBE_LEN(ja, i); + return off; + } + + /* * BT comparator worker function. Returns an integer less than, equal to, or * greater than zero, indicating whether a is less than, equal to, or greater * than b. Consistent with the requirements for a B-Tree operator class *************** compareJsonbContainers(JsonbContainer *a *** 201,207 **** * * If the two values were of the same container type, then there'd * have been a chance to observe the variation in the number of ! * elements/pairs (when processing WJB_BEGIN_OBJECT, say). They're * either two heterogeneously-typed containers, or a container and * some scalar type. * --- 223,229 ---- * * If the two values were of the same container type, then there'd * have been a chance to observe the variation in the number of ! * elements/pairs (when processing WJB_BEGIN_OBJECT, say). They're * either two heterogeneously-typed containers, or a container and * some scalar type. * *************** findJsonbValueFromContainer(JsonbContain *** 272,333 **** { JEntry *children = container->children; int count = (container->header & JB_CMASK); ! JsonbValue *result = palloc(sizeof(JsonbValue)); Assert((flags & ~(JB_FARRAY | JB_FOBJECT)) == 0); if (flags & JB_FARRAY & container->header) { char *base_addr = (char *) (children + count); int i; for (i = 0; i < count; i++) { ! fillJsonbValue(children, i, base_addr, result); if (key->type == result->type) { if (equalsJsonbScalarValue(key, result)) return result; } } } else if (flags & JB_FOBJECT & container->header) { /* Since this is an object, account for *Pairs* of Jentrys */ char *base_addr = (char *) (children + count * 2); uint32 stopLow = 0, ! stopMiddle; ! /* Object key past by caller must be a string */ Assert(key->type == jbvString); /* Binary search on object/pair keys *only* */ ! while (stopLow < count) { ! int index; int difference; JsonbValue candidate; ! /* ! * Note how we compensate for the fact that we're iterating ! * through pairs (not entries) throughout. ! */ ! stopMiddle = stopLow + (count - stopLow) / 2; ! ! index = stopMiddle * 2; candidate.type = jbvString; ! candidate.val.string.val = base_addr + JBE_OFF(children, index); ! candidate.val.string.len = JBE_LEN(children, index); difference = lengthCompareJsonbStringValue(&candidate, key); if (difference == 0) { ! /* Found our key, return value */ ! fillJsonbValue(children, index + 1, base_addr, result); return result; } else --- 294,386 ---- { JEntry *children = container->children; int count = (container->header & JB_CMASK); ! JsonbValue *result; Assert((flags & ~(JB_FARRAY | JB_FOBJECT)) == 0); + /* Quick out without a palloc cycle if object/array is empty */ + if (count <= 0) + return NULL; + + result = palloc(sizeof(JsonbValue)); + if (flags & JB_FARRAY & container->header) { char *base_addr = (char *) (children + count); + uint32 offset = 0; int i; for (i = 0; i < count; i++) { ! fillJsonbValue(children, i, base_addr, offset, result); if (key->type == result->type) { if (equalsJsonbScalarValue(key, result)) return result; } + + offset += JBE_LEN(children, i); } } else if (flags & JB_FOBJECT & container->header) { /* Since this is an object, account for *Pairs* of Jentrys */ char *base_addr = (char *) (children + count * 2); + uint32 *offsets; + uint32 lastoff; + int i; uint32 stopLow = 0, ! stopHigh = count; ! /* Object key passed by caller must be a string */ Assert(key->type == jbvString); + /* + * We use a cache to avoid redundant getJsonbOffset() computations + * inside the search loop. The entire cache can be filled immediately + * since we expect to need the last offset for value access. (This + * choice could lose if the key is not present, but avoiding extra + * logic inside the search loop probably makes up for that.) + */ + offsets = (uint32 *) palloc(count * sizeof(uint32)); + lastoff = 0; + for (i = 0; i < count; i++) + { + offsets[i] = lastoff; + lastoff += JBE_LEN(children, i); + } + /* lastoff now has the offset of the first value item */ + /* Binary search on object/pair keys *only* */ ! while (stopLow < stopHigh) { ! uint32 stopMiddle; int difference; JsonbValue candidate; ! stopMiddle = stopLow + (stopHigh - stopLow) / 2; candidate.type = jbvString; ! candidate.val.string.val = base_addr + offsets[stopMiddle]; ! candidate.val.string.len = JBE_LEN(children, stopMiddle); difference = lengthCompareJsonbStringValue(&candidate, key); if (difference == 0) { ! /* Found our key, return corresponding value */ ! int index = stopMiddle + count; ! ! /* navigate to appropriate offset */ ! for (i = count; i < index; i++) ! lastoff += JBE_LEN(children, i); ! ! fillJsonbValue(children, index, ! base_addr, lastoff, ! result); + pfree(offsets); return result; } else *************** findJsonbValueFromContainer(JsonbContain *** 335,343 **** if (difference < 0) stopLow = stopMiddle + 1; else ! count = stopMiddle; } } } /* Not found */ --- 388,398 ---- if (difference < 0) stopLow = stopMiddle + 1; else ! stopHigh = stopMiddle; } } + + pfree(offsets); } /* Not found */ *************** getIthJsonbValueFromContainer(JsonbConta *** 368,374 **** result = palloc(sizeof(JsonbValue)); ! fillJsonbValue(container->children, i, base_addr, result); return result; } --- 423,431 ---- result = palloc(sizeof(JsonbValue)); ! fillJsonbValue(container->children, i, base_addr, ! getJsonbOffset(container->children, i), ! result); return result; } *************** getIthJsonbValueFromContainer(JsonbConta *** 377,387 **** * A helper function to fill in a JsonbValue to represent an element of an * array, or a key or value of an object. * * A nested array or object will be returned as jbvBinary, ie. it won't be * expanded. */ static void ! fillJsonbValue(JEntry *children, int index, char *base_addr, JsonbValue *result) { JEntry entry = children[index]; --- 434,450 ---- * A helper function to fill in a JsonbValue to represent an element of an * array, or a key or value of an object. * + * The node's JEntry is at children[index], and its variable-length data + * is at base_addr + offset. We make the caller determine the offset since + * in many cases the caller can amortize the work across multiple children. + * * A nested array or object will be returned as jbvBinary, ie. it won't be * expanded. */ static void ! fillJsonbValue(JEntry *children, int index, ! char *base_addr, uint32 offset, ! JsonbValue *result) { JEntry entry = children[index]; *************** fillJsonbValue(JEntry *children, int ind *** 392,405 **** else if (JBE_ISSTRING(entry)) { result->type = jbvString; ! result->val.string.val = base_addr + JBE_OFF(children, index); ! result->val.string.len = JBE_LEN(children, index); Assert(result->val.string.len >= 0); } else if (JBE_ISNUMERIC(entry)) { result->type = jbvNumeric; ! result->val.numeric = (Numeric) (base_addr + INTALIGN(JBE_OFF(children, index))); } else if (JBE_ISBOOL_TRUE(entry)) { --- 455,468 ---- else if (JBE_ISSTRING(entry)) { result->type = jbvString; ! result->val.string.val = base_addr + offset; ! result->val.string.len = JBE_LENFLD(entry); Assert(result->val.string.len >= 0); } else if (JBE_ISNUMERIC(entry)) { result->type = jbvNumeric; ! result->val.numeric = (Numeric) (base_addr + INTALIGN(offset)); } else if (JBE_ISBOOL_TRUE(entry)) { *************** fillJsonbValue(JEntry *children, int ind *** 415,422 **** { Assert(JBE_ISCONTAINER(entry)); result->type = jbvBinary; ! result->val.binary.data = (JsonbContainer *) (base_addr + INTALIGN(JBE_OFF(children, index))); ! result->val.binary.len = JBE_LEN(children, index) - (INTALIGN(JBE_OFF(children, index)) - JBE_OFF(children, index)); } } --- 478,486 ---- { Assert(JBE_ISCONTAINER(entry)); result->type = jbvBinary; ! /* Remove alignment padding from data pointer and len */ ! result->val.binary.data = (JsonbContainer *) (base_addr + INTALIGN(offset)); ! result->val.binary.len = JBE_LENFLD(entry) - (INTALIGN(offset) - offset); } } *************** recurse: *** 668,680 **** * a full conversion */ val->val.array.rawScalar = (*it)->isScalar; ! (*it)->i = 0; /* Set state for next call */ (*it)->state = JBI_ARRAY_ELEM; return WJB_BEGIN_ARRAY; case JBI_ARRAY_ELEM: ! if ((*it)->i >= (*it)->nElems) { /* * All elements within array already processed. Report this --- 732,746 ---- * a full conversion */ val->val.array.rawScalar = (*it)->isScalar; ! (*it)->curIndex = 0; ! (*it)->curDataOffset = 0; ! (*it)->curValueOffset = 0; /* not actually used */ /* Set state for next call */ (*it)->state = JBI_ARRAY_ELEM; return WJB_BEGIN_ARRAY; case JBI_ARRAY_ELEM: ! if ((*it)->curIndex >= (*it)->nElems) { /* * All elements within array already processed. Report this *************** recurse: *** 686,692 **** return WJB_END_ARRAY; } ! fillJsonbValue((*it)->children, (*it)->i++, (*it)->dataProper, val); if (!IsAJsonbScalar(val) && !skipNested) { --- 752,763 ---- return WJB_END_ARRAY; } ! fillJsonbValue((*it)->children, (*it)->curIndex, ! (*it)->dataProper, (*it)->curDataOffset, ! val); ! ! (*it)->curDataOffset += JBE_LEN((*it)->children, (*it)->curIndex); ! (*it)->curIndex++; if (!IsAJsonbScalar(val) && !skipNested) { *************** recurse: *** 697,704 **** else { /* ! * Scalar item in array, or a container and caller didn't ! * want us to recurse into it. */ return WJB_ELEM; } --- 768,775 ---- else { /* ! * Scalar item in array, or a container and caller didn't want ! * us to recurse into it. */ return WJB_ELEM; } *************** recurse: *** 712,724 **** * v->val.object.pairs is not actually set, because we aren't * doing a full conversion */ ! (*it)->i = 0; /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; return WJB_BEGIN_OBJECT; case JBI_OBJECT_KEY: ! if ((*it)->i >= (*it)->nElems) { /* * All pairs within object already processed. Report this to --- 783,798 ---- * v->val.object.pairs is not actually set, because we aren't * doing a full conversion */ ! (*it)->curIndex = 0; ! (*it)->curDataOffset = 0; ! (*it)->curValueOffset = getJsonbOffset((*it)->children, ! (*it)->nElems); /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; return WJB_BEGIN_OBJECT; case JBI_OBJECT_KEY: ! if ((*it)->curIndex >= (*it)->nElems) { /* * All pairs within object already processed. Report this to *************** recurse: *** 732,738 **** else { /* Return key of a key/value pair. */ ! fillJsonbValue((*it)->children, (*it)->i * 2, (*it)->dataProper, val); if (val->type != jbvString) elog(ERROR, "unexpected jsonb type as object key"); --- 806,814 ---- else { /* Return key of a key/value pair. */ ! fillJsonbValue((*it)->children, (*it)->curIndex, ! (*it)->dataProper, (*it)->curDataOffset, ! val); if (val->type != jbvString) elog(ERROR, "unexpected jsonb type as object key"); *************** recurse: *** 745,752 **** /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; ! fillJsonbValue((*it)->children, ((*it)->i++) * 2 + 1, ! (*it)->dataProper, val); /* * Value may be a container, in which case we recurse with new, --- 821,835 ---- /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; ! fillJsonbValue((*it)->children, (*it)->curIndex + (*it)->nElems, ! (*it)->dataProper, (*it)->curValueOffset, ! val); ! ! (*it)->curDataOffset += JBE_LEN((*it)->children, ! (*it)->curIndex); ! (*it)->curValueOffset += JBE_LEN((*it)->children, ! (*it)->curIndex + (*it)->nElems); ! (*it)->curIndex++; /* * Value may be a container, in which case we recurse with new, *************** reserveFromBuffer(StringInfo buffer, int *** 1209,1216 **** buffer->len += len; /* ! * Keep a trailing null in place, even though it's not useful for us; ! * it seems best to preserve the invariants of StringInfos. */ buffer->data[buffer->len] = '\0'; --- 1292,1299 ---- buffer->len += len; /* ! * Keep a trailing null in place, even though it's not useful for us; it ! * seems best to preserve the invariants of StringInfos. */ buffer->data[buffer->len] = '\0'; *************** convertToJsonb(JsonbValue *val) *** 1284,1291 **** /* * Note: the JEntry of the root is discarded. Therefore the root ! * JsonbContainer struct must contain enough information to tell what ! * kind of value it is. */ res = (Jsonb *) buffer.data; --- 1367,1374 ---- /* * Note: the JEntry of the root is discarded. Therefore the root ! * JsonbContainer struct must contain enough information to tell what kind ! * of value it is. */ res = (Jsonb *) buffer.data; *************** convertJsonbValue(StringInfo buffer, JEn *** 1315,1324 **** return; /* ! * A JsonbValue passed as val should never have a type of jbvBinary, ! * and neither should any of its sub-components. Those values will be ! * produced by convertJsonbArray and convertJsonbObject, the results of ! * which will not be passed back to this function as an argument. */ if (IsAJsonbScalar(val)) --- 1398,1407 ---- return; /* ! * A JsonbValue passed as val should never have a type of jbvBinary, and ! * neither should any of its sub-components. Those values will be produced ! * by convertJsonbArray and convertJsonbObject, the results of which will ! * not be passed back to this function as an argument. */ if (IsAJsonbScalar(val)) *************** convertJsonbArray(StringInfo buffer, JEn *** 1340,1353 **** int totallen; uint32 header; ! /* Initialize pointer into conversion buffer at this level */ offset = buffer->len; padBufferToInt(buffer); /* ! * Construct the header Jentry, stored in the beginning of the variable- ! * length payload. */ header = val->val.array.nElems | JB_FARRAY; if (val->val.array.rawScalar) --- 1423,1437 ---- int totallen; uint32 header; ! /* Remember where variable-length data starts for this array */ offset = buffer->len; + /* Align to 4-byte boundary (any padding counts as part of my data) */ padBufferToInt(buffer); /* ! * Construct the header Jentry and store it in the beginning of the ! * variable-length payload. */ header = val->val.array.nElems | JB_FARRAY; if (val->val.array.rawScalar) *************** convertJsonbArray(StringInfo buffer, JEn *** 1358,1364 **** } appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! /* reserve space for the JEntries of the elements. */ metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.array.nElems); totallen = 0; --- 1442,1449 ---- } appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! ! /* Reserve space for the JEntries of the elements. */ metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.array.nElems); totallen = 0; *************** convertJsonbArray(StringInfo buffer, JEn *** 1368,1391 **** int len; JEntry meta; convertJsonbValue(buffer, &meta, elem, level + 1); - len = meta & JENTRY_POSMASK; - totallen += len; ! if (totallen > JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_POSMASK))); - if (i > 0) - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } totallen = buffer->len - offset; /* Initialize the header of this node, in the container's JEntry array */ *pheader = JENTRY_ISCONTAINER | totallen; } --- 1453,1491 ---- int len; JEntry meta; + /* + * Convert element, producing a JEntry and appending its + * variable-length data to buffer + */ convertJsonbValue(buffer, &meta, elem, level + 1); ! /* ! * Bail out if total variable-length data exceeds what will fit in a ! * JEntry length field. We check this in each iteration, not just ! * once at the end, to forestall possible integer overflow. ! */ ! len = JBE_LENFLD(meta); ! totallen += len; ! if (totallen > JENTRY_LENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_LENMASK))); copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } + /* Total data size is everything we've appended to buffer */ totallen = buffer->len - offset; + /* Check length again, since we didn't include the metadata above */ + if (totallen > JENTRY_LENMASK) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", + JENTRY_LENMASK))); + /* Initialize the header of this node, in the container's JEntry array */ *pheader = JENTRY_ISCONTAINER | totallen; } *************** convertJsonbArray(StringInfo buffer, JEn *** 1393,1457 **** static void convertJsonbObject(StringInfo buffer, JEntry *pheader, JsonbValue *val, int level) { - uint32 header; int offset; int metaoffset; int i; int totallen; ! /* Initialize pointer into conversion buffer at this level */ offset = buffer->len; padBufferToInt(buffer); ! /* Initialize header */ header = val->val.object.nPairs | JB_FOBJECT; appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! /* reserve space for the JEntries of the keys and values */ metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.object.nPairs * 2); totallen = 0; for (i = 0; i < val->val.object.nPairs; i++) { ! JsonbPair *pair = &val->val.object.pairs[i]; ! int len; ! JEntry meta; ! /* put key */ convertJsonbScalar(buffer, &meta, &pair->key); ! len = meta & JENTRY_POSMASK; totallen += len; - if (totallen > JENTRY_POSMASK) - ereport(ERROR, - (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), - errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", - JENTRY_POSMASK))); - - if (i > 0) - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); ! convertJsonbValue(buffer, &meta, &pair->value, level); ! len = meta & JENTRY_POSMASK; ! totallen += len; ! ! if (totallen > JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_POSMASK))); - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } totallen = buffer->len - offset; *pheader = JENTRY_ISCONTAINER | totallen; } --- 1493,1595 ---- static void convertJsonbObject(StringInfo buffer, JEntry *pheader, JsonbValue *val, int level) { int offset; int metaoffset; int i; int totallen; + uint32 header; ! /* Remember where variable-length data starts for this object */ offset = buffer->len; + /* Align to 4-byte boundary (any padding counts as part of my data) */ padBufferToInt(buffer); ! /* ! * Construct the header Jentry and store it in the beginning of the ! * variable-length payload. ! */ header = val->val.object.nPairs | JB_FOBJECT; appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! /* Reserve space for the JEntries of the keys and values. */ metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.object.nPairs * 2); + /* + * Iterate over the keys, then over the values, since that is the ordering + * we want in the on-disk representation. + */ totallen = 0; for (i = 0; i < val->val.object.nPairs; i++) { ! JsonbPair *pair = &val->val.object.pairs[i]; ! int len; ! JEntry meta; ! /* ! * Convert key, producing a JEntry and appending its variable-length ! * data to buffer ! */ convertJsonbScalar(buffer, &meta, &pair->key); ! len = JBE_LENFLD(meta); totallen += len; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); ! /* ! * Bail out if total variable-length data exceeds what will fit in a ! * JEntry length field. We check this in each iteration, not just ! * once at the end, to forestall possible integer overflow. ! */ ! if (totallen > JENTRY_LENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("total size of jsonb object elements exceeds the maximum of %u bytes", ! JENTRY_LENMASK))); ! } ! for (i = 0; i < val->val.object.nPairs; i++) ! { ! JsonbPair *pair = &val->val.object.pairs[i]; ! int len; ! JEntry meta; ! ! /* ! * Convert value, producing a JEntry and appending its variable-length ! * data to buffer ! */ ! convertJsonbValue(buffer, &meta, &pair->value, level + 1); ! ! len = JBE_LENFLD(meta); ! totallen += len; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); + + /* + * Bail out if total variable-length data exceeds what will fit in a + * JEntry length field. We check this in each iteration, not just + * once at the end, to forestall possible integer overflow. + */ + if (totallen > JENTRY_LENMASK) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("total size of jsonb object elements exceeds the maximum of %u bytes", + JENTRY_LENMASK))); } + /* Total data size is everything we've appended to buffer */ totallen = buffer->len - offset; + /* Check length again, since we didn't include the metadata above */ + if (totallen > JENTRY_LENMASK) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("total size of jsonb object elements exceeds the maximum of %u bytes", + JENTRY_LENMASK))); + + /* Initialize the header of this node, in the container's JEntry array */ *pheader = JENTRY_ISCONTAINER | totallen; } diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h index 91e3e14..f9472af 100644 *** a/src/include/utils/jsonb.h --- b/src/include/utils/jsonb.h *************** typedef struct JsonbValue JsonbValue; *** 83,91 **** * buffer is accessed, but they can also be deep copied and passed around. * * Jsonb is a tree structure. Each node in the tree consists of a JEntry ! * header, and a variable-length content. The JEntry header indicates what ! * kind of a node it is, e.g. a string or an array, and the offset and length ! * of its variable-length portion within the container. * * The JEntry and the content of a node are not stored physically together. * Instead, the container array or object has an array that holds the JEntrys --- 83,91 ---- * buffer is accessed, but they can also be deep copied and passed around. * * Jsonb is a tree structure. Each node in the tree consists of a JEntry ! * header and a variable-length content (possibly of zero size). The JEntry ! * header indicates what kind of a node it is, e.g. a string or an array, ! * and includes the length of its variable-length portion. * * The JEntry and the content of a node are not stored physically together. * Instead, the container array or object has an array that holds the JEntrys *************** typedef struct JsonbValue JsonbValue; *** 95,133 **** * hold its JEntry. Hence, no JEntry header is stored for the root node. It * is implicitly known that the root node must be an array or an object, * so we can get away without the type indicator as long as we can distinguish ! * the two. For that purpose, both an array and an object begins with a uint32 * header field, which contains an JB_FOBJECT or JB_FARRAY flag. When a naked * scalar value needs to be stored as a Jsonb value, what we actually store is * an array with one element, with the flags in the array's header field set * to JB_FSCALAR | JB_FARRAY. * - * To encode the length and offset of the variable-length portion of each - * node in a compact way, the JEntry stores only the end offset within the - * variable-length portion of the container node. For the first JEntry in the - * container's JEntry array, that equals to the length of the node data. The - * begin offset and length of the rest of the entries can be calculated using - * the end offset of the previous JEntry in the array. - * * Overall, the Jsonb struct requires 4-bytes alignment. Within the struct, * the variable-length portion of some node types is aligned to a 4-byte * boundary, while others are not. When alignment is needed, the padding is * in the beginning of the node that requires it. For example, if a numeric * node is stored after a string node, so that the numeric node begins at * offset 3, the variable-length portion of the numeric node will begin with ! * one padding byte. */ /* * Jentry format. * ! * The least significant 28 bits store the end offset of the entry (see ! * JBE_ENDPOS, JBE_OFF, JBE_LEN macros below). The next three bits ! * are used to store the type of the entry. The most significant bit ! * is unused, and should be set to zero. */ typedef uint32 JEntry; ! #define JENTRY_POSMASK 0x0FFFFFFF #define JENTRY_TYPEMASK 0x70000000 /* values stored in the type bits */ --- 95,126 ---- * hold its JEntry. Hence, no JEntry header is stored for the root node. It * is implicitly known that the root node must be an array or an object, * so we can get away without the type indicator as long as we can distinguish ! * the two. For that purpose, both an array and an object begin with a uint32 * header field, which contains an JB_FOBJECT or JB_FARRAY flag. When a naked * scalar value needs to be stored as a Jsonb value, what we actually store is * an array with one element, with the flags in the array's header field set * to JB_FSCALAR | JB_FARRAY. * * Overall, the Jsonb struct requires 4-bytes alignment. Within the struct, * the variable-length portion of some node types is aligned to a 4-byte * boundary, while others are not. When alignment is needed, the padding is * in the beginning of the node that requires it. For example, if a numeric * node is stored after a string node, so that the numeric node begins at * offset 3, the variable-length portion of the numeric node will begin with ! * one padding byte so that the actual numeric data is 4-byte aligned. */ /* * Jentry format. * ! * The least significant 28 bits store the data length of the entry (see ! * JBE_LENFLD and JBE_LEN macros below). The next three bits store the type ! * of the entry. The most significant bit is reserved for future use, and ! * should be set to zero. */ typedef uint32 JEntry; ! #define JENTRY_LENMASK 0x0FFFFFFF #define JENTRY_TYPEMASK 0x70000000 /* values stored in the type bits */ *************** typedef uint32 JEntry; *** 148,166 **** #define JBE_ISBOOL(je_) (JBE_ISBOOL_TRUE(je_) || JBE_ISBOOL_FALSE(je_)) /* ! * Macros for getting the offset and length of an element. Note multiple ! * evaluations and access to prior array element. */ ! #define JBE_ENDPOS(je_) ((je_) & JENTRY_POSMASK) ! #define JBE_OFF(ja, i) ((i) == 0 ? 0 : JBE_ENDPOS((ja)[i - 1])) ! #define JBE_LEN(ja, i) ((i) == 0 ? JBE_ENDPOS((ja)[i]) \ ! : JBE_ENDPOS((ja)[i]) - JBE_ENDPOS((ja)[i - 1])) /* * A jsonb array or object node, within a Jsonb Datum. * ! * An array has one child for each element. An object has two children for ! * each key/value pair. */ typedef struct JsonbContainer { --- 141,160 ---- #define JBE_ISBOOL(je_) (JBE_ISBOOL_TRUE(je_) || JBE_ISBOOL_FALSE(je_)) /* ! * Macros for getting the data length of a JEntry. */ ! #define JBE_LENFLD(je_) ((je_) & JENTRY_LENMASK) ! #define JBE_LEN(ja, i) JBE_LENFLD((ja)[i]) /* * A jsonb array or object node, within a Jsonb Datum. * ! * An array has one child for each element, stored in array order. ! * ! * An object has two children for each key/value pair. The keys all appear ! * first, in key sort order; then the values appear, in an order matching the ! * key order. This arrangement keeps the keys compact in memory, making a ! * search for a particular key more cache-friendly. */ typedef struct JsonbContainer { *************** typedef struct JsonbContainer *** 172,179 **** } JsonbContainer; /* flags for the header-field in JsonbContainer */ ! #define JB_CMASK 0x0FFFFFFF ! #define JB_FSCALAR 0x10000000 #define JB_FOBJECT 0x20000000 #define JB_FARRAY 0x40000000 --- 166,173 ---- } JsonbContainer; /* flags for the header-field in JsonbContainer */ ! #define JB_CMASK 0x0FFFFFFF /* mask for count field */ ! #define JB_FSCALAR 0x10000000 /* flag bits */ #define JB_FOBJECT 0x20000000 #define JB_FARRAY 0x40000000 *************** struct JsonbValue *** 248,265 **** (jsonbval)->type <= jbvBool) /* ! * Pair within an Object. * ! * Pairs with duplicate keys are de-duplicated. We store the order for the ! * benefit of doing so in a well-defined way with respect to the original ! * observed order (which is "last observed wins"). This is only used briefly ! * when originally constructing a Jsonb. */ struct JsonbPair { JsonbValue key; /* Must be a jbvString */ JsonbValue value; /* May be of any type */ ! uint32 order; /* preserves order of pairs with equal keys */ }; /* Conversion state used when parsing Jsonb from text, or for type coercion */ --- 242,261 ---- (jsonbval)->type <= jbvBool) /* ! * Key/value pair within an Object. * ! * This struct type is only used briefly while constructing a Jsonb; it is ! * *not* the on-disk representation. ! * ! * Pairs with duplicate keys are de-duplicated. We store the originally ! * observed pair ordering for the purpose of removing duplicates in a ! * well-defined way (which is "last observed wins"). */ struct JsonbPair { JsonbValue key; /* Must be a jbvString */ JsonbValue value; /* May be of any type */ ! uint32 order; /* Pair's index in original sequence */ }; /* Conversion state used when parsing Jsonb from text, or for type coercion */ *************** typedef struct JsonbIterator *** 287,306 **** { /* Container being iterated */ JsonbContainer *container; ! uint32 nElems; /* Number of elements in children array (will be ! * nPairs for objects) */ bool isScalar; /* Pseudo-array scalar value? */ ! JEntry *children; ! /* Current item in buffer (up to nElems, but must * 2 for objects) */ ! int i; /* ! * Data proper. This points just past end of children array. ! * We use the JBE_OFF() macro on the Jentrys to find offsets of each ! * child in this area. */ ! char *dataProper; /* Private state */ JsonbIterState state; --- 283,307 ---- { /* Container being iterated */ JsonbContainer *container; ! uint32 nElems; /* Number of elements in children array (will ! * be nPairs for objects) */ bool isScalar; /* Pseudo-array scalar value? */ ! JEntry *children; /* JEntrys for child nodes */ ! /* Data proper. This points just past end of children array */ ! char *dataProper; ! /* Current item in buffer (up to nElems) */ ! int curIndex; ! ! /* Data offset corresponding to current item */ ! uint32 curDataOffset; /* ! * If the container is an object, we want to return keys and values ! * alternately; so curDataOffset points to the current key, and ! * curValueOffset points to the current value. */ ! uint32 curValueOffset; /* Private state */ JsonbIterState state; *************** extern Datum gin_consistent_jsonb_path(P *** 344,349 **** --- 345,351 ---- extern Datum gin_triconsistent_jsonb_path(PG_FUNCTION_ARGS); /* Support functions */ + extern uint32 getJsonbOffset(const JEntry *ja, int index); extern int compareJsonbContainers(JsonbContainer *a, JsonbContainer *b); extern JsonbValue *findJsonbValueFromContainer(JsonbContainer *sheader, uint32 flags,
Tom, here's the results with github data (8 top level keys) only. Here's a sample object https://gist.github.com/igrigorik/2017462
All-Lenghts + Cache-Aware EXTERNAL
Query 1: 516ms
Query 2: 350ms
The difference is small but I's definitely faster, which makes sense since cache line misses are probably slightly reduced.
As in the previous runs, I ran the query a dozen times and took the average after excluding runs with a high deviation.
As in the previous runs, I ran the query a dozen times and took the average after excluding runs with a high deviation.
HEAD (aka, all offsets) EXTERNAL
Test query 1 runtime: 505ms
Test query 2 runtime: 350ms
All Lengths (Tom Lane patch) EXTERNAL
Test query 1 runtime: 525ms
Test query 2 runtime: 355ms
--
Arthur Silva
Arthur Silva
On Tue, Aug 26, 2014 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:This looks pretty good from a coding point of view. I have not had time
> I wish it were cache-friendly too, per the upthread tangent about having
> to fetch keys from all over the place within a large JSON object.
> ... and while I was typing that sentence, lightning struck. The existing
> arrangement of object subfields with keys and values interleaved is just
> plain dumb. We should rearrange that as all the keys in order, then all
> the values in the same order. Then the keys are naturally adjacent in
> memory and object-key searches become much more cache-friendly: you
> probably touch most of the key portion of the object, but none of the
> values portion, until you know exactly what part of the latter to fetch.
> This approach might complicate the lookup logic marginally but I bet not
> very much; and it will be a huge help if we ever want to do smart access
> to EXTERNAL (non-compressed) JSON values.
> I will go prototype that just to see how much code rearrangement is
> required.
yet to see if it affects the speed of the benchmark cases we've been
trying. I suspect that it won't make much difference in them. I think
if we do decide to make an on-disk format change, we should seriously
consider including this change.
The same concept could be applied to offset-based storage of course,
although I rather doubt that we'd make that combination of choices since
it would be giving up on-disk compatibility for benefits that are mostly
in the future.
Attached are two patches: one is a "delta" against the last jsonb-lengths
patch I posted, and the other is a "merged" patch showing the total change
from HEAD, for ease of application.
regards, tom lane
On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva <arthurprs@gmail.com> wrote: > The difference is small but I's definitely faster, which makes sense since > cache line misses are probably slightly reduced. > As in the previous runs, I ran the query a dozen times and took the average > after excluding runs with a high deviation. I'm not surprised that it hasn't beaten HEAD. I haven't studied the problem in detail, but I don't think that the "cache awareness" of the new revision is necessarily a distinct advantage. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > I'm not surprised that it hasn't beaten HEAD. I haven't studied the > problem in detail, but I don't think that the "cache awareness" of the > new revision is necessarily a distinct advantage. I doubt it's a significant advantage in the current state of the code; I'm happy if it's not a loss. I was looking ahead to someday fetching key values efficiently from large EXTERNAL (ie out-of-line-but-not-compressed) JSON values, analogously to the existing optimization for fetching text substrings from EXTERNAL text values. As mentioned upthread, the current JSONB representation would be seriously unfriendly to such a thing. regards, tom lane
It won't be faster by any means, but it should definitely be incorporated if any format changes are made (like Tom already suggested).
I think it's important we gather at least 2 more things before making any calls:
* Josh tests w/ cache aware patch, which should confirm cache aware is indeed prefered
* Tests with toast hacked to use lz4 instead, which might ease any decisions
--
Arthur Silva
Arthur Silva
On Wed, Aug 27, 2014 at 12:53 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva <arthurprs@gmail.com> wrote:I'm not surprised that it hasn't beaten HEAD. I haven't studied the
> The difference is small but I's definitely faster, which makes sense since
> cache line misses are probably slightly reduced.
> As in the previous runs, I ran the query a dozen times and took the average
> after excluding runs with a high deviation.
problem in detail, but I don't think that the "cache awareness" of the
new revision is necessarily a distinct advantage.
--
Peter Geoghegan
On Wed, Aug 27, 2014 at 1:09 AM, Arthur Silva <arthurprs@gmail.com> wrote:
It won't be faster by any means, but it should definitely be incorporated if any format changes are made (like Tom already suggested).I think it's important we gather at least 2 more things before making any calls:* Josh tests w/ cache aware patch, which should confirm cache aware is indeed prefered* Tests with toast hacked to use lz4 instead, which might ease any decisions--
Arthur SilvaOn Wed, Aug 27, 2014 at 12:53 AM, Peter Geoghegan <pg@heroku.com> wrote:On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva <arthurprs@gmail.com> wrote:I'm not surprised that it hasn't beaten HEAD. I haven't studied the
> The difference is small but I's definitely faster, which makes sense since
> cache line misses are probably slightly reduced.
> As in the previous runs, I ran the query a dozen times and took the average
> after excluding runs with a high deviation.
problem in detail, but I don't think that the "cache awareness" of the
new revision is necessarily a distinct advantage.
--
Peter Geoghegan
I'm attaching a quick-n-dirty patch that uses lz4 compression instead of pglz in case someone wants to experiment with it. Seems to work in my test env, I'll make more tests when I get home.
PS: gotta love gmail fixed defaults of top-posting...
Attachment
On 08/12/2014 10:58 AM, Robert Haas wrote: > What would really be ideal here is if the JSON code could inform the > toast compression code "this many initial bytes are likely > incompressible, just pass them through without trying, and then start > compressing at byte N", where N is the byte following the TOC. But I > don't know that there's a reasonable way to implement that. > Sorry, being late for the party. Anyhow, this strikes me as a good basic direction of thought. But I think we should put the burden on the data type, not on toast. To do that data types could have an optional toast_hint_values() function, which the toast code can call with the actual datum at hand and its default parameter array. The hint values function then can modify that parameter array, telling toast how much to skip, how hard to try (or not at all) and so on. A data type specific function should know much better how to figure out how compressible a particular datum may be. Certainly nothing for 9.4, but it might require changing the toast API in a different way than just handing it an oid and hard-coding the JASONBOID case into toast for 9.4. If we are going to change the API, we might as well do it right. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info
On 08/08/2014 10:21 AM, Andrew Dunstan wrote: > > On 08/07/2014 11:17 PM, Tom Lane wrote: >> I looked into the issue reported in bug #11109. The problem appears to be >> that jsonb's on-disk format is designed in such a way that the leading >> portion of any JSON array or object will be fairly incompressible, because >> it consists mostly of a strictly-increasing series of integer offsets. >> This interacts poorly with the code in pglz_compress() that gives up if >> it's found nothing compressible in the first first_success_by bytes of a >> value-to-be-compressed. (first_success_by is 1024 in the default set of >> compression parameters.) > > [snip] > >> There is plenty of compressible data once we get into the repetitive >> strings in the payload part --- but that starts at offset 944, and up to >> that point there is nothing that pg_lzcompress can get a handle on. There >> are, by definition, no sequences of 4 or more repeated bytes in that area. >> I think in principle pg_lzcompress could decide to compress the 3-byte >> sequences consisting of the high-order 24 bits of each offset; but it >> doesn't choose to do so, probably because of the way its lookup hash table >> works: >> >> * pglz_hist_idx - >> * >> * Computes the history table slot for the lookup by the next 4 >> * characters in the input. >> * >> * NB: because we use the next 4 characters, we are not guaranteed to >> * find 3-character matches; they very possibly will be in the wrong >> * hash list. This seems an acceptable tradeoff for spreading out the >> * hash keys more. >> >> For jsonb header data, the "next 4 characters" are *always* different, so >> only a chance hash collision can result in a match. There is therefore a >> pretty good chance that no compression will occur before it gives up >> because of first_success_by. >> >> I'm not sure if there is any easy fix for this. We could possibly change >> the default first_success_by value, but I think that'd just be postponing >> the problem to larger jsonb objects/arrays, and it would hurt performance >> for genuinely incompressible data. A somewhat painful, but not yet >> out-of-the-question, alternative is to change the jsonb on-disk >> representation. Perhaps the JEntry array could be defined as containing >> element lengths instead of element ending offsets. Not sure though if >> that would break binary searching for JSON object keys. >> >> > > > Ouch. > > Back when this structure was first presented at pgCon 2013, I wondered > if we shouldn't extract the strings into a dictionary, because of key > repetition, and convinced myself that this shouldn't be necessary > because in significant cases TOAST would take care of it. > > Maybe we should have pglz_compress() look at the *last* 1024 bytes if it > can't find anything worth compressing in the first, for values larger > than a certain size. > > It's worth noting that this is a fairly pathological case. AIUI the > example you constructed has an array with 100k string elements. I don't > think that's typical. So I suspect that unless I've misunderstood the > statement of the problem we're going to find that almost all the jsonb > we will be storing is still compressible. I also think that a substantial part of the problem of coming up with a "representative" data sample is because the size of the incompressible data at the beginning is somewhat tied to the overall size of the datum itself. This may or may not be true in any particular use case, but as a general rule of thumb I would assume that the larger the JSONB document, the larger the offset array at the beginning. Would changing 1024 to a fraction of the datum length for the time being give us enough room to come up with a proper solution for 9.5? Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info
On 08/08/2014 11:18 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 08/07/2014 11:17 PM, Tom Lane wrote: >>> I looked into the issue reported in bug #11109. The problem appears to be >>> that jsonb's on-disk format is designed in such a way that the leading >>> portion of any JSON array or object will be fairly incompressible, because >>> it consists mostly of a strictly-increasing series of integer offsets. > >> Ouch. > >> Back when this structure was first presented at pgCon 2013, I wondered >> if we shouldn't extract the strings into a dictionary, because of key >> repetition, and convinced myself that this shouldn't be necessary >> because in significant cases TOAST would take care of it. > > That's not really the issue here, I think. The problem is that a > relatively minor aspect of the representation, namely the choice to store > a series of offsets rather than a series of lengths, produces > nonrepetitive data even when the original input is repetitive. This is only because the input data was exact copies of the same strings over and over again. PGLZ can very well compress slightly less identical strings of varying lengths too. Not as well, but well enough. But I suspect such input data would make it fail again, even with lengths. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info
On Sep 4, 2014, at 7:26 PM, Jan Wieck <jan@wi3ck.info> wrote: > This is only because the input data was exact copies of the same strings over and over again. PGLZ can very well compressslightly less identical strings of varying lengths too. Not as well, but well enough. But I suspect such input datawould make it fail again, even with lengths. We had a bit of discussion about JSONB compression at PDXPUG Day this morning. Josh polled the room, and about half thoughwe should apply the patch for better compression, while the other half seemed to want faster access operations. (Somefolks no doubt voted for both.) But in the ensuing discussion, I started to think that maybe we should leave it as itis, for two reasons: 1. There has been a fair amount of discussion about ways to better deal with this in future releases, such as hints to TOASTabout how to compress, or the application of different compression algorithms (or pluggable compression). I’m assumingthat leaving it as-is does not remove those possibilities. 2. The major advantage of JSONB is fast access operations. If those are not as important for a given use case as storagespace, there’s still the JSON type, which *does* compress reasonably well. IOW, We already have a JSON alternativethe compresses well. So why make the same (or similar) trade-offs with JSONB? Just my $0.02. I would like to see some consensus on this, soon, though, as I am eager to get 9.4 and JSONB, regardless ofthe outcome! Best, David
So, I finally got time to test Tom's latest patch on this. TLDR: we want to go with Tom's latest patch and release beta3. Figures: So I tested HEAD against the latest lengths patch. Per Arthur Silva, I checked uncompressed times for JSONB against compressed times. This changed the picture considerably. TABLE SIZES ----------- HEAD ?column? | pg_size_pretty ---------------------+----------------json text format | 393 MBjsonb: compressed | 1147 MBjsonb: uncompressed | 1221MB PATCHED ?column? | pg_size_pretty ---------------------+----------------json text format | 394 MBjsonb: compressed | 525 MBjsonb: uncompressed | 1200MB EXTRACTION TIMES ---------------- HEAD Q1 (search via GIN index followed by extracting 100,000 values from rows): jsonb compressed: 4000 jsonb uncompressed: 3250 Q2 (seq scan and extract 200,000 values from rows): json: 11700 jsonb compressed: 3150 jsonb uncompressed: 2700 PATCHED Q1: jsonb compressed: 6750 jsonb uncompressed: 3350 Q2: json: 11796 jsonb compressed: 4700 jsonb uncompressed: 2650 ---------------------- Conclusion: with Tom's patch, compressed JSONB is 55% smaller when compressed (EXTENDED). Extraction times are 50% to 70% slower, but this appears to be almost entirely due to decompression overhead. When not compressing (EXTERNAL), extraction times for patch versions are statistically the same as HEAD, and file sizes are similar to HEAD. USER REACTION ------------- I polled at both PDXpgDay and at FOSS4G, asking some ~~ 80 Postgres users how they would feel about a compression vs. extraction time tradeoff. The audience was evenly split. However, with the current patch, the user can choose. Users who know enough for performance tuning can set JSONB columns to EXTERNAL, and the the same performance as the unpatched version. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
* Josh Berkus (josh@agliodbs.com) wrote: > TLDR: we want to go with Tom's latest patch and release beta3. Having not even read the rest- yes please. We really need to get beta3 out and figure out when we're going to actually release 9.4... Admittedly, the last month has been good and we've been fixing issues, but it'd really be good to wrap 9.4 up. > Conclusion: with Tom's patch, compressed JSONB is 55% smaller when > compressed (EXTENDED). Extraction times are 50% to 70% slower, but this > appears to be almost entirely due to decompression overhead. When not > compressing (EXTERNAL), extraction times for patch versions are > statistically the same as HEAD, and file sizes are similar to HEAD. Not really a surprise. > I polled at both PDXpgDay and at FOSS4G, asking some ~~ 80 Postgres > users how they would feel about a compression vs. extraction time > tradeoff. The audience was evenly split. Also not a surprise. > However, with the current patch, the user can choose. Users who know > enough for performance tuning can set JSONB columns to EXTERNAL, and the > the same performance as the unpatched version. Agreed. Thanks, Stephen
On Thu, Sep 11, 2014 at 10:01 PM, Josh Berkus <josh@agliodbs.com> wrote:
So, I finally got time to test Tom's latest patch on this.
TLDR: we want to go with Tom's latest patch and release beta3.
Figures:
So I tested HEAD against the latest lengths patch. Per Arthur Silva, I
checked uncompressed times for JSONB against compressed times. This
changed the picture considerably.
TABLE SIZES
-----------
HEAD
?column? | pg_size_pretty
---------------------+----------------
json text format | 393 MB
jsonb: compressed | 1147 MB
jsonb: uncompressed | 1221 MB
PATCHED
?column? | pg_size_pretty
---------------------+----------------
json text format | 394 MB
jsonb: compressed | 525 MB
jsonb: uncompressed | 1200 MB
EXTRACTION TIMES
----------------
HEAD
Q1 (search via GIN index followed by extracting 100,000 values from rows):
jsonb compressed: 4000
jsonb uncompressed: 3250
Q2 (seq scan and extract 200,000 values from rows):
json: 11700
jsonb compressed: 3150
jsonb uncompressed: 2700
PATCHED
Q1:
jsonb compressed: 6750
jsonb uncompressed: 3350
Q2:
json: 11796
jsonb compressed: 4700
jsonb uncompressed: 2650
----------------------
Conclusion: with Tom's patch, compressed JSONB is 55% smaller when
compressed (EXTENDED). Extraction times are 50% to 70% slower, but this
appears to be almost entirely due to decompression overhead. When not
compressing (EXTERNAL), extraction times for patch versions are
statistically the same as HEAD, and file sizes are similar to HEAD.
USER REACTION
-------------
I polled at both PDXpgDay and at FOSS4G, asking some ~~ 80 Postgres
users how they would feel about a compression vs. extraction time
tradeoff. The audience was evenly split.
However, with the current patch, the user can choose. Users who know
enough for performance tuning can set JSONB columns to EXTERNAL, and the
the same performance as the unpatched version.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
The compression ratio difference is exaggerated in this case but it does support that Tom's patch alleviates the extraction penalty.
In my testings with the github archive data the savings <-> performance-penalty was fine, but I'm not confident in those results since there were only 8 top level keys.
For comparison, some twitter api objects[1] have 30+ top level keys. If I have time in the next couple of days I'll conduct some testings with the public twitter fire-hose data.
[1] https://dev.twitter.com/rest/reference/get/statuses/home_timeline
On 09/11/2014 06:56 PM, Arthur Silva wrote: > > In my testings with the github archive data the savings <-> > performance-penalty was fine, but I'm not confident in those results > since there were only 8 top level keys. Well, we did want to see that the patch doesn't create a regression with data which doesn't fall into the problem case area, and your test did that nicely. > For comparison, some twitter api objects[1] have 30+ top level keys. If > I have time in the next couple of days I'll conduct some testings with > the public twitter fire-hose data. Yah, if we have enough time for me to get the Mozilla Socorro test environment working, I can also test with Mozilla crash data. That has some deep nesting and very large values. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus <josh@agliodbs.com> wrote: > So, I finally got time to test Tom's latest patch on this. > > TLDR: we want to go with Tom's latest patch and release beta3. > > Figures: > > So I tested HEAD against the latest lengths patch. Per Arthur Silva, I > checked uncompressed times for JSONB against compressed times. This > changed the picture considerably. Did you -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 12, 2014 at 1:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus <josh@agliodbs.com> wrote: >> So, I finally got time to test Tom's latest patch on this. >> >> TLDR: we want to go with Tom's latest patch and release beta3. >> >> Figures: >> >> So I tested HEAD against the latest lengths patch. Per Arthur Silva, I >> checked uncompressed times for JSONB against compressed times. This >> changed the picture considerably. > > Did you Blah. Did you test Heikki's patch from here? http://www.postgresql.org/message-id/53EC8194.4020804@vmware.com Tom didn't like it, but I thought it was rather clever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/12/2014 10:00 AM, Robert Haas wrote: > On Fri, Sep 12, 2014 at 1:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> So, I finally got time to test Tom's latest patch on this. >>> >>> TLDR: we want to go with Tom's latest patch and release beta3. >>> >>> Figures: >>> >>> So I tested HEAD against the latest lengths patch. Per Arthur Silva, I >>> checked uncompressed times for JSONB against compressed times. This >>> changed the picture considerably. >> >> Did you > > Blah. > > Did you test Heikki's patch from here? > > http://www.postgresql.org/message-id/53EC8194.4020804@vmware.com > > Tom didn't like it, but I thought it was rather clever. > Yes, I posted the results for that a couple weeks ago; Tom had posted a cleaned-up version of that patch, but materially it made no difference in sizes or extraction times compared with Tom's lengths-only patch. Same for Arthur's tests. It's certainly possible that there is a test case for which Heikki's approach is superior, but if so we haven't seen it. And since it's approach is also more complicated, sticking with the simpler lengths-only approach seems like the way to go. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Sep 12, 2014 at 1:11 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 09/12/2014 10:00 AM, Robert Haas wrote: >> On Fri, Sep 12, 2014 at 1:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus <josh@agliodbs.com> wrote: >>>> So, I finally got time to test Tom's latest patch on this. >>>> >>>> TLDR: we want to go with Tom's latest patch and release beta3. >>>> >>>> Figures: >>>> >>>> So I tested HEAD against the latest lengths patch. Per Arthur Silva, I >>>> checked uncompressed times for JSONB against compressed times. This >>>> changed the picture considerably. >>> >>> Did you >> >> Blah. >> >> Did you test Heikki's patch from here? >> >> http://www.postgresql.org/message-id/53EC8194.4020804@vmware.com >> >> Tom didn't like it, but I thought it was rather clever. > > Yes, I posted the results for that a couple weeks ago; Tom had posted a > cleaned-up version of that patch, but materially it made no difference > in sizes or extraction times compared with Tom's lengths-only patch. > Same for Arthur's tests. > > It's certainly possible that there is a test case for which Heikki's > approach is superior, but if so we haven't seen it. And since it's > approach is also more complicated, sticking with the simpler > lengths-only approach seems like the way to go. Huh, OK. I'm slightly surprised, but that's why we benchmark these things. Thanks for following up on this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Sep 12, 2014 at 1:11 PM, Josh Berkus <josh@agliodbs.com> wrote: >> It's certainly possible that there is a test case for which Heikki's >> approach is superior, but if so we haven't seen it. And since it's >> approach is also more complicated, sticking with the simpler >> lengths-only approach seems like the way to go. > Huh, OK. I'm slightly surprised, but that's why we benchmark these things. The argument for Heikki's patch was never that it would offer better performance; it's obvious (at least to me) that it won't. The argument was that it'd be upward-compatible with what we're doing now, so that we'd not have to force an on-disk compatibility break with 9.4beta2. regards, tom lane
On 09/12/2014 08:52 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Sep 12, 2014 at 1:11 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> It's certainly possible that there is a test case for which Heikki's >>> approach is superior, but if so we haven't seen it. And since it's >>> approach is also more complicated, sticking with the simpler >>> lengths-only approach seems like the way to go. > >> Huh, OK. I'm slightly surprised, but that's why we benchmark these things. > > The argument for Heikki's patch was never that it would offer better > performance; it's obvious (at least to me) that it won't. Performance was one argument for sure. It's not hard to come up with a case where the all-lengths approach is much slower: take a huge array with, say, million elements, and fetch the last element in a tight loop. And do that in a PL/pgSQL function without storing the datum to disk, so that it doesn't get toasted. Not a very common thing to do in real life, although something like that might come up if you do a lot of json processing in PL/pgSQL. but storing offsets makes that faster. IOW, something like this: do $$ declare ja jsonb; i int4; begin select json_agg(g) into ja from generate_series(1, 100000) g; for i in 1..100000 loop perform ja ->> 90000; endloop; end; $$; should perform much better with current git master or "my patch", than with the all-lengths patch. I'm OK with going for the all-lengths approach anyway; it's simpler, and working with huge arrays is hopefully not that common. But it's not a completely open-and-shut case. - Heikki
On 09/12/2014 01:30 PM, Heikki Linnakangas wrote: > > Performance was one argument for sure. It's not hard to come up with a > case where the all-lengths approach is much slower: take a huge array > with, say, million elements, and fetch the last element in a tight loop. > And do that in a PL/pgSQL function without storing the datum to disk, so > that it doesn't get toasted. Not a very common thing to do in real life, > although something like that might come up if you do a lot of json > processing in PL/pgSQL. but storing offsets makes that faster. While I didnt post the results (because they were uninteresting), I did specifically test the "last element" in a set of 200 elements for all-lengths vs. original offsets for JSONB, and the results were not statistically different. I did not test against your patch; is there some reason why your patch would be faster for the "last element" case than the original offsets version? If not, I think the corner case is so obscure as to be not worth optimizing for. I can't imagine that more than a tiny minority of our users are going to have thousands of keys per datum. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Sep 15, 2014 at 2:12 PM, Josh Berkus <josh@agliodbs.com> wrote: > If not, I think the corner case is so obscure as to be not worth > optimizing for. I can't imagine that more than a tiny minority of our > users are going to have thousands of keys per datum. Worst case is linear cost scaling vs number of keys, which depends on the number of keys how expensive it is. It would have an effect only on uncompressed jsonb, since compressed jsonb already pays a linear cost for decompression. I'd suggest testing performance of large small keys in uncompressed form. It's bound to have a noticeable regression there. Now, large small keys could be 200 or 2000, or even 20k. I'd guess several should be tested to find the shape of the curve.
On 09/15/2014 10:23 AM, Claudio Freire wrote: > Now, large small keys could be 200 or 2000, or even 20k. I'd guess > several should be tested to find the shape of the curve. Well, we know that it's not noticeable with 200, and that it is noticeable with 100K. It's only worth testing further if we think that having more than 200 top-level keys in one JSONB value is going to be a use case for more than 0.1% of our users. I personally do not. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Sep 15, 2014 at 4:09 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 09/15/2014 10:23 AM, Claudio Freire wrote: >> Now, large small keys could be 200 or 2000, or even 20k. I'd guess >> several should be tested to find the shape of the curve. > > Well, we know that it's not noticeable with 200, and that it is > noticeable with 100K. It's only worth testing further if we think that > having more than 200 top-level keys in one JSONB value is going to be a > use case for more than 0.1% of our users. I personally do not. Yes, but bear in mind that the worst case is exactly at the use case jsonb was designed to speed up: element access within relatively big json documents. Having them uncompressed is expectable because people using jsonb will often favor speed over compactness if it's a tradeoff (otherwise they'd use plain json). So while you're right that it's perhaps above what would be a common use case, the range "somewhere between 200 and 100K" for the tipping point seems overly imprecise to me.
On 09/15/2014 12:15 PM, Claudio Freire wrote: > So while you're right that it's perhaps above what would be a common > use case, the range "somewhere between 200 and 100K" for the tipping > point seems overly imprecise to me. Well, then, you know how to solve that. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Sep 15, 2014 at 4:17 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 09/15/2014 12:15 PM, Claudio Freire wrote: >> So while you're right that it's perhaps above what would be a common >> use case, the range "somewhere between 200 and 100K" for the tipping >> point seems overly imprecise to me. > > Well, then, you know how to solve that. I was hoping testing with other numbers was a simple hitting a key for someone else. But sure. I'll set something up.
On 09/15/2014 12:25 PM, Claudio Freire wrote: > On Mon, Sep 15, 2014 at 4:17 PM, Josh Berkus <josh@agliodbs.com> wrote: >> On 09/15/2014 12:15 PM, Claudio Freire wrote: >>> So while you're right that it's perhaps above what would be a common >>> use case, the range "somewhere between 200 and 100K" for the tipping >>> point seems overly imprecise to me. >> >> Well, then, you know how to solve that. > > > I was hoping testing with other numbers was a simple hitting a key for > someone else. Nope. My test case has a fixed size. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Sep 15, 2014 at 3:09 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 09/15/2014 10:23 AM, Claudio Freire wrote: >> Now, large small keys could be 200 or 2000, or even 20k. I'd guess >> several should be tested to find the shape of the curve. > > Well, we know that it's not noticeable with 200, and that it is > noticeable with 100K. It's only worth testing further if we think that > having more than 200 top-level keys in one JSONB value is going to be a > use case for more than 0.1% of our users. I personally do not. FWIW, I have written one (1) application that uses JSONB and it has one sub-object (not the top-level object) that in the most typical configuration contains precisely 270 keys. Now, granted, that is not the top-level object, if that distinction is actually relevant here, but color me just a bit skeptical of this claim anyway. This was just a casual thing I did for my own use, not anything industrial strength, so it's hard to believe I'm stressing the system more than 99.9% of users will. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/15/2014 02:16 PM, Robert Haas wrote: > On Mon, Sep 15, 2014 at 3:09 PM, Josh Berkus <josh@agliodbs.com> wrote: >> On 09/15/2014 10:23 AM, Claudio Freire wrote: >>> Now, large small keys could be 200 or 2000, or even 20k. I'd guess >>> several should be tested to find the shape of the curve. >> >> Well, we know that it's not noticeable with 200, and that it is >> noticeable with 100K. It's only worth testing further if we think that >> having more than 200 top-level keys in one JSONB value is going to be a >> use case for more than 0.1% of our users. I personally do not. > > FWIW, I have written one (1) application that uses JSONB and it has > one sub-object (not the top-level object) that in the most typical > configuration contains precisely 270 keys. Now, granted, that is not > the top-level object, if that distinction is actually relevant here, > but color me just a bit skeptical of this claim anyway. This was just > a casual thing I did for my own use, not anything industrial strength, > so it's hard to believe I'm stressing the system more than 99.9% of > users will. Actually, having the keys all at the same level *is* relevant for the issue we're discussing. If those 270 keys are organized in a tree, it's not the same as having them all on one level (and not as problematic). -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus <josh@agliodbs.com> wrote: > Actually, having the keys all at the same level *is* relevant for the > issue we're discussing. If those 270 keys are organized in a tree, it's > not the same as having them all on one level (and not as problematic). I believe Robert meant that the 270 keys are not at the top level, but are at some level (in other words, some object has 270 pairs). That is equivalent to having them at the top level for the purposes of this discussion. FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. -- Peter Geoghegan
I couldn't get my hands on the twitter data but I'm generating my own. The json template is http://paste2.org/wJ1dfcjw and data was generated with http://www.json-generator.com/. It has 35 top level keys, just in case someone is wondering.
I generated 10000 random objects and I'm inserting them repeatedly until I got 320k rows.Tom's lengths+cache aware: 455ms
HEAD: 440ms
This is a realistic-ish workload in my opinion and Tom's patch performs within 4% of HEAD.
Due to the overall lenghts I couldn't really test compressibility so I re-ran the test. This time I inserted an array of 2 objects in each row, as in: [obj, obj];
The objects where taken in sequence from the 10000 pool so contents match in both tests.
Test query: SELECT data #> '{0, name}', data #> '{0, email}', data #> '{1, name}', data #> '{1, email}' FROM t_json
Test storage: EXTENDED
The objects where taken in sequence from the 10000 pool so contents match in both tests.
Test query: SELECT data #> '{0, name}', data #> '{0, email}', data #> '{1, name}', data #> '{1, email}' FROM t_json
Test storage: EXTENDED
HEAD: 17mb table + 878mb toast
HEAD size quartiles: {2015,2500,2591,2711,3483}
HEAD query runtime: 15s
Tom's: 220mb table + 580mb toast
Tom's size quartiles: {1665,1984,2061,2142.25,2384}
Tom's query runtime: 13sThis is an intriguing edge case that Tom's patch actually outperform the base implementation for 3~4kb jsons.
On 09/16/2014 07:44 AM, Peter Geoghegan wrote: > FWIW, I am slightly concerned about weighing use cases around very > large JSON documents too heavily. Having enormous jsonb documents just > isn't going to work out that well, but neither will equivalent designs > in popular document database systems for similar reasons. For example, > the maximum BSON document size supported by MongoDB is 16 megabytes, > and that seems to be something that their users don't care too much > about. Having 270 pairs in an object isn't unreasonable, but it isn't > going to be all that common either. Also, at a certain size the fact that Pg must rewrite the whole document for any change to it starts to introduce other practical changes. Anyway - this is looking like the change will go in, and with it a catversion bump. Introduction of a jsonb version/flags byte might be worthwhile at the same time. It seems likely that there'll be more room for improvement in jsonb, possibly even down to using different formats for different data. Is it worth paying a byte per value to save on possible upgrade pain? -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Actually, having the keys all at the same level *is* relevant for the >> issue we're discussing. If those 270 keys are organized in a tree, it's >> not the same as having them all on one level (and not as problematic). > > I believe Robert meant that the 270 keys are not at the top level, but > are at some level (in other words, some object has 270 pairs). That is > equivalent to having them at the top level for the purposes of this > discussion. Yes, that's exactly what I meant. > FWIW, I am slightly concerned about weighing use cases around very > large JSON documents too heavily. Having enormous jsonb documents just > isn't going to work out that well, but neither will equivalent designs > in popular document database systems for similar reasons. For example, > the maximum BSON document size supported by MongoDB is 16 megabytes, > and that seems to be something that their users don't care too much > about. Having 270 pairs in an object isn't unreasonable, but it isn't > going to be all that common either. The JSON documents in this case were not particularly large. These objects were < 100kB; they just had a lot of keys. I'm a little baffled by the apparent theme that people think that (object size) / (# of keys) will tend to be large. Maybe there will be some instances where that's the case, but it's not what I'd expect. I would expect people to use JSON to serialize structured data in situations where normalizing would be unwieldly. For example, pick your favorite Facebook or Smartphone game - Plants vs. Zombies, Farmville, Candy Crush Saga, whatever. Or even a traditional board game like chess. Think about what the game state looks like as an abstract object. Almost without exception, you've got some kind of game board with a bunch of squares and then you have a bunch of pieces (plants, crops, candies, pawns) that are positioned on those squares. Now you want to store this in a database. You're certainly not going to have a table column per square, and EAV would be stupid, so what's left? You could use an array, but an array of strings might not be descriptive enough; for a square in Farmville, for example, you might need to know the type of crop, and whether it was fertilized with special magic fertilizer, and when it's going to be ready to harvest, and when it'll wither if not harvested. So a JSON is a pretty natural structure: an array of arrays of objects. If you have a 30x30 farm, you'll have 900 keys. If you have a 50x50 farm, which probably means you're spending real money to buy imaginary plants, you'll have 2500 keys. (For the record, I have no actual knowledge of how any of these games are implemented under the hood. I'm just speculating on how I would have done it.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/16/2014 06:31 AM, Robert Haas wrote: > On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan <pg@heroku.com> wrote: >> On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> Actually, having the keys all at the same level *is* relevant for the >>> issue we're discussing. If those 270 keys are organized in a tree, it's >>> not the same as having them all on one level (and not as problematic). >> >> I believe Robert meant that the 270 keys are not at the top level, but >> are at some level (in other words, some object has 270 pairs). That is >> equivalent to having them at the top level for the purposes of this >> discussion. > > Yes, that's exactly what I meant. > >> FWIW, I am slightly concerned about weighing use cases around very >> large JSON documents too heavily. Having enormous jsonb documents just >> isn't going to work out that well, but neither will equivalent designs >> in popular document database systems for similar reasons. For example, >> the maximum BSON document size supported by MongoDB is 16 megabytes, >> and that seems to be something that their users don't care too much >> about. Having 270 pairs in an object isn't unreasonable, but it isn't >> going to be all that common either. Well, I can only judge from the use cases I personally have, none of which involve more than 100 keys at any level for most rows. So far I've seen some people argue hypotetical use cases involving hundreds of keys per level, but nobody who *actually* has such a use case. Also, note that we currently don't know where the "last value" extraction becomes a performance problem at this stage, except that it's somewhere between 200 and 100,000. Also, we don't have a test which shows the hybrid approach (Heikki's patch) performing better with 1000's of keys. Basically, if someone is going to make a serious case for Heikki's hybrid approach over the simpler lengths-only approach, then please post some test data showing the benefit ASAP, since I can't demonstrate it. Otherwise, let's get beta 3 out the door so we can get the 9.4 release train moving again. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Sep 16, 2014 at 12:47 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 09/16/2014 06:31 AM, Robert Haas wrote: >> On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus <josh@agliodbs.com> wrote: >>>> Actually, having the keys all at the same level *is* relevant for the >>>> issue we're discussing. If those 270 keys are organized in a tree, it's >>>> not the same as having them all on one level (and not as problematic). >>> >>> I believe Robert meant that the 270 keys are not at the top level, but >>> are at some level (in other words, some object has 270 pairs). That is >>> equivalent to having them at the top level for the purposes of this >>> discussion. >> >> Yes, that's exactly what I meant. >> >>> FWIW, I am slightly concerned about weighing use cases around very >>> large JSON documents too heavily. Having enormous jsonb documents just >>> isn't going to work out that well, but neither will equivalent designs >>> in popular document database systems for similar reasons. For example, >>> the maximum BSON document size supported by MongoDB is 16 megabytes, >>> and that seems to be something that their users don't care too much >>> about. Having 270 pairs in an object isn't unreasonable, but it isn't >>> going to be all that common either. > > Well, I can only judge from the use cases I personally have, none of > which involve more than 100 keys at any level for most rows. So far > I've seen some people argue hypotetical use cases involving hundreds of > keys per level, but nobody who *actually* has such a use case. I already told you that I did, and that it was the only and only app I had written for JSONB. > Also, > note that we currently don't know where the "last value" extraction > becomes a performance problem at this stage, except that it's somewhere > between 200 and 100,000. Also, we don't have a test which shows the > hybrid approach (Heikki's patch) performing better with 1000's of keys. Fair point. > Basically, if someone is going to make a serious case for Heikki's > hybrid approach over the simpler lengths-only approach, then please post > some test data showing the benefit ASAP, since I can't demonstrate it. > Otherwise, let's get beta 3 out the door so we can get the 9.4 release > train moving again. I don't personally care about this enough to spend more time on it. I told you my extremely limited experience because it seems to contradict your broader experience. If you don't care, you don't care. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/16/2014 09:54 AM, Robert Haas wrote: > On Tue, Sep 16, 2014 at 12:47 PM, Josh Berkus <josh@agliodbs.com> wrote: >> On 09/16/2014 06:31 AM, Robert Haas wrote: >>> On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan <pg@heroku.com> wrote: >>>> On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus <josh@agliodbs.com> wrote: >>>>> Actually, having the keys all at the same level *is* relevant for the >>>>> issue we're discussing. If those 270 keys are organized in a tree, it's >>>>> not the same as having them all on one level (and not as problematic). >>>> >>>> I believe Robert meant that the 270 keys are not at the top level, but >>>> are at some level (in other words, some object has 270 pairs). That is >>>> equivalent to having them at the top level for the purposes of this >>>> discussion. >>> >>> Yes, that's exactly what I meant. >>> >>>> FWIW, I am slightly concerned about weighing use cases around very >>>> large JSON documents too heavily. Having enormous jsonb documents just >>>> isn't going to work out that well, but neither will equivalent designs >>>> in popular document database systems for similar reasons. For example, >>>> the maximum BSON document size supported by MongoDB is 16 megabytes, >>>> and that seems to be something that their users don't care too much >>>> about. Having 270 pairs in an object isn't unreasonable, but it isn't >>>> going to be all that common either. >> >> Well, I can only judge from the use cases I personally have, none of >> which involve more than 100 keys at any level for most rows. So far >> I've seen some people argue hypotetical use cases involving hundreds of >> keys per level, but nobody who *actually* has such a use case. > > I already told you that I did, and that it was the only and only app I > had written for JSONB. Ah, ok, I thought yours was a test case. Did you check how it performed on the two patches at all? My tests with 185 keys didn't show any difference, including for a "last key" case. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 09/16/2014 07:47 PM, Josh Berkus wrote: > On 09/16/2014 06:31 AM, Robert Haas wrote: >> On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus <josh@agliodbs.com> wrote: >>>> Actually, having the keys all at the same level *is* relevant for the >>>> issue we're discussing. If those 270 keys are organized in a tree, it's >>>> not the same as having them all on one level (and not as problematic). >>> >>> I believe Robert meant that the 270 keys are not at the top level, but >>> are at some level (in other words, some object has 270 pairs). That is >>> equivalent to having them at the top level for the purposes of this >>> discussion. >> >> Yes, that's exactly what I meant. >> >>> FWIW, I am slightly concerned about weighing use cases around very >>> large JSON documents too heavily. Having enormous jsonb documents just >>> isn't going to work out that well, but neither will equivalent designs >>> in popular document database systems for similar reasons. For example, >>> the maximum BSON document size supported by MongoDB is 16 megabytes, >>> and that seems to be something that their users don't care too much >>> about. Having 270 pairs in an object isn't unreasonable, but it isn't >>> going to be all that common either. > > Well, I can only judge from the use cases I personally have, none of > which involve more than 100 keys at any level for most rows. So far > I've seen some people argue hypotetical use cases involving hundreds of > keys per level, but nobody who *actually* has such a use case. Also, > note that we currently don't know where the "last value" extraction > becomes a performance problem at this stage, except that it's somewhere > between 200 and 100,000. Also, we don't have a test which shows the > hybrid approach (Heikki's patch) performing better with 1000's of keys. > > Basically, if someone is going to make a serious case for Heikki's > hybrid approach over the simpler lengths-only approach, then please post > some test data showing the benefit ASAP, since I can't demonstrate it. > Otherwise, let's get beta 3 out the door so we can get the 9.4 release > train moving again. Are you looking for someone with a real life scenario, or just synthetic test case? The latter is easy to do. See attached test program. It's basically the same I posted earlier. Here are the results from my laptop with Tom's jsonb-lengths-merged.patch: postgres=# select * from testtimes ; elem | duration_ms ------+------------- 11 | 0.289508 12 | 0.288122 13 | 0.290558 14 | 0.287889 15 | 0.286303 17 | 0.290415 19 | 0.289829 21 | 0.289783 23 | 0.287104 25 | 0.289834 28 | 0.290735 31 | 0.291844 34 | 0.293454 37 | 0.293866 41 | 0.291217 45 | 0.289243 50 | 0.290385 55 | 0.292085 61 | 0.290892 67 | 0.292335 74 | 0.292561 81 | 0.291416 89 | 0.295714 98 | 0.29844 108 | 0.297421 119 | 0.299471 131 | 0.299877 144 | 0.301604 158 | 0.303365 174 | 0.304203 191 | 0.303596 210 | 0.306526 231 | 0.304189 254 | 0.307782 279 | 0.307372 307 | 0.306873 338 | 0.310471 372 | 0.3151 409 | 0.320354 450 | 0.32038 495 | 0.322127 545 | 0.323256 600 | 0.330419 660 | 0.334226 726 | 0.336951 799 | 0.34108 879 | 0.347746 967 | 0.354275 1064 | 0.356696 1170 | 0.366906 1287 | 0.375352 1416 | 0.392952 1558 | 0.392907 1714 | 0.402157 1885 | 0.412384 2074 | 0.425958 2281 | 0.435415 2509 | 0.45301 2760 | 0.469983 3036 | 0.487329 3340 | 0.505505 3674 | 0.530412 4041 | 0.552585 4445 | 0.581815 4890 | 0.610509 5379 | 0.642885 5917 | 0.680395 6509 | 0.713849 7160 | 0.757561 7876 | 0.805225 8664 | 0.856142 9530 | 0.913255 (72 rows) That's up to 9530 elements - it's pretty easy to extrapolate from there to higher counts, it's O(n). With unpatched git master, the runtime is flat, regardless of which element is queried, at about 0.29 s. With jsonb-with-offsets-and-lengths-2.patch, there's no difference that I could measure. The difference starts to be meaningful at around 500 entries. In practice, I doubt anyone's going to notice until you start talking about tens of thousands of entries. I'll leave it up to the jury to decide if we care or not. It seems like a fairly unusual use case, where you push around large enough arrays or objects to notice. Then again, I'm sure *someone* will do it. People do strange things, and they find ways to abuse the features that the original developers didn't think of. - Heikki
Attachment
On Tue, Sep 16, 2014 at 3:12 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > I'll leave it up to the jury to decide if we care or not. It seems like a > fairly unusual use case, where you push around large enough arrays or > objects to notice. Then again, I'm sure *someone* will do it. People do > strange things, and they find ways to abuse the features that the original > developers didn't think of. Again, it's not abusing of the feature. It's using it. Jsonb is supposed to be fast for this.
On Tue, Sep 16, 2014 at 1:11 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> Well, I can only judge from the use cases I personally have, none of >>> which involve more than 100 keys at any level for most rows. So far >>> I've seen some people argue hypotetical use cases involving hundreds of >>> keys per level, but nobody who *actually* has such a use case. >> >> I already told you that I did, and that it was the only and only app I >> had written for JSONB. > > Ah, ok, I thought yours was a test case. Did you check how it performed > on the two patches at all? My tests with 185 keys didn't show any > difference, including for a "last key" case. No, I didn't test it. But I think Heikki's test results pretty much tell us everything there is to see here. This isn't really that complicated; I've read a few papers on index compression over the years and they seem to often use techniques that have the same general flavor as what Heikki did here, adding complexity in the data format to gain other advantages. So I don't think we should be put off. Basically, I think that if we make a decision to use Tom's patch rather than Heikki's patch, we're deciding that the initial decision, by the folks who wrote the original jsonb code, to make array access less than O(n) was misguided. While that could be true, I'd prefer to bet that those folks knew what they were doing. The only way reason we're even considering changing it is that the array of lengths doesn't compress well, and we've got an approach that fixes that problem while preserving the advantages of fast lookup. We should have a darn fine reason to say no to that approach, and "it didn't benefit my particular use case" is not it. In practice, I'm not very surprised that the impact doesn't seem too bad when you're running SQL queries from the client. There's so much other overhead, for de-TOASTing and client communication and even just planner and executor costs, that this gets lost in the noise. But think about a PL/pgsql procedure, say, where somebody might loop over all of the elements in array. If those operations go from O(1) to O(n), then the loop goes from O(n) to O(n^2). I will bet you a beverage of your choice that somebody will find that behavior within a year of release and be dismayed by it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Heikki, Robert: On 09/16/2014 11:12 AM, Heikki Linnakangas wrote: > Are you looking for someone with a real life scenario, or just synthetic > test case? The latter is easy to do. > > See attached test program. It's basically the same I posted earlier. > Here are the results from my laptop with Tom's jsonb-lengths-merged.patch: Thanks for that! > postgres=# select * from testtimes ; > elem | duration_ms > ------+------------- > 3674 | 0.530412 > 4041 | 0.552585 > 4445 | 0.581815 This looks like the level at which the difference gets to be really noticeable. Note that this is completely swamped by the difference between compressed vs. uncompressed though. > With unpatched git master, the runtime is flat, regardless of which > element is queried, at about 0.29 s. With > jsonb-with-offsets-and-lengths-2.patch, there's no difference that I > could measure. OK, thanks. > The difference starts to be meaningful at around 500 entries. In > practice, I doubt anyone's going to notice until you start talking about > tens of thousands of entries. > > I'll leave it up to the jury to decide if we care or not. It seems like > a fairly unusual use case, where you push around large enough arrays or > objects to notice. Then again, I'm sure *someone* will do it. People do > strange things, and they find ways to abuse the features that the > original developers didn't think of. Right, but the question is whether it's worth having a more complex code and data structure in order to support what certainly *seems* to be a fairly obscure use-case, that is more than 4000 keys at the same level.And it's not like it stops working or becomes completelyunresponsive at that level; it's just double the response time. On 09/16/2014 12:20 PM, Robert Haas wrote:> Basically, I think that if we make a decision to use Tom's patch > rather than Heikki's patch, we're deciding that the initial decision, > by the folks who wrote the original jsonb code, to make array access > less than O(n) was misguided. While that could be true, I'd prefer to > bet that those folks knew what they were doing. The only way reason > we're even considering changing it is that the array of lengths > doesn't compress well, and we've got an approach that fixes that > problem while preserving the advantages of fast lookup. We should > have a darn fine reason to say no to that approach, and "it didn't > benefit my particular use case" is not it. Do you feel that way *as a code maintainer*? That is, if you ended up maintaining the JSONB code, would you still feel that it's worth the extra complexity? Because that will be the main cost here. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Sep 16, 2014 at 3:24 PM, Josh Berkus <josh@agliodbs.com> wrote: > Do you feel that way *as a code maintainer*? That is, if you ended up > maintaining the JSONB code, would you still feel that it's worth the > extra complexity? Because that will be the main cost here. I feel that Heikki doesn't have a reputation for writing or committing unmaintainable code. I haven't reviewed the patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16/09/14 21:20, Robert Haas wrote: > In practice, I'm not very surprised that the impact doesn't seem too > bad when you're running SQL queries from the client. There's so much > other overhead, for de-TOASTing and client communication and even just > planner and executor costs, that this gets lost in the noise. But > think about a PL/pgsql procedure, say, where somebody might loop over > all of the elements in array. If those operations go from O(1) to > O(n), then the loop goes from O(n) to O(n^2). I will bet you a > beverage of your choice that somebody will find that behavior within a > year of release and be dismayed by it. > As somebody who did see server melt (quite literally that time unfortunately) thanks to the CPU overhead of operations on varlena arrays +1 (in fact +many). Especially if we are trying to promote the json improvements in 9.4 as "best of both worlds" kind of thing. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 09/16/2014 10:37 PM, Robert Haas wrote: > On Tue, Sep 16, 2014 at 3:24 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Do you feel that way *as a code maintainer*? That is, if you ended up >> maintaining the JSONB code, would you still feel that it's worth the >> extra complexity? Because that will be the main cost here. > > I feel that Heikki doesn't have a reputation for writing or committing > unmaintainable code. > > I haven't reviewed the patch. The patch I posted was not pretty, but I'm sure it could be refined to something sensible. There are many possible variations of the basic scheme of storing mostly lengths, but an offset for every N elements. I replaced the length with offset on some element and used a flag bit to indicate which it is. Perhaps a simpler approach would be to store lengths, but also store a separate smaller array of offsets, after the lengths array. I can write a patch if we want to go that way. - Heikki
On Tue, Sep 16, 2014 at 4:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Sep 16, 2014 at 1:11 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> Well, I can only judge from the use cases I personally have, none of
>>> which involve more than 100 keys at any level for most rows. So far
>>> I've seen some people argue hypotetical use cases involving hundreds of
>>> keys per level, but nobody who *actually* has such a use case.
>>
>> I already told you that I did, and that it was the only and only app I
>> had written for JSONB.
>
> Ah, ok, I thought yours was a test case. Did you check how it performed
> on the two patches at all? My tests with 185 keys didn't show any
> difference, including for a "last key" case.
No, I didn't test it. But I think Heikki's test results pretty much
tell us everything there is to see here. This isn't really that
complicated; I've read a few papers on index compression over the
years and they seem to often use techniques that have the same general
flavor as what Heikki did here, adding complexity in the data format
to gain other advantages. So I don't think we should be put off.
I second this reasoning. Even if I ran a couple of very realistic test cases that support all-lengths I do fell that the Hybrid aproach would be better as it covers all bases. To put things in perspective Tom's latest patch isn't much simpler either.
Since it would still be a breaking change we should consider changing the layout to key-key-key-value-value-value as it seems to pay off.
Basically, I think that if we make a decision to use Tom's patch
rather than Heikki's patch, we're deciding that the initial decision,
by the folks who wrote the original jsonb code, to make array access
less than O(n) was misguided. While that could be true, I'd prefer to
bet that those folks knew what they were doing. The only way reason
we're even considering changing it is that the array of lengths
doesn't compress well, and we've got an approach that fixes that
problem while preserving the advantages of fast lookup. We should
have a darn fine reason to say no to that approach, and "it didn't
benefit my particular use case" is not it.
In practice, I'm not very surprised that the impact doesn't seem too
bad when you're running SQL queries from the client. There's so much
other overhead, for de-TOASTing and client communication and even just
planner and executor costs, that this gets lost in the noise. But
think about a PL/pgsql procedure, say, where somebody might loop over
all of the elements in array. If those operations go from O(1) to
O(n), then the loop goes from O(n) to O(n^2). I will bet you a
beverage of your choice that somebody will find that behavior within a
year of release and be dismayed by it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
----- Цитат от Robert Haas (robertmhaas@gmail.com), на 16.09.2014 в 22:20 ----- > > In practice, I'm not very surprised that the impact doesn't seem too > bad when you're running SQL queries from the client. There's so much > other overhead, for de-TOASTing and client communication and even just > planner and executor costs, that this gets lost in the noise. But > think about a PL/pgsql procedure, say, where somebody might loop over > all of the elements in array. If those operations go from O(1) to > O(n), then the loop goes from O(n) to O(n^2). I will bet you a > beverage of your choice that somebody will find that behavior within a > year of release and be dismayed by it. > Hi, I can imagine situation exactly like that. We could use jsonb object to represent sparse vectors in the database where the key is the dimension and the value is the value. So they could easily grow to thousands of dimensions. Once you have than in the database it is easy to go and write some simple numeric computations on these vectors, let's say you want a dot product of 2 sparse vectors. If the random access inside one vector is going to O(n^2) then the dot product computation will be going to O(n^2*m^2), so not pretty. I am not saying that the DB is the right place to do this type of computations but it is somethimes convenient to have it also in the DB. Regards, luben
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 09/16/2014 10:37 PM, Robert Haas wrote: >> On Tue, Sep 16, 2014 at 3:24 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> Do you feel that way *as a code maintainer*? That is, if you ended up >>> maintaining the JSONB code, would you still feel that it's worth the >>> extra complexity? Because that will be the main cost here. >> I feel that Heikki doesn't have a reputation for writing or committing >> unmaintainable code. >> I haven't reviewed the patch. > The patch I posted was not pretty, but I'm sure it could be refined to > something sensible. We're somewhat comparing apples and oranges here, in that I pushed my approach to something that I think is of committable quality (and which, not incidentally, fixes some existing bugs that we'd need to fix in any case); while Heikki's patch was just proof-of-concept. It would be worth pushing Heikki's patch to committable quality so that we had a more complete understanding of just what the complexity difference really is. > There are many possible variations of the basic scheme of storing mostly > lengths, but an offset for every N elements. I replaced the length with > offset on some element and used a flag bit to indicate which it is. Aside from the complexity issue, a demerit of Heikki's solution is that it eats up a flag bit that we may well wish we had back later. On the other hand, there's definitely something to be said for not breaking pg_upgrade-ability of 9.4beta databases. > Perhaps a simpler approach would be to store lengths, but also store a > separate smaller array of offsets, after the lengths array. That way would also give up on-disk compatibility, and I'm not sure it's any simpler in practice than your existing solution. regards, tom lane
On 09/16/2014 08:45 PM, Tom Lane wrote: > We're somewhat comparing apples and oranges here, in that I pushed my > approach to something that I think is of committable quality (and which, > not incidentally, fixes some existing bugs that we'd need to fix in any > case); while Heikki's patch was just proof-of-concept. It would be worth > pushing Heikki's patch to committable quality so that we had a more > complete understanding of just what the complexity difference really is. Is anyone actually working on this? If not, I'm voting for the all-lengths patch so that we can get 9.4 out the door. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 09/18/2014 07:53 PM, Josh Berkus wrote: > On 09/16/2014 08:45 PM, Tom Lane wrote: >> We're somewhat comparing apples and oranges here, in that I pushed my >> approach to something that I think is of committable quality (and which, >> not incidentally, fixes some existing bugs that we'd need to fix in any >> case); while Heikki's patch was just proof-of-concept. It would be worth >> pushing Heikki's patch to committable quality so that we had a more >> complete understanding of just what the complexity difference really is. > > Is anyone actually working on this? > > If not, I'm voting for the all-lengths patch so that we can get 9.4 out > the door. I'll try to write a more polished patch tomorrow. We'll then see what it looks like, and can decide if we want it. - Heikki
On 09/18/2014 09:27 PM, Heikki Linnakangas wrote: > On 09/18/2014 07:53 PM, Josh Berkus wrote: >> On 09/16/2014 08:45 PM, Tom Lane wrote: >>> We're somewhat comparing apples and oranges here, in that I pushed my >>> approach to something that I think is of committable quality (and which, >>> not incidentally, fixes some existing bugs that we'd need to fix in any >>> case); while Heikki's patch was just proof-of-concept. It would be worth >>> pushing Heikki's patch to committable quality so that we had a more >>> complete understanding of just what the complexity difference really is. >> >> Is anyone actually working on this? >> >> If not, I'm voting for the all-lengths patch so that we can get 9.4 out >> the door. > > I'll try to write a more polished patch tomorrow. We'll then see what it > looks like, and can decide if we want it. Ok, here are two patches. One is a refined version of my earlier patch, and the other implements the separate offsets array approach. They are both based on Tom's jsonb-lengths-merged.patch, so they include all the whitespace fixes etc. he mentioned. There is no big difference in terms of code complexity between the patches. IMHO the separate offsets array is easier to understand, but it makes for more complicated accessor macros to find the beginning of the variable-length data. Unlike Tom's patch, these patches don't cache any offsets when doing a binary search. Doesn't seem worth it, when the access time is O(1) anyway. Both of these patches have a #define JB_OFFSET_STRIDE for the "stride size". For the separate offsets array, the offsets array has one element for every JB_OFFSET_STRIDE children. For the other patch, every JB_OFFSET_STRIDE child stores the end offset, while others store the length. A smaller value makes random access faster, at the cost of compressibility / on-disk size. I haven't done any measurements to find the optimal value, the values in the patches are arbitrary. I think we should bite the bullet and break compatibility with 9.4beta2 format, even if we go with "my patch". In a jsonb object, it makes sense to store all the keys first, like Tom did, because of cache benefits, and the future possibility to do smart EXTERNAL access. Also, even if we can make the on-disk format compatible, it's weird that you can get different runtime behavior with datums created with a beta version. Seems more clear to just require a pg_dump + restore. Tom: You mentioned earlier that your patch fixes some existing bugs. What were they? There were a bunch of whitespace and comment fixes that we should apply in any case, but I couldn't see any actual bugs. I think we should apply those fixes separately, to make sure we don't forget about them, and to make it easier to review these patches. - Heikki
Attachment
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > Tom: You mentioned earlier that your patch fixes some existing bugs. > What were they? What I remember at the moment (sans caffeine) is that the routines for assembling jsonb values out of field data were lacking some necessary tests for overflow of the size/offset fields. If you like I can apply those fixes separately, but I think they were sufficiently integrated with other changes in the logic that it wouldn't really help much for patch reviewability. regards, tom lane
On 09/19/2014 07:07 AM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> Tom: You mentioned earlier that your patch fixes some existing bugs. >> What were they? > > What I remember at the moment (sans caffeine) is that the routines for > assembling jsonb values out of field data were lacking some necessary > tests for overflow of the size/offset fields. If you like I can apply > those fixes separately, but I think they were sufficiently integrated with > other changes in the logic that it wouldn't really help much for patch > reviewability. Where are we on this? Do we have a patch ready for testing? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Sep 19, 2014 at 5:40 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > I think we should bite the bullet and break compatibility with 9.4beta2 > format, even if we go with "my patch". In a jsonb object, it makes sense to > store all the keys first, like Tom did, because of cache benefits, and the > future possibility to do smart EXTERNAL access. Also, even if we can make > the on-disk format compatible, it's weird that you can get different runtime > behavior with datums created with a beta version. Seems more clear to just > require a pg_dump + restore. I vote for going with your patch, and breaking compatibility for the reasons stated here (though I'm skeptical of the claims about cache benefits, FWIW). -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Fri, Sep 19, 2014 at 5:40 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> I think we should bite the bullet and break compatibility with 9.4beta2 >> format, even if we go with "my patch". In a jsonb object, it makes sense to >> store all the keys first, like Tom did, because of cache benefits, and the >> future possibility to do smart EXTERNAL access. Also, even if we can make >> the on-disk format compatible, it's weird that you can get different runtime >> behavior with datums created with a beta version. Seems more clear to just >> require a pg_dump + restore. > I vote for going with your patch, and breaking compatibility for the > reasons stated here (though I'm skeptical of the claims about cache > benefits, FWIW). I'm also skeptical of that, but I think the potential for smart EXTERNAL access is a valid consideration. I've not had time to read Heikki's updated patch yet --- has anyone else compared the two patches for code readability? If they're fairly close on that score, then I'd agree his approach is the best solution. (I will look at his code, but I'm not sure I'm the most unbiased observer.) regards, tom lane
On 09/15/2014 09:46 PM, Craig Ringer wrote: > On 09/16/2014 07:44 AM, Peter Geoghegan wrote: >> FWIW, I am slightly concerned about weighing use cases around very >> large JSON documents too heavily. Having enormous jsonb documents just >> isn't going to work out that well, but neither will equivalent designs >> in popular document database systems for similar reasons. For example, >> the maximum BSON document size supported by MongoDB is 16 megabytes, >> and that seems to be something that their users don't care too much >> about. Having 270 pairs in an object isn't unreasonable, but it isn't >> going to be all that common either. > > Also, at a certain size the fact that Pg must rewrite the whole document > for any change to it starts to introduce other practical changes. > > Anyway - this is looking like the change will go in, and with it a > catversion bump. Introduction of a jsonb version/flags byte might be > worthwhile at the same time. It seems likely that there'll be more room > for improvement in jsonb, possibly even down to using different formats > for different data. > > Is it worth paying a byte per value to save on possible upgrade pain? > This comment seems to have drowned in the discussion. If there indeed has to be a catversion bump in the process of this, then I agree with Craig. Jan -- Jan Wieck Senior Software Engineer http://slony.info
On Tue, Sep 23, 2014 at 10:02 PM, Jan Wieck <jan@wi3ck.info> wrote: >> Is it worth paying a byte per value to save on possible upgrade pain? >> > > This comment seems to have drowned in the discussion. > > If there indeed has to be a catversion bump in the process of this, then I > agree with Craig. -1. We already have a reserved bit. -- Peter Geoghegan
Jan Wieck <jan@wi3ck.info> writes: > On 09/15/2014 09:46 PM, Craig Ringer wrote: >> Anyway - this is looking like the change will go in, and with it a >> catversion bump. Introduction of a jsonb version/flags byte might be >> worthwhile at the same time. It seems likely that there'll be more room >> for improvement in jsonb, possibly even down to using different formats >> for different data. >> >> Is it worth paying a byte per value to save on possible upgrade pain? > If there indeed has to be a catversion bump in the process of this, then > I agree with Craig. FWIW, I don't really. To begin with, it wouldn't be a byte per value, it'd be four bytes, because we need word-alignment of the jsonb contents so there's noplace to squeeze in an ID byte for free. Secondly, as I wrote in <15378.1408548595@sss.pgh.pa.us>: : There remains the : question of whether to take this opportunity to add a version ID to the : binary format. I'm not as excited about that idea as I originally was; : having now studied the code more carefully, I think that any expansion : would likely happen by adding more type codes and/or commandeering the : currently-unused high-order bit of JEntrys. We don't need a version ID : in the header for that. Moreover, if we did have such an ID, it would be : notationally painful to get it to most of the places that might need it. Heikki's patch would eat up the high-order JEntry bits, but the other points remain. regards, tom lane
On 09/24/2014 08:16 AM, Tom Lane wrote: > Jan Wieck <jan@wi3ck.info> writes: >> On 09/15/2014 09:46 PM, Craig Ringer wrote: >>> Anyway - this is looking like the change will go in, and with it a >>> catversion bump. Introduction of a jsonb version/flags byte might be >>> worthwhile at the same time. It seems likely that there'll be more room >>> for improvement in jsonb, possibly even down to using different formats >>> for different data. >>> >>> Is it worth paying a byte per value to save on possible upgrade pain? > >> If there indeed has to be a catversion bump in the process of this, then >> I agree with Craig. > > FWIW, I don't really. To begin with, it wouldn't be a byte per value, > it'd be four bytes, because we need word-alignment of the jsonb contents > so there's noplace to squeeze in an ID byte for free. Secondly, as I > wrote in <15378.1408548595@sss.pgh.pa.us>: > > : There remains the > : question of whether to take this opportunity to add a version ID to the > : binary format. I'm not as excited about that idea as I originally was; > : having now studied the code more carefully, I think that any expansion > : would likely happen by adding more type codes and/or commandeering the > : currently-unused high-order bit of JEntrys. We don't need a version ID > : in the header for that. Moreover, if we did have such an ID, it would be > : notationally painful to get it to most of the places that might need it. > > Heikki's patch would eat up the high-order JEntry bits, but the other > points remain. If we don't need to be backwards-compatible with the 9.4beta on-disk format, we don't necessarily need to eat the high-order JEntry bit. You can just assume that that every nth element is stored as an offset, and the rest as lengths. Although it would be nice to have the flag for it explicitly. There are also a few free bits in the JsonbContainer header that can be used as a version ID in the future. So I don't think we need to change the format to add an explicit version ID field. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 09/24/2014 08:16 AM, Tom Lane wrote: >> Heikki's patch would eat up the high-order JEntry bits, but the other >> points remain. > If we don't need to be backwards-compatible with the 9.4beta on-disk > format, we don't necessarily need to eat the high-order JEntry bit. You > can just assume that that every nth element is stored as an offset, and > the rest as lengths. Although it would be nice to have the flag for it > explicitly. If we go with this approach, I think that we *should* eat the high bit for it. The main reason I want to do that is that it avoids having to engrave the value of N on stone tablets. I think that we should use a pretty large value of N --- maybe 32 or so --- and having the freedom to change it later based on experience seems like a good thing. regards, tom lane
On 2014-09-19 15:40:14 +0300, Heikki Linnakangas wrote: > On 09/18/2014 09:27 PM, Heikki Linnakangas wrote: > >I'll try to write a more polished patch tomorrow. We'll then see what it > >looks like, and can decide if we want it. > > Ok, here are two patches. One is a refined version of my earlier patch, and > the other implements the separate offsets array approach. They are both > based on Tom's jsonb-lengths-merged.patch, so they include all the > whitespace fixes etc. he mentioned. > > There is no big difference in terms of code complexity between the patches. > IMHO the separate offsets array is easier to understand, but it makes for > more complicated accessor macros to find the beginning of the > variable-length data. I personally am pretty clearly in favor of Heikki's version. I think it could stand to slightly expand the reasoning behind the mixed length/offset format; it's not immediately obvious why the offsets are problematic for compression. Otherwise, based on a cursory look, it looks good. But independent of which version is chosen, we *REALLY* need to make the decision soon. This issue has held up the next beta (like jsonb has blocked previous beta) for *weeks*. Personally it doesn't make me very happy that Heikki and Tom had to be the people stepping up to fix this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09/25/2014 09:01 AM, Andres Freund wrote: > But independent of which version is chosen, we *REALLY* need to make the > decision soon. This issue has held up the next beta (like jsonb has > blocked previous beta) for *weeks*. Yes, please! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: > But independent of which version is chosen, we *REALLY* need to make the > decision soon. This issue has held up the next beta (like jsonb has > blocked previous beta) for *weeks*. > > Personally it doesn't make me very happy that Heikki and Tom had to be > the people stepping up to fix this. I think there are a few reasons this has been delayed, aside from the scheduling ones: 1. compression issues were a surprise, and we are wondering if there are any other surprises2. pg_upgrade makes futuredata format changes problematic3. 9.3 multi-xact bugs spooked us into being more careful I am not sure what we can do to increase our speed based on these items. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 09/25/2014 10:14 AM, Bruce Momjian wrote: > On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: >> But independent of which version is chosen, we *REALLY* need to make the >> decision soon. This issue has held up the next beta (like jsonb has >> blocked previous beta) for *weeks*. >> >> Personally it doesn't make me very happy that Heikki and Tom had to be >> the people stepping up to fix this. > > I think there are a few reasons this has been delayed, aside from the > scheduling ones: > > 1. compression issues were a surprise, and we are wondering if > there are any other surprises > 2. pg_upgrade makes future data format changes problematic > 3. 9.3 multi-xact bugs spooked us into being more careful > > I am not sure what we can do to increase our speed based on these items. Alternately, this is delayed because: 1. We have one tested patch to fix the issue. 2. However, people are convinced that there's a better patch possible. 3. But nobody is working on this better patch except "in their spare time". Given this, I once again vote for releasing based on Tom's lengths-only patch, which is done, tested, and ready to go. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2014-09-25 10:18:24 -0700, Josh Berkus wrote: > On 09/25/2014 10:14 AM, Bruce Momjian wrote: > > On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: > >> But independent of which version is chosen, we *REALLY* need to make the > >> decision soon. This issue has held up the next beta (like jsonb has > >> blocked previous beta) for *weeks*. > >> > >> Personally it doesn't make me very happy that Heikki and Tom had to be > >> the people stepping up to fix this. > > > > I think there are a few reasons this has been delayed, aside from the > > scheduling ones: > > > > 1. compression issues were a surprise, and we are wondering if > > there are any other surprises > > 2. pg_upgrade makes future data format changes problematic > > 3. 9.3 multi-xact bugs spooked us into being more careful > > > > I am not sure what we can do to increase our speed based on these items. > > Alternately, this is delayed because: > > 1. We have one tested patch to fix the issue. > > 2. However, people are convinced that there's a better patch possible. > > 3. But nobody is working on this better patch except "in their spare time". > > Given this, I once again vote for releasing based on Tom's lengths-only > patch, which is done, tested, and ready to go. Heikki's patch is there and polished. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09/25/2014 10:20 AM, Andres Freund wrote: > On 2014-09-25 10:18:24 -0700, Josh Berkus wrote: >> On 09/25/2014 10:14 AM, Bruce Momjian wrote: >>> On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: >>>> But independent of which version is chosen, we *REALLY* need to make the >>>> decision soon. This issue has held up the next beta (like jsonb has >>>> blocked previous beta) for *weeks*. >>>> >>>> Personally it doesn't make me very happy that Heikki and Tom had to be >>>> the people stepping up to fix this. >>> >>> I think there are a few reasons this has been delayed, aside from the >>> scheduling ones: >>> >>> 1. compression issues were a surprise, and we are wondering if >>> there are any other surprises >>> 2. pg_upgrade makes future data format changes problematic >>> 3. 9.3 multi-xact bugs spooked us into being more careful >>> >>> I am not sure what we can do to increase our speed based on these items. >> >> Alternately, this is delayed because: >> >> 1. We have one tested patch to fix the issue. >> >> 2. However, people are convinced that there's a better patch possible. >> >> 3. But nobody is working on this better patch except "in their spare time". >> >> Given this, I once again vote for releasing based on Tom's lengths-only >> patch, which is done, tested, and ready to go. > > Heikki's patch is there and polished. If Heikki says it's ready, I'll test. So far he's said that it wasn't done yet. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: > If Heikki says it's ready, I'll test. So far he's said that it wasn't > done yet. http://www.postgresql.org/message-id/541C242E.3030004@vmware.com Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09/25/2014 10:26 AM, Andres Freund wrote: > On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: >> If Heikki says it's ready, I'll test. So far he's said that it wasn't >> done yet. > > http://www.postgresql.org/message-id/541C242E.3030004@vmware.com Yeah, and that didn't include some of Tom's bug fixes apparently, per the succeeding message. Which is why I asked Heikki if he was done, to which he has not replied. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2014-09-25 10:29:51 -0700, Josh Berkus wrote: > On 09/25/2014 10:26 AM, Andres Freund wrote: > > On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: > >> If Heikki says it's ready, I'll test. So far he's said that it wasn't > >> done yet. > > > > http://www.postgresql.org/message-id/541C242E.3030004@vmware.com > > Yeah, and that didn't include some of Tom's bug fixes apparently, per > the succeeding message. Which is why I asked Heikki if he was done, to > which he has not replied. Well, Heikki said he doesn't see any fixes in Tom's patch. But either way, this isn't anything that should prevent you from testing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Josh Berkus <josh@agliodbs.com> writes: > On 09/25/2014 10:26 AM, Andres Freund wrote: >> On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: >>> If Heikki says it's ready, I'll test. So far he's said that it wasn't >>> done yet. >> http://www.postgresql.org/message-id/541C242E.3030004@vmware.com > Yeah, and that didn't include some of Tom's bug fixes apparently, per > the succeeding message. Which is why I asked Heikki if he was done, to > which he has not replied. I took a quick look at the two patches Heikki posted. I find the "separate offsets array" approach unappealing. It takes more space than the other approaches, and that space will be filled with data that we already know will not be at all compressible. Moreover, AFAICS we'd have to engrave the stride on stone tablets, which as I already mentioned I'd really like to not do. The "offsets-and-lengths" patch seems like the approach we ought to compare to my patch, but it looks pretty unfinished to me: AFAICS it includes logic to understand offsets sprinkled into a mostly-lengths array, but no logic that would actually *store* any such offsets, which means it's going to act just like my patch for performance purposes. In the interests of pushing this forward, I will work today on trying to finish and review Heikki's offsets-and-lengths patch so that we have something we can do performance testing on. I doubt that the performance testing will tell us anything we don't expect, but we should do it anyway. regards, tom lane
On 09/25/2014 11:22 AM, Tom Lane wrote: > In the interests of pushing this forward, I will work today on > trying to finish and review Heikki's offsets-and-lengths patch > so that we have something we can do performance testing on. > I doubt that the performance testing will tell us anything we > don't expect, but we should do it anyway. OK. I'll spend some time trying to get Socorro with JSONB working so that I'll have a second test case. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
BTW, it seems like there is consensus that we ought to reorder the items in a jsonb object to have keys first and then values, independently of the other issues under discussion. This means we *will* be breaking on-disk compatibility with 9.4beta2, which means pg_upgrade will need to be taught to refuse an upgrade if the database contains any jsonb columns. Bruce, do you have time to crank out a patch for that? regards, tom lane
On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: > BTW, it seems like there is consensus that we ought to reorder the items > in a jsonb object to have keys first and then values, independently of the > other issues under discussion. This means we *will* be breaking on-disk > compatibility with 9.4beta2, which means pg_upgrade will need to be taught > to refuse an upgrade if the database contains any jsonb columns. Bruce, > do you have time to crank out a patch for that? Yes, I can do that easily. Tell me when you want it --- I just need a catalog version number to trigger on. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2014-09-25 14:46:18 -0400, Bruce Momjian wrote: > On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: > > BTW, it seems like there is consensus that we ought to reorder the items > > in a jsonb object to have keys first and then values, independently of the > > other issues under discussion. This means we *will* be breaking on-disk > > compatibility with 9.4beta2, which means pg_upgrade will need to be taught > > to refuse an upgrade if the database contains any jsonb columns. Bruce, > > do you have time to crank out a patch for that? > > Yes, I can do that easily. Tell me when you want it --- I just need a > catalog version number to trigger on. Do you plan to make it conditional on jsonb being used in the database? That'd not be bad to reduce the pain for testers that haven't used jsonb. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 25, 2014 at 09:00:07PM +0200, Andres Freund wrote: > On 2014-09-25 14:46:18 -0400, Bruce Momjian wrote: > > On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: > > > BTW, it seems like there is consensus that we ought to reorder the items > > > in a jsonb object to have keys first and then values, independently of the > > > other issues under discussion. This means we *will* be breaking on-disk > > > compatibility with 9.4beta2, which means pg_upgrade will need to be taught > > > to refuse an upgrade if the database contains any jsonb columns. Bruce, > > > do you have time to crank out a patch for that? > > > > Yes, I can do that easily. Tell me when you want it --- I just need a > > catalog version number to trigger on. > > Do you plan to make it conditional on jsonb being used in the database? > That'd not be bad to reduce the pain for testers that haven't used jsonb. Yes, I already have code that scans pg_attribute looking for columns with problematic data types and output them to a file, and then throw an error. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > 3. 9.3 multi-xact bugs spooked us into being more careful Uh. Multixact changes in 9.3 were infinitely more invasive than the jsonb changes will ever be. a) they touched basic visibility design and routines, which are complex, understood by very few people, and have remained mostly unchanged for ages; b) they changed on-disk format for an underlying support structure, requiring pg_upgrade to handle the conversion; c) they added new catalog infrastructure to keep track of required freezing; d) they introduced new uint32 counters subject to wraparound; e) they introduced a novel user of slru.c with 5-char long filenames; f) they messed with tuple locking protocol and EvalPlanQual logic for traversing update chains. Maybe I'm forgetting others. JSONB has none of these properties. As far as I can see, the only hairy issue here (other than getting Josh Berkus to actually test the proposed patches) is that JSONB is changing on-disk format; but we're avoiding most issues there by dictating that people with existing JSONB databases need to pg_dump them, i.e. there is no conversion step being written for pg_upgrade. It's good to be careful; it's even better to be more careful. I too have learned a lesson there. Anyway I have no opinion on the JSONB stuff, other than considering that ignoring performance for large arrays and large objects seems to run counter to the whole point of JSONB in the first place (and of course failing to compress is part of that, too.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I wrote: > The "offsets-and-lengths" patch seems like the approach we ought to > compare to my patch, but it looks pretty unfinished to me: AFAICS it > includes logic to understand offsets sprinkled into a mostly-lengths > array, but no logic that would actually *store* any such offsets, > which means it's going to act just like my patch for performance > purposes. > In the interests of pushing this forward, I will work today on > trying to finish and review Heikki's offsets-and-lengths patch > so that we have something we can do performance testing on. > I doubt that the performance testing will tell us anything we > don't expect, but we should do it anyway. I've now done that, and attached is what I think would be a committable version. Having done this work, I no longer think that this approach is significantly messier code-wise than the all-lengths version, and it does have the merit of not degrading on very large objects/arrays. So at the moment I'm leaning to this solution not the all-lengths one. To get a sense of the compression effects of varying the stride distance, I repeated the compression measurements I'd done on 14 August with Pavel's geometry data (<24077.1408052877@sss.pgh.pa.us>). The upshot of that was min max avg external text representation 220 172685 880.3 JSON representation (compressed text) 224 78565 541.3 pg_column_size, JSONB HEAD repr. 225 82540 639.0 pg_column_size, all-lengths repr. 225 66794 531.1 Here's what I get with this patch and different stride distances: JB_OFFSET_STRIDE = 8 225 68551 559.7 JB_OFFSET_STRIDE = 16 225 67601 552.3 JB_OFFSET_STRIDE = 32 225 67120 547.4 JB_OFFSET_STRIDE = 64 225 66886 546.9 JB_OFFSET_STRIDE = 128 225 66879 546.9 JB_OFFSET_STRIDE = 256 225 66846 546.8 So at least for that test data, 32 seems like the sweet spot. We are giving up a couple percent of space in comparison to the all-lengths version, but this is probably an acceptable tradeoff for not degrading on very large arrays. I've not done any speed testing. regards, tom lane diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 2fd87fc..9beebb3 100644 *** a/src/backend/utils/adt/jsonb.c --- b/src/backend/utils/adt/jsonb.c *************** jsonb_from_cstring(char *json, int len) *** 196,207 **** static size_t checkStringLen(size_t len) { ! if (len > JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("string too long to represent as jsonb string"), errdetail("Due to an implementation restriction, jsonb strings cannot exceed %d bytes.", ! JENTRY_POSMASK))); return len; } --- 196,207 ---- static size_t checkStringLen(size_t len) { ! if (len > JENTRY_OFFLENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("string too long to represent as jsonb string"), errdetail("Due to an implementation restriction, jsonb strings cannot exceed %d bytes.", ! JENTRY_OFFLENMASK))); return len; } diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 04f35bf..f157df3 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *************** *** 26,40 **** * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (the total size of an array's elements is also limited by JENTRY_POSMASK, ! * but we're not concerned about that here) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) ! static void fillJsonbValue(JEntry *array, int index, char *base_addr, JsonbValue *result); ! static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); static int compareJsonbScalarValue(JsonbValue *a, JsonbValue *b); static Jsonb *convertToJsonb(JsonbValue *val); static void convertJsonbValue(StringInfo buffer, JEntry *header, JsonbValue *val, int level); --- 26,41 ---- * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (The total size of an array's or object's elements is also limited by ! * JENTRY_OFFLENMASK, but we're not concerned about that here.) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) ! static void fillJsonbValue(JsonbContainer *container, int index, ! char *base_addr, uint32 offset, JsonbValue *result); ! static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); static int compareJsonbScalarValue(JsonbValue *a, JsonbValue *b); static Jsonb *convertToJsonb(JsonbValue *val); static void convertJsonbValue(StringInfo buffer, JEntry *header, JsonbValue *val, int level); *************** static void convertJsonbArray(StringInfo *** 42,48 **** static void convertJsonbObject(StringInfo buffer, JEntry *header, JsonbValue *val, int level); static void convertJsonbScalar(StringInfo buffer, JEntry *header, JsonbValue *scalarVal); ! static int reserveFromBuffer(StringInfo buffer, int len); static void appendToBuffer(StringInfo buffer, const char *data, int len); static void copyToBuffer(StringInfo buffer, int offset, const char *data, int len); static short padBufferToInt(StringInfo buffer); --- 43,49 ---- static void convertJsonbObject(StringInfo buffer, JEntry *header, JsonbValue *val, int level); static void convertJsonbScalar(StringInfo buffer, JEntry *header, JsonbValue *scalarVal); ! static int reserveFromBuffer(StringInfo buffer, int len); static void appendToBuffer(StringInfo buffer, const char *data, int len); static void copyToBuffer(StringInfo buffer, int offset, const char *data, int len); static short padBufferToInt(StringInfo buffer); *************** JsonbValueToJsonb(JsonbValue *val) *** 108,113 **** --- 109,166 ---- } /* + * Get the offset of the variable-length portion of a Jsonb node within + * the variable-length-data part of its container. The node is identified + * by index within the container's JEntry array. + */ + uint32 + getJsonbOffset(const JsonbContainer *jc, int index) + { + uint32 offset = 0; + int i; + + /* + * Start offset of this entry is equal to the end offset of the previous + * entry. Walk backwards to the most recent entry stored as an end + * offset, returning that offset plus any lengths in between. + */ + for (i = index - 1; i >= 0; i--) + { + offset += JBE_OFFLENFLD(jc->children[i]); + if (JBE_HAS_OFF(jc->children[i])) + break; + } + + return offset; + } + + /* + * Get the length of the variable-length portion of a Jsonb node. + * The node is identified by index within the container's JEntry array. + */ + uint32 + getJsonbLength(const JsonbContainer *jc, int index) + { + uint32 off; + uint32 len; + + /* + * If the length is stored directly in the JEntry, just return it. + * Otherwise, get the begin offset of the entry, and subtract that from + * the stored end+1 offset. + */ + if (JBE_HAS_OFF(jc->children[index])) + { + off = getJsonbOffset(jc, index); + len = JBE_OFFLENFLD(jc->children[index]) - off; + } + else + len = JBE_OFFLENFLD(jc->children[index]); + + return len; + } + + /* * BT comparator worker function. Returns an integer less than, equal to, or * greater than zero, indicating whether a is less than, equal to, or greater * than b. Consistent with the requirements for a B-Tree operator class *************** compareJsonbContainers(JsonbContainer *a *** 201,207 **** * * If the two values were of the same container type, then there'd * have been a chance to observe the variation in the number of ! * elements/pairs (when processing WJB_BEGIN_OBJECT, say). They're * either two heterogeneously-typed containers, or a container and * some scalar type. * --- 254,260 ---- * * If the two values were of the same container type, then there'd * have been a chance to observe the variation in the number of ! * elements/pairs (when processing WJB_BEGIN_OBJECT, say). They're * either two heterogeneously-typed containers, or a container and * some scalar type. * *************** findJsonbValueFromContainer(JsonbContain *** 272,295 **** { JEntry *children = container->children; int count = (container->header & JB_CMASK); ! JsonbValue *result = palloc(sizeof(JsonbValue)); Assert((flags & ~(JB_FARRAY | JB_FOBJECT)) == 0); if (flags & JB_FARRAY & container->header) { char *base_addr = (char *) (children + count); int i; for (i = 0; i < count; i++) { ! fillJsonbValue(children, i, base_addr, result); if (key->type == result->type) { if (equalsJsonbScalarValue(key, result)) return result; } } } else if (flags & JB_FOBJECT & container->header) --- 325,357 ---- { JEntry *children = container->children; int count = (container->header & JB_CMASK); ! JsonbValue *result; Assert((flags & ~(JB_FARRAY | JB_FOBJECT)) == 0); + /* Quick out without a palloc cycle if object/array is empty */ + if (count <= 0) + return NULL; + + result = palloc(sizeof(JsonbValue)); + if (flags & JB_FARRAY & container->header) { char *base_addr = (char *) (children + count); + uint32 offset = 0; int i; for (i = 0; i < count; i++) { ! fillJsonbValue(container, i, base_addr, offset, result); if (key->type == result->type) { if (equalsJsonbScalarValue(key, result)) return result; } + + JBE_ADVANCE_OFFSET(offset, children[i]); } } else if (flags & JB_FOBJECT & container->header) *************** findJsonbValueFromContainer(JsonbContain *** 297,332 **** /* Since this is an object, account for *Pairs* of Jentrys */ char *base_addr = (char *) (children + count * 2); uint32 stopLow = 0, ! stopMiddle; ! /* Object key past by caller must be a string */ Assert(key->type == jbvString); /* Binary search on object/pair keys *only* */ ! while (stopLow < count) { ! int index; int difference; JsonbValue candidate; ! /* ! * Note how we compensate for the fact that we're iterating ! * through pairs (not entries) throughout. ! */ ! stopMiddle = stopLow + (count - stopLow) / 2; ! ! index = stopMiddle * 2; candidate.type = jbvString; ! candidate.val.string.val = base_addr + JBE_OFF(children, index); ! candidate.val.string.len = JBE_LEN(children, index); difference = lengthCompareJsonbStringValue(&candidate, key); if (difference == 0) { ! /* Found our key, return value */ ! fillJsonbValue(children, index + 1, base_addr, result); return result; } --- 359,393 ---- /* Since this is an object, account for *Pairs* of Jentrys */ char *base_addr = (char *) (children + count * 2); uint32 stopLow = 0, ! stopHigh = count; ! /* Object key passed by caller must be a string */ Assert(key->type == jbvString); /* Binary search on object/pair keys *only* */ ! while (stopLow < stopHigh) { ! uint32 stopMiddle; int difference; JsonbValue candidate; ! stopMiddle = stopLow + (stopHigh - stopLow) / 2; candidate.type = jbvString; ! candidate.val.string.val = ! base_addr + getJsonbOffset(container, stopMiddle); ! candidate.val.string.len = getJsonbLength(container, stopMiddle); difference = lengthCompareJsonbStringValue(&candidate, key); if (difference == 0) { ! /* Found our key, return corresponding value */ ! int index = stopMiddle + count; ! ! fillJsonbValue(container, index, base_addr, ! getJsonbOffset(container, index), ! result); return result; } *************** findJsonbValueFromContainer(JsonbContain *** 335,341 **** if (difference < 0) stopLow = stopMiddle + 1; else ! count = stopMiddle; } } } --- 396,402 ---- if (difference < 0) stopLow = stopMiddle + 1; else ! stopHigh = stopMiddle; } } } *************** getIthJsonbValueFromContainer(JsonbConta *** 368,374 **** result = palloc(sizeof(JsonbValue)); ! fillJsonbValue(container->children, i, base_addr, result); return result; } --- 429,437 ---- result = palloc(sizeof(JsonbValue)); ! fillJsonbValue(container, i, base_addr, ! getJsonbOffset(container, i), ! result); return result; } *************** getIthJsonbValueFromContainer(JsonbConta *** 377,389 **** * A helper function to fill in a JsonbValue to represent an element of an * array, or a key or value of an object. * * A nested array or object will be returned as jbvBinary, ie. it won't be * expanded. */ static void ! fillJsonbValue(JEntry *children, int index, char *base_addr, JsonbValue *result) { ! JEntry entry = children[index]; if (JBE_ISNULL(entry)) { --- 440,459 ---- * A helper function to fill in a JsonbValue to represent an element of an * array, or a key or value of an object. * + * The node's JEntry is at container->children[index], and its variable-length + * data is at base_addr + offset. We make the caller determine the offset + * since in many cases the caller can amortize that work across multiple + * children. When it can't, it can just call getJsonbOffset(). + * * A nested array or object will be returned as jbvBinary, ie. it won't be * expanded. */ static void ! fillJsonbValue(JsonbContainer *container, int index, ! char *base_addr, uint32 offset, ! JsonbValue *result) { ! JEntry entry = container->children[index]; if (JBE_ISNULL(entry)) { *************** fillJsonbValue(JEntry *children, int ind *** 392,405 **** else if (JBE_ISSTRING(entry)) { result->type = jbvString; ! result->val.string.val = base_addr + JBE_OFF(children, index); ! result->val.string.len = JBE_LEN(children, index); Assert(result->val.string.len >= 0); } else if (JBE_ISNUMERIC(entry)) { result->type = jbvNumeric; ! result->val.numeric = (Numeric) (base_addr + INTALIGN(JBE_OFF(children, index))); } else if (JBE_ISBOOL_TRUE(entry)) { --- 462,475 ---- else if (JBE_ISSTRING(entry)) { result->type = jbvString; ! result->val.string.val = base_addr + offset; ! result->val.string.len = getJsonbLength(container, index); Assert(result->val.string.len >= 0); } else if (JBE_ISNUMERIC(entry)) { result->type = jbvNumeric; ! result->val.numeric = (Numeric) (base_addr + INTALIGN(offset)); } else if (JBE_ISBOOL_TRUE(entry)) { *************** fillJsonbValue(JEntry *children, int ind *** 415,422 **** { Assert(JBE_ISCONTAINER(entry)); result->type = jbvBinary; ! result->val.binary.data = (JsonbContainer *) (base_addr + INTALIGN(JBE_OFF(children, index))); ! result->val.binary.len = JBE_LEN(children, index) - (INTALIGN(JBE_OFF(children, index)) - JBE_OFF(children, index)); } } --- 485,494 ---- { Assert(JBE_ISCONTAINER(entry)); result->type = jbvBinary; ! /* Remove alignment padding from data pointer and length */ ! result->val.binary.data = (JsonbContainer *) (base_addr + INTALIGN(offset)); ! result->val.binary.len = getJsonbLength(container, index) - ! (INTALIGN(offset) - offset); } } *************** recurse: *** 668,680 **** * a full conversion */ val->val.array.rawScalar = (*it)->isScalar; ! (*it)->i = 0; /* Set state for next call */ (*it)->state = JBI_ARRAY_ELEM; return WJB_BEGIN_ARRAY; case JBI_ARRAY_ELEM: ! if ((*it)->i >= (*it)->nElems) { /* * All elements within array already processed. Report this --- 740,754 ---- * a full conversion */ val->val.array.rawScalar = (*it)->isScalar; ! (*it)->curIndex = 0; ! (*it)->curDataOffset = 0; ! (*it)->curValueOffset = 0; /* not actually used */ /* Set state for next call */ (*it)->state = JBI_ARRAY_ELEM; return WJB_BEGIN_ARRAY; case JBI_ARRAY_ELEM: ! if ((*it)->curIndex >= (*it)->nElems) { /* * All elements within array already processed. Report this *************** recurse: *** 686,692 **** return WJB_END_ARRAY; } ! fillJsonbValue((*it)->children, (*it)->i++, (*it)->dataProper, val); if (!IsAJsonbScalar(val) && !skipNested) { --- 760,772 ---- return WJB_END_ARRAY; } ! fillJsonbValue((*it)->container, (*it)->curIndex, ! (*it)->dataProper, (*it)->curDataOffset, ! val); ! ! JBE_ADVANCE_OFFSET((*it)->curDataOffset, ! (*it)->children[(*it)->curIndex]); ! (*it)->curIndex++; if (!IsAJsonbScalar(val) && !skipNested) { *************** recurse: *** 697,704 **** else { /* ! * Scalar item in array, or a container and caller didn't ! * want us to recurse into it. */ return WJB_ELEM; } --- 777,784 ---- else { /* ! * Scalar item in array, or a container and caller didn't want ! * us to recurse into it. */ return WJB_ELEM; } *************** recurse: *** 712,724 **** * v->val.object.pairs is not actually set, because we aren't * doing a full conversion */ ! (*it)->i = 0; /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; return WJB_BEGIN_OBJECT; case JBI_OBJECT_KEY: ! if ((*it)->i >= (*it)->nElems) { /* * All pairs within object already processed. Report this to --- 792,807 ---- * v->val.object.pairs is not actually set, because we aren't * doing a full conversion */ ! (*it)->curIndex = 0; ! (*it)->curDataOffset = 0; ! (*it)->curValueOffset = getJsonbOffset((*it)->container, ! (*it)->nElems); /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; return WJB_BEGIN_OBJECT; case JBI_OBJECT_KEY: ! if ((*it)->curIndex >= (*it)->nElems) { /* * All pairs within object already processed. Report this to *************** recurse: *** 732,738 **** else { /* Return key of a key/value pair. */ ! fillJsonbValue((*it)->children, (*it)->i * 2, (*it)->dataProper, val); if (val->type != jbvString) elog(ERROR, "unexpected jsonb type as object key"); --- 815,823 ---- else { /* Return key of a key/value pair. */ ! fillJsonbValue((*it)->container, (*it)->curIndex, ! (*it)->dataProper, (*it)->curDataOffset, ! val); if (val->type != jbvString) elog(ERROR, "unexpected jsonb type as object key"); *************** recurse: *** 745,752 **** /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; ! fillJsonbValue((*it)->children, ((*it)->i++) * 2 + 1, ! (*it)->dataProper, val); /* * Value may be a container, in which case we recurse with new, --- 830,844 ---- /* Set state for next call */ (*it)->state = JBI_OBJECT_KEY; ! fillJsonbValue((*it)->container, (*it)->curIndex + (*it)->nElems, ! (*it)->dataProper, (*it)->curValueOffset, ! val); ! ! JBE_ADVANCE_OFFSET((*it)->curDataOffset, ! (*it)->children[(*it)->curIndex]); ! JBE_ADVANCE_OFFSET((*it)->curValueOffset, ! (*it)->children[(*it)->curIndex + (*it)->nElems]); ! (*it)->curIndex++; /* * Value may be a container, in which case we recurse with new, *************** iteratorFromContainer(JsonbContainer *co *** 795,805 **** break; case JB_FOBJECT: - - /* - * Offset reflects that nElems indicates JsonbPairs in an object. - * Each key and each value contain Jentry metadata just the same. - */ it->dataProper = (char *) it->children + it->nElems * sizeof(JEntry) * 2; it->state = JBI_OBJECT_START; --- 887,892 ---- *************** reserveFromBuffer(StringInfo buffer, int *** 1209,1216 **** buffer->len += len; /* ! * Keep a trailing null in place, even though it's not useful for us; ! * it seems best to preserve the invariants of StringInfos. */ buffer->data[buffer->len] = '\0'; --- 1296,1303 ---- buffer->len += len; /* ! * Keep a trailing null in place, even though it's not useful for us; it ! * seems best to preserve the invariants of StringInfos. */ buffer->data[buffer->len] = '\0'; *************** convertToJsonb(JsonbValue *val) *** 1284,1291 **** /* * Note: the JEntry of the root is discarded. Therefore the root ! * JsonbContainer struct must contain enough information to tell what ! * kind of value it is. */ res = (Jsonb *) buffer.data; --- 1371,1378 ---- /* * Note: the JEntry of the root is discarded. Therefore the root ! * JsonbContainer struct must contain enough information to tell what kind ! * of value it is. */ res = (Jsonb *) buffer.data; *************** convertToJsonb(JsonbValue *val) *** 1298,1307 **** /* * Subroutine of convertJsonb: serialize a single JsonbValue into buffer. * ! * The JEntry header for this node is returned in *header. It is filled in ! * with the length of this value, but if it is stored in an array or an ! * object (which is always, except for the root node), it is the caller's ! * responsibility to adjust it with the offset within the container. * * If the value is an array or an object, this recurses. 'level' is only used * for debugging purposes. --- 1385,1394 ---- /* * Subroutine of convertJsonb: serialize a single JsonbValue into buffer. * ! * The JEntry header for this node is returned in *header. It is filled in ! * with the length of this value and appropriate type bits. If we wish to ! * store an end offset rather than a length, it is the caller's responsibility ! * to adjust for that. * * If the value is an array or an object, this recurses. 'level' is only used * for debugging purposes. *************** convertJsonbValue(StringInfo buffer, JEn *** 1315,1324 **** return; /* ! * A JsonbValue passed as val should never have a type of jbvBinary, ! * and neither should any of its sub-components. Those values will be ! * produced by convertJsonbArray and convertJsonbObject, the results of ! * which will not be passed back to this function as an argument. */ if (IsAJsonbScalar(val)) --- 1402,1411 ---- return; /* ! * A JsonbValue passed as val should never have a type of jbvBinary, and ! * neither should any of its sub-components. Those values will be produced ! * by convertJsonbArray and convertJsonbObject, the results of which will ! * not be passed back to this function as an argument. */ if (IsAJsonbScalar(val)) *************** convertJsonbValue(StringInfo buffer, JEn *** 1334,1457 **** static void convertJsonbArray(StringInfo buffer, JEntry *pheader, JsonbValue *val, int level) { ! int offset; ! int metaoffset; int i; int totallen; uint32 header; ! /* Initialize pointer into conversion buffer at this level */ ! offset = buffer->len; padBufferToInt(buffer); /* ! * Construct the header Jentry, stored in the beginning of the variable- ! * length payload. */ ! header = val->val.array.nElems | JB_FARRAY; if (val->val.array.rawScalar) { ! Assert(val->val.array.nElems == 1); Assert(level == 0); header |= JB_FSCALAR; } appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! /* reserve space for the JEntries of the elements. */ ! metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.array.nElems); totallen = 0; ! for (i = 0; i < val->val.array.nElems; i++) { JsonbValue *elem = &val->val.array.elems[i]; int len; JEntry meta; convertJsonbValue(buffer, &meta, elem, level + 1); ! len = meta & JENTRY_POSMASK; totallen += len; ! if (totallen > JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_POSMASK))); ! if (i > 0) ! meta = (meta & ~JENTRY_POSMASK) | totallen; ! copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); ! metaoffset += sizeof(JEntry); } ! totallen = buffer->len - offset; ! /* Initialize the header of this node, in the container's JEntry array */ *pheader = JENTRY_ISCONTAINER | totallen; } static void convertJsonbObject(StringInfo buffer, JEntry *pheader, JsonbValue *val, int level) { ! uint32 header; ! int offset; ! int metaoffset; int i; int totallen; ! /* Initialize pointer into conversion buffer at this level */ ! offset = buffer->len; padBufferToInt(buffer); ! /* Initialize header */ ! header = val->val.object.nPairs | JB_FOBJECT; appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! /* reserve space for the JEntries of the keys and values */ ! metaoffset = reserveFromBuffer(buffer, sizeof(JEntry) * val->val.object.nPairs * 2); totallen = 0; ! for (i = 0; i < val->val.object.nPairs; i++) { ! JsonbPair *pair = &val->val.object.pairs[i]; ! int len; ! JEntry meta; ! /* put key */ convertJsonbScalar(buffer, &meta, &pair->key); ! len = meta & JENTRY_POSMASK; totallen += len; ! if (totallen > JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_POSMASK))); ! if (i > 0) ! meta = (meta & ~JENTRY_POSMASK) | totallen; ! copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); ! metaoffset += sizeof(JEntry); ! convertJsonbValue(buffer, &meta, &pair->value, level); ! len = meta & JENTRY_POSMASK; totallen += len; ! if (totallen > JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_POSMASK))); ! meta = (meta & ~JENTRY_POSMASK) | totallen; ! copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); ! metaoffset += sizeof(JEntry); } ! totallen = buffer->len - offset; *pheader = JENTRY_ISCONTAINER | totallen; } --- 1421,1620 ---- static void convertJsonbArray(StringInfo buffer, JEntry *pheader, JsonbValue *val, int level) { ! int base_offset; ! int jentry_offset; int i; int totallen; uint32 header; + int nElems = val->val.array.nElems; ! /* Remember where in the buffer this array starts. */ ! base_offset = buffer->len; + /* Align to 4-byte boundary (any padding counts as part of my data) */ padBufferToInt(buffer); /* ! * Construct the header Jentry and store it in the beginning of the ! * variable-length payload. */ ! header = nElems | JB_FARRAY; if (val->val.array.rawScalar) { ! Assert(nElems == 1); Assert(level == 0); header |= JB_FSCALAR; } appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! ! /* Reserve space for the JEntries of the elements. */ ! jentry_offset = reserveFromBuffer(buffer, sizeof(JEntry) * nElems); totallen = 0; ! for (i = 0; i < nElems; i++) { JsonbValue *elem = &val->val.array.elems[i]; int len; JEntry meta; + /* + * Convert element, producing a JEntry and appending its + * variable-length data to buffer + */ convertJsonbValue(buffer, &meta, elem, level + 1); ! ! len = JBE_OFFLENFLD(meta); totallen += len; ! /* ! * Bail out if total variable-length data exceeds what will fit in a ! * JEntry length field. We check this in each iteration, not just ! * once at the end, to forestall possible integer overflow. ! */ ! if (totallen > JENTRY_OFFLENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_OFFLENMASK))); ! /* ! * Convert each JB_OFFSET_STRIDE'th length to an offset. ! */ ! if ((i % JB_OFFSET_STRIDE) == 0) ! meta = (meta & JENTRY_TYPEMASK) | totallen | JENTRY_HAS_OFF; ! ! copyToBuffer(buffer, jentry_offset, (char *) &meta, sizeof(JEntry)); ! jentry_offset += sizeof(JEntry); } ! /* Total data size is everything we've appended to buffer */ ! totallen = buffer->len - base_offset; ! /* Check length again, since we didn't include the metadata above */ ! if (totallen > JENTRY_OFFLENMASK) ! ereport(ERROR, ! (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", ! JENTRY_OFFLENMASK))); ! ! /* Initialize the header of this node in the container's JEntry array */ *pheader = JENTRY_ISCONTAINER | totallen; } static void convertJsonbObject(StringInfo buffer, JEntry *pheader, JsonbValue *val, int level) { ! int base_offset; ! int jentry_offset; int i; int totallen; + uint32 header; + int nPairs = val->val.object.nPairs; ! /* Remember where in the buffer this object starts. */ ! base_offset = buffer->len; + /* Align to 4-byte boundary (any padding counts as part of my data) */ padBufferToInt(buffer); ! /* ! * Construct the header Jentry and store it in the beginning of the ! * variable-length payload. ! */ ! header = nPairs | JB_FOBJECT; appendToBuffer(buffer, (char *) &header, sizeof(uint32)); ! /* Reserve space for the JEntries of the keys and values. */ ! jentry_offset = reserveFromBuffer(buffer, sizeof(JEntry) * nPairs * 2); + /* + * Iterate over the keys, then over the values, since that is the ordering + * we want in the on-disk representation. + */ totallen = 0; ! for (i = 0; i < nPairs; i++) { ! JsonbPair *pair = &val->val.object.pairs[i]; ! int len; ! JEntry meta; ! /* ! * Convert key, producing a JEntry and appending its variable-length ! * data to buffer ! */ convertJsonbScalar(buffer, &meta, &pair->key); ! len = JBE_OFFLENFLD(meta); totallen += len; ! /* ! * Bail out if total variable-length data exceeds what will fit in a ! * JEntry length field. We check this in each iteration, not just ! * once at the end, to forestall possible integer overflow. ! */ ! if (totallen > JENTRY_OFFLENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("total size of jsonb object elements exceeds the maximum of %u bytes", ! JENTRY_OFFLENMASK))); ! /* ! * Convert each JB_OFFSET_STRIDE'th length to an offset. ! */ ! if ((i % JB_OFFSET_STRIDE) == 0) ! meta = (meta & JENTRY_TYPEMASK) | totallen | JENTRY_HAS_OFF; ! copyToBuffer(buffer, jentry_offset, (char *) &meta, sizeof(JEntry)); ! jentry_offset += sizeof(JEntry); ! } ! for (i = 0; i < nPairs; i++) ! { ! JsonbPair *pair = &val->val.object.pairs[i]; ! int len; ! JEntry meta; ! ! /* ! * Convert value, producing a JEntry and appending its variable-length ! * data to buffer ! */ ! convertJsonbValue(buffer, &meta, &pair->value, level + 1); ! ! len = JBE_OFFLENFLD(meta); totallen += len; ! /* ! * Bail out if total variable-length data exceeds what will fit in a ! * JEntry length field. We check this in each iteration, not just ! * once at the end, to forestall possible integer overflow. ! */ ! if (totallen > JENTRY_OFFLENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), ! errmsg("total size of jsonb object elements exceeds the maximum of %u bytes", ! JENTRY_OFFLENMASK))); ! /* ! * Convert each JB_OFFSET_STRIDE'th length to an offset. ! */ ! if (((i + nPairs) % JB_OFFSET_STRIDE) == 0) ! meta = (meta & JENTRY_TYPEMASK) | totallen | JENTRY_HAS_OFF; ! ! copyToBuffer(buffer, jentry_offset, (char *) &meta, sizeof(JEntry)); ! jentry_offset += sizeof(JEntry); } ! /* Total data size is everything we've appended to buffer */ ! totallen = buffer->len - base_offset; + /* Check length again, since we didn't include the metadata above */ + if (totallen > JENTRY_OFFLENMASK) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("total size of jsonb object elements exceeds the maximum of %u bytes", + JENTRY_OFFLENMASK))); + + /* Initialize the header of this node in the container's JEntry array */ *pheader = JENTRY_ISCONTAINER | totallen; } diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h index 91e3e14..b89e4cb 100644 *** a/src/include/utils/jsonb.h --- b/src/include/utils/jsonb.h *************** typedef struct JsonbValue JsonbValue; *** 83,91 **** * buffer is accessed, but they can also be deep copied and passed around. * * Jsonb is a tree structure. Each node in the tree consists of a JEntry ! * header, and a variable-length content. The JEntry header indicates what ! * kind of a node it is, e.g. a string or an array, and the offset and length ! * of its variable-length portion within the container. * * The JEntry and the content of a node are not stored physically together. * Instead, the container array or object has an array that holds the JEntrys --- 83,91 ---- * buffer is accessed, but they can also be deep copied and passed around. * * Jsonb is a tree structure. Each node in the tree consists of a JEntry ! * header and a variable-length content (possibly of zero size). The JEntry ! * header indicates what kind of a node it is, e.g. a string or an array, ! * and provides the length of its variable-length portion. * * The JEntry and the content of a node are not stored physically together. * Instead, the container array or object has an array that holds the JEntrys *************** typedef struct JsonbValue JsonbValue; *** 95,134 **** * hold its JEntry. Hence, no JEntry header is stored for the root node. It * is implicitly known that the root node must be an array or an object, * so we can get away without the type indicator as long as we can distinguish ! * the two. For that purpose, both an array and an object begins with a uint32 * header field, which contains an JB_FOBJECT or JB_FARRAY flag. When a naked * scalar value needs to be stored as a Jsonb value, what we actually store is * an array with one element, with the flags in the array's header field set * to JB_FSCALAR | JB_FARRAY. * - * To encode the length and offset of the variable-length portion of each - * node in a compact way, the JEntry stores only the end offset within the - * variable-length portion of the container node. For the first JEntry in the - * container's JEntry array, that equals to the length of the node data. The - * begin offset and length of the rest of the entries can be calculated using - * the end offset of the previous JEntry in the array. - * * Overall, the Jsonb struct requires 4-bytes alignment. Within the struct, * the variable-length portion of some node types is aligned to a 4-byte * boundary, while others are not. When alignment is needed, the padding is * in the beginning of the node that requires it. For example, if a numeric * node is stored after a string node, so that the numeric node begins at * offset 3, the variable-length portion of the numeric node will begin with ! * one padding byte. */ /* ! * Jentry format. * ! * The least significant 28 bits store the end offset of the entry (see ! * JBE_ENDPOS, JBE_OFF, JBE_LEN macros below). The next three bits ! * are used to store the type of the entry. The most significant bit ! * is unused, and should be set to zero. */ typedef uint32 JEntry; ! #define JENTRY_POSMASK 0x0FFFFFFF #define JENTRY_TYPEMASK 0x70000000 /* values stored in the type bits */ #define JENTRY_ISSTRING 0x00000000 --- 95,146 ---- * hold its JEntry. Hence, no JEntry header is stored for the root node. It * is implicitly known that the root node must be an array or an object, * so we can get away without the type indicator as long as we can distinguish ! * the two. For that purpose, both an array and an object begin with a uint32 * header field, which contains an JB_FOBJECT or JB_FARRAY flag. When a naked * scalar value needs to be stored as a Jsonb value, what we actually store is * an array with one element, with the flags in the array's header field set * to JB_FSCALAR | JB_FARRAY. * * Overall, the Jsonb struct requires 4-bytes alignment. Within the struct, * the variable-length portion of some node types is aligned to a 4-byte * boundary, while others are not. When alignment is needed, the padding is * in the beginning of the node that requires it. For example, if a numeric * node is stored after a string node, so that the numeric node begins at * offset 3, the variable-length portion of the numeric node will begin with ! * one padding byte so that the actual numeric data is 4-byte aligned. */ /* ! * JEntry format. * ! * The least significant 28 bits store either the data length of the entry, ! * or its end+1 offset from the start of the variable-length portion of the ! * containing object. The next three bits store the type of the entry, and ! * the high-order bit tells whether the least significant bits store a length ! * or an offset. ! * ! * The reason for the offset-or-length complication is to compromise between ! * access speed and data compressibility. In the initial design each JEntry ! * always stored an offset, but this resulted in JEntry arrays with horrible ! * compressibility properties, so that TOAST compression of a JSONB did not ! * work well. Storing only lengths would greatly improve compressibility, ! * but it makes random access into large arrays expensive (O(N) not O(1)). ! * So what we do is store an offset in every JB_OFFSET_STRIDE'th JEntry and ! * a length in the rest. This results in reasonably compressible data (as ! * long as the stride isn't too small). We may have to examine as many as ! * JB_OFFSET_STRIDE JEntrys in order to find out the offset or length of any ! * given item, but that's still O(1) no matter how large the container is. ! * ! * We could avoid eating a flag bit for this purpose if we were to store ! * the stride in the container header, or if we were willing to treat the ! * stride as an unchangeable constant. Neither of those options is very ! * attractive though. */ typedef uint32 JEntry; ! #define JENTRY_OFFLENMASK 0x0FFFFFFF #define JENTRY_TYPEMASK 0x70000000 + #define JENTRY_HAS_OFF 0x80000000 /* values stored in the type bits */ #define JENTRY_ISSTRING 0x00000000 *************** typedef uint32 JEntry; *** 138,144 **** #define JENTRY_ISNULL 0x40000000 #define JENTRY_ISCONTAINER 0x50000000 /* array or object */ ! /* Note possible multiple evaluations */ #define JBE_ISSTRING(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISSTRING) #define JBE_ISNUMERIC(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISNUMERIC) #define JBE_ISCONTAINER(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISCONTAINER) --- 150,158 ---- #define JENTRY_ISNULL 0x40000000 #define JENTRY_ISCONTAINER 0x50000000 /* array or object */ ! /* Access macros. Note possible multiple evaluations */ ! #define JBE_OFFLENFLD(je_) ((je_) & JENTRY_OFFLENMASK) ! #define JBE_HAS_OFF(je_) (((je_) & JENTRY_HAS_OFF) != 0) #define JBE_ISSTRING(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISSTRING) #define JBE_ISNUMERIC(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISNUMERIC) #define JBE_ISCONTAINER(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISCONTAINER) *************** typedef uint32 JEntry; *** 147,166 **** #define JBE_ISBOOL_FALSE(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISBOOL_FALSE) #define JBE_ISBOOL(je_) (JBE_ISBOOL_TRUE(je_) || JBE_ISBOOL_FALSE(je_)) /* ! * Macros for getting the offset and length of an element. Note multiple ! * evaluations and access to prior array element. */ ! #define JBE_ENDPOS(je_) ((je_) & JENTRY_POSMASK) ! #define JBE_OFF(ja, i) ((i) == 0 ? 0 : JBE_ENDPOS((ja)[i - 1])) ! #define JBE_LEN(ja, i) ((i) == 0 ? JBE_ENDPOS((ja)[i]) \ ! : JBE_ENDPOS((ja)[i]) - JBE_ENDPOS((ja)[i - 1])) /* * A jsonb array or object node, within a Jsonb Datum. * ! * An array has one child for each element. An object has two children for ! * each key/value pair. */ typedef struct JsonbContainer { --- 161,194 ---- #define JBE_ISBOOL_FALSE(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISBOOL_FALSE) #define JBE_ISBOOL(je_) (JBE_ISBOOL_TRUE(je_) || JBE_ISBOOL_FALSE(je_)) + /* Macro for advancing an offset variable to the next JEntry */ + #define JBE_ADVANCE_OFFSET(offset, je) \ + do { \ + JEntry je_ = (je); \ + if (JBE_HAS_OFF(je_)) \ + (offset) = JBE_OFFLENFLD(je_); \ + else \ + (offset) += JBE_OFFLENFLD(je_); \ + } while(0) + /* ! * We store an offset, not a length, every JB_OFFSET_STRIDE children. ! * Caution: this macro should only be referenced when creating a JSONB ! * value. When examining an existing value, pay attention to the HAS_OFF ! * bits instead. This allows changes in the offset-placement heuristic ! * without breaking on-disk compatibility. */ ! #define JB_OFFSET_STRIDE 32 /* * A jsonb array or object node, within a Jsonb Datum. * ! * An array has one child for each element, stored in array order. ! * ! * An object has two children for each key/value pair. The keys all appear ! * first, in key sort order; then the values appear, in an order matching the ! * key order. This arrangement keeps the keys compact in memory, making a ! * search for a particular key more cache-friendly. */ typedef struct JsonbContainer { *************** typedef struct JsonbContainer *** 172,179 **** } JsonbContainer; /* flags for the header-field in JsonbContainer */ ! #define JB_CMASK 0x0FFFFFFF ! #define JB_FSCALAR 0x10000000 #define JB_FOBJECT 0x20000000 #define JB_FARRAY 0x40000000 --- 200,207 ---- } JsonbContainer; /* flags for the header-field in JsonbContainer */ ! #define JB_CMASK 0x0FFFFFFF /* mask for count field */ ! #define JB_FSCALAR 0x10000000 /* flag bits */ #define JB_FOBJECT 0x20000000 #define JB_FARRAY 0x40000000 *************** struct JsonbValue *** 248,265 **** (jsonbval)->type <= jbvBool) /* ! * Pair within an Object. * ! * Pairs with duplicate keys are de-duplicated. We store the order for the ! * benefit of doing so in a well-defined way with respect to the original ! * observed order (which is "last observed wins"). This is only used briefly ! * when originally constructing a Jsonb. */ struct JsonbPair { JsonbValue key; /* Must be a jbvString */ JsonbValue value; /* May be of any type */ ! uint32 order; /* preserves order of pairs with equal keys */ }; /* Conversion state used when parsing Jsonb from text, or for type coercion */ --- 276,295 ---- (jsonbval)->type <= jbvBool) /* ! * Key/value pair within an Object. * ! * This struct type is only used briefly while constructing a Jsonb; it is ! * *not* the on-disk representation. ! * ! * Pairs with duplicate keys are de-duplicated. We store the originally ! * observed pair ordering for the purpose of removing duplicates in a ! * well-defined way (which is "last observed wins"). */ struct JsonbPair { JsonbValue key; /* Must be a jbvString */ JsonbValue value; /* May be of any type */ ! uint32 order; /* Pair's index in original sequence */ }; /* Conversion state used when parsing Jsonb from text, or for type coercion */ *************** typedef struct JsonbIterator *** 287,306 **** { /* Container being iterated */ JsonbContainer *container; ! uint32 nElems; /* Number of elements in children array (will be ! * nPairs for objects) */ bool isScalar; /* Pseudo-array scalar value? */ ! JEntry *children; ! /* Current item in buffer (up to nElems, but must * 2 for objects) */ ! int i; /* ! * Data proper. This points just past end of children array. ! * We use the JBE_OFF() macro on the Jentrys to find offsets of each ! * child in this area. */ ! char *dataProper; /* Private state */ JsonbIterState state; --- 317,341 ---- { /* Container being iterated */ JsonbContainer *container; ! uint32 nElems; /* Number of elements in children array (will ! * be nPairs for objects) */ bool isScalar; /* Pseudo-array scalar value? */ ! JEntry *children; /* JEntrys for child nodes */ ! /* Data proper. This points to the beginning of the variable-length data */ ! char *dataProper; ! /* Current item in buffer (up to nElems) */ ! int curIndex; ! ! /* Data offset corresponding to current item */ ! uint32 curDataOffset; /* ! * If the container is an object, we want to return keys and values ! * alternately; so curDataOffset points to the current key, and ! * curValueOffset points to the current value. */ ! uint32 curValueOffset; /* Private state */ JsonbIterState state; *************** extern Datum gin_consistent_jsonb_path(P *** 344,349 **** --- 379,386 ---- extern Datum gin_triconsistent_jsonb_path(PG_FUNCTION_ARGS); /* Support functions */ + extern uint32 getJsonbOffset(const JsonbContainer *jc, int index); + extern uint32 getJsonbLength(const JsonbContainer *jc, int index); extern int compareJsonbContainers(JsonbContainer *a, JsonbContainer *b); extern JsonbValue *findJsonbValueFromContainer(JsonbContainer *sheader, uint32 flags,
On 09/25/2014 08:10 PM, Tom Lane wrote: > I wrote: >> The "offsets-and-lengths" patch seems like the approach we ought to >> compare to my patch, but it looks pretty unfinished to me: AFAICS it >> includes logic to understand offsets sprinkled into a mostly-lengths >> array, but no logic that would actually *store* any such offsets, >> which means it's going to act just like my patch for performance >> purposes. > >> In the interests of pushing this forward, I will work today on >> trying to finish and review Heikki's offsets-and-lengths patch >> so that we have something we can do performance testing on. >> I doubt that the performance testing will tell us anything we >> don't expect, but we should do it anyway. > > I've now done that, and attached is what I think would be a committable > version. Having done this work, I no longer think that this approach > is significantly messier code-wise than the all-lengths version, and > it does have the merit of not degrading on very large objects/arrays. > So at the moment I'm leaning to this solution not the all-lengths one. > > To get a sense of the compression effects of varying the stride distance, > I repeated the compression measurements I'd done on 14 August with Pavel's > geometry data (<24077.1408052877@sss.pgh.pa.us>). The upshot of that was > > min max avg > > external text representation 220 172685 880.3 > JSON representation (compressed text) 224 78565 541.3 > pg_column_size, JSONB HEAD repr. 225 82540 639.0 > pg_column_size, all-lengths repr. 225 66794 531.1 > > Here's what I get with this patch and different stride distances: > > JB_OFFSET_STRIDE = 8 225 68551 559.7 > JB_OFFSET_STRIDE = 16 225 67601 552.3 > JB_OFFSET_STRIDE = 32 225 67120 547.4 > JB_OFFSET_STRIDE = 64 225 66886 546.9 > JB_OFFSET_STRIDE = 128 225 66879 546.9 > JB_OFFSET_STRIDE = 256 225 66846 546.8 > > So at least for that test data, 32 seems like the sweet spot. > We are giving up a couple percent of space in comparison to the > all-lengths version, but this is probably an acceptable tradeoff > for not degrading on very large arrays. > > I've not done any speed testing. I'll do some tommorrow. I should have some different DBs to test on, too. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
All, So these results have become a bit complex. So spreadsheet time. https://docs.google.com/spreadsheets/d/1Mokpx3EqlbWlFDIkF9qzpM7NneN9z-QOXWSzws3E-R4 Some details: The Length-and-Offset test was performed using a more recent 9.4 checkout than the other two tests. This was regrettable, and due to a mistake with git, since the results tell me that there have been some other changes. I added two new datasets: errlog2 is a simple, 4-column error log in JSON format, with 2 small values and 2 large values in each datum. It was there to check if any of our changes affected the performance or size of such simple structures (answer: no). processed_b is a synthetic version of Mozilla Socorro's crash dumps, about 900,000 of them, with nearly identical JSON on each row. These are large json values (around 4KB each) with a broad mix of values and 5 levels of nesting. However, none of the levels have very many keys per level; the max is that the top level has up to 40 keys. Unlike the other data sets, I can provide a copy of processed_b for asking. So, some observations: * Data sizes with lengths-and-offets are slightly (3%) larger than all-lengths for the pathological case (jsonbish) and unaffected for other cases. * Even large, complex JSON (processed_b) gets better compression with the two patches than with head, although only slightly better (16%) * This better compression for processed_b leads to slightly slower extraction (6-7%), and surprisingly slower extraction for length-and-offset than for all-lengths (about 2%). * in the patholgical case, length-and-offset was notably faster on Q1 than all-lengths (24%), and somewhat slower on Q2 (8%). I think this shows me that I don't understand what JSON keys are "at the end". * notably, length-and-offset when uncompressed (EXTERNAL) was faster on Q1 than head! This was surprising enough that I retested it. Overall, I'm satisfied with the performance of the length-and-offset patch. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 09/26/2014 06:20 PM, Josh Berkus wrote: > Overall, I'm satisfied with the performance of the length-and-offset > patch. Oh, also ... no bugs found. So, can we get Beta3 out now? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > So, can we get Beta3 out now? If nobody else steps up and says they want to do some performance testing, I'll push the latest lengths+offsets patch tomorrow. Are any of the other open items listed at https://wiki.postgresql.org/wiki/PostgreSQL_9.4_Open_Items things that we must-fix-before-beta3? regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: >> BTW, it seems like there is consensus that we ought to reorder the items >> in a jsonb object to have keys first and then values, independently of the >> other issues under discussion. This means we *will* be breaking on-disk >> compatibility with 9.4beta2, which means pg_upgrade will need to be taught >> to refuse an upgrade if the database contains any jsonb columns. Bruce, >> do you have time to crank out a patch for that? > Yes, I can do that easily. Tell me when you want it --- I just need a > catalog version number to trigger on. Done --- 201409291 is the cutover point. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Bruce Momjian <bruce@momjian.us> writes: > > On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: > >> BTW, it seems like there is consensus that we ought to reorder the items > >> in a jsonb object to have keys first and then values, independently of the > >> other issues under discussion. This means we *will* be breaking on-disk > >> compatibility with 9.4beta2, which means pg_upgrade will need to be taught > >> to refuse an upgrade if the database contains any jsonb columns. Bruce, > >> do you have time to crank out a patch for that? > > > Yes, I can do that easily. Tell me when you want it --- I just need a > > catalog version number to trigger on. > > Done --- 201409291 is the cutover point. Just to clarify- the commit bumped the catversion to 201409292, so version <= 201409291 has the old format while version > 201409291 has the new format. There was no 201409291, so I suppose it doesn't matter too much, but technically 'version >= 201409291' wouldn't be accurate. I'm guessing this all makes sense for how pg_upgrade works, but I found it a bit surprising that the version mentioned as the cutover point wasn't the catversion committed. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Done --- 201409291 is the cutover point. > Just to clarify- the commit bumped the catversion to 201409292, so > version <= 201409291 has the old format while version > 201409291 has > the new format. There was no 201409291, so I suppose it doesn't matter > too much, but technically 'version >= 201409291' wouldn't be accurate. Nope. See my response to Andrew: ...1 is the cutover commit Bruce should use, because that's what it is in 9.4. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> Done --- 201409291 is the cutover point. > > > Just to clarify- the commit bumped the catversion to 201409292, so > > version <= 201409291 has the old format while version > 201409291 has > > the new format. There was no 201409291, so I suppose it doesn't matter > > too much, but technically 'version >= 201409291' wouldn't be accurate. > > Nope. See my response to Andrew: ...1 is the cutover commit Bruce > should use, because that's what it is in 9.4. Yup, makes sense. Thanks! Stephen
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Sep 29, 2014 at 12:19 AM, Josh Berkus<span dir="ltr"><<a href="mailto:josh@agliodbs.com" target="_blank">josh@agliodbs.com</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><spanclass="">On 09/26/2014 06:20 PM, Josh Berkus wrote:<br /> > Overall, I'm satisfiedwith the performance of the length-and-offset<br /> > patch.<br /><br /></span>Oh, also ... no bugs found.<br/><br /> So, can we get Beta3 out now?<br /><div class=""><div class="h5"><br /> --<br /> Josh Berkus<br /> PostgreSQLExperts Inc.<br /><a href="http://pgexperts.com" target="_blank">http://pgexperts.com</a><br /><br /><br /> --<br/> Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></div></div></blockquote></div><br />What's thecall on the stride length? Are we going to keep it hardcoded?<br /></div></div>
On 09/29/2014 11:49 AM, Arthur Silva wrote: > What's the call on the stride length? Are we going to keep it hardcoded? Please, yes. The complications caused by a variable stride length would be horrible. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Arthur Silva <arthurprs@gmail.com> writes: > What's the call on the stride length? Are we going to keep it hardcoded? At the moment it's 32, but we could change it without forcing a new initdb. I ran a simple test that seemed to show 32 was a good choice, but if anyone else wants to try other cases, go for it. regards, tom lane
On Mon, Sep 29, 2014 at 12:30:40PM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: > >> BTW, it seems like there is consensus that we ought to reorder the items > >> in a jsonb object to have keys first and then values, independently of the > >> other issues under discussion. This means we *will* be breaking on-disk > >> compatibility with 9.4beta2, which means pg_upgrade will need to be taught > >> to refuse an upgrade if the database contains any jsonb columns. Bruce, > >> do you have time to crank out a patch for that? > > > Yes, I can do that easily. Tell me when you want it --- I just need a > > catalog version number to trigger on. > > Done --- 201409291 is the cutover point. Attached patch applied to head, and backpatched to 9.4. I think we need to keep this in all future pg_ugprade versions in case someone from the beta tries to jump versions, e.g. 9.4 beta1 to 9.5. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +