Re: jsonb format is pessimal for toast compression - Mailing list pgsql-hackers
From | Larry White |
---|---|
Subject | Re: jsonb format is pessimal for toast compression |
Date | |
Msg-id | CAMdbzVgYBGtOaGQpo8tU6PC7zL5T2gqT6kme40BHpzEi3uWyOQ@mail.gmail.com Whole thread Raw |
In response to | jsonb format is pessimal for toast compression (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
<div dir="ltr">Apologies if this is a ridiculous suggestion, but I believe that swapping out the compression algorithm (forSnappy, for example) has been discussed in the past. I wonder if that algorithm is sufficiently different that it wouldproduce a better result, and if that might not be preferable to some of the other options. </div><div class="gmail_extra"><br/><br /><div class="gmail_quote">On Thu, Aug 7, 2014 at 11:17 PM, Tom Lane <span dir="ltr"><<ahref="mailto:tgl@sss.pgh.pa.us" target="_blank">tgl@sss.pgh.pa.us</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> I looked into theissue reported in bug #11109. The problem appears to be<br /> that jsonb's on-disk format is designed in such a way thatthe leading<br /> portion of any JSON array or object will be fairly incompressible, because<br /> it consists mostlyof a strictly-increasing series of integer offsets.<br /> This interacts poorly with the code in pglz_compress() thatgives up if<br /> it's found nothing compressible in the first first_success_by bytes of a<br /> value-to-be-compressed. (first_success_by is 1024 in the default set of<br /> compression parameters.)<br /><br /> As anexample, here's gdb's report of the bitwise representation of the<br /> example JSON value in the bug thread:<br /><br/> 0x2ab85ac: 0x20000005 0x00000004 0x50003098 0x0000309f<br /> 0x2ab85bc: 0x000030ae 0x000030b8 0x000030cf 0x000030da<br /> 0x2ab85cc: 0x000030df 0x000030ee 0x00003105 0x6b6e756a<br/> 0x2ab85dc: 0x400000de 0x00000034 0x00000068 0x0000009c<br /> 0x2ab85ec: 0x000000d0 0x00000104 0x00000138 0x0000016c<br /> 0x2ab85fc: 0x000001a0 0x000001d4 0x00000208 0x0000023c<br /> 0x2ab860c: 0x00000270 0x000002a4 0x000002d8 0x0000030c<br /> 0x2ab861c: 0x00000340 0x00000374 0x000003a8 0x000003dc<br /> 0x2ab862c: 0x00000410 0x00000444 0x00000478 0x000004ac<br /> 0x2ab863c: 0x000004e0 0x00000514 0x00000548 0x0000057c<br/> 0x2ab864c: 0x000005b0 0x000005e4 0x00000618 0x0000064c<br /> 0x2ab865c: 0x00000680 0x000006b4 0x000006e8 0x0000071c<br /> 0x2ab866c: 0x00000750 0x00000784 0x000007b8 0x000007ec<br /> 0x2ab867c: 0x00000820 0x00000854 0x00000888 0x000008bc<br /> 0x2ab868c: 0x000008f0 0x00000924 0x00000958 0x0000098c<br /> 0x2ab869c: 0x000009c0 0x000009f4 0x00000a28 0x00000a5c<br /> 0x2ab86ac: 0x00000a90 0x00000ac4 0x00000af8 0x00000b2c<br/> 0x2ab86bc: 0x00000b60 0x00000b94 0x00000bc8 0x00000bfc<br /> 0x2ab86cc: 0x00000c30 0x00000c64 0x00000c98 0x00000ccc<br /> 0x2ab86dc: 0x00000d00 0x00000d34 0x00000d68 0x00000d9c<br /> 0x2ab86ec: 0x00000dd0 0x00000e04 0x00000e38 0x00000e6c<br /> 0x2ab86fc: 0x00000ea0 0x00000ed4 0x00000f08 0x00000f3c<br /> 0x2ab870c: 0x00000f70 0x00000fa4 0x00000fd8 0x0000100c<br /> 0x2ab871c: 0x00001040 0x00001074 0x000010a8 0x000010dc<br/> 0x2ab872c: 0x00001110 0x00001144 0x00001178 0x000011ac<br /> 0x2ab873c: 0x000011e0 0x00001214 0x00001248 0x0000127c<br /> 0x2ab874c: 0x000012b0 0x000012e4 0x00001318 0x0000134c<br /> 0x2ab875c: 0x00001380 0x000013b4 0x000013e8 0x0000141c<br /> 0x2ab876c: 0x00001450 0x00001484 0x000014b8 0x000014ec<br /> 0x2ab877c: 0x00001520 0x00001554 0x00001588 0x000015bc<br /> 0x2ab878c: 0x000015f0 0x00001624 0x00001658 0x0000168c<br/> 0x2ab879c: 0x000016c0 0x000016f4 0x00001728 0x0000175c<br /> 0x2ab87ac: 0x00001790 0x000017c4 0x000017f8 0x0000182c<br /> 0x2ab87bc: 0x00001860 0x00001894 0x000018c8 0x000018fc<br /> 0x2ab87cc: 0x00001930 0x00001964 0x00001998 0x000019cc<br /> 0x2ab87dc: 0x00001a00 0x00001a34 0x00001a68 0x00001a9c<br /> 0x2ab87ec: 0x00001ad0 0x00001b04 0x00001b38 0x00001b6c<br /> 0x2ab87fc: 0x00001ba0 0x00001bd4 0x00001c08 0x00001c3c<br/> 0x2ab880c: 0x00001c70 0x00001ca4 0x00001cd8 0x00001d0c<br /> 0x2ab881c: 0x00001d40 0x00001d74 0x00001da8 0x00001ddc<br /> 0x2ab882c: 0x00001e10 0x00001e44 0x00001e78 0x00001eac<br /> 0x2ab883c: 0x00001ee0 0x00001f14 0x00001f48 0x00001f7c<br /> 0x2ab884c: 0x00001fb0 0x00001fe4 0x00002018 0x0000204c<br /> 0x2ab885c: 0x00002080 0x000020b4 0x000020e8 0x0000211c<br /> 0x2ab886c: 0x00002150 0x00002184 0x000021b8 0x000021ec<br/> 0x2ab887c: 0x00002220 0x00002254 0x00002288 0x000022bc<br /> 0x2ab888c: 0x000022f0 0x00002324 0x00002358 0x0000238c<br /> 0x2ab889c: 0x000023c0 0x000023f4 0x00002428 0x0000245c<br /> 0x2ab88ac: 0x00002490 0x000024c4 0x000024f8 0x0000252c<br /> 0x2ab88bc: 0x00002560 0x00002594 0x000025c8 0x000025fc<br /> 0x2ab88cc: 0x00002630 0x00002664 0x00002698 0x000026cc<br /> 0x2ab88dc: 0x00002700 0x00002734 0x00002768 0x0000279c<br/> 0x2ab88ec: 0x000027d0 0x00002804 0x00002838 0x0000286c<br /> 0x2ab88fc: 0x000028a0 0x000028d4 0x00002908 0x0000293c<br /> 0x2ab890c: 0x00002970 0x000029a4 0x000029d8 0x00002a0c<br /> 0x2ab891c: 0x00002a40 0x00002a74 0x00002aa8 0x00002adc<br /> 0x2ab892c: 0x00002b10 0x00002b44 0x00002b78 0x00002bac<br /> 0x2ab893c: 0x00002be0 0x00002c14 0x00002c48 0x00002c7c<br /> 0x2ab894c: 0x00002cb0 0x00002ce4 0x00002d18 0x32343231<br/> 0x2ab895c: 0x74653534 0x74656577 0x33746577 0x77673534<br /> 0x2ab896c: 0x74657274 0x33347477 0x72777120 0x20717771<br /> 0x2ab897c: 0x65727771 0x20777120 0x66647372 0x73616b6c<br /> 0x2ab898c: 0x33353471 0x71772035 0x72777172 0x71727771<br /> 0x2ab899c: 0x77203277 0x72777172 0x71727771 0x33323233<br /> 0x2ab89ac: 0x6b207732 0x20657773 0x73616673 0x73207372<br /> 0x2ab89bc: 0x64736664 0x32343231 0x74653534 0x74656577<br/> 0x2ab89cc: 0x33746577 0x77673534 0x74657274 0x33347477<br /> 0x2ab89dc: 0x72777120 0x20717771 0x65727771 0x20777120<br /> 0x2ab89ec: 0x66647372 0x73616b6c 0x33353471 0x71772035<br /> 0x2ab89fc: 0x72777172 0x71727771 0x77203277 0x72777172<br /> 0x2ab8a0c: 0x71727771 0x33323233 0x6b207732 0x20657773<br /> 0x2ab8a1c: 0x73616673 0x73207372 0x64736664 0x32343231<br /> 0x2ab8a2c: 0x74653534 0x74656577 0x33746577 0x77673534<br/> 0x2ab8a3c: 0x74657274 0x33347477 0x72777120 0x20717771<br /> 0x2ab8a4c: 0x65727771 0x20777120 0x66647372 0x73616b6c<br /> 0x2ab8a5c: 0x33353471 0x71772035 0x72777172 0x71727771<br /> 0x2ab8a6c: 0x77203277 0x72777172 0x71727771 0x33323233<br /> 0x2ab8a7c: 0x6b207732 0x20657773 0x73616673 0x73207372<br /> 0x2ab8a8c: 0x64736664 0x32343231 0x74653534 0x74656577<br /> 0x2ab8a9c: 0x33746577 0x77673534 0x74657274 0x33347477<br/> 0x2ab8aac: 0x72777120 0x20717771 0x65727771 0x20777120<br /> 0x2ab8abc: 0x66647372 0x73616b6c 0x33353471 0x71772035<br /> 0x2ab8acc: 0x72777172 0x71727771 0x77203277 0x72777172<br /> 0x2ab8adc: 0x71727771 0x33323233 0x6b207732 0x20657773<br /> 0x2ab8aec: 0x73616673 0x73207372 0x64736664 0x32343231<br /> 0x2ab8afc: 0x74653534 0x74656577 0x33746577 0x77673534<br /> ...<br /> 0x2abb61c: 0x74657274 0x33347477 0x72777120 0x20717771<br /> 0x2abb62c: 0x65727771 0x20777120 0x66647372 0x73616b6c<br /> 0x2abb63c: 0x33353471 0x71772035 0x72777172 0x71727771<br /> 0x2abb64c: 0x77203277 0x72777172 0x71727771 0x33323233<br /> 0x2abb65c: 0x6b207732 0x20657773 0x73616673 0x73207372<br /> 0x2abb66c: 0x64736664 0x537a6962 0x41706574 0x73756220<br /> 0x2abb67c: 0x73656e69 0x74732073 0x45617065 0x746e6576<br /> 0x2abb68c: 0x656d6954 0x34313032 0x2d38302d 0x32203730<br/> 0x2abb69c: 0x33323a31 0x2e33333a 0x62393434 0x6f4c7a69<br /> 0x2abb6ac: 0x69746163 0x61506e6f 0x74736972 0x736e6172<br /> 0x2abb6bc: 0x69746361 0x61446e6f 0x30326574 0x302d3431<br /> 0x2abb6cc: 0x37302d38 0x3a313220 0x333a3332 0x34342e33<br /><br /> Thereis plenty of compressible data once we get into the repetitive<br /> strings in the payload part --- but that startsat offset 944, and up to<br /> that point there is nothing that pg_lzcompress can get a handle on. There<br /> are,by definition, no sequences of 4 or more repeated bytes in that area.<br /> I think in principle pg_lzcompress coulddecide to compress the 3-byte<br /> sequences consisting of the high-order 24 bits of each offset; but it<br /> doesn'tchoose to do so, probably because of the way its lookup hash table<br /> works:<br /><br /> * pglz_hist_idx -<br/> *<br /> * Computes the history table slot for the lookup by the next 4<br /> * charactersin the input.<br /> *<br /> * NB: because we use the next 4 characters, we are not guaranteed to<br /> * find3-character matches; they very possibly will be in the wrong<br /> * hash list. This seems an acceptable tradeoff forspreading out the<br /> * hash keys more.<br /><br /> For jsonb header data, the "next 4 characters" are *always* different,so<br /> only a chance hash collision can result in a match. There is therefore a<br /> pretty good chance thatno compression will occur before it gives up<br /> because of first_success_by.<br /><br /> I'm not sure if there isany easy fix for this. We could possibly change<br /> the default first_success_by value, but I think that'd just be postponing<br/> the problem to larger jsonb objects/arrays, and it would hurt performance<br /> for genuinely incompressibledata. A somewhat painful, but not yet<br /> out-of-the-question, alternative is to change the jsonb on-disk<br/> representation. Perhaps the JEntry array could be defined as containing<br /> element lengths instead of elementending offsets. Not sure though if<br /> that would break binary searching for JSON object keys.<br /><br /> regards, tom lane<br /></blockquote></div><br /></div>
pgsql-hackers by date: