jsonb format is pessimal for toast compression - Mailing list pgsql-hackers

I looked into the issue reported in bug #11109.  The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed.  (first_success_by is 1024 in the default set of
compression parameters.)

As an example, here's gdb's report of the bitwise representation of the
example JSON value in the bug thread:

0x2ab85ac:      0x20000005      0x00000004      0x50003098      0x0000309f
0x2ab85bc:      0x000030ae      0x000030b8      0x000030cf      0x000030da
0x2ab85cc:      0x000030df      0x000030ee      0x00003105      0x6b6e756a
0x2ab85dc:      0x400000de      0x00000034      0x00000068      0x0000009c
0x2ab85ec:      0x000000d0      0x00000104      0x00000138      0x0000016c
0x2ab85fc:      0x000001a0      0x000001d4      0x00000208      0x0000023c
0x2ab860c:      0x00000270      0x000002a4      0x000002d8      0x0000030c
0x2ab861c:      0x00000340      0x00000374      0x000003a8      0x000003dc
0x2ab862c:      0x00000410      0x00000444      0x00000478      0x000004ac
0x2ab863c:      0x000004e0      0x00000514      0x00000548      0x0000057c
0x2ab864c:      0x000005b0      0x000005e4      0x00000618      0x0000064c
0x2ab865c:      0x00000680      0x000006b4      0x000006e8      0x0000071c
0x2ab866c:      0x00000750      0x00000784      0x000007b8      0x000007ec
0x2ab867c:      0x00000820      0x00000854      0x00000888      0x000008bc
0x2ab868c:      0x000008f0      0x00000924      0x00000958      0x0000098c
0x2ab869c:      0x000009c0      0x000009f4      0x00000a28      0x00000a5c
0x2ab86ac:      0x00000a90      0x00000ac4      0x00000af8      0x00000b2c
0x2ab86bc:      0x00000b60      0x00000b94      0x00000bc8      0x00000bfc
0x2ab86cc:      0x00000c30      0x00000c64      0x00000c98      0x00000ccc
0x2ab86dc:      0x00000d00      0x00000d34      0x00000d68      0x00000d9c
0x2ab86ec:      0x00000dd0      0x00000e04      0x00000e38      0x00000e6c
0x2ab86fc:      0x00000ea0      0x00000ed4      0x00000f08      0x00000f3c
0x2ab870c:      0x00000f70      0x00000fa4      0x00000fd8      0x0000100c
0x2ab871c:      0x00001040      0x00001074      0x000010a8      0x000010dc
0x2ab872c:      0x00001110      0x00001144      0x00001178      0x000011ac
0x2ab873c:      0x000011e0      0x00001214      0x00001248      0x0000127c
0x2ab874c:      0x000012b0      0x000012e4      0x00001318      0x0000134c
0x2ab875c:      0x00001380      0x000013b4      0x000013e8      0x0000141c
0x2ab876c:      0x00001450      0x00001484      0x000014b8      0x000014ec
0x2ab877c:      0x00001520      0x00001554      0x00001588      0x000015bc
0x2ab878c:      0x000015f0      0x00001624      0x00001658      0x0000168c
0x2ab879c:      0x000016c0      0x000016f4      0x00001728      0x0000175c
0x2ab87ac:      0x00001790      0x000017c4      0x000017f8      0x0000182c
0x2ab87bc:      0x00001860      0x00001894      0x000018c8      0x000018fc
0x2ab87cc:      0x00001930      0x00001964      0x00001998      0x000019cc
0x2ab87dc:      0x00001a00      0x00001a34      0x00001a68      0x00001a9c
0x2ab87ec:      0x00001ad0      0x00001b04      0x00001b38      0x00001b6c
0x2ab87fc:      0x00001ba0      0x00001bd4      0x00001c08      0x00001c3c
0x2ab880c:      0x00001c70      0x00001ca4      0x00001cd8      0x00001d0c
0x2ab881c:      0x00001d40      0x00001d74      0x00001da8      0x00001ddc
0x2ab882c:      0x00001e10      0x00001e44      0x00001e78      0x00001eac
0x2ab883c:      0x00001ee0      0x00001f14      0x00001f48      0x00001f7c
0x2ab884c:      0x00001fb0      0x00001fe4      0x00002018      0x0000204c
0x2ab885c:      0x00002080      0x000020b4      0x000020e8      0x0000211c
0x2ab886c:      0x00002150      0x00002184      0x000021b8      0x000021ec
0x2ab887c:      0x00002220      0x00002254      0x00002288      0x000022bc
0x2ab888c:      0x000022f0      0x00002324      0x00002358      0x0000238c
0x2ab889c:      0x000023c0      0x000023f4      0x00002428      0x0000245c
0x2ab88ac:      0x00002490      0x000024c4      0x000024f8      0x0000252c
0x2ab88bc:      0x00002560      0x00002594      0x000025c8      0x000025fc
0x2ab88cc:      0x00002630      0x00002664      0x00002698      0x000026cc
0x2ab88dc:      0x00002700      0x00002734      0x00002768      0x0000279c
0x2ab88ec:      0x000027d0      0x00002804      0x00002838      0x0000286c
0x2ab88fc:      0x000028a0      0x000028d4      0x00002908      0x0000293c
0x2ab890c:      0x00002970      0x000029a4      0x000029d8      0x00002a0c
0x2ab891c:      0x00002a40      0x00002a74      0x00002aa8      0x00002adc
0x2ab892c:      0x00002b10      0x00002b44      0x00002b78      0x00002bac
0x2ab893c:      0x00002be0      0x00002c14      0x00002c48      0x00002c7c
0x2ab894c:      0x00002cb0      0x00002ce4      0x00002d18      0x32343231
0x2ab895c:      0x74653534      0x74656577      0x33746577      0x77673534
0x2ab896c:      0x74657274      0x33347477      0x72777120      0x20717771
0x2ab897c:      0x65727771      0x20777120      0x66647372      0x73616b6c
0x2ab898c:      0x33353471      0x71772035      0x72777172      0x71727771
0x2ab899c:      0x77203277      0x72777172      0x71727771      0x33323233
0x2ab89ac:      0x6b207732      0x20657773      0x73616673      0x73207372
0x2ab89bc:      0x64736664      0x32343231      0x74653534      0x74656577
0x2ab89cc:      0x33746577      0x77673534      0x74657274      0x33347477
0x2ab89dc:      0x72777120      0x20717771      0x65727771      0x20777120
0x2ab89ec:      0x66647372      0x73616b6c      0x33353471      0x71772035
0x2ab89fc:      0x72777172      0x71727771      0x77203277      0x72777172
0x2ab8a0c:      0x71727771      0x33323233      0x6b207732      0x20657773
0x2ab8a1c:      0x73616673      0x73207372      0x64736664      0x32343231
0x2ab8a2c:      0x74653534      0x74656577      0x33746577      0x77673534
0x2ab8a3c:      0x74657274      0x33347477      0x72777120      0x20717771
0x2ab8a4c:      0x65727771      0x20777120      0x66647372      0x73616b6c
0x2ab8a5c:      0x33353471      0x71772035      0x72777172      0x71727771
0x2ab8a6c:      0x77203277      0x72777172      0x71727771      0x33323233
0x2ab8a7c:      0x6b207732      0x20657773      0x73616673      0x73207372
0x2ab8a8c:      0x64736664      0x32343231      0x74653534      0x74656577
0x2ab8a9c:      0x33746577      0x77673534      0x74657274      0x33347477
0x2ab8aac:      0x72777120      0x20717771      0x65727771      0x20777120
0x2ab8abc:      0x66647372      0x73616b6c      0x33353471      0x71772035
0x2ab8acc:      0x72777172      0x71727771      0x77203277      0x72777172
0x2ab8adc:      0x71727771      0x33323233      0x6b207732      0x20657773
0x2ab8aec:      0x73616673      0x73207372      0x64736664      0x32343231
0x2ab8afc:      0x74653534      0x74656577      0x33746577      0x77673534
...
0x2abb61c:      0x74657274      0x33347477      0x72777120      0x20717771
0x2abb62c:      0x65727771      0x20777120      0x66647372      0x73616b6c
0x2abb63c:      0x33353471      0x71772035      0x72777172      0x71727771
0x2abb64c:      0x77203277      0x72777172      0x71727771      0x33323233
0x2abb65c:      0x6b207732      0x20657773      0x73616673      0x73207372
0x2abb66c:      0x64736664      0x537a6962      0x41706574      0x73756220
0x2abb67c:      0x73656e69      0x74732073      0x45617065      0x746e6576
0x2abb68c:      0x656d6954      0x34313032      0x2d38302d      0x32203730
0x2abb69c:      0x33323a31      0x2e33333a      0x62393434      0x6f4c7a69
0x2abb6ac:      0x69746163      0x61506e6f      0x74736972      0x736e6172
0x2abb6bc:      0x69746361      0x61446e6f      0x30326574      0x302d3431
0x2abb6cc:      0x37302d38      0x3a313220      0x333a3332      0x34342e33

There is plenty of compressible data once we get into the repetitive
strings in the payload part --- but that starts at offset 944, and up to
that point there is nothing that pg_lzcompress can get a handle on.  There
are, by definition, no sequences of 4 or more repeated bytes in that area.
I think in principle pg_lzcompress could decide to compress the 3-byte
sequences consisting of the high-order 24 bits of each offset; but it
doesn't choose to do so, probably because of the way its lookup hash table
works:
* pglz_hist_idx -**        Computes the history table slot for the lookup by the next 4*        characters in the
input.**NB: because we use the next 4 characters, we are not guaranteed to* find 3-character matches; they very
possiblywill be in the wrong* hash list.  This seems an acceptable tradeoff for spreading out the* hash keys more.
 

For jsonb header data, the "next 4 characters" are *always* different, so
only a chance hash collision can result in a match.  There is therefore a
pretty good chance that no compression will occur before it gives up
because of first_success_by.

I'm not sure if there is any easy fix for this.  We could possibly change
the default first_success_by value, but I think that'd just be postponing
the problem to larger jsonb objects/arrays, and it would hurt performance
for genuinely incompressible data.  A somewhat painful, but not yet
out-of-the-question, alternative is to change the jsonb on-disk
representation.  Perhaps the JEntry array could be defined as containing
element lengths instead of element ending offsets.  Not sure though if
that would break binary searching for JSON object keys.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Reporting the commit LSN at commit time
Next
From: Craig Ringer
Date:
Subject: Re: Reporting the commit LSN at commit time