Re: jsonb format is pessimal for toast compression - Mailing list pgsql-hackers
From | Andrew Dunstan |
---|---|
Subject | Re: jsonb format is pessimal for toast compression |
Date | |
Msg-id | 53E4DCFA.9020305@dunslane.net Whole thread Raw |
In response to | jsonb format is pessimal for toast compression (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: jsonb format is pessimal for toast compression
Re: jsonb format is pessimal for toast compression |
List | pgsql-hackers |
On 08/07/2014 11:17 PM, Tom Lane wrote: > I looked into the issue reported in bug #11109. The problem appears to be > that jsonb's on-disk format is designed in such a way that the leading > portion of any JSON array or object will be fairly incompressible, because > it consists mostly of a strictly-increasing series of integer offsets. > This interacts poorly with the code in pglz_compress() that gives up if > it's found nothing compressible in the first first_success_by bytes of a > value-to-be-compressed. (first_success_by is 1024 in the default set of > compression parameters.) [snip] > There is plenty of compressible data once we get into the repetitive > strings in the payload part --- but that starts at offset 944, and up to > that point there is nothing that pg_lzcompress can get a handle on. There > are, by definition, no sequences of 4 or more repeated bytes in that area. > I think in principle pg_lzcompress could decide to compress the 3-byte > sequences consisting of the high-order 24 bits of each offset; but it > doesn't choose to do so, probably because of the way its lookup hash table > works: > > * pglz_hist_idx - > * > * Computes the history table slot for the lookup by the next 4 > * characters in the input. > * > * NB: because we use the next 4 characters, we are not guaranteed to > * find 3-character matches; they very possibly will be in the wrong > * hash list. This seems an acceptable tradeoff for spreading out the > * hash keys more. > > For jsonb header data, the "next 4 characters" are *always* different, so > only a chance hash collision can result in a match. There is therefore a > pretty good chance that no compression will occur before it gives up > because of first_success_by. > > I'm not sure if there is any easy fix for this. We could possibly change > the default first_success_by value, but I think that'd just be postponing > the problem to larger jsonb objects/arrays, and it would hurt performance > for genuinely incompressible data. A somewhat painful, but not yet > out-of-the-question, alternative is to change the jsonb on-disk > representation. Perhaps the JEntry array could be defined as containing > element lengths instead of element ending offsets. Not sure though if > that would break binary searching for JSON object keys. > > Ouch. Back when this structure was first presented at pgCon 2013, I wondered if we shouldn't extract the strings into a dictionary, because of key repetition, and convinced myself that this shouldn't be necessary because in significant cases TOAST would take care of it. Maybe we should have pglz_compress() look at the *last* 1024 bytes if it can't find anything worth compressing in the first, for values larger than a certain size. It's worth noting that this is a fairly pathological case. AIUI the example you constructed has an array with 100k string elements. I don't think that's typical. So I suspect that unless I've misunderstood the statement of the problem we're going to find that almost all the jsonb we will be storing is still compressible. cheers andrew
pgsql-hackers by date: