Re: jsonb format is pessimal for toast compression - Mailing list pgsql-hackers

From Tom Lane
Subject Re: jsonb format is pessimal for toast compression
Date
Msg-id 10350.1407511111@sss.pgh.pa.us
Whole thread Raw
In response to Re: jsonb format is pessimal for toast compression  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: jsonb format is pessimal for toast compression
Re: jsonb format is pessimal for toast compression
List pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
> On 08/07/2014 11:17 PM, Tom Lane wrote:
>> I looked into the issue reported in bug #11109.  The problem appears to be
>> that jsonb's on-disk format is designed in such a way that the leading
>> portion of any JSON array or object will be fairly incompressible, because
>> it consists mostly of a strictly-increasing series of integer offsets.

> Ouch.

> Back when this structure was first presented at pgCon 2013, I wondered 
> if we shouldn't extract the strings into a dictionary, because of key 
> repetition, and convinced myself that this shouldn't be necessary 
> because in significant cases TOAST would take care of it.

That's not really the issue here, I think.  The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.

> Maybe we should have pglz_compress() look at the *last* 1024 bytes if it 
> can't find anything worth compressing in the first, for values larger 
> than a certain size.

Not possible with anything like the current implementation, since it's
just an on-the-fly status check not a trial compression.

> It's worth noting that this is a fairly pathological case. AIUI the 
> example you constructed has an array with 100k string elements. I don't 
> think that's typical. So I suspect that unless I've misunderstood the 
> statement of the problem we're going to find that almost all the jsonb 
> we will be storing is still compressible.

Actually, the 100K-string example I constructed *did* compress.  Larry's
example that's not compressing is only about 12kB.  AFAICS, the threshold
for trouble is in the vicinity of 256 array or object entries (resulting
in a 1kB JEntry array).  That doesn't seem especially high.  There is a
probabilistic component as to whether the early-exit case will actually
fire, since any chance hash collision will probably result in some 3-byte
offset prefix getting compressed.  But the fact that a beta tester tripped
over this doesn't leave me with a warm feeling about the odds that it
won't happen much in the field.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Minmax indexes
Next
From: Robert Haas
Date:
Subject: Re: replication commands and log_statements