Re: jsonb format is pessimal for toast compression - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: jsonb format is pessimal for toast compression
Date
Msg-id 53E4DCFA.9020305@dunslane.net
Whole thread Raw
In response to jsonb format is pessimal for toast compression  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: jsonb format is pessimal for toast compression
Re: jsonb format is pessimal for toast compression
List pgsql-hackers
On 08/07/2014 11:17 PM, Tom Lane wrote:
> I looked into the issue reported in bug #11109.  The problem appears to be
> that jsonb's on-disk format is designed in such a way that the leading
> portion of any JSON array or object will be fairly incompressible, because
> it consists mostly of a strictly-increasing series of integer offsets.
> This interacts poorly with the code in pglz_compress() that gives up if
> it's found nothing compressible in the first first_success_by bytes of a
> value-to-be-compressed.  (first_success_by is 1024 in the default set of
> compression parameters.)

[snip]

> There is plenty of compressible data once we get into the repetitive
> strings in the payload part --- but that starts at offset 944, and up to
> that point there is nothing that pg_lzcompress can get a handle on.  There
> are, by definition, no sequences of 4 or more repeated bytes in that area.
> I think in principle pg_lzcompress could decide to compress the 3-byte
> sequences consisting of the high-order 24 bits of each offset; but it
> doesn't choose to do so, probably because of the way its lookup hash table
> works:
>
>   * pglz_hist_idx -
>   *
>   *        Computes the history table slot for the lookup by the next 4
>   *        characters in the input.
>   *
>   * NB: because we use the next 4 characters, we are not guaranteed to
>   * find 3-character matches; they very possibly will be in the wrong
>   * hash list.  This seems an acceptable tradeoff for spreading out the
>   * hash keys more.
>
> For jsonb header data, the "next 4 characters" are *always* different, so
> only a chance hash collision can result in a match.  There is therefore a
> pretty good chance that no compression will occur before it gives up
> because of first_success_by.
>
> I'm not sure if there is any easy fix for this.  We could possibly change
> the default first_success_by value, but I think that'd just be postponing
> the problem to larger jsonb objects/arrays, and it would hurt performance
> for genuinely incompressible data.  A somewhat painful, but not yet
> out-of-the-question, alternative is to change the jsonb on-disk
> representation.  Perhaps the JEntry array could be defined as containing
> element lengths instead of element ending offsets.  Not sure though if
> that would break binary searching for JSON object keys.
>
>             


Ouch.

Back when this structure was first presented at pgCon 2013, I wondered 
if we shouldn't extract the strings into a dictionary, because of key 
repetition, and convinced myself that this shouldn't be necessary 
because in significant cases TOAST would take care of it.

Maybe we should have pglz_compress() look at the *last* 1024 bytes if it 
can't find anything worth compressing in the first, for values larger 
than a certain size.

It's worth noting that this is a fairly pathological case. AIUI the 
example you constructed has an array with 100k string elements. I don't 
think that's typical. So I suspect that unless I've misunderstood the 
statement of the problem we're going to find that almost all the jsonb 
we will be storing is still compressible.

cheers

andrew



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: A worst case for qsort
Next
From: "MauMau"
Date:
Subject: Re: [RFC] Should smgrtruncate() avoid sending sinval message for temp relations