Re: jsonb format is pessimal for toast compression - Mailing list pgsql-hackers

From Tom Lane
Subject Re: jsonb format is pessimal for toast compression
Date
Msg-id 14580.1408039034@sss.pgh.pa.us
Whole thread Raw
In response to Re: jsonb format is pessimal for toast compression  (Bruce Momjian <bruce@momjian.us>)
Responses Re: jsonb format is pessimal for toast compression
Re: jsonb format is pessimal for toast compression
Re: jsonb format is pessimal for toast compression
List pgsql-hackers
Bruce Momjian <bruce@momjian.us> writes:
> On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote:
>> This gets back to the problem of what test case are we going to consider
>> while debating what solution to adopt.

> Uh, we just one need one 12k JSON document from somewhere.  Clearly this
> is something we can easily get.

I would put little faith in a single document as being representative.

To try to get some statistics about a real-world case, I looked at the
delicio.us dataset that someone posted awhile back (1252973 JSON docs).
These have a minimum length (in text representation) of 604 bytes and
a maximum length of 5949 bytes, which means that they aren't going to
tell us all that much about large JSON docs, but this is better than
no data at all.

Since documents of only a couple hundred bytes aren't going to be subject
to compression, I made a table of four columns each containing the same
JSON data, so that each row would be long enough to force the toast logic
to try to do something.  (Note that none of these documents are anywhere
near big enough to hit the refuses-to-compress problem.)  Given that,
I get the following statistics for pg_column_size():
            min    max    avg

JSON (text) representation    382    1155    526.5

HEAD's JSONB representation    493    1485    695.1
all-lengths representation    440    1257    615.3

So IOW, on this dataset the existing JSONB representation creates about
32% bloat compared to just storing the (compressed) user-visible text,
and switching to all-lengths would about halve that penalty.

Maybe this is telling us it's not worth changing the representation,
and we should just go do something about the first_success_by threshold
and be done.  I'm hesitant to draw such conclusions on the basis of a
single use-case though, especially one that doesn't really have that
much use for compression in the first place.  Do we have other JSON
corpuses to look at?
        regards, tom lane



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Function to know last log write timestamp
Next
From: Peter Geoghegan
Date:
Subject: Re: jsonb format is pessimal for toast compression