Re: Very Limited Toast Compression on JSONB (9.4 beta 2) - Mailing list pgsql-general

From Jeff Janes
Subject Re: Very Limited Toast Compression on JSONB (9.4 beta 2)
Date
Msg-id CAMkU=1z8R3wKt4P8ozyoTv00j7m0GwXYOT4xqGVzK5CYk_Fx3w@mail.gmail.com
Whole thread Raw
In response to Re: Very Limited Toast Compression on JSONB (9.4 beta 2)  (Larry White <ljw1001@gmail.com>)
Responses Re: Very Limited Toast Compression on JSONB (9.4 beta 2)  (Larry White <ljw1001@gmail.com>)
List pgsql-general
On Thu, Jul 31, 2014 at 11:36 PM, Larry White <ljw1001@gmail.com> wrote:

On Fri, Aug 1, 2014 at 2:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thursday, July 31, 2014, Larry White <ljw1001@gmail.com> wrote:
Hi, 

I'm running an experiment on 9.4 beta 2.

I put 275,000 identical JSON files into a table using JSONB (one per row).  Each raw text file is 251K in size, so the total uncompressed is 69GB. The column storage is set to EXTENDED.  There are other toastable columns in the table, but none have more than 36 bytes of data in them.

My Toast table is 66GB. I would have expected to get that much (or more) compression just from JSONB being a binary format.

If I compress one of these JSON files outside of Postgres, it goes from 251K to 1K.

That is an astonishing amount of compression.  Are you really compressing one of the files in isolation, rather than co-compressing many and then pro-rating the result?

Yes, I should have explained. These are generated JSON files for testing and there is massive repetition in them, which is why they compress so well outside of Postgres. (Basically there is a repeating array of the same string) I did compress just the one. 
 

Can you provide an example of the data, and the command line you used to compress it?

Compressed on a Mac with the Compress UI option. Here's a brief sample from the file. You can see why it compresses so well:
"{\"junk\":[\"124245etweetwet345gwtretwt43 qwrqwq qwre qw
rsdflkas\",\"q4535 wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535 

and so on. 

If I take that example (and cap off the array and hash right after the end of what you show, and remove the escapes of the double quote marks) then it does not compress, but only because it is not long enough to trigger the compression attempts.

If I repeat the array portion 4 more times to make the whole thing long enough for compression to be used, it compresses nicely.  Not 100 fold (but then again, neither does bzip2 or gzip on the data I just described), but text and json compresses 10 fold and jsonb 5 fold.

Cheers,

Jeff

pgsql-general by date:

Previous
From: Vik Fearing
Date:
Subject: Re: Feature proposal and discussion: full-fledged column/function equivalence
Next
From: Christoph Moench-Tegeder
Date:
Subject: Re: jsonb creation functions?