Thread: Very Limited Toast Compression on JSONB (9.4 beta 2)

Very Limited Toast Compression on JSONB (9.4 beta 2)

From
Larry White
Date:
Hi, 

I'm running an experiment on 9.4 beta 2.

I put 275,000 identical JSON files into a table using JSONB (one per row).  Each raw text file is 251K in size, so the total uncompressed is 69GB. The column storage is set to EXTENDED.  There are other toastable columns in the table, but none have more than 36 bytes of data in them.

My Toast table is 66GB. I would have expected to get that much (or more) compression just from JSONB being a binary format.

If I compress one of these JSON files outside of Postgres, it goes from 251K to 1K. So each file should theoretically fit on a single row in the toast table. In total, the amount well under a GB when compressed outside of PG.

Any guesses as to why there is so little compression of this data or how I might remedy the situation?

Thanks much for your help.


Larry



Re: Very Limited Toast Compression on JSONB (9.4 beta 2)

From
Adrian Klaver
Date:
On 07/31/2014 01:44 PM, Larry White wrote:
> Hi,
>
> I'm running an experiment on 9.4 beta 2.
>
> I put 275,000 identical JSON files into a table using JSONB (one per
> row).  Each raw text file is 251K in size, so the total uncompressed is
> 69GB. The column storage is set to EXTENDED.  There are other toastable
> columns in the table, but none have more than 36 bytes of data in them.
>
> My Toast table is 66GB. I would have expected to get that much (or more)
> compression just from JSONB being a binary format.
>
> If I compress one of these JSON files outside of Postgres, it goes from
> 251K to 1K. So each file should theoretically fit on a single row in the
> toast table. In total, the amount well under a GB when compressed
> outside of PG.
>
> Any guesses as to why there is so little compression of this data or how
> I might remedy the situation?

Are you sure the column storage is EXTENDED and not EXTERNAL?

>
> Thanks much for your help.
>
>
> Larry
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Very Limited Toast Compression on JSONB (9.4 beta 2)

From
Larry White
Date:
Yes. It was EXTENDED. 

As a further test, I dropped the table and rebuilt it, explicitly changing the EXTENDED designation to EXTERNAL and got exactly the same size TOAST table. So there was no compression at all with storage set to EXTENDED.


On Thu, Jul 31, 2014 at 11:51 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 07/31/2014 01:44 PM, Larry White wrote:
Hi,

I'm running an experiment on 9.4 beta 2.

I put 275,000 identical JSON files into a table using JSONB (one per
row).  Each raw text file is 251K in size, so the total uncompressed is
69GB. The column storage is set to EXTENDED.  There are other toastable
columns in the table, but none have more than 36 bytes of data in them.

My Toast table is 66GB. I would have expected to get that much (or more)
compression just from JSONB being a binary format.

If I compress one of these JSON files outside of Postgres, it goes from
251K to 1K. So each file should theoretically fit on a single row in the
toast table. In total, the amount well under a GB when compressed
outside of PG.

Any guesses as to why there is so little compression of this data or how
I might remedy the situation?

Are you sure the column storage is EXTENDED and not EXTERNAL?



Thanks much for your help.


Larry





--
Adrian Klaver
adrian.klaver@aklaver.com

Very Limited Toast Compression on JSONB (9.4 beta 2)

From
Jeff Janes
Date:
On Thursday, July 31, 2014, Larry White <ljw1001@gmail.com> wrote:
Hi, 

I'm running an experiment on 9.4 beta 2.

I put 275,000 identical JSON files into a table using JSONB (one per row).  Each raw text file is 251K in size, so the total uncompressed is 69GB. The column storage is set to EXTENDED.  There are other toastable columns in the table, but none have more than 36 bytes of data in them.

My Toast table is 66GB. I would have expected to get that much (or more) compression just from JSONB being a binary format.

If I compress one of these JSON files outside of Postgres, it goes from 251K to 1K.

That is an astonishing amount of compression.  Are you really compressing one of the files in isolation, rather than co-compressing many and then pro-rating the result?

Can you provide an example of the data, and the command line you used to compress it?

 
So each file should theoretically fit on a single row in the toast table. In total, the amount well under a GB when compressed outside of PG.

Any guesses as to why there is so little compression of this data or how I might remedy the situation?

PostgreSQL's built in tuple compression is generally not very good.  It is good at compressing long strings of identical bytes, but not good at compressing the type of thing you are likely to find in JSON (unless your JSON had long strings of spaces to reflect indentation of deeply nested structures, which JSON probably wouldn't do and which JSONB certainly wouldn't).  It was designed to be very fast and to be unencumbered with the patent issues common at the time it was written.  It was not designed to give the best possible compression ratios.   

It also compresses each row independently.  Most of the compression opportunities in a column of JSON data would probably be between rows, when the same keys show up and over and over again, not within a row.  But it can't capture those opportunities.

Cheers,

Jeff

Re: Very Limited Toast Compression on JSONB (9.4 beta 2)

From
Larry White
Date:

On Fri, Aug 1, 2014 at 2:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thursday, July 31, 2014, Larry White <ljw1001@gmail.com> wrote:
Hi, 

I'm running an experiment on 9.4 beta 2.

I put 275,000 identical JSON files into a table using JSONB (one per row).  Each raw text file is 251K in size, so the total uncompressed is 69GB. The column storage is set to EXTENDED.  There are other toastable columns in the table, but none have more than 36 bytes of data in them.

My Toast table is 66GB. I would have expected to get that much (or more) compression just from JSONB being a binary format.

If I compress one of these JSON files outside of Postgres, it goes from 251K to 1K.

That is an astonishing amount of compression.  Are you really compressing one of the files in isolation, rather than co-compressing many and then pro-rating the result?

Yes, I should have explained. These are generated JSON files for testing and there is massive repetition in them, which is why they compress so well outside of Postgres. (Basically there is a repeating array of the same string) I did compress just the one. 
 

Can you provide an example of the data, and the command line you used to compress it?

Compressed on a Mac with the Compress UI option. Here's a brief sample from the file. You can see why it compresses so well:
"{\"junk\":[\"124245etweetwet345gwtretwt43 qwrqwq qwre qw
rsdflkas\",\"q4535 wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535 

and so on. 

So each file should theoretically fit on a single row in the toast table. In total, the amount well under a GB when compressed outside of PG.

Any guesses as to why there is so little compression of this data or how I might remedy the situation?

PostgreSQL's built in tuple compression is generally not very good.  It is good at compressing long strings of identical bytes, but not good at compressing the type of thing you are likely to find in JSON (unless your JSON had long strings of spaces to reflect indentation of deeply nested structures, which JSON probably wouldn't do and which JSONB certainly wouldn't).  It was designed to be very fast and to be unencumbered with the patent issues common at the time it was written.  It was not designed to give the best possible compression ratios.   

It also compresses each row independently.  Most of the compression opportunities in a column of JSON data would probably be between rows, when the same keys show up and over and over again, not within a row.  But it can't capture those opportunities.

I'm not expecting miracles with real data, but as far as I can tell, there is zero compression happening. I'm wondering if it is disabled for JSONB for some reason.
 

Cheers,

Jeff

Re: Very Limited Toast Compression on JSONB (9.4 beta 2)

From
Larry White
Date:
There is no TOAST compression on JSON or JSONB data in 9.4 beta 2. I'm not sure about other versions.  I'm also not sure if this is a bug or by design, but if it is by design, I think the documentation should be updated. 

Here is a summary of my results inserting 10,000 highly compressible JSON docs of 251K each.

Column Type  - Storage       - TOAST table size
JSONB           - EXTERNAL -  2448 MB
JSONB           - EXTENDED - 2448 MB
JSON             - EXTENDED - 2504 MB
TEXT             - EXTERNAL  - 2409 MB
TEXT             - EXTENDED -      40 MB


On Fri, Aug 1, 2014 at 2:36 AM, Larry White <ljw1001@gmail.com> wrote:

On Fri, Aug 1, 2014 at 2:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thursday, July 31, 2014, Larry White <ljw1001@gmail.com> wrote:
Hi, 

I'm running an experiment on 9.4 beta 2.

I put 275,000 identical JSON files into a table using JSONB (one per row).  Each raw text file is 251K in size, so the total uncompressed is 69GB. The column storage is set to EXTENDED.  There are other toastable columns in the table, but none have more than 36 bytes of data in them.

My Toast table is 66GB. I would have expected to get that much (or more) compression just from JSONB being a binary format.

If I compress one of these JSON files outside of Postgres, it goes from 251K to 1K.

That is an astonishing amount of compression.  Are you really compressing one of the files in isolation, rather than co-compressing many and then pro-rating the result?

Yes, I should have explained. These are generated JSON files for testing and there is massive repetition in them, which is why they compress so well outside of Postgres. (Basically there is a repeating array of the same string) I did compress just the one. 
 

Can you provide an example of the data, and the command line you used to compress it?

Compressed on a Mac with the Compress UI option. Here's a brief sample from the file. You can see why it compresses so well:
"{\"junk\":[\"124245etweetwet345gwtretwt43 qwrqwq qwre qw
rsdflkas\",\"q4535 wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535 

and so on. 

So each file should theoretically fit on a single row in the toast table. In total, the amount well under a GB when compressed outside of PG.

Any guesses as to why there is so little compression of this data or how I might remedy the situation?

PostgreSQL's built in tuple compression is generally not very good.  It is good at compressing long strings of identical bytes, but not good at compressing the type of thing you are likely to find in JSON (unless your JSON had long strings of spaces to reflect indentation of deeply nested structures, which JSON probably wouldn't do and which JSONB certainly wouldn't).  It was designed to be very fast and to be unencumbered with the patent issues common at the time it was written.  It was not designed to give the best possible compression ratios.   

It also compresses each row independently.  Most of the compression opportunities in a column of JSON data would probably be between rows, when the same keys show up and over and over again, not within a row.  But it can't capture those opportunities.

I'm not expecting miracles with real data, but as far as I can tell, there is zero compression happening. I'm wondering if it is disabled for JSONB for some reason.
 

Cheers,

Jeff


Re: Very Limited Toast Compression on JSONB (9.4 beta 2)

From
Jeff Janes
Date:
On Thu, Jul 31, 2014 at 11:36 PM, Larry White <ljw1001@gmail.com> wrote:

On Fri, Aug 1, 2014 at 2:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thursday, July 31, 2014, Larry White <ljw1001@gmail.com> wrote:
Hi, 

I'm running an experiment on 9.4 beta 2.

I put 275,000 identical JSON files into a table using JSONB (one per row).  Each raw text file is 251K in size, so the total uncompressed is 69GB. The column storage is set to EXTENDED.  There are other toastable columns in the table, but none have more than 36 bytes of data in them.

My Toast table is 66GB. I would have expected to get that much (or more) compression just from JSONB being a binary format.

If I compress one of these JSON files outside of Postgres, it goes from 251K to 1K.

That is an astonishing amount of compression.  Are you really compressing one of the files in isolation, rather than co-compressing many and then pro-rating the result?

Yes, I should have explained. These are generated JSON files for testing and there is massive repetition in them, which is why they compress so well outside of Postgres. (Basically there is a repeating array of the same string) I did compress just the one. 
 

Can you provide an example of the data, and the command line you used to compress it?

Compressed on a Mac with the Compress UI option. Here's a brief sample from the file. You can see why it compresses so well:
"{\"junk\":[\"124245etweetwet345gwtretwt43 qwrqwq qwre qw
rsdflkas\",\"q4535 wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535 

and so on. 

If I take that example (and cap off the array and hash right after the end of what you show, and remove the escapes of the double quote marks) then it does not compress, but only because it is not long enough to trigger the compression attempts.

If I repeat the array portion 4 more times to make the whole thing long enough for compression to be used, it compresses nicely.  Not 100 fold (but then again, neither does bzip2 or gzip on the data I just described), but text and json compresses 10 fold and jsonb 5 fold.

Cheers,

Jeff

Re: Very Limited Toast Compression on JSONB (9.4 beta 2)

From
Larry White
Date:
Jeff,

Thank you for your help. This is a Postgres bug, but I don't think I'd have figured it out without your help.

What is happening is that if PG can, after compression, put the entire 'document' into one row/page in the toast table it does. However, if the document is too big to fit in one row after compression, it does no compression at all.  This is why it worked for you, but not for me.

I create my test file (in part) with this loop:

        for (int j = 0; j < 110; j++) {
            mediumPayload.getJunk().add("124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas");
            mediumPayload.getJunk().add("q4535 wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs sdfsd");
        }

if the loop runs 110 times as shown, it compresses.
if the loop runs 111 times, it does not:

With 110 iterations:
Extended      8192 bytes  (one page)
External           66 MB

With 111 iterations:
Extended      69 MB
External        69 MB

Hopefully they can fix this before the GA release. 


On Fri, Aug 1, 2014 at 12:38 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Jul 31, 2014 at 11:36 PM, Larry White <ljw1001@gmail.com> wrote:

On Fri, Aug 1, 2014 at 2:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thursday, July 31, 2014, Larry White <ljw1001@gmail.com> wrote:
Hi, 

I'm running an experiment on 9.4 beta 2.

I put 275,000 identical JSON files into a table using JSONB (one per row).  Each raw text file is 251K in size, so the total uncompressed is 69GB. The column storage is set to EXTENDED.  There are other toastable columns in the table, but none have more than 36 bytes of data in them.

My Toast table is 66GB. I would have expected to get that much (or more) compression just from JSONB being a binary format.

If I compress one of these JSON files outside of Postgres, it goes from 251K to 1K.

That is an astonishing amount of compression.  Are you really compressing one of the files in isolation, rather than co-compressing many and then pro-rating the result?

Yes, I should have explained. These are generated JSON files for testing and there is massive repetition in them, which is why they compress so well outside of Postgres. (Basically there is a repeating array of the same string) I did compress just the one. 
 

Can you provide an example of the data, and the command line you used to compress it?

Compressed on a Mac with the Compress UI option. Here's a brief sample from the file. You can see why it compresses so well:
"{\"junk\":[\"124245etweetwet345gwtretwt43 qwrqwq qwre qw
rsdflkas\",\"q4535 wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535 

and so on. 

If I take that example (and cap off the array and hash right after the end of what you show, and remove the escapes of the double quote marks) then it does not compress, but only because it is not long enough to trigger the compression attempts.

If I repeat the array portion 4 more times to make the whole thing long enough for compression to be used, it compresses nicely.  Not 100 fold (but then again, neither does bzip2 or gzip on the data I just described), but text and json compresses 10 fold and jsonb 5 fold.

Cheers,

Jeff

Re: Very Limited Toast Compression on JSONB (9.4 beta 2)

From
Larry White
Date:
Reported as bug #11109.


On Fri, Aug 1, 2014 at 1:46 PM, Larry White <ljw1001@gmail.com> wrote:
Jeff,

Thank you for your help. This is a Postgres bug, but I don't think I'd have figured it out without your help.

What is happening is that if PG can, after compression, put the entire 'document' into one row/page in the toast table it does. However, if the document is too big to fit in one row after compression, it does no compression at all.  This is why it worked for you, but not for me.

I create my test file (in part) with this loop:

        for (int j = 0; j < 110; j++) {
            mediumPayload.getJunk().add("124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas");
            mediumPayload.getJunk().add("q4535 wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs sdfsd");
        }

if the loop runs 110 times as shown, it compresses.
if the loop runs 111 times, it does not:

With 110 iterations:
Extended      8192 bytes  (one page)
External           66 MB

With 111 iterations:
Extended      69 MB
External        69 MB

Hopefully they can fix this before the GA release. 


On Fri, Aug 1, 2014 at 12:38 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Jul 31, 2014 at 11:36 PM, Larry White <ljw1001@gmail.com> wrote:

On Fri, Aug 1, 2014 at 2:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thursday, July 31, 2014, Larry White <ljw1001@gmail.com> wrote:
Hi, 

I'm running an experiment on 9.4 beta 2.

I put 275,000 identical JSON files into a table using JSONB (one per row).  Each raw text file is 251K in size, so the total uncompressed is 69GB. The column storage is set to EXTENDED.  There are other toastable columns in the table, but none have more than 36 bytes of data in them.

My Toast table is 66GB. I would have expected to get that much (or more) compression just from JSONB being a binary format.

If I compress one of these JSON files outside of Postgres, it goes from 251K to 1K.

That is an astonishing amount of compression.  Are you really compressing one of the files in isolation, rather than co-compressing many and then pro-rating the result?

Yes, I should have explained. These are generated JSON files for testing and there is massive repetition in them, which is why they compress so well outside of Postgres. (Basically there is a repeating array of the same string) I did compress just the one. 
 

Can you provide an example of the data, and the command line you used to compress it?

Compressed on a Mac with the Compress UI option. Here's a brief sample from the file. You can see why it compresses so well:
"{\"junk\":[\"124245etweetwet345gwtretwt43 qwrqwq qwre qw
rsdflkas\",\"q4535 wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535 

and so on. 

If I take that example (and cap off the array and hash right after the end of what you show, and remove the escapes of the double quote marks) then it does not compress, but only because it is not long enough to trigger the compression attempts.

If I repeat the array portion 4 more times to make the whole thing long enough for compression to be used, it compresses nicely.  Not 100 fold (but then again, neither does bzip2 or gzip on the data I just described), but text and json compresses 10 fold and jsonb 5 fold.

Cheers,

Jeff