Re: jsonb format is pessimal for toast compression - Mailing list pgsql-hackers

From David G Johnston
Subject Re: jsonb format is pessimal for toast compression
Date
Msg-id 1407557341622-5814299.post@n5.nabble.com
Whole thread Raw
In response to Re: jsonb format is pessimal for toast compression  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
akapila wrote
> On Sat, Aug 9, 2014 at 6:15 AM, Tom Lane <

> tgl@.pa

> > wrote:
>>
>> Stephen Frost <

> sfrost@

> > writes:
>> > What about considering how large the object is when we are analyzing if
>> > it compresses well overall?
>>
>> Hmm, yeah, that's a possibility: we could redefine the limit at which
>> we bail out in terms of a fraction of the object size instead of a fixed
>> limit.  However, that risks expending a large amount of work before we
>> bail, if we have a very large incompressible object --- which is not
>> exactly an unlikely case.  Consider for example JPEG images stored as
>> bytea, which I believe I've heard of people doing.  Another issue is
>> that it's not real clear that that fixes the problem for any fractional
>> size we'd want to use.  In Larry's example of a jsonb value that fails
>> to compress, the header size is 940 bytes out of about 12K, so we'd be
>> needing to trial-compress about 10% of the object before we reach
>> compressible data --- and I doubt his example is worst-case.
>>
>> >> 1. The real problem here is that jsonb is emitting quite a bit of
>> >> fundamentally-nonrepetitive data, even when the user-visible input is
> very
>> >> repetitive.  That's a compression-unfriendly transformation by
>> anyone's
>> >> measure.
>>
>> > I disagree that another algorithm wouldn't be able to manage better on
>> > this data than pglz.  pglz, from my experience, is notoriously bad a
>> > certain data sets which other algorithms are not as poorly impacted by.
>>
>> Well, I used to be considered a compression expert, and I'm going to
>> disagree with you here.  It's surely possible that other algorithms would
>> be able to get some traction where pglz fails to get any,
> 
> During my previous work in this area, I had seen that some algorithms
> use skipping logic which can be useful for incompressible data followed
> by compressible data or in general as well. 

Random thought from the sideline...

This particular data type has the novel (within PostgreSQL) design of both a
(feature oriented - and sizeable) header and a payload.  Is there some way
to add that model into the storage system so that, at a higher level,
separate attempts are made to compress each section and then the compressed
(or not) results and written out adjacently and with a small header
indicating the length of the stored header and other meta data like whether
each part is compressed and even the type that data represents?  For reading
back into memory the header-payload generic type is populated and then the
header and payload decompressed - as needed - then the two parts are fed
into the appropriate type constructor that understands and accepts the two
pieces.

Just hoping to spark an idea here - I don't know enough about the internals
to even guess how close I am to something feasible.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/jsonb-format-is-pessimal-for-toast-compression-tp5814162p5814299.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Defining a foreign key with a duplicate column is broken
Next
From: David G Johnston
Date:
Subject: Re: 9.4 pg_restore --help changes