On 08/08/2014 11:18 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 08/07/2014 11:17 PM, Tom Lane wrote:
>>> I looked into the issue reported in bug #11109. The problem appears to be
>>> that jsonb's on-disk format is designed in such a way that the leading
>>> portion of any JSON array or object will be fairly incompressible, because
>>> it consists mostly of a strictly-increasing series of integer offsets.
>
>> Back when this structure was first presented at pgCon 2013, I wondered
>> if we shouldn't extract the strings into a dictionary, because of key
>> repetition, and convinced myself that this shouldn't be necessary
>> because in significant cases TOAST would take care of it.
> That's not really the issue here, I think. The problem is that a
> relatively minor aspect of the representation, namely the choice to store
> a series of offsets rather than a series of lengths, produces
> nonrepetitive data even when the original input is repetitive.
It would certainly be worth validating that changing this would fix the
problem.
I don't know how invasive that would be - I suspect (without looking
very closely) not terribly much.
> 2. Are we going to ship 9.4 without fixing this? I definitely don't see
> replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
> jsonb is still within the bounds of reason.
>
> Considering all the hype that's built up around jsonb, shipping a design
> with a fundamental performance handicap doesn't seem like a good plan
> to me. We could perhaps band-aid around it by using different compression
> parameters for jsonb, although that would require some painful API changes
> since toast_compress_datum() doesn't know what datatype it's operating on.
>
>
Yeah, it would be a bit painful, but after all finding out this sort of
thing is why we have betas.
cheers
andrew