Re: jsonb format is pessimal for toast compression - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: jsonb format is pessimal for toast compression |
Date | |
Msg-id | 20140809001505.GN16422@tamriel.snowman.net Whole thread Raw |
In response to | Re: jsonb format is pessimal for toast compression (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: jsonb format is pessimal for toast compression
|
List | pgsql-hackers |
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> I looked into the issue reported in bug #11109. The problem appears to be > >> that jsonb's on-disk format is designed in such a way that the leading > >> portion of any JSON array or object will be fairly incompressible, because > >> it consists mostly of a strictly-increasing series of integer offsets. > >> This interacts poorly with the code in pglz_compress() that gives up if > >> it's found nothing compressible in the first first_success_by bytes of a > >> value-to-be-compressed. (first_success_by is 1024 in the default set of > >> compression parameters.) > > > I haven't looked at this in any detail, so take this with a grain of > > salt, but what about teaching pglz_compress about using an offset > > farther into the data, if the incoming data is quite a bit larger than > > 1k? This is just a test to see if it's worthwhile to keep going, no? > > Well, the point of the existing approach is that it's a *nearly free* > test to see if it's worthwhile to keep going; there's just one if-test > added in the outer loop of the compression code. (cf commit ad434473ebd2, > which added that along with some other changes.) AFAICS, what we'd have > to do to do it as you suggest would to execute compression on some subset > of the data and then throw away that work entirely. I do not find that > attractive, especially when for most datatypes there's no particular > reason to look at one subset instead of another. Ah, I see- we were using the first block as it means we can reuse the work done on it if we decide to continue with the compression. Makes sense. We could possibly arrange to have the amount attempted depend on the data type, but you point out that we can't do that without teaching lower components about types, which is less than ideal. What about considering how large the object is when we are analyzing if it compresses well overall? That is- for a larger object, make a larger effort to compress it. There's clearly a pessimistic case which could arise from that, but it may be better than the current situation. There's a clear risk that such an algorithm may well be very type specific, meaning that we make things worse for some types (eg: bytea's which end up never compressing well we'd likely spend more CPU time trying than we do today). > 1. The real problem here is that jsonb is emitting quite a bit of > fundamentally-nonrepetitive data, even when the user-visible input is very > repetitive. That's a compression-unfriendly transformation by anyone's > measure. Assuming that some future replacement for pg_lzcompress() will > nonetheless be able to compress the data strikes me as mostly wishful > thinking. Besides, we'd more than likely have a similar early-exit rule > in any substitute implementation, so that we'd still be at risk even if > it usually worked. I agree that jsonb ends up being nonrepetitive in part, which is why I've been trying to push the discussion in the direction of making it more likely for the highly-compressible data to be considered rather than the start of the jsonb object. I don't care for our compression algorithm having to be catered to in this regard in general though as the exact same problem could, and likely does, exist in some real life bytea-using PG implementations. I disagree that another algorithm wouldn't be able to manage better on this data than pglz. pglz, from my experience, is notoriously bad a certain data sets which other algorithms are not as poorly impacted by. > 2. Are we going to ship 9.4 without fixing this? I definitely don't see > replacing pg_lzcompress as being on the agenda for 9.4, whereas changing > jsonb is still within the bounds of reason. I'd really hate to ship 9.4 without a fix for this, but I have a similar hard time with shipping 9.4 without the binary search component.. > Considering all the hype that's built up around jsonb, shipping a design > with a fundamental performance handicap doesn't seem like a good plan > to me. We could perhaps band-aid around it by using different compression > parameters for jsonb, although that would require some painful API changes > since toast_compress_datum() doesn't know what datatype it's operating on. I don't like the idea of shipping with this handicap either. Perhaps another options would be a new storage type which basically says "just compress it, no matter what"? We'd be able to make that the default for jsonb columns too, no? Again- I'll admit this is shooting from the hip, but I wanted to get these thoughts out and I won't have much more time tonight. Thanks! Stephen
pgsql-hackers by date: