Re: Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Performance Improvement by reducing WAL for Update Operation |
Date | |
Msg-id | CA+Tgmoa-b7oHNMyngY=0EoTAXbDZ+JYd437cBqnUFTLuZqYCQg@mail.gmail.com Whole thread Raw |
In response to | Re: Performance Improvement by reducing WAL for Update Operation (Greg Smith <greg@2ndQuadrant.com>) |
Responses |
Re: Performance Improvement by reducing WAL for Update Operation
(Amit Kapila <amit.kapila16@gmail.com>)
Re: Performance Improvement by reducing WAL for Update Operation (Peter Geoghegan <pg@heroku.com>) |
List | pgsql-hackers |
On Mon, Jul 22, 2013 at 2:31 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On the Mac, the only case that seems to have a slowdown now is "hundred tiny > fields, half nulled". It would be nice to understand just what is going on > with that one. I got some ugly results in "two short fields, no change" > too, along with a couple of other weird results, but I think those were > testing procedure issues that can be ignored. The pgbench throttle work I > did recently highlights that I can't really make a Mac quiet/consistent for > benchmarking very well. Note that I ran all of the Mac tests with > assertions on, to try and catch platform specific bugs. The Linux ones used > the default build parameters. Amit has been asking me to look at this patch for a while, so I finally did. While I agree that it would be nice to get the CPU overhead down low enough that we can turn this on by default and forget about it, I'm not convinced that it's without value even if we can't. Fundamentally, this patch trades away some CPU in exchanged for decrease I/O. The testing thus far is all about whether the CPU overhead can be made trivial, which is a good question to ask, because if the answer is yes, then rather than trading something for something else, we just get something for free. Win! But even if that doesn't pan out, I think the fallback position should not be "OK, well, if we can't get decreased I/O for free then forget it" but rather "OK, if we can't get decreased I/O for free then let's get decreased I/O in exchange for increased CPU usage". I spent a little time running the tests from Heikki's script under perf. On all three "two short fields" tests and also on the "ten tiny fields, all changed" test, we spend about 1% of the CPU time in pglz_delta_encode. I don't see any evidence that it's actually compressing anything at all; it appears to be falling out where we test the input length against the strategy, presumably because the default strategy (which we largely copy here) doesn't try to compress input data of less than 32 bytes. Given that this code isn't actually compressing anything in these cases, I'm a bit confused by Greg's report of substantial gains on "ten tiny fields, all changed" test; how can we win if we're not compressing? I studied the "hundred tiny fields, half nulled" test case in some detail. Some thoughts: - There is a comment "TODO: It would be nice to behave like the history and the source strings were concatenated, so that you could compress using the new data, too." If we're not already doing that, then how are we managing to compress WAL by more than one-quarter in the "hundred tiny fields, all changed" case? It looks to me like the patch IS doing that, and I'm not sure it's a good idea, especially because it's using pglz_hist_add_no_recycle rather than pglz_add_hist: we verify that hlen + slen < 2 * PGLZ_HISTORY_SIZE but that doesn't seem good enough. On the "hundred tiny fields, half nulled" test case, removing that line reduces compression somewhat but also saves on CPU cycles. - pglz_find_match() is happy to walk all the way down even a really, really long bucket chain. It has some code that reduces good_match each time through, but it fails to produce a non-zero decrement once good_match * good_drop < 100. So if we're searching an enormously deep bucket many times in a row, and there are actually no matches, we'll go flying down the whole linked list every time. I tried mandating a minimum decrease of 1 and didn't observe that it made any difference in the run time of this test case, but it still seems odd. For the record, it's not new behavior with this patch; pglz_compress() has the same issue as things stand today. I wonder if we ought to decrease the good match length by a constant rather than a percentage at each step. - pglz_delta_encode() seems to spend about 50% of its CPU time loading up the history data. It would be nice to find a way to reduce that effort. I thought about maybe only making a history entry for, say, every fourth input position rather than every one, but a quick test seems to suggest that's a big fail, probably because it's too easy to skip over the position where we would have made "the right" match via some short match. But maybe there's some more sophisticated strategy here that would work better. For example, see: http://en.wikipedia.org/wiki/Rabin_fingerprint The basic idea is that you use a rolling hash function to divide up the history data into chunks of a given average size. So we scan the history data, compute a rolling hash value at each point, and each time the bottom N bits are zero, we consider that to be the end of a chunk. We enter all the chunks into a hash table. The chunk size will vary, but on the average, given a reasonably well-behaved rolling hash function (the pglz one probably doesn't qualify) it'll happen every 2^N bytes, so perhaps for this purpose we'd choose N to be between 3 and 5. Then, we scan the input we want to compress and divide it into chunks in the same way. Chunks that don't exist in the history data get copied to the output, while those that do get replaced with a reference to their position in the history data. I'm not 100% certain that something like this would be better than trying to leverage the pglz machinery, but I think it might be worth trying. One possible advantage is that you make many fewer hash-table entries, which reduces both the cost of setting up the hash table and the cost of probing it; another is that if you find a hit in the hash table, you needn't search any further: you are done; this is related to the point that, for the most part, the processing is chunk-at-a-time rather than character-at-a-time, which might be more efficient. On the other hand, the compression ratio might stink, or it might suck for some other reason: I just don't know. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: