Re: Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Performance Improvement by reducing WAL for Update Operation |
Date | |
Msg-id | 52E8D3E6.3020505@vmware.com Whole thread Raw |
In response to | Re: Performance Improvement by reducing WAL for Update Operation (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Performance Improvement by reducing WAL for Update Operation
|
List | pgsql-hackers |
On 01/28/2014 07:01 PM, Heikki Linnakangas wrote: > On 01/27/2014 07:03 PM, Amit Kapila wrote: >> I have tried to improve algorithm in another way so that we can get >> benefit of same chunks during find match (something similar to lz). >> The main change is to consider chunks at fixed boundary (4 byte) >> and after finding match, try to find if there is a longer match than >> current chunk. While finding longer match, it still takes care that >> next bigger match should be at chunk boundary. I am not >> completely sure about the chunk boundary may be 8 or 16 can give >> better results. > > Since you're only putting a value in the history every 4 bytes, you > wouldn't need to calculate the hash in a rolling fashion. You could just > take next four bytes, calculate hash, put it in history table. Then next > four bytes, calculate hash, and so on. Might save some cycles when > building the history table... On a closer look, you're putting a chunk in the history table only every four bytes, but you're *also* checking the history table for a match only every four bytes. That completely destroys the shift-resistence of the algorithm. For example, if the new tuple is an exact copy of the old tuple, except for one additional byte in the beginning, the algorithm would fail to recognize that. It would be good to add a test case like that in the test suite. You can skip bytes when building the history table, or when finding matches, but not both. Or you could skip N bytes, and then check for matches for the next four bytes, then skip again and so forth, as long as you always check four consecutive bytes (because the hash function is calculated from four bytes). I couldn't resist the challenge, and started hacking this. First, some observations from your latest patch (pgrb_delta_encoding_v5.patch): 1. There are a lot of comments and code that refers to "chunks", which seem obsolete. For example, ck_size field in PGRB_HistEntry is always set to a constant, 4, except maybe at the very end of the history string. The algorithm has nothing to do with Rabin-Karp anymore. 2. The 'hindex' field in PGRB_HistEntry is unused. Also, ck_start_pos is redundant with the index of the entry in the array, because the array is filled in order. That only leaves us just the 'next' field, and that can be represented as a int16 rather than a pointer. So, we really only need a simple int16 array as the history entries array. 3. You're not gaining much by calculating the hash in a rolling fashion. A single rolling step requires two multiplications and two sums, plus shifting the variables around. Calculating the hash from scratch requires three multiplications and three sums. 4. Given that we're not doing the Rabin-Karp variable-length chunk thing, we could use a cheaper hash function to begin with. Like, the one used in pglz. The multiply-by-prime method probably gives fewer collisions than pglz's shift/xor method, but I don't think that matters as much as computation speed. No-one has mentioned or measured the effect of collisions in this thread; that either means that it's a non-issue or that no-one's just realized how big a problem it is yet. I'm guessing that it's not a problem, and if it is, it's mitigated by only trying to find matches every N bytes; collisions would make finding matches slower, and that's exactly what skipping helps with. After addressing the above, we're pretty much back to PGLZ approach. I kept the change to only find matches every four bytes, that does make some difference. And I like having this new encoding code in a separate file, not mingled with pglz stuff, it's sufficiently different that that's better. I haven't done all much testing with this, so take it with a grain of salt. I don't know if this is better or worse than the other patches that have been floated around, but I though I might as well share it.. - Heikki
Attachment
pgsql-hackers by date: