Re: Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Performance Improvement by reducing WAL for Update Operation
Date
Msg-id CAA4eK1+nzr+18p43=L-2UQMPmTCJr=G2bPZB81X-y14-QZ0CpQ@mail.gmail.com
Whole thread Raw
In response to Re: Performance Improvement by reducing WAL for Update Operation  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Performance Improvement by reducing WAL for Update Operation  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Wed, Nov 27, 2013 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 27, 2013 at 12:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> The basic idea is that you use a rolling hash function to divide up
>>> the history data into chunks of a given average size.  So we scan the
>>> history data, compute a rolling hash value at each point, and each
>>> time the bottom N bits are zero, we consider that to be the end of a
>>> chunk.  We enter all the chunks into a hash table.  The chunk size
>>> will vary, but on the average, given a reasonably well-behaved rolling
>>> hash function (the pglz one probably doesn't qualify) it'll happen
>>> every 2^N bytes, so perhaps for this purpose we'd choose N to be
>>> between 3 and 5.  Then, we scan the input we want to compress and
>>> divide it into chunks in the same way.  Chunks that don't exist in the
>>> history data get copied to the output, while those that do get
>>> replaced with a reference to their position in the history data.
>>
>> I think this idea looks better than current and it will definately
>> improve some of the cases, but not sure we can win in all cases.
>> We have tried one of the similar idea (reduce size of hash and
>> eventually comparision) by adding every 10 bytes to the history
>> lookup table rather than every byte. It improved most cases but not
>> all cases ("hundred tiny fields, all changed",
>>  "hundred tiny fields, half changed" test were still slow).
>> Patch and results are at link (refer approach-1):
>> http://www.postgresql.org/message-id/001f01ce1c14$d3af0770$7b0d1650$@kapila@huawei.com
>
> What you did there will, I think, tend to miss a lot of compression
> opportunities.  Suppose for example that the old tuple is
> ABCDEFGHIJKLMNOP and the new tuple is xABCDEFGHIJKLMNOP.  After
> copying one literal byte we'll proceed to copy 9 more, missing the
> fact that there was a long match available after the first byte.

That is right, but one idea to try that out was to see if we can
reduce CPU usage at cost of compression,
but we found that it didn't completely eliminate that problem.

> The
> advantage of the fingerprinting technique is that it's supposed to be
> resistant to that sort of thing.

Okay, one question arise here is that can it be better in terms of CPU
usage as compare to when
we have used hash function for every 10th byte, if you have a feeling
that it can improve situation,
I can try a prototype implementation of same to check the results.


>> Now the tough question is what are the possible options for this patch
>> and which one to pick:
>> a. optimize encoding technique, so that it can improve results in most
>> cases even if not all.
>> b. have a table level option or guc to enable/disable WAL compression.
>> c. use some heuristics to check if chances of compression are good,
>> then only perform compression.
>>     1. apply this optimization for tuple size > 128 and < 2000
>>     2. apply this optimization if number of modified columns are less
>> than 25% (some threshold number) of total columns.
>>         I think we can get modified columns from target entry and use
>> it if triggers haven't changed that tuple. I remember
>>         earlier there were concerns that this value can't be trusted
>> completely, but I think using it as a heuristic is not a
>>         problem, even if this number is not right in some cases.
>> d. forget about this optimization and reject the patch.
>> I think by doing option 'b' and 'c' together can make this
>> optimization usable in cases where it is actually useful.
>
> I agree that we probably want to do (b), and I suspect we want both a
> GUC and a reloption, assuming that can be done relatively cleanly.
>
> However, I think we should explore (a) more before we explore (c).

Sure, but to explore (a), the scope is bit bigger. We have below
options to explore (a):
1. try to optimize existing algorithm as used in patch, which we have
tried but ofcourse we can spend some more time to see if anything more   can be tried out.
2. try fingerprint technique as suggested by you above.
3. try some other standard methods like vcdiff, lz4 etc.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Vik Fearing
Date:
Subject: Re: Extension Templates S03E11
Next
From: Kevin Grittner
Date:
Subject: Re: Platform-dependent(?) failure in timeout handling