Thread: Re: Performance Improvement by reducing WAL for Update Operation

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit kapila
Date:

On 11 January 2013 15:57, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>>>> I've moved this to the next CF. I'm planning to review this one first.
>>
>>> Thank you.
>
>> Just reviewing the patch now, making more sense with comments added.

> Making more sense, but not yet making complete sense.

> I'd like you to revisit the patch comments since some of them are completely unreadable.

 

I have modified most of the comments in code.

The changes in attached patch are as below:

 

1. Introduced Encoded WAL Tuple (EWT) to refer to delta encoded tuple for update operation.

   It can rename to one of below:

   a. WAL Encoded Tuple (WET)

   b. Delta Encoded WAL Tuple (DEWT)

   c. Delta WAL Encoded Tuple (DWET)

   d. any others?

 

2. I have kept the wording related to compression in modified docs, but i have tried to copy parts completely.

    IMO this is required as there are some changes w.r.t LZ compression like for New Data.

 

3. There is small coding change as it has been overwritten by one of my previous patch patches.

    Calculation of approximate length for encoded wal tuple.

    Previous Patch:

    if ((bp + (2 * new_tup_bitmaplen)) - bstart >= result_max)

    New Patch:

    if ((bp + (2 + new_tup_bitmaplen)) - bstart >= result_max)

 

   The previous patch calculation was valid if we could have exactly used LZ format.

 

 

With Regards,

Amit Kapila.

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Monday, January 21, 2013 9:32 PM Amit kapila wrote:
On 11 January 2013 15:57, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>>>>> I've moved this to the next CF. I'm planning to review this one first.
>>
>>>> Thank you.
>
>>> Just reviewing the patch now, making more sense with comments added.

>> Making more sense, but not yet making complete sense.

>> I'd like you to revisit the patch comments since some of them are
completely unreadable.
 
>I have modified most of the comments in code.
>The changes in attached patch are as below:
 

Rebased the patch as per HEAD.
 
With Regards,
Amit Kapila.

Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 28.01.2013 15:39, Amit Kapila wrote:
> Rebased the patch as per HEAD.

I don't like the way heap_delta_encode has intimate knowledge of how the
lz compression works. It feels like a violent punch through the
abstraction layers.

Ideally, you would just pass the old and new tuple to pglz as char *,
and pglz code would find the common parts. But I guess that's too slow,
as that's what I originally suggested and you rejected that approach.
But even if that's not possible on performance grounds, we don't need to
completely blow up the abstraction. pglz can still do the encoding - the
caller just needs to pass it the attribute boundaries to consider for
matches, so that it doesn't need to scan them byte by byte.

I came up with the attached patch. I wrote it to demonstrate the API,
I'm not 100% sure the result after decoding is correct.

- Heikki

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tuesday, January 29, 2013 2:53 AM Heikki Linnakangas wrote:
> On 28.01.2013 15:39, Amit Kapila wrote:
> > Rebased the patch as per HEAD.
> 
> I don't like the way heap_delta_encode has intimate knowledge of how
> the lz compression works. It feels like a violent punch through the
> abstraction layers.
> 
> Ideally, you would just pass the old and new tuple to pglz as char *,
> and pglz code would find the common parts. But I guess that's too slow,
> as that's what I originally suggested and you rejected that approach.
> But even if that's not possible on performance grounds, we don't need
> to completely blow up the abstraction. pglz can still do the encoding -
> the caller just needs to pass it the attribute boundaries to consider
> for matches, so that it doesn't need to scan them byte by byte.
> 
> I came up with the attached patch. I wrote it to demonstrate the API,
> I'm not 100% sure the result after decoding is correct.

I have checked the patch code, found few problems. 

1. History will be old tuple, in that case below call needs to be changed
/*        return pglz_compress_with_history((char *) oldtup->t_data,
oldtup->t_len, 
(char *) newtup->t_data, newtup->t_len, 
offsets, noffsets, (PGLZ_Header *) encdata, 
&strategy);*/    return pglz_compress_with_history((char *) newtup->t_data,
newtup->t_len, 
(char *) oldtup->t_data, oldtup->t_len, 
offsets, noffsets, (PGLZ_Header *) encdata, 
&strategy);
2. The offset array is beginning of each column offset. In that case below
needs to be changed.
               offsets[noffsets++] = off; 
               off = att_addlength_pointer(off, thisatt->attlen, tp + off);

               if (thisatt->attlen <= 0)                        slow = true;                /* can't use attcacheoff
anymore */ 

//                offsets[noffsets++] = off;        } 

Apart from this, some of the test cases are failing which I need to check.

I have debugged the new code, it appears to me that this will not be as
efficient as the current approach of patch.
It needs to build hash table for history reference and comparison which can
add overhead as compare to existing approach. I am taking the Performance
and WAL Reduction data.

Can there be another way with which current patch code can be made better,
so that we don't need to change the encoding approach, as I am having
feeling that this might not be performance wise equally good.

With Regards,
Amit Kapila.




Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 29.01.2013 11:58, Amit Kapila wrote:
> Can there be another way with which current patch code can be made better,
> so that we don't need to change the encoding approach, as I am having
> feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the 
internals of pglz compression. You could probably make my patch more 
like yours in behavior by also passing an array of offsets in the new 
tuple to check, and only checking for matches as those offsets.

- Heikki



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:
> On 29.01.2013 11:58, Amit Kapila wrote:
> > Can there be another way with which current patch code can be made
> better,
> > so that we don't need to change the encoding approach, as I am having
> > feeling that this might not be performance wise equally good.
> 
> The point is that I don't want to heap_delta_encode() to know the
> internals of pglz compression. You could probably make my patch more
> like yours in behavior by also passing an array of offsets in the new
> tuple to check, and only checking for matches as those offsets.

I think it makes sense, because if we have offsets of both new and old
tuple, we
can internally use memcmp to compare columns and use same algorithm for
encoding.
I will change the patch according to this suggestion.

With Regards,
Amit Kapila.





Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:
> On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:
> > On 29.01.2013 11:58, Amit Kapila wrote:
> > > Can there be another way with which current patch code can be made
> > better,
> > > so that we don't need to change the encoding approach, as I am
> having
> > > feeling that this might not be performance wise equally good.
> >
> > The point is that I don't want to heap_delta_encode() to know the
> > internals of pglz compression. You could probably make my patch more
> > like yours in behavior by also passing an array of offsets in the new
> > tuple to check, and only checking for matches as those offsets.
> 
> I think it makes sense, because if we have offsets of both new and old
> tuple, we
> can internally use memcmp to compare columns and use same algorithm for
> encoding.
> I will change the patch according to this suggestion.

I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed bitmaplength
also, as we need
to copy the bitmap of new tuple as it is into Encoded WAL Tuple.

Please see if such API design is okay?

I shall update the README and send the performance/WAL Reduction data for
modified patch tomorrow.

With Regards,
Amit Kapila.


Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:
> On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:
> > On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:
> > > On 29.01.2013 11:58, Amit Kapila wrote:
> > > > Can there be another way with which current patch code can be
> made
> > > better,
> > > > so that we don't need to change the encoding approach, as I am
> > having
> > > > feeling that this might not be performance wise equally good.
> > >
> > > The point is that I don't want to heap_delta_encode() to know the
> > > internals of pglz compression. You could probably make my patch
> more
> > > like yours in behavior by also passing an array of offsets in the
> > > new tuple to check, and only checking for matches as those offsets.
> >
> > I think it makes sense, because if we have offsets of both new and
> old
> > tuple, we can internally use memcmp to compare columns and use same
> > algorithm for encoding.
> > I will change the patch according to this suggestion.
> 
> I have modified the patch as per above suggestion.
> Apart from passing new and old tuple offsets, I have passed
> bitmaplength also, as we need to copy the bitmap of new tuple as it is
> into Encoded WAL Tuple.
> 
> Please see if such API design is okay?
> 
> I shall update the README and send the performance/WAL Reduction data
> for modified patch tomorrow.

Updated patch including comments and README is attached with this mail.
This patch contain exactly same design behavior as per previous. 
It takes care of API design suggestion of Heikki.

The performance data is similar, as it is not complete, I shall send that
tomorrow.

With Regards,
Amit Kapila.

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thursday, January 31, 2013 6:44 PM Amit Kapila wrote:
> On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:
> > On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:
> > > On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:
> > > > On 29.01.2013 11:58, Amit Kapila wrote:
> > > > > Can there be another way with which current patch code can be
> > made
> > > > better,
> > > > > so that we don't need to change the encoding approach, as I am
> > > having
> > > > > feeling that this might not be performance wise equally good.
> > > >
> > > > The point is that I don't want to heap_delta_encode() to know the
> > > > internals of pglz compression. You could probably make my patch
> > more
> > > > like yours in behavior by also passing an array of offsets in the
> > > > new tuple to check, and only checking for matches as those
> offsets.
> > >
> > > I think it makes sense, because if we have offsets of both new and
> > old
> > > tuple, we can internally use memcmp to compare columns and use same
> > > algorithm for encoding.
> > > I will change the patch according to this suggestion.
> >
> > I have modified the patch as per above suggestion.
> > Apart from passing new and old tuple offsets, I have passed
> > bitmaplength also, as we need to copy the bitmap of new tuple as it
> is
> > into Encoded WAL Tuple.
> >
> > Please see if such API design is okay?
> >
> > I shall update the README and send the performance/WAL Reduction data
> > for modified patch tomorrow.
> 
> Updated patch including comments and README is attached with this mail.
> This patch contain exactly same design behavior as per previous.
> It takes care of API design suggestion of Heikki.
> 
> The performance data is similar, as it is not complete, I shall send
> that tomorrow.

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with not much
performance difference.
3. With 500 and above record size in pgbench there is an improvement in the
performance and wal reduction both. 

If the record size increases there is a gain in performance and wal size is
reduced as well.

Performance data for synchronous_commit = on is under progress, I shall post
it once it is done.
I am expecting it to be same as previous.

With Regards,
Amit Kapila.

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Friday, February 01, 2013 6:37 PM Amit Kapila wrote:
> On Thursday, January 31, 2013 6:44 PM Amit Kapila wrote:
> > On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:
> > > On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:
> > > > On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:
> > > > > On 29.01.2013 11:58, Amit Kapila wrote:
> > > > > > Can there be another way with which current patch code can be
> > > made
> > > > > better,
> > > > > > so that we don't need to change the encoding approach, as I
> am
> > > > having
> > > > > > feeling that this might not be performance wise equally good.
> > > > >
> > > > > The point is that I don't want to heap_delta_encode() to know
> > > > > the internals of pglz compression. You could probably make my
> > > > > patch
> > > more
> > > > > like yours in behavior by also passing an array of offsets in
> > > > > the new tuple to check, and only checking for matches as those
> > offsets.
> > > >
> > > > I think it makes sense, because if we have offsets of both new
> and
> > > old
> > > > tuple, we can internally use memcmp to compare columns and use
> > > > same algorithm for encoding.
> > > > I will change the patch according to this suggestion.
> > >
> > > I have modified the patch as per above suggestion.
> > > Apart from passing new and old tuple offsets, I have passed
> > > bitmaplength also, as we need to copy the bitmap of new tuple as it
> > is
> > > into Encoded WAL Tuple.
> > >
> > > Please see if such API design is okay?
> > >
> > > I shall update the README and send the performance/WAL Reduction
> > > data for modified patch tomorrow.
> >
> > Updated patch including comments and README is attached with this
> mail.
> > This patch contain exactly same design behavior as per previous.
> > It takes care of API design suggestion of Heikki.
> >
> > The performance data is similar, as it is not complete, I shall send
> > that tomorrow.
> 
> Performance data for the patch is attached with this mail.
> Conclusions from the readings (these are same as my previous patch):
> 
> 1. With orignal pgbench there is a max 7% WAL reduction with not much
> performance difference.
> 2. With 250 record pgbench there is a max wal reduction of 35% with not
> much performance difference.
> 3. With 500 and above record size in pgbench there is an improvement in
> the performance and wal reduction both.
> 
> If the record size increases there is a gain in performance and wal
> size is reduced as well.
> 
> Performance data for synchronous_commit = on is under progress, I shall
> post it once it is done.
> I am expecting it to be same as previous.

Please find the performance readings for synchronous_commit = on. 

Each run is taken for 20 min. 

Conclusions from the readings with synchronous commit on mode:

1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with not much
performance difference.
3. With 1800 record size in pgbench there is both an improvement in the
performance (approx 3%) as well as wal reduction (44%). 

If the record size increases there is a very good reduction in WAL size.

Please provide your feedback.

With Regards,
Amit Kapila.

Re: Performance Improvement by reducing WAL for Update Operation

From
Craig Ringer
Date:
On 02/05/2013 11:53 PM, Amit Kapila wrote:
>> Performance data for the patch is attached with this mail.
>> Conclusions from the readings (these are same as my previous patch):
>>
>> 1. With orignal pgbench there is a max 7% WAL reduction with not much
>> performance difference.
>> 2. With 250 record pgbench there is a max wal reduction of 35% with not
>> much performance difference.
>> 3. With 500 and above record size in pgbench there is an improvement in
>> the performance and wal reduction both.
>>
>> If the record size increases there is a gain in performance and wal
>> size is reduced as well.
>>
>> Performance data for synchronous_commit = on is under progress, I shall
>> post it once it is done.
>> I am expecting it to be same as previous.
> Please find the performance readings for synchronous_commit = on. 
>
> Each run is taken for 20 min. 
>
> Conclusions from the readings with synchronous commit on mode:
>
> 1. With orignal pgbench there is a max 2% WAL reduction with not much
> performance difference.
> 2. With 500 record pgbench there is a max wal reduction of 3% with not much
> performance difference.
> 3. With 1800 record size in pgbench there is both an improvement in the
> performance (approx 3%) as well as wal reduction (44%). 
>
> If the record size increases there is a very good reduction in WAL size.

The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services




Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
> On 02/05/2013 11:53 PM, Amit Kapila wrote:
> >> Performance data for the patch is attached with this mail.
> >> Conclusions from the readings (these are same as my previous patch):
> >>
> >> 1. With orignal pgbench there is a max 7% WAL reduction with not
> much
> >> performance difference.
> >> 2. With 250 record pgbench there is a max wal reduction of 35% with
> not
> >> much performance difference.
> >> 3. With 500 and above record size in pgbench there is an improvement
> in
> >> the performance and wal reduction both.
> >>
> >> If the record size increases there is a gain in performance and wal
> >> size is reduced as well.
> >>
> >> Performance data for synchronous_commit = on is under progress, I
> shall
> >> post it once it is done.
> >> I am expecting it to be same as previous.
> > Please find the performance readings for synchronous_commit = on.
> >
> > Each run is taken for 20 min.
> >
> > Conclusions from the readings with synchronous commit on mode:
> >
> > 1. With orignal pgbench there is a max 2% WAL reduction with not much
> > performance difference.
> > 2. With 500 record pgbench there is a max wal reduction of 3% with
> not much
> > performance difference.
> > 3. With 1800 record size in pgbench there is both an improvement in
> the
> > performance (approx 3%) as well as wal reduction (44%).
> >
> > If the record size increases there is a very good reduction in WAL
> size.
> 
> The stats look fairly sane. I'm a little concerned about the apparent
> trend of falling TPS in the patched vs original tests for the 1-client
> test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
> 0.4% case made other config changes too. Nonetheless, it might be wise
> to check with really big records and see if the trend continues.

For bigger size (~2000) records, it goes into toast, for which we don't do
this optimization.
This optimization is mainly for medium size records.


With Regards,
Amit Kapila.




Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 04.03.2013 06:39, Amit Kapila wrote:
> On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
>> On 02/05/2013 11:53 PM, Amit Kapila wrote:
>>>> Performance data for the patch is attached with this mail.
>>>> Conclusions from the readings (these are same as my previous patch):
>>>>
>>>> 1. With orignal pgbench there is a max 7% WAL reduction with not
>> much
>>>> performance difference.
>>>> 2. With 250 record pgbench there is a max wal reduction of 35% with
>> not
>>>> much performance difference.
>>>> 3. With 500 and above record size in pgbench there is an improvement
>> in
>>>> the performance and wal reduction both.
>>>>
>>>> If the record size increases there is a gain in performance and wal
>>>> size is reduced as well.
>>>>
>>>> Performance data for synchronous_commit = on is under progress, I
>> shall
>>>> post it once it is done.
>>>> I am expecting it to be same as previous.
>>> Please find the performance readings for synchronous_commit = on.
>>>
>>> Each run is taken for 20 min.
>>>
>>> Conclusions from the readings with synchronous commit on mode:
>>>
>>> 1. With orignal pgbench there is a max 2% WAL reduction with not much
>>> performance difference.
>>> 2. With 500 record pgbench there is a max wal reduction of 3% with
>> not much
>>> performance difference.
>>> 3. With 1800 record size in pgbench there is both an improvement in
>> the
>>> performance (approx 3%) as well as wal reduction (44%).
>>>
>>> If the record size increases there is a very good reduction in WAL
>> size.
>>
>> The stats look fairly sane. I'm a little concerned about the apparent
>> trend of falling TPS in the patched vs original tests for the 1-client
>> test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
>> 0.4% case made other config changes too. Nonetheless, it might be wise
>> to check with really big records and see if the trend continues.
>
> For bigger size (~2000) records, it goes into toast, for which we don't do
> this optimization.
> This optimization is mainly for medium size records.

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:

unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):

                 testname                 | wal_generated |     duration

-----------------------------------------+---------------+------------------
  two short fields, no change             |    1245525360 | 9.94613695144653
  two short fields, one changed           |    1245536528 |  10.146910905838
  two short fields, both changed          |    1245523160 | 11.2332470417023
  one short and one long field, no change |    1054926504 | 5.90477800369263
  ten tiny fields, all changed            |    1411774608 | 13.4536008834839
  hundred tiny fields, all changed        |     635739680 | 7.57448387145996
  hundred tiny fields, half changed       |     636930560 | 7.56888699531555
  hundred tiny fields, half nulled        |     573751120 | 6.68991994857788

Amit's wal_update_changes_v10.patch:

                 testname                 | wal_generated |     duration

-----------------------------------------+---------------+------------------
  two short fields, no change             |    1249722112 | 13.0558869838715
  two short fields, one changed           |    1246145408 | 12.9947438240051
  two short fields, both changed          |    1245951056 | 13.0262880325317
  one short and one long field, no change |     678480664 | 5.70031690597534
  ten tiny fields, all changed            |    1328873920 | 20.0167419910431
  hundred tiny fields, all changed        |     638149416 | 14.4236788749695
  hundred tiny fields, half changed       |     635560504 | 14.8770561218262
  hundred tiny fields, half nulled        |     558468352 | 16.2437210083008

pglz-with-micro-optimizations-1.patch:

                  testname                 | wal_generated |
duration
-----------------------------------------+---------------+------------------
  two short fields, no change             |    1245519008 | 11.6702048778534
  two short fields, one changed           |    1245756904 | 11.3233819007874
  two short fields, both changed          |    1249711088 | 11.6836447715759
  one short and one long field, no change |     664741392 | 6.44810795783997
  ten tiny fields, all changed            |    1328085568 | 13.9679481983185
  hundred tiny fields, all changed        |     635974088 | 9.15514206886292
  hundred tiny fields, half changed       |     636309040 | 9.13769292831421
  hundred tiny fields, half nulled        |     496396448 | 8.77351498603821

In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE
is timed. Duration is the time spent in the UPDATE (lower is better),
and wal_generated is the amount of WAL generated by the updates (lower
is better).

The summary is that Amit's patch is a small win in terms of CPU usage,
in the best case where the table has few columns, with one large column
that is not updated. In all other cases it just adds overhead. In terms
of WAL size, you get a big gain in the same best case scenario.

Attached is a different version of this patch, which uses the pglz
algorithm to spot the similarities between the old and new tuple,
instead of having explicit knowledge of where the column boundaries are.
This has the advantage that it will spot similarities, and be able to
compress, in more cases. For example, you can see a reduction in WAL
size in the "hundred tiny fields, half nulled" test case above.

The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default, this
probably just isn't worth it.

The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function, it
goes further than that, and contains some further micro-optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for speed.

If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to be
brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.

PS. I haven't done much testing of WAL redo, so it's quite possible that
the encoding is actually buggy, or that decoding is slow. But I don't
think there's anything so fundamentally wrong that it would affect the
performance results much.

- Heikki

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
> On 04.03.2013 06:39, Amit Kapila wrote:
> > On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
> >> On 02/05/2013 11:53 PM, Amit Kapila wrote:
> >>>> Performance data for the patch is attached with this mail.
> >>>> Conclusions from the readings (these are same as my previous
> patch):
> >>>>
> 
> I've been doing investigating the pglz option further, and doing
> performance comparisons of the pglz approach and this patch. I'll begin
> with some numbers:
> 
> unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):
> 
>                  testname                 | wal_generated |
> duration
> 
> -----------------------------------------+---------------+-------------
> -
> -----------------------------------------+---------------+----
>   two short fields, no change             |    1245525360 |
> 9.94613695144653
>   two short fields, one changed           |    1245536528 |
> 10.146910905838
>   two short fields, both changed          |    1245523160 |
> 11.2332470417023
>   one short and one long field, no change |    1054926504 |
> 5.90477800369263
>   ten tiny fields, all changed            |    1411774608 |
> 13.4536008834839
>   hundred tiny fields, all changed        |     635739680 |
> 7.57448387145996
>   hundred tiny fields, half changed       |     636930560 |
> 7.56888699531555
>   hundred tiny fields, half nulled        |     573751120 |
> 6.68991994857788
> 
> Amit's wal_update_changes_v10.patch:
> 
>                  testname                 | wal_generated |
> duration
> 
> -----------------------------------------+---------------+-------------
> -
> -----------------------------------------+---------------+----
>   two short fields, no change             |    1249722112 |
> 13.0558869838715
>   two short fields, one changed           |    1246145408 |
> 12.9947438240051
>   two short fields, both changed          |    1245951056 |
> 13.0262880325317
>   one short and one long field, no change |     678480664 |
> 5.70031690597534
>   ten tiny fields, all changed            |    1328873920 |
> 20.0167419910431
>   hundred tiny fields, all changed        |     638149416 |
> 14.4236788749695
>   hundred tiny fields, half changed       |     635560504 |
> 14.8770561218262
>   hundred tiny fields, half nulled        |     558468352 |
> 16.2437210083008
> 
> pglz-with-micro-optimizations-1.patch:
> 
>                   testname                 | wal_generated |
> duration
> -----------------------------------------+---------------+-------------
> -
> -----------------------------------------+---------------+----
>   two short fields, no change             |    1245519008 |
> 11.6702048778534
>   two short fields, one changed           |    1245756904 |
> 11.3233819007874
>   two short fields, both changed          |    1249711088 |
> 11.6836447715759
>   one short and one long field, no change |     664741392 |
> 6.44810795783997
>   ten tiny fields, all changed            |    1328085568 |
> 13.9679481983185
>   hundred tiny fields, all changed        |     635974088 |
> 9.15514206886292
>   hundred tiny fields, half changed       |     636309040 |
> 9.13769292831421
>   hundred tiny fields, half nulled        |     496396448 |
> 8.77351498603821

For some of the tests, it doesn't even execute main part of
compression/encoding. 
The reason is that the length of tuple is less than strategy min length, so
it returns from below check
in function pglz_delta_encode()
if (strategy->match_size_good <= 0 ||                slen < strategy->min_input_size ||                slen >
strategy->max_input_size)               return false;
 

The tests for which it doesn't execute encoding are below:
two short fields, no change
two short fields, one changed     
two short fields, both changed
ten tiny fields, all changed                                         


For above cases, the reason of difference in timings for both approaches
with original could be due to the reason that
this check is done after some processing. So I think if we check the length
in log_heap_update, then
there should not be timing difference for above test scenario's. I can check
that once.

This optimization helps only when tuple is of length > 128~200 bytes and
upto 1800 bytes (till it turns to toast), otherwise it could result in
overhead without any major WAL reduction. 
Infact I think in one of my initial patch there is a check if length of
tuple is greater than 128 bytes then perform the optimization.

I shall try to run both patches for cases when length of tuple is > 128~200
bytes, as this optimization has benefits in those cases. 

> In each test, a table is created with a large number of identical rows,
> and fillfactor=50. Then a full-table UPDATE is performed, and the
> UPDATE is timed. Duration is the time spent in the UPDATE (lower is
> better), and wal_generated is the amount of WAL generated by the
> updates (lower is better).
> 
> The summary is that Amit's patch is a small win in terms of CPU usage,
> in the best case where the table has few columns, with one large column
> that is not updated. In all other cases it just adds overhead. In terms
> of WAL size, you get a big gain in the same best case scenario.
> 
> Attached is a different version of this patch, which uses the pglz
> algorithm to spot the similarities between the old and new tuple,
> instead of having explicit knowledge of where the column boundaries
> are.
> This has the advantage that it will spot similarities, and be able to
> compress, in more cases. For example, you can see a reduction in WAL
> size in the "hundred tiny fields, half nulled" test case above.
> 
> The attached patch also just adds overhead in most cases, but the
> overhead is much smaller in the worst case. I think that's the right
> tradeoff here - we want to avoid scenarios where performance falls off
> the cliff. That said, if you usually just get a slowdown, we certainly
> can't make this the default, and if we can't turn it on by default,
> this probably just isn't worth it.

As I mentioned, for smaller tuples it can be overhead without any major
benefit of WAL reduction, 
so I think before doing encoding it should ensure that tuple length is
greater than some threshold length.
Yes, it can miss some cases like your test has shown for (hundred tiny
fields, half nulled), 
but we might be able to safely enable it for default.

> The attached patch contains the variable-hash-size changes I posted in
> the "Optimizing pglz compressor". But in the delta encoding function,
> it goes further than that, and contains some further micro-
> optimizations:
> the hash is calculated in a rolling fashion, and it uses a specialized
> version of the pglz_hist_add macro that knows that the input can't
> exceed 4096 bytes. Those changes shaved off some cycles, but you could
> probably do more. 

> One idea is to only add every 10 bytes or so to the
> history lookup table; that would sacrifice some compressibility for
> speed.

Do you mean to say roll for 10 times and then call pglz_hist_add_no_recycle
and then same
before pglz_find_match?

I shall try doing this for the tests.

> If you could squeeze pglz_delta_encode function to be cheap enough that
> we could enable this by default, this would be pretty cool patch. Or at
> least, the overhead in the cases that you get no compression needs to
> be brought down, to about 2-5 % at most I think. If it can't be done
> easily, I feel that this probably needs to be dropped.

Agreed, though it gives benefit for some of the cases, but it should not
degrade much
for any of other cases.

One more thing that any compression technique will have some overhead, so it
should be
used optimally rather then in every case. So in that regards, I think we
should do this 
optimization only when it has better chance of win (like based on length of
tuple or some other criteria 
where WAL tuple can be logged as-is). What is your opinion?

> PS. I haven't done much testing of WAL redo, so it's quite possible
> that the encoding is actually buggy, or that decoding is slow. But I
> don't think there's anything so fundamentally wrong that it would
> affect the performance results much.

I also don't think it will have any problem, but I can run some test to
verify the same.

With Regards,
Amit Kapila.




Re: Performance Improvement by reducing WAL for Update Operation

From
Andres Freund
Date:
On 2013-03-05 23:26:59 +0200, Heikki Linnakangas wrote:
> On 04.03.2013 06:39, Amit Kapila wrote:
> >>The stats look fairly sane. I'm a little concerned about the apparent
> >>trend of falling TPS in the patched vs original tests for the 1-client
> >>test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
> >>0.4% case made other config changes too. Nonetheless, it might be wise
> >>to check with really big records and see if the trend continues.
> >
> >For bigger size (~2000) records, it goes into toast, for which we don't do
> >this optimization.
> >This optimization is mainly for medium size records.
> 
> I've been doing investigating the pglz option further, and doing performance
> comparisons of the pglz approach and this patch. I'll begin with some
> numbers:
> 
> unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):
> 
>                 testname                 | wal_generated |     duration
> 
> -----------------------------------------+---------------+------------------
>  two short fields, no change             |    1245525360 | 9.94613695144653
>  two short fields, one changed           |    1245536528 |  10.146910905838
>  two short fields, both changed          |    1245523160 | 11.2332470417023
>  one short and one long field, no change |    1054926504 | 5.90477800369263
>  ten tiny fields, all changed            |    1411774608 | 13.4536008834839
>  hundred tiny fields, all changed        |     635739680 | 7.57448387145996
>  hundred tiny fields, half changed       |     636930560 | 7.56888699531555
>  hundred tiny fields, half nulled        |     573751120 | 6.68991994857788
> 
> Amit's wal_update_changes_v10.patch:
> 
>                 testname                 | wal_generated |     duration
> 
> -----------------------------------------+---------------+------------------
>  two short fields, no change             |    1249722112 | 13.0558869838715
>  two short fields, one changed           |    1246145408 | 12.9947438240051
>  two short fields, both changed          |    1245951056 | 13.0262880325317
>  one short and one long field, no change |     678480664 | 5.70031690597534
>  ten tiny fields, all changed            |    1328873920 | 20.0167419910431
>  hundred tiny fields, all changed        |     638149416 | 14.4236788749695
>  hundred tiny fields, half changed       |     635560504 | 14.8770561218262
>  hundred tiny fields, half nulled        |     558468352 | 16.2437210083008
> 
> pglz-with-micro-optimizations-1.patch:
> 
>                  testname                 | wal_generated | duration
> -----------------------------------------+---------------+------------------
>  two short fields, no change             |    1245519008 | 11.6702048778534
>  two short fields, one changed           |    1245756904 | 11.3233819007874
>  two short fields, both changed          |    1249711088 | 11.6836447715759
>  one short and one long field, no change |     664741392 | 6.44810795783997
>  ten tiny fields, all changed            |    1328085568 | 13.9679481983185
>  hundred tiny fields, all changed        |     635974088 | 9.15514206886292
>  hundred tiny fields, half changed       |     636309040 | 9.13769292831421
>  hundred tiny fields, half nulled        |     496396448 | 8.77351498603821
> 
> In each test, a table is created with a large number of identical rows, and
> fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE is
> timed. Duration is the time spent in the UPDATE (lower is better), and
> wal_generated is the amount of WAL generated by the updates (lower is
> better).
> 
> The summary is that Amit's patch is a small win in terms of CPU usage, in
> the best case where the table has few columns, with one large column that is
> not updated. In all other cases it just adds overhead. In terms of WAL size,
> you get a big gain in the same best case scenario.
> 
> Attached is a different version of this patch, which uses the pglz algorithm
> to spot the similarities between the old and new tuple, instead of having
> explicit knowledge of where the column boundaries are. This has the
> advantage that it will spot similarities, and be able to compress, in more
> cases. For example, you can see a reduction in WAL size in the "hundred tiny
> fields, half nulled" test case above.
> 
> The attached patch also just adds overhead in most cases, but the overhead
> is much smaller in the worst case. I think that's the right tradeoff here -
> we want to avoid scenarios where performance falls off the cliff. That said,
> if you usually just get a slowdown, we certainly can't make this the
> default, and if we can't turn it on by default, this probably just isn't
> worth it.
> 
> The attached patch contains the variable-hash-size changes I posted in the
> "Optimizing pglz compressor". But in the delta encoding function, it goes
> further than that, and contains some further micro-optimizations: the hash
> is calculated in a rolling fashion, and it uses a specialized version of the
> pglz_hist_add macro that knows that the input can't exceed 4096 bytes. Those
> changes shaved off some cycles, but you could probably do more. One idea is
> to only add every 10 bytes or so to the history lookup table; that would
> sacrifice some compressibility for speed.
> 
> If you could squeeze pglz_delta_encode function to be cheap enough that we
> could enable this by default, this would be pretty cool patch. Or at least,
> the overhead in the cases that you get no compression needs to be brought
> down, to about 2-5 % at most I think. If it can't be done easily, I feel
> that this probably needs to be dropped.

While this is exciting stuff - and I find Heikki's approach more
interesting and applicable to more cases - I think this is clearly not
9.3 material anymore. There are loads of tradeoffs here which requires
substantial amount of benchmarking and its not the kind of change that
can be backed out easily during 9.3's lifecycle.

And I have to say I find 2-5% performance overhead too high...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
> On 04.03.2013 06:39, Amit Kapila wrote:
> > On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
> >> On 02/05/2013 11:53 PM, Amit Kapila wrote:
> >>>> Performance data for the patch is attached with this mail.
> >>>> Conclusions from the readings (these are same as my previous
> patch):
> >>>>
> 
> I've been doing investigating the pglz option further, and doing
> performance comparisons of the pglz approach and this patch. I'll begin
> with some numbers:
> 
> unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):
> 
>                  testname                 | wal_generated |
> duration
> 
> -----------------------------------------+---------------+-------------
> -
> -----------------------------------------+---------------+----
>   two short fields, no change             |    1245525360 |
> 9.94613695144653
>   two short fields, one changed           |    1245536528 |
> 10.146910905838
>   two short fields, both changed          |    1245523160 |
> 11.2332470417023
>   one short and one long field, no change |    1054926504 |
> 5.90477800369263
>   ten tiny fields, all changed            |    1411774608 |
> 13.4536008834839
>   hundred tiny fields, all changed        |     635739680 |
> 7.57448387145996
>   hundred tiny fields, half changed       |     636930560 |
> 7.56888699531555
>   hundred tiny fields, half nulled        |     573751120 |
> 6.68991994857788
> 
> Amit's wal_update_changes_v10.patch:
> 
>                  testname                 | wal_generated |
> duration
> 
> -----------------------------------------+---------------+-------------
> -
> -----------------------------------------+---------------+----
>   two short fields, no change             |    1249722112 |
> 13.0558869838715
>   two short fields, one changed           |    1246145408 |
> 12.9947438240051
>   two short fields, both changed          |    1245951056 |
> 13.0262880325317
>   one short and one long field, no change |     678480664 |
> 5.70031690597534
>   ten tiny fields, all changed            |    1328873920 |
> 20.0167419910431
>   hundred tiny fields, all changed        |     638149416 |
> 14.4236788749695
>   hundred tiny fields, half changed       |     635560504 |
> 14.8770561218262
>   hundred tiny fields, half nulled        |     558468352 |
> 16.2437210083008
> 
> pglz-with-micro-optimizations-1.patch:
> 
>                   testname                 | wal_generated |
> duration
> -----------------------------------------+---------------+-------------
> -
> -----------------------------------------+---------------+----
>   two short fields, no change             |    1245519008 |
> 11.6702048778534
>   two short fields, one changed           |    1245756904 |
> 11.3233819007874
>   two short fields, both changed          |    1249711088 |
> 11.6836447715759
>   one short and one long field, no change |     664741392 |
> 6.44810795783997
>   ten tiny fields, all changed            |    1328085568 |
> 13.9679481983185
>   hundred tiny fields, all changed        |     635974088 |
> 9.15514206886292
>   hundred tiny fields, half changed       |     636309040 |
> 9.13769292831421
>   hundred tiny fields, half nulled        |     496396448 |
> 8.77351498603821
> 
> In each test, a table is created with a large number of identical rows,
> and fillfactor=50. Then a full-table UPDATE is performed, and the
> UPDATE is timed. Duration is the time spent in the UPDATE (lower is
> better), and wal_generated is the amount of WAL generated by the
> updates (lower is better).

Based on your patch, I have tried some more optimizations:

Fixed bug in your patch (pglz-with-micro-optimizations-2): 
1. There were some problems in recovery due to wrong length of oldtuple
passed in decode which I have corrected.

Approach -1 (pglz-with-micro-optimizations-2_roll10_32)
1. Move strategy min length (32) check in log_heap_update 
2. Rolling 10 for hash as suggested by you is added. 

Approach -2 (pglz-with-micro-optimizations-2_roll10_32_1hashkey)
1. This is done on top of Approach-1 changes
2. Used 1 byte data as the hash key. 

Approach-3
(pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal)
1. This is done on top of Approach-1 and Approach-2 changes
2. Instead of doing copy of literal byte when it founds as non match with
history, do all in a batch. 

Data for all above approaches is in attached file "test_readings" (Apart
from your tests, I have added one more test " hundred tiny fields, first 10
changed")

Summary - 
After changes of Approach-1, CPU utilization for all except 2 tests
("hundred tiny fields, all changed",
"hundred tiny fields, half changed") is either same or less. The best case
CPU utilization has decreased (which is better), but WAL reduction has
little bit increased (which is as per expectation due 10 consecutive
rollup's).

Approach-2 modifications was done to see if there is any overhead of hash
calculation.
Approach-2 & Approach-3 doesn't result into any improvements.

I have investigated the reason for CPU utilization for 2 tests and the
reason is that there is nothing to compress in the new tuple and that
information it will come to know only after it processes 75% (compression
ratio) of tuple bytes.
I think any compression algorithm will have this drawback that if data is
not compressible, it can consume time inspite
of the fact that it will not be able to compress the data.
I think most updates will update some part of tuple which will always yield
positive results.

Apart from above tests, I have run your patch against my old tests, it
yields quite positive results,
WAL Reduction is more as compare to my patch and CPU utilization is almost
similar or my patch is slightly better.
The results are in attached file "pgbench_pg_lz_mod"

The above all data is for synchronous_commit = off. I can collect the data
for synchronous_commit = on and
Performance of recovery.

Any further suggestions?


With Regards,
Amit Kapila.

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Friday, March 08, 2013 9:22 PM Amit Kapila wrote:
> On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
> > On 04.03.2013 06:39, Amit Kapila wrote:
> > > On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
> > >> On 02/05/2013 11:53 PM, Amit Kapila wrote:
> > >>>> Performance data for the patch is attached with this mail.
> > >>>> Conclusions from the readings (these are same as my previous
> > patch):
> > >>>>
> >
> > I've been doing investigating the pglz option further, and doing
> > performance comparisons of the pglz approach and this patch. I'll
> > begin with some numbers:
> >
> 
> Based on your patch, I have tried some more optimizations:
> 
> Fixed bug in your patch (pglz-with-micro-optimizations-2):
> 1. There were some problems in recovery due to wrong length of oldtuple
> passed in decode which I have corrected.
> 
> Approach -1 (pglz-with-micro-optimizations-2_roll10_32)
> 1. Move strategy min length (32) check in log_heap_update 2. Rolling 10
> for hash as suggested by you is added.
> 
> Approach -2 (pglz-with-micro-optimizations-2_roll10_32_1hashkey)
> 1. This is done on top of Approach-1 changes 2. Used 1 byte data as the
> hash key.
> 
> Approach-3
> (pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal)
> 1. This is done on top of Approach-1 and Approach-2 changes 2. Instead
> of doing copy of literal byte when it founds as non match with history,
> do all in a batch.
> 
> Data for all above approaches is in attached file "test_readings"
> (Apart from your tests, I have added one more test " hundred tiny
> fields, first 10
> changed")
> 
> Summary -
> After changes of Approach-1, CPU utilization for all except 2 tests
> ("hundred tiny fields, all changed", "hundred tiny fields, half
> changed") is either same or less. The best case CPU utilization has
> decreased (which is better), but WAL reduction has little bit increased
> (which is as per expectation due 10 consecutive rollup's).
> 
> Approach-2 modifications was done to see if there is any overhead of
> hash calculation.
> Approach-2 & Approach-3 doesn't result into any improvements.
> 
> I have investigated the reason for CPU utilization for 2 tests and the
> reason is that there is nothing to compress in the new tuple and that
> information it will come to know only after it processes 75%
> (compression
> ratio) of tuple bytes.
> I think any compression algorithm will have this drawback that if data
> is not compressible, it can consume time inspite of the fact that it
> will not be able to compress the data.
> I think most updates will update some part of tuple which will always
> yield positive results.
> 
> Apart from above tests, I have run your patch against my old tests, it
> yields quite positive results, WAL Reduction is more as compare to my
> patch and CPU utilization is almost similar or my patch is slightly
> better.
> The results are in attached file "pgbench_pg_lz_mod"
> 
> The above all data is for synchronous_commit = off. I can collect the
> data for synchronous_commit = on and Performance of recovery.

Data for synchronous_commit = on is as follows:

Find the data for heikki's test in file "test_readings_on.txt"

Result and observation is same as for synchronous_commit =off. In short,
Approach-1 
as mentioned in above mail seems to be best.

Find the data for pg_bench based test's used in my previous tests in
"pgbench_pg_lz_mod_sync_commit_on.htm"
This has been done for Heikki's original patch and Approach-1.
It shows that there is very minor cpu dip (0.1%) in some cases and WAL
Reduction of (2~3%). 
WAL reduction is not much as operations performed are less.

Recovery Performance
----------------------
pgbench org: 

./pgbench -i -s 75 -F 80 postgres 
./pgbench -c 4 -j 4 -T 600 postgres 

pgbench 1800(rec size=1800): 

./pgbench -i -s 10 -F 80 postgres 
./pgbench -c 4 -j 4 -T 600 postgres 


Recovery benchmark: 
                       postgres org                postgres pg lz
optimization                          Recovery(sec)                Recovery(sec) 
pgbench org                   11                         11         
pgbench 1800                  16                         11

This shows that with your patch recovery performance is also improved.



There is one more defect in recovery which is fixed in attached patch
pglz-with-micro-optimizations-3.patch.
In pglz_find_match(), it was going beyond maxlen for comparision due to
which encoded data was not properly written to WAL.

Finally, as per my work further to your patch, the best patch will be by
fixing recovery defects and changes for Approach-1.

With Regards,
Amit Kapila.

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wednesday, March 13, 2013 5:50 PM Amit Kapila wrote:
> On Friday, March 08, 2013 9:22 PM Amit Kapila wrote:
> > On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
> > > On 04.03.2013 06:39, Amit Kapila wrote:
> > > > On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
> > > >> On 02/05/2013 11:53 PM, Amit Kapila wrote:
> > > >>>> Performance data for the patch is attached with this mail.
> > > >>>> Conclusions from the readings (these are same as my previous
> > > patch):
> > > >>>>
> > >
> > > I've been doing investigating the pglz option further, and doing
> > > performance comparisons of the pglz approach and this patch. I'll
> > > begin with some numbers:
> > >
> >
> > Based on your patch, I have tried some more optimizations:
> >

Based on numbers provided by Daniel for compression methods, I tried Snappy
Algorithm for encoding
and it addresses most of your concerns that it should not degrade
performance for majority cases.

postgres orginal:
               testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------two short fields, no change             |
1232916160 | 34.0338308811188two short fields, one changed           |    1232909704 | 32.8722319602966two short
fields,both changed          |    1236770128 | 35.445415019989one short and one long field, no change |    1053000144 |
23.2983899116516tentiny fields, all changed            |    1397452584 | 40.2718069553375hundred tiny fields, first 10
changed  |     622082664 | 21.7642788887024hundred tiny fields, all changed        |     626461528 |
20.964781999588hundredtiny fields, half changed       |     621900472 | 21.6473519802094hundred tiny fields, half
nulled       |     557714752 | 19.0088789463043
 
(9 rows)


postgres encode wal using snappy:

               testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------two short fields, no change             |
1232915128 | 34.6910920143127two short fields, one changed           |    1238902520 | 34.2287850379944two short
fields,both changed          |    1233882056 | 35.3292708396912one short and one long field, no change |     733095168
|20.3494939804077ten tiny fields, all changed            |    1314959744 | 38.969575881958hundred tiny fields, first 10
changed  |     483275136 | 19.6973309516907hundred tiny fields, all changed        |     481755280 |
19.7665288448334hundredtiny fields, half changed       |     488693616 | 19.7246761322021hundred tiny fields, half
nulled       |     483425712 | 18.6299569606781
 
(9 rows)

Changes are to call snappy compress and decompress for encoding and decoding
in patch.
I am doing encoding for tup length greater than 32, as for too small tuples
it might not make much sense for encoding.

On my m/c while using snapy compress/decompress, it was giving stack
corruption for first 4 bytes, so I put below fix to proceed.
I am looking into reason of same.
1. snappy_compress - Increment the encoded data buffer with 4 bytes before
encryption starts. 
2. snappy_uncompress - Decrement the 4 bytes increment done during compress.

3.  snappy_uncompressed_length - Decrement the 4 bytes increment done during
compress. 


For LZ compression patch, there was small problem in identifying max length
which I have corrected in separate patch
'pglz-with-micro-optimizations-4.patch'


In my opinion, there can be following ways for this patch:
1. Use LZ compression, but provide a way to user so that it can be avoided
for cases where much compression is not possible.  I see this as a viable way because most updates will update only
havefew
 
columns and rest data would be same.
2. Use snappy API's, do anyone know of standard library of snappy?
3. Provide multiple compression ways, so depending on usage, user can use
appropriate one.

Feedback?

With Regards,
Amit Kapila.


Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
> On 04.03.2013 06:39, Amit Kapila wrote:
> > On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
> >> On 02/05/2013 11:53 PM, Amit Kapila wrote:
> >>>> Performance data for the patch is attached with this mail.
> >>>> Conclusions from the readings (these are same as my previous
> patch):
> >>>>
> The attached patch also just adds overhead in most cases, but the
> overhead is much smaller in the worst case. I think that's the right
> tradeoff here - we want to avoid scenarios where performance falls off
> the cliff. That said, if you usually just get a slowdown, we certainly
> can't make this the default, and if we can't turn it on by default,
> this probably just isn't worth it.
> 
> The attached patch contains the variable-hash-size changes I posted in
> the "Optimizing pglz compressor". But in the delta encoding function,
> it goes further than that, and contains some further micro-
> optimizations:
> the hash is calculated in a rolling fashion, and it uses a specialized
> version of the pglz_hist_add macro that knows that the input can't
> exceed 4096 bytes. Those changes shaved off some cycles, but you could
> probably do more. One idea is to only add every 10 bytes or so to the
> history lookup table; that would sacrifice some compressibility for
> speed.
> 
> If you could squeeze pglz_delta_encode function to be cheap enough that
> we could enable this by default, this would be pretty cool patch. Or at
> least, the overhead in the cases that you get no compression needs to
> be brought down, to about 2-5 % at most I think. If it can't be done
> easily, I feel that this probably needs to be dropped.

After trying some more on optimizing pglz_delta_encode(), I found that if we
use new data also in history, then the results of compression
and cpu utilization are much better. 

In addition to the pg lz micro optimization changes, following changes are
done in modified patch 

1. The unmatched new data is also added to the history which can be
referenced later. 
2. To incorporate this change in the lZ algorithm, 1 extra control bit is
needed to indicate if data is from old or new tuple

Performance Data
-----------------

Head code: 
               testname                 | wal_generated |     duration 
-----------------------------------------+---------------+------------------
two short fields, no change             |    1232908016 | 36.3914430141449 two short fields, one changed           |
1232904040| 36.5231261253357 two short fields, both changed          |    1235215048 | 37.7455959320068 one short and
onelong field, no change |    1051394568 | 24.418487071991 ten tiny fields, all changed            |    1395189872 |
43.2316210269928hundred tiny fields, first 10 changed   |     622156848 | 21.9155580997467 hundred tiny fields, all
changed       |     625962056 | 22.3296411037445 hundred tiny fields, half changed       |     621901128 |
21.3881061077118hundred tiny fields, half nulled        |     557708096 | 19.4633228778839 
 



pglz-with-micro-optimization-compress-using-newdata-1: 
               testname                 | wal_generated |     duration 
-----------------------------------------+---------------+------------------
two short fields, no change             |    1235992768 | 37.3365149497986 two short fields, one changed           |
1240979256| 36.897796869278 two short fields, both changed          |    1236079976 | 38.4273149967194 one short and
onelong field, no change |     651010944 | 20.9490079879761 ten tiny fields, all changed            |    1315606864 |
42.5771369934082hundred tiny fields, first 10 changed   |     459134432 | 17.4556930065155 hundred tiny fields, all
changed       |     456506680 | 17.8865270614624 hundred tiny fields, half changed       |     454784456 |
18.0130441188812hundred tiny fields, half nulled        |     486675784 | 18.6600229740143
 


Observation
---------------
1. It yielded compression in more cases (refer all cases of hundred tiny
fields)
2. CPU- utilization is also better.


Performance data for pgbench related scenarios is attached in document
(pgbench_lz_opt_compress_using_newdata.htm)

1. Better reduction in WAL
2. TPS increase can be observed after records size is >=250
3. There is small performance penality for single-thread (0.04~3.45), but
when penality is 3.45 in single thread, for 8 threads TPS improvement is
high.

Do you think it matches the conditions you have in mind for further
proceeding of this patch?


Thanks to Hari Babu for helping in implementation of this idea and taking
performance data.


With Regards,
Amit Kapila.

Re: Performance Improvement by reducing WAL for Update Operation

From
Hari Babu
Date:
On Friday, June 07, 2013 5:07 PM Amit Kapila wrote:
>On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
>> On 04.03.2013 06:39, Amit Kapila wrote:
>> > On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
>> >> On 02/05/2013 11:53 PM, Amit Kapila wrote:
>> >>>> Performance data for the patch is attached with this mail.
>> >>>> Conclusions from the readings (these are same as my previous
>> patch):
>> >>>>
>> 
>> The attached patch also just adds overhead in most cases, but the 
>> overhead is much smaller in the worst case. I think that's the right 
>> tradeoff here - we want to avoid scenarios where performance falls off 
>> the cliff. That said, if you usually just get a slowdown, we certainly 
>> can't make this the default, and if we can't turn it on by default, 
>> this probably just isn't worth it.
>> 
>> The attached patch contains the variable-hash-size changes I posted in 
>> the "Optimizing pglz compressor". But in the delta encoding function, 
>> it goes further than that, and contains some further micro-
>> optimizations:
>> the hash is calculated in a rolling fashion, and it uses a specialized 
>> version of the pglz_hist_add macro that knows that the input can't 
>> exceed 4096 bytes. Those changes shaved off some cycles, but you could 
>> probably do more. One idea is to only add every 10 bytes or so to the 
>> history lookup table; that would sacrifice some compressibility for 
>> speed.
>> 
>> If you could squeeze pglz_delta_encode function to be cheap enough 
>> that we could enable this by default, this would be pretty cool patch. 
>> Or at least, the overhead in the cases that you get no compression 
>> needs to be brought down, to about 2-5 % at most I think. If it can't 
>> be done easily, I feel that this probably needs to be dropped.

>After trying some more on optimizing pglz_delta_encode(), I found that if
we use new data also in history, then the results of compression and cpu
utilization >are much better. 

>In addition to the pg lz micro optimization changes, following changes are
done in modified patch 

>1. The unmatched new data is also added to the history which can be
referenced later. 
>2. To incorporate this change in the lZ algorithm, 1 extra control bit is
needed to indicate if data is from old or new tuple

The patch is rebased to use the new PG LZ algorithm optimization changes
which got committed recently.

Performance Data
-----------------

Head code: 
               testname                 | wal_generated |     duration 
-----------------------------------------+---------------+------------------
two short fields, no change             |    1232911016 | 35.1784930229187 two short fields, one changed           |
1240322016| 35.0436308383942 two short fields, both changed          |    1235318352 | 35.4989421367645 one short and
onelong field, no change |    1042332336 | 23.4457180500031 ten tiny fields, all changed            |    1395194136 |
41.9023628234863hundred tiny fields, first 10 changed   |     626725984 | 21.2999589443207 hundred tiny fields, all
changed       |     621899224 | 21.6676609516144 hundred tiny fields, half changed       |     623998272 |
21.2745981216431hundred tiny fields, half nulled        |     557714088 | 19.5902800559998
 


pglz-with-micro-optimization-compress-using-newdata-2: 
               testname                 | wal_generated |     duration 
-----------------------------------------+---------------+------------------
two short fields, no change             |    1232903384 | 35.0115969181061 two short fields, one changed           |
1232906960| 34.3333759307861 two short fields, both changed          |    1232903520 | 35.7665238380432 one short and
onelong field, no change |     649647992 | 19.4671010971069 ten tiny fields, all changed            |    1314957136 |
39.9727990627289hundred tiny fields, first 10 changed   |     458684024 | 17.8197758197784 hundred tiny fields, all
changed       |     461028464 | 17.3083391189575 hundred tiny fields, half changed       |     456528696 |
17.1769199371338hundred tiny fields, half nulled        |     480548936 | 18.81720495224
 

Observation
---------------
1. It yielded compression in more cases (refer all cases of hundred tiny
fields)
2. CPU- utilization is also better.


Performance data for pgbench related scenarios is attached in document
(pgbench_lz_opt_compress_using_newdata-2.htm)

1. Better reduction in WAL
2. TPS increase can be observed after records size is >=250
3. There is small performance penality for single-thread (0.36~3.23),  but when penality is 3.23 in single thread, for
8threads TPS improvement
 
is high.

Please suggest any further proceedings on this patch.

Regards,
Hari babu.

Re: Performance Improvement by reducing WAL for Update Operation

From
Mike Blackwell
Date:
I can't comment on further direction for the patch, but since it was marked as Needs Review in the CF app I took a quick look at it.

It patches and compiles clean against the current Git HEAD, and 'make check' runs successfully.

Does it need documentation for the GUC variable 'wal_update_compression_ratio'?

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com




On Tue, Jul 2, 2013 at 2:26 AM, Hari Babu <haribabu.kommi@huawei.com> wrote:
On Friday, June 07, 2013 5:07 PM Amit Kapila wrote:
>On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
>> On 04.03.2013 06:39, Amit Kapila wrote:
>> > On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
>> >> On 02/05/2013 11:53 PM, Amit Kapila wrote:
>> >>>> Performance data for the patch is attached with this mail.
>> >>>> Conclusions from the readings (these are same as my previous
>> patch):
>> >>>>
>>
>> The attached patch also just adds overhead in most cases, but the
>> overhead is much smaller in the worst case. I think that's the right
>> tradeoff here - we want to avoid scenarios where performance falls off
>> the cliff. That said, if you usually just get a slowdown, we certainly
>> can't make this the default, and if we can't turn it on by default,
>> this probably just isn't worth it.
>>
>> The attached patch contains the variable-hash-size changes I posted in
>> the "Optimizing pglz compressor". But in the delta encoding function,
>> it goes further than that, and contains some further micro-
>> optimizations:
>> the hash is calculated in a rolling fashion, and it uses a specialized
>> version of the pglz_hist_add macro that knows that the input can't
>> exceed 4096 bytes. Those changes shaved off some cycles, but you could
>> probably do more. One idea is to only add every 10 bytes or so to the
>> history lookup table; that would sacrifice some compressibility for
>> speed.
>>
>> If you could squeeze pglz_delta_encode function to be cheap enough
>> that we could enable this by default, this would be pretty cool patch.
>> Or at least, the overhead in the cases that you get no compression
>> needs to be brought down, to about 2-5 % at most I think. If it can't
>> be done easily, I feel that this probably needs to be dropped.

>After trying some more on optimizing pglz_delta_encode(), I found that if
we use new data also in history, then the results of compression and cpu
utilization >are much better.

>In addition to the pg lz micro optimization changes, following changes are
done in modified patch

>1. The unmatched new data is also added to the history which can be
referenced later.
>2. To incorporate this change in the lZ algorithm, 1 extra control bit is
needed to indicate if data is from old or new tuple

The patch is rebased to use the new PG LZ algorithm optimization changes
which got committed recently.

Performance Data
-----------------

Head code:

                testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------

 two short fields, no change             |    1232911016 | 35.1784930229187
 two short fields, one changed           |    1240322016 | 35.0436308383942
 two short fields, both changed          |    1235318352 | 35.4989421367645
 one short and one long field, no change |    1042332336 | 23.4457180500031
 ten tiny fields, all changed            |    1395194136 | 41.9023628234863
 hundred tiny fields, first 10 changed   |     626725984 | 21.2999589443207
 hundred tiny fields, all changed        |     621899224 | 21.6676609516144
 hundred tiny fields, half changed       |     623998272 | 21.2745981216431
 hundred tiny fields, half nulled        |     557714088 | 19.5902800559998


pglz-with-micro-optimization-compress-using-newdata-2:

                testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------

 two short fields, no change             |    1232903384 | 35.0115969181061
 two short fields, one changed           |    1232906960 | 34.3333759307861
 two short fields, both changed          |    1232903520 | 35.7665238380432
 one short and one long field, no change |     649647992 | 19.4671010971069
 ten tiny fields, all changed            |    1314957136 | 39.9727990627289
 hundred tiny fields, first 10 changed   |     458684024 | 17.8197758197784
 hundred tiny fields, all changed        |     461028464 | 17.3083391189575
 hundred tiny fields, half changed       |     456528696 | 17.1769199371338
 hundred tiny fields, half nulled        |     480548936 | 18.81720495224

Observation
---------------
1. It yielded compression in more cases (refer all cases of hundred tiny
fields)
2. CPU- utilization is also better.


Performance data for pgbench related scenarios is attached in document
(pgbench_lz_opt_compress_using_newdata-2.htm)

1. Better reduction in WAL
2. TPS increase can be observed after records size is >=250
3. There is small performance penality for single-thread (0.36~3.23),
   but when penality is 3.23 in single thread, for 8 threads TPS improvement
is high.

Please suggest any further proceedings on this patch.

Regards,
Hari babu.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Performance Improvement by reducing WAL for Update Operation

From
Josh Berkus
Date:
On 07/08/2013 02:21 PM, Mike Blackwell wrote:
> I can't comment on further direction for the patch, but since it was marked
> as Needs Review in the CF app I took a quick look at it.
> 
> It patches and compiles clean against the current Git HEAD, and 'make
> check' runs successfully.
> 
> Does it need documentation for the GUC variable
> 'wal_update_compression_ratio'?

Yes.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tuesday, July 09, 2013 2:52 AM Mike Blackwell wrote:

> I can't comment on further direction for the patch, but since it was marked as Needs Review in the CF app I took a
quicklook at it. Thanks for looking into it. 
 Last time Heikki has found test scenario's where the original patch was not performing good.  He has also proposed a
differentapproach for WAL encoding and sent the modified patch which has comparatively less negative performance impact
and asked to check if the patch can reduce the performance impact for the scenario's mentioned by him. After that I
foundthat with some modification's (use new tuple data for  encoding) in his approach, it eliminates the negative
performanceimpact and  have WAL reduction for more number of cases. 
 I think the first thing to verify is whether the results posted can be validated in some other environment setup by
anotherperson.  The testcase used is posted at below link:
http://www.postgresql.org/message-id/51366323.8070606@vmware.com


> It patches and compiles clean against the current Git HEAD, and 'make check' runs successfully.

> Does it need documentation for the GUC variable 'wal_update_compression_ratio'?
 This variable has been added to test the patch for different compression_ratio during development test. It was not
decidedto have this variable permanently as part of this patch, so currently there is no documentation for it.  However
ifthe decision comes out to be that it needs to be part of patch, then documentation for same can be updated. 

With Regards,
Amit Kapila.




Re: Performance Improvement by reducing WAL for Update Operation

From
Mike Blackwell
Date:
The only environment I have available at the moment is a virtual box.  That's probably not going to be very helpful for performance testing. 

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com




On Mon, Jul 8, 2013 at 11:09 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Tuesday, July 09, 2013 2:52 AM Mike Blackwell wrote:

> I can't comment on further direction for the patch, but since it was marked as Needs Review in the CF app I took a quick look at it.
  Thanks for looking into it.

  Last time Heikki has found test scenario's where the original patch was not performing good.
  He has also proposed a different approach for WAL encoding and sent the modified patch which has comparatively less negative performance impact and
  asked to check if the patch can reduce the performance impact for the scenario's mentioned by him.
  After that I found that with some modification's (use new tuple data for  encoding) in his approach, it eliminates the negative performance impact and
  have WAL reduction for more number of cases.

  I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person.
  The testcase used is posted at below link:
  http://www.postgresql.org/message-id/51366323.8070606@vmware.com



> It patches and compiles clean against the current Git HEAD, and 'make check' runs successfully.

> Does it need documentation for the GUC variable 'wal_update_compression_ratio'?

  This variable has been added to test the patch for different compression_ratio during development test.
  It was not decided to have this variable permanently as part of this patch, so currently there is no documentation for it.
  However if the decision comes out to be that it needs to be part of patch, then documentation for same can be updated.

With Regards,
Amit Kapila.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
> On Wednesday, July 10, 2013 6:32 AM Mike Blackwell wrote:

> The only environment I have available at the moment is a virtual box.  That's probably not going to be very helpful
forperformance testing. 

It's okay. Anyway thanks for doing the basic test for patch.

With Regards,
Amit Kapila.




Re: Performance Improvement by reducing WAL for Update Operation

From
Greg Smith
Date:
On 7/9/13 12:09 AM, Amit Kapila wrote:
>    I think the first thing to verify is whether the results posted can be validated in some other environment setup
byanother person.
 
>    The testcase used is posted at below link:
>    http://www.postgresql.org/message-id/51366323.8070606@vmware.com

That seems easy enough to do here, Heikki's test script is excellent. 
The latest patch Hari posted on July 2 has one hunk that doesn't apply 
anymore now.  Inside src/backend/utils/adt/pg_lzcompress.c the patch 
tries to change this code:

-               if (hent)
+               if (hentno != INVALID_ENTRY)

But that line looks like this now:
                if (hent != INVALID_ENTRY_PTR)

Definitions of those:

#define INVALID_ENTRY                   0
#define INVALID_ENTRY_PTR               (&hist_entries[INVALID_ENTRY])

I'm not sure if different error handling may be needed here now due the 
commit that changed this, or if the patch wasn't referring to the right 
type of error originally.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Stephen Frost
Date:
Greg,

* Greg Smith (greg@2ndQuadrant.com) wrote:
> That seems easy enough to do here, Heikki's test script is
> excellent. The latest patch Hari posted on July 2 has one hunk that
> doesn't apply anymore now.  Inside
> src/backend/utils/adt/pg_lzcompress.c the patch tries to change this
> code:
>
> -               if (hent)
> +               if (hentno != INVALID_ENTRY)

hentno certainly doesn't make much sense here- it's only used at the top
of the function to keep things a bit cleaner when extracting the address
into hent from hist_entries:
   hentno = hstart[pglz_hist_idx(input, end, mask)];   hent = &hist_entries[hentno];

Indeed, as the referenced conditional is inside the following loop:

while (hent != INVALID_ENTRY_PTR)

and, since hentno == 0 implies hent == INVALID_ENTRY_PTR, the
conditional would never fail (which is what was happening prior to
Heikki commiting the fix for this, changing the conditional to what is
below).

> But that line looks like this now:
>
>                 if (hent != INVALID_ENTRY_PTR)

Right, this is correct- it's useful to check the new value for hent
after it's been updated by:

hent = hent->next;

and see if it's possible to drop out early.

> I'm not sure if different error handling may be needed here now due
> the commit that changed this, or if the patch wasn't referring to
> the right type of error originally.

I've not looked at anything regarding this beyond this email, but I'm
pretty confident that the change Heikki committed was the correct one.
Thanks,
    Stephen

Re: Performance Improvement by reducing WAL for Update Operation

From
Hari Babu
Date:
On Friday, July 19, 2013 4:11 AM Greg Smith wrote:
>On 7/9/13 12:09 AM, Amit Kapila wrote:
>>    I think the first thing to verify is whether the results posted can be validated in some other environment setup
byanother person. 
>>    The testcase used is posted at below link:
>>    http://www.postgresql.org/message-id/51366323.8070606@vmware.com

>That seems easy enough to do here, Heikki's test script is excellent.
>The latest patch Hari posted on July 2 has one hunk that doesn't apply
>anymore now.

The Head code change from Heikki is correct.
During the patch rebase to latest PG LZ optimization code, the above code change is missed.

Apart from the above changed some more changes are done in the patch, those are.

1. corrected some comments in the code
2. Added a validity check as source and history length combined cannot be more than or equal to 8192.

Thanks for the review, please find the latest patch attached in the mail.

Regards,
Hari babu.


Re: Performance Improvement by reducing WAL for Update Operation

From
Greg Smith
Date:
The v3 patch applies perfectly here now.  Attached is a spreadsheet with
test results from two platforms, a Mac laptop and a Linux server.  I
used systems with high disk speed because that seemed like a worst case
for this improvement.  The actual improvement for shrinking WAL should
be even better on a system with slower disks.

There are enough problems with the "hundred tiny fields" results that I
think this not quite ready for commit yet.  More comments on that below.
  This seems close though, close enough that I am planning to follow up
to see if the slow disk results are better.

Reviewing the wal-update-testsuite.sh test program, I think the only
case missing that would be useful to add is "ten tiny fields, one
changed".  I think that one is interesting to highlight because it's
what benchmark programs like pgbench do very often.

The GUC added by the program looks like this:

postgres=# show wal_update_compression_ratio ;
  wal_update_compression_ratio
------------------------------
  25

Is possible to add a setting here that disables the feature altogether?
  That always makes it easier to consider a commit, knowing people can
roll back the change if it makes performance worse.  That would make
performance testing easier too.  wal-update-testsuit.sh takes as long as
13 minutes, it's long enough that I'd like the easier to automate
comparison GUC disabling adds.  If that's not practical to do given the
intrusiveness of the code, it's not really necessary.  I haven't looked
at the change enough to be sure how hard this is.

On the Mac, the only case that seems to have a slowdown now is "hundred
tiny fields, half nulled".  It would be nice to understand just what is
going on with that one.  I got some ugly results in "two short fields,
no change" too, along with a couple of other weird results, but I think
those were testing procedure issues that can be ignored.  The pgbench
throttle work I did recently highlights that I can't really make a Mac
quiet/consistent for benchmarking very well.  Note that I ran all of the
Mac tests with assertions on, to try and catch platform specific bugs.
The Linux ones used the default build parameters.

On Linux "hundred tiny fields, half nulled" was also by far the worst
performing one, with a >30% increase in duration despite the 14% drop in
WAL.  Exactly what's going on there really needs to be investigated
before this seems safe to commit.  All of the "hundred tiny fields"
cases seem pretty bad on Linux, with the rest of them running about a
11% duration increase.

This doesn't seem ready to commit for this CF, but the number of problem
cases is getting pretty small now.  Now that I've gotten more familiar
with the test programs and the feature, I can run more performance tests
on this at any time really.  If updates addressing the trouble cases are
ready from Amit or Hari before the next CF, send them out and I can look
at them without waiting until that one starts.  This is a very promising
looking performance feature.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Andres Freund
Date:
On 2013-07-19 10:40:01 +0530, Hari Babu wrote:
> 
> On Friday, July 19, 2013 4:11 AM Greg Smith wrote:
> >On 7/9/13 12:09 AM, Amit Kapila wrote:
> >>    I think the first thing to verify is whether the results posted can be validated in some other environment
setupby another person.
 
> >>    The testcase used is posted at below link:
> >>    http://www.postgresql.org/message-id/51366323.8070606@vmware.com
> 
> >That seems easy enough to do here, Heikki's test script is excellent. 
> >The latest patch Hari posted on July 2 has one hunk that doesn't apply 
> >anymore now.
> 
> The Head code change from Heikki is correct.
> During the patch rebase to latest PG LZ optimization code, the above code change is missed.
> 
> Apart from the above changed some more changes are done in the patch, those are. 

FWIW I don't like this approach very much:

* I'd be very surprised if this doesn't make WAL replay of update heavy workloads slower by at least factor of 2.

* It makes data recovery from WAL *noticeably* harder since data corruption now is carried forwards and you need the
olddata to decode new data
 

* It makes changeset extraction either more expensive or it would have to be disabled there.

I think my primary issue is that philosophically/architecturally I am of
the opinion that a wal record should make sense of it's own without
depending on heap data. And this patch looses that.

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance Improvement by reducing WAL for Update Operation

From
Greg Smith
Date:
On 7/22/13 2:57 PM, Andres Freund wrote:
> * I'd be very surprised if this doesn't make WAL replay of update heavy
>    workloads slower by at least factor of 2.

I was thinking about what a benchmark of WAL replay would look like last 
year.  I don't think that data is captured very well yet, and it should be.

My idea was to break the benchmark into two pieces.  One would take a 
base backup, then run a series of tests and archive the resulting the 
WAL.  I doubt you can make a useful benchmark here without a usefully 
populated database, that's why the base backup step is needed.

The first useful result then is to measure how long commit/archiving 
took and the WAL volume, which is what's done by the test harness for 
this program.  Then the resulting backup would be setup for replay. 
tarring up the backup and WAL archive could even give you a repeatable 
test set for ones where it's only replay changes happening.  Then the 
main number that's useful, total replay time, would be measured.

The main thing I wanted this for wasn't for code changes; it was to 
benchmark configuration changes.  I'd like to be able to answer 
questions like "which I/O scheduler is best for a standby" in a way that 
has real test data behind it.  The same approach should useful for 
answering your concerns about the replay performance impact of this 
change too though.

> * It makes changeset extraction either more expensive or it would have
>    to be disabled there.

That argues that if committed at all, the ability to turn this off I was 
asking about would be necessary.  It sounds like this *could* work like 
how minimal WAL archiving levels allow optimizations that are disabled 
at higher ones--like the COPY into a truncated/new table cheat.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tuesday, July 23, 2013 12:02 AM Greg Smith wrote:
> The v3 patch applies perfectly here now.  Attached is a spreadsheet
> with test results from two platforms, a Mac laptop and a Linux server.
> I used systems with high disk speed because that seemed like a worst
> case for this improvement.  The actual improvement for shrinking WAL
> should be even better on a system with slower disks.

You are absolutely right.
To mimic it on our system, by configuring RAMFS for database, it shows similar results.
> There are enough problems with the "hundred tiny fields" results that I
> think this not quite ready for commit yet.  More comments on that
> below.
>   This seems close though, close enough that I am planning to follow up
> to see if the slow disk results are better.

Thanks for going extra mile to try for slower disks.

> Reviewing the wal-update-testsuite.sh test program, I think the only
> case missing that would be useful to add is "ten tiny fields, one
> changed".  I think that one is interesting to highlight because it's
> what benchmark programs like pgbench do very often.
> The GUC added by the program looks like this:
>
> postgres=# show wal_update_compression_ratio ;
>   wal_update_compression_ratio
> ------------------------------
>   25
>
> Is possible to add a setting here that disables the feature altogether?

Yes, it can be done in below 2 ways:
1. Provide a new configuration parameter (wal_update_compression) to turn on/off this feature.
2. At table level user can be given option to configure

>   That always makes it easier to consider a commit, knowing people can
> roll back the change if it makes performance worse.  That would make
> performance testing easier too.  wal-update-testsuit.sh takes as long
> as
> 13 minutes, it's long enough that I'd like the easier to automate
> comparison GUC disabling adds.  If that's not practical to do given the
> intrusiveness of the code, it's not really necessary.  I haven't looked
> at the change enough to be sure how hard this is.
>
> On the Mac, the only case that seems to have a slowdown now is "hundred
> tiny fields, half nulled".  It would be nice to understand just what is
> going on with that one.  I got some ugly results in "two short fields,
> no change" too, along with a couple of other weird results, but I think
> those were testing procedure issues that can be ignored.  The pgbench
> throttle work I did recently highlights that I can't really make a Mac
> quiet/consistent for benchmarking very well.  Note that I ran all of
> the Mac tests with assertions on, to try and catch platform specific
> bugs.
> The Linux ones used the default build parameters.
>
> On Linux "hundred tiny fields, half nulled" was also by far the worst
> performing one, with a >30% increase in duration despite the 14% drop
> in WAL.  Exactly what's going on there really needs to be investigated
> before this seems safe to commit.  All of the "hundred tiny fields"
> cases seem pretty bad on Linux, with the rest of them running about a
> 11% duration increase.

The main benefit of this patch is to reduce WAL for improving time in I/O,
But I think for faster I/O systems, the calculation to reduce WAL has overhead.
I will check how to optimize that calculation, but I think this feature should be consider
with configuration knob as it can improve many cases.

With Regards,
Amit Kapila.




Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tuesday, July 23, 2013 12:27 AM Andres Freund wrote:
> On 2013-07-19 10:40:01 +0530, Hari Babu wrote:
> >
> > On Friday, July 19, 2013 4:11 AM Greg Smith wrote:
> > >On 7/9/13 12:09 AM, Amit Kapila wrote:
> > >>    I think the first thing to verify is whether the results posted
> can be validated in some other environment setup by another person.
> > >>    The testcase used is posted at below link:
> > >>    http://www.postgresql.org/message-
> id/51366323.8070606@vmware.com
> >
> > >That seems easy enough to do here, Heikki's test script is
> excellent.
> > >The latest patch Hari posted on July 2 has one hunk that doesn't
> apply
> > >anymore now.
> >
> > The Head code change from Heikki is correct.
> > During the patch rebase to latest PG LZ optimization code, the above
> code change is missed.
> >
> > Apart from the above changed some more changes are done in the patch,
> those are.
> 
> FWIW I don't like this approach very much:
> 
> * I'd be very surprised if this doesn't make WAL replay of update heavy
>   workloads slower by at least factor of 2.
   Yes, if you just consider the cost of replay, but it involves other
operations as well   like for standby case transfer of WAL, Write of WAL, Read from WAL and
then apply.   So among them most operation's will be benefited from reduced WAL size,
except apply where you need to decode. 
> * It makes data recovery from WAL *noticeably* harder since data
>   corruption now is carried forwards and you need the old data to
> decode
>   new data   This is one of the reasons why this optimization is done only when the
new row goes in same page.

> * It makes changeset extraction either more expensive or it would have
>   to be disabled there.      I think, if there is any such implication, we can probably have the
option of disable it

> I think my primary issue is that philosophically/architecturally I am
> of
> the opinion that a wal record should make sense of it's own without
> depending on heap data. And this patch looses that.

Is the main worry about corruption getting propagated?

With Regards,
Amit Kapila.




Re: Performance Improvement by reducing WAL for Update Operation

From
Andres Freund
Date:
On 2013-07-23 18:59:11 +0530, Amit Kapila wrote:
> > * I'd be very surprised if this doesn't make WAL replay of update heavy
> >   workloads slower by at least factor of 2.
> 
>     Yes, if you just consider the cost of replay, but it involves other
> operations as well
>     like for standby case transfer of WAL, Write of WAL, Read from WAL and
> then apply.
>     So among them most operation's will be benefited from reduced WAL size,
> except apply where you need to decode.

I still think it's rather unlikely that they offset those. I've seen wal
replay be a major bottleneck more than once...

> > * It makes data recovery from WAL *noticeably* harder since data
> >   corruption now is carried forwards and you need the old data to
> > decode
> >   new data
>   
>    This is one of the reasons why this optimization is done only when the
> new row goes in same page.

That doesn't help all that much. It somewhat eases recovering data if
full_page_writes are on, but it's realy hard to stitch together all
changes if the corruption occured within a 1h long checkpoint...

> > * It makes changeset extraction either more expensive or it would have
> >   to be disabled there.

>     I think, if there is any such implication, we can probably have the
> option of disable it

That can just be done on wal_level = logical, that's not the
problem. It's certainly not with precedence that we have wal_level
dependent optimizations.

> > I think my primary issue is that philosophically/architecturally I am
> > of
> > the opinion that a wal record should make sense of it's own without
> > depending on heap data. And this patch looses that.
> 
> Is the main worry about corruption getting propagated?

Not really. It "feels" wrong to me architecturally. That's subjective, I
know.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tuesday, July 23, 2013 7:06 PM Andres Freund wrote:
> On 2013-07-23 18:59:11 +0530, Amit Kapila wrote:
> > > * I'd be very surprised if this doesn't make WAL replay of update
> heavy
> > >   workloads slower by at least factor of 2.
> >
> >     Yes, if you just consider the cost of replay, but it involves
> other
> > operations as well
> >     like for standby case transfer of WAL, Write of WAL, Read from
> WAL and
> > then apply.
> >     So among them most operation's will be benefited from reduced WAL
> size,
> > except apply where you need to decode.
> 
> I still think it's rather unlikely that they offset those. I've seen
> wal
> replay be a major bottleneck more than once...
>
> > > * It makes data recovery from WAL *noticeably* harder since data
> > >   corruption now is carried forwards and you need the old data to
> > > decode
> > >   new data
> >
> >    This is one of the reasons why this optimization is done only when
> the
> > new row goes in same page.
> 
> That doesn't help all that much. It somewhat eases recovering data if
> full_page_writes are on, but it's realy hard to stitch together all
> changes if the corruption occured within a 1h long checkpoint...
I think once a record is corrupted on page, user has to reconstruct that
page, it might be difficult
to just reconstruct a record and this optimization will not carry forward
any corruption from one to another 
Page.    
> > > * It makes changeset extraction either more expensive or it would
> have
> > >   to be disabled there.
> 
> >     I think, if there is any such implication, we can probably have
> the
> > option of disable it
> 
> That can just be done on wal_level = logical, that's not the
> problem. It's certainly not with precedence that we have wal_level
> dependent optimizations.


With Regards,
Amit Kapila.




Re: Performance Improvement by reducing WAL for Update Operation

From
Haribabu kommi
Date:
On 23 July 2013 17:35 Amit Kapila wrote:
>On Tuesday, July 23, 2013 12:02 AM Greg Smith wrote:
>> The v3 patch applies perfectly here now.  Attached is a spreadsheet 
>> with test results from two platforms, a Mac laptop and a Linux server.
>> I used systems with high disk speed because that seemed like a worst 
>> case for this improvement.  The actual improvement for shrinking WAL 
>> should be even better on a system with slower disks.

>You are absolutely right. 
>To mimic it on our system, by configuring RAMFS for database, it shows similar results.
 
>> Is possible to add a setting here that disables the feature altogether?

>Yes, it can be done in below 2 ways:
>1. Provide a new configuration parameter (wal_update_compression) to turn on/off this feature.
>2. At table level user can be given option to configure

>The main benefit of this patch is to reduce WAL for improving time in I/O, But I think for faster I/O systems, the
calculationto reduce WAL has overhead. 
 
>I will check how to optimize that calculation, but I think this feature should be consider with configuration knob as
itcan improve many cases.
 

I tried to improve the performance of this feature on faster I/O systems where the calculation to reduce the WAL is an
overhead,but resulted no success.
 
But this optimization is beneficial for a systems where the I/O is a bottleneck. To support those use cases I have
addeda configuration parameter "wal_update_optimization"
 
which is off by default. User can enable/disable this optimization for update operations based on its need. During
replayof WAL record it can be identified easily as
 
it is an encode wal tuple or not by checking the flags. 

Please let me know your suggestions on the same.

Regards,
Hari babu.

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Mon, Jul 22, 2013 at 2:31 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On the Mac, the only case that seems to have a slowdown now is "hundred tiny
> fields, half nulled".  It would be nice to understand just what is going on
> with that one.  I got some ugly results in "two short fields, no change"
> too, along with a couple of other weird results, but I think those were
> testing procedure issues that can be ignored.  The pgbench throttle work I
> did recently highlights that I can't really make a Mac quiet/consistent for
> benchmarking very well.  Note that I ran all of the Mac tests with
> assertions on, to try and catch platform specific bugs. The Linux ones used
> the default build parameters.

Amit has been asking me to look at this patch for a while, so I
finally did.  While I agree that it would be nice to get the CPU
overhead down low enough that we can turn this on by default and
forget about it, I'm not convinced that it's without value even if we
can't.  Fundamentally, this patch trades away some CPU in exchanged
for decrease I/O.  The testing thus far is all about whether the CPU
overhead can be made trivial, which is a good question to ask, because
if the answer is yes, then rather than trading something for something
else, we just get something for free.  Win!  But even if that doesn't
pan out, I think the fallback position should not be "OK, well, if we
can't get decreased I/O for free then forget it" but rather "OK, if we
can't get decreased I/O for free then let's get decreased I/O in
exchange for increased CPU usage".

I spent a little time running the tests from Heikki's script under
perf.  On all three "two short fields" tests and also on the "ten tiny
fields, all changed" test, we spend about 1% of the CPU time in
pglz_delta_encode.  I don't see any evidence that it's actually
compressing anything at all; it appears to be falling out where we
test the input length against the strategy, presumably because the
default strategy (which we largely copy here) doesn't try to compress
input data of less than 32 bytes.   Given that this code isn't
actually compressing anything in these cases, I'm a bit confused by
Greg's report of substantial gains on "ten tiny fields, all changed"
test; how can we win if we're not compressing?

I studied the "hundred tiny fields, half nulled" test case in some
detail.  Some thoughts:

- There is a comment "TODO: It would be nice to behave like the
history and the source strings were concatenated, so that you could
compress using the new data, too."  If we're not already doing that,
then how are we managing to compress WAL by more than one-quarter in
the "hundred tiny fields, all changed" case?  It looks to me like the
patch IS doing that, and I'm not sure it's a good idea, especially
because it's using pglz_hist_add_no_recycle rather than pglz_add_hist:
we verify that hlen + slen < 2 * PGLZ_HISTORY_SIZE but that doesn't
seem good enough. On the "hundred tiny fields, half nulled" test case,
removing that line reduces compression somewhat but also saves on CPU
cycles.

- pglz_find_match() is happy to walk all the way down even a really,
really long bucket chain.  It has some code that reduces good_match
each time through, but it fails to produce a non-zero decrement once
good_match * good_drop < 100.  So if we're searching an enormously
deep bucket many times in a row, and there are actually no matches,
we'll go flying down the whole linked list every time.  I tried
mandating a minimum decrease of 1 and didn't observe that it made any
difference in the run time of this test case, but it still seems odd.
For the record, it's not new behavior with this patch; pglz_compress()
has the same issue as things stand today.  I wonder if we ought to
decrease the good match length by a constant rather than a percentage
at each step.

- pglz_delta_encode() seems to spend about 50% of its CPU time loading
up the history data.  It would be nice to find a way to reduce that
effort.  I thought about maybe only making a history entry for, say,
every fourth input position rather than every one, but a quick test
seems to suggest that's a big fail, probably because it's too easy to
skip over the position where we would have made "the right" match via
some short match.  But maybe there's some more sophisticated strategy
here that would work better.  For example, see:

http://en.wikipedia.org/wiki/Rabin_fingerprint

The basic idea is that you use a rolling hash function to divide up
the history data into chunks of a given average size.  So we scan the
history data, compute a rolling hash value at each point, and each
time the bottom N bits are zero, we consider that to be the end of a
chunk.  We enter all the chunks into a hash table.  The chunk size
will vary, but on the average, given a reasonably well-behaved rolling
hash function (the pglz one probably doesn't qualify) it'll happen
every 2^N bytes, so perhaps for this purpose we'd choose N to be
between 3 and 5.  Then, we scan the input we want to compress and
divide it into chunks in the same way.  Chunks that don't exist in the
history data get copied to the output, while those that do get
replaced with a reference to their position in the history data.

I'm not 100% certain that something like this would be better than
trying to leverage the pglz machinery, but I think it might be worth
trying.  One possible advantage is that you make many fewer hash-table
entries, which reduces both the cost of setting up the hash table and
the cost of probing it; another is that if you find a hit in the hash
table, you needn't search any further: you are done; this is related
to the point that, for the most part, the processing is
chunk-at-a-time rather than character-at-a-time, which might be more
efficient.  On the other hand, the compression ratio might stink, or
it might suck for some other reason: I just don't know.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tue, Nov 26, 2013 at 8:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jul 22, 2013 at 2:31 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>
> I spent a little time running the tests from Heikki's script under
> perf.  On all three "two short fields" tests and also on the "ten tiny
> fields, all changed" test, we spend about 1% of the CPU time in
> pglz_delta_encode.  I don't see any evidence that it's actually
> compressing anything at all; it appears to be falling out where we
> test the input length against the strategy, presumably because the
> default strategy (which we largely copy here) doesn't try to compress
> input data of less than 32 bytes.   Given that this code isn't
> actually compressing anything in these cases, I'm a bit confused by
> Greg's report of substantial gains on "ten tiny fields, all changed"
> test; how can we win if we're not compressing?

I think it is mainly due to variation of test data on Mac, if we see
the results of Linux, there is not much difference in results.
Also the results posted by Heikki or by me at below links doesn't show
such inconsistency for "ten tiny fields, all changed" case.

http://www.postgresql.org/message-id/51366323.8070606@vmware.com
http://www.postgresql.org/message-id/001f01ce1c14$d3af0770$7b0d1650$@kapila@huawei.com
(Refer test_readings.txt)


> I studied the "hundred tiny fields, half nulled" test case in some
> detail.  Some thoughts:
>
> - There is a comment "TODO: It would be nice to behave like the
> history and the source strings were concatenated, so that you could
> compress using the new data, too."  If we're not already doing that,
> then how are we managing to compress WAL by more than one-quarter in
> the "hundred tiny fields, all changed" case?

Algorithm is not doing concatenation of history and source strings,
the hash table is formed just with history data and then later
if match is not found then it is added to history, so this can be the
reason for the above result.

> It looks to me like the
> patch IS doing that, and I'm not sure it's a good idea, especially
> because it's using pglz_hist_add_no_recycle rather than pglz_add_hist:
> we verify that hlen + slen < 2 * PGLZ_HISTORY_SIZE but that doesn't
> seem good enough. On the "hundred tiny fields, half nulled" test case,
> removing that line reduces compression somewhat but also saves on CPU
> cycles.

> - pglz_find_match() is happy to walk all the way down even a really,
> really long bucket chain.  It has some code that reduces good_match
> each time through, but it fails to produce a non-zero decrement once
> good_match * good_drop < 100.  So if we're searching an enormously
> deep bucket many times in a row, and there are actually no matches,
> we'll go flying down the whole linked list every time.  I tried
> mandating a minimum decrease of 1 and didn't observe that it made any
> difference in the run time of this test case, but it still seems odd.
> For the record, it's not new behavior with this patch; pglz_compress()
> has the same issue as things stand today.  I wonder if we ought to
> decrease the good match length by a constant rather than a percentage
> at each step.
>
> - pglz_delta_encode() seems to spend about 50% of its CPU time loading
> up the history data.  It would be nice to find a way to reduce that
> effort.  I thought about maybe only making a history entry for, say,
> every fourth input position rather than every one, but a quick test
> seems to suggest that's a big fail, probably because it's too easy to
> skip over the position where we would have made "the right" match via
> some short match.  But maybe there's some more sophisticated strategy
> here that would work better.  For example, see:
>
> http://en.wikipedia.org/wiki/Rabin_fingerprint
>
> The basic idea is that you use a rolling hash function to divide up
> the history data into chunks of a given average size.  So we scan the
> history data, compute a rolling hash value at each point, and each
> time the bottom N bits are zero, we consider that to be the end of a
> chunk.  We enter all the chunks into a hash table.  The chunk size
> will vary, but on the average, given a reasonably well-behaved rolling
> hash function (the pglz one probably doesn't qualify) it'll happen
> every 2^N bytes, so perhaps for this purpose we'd choose N to be
> between 3 and 5.  Then, we scan the input we want to compress and
> divide it into chunks in the same way.  Chunks that don't exist in the
> history data get copied to the output, while those that do get
> replaced with a reference to their position in the history data.
>
> I'm not 100% certain that something like this would be better than
> trying to leverage the pglz machinery, but I think it might be worth
> trying.  One possible advantage is that you make many fewer hash-table
> entries, which reduces both the cost of setting up the hash table and
> the cost of probing it; another is that if you find a hit in the hash
> table, you needn't search any further: you are done; this is related
> to the point that, for the most part, the processing is
> chunk-at-a-time rather than character-at-a-time, which might be more
> efficient.  On the other hand, the compression ratio might stink, or
> it might suck for some other reason: I just don't know.

I think this idea looks better than current and it will definately
improve some of the cases, but not sure we can win in all cases.
We have tried one of the similar idea (reduce size of hash and
eventually comparision) by adding every 10 bytes to the history
lookup table rather than every byte. It improved most cases but not
all cases ("hundred tiny fields, all changed","hundred tiny fields, half changed" test were still slow).
Patch and results are at link (refer approach-1):
http://www.postgresql.org/message-id/001f01ce1c14$d3af0770$7b0d1650$@kapila@huawei.com

Now the tough question is what are the possible options for this patch
and which one to pick:
a. optimize encoding technique, so that it can improve results in most
cases even if not all.
b. have a table level option or guc to enable/disable WAL compression.
c. use some heuristics to check if chances of compression are good,
then only perform compression.   1. apply this optimization for tuple size > 128 and < 2000   2. apply this
optimizationif number of modified columns are less
 
than 25% (some threshold number) of total columns.       I think we can get modified columns from target entry and use
it if triggers haven't changed that tuple. I remember       earlier there were concerns that this value can't be
trusted
completely, but I think using it as a heuristic is not a       problem, even if this number is not right in some
cases.
d. forget about this optimization and reject the patch.

I think by doing option 'b' and 'c' together can make this
optimization usable in cases where it is actually useful.

Opinions/Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Wed, Nov 27, 2013 at 12:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> - There is a comment "TODO: It would be nice to behave like the
>> history and the source strings were concatenated, so that you could
>> compress using the new data, too."  If we're not already doing that,
>> then how are we managing to compress WAL by more than one-quarter in
>> the "hundred tiny fields, all changed" case?
>
> Algorithm is not doing concatenation of history and source strings,
> the hash table is formed just with history data and then later
> if match is not found then it is added to history, so this can be the
> reason for the above result.

From the compressor's point of view, that's pretty much equivalent to
behaving as if those strings were concatenated.

The point is that there's a difference between using the old tuple's
history entries to compress the new tuple and using the new tuple's
own history to compress it.  The former is delta-compression, which is
what we're supposedly doing here.  The later is just plain
compression.  That doesn't *necessarily* make it a bad idea, but they
are clearly two different things.

>> The basic idea is that you use a rolling hash function to divide up
>> the history data into chunks of a given average size.  So we scan the
>> history data, compute a rolling hash value at each point, and each
>> time the bottom N bits are zero, we consider that to be the end of a
>> chunk.  We enter all the chunks into a hash table.  The chunk size
>> will vary, but on the average, given a reasonably well-behaved rolling
>> hash function (the pglz one probably doesn't qualify) it'll happen
>> every 2^N bytes, so perhaps for this purpose we'd choose N to be
>> between 3 and 5.  Then, we scan the input we want to compress and
>> divide it into chunks in the same way.  Chunks that don't exist in the
>> history data get copied to the output, while those that do get
>> replaced with a reference to their position in the history data.
>
> I think this idea looks better than current and it will definately
> improve some of the cases, but not sure we can win in all cases.
> We have tried one of the similar idea (reduce size of hash and
> eventually comparision) by adding every 10 bytes to the history
> lookup table rather than every byte. It improved most cases but not
> all cases ("hundred tiny fields, all changed",
>  "hundred tiny fields, half changed" test were still slow).
> Patch and results are at link (refer approach-1):
> http://www.postgresql.org/message-id/001f01ce1c14$d3af0770$7b0d1650$@kapila@huawei.com

What you did there will, I think, tend to miss a lot of compression
opportunities.  Suppose for example that the old tuple is
ABCDEFGHIJKLMNOP and the new tuple is xABCDEFGHIJKLMNOP.  After
copying one literal byte we'll proceed to copy 9 more, missing the
fact that there was a long match available after the first byte.  The
advantage of the fingerprinting technique is that it's supposed to be
resistant to that sort of thing.

> Now the tough question is what are the possible options for this patch
> and which one to pick:
> a. optimize encoding technique, so that it can improve results in most
> cases even if not all.
> b. have a table level option or guc to enable/disable WAL compression.
> c. use some heuristics to check if chances of compression are good,
> then only perform compression.
>     1. apply this optimization for tuple size > 128 and < 2000
>     2. apply this optimization if number of modified columns are less
> than 25% (some threshold number) of total columns.
>         I think we can get modified columns from target entry and use
> it if triggers haven't changed that tuple. I remember
>         earlier there were concerns that this value can't be trusted
> completely, but I think using it as a heuristic is not a
>         problem, even if this number is not right in some cases.
> d. forget about this optimization and reject the patch.
> I think by doing option 'b' and 'c' together can make this
> optimization usable in cases where it is actually useful.

I agree that we probably want to do (b), and I suspect we want both a
GUC and a reloption, assuming that can be done relatively cleanly.

However, I think we should explore (a) more before we explore (c).   I
think there's a good chance that we can reduce the CPU overhead of
this enough to feel comfortable having it enabled by default.  If we
proceed with heuristics as in approach (c), I don't think that's the
end of the world, but I think there will be more corner cases where we
lose and have to fiddle things manually.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Nov 27, 2013 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 27, 2013 at 12:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> The basic idea is that you use a rolling hash function to divide up
>>> the history data into chunks of a given average size.  So we scan the
>>> history data, compute a rolling hash value at each point, and each
>>> time the bottom N bits are zero, we consider that to be the end of a
>>> chunk.  We enter all the chunks into a hash table.  The chunk size
>>> will vary, but on the average, given a reasonably well-behaved rolling
>>> hash function (the pglz one probably doesn't qualify) it'll happen
>>> every 2^N bytes, so perhaps for this purpose we'd choose N to be
>>> between 3 and 5.  Then, we scan the input we want to compress and
>>> divide it into chunks in the same way.  Chunks that don't exist in the
>>> history data get copied to the output, while those that do get
>>> replaced with a reference to their position in the history data.
>>
>> I think this idea looks better than current and it will definately
>> improve some of the cases, but not sure we can win in all cases.
>> We have tried one of the similar idea (reduce size of hash and
>> eventually comparision) by adding every 10 bytes to the history
>> lookup table rather than every byte. It improved most cases but not
>> all cases ("hundred tiny fields, all changed",
>>  "hundred tiny fields, half changed" test were still slow).
>> Patch and results are at link (refer approach-1):
>> http://www.postgresql.org/message-id/001f01ce1c14$d3af0770$7b0d1650$@kapila@huawei.com
>
> What you did there will, I think, tend to miss a lot of compression
> opportunities.  Suppose for example that the old tuple is
> ABCDEFGHIJKLMNOP and the new tuple is xABCDEFGHIJKLMNOP.  After
> copying one literal byte we'll proceed to copy 9 more, missing the
> fact that there was a long match available after the first byte.

That is right, but one idea to try that out was to see if we can
reduce CPU usage at cost of compression,
but we found that it didn't completely eliminate that problem.

> The
> advantage of the fingerprinting technique is that it's supposed to be
> resistant to that sort of thing.

Okay, one question arise here is that can it be better in terms of CPU
usage as compare to when
we have used hash function for every 10th byte, if you have a feeling
that it can improve situation,
I can try a prototype implementation of same to check the results.


>> Now the tough question is what are the possible options for this patch
>> and which one to pick:
>> a. optimize encoding technique, so that it can improve results in most
>> cases even if not all.
>> b. have a table level option or guc to enable/disable WAL compression.
>> c. use some heuristics to check if chances of compression are good,
>> then only perform compression.
>>     1. apply this optimization for tuple size > 128 and < 2000
>>     2. apply this optimization if number of modified columns are less
>> than 25% (some threshold number) of total columns.
>>         I think we can get modified columns from target entry and use
>> it if triggers haven't changed that tuple. I remember
>>         earlier there were concerns that this value can't be trusted
>> completely, but I think using it as a heuristic is not a
>>         problem, even if this number is not right in some cases.
>> d. forget about this optimization and reject the patch.
>> I think by doing option 'b' and 'c' together can make this
>> optimization usable in cases where it is actually useful.
>
> I agree that we probably want to do (b), and I suspect we want both a
> GUC and a reloption, assuming that can be done relatively cleanly.
>
> However, I think we should explore (a) more before we explore (c).

Sure, but to explore (a), the scope is bit bigger. We have below
options to explore (a):
1. try to optimize existing algorithm as used in patch, which we have
tried but ofcourse we can spend some more time to see if anything more   can be tried out.
2. try fingerprint technique as suggested by you above.
3. try some other standard methods like vcdiff, lz4 etc.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Wed, Nov 27, 2013 at 9:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Sure, but to explore (a), the scope is bit bigger. We have below
> options to explore (a):
> 1. try to optimize existing algorithm as used in patch, which we have
> tried but ofcourse we can spend some more time to see if anything more
>     can be tried out.
> 2. try fingerprint technique as suggested by you above.
> 3. try some other standard methods like vcdiff, lz4 etc.

Well, obviously, I'm hot on idea #2 and think that would be worth
spending some time on.  If we can optimize the algorithm used in the
patch some more (option #1), that would be fine, too, but the code
looks pretty tight to me, so I'm not sure how successful that's likely
to be.  But if you have an idea, sure.

As to #3, I took a look at lz4 and snappy but neither seems to have an
API for delta compression.  vcdiff is a commonly-used output for delta
compression but doesn't directly address the question of what
algorithm to use to find matches between the old and new input; and I
suspect that when you go searching for algorithms to do that
efficiently, it's going to bring you right back to #2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Haribabu kommi
Date:
On 29 November 2013 03:05 Robert Haas wrote:
> On Wed, Nov 27, 2013 at 9:31 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > Sure, but to explore (a), the scope is bit bigger. We have below
> > options to explore (a):
> > 1. try to optimize existing algorithm as used in patch, which we have
> > tried but ofcourse we can spend some more time to see if anything
> more
> >     can be tried out.
> > 2. try fingerprint technique as suggested by you above.
> > 3. try some other standard methods like vcdiff, lz4 etc.
>
> Well, obviously, I'm hot on idea #2 and think that would be worth
> spending some time on.  If we can optimize the algorithm used in the
> patch some more (option #1), that would be fine, too, but the code
> looks pretty tight to me, so I'm not sure how successful that's likely
> to be.  But if you have an idea, sure.

I tried modifying the existing patch to support the dynamic rollup as follows.
For every 32 bytes mismatch between the old and new tuple and it resets back whenever it found a match.

1. pglz-with-micro-optimization-compress-using-newdata-5:

Adds all old tuple data to history and then check for the match from new tuple.
For every 32 bytes mismatch, it checks for the match for 2 bytes once. Like this
It repeats until it found a match or end of data.

2. pglz-with-micro-optimization-compress-using-newdata_snappy_hash-1:

Adds only first byte of old tuple data to the history and then check for the match
>From new tuple. If any match found, then next unmatched byte from old tuple is added
To the history and repeats the process.

If no match founds then adds the next byte of the old tuple history followed by the
Unmatched byte from new tuple data to the history.

In this case the performance is good, but if there is any forward references in the
New data with old data then it will not compress the data.

Eg- old data - 12345     abcdefgh
    New data - abcdefgh  56789

The updated patches and performance data is attached in the mail.
Please let me know your suggestions.

Regards,
Hari babu.


Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Mon, Dec 2, 2013 at 7:40 PM, Haribabu kommi
<haribabu.kommi@huawei.com> wrote:
> On 29 November 2013 03:05 Robert Haas wrote:
>> On Wed, Nov 27, 2013 at 9:31 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>
> I tried modifying the existing patch to support the dynamic rollup as follows.
> For every 32 bytes mismatch between the old and new tuple and it resets back whenever it found a match.
>
> 1. pglz-with-micro-optimization-compress-using-newdata-5:
>
> Adds all old tuple data to history and then check for the match from new tuple.
> For every 32 bytes mismatch, it checks for the match for 2 bytes once. Like this
> It repeats until it found a match or end of data.
>
> 2. pglz-with-micro-optimization-compress-using-newdata_snappy_hash-1:
>
> Adds only first byte of old tuple data to the history and then check for the match
> From new tuple. If any match found, then next unmatched byte from old tuple is added
> To the history and repeats the process.
>
> If no match founds then adds the next byte of the old tuple history followed by the
> Unmatched byte from new tuple data to the history.
>
> In this case the performance is good, but if there is any forward references in the
> New data with old data then it will not compress the data.

The performance data has still same problem that is on fast disks
(tempfs data) it is low.
I am already doing chunk-wise implementation to see if it can improve
the situation, please wait
and then we can decide what is the best way to proceed.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Fri, Nov 29, 2013 at 3:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 27, 2013 at 9:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Sure, but to explore (a), the scope is bit bigger. We have below
>> options to explore (a):
>> 1. try to optimize existing algorithm as used in patch, which we have
>> tried but ofcourse we can spend some more time to see if anything more
>>     can be tried out.
>> 2. try fingerprint technique as suggested by you above.
>> 3. try some other standard methods like vcdiff, lz4 etc.
>
> Well, obviously, I'm hot on idea #2 and think that would be worth
> spending some time on.  If we can optimize the algorithm used in the
> patch some more (option #1), that would be fine, too, but the code
> looks pretty tight to me, so I'm not sure how successful that's likely
> to be.  But if you have an idea, sure.

I have been experimenting chunk wise delta encoding (by using
technique similar to rabin fingerprint method) from last few days and
here are results of my investigation.

Performance Data
----------------------------
Non-default settings:
autovacuum =off
checkpoint_segments =128
checkpoint_timeout = 10min

unpatched

                testname                             | wal_generated |
    duration
-----------------------------------------+---------------+------------------
 one short and one long field, no change |    1054921328 | 25.5855557918549
 hundred tiny fields, all changed             |     634483328 | 20.8992719650269
 hundred tiny fields, half changed           |     635948640 | 19.8670389652252
 hundred tiny fields, half nulled               |     571388552 |
18.9413228034973


lz-delta-encoding

     testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------
 one short and one long field, no change |     662984384 | 21.7335519790649
 hundred tiny fields, all changed             |     633944320 | 24.1207830905914
 hundred tiny fields, half changed           |     633944344 | 24.4657719135284
 hundred tiny fields, half nulled               |     492200208 |
22.0337791442871


rabin-delta-encoding

                testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------
 one short and one long field, no change |     662235752 | 20.1823079586029
 hundred tiny fields, all changed             |     633950080 | 22.0473308563232
 hundred tiny fields, half changed           |     633950880 | 21.8351459503174
 hundred tiny fields, half nulled               |     508943072 |
20.9554698467255


Results Summarization
-------------------------------------
1. With Chunkwise approach, WAL reduction is almost same as with LZ
barring half nulled case which can be improved.
2. With Chunkwise approach, CPU usage is reduced to 50% in most cases
where there is less or no compression,
    still there is 5~10% overhead for cases where data is not
compressible. I think there will certainly a small
    overhead of forming hash table and scanning to conclude data is
non-compressible.
3. I have not tested other tests which will anyway return from top of
encoding function due to tuple length less than 32.

Main reasons of improvement
---------------------------------------------
1. lesser hash entries for old tuple and lesser calculations during
compressing of new tuple.
2. memset for data structure related to hash table for lesser size
3. Don't copy into output buffer untill we found match.

Further Actions
------------------------
1. Need to decide if this reduction in CPU usage is acceptable, do we
need enable/disable flag at table level.
2. We can do further micro-optimisations in chunk wise approach like
hash function improvement.
3. Some code improvements are pending like for cases where data to be
compressed is non-contiguous.

Attached files
---------------------
1. pgrb_delta_encoding_v1 - In heaptuple.c, there is a parameter
rabin_fingerprint_comp, set it to true for
    chunkwise delta encoding and set it to false for lz encoding. By
default it is true. I wanted to provide
    better way to enable both modes and tried as well but end up with this way.
2. wal-update-testsuite.sh - test script developed by Heikki to test this patch.

Note -
a. Performance is data is taken on my laptop, needs to be tested on
some better m/c
b. Attached Patch is just a prototype of chunkwise concept, code needs
to be improved and decode
    handling/test is pending.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Haribabu kommi
Date:
On 05 December 2013 21:16 Amit Kapila wrote:
> Note -
> a. Performance is data is taken on my laptop, needs to be tested on
> some better m/c b. Attached Patch is just a prototype of chunkwise
> concept, code needs to be improved and decode
>     handling/test is pending.

I ran the performance test on linux machine and attached the results in the mail.

Regards,
Hari babu.

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Fri, Dec 6, 2013 at 12:10 PM, Haribabu kommi
<haribabu.kommi@huawei.com> wrote:
> On 05 December 2013 21:16 Amit Kapila wrote:
>> Note -
>> a. Performance is data is taken on my laptop, needs to be tested on
>> some better m/c b. Attached Patch is just a prototype of chunkwise
>> concept, code needs to be improved and decode
>>     handling/test is pending.
>
> I ran the performance test on linux machine and attached the results in the mail.

This test doesn't make much sense for comparison as in chunkwise delta
encoding, I am not doing compression using new tuple
and the reason is that I want to check how good/bad it is as compare
to LZ approach for cases when data is non-compressible.
So could you please try to take readings by using patch
pgrb_delta_encoding_v1 attached in my previous mail.

For LZ delta encoding-
pgrb_delta_encoding_v1 - In heaptuple.c, there is a parameter
rabin_fingerprint_comp, set it to false, compile the code and take
readings.                                     This will do LZ compression.

For chunk wise delta encoding - In heaptuple.c, there is a parameter
rabin_fingerprint_comp, set it to true, compile the code and take
readings                                             This will operate chunk wise.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Haribabu kommi
Date:
On 06 December 2013 12:29 Amit Kapila wrote:
> On Fri, Dec 6, 2013 at 12:10 PM, Haribabu kommi
> <haribabu.kommi@huawei.com> wrote:
> > On 05 December 2013 21:16 Amit Kapila wrote:
> >> Note -
> >> a. Performance is data is taken on my laptop, needs to be tested on
> >> some better m/c b. Attached Patch is just a prototype of chunkwise
> >> concept, code needs to be improved and decode
> >>     handling/test is pending.
> >
> > I ran the performance test on linux machine and attached the results
> in the mail.
>
> This test doesn't make much sense for comparison as in chunkwise delta
> encoding, I am not doing compression using new tuple and the reason is
> that I want to check how good/bad it is as compare to LZ approach for
> cases when data is non-compressible.
> So could you please try to take readings by using patch
> pgrb_delta_encoding_v1 attached in my previous mail.
>
> For LZ delta encoding-
> pgrb_delta_encoding_v1 - In heaptuple.c, there is a parameter
> rabin_fingerprint_comp, set it to false, compile the code and take
> readings.
>                                       This will do LZ compression.
>
> For chunk wise delta encoding - In heaptuple.c, there is a parameter
> rabin_fingerprint_comp, set it to true, compile the code and take
> readings
>                                               This will operate chunk
> wise.

I ran the performance test on above patches including another patch which
Does snappy hash instead of normal hash in LZ algorithm. The performance
Readings and patch with snappy hash not including new data in compression
are attached in the mail.

The chunk wise approach is giving good performance in most of the scenarios.

Regards,
Hari babu.

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Fri, Dec 6, 2013 at 3:39 PM, Haribabu kommi
<haribabu.kommi@huawei.com> wrote:
> On 06 December 2013 12:29 Amit Kapila wrote:
>> >> Note -
>> >> a. Performance is data is taken on my laptop, needs to be tested on
>> >> some better m/c b. Attached Patch is just a prototype of chunkwise
>> >> concept, code needs to be improved and decode
>> >>     handling/test is pending.
>> >
>> > I ran the performance test on linux machine and attached the results
>> in the mail.
>>
>
> I ran the performance test on above patches including another patch which
> Does snappy hash instead of normal hash in LZ algorithm. The performance
> Readings and patch with snappy hash not including new data in compression
> are attached in the mail.
  Thanks for taking the data.

> The chunk wise approach is giving good performance in most of the scenarios.
  Agreed, summarization of data for LZ/Chunkwise encoding especially for  non-compressible (hundred tiny fields, all
changed/halfchanged) or less  compressible data (hundred tiny fields, half nulled) w.r.t CPU
 
usage is as below:
  a. For hard disk, there is an overhead of 7~16% with LZ delta encoding      and there is an overhead of 5~8% with
Chunkwise encoding.
 
  b. For Tempfs (which means operate on RAM as disk), there is an
overhead of 19~26%      with LZ delta encoding and there is an overhead of 9~18% with
Chunk wise encoding.

There might be some variation of data (in your last mail the overhead
for chunkwise method for Tempfs was < 12%),
but in general the data suggests that chunk wise encoding has less
overhead than LZ encoding for non-compressible data
and for others it is better or equal.

Now, I think we have below options for this patch:
a. If the performance overhead for worst case is acceptable (we can
try to reduce some more, but don't think it will be something
drastic),   then this can be done without any flag/option.
b. Have it with table level option to enable/disable WAL compression
c. Drop this patch, as for worst cases there is some performance overhead.
d. Go back and work more on it, if there is any further suggestions
for improvement.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Peter Eisentraut
Date:
This patch fails the regression tests; see attachment.

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thu, Dec 12, 2013 at 3:43 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> This patch fails the regression tests; see attachment.
  Thanks for reporting the diffs. The reason for failures is that
still decoding for tuple is not done as mentioned in Notes section in
mail  (http://www.postgresql.org/message-id/CAA4eK1JeUbY16uwrDA2TaBkk+rLRL3Giyyqy1mVh_6CThmDR8w@mail.gmail.com)
  However, to keep the sanity of patch, I will do that and post an
updated patch, but I think the main idea behind new approach at this  point is to get feedback on if such an
optimizationis acceptable
 
for worst case scenarios and if not whether we can get this done  with table level or GUC option.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Thu, Dec 12, 2013 at 12:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Dec 12, 2013 at 3:43 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> This patch fails the regression tests; see attachment.
>
>    Thanks for reporting the diffs. The reason for failures is that
> still decoding for tuple is not done as mentioned in Notes section in
> mail
>    (http://www.postgresql.org/message-id/CAA4eK1JeUbY16uwrDA2TaBkk+rLRL3Giyyqy1mVh_6CThmDR8w@mail.gmail.com)
>
>    However, to keep the sanity of patch, I will do that and post an
> updated patch, but I think the main idea behind new approach at this
>    point is to get feedback on if such an optimization is acceptable
> for worst case scenarios and if not whether we can get this done
>    with table level or GUC option.

I don't understand why lack of decoding support should cause
regression tests to fail.  I thought decoding was only being done
during WAL replay, a case not exercised by the regression tests.

A few other comments:

+#define PGRB_HKEY_PRIME            11     /* prime number used for
rolling hash */
+#define PGRB_HKEY_SQUARE_PRIME            11 * 11     /* prime number
used for rolling hash */
+#define PGRB_HKEY_CUBE_PRIME            11 * 11     * 11 /* prime
number used for rolling hash */

11 * 11 can't accurately be described as a prime number.  Nor can 11 *
11 * 11.  Please adjust the comment.  Also, why 11?

It doesn't appear that pglz_hist_idx is changed except for whitespace;
please revert that hunk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thu, Dec 12, 2013 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 12, 2013 at 12:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Dec 12, 2013 at 3:43 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>>> This patch fails the regression tests; see attachment.
>>
>>    However, to keep the sanity of patch, I will do that and post an
>> updated patch, but I think the main idea behind new approach at this
>>    point is to get feedback on if such an optimization is acceptable
>> for worst case scenarios and if not whether we can get this done
>>    with table level or GUC option.
>
> I don't understand why lack of decoding support should cause
> regression tests to fail.  I thought decoding was only being done
> during WAL replay, a case not exercised by the regression tests.

   I had mentioned decoding, as the message in regression failures can come from
   decompress/decode functions. Today, I had debugged the regression
failures and
   found that they are coming from pglz_decompress() function and the
reason is that
   optimizations done in pglz_find_match() to reduce the length of
'maxlen' has problem
   due to which compression of data was not happening properly.
   I have corrected the calculation for 'maxlen' and now compression
is happening properly
   and regression tests are passed.
   This problem was observed previously also and we have corrected in
one of the versions
   of patch which I forgot to take care while preparing combined patch
of chunk-wise and byte
   by byte encoding.


> A few other comments:
>
> +#define PGRB_HKEY_PRIME            11     /* prime number used for
> rolling hash */
> +#define PGRB_HKEY_SQUARE_PRIME            11 * 11     /* prime number
> used for rolling hash */
> +#define PGRB_HKEY_CUBE_PRIME            11 * 11     * 11 /* prime
> number used for rolling hash */
>
> 11 * 11 can't accurately be described as a prime number.  Nor can 11 *
> 11 * 11.  Please adjust the comment.

   Fixed.

>Also, why 11?

  I have tried with 11,31,101 to see which generates the better chunks
and I found 11 chooses better
  chunks (with 31 and 101, there was almost no chunk, it was
considering whole data as one chunk).
  The data I have used was the test data of the current test case
which we are using to evaluate this
  patch. It contains mostly repetitive data, so might be we need to
test with other kind of data as well
  to verify, if currently used number is okay.

> It doesn't appear that pglz_hist_idx is changed except for whitespace;
> please revert that hunk.

Fixed.


Thanks for review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Fri, Dec 6, 2013 at 6:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>    Agreed, summarization of data for LZ/Chunkwise encoding especially for
>    non-compressible (hundred tiny fields, all changed/half changed) or less
>    compressible data (hundred tiny fields, half nulled) w.r.t CPU
> usage is as below:
>
>    a. For hard disk, there is an overhead of 7~16% with LZ delta encoding
>        and there is an overhead of 5~8% with Chunk wise encoding.
>
>    b. For Tempfs (which means operate on RAM as disk), there is an
> overhead of 19~26%
>        with LZ delta encoding and there is an overhead of 9~18% with
> Chunk wise encoding.
>
> There might be some variation of data (in your last mail the overhead
> for chunkwise method for Tempfs was < 12%),
> but in general the data suggests that chunk wise encoding has less
> overhead than LZ encoding for non-compressible data
> and for others it is better or equal.
>
> Now, I think we have below options for this patch:
> a. If the performance overhead for worst case is acceptable (we can
> try to reduce some more, but don't think it will be something
> drastic),
>     then this can be done without any flag/option.
> b. Have it with table level option to enable/disable WAL compression
> c. Drop this patch, as for worst cases there is some performance overhead.
> d. Go back and work more on it, if there is any further suggestions
> for improvement.

Based on data posted previously for both approaches
(lz_delta, chunk_wise_encoding) and above options, I have improved
the last version of patch by keeping chunk wise approach and
provided a table level option to user.

Changes in this version of patch:
--------------------------------------------------
1. Implement decoding, it is almost similar to pglz_decompress as
    the format to store encoded data is not changed much.

2. Provide a new reloption to specify Wal compression
    for update operation on table
    Create table tbl(c1 char(100)) With (compress_wal = true);

    Alternative options:
    a. compress_wal can take input as operation, e.g. 'insert', 'update',
    b. use alternate syntax:
        Create table tbl(c1 char(100))  Compress Wal For Update;
    c. anything better?

3. Fixed below 2 defects in encoding:
    a. In function pgrb_find_match(), if last byte of chunk matches,
        it consider whole chunk as match.
    b. If there is no match, it copies chunk as it is to encoded data,
        while copying, it is ignoring last byte.
    Due to defect fixes, data can vary, but I don't think there can be
    any major change.

Points to consider
-----------------------------

1. As the current algorithm store the entry for same chunks at head of list,
   it will always find last but one chunk (we don't store last 4 bytes) for
   long matching string during match phase in encoding (pgrb_delta_encode).

   We can improve it either by storing same chunks at end of list instead of at
   head or by trying to find a good_match technique used in lz algorithm.
   Finding good_match technique can have overhead in some of the cases
   when there is actually no match.

2. Another optimization that we can do in pgrb_find_match(), is that
currently if
    it doesn't find the first chunk (chunk got by hash index) matching, it
    continues to find the match in other chunks. I am not sure if there is any
    benefit to search for other chunks if first one is not matching.

3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c,
    if we want to move, then we need to either duplicate some common macros
    like pglz_out_tag or keep it common, but might be change the name.

4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively).
    The point to consider is that if we keep bigger chunk sizes, then it can
    save us on CPU cycles, but less reduction in Wal, on the other side if we
    keep it small it can have better reduction in Wal but consume more CPU
    cycles.

5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we
   can remove it before commit.

7. docs needs to be updated, tab completion needs some work.

8. We can extend Alter Table to set compress option for table.


Thoughts/Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
> 2. Provide a new reloption to specify Wal compression
>     for update operation on table
>     Create table tbl(c1 char(100)) With (compress_wal = true);
>
>     Alternative options:
>     a. compress_wal can take input as operation, e.g. 'insert', 'update',
>     b. use alternate syntax:
>         Create table tbl(c1 char(100))  Compress Wal For Update;
>     c. anything better?

I think WITH (compress_wal = true) is pretty good.  I don't understand
your point about taking the operation as input, because this only
applies to updates.  But we could try to work "update" into the name
of the setting somehow, so as to be less likely to conflict with
future additions, like maybe wal_compress_update.  I think the
alternate syntax you propose is clearly worse, because it would
involve adding new keywords, something we try to avoid.

The only possible enhancement I can think of here is to make the
setting an integer rather than a Boolean, defined as the minimum
acceptable compression ratio.  A setting of 0 means always compress; a
setting of 100 means never compress; intermediate values define the
least acceptable ratio.  But to be honest, I think that's overkill;
I'd be inclined to hard-code the default value of 25 in the patch and
make it a #define.  The only real advantage of requiring a minimum 25%
compression percentage is that we can bail out on compression
three-quarters of the way through the tuple if we're getting nowhere.
That's fine for what it is, but the idea that users are going to see
much benefit from twaddling that number seems very dubious to me.

> Points to consider
> -----------------------------
>
> 1. As the current algorithm store the entry for same chunks at head of list,
>    it will always find last but one chunk (we don't store last 4 bytes) for
>    long matching string during match phase in encoding (pgrb_delta_encode).
>
>    We can improve it either by storing same chunks at end of list instead of at
>    head or by trying to find a good_match technique used in lz algorithm.
>    Finding good_match technique can have overhead in some of the cases
>    when there is actually no match.

I don't see what the good_match thing has to do with anything in the
Rabin algorithm.  But I do think there might be a bug here, which is
that, unless I'm misinterpreting something, hp is NOT the end of the
chunk.  After calling pgrb_hash_init(), we've looked at the first FOUR
bytes of the input.  If we find that we have a zero hash value at that
point, shouldn't the chunk size be 4, not 1?  And similarly if we find
it after sucking in one more byte, shouldn't the chunk size be 5, not
2?  Right now, we're deciding where the chunks should end based on the
data in the chunk plus the following 3 bytes, and that seems wonky.  I
would expect us to include all of those bytes in the chunk.

> 2. Another optimization that we can do in pgrb_find_match(), is that
> currently if
>     it doesn't find the first chunk (chunk got by hash index) matching, it
>     continues to find the match in other chunks. I am not sure if there is any
>     benefit to search for other chunks if first one is not matching.

Well, if you took that out, I suspect it would hurt the compression
ratio.  Unless the CPU savings are substantial, I'd leave it alone.

> 3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c,
>     if we want to move, then we need to either duplicate some common macros
>     like pglz_out_tag or keep it common, but might be change the name.

+1 for a new file.

> 4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively).
>     The point to consider is that if we keep bigger chunk sizes, then it can
>     save us on CPU cycles, but less reduction in Wal, on the other side if we
>     keep it small it can have better reduction in Wal but consume more CPU
>     cycles.

Whoa.  That seems way too small.  Since PGRB_PATTERN_AFTER_BITS is 4,
the average length of a chunk is about 16 bytes.  It makes little
sense to have the maximum chunk size be 25% of the expected chunk
length.  I'd recommend making the maximum chunk length something like
4 * PGRB_CONST_NUM, and the minimum chunk length maybe something like
4.

> 5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we
>    can remove it before commit.

Let's remove it now.

> 7. docs needs to be updated, tab completion needs some work.

Tab completion can be skipped for now, but documentation is important.

> 8. We can extend Alter Table to set compress option for table.

I don't understand what you have in mind here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Fri, Jan 10, 2014 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 2. Provide a new reloption to specify Wal compression
>>     for update operation on table
>>     Create table tbl(c1 char(100)) With (compress_wal = true);
>>
>>     Alternative options:
>>     a. compress_wal can take input as operation, e.g. 'insert', 'update',
>>     b. use alternate syntax:
>>         Create table tbl(c1 char(100))  Compress Wal For Update;
>>     c. anything better?
>
> I think WITH (compress_wal = true) is pretty good.  I don't understand
> your point about taking the operation as input, because this only
> applies to updates.

Yes, currently this applies to update, what I have in mind is that
in future if some one wants to use WAL compression for any other
operation like 'full_page_writes', then it can be easily extendible.

To be honest, I have not evaluated whether such a flag or compression
would make sense for full page writes, but I think it should be possible
while doing full page write (BkpBlock has RelFileNode) to check such a
flag if it's present.

> But we could try to work "update" into the name
> of the setting somehow, so as to be less likely to conflict with
> future additions, like maybe wal_compress_update.  I think the
> alternate syntax you propose is clearly worse, because it would
> involve adding new keywords, something we try to avoid.

Yes, this would be better than current, I will include it in
next version of patch, unless there is any other better idea than
this one.

>
>> Points to consider
>> -----------------------------
>>
>> 1. As the current algorithm store the entry for same chunks at head of list,
>>    it will always find last but one chunk (we don't store last 4 bytes) for
>>    long matching string during match phase in encoding (pgrb_delta_encode).
>>
>>    We can improve it either by storing same chunks at end of list instead of at
>>    head or by trying to find a good_match technique used in lz algorithm.
>>    Finding good_match technique can have overhead in some of the cases
>>    when there is actually no match.
>
> I don't see what the good_match thing has to do with anything in the
> Rabin algorithm.

The case for which I have mentioned it is where most of the data is
repetitive and modified tuple is almost same.

For example

orignal tuple
aaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

modified tuple
ccaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb


Now let us see what will happen as per current algorithm

Step -1: Form the hash table (lets consider 4 byte chunks just for purpose        of explanation):        a. First
chunkafter first 4 a's (aaaa) with hash value 1024.        b. After that all the chunks will be 4 b's (bbbb) and will
have          same hash value (lets call it H2), so will be mapped to same           bucket (let us call it 'B2'
bucket)and hence form a list.           Every new chunk with same hash value is added to front of list,           so if
goby this B2 has front element with hash value as H2           and location as 58 (last but 8 bytes, as we don't
includelast           4 bytes in our algorithm)
 

Step -2: Perform encoding for modified tuple by using hash table formed in        Step-1        a. First chunk after
first4 bytes (ccaa) with hash value 1056.        b. try to find match in hash table, no match, so proceed.        c.
Nextchunk with 4 b's and hash value H2, try to find a match,           it will find match in B2 bucket (at first
element).       d. okay hash value matched, so it will try to match each byte.           if all bytes matched, it will
considerchunk as matched_chunk.        e. Now if the chunk matches, then we try to match consecutive bytes
afterchunk (in the hope of finding longer match), and in this           case, it will find a match of 8 bytes and
considermatch_len as 8.        f. it will increment the modified tuple (dp) to point to byte           next to
match_len.       g. Again, it will consider next 4 b's as chunk and repeat step           c~f.
 

So here, what is happening is steps c~f are getting repeated, whereas if
we would have added same chunks at end of list in step 1b, then we
could have found the matching string just in one go (c~f).

The reason of adding the same chunk in head of list is that it uses same
technique as pglz_hist_add. Now in pglz, it will not have repeat steps
from c~f, as it has concept of good_match which leads to get this done in
one go.

Being said above, I am really not sure, how much real world data falls
in above category and should we try to optimize based on above example,
but yes it will save some CPU cycles in current test we are using.

>
>But I do think there might be a bug here, which is
> that, unless I'm misinterpreting something, hp is NOT the end of the
> chunk.  After calling pgrb_hash_init(), we've looked at the first FOUR
> bytes of the input.  If we find that we have a zero hash value at that
> point, shouldn't the chunk size be 4, not 1?  And similarly if we find
> it after sucking in one more byte, shouldn't the chunk size be 5, not
> 2?  Right now, we're deciding where the chunks should end based on the
> data in the chunk plus the following 3 bytes, and that seems wonky.  I
> would expect us to include all of those bytes in the chunk.

It depends on how we define chunk, basically chunk size will be based
on the byte for which we consider hindex. The hindex for any byte is
calculated considering that byte and the following 3 bytes, so
after calling pgrb_hash_init(), even though we have looked at 4 bytes
but still the hindex is for first byte and thats why it consider
chunk size as 1, not 4.

Isn't it similar to how current pglz works, basically it also
uses next 4 bytes to calculate index (pglz_hist_idx) but still
does byte by byte comparison, here if we try to map to rabin's
delta encoding then always chunk size is 1.

If we follow the same logic to define chunk both for encoding and match,
will there be any problem?

I have tried to keep the implementation closer to previous lz delta
encoding, but if you see benefit in including the supporting bytes
(next 3 bytes) to define a chunk, then I can try to change it.

>> 2. Another optimization that we can do in pgrb_find_match(), is that
>> currently if
>>     it doesn't find the first chunk (chunk got by hash index) matching, it
>>     continues to find the match in other chunks. I am not sure if there is any
>>     benefit to search for other chunks if first one is not matching.
>
> Well, if you took that out, I suspect it would hurt the compression
> ratio.

True, this is the reason I have kept it, but was not sure what kind
of scenarios it can benefit and whether such scenarios can be
more common for updates.

> Unless the CPU savings are substantial, I'd leave it alone.

Okay, lets leave it as it is.

>> 3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c,
>>     if we want to move, then we need to either duplicate some common macros
>>     like pglz_out_tag or keep it common, but might be change the name.
>
> +1 for a new file.

Okay, will take care of it in next version.

>
>> 4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively).
>>     The point to consider is that if we keep bigger chunk sizes, then it can
>>     save us on CPU cycles, but less reduction in Wal, on the other side if we
>>     keep it small it can have better reduction in Wal but consume more CPU
>>     cycles.
>
> Whoa.  That seems way too small.  Since PGRB_PATTERN_AFTER_BITS is 4,
> the average length of a chunk is about 16 bytes.  It makes little
> sense to have the maximum chunk size be 25% of the expected chunk
> length.  I'd recommend making the maximum chunk length something like
> 4 * PGRB_CONST_NUM, and the minimum chunk length maybe something like
> 4.

Agreed, but I think for some strings (where it doesn't find special bit
pattern) it will create long chunks which can effect the reduction.
AFAIR, in the current test which we are using to evaluate the
performance of this patch has strings where it will do so, but on
the other side, this test might not be the case for real update strings.

I will make these modifications and report here if it affects the results.

>> 5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we
>>    can remove it before commit.
>
> Let's remove it now.

Sure, will remove in next version.

>
>> 7. docs needs to be updated, tab completion needs some work.
>
> Tab completion can be skipped for now, but documentation is important.

Agreed, left it for this version of patch, so that we can conclude on syntax
and then accordingly, I can mention it in docs.

>> 8. We can extend Alter Table to set compress option for table.
>
> I don't understand what you have in mind here.

Let us say user created table without this new option (compress_wal)
and later wants to enable it, so we can provide similar to what we
provide for other storage parameters.
Alter Table Set (compress_wal=true)

One more point to note that in the version of patch I posted, it has
default value for compress_wal as true, just for easiness of test,
we might want to change it to false for backeward compatibility.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yes, currently this applies to update, what I have in mind is that
> in future if some one wants to use WAL compression for any other
> operation like 'full_page_writes', then it can be easily extendible.
>
> To be honest, I have not evaluated whether such a flag or compression
> would make sense for full page writes, but I think it should be possible
> while doing full page write (BkpBlock has RelFileNode) to check such a
> flag if it's present.

Makes sense.

> The reason of adding the same chunk in head of list is that it uses same
> technique as pglz_hist_add. Now in pglz, it will not have repeat steps
> from c~f, as it has concept of good_match which leads to get this done in
> one go.
>
> Being said above, I am really not sure, how much real world data falls
> in above category and should we try to optimize based on above example,
> but yes it will save some CPU cycles in current test we are using.

In the Rabin algorithm, we shouldn't try to find a longer match.  The
match should end at the chunk end, period.  Otherwise, you lose the
shift-resistant property of the algorithm.

>>But I do think there might be a bug here, which is
>> that, unless I'm misinterpreting something, hp is NOT the end of the
>> chunk.  After calling pgrb_hash_init(), we've looked at the first FOUR
>> bytes of the input.  If we find that we have a zero hash value at that
>> point, shouldn't the chunk size be 4, not 1?  And similarly if we find
>> it after sucking in one more byte, shouldn't the chunk size be 5, not
>> 2?  Right now, we're deciding where the chunks should end based on the
>> data in the chunk plus the following 3 bytes, and that seems wonky.  I
>> would expect us to include all of those bytes in the chunk.
>
> It depends on how we define chunk, basically chunk size will be based
> on the byte for which we consider hindex. The hindex for any byte is
> calculated considering that byte and the following 3 bytes, so
> after calling pgrb_hash_init(), even though we have looked at 4 bytes
> but still the hindex is for first byte and thats why it consider
> chunk size as 1, not 4.
>
> Isn't it similar to how current pglz works, basically it also
> uses next 4 bytes to calculate index (pglz_hist_idx) but still
> does byte by byte comparison, here if we try to map to rabin's
> delta encoding then always chunk size is 1.

I don't quite understand this.  The point of the Rabin algorithm is to
split the old tuple up into chunks and then for those chunks in the
new tuple.  For example, suppose the old tuple is
abcdefghijklmnopqrstuvwxyz.  It might get split like this: abcdef
hijklmnopqrstuvw xyz.  If any of those three chunks appear in the new
tuple, then we'll use them for compression.  If not, we'll just copy
the literal bytes.  If the chunks appear in the new tuple reordered or
shifted or with stuff inserted between one chunk at the next, we'll
still find them.  Unless I'm confused, which is possible, what you're
doing is essentially looking at the string and spitting it in those
three places, but then recording the chunks as being three bytes
shorter than they really are.  I don't see how that can be right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tue, Jan 14, 2014 at 2:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Yes, currently this applies to update, what I have in mind is that
>> in future if some one wants to use WAL compression for any other
>> operation like 'full_page_writes', then it can be easily extendible.
>>
>> To be honest, I have not evaluated whether such a flag or compression
>> would make sense for full page writes, but I think it should be possible
>> while doing full page write (BkpBlock has RelFileNode) to check such a
>> flag if it's present.
>
> Makes sense.
  So shall I change it to string instead of bool and keep the name as  compress_wal or compress_wal_for_opr?

>> The reason of adding the same chunk in head of list is that it uses same
>> technique as pglz_hist_add. Now in pglz, it will not have repeat steps
>> from c~f, as it has concept of good_match which leads to get this done in
>> one go.
>>
>> Being said above, I am really not sure, how much real world data falls
>> in above category and should we try to optimize based on above example,
>> but yes it will save some CPU cycles in current test we are using.
>
> In the Rabin algorithm, we shouldn't try to find a longer match.  The
> match should end at the chunk end, period.  Otherwise, you lose the
> shift-resistant property of the algorithm.
  Okay, it will work well for cases when most chunks in tuple are due  due to special pattern in it, but it will loose
outon CPU cycles in  cases where most of the chunks are due to maximum chunk boundary  and most part of new tuple
matcheswith old tuple. The reason is that  if the algorithm have some such property of finding longer matches than
chunkboundaries, then it can save us on calculating hash again and  again when we try to find match in old tuple.
HoweverI think it is better to go with rabin's algorithm instead of adding  optimizations based on our own assumptions,
becauseit is difficult to  predict the real world tuple data.
 

>>
>> Isn't it similar to how current pglz works, basically it also
>> uses next 4 bytes to calculate index (pglz_hist_idx) but still
>> does byte by byte comparison, here if we try to map to rabin's
>> delta encoding then always chunk size is 1.
>
> I don't quite understand this.  The point of the Rabin algorithm is to
> split the old tuple up into chunks and then for those chunks in the
> new tuple.  For example, suppose the old tuple is
> abcdefghijklmnopqrstuvwxyz.  It might get split like this: abcdef
> hijklmnopqrstuvw xyz.  If any of those three chunks appear in the new
> tuple, then we'll use them for compression.  If not, we'll just copy
> the literal bytes.  If the chunks appear in the new tuple reordered or
> shifted or with stuff inserted between one chunk at the next, we'll
> still find them.  Unless I'm confused, which is possible, what you're
> doing is essentially looking at the string and spitting it in those
> three places, but then recording the chunks as being three bytes
> shorter than they really are.  I don't see how that can be right.
 Today again spending some time on algorithm, I got the bug you are pointing to and you are right in saying that chunk
isshorter. I think it should not be difficult to address this issue without affecting most part of algorithm, let me
tryto handle it.
 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Tue, Jan 14, 2014 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jan 14, 2014 at 2:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Yes, currently this applies to update, what I have in mind is that
>>> in future if some one wants to use WAL compression for any other
>>> operation like 'full_page_writes', then it can be easily extendible.
>>>
>>> To be honest, I have not evaluated whether such a flag or compression
>>> would make sense for full page writes, but I think it should be possible
>>> while doing full page write (BkpBlock has RelFileNode) to check such a
>>> flag if it's present.
>>
>> Makes sense.
>
>    So shall I change it to string instead of bool and keep the name as
>    compress_wal or compress_wal_for_opr?

No.  If we add full-page-write compression in the future, that can be
a separate option.  But I doubt we'd want to set that at the table
level anyway; there's no particular reason that would be good for some
tables and bad for others (whereas in this case there is such a
reason).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Fri, Jan 10, 2014 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 2. Provide a new reloption to specify Wal compression
>>     for update operation on table
>>     Create table tbl(c1 char(100)) With (compress_wal = true);
>>
>>     Alternative options:
>>     a. compress_wal can take input as operation, e.g. 'insert', 'update',
>>     b. use alternate syntax:
>>         Create table tbl(c1 char(100))  Compress Wal For Update;
>>     c. anything better?
>
> I think WITH (compress_wal = true) is pretty good.  I don't understand
> your point about taking the operation as input, because this only
> applies to updates.  But we could try to work "update" into the name
> of the setting somehow, so as to be less likely to conflict with
> future additions, like maybe wal_compress_update.  I think the
> alternate syntax you propose is clearly worse, because it would
> involve adding new keywords, something we try to avoid.

   Changed name to wal_compress_update in attached version of
   patch.

>
>> Points to consider
>> -----------------------------
>>
>> 1. As the current algorithm store the entry for same chunks at head of list,
>>    it will always find last but one chunk (we don't store last 4 bytes) for
>>    long matching string during match phase in encoding (pgrb_delta_encode).
>>
>>    We can improve it either by storing same chunks at end of list instead of at
>>    head or by trying to find a good_match technique used in lz algorithm.
>>    Finding good_match technique can have overhead in some of the cases
>>    when there is actually no match.
>
> I don't see what the good_match thing has to do with anything in the
> Rabin algorithm.  But I do think there might be a bug here, which is
> that, unless I'm misinterpreting something, hp is NOT the end of the
> chunk.  After calling pgrb_hash_init(), we've looked at the first FOUR
> bytes of the input.  If we find that we have a zero hash value at that
> point, shouldn't the chunk size be 4, not 1?  And similarly if we find
> it after sucking in one more byte, shouldn't the chunk size be 5, not
> 2?  Right now, we're deciding where the chunks should end based on the
> data in the chunk plus the following 3 bytes, and that seems wonky.  I
> would expect us to include all of those bytes in the chunk.

Okay, I had modified the patch to consider the  the data plus following
3 bytes inside chunk. To resolve it, after calling pgrb_hash_init(), we
need to initialize chunk size as 4 and once we find the chunk, increase
the next chunk start position by adding chunk size to it. Similarly during
match phase while copying unmatched, data, make sure to copy 3
bytes ahead of current position as those will not be considered for new
chunk.


> In the Rabin algorithm, we shouldn't try to find a longer match.  The
> match should end at the chunk end, period.  Otherwise, you lose the
> shift-resistant property of the algorithm.

Okay for now, I have commented the code in pgrb_find_match() which
tries to find longer match after chunk boundary. The reason for just
commenting it rather than removing is that I fear it might have negative
impact on WAL reduction atleast for the cases where most of the data
is repetitive. I have done some performance test and data is at end of
mail, if you are okay with it, then I will remove this code altogether.

>> 2. Another optimization that we can do in pgrb_find_match(), is that
>> currently if
>>     it doesn't find the first chunk (chunk got by hash index) matching, it
>>     continues to find the match in other chunks. I am not sure if there is any
>>     benefit to search for other chunks if first one is not matching.
>
> Well, if you took that out, I suspect it would hurt the compression
> ratio.  Unless the CPU savings are substantial, I'd leave it alone.

I kept this code intact.

>> 3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c,
>>     if we want to move, then we need to either duplicate some common macros
>>     like pglz_out_tag or keep it common, but might be change the name.
>
> +1 for a new file.

Done, after moving code to new file, it looks better.

>> 4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively).
>>     The point to consider is that if we keep bigger chunk sizes, then it can
>>     save us on CPU cycles, but less reduction in Wal, on the other side if we
>>     keep it small it can have better reduction in Wal but consume more CPU
>>     cycles.
>
> Whoa.  That seems way too small.  Since PGRB_PATTERN_AFTER_BITS is 4,
> the average length of a chunk is about 16 bytes.  It makes little
> sense to have the maximum chunk size be 25% of the expected chunk
> length.  I'd recommend making the maximum chunk length something like
> 4 * PGRB_CONST_NUM, and the minimum chunk length maybe something like
> 4.

Okay changed as per suggestion.

>> 5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we
>>    can remove it before commit.
>
> Let's remove it now.

Done.

>> 7. docs needs to be updated, tab completion needs some work.
>
> Tab completion can be skipped for now, but documentation is important.

Updated Create Table documentation.

Performance Data
-----------------------------
Non-default settings:
autovacuum =off
checkpoint_segments =128
checkpoint_timeout = 10min

Unpatched
-------------------
                testname                             | wal_generated |
    duration
----------------------------------------------------------+----------------------+------------------
 one short and one long field, no change |    1054923224 |  33.101135969162

After pgrb_delta_encoding_v4
---------------------------------------------

                testname                             | wal_generated |
    duration
----------------------------------------------------------+----------------------+------------------
 one short and one long field, no change |     877859144 | 30.6749138832092


Temporary Changes
(Revert Max Chunksize = 4 and logic of finding longer match)
---------------------------------------------------------------------------------------------

                 testname                            | wal_generated |
    duration
----------------------------------------------------------+----------------------+------------------
 one short and one long field, no change |     677337304 | 25.4048750400543


Summarization of test result:
1. If we don't try to find longer match, then it will have more tags in encoded
    tuple which will increase the overall length of encoded tuple.

Note -
a. I have taken data just for one case to check whether the effect of changes
    is acceptable.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Jan 15, 2014 at 5:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jan 10, 2014 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Performance Data
> -----------------------------
> Non-default settings:
> autovacuum =off
> checkpoint_segments =128
> checkpoint_timeout = 10min
>
> Unpatched
> -------------------
>                 testname                             | wal_generated |
>     duration
> ----------------------------------------------------------+----------------------+------------------
>  one short and one long field, no change |    1054923224 |  33.101135969162
>
> After pgrb_delta_encoding_v4
> ---------------------------------------------
>
>                 testname                             | wal_generated |
>     duration
> ----------------------------------------------------------+----------------------+------------------
>  one short and one long field, no change |     877859144 | 30.6749138832092
>
>
> Temporary Changes
> (Revert Max Chunksize = 4 and logic of finding longer match)
> ---------------------------------------------------------------------------------------------
>
>                  testname                            | wal_generated |
>     duration
> ----------------------------------------------------------+----------------------+------------------
>  one short and one long field, no change |     677337304 | 25.4048750400543
>

Sorry, minor correction in last mail, the last data (Temporary Changes) is
just by reverting logic of finding longer match in pgrb_find_match().

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Unpatched
> -------------------
>                 testname                             | wal_generated |
>     duration
> ----------------------------------------------------------+----------------------+------------------
>  one short and one long field, no change |    1054923224 |  33.101135969162
>
> After pgrb_delta_encoding_v4
> ---------------------------------------------
>
>                 testname                             | wal_generated |
>     duration
> ----------------------------------------------------------+----------------------+------------------
>  one short and one long field, no change |     877859144 | 30.6749138832092
>
>
> Temporary Changes
> (Revert Max Chunksize = 4 and logic of finding longer match)
> ---------------------------------------------------------------------------------------------
>
>                  testname                            | wal_generated |
>     duration
> ----------------------------------------------------------+----------------------+------------------
>  one short and one long field, no change |     677337304 | 25.4048750400543

Sure, but watch me not care.

If we're interested in taking advantage of the internal
compressibility of tuples, we can do a lot better than this patch.  We
can compress the old tuple and the new tuple.  We can compress
full-page images.  We can compress inserted tuples.  But that's not
the point of this patch.

The point of *this* patch is to exploit the fact that the old and new
tuples are likely to be very similar, NOT to squeeze out every ounce
of compression from other sources.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thu, Jan 16, 2014 at 12:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Unpatched
>> -------------------
>>                 testname                             | wal_generated |
>>     duration
>> ----------------------------------------------------------+----------------------+------------------
>>  one short and one long field, no change |    1054923224 |  33.101135969162
>>
>> After pgrb_delta_encoding_v4
>> ---------------------------------------------
>>
>>                 testname                             | wal_generated |
>>     duration
>> ----------------------------------------------------------+----------------------+------------------
>>  one short and one long field, no change |     877859144 | 30.6749138832092
>>
>>
>> Temporary Changes
>> (Revert Max Chunksize = 4 and logic of finding longer match)
>> ---------------------------------------------------------------------------------------------
>>
>>                  testname                            | wal_generated |
>>     duration
>> ----------------------------------------------------------+----------------------+------------------
>>  one short and one long field, no change |     677337304 | 25.4048750400543
>
> Sure, but watch me not care.
>
> If we're interested in taking advantage of the internal
> compressibility of tuples, we can do a lot better than this patch.  We
> can compress the old tuple and the new tuple.  We can compress
> full-page images.  We can compress inserted tuples.  But that's not
> the point of this patch.
>
> The point of *this* patch is to exploit the fact that the old and new
> tuples are likely to be very similar, NOT to squeeze out every ounce
> of compression from other sources.
  Okay, got your point.  Another minor thing is that in latest patch which I have sent yesterday,  I have modified it
suchthat while formation of chunks if there is a data  at end of string which doesn't have special pattern and is less
thanmax  chunk size, we also consider that as a chunk. The reason of doing this  was that let us say if we have 104
bytesstring which contains no special  bit pattern, then it will just have one 64 byte chunk and will leave the
remainingbytes, which might miss the chance of doing compression for  that data.
 



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Thu, Jan 16, 2014 at 12:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 16, 2014 at 12:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Unpatched
>>> -------------------
>>>                 testname                             | wal_generated |
>>>     duration
>>> ----------------------------------------------------------+----------------------+------------------
>>>  one short and one long field, no change |    1054923224 |  33.101135969162
>>>
>>> After pgrb_delta_encoding_v4
>>> ---------------------------------------------
>>>
>>>                 testname                             | wal_generated |
>>>     duration
>>> ----------------------------------------------------------+----------------------+------------------
>>>  one short and one long field, no change |     877859144 | 30.6749138832092
>>>
>>>
>>> Temporary Changes
>>> (Revert Max Chunksize = 4 and logic of finding longer match)
>>> ---------------------------------------------------------------------------------------------
>>>
>>>                  testname                            | wal_generated |
>>>     duration
>>> ----------------------------------------------------------+----------------------+------------------
>>>  one short and one long field, no change |     677337304 | 25.4048750400543
>>
>> Sure, but watch me not care.
>>
>> If we're interested in taking advantage of the internal
>> compressibility of tuples, we can do a lot better than this patch.  We
>> can compress the old tuple and the new tuple.  We can compress
>> full-page images.  We can compress inserted tuples.  But that's not
>> the point of this patch.
>>
>> The point of *this* patch is to exploit the fact that the old and new
>> tuples are likely to be very similar, NOT to squeeze out every ounce
>> of compression from other sources.
>
>    Okay, got your point.
>    Another minor thing is that in latest patch which I have sent yesterday,
>    I have modified it such that while formation of chunks if there is a data
>    at end of string which doesn't have special pattern and is less than max
>    chunk size, we also consider that as a chunk. The reason of doing this
>    was that let us say if we have 104 bytes string which contains no special
>    bit pattern, then it will just have one 64 byte chunk and will leave the
>    remaining bytes, which might miss the chance of doing compression for
>    that data.

Yeah, that sounds right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Thu, Jan 16, 2014 at 12:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>    Okay, got your point.
>    Another minor thing is that in latest patch which I have sent yesterday,
>    I have modified it such that while formation of chunks if there is a data
>    at end of string which doesn't have special pattern and is less than max
>    chunk size, we also consider that as a chunk. The reason of doing this
>    was that let us say if we have 104 bytes string which contains no special
>    bit pattern, then it will just have one 64 byte chunk and will leave the
>    remaining bytes, which might miss the chance of doing compression for
>    that data.

I ran Heikki's test suit on latest master and latest master plus
pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results
didn't look too good.  The only tests where the WAL volume changed by
more than half a percent were the "one short and one long field, no
change" test, where it dropped by 17%, but at the expense of an
increase in duration of 38%; and the "hundred tiny fields, half
nulled" test, where it dropped by 2% without a change in runtime.
Unfortunately, some of the tests where WAL didn't change significantly
took a runtime hit - in particular, "hundred tiny fields, half
changed" slowed down by 10% and "hundred tiny fields, all changed" by
8%.  I've attached the full results in OpenOffice format.

Profiling the "one short and one long field, no change" test turns up
the following:

    51.38%     postgres  pgrb_delta_encode
    23.58%     postgres  XLogInsert
     2.54%     postgres  heap_update
     1.09%     postgres  LWLockRelease
     0.90%     postgres  LWLockAcquire
     0.89%     postgres  palloc0
     0.88%     postgres  log_heap_update
     0.84%     postgres  HeapTupleSatisfiesMVCC
     0.75%     postgres  ExecModifyTable
     0.73%     postgres  hash_search_with_hash_value

Yipes.  That's a lot more than I remember this costing before.  And I
don't understand why I'm seeing such a large time hit on this test
where you actually saw a significant time *reduction*.  One
possibility is that you may have been running with a default
checkpoint_segments value or one that's low enough to force
checkpointing activity during the test.  I ran with
checkpoint_segments=300.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Mon, Jan 20, 2014 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I ran Heikki's test suit on latest master and latest master plus
> pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results
> didn't look too good.  The only tests where the WAL volume changed by
> more than half a percent were the "one short and one long field, no
> change" test, where it dropped by 17%, but at the expense of an
> increase in duration of 38%; and the "hundred tiny fields, half
> nulled" test, where it dropped by 2% without a change in runtime.

> Unfortunately, some of the tests where WAL didn't change significantly
> took a runtime hit - in particular, "hundred tiny fields, half
> changed" slowed down by 10% and "hundred tiny fields, all changed" by
> 8%.

I think this part of result is positive, as with earlier approaches here the
dip was > 20%. Refer the result posted at link:
http://www.postgresql.org/message-id/51366323.8070606@vmware.com


>  I've attached the full results in OpenOffice format.

> Profiling the "one short and one long field, no change" test turns up
> the following:
>
>     51.38%     postgres  pgrb_delta_encode
>     23.58%     postgres  XLogInsert
>      2.54%     postgres  heap_update
>      1.09%     postgres  LWLockRelease
>      0.90%     postgres  LWLockAcquire
>      0.89%     postgres  palloc0
>      0.88%     postgres  log_heap_update
>      0.84%     postgres  HeapTupleSatisfiesMVCC
>      0.75%     postgres  ExecModifyTable
>      0.73%     postgres  hash_search_with_hash_value
>
> Yipes.  That's a lot more than I remember this costing before.  And I
> don't understand why I'm seeing such a large time hit on this test
> where you actually saw a significant time *reduction*.  One
> possibility is that you may have been running with a default
> checkpoint_segments value or one that's low enough to force
> checkpointing activity during the test.  I ran with
> checkpoint_segments=300.

I ran with checkpoint_segments = 128 and when I ran with v4, I also
see similar WAL reduction as you are seeing, except that in my case
runtime for both are almost similar (i think in your case disk writes are
fast, so CPU overhead is more visible).
I think the major difference in above test is due to below part of code:

pgrb_find_match()
{
..
+ /* if (match_chunk)
+ {
+ while (*ip == *hp)
+ {
+ matchlen++;
+ ip++;
+ hp++;
+ }
+ } */
}

Basically if we don't go for longer match, then for test where most data
("one short and one long field, no change") is similar, it has to do below
extra steps with no advantage:
a. copy extra tags
b. calculation for rolling hash
c. finding the match
I think here major cost is due to 'a', but others might also not be free.
To confirm the theory, if we run the test by just un-commenting above
code, there can be significant change in both WAL reduction and
runtime for this test.

I have one idea to avoid the overhead of step a) which is to combine
the tags, means don't write the tag until it founds any un-matching data.
When any un-matched data is found, then combine all the previously
matched data and write it as one tag.
This should eliminate the overhead due to step a.

Can we think of anyway in which inspite of doing longer matches, we
can retain the sanctity of this approach?
One way could be to check if the match after chunk is long enough
that it matches rest of the string, but I think it can create problems
in some other cases.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Peter Geoghegan
Date:
On Mon, Nov 25, 2013 at 6:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> But even if that doesn't
> pan out, I think the fallback position should not be "OK, well, if we
> can't get decreased I/O for free then forget it" but rather "OK, if we
> can't get decreased I/O for free then let's get decreased I/O in
> exchange for increased CPU usage".

While I haven't been following the development of this patch, I will
note that on the face of it the latter seem like a trade-off I'd be
quite willing to make.


-- 
Peter Geoghegan



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Tue, Jan 21, 2014 at 2:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 20, 2014 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I ran Heikki's test suit on latest master and latest master plus
>> pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results
>> didn't look too good.  The only tests where the WAL volume changed by
>> more than half a percent were the "one short and one long field, no
>> change" test, where it dropped by 17%, but at the expense of an
>> increase in duration of 38%; and the "hundred tiny fields, half
>> nulled" test, where it dropped by 2% without a change in runtime.
>
>> Unfortunately, some of the tests where WAL didn't change significantly
>> took a runtime hit - in particular, "hundred tiny fields, half
>> changed" slowed down by 10% and "hundred tiny fields, all changed" by
>> 8%.
>
> I think this part of result is positive, as with earlier approaches here the
> dip was > 20%. Refer the result posted at link:
> http://www.postgresql.org/message-id/51366323.8070606@vmware.com
>
>
>>  I've attached the full results in OpenOffice format.
>
>> Profiling the "one short and one long field, no change" test turns up
>> the following:
>>
>>     51.38%     postgres  pgrb_delta_encode
>>     23.58%     postgres  XLogInsert
>>      2.54%     postgres  heap_update
>>      1.09%     postgres  LWLockRelease
>>      0.90%     postgres  LWLockAcquire
>>      0.89%     postgres  palloc0
>>      0.88%     postgres  log_heap_update
>>      0.84%     postgres  HeapTupleSatisfiesMVCC
>>      0.75%     postgres  ExecModifyTable
>>      0.73%     postgres  hash_search_with_hash_value
>>
>> Yipes.  That's a lot more than I remember this costing before.  And I
>> don't understand why I'm seeing such a large time hit on this test
>> where you actually saw a significant time *reduction*.  One
>> possibility is that you may have been running with a default
>> checkpoint_segments value or one that's low enough to force
>> checkpointing activity during the test.  I ran with
>> checkpoint_segments=300.
>
> I ran with checkpoint_segments = 128 and when I ran with v4, I also
> see similar WAL reduction as you are seeing, except that in my case
> runtime for both are almost similar (i think in your case disk writes are
> fast, so CPU overhead is more visible).
> I think the major difference in above test is due to below part of code:
>
> pgrb_find_match()
> {
> ..
> + /* if (match_chunk)
> + {
> + while (*ip == *hp)
> + {
> + matchlen++;
> + ip++;
> + hp++;
> + }
> + } */
> }
>
> Basically if we don't go for longer match, then for test where most data
> ("one short and one long field, no change") is similar, it has to do below
> extra steps with no advantage:
> a. copy extra tags
> b. calculation for rolling hash
> c. finding the match
> I think here major cost is due to 'a', but others might also not be free.
> To confirm the theory, if we run the test by just un-commenting above
> code, there can be significant change in both WAL reduction and
> runtime for this test.
>
> I have one idea to avoid the overhead of step a) which is to combine
> the tags, means don't write the tag until it founds any un-matching data.
> When any un-matched data is found, then combine all the previously
> matched data and write it as one tag.
> This should eliminate the overhead due to step a.

I think that's a good thing to try.  Can you code it up?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Jan 22, 2014 at 12:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 21, 2014 at 2:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Jan 20, 2014 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I ran Heikki's test suit on latest master and latest master plus
>>> pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results
>>> didn't look too good.  The only tests where the WAL volume changed by
>>> more than half a percent were the "one short and one long field, no
>>> change" test, where it dropped by 17%, but at the expense of an
>>> increase in duration of 38%; and the "hundred tiny fields, half
>>> nulled" test, where it dropped by 2% without a change in runtime.
>>
>>> Unfortunately, some of the tests where WAL didn't change significantly
>>> took a runtime hit - in particular, "hundred tiny fields, half
>>> changed" slowed down by 10% and "hundred tiny fields, all changed" by
>>> 8%.
>>
>> I think this part of result is positive, as with earlier approaches here the
>> dip was > 20%. Refer the result posted at link:
>> http://www.postgresql.org/message-id/51366323.8070606@vmware.com
>>
>> Basically if we don't go for longer match, then for test where most data
>> ("one short and one long field, no change") is similar, it has to do below
>> extra steps with no advantage:
>> a. copy extra tags
>> b. calculation for rolling hash
>> c. finding the match
>> I think here major cost is due to 'a', but others might also not be free.
>> To confirm the theory, if we run the test by just un-commenting above
>> code, there can be significant change in both WAL reduction and
>> runtime for this test.
>>
>> I have one idea to avoid the overhead of step a) which is to combine
>> the tags, means don't write the tag until it founds any un-matching data.
>> When any un-matched data is found, then combine all the previously
>> matched data and write it as one tag.
>> This should eliminate the overhead due to step a.
>
> I think that's a good thing to try.  Can you code it up?

I have tried to improve algorithm in another way so that we can get
benefit of same chunks during find match (something similar to lz).
The main change is to consider chunks at fixed boundary (4 byte)
and after finding match, try to find if there is a longer match than
current chunk. While finding longer match, it still takes care that
next bigger match should be at chunk boundary. I am not
completely sure about the chunk boundary may be 8 or 16 can give
better results.

I think now we can once run with this patch on high end m/c.

Below is the data on my laptop.

Non-Default Settings
checkpoint_segments = 128
checkpoint_timeout = 15min
autovacuum = off

Before Patch

              testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------
 one short and one long field, no change |    1054922336 | 25.4784970283508
 one short and one long field, no change |    1054914728 | 45.9248871803284
 one short and one long field, no change |    1054911288 | 42.0877709388733
 hundred tiny fields, all changed        |     633946880 | 21.4810841083527
 hundred tiny fields, all changed        |     633943520 | 29.5192229747772
 hundred tiny fields, all changed        |     633943944 | 38.1980679035187
 hundred tiny fields, half changed       |     633946784 | 36.0654091835022
 hundred tiny fields, half changed       |     638136544 |  36.231675863266
 hundred tiny fields, half changed       |     633944072 | 30.7445759773254
 hundred tiny fields, half nulled        |     570130888 | 28.6964628696442
 hundred tiny fields, half nulled        |     569755584 | 32.7119750976562
 hundred tiny fields, half nulled        |     569760312 | 32.4714169502258
(12 rows)


After Patch

                testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------
 one short and one long field, no change |     662239704 | 22.8768830299377
 one short and one long field, no change |     662896760 |  22.466646194458
 one short and one long field, no change |     662878736 | 17.6034708023071
 hundred tiny fields, all changed        |     633946192 | 24.5791938304901
 hundred tiny fields, all changed        |     634161120 | 25.7798039913177
 hundred tiny fields, all changed        |     633946416 |  23.761885881424
 hundred tiny fields, half changed       |     633945512 | 24.7001428604126
 hundred tiny fields, half changed       |     633947944 | 25.2069280147552
 hundred tiny fields, half changed       |     633946480 | 26.6489980220795
 hundred tiny fields, half nulled        |     492199720 | 28.7052059173584
 hundred tiny fields, half nulled        |     492194576 | 26.6311559677124
 hundred tiny fields, half nulled        |     492449408 | 25.2788209915161
(12 rows)


With above modifications, I could see ~37% WAL reduction for best case
"one short and one long field, no change" and ~13% for
"hundred tiny fields, half nulled". The duration is quite fluctuating in most
runs, so may be running it on some better m/c can give us a clear picture.

Any suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Mon, Jan 27, 2014 at 12:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think that's a good thing to try.  Can you code it up?
>
> I have tried to improve algorithm in another way so that we can get
> benefit of same chunks during find match (something similar to lz).
> The main change is to consider chunks at fixed boundary (4 byte)
> and after finding match, try to find if there is a longer match than
> current chunk. While finding longer match, it still takes care that
> next bigger match should be at chunk boundary. I am not
> completely sure about the chunk boundary may be 8 or 16 can give
> better results.
>
> I think now we can once run with this patch on high end m/c.

Here are the results I got on the community PPC64 box.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 01/27/2014 07:03 PM, Amit Kapila wrote:
> I have tried to improve algorithm in another way so that we can get
> benefit of same chunks during find match (something similar to lz).
> The main change is to consider chunks at fixed boundary (4 byte)
> and after finding match, try to find if there is a longer match than
> current chunk. While finding longer match, it still takes care that
> next bigger match should be at chunk boundary. I am not
> completely sure about the chunk boundary may be 8 or 16 can give
> better results.

Since you're only putting a value in the history every 4 bytes, you 
wouldn't need to calculate the hash in a rolling fashion. You could just 
take next four bytes, calculate hash, put it in history table. Then next 
four bytes, calculate hash, and so on. Might save some cycles when 
building the history table...

- Heikki



Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 01/28/2014 07:01 PM, Heikki Linnakangas wrote:
> On 01/27/2014 07:03 PM, Amit Kapila wrote:
>> I have tried to improve algorithm in another way so that we can get
>> benefit of same chunks during find match (something similar to lz).
>> The main change is to consider chunks at fixed boundary (4 byte)
>> and after finding match, try to find if there is a longer match than
>> current chunk. While finding longer match, it still takes care that
>> next bigger match should be at chunk boundary. I am not
>> completely sure about the chunk boundary may be 8 or 16 can give
>> better results.
>
> Since you're only putting a value in the history every 4 bytes, you
> wouldn't need to calculate the hash in a rolling fashion. You could just
> take next four bytes, calculate hash, put it in history table. Then next
> four bytes, calculate hash, and so on. Might save some cycles when
> building the history table...

On a closer look, you're putting a chunk in the history table only every
four bytes, but you're *also* checking the history table for a match
only every four bytes. That completely destroys the shift-resistence of
the algorithm. For example, if the new tuple is an exact copy of the old
tuple, except for one additional byte in the beginning, the algorithm
would fail to recognize that. It would be good to add a test case like
that in the test suite.

You can skip bytes when building the history table, or when finding
matches, but not both. Or you could skip N bytes, and then check for
matches for the next four bytes, then skip again and so forth, as long
as you always check four consecutive bytes (because the hash function is
calculated from four bytes).

I couldn't resist the challenge, and started hacking this. First, some
observations from your latest patch (pgrb_delta_encoding_v5.patch):

1. There are a lot of comments and code that refers to "chunks", which
seem obsolete. For example, ck_size field in PGRB_HistEntry is always
set to a constant, 4, except maybe at the very end of the history
string. The algorithm has nothing to do with Rabin-Karp anymore.

2. The 'hindex' field in PGRB_HistEntry is unused. Also, ck_start_pos is
redundant with the index of the entry in the array, because the array is
filled in order. That only leaves us just the 'next' field, and that can
be represented as a int16 rather than a pointer. So, we really only need
a simple int16 array as the history entries array.

3. You're not gaining much by calculating the hash in a rolling fashion.
A single rolling step requires two multiplications and two sums, plus
shifting the variables around. Calculating the hash from scratch
requires three multiplications and three sums.

4. Given that we're not doing the Rabin-Karp variable-length chunk
thing, we could use a cheaper hash function to begin with. Like, the one
used in pglz. The multiply-by-prime method probably gives fewer
collisions than pglz's shift/xor method, but I don't think that matters
as much as computation speed. No-one has mentioned or measured the
effect of collisions in this thread; that either means that it's a
non-issue or that no-one's just realized how big a problem it is yet.
I'm guessing that it's not a problem, and if it is, it's mitigated by
only trying to find matches every N bytes; collisions would make finding
matches slower, and that's exactly what skipping helps with.

After addressing the above, we're pretty much back to PGLZ approach. I
kept the change to only find matches every four bytes, that does make
some difference. And I like having this new encoding code in a separate
file, not mingled with pglz stuff, it's sufficiently different that
that's better. I haven't done all much testing with this, so take it
with a grain of salt.

I don't know if this is better or worse than the other patches that have
been floated around, but I though I might as well share it..

- Heikki

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Jan 29, 2014 at 3:41 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 01/28/2014 07:01 PM, Heikki Linnakangas wrote:
>>
>> On 01/27/2014 07:03 PM, Amit Kapila wrote:
>>>
>>> I have tried to improve algorithm in another way so that we can get
>>> benefit of same chunks during find match (something similar to lz).
>>> The main change is to consider chunks at fixed boundary (4 byte)
>>> and after finding match, try to find if there is a longer match than
>>> current chunk. While finding longer match, it still takes care that
>>> next bigger match should be at chunk boundary. I am not
>>> completely sure about the chunk boundary may be 8 or 16 can give
>>> better results.
>>
>>
>> Since you're only putting a value in the history every 4 bytes, you
>> wouldn't need to calculate the hash in a rolling fashion. You could just
>> take next four bytes, calculate hash, put it in history table. Then next
>> four bytes, calculate hash, and so on. Might save some cycles when
>> building the history table...

First of all thanks for looking into patch.

Yes, this is right, we can save cycles by not doing rolling during hash
calculation and I was working to improve the patch on those lines. Earlier
it was their because of rabin's delta encoding where we need to check
for a special match after each byte.

>
> On a closer look, you're putting a chunk in the history table only every
> four bytes, but you're *also* checking the history table for a match only
> every four bytes. That completely destroys the shift-resistence of the
> algorithm.

You are right that it will loose the shift-resistence and even Robert has
pointed me this, that's why he wants to maintain the property of special
bytes at chunk boundaries as mentioned in Rabin encoding. The only
real reason to shift to fixed size was to improve CPU usage and I
thought most cases in Update will update the fixed length columns,
but it might not be true.

> For example, if the new tuple is an exact copy of the old tuple,
> except for one additional byte in the beginning, the algorithm would fail to
> recognize that. It would be good to add a test case like that in the test
> suite.
>
> You can skip bytes when building the history table, or when finding matches,
> but not both. Or you could skip N bytes, and then check for matches for the
> next four bytes, then skip again and so forth, as long as you always check
> four consecutive bytes (because the hash function is calculated from four
> bytes).

Can we do something like:
Build Phase
a. Calculate the hash and add the entry in history table at every 4 bytes.

Match Phase
a. Calculate the hash in rolling fashion and try to find match at every byte.
b. When match is found then skip only in chunks, something like I was   doing in find match function
+
+ /* consider only complete chunk matches. */
+ if (history_chunk_size == 0)
+ thislen += PGRB_MIN_CHUNK_SIZE;
+ }

Will this address the concern?

The main reason to process in chunks as much as possible is to save
cpu cycles. For example if we build hash table byte-by-byte, then even
for best case where most of tuple has a match, it will have reasonable
overhead due to formation of hash table.

>
> I couldn't resist the challenge, and started hacking this. First, some
> observations from your latest patch (pgrb_delta_encoding_v5.patch):
>
> 1. There are a lot of comments and code that refers to "chunks", which seem
> obsolete. For example, ck_size field in PGRB_HistEntry is always set to a
> constant, 4, except maybe at the very end of the history string. The
> algorithm has nothing to do with Rabin-Karp anymore.
>
> 2. The 'hindex' field in PGRB_HistEntry is unused. Also, ck_start_pos is
> redundant with the index of the entry in the array, because the array is
> filled in order. That only leaves us just the 'next' field, and that can be
> represented as a int16 rather than a pointer. So, we really only need a
> simple int16 array as the history entries array.
>
> 3. You're not gaining much by calculating the hash in a rolling fashion. A
> single rolling step requires two multiplications and two sums, plus shifting
> the variables around. Calculating the hash from scratch requires three
> multiplications and three sums.
>
> 4. Given that we're not doing the Rabin-Karp variable-length chunk thing, we
> could use a cheaper hash function to begin with. Like, the one used in pglz.
> The multiply-by-prime method probably gives fewer collisions than pglz's
> shift/xor method, but I don't think that matters as much as computation
> speed. No-one has mentioned or measured the effect of collisions in this
> thread; that either means that it's a non-issue or that no-one's just
> realized how big a problem it is yet. I'm guessing that it's not a problem,
> and if it is, it's mitigated by only trying to find matches every N bytes;
> collisions would make finding matches slower, and that's exactly what
> skipping helps with.
>
> After addressing the above, we're pretty much back to PGLZ approach.

Here during match phase, I think we can avoid copying literal bytes until
a match is found, that can save cycles for cases when old and new
tuple are mostly different.

> I kept
> the change to only find matches every four bytes, that does make some
> difference. And I like having this new encoding code in a separate file, not
> mingled with pglz stuff, it's sufficiently different that that's better. I
> haven't done all much testing with this, so take it with a grain of salt.
>
> I don't know if this is better or worse than the other patches that have
> been floated around, but I though I might as well share it..

Thanks for sharing the patch, I can take the data and compare it with
existing approach, if you think the explanation to change algorithm I
have given above is okay.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 01/29/2014 02:21 PM, Amit Kapila wrote:
> On Wed, Jan 29, 2014 at 3:41 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> For example, if the new tuple is an exact copy of the old tuple,
>> except for one additional byte in the beginning, the algorithm would fail to
>> recognize that. It would be good to add a test case like that in the test
>> suite.
>>
>> You can skip bytes when building the history table, or when finding matches,
>> but not both. Or you could skip N bytes, and then check for matches for the
>> next four bytes, then skip again and so forth, as long as you always check
>> four consecutive bytes (because the hash function is calculated from four
>> bytes).
>
> Can we do something like:
> Build Phase
> a. Calculate the hash and add the entry in history table at every 4 bytes.
>
> Match Phase
> a. Calculate the hash in rolling fashion and try to find match at every byte.

Sure, that'll work. However, I believe it's cheaper to add entries to 
the history table at every byte, and check for a match every 4 bytes. I 
think you'll get more or less the same level of compression either way, 
but adding to the history table is cheaper than checking for matches, 
and we'd rather do the cheap thing more often than the expensive thing.

> b. When match is found then skip only in chunks, something like I was
>      doing in find match function
> +
> + /* consider only complete chunk matches. */
> + if (history_chunk_size == 0)
> + thislen += PGRB_MIN_CHUNK_SIZE;
> + }
>
> Will this address the concern?

Hmm, so when checking if a match is truly a match, you compare the 
strings four bytes at a time rather than byte-by-byte? That might work, 
but I don't think that's a hot spot currently. In the profiling I did, 
with a "nothing matches" test case, all the cycles were spent in the 
history building, and finding matches. Finding out how long a match is 
was negligible. Of course, it might be a different story with input 
where the encoding helps and you have matches, but I think we were doing 
pretty well in those cases already.

> The main reason to process in chunks as much as possible is to save
> cpu cycles. For example if we build hash table byte-by-byte, then even
> for best case where most of tuple has a match, it will have reasonable
> overhead due to formation of hash table.

Hmm. One very simple optimization we could do is to just compare the two 
strings byte by byte, before doing anything else, to find any common 
prefix they might have. Then output a tag for the common prefix, and run 
the normal algorithm on the rest of the strings. In many real-world 
tables, the 1-2 first columns are a key that never changes, so that 
might work pretty well in practice. Maybe it would also be worthwhile to 
do the same for any common suffix the tuples might have.

That would fail to find matches where you e.g. update the last column to 
have the same value as the first column, and change nothing else, but 
that's ok. We're not aiming for the best possible compression, just 
trying to avoid WAL-logging data that wasn't changed.

> Here during match phase, I think we can avoid copying literal bytes until
> a match is found, that can save cycles for cases when old and new
> tuple are mostly different.

I think the extra if's in the loop will actually cost you more cycles 
than you save. You could perhaps have two copies of the main 
match-finding loop though. First, loop without outputting anything, 
until you find the first match. Then, output anything up to that point 
as literals. Then fall into the second loop, which outputs any 
non-matches byte by byte.

- Heikki



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 01/29/2014 02:21 PM, Amit Kapila wrote:
>> b. When match is found then skip only in chunks, something like I was
>>      doing in find match function
>> +
>> + /* consider only complete chunk matches. */
>> + if (history_chunk_size == 0)
>> + thislen += PGRB_MIN_CHUNK_SIZE;
>> + }
>>
>> Will this address the concern?
>
>
> Hmm, so when checking if a match is truly a match, you compare the strings
> four bytes at a time rather than byte-by-byte? That might work, but I don't
> think that's a hot spot currently. In the profiling I did, with a "nothing
> matches" test case, all the cycles were spent in the history building, and
> finding matches. Finding out how long a match is was negligible. Of course,
> it might be a different story with input where the encoding helps and you
> have matches, but I think we were doing pretty well in those cases already.

I think the way you have improve forming history tables is damn good (using
very few instructions) and we might not need to proceed chunk wise, but still
I think it might give us benefits when all or most part of the old and new tuple
matches. Also for nothing or lesser match case, I think we can skip more
frequently till we find first match.
If we don't find any match for first 4 bytes, then skip 4 bytes and if we don't
find match again for next 8 bytes, then skip 8 bytes and keep on doing the
same until we find first match. There is a chance that we miss some bytes
for compression, but it should not effect much as we are doing this only till
we find first match and during match phase we always find longest match.
I have added this concept in new version of patch and it's more easy to
add this logic after I implemented your suggestion of breaking main match
loop to 2 loops.


>
>> The main reason to process in chunks as much as possible is to save
>> cpu cycles. For example if we build hash table byte-by-byte, then even
>> for best case where most of tuple has a match, it will have reasonable
>> overhead due to formation of hash table.
>
>
> Hmm. One very simple optimization we could do is to just compare the two
> strings byte by byte, before doing anything else, to find any common prefix
> they might have. Then output a tag for the common prefix, and run the normal
> algorithm on the rest of the strings. In many real-world tables, the 1-2
> first columns are a key that never changes, so that might work pretty well
> in practice. Maybe it would also be worthwhile to do the same for any common
> suffix the tuples might have.

Is it possible to do for both prefix and suffix together, basically
the question I
have in mind is what will be deciding factor for switching from hash table
mechanism to string comparison mode for suffix. Do we switch when we find
long enough match?

Can we do this optimization after the basic version is acceptable?

>> Here during match phase, I think we can avoid copying literal bytes until
>> a match is found, that can save cycles for cases when old and new
>> tuple are mostly different.
>
>
> I think the extra if's in the loop will actually cost you more cycles than
> you save. You could perhaps have two copies of the main match-finding loop
> though. First, loop without outputting anything, until you find the first
> match. Then, output anything up to that point as literals. Then fall into
> the second loop, which outputs any non-matches byte by byte.

This is certainly better way of implementation, I have changed the patch
accordingly and I have modified patch to address your comments, except
for removing hindex from history entry structure. I believe that the way you
have done in patch back-to-pglz-like-delta-encoding-1 is better and I will
change it after understanding the logic completely.

Few observations in patch (back-to-pglz-like-delta-encoding-1):

1.
+#define pgrb_hash_unroll(_p, hindex) \
+ hindex = hindex ^ ((_p)[0] << 8)

shouldn't it shift by 6 rather than by 8.

2.
+ if (bp - bstart >= result_max)
+ return false;

I think for nothing or lesser match case it will traverse whole tuple.
Can we optimize such that if there is no match till 75%, we can bail out.
Ideally, I think if we don't find any match in first 50 to 60%, we should
come out.

3. pg_rbcompress.h is missing.

I am still working on patch back-to-pglz-like-delta-encoding-1 to see if
it works well for all cases, but thought of sharing what I have done till
now to you.

After basic verification of  back-to-pglz-like-delta-encoding-1, I will
take the data with both the patches and report the same.

Please let me know your thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thu, Jan 30, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>
> Few observations in patch (back-to-pglz-like-delta-encoding-1):
>
> 1.
> +#define pgrb_hash_unroll(_p, hindex) \
> + hindex = hindex ^ ((_p)[0] << 8)
>
> shouldn't it shift by 6 rather than by 8.
>
> 2.
> + if (bp - bstart >= result_max)
> + return false;
>
> I think for nothing or lesser match case it will traverse whole tuple.
> Can we optimize such that if there is no match till 75%, we can bail out.
> Ideally, I think if we don't find any match in first 50 to 60%, we should
> come out.
>
> 3. pg_rbcompress.h is missing.
>
> I am still working on patch back-to-pglz-like-delta-encoding-1 to see if
> it works well for all cases, but thought of sharing what I have done till
> now to you.
>
> After basic verification of  back-to-pglz-like-delta-encoding-1, I will
> take the data with both the patches and report the same.

Apart from above, the only other thing which I found problematic is
below code in find match

+ while (*ip == *hp && thislen < maxlen)
+ thislen++;

It should be
while (*ip++ == *hp++ && thislen < maxlen)

Please confirm.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thu, Jan 30, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>
> After basic verification of  back-to-pglz-like-delta-encoding-1, I will
> take the data with both the patches and report the same.

I have corrected the problems reported in back-to-pglz-like-delta-encoding-1
and removed hindex from pgrb_delta_encoding_v6 and attached are
new versions of both patches.

I/O Reduction Data
-----------------------------
Non-Default settings
autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min

Unpatched
------------------
                testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------
 one short and one long field, no change |    1054917904 | 14.6407959461212
 one short and one long field, no change |    1054917840 | 14.2938411235809
 one short and one long field, no change |    1054916032 | 14.6062371730804
 hundred tiny fields, all changed        |     633950304 | 15.6165988445282
 hundred tiny fields, all changed        |     633943184 | 15.7330548763275
 hundred tiny fields, all changed        |     633943536 | 16.2008850574493
 hundred tiny fields, half changed       |     633946056 | 15.9042718410492
 hundred tiny fields, half changed       |     633949992 | 15.9494590759277
 hundred tiny fields, half changed       |     633948448 | 17.1421928405762
 hundred tiny fields, half nulled        |     569757992 | 16.0392069816589
 hundred tiny fields, half nulled        |     569758848 | 15.7891688346863
 hundred tiny fields, half nulled        |     569755144 | 16.2466349601746

Patch pgrb_delta_encoding_v7
------------------------------------------------
                testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------
 one short and one long field, no change |     662240016 | 12.0052649974823
 one short and one long field, no change |     662570640 | 11.5202040672302
 one short and one long field, no change |     662231656 | 12.2640421390533
 hundred tiny fields, all changed        |     633947296 | 17.0527350902557
 hundred tiny fields, all changed        |     633945824 | 17.1216440200806
 hundred tiny fields, all changed        |     633948904 | 16.8881120681763
 hundred tiny fields, half changed       |     633944656 | 18.0734100341797
 hundred tiny fields, half changed       |     633944472 | 17.0183899402618
 hundred tiny fields, half changed       |     633945112 | 16.6483509540558
 hundred tiny fields, half nulled        |     499946000 | 18.9340658187866
 hundred tiny fields, half nulled        |     499952408 | 18.7714779376984
 hundred tiny fields, half nulled        |     499953432 |  18.690948009491
(12 rows)


Patch back-to-pglz-like-delta-encoding-2
----------------------------------------------------------
                testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------
 one short and one long field, no change |     662242872 | 12.7399699687958
 one short and one long field, no change |     662233440 | 12.7010321617126
 one short and one long field, no change |     663938992 | 13.1172158718109
 hundred tiny fields, all changed        |     635451832 |  17.918673992157
 hundred tiny fields, all changed        |     633946736 | 17.1329951286316
 hundred tiny fields, all changed        |     633943480 | 17.0818238258362
 hundred tiny fields, half changed       |     634762208 | 17.0016329288483
 hundred tiny fields, half changed       |     633946560 | 17.3154718875885
 hundred tiny fields, half changed       |     633943240 | 17.1657249927521
 hundred tiny fields, half nulled        |     492017488 | 27.3930599689484
 hundred tiny fields, half nulled        |     492016776 | 26.7517058849335
 hundred tiny fields, half nulled        |     493848424 | 26.6423358917236
(12 rows)


Observations
--------------------
1. With both the patches WAL reduction is similar i.e ~37% for
    "one short and one long field, no change" and 12% for
    "hundred tiny fields, half nulled"
2. With pgrb_delta_encoding_v7, there is ~19% CPU reduction for best
    case "one short and one long field, no change".
3. With pgrb_delta_encoding_v7, there is approximately 8~9% overhead
    for cases where there is no match
4. With pgrb_delta_encoding_v7, there is approximately 15~18% overhead
    for "hundred tiny fields, half nulled" case
5. With back-to-pglz-like-delta-encoding-2, the data is mostly similar except
    for "hundred tiny fields, half nulled" where CPU overhead is much more.

The case ("hundred tiny fields, half nulled") where CPU overhead is visible
is due to repetitive data and if take some random or different data, it will not
be there. I think the main reason for overhead is that we store last offset
of matching data in history at front, so during match, it has to traverse back
many times to find longest possible match and in real world it won't be the
case that most of history entries contain same hash index, so it should not
effect.
Finally if any user is concerned much about CPU overhead due to it, there
is a table level knob which he can use to avoid it.

Please let me know your suggestions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Fri, Jan 31, 2014 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 30, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>
>> After basic verification of  back-to-pglz-like-delta-encoding-1, I will
>> take the data with both the patches and report the same.
>
> I have corrected the problems reported in back-to-pglz-like-delta-encoding-1
> and removed hindex from pgrb_delta_encoding_v6 and attached are
> new versions of both patches.
>
> I/O Reduction Data
> -----------------------------
> Non-Default settings
> autovacuum = off
> checkpoitnt_segments = 256
> checkpoint_timeout =15min
>
> Observations
> --------------------
> 1. With both the patches WAL reduction is similar i.e ~37% for
>     "one short and one long field, no change" and 12% for
>     "hundred tiny fields, half nulled"
> 2. With pgrb_delta_encoding_v7, there is ~19% CPU reduction for best
>     case "one short and one long field, no change".
> 3. With pgrb_delta_encoding_v7, there is approximately 8~9% overhead
>     for cases where there is no match
> 4. With pgrb_delta_encoding_v7, there is approximately 15~18% overhead
>     for "hundred tiny fields, half nulled" case
> 5. With back-to-pglz-like-delta-encoding-2, the data is mostly similar except
>     for "hundred tiny fields, half nulled" where CPU overhead is much more.
>
> The case ("hundred tiny fields, half nulled") where CPU overhead is visible
> is due to repetitive data and if take some random or different data, it will not
> be there.

To verify this theory, I have added one new test which is almost similar to
"hundred tiny fields, half nulled", the difference is that it has
non-repetive string
 and the results are as below:

Unpatch
--------------
                    testname                       | wal_generated |
  duration
------------------------------------------------------+---------------+------------------
 nine short and one long field, thirty percent change |     698912496
| 12.1819660663605
 nine short and one long field, thirty percent change |     698906048
| 11.9409539699554
 nine short and one long field, thirty percent change |     698910904
| 11.9367880821228

Patch pgrb_delta_encoding_v7
------------------------------------------------

                       testname                       | wal_generated
|     duration
------------------------------------------------------+---------------+------------------
 nine short and one long field, thirty percent change |     559840840
| 11.6027710437775
 nine short and one long field, thirty percent change |     559829440
| 11.8239741325378
 nine short and one long field, thirty percent change |     560141352
| 11.6789472103119

Patch back-to-pglz-like-delta-encoding-2
----------------------------------------------------------

                      testname                       | wal_generated |
    duration
------------------------------------------------------+---------------+------------------
 nine short and one long field, thirty percent change |     544391432
| 12.3666560649872
 nine short and one long field, thirty percent change |     544378616
| 11.8833730220795
 nine short and one long field, thirty percent change |     544376888
| 11.9487581253052
(3 rows)


Basic idea of new test is that some part of tuple is unchanged and
other part is changed, here the unchanged part contains random string
rather than repetitive set of chars.
The new test is added with other tests in attached file.

Observation
-------------------
LZ like delta encoding has more WAL reduction and chunk wise encoding
has bit better CPU usage, but overall both are almost similar.

> I think the main reason for overhead is that we store last offset
> of matching data in history at front, so during match, it has to traverse back
> many times to find longest possible match and in real world it won't be the
> case that most of history entries contain same hash index, so it should not
> effect.

If we want to improve CPU usage for cases like "hundred tiny fields,
half nulled"
(which I think is not important), forming history table by traversing from end
rather than beginning, can serve the purpose, I have not tried it but I think
it can certainly help.

Do you think overall data is acceptable?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Fri, Jan 31, 2014 at 1:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jan 31, 2014 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Jan 30, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
>>> <hlinnakangas@vmware.com> wrote:
>>>
>>> After basic verification of  back-to-pglz-like-delta-encoding-1, I will
>>> take the data with both the patches and report the same.
>>
>> I have corrected the problems reported in back-to-pglz-like-delta-encoding-1
>> and removed hindex from pgrb_delta_encoding_v6 and attached are
>> new versions of both patches.
>>
>> I/O Reduction Data
>> -----------------------------
>> Non-Default settings
>> autovacuum = off
>> checkpoitnt_segments = 256
>> checkpoint_timeout =15min
>>
>> Observations
>> --------------------
>> 1. With both the patches WAL reduction is similar i.e ~37% for
>>     "one short and one long field, no change" and 12% for
>>     "hundred tiny fields, half nulled"
>> 2. With pgrb_delta_encoding_v7, there is ~19% CPU reduction for best
>>     case "one short and one long field, no change".
>> 3. With pgrb_delta_encoding_v7, there is approximately 8~9% overhead
>>     for cases where there is no match
>> 4. With pgrb_delta_encoding_v7, there is approximately 15~18% overhead
>>     for "hundred tiny fields, half nulled" case
>> 5. With back-to-pglz-like-delta-encoding-2, the data is mostly similar except
>>     for "hundred tiny fields, half nulled" where CPU overhead is much more.
>>
>> I think the main reason for overhead is that we store last offset
>> of matching data in history at front, so during match, it has to traverse back
>> many times to find longest possible match and in real world it won't be the
>> case that most of history entries contain same hash index, so it should not
>> effect.
>
> If we want to improve CPU usage for cases like "hundred tiny fields,
> half nulled"
> (which I think is not important), forming history table by traversing from end
> rather than beginning, can serve the purpose, I have not tried it but I think
> it can certainly help.

I had implemented the above idea of forming the history table by traversing
the old tuple from end instead of from beginning and had done some
optimizations in find match for breaking the loop based on good match
concept similar to pglz. The advantage of this is that we can find longer
matches quickly and due to which even for case "hundred tiny fields,
half nulled", now there is no CPU overhead without having any
significant effect on any other case.

Please find the updated patch attached with mail and new
data as below:

Non-Default settings
---------------------------------
autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min

Unpatched

                      testname                       | wal_generated |
    duration

------------------------------------------------------+---------------+------------------
 one short and one long field, no change              |    1055025424
| 14.3506939411163
 one short and one long field, no change              |    1056580160
| 18.1261160373688
 one short and one long field, no change              |    1054914792
|  15.104973077774
 hundred tiny fields, all changed                          |
636948992 | 16.3172590732574
 hundred tiny fields, all changed                          |
633943680 |  16.308168888092
 hundred tiny fields, all changed                          |
636516776 | 16.4316298961639
 hundred tiny fields, half changed                        |
633948288 | 16.5795118808746
 hundred tiny fields, half changed                        |
636068648 | 16.2913551330566
 hundred tiny fields, half changed                        |
635848432 | 15.9602961540222
 hundred tiny fields, half nulled                            |
569758744 | 15.9501180648804
 hundred tiny fields, half nulled                            |
569760112 | 15.9422838687897
 hundred tiny fields, half nulled                            |
570609712 | 16.5659689903259
 nine short and one long field, thirty % change      |     698908824 |
12.7938749790192
 nine short and one long field, thirty % change      |     698905400 |
12.0160901546478
 nine short and one long field, thirty % change      |     698909720 |
12.2999179363251


After pgrb_delta_encoding_v8.patch
----------------------------------------------------------
                       testname                       | wal_generated
|     duration
------------------------------------------------------+---------------+------------------
 one short and one long field, no change              |     680203392
| 12.4820687770844
 one short and one long field, no change              |     677340120
| 11.8634090423584
 one short and one long field, no change              |     677333288
| 11.9269840717316
 hundred tiny fields, all changed                          |
633950264 | 16.7694170475006
 hundred tiny fields, all changed                          |
635496520 | 16.9294109344482
 hundred tiny fields, all changed                          |
633942832 | 18.0690770149231
 hundred tiny fields, half changed                        |
633948024 | 17.0814690589905
 hundred tiny fields, half changed                        |
633947488 | 17.0073189735413
 hundred tiny fields, half changed                        |
633949224 | 17.0454230308533
 hundred tiny fields, half nulled                            |
499950184 | 16.3303508758545
 hundred tiny fields, half nulled                            |
499952888 | 15.7197980880737
 hundred tiny fields, half nulled                            |
499958120 | 15.7198679447174
 nine short and one long field, thirty % change      |     559831384 |
12.0672481060028
 nine short and one long field, thirty % change      |     559829472 |
11.8555760383606
 nine short and one long field, thirty % change      |     559832760 |
11.9470820426941

Observations are almost similar as previous except for
"hundred tiny fields, half nulled" case which I have updated below:

>> Observations
>> --------------------
>> 1. With both the patches WAL reduction is similar i.e ~37% for
>>     "one short and one long field, no change" and 12% for
>>     "hundred tiny fields, half nulled"
>> 2. With pgrb_delta_encoding_v7, there is ~19% CPU reduction for best
>>     case "one short and one long field, no change".
>> 3. With pgrb_delta_encoding_v7, there is approximately 8~9% overhead
>>     for cases where there is no match
>> 4. With pgrb_delta_encoding_v7, there is approximately 15~18% overhead
>>     for "hundred tiny fields, half nulled" case

Now there is approximately 1.4~5% CPU gain for
"hundred tiny fields, half nulled" case

>> 5. With back-to-pglz-like-delta-encoding-2, the data is mostly similar except
>>     for "hundred tiny fields, half nulled" where CPU overhead is much more.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Now there is approximately 1.4~5% CPU gain for
> "hundred tiny fields, half nulled" case

I don't want to advocate too strongly for this patch because, number
one, Amit is a colleague and more importantly, number two, I can't
claim to be an expert on compression.  But that having been said, I
think these numbers are starting to look awfully good.  The only
remaining regressions are in the cases where a large fraction of the
tuple turns over, and they're not that big even then.  The two *worst*
tests now seem to be "hundred tiny fields, all changed" and "hundred
tiny fields, half changed".  For the "all changed" case, the median
unpatched time is 16.3172590732574 and the median patched time is
16.9294109344482, a <4% loss; for the "half changed" case, the median
unpatched time is 16.5795118808746 and the median patched time is
17.0454230308533, a <3% loss.  Both cases show minimal change in WAL
volume.

Meanwhile, in friendlier cases, like "one short and one long field, no
change", we're seeing big improvements.  That particular case shows a
speedup of 21% and a WAL reduction of 36%.  That's a pretty big deal,
and I think not unrepresentative of many real-world workloads.  Some
might well do better, having either more or longer unchanged fields.
Assuming that the logic isn't buggy, a point in need of further study,
I'm starting to feel like we want to have this.  And I might even be
tempted to remove the table-level off switch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Bruce Momjian
Date:
On Tue, Feb  4, 2014 at 01:28:38PM -0500, Robert Haas wrote:
> Meanwhile, in friendlier cases, like "one short and one long field, no
> change", we're seeing big improvements.  That particular case shows a
> speedup of 21% and a WAL reduction of 36%.  That's a pretty big deal,
> and I think not unrepresentative of many real-world workloads.  Some
> might well do better, having either more or longer unchanged fields.
> Assuming that the logic isn't buggy, a point in need of further study,
> I'm starting to feel like we want to have this.  And I might even be
> tempted to remove the table-level off switch.

Does this feature relate to compression of WAL page images at all?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Performance Improvement by reducing WAL for Update Operation

From
Andres Freund
Date:
On 2014-02-04 14:09:57 -0500, Bruce Momjian wrote:
> On Tue, Feb  4, 2014 at 01:28:38PM -0500, Robert Haas wrote:
> > Meanwhile, in friendlier cases, like "one short and one long field, no
> > change", we're seeing big improvements.  That particular case shows a
> > speedup of 21% and a WAL reduction of 36%.  That's a pretty big deal,
> > and I think not unrepresentative of many real-world workloads.  Some
> > might well do better, having either more or longer unchanged fields.
> > Assuming that the logic isn't buggy, a point in need of further study,
> > I'm starting to feel like we want to have this.  And I might even be
> > tempted to remove the table-level off switch.
> 
> Does this feature relate to compression of WAL page images at all?

No.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance Improvement by reducing WAL for Update Operation

From
Bruce Momjian
Date:
On Tue, Feb  4, 2014 at 08:11:18PM +0100, Andres Freund wrote:
> On 2014-02-04 14:09:57 -0500, Bruce Momjian wrote:
> > On Tue, Feb  4, 2014 at 01:28:38PM -0500, Robert Haas wrote:
> > > Meanwhile, in friendlier cases, like "one short and one long field, no
> > > change", we're seeing big improvements.  That particular case shows a
> > > speedup of 21% and a WAL reduction of 36%.  That's a pretty big deal,
> > > and I think not unrepresentative of many real-world workloads.  Some
> > > might well do better, having either more or longer unchanged fields.
> > > Assuming that the logic isn't buggy, a point in need of further study,
> > > I'm starting to feel like we want to have this.  And I might even be
> > > tempted to remove the table-level off switch.
> > 
> > Does this feature relate to compression of WAL page images at all?
> 
> No.

I guess it bothers me we are working on compressing row change sets
while the majority(?) of WAL is page images.  I know we had a page image
compression patch that got stalled.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Performance Improvement by reducing WAL for Update Operation

From
Peter Geoghegan
Date:
On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Does this feature relate to compression of WAL page images at all?
>
> No.

So the obvious question is: where, if anywhere, do the two efforts
(this patch, and Fujii's patch) overlap? Does Fujii have any concerns
about this patch as it relates to his effort to compress FPIs?

-- 
Peter Geoghegan



Re: Performance Improvement by reducing WAL for Update Operation

From
Andres Freund
Date:
On February 4, 2014 10:50:10 PM CET, Peter Geoghegan <pg@heroku.com> wrote:
>On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund <andres@2ndquadrant.com>
>wrote:
>>> Does this feature relate to compression of WAL page images at all?
>>
>> No.
>
>So the obvious question is: where, if anywhere, do the two efforts
>(this patch, and Fujii's patch) overlap? Does Fujii have any concerns
>about this patch as it relates to his effort to compress FPIs?

I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an
irrelevantvolume.
 

Andres

-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund                       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Performance Improvement by reducing WAL for Update Operation

From
Peter Geoghegan
Date:
On Tue, Feb 4, 2014 at 1:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an
irrelevantvolume.
 

I'd have thought so too, but I would not like to assume. Like many
people commenting on this thread, I don't know very much about
compression.


-- 
Peter Geoghegan



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Now there is approximately 1.4~5% CPU gain for
>> "hundred tiny fields, half nulled" case

> Assuming that the logic isn't buggy, a point in need of further study,
> I'm starting to feel like we want to have this.  And I might even be
> tempted to remove the table-level off switch.

I have tried to stress on worst case more, as you are thinking to
remove table-level switch and found that even if we increase the
data by approx. 8 times ("ten long fields, all changed", each field contains
80 byte data), the CPU overhead is still < 5% which clearly shows that
the overhead doesn't increase much even if the length of unmatched data
is increased by much larger factor.
So the data for worst case adds more weight to your statement
("remove table-level switch"), however there is no harm in keeping
table-level option with default as 'true' and if some users are really sure
the updates in their system will have nothing in common, then they can
make this new option as 'false'.

Below is data for the new case " ten long fields, all changed" added
in attached script file:

Unpatched
           testname           | wal_generated |     duration
------------------------------+---------------+------------------
 ten long fields, all changed |    3473999520 | 45.0375978946686
 ten long fields, all changed |    3473999864 | 45.2536928653717
 ten long fields, all changed |    3474006880 | 45.1887288093567


After pgrb_delta_encoding_v8.patch
----------------------------------------------------------
          testname           | wal_generated |     duration
------------------------------+---------------+------------------
 ten long fields, all changed |    3474006456 | 47.5744359493256
 ten long fields, all changed |    3474000136 | 47.3830440044403
 ten long fields, all changed |    3474002688 | 46.9923310279846



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 02/04/2014 11:58 PM, Andres Freund wrote:
> On February 4, 2014 10:50:10 PM CET, Peter Geoghegan <pg@heroku.com> wrote:
>> On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund <andres@2ndquadrant.com>
>> wrote:
>>>> Does this feature relate to compression of WAL page images at all?
>>>
>>> No.
>>
>> So the obvious question is: where, if anywhere, do the two efforts
>> (this patch, and Fujii's patch) overlap? Does Fujii have any concerns
>> about this patch as it relates to his effort to compress FPIs?
>
> I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an
irrelevantvolume.
 

Correct. Compressing a full-page image happens on the first update after 
a checkpoint, and the diff between old and new tuple is not used in that 
case.

Compressing full page images makes a difference if you're doing random 
updates across a large table, so that you only update each buffer 1-2 
times. This patch will have no effect in that case. And when you update 
the same page many times between checkpoints, the full-page image is 
insignificant, and this patch has a big effect.

- Heikki



Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 02/05/2014 07:54 AM, Amit Kapila wrote:
> On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Now there is approximately 1.4~5% CPU gain for
>>> "hundred tiny fields, half nulled" case
>
>> Assuming that the logic isn't buggy, a point in need of further study,
>> I'm starting to feel like we want to have this.  And I might even be
>> tempted to remove the table-level off switch.
>
> I have tried to stress on worst case more, as you are thinking to
> remove table-level switch and found that even if we increase the
> data by approx. 8 times ("ten long fields, all changed", each field contains
> 80 byte data), the CPU overhead is still < 5% which clearly shows that
> the overhead doesn't increase much even if the length of unmatched data
> is increased by much larger factor.
> So the data for worst case adds more weight to your statement
> ("remove table-level switch"), however there is no harm in keeping
> table-level option with default as 'true' and if some users are really sure
> the updates in their system will have nothing in common, then they can
> make this new option as 'false'.
>
> Below is data for the new case " ten long fields, all changed" added
> in attached script file:

That's not the worst case, by far.

First, note that the skipping while scanning new tuple is only performed
in the first loop. That means that as soon as you have a single match,
you fall back to hashing every byte. So for the worst case, put one
4-byte field as the first column, and don't update it.

Also, I suspect the runtimes in your test were dominated by I/O. When I
scale down the number of rows involved so that the whole test fits in
RAM, I get much bigger differences with and without the patch. You might
also want to turn off full_page_writes, to make the effect clear with
less data.

So, I came up with the attached worst case test, modified from your
latest test suite.

unpatched:


                testname               | wal_generated |     duration
--------------------------------------+---------------+------------------
  ten long fields, all but one changed |     343385312 | 2.20806908607483
  ten long fields, all but one changed |     336263592 | 2.18997097015381
  ten long fields, all but one changed |     336264504 | 2.17843413352966
(3 rows)


pgrb_delta_encoding_v8.patch:

                testname               | wal_generated |     duration
--------------------------------------+---------------+------------------
  ten long fields, all but one changed |     338356944 | 3.33501315116882
  ten long fields, all but one changed |     344059272 | 3.37364101409912
  ten long fields, all but one changed |     336257840 | 3.36244201660156
(3 rows)

So with this test, the overhead is very significant.

With the skipping logic, another kind of "worst case" case is that you
have a lot of similarity between the old and new tuple, but you miss it
because you skip. For example, if you change the first few columns, but
leave a large text column at the end of the tuple unchanged.

- Heikki

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 01/30/2014 08:53 AM, Amit Kapila wrote:
> On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> On 01/29/2014 02:21 PM, Amit Kapila wrote:
>>> The main reason to process in chunks as much as possible is to save
>>> cpu cycles. For example if we build hash table byte-by-byte, then even
>>> for best case where most of tuple has a match, it will have reasonable
>>> overhead due to formation of hash table.
>>
>> Hmm. One very simple optimization we could do is to just compare the two
>> strings byte by byte, before doing anything else, to find any common prefix
>> they might have. Then output a tag for the common prefix, and run the normal
>> algorithm on the rest of the strings. In many real-world tables, the 1-2
>> first columns are a key that never changes, so that might work pretty well
>> in practice. Maybe it would also be worthwhile to do the same for any common
>> suffix the tuples might have.
>
> Is it possible to do for both prefix and suffix together, basically
> the question I
> have in mind is what will be deciding factor for switching from hash table
> mechanism to string comparison mode for suffix. Do we switch when we find
> long enough match?

I think you got it backwards. You don't switch from hash table mechanism
to string comparison. You do the prefix/suffix comparison *first*, and
run the hash table algorithm only on the "middle" part, between the
common prefix and suffix.

> Can we do this optimization after the basic version is acceptable?

I would actually suggest doing that first. Perhaps even ditch the whole
history table approach and do *only* the scan for prefix and suffix.
That's very cheap, and already covers a large fraction of UPDATEs that
real applications do. In particular, it's optimal for the case that you
update only a single column, something like "UPDATE foo SET bar = bar + 1".

I'm pretty sure the overhead of that would be negligible, so we could
always enable it. There are certainly a lot of scenarios where
prefix/suffix detection alone wouldn't help, but so what.

Attached is a quick patch for that, if you want to test it.

- Heikki

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 01/30/2014 08:53 AM, Amit Kapila wrote:
>>
>> Is it possible to do for both prefix and suffix together, basically
>> the question I
>> have in mind is what will be deciding factor for switching from hash table
>> mechanism to string comparison mode for suffix. Do we switch when we find
>> long enough match?
>
>
> I think you got it backwards. You don't switch from hash table mechanism to
> string comparison. You do the prefix/suffix comparison *first*, and run the
> hash table algorithm only on the "middle" part, between the common prefix
> and suffix.
>
>
>> Can we do this optimization after the basic version is acceptable?
>
>
> I would actually suggest doing that first. Perhaps even ditch the whole
> history table approach and do *only* the scan for prefix and suffix. That's
> very cheap, and already covers a large fraction of UPDATEs that real
> applications do. In particular, it's optimal for the case that you update
> only a single column, something like "UPDATE foo SET bar = bar + 1".
>
> I'm pretty sure the overhead of that would be negligible, so we could always
> enable it. There are certainly a lot of scenarios where prefix/suffix
> detection alone wouldn't help, but so what.
>
> Attached is a quick patch for that, if you want to test it.

I have done one test where there is a large suffix match, but
not large enough that it can compress more than 75% of string,
the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
not much, but there is no I/O reduction as well. However for same
case there is both significant WAL reduction and CPU gain with
pgrb_delta_encoding_v8.patch

I have updated  "ten long fields, all changed" such that there is large
suffix match. Updated script is attached.

Unpatched
           testname           | wal_generated |     duration
------------------------------+---------------+------------------
 ten long fields, all changed |    1760986528 | 28.3700430393219
 ten long fields, all changed |    1760981320 |   28.53244805336
 ten long fields, all changed |    1764294992 | 28.6722140312195
(3 rows)


wal-update-prefix-suffix-encode-1.patch
           testname           | wal_generated |     duration
------------------------------+---------------+------------------
 ten long fields, all changed |    1760986016 | 29.4183659553528
 ten long fields, all changed |    1760981904 | 29.7636449337006
 ten long fields, all changed |    1762436104 |  29.508908033371
(3 rows)

pgrb_delta_encoding_v8.patch

           testname           | wal_generated |     duration
------------------------------+---------------+------------------
 ten long fields, all changed |     733969304 |  23.916286945343
 ten long fields, all changed |     733977040 | 23.6019561290741
 ten long fields, all changed |     737384632 | 24.2645490169525


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 02/05/2014 04:48 PM, Amit Kapila wrote:
> I have done one test where there is a large suffix match, but
> not large enough that it can compress more than 75% of string,
> the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
> not much, but there is no I/O reduction as well.

Hmm, it's supposed to compress if you save at least 25%, not 75%. 
Apparently I got that backwards in the patch...

- Heikki



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Feb 5, 2014 at 5:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 02/05/2014 07:54 AM, Amit Kapila wrote:
>>
>> On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>>>
>>> On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila <amit.kapila16@gmail.com>
>>> wrote:
>>>>
>>>> Now there is approximately 1.4~5% CPU gain for
>>>> "hundred tiny fields, half nulled" case
>>
>>
>>> Assuming that the logic isn't buggy, a point in need of further study,
>>> I'm starting to feel like we want to have this.  And I might even be
>>> tempted to remove the table-level off switch.
>>
>>
>> I have tried to stress on worst case more, as you are thinking to
>> remove table-level switch and found that even if we increase the
>> data by approx. 8 times ("ten long fields, all changed", each field
>> contains
>> 80 byte data), the CPU overhead is still < 5% which clearly shows that
>> the overhead doesn't increase much even if the length of unmatched data
>> is increased by much larger factor.
>> So the data for worst case adds more weight to your statement
>> ("remove table-level switch"), however there is no harm in keeping
>> table-level option with default as 'true' and if some users are really
>> sure
>> the updates in their system will have nothing in common, then they can
>> make this new option as 'false'.
>>
>> Below is data for the new case " ten long fields, all changed" added
>> in attached script file:
>
>
> That's not the worst case, by far.
>
> First, note that the skipping while scanning new tuple is only performed in
> the first loop. That means that as soon as you have a single match, you fall
> back to hashing every byte. So for the worst case, put one 4-byte field as
> the first column, and don't update it.
>
> Also, I suspect the runtimes in your test were dominated by I/O. When I
> scale down the number of rows involved so that the whole test fits in RAM, I
> get much bigger differences with and without the patch. You might also want
> to turn off full_page_writes, to make the effect clear with less data.
>
> So with this test, the overhead is very significant.
>
> With the skipping logic, another kind of "worst case" case is that you have
> a lot of similarity between the old and new tuple, but you miss it because
> you skip.

This is exactly the reason why I have not kept skipping logic in second
pass(loop), but I think may be it would have been better to keep it not
as aggressive as in first pass. The basic idea I had in mind is that if we
get match, then there is high chance that we get match in consecutive
positions.

I think we should see this patch as I/O reduction feature rather than CPU
gain/overhead because the I/O reduction in WAL has some other has
some other benefits like transfer for replication, in archiving, recovery,
basically where-ever there is disk read operation, as I/O reduction will
amount to less data read and which can be beneficial in many ways.

Sometime back, I was reading article on benefits of compression
in database where the benefits are shown something like what
I said above (atleast that is what I understood from it). Link to that
article is:
http://db2guys.wordpress.com/2013/08/23/compression/

Another thing is that I think it might be difficult to get negligible
overhead for data which is very less or non-compressible, that's
why it is preferred to have compression for table enabled with
switch.

Is it viable to see here, what is the best way to get I/O reduction
for most cases and provide a switch so that for worst cases
user can make it off?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 02/05/2014 04:48 PM, Amit Kapila wrote:
>>
>> I have done one test where there is a large suffix match, but
>> not large enough that it can compress more than 75% of string,
>> the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
>> not much, but there is no I/O reduction as well.
>
>
> Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently
> I got that backwards in the patch...

Okay I think that is right, may be I can change the that check to see the
difference, but in general isn't it going to loose compression in much more
cases like if there is less than 25% match in prefix/suffix, but
more than 50% match in between the string.

While debugging, I noticed that it compresses less than history table
approach for general cases when internally update is done like for
Truncate table.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Peter Geoghegan
Date:
On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> I think there's zero overlap. They're completely complimentary features.
>> It's not like normal WAL records have an irrelevant volume.
>
>
> Correct. Compressing a full-page image happens on the first update after a
> checkpoint, and the diff between old and new tuple is not used in that case.

Uh, I really just meant that one thing that might overlap is
considerations around the choice of compression algorithm. I think
that there was some useful discussion of that on the other thread as
well.


-- 
Peter Geoghegan



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Wed, Feb 5, 2014 at 6:59 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Attached is a quick patch for that, if you want to test it.

But if we really just want to do prefix/suffix compression, this is a
crappy and expensive way to do it.  We needn't force everything
through the pglz tag format just because we elide a common prefix or
suffix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 02/05/2014 04:48 PM, Amit Kapila wrote:
>>
>> I have done one test where there is a large suffix match, but
>> not large enough that it can compress more than 75% of string,
>> the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
>> not much, but there is no I/O reduction as well.
>
>
> Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently
> I got that backwards in the patch...

So If I understand the code correctly, the new check should be

if (prefixlen + suffixlen < (slen * need_rate) / 100)   return false;

rather than

if (slen - prefixlen - suffixlen > (slen * need_rate) / 100)
return false;

Please confirm, else any validation for this might not be useful?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Wed, Feb 5, 2014 at 6:43 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> So, I came up with the attached worst case test, modified from your latest
> test suite.
>
> unpatched:
>
>
>                testname               | wal_generated |     duration
> --------------------------------------+---------------+------------------
>  ten long fields, all but one changed |     343385312 | 2.20806908607483
>  ten long fields, all but one changed |     336263592 | 2.18997097015381
>  ten long fields, all but one changed |     336264504 | 2.17843413352966
> (3 rows)
>
>
> pgrb_delta_encoding_v8.patch:
>
>                testname               | wal_generated |     duration
> --------------------------------------+---------------+------------------
>  ten long fields, all but one changed |     338356944 | 3.33501315116882
>  ten long fields, all but one changed |     344059272 | 3.37364101409912
>  ten long fields, all but one changed |     336257840 | 3.36244201660156
> (3 rows)
>
> So with this test, the overhead is very significant.

Yuck.  Well that sucks.

> With the skipping logic, another kind of "worst case" case is that you have
> a lot of similarity between the old and new tuple, but you miss it because
> you skip. For example, if you change the first few columns, but leave a
> large text column at the end of the tuple unchanged.

I suspect there's no way to have our cake and eat it, too.  Most of
the work that Amit has done on this patch in the last few revs is to
cut back CPU overhead in the cases where the patch can't help because
the tuple has been radically modified.  If we're trying to get maximum
compression, we need to go the other way: for example, we could just
feed both the old and new tuples through pglz (or snappy, or
whatever).  That would allow us to take advantage not only of
similarity between the old and new tuples but also internal
duplication within either the old or the new tuple, but it would also
cost more CPU.  The concern with minimizing overhead in cases where
the compression doesn't help has thus far pushed us in the opposite
direction, namely passing over compression opportunities that a more
aggressive algorithm could find in order to keep the overhead low.

Off-hand, I'm wondering why we shouldn't apply the same skipping
algorithm that Amit is using at the beginning of the string for the
rest of it as well.  It might be a little too aggressive (maybe the
skip distance shouldn't increase by quite as much as doubling every
time, or not beyond 16/32 bytes?) but I don't see why the general
principle isn't sound wherever we are in the tuple.

Unfortunately, despite changing things to make a history entry only
every 4th character, building the history is still pretty expensive.
By the time we even begin looking at the tuple we're gonna compress,
we've already spent something like half the total effort, and of
course we have to go further than that before we know whether our
attempt to compress is actually going anywhere.  I think that's the
central problem here.  pglz has several safeguards to ensure that it
doesn't do too much work in vain: we abort if we find nothing
compressible within first_success_by bytes, or if we emit enough total
output to be certain that we won't meet the need_rate threshold.
Those safeguards are a lot less effective here because they can't be
applied until *after* we've already paid the cost of building the
history.  If we could figure out some way to apply those guards, or
other guards, earlier in the algorithm, we could do a better job
mitigating the worst-case scenarios, but I don't have a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Gavin Flower
Date:
On 06/02/14 16:59, Robert Haas wrote:
> On Wed, Feb 5, 2014 at 6:43 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> So, I came up with the attached worst case test, modified from your latest
>> test suite.
>>
>> unpatched:
>>
>>
>>                 testname               | wal_generated |     duration
>> --------------------------------------+---------------+------------------
>>   ten long fields, all but one changed |     343385312 | 2.20806908607483
>>   ten long fields, all but one changed |     336263592 | 2.18997097015381
>>   ten long fields, all but one changed |     336264504 | 2.17843413352966
>> (3 rows)
>>
>>
>> pgrb_delta_encoding_v8.patch:
>>
>>                 testname               | wal_generated |     duration
>> --------------------------------------+---------------+------------------
>>   ten long fields, all but one changed |     338356944 | 3.33501315116882
>>   ten long fields, all but one changed |     344059272 | 3.37364101409912
>>   ten long fields, all but one changed |     336257840 | 3.36244201660156
>> (3 rows)
>>
>> So with this test, the overhead is very significant.
> Yuck.  Well that sucks.
>
>> With the skipping logic, another kind of "worst case" case is that you have
>> a lot of similarity between the old and new tuple, but you miss it because
>> you skip. For example, if you change the first few columns, but leave a
>> large text column at the end of the tuple unchanged.
> I suspect there's no way to have our cake and eat it, too.  Most of
> the work that Amit has done on this patch in the last few revs is to
> cut back CPU overhead in the cases where the patch can't help because
> the tuple has been radically modified.  If we're trying to get maximum
> compression, we need to go the other way: for example, we could just
> feed both the old and new tuples through pglz (or snappy, or
> whatever).  That would allow us to take advantage not only of
> similarity between the old and new tuples but also internal
> duplication within either the old or the new tuple, but it would also
> cost more CPU.  The concern with minimizing overhead in cases where
> the compression doesn't help has thus far pushed us in the opposite
> direction, namely passing over compression opportunities that a more
> aggressive algorithm could find in order to keep the overhead low.
>
> Off-hand, I'm wondering why we shouldn't apply the same skipping
> algorithm that Amit is using at the beginning of the string for the
> rest of it as well.  It might be a little too aggressive (maybe the
> skip distance shouldn't increase by quite as much as doubling every
> time, or not beyond 16/32 bytes?) but I don't see why the general
> principle isn't sound wherever we are in the tuple.
>
> Unfortunately, despite changing things to make a history entry only
> every 4th character, building the history is still pretty expensive.
> By the time we even begin looking at the tuple we're gonna compress,
> we've already spent something like half the total effort, and of
> course we have to go further than that before we know whether our
> attempt to compress is actually going anywhere.  I think that's the
> central problem here.  pglz has several safeguards to ensure that it
> doesn't do too much work in vain: we abort if we find nothing
> compressible within first_success_by bytes, or if we emit enough total
> output to be certain that we won't meet the need_rate threshold.
> Those safeguards are a lot less effective here because they can't be
> applied until *after* we've already paid the cost of building the
> history.  If we could figure out some way to apply those guards, or
> other guards, earlier in the algorithm, we could do a better job
> mitigating the worst-case scenarios, but I don't have a good idea.
>
Surely the weighting should be done according to the relative scarcity 
of processing power vs I/O bandwidth? I get the impression that 
different workloads and hardware configurations may favour conserving 
one of processor or I/O resources.  Would it be feasible to have 
different logic, depending on the trade-offs identified?


Cheers,
Gavin



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thu, Feb 6, 2014 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> On 02/05/2014 04:48 PM, Amit Kapila wrote:
>>>
>>> I have done one test where there is a large suffix match, but
>>> not large enough that it can compress more than 75% of string,
>>> the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
>>> not much, but there is no I/O reduction as well.
>>
>>
>> Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently
>> I got that backwards in the patch...
>
> So If I understand the code correctly, the new check should be
>
> if (prefixlen + suffixlen < (slen * need_rate) / 100)
>     return false;
>
> rather than
>
> if (slen - prefixlen - suffixlen > (slen * need_rate) / 100)
> return false;

Considering above change as correct, I have tried to see the worst
case overhead for this patch by having new tuple such that after
25% or so of suffix/prefix match, there is a small change in tuple
and kept rest of tuple same as old tuple and it shows overhead
for this patch as well.

Updated test script is attached.

Unpatched
             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     348843824 | 5.56866788864136
 ten long fields, 8 bytes changed |     348844800 | 5.84434294700623
 ten long fields, 8 bytes changed |     350500000 | 5.92329406738281
(3 rows)



wal-update-prefix-suffix-encode-1.patch

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     348845624 | 6.92243480682373
 ten long fields, 8 bytes changed |     348847000 | 8.35828399658203
 ten long fields, 8 bytes changed |     350204752 | 7.61826491355896
(3 rows)

One minor point, can we avoid having prefix tag if prefixlen is 0.

+ /* output prefix as a tag */
+ pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, prefixlen, hlen);



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Feb 5, 2014 at 8:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 5, 2014 at 5:13 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> On 02/05/2014 07:54 AM, Amit Kapila wrote:
>>
>> That's not the worst case, by far.
>>
>> First, note that the skipping while scanning new tuple is only performed in
>> the first loop. That means that as soon as you have a single match, you fall
>> back to hashing every byte. So for the worst case, put one 4-byte field as
>> the first column, and don't update it.
>>
>> Also, I suspect the runtimes in your test were dominated by I/O. When I
>> scale down the number of rows involved so that the whole test fits in RAM, I
>> get much bigger differences with and without the patch. You might also want
>> to turn off full_page_writes, to make the effect clear with less data.
>>
>> So with this test, the overhead is very significant.
>>
>> With the skipping logic, another kind of "worst case" case is that you have
>> a lot of similarity between the old and new tuple, but you miss it because
>> you skip.
>
> This is exactly the reason why I have not kept skipping logic in second
> pass(loop), but I think may be it would have been better to keep it not
> as aggressive as in first pass.

I have tried to merge pass-1 and pass-2 and kept skipping logic as same,
and it have reduced the overhead to a good extent but not completely for
the new case you have added. This change is to check if it can reduce
overhead, if we want to proceed, may be we can limit the skip factor, so
that chance of skipping some match data is reduced.

New version of patch is attached with mail

Unpatched

           testname           | wal_generated |     duration
------------------------------+---------------+------------------
 ten long fields, all changed |     348842856 | 6.93688106536865
 ten long fields, all changed |     348843672 | 7.53063702583313
 ten long fields, all changed |     352662344 | 7.76640701293945
(3 rows)


pgrb_delta_encoding_v8.patch
             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, but one changed |     348848144 | 9.22694897651672
 ten long fields, but one changed |     348841376 | 9.11818099021912
 ten long fields, but one changed |     352963488 | 8.37875485420227
(3 rows)


pgrb_delta_encoding_v9.patch

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, but one changed |     350166320 | 8.84561610221863
 ten long fields, but one changed |     348840728 | 8.45299792289734
 ten long fields, but one changed |     348846656 | 8.34846496582031
(3 rows)


It appears to me that it can be good idea to merge both the patches
(prefix-suffix encoding + delta-encoding) in a way such that if we
get reasonable compression (50% or so) with prefix-suffix, then we
can return without doing delta encoding and if compression is lesser
than we can do delta encoding for rest of tuple. The reason I think it
will be good because by just doing prefix-suffix we might leave many
cases where good compression is possible.
If you think it is viable way, then I can merge both the patches and
check the results.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thu, Feb 6, 2014 at 5:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Feb 6, 2014 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Considering above change as correct, I have tried to see the worst
> case overhead for this patch by having new tuple such that after
> 25% or so of suffix/prefix match, there is a small change in tuple
> and kept rest of tuple same as old tuple and it shows overhead
> for this patch as well.
>
> Updated test script is attached.
>
> Unpatched
>              testname             | wal_generated |     duration
> ----------------------------------+---------------+------------------
>  ten long fields, 8 bytes changed |     348843824 | 5.56866788864136
>  ten long fields, 8 bytes changed |     348844800 | 5.84434294700623
>  ten long fields, 8 bytes changed |     350500000 | 5.92329406738281
> (3 rows)
>
>
>
> wal-update-prefix-suffix-encode-1.patch
>
>              testname             | wal_generated |     duration
> ----------------------------------+---------------+------------------
>  ten long fields, 8 bytes changed |     348845624 | 6.92243480682373
>  ten long fields, 8 bytes changed |     348847000 | 8.35828399658203
>  ten long fields, 8 bytes changed |     350204752 | 7.61826491355896
> (3 rows)
>
> One minor point, can we avoid having prefix tag if prefixlen is 0.
>
> + /* output prefix as a tag */
> + pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, prefixlen, hlen);

I think generating out tag for suffix/prefix has one bug i.e it doesn't
consider the max length of 273 bytes (PGLZ_MAX_MATCH ) which
is mandatory for LZ format.

One more point about this patch is that in function pgrb_delta_encode(),
is it mandatory to return at end in below check:

if (result_size > result_max)
return false;

I mean to say as before starting for copying literal bytes we have
already ensured that the compressed data is greater than >25%, so
may be we can avoid this check. I have tried to take the data by
removing this check and found that it reduces overhead and improves
WAL reduction as well. The data is as below (compare this with data
in above mail for unpatched version data):

            testname             | wal_generated |     duration
----------------------------------+---------------+------------------ten long fields, 8 bytes changed |     300705552 |
6.51416897773743tenlong fields, 8 bytes changed |     300703816 | 6.85267090797424ten long fields, 8 bytes changed |
300701840 | 7.15832996368408
 
(3 rows)

If we want to go with this approach, then I think apart from above
points there is no major change required (may be some comments,
function names etc. can be improved).

> But if we really just want to do prefix/suffix compression, this is a
> crappy and expensive way to do it.  We needn't force everything
> through the pglz tag format just because we elide a common prefix or
> suffix.

Here are you bothered about below code where the patch is
doing byte-by-byte copy after prefix/suffix match?

/* output bytes between prefix and suffix as literals */
dp = &source[prefixlen];
dend = &source[slen - suffixlen];
while (dp < dend)
{
pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
dp++; /* Do not do this ++ in the line above! */
}

I think if we want to change LZ format, it will be bit more work and
verification for decoding has to be done much more strenuously.

Note - During performance test, I have focussed mainly on worst case,
because we already know that this idea is good for best and average cases.
However if we decide that this is better and good to proceed, I can take
the data for other cases as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Bruce Momjian
Date:
On Wed, Feb  5, 2014 at 10:57:57AM -0800, Peter Geoghegan wrote:
> On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> >> I think there's zero overlap. They're completely complimentary features.
> >> It's not like normal WAL records have an irrelevant volume.
> >
> >
> > Correct. Compressing a full-page image happens on the first update after a
> > checkpoint, and the diff between old and new tuple is not used in that case.
> 
> Uh, I really just meant that one thing that might overlap is
> considerations around the choice of compression algorithm. I think
> that there was some useful discussion of that on the other thread as
> well.

Yes, that was my point.  I though the compression of full-page images
was a huge win and that compression was pretty straight-forward, except
for the compression algorithm.  If the compression algorithm issue is
resolved, can we move move forward with the full-page compression patch?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tue, Feb 11, 2014 at 10:07 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Feb  5, 2014 at 10:57:57AM -0800, Peter Geoghegan wrote:
>> On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>> >> I think there's zero overlap. They're completely complimentary features.
>> >> It's not like normal WAL records have an irrelevant volume.
>> >
>> >
>> > Correct. Compressing a full-page image happens on the first update after a
>> > checkpoint, and the diff between old and new tuple is not used in that case.
>>
>> Uh, I really just meant that one thing that might overlap is
>> considerations around the choice of compression algorithm. I think
>> that there was some useful discussion of that on the other thread as
>> well.
>
> Yes, that was my point.  I though the compression of full-page images
> was a huge win and that compression was pretty straight-forward, except
> for the compression algorithm.  If the compression algorithm issue is
> resolved,

By issue, I assume you mean to say, which compression algorithm is
best for this patch.
For this patch, currently we have 2 algorithm's for which results have been
posted. As far as I understand Heikki is pretty sure that the latest algorithm
(compression using prefix-suffix match in old and new tuple) used for this
patch is better than the other algorithm in terms of CPU gain or overhead.
The performance data taken by me for the worst case for this algorithm
shows there is a CPU overhead for this algorithm as well.

OTOH the another algorithm (compression using old tuple as history) can be
a bigger win in terms I/O reduction in more number of cases.

In short, it is still not decided which algorithm to choose and whether
it can be enabled by default or it is better to have table level switch
to enable/disable it.

So I think the decision to be taken here is about below points:
1.  Are we okay with I/O reduction at the expense of CPU for *worst* cases    and I/O reduction without impacting CPU
(betteroverall tps) for    *favourable* cases?
 
2.  If we are not okay with worst case behaviour, then can we provide    a table-level switch, so that it can be
decidedby user?
 
3.  If none of above, then is there any other way to mitigate the worst    case behaviour or shall we just reject this
patchand move on.
 

Given a choice to me, I would like to go with option-2, because I think
for most cases UPDATE statement will have same data for old and
new tuples except for some part of tuple (generally column's having large
text data are not modified), so we will be end up mostly in favourable cases
and surely for worst cases we don't want user to suffer from CPU overhead,
so a table-level switch is also required.

I think here one might argue that for some users it is not feasible to
decide whether their tuples data for UPDATE is going to be similar
or completely different and they are not at all ready for any risk for
CPU overhead, but they would be happy to see I/O reduction in which
case it is difficult to decide what should be the value of table-level
switch. Here I think the only answer is "nothing is free" in this world,
so either make sure about the application's behaviour for UPDATE
statement before going to production or just don't enable this switch and
be happy with the current behaviour.

On the other side there will be users who will be pretty certain about their
usage of UPDATE statement or atleast are ready to evaluate their
application if they can get such a huge gain, so it would be quite useful
feature for such users.

>can we move move forward with the full-page compression patch?

In my opinion, it is not certain that whatever compression algorithm got
decided for this patch (if any) can be directly used for full-page
compression, some ideas could be used or may be the algorithm could be
tweaked a bit to make it usable for full-page compression.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Bruce Momjian
Date:
On Wed, Feb 12, 2014 at 10:02:32AM +0530, Amit Kapila wrote:
> By issue, I assume you mean to say, which compression algorithm is
> best for this patch.
> For this patch, currently we have 2 algorithm's for which results have been
> posted. As far as I understand Heikki is pretty sure that the latest algorithm
> (compression using prefix-suffix match in old and new tuple) used for this
> patch is better than the other algorithm in terms of CPU gain or overhead.
> The performance data taken by me for the worst case for this algorithm
> shows there is a CPU overhead for this algorithm as well.
> 
> OTOH the another algorithm (compression using old tuple as history) can be
> a bigger win in terms I/O reduction in more number of cases.
> 
> In short, it is still not decided which algorithm to choose and whether
> it can be enabled by default or it is better to have table level switch
> to enable/disable it.
> 
> So I think the decision to be taken here is about below points:
> 1.  Are we okay with I/O reduction at the expense of CPU for *worst* cases
>      and I/O reduction without impacting CPU (better overall tps) for
>      *favourable* cases?
> 2.  If we are not okay with worst case behaviour, then can we provide
>      a table-level switch, so that it can be decided by user?
> 3.  If none of above, then is there any other way to mitigate the worst
>      case behaviour or shall we just reject this patch and move on.
> 
> Given a choice to me, I would like to go with option-2, because I think
> for most cases UPDATE statement will have same data for old and
> new tuples except for some part of tuple (generally column's having large
> text data are not modified), so we will be end up mostly in favourable cases
> and surely for worst cases we don't want user to suffer from CPU overhead,
> so a table-level switch is also required.

I think 99.9% of users are never going to adjust this so we had better
choose something we are happy to enable for effectively everyone.  In my
reading, prefix/suffix seemed safe for everyone.  We can always revisit
this if we think of something better later, as WAL format changes are not
a problem for pg_upgrade.

I also think making it user-tunable is so hard for users to know when to
adjust as to be almost not worth the user interface complexity it adds.

I suggest we go with always-on prefix/suffix mode, then add some check
so the worst case is avoided by just giving up on compression.

As I said previously, I think compressing the page images is the next
big win in this area.

> I think here one might argue that for some users it is not feasible to
> decide whether their tuples data for UPDATE is going to be similar
> or completely different and they are not at all ready for any risk for
> CPU overhead, but they would be happy to see I/O reduction in which
> case it is difficult to decide what should be the value of table-level
> switch. Here I think the only answer is "nothing is free" in this world,
> so either make sure about the application's behaviour for UPDATE
> statement before going to production or just don't enable this switch and
> be happy with the current behaviour.

Again, can't set do a minimal attempt at prefix/suffix compression so
there is no measurable overhead?

> On the other side there will be users who will be pretty certain about their
> usage of UPDATE statement or atleast are ready to evaluate their
> application if they can get such a huge gain, so it would be quite useful
> feature for such users.
> 
> >can we move move forward with the full-page compression patch?
> 
> In my opinion, it is not certain that whatever compression algorithm got
> decided for this patch (if any) can be directly used for full-page
> compression, some ideas could be used or may be the algorithm could be
> tweaked a bit to make it usable for full-page compression.

Thanks, I understand that now.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Feb 12, 2014 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Feb 12, 2014 at 10:02:32AM +0530, Amit Kapila wrote:
>
> I think 99.9% of users are never going to adjust this so we had better
> choose something we are happy to enable for effectively everyone.  In my
> reading, prefix/suffix seemed safe for everyone.  We can always revisit
> this if we think of something better later, as WAL format changes are not
> a problem for pg_upgrade.
 Agreed.

> I also think making it user-tunable is so hard for users to know when to
> adjust as to be almost not worth the user interface complexity it adds.
>
> I suggest we go with always-on prefix/suffix mode, then add some check
> so the worst case is avoided by just giving up on compression.
>
> As I said previously, I think compressing the page images is the next
> big win in this area.
>
>> I think here one might argue that for some users it is not feasible to
>> decide whether their tuples data for UPDATE is going to be similar
>> or completely different and they are not at all ready for any risk for
>> CPU overhead, but they would be happy to see I/O reduction in which
>> case it is difficult to decide what should be the value of table-level
>> switch. Here I think the only answer is "nothing is free" in this world,
>> so either make sure about the application's behaviour for UPDATE
>> statement before going to production or just don't enable this switch and
>> be happy with the current behaviour.
>
> Again, can't set do a minimal attempt at prefix/suffix compression so
> there is no measurable overhead?

Yes, currently it is there at 25%, which means there should be atleast 25%
match in prefix-suffix, then only we consider it for compression and that
is pretty fast and almost no overhead, but the worst case here is other
way i.e when the string has 25% match in prefix-suffix, but after that
there is no match or at least in next few bytes there is no match.

For example, consider below 2 cases:

Case-1

old tuple
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

new tuple
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaa

Here there is a suffix match for 25% of string, but after that there is no
match, so we have to copy all the 75% remaining bytes as it is byte-by-byte.
Now here with bit longer tuples (800 bytes), the performance data taken be me
shows around ~11% of CPU overhead. Now as this test is a fabricated test
to just see how much extra CPU it consumes for worst scenario, in reality
user might not see this, at least in synchronous commit mode on, because
there is always some I/O involved at end of transaction (unless there is some
error in between or user rollbacks transaction chances of which are very less).


First thing that comes to mind after seeing above scenario, is that why not
increase the minimum limit of 25%, because we have almost negligible
overhead in comparing prefix-suffix, so I have tried that by increasing it
to 35% or more but in that case it starts falling from other side like
for cases when there is 34% match and still we return.

Here one of the improvements which can be done is that after prefix-suffix
match, instead of going byte-by-byte copy as per LZ format we can directly
copy all the remaining part of tuple but I think that would require us to use
some different format than LZ which is also not too difficult to do, but the
question is do we really need such a change to handle the above kind of
worst case.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Claudio Freire
Date:
On Thu, Feb 13, 2014 at 1:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Here one of the improvements which can be done is that after prefix-suffix
> match, instead of going byte-by-byte copy as per LZ format we can directly
> copy all the remaining part of tuple but I think that would require us to use
> some different format than LZ which is also not too difficult to do, but the
> question is do we really need such a change to handle the above kind of
> worst case.


Why use LZ at all? Why not *only* prefix/suffix?



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thu, Feb 13, 2014 at 10:07 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Feb 13, 2014 at 1:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Here one of the improvements which can be done is that after prefix-suffix
>> match, instead of going byte-by-byte copy as per LZ format we can directly
>> copy all the remaining part of tuple but I think that would require us to use
>> some different format than LZ which is also not too difficult to do, but the
>> question is do we really need such a change to handle the above kind of
>> worst case.
>
>
> Why use LZ at all?

We are just using LZ *format* to represent compressed string.
Just copied some text from pg_lzcompress.c, to explain what
exactly we are using

"the first byte after the header tells what to dothe next 8 times. We call this the control byte.

An unset bit in the control byte means, that one uncompressed
byte follows, which is copied from input to output.
A set bit in the control byte means, that a tag of 2-3 bytes
follows. A tag contains information to copy some bytes, that
are already in the output buffer, to the current location in
the output."

> Why not *only* prefix/suffix?

To represent prefix/suffix match, we atleast need a way to tell
that the offset and len of matched bytes and then how much
is the length of unmatched bytes we have copied.
I agree that a simpler format could be devised if we just want to
do prefix-suffix match, but that would require much more test
during recovery to ensure everything is fine, advantage with LZ
format is that we don't need to bother about decoding, it will work
as without any much change in LZ decode routine.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Tue, Feb 11, 2014 at 11:37 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Yes, that was my point.  I though the compression of full-page images
> was a huge win and that compression was pretty straight-forward, except
> for the compression algorithm.  If the compression algorithm issue is
> resolved, can we move move forward with the full-page compression patch?

Discussion of the full-page compression patch properly belongs on that
thread rather than this one.  However, based on what we've discovered
so far here, I won't be very surprised if that patch turns out to have
serious problems with CPU consumption.  The evidence from this thread
suggests that making even relatively lame attempts at compression is
extremely costly in terms of CPU overhead.  Now, the issues with
straight-up compression are somewhat different than for delta
compression and, in particular, it's easier to bail out of straight-up
compression sooner if things aren't working out.  But even with all
that, I expect it to be not too difficult to find cases where some
compression is achieved but with a dramatic increase in runtime on
CPU-bound workloads.  Which is basically the same problem this patch
has.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Mon, Feb 10, 2014 at 10:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think if we want to change LZ format, it will be bit more work and
> verification for decoding has to be done much more strenuously.

I don't think it'll be that big of a deal.  And anyway, the evidence
here suggests that we still need more speed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Bruce Momjian
Date:
On Thu, Feb 13, 2014 at 10:20:46AM +0530, Amit Kapila wrote:
> > Why not *only* prefix/suffix?
> 
> To represent prefix/suffix match, we atleast need a way to tell
> that the offset and len of matched bytes and then how much
> is the length of unmatched bytes we have copied.
> I agree that a simpler format could be devised if we just want to
> do prefix-suffix match, but that would require much more test
> during recovery to ensure everything is fine, advantage with LZ
> format is that we don't need to bother about decoding, it will work
> as without any much change in LZ decode routine.

Based on the numbers I think prefix/suffix-only needs to be explored. 
Consider if you just change one field of a row --- prefix/suffix would
find all the matching parts.  If you change the first and last fields,
you get no compression at all, but your prefix/suffix test isn't going
to get that either.

As I understand it, the only place prefix/suffix with LZ compression is
a win over prefix/suffix-only is when you change two middle fields, and
there are common fields unchanged between them.  If we are looking at
11% CPU overhead for that, it isn't worth it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thu, Feb 13, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 10, 2014 at 10:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think if we want to change LZ format, it will be bit more work and
>> verification for decoding has to be done much more strenuously.
>
> I don't think it'll be that big of a deal.  And anyway, the evidence
> here suggests that we still need more speed.

Okay. I did one small hack (for unmatched part directly copy it to destination
buffer, instead of getting it through LZ i.e memcpy unchanged data in
destination buffer) in patch to find if format change or doing memcpy instead
of byte-by-byte can give us any benefit and found that it can give benefit, but
may not be very high. We cannot change it like this if we have to do
some change in format, but this is just a quick hack to see if such a change
can give us benefit.

The data is fluctuating as it is purely CPU based test, so what
I have done is that run the same test five times and took the best data
for all 3 the patches. Explanation of changes in 2 patches other than
master is given after data:

Performance Data
-----------------------------
Non-Default settings
checkpoint_segments = 128
checkpoint_timeout     = 15 min
full_page_writes = off

Unpatched
             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     348847264 | 5.30486917495728
 ten long fields, 8 bytes changed |     348848384 | 5.42504191398621
 ten long fields, 8 bytes changed |     348841384 | 5.59665489196777
(3 rows)

wal-update-prefix-suffix-encode-2.patch

            testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     300706992 | 5.83324003219604
 ten long fields, 8 bytes changed |     303039200 |  5.8794629573822
 ten long fields, 8 bytes changed |     300707256 | 6.04627680778503
(3 rows)

wal-update-prefix-suffix-encode-3.patch

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     271815824 | 4.74523997306824
 ten long fields, 8 bytes changed |     273221608 | 5.36515283584595
 ten long fields, 8 bytes changed |     271818664 | 5.76620006561279
(3 rows)

Changes in wal-update-prefix-suffix-encode-2.patch
1. Remove the check at end of pgrb_delta_encode() that checks if encoded
buffer has more than 75% of tuple data, as before starting for copying literal
bytes we have already ensured that the compressed data is greater
than >25%, so there should not be any harm in avoiding this check.

Changes in wal-update-prefix-suffix-encode-3.patch
1. Kept change of wal-update-prefix-suffix-encode-2.patch
2. Changed copying of unmatched literal bytes to memcpy

Considering median data for all patches, there is a CPU overhead
of 8.37% with version-2 and there is a CPU gain of 1.11% with
version-3 of patch. Now here there is a small catch that even if
we want to change the LZ format for prefix-suffix encoding, the CPU
data shown above with memcpy might not be same, rather it will
depend on whether we can come up with good format which can give
us same benefit as direct memcpy is giving.

One of the ideas for change in format:
Tag for prefix/suffix match
12 bits - offset
12 bits - length
Value for unmatched data
1 or 2 bytes for length depending on length of data (first bit can indicate
whether we need 1 byte or 2 bytes)
data

Now considering above format let us see how much difference in data
would it create as compare to LZ format. For example, consider the
data of current worst case:
Suffix match ~ 200 bytes
unmatched data ~600 bytes

To represent suffix match, both formats will take same amount of bytes,
For unmatched data, LZ format would take 10 extra bytes (it uses 1bit
to indicate 1 uncompressed byte) where as above changed format will
take 2 bytes, also more the uncompressed data more extra bytes it can
take in LZ format. However for few unchanged bytes (<64), I think
LZ format will use lesser number of bits, but in that case anyway we
will get compression, so loosing few bits should not matter.

I think that CPU overhead less than 5% for worst case could
have been considered acceptable and this is on bit higher side, but
do you think that it is so high that it deserves change in format?

One more idea, I have in mind but still not tried for prefix-suffix match
i.e to try with minimum compression ratio as 30% rather than 25%, not
sure if it can reduce overhead to less than 5% for worst case without
loosing on any other case.

Test used is same as provided in mail:
http://www.postgresql.org/message-id/CAA4eK1+k5-Jo3SLHFuSK2Y59TL+zctVVBFGwXawH6KhrLnW6=w@mail.gmail.com

Patch for v-2 and v-3 are attached


Below is the data for 5 runs with all the patches, this is just to show
the fluctuation in data:

Unpatched

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     348844424 |  6.0697078704834
 ten long fields, 8 bytes changed |     348845440 | 6.25980114936829
 ten long fields, 8 bytes changed |     348846632 | 6.28065395355225
(3 rows)

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     352182832 | 7.78950119018555
 ten long fields, 8 bytes changed |     348841592 | 6.33335590362549
 ten long fields, 8 bytes changed |     348842592 | 5.47767996788025
(3 rows)


             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     352481368 | 6.10013723373413
 ten long fields, 8 bytes changed |     348845216 | 6.23139500617981
 ten long fields, 8 bytes changed |     348846328 | 7.20329117774963
(3 rows)

            testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     352780032 | 5.71489500999451
 ten long fields, 8 bytes changed |     348848256 | 6.01294183731079
 ten long fields, 8 bytes changed |     348845640 | 5.97938108444214
(3 rows)

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     348847264 | 5.30486917495728
 ten long fields, 8 bytes changed |     348848384 | 5.42504191398621
 ten long fields, 8 bytes changed |     348841384 | 5.59665489196777
(3 rows)

wal-update-prefix-suffix-encode-2.patch

            testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     300706992 | 5.83324003219604
 ten long fields, 8 bytes changed |     303039200 |  5.8794629573822
 ten long fields, 8 bytes changed |     300707256 | 6.04627680778503
(3 rows)


             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     300703744 | 7.27797102928162
 ten long fields, 8 bytes changed |     300701984 |  7.3160879611969
 ten long fields, 8 bytes changed |     300700360 | 7.88055396080017
(3 rows)

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     300705024 | 7.86505889892578
 ten long fields, 8 bytes changed |     300702544 | 7.78658819198608
 ten long fields, 8 bytes changed |     300700128 | 6.14991092681885
(3 rows)

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     300700520 | 6.61981701850891
 ten long fields, 8 bytes changed |     301010008 | 6.38593101501465
 ten long fields, 8 bytes changed |     300705136 | 6.31078720092773
(3 rows)

            testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     300705512 | 5.61318206787109
 ten long fields, 8 bytes changed |     300703776 |  6.2267439365387
 ten long fields, 8 bytes changed |     300701240 |  6.4169659614563
(3 rows)

wal-update-prefix-suffix-encode-3.patch

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     271821064 | 6.24568295478821
 ten long fields, 8 bytes changed |     271818992 | 6.68939399719238
6.86% overhead.
 ten long fields, 8 bytes changed |     271816880 | 6.63792490959167
(3 rows)

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     271819992 | 5.78784203529358
 ten long fields, 8 bytes changed |     271822232 | 4.71433019638062
 ten long fields, 8 bytes changed |     271820128 | 5.84002709388733
(3 rows)

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     271815824 | 4.74523997306824
 ten long fields, 8 bytes changed |     273221608 | 5.36515283584595
 ten long fields, 8 bytes changed |     271818664 | 5.76620006561279
(3 rows)

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     271818872 | 5.49491405487061
 ten long fields, 8 bytes changed |     271816776 | 6.59977793693542
 ten long fields, 8 bytes changed |     271822752 |  5.1178731918335
(3 rows)

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     275747216 | 6.48244714736938
 ten long fields, 8 bytes changed |     274589280 | 5.66005206108093
 ten long fields, 8 bytes changed |     271818400 | 5.08064913749695
(3 rows)


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Andres Freund
Date:
Hi,


Some quick review comments:

On 2014-02-13 18:14:54 +0530, Amit Kapila wrote:
> +    /*
> +     * EWT can be generated for all new tuple versions created by Update
> +     * operation. Currently we do it when both the old and new tuple versions
> +     * are on same page, because during recovery if the page containing old
> +     * tuple is corrupt, it should not cascade that corruption to other pages.
> +     * Under the general assumption that for long runs most updates tend to
> +     * create new tuple version on same page, there should not be significant
> +     * impact on WAL reduction or performance.
> +     *
> +     * We should not generate EWT when we need to backup the whole block in
> +     * WAL as in that case there is no saving by reduced WAL size.
> +     */
> +
> +    if (RelationIsEnabledForWalCompression(reln) &&
> +        (oldbuf == newbuf) &&
> +        !XLogCheckBufferNeedsBackup(newbuf))
> +    {
> +        uint32        enclen;

You should note that thew check for RelationIsEnabledForWalCompression()
here is racy and that that's ok because the worst that can happen is
that a uselessly generated delta.
>      xlrec.target.node = reln->rd_node;
>      xlrec.target.tid = oldtup->t_self;
>      xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
> @@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
>      xlrec.newtid = newtup->t_self;
>      if (new_all_visible_cleared)
>          xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
> +    if (compressed)
> +        xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;

I think this also needs to unset XLOG_HEAP_CONTAINS_NEW_TUPLE and
conditional on !need_tuple_data.



>  /*
> + * Determine whether the buffer referenced has to be backed up. Since we don't
> + * yet have the insert lock, fullPageWrites and forcePageWrites could change
> + * later, but will not cause any problem because this function is used only to
> + * identify whether EWT is required for update.
> + */
> +bool
> +XLogCheckBufferNeedsBackup(Buffer buffer)
> +{

Should note very, very boldly that this can only be used in contexts
where a race is acceptable.

> diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
> new file mode 100644
> index 0000000..877ccd7
> --- /dev/null
> +++ b/src/backend/utils/adt/pg_rbcompress.c
> @@ -0,0 +1,355 @@
> +/* ----------
> + * pg_rbcompress.c -
> + *
> + *        This is a delta encoding scheme specific to PostgreSQL and designed
> + *        to compress similar tuples. It can be used as it is or extended for
> + *        other purpose in PostgrSQL if required.
> + *
> + *        Currently, this just checks for a common prefix and/or suffix, but
> + *        the output format is similar to the LZ format used in pg_lzcompress.c.
> + *
> + * Copyright (c) 1999-2014, PostgreSQL Global Development Group
> + *
> + * src/backend/utils/adt/pg_rbcompress.c
> + * ----------
> + */

This needs significantly more explanations about the algorithm and the
reasoning behind it.


> +static const PGRB_Strategy strategy_default_data = {
> +    32,                            /* Data chunks less than 32 bytes are not
> +                                 * compressed */
> +    INT_MAX,                    /* No upper limit on what we'll try to
> +                                 * compress */
> +    35,                            /* Require 25% compression rate, or not worth
> +                                 * it */
> +};

compression rate looks like it's mismatch between comment and code.

> +/* ----------
> + * pgrb_out_ctrl -
> + *
> + *        Outputs the last and allocates a new control byte if needed.
> + * ----------
> + */
> +#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
> +do { \
> +    if ((__ctrl & 0xff) == 0)                                                \
> +    {                                                                        \
> +        *(__ctrlp) = __ctrlb;                                                \
> +        __ctrlp = (__buf)++;                                                \
> +        __ctrlb = 0;                                                        \
> +        __ctrl = 1;                                                            \
> +    }                                                                        \
> +} while (0)
> +

double underscore variables are reserved for the compiler and os.

> +/* ----------
> + * pgrb_out_literal -
> + *
> + *        Outputs a literal byte to the destination buffer including the
> + *        appropriate control bit.
> + * ----------
> + */
> +#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
> +do { \
> +    pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);                                \
> +    *(_buf)++ = (unsigned char)(_byte);                                        \
> +    _ctrl <<= 1;                                                            \
> +} while (0)
> +
> +
> +/* ----------
> + * pgrb_out_tag -
> + *
> + *        Outputs a backward reference tag of 2-4 bytes (depending on
> + *        offset and length) to the destination buffer including the
> + *        appropriate control bit.
> + * ----------
> + */
> +#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
> +do { \
> +    pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);                                \
> +    _ctrlb |= _ctrl;                                                        \
> +    _ctrl <<= 1;                                                            \
> +    if (_len > 17)                                                            \
> +    {                                                                        \
> +        (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);        \
> +        (_buf)[1] = (unsigned char)(((_off) & 0xff));                        \
> +        (_buf)[2] = (unsigned char)((_len) - 18);                            \
> +        (_buf) += 3;                                                        \
> +    } else {                                                                \
> +        (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
> +        (_buf)[1] = (unsigned char)((_off) & 0xff);                            \
> +        (_buf) += 2;                                                        \
> +    }                                                                        \
> +} while (0)
> +

What's the reason to use macros here? Just use inline functions when
dealing with file-local stuff.

> +/* ----------
> + * pgrb_delta_encode - find common prefix/suffix between inputs and encode.
> + *
> + *    source is the input data to be compressed
> + *    slen is the length of source data
> + *  history is the data which is used as reference for compression
> + *    hlen is the length of history data
> + *    The encoded result is written to dest, and its length is returned in
> + *    finallen.
> + *    The return value is TRUE if compression succeeded,
> + *    FALSE if not; in the latter case the contents of dest
> + *    are undefined.
> + *    ----------
> + */
> +bool
> +pgrb_delta_encode(const char *source, int32 slen,
> +                  const char *history, int32 hlen,
> +                  char *dest, uint32 *finallen,
> +                  const PGRB_Strategy *strategy)
> +{
> +    unsigned char *bp = ((unsigned char *) dest);
> +    unsigned char *bstart = bp;
> +    const char *dp = source;
> +    const char *dend = source + slen;
> +    const char *hp = history;
> +    unsigned char ctrl_dummy = 0;
> +    unsigned char *ctrlp = &ctrl_dummy;
> +    unsigned char ctrlb = 0;
> +    unsigned char ctrl = 0;
> +    int32        result_size;
> +    int32        result_max;
> +    int32        need_rate;
> +    int            prefixlen;
> +    int            suffixlen;
> +
> +    /*
> +     * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
> +     * delta encode as this is the maximum size of history offset.
> +     * XXX: still true?
> +     */

Why didn't you define a maximum tuple size in the strategy definition
above then?

> +    if (hlen >= PGRB_HISTORY_SIZE || hlen < PGRB_MIN_MATCH)
> +        return false;
> +
> +    /*
> +     * Our fallback strategy is the default.
> +     */
> +    if (strategy == NULL)
> +        strategy = PGRB_strategy_default;
>
> +    /*
> +     * If the strategy forbids compression (at all or if source chunk size out
> +     * of range), fail.
> +     */
> +    if (slen < strategy->min_input_size ||
> +        slen > strategy->max_input_size)
> +        return false;
> +
> +    need_rate = strategy->min_comp_rate;
> +    if (need_rate < 0)
> +        need_rate = 0;
> +    else if (need_rate > 99)
> +        need_rate = 99;

Is there really need for all this stuff here? This is so specific to the
usecase that I have significant doubts that all the pglz boiler plate
makes much sense.

> +    /*
> +     * Compute the maximum result size allowed by the strategy, namely the
> +     * input size minus the minimum wanted compression rate.  This had better
> +     * be <= slen, else we might overrun the provided output buffer.
> +     */
> +    /*if (slen > (INT_MAX / 100))
> +    {
> +        /* Approximate to avoid overflow */
> +        /*result_max = (slen / 100) * (100 - need_rate);
> +    }
> +    else
> +    {
> +        result_max = (slen * (100 - need_rate)) / 100;
> +    }*/

err?

> +--
> +-- Test to update continuos and non continuos columns
> +--

*continuous

I have to admit, I have serious doubts about this approach. I have a
very hard time believing this won't cause performance regression in many
common cases... More importantly I don't think doing the compression on
this level is that interesting. I know Heikki argued for it, but I think
extending the bitmap that's computed for HOT to cover all columns and
doing this on a column level sounds much more sensible to me.

Greetings,

Andres Freund



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Sat, Feb 15, 2014 at 8:21 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
>
> Some quick review comments:
Thanks for the review, I shall handle/reply to comments with the
updated version in which I am planing to fix a bug (right now preparing a
test to reproduce it) in this code.
Bug:
Tag can handle maximum length of 273 bytes, but this patch is not
considering it.

> I have to admit, I have serious doubts about this approach. I have a
> very hard time believing this won't cause performance regression in many
> common cases...

Actually, till now I was majorly focusing on worst case (i.e at boundary of
compression ratio) thinking that most others will do good. However I shall
produce data for much more common cases as well.
Please let me know if you have anything specific thing in mind where this
will not work well.

>More importantly I don't think doing the compression on
> this level is that interesting. I know Heikki argued for it, but I think
> extending the bitmap that's computed for HOT to cover all columns and
> doing this on a column level sounds much more sensible to me.

Previously we have tried to do at column boundaries, but the main problem
turned out to be in worst cases where we spend time in extracting values
from tuples based on column boundaries and later found that data is not
compressible.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Andres Freund
Date:
On 2014-02-15 21:01:07 +0530, Amit Kapila wrote:
> >More importantly I don't think doing the compression on
> > this level is that interesting. I know Heikki argued for it, but I think
> > extending the bitmap that's computed for HOT to cover all columns and
> > doing this on a column level sounds much more sensible to me.
> 
> Previously we have tried to do at column boundaries, but the main problem
> turned out to be in worst cases where we spend time in extracting values
> from tuples based on column boundaries and later found that data is not
> compressible.

I think that hugely depends on how you implement it. I think you'd need
to have a loop traversing over the both tuples at the same time on the
level of heap_deform_tuple(). If you'd use the result to get rid of
HeapSatisfiesHOTandKeyUpdate() at the same time I am pretty sure you
wouldn't see very high overhead.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I'm pretty sure the overhead of that would be negligible, so we could always
> enable it. There are certainly a lot of scenarios where prefix/suffix
> detection alone wouldn't help, but so what.
>
> Attached is a quick patch for that, if you want to test it.

I have updated the patch to correct few problems, addressed review comments
by Andres and done some optimizations to improve CPU overhead for worst
case. Let me first start with performance data for this patch.

Performance Data
-----------------------------
Non-Default settings

autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min
full_page_writes = off


Unpatched

                testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------
 one short and one long field, no change |     573502736 | 9.70863103866577
 one short and one long field, no change |     573504920 | 10.1023359298706
 one short and one long field, no change |     573498936 | 9.84194612503052
 hundred tiny fields, all changed        |     364891128 | 13.9618089199066
 hundred tiny fields, all changed        |     364888088 | 13.4061119556427
 hundred tiny fields, all changed        |     367753480 |  13.433109998703
 hundred tiny fields, half changed       |     364892928 | 13.5090639591217
 hundred tiny fields, half changed       |     364890384 | 13.5632100105286
 hundred tiny fields, half changed       |     364888136 | 13.6033401489258
 hundred tiny fields, half nulled        |     300702272 | 13.7366359233856
 hundred tiny fields, half nulled        |     300703656 | 14.5007920265198
 hundred tiny fields, half nulled        |     300705216 | 13.9954152107239
 9 short and 1 long, short changed       |     396987760 |  9.5885021686554
 9 short and 1 long, short changed       |     396988864 | 9.11789703369141
 9 short and 1 long, short changed       |     396985728 | 9.52586102485657
(15 rows)

wal-update-prefix-suffix-encode-4.patch

                testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------
 one short and one long field, no change |     156854192 | 6.74417304992676
 one short and one long field, no change |     156279384 | 6.61455297470093
 one short and one long field, no change |     156277824 | 6.84297394752502
 hundred tiny fields, all changed        |     364893056 | 13.9131588935852
 hundred tiny fields, all changed        |     364890912 | 13.1628270149231
 hundred tiny fields, all changed        |     364889424 | 13.7095680236816
 hundred tiny fields, half changed       |     364895592 | 13.6322529315948
 hundred tiny fields, half changed       |     365172160 | 14.0036828517914
 hundred tiny fields, half changed       |     364892400 | 13.5247440338135
 hundred tiny fields, half nulled        |     206833992 | 12.4429869651794
 hundred tiny fields, half nulled        |     208443760 | 12.1079058647156
 hundred tiny fields, half nulled        |     205858280 | 12.7899498939514
 9 short and 1 long, short changed       |     236516832 | 8.36392688751221
 9 short and 1 long, short changed       |     236515744 | 8.46648907661438
 9 short and 1 long, short changed       |     236518336 | 8.02749991416931
(15 rows)

There is major WAL reduction and CPU gain for best and average
cases and for cases where there is no WAL reduction (as updated
tuple has different data), there is no CPU overhead.

Test script (wal-update-testsuite.sh) to collect above data is attached.

Now for the worst case where the tuple has same data till compression
ratio, I have tried to keep the compression rate at 25 and 30%,
and observed that there is quite minimal overhead at 30%. Performance
data for same is as below:


Case - 1 : Change some bytes just after 30% of tuple
Unpatched

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     352055792 |  5.4274320602417
 ten long fields, 8 bytes changed |     352050536 | 6.44699001312256
 ten long fields, 8 bytes changed |     352057880 | 5.78993391990662
(3 rows)

wal-update-prefix-suffix-encode-4.patch

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     281447616 | 5.79180097579956
 ten long fields, 8 bytes changed |     281451096 | 5.63260507583618
 ten long fields, 8 bytes changed |     281445728 | 5.56671595573425
(3 rows)

Case - 2 : Change some bytes just before 30% of tuple
Unpatched

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     350873408 | 6.44963002204895
 ten long fields, 8 bytes changed |     348842888 | 6.33179092407227
 ten long fields, 8 bytes changed |     348848488 | 6.66787099838257
(3 rows)

wal-update-prefix-suffix-encode-4.patch

             testname             | wal_generated |     duration
----------------------------------+---------------+------------------
 ten long fields, 8 bytes changed |     352660656 | 8.03470301628113
 ten long fields, 8 bytes changed |     348843208 | 6.36861610412598
 ten long fields, 8 bytes changed |     348844728 | 6.56955599784851
(3 rows)

Keeping the compression rate at 30% for case when match is till
~29%, there is about ~2% CPU overhead (considering median data)
and when there is a match till ~31%, there is a WAL reduction of 20%
and no CPU overhead.

Now if we keep compression rate at 25%, it will perform better when
match till 24%, but will perform bad when match till 26%.

I have attached separate scripts for both (25% & 30%) boundary tests
(wal-update-testsuite-ten-long-8-bytes-changed-at-30-percent-boundary.sh &
wal-update-testsuite-ten-long-8-bytes-changed-at-25-percent-boundary.sh).
You can change value of PGDE_MIN_COMP_RATE in patch to run
the test, current it is 30, if you want to run 25% boundary test, then
change it to 25.

Note - Performance data for worst case was fluctuating, so I have took
5 times and get the difference of data which occurred most number of
times.

About the main changes in this patch:

1. An un-necessary Tag was getting added to encoded tuple, even
when there is no match for prefix/suffix.
2. Maximum Tag length was not considered, done changes to split
the tag if the length is greater than 273 bytes (max tag length supported
by format).
3. Check for whether prefix/suffix length has achieved compression ratio
was wrong. Changed the code for same.
4. Even after we decide after prefix/suffix match has achieved compression
ratio, at the end of encode function it was returning based on max size, which
I think is not required as buffer has sufficient space and it was
causing overhead
for worst cases. If we just want to be extra careful, then we might want to have
a check for max buffer size passed to encode function.
5. Change file names to pg_decompress.c/.h(de - delta encoding)

Fixes for review comments by Andres

> +
> +     if (RelationIsEnabledForWalCompression(reln) &&
> +             (oldbuf == newbuf) &&
> +             !XLogCheckBufferNeedsBackup(newbuf))
> +     {
> +             uint32          enclen;

>You should note that thew check for RelationIsEnabledForWalCompression()
>here is racy and that that's ok because the worst that can happen is
>that a uselessly generated delta.

I think we might not even keep this switch, as performance data seems to
be okay, but yes even if we keep it might not do any harm.


> +     if (compressed)
> +             xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;

> I think this also needs to unset XLOG_HEAP_CONTAINS_NEW_TUPLE and
> conditional on !need_tuple_data.

I could not understand this point, from above sentence it seems you want
to unset XLOG_HEAP_CONTAINS_NEW_TUPLE when !need_tuple_data,
but not sure, could you explain bit more on it.


> +bool
> +XLogCheckBufferNeedsBackup(Buffer buffer)
> +{

> Should note very, very boldly that this can only be used in contexts
> where a race is acceptable.

Yes, this is racy, however in worst case it will do encoding
when it is not required or won't do encoding when it can save WAL, but
both cases doesn't do any harm.

> + *
> + * Copyright (c) 1999-2014, PostgreSQL Global Development Group
> + *
> + * src/backend/utils/adt/pg_rbcompress.c
> + * ----------
> + */

> This needs significantly more explanations about the algorithm and the
> reasoning behind it.

Agreed. I have added more explanations and reasoning to choose
algorithm.


> +static const PGRB_Strategy strategy_default_data = {
> +     32,                                                     /* Data chunks less than 32 bytes are not
> +                                                              * compressed */
> +     INT_MAX,                                        /* No upper limit on what we'll try to
> +                                                              * compress */
> +     35,                                                     /* Require 25% compression rate, or not worth
> +                                                              * it */
> +};

> compression rate looks like it's mismatch between comment and code.

Corrected. Now I have removed this strategy structure itself and instead
do define for the values, as we don't require multiple strategies for
this encoding.

> +/* ----------
> + * pgrb_out_ctrl -
> + *
> + *           Outputs the last and allocates a new control byte if needed.
> + * ----------
> + */
> +#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
> +do { \
> +     if ((__ctrl & 0xff) == 0)
       \ 
> +     {
                               \ 
> +             *(__ctrlp) = __ctrlb;
       \ 
> +             __ctrlp = (__buf)++;
       \ 
> +             __ctrlb = 0;
               \ 
> +             __ctrl = 1;
                       \ 
> +     }
                               \ 
> +} while (0)
> +

> double underscore variables are reserved for the compiler and os.

These macro's are mostly same as pg_lz as we have not changed the
format for encoded buffer. There are couple of other places like like.c
and valid.h where double underscores are used in macro's. However
I think there is no compelling need to use it and neither is recommended
way for variables.
I am not sure why this is used originally in pg_lzcompress.c, so for now
lets keep it as it is, I will change these names if we decide to go with this
version of patch. Right now the main decision is about performance
data, If that is done, I will change it along with some other similar changes.


> +#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
> +do { \
> +     pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);                                                                \
> +     *(_buf)++ = (unsigned char)(_byte);
\
> +     _ctrl <<= 1;
               \ 
> +} while (0)

> What's the reason to use macros here? Just use inline functions when
> dealing with file-local stuff.

Again same reason as above. Basically copied from pg_lz as the
encoding format is same.


> +
> +     /*
> +      * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
> +      * delta encode as this is the maximum size of history offset.
> +      * XXX: still true?
> +      */

> Why didn't you define a maximum tuple size in the strategy definition
> above then?

Now I have removed this strategy structure itself and instead
do define for the values, as we don't require multiple strategies for
this encoding.


> +     need_rate = strategy->min_comp_rate;
> +     if (need_rate < 0)
> +             need_rate = 0;
> +     else if (need_rate > 99)
> +             need_rate = 99;

> Is there really need for all this stuff here? This is so specific to the
> usecase that I have significant doubts that all the pglz boiler plate
> makes much sense.

Agreed. I have removed these extra checks.

> +     else
> +     {
> +             result_max = (slen * (100 - need_rate)) / 100;
> +     }*/

> err?

Fixed.

> +--
> +-- Test to update continuos and non continuos columns
> +--

> *continuous

Fixed.

>> Previously we have tried to do at column boundaries, but the main problem
>> turned out to be in worst cases where we spend time in extracting values
>> from tuples based on column boundaries and later found that data is not
>> compressible.

> I think that hugely depends on how you implement it. I think you'd need
> to have a loop traversing over the both tuples at the same time on the
> level of heap_deform_tuple(). If you'd use the result to get rid of
> HeapSatisfiesHOTandKeyUpdate() at the same time I am pretty sure you
> wouldn't see very high overhead

The case where it can have more overhead is, let us say you compress
and later found that its not HOT update, then you have to go and log
new tuple as it is which will waste cycles for doing compression.
We have to always find whether it is HOT update or not, but we might
choose to give up on tuple compression in between based on compression
ratio in which case it might have overhead.
It sounds that for best and average cases, this strategy might work even
better than current methods tried, but we can't be sure about negative
scenarios.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 02/16/2014 01:51 PM, Amit Kapila wrote:
> On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> >I'm pretty sure the overhead of that would be negligible, so we could always
>> >enable it. There are certainly a lot of scenarios where prefix/suffix
>> >detection alone wouldn't help, but so what.
>> >
>> >Attached is a quick patch for that, if you want to test it.
> I have updated the patch to correct few problems, addressed review comments
> by Andres and done some optimizations to improve CPU overhead for worst
> case.

Thanks. I have to agree with Robert though that using the pglz encoding
when we're just checking for a common prefix/suffix is a pretty crappy
way of going about it [1].

As the patch stands, it includes the NULL bitmap when checking for a
common prefix. That's probably not a good idea, because it defeats the
prefix detection in a the common case that you update a field from NULL
to not-NULL or vice versa.

Attached is a rewritten version, which does the prefix/suffix tests
directly in heapam.c, and adds the prefix/suffix lengths directly as
fields in the WAL record. If you could take one more look at this
version, to check if I've missed anything.

This ought to be tested with the new logical decoding stuff as it
modified the WAL update record format which the logical decoding stuff
also relies, but I don't know anything about that.

[1]
http://www.postgresql.org/message-id/CA+TgmoZSTdQdKU7DHcLjChvMBrh1_YFOUSE+fuxESEVnc4jEgg@mail.gmail.com

- Heikki

Attachment

Re: Performance Improvement by reducing WAL for Update Operation

From
Andres Freund
Date:
On 2014-03-03 16:27:05 +0200, Heikki Linnakangas wrote:
> Thanks. I have to agree with Robert though that using the pglz encoding when
> we're just checking for a common prefix/suffix is a pretty crappy way of
> going about it [1].
> 
> As the patch stands, it includes the NULL bitmap when checking for a common
> prefix. That's probably not a good idea, because it defeats the prefix
> detection in a the common case that you update a field from NULL to not-NULL
> or vice versa.
> 
> Attached is a rewritten version, which does the prefix/suffix tests directly
> in heapam.c, and adds the prefix/suffix lengths directly as fields in the
> WAL record. If you could take one more look at this version, to check if
> I've missed anything.

Have you rerun the benchmarks? I'd guess the CPU overhead of this
version is lower than earlier versions, but seing it tested won't be a
bad idea.

> This ought to be tested with the new logical decoding stuff as it modified
> the WAL update record format which the logical decoding stuff also relies,
> but I don't know anything about that.

Hm, I think all it needs to do disable delta encoding if
need_tuple_data (which is dependent on wal_level=logical).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Mon, Mar 3, 2014 at 9:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hm, I think all it needs to do disable delta encoding if
> need_tuple_data (which is dependent on wal_level=logical).

Why does it need to do that?  The logical decoding stuff should be
able to reverse out the delta encoding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Andres Freund
Date:
On 2014-03-03 10:35:03 -0500, Robert Haas wrote:
> On Mon, Mar 3, 2014 at 9:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Hm, I think all it needs to do disable delta encoding if
> > need_tuple_data (which is dependent on wal_level=logical).
> 
> Why does it need to do that?  The logical decoding stuff should be
> able to reverse out the delta encoding.

Against what should it perform the delta? Unless I misunderstand how the
patch works, it computes the delta against the old tuple in the heap
page?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Mon, Mar 3, 2014 at 10:38 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-03-03 10:35:03 -0500, Robert Haas wrote:
>> On Mon, Mar 3, 2014 at 9:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > Hm, I think all it needs to do disable delta encoding if
>> > need_tuple_data (which is dependent on wal_level=logical).
>>
>> Why does it need to do that?  The logical decoding stuff should be
>> able to reverse out the delta encoding.
>
> Against what should it perform the delta? Unless I misunderstand how the
> patch works, it computes the delta against the old tuple in the heap
> page?

Oh, maybe I need more caffeine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 03/03/2014 04:57 PM, Andres Freund wrote:
> On 2014-03-03 16:27:05 +0200, Heikki Linnakangas wrote:
>> Attached is a rewritten version, which does the prefix/suffix tests directly
>> in heapam.c, and adds the prefix/suffix lengths directly as fields in the
>> WAL record. If you could take one more look at this version, to check if
>> I've missed anything.
>
> Have you rerun the benchmarks?

No.

> I'd guess the CPU overhead of this version is lower than earlier
> versions,

That's what I would expect too.

> but seing it tested won't be a bad idea.

Agreed. Amit, do you have the test setup at hand, can you check the 
performance of this one more time?

Also, I removed the GUC and table level options, on the assumption that 
this is cheap enough even when it's not helping that we don't need to 
make it configurable.

>> This ought to be tested with the new logical decoding stuff as it modified
>> the WAL update record format which the logical decoding stuff also relies,
>> but I don't know anything about that.
>
> Hm, I think all it needs to do disable delta encoding if
> need_tuple_data (which is dependent on wal_level=logical).

That's a pity, but we can live with it. If we did this at a higher level 
and checked which columns have been modified, we could include just the 
modified fields in the record, which should to be enough for logical 
decoding. It might be even more useful for logical decoding too to know 
exactly which fields were changed.

- Heikki



Re: Performance Improvement by reducing WAL for Update Operation

From
Andres Freund
Date:
On 2014-03-04 12:43:48 +0200, Heikki Linnakangas wrote:
> >>This ought to be tested with the new logical decoding stuff as it modified
> >>the WAL update record format which the logical decoding stuff also relies,
> >>but I don't know anything about that.
> >
> >Hm, I think all it needs to do disable delta encoding if
> >need_tuple_data (which is dependent on wal_level=logical).
> 
> That's a pity, but we can live with it.

Agreed. This is hardly the first optimization that only works for some
wal_levels.

> If we did this at a higher level and
> checked which columns have been modified, we could include just the modified
> fields in the record, which should to be enough for logical decoding. It
> might be even more useful for logical decoding too to know exactly which
> fields were changed.

Yea, I argued that way elsewhere in this thread. I do think we're going
to need per column info for further features in the near future. It's a
bit absurd that we're computing various sets of changed columns (HOT,
key, identity) plus the pre/postfix with this patchset.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Mon, Mar 3, 2014 at 7:57 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 02/16/2014 01:51 PM, Amit Kapila wrote:
>>
>> On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com>  wrote:
>
> Thanks. I have to agree with Robert though that using the pglz encoding when
> we're just checking for a common prefix/suffix is a pretty crappy way of
> going about it [1].
>
> As the patch stands, it includes the NULL bitmap when checking for a common
> prefix. That's probably not a good idea, because it defeats the prefix
> detection in a the common case that you update a field from NULL to not-NULL
> or vice versa.
>
> Attached is a rewritten version, which does the prefix/suffix tests directly
> in heapam.c, and adds the prefix/suffix lengths directly as fields in the
> WAL record. If you could take one more look at this version, to check if
> I've missed anything.

I had verified the patch and found few minor points:
1.
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);

Declaration for above functions are not required now.

2.
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)

Here, I think we can change the comment to avoid word
EWT (Encoded WAL tuple), as now we changed compression
mechanism and it's not used anywhere else.

One Question:
+ rdata[1].data = (char *) &xlrec;
Earlier it seems to store record hearder as first segment rdata[0],
whats the reason of changing it?


I have verified the patch by doing crash recovery for below scenario's
and it worked fine:
a. no change in old and new tuple
b. all changed in new tuple
c. half changed (update half of the values to NULLS) in new tuple
d. only prefix same in new tuple
e. only suffix same in new tuple
f.  prefix-suffix same, other columns values changed in new tuple.

Performance Data
----------------------------

Non-Default settings

autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min
full_page_writes = off

Unpatched
               testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------one short and one long field, no change |
 573506704 | 9.56587505340576one short and one long field, no change |     575351216 | 9.97713398933411one short and
onelong field, no change |     573501848 | 9.76377606391907hundred tiny fields, all changed        |     364894056 |
13.3053929805756hundredtiny fields, all changed        |     364891536 | 13.3533811569214hundred tiny fields, all
changed       |     364889264 | 13.3041989803314hundred tiny fields, half changed       |     365411920 |
14.1831648349762hundredtiny fields, half changed       |     365918216 | 13.6393811702728hundred tiny fields, half
changed      |     366456552 | 13.6420011520386hundred tiny fields, half nulled        |     300705288 |
12.8859741687775hundredtiny fields, half nulled        |     301665624 | 12.6988201141357hundred tiny fields, half
nulled       |     300700504 | 13.35361003875739 short and 1 long, short changed       |     396983080 |
8.836713075637829short and 1 long, short changed       |     396987976 | 9.237692117691049 short and 1 long, short
changed      |     396984080 | 9.45178604125977
 


wal-update-prefix-suffix-5.patch
               testname                 | wal_generated |     duration
-----------------------------------------+---------------+------------------one short and one long field, no change |
 156278832 | 6.69434094429016one short and one long field, no change |     156277352 | 6.70855903625488one short and
onelong field, no change |     156280040 | 6.70657396316528hundred tiny fields, all changed        |     364895152 |
13.6677348613739hundredtiny fields, all changed        |     364892256 | 12.7107839584351hundred tiny fields, all
changed       |     364890424 | 13.7760601043701hundred tiny fields, half changed       |     365970360 |
13.1902158260345hundredtiny fields, half changed       |     364895120 | 13.5730090141296hundred tiny fields, half
changed      |     367031168 | 13.7023210525513hundred tiny fields, half nulled        |     204418576 |
12.1997199058533hundredtiny fields, half nulled        |     204422880 | 11.4583330154419hundred tiny fields, half
nulled       |     204417464 | 12.02289700508129 short and 1 long, short changed       |     220466016 |
8.148435115814219short and 1 long, short changed       |     220471168 | 8.037127971649179 short and 1 long, short
changed      |     220464464 | 8.55907511711121
 
(15 rows)


Conclusion is that patch shows good WAL reduction and performance
improvement for favourable cases without CPU overhead for non-favourable
cases.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Tue, Mar 4, 2014 at 4:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Agreed. Amit, do you have the test setup at hand, can you check the
> performance of this one more time?

Are you expecting more performance numbers than I have posted?
Is there anything more left for patch which you are expecting?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 03/04/2014 01:58 PM, Amit Kapila wrote:
> On Mon, Mar 3, 2014 at 7:57 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> On 02/16/2014 01:51 PM, Amit Kapila wrote:
>>>
>>> On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
>>> <hlinnakangas@vmware.com>  wrote:
>>
>> Thanks. I have to agree with Robert though that using the pglz encoding when
>> we're just checking for a common prefix/suffix is a pretty crappy way of
>> going about it [1].
>>
>> As the patch stands, it includes the NULL bitmap when checking for a common
>> prefix. That's probably not a good idea, because it defeats the prefix
>> detection in a the common case that you update a field from NULL to not-NULL
>> or vice versa.
>>
>> Attached is a rewritten version, which does the prefix/suffix tests directly
>> in heapam.c, and adds the prefix/suffix lengths directly as fields in the
>> WAL record. If you could take one more look at this version, to check if
>> I've missed anything.
>
> I had verified the patch and found few minor points:
> ...

Fixed those.

> One Question:
> + rdata[1].data = (char *) &xlrec;
> Earlier it seems to store record hearder as first segment rdata[0],
> whats the reason of changing it?

I found the code easier to read that way. The order of rdata entries 
used to be:

0: xl_heap_update struct
1: full-page reference to oldbuf (no data)
2: xl_heap_header_len struct for the new tuple
3-7: logical decoding stuff

The prefix/suffix fields made that order a bit awkward, IMHO. They are 
logically part of the header, even though they're not part of the struct 
(they are documented in comments inside the struct). So they ought to 
stay together with the xl_heap_update struct. Another option would've 
been to move it after the xl_heap_header_len struct.

Note that this doesn't affect the on-disk format of the WAL record, 
because the moved rdata entry is just a full-page reference, with no 
payload of its own.

> I have verified the patch by doing crash recovery for below scenario's
> and it worked fine:
> a. no change in old and new tuple
> b. all changed in new tuple
> c. half changed (update half of the values to NULLS) in new tuple
> d. only prefix same in new tuple
> e. only suffix same in new tuple
> f.  prefix-suffix same, other columns values changed in new tuple.

Thanks!

> Conclusion is that patch shows good WAL reduction and performance
> improvement for favourable cases without CPU overhead for non-favourable
> cases.

Ok, great. Committed!

I left out the regression tests. It was good to have them while 
developing this, but I don't think there's a lot of value in including 
them permanently in the regression suite. Low-level things like the 
alignment-sensitive test are fragile, and can easily stop testing the 
thing it's supposed to test, depending on the platform and future 
changes in the code. And the current algorithm doesn't care much about 
alignment anyway.

- Heikki



Re: Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Wed, Mar 12, 2014 at 5:30 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Ok, great. Committed!

Awesome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company