Re: Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Performance Improvement by reducing WAL for Update Operation
Date
Msg-id 51366323.8070606@vmware.com
Whole thread Raw
In response to Re: Performance Improvement by reducing WAL for Update Operation  (Amit Kapila <amit.kapila@huawei.com>)
Responses Re: Performance Improvement by reducing WAL for Update Operation  (Amit Kapila <amit.kapila@huawei.com>)
Re: Performance Improvement by reducing WAL for Update Operation  (Andres Freund <andres@2ndquadrant.com>)
Re: Performance Improvement by reducing WAL for Update Operation  (Amit Kapila <amit.kapila@huawei.com>)
Re: Performance Improvement by reducing WAL for Update Operation  (Amit Kapila <amit.kapila@huawei.com>)
List pgsql-hackers
On 04.03.2013 06:39, Amit Kapila wrote:
> On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
>> On 02/05/2013 11:53 PM, Amit Kapila wrote:
>>>> Performance data for the patch is attached with this mail.
>>>> Conclusions from the readings (these are same as my previous patch):
>>>>
>>>> 1. With orignal pgbench there is a max 7% WAL reduction with not
>> much
>>>> performance difference.
>>>> 2. With 250 record pgbench there is a max wal reduction of 35% with
>> not
>>>> much performance difference.
>>>> 3. With 500 and above record size in pgbench there is an improvement
>> in
>>>> the performance and wal reduction both.
>>>>
>>>> If the record size increases there is a gain in performance and wal
>>>> size is reduced as well.
>>>>
>>>> Performance data for synchronous_commit = on is under progress, I
>> shall
>>>> post it once it is done.
>>>> I am expecting it to be same as previous.
>>> Please find the performance readings for synchronous_commit = on.
>>>
>>> Each run is taken for 20 min.
>>>
>>> Conclusions from the readings with synchronous commit on mode:
>>>
>>> 1. With orignal pgbench there is a max 2% WAL reduction with not much
>>> performance difference.
>>> 2. With 500 record pgbench there is a max wal reduction of 3% with
>> not much
>>> performance difference.
>>> 3. With 1800 record size in pgbench there is both an improvement in
>> the
>>> performance (approx 3%) as well as wal reduction (44%).
>>>
>>> If the record size increases there is a very good reduction in WAL
>> size.
>>
>> The stats look fairly sane. I'm a little concerned about the apparent
>> trend of falling TPS in the patched vs original tests for the 1-client
>> test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
>> 0.4% case made other config changes too. Nonetheless, it might be wise
>> to check with really big records and see if the trend continues.
>
> For bigger size (~2000) records, it goes into toast, for which we don't do
> this optimization.
> This optimization is mainly for medium size records.

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:

unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):

                 testname                 | wal_generated |     duration

-----------------------------------------+---------------+------------------
  two short fields, no change             |    1245525360 | 9.94613695144653
  two short fields, one changed           |    1245536528 |  10.146910905838
  two short fields, both changed          |    1245523160 | 11.2332470417023
  one short and one long field, no change |    1054926504 | 5.90477800369263
  ten tiny fields, all changed            |    1411774608 | 13.4536008834839
  hundred tiny fields, all changed        |     635739680 | 7.57448387145996
  hundred tiny fields, half changed       |     636930560 | 7.56888699531555
  hundred tiny fields, half nulled        |     573751120 | 6.68991994857788

Amit's wal_update_changes_v10.patch:

                 testname                 | wal_generated |     duration

-----------------------------------------+---------------+------------------
  two short fields, no change             |    1249722112 | 13.0558869838715
  two short fields, one changed           |    1246145408 | 12.9947438240051
  two short fields, both changed          |    1245951056 | 13.0262880325317
  one short and one long field, no change |     678480664 | 5.70031690597534
  ten tiny fields, all changed            |    1328873920 | 20.0167419910431
  hundred tiny fields, all changed        |     638149416 | 14.4236788749695
  hundred tiny fields, half changed       |     635560504 | 14.8770561218262
  hundred tiny fields, half nulled        |     558468352 | 16.2437210083008

pglz-with-micro-optimizations-1.patch:

                  testname                 | wal_generated |
duration
-----------------------------------------+---------------+------------------
  two short fields, no change             |    1245519008 | 11.6702048778534
  two short fields, one changed           |    1245756904 | 11.3233819007874
  two short fields, both changed          |    1249711088 | 11.6836447715759
  one short and one long field, no change |     664741392 | 6.44810795783997
  ten tiny fields, all changed            |    1328085568 | 13.9679481983185
  hundred tiny fields, all changed        |     635974088 | 9.15514206886292
  hundred tiny fields, half changed       |     636309040 | 9.13769292831421
  hundred tiny fields, half nulled        |     496396448 | 8.77351498603821

In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE
is timed. Duration is the time spent in the UPDATE (lower is better),
and wal_generated is the amount of WAL generated by the updates (lower
is better).

The summary is that Amit's patch is a small win in terms of CPU usage,
in the best case where the table has few columns, with one large column
that is not updated. In all other cases it just adds overhead. In terms
of WAL size, you get a big gain in the same best case scenario.

Attached is a different version of this patch, which uses the pglz
algorithm to spot the similarities between the old and new tuple,
instead of having explicit knowledge of where the column boundaries are.
This has the advantage that it will spot similarities, and be able to
compress, in more cases. For example, you can see a reduction in WAL
size in the "hundred tiny fields, half nulled" test case above.

The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default, this
probably just isn't worth it.

The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function, it
goes further than that, and contains some further micro-optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for speed.

If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to be
brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.

PS. I haven't done much testing of WAL redo, so it's quite possible that
the encoding is actually buggy, or that decoding is slow. But I don't
think there's anything so fundamentally wrong that it would affect the
performance results much.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Materialized views WIP patch
Next
From: Alvaro Herrera
Date:
Subject: Re: sql_drop Event Trigger