Thread: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit kapila

Date:

24 September 2012, 10:57:48

From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com]
Sent: Monday, August 27, 2012 5:58 PM
To: Amit kapila
On 27.08.2012 15:18, Amit kapila wrote:
>>> I have implemented the WAL Reduction Patch for the case of HOT Update as
pointed out by Simon and Robert. In this patch it only goes for Optimized
WAL in case of HOT Update with other restrictions same as in previous patch.
>>
>>> The performance numbers for this patch are attached in this mail. It has
improved by 90% if the page has fillfactor 80.
>>
>>> Now going forward I have following options:
>>> a. Upload the patch in Open CF for WAL Reduction which contains
reductution for HOT and non-HOT updates.
>>> b. Upload the patch in Open CF for WAL Reduction which contains
reductution for HOT updates.
>>> c. Upload both the patches as different versions.

>> Let's do it for HOT updates only. Simon & Robert made good arguments on
>> why this is a bad idea for non-HOT updates.

>Okay, I shall do it that way.
>So now I shall send information about all the testing I have done for this
>Patch and then Upload it in CF.

Rebased version of patch based on latest code.

With Regards,

Amit Kapila.

Attachment

wal_update_changes_v2.patch

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Heikki Linnakangas

Date:

25 September 2012, 14:00:19

On 24.09.2012 13:57, Amit kapila wrote:
> Rebased version of patch based on latest code.

When HOT was designed, we decided that heap_update needs to compare the 
old and new attributes directly, with memcmp(), to determine whether any 
of the indexed columns have changed. It was not deemed infeasible to 
pass down that information from the executor. I don't remember the 
details of why that was, but you seem to trying to same thing in this 
patch, and pass the bitmap of modified cols from the executor to 
heap_update(). I'm pretty sure that won't work, for the same reasons we 
didn't do it for HOT.

I still feel that it would probably be better to use a generic delta 
encoding scheme, instead of inventing one. How about VCDIFF 
(http://tools.ietf.org/html/rfc3284), for example? Or you could reuse 
the LZ compressor that we already have in the source tree. You can use 
LZ for delta compression by initializing the history buffer of the 
algorithm with the old tuple, and then compressing the new tuple as 
usual. Or you could still use the knowledge of where the attributes 
begin and end and which attributes were updated, and do the encoding 
similar to how you did in the patch, but use LZ as the output format. 
That way the decoding would be the same as LZ decompression.

- Heikki

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit Kapila

Date:

25 September 2012, 15:29:39

> On Tuesday, September 25, 2012 7:30 PM Heikki Linnakangas wrote:
> On 24.09.2012 13:57, Amit kapila wrote:
> > Rebased version of patch based on latest code.
> 
> When HOT was designed, we decided that heap_update needs to compare the
> old and new attributes directly, with memcmp(), to determine whether
> any
> of the indexed columns have changed. It was not deemed infeasible to
> pass down that information from the executor. I don't remember the
> details of why that was, but you seem to trying to same thing in this
> patch, and pass the bitmap of modified cols from the executor to
> heap_update(). I'm pretty sure that won't work, for the same reasons we
> didn't do it for HOT.

I think the reason of not relying on modified columns can be some such case
where modified columns might not give the correct information. 
It may be due to Before triggers can change the modified columns that's why
for HOT update we need to do 
Comparison. In our case we have taken care of such a case by not doing
optimization, so not relying on modified columns.

If you feel it is must to do the comparison, we can do it in same way as we
identify for HOT? 

> I still feel that it would probably be better to use a generic delta
> encoding scheme, instead of inventing one. How about VCDIFF
> (http://tools.ietf.org/html/rfc3284), for example? Or you could reuse
> the LZ compressor that we already have in the source tree. You can use
> LZ for delta compression by initializing the history buffer of the
> algorithm with the old tuple, and then compressing the new tuple as
> usual. 

>Or you could still use the knowledge of where the attributes
> begin and end and which attributes were updated, and do the encoding
> similar to how you did in the patch, but use LZ as the output format.
> That way the decoding would be the same as LZ decompression.
 Can you please explain me why you think that after doing encoding doing LZ
compression on it is better, as already we have reduced the amount of WAL
for update by only storing changed column information?

a. is it to further reduce the size of WAL
b. storing diff WAL in some standard format
c. or does it give any other kind of benefit

With Regards,
Amit Kapila.

Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Noah Misch

Date:

27 September 2012, 03:41:46

On Mon, Sep 24, 2012 at 10:57:02AM +0000, Amit kapila wrote:
> Rebased version of patch based on latest code.

I like the direction you're taking with this patch; the gains are striking,
especially considering the isolation of the changes.

You cannot assume executor-unmodified columns are also unmodified from
heap_update()'s perspective.  Expansion in one column may instigate TOAST
compression of a logically-unmodified column, and that counts as a change for
xlog delta purposes.  You do currently skip the optimization for relations
having a TOAST table, but TOAST compression can still apply.  Observe this
with text columns of storage mode PLAIN.  I see two ways out: skip the new
behavior when need_toast=true, or compare all inline column data, not just
what the executor modified.  One can probably construct a benchmark favoring
either choice.  I'd lean toward the latter; wide tuples are the kind this
change can most help.  If the marginal advantage of ignoring known-unmodified
columns proves important, we can always bring it back after designing a way to
track which columns changed in the toaster.

Given that, why not treat the tuple as an opaque series of bytes and not worry
about datum boundaries?  When several narrow columns change together, say a
sequence of sixteen smallint columns, you will use fewer binary delta commands
by representing the change with a single 32-byte substitution.  If an UPDATE
changes just part of a long datum, the delta encoding algorithm will still be
able to save considerable space.  That case arises in many forms:  changing
one word in a long string, changing one element in a long array, changing one
field of a composite-typed column.  Granted, this makes the choice of delta
encoding algorithm more important.

Like Heikki, I'm left wondering why your custom delta encoding is preferable
to an encoding from the literature.  Your encoding has much in common with
VCDIFF, even sharing two exact command names.  If a custom encoding is the
right thing, code comments or a README section should at least discuss the
advantages over an established alternative.  Idle thought: it might pay off to
use 1-byte sizes and offsets most of the time.  Tuples shorter than 256 bytes
are common; for longer tuples, we can afford wider offsets.

The benchmarks you posted upthread were helpful.  I think benchmarking with
fsync=off is best if you don't have a battery-backed write controller or SSD.
Otherwise, fsync time dominates a pgbench run.  Please benchmark recovery.  To
do so, set up WAL archiving and take a base backup from a fresh cluster.  Run
pgbench for awhile.  Finally, observe the elapsed time to recover your base
backup to the end of archived WAL.

> *** a/src/backend/access/common/heaptuple.c
> --- b/src/backend/access/common/heaptuple.c

> + /*
> +  * encode_xlog_update
> +  *        Forms a diff tuple from old and new tuple with the modified columns.
> +  *
> +  *        att - attribute list.
> +  *        oldtup - pointer to the old tuple.
> +  *        heaptup - pointer to the modified tuple.
> +  *        wal_tup - pointer to the wal record which needs to be formed from old
> +                   and new tuples by using the modified columns list.
> +  *        modifiedCols - modified columns list by the update command.
> +  */
> + void
> + encode_xlog_update(Form_pg_attribute *att, HeapTuple oldtup,
> +                    HeapTuple heaptup, HeapTuple wal_tup,
> +                    Bitmapset *modifiedCols)

This name is too generic for an extern function.  Maybe "heap_delta_encode"?

> + void
> + decode_xlog_update(HeapTupleHeader htup, uint32 old_tup_len, char *data,
> +                    uint32 *new_tup_len, char *waldata, uint32 wal_len)

Likwise, maybe "heap_delta_decode" here.

> *** a/src/backend/access/heap/heapam.c
> --- b/src/backend/access/heap/heapam.c
> ***************
> *** 71,77 ****
>   #include "utils/syscache.h"
>   #include "utils/tqual.h"
>   
> - 
>   /* GUC variable */
>   bool        synchronize_seqscans = true;
>   

Spurious whitespace change.

> ***************
> *** 3195,3204 **** l2:
>       /* XLOG stuff */
>       if (RelationNeedsWAL(relation))
>       {
> !         XLogRecPtr    recptr = log_heap_update(relation, buffer, oldtup.t_self,
> !                                              newbuf, heaptup,
> !                                              all_visible_cleared,
> !                                              all_visible_cleared_new);
>   
>           if (newbuf != buffer)
>           {
> --- 3203,3233 ----
>       /* XLOG stuff */
>       if (RelationNeedsWAL(relation))
>       {
> !         XLogRecPtr    recptr;
> ! 
> !         /*
> !          * Apply the xlog diff update algorithm only for hot updates.
> !          */
> !         if (modifiedCols && use_hot_update)

Why HOT in particular?  I understand the arguments upthread for skipping the
optimization when the update crosses pages, but the other condition for HOT
(no changes to indexed columns) seems irrelevant here.  Why not retest "newbuf
== buffer", instead?

In any event, the comment should memorialize rationale behind any excluded
cases, not merely restate the obvious fact that the code excludes them.

For the record, I think that if this pays off for intra-page updates, we
should eventually extend it to cross-page updates under full_page_writes=on.
If we were already logging deltas for all updates, I doubt we would adopt a
proposal to add complete-tuple logging as a disaster recovery aid.  When
something corrupts a block, all bets are off.

> ***************
> *** 5239,5252 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
>       }            tbuf;
>       xl_heap_header xlhdr;
>       int            hsize;
> !     uint32        newlen;
>       Size        freespace;
>   
>       /*
>        * The visibility map may need to be fixed even if the heap page is
>        * already up-to-date.
>        */
> !     if (xlrec->all_visible_cleared)
>       {
>           Relation    reln = CreateFakeRelcacheEntry(xlrec->target.node);
>           BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
> --- 5276,5293 ----
>       }            tbuf;
>       xl_heap_header xlhdr;
>       int            hsize;
> !     uint32        new_tup_len = 0;

This variable rename looks spurious.

>       Size        freespace;
>   
> +     /* Initialize the buffer, used to frame the new tuple */
> +     MemSet((char *) &tbuf.hdr, 0, sizeof(HeapTupleHeaderData));
> +     hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
> + 
>       /*
>        * The visibility map may need to be fixed even if the heap page is
>        * already up-to-date.
>        */
> !     if (xlrec->new_all_visible_cleared & 0x0F)
>       {
>           Relation    reln = CreateFakeRelcacheEntry(xlrec->target.node);
>           BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
> ***************
> *** 5266,5272 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
>       }
>   
>       /* Deal with old tuple version */
> - 
>       buffer = XLogReadBuffer(xlrec->target.node,
>                               ItemPointerGetBlockNumber(&(xlrec->target.tid)),
>                               false);

Spurious whitespace change.

> --- 5307,5312 ----
> ***************
> *** 5291,5296 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
> --- 5331,5359 ----
>   
>       htup = (HeapTupleHeader) PageGetItem(page, lp);
>   
> +     if (xlrec->diff_update)
> +     {
> +         char       *data = (char *) &tbuf.hdr + htup->t_hoff;
> +         uint32        old_tup_len;
> +         uint32        wal_len;
> +         char       *waldata = (char *) xlrec + hsize + htup->t_hoff
> +         - offsetof(HeapTupleHeaderData, t_bits);
> + 
> +         wal_len = record->xl_len - hsize;
> +         Assert(wal_len <= MaxHeapTupleSize);
> + 
> +         wal_len -= (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits));
> + 
> +         old_tup_len = ItemIdGetLength(lp) - htup->t_hoff;
> + 
> +         /* copy exactly the tuple header present in the WAL to new tuple */
> +         memcpy((char *) &tbuf.hdr + offsetof(HeapTupleHeaderData, t_bits),
> +                (char *) xlrec + hsize,
> +                (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits)));
> + 
> +         decode_xlog_update(htup, old_tup_len, data, &new_tup_len, waldata, wal_len);

I think the above code should appear later, with treatment of the new tuple.

encode_xlog_update() and decode_xlog_update() should be essentially-inverse
APIs.  Instead, you have encode_xlog_update() working with HeapTuple arguments
while decode_xlog_update() works with opaque pointers.  encode_xlog_update()
clones the header, but decode_xlog_update() leaves that to its caller.  Those
decisions are convenient enough for these heapam.c callers, but I think
heap_xlog_update() should work harder so the decode_xlog_update() argument
list need not appear ad hoc.

> ***************
> *** 5317,5322 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
> --- 5380,5386 ----
>        */
>       if (samepage)
>           goto newsame;
> + 
>       PageSetLSN(page, lsn);
>       PageSetTLI(page, ThisTimeLineID);
>       MarkBufferDirty(buffer);

Spurious whitespace change.

> *** a/src/backend/executor/nodeModifyTable.c
> --- b/src/backend/executor/nodeModifyTable.c

> ***************
> *** 554,559 **** lreplace:;
> --- 557,571 ----
>           if (resultRelationDesc->rd_att->constr)
>               ExecConstraints(resultRelInfo, slot, estate);
>   
> +         /* check whether the xlog diff update can be applied or not? */
> +         if ((resultRelationDesc->rd_toastoid == InvalidOid)
> +             && (tuple_bf_trigger == tuple)
> +             && (tuple->t_len > MinHeapTupleSizeForDiffUpdate))

Having the executor apply these tests introduces a modularity violation.

If any of these restrictions are to remain, a comment at code enforcing them
should give rationale for each.

> *** a/src/include/access/heapam_xlog.h
> --- b/src/include/access/heapam_xlog.h
> ***************
> *** 142,149 **** typedef struct xl_heap_update
>   {
>       xl_heaptid    target;            /* deleted tuple id */
>       ItemPointerData newtid;        /* new inserted tuple id */
> !     bool        all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
> !     bool        new_all_visible_cleared;        /* same for the page of newtid */
>       /* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
>   } xl_heap_update;
>   
> --- 142,155 ----
>   {
>       xl_heaptid    target;            /* deleted tuple id */
>       ItemPointerData newtid;        /* new inserted tuple id */
> !     bool    diff_update;        /* optimized update or not */
> !     /*
> !      * To keep the structure size same all_visible_cleared is merged with
> !      * new_all_visible_cleared.
> !      */
> !     bool    new_all_visible_cleared; /* MSB 4 bits tells PD_ALL_VISIBLE    was
> !                                         cleared of new page and rest 4 bits
> !                                         for the old page */

In place of these two fields, store three flags in a uint8 field.

>       /* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
>   } xl_heap_update;
>   
> *** a/src/include/access/htup_details.h
> --- b/src/include/access/htup_details.h
> ***************
> *** 528,533 **** struct MinimalTupleData
> --- 528,546 ----
>           HeapTupleHeaderSetOid((tuple)->t_data, (oid))
>   
>   
> + /* WAL Diff update options */
> + #define HEAP_UPDATE_WAL_OPT_COPY 0
> + #define HEAP_UPDATE_WAL_OPT_ADD  1
> + #define HEAP_UPDATE_WAL_OPT_IGN  2
> + #define HEAP_UPDATE_WAL_OPT_PAD  3

These defines can remain private to the file implementing the encoding.

> + 
> + /*
> +  * Minimum tuple length required by the tuple during update operation for doing
> +  * WAL optimization of update operation.
> +  */
> + #define MinHeapTupleSizeForDiffUpdate 128

It's not at all clear to me what threshold to use, and 128 feels high.  If you
want to select a threshold, I suggest benchmarking through a binary search of
small tuple sizes.  That being said, though I have no doubt the algorithm will
lose when updating a single one-byte column, it will also finish darn quickly.
Might it be enough to just run the delta algorithm every time but discard any
diff wider than the complete new tuple?

Thanks,
nm

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Tom Lane

Date:

27 September 2012, 04:48:51

Noah Misch <noah@leadboat.com> writes:
> You cannot assume executor-unmodified columns are also unmodified from
> heap_update()'s perspective.  Expansion in one column may instigate TOAST
> compression of a logically-unmodified column, and that counts as a change for
> xlog delta purposes.

Um ... what about BEFORE triggers?

Frankly, I think that expecting the executor to tell you which columns
have been modified is a non-starter.  We have a solution for HOT and
it's silly to do the same thing differently just a few lines away.
        regards, tom lane

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit Kapila

Date:

27 September 2012, 05:52:23

  On Thursday, September 27, 2012 10:19 AM
> Noah Misch <noah@leadboat.com> writes:
> > You cannot assume executor-unmodified columns are also unmodified
> from
> > heap_update()'s perspective.  Expansion in one column may instigate
> TOAST
> > compression of a logically-unmodified column, and that counts as a
> change for
> > xlog delta purposes.
> 
> Um ... what about BEFORE triggers?
This optimization will not apply in case Before triggers updates the tuple.
> 
> Frankly, I think that expecting the executor to tell you which columns
> have been modified is a non-starter.  We have a solution for HOT and
> it's silly to do the same thing differently just a few lines away.
> 
My apprehension is that it can hit the performance advantage if we compare
all attributes to check which have been modified and that to under Buffer
Exclusive Lock.
In case of HOT only the index attributes get compared.

I agree that doing things differently at 2 nearby places is not good. 
So I will do it same way as for HOT and then take the performance data again
and if there is no big impact then
we can do it that way. 


With Regards,
Amit Kapila.

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Heikki Linnakangas

Date:

27 September 2012, 10:42:03

On 25.09.2012 18:27, Amit Kapila wrote:
> If you feel it is must to do the comparison, we can do it in same way as we
> identify for HOT?

Yeah. (But as discussed, I think it would be even better to just treat
the old and new tuple as an opaque chunk of bytes, and run them through
a generic delta algorithm).

>    Can you please explain me why you think that after doing encoding doing LZ
> compression on it is better, as already we have reduced the amount of WAL
> for update by only storing changed column information?
>
> a. is it to further reduce the size of WAL
> b. storing diff WAL in some standard format
> c. or does it give any other kind of benefit

Potentially all of those. I don't know if it'd be better or worse, but
my gut feeling is that it would be simpler, and produce even more
compact WAL.

Attached is a simple patch to apply LZ compression to update WAL
records. I modified the LZ compressor so that it can optionally use a
separate "history" data, and the same history data must then be passed
to the decompression function. That makes it work as a pretty efficient
delta encoder, when you use the old tuple as the history data.

I ran some performance tests with the modified version of pgbench that
you posted earlier:

Current PostgreSQL master
-------------------------

tps = 941.601924 (excluding connections establishing)
  pg_xlog_location_diff
-----------------------
              721227944

pglz_wal_update_records.patch
-----------------------------

tps = 1039.792527 (excluding connections establishing)
  pg_xlog_location_diff
-----------------------
              419395208

pglz_wal_update_records.patch, COMPRESS_ONLY
--------------------------------------------

tps = 1009.682002 (excluding connections establishing)
  pg_xlog_location_diff
-----------------------
              422505104

Amit's wal_update_changes_hot_update.patch
------------------------------------------

tps = 1092.703883 (excluding connections establishing)
  pg_xlog_location_diff
-----------------------
              436031544

The COMPRESS_ONLY result is with the attached patch, but it just uses LZ
to compress the new tuple, without taking advantage of the old tuple.
The pg_xlog_location_diff value is the amount of WAL generated during
the pgbench run. Attached is also the shell script I used to run these
tests.

The conclusion is that there isn't very much difference among the
patches. They all squeeze the WAL to about the same size, and the
increase in TPS is roughly the same.

I think more performance testing is required. The modified pgbench test
isn't necessarily very representative of a real-life application. The
gain (or loss) of this patch is going to depend a lot on how many
columns are updated, and in what ways. Need to test more scenarios, with
many different database schemas.

The LZ approach has the advantage that it can take advantage of all
kinds of similarities between old and new tuple. For example, if you
swap the values of two columns, LZ will encode that efficiently. Or if
you insert a character in the middle of a long string. On the flipside,
it's probably more expensive. Then again, you have to do a memcmp() to
detect which columns have changed with your approach, and that's not
free either. That was not yet included in the patch version I tested.
Another consideration is that when you compress the record more, you
have less data to calculate CRC for. CRC calculation tends to be quite
expensive, so even quite aggressive compression might be a win. Yet
another consideration is that the compression/encoding is done while
holding a lock on the buffer. For the sake of concurrency, you want to
keep the duration the lock is held as short as possible.

- Heikki

Attachment

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit Kapila

Date:

27 September 2012, 13:09:31

> On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
> On 25.09.2012 18:27, Amit Kapila wrote:
> > If you feel it is must to do the comparison, we can do it in same way
> > as we identify for HOT?
> 
> Yeah. (But as discussed, I think it would be even better to just treat
> the old and new tuple as an opaque chunk of bytes, and run them through
> a generic delta algorithm).
> 

Thank you for the modified patch.
> The conclusion is that there isn't very much difference among the
> patches. They all squeeze the WAL to about the same size, and the
> increase in TPS is roughly the same.
> 
> I think more performance testing is required. The modified pgbench test
> isn't necessarily very representative of a real-life application. The
> gain (or loss) of this patch is going to depend a lot on how many
> columns are updated, and in what ways. Need to test more scenarios,
> with many different database schemas.
> 
> The LZ approach has the advantage that it can take advantage of all
> kinds of similarities between old and new tuple. For example, if you
> swap the values of two columns, LZ will encode that efficiently. Or if
> you insert a character in the middle of a long string. On the flipside,
> it's probably more expensive. Then again, you have to do a memcmp() to
> detect which columns have changed with your approach, and that's not
> free either. That was not yet included in the patch version I tested.
> Another consideration is that when you compress the record more, you
> have less data to calculate CRC for. CRC calculation tends to be quite
> expensive, so even quite aggressive compression might be a win. Yet
> another consideration is that the compression/encoding is done while
> holding a lock on the buffer. For the sake of concurrency, you want to
> keep the duration the lock is held as short as possible.

Now I shall do the various tests for following and post it here:
a. Attached Patch in the mode where it takes advantage of history tuple
b. By changing the logic for modified column calculation to use calculation
for memcmp()


With Regards,
Amit Kapila.

Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit Kapila

Date:

27 September 2012, 13:28:48

> On Thursday, September 27, 2012 9:12 AM Noah Misch wrote:
> On Mon, Sep 24, 2012 at 10:57:02AM +0000, Amit kapila wrote:
> > Rebased version of patch based on latest code.
> 
> I like the direction you're taking with this patch; the gains are
> striking,
> especially considering the isolation of the changes.
 Thank you for a detailed review of the patch.

> You cannot assume executor-unmodified columns are also unmodified from
> heap_update()'s perspective.  Expansion in one column may instigate
> TOAST
> compression of a logically-unmodified column, and that counts as a
> change for
> xlog delta purposes.  You do currently skip the optimization for
> relations
> having a TOAST table, but TOAST compression can still apply.  Observe
> this
> with text columns of storage mode PLAIN.  I see two ways out: skip the
> new
> behavior when need_toast=true, or compare all inline column data, not
> just
> what the executor modified.  One can probably construct a benchmark
> favoring
> either choice.  I'd lean toward the latter; wide tuples are the kind
> this
> change can most help.  If the marginal advantage of ignoring known-
> unmodified
> columns proves important, we can always bring it back after designing a
> way to
> track which columns changed in the toaster.

You are right that it can give benefit for both ways, but we should also see
which approach can 
give better results for most of the scenario's. 
As in most cases of Update I have observed, the change in values will not
increase the length of value to too much.
OTOH I am not sure may be there are many more scenario's which change the
length of updated value which can lead to scenario explained by you above.

> 
> Given that, why not treat the tuple as an opaque series of bytes and
> not worry
> about datum boundaries?  When several narrow columns change together,
> say a
> sequence of sixteen smallint columns, you will use fewer binary delta
> commands
> by representing the change with a single 32-byte substitution.  If an
> UPDATE
> changes just part of a long datum, the delta encoding algorithm will
> still be
> able to save considerable space.  That case arises in many forms:
> changing
> one word in a long string, changing one element in a long array,
> changing one
> field of a composite-typed column.  Granted, this makes the choice of
> delta
> encoding algorithm more important.
> 
> Like Heikki, I'm left wondering why your custom delta encoding is
> preferable
> to an encoding from the literature.  Your encoding has much in common
> with
> VCDIFF, even sharing two exact command names.  If a custom encoding is
> the
> right thing, code comments or a README section should at least discuss
> the
> advantages over an established alternative.  Idle thought: it might pay
> off to
> use 1-byte sizes and offsets most of the time.  Tuples shorter than 256
> bytes
> are common; for longer tuples, we can afford wider offsets.

My apprehension was that it can affect the performance if do more work by
holding the lock. 
If we use any standard technique like LZ of VCDiff, it has overhead of
comparison
and other things pertaining to their algorithm. 
However using updated patch by Heikki, I can run the various performance
tests both for update operation as well as recovery.

With Regards,
Amit Kapila.

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit Kapila

Date:

28 September 2012, 13:34:30

> On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:
> > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
> > On 25.09.2012 18:27, Amit Kapila wrote:
> > > If you feel it is must to do the comparison, we can do it in same
> way
> > > as we identify for HOT?
> >
> > Yeah. (But as discussed, I think it would be even better to just
> treat
> > the old and new tuple as an opaque chunk of bytes, and run them
> through
> > a generic delta algorithm).
> >
> 
> Thank you for the modified patch.
> 
> > The conclusion is that there isn't very much difference among the
> > patches. They all squeeze the WAL to about the same size, and the
> > increase in TPS is roughly the same.
> >
> > I think more performance testing is required. The modified pgbench
> test
> > isn't necessarily very representative of a real-life application. The
> > gain (or loss) of this patch is going to depend a lot on how many
> > columns are updated, and in what ways. Need to test more scenarios,
> > with many different database schemas.

I have done for few and planning for doing more. 

> Now I shall do the various tests for following and post it here:
> a. Attached Patch in the mode where it takes advantage of history tuple
> b. By changing the logic for modified column calculation to use
> calculation
> for memcmp()

Attached documents contain data for following scenarios for both 'a' (LZ
compression patch) and 'b' (modified wal patch) patches:

1. Using fixed string (last few bytes are random) to update the column
values.  Total record length = 1800  Updated columns length = 250
2. Using random string to update the column values   Total record length = 1800  Updated columns length = 250

Observations - 1. With both patches performance increase is very good .2. Almost same performance increase with both
patcheswith slightly more

for LZ compression patch.3. TPS is varying with LZ patch, but if we take average it is equivalent to
other patch.

Other Performance tests I am planning to conduct
1. Using bigger random string to update the column values   Total record length = 1800  Updated columns length = 250
2. Using fixed string (last few bytes are random) to update the column
values.  Total record length = 1800  Updated columns length = 50, 100, 500, 750, 1000, 1500, 1800
3. Recovery performance test as suggested by Noah
4. Complete testing for LZ compression patch using testcases defined for
original patch

Kindly suggest more performance test cases which can make findings concrete
or incase you feel
above is sufficient then please confirm.

With Regards,
Amit Kapila.

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit Kapila

Date:

03 October 2012, 16:05:31

On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:
> > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:
> > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
> > > On 25.09.2012 18:27, Amit Kapila wrote:
> > > > If you feel it is must to do the comparison, we can do it in same
> > way
> > > > as we identify for HOT?
> > >
> 
> 
> > Now I shall do the various tests for following and post it here:
> > a. Attached Patch in the mode where it takes advantage of history
> > tuple b. By changing the logic for modified column calculation to use
> > calculation for memcmp()
> 
> 
> Attached documents contain data for following scenarios for both 'a' (LZ
> compression patch) and 'b' (modified wal patch) patches:
> 
> 1. Using fixed string (last few bytes are random) to update the column
> values.
>    Total record length = 1800
>    Updated columns length = 250
> 2. Using random string to update the column values
>    Total record length = 1800
>    Updated columns length = 250
> 
> Observations -
>  1. With both patches performance increase is very good .
>  2. Almost same performance increase with both patches with slightly
> more for LZ compression patch.
>  3. TPS is varying with LZ patch, but if we take average it is
> equivalent to other patch.



> Other Performance tests I am planning to conduct
> 1. Using bigger random string to update the column values
>    Total record length = 1800
>    Updated columns length = 250
> 2. Using fixed string (last few bytes are random) to update the column
> values.
>    Total record length = 1800
>    Updated columns length = 50, 100, 500, 750, 1000, 1500, 1800

1. Please find the results (pgbench_test.htm) for point -2 where there is
one fixed column updation (last few bytes are random) and second column
updation is 32 byte random string. The results for 50, 100 are still going
on others are attached with this mail.
2. Attached pgbench test code for a modification of 25 and 250 bytes record
size having total record length as 1800.  For the other record size modification tests, the schema is changed
accordingly.
3. Added a random string generation for updating some column data from 250
record modification test onwards. 
CREATE OR REPLACE FUNCTION random_text_md5_v2(INTEGER) 
RETURNS TEXT 
LANGUAGE SQL 
AS $$    select upper(        substring(            (                SELECT string_agg(md5(random()::TEXT), '')
      FROM generate_series(1, CEIL($1 / 32.)::integer)                ),        $1)    ); 
 
$$;
4. Observations   a. When the size of updated value is less, the performance is almost
same for both the patches.   b. When the size of updated value is more, the performance with LZ patch
is better.


> 3. Recovery performance test as suggested by Noah    Still not started.  
> 4. Complete testing for LZ compression patch using testcases defined for
> original patch    a. During testing of LZ patch, few issues are found related to when the
updated record contains NULLS. Working on it to fix.


Any comments/suggestions regarding performance/functionality test?

With Regards,
Amit Kapila.

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Heikki Linnakangas

Date:

04 October 2012, 07:24:29

On 03.10.2012 19:03, Amit Kapila wrote:
> Any comments/suggestions regarding performance/functionality test?

Hmm. Doing a lot of UPDATEs concurrently can be limited by the 
WALInsertLock, which each inserter holds while copying the WAL record to 
the buffer. Reducing the size of the WAL records, by compression or 
delta encoding, alleviates that bottleneck: when WAL records are 
smaller, the lock needs to be held for a shorter duration. That improves 
throughput, even if individual backends need to do more CPU work to 
compress the records, because that work can be done in parallel. I 
suspect much of the benefit you're seeing in these tests might be 
because of that effect.

As it happens, I've been working on making WAL insertion scale better in 
general: 
http://archives.postgresql.org/message-id/5064779A.3050407@vmware.com. 
That should also help most when inserting large WAL records. The 
question is: assuming we commit the xloginsert-scale patch, how much 
benefit is there left from the compression? It will surely still help to 
reduce the size of WAL, which can certainly help if you're limited by 
the WAL I/O, but I suspect the results from the pgbench tests you run 
might look quite different.

So, could you rerun these tests with the xloginsert-scale patch applied? 
Reducing the WAL size might still be a good idea even if the patch 
doesn't have much effect on TPS, but I'd like to make sure that the 
compression doesn't hurt performance. Also, it would be a good idea to 
repeat the tests with just a single client; we don't want to hurt the 
performance in that scenario either.

- Heikki

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit Kapila

Date:

04 October 2012, 10:58:05

> On Thursday, October 04, 2012 12:54 PM Heikki Linnakangas
> On 03.10.2012 19:03, Amit Kapila wrote:
> > Any comments/suggestions regarding performance/functionality test?
> 
> Hmm. Doing a lot of UPDATEs concurrently can be limited by the
> WALInsertLock, which each inserter holds while copying the WAL record to
> the buffer. Reducing the size of the WAL records, by compression or
> delta encoding, alleviates that bottleneck: when WAL records are
> smaller, the lock needs to be held for a shorter duration. That improves
> throughput, even if individual backends need to do more CPU work to
> compress the records, because that work can be done in parallel. I
> suspect much of the benefit you're seeing in these tests might be
> because of that effect.
> 
> As it happens, I've been working on making WAL insertion scale better in
> general:
> http://archives.postgresql.org/message-id/5064779A.3050407@vmware.com.
> That should also help most when inserting large WAL records. The
> question is: assuming we commit the xloginsert-scale patch, how much
> benefit is there left from the compression? It will surely still help to
> reduce the size of WAL, which can certainly help if you're limited by
> the WAL I/O, but I suspect the results from the pgbench tests you run
> might look quite different.
> 
> So, could you rerun these tests with the xloginsert-scale patch applied?

I shall take care of doing the performance test with xloginsert-scale patch
as well
both for single and multi-thread.

With Regards,
Amit Kapila.

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit kapila

Date:

04 October 2012, 14:36:10

On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote:
On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:
> > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:
> > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
> > > On 25.09.2012 18:27, Amit Kapila wrote:
> > > > If you feel it is must to do the comparison, we can do it in same
> > way
> > > > as we identify for HOT?
> > >
>
>
> > Now I shall do the various tests for following and post it here:
> > a. Attached Patch in the mode where it takes advantage of history
> > tuple b. By changing the logic for modified column calculation to use
> > calculation for memcmp()
>
>

> 1. Please find the results (pgbench_test.htm) for point -2 where there is
> one fixed column updation (last few bytes are random) and second column
> updation is 32 byte random string. The results for 50, 100 are still going
> on others are attached with this mail.

The results for updated record size (50,100) is attached with this mail

Observations
    a. The performance is comparable for both approaches


>> 4. Complete testing for LZ compression patch using testcases defined for
>>  original patch
     > a. During testing of LZ patch, few issues are found related to when the updated record contains NULLS. Working
onit to fix. 

        The problems were that
       a. offset calculation during compression is based on input buffer [new tuple] and oldtuple [history]. Offset
shouldbe relative to history end. 
           In normal LZ compression always input buffer and history will be adjacent, so there is no problem for it.
       b. The new tuple contents should not be added to history buffer as the addresses will be different for new tuple
andhistory. So it will make offset  
           calculation go wrong.

Patch containing fix of above problems is attached with this mail.

With Regards,
Amit Kapila.

Attachment

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit kapila

Date:

06 October 2012, 14:05:45

On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote:
On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote:
On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:
> > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:
> > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
> > > On 25.09.2012 18:27, Amit Kapila wrote:
> > > > If you feel it is must to do the comparison, we can do it in same
> > way
> > > > as we identify for HOT?
> > >
>
>
> > Now I shall do the various tests for following and post it here:
> > a. Attached Patch in the mode where it takes advantage of history
> > tuple b. By changing the logic for modified column calculation to use
> > calculation for memcmp()
>
>

> 1. Please find the results (pgbench_test.htm) for point -2 where there is
> one fixed column updation (last few bytes are random) and second column
> updation is 32 byte random string. The results for 50, 100 are still going
> on others are attached with this mail.

Please find the readings of LZ patch along with Xlog-Scale patch.
The comparison is between for Update operations
base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

The readings have been taken based on below data.
pgbench_xlog_scale_50 -
a. Updated Record size 50, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

pgbench_xlog_scale_250 -
a. Updated Record size 250, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

pgbench_xlog_scale_500-
a. Updated Record size 500, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

Observations
--------------
a. There is still a good performance improvement even if we do Update WAL optimization on top of Xlog Sclaing Patch.
b. There is a slight performance dip for 1 thread (only in Sync mode = off) with Update WAL optimization (LZ
compression)   but for 2 threads there is a performance increase. 


With Regards,
Amit Kapila.

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit kapila

Date:

06 October 2012, 14:15:09

Sorry, forgot to attach performance data. Its attached in this mail.

________________________________________
From: Amit kapila
Sent: Saturday, October 06, 2012 7:34 PM
To: 'Heikki Linnakangas'; noah@leadboat.com
Cc: pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote:
On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote:
On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:
> > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:
> > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
> > > On 25.09.2012 18:27, Amit Kapila wrote:
> > > > If you feel it is must to do the comparison, we can do it in same
> > way
> > > > as we identify for HOT?
> > >
>
>
> > Now I shall do the various tests for following and post it here:
> > a. Attached Patch in the mode where it takes advantage of history
> > tuple b. By changing the logic for modified column calculation to use
> > calculation for memcmp()
>
>

> 1. Please find the results (pgbench_test.htm) for point -2 where there is
> one fixed column updation (last few bytes are random) and second column
> updation is 32 byte random string. The results for 50, 100 are still going
> on others are attached with this mail.

Please find the readings of LZ patch along with Xlog-Scale patch.
The comparison is between for Update operations
base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

The readings have been taken based on below data.
pgbench_xlog_scale_50 -
a. Updated Record size 50, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

pgbench_xlog_scale_250 -
a. Updated Record size 250, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

pgbench_xlog_scale_500-
a. Updated Record size 500, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

Observations
--------------
a. There is still a good performance improvement even if we do Update WAL optimization on top of Xlog Sclaing Patch.
b. There is a slight performance dip for 1 thread (only in Sync mode = off) with Update WAL optimization (LZ
compression)
    but for 2 threads there is a performance increase.


With Regards,
Amit Kapila.

Attachment

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit kapila

Date:

08 October 2012, 14:58:37

On Saturday, October 06, 2012 7:34 PM Amit kapila wrote:
On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote:
On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote:
On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:
> > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:
> > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
> > > On 25.09.2012 18:27, Amit Kapila wrote:


>> 1. Please find the results (pgbench_test.htm) for point -2 where there is
>> one fixed column updation (last few bytes are random) and second column
>> updation is 32 byte random string. The results for 50, 100 are still going
>> on others are attached with this mail.

> Please find the readings of LZ patch along with Xlog-Scale patch.
> The comparison is between for Update operations
> base code + Xlog Scale Patch
> base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

Please find attached the recovery performance data and
updated patch to handle the unaligned access of PGLZ_Header in decompress by copying the header part to the local
alignedaddress.  

Recovery Performance
----------------------------
1. The recovery performance is also better with LZ compression Patch.

Please do let me know if any more data or test is required for this patch.

With Regards,
Amit Kapila.

Attachment

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Alvaro Herrera

Date:

23 October 2012, 18:45:20

Amit kapila wrote:

> Rebased version of patch based on latest code.

Uhm, how can this patch change a caller of PageAddItem() by adding one
more argument, yet not touch bufpage.c at all?  Are you sure this
compiles?

The email subject has a WIP tag; is that still the patch status?  If so,
I assume it's okay to mark this Returned with Feedback and expect a
later version to be posted.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From

Amit kapila

Date:

24 October 2012, 06:08:26

On Wednesday, October 24, 2012 12:15 AM Alvaro Herrera wrote:

Amit kapila wrote:

> Rebased version of patch based on latest code.

> Uhm, how can this patch change a caller of PageAddItem() by adding one
> more argument, yet not touch bufpage.c at all?  Are you sure this
> compiles?

It compiles, the same is confirmed even with latest Head.
Can you please point me if you feel something is done wrong in the patch.

> The email subject has a WIP tag; is that still the patch status?  If so,
> I assume it's okay to mark this Returned with Feedback and expect a
> later version to be posted.

The WIP word is from original mail chain discussion. The current status is as follows:
I have update the patch with all bug fixes and performance results were posted. Noah has also taken the performance
data.
He believes that there is discrepency in performance data, but actually the reason according to me is just the way I
haveposted the data. 

Currently there is no clear feedback on which I can work, So I would be very thankfull to you if you can wait for some
conclusionof the discussion. 

With Regards,
Amit Kapila.