Re: [WIP] Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers

From Noah Misch
Subject Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date
Msg-id 20120927034136.GA25353@tornado.leadboat.com
Whole thread Raw
In response to Re: [WIP] Performance Improvement by reducing WAL for Update Operation  (Amit kapila <amit.kapila@huawei.com>)
Responses Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Re: [WIP] Performance Improvement by reducing WAL for Update Operation
List pgsql-hackers
On Mon, Sep 24, 2012 at 10:57:02AM +0000, Amit kapila wrote:
> Rebased version of patch based on latest code.

I like the direction you're taking with this patch; the gains are striking,
especially considering the isolation of the changes.

You cannot assume executor-unmodified columns are also unmodified from
heap_update()'s perspective.  Expansion in one column may instigate TOAST
compression of a logically-unmodified column, and that counts as a change for
xlog delta purposes.  You do currently skip the optimization for relations
having a TOAST table, but TOAST compression can still apply.  Observe this
with text columns of storage mode PLAIN.  I see two ways out: skip the new
behavior when need_toast=true, or compare all inline column data, not just
what the executor modified.  One can probably construct a benchmark favoring
either choice.  I'd lean toward the latter; wide tuples are the kind this
change can most help.  If the marginal advantage of ignoring known-unmodified
columns proves important, we can always bring it back after designing a way to
track which columns changed in the toaster.

Given that, why not treat the tuple as an opaque series of bytes and not worry
about datum boundaries?  When several narrow columns change together, say a
sequence of sixteen smallint columns, you will use fewer binary delta commands
by representing the change with a single 32-byte substitution.  If an UPDATE
changes just part of a long datum, the delta encoding algorithm will still be
able to save considerable space.  That case arises in many forms:  changing
one word in a long string, changing one element in a long array, changing one
field of a composite-typed column.  Granted, this makes the choice of delta
encoding algorithm more important.

Like Heikki, I'm left wondering why your custom delta encoding is preferable
to an encoding from the literature.  Your encoding has much in common with
VCDIFF, even sharing two exact command names.  If a custom encoding is the
right thing, code comments or a README section should at least discuss the
advantages over an established alternative.  Idle thought: it might pay off to
use 1-byte sizes and offsets most of the time.  Tuples shorter than 256 bytes
are common; for longer tuples, we can afford wider offsets.

The benchmarks you posted upthread were helpful.  I think benchmarking with
fsync=off is best if you don't have a battery-backed write controller or SSD.
Otherwise, fsync time dominates a pgbench run.  Please benchmark recovery.  To
do so, set up WAL archiving and take a base backup from a fresh cluster.  Run
pgbench for awhile.  Finally, observe the elapsed time to recover your base
backup to the end of archived WAL.


> *** a/src/backend/access/common/heaptuple.c
> --- b/src/backend/access/common/heaptuple.c

> + /*
> +  * encode_xlog_update
> +  *        Forms a diff tuple from old and new tuple with the modified columns.
> +  *
> +  *        att - attribute list.
> +  *        oldtup - pointer to the old tuple.
> +  *        heaptup - pointer to the modified tuple.
> +  *        wal_tup - pointer to the wal record which needs to be formed from old
> +                   and new tuples by using the modified columns list.
> +  *        modifiedCols - modified columns list by the update command.
> +  */
> + void
> + encode_xlog_update(Form_pg_attribute *att, HeapTuple oldtup,
> +                    HeapTuple heaptup, HeapTuple wal_tup,
> +                    Bitmapset *modifiedCols)

This name is too generic for an extern function.  Maybe "heap_delta_encode"?

> + void
> + decode_xlog_update(HeapTupleHeader htup, uint32 old_tup_len, char *data,
> +                    uint32 *new_tup_len, char *waldata, uint32 wal_len)

Likwise, maybe "heap_delta_decode" here.

> *** a/src/backend/access/heap/heapam.c
> --- b/src/backend/access/heap/heapam.c
> ***************
> *** 71,77 ****
>   #include "utils/syscache.h"
>   #include "utils/tqual.h"
>   
> - 
>   /* GUC variable */
>   bool        synchronize_seqscans = true;
>   

Spurious whitespace change.

> ***************
> *** 3195,3204 **** l2:
>       /* XLOG stuff */
>       if (RelationNeedsWAL(relation))
>       {
> !         XLogRecPtr    recptr = log_heap_update(relation, buffer, oldtup.t_self,
> !                                              newbuf, heaptup,
> !                                              all_visible_cleared,
> !                                              all_visible_cleared_new);
>   
>           if (newbuf != buffer)
>           {
> --- 3203,3233 ----
>       /* XLOG stuff */
>       if (RelationNeedsWAL(relation))
>       {
> !         XLogRecPtr    recptr;
> ! 
> !         /*
> !          * Apply the xlog diff update algorithm only for hot updates.
> !          */
> !         if (modifiedCols && use_hot_update)

Why HOT in particular?  I understand the arguments upthread for skipping the
optimization when the update crosses pages, but the other condition for HOT
(no changes to indexed columns) seems irrelevant here.  Why not retest "newbuf
== buffer", instead?

In any event, the comment should memorialize rationale behind any excluded
cases, not merely restate the obvious fact that the code excludes them.

For the record, I think that if this pays off for intra-page updates, we
should eventually extend it to cross-page updates under full_page_writes=on.
If we were already logging deltas for all updates, I doubt we would adopt a
proposal to add complete-tuple logging as a disaster recovery aid.  When
something corrupts a block, all bets are off.

> ***************
> *** 5239,5252 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
>       }            tbuf;
>       xl_heap_header xlhdr;
>       int            hsize;
> !     uint32        newlen;
>       Size        freespace;
>   
>       /*
>        * The visibility map may need to be fixed even if the heap page is
>        * already up-to-date.
>        */
> !     if (xlrec->all_visible_cleared)
>       {
>           Relation    reln = CreateFakeRelcacheEntry(xlrec->target.node);
>           BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
> --- 5276,5293 ----
>       }            tbuf;
>       xl_heap_header xlhdr;
>       int            hsize;
> !     uint32        new_tup_len = 0;

This variable rename looks spurious.

>       Size        freespace;
>   
> +     /* Initialize the buffer, used to frame the new tuple */
> +     MemSet((char *) &tbuf.hdr, 0, sizeof(HeapTupleHeaderData));
> +     hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
> + 
>       /*
>        * The visibility map may need to be fixed even if the heap page is
>        * already up-to-date.
>        */
> !     if (xlrec->new_all_visible_cleared & 0x0F)
>       {
>           Relation    reln = CreateFakeRelcacheEntry(xlrec->target.node);
>           BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
> ***************
> *** 5266,5272 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
>       }
>   
>       /* Deal with old tuple version */
> - 
>       buffer = XLogReadBuffer(xlrec->target.node,
>                               ItemPointerGetBlockNumber(&(xlrec->target.tid)),
>                               false);

Spurious whitespace change.

> --- 5307,5312 ----
> ***************
> *** 5291,5296 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
> --- 5331,5359 ----
>   
>       htup = (HeapTupleHeader) PageGetItem(page, lp);
>   
> +     if (xlrec->diff_update)
> +     {
> +         char       *data = (char *) &tbuf.hdr + htup->t_hoff;
> +         uint32        old_tup_len;
> +         uint32        wal_len;
> +         char       *waldata = (char *) xlrec + hsize + htup->t_hoff
> +         - offsetof(HeapTupleHeaderData, t_bits);
> + 
> +         wal_len = record->xl_len - hsize;
> +         Assert(wal_len <= MaxHeapTupleSize);
> + 
> +         wal_len -= (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits));
> + 
> +         old_tup_len = ItemIdGetLength(lp) - htup->t_hoff;
> + 
> +         /* copy exactly the tuple header present in the WAL to new tuple */
> +         memcpy((char *) &tbuf.hdr + offsetof(HeapTupleHeaderData, t_bits),
> +                (char *) xlrec + hsize,
> +                (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits)));
> + 
> +         decode_xlog_update(htup, old_tup_len, data, &new_tup_len, waldata, wal_len);

I think the above code should appear later, with treatment of the new tuple.

encode_xlog_update() and decode_xlog_update() should be essentially-inverse
APIs.  Instead, you have encode_xlog_update() working with HeapTuple arguments
while decode_xlog_update() works with opaque pointers.  encode_xlog_update()
clones the header, but decode_xlog_update() leaves that to its caller.  Those
decisions are convenient enough for these heapam.c callers, but I think
heap_xlog_update() should work harder so the decode_xlog_update() argument
list need not appear ad hoc.

> ***************
> *** 5317,5322 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
> --- 5380,5386 ----
>        */
>       if (samepage)
>           goto newsame;
> + 
>       PageSetLSN(page, lsn);
>       PageSetTLI(page, ThisTimeLineID);
>       MarkBufferDirty(buffer);

Spurious whitespace change.

> *** a/src/backend/executor/nodeModifyTable.c
> --- b/src/backend/executor/nodeModifyTable.c

> ***************
> *** 554,559 **** lreplace:;
> --- 557,571 ----
>           if (resultRelationDesc->rd_att->constr)
>               ExecConstraints(resultRelInfo, slot, estate);
>   
> +         /* check whether the xlog diff update can be applied or not? */
> +         if ((resultRelationDesc->rd_toastoid == InvalidOid)
> +             && (tuple_bf_trigger == tuple)
> +             && (tuple->t_len > MinHeapTupleSizeForDiffUpdate))

Having the executor apply these tests introduces a modularity violation.

If any of these restrictions are to remain, a comment at code enforcing them
should give rationale for each.

> *** a/src/include/access/heapam_xlog.h
> --- b/src/include/access/heapam_xlog.h
> ***************
> *** 142,149 **** typedef struct xl_heap_update
>   {
>       xl_heaptid    target;            /* deleted tuple id */
>       ItemPointerData newtid;        /* new inserted tuple id */
> !     bool        all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
> !     bool        new_all_visible_cleared;        /* same for the page of newtid */
>       /* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
>   } xl_heap_update;
>   
> --- 142,155 ----
>   {
>       xl_heaptid    target;            /* deleted tuple id */
>       ItemPointerData newtid;        /* new inserted tuple id */
> !     bool    diff_update;        /* optimized update or not */
> !     /*
> !      * To keep the structure size same all_visible_cleared is merged with
> !      * new_all_visible_cleared.
> !      */
> !     bool    new_all_visible_cleared; /* MSB 4 bits tells PD_ALL_VISIBLE    was
> !                                         cleared of new page and rest 4 bits
> !                                         for the old page */

In place of these two fields, store three flags in a uint8 field.

>       /* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
>   } xl_heap_update;
>   
> *** a/src/include/access/htup_details.h
> --- b/src/include/access/htup_details.h
> ***************
> *** 528,533 **** struct MinimalTupleData
> --- 528,546 ----
>           HeapTupleHeaderSetOid((tuple)->t_data, (oid))
>   
>   
> + /* WAL Diff update options */
> + #define HEAP_UPDATE_WAL_OPT_COPY 0
> + #define HEAP_UPDATE_WAL_OPT_ADD  1
> + #define HEAP_UPDATE_WAL_OPT_IGN  2
> + #define HEAP_UPDATE_WAL_OPT_PAD  3

These defines can remain private to the file implementing the encoding.

> + 
> + /*
> +  * Minimum tuple length required by the tuple during update operation for doing
> +  * WAL optimization of update operation.
> +  */
> + #define MinHeapTupleSizeForDiffUpdate 128

It's not at all clear to me what threshold to use, and 128 feels high.  If you
want to select a threshold, I suggest benchmarking through a binary search of
small tuple sizes.  That being said, though I have no doubt the algorithm will
lose when updating a single one-byte column, it will also finish darn quickly.
Might it be enough to just run the delta algorithm every time but discard any
diff wider than the complete new tuple?

Thanks,
nm



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: pg_upgrade does not completely honor --new-port
Next
From: Kohei KaiGai
Date:
Subject: Re: 64-bit API for large object