Re: [WIP] Performance Improvement by reducing WAL for Update Operation - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: [WIP] Performance Improvement by reducing WAL for Update Operation |
Date | |
Msg-id | 20120927034136.GA25353@tornado.leadboat.com Whole thread Raw |
In response to | Re: [WIP] Performance Improvement by reducing WAL for Update Operation (Amit kapila <amit.kapila@huawei.com>) |
Responses |
Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Re: [WIP] Performance Improvement by reducing WAL for Update Operation |
List | pgsql-hackers |
On Mon, Sep 24, 2012 at 10:57:02AM +0000, Amit kapila wrote: > Rebased version of patch based on latest code. I like the direction you're taking with this patch; the gains are striking, especially considering the isolation of the changes. You cannot assume executor-unmodified columns are also unmodified from heap_update()'s perspective. Expansion in one column may instigate TOAST compression of a logically-unmodified column, and that counts as a change for xlog delta purposes. You do currently skip the optimization for relations having a TOAST table, but TOAST compression can still apply. Observe this with text columns of storage mode PLAIN. I see two ways out: skip the new behavior when need_toast=true, or compare all inline column data, not just what the executor modified. One can probably construct a benchmark favoring either choice. I'd lean toward the latter; wide tuples are the kind this change can most help. If the marginal advantage of ignoring known-unmodified columns proves important, we can always bring it back after designing a way to track which columns changed in the toaster. Given that, why not treat the tuple as an opaque series of bytes and not worry about datum boundaries? When several narrow columns change together, say a sequence of sixteen smallint columns, you will use fewer binary delta commands by representing the change with a single 32-byte substitution. If an UPDATE changes just part of a long datum, the delta encoding algorithm will still be able to save considerable space. That case arises in many forms: changing one word in a long string, changing one element in a long array, changing one field of a composite-typed column. Granted, this makes the choice of delta encoding algorithm more important. Like Heikki, I'm left wondering why your custom delta encoding is preferable to an encoding from the literature. Your encoding has much in common with VCDIFF, even sharing two exact command names. If a custom encoding is the right thing, code comments or a README section should at least discuss the advantages over an established alternative. Idle thought: it might pay off to use 1-byte sizes and offsets most of the time. Tuples shorter than 256 bytes are common; for longer tuples, we can afford wider offsets. The benchmarks you posted upthread were helpful. I think benchmarking with fsync=off is best if you don't have a battery-backed write controller or SSD. Otherwise, fsync time dominates a pgbench run. Please benchmark recovery. To do so, set up WAL archiving and take a base backup from a fresh cluster. Run pgbench for awhile. Finally, observe the elapsed time to recover your base backup to the end of archived WAL. > *** a/src/backend/access/common/heaptuple.c > --- b/src/backend/access/common/heaptuple.c > + /* > + * encode_xlog_update > + * Forms a diff tuple from old and new tuple with the modified columns. > + * > + * att - attribute list. > + * oldtup - pointer to the old tuple. > + * heaptup - pointer to the modified tuple. > + * wal_tup - pointer to the wal record which needs to be formed from old > + and new tuples by using the modified columns list. > + * modifiedCols - modified columns list by the update command. > + */ > + void > + encode_xlog_update(Form_pg_attribute *att, HeapTuple oldtup, > + HeapTuple heaptup, HeapTuple wal_tup, > + Bitmapset *modifiedCols) This name is too generic for an extern function. Maybe "heap_delta_encode"? > + void > + decode_xlog_update(HeapTupleHeader htup, uint32 old_tup_len, char *data, > + uint32 *new_tup_len, char *waldata, uint32 wal_len) Likwise, maybe "heap_delta_decode" here. > *** a/src/backend/access/heap/heapam.c > --- b/src/backend/access/heap/heapam.c > *************** > *** 71,77 **** > #include "utils/syscache.h" > #include "utils/tqual.h" > > - > /* GUC variable */ > bool synchronize_seqscans = true; > Spurious whitespace change. > *************** > *** 3195,3204 **** l2: > /* XLOG stuff */ > if (RelationNeedsWAL(relation)) > { > ! XLogRecPtr recptr = log_heap_update(relation, buffer, oldtup.t_self, > ! newbuf, heaptup, > ! all_visible_cleared, > ! all_visible_cleared_new); > > if (newbuf != buffer) > { > --- 3203,3233 ---- > /* XLOG stuff */ > if (RelationNeedsWAL(relation)) > { > ! XLogRecPtr recptr; > ! > ! /* > ! * Apply the xlog diff update algorithm only for hot updates. > ! */ > ! if (modifiedCols && use_hot_update) Why HOT in particular? I understand the arguments upthread for skipping the optimization when the update crosses pages, but the other condition for HOT (no changes to indexed columns) seems irrelevant here. Why not retest "newbuf == buffer", instead? In any event, the comment should memorialize rationale behind any excluded cases, not merely restate the obvious fact that the code excludes them. For the record, I think that if this pays off for intra-page updates, we should eventually extend it to cross-page updates under full_page_writes=on. If we were already logging deltas for all updates, I doubt we would adopt a proposal to add complete-tuple logging as a disaster recovery aid. When something corrupts a block, all bets are off. > *************** > *** 5239,5252 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update) > } tbuf; > xl_heap_header xlhdr; > int hsize; > ! uint32 newlen; > Size freespace; > > /* > * The visibility map may need to be fixed even if the heap page is > * already up-to-date. > */ > ! if (xlrec->all_visible_cleared) > { > Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); > BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid); > --- 5276,5293 ---- > } tbuf; > xl_heap_header xlhdr; > int hsize; > ! uint32 new_tup_len = 0; This variable rename looks spurious. > Size freespace; > > + /* Initialize the buffer, used to frame the new tuple */ > + MemSet((char *) &tbuf.hdr, 0, sizeof(HeapTupleHeaderData)); > + hsize = SizeOfHeapUpdate + SizeOfHeapHeader; > + > /* > * The visibility map may need to be fixed even if the heap page is > * already up-to-date. > */ > ! if (xlrec->new_all_visible_cleared & 0x0F) > { > Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); > BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid); > *************** > *** 5266,5272 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update) > } > > /* Deal with old tuple version */ > - > buffer = XLogReadBuffer(xlrec->target.node, > ItemPointerGetBlockNumber(&(xlrec->target.tid)), > false); Spurious whitespace change. > --- 5307,5312 ---- > *************** > *** 5291,5296 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update) > --- 5331,5359 ---- > > htup = (HeapTupleHeader) PageGetItem(page, lp); > > + if (xlrec->diff_update) > + { > + char *data = (char *) &tbuf.hdr + htup->t_hoff; > + uint32 old_tup_len; > + uint32 wal_len; > + char *waldata = (char *) xlrec + hsize + htup->t_hoff > + - offsetof(HeapTupleHeaderData, t_bits); > + > + wal_len = record->xl_len - hsize; > + Assert(wal_len <= MaxHeapTupleSize); > + > + wal_len -= (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits)); > + > + old_tup_len = ItemIdGetLength(lp) - htup->t_hoff; > + > + /* copy exactly the tuple header present in the WAL to new tuple */ > + memcpy((char *) &tbuf.hdr + offsetof(HeapTupleHeaderData, t_bits), > + (char *) xlrec + hsize, > + (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits))); > + > + decode_xlog_update(htup, old_tup_len, data, &new_tup_len, waldata, wal_len); I think the above code should appear later, with treatment of the new tuple. encode_xlog_update() and decode_xlog_update() should be essentially-inverse APIs. Instead, you have encode_xlog_update() working with HeapTuple arguments while decode_xlog_update() works with opaque pointers. encode_xlog_update() clones the header, but decode_xlog_update() leaves that to its caller. Those decisions are convenient enough for these heapam.c callers, but I think heap_xlog_update() should work harder so the decode_xlog_update() argument list need not appear ad hoc. > *************** > *** 5317,5322 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update) > --- 5380,5386 ---- > */ > if (samepage) > goto newsame; > + > PageSetLSN(page, lsn); > PageSetTLI(page, ThisTimeLineID); > MarkBufferDirty(buffer); Spurious whitespace change. > *** a/src/backend/executor/nodeModifyTable.c > --- b/src/backend/executor/nodeModifyTable.c > *************** > *** 554,559 **** lreplace:; > --- 557,571 ---- > if (resultRelationDesc->rd_att->constr) > ExecConstraints(resultRelInfo, slot, estate); > > + /* check whether the xlog diff update can be applied or not? */ > + if ((resultRelationDesc->rd_toastoid == InvalidOid) > + && (tuple_bf_trigger == tuple) > + && (tuple->t_len > MinHeapTupleSizeForDiffUpdate)) Having the executor apply these tests introduces a modularity violation. If any of these restrictions are to remain, a comment at code enforcing them should give rationale for each. > *** a/src/include/access/heapam_xlog.h > --- b/src/include/access/heapam_xlog.h > *************** > *** 142,149 **** typedef struct xl_heap_update > { > xl_heaptid target; /* deleted tuple id */ > ItemPointerData newtid; /* new inserted tuple id */ > ! bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ > ! bool new_all_visible_cleared; /* same for the page of newtid */ > /* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */ > } xl_heap_update; > > --- 142,155 ---- > { > xl_heaptid target; /* deleted tuple id */ > ItemPointerData newtid; /* new inserted tuple id */ > ! bool diff_update; /* optimized update or not */ > ! /* > ! * To keep the structure size same all_visible_cleared is merged with > ! * new_all_visible_cleared. > ! */ > ! bool new_all_visible_cleared; /* MSB 4 bits tells PD_ALL_VISIBLE was > ! cleared of new page and rest 4 bits > ! for the old page */ In place of these two fields, store three flags in a uint8 field. > /* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */ > } xl_heap_update; > > *** a/src/include/access/htup_details.h > --- b/src/include/access/htup_details.h > *************** > *** 528,533 **** struct MinimalTupleData > --- 528,546 ---- > HeapTupleHeaderSetOid((tuple)->t_data, (oid)) > > > + /* WAL Diff update options */ > + #define HEAP_UPDATE_WAL_OPT_COPY 0 > + #define HEAP_UPDATE_WAL_OPT_ADD 1 > + #define HEAP_UPDATE_WAL_OPT_IGN 2 > + #define HEAP_UPDATE_WAL_OPT_PAD 3 These defines can remain private to the file implementing the encoding. > + > + /* > + * Minimum tuple length required by the tuple during update operation for doing > + * WAL optimization of update operation. > + */ > + #define MinHeapTupleSizeForDiffUpdate 128 It's not at all clear to me what threshold to use, and 128 feels high. If you want to select a threshold, I suggest benchmarking through a binary search of small tuple sizes. That being said, though I have no doubt the algorithm will lose when updating a single one-byte column, it will also finish darn quickly. Might it be enough to just run the delta algorithm every time but discard any diff wider than the complete new tuple? Thanks, nm
pgsql-hackers by date: