Thread: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com]
Sent: Monday, August 27, 2012 5:58 PM
To: Amit kapila
On 27.08.2012 15:18, Amit kapila wrote:
>>> I have implemented the WAL Reduction Patch for the case of HOT Update as
pointed out by Simon and Robert. In this patch it only goes for Optimized
WAL in case of HOT Update with other restrictions same as in previous patch.
>>
>>> The performance numbers for this patch are attached in this mail. It has
improved by 90% if the page has fillfactor 80.
>>
>>> Now going forward I have following options:
>>> a. Upload the patch in Open CF for WAL Reduction which contains
reductution for HOT and non-HOT updates.
>>> b. Upload the patch in Open CF for WAL Reduction which contains
reductution for HOT updates.
>>> c. Upload both the patches as different versions.
>> Let's do it for HOT updates only. Simon & Robert made good arguments on
>> why this is a bad idea for non-HOT updates.
>Okay, I shall do it that way.
>So now I shall send information about all the testing I have done for this
>Patch and then Upload it in CF.
Rebased version of patch based on latest code.
With Regards,
Amit Kapila.
Attachment
Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 24.09.2012 13:57, Amit kapila wrote: > Rebased version of patch based on latest code. When HOT was designed, we decided that heap_update needs to compare the old and new attributes directly, with memcmp(), to determine whether any of the indexed columns have changed. It was not deemed infeasible to pass down that information from the executor. I don't remember the details of why that was, but you seem to trying to same thing in this patch, and pass the bitmap of modified cols from the executor to heap_update(). I'm pretty sure that won't work, for the same reasons we didn't do it for HOT. I still feel that it would probably be better to use a generic delta encoding scheme, instead of inventing one. How about VCDIFF (http://tools.ietf.org/html/rfc3284), for example? Or you could reuse the LZ compressor that we already have in the source tree. You can use LZ for delta compression by initializing the history buffer of the algorithm with the old tuple, and then compressing the new tuple as usual. Or you could still use the knowledge of where the attributes begin and end and which attributes were updated, and do the encoding similar to how you did in the patch, but use LZ as the output format. That way the decoding would be the same as LZ decompression. - Heikki
> On Tuesday, September 25, 2012 7:30 PM Heikki Linnakangas wrote: > On 24.09.2012 13:57, Amit kapila wrote: > > Rebased version of patch based on latest code. > > When HOT was designed, we decided that heap_update needs to compare the > old and new attributes directly, with memcmp(), to determine whether > any > of the indexed columns have changed. It was not deemed infeasible to > pass down that information from the executor. I don't remember the > details of why that was, but you seem to trying to same thing in this > patch, and pass the bitmap of modified cols from the executor to > heap_update(). I'm pretty sure that won't work, for the same reasons we > didn't do it for HOT. I think the reason of not relying on modified columns can be some such case where modified columns might not give the correct information. It may be due to Before triggers can change the modified columns that's why for HOT update we need to do Comparison. In our case we have taken care of such a case by not doing optimization, so not relying on modified columns. If you feel it is must to do the comparison, we can do it in same way as we identify for HOT? > I still feel that it would probably be better to use a generic delta > encoding scheme, instead of inventing one. How about VCDIFF > (http://tools.ietf.org/html/rfc3284), for example? Or you could reuse > the LZ compressor that we already have in the source tree. You can use > LZ for delta compression by initializing the history buffer of the > algorithm with the old tuple, and then compressing the new tuple as > usual. >Or you could still use the knowledge of where the attributes > begin and end and which attributes were updated, and do the encoding > similar to how you did in the patch, but use LZ as the output format. > That way the decoding would be the same as LZ decompression. Can you please explain me why you think that after doing encoding doing LZ compression on it is better, as already we have reduced the amount of WAL for update by only storing changed column information? a. is it to further reduce the size of WAL b. storing diff WAL in some standard format c. or does it give any other kind of benefit With Regards, Amit Kapila.
On Mon, Sep 24, 2012 at 10:57:02AM +0000, Amit kapila wrote: > Rebased version of patch based on latest code. I like the direction you're taking with this patch; the gains are striking, especially considering the isolation of the changes. You cannot assume executor-unmodified columns are also unmodified from heap_update()'s perspective. Expansion in one column may instigate TOAST compression of a logically-unmodified column, and that counts as a change for xlog delta purposes. You do currently skip the optimization for relations having a TOAST table, but TOAST compression can still apply. Observe this with text columns of storage mode PLAIN. I see two ways out: skip the new behavior when need_toast=true, or compare all inline column data, not just what the executor modified. One can probably construct a benchmark favoring either choice. I'd lean toward the latter; wide tuples are the kind this change can most help. If the marginal advantage of ignoring known-unmodified columns proves important, we can always bring it back after designing a way to track which columns changed in the toaster. Given that, why not treat the tuple as an opaque series of bytes and not worry about datum boundaries? When several narrow columns change together, say a sequence of sixteen smallint columns, you will use fewer binary delta commands by representing the change with a single 32-byte substitution. If an UPDATE changes just part of a long datum, the delta encoding algorithm will still be able to save considerable space. That case arises in many forms: changing one word in a long string, changing one element in a long array, changing one field of a composite-typed column. Granted, this makes the choice of delta encoding algorithm more important. Like Heikki, I'm left wondering why your custom delta encoding is preferable to an encoding from the literature. Your encoding has much in common with VCDIFF, even sharing two exact command names. If a custom encoding is the right thing, code comments or a README section should at least discuss the advantages over an established alternative. Idle thought: it might pay off to use 1-byte sizes and offsets most of the time. Tuples shorter than 256 bytes are common; for longer tuples, we can afford wider offsets. The benchmarks you posted upthread were helpful. I think benchmarking with fsync=off is best if you don't have a battery-backed write controller or SSD. Otherwise, fsync time dominates a pgbench run. Please benchmark recovery. To do so, set up WAL archiving and take a base backup from a fresh cluster. Run pgbench for awhile. Finally, observe the elapsed time to recover your base backup to the end of archived WAL. > *** a/src/backend/access/common/heaptuple.c > --- b/src/backend/access/common/heaptuple.c > + /* > + * encode_xlog_update > + * Forms a diff tuple from old and new tuple with the modified columns. > + * > + * att - attribute list. > + * oldtup - pointer to the old tuple. > + * heaptup - pointer to the modified tuple. > + * wal_tup - pointer to the wal record which needs to be formed from old > + and new tuples by using the modified columns list. > + * modifiedCols - modified columns list by the update command. > + */ > + void > + encode_xlog_update(Form_pg_attribute *att, HeapTuple oldtup, > + HeapTuple heaptup, HeapTuple wal_tup, > + Bitmapset *modifiedCols) This name is too generic for an extern function. Maybe "heap_delta_encode"? > + void > + decode_xlog_update(HeapTupleHeader htup, uint32 old_tup_len, char *data, > + uint32 *new_tup_len, char *waldata, uint32 wal_len) Likwise, maybe "heap_delta_decode" here. > *** a/src/backend/access/heap/heapam.c > --- b/src/backend/access/heap/heapam.c > *************** > *** 71,77 **** > #include "utils/syscache.h" > #include "utils/tqual.h" > > - > /* GUC variable */ > bool synchronize_seqscans = true; > Spurious whitespace change. > *************** > *** 3195,3204 **** l2: > /* XLOG stuff */ > if (RelationNeedsWAL(relation)) > { > ! XLogRecPtr recptr = log_heap_update(relation, buffer, oldtup.t_self, > ! newbuf, heaptup, > ! all_visible_cleared, > ! all_visible_cleared_new); > > if (newbuf != buffer) > { > --- 3203,3233 ---- > /* XLOG stuff */ > if (RelationNeedsWAL(relation)) > { > ! XLogRecPtr recptr; > ! > ! /* > ! * Apply the xlog diff update algorithm only for hot updates. > ! */ > ! if (modifiedCols && use_hot_update) Why HOT in particular? I understand the arguments upthread for skipping the optimization when the update crosses pages, but the other condition for HOT (no changes to indexed columns) seems irrelevant here. Why not retest "newbuf == buffer", instead? In any event, the comment should memorialize rationale behind any excluded cases, not merely restate the obvious fact that the code excludes them. For the record, I think that if this pays off for intra-page updates, we should eventually extend it to cross-page updates under full_page_writes=on. If we were already logging deltas for all updates, I doubt we would adopt a proposal to add complete-tuple logging as a disaster recovery aid. When something corrupts a block, all bets are off. > *************** > *** 5239,5252 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update) > } tbuf; > xl_heap_header xlhdr; > int hsize; > ! uint32 newlen; > Size freespace; > > /* > * The visibility map may need to be fixed even if the heap page is > * already up-to-date. > */ > ! if (xlrec->all_visible_cleared) > { > Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); > BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid); > --- 5276,5293 ---- > } tbuf; > xl_heap_header xlhdr; > int hsize; > ! uint32 new_tup_len = 0; This variable rename looks spurious. > Size freespace; > > + /* Initialize the buffer, used to frame the new tuple */ > + MemSet((char *) &tbuf.hdr, 0, sizeof(HeapTupleHeaderData)); > + hsize = SizeOfHeapUpdate + SizeOfHeapHeader; > + > /* > * The visibility map may need to be fixed even if the heap page is > * already up-to-date. > */ > ! if (xlrec->new_all_visible_cleared & 0x0F) > { > Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); > BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid); > *************** > *** 5266,5272 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update) > } > > /* Deal with old tuple version */ > - > buffer = XLogReadBuffer(xlrec->target.node, > ItemPointerGetBlockNumber(&(xlrec->target.tid)), > false); Spurious whitespace change. > --- 5307,5312 ---- > *************** > *** 5291,5296 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update) > --- 5331,5359 ---- > > htup = (HeapTupleHeader) PageGetItem(page, lp); > > + if (xlrec->diff_update) > + { > + char *data = (char *) &tbuf.hdr + htup->t_hoff; > + uint32 old_tup_len; > + uint32 wal_len; > + char *waldata = (char *) xlrec + hsize + htup->t_hoff > + - offsetof(HeapTupleHeaderData, t_bits); > + > + wal_len = record->xl_len - hsize; > + Assert(wal_len <= MaxHeapTupleSize); > + > + wal_len -= (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits)); > + > + old_tup_len = ItemIdGetLength(lp) - htup->t_hoff; > + > + /* copy exactly the tuple header present in the WAL to new tuple */ > + memcpy((char *) &tbuf.hdr + offsetof(HeapTupleHeaderData, t_bits), > + (char *) xlrec + hsize, > + (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits))); > + > + decode_xlog_update(htup, old_tup_len, data, &new_tup_len, waldata, wal_len); I think the above code should appear later, with treatment of the new tuple. encode_xlog_update() and decode_xlog_update() should be essentially-inverse APIs. Instead, you have encode_xlog_update() working with HeapTuple arguments while decode_xlog_update() works with opaque pointers. encode_xlog_update() clones the header, but decode_xlog_update() leaves that to its caller. Those decisions are convenient enough for these heapam.c callers, but I think heap_xlog_update() should work harder so the decode_xlog_update() argument list need not appear ad hoc. > *************** > *** 5317,5322 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update) > --- 5380,5386 ---- > */ > if (samepage) > goto newsame; > + > PageSetLSN(page, lsn); > PageSetTLI(page, ThisTimeLineID); > MarkBufferDirty(buffer); Spurious whitespace change. > *** a/src/backend/executor/nodeModifyTable.c > --- b/src/backend/executor/nodeModifyTable.c > *************** > *** 554,559 **** lreplace:; > --- 557,571 ---- > if (resultRelationDesc->rd_att->constr) > ExecConstraints(resultRelInfo, slot, estate); > > + /* check whether the xlog diff update can be applied or not? */ > + if ((resultRelationDesc->rd_toastoid == InvalidOid) > + && (tuple_bf_trigger == tuple) > + && (tuple->t_len > MinHeapTupleSizeForDiffUpdate)) Having the executor apply these tests introduces a modularity violation. If any of these restrictions are to remain, a comment at code enforcing them should give rationale for each. > *** a/src/include/access/heapam_xlog.h > --- b/src/include/access/heapam_xlog.h > *************** > *** 142,149 **** typedef struct xl_heap_update > { > xl_heaptid target; /* deleted tuple id */ > ItemPointerData newtid; /* new inserted tuple id */ > ! bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ > ! bool new_all_visible_cleared; /* same for the page of newtid */ > /* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */ > } xl_heap_update; > > --- 142,155 ---- > { > xl_heaptid target; /* deleted tuple id */ > ItemPointerData newtid; /* new inserted tuple id */ > ! bool diff_update; /* optimized update or not */ > ! /* > ! * To keep the structure size same all_visible_cleared is merged with > ! * new_all_visible_cleared. > ! */ > ! bool new_all_visible_cleared; /* MSB 4 bits tells PD_ALL_VISIBLE was > ! cleared of new page and rest 4 bits > ! for the old page */ In place of these two fields, store three flags in a uint8 field. > /* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */ > } xl_heap_update; > > *** a/src/include/access/htup_details.h > --- b/src/include/access/htup_details.h > *************** > *** 528,533 **** struct MinimalTupleData > --- 528,546 ---- > HeapTupleHeaderSetOid((tuple)->t_data, (oid)) > > > + /* WAL Diff update options */ > + #define HEAP_UPDATE_WAL_OPT_COPY 0 > + #define HEAP_UPDATE_WAL_OPT_ADD 1 > + #define HEAP_UPDATE_WAL_OPT_IGN 2 > + #define HEAP_UPDATE_WAL_OPT_PAD 3 These defines can remain private to the file implementing the encoding. > + > + /* > + * Minimum tuple length required by the tuple during update operation for doing > + * WAL optimization of update operation. > + */ > + #define MinHeapTupleSizeForDiffUpdate 128 It's not at all clear to me what threshold to use, and 128 feels high. If you want to select a threshold, I suggest benchmarking through a binary search of small tuple sizes. That being said, though I have no doubt the algorithm will lose when updating a single one-byte column, it will also finish darn quickly. Might it be enough to just run the delta algorithm every time but discard any diff wider than the complete new tuple? Thanks, nm
Noah Misch <noah@leadboat.com> writes: > You cannot assume executor-unmodified columns are also unmodified from > heap_update()'s perspective. Expansion in one column may instigate TOAST > compression of a logically-unmodified column, and that counts as a change for > xlog delta purposes. Um ... what about BEFORE triggers? Frankly, I think that expecting the executor to tell you which columns have been modified is a non-starter. We have a solution for HOT and it's silly to do the same thing differently just a few lines away. regards, tom lane
On Thursday, September 27, 2012 10:19 AM > Noah Misch <noah@leadboat.com> writes: > > You cannot assume executor-unmodified columns are also unmodified > from > > heap_update()'s perspective. Expansion in one column may instigate > TOAST > > compression of a logically-unmodified column, and that counts as a > change for > > xlog delta purposes. > > Um ... what about BEFORE triggers? This optimization will not apply in case Before triggers updates the tuple. > > Frankly, I think that expecting the executor to tell you which columns > have been modified is a non-starter. We have a solution for HOT and > it's silly to do the same thing differently just a few lines away. > My apprehension is that it can hit the performance advantage if we compare all attributes to check which have been modified and that to under Buffer Exclusive Lock. In case of HOT only the index attributes get compared. I agree that doing things differently at 2 nearby places is not good. So I will do it same way as for HOT and then take the performance data again and if there is no big impact then we can do it that way. With Regards, Amit Kapila.
Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 25.09.2012 18:27, Amit Kapila wrote: > If you feel it is must to do the comparison, we can do it in same way as we > identify for HOT? Yeah. (But as discussed, I think it would be even better to just treat the old and new tuple as an opaque chunk of bytes, and run them through a generic delta algorithm). > Can you please explain me why you think that after doing encoding doing LZ > compression on it is better, as already we have reduced the amount of WAL > for update by only storing changed column information? > > a. is it to further reduce the size of WAL > b. storing diff WAL in some standard format > c. or does it give any other kind of benefit Potentially all of those. I don't know if it'd be better or worse, but my gut feeling is that it would be simpler, and produce even more compact WAL. Attached is a simple patch to apply LZ compression to update WAL records. I modified the LZ compressor so that it can optionally use a separate "history" data, and the same history data must then be passed to the decompression function. That makes it work as a pretty efficient delta encoder, when you use the old tuple as the history data. I ran some performance tests with the modified version of pgbench that you posted earlier: Current PostgreSQL master ------------------------- tps = 941.601924 (excluding connections establishing) pg_xlog_location_diff ----------------------- 721227944 pglz_wal_update_records.patch ----------------------------- tps = 1039.792527 (excluding connections establishing) pg_xlog_location_diff ----------------------- 419395208 pglz_wal_update_records.patch, COMPRESS_ONLY -------------------------------------------- tps = 1009.682002 (excluding connections establishing) pg_xlog_location_diff ----------------------- 422505104 Amit's wal_update_changes_hot_update.patch ------------------------------------------ tps = 1092.703883 (excluding connections establishing) pg_xlog_location_diff ----------------------- 436031544 The COMPRESS_ONLY result is with the attached patch, but it just uses LZ to compress the new tuple, without taking advantage of the old tuple. The pg_xlog_location_diff value is the amount of WAL generated during the pgbench run. Attached is also the shell script I used to run these tests. The conclusion is that there isn't very much difference among the patches. They all squeeze the WAL to about the same size, and the increase in TPS is roughly the same. I think more performance testing is required. The modified pgbench test isn't necessarily very representative of a real-life application. The gain (or loss) of this patch is going to depend a lot on how many columns are updated, and in what ways. Need to test more scenarios, with many different database schemas. The LZ approach has the advantage that it can take advantage of all kinds of similarities between old and new tuple. For example, if you swap the values of two columns, LZ will encode that efficiently. Or if you insert a character in the middle of a long string. On the flipside, it's probably more expensive. Then again, you have to do a memcmp() to detect which columns have changed with your approach, and that's not free either. That was not yet included in the patch version I tested. Another consideration is that when you compress the record more, you have less data to calculate CRC for. CRC calculation tends to be quite expensive, so even quite aggressive compression might be a win. Yet another consideration is that the compression/encoding is done while holding a lock on the buffer. For the sake of concurrency, you want to keep the duration the lock is held as short as possible. - Heikki
Attachment
> On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote: > On 25.09.2012 18:27, Amit Kapila wrote: > > If you feel it is must to do the comparison, we can do it in same way > > as we identify for HOT? > > Yeah. (But as discussed, I think it would be even better to just treat > the old and new tuple as an opaque chunk of bytes, and run them through > a generic delta algorithm). > Thank you for the modified patch. > The conclusion is that there isn't very much difference among the > patches. They all squeeze the WAL to about the same size, and the > increase in TPS is roughly the same. > > I think more performance testing is required. The modified pgbench test > isn't necessarily very representative of a real-life application. The > gain (or loss) of this patch is going to depend a lot on how many > columns are updated, and in what ways. Need to test more scenarios, > with many different database schemas. > > The LZ approach has the advantage that it can take advantage of all > kinds of similarities between old and new tuple. For example, if you > swap the values of two columns, LZ will encode that efficiently. Or if > you insert a character in the middle of a long string. On the flipside, > it's probably more expensive. Then again, you have to do a memcmp() to > detect which columns have changed with your approach, and that's not > free either. That was not yet included in the patch version I tested. > Another consideration is that when you compress the record more, you > have less data to calculate CRC for. CRC calculation tends to be quite > expensive, so even quite aggressive compression might be a win. Yet > another consideration is that the compression/encoding is done while > holding a lock on the buffer. For the sake of concurrency, you want to > keep the duration the lock is held as short as possible. Now I shall do the various tests for following and post it here: a. Attached Patch in the mode where it takes advantage of history tuple b. By changing the logic for modified column calculation to use calculation for memcmp() With Regards, Amit Kapila.
> On Thursday, September 27, 2012 9:12 AM Noah Misch wrote: > On Mon, Sep 24, 2012 at 10:57:02AM +0000, Amit kapila wrote: > > Rebased version of patch based on latest code. > > I like the direction you're taking with this patch; the gains are > striking, > especially considering the isolation of the changes. Thank you for a detailed review of the patch. > You cannot assume executor-unmodified columns are also unmodified from > heap_update()'s perspective. Expansion in one column may instigate > TOAST > compression of a logically-unmodified column, and that counts as a > change for > xlog delta purposes. You do currently skip the optimization for > relations > having a TOAST table, but TOAST compression can still apply. Observe > this > with text columns of storage mode PLAIN. I see two ways out: skip the > new > behavior when need_toast=true, or compare all inline column data, not > just > what the executor modified. One can probably construct a benchmark > favoring > either choice. I'd lean toward the latter; wide tuples are the kind > this > change can most help. If the marginal advantage of ignoring known- > unmodified > columns proves important, we can always bring it back after designing a > way to > track which columns changed in the toaster. You are right that it can give benefit for both ways, but we should also see which approach can give better results for most of the scenario's. As in most cases of Update I have observed, the change in values will not increase the length of value to too much. OTOH I am not sure may be there are many more scenario's which change the length of updated value which can lead to scenario explained by you above. > > Given that, why not treat the tuple as an opaque series of bytes and > not worry > about datum boundaries? When several narrow columns change together, > say a > sequence of sixteen smallint columns, you will use fewer binary delta > commands > by representing the change with a single 32-byte substitution. If an > UPDATE > changes just part of a long datum, the delta encoding algorithm will > still be > able to save considerable space. That case arises in many forms: > changing > one word in a long string, changing one element in a long array, > changing one > field of a composite-typed column. Granted, this makes the choice of > delta > encoding algorithm more important. > > Like Heikki, I'm left wondering why your custom delta encoding is > preferable > to an encoding from the literature. Your encoding has much in common > with > VCDIFF, even sharing two exact command names. If a custom encoding is > the > right thing, code comments or a README section should at least discuss > the > advantages over an established alternative. Idle thought: it might pay > off to > use 1-byte sizes and offsets most of the time. Tuples shorter than 256 > bytes > are common; for longer tuples, we can afford wider offsets. My apprehension was that it can affect the performance if do more work by holding the lock. If we use any standard technique like LZ of VCDiff, it has overhead of comparison and other things pertaining to their algorithm. However using updated patch by Heikki, I can run the various performance tests both for update operation as well as recovery. With Regards, Amit Kapila.
> On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote: > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote: > > On 25.09.2012 18:27, Amit Kapila wrote: > > > If you feel it is must to do the comparison, we can do it in same > way > > > as we identify for HOT? > > > > Yeah. (But as discussed, I think it would be even better to just > treat > > the old and new tuple as an opaque chunk of bytes, and run them > through > > a generic delta algorithm). > > > > Thank you for the modified patch. > > > The conclusion is that there isn't very much difference among the > > patches. They all squeeze the WAL to about the same size, and the > > increase in TPS is roughly the same. > > > > I think more performance testing is required. The modified pgbench > test > > isn't necessarily very representative of a real-life application. The > > gain (or loss) of this patch is going to depend a lot on how many > > columns are updated, and in what ways. Need to test more scenarios, > > with many different database schemas. I have done for few and planning for doing more. > Now I shall do the various tests for following and post it here: > a. Attached Patch in the mode where it takes advantage of history tuple > b. By changing the logic for modified column calculation to use > calculation > for memcmp() Attached documents contain data for following scenarios for both 'a' (LZ compression patch) and 'b' (modified wal patch) patches: 1. Using fixed string (last few bytes are random) to update the column values. Total record length = 1800 Updated columns length = 250 2. Using random string to update the column values Total record length = 1800 Updated columns length = 250 Observations - 1. With both patches performance increase is very good .2. Almost same performance increase with both patcheswith slightly more for LZ compression patch.3. TPS is varying with LZ patch, but if we take average it is equivalent to other patch. Other Performance tests I am planning to conduct 1. Using bigger random string to update the column values Total record length = 1800 Updated columns length = 250 2. Using fixed string (last few bytes are random) to update the column values. Total record length = 1800 Updated columns length = 50, 100, 500, 750, 1000, 1500, 1800 3. Recovery performance test as suggested by Noah 4. Complete testing for LZ compression patch using testcases defined for original patch Kindly suggest more performance test cases which can make findings concrete or incase you feel above is sufficient then please confirm. With Regards, Amit Kapila.
On Friday, September 28, 2012 7:03 PM Amit Kapila wrote: > > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote: > > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote: > > > On 25.09.2012 18:27, Amit Kapila wrote: > > > > If you feel it is must to do the comparison, we can do it in same > > way > > > > as we identify for HOT? > > > > > > > Now I shall do the various tests for following and post it here: > > a. Attached Patch in the mode where it takes advantage of history > > tuple b. By changing the logic for modified column calculation to use > > calculation for memcmp() > > > Attached documents contain data for following scenarios for both 'a' (LZ > compression patch) and 'b' (modified wal patch) patches: > > 1. Using fixed string (last few bytes are random) to update the column > values. > Total record length = 1800 > Updated columns length = 250 > 2. Using random string to update the column values > Total record length = 1800 > Updated columns length = 250 > > Observations - > 1. With both patches performance increase is very good . > 2. Almost same performance increase with both patches with slightly > more for LZ compression patch. > 3. TPS is varying with LZ patch, but if we take average it is > equivalent to other patch. > Other Performance tests I am planning to conduct > 1. Using bigger random string to update the column values > Total record length = 1800 > Updated columns length = 250 > 2. Using fixed string (last few bytes are random) to update the column > values. > Total record length = 1800 > Updated columns length = 50, 100, 500, 750, 1000, 1500, 1800 1. Please find the results (pgbench_test.htm) for point -2 where there is one fixed column updation (last few bytes are random) and second column updation is 32 byte random string. The results for 50, 100 are still going on others are attached with this mail. 2. Attached pgbench test code for a modification of 25 and 250 bytes record size having total record length as 1800. For the other record size modification tests, the schema is changed accordingly. 3. Added a random string generation for updating some column data from 250 record modification test onwards. CREATE OR REPLACE FUNCTION random_text_md5_v2(INTEGER) RETURNS TEXT LANGUAGE SQL AS $$ select upper( substring( ( SELECT string_agg(md5(random()::TEXT), '') FROM generate_series(1, CEIL($1 / 32.)::integer) ), $1) ); $$; 4. Observations a. When the size of updated value is less, the performance is almost same for both the patches. b. When the size of updated value is more, the performance with LZ patch is better. > 3. Recovery performance test as suggested by Noah Still not started. > 4. Complete testing for LZ compression patch using testcases defined for > original patch a. During testing of LZ patch, few issues are found related to when the updated record contains NULLS. Working on it to fix. Any comments/suggestions regarding performance/functionality test? With Regards, Amit Kapila.
Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 03.10.2012 19:03, Amit Kapila wrote: > Any comments/suggestions regarding performance/functionality test? Hmm. Doing a lot of UPDATEs concurrently can be limited by the WALInsertLock, which each inserter holds while copying the WAL record to the buffer. Reducing the size of the WAL records, by compression or delta encoding, alleviates that bottleneck: when WAL records are smaller, the lock needs to be held for a shorter duration. That improves throughput, even if individual backends need to do more CPU work to compress the records, because that work can be done in parallel. I suspect much of the benefit you're seeing in these tests might be because of that effect. As it happens, I've been working on making WAL insertion scale better in general: http://archives.postgresql.org/message-id/5064779A.3050407@vmware.com. That should also help most when inserting large WAL records. The question is: assuming we commit the xloginsert-scale patch, how much benefit is there left from the compression? It will surely still help to reduce the size of WAL, which can certainly help if you're limited by the WAL I/O, but I suspect the results from the pgbench tests you run might look quite different. So, could you rerun these tests with the xloginsert-scale patch applied? Reducing the WAL size might still be a good idea even if the patch doesn't have much effect on TPS, but I'd like to make sure that the compression doesn't hurt performance. Also, it would be a good idea to repeat the tests with just a single client; we don't want to hurt the performance in that scenario either. - Heikki
> On Thursday, October 04, 2012 12:54 PM Heikki Linnakangas > On 03.10.2012 19:03, Amit Kapila wrote: > > Any comments/suggestions regarding performance/functionality test? > > Hmm. Doing a lot of UPDATEs concurrently can be limited by the > WALInsertLock, which each inserter holds while copying the WAL record to > the buffer. Reducing the size of the WAL records, by compression or > delta encoding, alleviates that bottleneck: when WAL records are > smaller, the lock needs to be held for a shorter duration. That improves > throughput, even if individual backends need to do more CPU work to > compress the records, because that work can be done in parallel. I > suspect much of the benefit you're seeing in these tests might be > because of that effect. > > As it happens, I've been working on making WAL insertion scale better in > general: > http://archives.postgresql.org/message-id/5064779A.3050407@vmware.com. > That should also help most when inserting large WAL records. The > question is: assuming we commit the xloginsert-scale patch, how much > benefit is there left from the compression? It will surely still help to > reduce the size of WAL, which can certainly help if you're limited by > the WAL I/O, but I suspect the results from the pgbench tests you run > might look quite different. > > So, could you rerun these tests with the xloginsert-scale patch applied? I shall take care of doing the performance test with xloginsert-scale patch as well both for single and multi-thread. With Regards, Amit Kapila.
On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote: On Friday, September 28, 2012 7:03 PM Amit Kapila wrote: > > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote: > > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote: > > > On 25.09.2012 18:27, Amit Kapila wrote: > > > > If you feel it is must to do the comparison, we can do it in same > > way > > > > as we identify for HOT? > > > > > > > Now I shall do the various tests for following and post it here: > > a. Attached Patch in the mode where it takes advantage of history > > tuple b. By changing the logic for modified column calculation to use > > calculation for memcmp() > > > 1. Please find the results (pgbench_test.htm) for point -2 where there is > one fixed column updation (last few bytes are random) and second column > updation is 32 byte random string. The results for 50, 100 are still going > on others are attached with this mail. The results for updated record size (50,100) is attached with this mail Observations a. The performance is comparable for both approaches >> 4. Complete testing for LZ compression patch using testcases defined for >> original patch > a. During testing of LZ patch, few issues are found related to when the updated record contains NULLS. Working onit to fix. The problems were that a. offset calculation during compression is based on input buffer [new tuple] and oldtuple [history]. Offset shouldbe relative to history end. In normal LZ compression always input buffer and history will be adjacent, so there is no problem for it. b. The new tuple contents should not be added to history buffer as the addresses will be different for new tuple andhistory. So it will make offset calculation go wrong. Patch containing fix of above problems is attached with this mail. With Regards, Amit Kapila.
Attachment
On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote: On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote: On Friday, September 28, 2012 7:03 PM Amit Kapila wrote: > > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote: > > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote: > > > On 25.09.2012 18:27, Amit Kapila wrote: > > > > If you feel it is must to do the comparison, we can do it in same > > way > > > > as we identify for HOT? > > > > > > > Now I shall do the various tests for following and post it here: > > a. Attached Patch in the mode where it takes advantage of history > > tuple b. By changing the logic for modified column calculation to use > > calculation for memcmp() > > > 1. Please find the results (pgbench_test.htm) for point -2 where there is > one fixed column updation (last few bytes are random) and second column > updation is 32 byte random string. The results for 50, 100 are still going > on others are attached with this mail. Please find the readings of LZ patch along with Xlog-Scale patch. The comparison is between for Update operations base code + Xlog Scale Patch base code + Xlog Scale Patch + Update WAL Optimization (LZ compression) The readings have been taken based on below data. pgbench_xlog_scale_50 - a. Updated Record size 50, Total Record size 1800 b. Threads 8, 1 ,2 c. Synchronous_commit - off, on pgbench_xlog_scale_250 - a. Updated Record size 250, Total Record size 1800 b. Threads 8, 1 ,2 c. Synchronous_commit - off, on pgbench_xlog_scale_500- a. Updated Record size 500, Total Record size 1800 b. Threads 8, 1 ,2 c. Synchronous_commit - off, on Observations -------------- a. There is still a good performance improvement even if we do Update WAL optimization on top of Xlog Sclaing Patch. b. There is a slight performance dip for 1 thread (only in Sync mode = off) with Update WAL optimization (LZ compression) but for 2 threads there is a performance increase. With Regards, Amit Kapila.
Sorry, forgot to attach performance data. Its attached in this mail. ________________________________________ From: Amit kapila Sent: Saturday, October 06, 2012 7:34 PM To: 'Heikki Linnakangas'; noah@leadboat.com Cc: pgsql-hackers@postgresql.org Subject: RE: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote: On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote: On Friday, September 28, 2012 7:03 PM Amit Kapila wrote: > > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote: > > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote: > > > On 25.09.2012 18:27, Amit Kapila wrote: > > > > If you feel it is must to do the comparison, we can do it in same > > way > > > > as we identify for HOT? > > > > > > > Now I shall do the various tests for following and post it here: > > a. Attached Patch in the mode where it takes advantage of history > > tuple b. By changing the logic for modified column calculation to use > > calculation for memcmp() > > > 1. Please find the results (pgbench_test.htm) for point -2 where there is > one fixed column updation (last few bytes are random) and second column > updation is 32 byte random string. The results for 50, 100 are still going > on others are attached with this mail. Please find the readings of LZ patch along with Xlog-Scale patch. The comparison is between for Update operations base code + Xlog Scale Patch base code + Xlog Scale Patch + Update WAL Optimization (LZ compression) The readings have been taken based on below data. pgbench_xlog_scale_50 - a. Updated Record size 50, Total Record size 1800 b. Threads 8, 1 ,2 c. Synchronous_commit - off, on pgbench_xlog_scale_250 - a. Updated Record size 250, Total Record size 1800 b. Threads 8, 1 ,2 c. Synchronous_commit - off, on pgbench_xlog_scale_500- a. Updated Record size 500, Total Record size 1800 b. Threads 8, 1 ,2 c. Synchronous_commit - off, on Observations -------------- a. There is still a good performance improvement even if we do Update WAL optimization on top of Xlog Sclaing Patch. b. There is a slight performance dip for 1 thread (only in Sync mode = off) with Update WAL optimization (LZ compression) but for 2 threads there is a performance increase. With Regards, Amit Kapila.
Attachment
On Saturday, October 06, 2012 7:34 PM Amit kapila wrote: On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote: On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote: On Friday, September 28, 2012 7:03 PM Amit Kapila wrote: > > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote: > > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote: > > > On 25.09.2012 18:27, Amit Kapila wrote: >> 1. Please find the results (pgbench_test.htm) for point -2 where there is >> one fixed column updation (last few bytes are random) and second column >> updation is 32 byte random string. The results for 50, 100 are still going >> on others are attached with this mail. > Please find the readings of LZ patch along with Xlog-Scale patch. > The comparison is between for Update operations > base code + Xlog Scale Patch > base code + Xlog Scale Patch + Update WAL Optimization (LZ compression) Please find attached the recovery performance data and updated patch to handle the unaligned access of PGLZ_Header in decompress by copying the header part to the local alignedaddress. Recovery Performance ---------------------------- 1. The recovery performance is also better with LZ compression Patch. Please do let me know if any more data or test is required for this patch. With Regards, Amit Kapila.
Attachment
Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Amit kapila wrote: > Rebased version of patch based on latest code. Uhm, how can this patch change a caller of PageAddItem() by adding one more argument, yet not touch bufpage.c at all? Are you sure this compiles? The email subject has a WIP tag; is that still the patch status? If so, I assume it's okay to mark this Returned with Feedback and expect a later version to be posted. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wednesday, October 24, 2012 12:15 AM Alvaro Herrera wrote: Amit kapila wrote: > Rebased version of patch based on latest code. > Uhm, how can this patch change a caller of PageAddItem() by adding one > more argument, yet not touch bufpage.c at all? Are you sure this > compiles? It compiles, the same is confirmed even with latest Head. Can you please point me if you feel something is done wrong in the patch. > The email subject has a WIP tag; is that still the patch status? If so, > I assume it's okay to mark this Returned with Feedback and expect a > later version to be posted. The WIP word is from original mail chain discussion. The current status is as follows: I have update the patch with all bug fixes and performance results were posted. Noah has also taken the performance data. He believes that there is discrepency in performance data, but actually the reason according to me is just the way I haveposted the data. Currently there is no clear feedback on which I can work, So I would be very thankfull to you if you can wait for some conclusionof the discussion. With Regards, Amit Kapila.