Reduction in WAL for UPDATEs - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Reduction in WAL for UPDATEs |
Date | |
Msg-id | 1175022068.4386.162.camel@silverbirch.site Whole thread Raw |
Responses |
Re: Reduction in WAL for UPDATEs
|
List | pgsql-hackers |
It seems possible to reduce overall WAL volume by roughly 25% on common workloads by optimising the way UPDATE statements generate WAL. I've got a practical proposal that looks like it can be completed in a few more days work and fits very well with HOT UPDATEs. This is likely to have beneficial performance effects for many workloads, as well as having clear benefits for archive/backup by reducing overall WAL volume. Background ---------- A recent profile of the sources of WAL on a major OLTP benchmark: - records is the count of WAL records of that type - % records is the overall distribution by record type - avg space is per record - % space is the overall distribution by WAL volume tag | records | % records | avg space | % space ------------------------+---------+-----------+-----------+---------HEAP_UPDATE | 347822 | 25.91 | 298.72| 53.30BTREE_INSERT_LEAF | 586577 | 43.70 | 68.01 | 20.47HEAP_INSERT | 184712 | 13.76| 122.91 | 11.65BTREE_SPLIT_R | 1435 | 0.11 | 7086.18 | 5.22HEAP_LOCK | 159959| 11.92 | 58.77 | 4.82HEAP_UPDATE+INIT | 12157 | 0.91 | 414.92 | 2.59XACT_COMMIT | 29531 | 2.20 | 40.00 | 0.61BTREE_DELETE | 2806 | 0.21 | 379.45 | 0.55HEAP_DELETE | 13276 | 0.99 | 46.16 | 0.31BTREE_SPLIT_L | 75 | 0.01 | 7234.32| 0.28HEAP_INSERT+INIT | 2305 | 0.17 | 126.69 | 0.15BTREE_INSERT_UPPER | 1497 | 0.11 | 73.96 | 0.06BTREE_SPLIT_L_ROOT | 1 | 0.00 | 6882.00 | 0.00CLOG_ZEROPAGE | 1 | 0.00 | 32.00 | 0.00XLOG_CHECKPOINT_ONLINE | 6 | 0.00 | 72.00 | 0.00BTREE_NEWROOT | 1 | 0.00 | 76.00 | 0.00XACT_ABORT | 155 | 0.01 | 40.00 | 0.00 (17 rows) [Thanks to Greg] UPDATEs clearly generate most of the WAL. In this benchmark and in many other cases, around 10-25% of the data on a row changes at each UPDATE, which would be sufficient to reduce the overall WAL volume by 30-40%. Not all of that potential is practically realisable, see below. [Note also that this profile was performed *before* Heikki's recent optimisation of WAL generated during b-tree splits.] Concept ------- Previously, we've discussed the possibility that WAL can be reduced when performing UPDATEs. Currently, when we UPDATE a row we send the whole new row to WAL, whether or not the columns have changed. If we send only the changed data to WAL then we would need to be able to reconstruct the new tuple during recovery. The changed data is referred to as a "tuple diff" or just diff. Reconstruction would read the old tuple and apply the diff to construct a new tuple, which would then be applied to the data block. This is only practical when the old and new tuple are on the same block, since otherwise this would be likely to increase the I/O required to perform recovery - which is already I/O bound. So this fits nicely with HOT and builds upon the knowledge that with HOT most UPDATEs occur on the same block. Reconstruction is complicated by the absence of a relcache that accurately describes the TupleDesc at the exact point of the UPDATE. However, the tupledesc is available when we *create* the diff, so we can take advantage of that to make sure we optimise for NULL <-> not NULL transitions during UPDATE. So we use the tupledesc to construct a tupledesc-agnostic format which is a set of byte-by-byte instructions rather than being specified in terms of attributes or custom types. The diff instructions allow us to express operations very simply. e.g. CREATE TABLE foo (col1 integer, col2 integer, col3 varchar(50), col4 varhcar(50)); INSERT INTO foo (1, 1, repeat('abc',50)); UPDATE foo SET col2 = 100 WHERE col1 = 1; will generate diff instructions (assuming 4 byte alignment for now) COPY 4 (bytes from old to new tuple) IGNORE 4 (bytes on old tuple) ADD 4 (bytes from new tuple) COPY 156 (bytes from old to new tuple) These diff instructions are based upon the concept of delta instructions from the VCDIFF standard. http://www.faqs.org/rfcs/rfc3284.html This is tailored to this specific implementation, so although we use delta instructions we don't use the VCDIFF format since it caters for longer data which in our case will already have been TOASTed away. With a terse instruction set the diff format can encode the diff instructions in a small number of bytes, considerably reducing the WAL volume yet without requiring complex diffing algorithms. The simplicity of the diff algorithm is important because this introduces additional CPU and potentially contention also, since the diff is calculated while the block is locked. As a result, it is proposed that the diff would only be calculated when the new tuple length is in excess of a hard-coded limit, probably 128 bytes - a very simple heuristic. That seems high, but the current WAL volume is disproportionately generated by very wide tables and this would give a balance between overhead of generating the diff and the overhead from the WAL volume itself. Discussion Points: - diff starts at 128 bytes? - (no additional parameters suggested) - diff still taken whether or not we do full_page_writes Implementation -------------- Implement a new XLOG_HEAP2_UPDATE_DIFF xlog record and appropriate handling, rather than attempting to augment the existing XLOG_HEAP_UPDATE code since that also handles MOVE. I'm expecting there to be about a 100 suggestions for how to optimise the encoding, so please remember the balance between encoding time/effect and the WAL compression ratio. The good thing about the use of the diff format is that it neatly separates creating the diff from decoding it. Initially, my suggestion is to do column-by-column byte comparison to establish which bytes have changed. Its possible to enhance that to do sub-attribute diffs, but I'm not suggesting that in this initial proposal. Alignment is not an issue for creating or using the diff. When there is no BEFORE trigger, we can optimise that further by examining only the columns mentioned in the SET clause. That would further speed up the diff, but its fairly fiddly and I haven't spent time yet looking into that possibility. (Suggestions welcome!). No performance testing has been performed as yet since I haven't written much of the code so far. Diff Format ----------- The diff algorithm requirements seem to be: - accuracy (must be non-lossy) - speed - space saving - cope well with PostgreSQL var length changing values and NULLs Looking at VCDIFF format and thinking some, I need the following primitives in the diff storage format: - COPY - copy the next N bytes from old to new record - ADD - add the Record byte at the current new record location - IGNORE - ignore the next N bytes of the old record, nothing copied The format is similar, but not exactly same as HRL encoded bitmap data using a header and a body array. A header has length, number of instructions and a set of indicator bits. The indicator bit specifies whether the instruction is an ADD or not. ADD 1 set in header bit uint8 num bytes to insert into new tupledata bytes COPY 0 set in header bit uint8 lo 7 bits number of bytes to COPY from old to new hi bit = 1 to indicate COPY IGNORE 0 set in header bituint8 lo 7 bits number of bytes to IGNORE in old tuple hi bit = 0 to indicate IGNORE In the above example shown earlier we would store only 4 instruction bytes, plus 4 changed data bytes, plus 8 byte diff header = 16 bytes, rather than the 156 bytes. Note that we are diffing the tuple body, though storing the tuple header in full. We can tweak that some more in the future, maybe. Note also hat the format is very terse and the words ADD, COPY and IGNORE are conceptual only. Adjacent columns that have similar instructions can be represented by a single instruction, i.e. 4 integer columns that will be COPYd to the new tuple can be represented as COPY 16. As a result of that, I'm proposing that the header has sufficient instruction bits for at most 32 instructions. That helps keep the format simple, but it also allows us to introduce a further heuristic: skip calculating complex diffs because they probably aren't going to save much space anyhow. Very long columns may need multiple instructions to represent them, but that wastes very little space and should not present a problem. Comments, with a view to immediate implementation? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: