Thread: [WIP] Performance Improvement by reducing WAL for Update Operation
Problem statement:
-----------------------------------
Reducing wal size for an update operation for performance improvement.
Advantages:
---------------------
1. Observed increase in performance with pgbench when server is running in sync_commit off mode.
a. with pgbench (tpc_b) - 13%
b. with modified pgbench (such that size of modified columns are less than all row) - 83%
2. WAL size is reduced
Design/Impementation:
------------------------------
Currently the change is done only for fixed length columns for simple tables and the tuple should not contain NULLS.
This is a Proof of concept, the design and implementation needs to be changed based on final design required for handling other scenario's
Update operation:
-----------------------------
1. Check for the simple table or not.(No toast, No before update triggers)
2. Works only for not null tuples.
3. Identify the modified columns from the target entry.
4. Based on the modified column list, check for any variable length columns are modified, if so this optimization is not applied.
5. Identify the offset and length for the modified columns and store it as an optimized WAL tuple in the following format.
Note: Wal update header is modified to denote whether wal update optimization is done or not.
WAL update header + Tuple header(no change from previous format) +
[offset(2bytes)] [length(2 bytes)] [changed data value]
[offset(2bytes)] [length(2 bytes)] [changed data value]
....
....
Recovery:
----------------
The following steps are only incase of the tuple is optimized.
6. For forming the new tuple, old tuple is required.(including if the old tuple does not require any modifications also).
7. Form the new tuple based on the format specified in the 5th point.
8. once new tuple is framed, follow the exisiting behavior.
Frame the new tuple from old tuple and WAL record:
1. The length of the data which is needs to be copied from old tuple is calculated as
the difference of offset present in the WAL record and the old tuple offset.
(for the first time, the old tuple offset value is zero)
2. Once the old tuple data copied, then increase the offset for old tuple by the
copied length.
3. Get the length and value of modified column from WAL record, copy it into new tuple.
4. Increase the old tuple offset with the modified column length.
5. Repeat this procedure until the WAL record reaches the end.
6. If any remaining left out old tuple data will be copied.
Test results:
----------------------
1. The pgbench test run for 10min.
2. pgbench result for tpc-b is attached with this mail as pgbench_org
3. modified pgbench(such that size of modified columns are less than all row) result for tpc-b is attached with this mail as pgbench_1800_300
Modified pgbench code:
---------------------------------------
1. Schema of the tables are modified as added some extra fields to increase the record size to 1800.
2. The tcp_b benchmark suite to do only update operations.
3. The update operation changed as to update 3 columns with 300 bytes out of total size of 1800 bytes.
4. During initialization of tables removed the NULL value insertions.
I am working on solution to handle other scenarios like variable length columns, tuple contain NULLs, handling for before triggers.
Please provide suggestions/objections?
Attachment
Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 03.08.2012 14:46, Amit kapila wrote: > Currently the change is done only for fixed length columns for simple tables and the tuple should not contain NULLS. > > This is a Proof of concept, the design and implementation needs to be changed based on final design required for handlingother scenario's > > Update operation: > ----------------------------- > 1. Check for the simple table or not.(No toast, No before update triggers) > 2. Works only for not null tuples. > 3. Identify the modified columns from the target entry. > 4. Based on the modified column list, check for any variable length columns are modified, if so this optimization is notapplied. > 5. Identify the offset and length for the modified columns and store it as an optimized WAL tuple in the following format. > Note: Wal update header is modified to denote whether wal update optimization is done or not. > WAL update header + Tuple header(no change from previous format) + > [offset(2bytes)] [length(2 bytes)] [changed data value] > [offset(2bytes)] [length(2 bytes)] [changed data value] > .... > .... The performance will need to be re-verified after you fix these limitations. Those limitations need to be fixed before this can be applied. It would be nice to use some well-known binary delta algorithm for this, rather than invent our own. OTOH, we have more knowledge of the attribute boundaries, so a custom algorithm might work better. In any case, I'd like to see the code to do the delta encoding/decoding to be put into separate functions, outside of heapam.c. It would be good for readability, and we might want to reuse this in other places too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com] Sent: Saturday, August 04, 2012 1:33 AM On 03.08.2012 14:46, Amit kapila wrote: >> Currently the change is done only for fixed length columns for simple tables and the tuple should not contain NULLS. > >> This is a Proof of concept, the design and implementation needs to be changed based on final design required for handling other scenario's >> >> Update operation: >> ----------------------------- >> 1. Check for the simple table or not.(No toast, No before update triggers) >> 2. Works only for not null tuples. >> 3. Identify the modified columns from the target entry. >> 4. Based on the modified column list, check for any variable length columns are modified, if so this optimization is not applied. >> 5. Identify the offset and length for the modified columns and store it as an optimized WAL tuple in the following format. >> Note: Wal update header is modified to denote whether wal update optimization is done or not. >> WAL update header + Tuple header(no change from previous format) + >> [offset(2bytes)] [length(2 bytes)] [changed data value] >> [offset(2bytes)] [length(2 bytes)] [changed data value] >> .... >> .... > The performance will need to be re-verified after you fix these > limitations. Those limitations need to be fixed before this can be applied. Yes, I agree that solution should fix these limitations and performance numbers needs to be re-verified. Currently in my mind the work to be done is as follows: 1. Solution which can handle Variable length columns and NULLs 2. Handling of Before Triggers 3. Can the solution for fixed length columns be same as Variable length columns and NULLS. 4. Make the final code patch which addresses all the above. Please suggest if there are more things that needs to be handled? For the 3rd point, currently the solution for fixed length columns cannot handle the case of variable length columns and NULLS. The reason is for fixed length columns there is no need of diff technology between old and new tuple, however for other cases it will be required. For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of changed columns of new tuple in WAL, it will be sufficient to do the replay of WAL. However to handle other cases we need to use diff mechanism. Can we do something like if the changed columns are fixed length and doesn't contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for other cases store diff format. This has advantage that for Updates containing only fixed length columns don't have to pay penality of doing diff between new and old tuple. Also we can do the whole work in 2 parts, one for fixed length columns and second to handle other cases. > It would be nice to use some well-known binary delta algorithm for this, > rather than invent our own. OTOH, we have more knowledge of the > attribute boundaries, so a custom algorithm might work better. I shall work on this and post after initial work. > In any case, I'd like to see the code to do the delta encoding/decoding to be > put into separate functions, outside of heapam.c. It would be good for > readability, and we might want to reuse this in other places too. Agreed. I shall take care of doing it in suggested way. With Regards, Amit Kapila.
On 3 August 2012 12:46, Amit kapila <amit.kapila@huawei.com> wrote: > Frame the new tuple from old tuple and WAL record: Sounds good. I'd suggest we do this only when the saving is large enough for benefit, rather than do this every time. You don't mention whether or not the old and the new tuple are on the same data block. Personally, I think it will be important to ensure the above, otherwise recovery will require much additional code for that case. And that code will be prone to race conditions and performance issues. Please also bear in mind that Andres will be looking to include the PK columns in every WAL record for BDR. That could be an option, but I doubt there is much value in excluding PK columns. I think I'd want them to be there for debugging purposes so we can prove this code is correct in production, since otherwise this could be a source of data loss bugs. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
From: Simon Riggs [mailto:simon@2ndQuadrant.com] Sent: Thursday, August 09, 2012 12:36 PM On 3 August 2012 12:46, Amit kapila <amit.kapila@huawei.com> wrote: >> Frame the new tuple from old tuple and WAL record: > Sounds good. Thanks. > I'd suggest we do this only when the saving is large enough for > benefit, rather than do this every time. Do you mean to say that when length of updated values of tuple is less than some threshold(1/3 or 2/3, etc..) value of total length? > You don't mention whether or not the old and the new tuple are on the > same data block. WAL reduction is done for the case even when old and new are on different data blocks as well. > Personally, I think it will be important to ensure the above, > otherwise recovery will require much additional code for that case. In recovery currently also, it handles the case when old and new are on different page such that it has to read old page to get the old tuple. The modifications needs to ensure handling of following cases: a. When there is backup block,and old-new tuples are on different page Currently it doesn't read the old page, Howeverfor new implementation it needs to read old page for this case also. b. When changes are already applied on page [line : if (XLByteLE(lsn, PageGetLSN(page))); function: heap_xlog_update] Currently it doesn't read the old page, However for new implementationit needs to read old page for this case also. > And that code will be prone to race conditions and performance issues. Are you telling performance issues, as now we may need to read old page in some of the cases when earlier it was not reading? If yes, then I think as I have mentioned above, according to me above2 cases are not very usual cases. However the benefit of Update operation on running server is good enough as it reduces the WAL volume. If other then above, then please suggest me. > Please also bear in mind that Andres will be looking to include the PK > columns in every WAL record for BDR. That could be an option, but I > doubt there is much value in excluding PK columns. Agreed. However once the implementation by Andres is done I can merge both codes and take the performance data again, based on which we can take decision. With Regards, Amit Kapila.
On 9 August 2012 09:49, Amit Kapila <amit.kapila@huawei.com> wrote: >> I'd suggest we do this only when the saving is large enough for >> benefit, rather than do this every time. > Do you mean to say that when length of updated values of tuple is less > than some threshold(1/3 or 2/3, etc..) value of > total length? Some heuristic, yes, similar to TOAST's minimum threshold. To attempt removal of rows in all cases would not be worth it, so we need a fast path way of saying lets just take all of the columns. >> You don't mention whether or not the old and the new tuple are on the >> same data block. > > WAL reduction is done for the case even when old and new are on different > data blocks as well. That makes me feel nervous. I doubt the marginal gain is worth it. Most updates don't cross blocks. >> Please also bear in mind that Andres will be looking to include the PK >> columns in every WAL record for BDR. That could be an option, but I >> doubt there is much value in excluding PK columns. > > Agreed. However once the implementation by Andres is done I can merge both > codes and > take the performance data again, based on which we can take decision. It won't happen like that because there won't be a single point where Andres is done. If you agree, then its worth doing it that way to begin with, rather than requiring us to revisit the same section of code twice. One huge point that needs to be thought through is how we prove this code actually works on WAL/recovery side. A normal regression test won't prove that and we don't have a framework in place for that. If you think about what you'll need to do to prove you haven't made some fatal corruption of WAL, its going to look a lot like logical replication tests. Worst case here is that mistakes on this patch will show up as Andres' mistakes. So there is a stronger connection to Andres' work than it first appears. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 09.08.2012 12:18, Simon Riggs wrote: > On 9 August 2012 09:49, Amit Kapila<amit.kapila@huawei.com> wrote: > >> WAL reduction is done for the case even when old and new are on different >> data blocks as well. > > That makes me feel nervous. I doubt the marginal gain is worth it. > Most updates don't cross blocks. That was my first instinctive reaction too. But if the mechanism works just as well for cross-page updates, seems a bit strange to not use it. One argument would be that if for some reason the old block is corrupt or lost, you would not be able to recover the new version of the tuple from the WAL alone. At the moment, it's nice that the WAL record contains all the information required to reconstruct the new tuple, regardless of the old data block contents. But then again, full-page writes cover that too. There will be a full-page image of the old block in the WAL anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 9 August 2012 11:30, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 09.08.2012 12:18, Simon Riggs wrote: >> >> On 9 August 2012 09:49, Amit Kapila<amit.kapila@huawei.com> wrote: >> >>> WAL reduction is done for the case even when old and new are on >>> different >>> data blocks as well. >> >> >> That makes me feel nervous. I doubt the marginal gain is worth it. >> Most updates don't cross blocks. > > > That was my first instinctive reaction too. But if the mechanism works just > as well for cross-page updates, seems a bit strange to not use it. > > One argument would be that if for some reason the old block is corrupt or > lost, you would not be able to recover the new version of the tuple from the > WAL alone. At the moment, it's nice that the WAL record contains all the > information required to reconstruct the new tuple, regardless of the old > data block contents. Exactly. If we lose the first block in a checkpoint, we could lose all updates to rows in that page and all other pages linked to it over a whole checkpoint duration. Basically, page corruption will propogate from block to block if we do this. Given the marginal gain because of a low percentage of cross-block updates, I'm not keen. Low percentage because HOT tries hard to keep things on same block - even for non-HOT updates (which is the case, even though it sounds weird). > But then again, full-page writes cover that too. There > will be a full-page image of the old block in the WAL anyway. Right, but we're planning to remove that, so its not a safe assumption to use when building new code. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Simon Riggs Sent: Thursday, August 09, 2012 2:49 PM On 9 August 2012 09:49, Amit Kapila <amit.kapila@huawei.com> wrote: >>> I'd suggest we do this only when the saving is large enough for >>> benefit, rather than do this every time. >> Do you mean to say that when length of updated values of tuple is less >> than some threshold(1/3 or 2/3, etc..) value of >> total length? > Some heuristic, yes, similar to TOAST's minimum threshold. To attempt > removal of rows in all cases would not be worth it, so we need a fast > path way of saying lets just take all of the columns. Yes, it has to be done. Currently I have 2 ideas to take care of this: a. Based on number of updated columns b. Based onlength of updated values If you have any other idea or you favor among one of the above, let me know your opinion. >>> You don't mention whether or not the old and the new tuple are on the >>> same data block. > >> WAL reduction is done for the case even when old and new are on different >> data blocks as well. > That makes me feel nervous. I doubt the marginal gain is worth it. > Most updates don't cross blocks. How can it be proved whether gain is marginal or substantial to handle the case. One way is test after modification: I have updated pg_bench tpc_b case: 1. Schema is such that it contains 1800 length rows 2. tpc_b only has updates 3. length of updated column values is 300. 4. All tables has 100% fill factor. 5. Vacuum is OFF So in such a run, I think many should be updates are across blocks. But not sure, neither I have verified it in any way. The above run has given a good performance improvement. >>> Please also bear in mind that Andres will be looking to include the PK >>> columns in every WAL record for BDR. That could be an option, but I >>> doubt there is much value in excluding PK columns. > >> Agreed. However once the implementation by Andres is done I can merge both >> codes and >> take the performance data again, based on which we can take decision. > It won't happen like that because there won't be a single point where > Andres is done. If you agree, then its worth doing it that way to > begin with, rather than requiring us to revisit the same section of > code twice. This optimization is to reduce the amount of WAL and definitely adding anything extra will have some impact. However if there is no better way other than by including PK in WAL, then I don't have any problem. > One huge point that needs to be thought through is how we prove this > code actually works on WAL/recovery side. A normal regression test > won't prove that and we don't have a framework in place for that. My initial idea to validate recovery : 1. Manual Test: a. To generate enough scenarios for update operation. b. For each scenario, make sure Replayhappens properly. 2. Community Review. With Regards, Amit Kapila.
Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 09.08.2012 14:11, Simon Riggs wrote: > Given the marginal gain because of a low percentage of cross-block > updates, I'm not keen. Low percentage because HOT tries hard to keep > things on same block - even for non-HOT updates (which is the case, > even though it sounds weird). That depends entirely on the workload. If you do a bulk update that updates every row on the table, most are going to be cross-block updates, and the WAL size does matter. >> But then again, full-page writes cover that too. There >> will be a full-page image of the old block in the WAL anyway. > > Right, but we're planning to remove that, so its not a safe assumption > to use when building new code. I don't think we're going to get rid of full-page images any time soon. I guess you could easily check if full-page writes are enabled, though, and only do it for cross-page updates if it is. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 9 August 2012 12:17, Amit Kapila <amit.kapila@huawei.com> wrote: > This optimization is to reduce the amount of WAL and definitely adding > anything extra will have some impact. Of course. The question is "How much impact?". Each tweak has progressively less and less gain. This isn't a binary choice. Squeezing the last ounce of performance at the expense of all other concerns is not a sensible goal, IMHO, nor do we attempt that elsewhere. Given we're making no attempt to remove full page writes, which is clearly the biggest source of WAL volume currently, micro optimisation of other factors seems unwarranted at this stage. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com] Sent: Thursday, August 09, 2012 4:59 PM On 09.08.2012 14:11, Simon Riggs wrote: >>> But then again, full-page writes cover that too. There >>> will be a full-page image of the old block in the WAL anyway. > >> Right, but we're planning to remove that, so its not a safe assumption >> to use when building new code. > I don't think we're going to get rid of full-page images any time soon. > I guess you could easily check if full-page writes are enabled, though, > and only do it for cross-page updates if it is. According to my understanding you are talking about corruption due to partial page writes which can be handled by full-page image of WAL. Correct me if I misunderstood. Based on that, even if full-page image is removed it will be maintained by double buffer write[an alternative solution to full-page writes for some of the paths] for the case of corrupt page handling. With Regards, Amit Kapila.
Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 09.08.2012 15:56, Amit Kapila wrote: > From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com] > Sent: Thursday, August 09, 2012 4:59 PM > On 09.08.2012 14:11, Simon Riggs wrote: > >>>> But then again, full-page writes cover that too. There >>>> will be a full-page image of the old block in the WAL anyway. >> >>> Right, but we're planning to remove that, so its not a safe assumption >>> to use when building new code. > >> I don't think we're going to get rid of full-page images any time soon. >> I guess you could easily check if full-page writes are enabled, though, >> and only do it for cross-page updates if it is. > > According to my understanding you are talking about corruption due to > partial page writes which can be handled by full-page image of WAL. Correct > me if I misunderstood. I meant corruption caused by anything, like disk failure, bugs, cosmic rays, etc. The point is that currently the WAL record contains all the information required to reconstruct the old tuple. With a diff method, that's no longer the case, so if the old tuple gets corrupt for whatever reason, that error will be propagated to the new tuple. It's not an issue as long as everything works correctly, but some redundancy is nice when you're trying to resurrect a corrupt database. That's what we're talking about here. That said, I don't think it's a big deal for this patch, at least not as long as full-page writes are enabled. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
From: Simon Riggs [mailto:simon@2ndQuadrant.com] Sent: Thursday, August 09, 2012 5:29 PM On 9 August 2012 12:17, Amit Kapila <amit.kapila@huawei.com> wrote: >> This optimization is to reduce the amount of WAL and definitely adding >> anything extra will have some impact. > Of course. The question is "How much impact?". Each tweak has > progressively less and less gain. This isn't a binary choice. > Squeezing the last ounce of performance at the expense of all other > concerns is not a sensible goal, IMHO, nor do we attempt that > elsewhere. > Given we're making no attempt to remove full page writes, which is > clearly the biggest source of WAL volume currently, micro optimisation > of other factors seems unwarranted at this stage. What I am pointing from WAL reduction is about Update operation performance and full-page writes doesn't have direct correlation with Update operation except for a case of first time update of page after checkpoint. With Regards, Amit Kapila.
On Thu, Aug 9, 2012 at 9:09 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I meant corruption caused by anything, like disk failure, bugs, cosmic rays, > etc. The point is that currently the WAL record contains all the information > required to reconstruct the old tuple. With a diff method, that's no longer > the case, so if the old tuple gets corrupt for whatever reason, that error > will be propagated to the new tuple. > > It's not an issue as long as everything works correctly, but some redundancy > is nice when you're trying to resurrect a corrupt database. That's what > we're talking about here. That said, I don't think it's a big deal for this > patch, at least not as long as full-page writes are enabled. So suppose that the following sequence of events occurs: 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on page 2. 2. The table is vacuumed, removing tuple A. 3. Page 1 is written durably to disk. 4. Crash. If reconstructing tuple B requires possession of tuple A, it seems that we are now screwed. No? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 09.08.2012 19:39, Robert Haas wrote: > On Thu, Aug 9, 2012 at 9:09 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> I meant corruption caused by anything, like disk failure, bugs, cosmic rays, >> etc. The point is that currently the WAL record contains all the information >> required to reconstruct the old tuple. With a diff method, that's no longer >> the case, so if the old tuple gets corrupt for whatever reason, that error >> will be propagated to the new tuple. >> >> It's not an issue as long as everything works correctly, but some redundancy >> is nice when you're trying to resurrect a corrupt database. That's what >> we're talking about here. That said, I don't think it's a big deal for this >> patch, at least not as long as full-page writes are enabled. > > So suppose that the following sequence of events occurs: > > 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on page 2. > 2. The table is vacuumed, removing tuple A. > 3. Page 1 is written durably to disk. > 4. Crash. > > If reconstructing tuple B requires possession of tuple A, it seems > that we are now screwed. Not with full_page_writes=on, as crash recovery will restore the old page contents. But you're right, with full_page_writes=off you are screwed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Aug 9, 2012 at 12:43 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> So suppose that the following sequence of events occurs: >> >> 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on >> page 2. >> 2. The table is vacuumed, removing tuple A. >> 3. Page 1 is written durably to disk. >> 4. Crash. >> >> If reconstructing tuple B requires possession of tuple A, it seems >> that we are now screwed. > > Not with full_page_writes=on, as crash recovery will restore the old page > contents. But you're right, with full_page_writes=off you are screwed. I think the property that recovery only needs to worry about each block individually is one that we want to preserve. Supporting this optimizating only when full_page_writes=off seems ugly, and I also agree with Simon's objection upthread: the current design minimizes the chances of corruption propagating from block to block. Even if the proposed design is bullet-proof as of this moment (at least with full_page_writes=on) it seems very possible that it could get accidentally broken by future code changes, leading to hard-to-find data corruption bugs. It might also complicate other things that we will want to do down the line, like parallelizing recovery. In the pgbench testing I've done, almost all of the updates are HOT, provided you run the test long enough to reach steady state, so restricting this optimization to HOT updates shouldn't hurt that case (or similar real-world cases) very much. Of course there are probably also real-world cases where HOT applies only seldom, and those cases won't get the benefit of this, but you can't win them all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Robert Haas [mailto:robertmhaas@gmail.com] Sent: Thursday, August 09, 2012 11:18 PM On Thu, Aug 9, 2012 at 12:43 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >>> So suppose that the following sequence of events occurs: >> >>> 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on >>> page 2. >>> 2. The table is vacuumed, removing tuple A. >>> 3. Page 1 is written durably to disk. >>> 4. Crash. >> >>> If reconstructing tuple B requires possession of tuple A, it seems >>> that we are now screwed. > >> Not with full_page_writes=on, as crash recovery will restore the old page >> contents. But you're right, with full_page_writes=off you are screwed. > I think the property that recovery only needs to worry about each > block individually is one that we want to preserve. Supporting this > optimizating only when full_page_writes=off seems ugly, I think recovery needs to worry about multiple blocks as well in some cases. Please see below case and correct me if I am wrong. I think currently also there can be problems in case of full_page_writes=off for crash recovery. 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on page 2. 2. Page 1 is Partially written to disk. 3. During recovery, it can so appear that there is no need to update XMAX and other related things in Old tuple as LSN is greater than WAL lsn. 4. Now also there can be other problems related to tuple visibility. > and I also > agree with Simon's objection upthread: the current design minimizes > the chances of corruption propagating from block to block. Even if > the proposed design is bullet-proof as of this moment (at least with > full_page_writes=on) it seems very possible that it could get > accidentally broken by future code changes, leading to hard-to-find > data corruption bugs. It might also complicate other things that we > will want to do down the line, like parallelizing recovery. I can see the problem incase we remove full-page-writes concept and replace with some other equivalent concept which doesn't have current flexibility. With Regards, Amit Kapila.
On Fri, Aug 10, 2012 at 1:25 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >> I think the property that recovery only needs to worry about each >> block individually is one that we want to preserve. Supporting this >> optimizating only when full_page_writes=off seems ugly, > > I think recovery needs to worry about multiple blocks as well in some cases. > Please see below case and correct me if I am wrong. > I think currently also there can be problems in case of full_page_writes=off > for crash recovery. > 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on > page 2. > 2. Page 1 is Partially written to disk. > 3. During recovery, it can so appear that there is no need to update XMAX > and other related things in Old tuple > as LSN is greater than WAL lsn. > 4. Now also there can be other problems related to tuple visibility. Well, you're only supposed to turn full_page_writes=off if partial page writes are impossible on your system. If you turn off full_page_writes on a system where partial page writes are impossible, then you've intentionally broken crash recovery, and you get to keep both pieces. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thursday, August 30, 2012 11:23 PM Robert Haas [mailto:robertmhaas@gmail.com] wrote: On Fri, Aug 10, 2012 at 1:25 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >>> I think the property that recovery only needs to worry about each >>> block individually is one that we want to preserve. Supporting this >>> optimizating only when full_page_writes=off seems ugly, > >> I think recovery needs to worry about multiple blocks as well in some cases. >> Please see below case and correct me if I am wrong. >> I think currently also there can be problems in case of full_page_writes=off >> for crash recovery. >> 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on >> page 2. >> 2. Page 1 is Partially written to disk. >> 3. During recovery, it can so appear that there is no need to update XMAX >> and other related things in Old tuple >> as LSN is greater than WAL lsn. >> 4. Now also there can be other problems related to tuple visibility. > Well, you're only supposed to turn full_page_writes=off if partial > page writes are impossible on your system. If you turn off > full_page_writes on a system where partial page writes are impossible, I think you mean to say "full_page_writes on a system where partial page writes are possible." Because if partial page writes are impossible then user should keep full_page_writes = OFF. > then you've intentionally broken crash recovery, and you get to keep > both pieces. Robert, in broad I got your and Simon's idea that we should do optimization of WAL (Reduce) in case update happens on same page. I have implemented the final Patch which does WAL optimization only in case when updated tuple is on same page. Also we have observed that with fillfactor 80 the performance improvement is good. With Regards, Amit Kapila.