Thread: [WIP] Performance Improvement by reducing WAL for Update Operation

[WIP] Performance Improvement by reducing WAL for Update Operation

From
Amit kapila
Date:

Problem statement:

-----------------------------------

Reducing wal size for an update operation for performance improvement.

Advantages:

---------------------
        1. Observed increase in performance with pgbench when server is running in sync_commit off mode.
                a. with pgbench (tpc_b) - 13%
                b. with modified pgbench (such that size of modified columns are less than all row) - 83%

        2. WAL size is reduced

Design/Impementation: 

------------------------------

Currently the change is done only for fixed length columns for simple tables and the tuple should not contain NULLS.

This is a Proof of concept, the design and implementation needs to be changed based on final design required for handling other scenario's

Update operation:
-----------------------------
1. Check for the simple table or not.(No toast, No before update triggers)
2. Works only for not null tuples.
3. Identify the modified columns from the target entry.
4. Based on the modified column list, check for any variable length columns are modified, if so this optimization is not applied.
5. Identify the offset and length for the modified columns and store it as an optimized WAL tuple in the following format.
   Note: Wal update header is modified to denote whether wal update optimization is done or not.
        WAL update header + Tuple header(no change from previous format) +
                [offset(2bytes)] [length(2 bytes)] [changed data value]
                [offset(2bytes)] [length(2 bytes)] [changed data value]
  ....
  ....

Recovery: 

----------------
        The following steps are only incase of the tuple is optimized.

6. For forming the new tuple, old tuple is required.(including if the old tuple does not require any modifications also).
7. Form the new tuple based on the format specified in the 5th point.
8. once new tuple is framed, follow the exisiting behavior.

Frame the new tuple from old tuple and WAL record:

1. The length of the data which is needs to be copied from old tuple is calculated as
   the difference of offset present in the WAL record and the old tuple offset.
   (for the first time, the old tuple offset value is zero)
2. Once the old tuple data copied, then increase the offset for old tuple by the
   copied length.
3. Get the length and value of modified column from WAL record, copy it into new tuple.
4. Increase the old tuple offset with the modified column length.
5. Repeat this procedure until the WAL record reaches the end.
6. If any remaining left out old tuple data will be copied.


Test results:

----------------------
1. The pgbench test run for 10min.

2. pgbench result for tpc-b is attached with this mail as pgbench_org

3. modified pgbench(such that size of modified columns are less than all row)  result for tpc-b is attached with this mail as pgbench_1800_300

Modified pgbench code:

---------------------------------------
1. Schema of the tables are modified as added some extra fields to increase the record size to 1800.
2. The tcp_b benchmark suite to do only update operations.
3. The update operation changed as to update 3 columns with 300 bytes out of total size of 1800 bytes.
4. During initialization of tables removed the NULL value insertions. 


I am working on solution to handle other scenarios like variable length columns, tuple contain NULLs, handling for before triggers.

 

Please provide suggestions/objections?

 

With Regards,
Amit Kapila.
Attachment

Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 03.08.2012 14:46, Amit kapila wrote:
> Currently the change is done only for fixed length columns for simple tables and the tuple should not contain NULLS.
>
> This is a Proof of concept, the design and implementation needs to be changed based on final design required for
handlingother scenario's
 
>
> Update operation:
> -----------------------------
> 1. Check for the simple table or not.(No toast, No before update triggers)
> 2. Works only for not null tuples.
> 3. Identify the modified columns from the target entry.
> 4. Based on the modified column list, check for any variable length columns are modified, if so this optimization is
notapplied.
 
> 5. Identify the offset and length for the modified columns and store it as an optimized WAL tuple in the following
format.
>     Note: Wal update header is modified to denote whether wal update optimization is done or not.
>          WAL update header + Tuple header(no change from previous format) +
>                  [offset(2bytes)] [length(2 bytes)] [changed data value]
>                  [offset(2bytes)] [length(2 bytes)] [changed data value]
>    ....
>    ....

The performance will need to be re-verified after you fix these 
limitations. Those limitations need to be fixed before this can be applied.

It would be nice to use some well-known binary delta algorithm for this, 
rather than invent our own. OTOH, we have more knowledge of the 
attribute boundaries, so a custom algorithm might work better. In any 
case, I'd like to see the code to do the delta encoding/decoding to be 
put into separate functions, outside of heapam.c. It would be good for 
readability, and we might want to reuse this in other places too.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com] 
Sent: Saturday, August 04, 2012 1:33 AM
On 03.08.2012 14:46, Amit kapila wrote:
>> Currently the change is done only for fixed length columns for simple
tables and the tuple should not contain NULLS.
>
>> This is a Proof of concept, the design and implementation needs to be
changed based on final design required for handling other scenario's
>>
>> Update operation:
>> -----------------------------
>> 1. Check for the simple table or not.(No toast, No before update
triggers)
>> 2. Works only for not null tuples.
>> 3. Identify the modified columns from the target entry.
>> 4. Based on the modified column list, check for any variable length
columns are modified, if so this optimization is not applied.
>> 5. Identify the offset and length for the modified columns and store it
as an optimized WAL tuple in the following format.
>>     Note: Wal update header is modified to denote whether wal update
optimization is done or not.
>>          WAL update header + Tuple header(no change from previous format)
+
>>                  [offset(2bytes)] [length(2 bytes)] [changed data value]
>>                  [offset(2bytes)] [length(2 bytes)] [changed data value]
>>    ....
>>    ....

> The performance will need to be re-verified after you fix these 
> limitations. Those limitations need to be fixed before this can be
applied.

Yes, I agree that solution should fix these limitations and performance
numbers needs to be re-verified. 
Currently in my mind the work to be done is as follows:

1. Solution which can handle Variable length columns and NULLs
2. Handling of Before Triggers
3. Can the solution for fixed length columns be same as Variable length
columns and NULLS.
4. Make the final code patch which addresses all the above.

Please suggest if there are more things that needs to be handled?

For the 3rd point, currently the solution for fixed length columns cannot
handle the case of variable length columns and NULLS. The reason is for
fixed length columns there is no need of diff technology between old and new
tuple, however for other cases it will be required.
For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
changed columns of new tuple in WAL, it will be sufficient to do the replay
of WAL. However to handle other cases we need to use diff mechanism.

Can we do something like if the changed columns are fixed length and doesn't
contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
other cases store diff format.

This has advantage that for Updates containing only fixed length columns
don't have to pay penality of doing diff between new and old tuple. Also we
can do the whole work in 2 parts, one for fixed length columns and second to
handle other cases. 


> It would be nice to use some well-known binary delta algorithm for this, 
> rather than invent our own. OTOH, we have more knowledge of the 
> attribute boundaries, so a custom algorithm might work better. 

I shall work on this and post after initial work.

> In any case, I'd like to see the code to do the delta encoding/decoding to
be 
> put into separate functions, outside of heapam.c. It would be good for 
> readability, and we might want to reuse this in other places too.

Agreed. I shall take care of doing it in suggested way.


With Regards,
Amit Kapila.




Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Simon Riggs
Date:
On 3 August 2012 12:46, Amit kapila <amit.kapila@huawei.com> wrote:

> Frame the new tuple from old tuple and WAL record:

Sounds good.

I'd suggest we do this only when the saving is large enough for
benefit, rather than do this every time.

You don't mention whether or not the old and the new tuple are on the
same data block.

Personally, I think it will be important to ensure the above,
otherwise recovery will require much additional code for that case.
And that code will be prone to race conditions and performance issues.

Please also bear in mind that Andres will be looking to include the PK
columns in every WAL record for BDR. That could be an option, but I
doubt there is much value in excluding PK columns. I think I'd want
them to be there for debugging purposes so we can prove this code is
correct in production, since otherwise this could be a source of data
loss bugs.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
From: Simon Riggs [mailto:simon@2ndQuadrant.com] 
Sent: Thursday, August 09, 2012 12:36 PM
On 3 August 2012 12:46, Amit kapila <amit.kapila@huawei.com> wrote:

>> Frame the new tuple from old tuple and WAL record:

> Sounds good. Thanks.

> I'd suggest we do this only when the saving is large enough for
> benefit, rather than do this every time.     Do you mean to say that when length of updated values of tuple is less
than some threshold(1/3 or 2/3, etc..) value of  total length?


> You don't mention whether or not the old and the new tuple are on the
> same data block.
 WAL reduction is done for the case even when old and new are on different
data blocks as well.

> Personally, I think it will be important to ensure the above,
> otherwise recovery will require much additional code for that case.

 In recovery currently also, it handles the case when old and new are on
different page such that it has to read old page to get the old tuple.
 The modifications needs to ensure handling of following cases:
 a. When there is backup block,and old-new tuples are on different page    Currently it doesn't read the old page,
Howeverfor new implementation it needs to read old page for this case
 
also.
 b. When changes are already applied on page [line : if (XLByteLE(lsn,
PageGetLSN(page))); function: heap_xlog_update]    Currently it doesn't read the old page,    However for new
implementationit needs to read old page for this case
 
also.

> And that code will be prone to race conditions and performance issues.
 Are you telling performance issues, as now we may need to read old page in
some of the cases when earlier it was not reading? If yes, then I think as I have mentioned above, according to me
above2
 
cases are not very usual cases. However the benefit of Update operation on running server is good enough
as it reduces the WAL volume. If other then above, then please suggest me.


> Please also bear in mind that Andres will be looking to include the PK
> columns in every WAL record for BDR. That could be an option, but I
> doubt there is much value in excluding PK columns. 
 Agreed. However once the implementation by Andres is done I can merge both
codes and  take the performance data again, based on which we can take decision.


With Regards,
Amit Kapila.



Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Simon Riggs
Date:
On 9 August 2012 09:49, Amit Kapila <amit.kapila@huawei.com> wrote:

>> I'd suggest we do this only when the saving is large enough for
>> benefit, rather than do this every time.
>   Do you mean to say that when length of updated values of tuple is less
> than some threshold(1/3 or 2/3, etc..) value of
>   total length?

Some heuristic, yes, similar to TOAST's minimum threshold. To attempt
removal of rows in all cases would not be worth it, so we need a fast
path way of saying lets just take all of the columns.

>> You don't mention whether or not the old and the new tuple are on the
>> same data block.
>
>   WAL reduction is done for the case even when old and new are on different
> data blocks as well.

That makes me feel nervous. I doubt the marginal gain is worth it.
Most updates don't cross blocks.

>> Please also bear in mind that Andres will be looking to include the PK
>> columns in every WAL record for BDR. That could be an option, but I
>> doubt there is much value in excluding PK columns.
>
>   Agreed. However once the implementation by Andres is done I can merge both
> codes and
>   take the performance data again, based on which we can take decision.

It won't happen like that because there won't be a single point where
Andres is done. If you agree, then its worth doing it that way to
begin with, rather than requiring us to revisit the same section of
code twice.

One huge point that needs to be thought through is how we prove this
code actually works on WAL/recovery side. A normal regression test
won't prove that and we don't have a framework in place for that.

If you think about what you'll need to do to prove you haven't made
some fatal corruption of WAL, its going to look a lot like logical
replication tests. Worst case here is that mistakes on this patch will
show up as Andres' mistakes. So there is a stronger connection to
Andres' work than it first appears.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 09.08.2012 12:18, Simon Riggs wrote:
> On 9 August 2012 09:49, Amit Kapila<amit.kapila@huawei.com>  wrote:
>
>>    WAL reduction is done for the case even when old and new are on different
>> data blocks as well.
>
> That makes me feel nervous. I doubt the marginal gain is worth it.
> Most updates don't cross blocks.

That was my first instinctive reaction too. But if the mechanism works 
just as well for cross-page updates, seems a bit strange to not use it.

One argument would be that if for some reason the old block is corrupt 
or lost, you would not be able to recover the new version of the tuple 
from the WAL alone. At the moment, it's nice that the WAL record 
contains all the information required to reconstruct the new tuple, 
regardless of the old data block contents. But then again, full-page 
writes cover that too. There will be a full-page image of the old block 
in the WAL anyway.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Simon Riggs
Date:
On 9 August 2012 11:30, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 09.08.2012 12:18, Simon Riggs wrote:
>>
>> On 9 August 2012 09:49, Amit Kapila<amit.kapila@huawei.com>  wrote:
>>
>>>    WAL reduction is done for the case even when old and new are on
>>> different
>>> data blocks as well.
>>
>>
>> That makes me feel nervous. I doubt the marginal gain is worth it.
>> Most updates don't cross blocks.
>
>
> That was my first instinctive reaction too. But if the mechanism works just
> as well for cross-page updates, seems a bit strange to not use it.
>
> One argument would be that if for some reason the old block is corrupt or
> lost, you would not be able to recover the new version of the tuple from the
> WAL alone. At the moment, it's nice that the WAL record contains all the
> information required to reconstruct the new tuple, regardless of the old
> data block contents.

Exactly. If we lose the first block in a checkpoint, we could lose all
updates to rows in that page and all other pages linked to it over a
whole checkpoint duration. Basically, page corruption will propogate
from block to block if we do this.

Given the marginal gain because of a low percentage of cross-block
updates, I'm not keen. Low percentage because HOT tries hard to keep
things on same block - even for non-HOT updates (which is the case,
even though it sounds weird).

> But then again, full-page writes cover that too. There
> will be a full-page image of the old block in the WAL anyway.

Right, but we're planning to remove that, so its not a safe assumption
to use when building new code.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Simon Riggs
Sent: Thursday, August 09, 2012 2:49 PM
On 9 August 2012 09:49, Amit Kapila <amit.kapila@huawei.com> wrote:

>>> I'd suggest we do this only when the saving is large enough for
>>> benefit, rather than do this every time.
>>   Do you mean to say that when length of updated values of tuple is less
>> than some threshold(1/3 or 2/3, etc..) value of
>>   total length?

> Some heuristic, yes, similar to TOAST's minimum threshold. To attempt
> removal of rows in all cases would not be worth it, so we need a fast
> path way of saying lets just take all of the columns.
 Yes, it has to be done. Currently I have 2 ideas to take care of this: a. Based on number of updated columns b. Based
onlength of updated values If you have any other idea or you favor among one of the above, let me
 
know your opinion.

>>> You don't mention whether or not the old and the new tuple are on the
>>> same data block.
>
>>   WAL reduction is done for the case even when old and new are on
different
>> data blocks as well.

> That makes me feel nervous. I doubt the marginal gain is worth it.
> Most updates don't cross blocks.

How can it be proved whether gain is marginal or substantial to handle the
case.

One way is test after modification:
I have updated pg_bench tpc_b case:
1. Schema is such that it contains 1800 length rows
2. tpc_b only has updates
3. length of updated column values is 300.
4. All tables has 100% fill factor.
5. Vacuum is OFF

So in such a run, I think many should be updates are across blocks. But not
sure, neither I have verified it in any way.
The above run has given a good performance improvement.



>>> Please also bear in mind that Andres will be looking to include the PK
>>> columns in every WAL record for BDR. That could be an option, but I
>>> doubt there is much value in excluding PK columns.
>
>>   Agreed. However once the implementation by Andres is done I can merge
both
>> codes and
>>   take the performance data again, based on which we can take decision.

> It won't happen like that because there won't be a single point where
> Andres is done. If you agree, then its worth doing it that way to
> begin with, rather than requiring us to revisit the same section of
> code twice.

This optimization is to reduce the amount of WAL and definitely adding
anything extra will 
have some impact. 
However if there is no better way other than by including PK in WAL, then I
don't have any problem.

> One huge point that needs to be thought through is how we prove this
> code actually works on WAL/recovery side. A normal regression test
> won't prove that and we don't have a framework in place for that.

My initial idea to validate recovery :
1. Manual Test: a. To generate enough scenarios for update operation.                b. For each scenario, make sure
Replayhappens properly.
 
2. Community Review.



With Regards,
Amit Kapila.



Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 09.08.2012 14:11, Simon Riggs wrote:
> Given the marginal gain because of a low percentage of cross-block
> updates, I'm not keen. Low percentage because HOT tries hard to keep
> things on same block - even for non-HOT updates (which is the case,
> even though it sounds weird).

That depends entirely on the workload. If you do a bulk update that 
updates every row on the table, most are going to be cross-block 
updates, and the WAL size does matter.

>> But then again, full-page writes cover that too. There
>> will be a full-page image of the old block in the WAL anyway.
>
> Right, but we're planning to remove that, so its not a safe assumption
> to use when building new code.

I don't think we're going to get rid of full-page images any time soon. 
I guess you could easily check if full-page writes are enabled, though, 
and only do it for cross-page updates if it is.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Simon Riggs
Date:
On 9 August 2012 12:17, Amit Kapila <amit.kapila@huawei.com> wrote:

> This optimization is to reduce the amount of WAL and definitely adding
> anything extra will have some impact.

Of course. The question is "How much impact?". Each tweak has
progressively less and less gain. This isn't a binary choice.

Squeezing the last ounce of performance at the expense of all other
concerns is not a sensible goal, IMHO, nor do we attempt that
elsewhere.

Given we're making no attempt to remove full page writes, which is
clearly the biggest source of WAL volume currently, micro optimisation
of other factors seems unwarranted at this stage.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com] 
Sent: Thursday, August 09, 2012 4:59 PM
On 09.08.2012 14:11, Simon Riggs wrote:

>>> But then again, full-page writes cover that too. There
>>> will be a full-page image of the old block in the WAL anyway.
>
>> Right, but we're planning to remove that, so its not a safe assumption
>> to use when building new code.

> I don't think we're going to get rid of full-page images any time soon. 
> I guess you could easily check if full-page writes are enabled, though, 
> and only do it for cross-page updates if it is.

According to my understanding you are talking about corruption due to
partial page writes which can be handled by full-page image of WAL. Correct
me if I misunderstood.
Based on that, even if full-page image is removed it will be maintained by
double buffer write[an alternative solution to full-page writes for some of
the paths] for the case of corrupt page handling.

With Regards,
Amit Kapila.




Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 09.08.2012 15:56, Amit Kapila wrote:
> From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com]
> Sent: Thursday, August 09, 2012 4:59 PM
> On 09.08.2012 14:11, Simon Riggs wrote:
>
>>>> But then again, full-page writes cover that too. There
>>>> will be a full-page image of the old block in the WAL anyway.
>>
>>> Right, but we're planning to remove that, so its not a safe assumption
>>> to use when building new code.
>
>> I don't think we're going to get rid of full-page images any time soon.
>> I guess you could easily check if full-page writes are enabled, though,
>> and only do it for cross-page updates if it is.
>
> According to my understanding you are talking about corruption due to
> partial page writes which can be handled by full-page image of WAL. Correct
> me if I misunderstood.

I meant corruption caused by anything, like disk failure, bugs, cosmic 
rays, etc. The point is that currently the WAL record contains all the 
information required to reconstruct the old tuple. With a diff method, 
that's no longer the case, so if the old tuple gets corrupt for whatever 
reason, that error will be propagated to the new tuple.

It's not an issue as long as everything works correctly, but some 
redundancy is nice when you're trying to resurrect a corrupt database. 
That's what we're talking about here. That said, I don't think it's a 
big deal for this patch, at least not as long as full-page writes are 
enabled.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
From: Simon Riggs [mailto:simon@2ndQuadrant.com] 
Sent: Thursday, August 09, 2012 5:29 PM
On 9 August 2012 12:17, Amit Kapila <amit.kapila@huawei.com> wrote:

>> This optimization is to reduce the amount of WAL and definitely adding
>> anything extra will have some impact.

> Of course. The question is "How much impact?". Each tweak has
> progressively less and less gain. This isn't a binary choice.

> Squeezing the last ounce of performance at the expense of all other
> concerns is not a sensible goal, IMHO, nor do we attempt that
> elsewhere.

> Given we're making no attempt to remove full page writes, which is
> clearly the biggest source of WAL volume currently, micro optimisation
> of other factors seems unwarranted at this stage.

What I am pointing from WAL reduction is about Update operation performance
and
full-page writes doesn't have direct correlation with Update operation
except for 
a case of first time update of page after checkpoint.

With Regards,
Amit Kapila.




Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Thu, Aug 9, 2012 at 9:09 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I meant corruption caused by anything, like disk failure, bugs, cosmic rays,
> etc. The point is that currently the WAL record contains all the information
> required to reconstruct the old tuple. With a diff method, that's no longer
> the case, so if the old tuple gets corrupt for whatever reason, that error
> will be propagated to the new tuple.
>
> It's not an issue as long as everything works correctly, but some redundancy
> is nice when you're trying to resurrect a corrupt database. That's what
> we're talking about here. That said, I don't think it's a big deal for this
> patch, at least not as long as full-page writes are enabled.

So suppose that the following sequence of events occurs:

1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on page 2.
2. The table is vacuumed, removing tuple A.
3. Page 1 is written durably to disk.
4. Crash.

If reconstructing tuple B requires possession of tuple A, it seems
that we are now screwed.

No?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Heikki Linnakangas
Date:
On 09.08.2012 19:39, Robert Haas wrote:
> On Thu, Aug 9, 2012 at 9:09 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> I meant corruption caused by anything, like disk failure, bugs, cosmic rays,
>> etc. The point is that currently the WAL record contains all the information
>> required to reconstruct the old tuple. With a diff method, that's no longer
>> the case, so if the old tuple gets corrupt for whatever reason, that error
>> will be propagated to the new tuple.
>>
>> It's not an issue as long as everything works correctly, but some redundancy
>> is nice when you're trying to resurrect a corrupt database. That's what
>> we're talking about here. That said, I don't think it's a big deal for this
>> patch, at least not as long as full-page writes are enabled.
>
> So suppose that the following sequence of events occurs:
>
> 1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on page 2.
> 2. The table is vacuumed, removing tuple A.
> 3. Page 1 is written durably to disk.
> 4. Crash.
>
> If reconstructing tuple B requires possession of tuple A, it seems
> that we are now screwed.

Not with full_page_writes=on, as crash recovery will restore the old 
page contents. But you're right, with full_page_writes=off you are screwed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Thu, Aug 9, 2012 at 12:43 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> So suppose that the following sequence of events occurs:
>>
>> 1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on
>> page 2.
>> 2. The table is vacuumed, removing tuple A.
>> 3. Page 1 is written durably to disk.
>> 4. Crash.
>>
>> If reconstructing tuple B requires possession of tuple A, it seems
>> that we are now screwed.
>
> Not with full_page_writes=on, as crash recovery will restore the old page
> contents. But you're right, with full_page_writes=off you are screwed.

I think the property that recovery only needs to worry about each
block individually is one that we want to preserve.  Supporting this
optimizating only when full_page_writes=off seems ugly, and I also
agree with Simon's objection upthread: the current design minimizes
the chances of corruption propagating from block to block.  Even if
the proposed design is bullet-proof as of this moment (at least with
full_page_writes=on) it seems very possible that it could get
accidentally broken by future code changes, leading to hard-to-find
data corruption bugs.  It might also complicate other things that we
will want to do down the line, like parallelizing recovery.

In the pgbench testing I've done, almost all of the updates are HOT,
provided you run the test long enough to reach steady state, so
restricting this optimization to HOT updates shouldn't hurt that case
(or similar real-world cases) very much.  Of course there are probably
also real-world cases where HOT applies only seldom, and those cases
won't get the benefit of this, but you can't win them all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com] 
Sent: Thursday, August 09, 2012 11:18 PM
On Thu, Aug 9, 2012 at 12:43 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>>> So suppose that the following sequence of events occurs:
>>
>>> 1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on
>>> page 2.
>>> 2. The table is vacuumed, removing tuple A.
>>> 3. Page 1 is written durably to disk.
>>> 4. Crash.
>>
>>> If reconstructing tuple B requires possession of tuple A, it seems
>>> that we are now screwed.
>
>> Not with full_page_writes=on, as crash recovery will restore the old page
>> contents. But you're right, with full_page_writes=off you are screwed.

> I think the property that recovery only needs to worry about each
> block individually is one that we want to preserve.  Supporting this
> optimizating only when full_page_writes=off seems ugly, 

I think recovery needs to worry about multiple blocks as well in some cases.
Please see below case and correct me if I am wrong.
I think currently also there can be problems in case of full_page_writes=off
for crash recovery.
1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on
page 2.
2. Page 1 is Partially written to disk.
3. During recovery, it can so appear that there is no need to update XMAX
and other related things in Old tuple   as LSN is greater than WAL lsn.
4. Now also there can be other problems related to tuple visibility.



> and I also
> agree with Simon's objection upthread: the current design minimizes
> the chances of corruption propagating from block to block.  Even if
> the proposed design is bullet-proof as of this moment (at least with
> full_page_writes=on) it seems very possible that it could get
> accidentally broken by future code changes, leading to hard-to-find
> data corruption bugs. It might also complicate other things that we
> will want to do down the line, like parallelizing recovery.

I can see the problem incase we remove full-page-writes concept and replace
with some
other equivalent concept which doesn't have current flexibility.


With Regards,
Amit Kapila.



Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Robert Haas
Date:
On Fri, Aug 10, 2012 at 1:25 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
>> I think the property that recovery only needs to worry about each
>> block individually is one that we want to preserve.  Supporting this
>> optimizating only when full_page_writes=off seems ugly,
>
> I think recovery needs to worry about multiple blocks as well in some cases.
> Please see below case and correct me if I am wrong.
> I think currently also there can be problems in case of full_page_writes=off
> for crash recovery.
> 1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on
> page 2.
> 2. Page 1 is Partially written to disk.
> 3. During recovery, it can so appear that there is no need to update XMAX
> and other related things in Old tuple
>    as LSN is greater than WAL lsn.
> 4. Now also there can be other problems related to tuple visibility.

Well, you're only supposed to turn full_page_writes=off if partial
page writes are impossible on your system.  If you turn off
full_page_writes on a system where partial page writes are impossible,
then you've intentionally broken crash recovery, and you get to keep
both pieces.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From
Amit Kapila
Date:
On Thursday, August 30, 2012 11:23 PM Robert Haas
[mailto:robertmhaas@gmail.com] wrote:
On Fri, Aug 10, 2012 at 1:25 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
>>> I think the property that recovery only needs to worry about each
>>> block individually is one that we want to preserve.  Supporting this
>>> optimizating only when full_page_writes=off seems ugly,
>
>> I think recovery needs to worry about multiple blocks as well in some
cases.
>> Please see below case and correct me if I am wrong.
>> I think currently also there can be problems in case of
full_page_writes=off
>> for crash recovery.
>> 1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on
>> page 2.
>> 2. Page 1 is Partially written to disk.
>> 3. During recovery, it can so appear that there is no need to update XMAX
>> and other related things in Old tuple
>>    as LSN is greater than WAL lsn.
>> 4. Now also there can be other problems related to tuple visibility.

> Well, you're only supposed to turn full_page_writes=off if partial
> page writes are impossible on your system.  If you turn off
> full_page_writes on a system where partial page writes are impossible,
 I think you mean to say "full_page_writes on a system where partial page
writes are possible." Because if partial page writes are impossible then user should keep
full_page_writes = OFF.

> then you've intentionally broken crash recovery, and you get to keep
> both pieces.
 Robert, in broad I got your and Simon's idea that we should do
optimization of WAL (Reduce) in case update happens  on same page. I have implemented the final Patch which does WAL
optimization only in case when updated tuple is on same  page. Also we have observed that with fillfactor 80 the
performance
improvement is good.

With Regards,
Amit Kapila.