Thread: page corruption on 8.3+ that makes it to standby

page corruption on 8.3+ that makes it to standby

From
Jeff Davis
Date:
I reported a problem here:

http://archives.postgresql.org/pgsql-bugs/2010-07/msg00173.php

Perhaps I used a poor subject line, but I believe it's a serious issue.
That reproducible sequence seems like an obvious bug to me on 8.3+, and
what's worse, the corruption propagates to the standby as I found out
today (through a test, fortunately).

The only mitigating factor is that it doesn't actually lose data, and
you can fix it (I believe) with zero_damaged_pages (or careful use of
dd).

There are two fixes that I can see:

1. Have log_newpage() and heap_xlog_newpage() only call PageSetLSN() and
PageSetTLI() if the page is not new. This seems slightly awkward because
most WAL replay stuff doesn't have to worry about zero pages, but in
this case I think it does.

2. Have copy_relation_data() initialize new pages. I don't like this
because (a) it's not really the job of SET TABLESPACE to clean up zero
pages; and (b) it could be an index with different special size, etc.,
and it doesn't seem like a good place to figure that out.

Comments?

Regards,Jeff Davis



Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Tue, Jul 27, 2010 at 2:06 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> I reported a problem here:
>
> http://archives.postgresql.org/pgsql-bugs/2010-07/msg00173.php
>
> Perhaps I used a poor subject line, but I believe it's a serious issue.
> That reproducible sequence seems like an obvious bug to me on 8.3+, and
> what's worse, the corruption propagates to the standby as I found out
> today (through a test, fortunately).

I think that the problem is not so much your choice of subject line as
your misfortune to discover this bug when Tom and Heikki were both on
vacation.

> The only mitigating factor is that it doesn't actually lose data, and
> you can fix it (I believe) with zero_damaged_pages (or careful use of
> dd).
>
> There are two fixes that I can see:
>
> 1. Have log_newpage() and heap_xlog_newpage() only call PageSetLSN() and
> PageSetTLI() if the page is not new. This seems slightly awkward because
> most WAL replay stuff doesn't have to worry about zero pages, but in
> this case I think it does.
>
> 2. Have copy_relation_data() initialize new pages. I don't like this
> because (a) it's not really the job of SET TABLESPACE to clean up zero
> pages; and (b) it could be an index with different special size, etc.,
> and it doesn't seem like a good place to figure that out.

It appears to me that all of the callers of log_newpage() other than
copy_relation_data() do so with pages that they've just constructed,
and which therefore can't be new.  So maybe we could just modify
copy_relation_data to check PageIsNew(buf), or something like that,
and only call log_newpage() if that returns true.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: page corruption on 8.3+ that makes it to standby

From
Jeff Davis
Date:
On Tue, 2010-07-27 at 15:50 -0400, Robert Haas wrote:
> > 1. Have log_newpage() and heap_xlog_newpage() only call PageSetLSN() and
> > PageSetTLI() if the page is not new. This seems slightly awkward because
> > most WAL replay stuff doesn't have to worry about zero pages, but in
> > this case I think it does.
> >
> > 2. Have copy_relation_data() initialize new pages. I don't like this
> > because (a) it's not really the job of SET TABLESPACE to clean up zero
> > pages; and (b) it could be an index with different special size, etc.,
> > and it doesn't seem like a good place to figure that out.
> 
> It appears to me that all of the callers of log_newpage() other than
> copy_relation_data() do so with pages that they've just constructed,
> and which therefore can't be new.  So maybe we could just modify
> copy_relation_data to check PageIsNew(buf), or something like that,
> and only call log_newpage() if that returns true.

My first concern with that idea was that it may create an inconsistency
between the primary and the standby. The primary could have a bunch of
zero pages that never make it to the standby.

However, it looks like all WAL recovery stuff passes true for "init"
when calling XLogReadBuffer(), so I think it's safe.

I'll test it and submit a patch.

Regards,Jeff Davis



Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Tue, Jul 27, 2010 at 5:08 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2010-07-27 at 15:50 -0400, Robert Haas wrote:
>> > 1. Have log_newpage() and heap_xlog_newpage() only call PageSetLSN() and
>> > PageSetTLI() if the page is not new. This seems slightly awkward because
>> > most WAL replay stuff doesn't have to worry about zero pages, but in
>> > this case I think it does.
>> >
>> > 2. Have copy_relation_data() initialize new pages. I don't like this
>> > because (a) it's not really the job of SET TABLESPACE to clean up zero
>> > pages; and (b) it could be an index with different special size, etc.,
>> > and it doesn't seem like a good place to figure that out.
>>
>> It appears to me that all of the callers of log_newpage() other than
>> copy_relation_data() do so with pages that they've just constructed,
>> and which therefore can't be new.  So maybe we could just modify
>> copy_relation_data to check PageIsNew(buf), or something like that,
>> and only call log_newpage() if that returns true.
>
> My first concern with that idea was that it may create an inconsistency
> between the primary and the standby. The primary could have a bunch of
> zero pages that never make it to the standby.

Maybe I'm slow on the uptake here, but don't the pages start out
all-zeroes on the standby just as they do on the primary? The only way
it seems like this would be a problem is if a page that previously
contained data on the primary was subsequently zeroed without writing
a WAL record - or am I confused?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: page corruption on 8.3+ that makes it to standby

From
Jeff Davis
Date:
On Tue, 2010-07-27 at 17:18 -0400, Robert Haas wrote:
> > My first concern with that idea was that it may create an inconsistency
> > between the primary and the standby. The primary could have a bunch of
> > zero pages that never make it to the standby.
> 
> Maybe I'm slow on the uptake here, but don't the pages start out
> all-zeroes on the standby just as they do on the primary? The only way
> it seems like this would be a problem is if a page that previously
> contained data on the primary was subsequently zeroed without writing
> a WAL record - or am I confused?

The case I was concerned about is when you have a table on the primary
with a bunch of zero pages at the end. Then you SET TABLESPACE, and none
of the copied pages (or even the fact that they exist) would be sent to
the standby, but they would exist on the primary. And later pages may
have data, so the standby may see page N but not N-1.

Generally, most of the code is not expecting to read or write past the
end of the file, unless it's doing an extension.

However, I think everything is fine during recovery, because it looks
like it's designed to create zero pages as needed. So your idea seems
safe to me, although I do still have some doubts because of my lack of
knowledge in this area; particularly hot standby conflict
detection/resolution.

My idea was different: still log the zero page, just don't set LSN or
TLI for a zero page in log_newpage() or heap_xlog_newpage(). This isn't
as clean as your idea, but I'm a little more confident that it is
correct.

Regards,Jeff Davis




Re: page corruption on 8.3+ that makes it to standby

From
Jeff Davis
Date:
On Tue, 2010-07-27 at 15:23 -0700, Jeff Davis wrote:
> On Tue, 2010-07-27 at 17:18 -0400, Robert Haas wrote:
> > > My first concern with that idea was that it may create an inconsistency
> > > between the primary and the standby. The primary could have a bunch of
> > > zero pages that never make it to the standby.
> >
> > Maybe I'm slow on the uptake here, but don't the pages start out
> > all-zeroes on the standby just as they do on the primary? The only way
> > it seems like this would be a problem is if a page that previously
> > contained data on the primary was subsequently zeroed without writing
> > a WAL record - or am I confused?
>
> The case I was concerned about is when you have a table on the primary
> with a bunch of zero pages at the end. Then you SET TABLESPACE, and none
> of the copied pages (or even the fact that they exist) would be sent to
> the standby, but they would exist on the primary. And later pages may
> have data, so the standby may see page N but not N-1.
>
> Generally, most of the code is not expecting to read or write past the
> end of the file, unless it's doing an extension.
>
> However, I think everything is fine during recovery, because it looks
> like it's designed to create zero pages as needed. So your idea seems
> safe to me, although I do still have some doubts because of my lack of
> knowledge in this area; particularly hot standby conflict
> detection/resolution.
>
> My idea was different: still log the zero page, just don't set LSN or
> TLI for a zero page in log_newpage() or heap_xlog_newpage(). This isn't
> as clean as your idea, but I'm a little more confident that it is
> correct.
>

Both potential fixes attached and both appear to work.

fix1 -- Only call PageSetLSN/TLI inside log_newpage() and
heap_xlog_newpage() if the page is not zeroed.

fix2 -- Don't call log_newpage() at all if the page is not zeroed.

Please review. I don't have a strong opinion about which one should be
applied.

Regards,
    Jeff Davis

Attachment

Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Wed, Jul 28, 2010 at 12:23 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2010-07-27 at 15:23 -0700, Jeff Davis wrote:
>> On Tue, 2010-07-27 at 17:18 -0400, Robert Haas wrote:
>> > > My first concern with that idea was that it may create an inconsistency
>> > > between the primary and the standby. The primary could have a bunch of
>> > > zero pages that never make it to the standby.
>> >
>> > Maybe I'm slow on the uptake here, but don't the pages start out
>> > all-zeroes on the standby just as they do on the primary? The only way
>> > it seems like this would be a problem is if a page that previously
>> > contained data on the primary was subsequently zeroed without writing
>> > a WAL record - or am I confused?
>>
>> The case I was concerned about is when you have a table on the primary
>> with a bunch of zero pages at the end. Then you SET TABLESPACE, and none
>> of the copied pages (or even the fact that they exist) would be sent to
>> the standby, but they would exist on the primary. And later pages may
>> have data, so the standby may see page N but not N-1.
>>
>> Generally, most of the code is not expecting to read or write past the
>> end of the file, unless it's doing an extension.
>>
>> However, I think everything is fine during recovery, because it looks
>> like it's designed to create zero pages as needed. So your idea seems
>> safe to me, although I do still have some doubts because of my lack of
>> knowledge in this area; particularly hot standby conflict
>> detection/resolution.
>>
>> My idea was different: still log the zero page, just don't set LSN or
>> TLI for a zero page in log_newpage() or heap_xlog_newpage(). This isn't
>> as clean as your idea, but I'm a little more confident that it is
>> correct.
>>
>
> Both potential fixes attached and both appear to work.
>
> fix1 -- Only call PageSetLSN/TLI inside log_newpage() and
> heap_xlog_newpage() if the page is not zeroed.
>
> fix2 -- Don't call log_newpage() at all if the page is not zeroed.
>
> Please review. I don't have a strong opinion about which one should be
> applied.

Looks good.  I still prefer fix2; it seems cleaner to me.  It has
another advantage, too, which is that copy_relation_data() is used
ONLY by ALTER TABLE SET TABLESPACE.  So if I stick to patching that
function, that's the only thing I can possibly break, whereas
log_newpage() is used in a bunch of other places.  I don't think
either fix is going to break anything at all, but considering that
this is going to need back-patching, I'd rather be conservative.

Speaking of back-patching, the subject line describes this as an 8.3+
problem, but it looks to me like this goes all the way back to 8.0.
The code is a little different in 8.2 and prior, but ISTM it's
vulnerable to the same issue.  Don't we need a modified version of
this patch for the 8.0 - 8.2 branches?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: page corruption on 8.3+ that makes it to standby

From
Simon Riggs
Date:
On Tue, 2010-07-27 at 21:23 -0700, Jeff Davis wrote:

> Both potential fixes attached and both appear to work.
> 
> fix1 -- Only call PageSetLSN/TLI inside log_newpage() and
> heap_xlog_newpage() if the page is not zeroed.
> 
> fix2 -- Don't call log_newpage() at all if the page is not zeroed.
> 
> Please review. I don't have a strong opinion about which one should be
> applied.

ISTM we should just fix an uninitialized page first, using code from
VACUUM similar to
 if (PageIsNew(page)) {   ereport(WARNING,(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
   relname, blkno)));   PageInit(page, BufferGetPageSize(buf), 0); }
 

then continue as before.

We definitely shouldn't do anything that leaves standby different to
primary.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Wed, Jul 28, 2010 at 7:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, 2010-07-27 at 21:23 -0700, Jeff Davis wrote:
>
>> Both potential fixes attached and both appear to work.
>>
>> fix1 -- Only call PageSetLSN/TLI inside log_newpage() and
>> heap_xlog_newpage() if the page is not zeroed.
>>
>> fix2 -- Don't call log_newpage() at all if the page is not zeroed.
>>
>> Please review. I don't have a strong opinion about which one should be
>> applied.
>
> ISTM we should just fix an uninitialized page first, using code from
> VACUUM similar to
>
>  if (PageIsNew(page))
>  {
>    ereport(WARNING,
>        (errmsg("relation \"%s\" page %u is uninitialized --- fixing",
>                                                relname, blkno)));
>    PageInit(page, BufferGetPageSize(buf), 0);
>  }
>
> then continue as before.

As Jeff Davis pointed out upthread, you don't know that 0 is the
correct special size for the relation being copied.  In the VACUUM
path, that code is only applied to heaps, but that's not necessarily
the case here.

> We definitely shouldn't do anything that leaves standby different to
> primary.

Obviously.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: page corruption on 8.3+ that makes it to standby

From
Jeff Davis
Date:
On Wed, 2010-07-28 at 06:40 -0400, Robert Haas wrote:
> > fix1 -- Only call PageSetLSN/TLI inside log_newpage() and
> > heap_xlog_newpage() if the page is not zeroed.
> >
> > fix2 -- Don't call log_newpage() at all if the page is not zeroed.
> >
> > Please review. I don't have a strong opinion about which one should be
> > applied.
> 
> Looks good.  I still prefer fix2; it seems cleaner to me.  It has
> another advantage, too, which is that copy_relation_data() is used
> ONLY by ALTER TABLE SET TABLESPACE.  So if I stick to patching that
> function, that's the only thing I can possibly break, whereas
> log_newpage() is used in a bunch of other places.  I don't think
> either fix is going to break anything at all, but considering that
> this is going to need back-patching, I'd rather be conservative.
> 

Sounds good to me.

However, when Simon said "We definitely shouldn't do anything that
leaves standby different to primary." you said "obviously". Fix2 can
leave a difference between the two, because zeroed pages at the end of
the heap file on the primary will not be sent to the standby (the
standby will only create the zeroed pages if a higher block number is
sent; which won't be the case if the zeroed pages are at the end).

As we discussed before, that looks inconsequential, but I just want to
make sure that it's understood.

> Speaking of back-patching, the subject line describes this as an 8.3+
> problem, but it looks to me like this goes all the way back to 8.0.
> The code is a little different in 8.2 and prior, but ISTM it's
> vulnerable to the same issue.  Don't we need a modified version of
> this patch for the 8.0 - 8.2 branches?

That sounds right. I just saw that the code in question was introduced
in 8.3.

Regards,Jeff Davis



Re: page corruption on 8.3+ that makes it to standby

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> However, when Simon said "We definitely shouldn't do anything that
> leaves standby different to primary." you said "obviously". Fix2 can
> leave a difference between the two, because zeroed pages at the end of
> the heap file on the primary will not be sent to the standby (the
> standby will only create the zeroed pages if a higher block number is
> sent; which won't be the case if the zeroed pages are at the end).

> As we discussed before, that looks inconsequential, but I just want to
> make sure that it's understood.

I understand it, and I don't like it one bit.  I haven't caught up on
this thread yet, but I think the only acceptable solution is one that
leaves the slave in the *same* state as the master.  Not a state that
we hope will behave equivalently.  I can think of too many corner cases
where it might not.  (In fact, having a zeroed page in a relation is
already a corner case in itself, so the amount of testing you'd get for
such behaviors is epsilon squared.  You don't want to take that bet.)
        regards, tom lane


Re: page corruption on 8.3+ that makes it to standby

From
Jeff Davis
Date:
On Wed, 2010-07-28 at 12:36 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > However, when Simon said "We definitely shouldn't do anything that
> > leaves standby different to primary." you said "obviously". Fix2 can
> > leave a difference between the two, because zeroed pages at the end of
> > the heap file on the primary will not be sent to the standby (the
> > standby will only create the zeroed pages if a higher block number is
> > sent; which won't be the case if the zeroed pages are at the end).
> 
> > As we discussed before, that looks inconsequential, but I just want to
> > make sure that it's understood.
> 
> I understand it, and I don't like it one bit.  I haven't caught up on
> this thread yet, but I think the only acceptable solution is one that
> leaves the slave in the *same* state as the master.  Not a state that
> we hope will behave equivalently.  I can think of too many corner cases
> where it might not.  (In fact, having a zeroed page in a relation is
> already a corner case in itself, so the amount of testing you'd get for
> such behaviors is epsilon squared.  You don't want to take that bet.)
> 

Ok, sounds like my original fix (fix1) is the way to go then. Log zero
pages, but don't set LSN/TLI if it's a zero page (in log_newpage or
heap_xlog_newpage).

Regards,Jeff Davis



Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Wed, Jul 28, 2010 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> However, when Simon said "We definitely shouldn't do anything that
>> leaves standby different to primary." you said "obviously". Fix2 can
>> leave a difference between the two, because zeroed pages at the end of
>> the heap file on the primary will not be sent to the standby (the
>> standby will only create the zeroed pages if a higher block number is
>> sent; which won't be the case if the zeroed pages are at the end).
>
>> As we discussed before, that looks inconsequential, but I just want to
>> make sure that it's understood.
>
> I understand it, and I don't like it one bit.  I haven't caught up on
> this thread yet, but I think the only acceptable solution is one that
> leaves the slave in the *same* state as the master.  Not a state that
> we hope will behave equivalently.  I can think of too many corner cases
> where it might not.  (In fact, having a zeroed page in a relation is
> already a corner case in itself, so the amount of testing you'd get for
> such behaviors is epsilon squared.  You don't want to take that bet.)

I might be missing something here, but I don't see how you're going to
manage that.  In Jeff's original example, he crashes the database
after extending the relation but before initializing and writing the
new page.  I believe that at that point no XLOG has been written yet,
so the relation has been extended but there is no WAL to be sent to
the standby.  So now you have the exact situation you're concerned
about - the relation has been extended on the master but not on the
standby.  As far as I can see, this is an unavoidable consequence of
the fact that we don't XLOG the act of extending the relation.
Worrying about it only in the specific context of ALTER TABLE .. SET
TABLESPACE seems backwards; if there are any bugs there, we're in for
it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: page corruption on 8.3+ that makes it to standby

From
Jeff Davis
Date:
On Wed, 2010-07-28 at 13:18 -0400, Robert Haas wrote:
> In Jeff's original example, he crashes the database
> after extending the relation but before initializing and writing the
> new page.  I believe that at that point no XLOG has been written yet,
> so the relation has been extended but there is no WAL to be sent to
> the standby.  So now you have the exact situation you're concerned
> about - the relation has been extended on the master but not on the
> standby.  As far as I can see, this is an unavoidable consequence of
> the fact that we don't XLOG the act of extending the relation.
> Worrying about it only in the specific context of ALTER TABLE .. SET
> TABLESPACE seems backwards; if there are any bugs there, we're in for
> it.

That's a very good point. Now I'm leaning more toward your fix.

Regards,Jeff Davis



Re: page corruption on 8.3+ that makes it to standby

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jul 28, 2010 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I understand it, and I don't like it one bit. �I haven't caught up on
>> this thread yet, but I think the only acceptable solution is one that
>> leaves the slave in the *same* state as the master.

> I might be missing something here, but I don't see how you're going to
> manage that.  In Jeff's original example, he crashes the database
> after extending the relation but before initializing and writing the
> new page.  I believe that at that point no XLOG has been written yet,
> so the relation has been extended but there is no WAL to be sent to
> the standby.  So now you have the exact situation you're concerned
> about - the relation has been extended on the master but not on the
> standby.

You're right that we cannot prevent that situation --- or at least,
the cure would be worse than the disease.  (The cure would be to
XLOG the extension action, obviously, but then out-of-disk-space
has to be a PANIC condition.)  However, it doesn't follow that it's
a good idea to make copy_relation_data *intentionally* make the slave
and master different.

I've caught up on the thread now, and I think that fix2 (skip logging
the page) is extremely dangerous and has little if anything in its
favor.  fix1 seems reasonable given the structure of the page validity
checks.

However, what about Jeff's original comment

: On second thought, why are PageSetLSN and PageSetTLI being called from
: log_newpage(), anyway?

I think it is appropriate to be setting the LSN/TLI in the case of a
page that's been constructed by the caller as part of the WAL-logged
action, but doing so in copy_relation_data seems rather questionable.
We certainly didn't change the source page so changing its LSN seems
rather wrong --- wouldn't it be better to just copy the source pages
with their original LSNs?  So perhaps the best fix is to add a bool
parameter to log_newpage telling it whether to update LSN/TLI, and
have copy_relation_data pass false while the other callers pass true.
(Although I guess we'd need to propagate that flag in the WAL record,
so maybe this is more trouble than its worth.)
        regards, tom lane


Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Wed, Jul 28, 2010 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Jul 28, 2010 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I understand it, and I don't like it one bit.  I haven't caught up on
>>> this thread yet, but I think the only acceptable solution is one that
>>> leaves the slave in the *same* state as the master.
>
>> I might be missing something here, but I don't see how you're going to
>> manage that.  In Jeff's original example, he crashes the database
>> after extending the relation but before initializing and writing the
>> new page.  I believe that at that point no XLOG has been written yet,
>> so the relation has been extended but there is no WAL to be sent to
>> the standby.  So now you have the exact situation you're concerned
>> about - the relation has been extended on the master but not on the
>> standby.
>
> You're right that we cannot prevent that situation --- or at least,
> the cure would be worse than the disease.  (The cure would be to
> XLOG the extension action, obviously, but then out-of-disk-space
> has to be a PANIC condition.)

Not to mention that performance would probably be atrocious.

> However, it doesn't follow that it's
> a good idea to make copy_relation_data *intentionally* make the slave
> and master different.
>
> I've caught up on the thread now, and I think that fix2 (skip logging
> the page) is extremely dangerous and has little if anything in its
> favor.

Why do you think that?  They will be different only in terms of
whether the uninitialized bytes are before or after the nominal EOF,
and we know we have to be indifferent to that case anyway.

> fix1 seems reasonable given the structure of the page validity
> checks.
>
> However, what about Jeff's original comment
>
> : On second thought, why are PageSetLSN and PageSetTLI being called from
> : log_newpage(), anyway?
>
> I think it is appropriate to be setting the LSN/TLI in the case of a
> page that's been constructed by the caller as part of the WAL-logged
> action, but doing so in copy_relation_data seems rather questionable.
> We certainly didn't change the source page so changing its LSN seems
> rather wrong --- wouldn't it be better to just copy the source pages
> with their original LSNs?  So perhaps the best fix is to add a bool
> parameter to log_newpage telling it whether to update LSN/TLI, and
> have copy_relation_data pass false while the other callers pass true.
> (Although I guess we'd need to propagate that flag in the WAL record,
> so maybe this is more trouble than its worth.)

It seems like if log_newpage() were to set the LSN/TLI before calling
XLogInsert() - or optionally not - then it wouldn't be necessary to
set them also in heap_xlog_newpage(); the memcpy operation would by
definition have copied the right information onto the page.  That
seems like it would be a cleaner design, but back-patching a change to
the interpretation of WAL records that might already be on someone's
disk seems dicey at best.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: page corruption on 8.3+ that makes it to standby

From
Jeff Davis
Date:
On Wed, 2010-07-28 at 14:50 -0400, Robert Haas wrote:
> It seems like if log_newpage() were to set the LSN/TLI before calling
> XLogInsert() - or optionally not - then it wouldn't be necessary to
> set them also in heap_xlog_newpage(); the memcpy operation would by
> definition have copied the right information onto the page.  That
> seems like it would be a cleaner design, but back-patching a change to
> the interpretation of WAL records that might already be on someone's
> disk seems dicey at best.

How do you set the LSN before XLogInsert()?

Regards,Jeff Davis



Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Wed, Jul 28, 2010 at 3:08 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2010-07-28 at 14:50 -0400, Robert Haas wrote:
>> It seems like if log_newpage() were to set the LSN/TLI before calling
>> XLogInsert() - or optionally not - then it wouldn't be necessary to
>> set them also in heap_xlog_newpage(); the memcpy operation would by
>> definition have copied the right information onto the page.  That
>> seems like it would be a cleaner design, but back-patching a change to
>> the interpretation of WAL records that might already be on someone's
>> disk seems dicey at best.
>
> How do you set the LSN before XLogInsert()?

Details, details...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: page corruption on 8.3+ that makes it to standby

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jul 28, 2010 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I've caught up on the thread now, and I think that fix2 (skip logging
>> the page) is extremely dangerous and has little if anything in its
>> favor.

> Why do you think that?  They will be different only in terms of
> whether the uninitialized bytes are before or after the nominal EOF,
> and we know we have to be indifferent to that case anyway.

(1) You're assuming that the page will be zeroes on the slave without
having forced it to be so.  A really obvious case where this fails
is where we're doing crash-and-restart on the master: a later action
could have modified the page away from the all-zero state.  (In
principle that's OK but I think this might break torn-page protection.)

(2) On filesystems that support holes, the page will not have storage,
whereas it (probably) does on the master.  This could lead to a
divergence in behavior later, ie slave runs out of disk space at a
different point than the master.

(3) The position of the nominal EOF can drive choices about which page
to put new tuples in, specifically thats where RelationGetBufferForTuple
will go if FSM has no information.  This could result in unexpected
divergence in behavior after the slave goes live compared to what the
master would have done.  Maybe that's OK but it seems better to avoid
it if we can, especially when you think about crash-and-restart on the
master as opposed to a separate slave.

Now as I said earlier, these are all tiny corners of a corner case, and
they *probably* shouldn't matter.  But I see no good reason to expose
ourselves to the possibility that there's some cases where they do
matter.  Especially when your argument for fix2 is a purely aesthetic
judgment that I don't agree with anyway.

>> I think it is appropriate to be setting the LSN/TLI in the case of a
>> page that's been constructed by the caller as part of the WAL-logged
>> action, but doing so in copy_relation_data seems rather questionable.
>> We certainly didn't change the source page so changing its LSN seems
>> rather wrong --- wouldn't it be better to just copy the source pages
>> with their original LSNs?

> It seems like if log_newpage() were to set the LSN/TLI before calling
> XLogInsert() - or optionally not - then it wouldn't be necessary to
> set them also in heap_xlog_newpage(); the memcpy operation would by
> definition have copied the right information onto the page.

Not possible because it is only after you've done XLogInsert that you
know what LSN was assigned to the WAL record.
        regards, tom lane


Re: page corruption on 8.3+ that makes it to standby

From
Tom Lane
Date:
I wrote:
>>> I think it is appropriate to be setting the LSN/TLI in the case of a
>>> page that's been constructed by the caller as part of the WAL-logged
>>> action, but doing so in copy_relation_data seems rather questionable.

BTW, I thought of an argument that explains why that's sane: it marks
the copied page as having been recently WAL-logged.  If we do some
action on the copied relation shortly after completing the
copy_relation_data transaction, we will see that its LSN is later than
the last checkpoint and know that we don't need to emit a full-page WAL
image for it, which is correct because in case of crash+restart the
HEAP_NEWPAGE record will provide the full-page image.  If we left the
source relation's page's LSN in there, we would frequently make the
wrong decision and emit an unnecessary extra full-page image.

So nevermind that distraction.  I'm back to thinking that fix1 is
the way to go.
        regards, tom lane


Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Wed, Jul 28, 2010 at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (1) You're assuming that the page will be zeroes on the slave without
> having forced it to be so.  A really obvious case where this fails
> is where we're doing crash-and-restart on the master: a later action
> could have modified the page away from the all-zero state.  (In
> principle that's OK but I think this might break torn-page protection.)

Hmm, yeah, that does seem like it has the potential to be bad.  I
think this is sufficient reason to go with fix #1.

> (2) On filesystems that support holes, the page will not have storage,
> whereas it (probably) does on the master.  This could lead to a
> divergence in behavior later, ie slave runs out of disk space at a
> different point than the master.

I can't get excited about this one.

> (3) The position of the nominal EOF can drive choices about which page
> to put new tuples in, specifically thats where RelationGetBufferForTuple
> will go if FSM has no information.  This could result in unexpected
> divergence in behavior after the slave goes live compared to what the
> master would have done.  Maybe that's OK but it seems better to avoid
> it if we can, especially when you think about crash-and-restart on the
> master as opposed to a separate slave.

You're still going to have that in the "normal" (not altering the
tablespace) extension case, which is presumably far more common.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: page corruption on 8.3+ that makes it to standby

From
Jeff Davis
Date:
On Wed, 2010-07-28 at 15:37 -0400, Tom Lane wrote:
> So nevermind that distraction.  I'm back to thinking that fix1 is
> the way to go.

Agreed.

It's uncontroversial to have a simple guard against corrupting an
uninitialized page, and uncontroversial is good for things that will be
back-patched.

Regards,Jeff Davis



Re: page corruption on 8.3+ that makes it to standby

From
Simon Riggs
Date:
On Wed, 2010-07-28 at 14:22 -0700, Jeff Davis wrote:
> On Wed, 2010-07-28 at 15:37 -0400, Tom Lane wrote:
> > So nevermind that distraction.  I'm back to thinking that fix1 is
> > the way to go.
> 
> Agreed.
> 
> It's uncontroversial to have a simple guard against corrupting an
> uninitialized page, and uncontroversial is good for things that will be
> back-patched.

Still don't understand why we would not initialize such pages. If we're
copying a relation we must know enough about it to init a page.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Thu, Jul 29, 2010 at 4:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, 2010-07-28 at 14:22 -0700, Jeff Davis wrote:
>> On Wed, 2010-07-28 at 15:37 -0400, Tom Lane wrote:
>> > So nevermind that distraction.  I'm back to thinking that fix1 is
>> > the way to go.
>>
>> Agreed.
>>
>> It's uncontroversial to have a simple guard against corrupting an
>> uninitialized page, and uncontroversial is good for things that will be
>> back-patched.
>
> Still don't understand why we would not initialize such pages. If we're
> copying a relation we must know enough about it to init a page.

Well, I don't see why we'd want to do that.  As Jeff Davis pointed
out, if someone asks to move a table to a different tablespace,
changing the contents as we go along seems a bit off-topic.  But the
bigger problem is you haven't explained how you think we could
determine what initialization ought to be performed.  There's no
index-AM API that says "initialize this page".  I suppose we could
invent one if there were some benefit, but we couldn't very well
back-patch such a thing to 8.0.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: page corruption on 8.3+ that makes it to standby

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jul 29, 2010 at 4:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Still don't understand why we would not initialize such pages. If we're
>> copying a relation we must know enough about it to init a page.

> Well, I don't see why we'd want to do that.  As Jeff Davis pointed
> out, if someone asks to move a table to a different tablespace,
> changing the contents as we go along seems a bit off-topic.  But the
> bigger problem is you haven't explained how you think we could
> determine what initialization ought to be performed.  There's no
> index-AM API that says "initialize this page".  I suppose we could
> invent one if there were some benefit, but we couldn't very well
> back-patch such a thing to 8.0.

Yeah.  And you really would have to get the AM involved.  Even if you
were willing to assume that you knew the special-space size for a
particular index type, it would not fly to assume that the special space
doesn't require initialization to some nonzero content.
        regards, tom lane


Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Wed, Jul 28, 2010 at 5:22 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2010-07-28 at 15:37 -0400, Tom Lane wrote:
>> So nevermind that distraction.  I'm back to thinking that fix1 is
>> the way to go.
>
> Agreed.
>
> It's uncontroversial to have a simple guard against corrupting an
> uninitialized page, and uncontroversial is good for things that will be
> back-patched.

Here's a version of Jeff's fix1 patch (with a trivial change to the
comment) that applies to HEAD, REL9_0_STABLE, REL8_4_STABLE, and
REL8_3_STABLE; a slightly modified version that applies to
REL8_2_STABLE; and another slightly modified version that applies to
REL8_1_STABLE and REL8_0_STABLE.  REL7_4_STABLE doesn't have
tablespaces, so the problem can't manifest there, I think.

I'm currently compiling and testing all of these.  When that's done,
should I go ahead and check this in, or wait until after beta4 wraps?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment

Re: page corruption on 8.3+ that makes it to standby

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Here's a version of Jeff's fix1 patch (with a trivial change to the
> comment) that applies to HEAD, REL9_0_STABLE, REL8_4_STABLE, and
> REL8_3_STABLE; a slightly modified version that applies to
> REL8_2_STABLE; and another slightly modified version that applies to
> REL8_1_STABLE and REL8_0_STABLE.  REL7_4_STABLE doesn't have
> tablespaces, so the problem can't manifest there, I think.

Looks sane to the eyeball.  I'm not sure if the oldest versions have the
same page-read-time header sanity checks that we have now, so it may be
that there is not a need for this patch all the way back; but it can't
hurt anything.

> I'm currently compiling and testing all of these.  When that's done,
> should I go ahead and check this in, or wait until after beta4 wraps?

Go ahead and commit, please.
        regards, tom lane


Re: page corruption on 8.3+ that makes it to standby

From
Robert Haas
Date:
On Thu, Jul 29, 2010 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Here's a version of Jeff's fix1 patch (with a trivial change to the
>> comment) that applies to HEAD, REL9_0_STABLE, REL8_4_STABLE, and
>> REL8_3_STABLE; a slightly modified version that applies to
>> REL8_2_STABLE; and another slightly modified version that applies to
>> REL8_1_STABLE and REL8_0_STABLE.  REL7_4_STABLE doesn't have
>> tablespaces, so the problem can't manifest there, I think.
>
> Looks sane to the eyeball.  I'm not sure if the oldest versions have the
> same page-read-time header sanity checks that we have now, so it may be
> that there is not a need for this patch all the way back; but it can't
> hurt anything.

It looks like they do.  I am able to reproduce the problem even on
8.0, and the patch does fix it.

>> I'm currently compiling and testing all of these.  When that's done,
>> should I go ahead and check this in, or wait until after beta4 wraps?
>
> Go ahead and commit, please.

Done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company