Thread: Re: [GENERAL] PANIC: heap_update_redo: no block

Re: [GENERAL] PANIC: heap_update_redo: no block

From
Tom Lane
Date:
[ redirecting to a more appropriate mailing list ]

"Alex bahdushka" <bahdushka@gmail.com> writes:

> LOG:  REDO @ D/19176610; LSN D/19176644: prev D/191765E8; xid 81148979: Heap - clean: rel 1663/16386/16559898; blk 0
> LOG:  REDO @ D/19176644; LSN D/191766A4: prev D/19176610; xid 81148979: Heap - move: rel 1663/16386/16559898; tid
1/1;new 0/10
 
> PANIC:  heap_update_redo: no block: target blcknum: 1, relation(1663/16386/16559898) length: 1

I think what's happened here is that VACUUM FULL moved the only tuple
off page 1 of the relation, then truncated off page 1, and now
heap_update_redo is panicking because it can't find page 1 to replay the
move.  Curious that we've not seen a case like this before, because it
seems like a generic hazard for WAL replay.

The simplest fix would be to treat WAL records as no-ops if they refer
to nonexistent pages, but that seems much too prone to hide real failure
conditions.  Another thought is to remember that we ignored this record,
and then complain if we don't see a TRUNCATE that would've removed the
page.  That would be pretty complicated but not impossible.  Anyone have
a better idea?
        regards, tom lane


Re: [GENERAL] PANIC: heap_update_redo: no block

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I think what's happened here is that VACUUM FULL moved the only tuple
> off page 1 of the relation, then truncated off page 1, and now
> heap_update_redo is panicking because it can't find page 1 to replay the
> move.  Curious that we've not seen a case like this before, because it
> seems like a generic hazard for WAL replay.

This sounds familiar

http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php


-- 
greg



Re: [GENERAL] PANIC: heap_update_redo: no block

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> This sounds familiar
> http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php

Hm, I had totally forgotten about that todo item :-(.  Time to push it
back up the priority list.
        regards, tom lane


Re: [GENERAL] PANIC: heap_update_redo: no block

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> I think what's happened here is that VACUUM FULL moved the only tuple
>> off page 1 of the relation, then truncated off page 1, and now
>> heap_update_redo is panicking because it can't find page 1 to replay the
>> move.  Curious that we've not seen a case like this before, because it
>> seems like a generic hazard for WAL replay.

> This sounds familiar
> http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php

After further review I've concluded that there is not a systemic bug
here, but there are several nearby local bugs.  The reason it's not
a systemic bug is that this scenario is supposed to be handled by the
same mechanism that prevents torn-page writes: the first XLOG record
that touches a given page after a checkpoint is supposed to rewrite
the entire page, rather than update it incrementally.  Since XLOG replay
always begins at a checkpoint, this means we should always be able to
write a fresh copy of the page, even after relation deletion or
truncation.  Furthermore, during XLOG replay we are willing to create
a table (or even a whole tablespace or database directory) if it's not
there when touched.  The subsequent replay of the deletion or truncation
will get rid of any unwanted data again.

Therefore, there is no systemic bug --- unless you are running with
full_page_writes=off.  I assert that that GUC variable is broken and
must be removed.

There are, however, a bunch of local bugs, including these:

* On a symlink-less platform (ie, Windows), TablespaceCreateDbspace is
#ifdef'd to be a no-op.  This is wrong because it performs the essential
function of re-creating a tablespace or database directory if needed
during replay.  AFAICS the #if can just be removed and have the same
code with or without symlinks.

* log_heap_update decides that it can set XLOG_HEAP_INIT_PAGE instead
of storing the full destination page, if the destination contains only
the single tuple being moved.  This is fine, except it also resets the
buffer indicator for the *source* page, which is wrong --- that page
may still need to be re-generated from the xlog record.  This is the
proximate cause of the bug report that started this thread.

* btree_xlog_split passes extend=false to XLogReadBuffer for the left
sibling, which is silly because it is going to rewrite that whole page
from the xlog record anyway.  It should pass true so that there's no
complaint if the left sib page was later truncated away.  This accounts
for one of the bug reports mentioned in the message cited above.

* btree_xlog_delete_page passes extend=false for the target page,
which is likewise silly because it's going to init the page (not that
there was any useful data on it anyway).  This accounts for the other
bug report mentioned in the message cited above.

Clearly, we need to go through the xlog code with a fine tooth comb
and convince ourselves that all pages touched by any xlog record will
be properly reconstituted if they've later been truncated off.  I have
not yet examined any of the code except the above.

Notice that these are each, individually, pretty low-probability
scenarios, which is why we've not seen many bug reports.  If we had had
a systemic bug I'm sure we'd be seeing far more.
        regards, tom lane


Re: [GENERAL] PANIC: heap_update_redo: no block

From
Simon Riggs
Date:
On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > Tom Lane <tgl@sss.pgh.pa.us> writes:
> >> I think what's happened here is that VACUUM FULL moved the only tuple
> >> off page 1 of the relation, then truncated off page 1, and now
> >> heap_update_redo is panicking because it can't find page 1 to replay the
> >> move.  Curious that we've not seen a case like this before, because it
> >> seems like a generic hazard for WAL replay.
> 
> > This sounds familiar
> > http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php

Yes, I remember that also.

> After further review I've concluded that there is not a systemic bug
> here, but there are several nearby local bugs.  

IMHO that's amazing to find so many bugs in a code review of existing
production code. Cool.

> The reason it's not
> a systemic bug is that this scenario is supposed to be handled by the
> same mechanism that prevents torn-page writes: the first XLOG record
> that touches a given page after a checkpoint is supposed to rewrite
> the entire page, rather than update it incrementally.  Since XLOG replay
> always begins at a checkpoint, this means we should always be able to
> write a fresh copy of the page, even after relation deletion or
> truncation.  Furthermore, during XLOG replay we are willing to create
> a table (or even a whole tablespace or database directory) if it's not
> there when touched.  The subsequent replay of the deletion or truncation
> will get rid of any unwanted data again.

That will all work, agreed.

> The subsequent replay of the deletion or truncation
> will get rid of any unwanted data again.

Trouble is, it is not a watertight assumption that there *will be* a
subsequent truncation, even if it is a strong one. If there is not a
later truncation, we will just ignore what we ought to now know is an
error and then try to continue as if the database was fine, which it
would not be.

The overall problem is that auto extension fails to take action or
provide notification with regard to file system corruptions. Clearly we
would like xlog replay to work even in the face of strong file
corruptions, but we should make attempts to identify this situation and
notify people that this has occurred.

I'd suggest both WARNING messages in the log and something more extreme
still: anyone touching a corrupt table should receive a NOTICE saying
"database recovery displayed errors for this table" "HINT: check the
database logfiles for specific messages". Indexes should have a log
WARNING saying "database recovery displayed errors for this index"
"HINT: use  REINDEX to rebuild this index".

So I guess I had better help if we agree this is beneficial.

> Therefore, there is no systemic bug --- unless you are running with
> full_page_writes=off.  I assert that that GUC variable is broken and
> must be removed.

On this analysis, I would agree for current production systems. But what
this says is something deeper: we must log full pages, not because we
fear a partial page write has occurred, but because the xlog mechanism
intrinsically depends upon the existence of those full pages after each
checkpoint.

The writing of full pages in this way is a serious performance issue
that it would be good to improve upon. Perhaps this is the spur to
discuss a new xlog format that would support higher performance logging
as well as log-mining for replication?

> There are, however, a bunch of local bugs, including these:

...

> Notice that these are each, individually, pretty low-probability
> scenarios, which is why we've not seen many bug reports.  

Most people don't file bug reports. If we have a recovery mode that
ignores file system corruptions we'll get even less because any errors
that occur will be deemed as gamma rays or some other excuse.

> a systemic bug 

Perhaps we do have one systemic problem: systems documentation.

The xlog code is distinct from other parts of the codebase in that it
has almost zero comments with it and the overall mechanisms are
relatively poorly documented in README form. Methinks there are very few
people who could attempt such a code review and even fewer who would
find any bugs by inspection. I'll think some more on that...

Best Regards, Simon Riggs



Re: [GENERAL] PANIC: heap_update_redo: no block

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote:
>> The subsequent replay of the deletion or truncation
>> will get rid of any unwanted data again.

> Trouble is, it is not a watertight assumption that there *will be* a
> subsequent truncation, even if it is a strong one.

Well, in fact we'll have correctly recreated the page, so I'm not
thinking that it's necessary or desirable to check this.  What's the
point?  "PANIC: we think your filesystem screwed up.  We don't know
exactly how or why, and we successfully rebuilt all our data, but
we're gonna refuse to start up anyway."  Doesn't seem like robust
behavior to me.  If you check the archives you'll find that we've
backed off panic-for-panic's-sake behaviors in replay several times
before, after concluding they made the system less robust rather than
more so.  This just seems like another one of the same.

> Perhaps we do have one systemic problem: systems documentation.

I agree on that ;-).  The xlog code is really poorly documented.
I'm going to try to improve the comments for at least the xlogutils
routines while I'm fixing this.
        regards, tom lane


Re: [GENERAL] PANIC: heap_update_redo: no block

From
"Jim C. Nasby"
Date:
On Tue, Mar 28, 2006 at 10:07:35AM -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote:
> >> The subsequent replay of the deletion or truncation
> >> will get rid of any unwanted data again.
> 
> > Trouble is, it is not a watertight assumption that there *will be* a
> > subsequent truncation, even if it is a strong one.
> 
> Well, in fact we'll have correctly recreated the page, so I'm not
> thinking that it's necessary or desirable to check this.  What's the
> point?  "PANIC: we think your filesystem screwed up.  We don't know
> exactly how or why, and we successfully rebuilt all our data, but
> we're gonna refuse to start up anyway."  Doesn't seem like robust
> behavior to me.  If you check the archives you'll find that we've
> backed off panic-for-panic's-sake behaviors in replay several times
> before, after concluding they made the system less robust rather than
> more so.  This just seems like another one of the same.

Would the suggestion made in
http://archives.postgresql.org/pgsql-hackers/2005-05/msg01374.php help
in this regard? (Sorry, much of this is over my head, but not everyone
may have read that...)
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: [GENERAL] PANIC: heap_update_redo: no block

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Tue, Mar 28, 2006 at 10:07:35AM -0500, Tom Lane wrote:
>> Well, in fact we'll have correctly recreated the page, so I'm not
>> thinking that it's necessary or desirable to check this.

> Would the suggestion made in
> http://archives.postgresql.org/pgsql-hackers/2005-05/msg01374.php help
> in this regard?

That's exactly what we are debating: whether it's still necessary/useful
to make such a check, given that we now realize the failures are just
isolated bugs and not a systemic problem with truncated files.
        regards, tom lane


Re: [GENERAL] PANIC: heap_update_redo: no block

From
Simon Riggs
Date:
On Tue, 2006-03-28 at 10:07 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote:
> >> The subsequent replay of the deletion or truncation
> >> will get rid of any unwanted data again.
> 
> > Trouble is, it is not a watertight assumption that there *will be* a
> > subsequent truncation, even if it is a strong one.
> 
> Well, in fact we'll have correctly recreated the page, so I'm not
> thinking that it's necessary or desirable to check this.  What's the
> point?  

We recreated *a* page but we are shying away from exploring *why* we
needed to in the first place. If there was no later truncation then
there absolutely should have been a page there already and the fact
there wasn't one needs to be reported.

I don't want to write that code either, I just think we should.

> "PANIC: we think your filesystem screwed up.  We don't know
> exactly how or why, and we successfully rebuilt all our data, but
> we're gonna refuse to start up anyway."  Doesn't seem like robust
> behavior to me.  

Agreed, which is why I explicitly said we shouldn't do that.

grass_up_filesystem = on should be the only setting we support, but
you're right we can't know why its wrong, but the sysadmin might.

> > Perhaps we do have one systemic problem: systems documentation.
> 
> I agree on that ;-).  The xlog code is really poorly documented.
> I'm going to try to improve the comments for at least the xlogutils
> routines while I'm fixing this.

I'll take a look also.

Best Regards, Simon Riggs



Re: [GENERAL] PANIC: heap_update_redo: no block

From
Tom Lane
Date:
I wrote:
> * log_heap_update decides that it can set XLOG_HEAP_INIT_PAGE instead
> of storing the full destination page, if the destination contains only
> the single tuple being moved.  This is fine, except it also resets the
> buffer indicator for the *source* page, which is wrong --- that page
> may still need to be re-generated from the xlog record.  This is the
> proximate cause of the bug report that started this thread.

I have to retract that particular bit of analysis: I had misread the
log_heap_update code.  It seems to be doing the right thing, and in any
case, given Alex's output

LOG:  REDO @ D/19176644; LSN D/191766A4: prev D/19176610; xid 81148979: Heap - move: rel 1663/16386/16559898; tid 1/1;
new0/10
 

we can safely conclude that log_heap_update did not set the INIT_PAGE
bit, because the "new" tid doesn't have offset=1.  (The fact that the
WAL_DEBUG printout doesn't report the bit's state is an oversight I plan
to fix, but anyway we can be pretty sure it's not set here.)

What we should be seeing, and don't see, is an indication of a backup
block attached to this WAL record.  Furthermore, I don't see any
indication of a backup block attached to *any* of the WAL records in
Alex's printout.  The only conclusion I can draw is that he had
full_page_writes turned OFF, and as we have just realized that that
setting is completely unsafe, that is the explanation for his failure.

> Clearly, we need to go through the xlog code with a fine tooth comb
> and convince ourselves that all pages touched by any xlog record will
> be properly reconstituted if they've later been truncated off.  I have
> not yet examined any of the code except the above.

I've finished going through the xlog code looking for related problems,
and AFAICS this is the score:

* full_page_writes = OFF doesn't work.

* btree_xlog_split and btree_xlog_delete_page should pass TRUE not FALSE to XLogReadBuffer for all pages that they are
goingto re-initialize.
 

* the recently-added gist xlog code is badly broken --- it pays no attention whatever to preventing torn pages :-(.
It'snot going to be easy to fix, either, because the page split code assumes that a single WAL record can describe
changesto any number of pages, which is not the case.
 

Everything else seems to be getting it right.
        regards, tom lane


Re: [GENERAL] PANIC: heap_update_redo: no block

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> What we should be seeing, and don't see, is an indication of a backup
> block attached to this WAL record.  Furthermore, I don't see any
> indication of a backup block attached to *any* of the WAL records in
> Alex's printout.  The only conclusion I can draw is that he had
> full_page_writes turned OFF, and as we have just realized that that
> setting is completely unsafe, that is the explanation for his failure.
>

This might be the answer. I tried the fill-checkpoint-vacuum-crash sequence
as you suggested, but still a neat recovery. That's because, IMHO, even
after checkpoint, the moved page will still be saved into WAL (since it is
new again to the checkpoint) if full_page_writes is on.

Regards,
Qingqing




Re: [GENERAL] PANIC: heap_update_redo: no block

From
"Qingqing Zhou"
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote
>
> "Tom Lane" <tgl@sss.pgh.pa.us> wrote
>>
>> What we should be seeing, and don't see, is an indication of a backup
>> block attached to this WAL record.  Furthermore, I don't see any
>> indication of a backup block attached to *any* of the WAL records in
>> Alex's printout.  The only conclusion I can draw is that he had
>> full_page_writes turned OFF, and as we have just realized that that
>> setting is completely unsafe, that is the explanation for his failure.
>>
>

According to Alex, seems the problem is not because of full_page_writes OFF

>
>> According to the discussion in pgsql-hackers, to finish this case, did 
>> you
>> turn off the full_page_writes parameter? I hope the answer is "yes" ...
>>
>
> If by off you mean full_page_writes = on then yes.


Regards,
Qingqing





Re: [GENERAL] PANIC: heap_update_redo: no block

From
Bruce Momjian
Date:
Tom Lane wrote:
> There are, however, a bunch of local bugs, including these:
> 
> * On a symlink-less platform (ie, Windows), TablespaceCreateDbspace is
> #ifdef'd to be a no-op.  This is wrong because it performs the essential
> function of re-creating a tablespace or database directory if needed
> during replay.  AFAICS the #if can just be removed and have the same
> code with or without symlinks.

FYI, Win32 in Win2k and XP has symlinks implemented using junction
points, and we use them.  It is just that pre-Win2k (NT4) does not have
them, and at this point pginstaller doesn't even support that platform.

--  Bruce Momjian   http://candle.pha.pa.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +