Thread: Re: [GENERAL] PANIC: heap_update_redo: no block
[ redirecting to a more appropriate mailing list ] "Alex bahdushka" <bahdushka@gmail.com> writes: > LOG: REDO @ D/19176610; LSN D/19176644: prev D/191765E8; xid 81148979: Heap - clean: rel 1663/16386/16559898; blk 0 > LOG: REDO @ D/19176644; LSN D/191766A4: prev D/19176610; xid 81148979: Heap - move: rel 1663/16386/16559898; tid 1/1;new 0/10 > PANIC: heap_update_redo: no block: target blcknum: 1, relation(1663/16386/16559898) length: 1 I think what's happened here is that VACUUM FULL moved the only tuple off page 1 of the relation, then truncated off page 1, and now heap_update_redo is panicking because it can't find page 1 to replay the move. Curious that we've not seen a case like this before, because it seems like a generic hazard for WAL replay. The simplest fix would be to treat WAL records as no-ops if they refer to nonexistent pages, but that seems much too prone to hide real failure conditions. Another thought is to remember that we ignored this record, and then complain if we don't see a TRUNCATE that would've removed the page. That would be pretty complicated but not impossible. Anyone have a better idea? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I think what's happened here is that VACUUM FULL moved the only tuple > off page 1 of the relation, then truncated off page 1, and now > heap_update_redo is panicking because it can't find page 1 to replay the > move. Curious that we've not seen a case like this before, because it > seems like a generic hazard for WAL replay. This sounds familiar http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php -- greg
Greg Stark <gsstark@mit.edu> writes: > This sounds familiar > http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php Hm, I had totally forgotten about that todo item :-(. Time to push it back up the priority list. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I think what's happened here is that VACUUM FULL moved the only tuple >> off page 1 of the relation, then truncated off page 1, and now >> heap_update_redo is panicking because it can't find page 1 to replay the >> move. Curious that we've not seen a case like this before, because it >> seems like a generic hazard for WAL replay. > This sounds familiar > http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php After further review I've concluded that there is not a systemic bug here, but there are several nearby local bugs. The reason it's not a systemic bug is that this scenario is supposed to be handled by the same mechanism that prevents torn-page writes: the first XLOG record that touches a given page after a checkpoint is supposed to rewrite the entire page, rather than update it incrementally. Since XLOG replay always begins at a checkpoint, this means we should always be able to write a fresh copy of the page, even after relation deletion or truncation. Furthermore, during XLOG replay we are willing to create a table (or even a whole tablespace or database directory) if it's not there when touched. The subsequent replay of the deletion or truncation will get rid of any unwanted data again. Therefore, there is no systemic bug --- unless you are running with full_page_writes=off. I assert that that GUC variable is broken and must be removed. There are, however, a bunch of local bugs, including these: * On a symlink-less platform (ie, Windows), TablespaceCreateDbspace is #ifdef'd to be a no-op. This is wrong because it performs the essential function of re-creating a tablespace or database directory if needed during replay. AFAICS the #if can just be removed and have the same code with or without symlinks. * log_heap_update decides that it can set XLOG_HEAP_INIT_PAGE instead of storing the full destination page, if the destination contains only the single tuple being moved. This is fine, except it also resets the buffer indicator for the *source* page, which is wrong --- that page may still need to be re-generated from the xlog record. This is the proximate cause of the bug report that started this thread. * btree_xlog_split passes extend=false to XLogReadBuffer for the left sibling, which is silly because it is going to rewrite that whole page from the xlog record anyway. It should pass true so that there's no complaint if the left sib page was later truncated away. This accounts for one of the bug reports mentioned in the message cited above. * btree_xlog_delete_page passes extend=false for the target page, which is likewise silly because it's going to init the page (not that there was any useful data on it anyway). This accounts for the other bug report mentioned in the message cited above. Clearly, we need to go through the xlog code with a fine tooth comb and convince ourselves that all pages touched by any xlog record will be properly reconstituted if they've later been truncated off. I have not yet examined any of the code except the above. Notice that these are each, individually, pretty low-probability scenarios, which is why we've not seen many bug reports. If we had had a systemic bug I'm sure we'd be seeing far more. regards, tom lane
On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> I think what's happened here is that VACUUM FULL moved the only tuple > >> off page 1 of the relation, then truncated off page 1, and now > >> heap_update_redo is panicking because it can't find page 1 to replay the > >> move. Curious that we've not seen a case like this before, because it > >> seems like a generic hazard for WAL replay. > > > This sounds familiar > > http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php Yes, I remember that also. > After further review I've concluded that there is not a systemic bug > here, but there are several nearby local bugs. IMHO that's amazing to find so many bugs in a code review of existing production code. Cool. > The reason it's not > a systemic bug is that this scenario is supposed to be handled by the > same mechanism that prevents torn-page writes: the first XLOG record > that touches a given page after a checkpoint is supposed to rewrite > the entire page, rather than update it incrementally. Since XLOG replay > always begins at a checkpoint, this means we should always be able to > write a fresh copy of the page, even after relation deletion or > truncation. Furthermore, during XLOG replay we are willing to create > a table (or even a whole tablespace or database directory) if it's not > there when touched. The subsequent replay of the deletion or truncation > will get rid of any unwanted data again. That will all work, agreed. > The subsequent replay of the deletion or truncation > will get rid of any unwanted data again. Trouble is, it is not a watertight assumption that there *will be* a subsequent truncation, even if it is a strong one. If there is not a later truncation, we will just ignore what we ought to now know is an error and then try to continue as if the database was fine, which it would not be. The overall problem is that auto extension fails to take action or provide notification with regard to file system corruptions. Clearly we would like xlog replay to work even in the face of strong file corruptions, but we should make attempts to identify this situation and notify people that this has occurred. I'd suggest both WARNING messages in the log and something more extreme still: anyone touching a corrupt table should receive a NOTICE saying "database recovery displayed errors for this table" "HINT: check the database logfiles for specific messages". Indexes should have a log WARNING saying "database recovery displayed errors for this index" "HINT: use REINDEX to rebuild this index". So I guess I had better help if we agree this is beneficial. > Therefore, there is no systemic bug --- unless you are running with > full_page_writes=off. I assert that that GUC variable is broken and > must be removed. On this analysis, I would agree for current production systems. But what this says is something deeper: we must log full pages, not because we fear a partial page write has occurred, but because the xlog mechanism intrinsically depends upon the existence of those full pages after each checkpoint. The writing of full pages in this way is a serious performance issue that it would be good to improve upon. Perhaps this is the spur to discuss a new xlog format that would support higher performance logging as well as log-mining for replication? > There are, however, a bunch of local bugs, including these: ... > Notice that these are each, individually, pretty low-probability > scenarios, which is why we've not seen many bug reports. Most people don't file bug reports. If we have a recovery mode that ignores file system corruptions we'll get even less because any errors that occur will be deemed as gamma rays or some other excuse. > a systemic bug Perhaps we do have one systemic problem: systems documentation. The xlog code is distinct from other parts of the codebase in that it has almost zero comments with it and the overall mechanisms are relatively poorly documented in README form. Methinks there are very few people who could attempt such a code review and even fewer who would find any bugs by inspection. I'll think some more on that... Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote: >> The subsequent replay of the deletion or truncation >> will get rid of any unwanted data again. > Trouble is, it is not a watertight assumption that there *will be* a > subsequent truncation, even if it is a strong one. Well, in fact we'll have correctly recreated the page, so I'm not thinking that it's necessary or desirable to check this. What's the point? "PANIC: we think your filesystem screwed up. We don't know exactly how or why, and we successfully rebuilt all our data, but we're gonna refuse to start up anyway." Doesn't seem like robust behavior to me. If you check the archives you'll find that we've backed off panic-for-panic's-sake behaviors in replay several times before, after concluding they made the system less robust rather than more so. This just seems like another one of the same. > Perhaps we do have one systemic problem: systems documentation. I agree on that ;-). The xlog code is really poorly documented. I'm going to try to improve the comments for at least the xlogutils routines while I'm fixing this. regards, tom lane
On Tue, Mar 28, 2006 at 10:07:35AM -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote: > >> The subsequent replay of the deletion or truncation > >> will get rid of any unwanted data again. > > > Trouble is, it is not a watertight assumption that there *will be* a > > subsequent truncation, even if it is a strong one. > > Well, in fact we'll have correctly recreated the page, so I'm not > thinking that it's necessary or desirable to check this. What's the > point? "PANIC: we think your filesystem screwed up. We don't know > exactly how or why, and we successfully rebuilt all our data, but > we're gonna refuse to start up anyway." Doesn't seem like robust > behavior to me. If you check the archives you'll find that we've > backed off panic-for-panic's-sake behaviors in replay several times > before, after concluding they made the system less robust rather than > more so. This just seems like another one of the same. Would the suggestion made in http://archives.postgresql.org/pgsql-hackers/2005-05/msg01374.php help in this regard? (Sorry, much of this is over my head, but not everyone may have read that...) -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Tue, Mar 28, 2006 at 10:07:35AM -0500, Tom Lane wrote: >> Well, in fact we'll have correctly recreated the page, so I'm not >> thinking that it's necessary or desirable to check this. > Would the suggestion made in > http://archives.postgresql.org/pgsql-hackers/2005-05/msg01374.php help > in this regard? That's exactly what we are debating: whether it's still necessary/useful to make such a check, given that we now realize the failures are just isolated bugs and not a systemic problem with truncated files. regards, tom lane
On Tue, 2006-03-28 at 10:07 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote: > >> The subsequent replay of the deletion or truncation > >> will get rid of any unwanted data again. > > > Trouble is, it is not a watertight assumption that there *will be* a > > subsequent truncation, even if it is a strong one. > > Well, in fact we'll have correctly recreated the page, so I'm not > thinking that it's necessary or desirable to check this. What's the > point? We recreated *a* page but we are shying away from exploring *why* we needed to in the first place. If there was no later truncation then there absolutely should have been a page there already and the fact there wasn't one needs to be reported. I don't want to write that code either, I just think we should. > "PANIC: we think your filesystem screwed up. We don't know > exactly how or why, and we successfully rebuilt all our data, but > we're gonna refuse to start up anyway." Doesn't seem like robust > behavior to me. Agreed, which is why I explicitly said we shouldn't do that. grass_up_filesystem = on should be the only setting we support, but you're right we can't know why its wrong, but the sysadmin might. > > Perhaps we do have one systemic problem: systems documentation. > > I agree on that ;-). The xlog code is really poorly documented. > I'm going to try to improve the comments for at least the xlogutils > routines while I'm fixing this. I'll take a look also. Best Regards, Simon Riggs
I wrote: > * log_heap_update decides that it can set XLOG_HEAP_INIT_PAGE instead > of storing the full destination page, if the destination contains only > the single tuple being moved. This is fine, except it also resets the > buffer indicator for the *source* page, which is wrong --- that page > may still need to be re-generated from the xlog record. This is the > proximate cause of the bug report that started this thread. I have to retract that particular bit of analysis: I had misread the log_heap_update code. It seems to be doing the right thing, and in any case, given Alex's output LOG: REDO @ D/19176644; LSN D/191766A4: prev D/19176610; xid 81148979: Heap - move: rel 1663/16386/16559898; tid 1/1; new0/10 we can safely conclude that log_heap_update did not set the INIT_PAGE bit, because the "new" tid doesn't have offset=1. (The fact that the WAL_DEBUG printout doesn't report the bit's state is an oversight I plan to fix, but anyway we can be pretty sure it's not set here.) What we should be seeing, and don't see, is an indication of a backup block attached to this WAL record. Furthermore, I don't see any indication of a backup block attached to *any* of the WAL records in Alex's printout. The only conclusion I can draw is that he had full_page_writes turned OFF, and as we have just realized that that setting is completely unsafe, that is the explanation for his failure. > Clearly, we need to go through the xlog code with a fine tooth comb > and convince ourselves that all pages touched by any xlog record will > be properly reconstituted if they've later been truncated off. I have > not yet examined any of the code except the above. I've finished going through the xlog code looking for related problems, and AFAICS this is the score: * full_page_writes = OFF doesn't work. * btree_xlog_split and btree_xlog_delete_page should pass TRUE not FALSE to XLogReadBuffer for all pages that they are goingto re-initialize. * the recently-added gist xlog code is badly broken --- it pays no attention whatever to preventing torn pages :-(. It'snot going to be easy to fix, either, because the page split code assumes that a single WAL record can describe changesto any number of pages, which is not the case. Everything else seems to be getting it right. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > What we should be seeing, and don't see, is an indication of a backup > block attached to this WAL record. Furthermore, I don't see any > indication of a backup block attached to *any* of the WAL records in > Alex's printout. The only conclusion I can draw is that he had > full_page_writes turned OFF, and as we have just realized that that > setting is completely unsafe, that is the explanation for his failure. > This might be the answer. I tried the fill-checkpoint-vacuum-crash sequence as you suggested, but still a neat recovery. That's because, IMHO, even after checkpoint, the moved page will still be saved into WAL (since it is new again to the checkpoint) if full_page_writes is on. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote > > "Tom Lane" <tgl@sss.pgh.pa.us> wrote >> >> What we should be seeing, and don't see, is an indication of a backup >> block attached to this WAL record. Furthermore, I don't see any >> indication of a backup block attached to *any* of the WAL records in >> Alex's printout. The only conclusion I can draw is that he had >> full_page_writes turned OFF, and as we have just realized that that >> setting is completely unsafe, that is the explanation for his failure. >> > According to Alex, seems the problem is not because of full_page_writes OFF > >> According to the discussion in pgsql-hackers, to finish this case, did >> you >> turn off the full_page_writes parameter? I hope the answer is "yes" ... >> > > If by off you mean full_page_writes = on then yes. Regards, Qingqing
Tom Lane wrote: > There are, however, a bunch of local bugs, including these: > > * On a symlink-less platform (ie, Windows), TablespaceCreateDbspace is > #ifdef'd to be a no-op. This is wrong because it performs the essential > function of re-creating a tablespace or database directory if needed > during replay. AFAICS the #if can just be removed and have the same > code with or without symlinks. FYI, Win32 in Win2k and XP has symlinks implemented using junction points, and we use them. It is just that pre-Win2k (NT4) does not have them, and at this point pginstaller doesn't even support that platform. -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +