Thread: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673
Hi All, I have a database that refuses to start due to the afformentioned error. I = am running POstgreSQL 8.1.11 on a Debian Etch box. Does anyone know what this error means and how to recover from it? Any help will be very much appreciated. Thanks, Val P.S. Here is the complete output I get Jan 5 10:36:29 db2 postgres[17111]: [2-1] LOG: database system was interr= upted while in recovery at 2009-01-05 10:24:37 GMT Jan 5 10:36:29 db2 postgres[17111]: [2-2] HINT: This probably means that = some data is corrupted and you will have to use the last backup for recover= y. Jan 5 10:36:29 db2 postgres[17111]: [3-1] LOG: checkpoint record is at 12= 2/D080660 Jan 5 10:36:29 db2 postgres[17111]: [4-1] LOG: redo record is at 122/D000= 60C; undo record is at 0/0; shutdown FALSE Jan 5 10:36:29 db2 postgres[17111]: [5-1] LOG: next transaction ID: 26640= 07622; next OID: 521067 Jan 5 10:36:29 db2 postgres[17111]: [6-1] LOG: next MultiXactId: 1; next = MultiXactOffset: 0 Jan 5 10:36:29 db2 postgres[17111]: [7-1] LOG: database system was not pr= operly shut down; automatic recovery in progress Jan 5 10:36:29 db2 postgres[17111]: [8-1] LOG: redo starts at 122/D00060C Jan 5 10:36:29 db2 postgres[17112]: [2-1] LOG: incomplete startup packet Jan 5 10:36:29 db2 postgres[17111]: [9-1] LOG: record with zero length at= 122/E914B48 Jan 5 10:36:29 db2 postgres[17111]: [10-1] LOG: redo done at 122/E914B20 Jan 5 10:36:29 db2 postgres[17111]: [11-1] PANIC: failed to re-find paren= t key in "100924" for split pages 1606/1673 Jan 5 10:36:29 db2 postgres[17110]: [2-1] LOG: startup process (PID 17111= ) was terminated by signal 6 Jan 5 10:36:29 db2 postgres[17110]: [3-1] LOG: aborting startup due to st= artup process failure
val <valiouk@yahoo.co.uk> writes: > I have a database that refuses to start due to the afformentioned error. I am running POstgreSQL 8.1.11 on a Debian Etchbox. > Jan 5 10:36:29 db2 postgres[17111]: [11-1] PANIC: failed to re-find parent key in "100924" for split pages 1606/1673 Hmm ... I wonder if this is telling us that our patch here was incomplete? http://archives.postgresql.org/pgsql-committers/2006-11/msg00004.php At the time we thought this failure could only occur during _bt_pagedel but you have evidently got a case where a split is failing. It might just be garden-variety index corruption, or it might be a real bug. Is this database sufficiently small and non-proprietary that you could send me a filesystem copy of it (a tarball of all of $PGDATA including the WAL files)? regards, tom lane
> > I have a database that refuses to start due to the > afformentioned error. I am running POstgreSQL 8.1.11 on a > Debian Etch box. >=20 > > Jan 5 10:36:29 db2 postgres[17111]: [11-1] PANIC:=20 > failed to re-find parent key in "100924" for split > pages 1606/1673 >=20=20 > Is this database sufficiently small and non-proprietary > that you could > send me a filesystem copy of it (a tarball of all of > $PGDATA including > the WAL files)? >=20 I solved my problem by reseting the next transaction ID with the pg_resetxl= og utility. Sorry I cannot send you the database since it is proprietary and is also qu= iet big, but if there is anything else I can do just let me know. thanks, val
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673
From
Simon Riggs
Date:
On Mon, 2009-01-05 at 14:25 -0500, Tom Lane wrote: > val <valiouk@yahoo.co.uk> writes: > > I have a database that refuses to start due to the afformentioned error. I am running POstgreSQL 8.1.11 on a Debian Etchbox. > > > Jan 5 10:36:29 db2 postgres[17111]: [11-1] PANIC: failed to re-find parent key in "100924" for split pages 1606/1673 > > Hmm ... I wonder if this is telling us that our patch here was > incomplete? > http://archives.postgresql.org/pgsql-committers/2006-11/msg00004.php > > At the time we thought this failure could only occur during _bt_pagedel > but you have evidently got a case where a split is failing. It might > just be garden-variety index corruption, or it might be a real bug. Did you catch this had occurred during recovery? Can we downgrade the error from PANIC to LOG please? One corrupt index shouldn't prevent us from restarting the whole server. Plus, if we have to use pg_resetxlog to get us out of trouble it isn't going to help much with diagnosis. We can rebuild indexes once server is up. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Mon, 2009-01-05 at 14:25 -0500, Tom Lane wrote: >> Hmm ... I wonder if this is telling us that our patch here was >> incomplete? >> http://archives.postgresql.org/pgsql-committers/2006-11/msg00004.php >> >> At the time we thought this failure could only occur during _bt_pagedel >> but you have evidently got a case where a split is failing. It might >> just be garden-variety index corruption, or it might be a real bug. > Did you catch this had occurred during recovery? Yes, I did. Which is one of the reasons I think there might be a real bug there, but without any evidence to look at it's hard to do much about it now. (Also, our solution to the underlying problem is quite different now than it was in 8.1, so I'm doubtful that the bug still exists in current code even if it's real in 8.1.) > Can we downgrade the error from PANIC to LOG please? No, that seems utterly unsafe to me. We'd have a corrupt index and nothing to cause it to get repaired. regards, tom lane
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673
From
Simon Riggs
Date:
On Thu, 2009-01-08 at 13:40 -0500, Tom Lane wrote: > > > Can we downgrade the error from PANIC to LOG please? > > No, that seems utterly unsafe to me. We'd have a corrupt index and > nothing to cause it to get repaired. We do exactly this with GIN and GIST indexes currently. I'd rather have a database that works, but has a corrupt index than one that won't come up at all. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2009-01-08 at 13:40 -0500, Tom Lane wrote: >> No, that seems utterly unsafe to me. We'd have a corrupt index and >> nothing to cause it to get repaired. > We do exactly this with GIN and GIST indexes currently. Which are not used in any system indexes. > I'd rather have a database that works, but has a corrupt index than one > that won't come up at all. If the btree in question is a critical system index, your value of "work" is going to be pretty damn small. regards, tom lane
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673
From
Simon Riggs
Date:
On Thu, 2009-01-08 at 14:19 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Thu, 2009-01-08 at 13:40 -0500, Tom Lane wrote: > >> No, that seems utterly unsafe to me. We'd have a corrupt index and > >> nothing to cause it to get repaired. > > > We do exactly this with GIN and GIST indexes currently. > > Which are not used in any system indexes. > > > I'd rather have a database that works, but has a corrupt index than one > > that won't come up at all. > > If the btree in question is a critical system index, your value of > "work" is going to be pretty damn small. Those are good points. So if its a system index we can throw a PANIC, else just LOG. Whilst a corrupt index is annoying in the extreme, a total server outage is not something we should allow. IMHO. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2009-01-08 at 14:19 -0500, Tom Lane wrote: >> If the btree in question is a critical system index, your value of >> "work" is going to be pretty damn small. > So if its a system index we can throw a PANIC, else just LOG. Whilst a > corrupt index is annoying in the extreme, a total server outage is not > something we should allow. IMHO. I think an appropriate solution would be to institute some mechanism that forces a reindex of the corrupted index at completion of recovery. Merely fooling around with message severity levels doesn't fix anything at all, it just opens the door to more trouble than you've already got. Whether this is important enough to get done in the near future is a different discussion... regards, tom lane
Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673
From
Simon Riggs
Date:
On Thu, 2009-01-08 at 15:04 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Thu, 2009-01-08 at 14:19 -0500, Tom Lane wrote: > >> If the btree in question is a critical system index, your value of > >> "work" is going to be pretty damn small. > > > So if its a system index we can throw a PANIC, else just LOG. Whilst a > > corrupt index is annoying in the extreme, a total server outage is not > > something we should allow. IMHO. > > I think an appropriate solution would be to institute some mechanism > that forces a reindex of the corrupted index at completion of recovery. > Merely fooling around with message severity levels doesn't fix anything > at all, it just opens the door to more trouble than you've already got. Well you know I agree on the longer term solution. But with a down server, you just force people to do pg_resetxlog, which loses both the corruption (probably) and real, useful data (likely) and *then* they bring up the server. I don't see why we should force people to take a manual action and lose data to bring up the server. It's not like they'll just look at it and say how much of a shame it is it won't start. They will be bringing up the server, somehow, or they get the sack. IMHO. I'll say no more though; its not an argument. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > But with a down server, you just force people to do pg_resetxlog, which > loses both the corruption (probably) and real, useful data (likely) and > *then* they bring up the server. I don't see why we should force people > to take a manual action and lose data to bring up the server. That's all fine, but simply reducing the message level from PANIC to LOG remains an utterly unacceptable "solution". What will happen is that the server will start, the DBA will go back to sleep after ignoring (most likely, never even reading) the log message, and the corruption will get worse. The potential consequences of corruption in a pg_class index, for example, are just horrid. Frankly I'd rather "rm -rf $PGDATA" and force someone to go back to their last backup than let them continue to run with a database that is known to be broken and the system didn't do anything more to warn them than emit a LOG message someplace. (No, I'm not seriously proposing that as a recovery technique. But it's no more irresponsible than ignoring a corruption condition.) regards, tom lane