Thread: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

Hi All,

I have a database that refuses to start due to the afformentioned error. I =
am running POstgreSQL 8.1.11 on a Debian Etch box.

Does anyone know what this error means and how to recover from it?

Any help will be very much appreciated.

Thanks,
Val

P.S. Here is the complete output I get

Jan  5 10:36:29 db2 postgres[17111]: [2-1] LOG:  database system was interr=
upted while in recovery at 2009-01-05 10:24:37 GMT
Jan  5 10:36:29 db2 postgres[17111]: [2-2] HINT:  This probably means that =
some data is corrupted and you will have to use the last backup for recover=
y.
Jan  5 10:36:29 db2 postgres[17111]: [3-1] LOG:  checkpoint record is at 12=
2/D080660
Jan  5 10:36:29 db2 postgres[17111]: [4-1] LOG:  redo record is at 122/D000=
60C; undo record is at 0/0; shutdown FALSE
Jan  5 10:36:29 db2 postgres[17111]: [5-1] LOG:  next transaction ID: 26640=
07622; next OID: 521067
Jan  5 10:36:29 db2 postgres[17111]: [6-1] LOG:  next MultiXactId: 1; next =
MultiXactOffset: 0
Jan  5 10:36:29 db2 postgres[17111]: [7-1] LOG:  database system was not pr=
operly shut down; automatic recovery in progress
Jan  5 10:36:29 db2 postgres[17111]: [8-1] LOG:  redo starts at 122/D00060C
Jan  5 10:36:29 db2 postgres[17112]: [2-1] LOG:  incomplete startup packet
Jan  5 10:36:29 db2 postgres[17111]: [9-1] LOG:  record with zero length at=
 122/E914B48
Jan  5 10:36:29 db2 postgres[17111]: [10-1] LOG:  redo done at 122/E914B20
Jan  5 10:36:29 db2 postgres[17111]: [11-1] PANIC:  failed to re-find paren=
t key in "100924" for split pages 1606/1673
Jan  5 10:36:29 db2 postgres[17110]: [2-1] LOG:  startup process (PID 17111=
) was terminated by signal 6
Jan  5 10:36:29 db2 postgres[17110]: [3-1] LOG:  aborting startup due to st=
artup process failure
val <valiouk@yahoo.co.uk> writes:
> I have a database that refuses to start due to the afformentioned error. I am running POstgreSQL 8.1.11 on a Debian
Etchbox. 

> Jan  5 10:36:29 db2 postgres[17111]: [11-1] PANIC:  failed to re-find parent key in "100924" for split pages
1606/1673

Hmm ... I wonder if this is telling us that our patch here was
incomplete?
http://archives.postgresql.org/pgsql-committers/2006-11/msg00004.php

At the time we thought this failure could only occur during _bt_pagedel
but you have evidently got a case where a split is failing.  It might
just be garden-variety index corruption, or it might be a real bug.

Is this database sufficiently small and non-proprietary that you could
send me a filesystem copy of it (a tarball of all of $PGDATA including
the WAL files)?

            regards, tom lane
> > I have a database that refuses to start due to the
> afformentioned error. I am running POstgreSQL 8.1.11 on a
> Debian Etch box.
>=20
> > Jan  5 10:36:29 db2 postgres[17111]: [11-1] PANIC:=20
> failed to re-find parent key in "100924" for split
> pages 1606/1673
>=20=20
> Is this database sufficiently small and non-proprietary
> that you could
> send me a filesystem copy of it (a tarball of all of
> $PGDATA including
> the WAL files)?
>=20

I solved my problem by reseting the next transaction ID with the pg_resetxl=
og utility.

Sorry I cannot send you the database since it is proprietary and is also qu=
iet big, but if there is anything else I can do just let me know.

thanks,
val

Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

From
Simon Riggs
Date:
On Mon, 2009-01-05 at 14:25 -0500, Tom Lane wrote:
> val <valiouk@yahoo.co.uk> writes:
> > I have a database that refuses to start due to the afformentioned error. I am running POstgreSQL 8.1.11 on a Debian
Etchbox. 
>
> > Jan  5 10:36:29 db2 postgres[17111]: [11-1] PANIC:  failed to re-find parent key in "100924" for split pages
1606/1673
>
> Hmm ... I wonder if this is telling us that our patch here was
> incomplete?
> http://archives.postgresql.org/pgsql-committers/2006-11/msg00004.php
>
> At the time we thought this failure could only occur during _bt_pagedel
> but you have evidently got a case where a split is failing.  It might
> just be garden-variety index corruption, or it might be a real bug.

Did you catch this had occurred during recovery?

Can we downgrade the error from PANIC to LOG please? One corrupt index
shouldn't prevent us from restarting the whole server. Plus, if we have
to use pg_resetxlog to get us out of trouble it isn't going to help much
with diagnosis. We can rebuild indexes once server is up.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Mon, 2009-01-05 at 14:25 -0500, Tom Lane wrote:
>> Hmm ... I wonder if this is telling us that our patch here was
>> incomplete?
>> http://archives.postgresql.org/pgsql-committers/2006-11/msg00004.php
>>
>> At the time we thought this failure could only occur during _bt_pagedel
>> but you have evidently got a case where a split is failing.  It might
>> just be garden-variety index corruption, or it might be a real bug.

> Did you catch this had occurred during recovery?

Yes, I did.  Which is one of the reasons I think there might be a real
bug there, but without any evidence to look at it's hard to do much
about it now.  (Also, our solution to the underlying problem is quite
different now than it was in 8.1, so I'm doubtful that the bug still
exists in current code even if it's real in 8.1.)

> Can we downgrade the error from PANIC to LOG please?

No, that seems utterly unsafe to me.  We'd have a corrupt index and
nothing to cause it to get repaired.

            regards, tom lane

Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

From
Simon Riggs
Date:
On Thu, 2009-01-08 at 13:40 -0500, Tom Lane wrote:
>
> > Can we downgrade the error from PANIC to LOG please?
>
> No, that seems utterly unsafe to me.  We'd have a corrupt index and
> nothing to cause it to get repaired.

We do exactly this with GIN and GIST indexes currently.

I'd rather have a database that works, but has a corrupt index than one
that won't come up at all.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Thu, 2009-01-08 at 13:40 -0500, Tom Lane wrote:
>> No, that seems utterly unsafe to me.  We'd have a corrupt index and
>> nothing to cause it to get repaired.

> We do exactly this with GIN and GIST indexes currently.

Which are not used in any system indexes.

> I'd rather have a database that works, but has a corrupt index than one
> that won't come up at all.

If the btree in question is a critical system index, your value of
"work" is going to be pretty damn small.

            regards, tom lane

Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

From
Simon Riggs
Date:
On Thu, 2009-01-08 at 14:19 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Thu, 2009-01-08 at 13:40 -0500, Tom Lane wrote:
> >> No, that seems utterly unsafe to me.  We'd have a corrupt index and
> >> nothing to cause it to get repaired.
>
> > We do exactly this with GIN and GIST indexes currently.
>
> Which are not used in any system indexes.
>
> > I'd rather have a database that works, but has a corrupt index than one
> > that won't come up at all.
>
> If the btree in question is a critical system index, your value of
> "work" is going to be pretty damn small.

Those are good points.

So if its a system index we can throw a PANIC, else just LOG. Whilst a
corrupt index is annoying in the extreme, a total server outage is not
something we should allow. IMHO.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Thu, 2009-01-08 at 14:19 -0500, Tom Lane wrote:
>> If the btree in question is a critical system index, your value of
>> "work" is going to be pretty damn small.

> So if its a system index we can throw a PANIC, else just LOG. Whilst a
> corrupt index is annoying in the extreme, a total server outage is not
> something we should allow. IMHO.

I think an appropriate solution would be to institute some mechanism
that forces a reindex of the corrupted index at completion of recovery.
Merely fooling around with message severity levels doesn't fix anything
at all, it just opens the door to more trouble than you've already got.

Whether this is important enough to get done in the near future is
a different discussion...

            regards, tom lane

Re: PANIC: failed to re-find parent key in "100924" for split pages 1606/1673

From
Simon Riggs
Date:
On Thu, 2009-01-08 at 15:04 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Thu, 2009-01-08 at 14:19 -0500, Tom Lane wrote:
> >> If the btree in question is a critical system index, your value of
> >> "work" is going to be pretty damn small.
>
> > So if its a system index we can throw a PANIC, else just LOG. Whilst a
> > corrupt index is annoying in the extreme, a total server outage is not
> > something we should allow. IMHO.
>
> I think an appropriate solution would be to institute some mechanism
> that forces a reindex of the corrupted index at completion of recovery.
> Merely fooling around with message severity levels doesn't fix anything
> at all, it just opens the door to more trouble than you've already got.

Well you know I agree on the longer term solution.

But with a down server, you just force people to do pg_resetxlog, which
loses both the corruption (probably) and real, useful data (likely) and
*then* they bring up the server. I don't see why we should force people
to take a manual action and lose data to bring up the server. It's not
like they'll just look at it and say how much of a shame it is it won't
start. They will be bringing up the server, somehow, or they get the
sack. IMHO. I'll say no more though; its not an argument.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
> But with a down server, you just force people to do pg_resetxlog, which
> loses both the corruption (probably) and real, useful data (likely) and
> *then* they bring up the server. I don't see why we should force people
> to take a manual action and lose data to bring up the server.

That's all fine, but simply reducing the message level from PANIC to LOG
remains an utterly unacceptable "solution".  What will happen is that
the server will start, the DBA will go back to sleep after ignoring
(most likely, never even reading) the log message, and the corruption
will get worse.  The potential consequences of corruption in a pg_class
index, for example, are just horrid.  Frankly I'd rather "rm -rf $PGDATA"
and force someone to go back to their last backup than let them continue
to run with a database that is known to be broken and the system didn't
do anything more to warn them than emit a LOG message someplace.

(No, I'm not seriously proposing that as a recovery technique.  But it's
no more irresponsible than ignoring a corruption condition.)

            regards, tom lane