Thread: PANIC: btree_split_redo: lost left sibling?

PANIC: btree_split_redo: lost left sibling?

From
Andrew Sukow
Date:
Greetings,

Our postgres system crashed and upon restarting it our database had the following errors.  The error log was 4.5 gigs
whichis much larger than usual.  We looked online for information about lost left siblings and how to fix the data and
notlose the 400 million records we have.  Anyone have an idea what's the matter and what the fix is? 

LOG:  database system was interrupted while in recovery at 2004-08-17 08:59:41 PDT
HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
LOG:  checkpoint record is at 326/C007B778
LOG:  redo record is at 326/BD899570; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 46922114; next OID: 133662911
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 326/BD899570
PANIC:  btree_split_redo: lost left sibling
LOG:  startup process (PID 9038) was terminated by signal 6
LOG:  aborting startup due to startup process failure

Thanks,

Andrew



Re: PANIC: btree_split_redo: lost left sibling?

From
Bruce Momjian
Date:
Please provide a PostgreSQL version and operating system information.

---------------------------------------------------------------------------

Andrew Sukow wrote:
> Greetings,
>
> Our postgres system crashed and upon restarting it our database had the following errors.  The error log was 4.5 gigs
whichis much larger than usual.  We looked online for information about lost left siblings and how to fix the data and
notlose the 400 million records we have.  Anyone have an idea what's the matter and what the fix is? 
>
> LOG:  database system was interrupted while in recovery at 2004-08-17 08:59:41 PDT
> HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
> LOG:  checkpoint record is at 326/C007B778
> LOG:  redo record is at 326/BD899570; undo record is at 0/0; shutdown FALSE
> LOG:  next transaction ID: 46922114; next OID: 133662911
> LOG:  database system was not properly shut down; automatic recovery in progress
> LOG:  redo starts at 326/BD899570
> PANIC:  btree_split_redo: lost left sibling
> LOG:  startup process (PID 9038) was terminated by signal 6
> LOG:  aborting startup due to startup process failure
>
> Thanks,
>
> Andrew
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: PANIC: btree_split_redo: lost left sibling?

From
Andrew Sukow
Date:
Gentoo
Postgres V 7.4.3
Freshly recompiled postgres and compiler

Thanks

Andrew

----- Original Message -----
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Tuesday, August 17, 2004 10:09 am
Subject: Re: [GENERAL] PANIC: btree_split_redo: lost left sibling?

>
> Please provide a PostgreSQL version and operating system information.
>
> -------------------------------------------------------------------
> --------
>
> Andrew Sukow wrote:
> > Greetings,
> >
> > Our postgres system crashed and upon restarting it our database
> had the following errors.  The error log was 4.5 gigs which is
> much larger than usual.  We looked online for information about
> lost left siblings and how to fix the data and not lose the 400
> million records we have.  Anyone have an idea what's the matter
> and what the fix is?
> >
> > LOG:  database system was interrupted while in recovery at 2004-
> 08-17 08:59:41 PDT
> > HINT:  This probably means that some data is corrupted and you
> will have to use the last backup for recovery.
> > LOG:  checkpoint record is at 326/C007B778
> > LOG:  redo record is at 326/BD899570; undo record is at 0/0;
> shutdown FALSE
> > LOG:  next transaction ID: 46922114; next OID: 133662911
> > LOG:  database system was not properly shut down; automatic
> recovery in progress
> > LOG:  redo starts at 326/BD899570
> > PANIC:  btree_split_redo: lost left sibling
> > LOG:  startup process (PID 9038) was terminated by signal 6
> > LOG:  aborting startup due to startup process failure
> >
> > Thanks,
> >
> > Andrew
> >
> >
> >
> > ---------------------------(end of broadcast)--------------------
> -------
> > TIP 4: Don't 'kill -9' the postmaster
> >
>
> --
>  Bruce Momjian                        |  http://candle.pha.pa.us
>  pgman@candle.pha.pa.us               |  (610) 359-1001
>  +  If your life is a hard drive,     |  13 Roberts Road
>  +  Christ can be your backup.        |  Newtown Square,
> Pennsylvania 19073
>


Re: PANIC: btree_split_redo: lost left sibling?

From
"Gavin M. Roy"
Date:
Note that I have had a few segfaults on gentoo, pg v7.4.3, amd64, kernel
2.6.5-gentoo-r1 as well.

Gavin

Andrew Sukow wrote:

>Gentoo
>Postgres V 7.4.3
>Freshly recompiled postgres and compiler
>
>Thanks
>
>Andrew
>
>----- Original Message -----
>From: Bruce Momjian <pgman@candle.pha.pa.us>
>Date: Tuesday, August 17, 2004 10:09 am
>Subject: Re: [GENERAL] PANIC: btree_split_redo: lost left sibling?
>
>
>
>>Please provide a PostgreSQL version and operating system information.
>>
>>-------------------------------------------------------------------
>>--------
>>
>>Andrew Sukow wrote:
>>
>>
>>>Greetings,
>>>
>>>Our postgres system crashed and upon restarting it our database
>>>
>>>
>>had the following errors.  The error log was 4.5 gigs which is
>>much larger than usual.  We looked online for information about
>>lost left siblings and how to fix the data and not lose the 400
>>million records we have.  Anyone have an idea what's the matter
>>and what the fix is?
>>
>>
>>>LOG:  database system was interrupted while in recovery at 2004-
>>>
>>>
>>08-17 08:59:41 PDT
>>
>>
>>>HINT:  This probably means that some data is corrupted and you
>>>
>>>
>>will have to use the last backup for recovery.
>>
>>
>>>LOG:  checkpoint record is at 326/C007B778
>>>LOG:  redo record is at 326/BD899570; undo record is at 0/0;
>>>
>>>
>>shutdown FALSE
>>
>>
>>>LOG:  next transaction ID: 46922114; next OID: 133662911
>>>LOG:  database system was not properly shut down; automatic
>>>
>>>
>>recovery in progress
>>
>>
>>>LOG:  redo starts at 326/BD899570
>>>PANIC:  btree_split_redo: lost left sibling
>>>LOG:  startup process (PID 9038) was terminated by signal 6
>>>LOG:  aborting startup due to startup process failure
>>>
>>>Thanks,
>>>
>>>Andrew
>>>
>>>
>>>
>>>---------------------------(end of broadcast)--------------------
>>>
>>>
>>-------
>>
>>
>>>TIP 4: Don't 'kill -9' the postmaster
>>>
>>>
>>>
>>--
>> Bruce Momjian                        |  http://candle.pha.pa.us
>> pgman@candle.pha.pa.us               |  (610) 359-1001
>> +  If your life is a hard drive,     |  13 Roberts Road
>> +  Christ can be your backup.        |  Newtown Square,
>>Pennsylvania 19073
>>
>>
>>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 6: Have you searched our list archives?
>
>               http://archives.postgresql.org
>
>


Re: PANIC: btree_split_redo: lost left sibling?

From
Tom Lane
Date:
Andrew Sukow <creoe@shaw.ca> writes:
> Our postgres system crashed and upon restarting it our database had the following errors.  The error log was 4.5 gigs
whichis much larger than usual.  We looked online for information about lost left siblings and how to fix the data and
notlose the 400 million records we have.  Anyone have an idea what's the matter and what the fix is? 

> PANIC:  btree_split_redo: lost left sibling

Looking at the code, the most probable explanation seems to be that the
WAL log contains a reference to a btree page that doesn't exist on disk
(ie, the index file on disk is too short to contain that page number).
The code is panicing because it expects that page should exist already.
I have to agree with it --- it would seem you are suffering from
filesystem misfeasance.  Are you close to being out of disk space
by any chance?

What I would suggest doing is modifying the error message (it's in
src/backend/access/nbtree/nbtxlog.c, about line 256 in 7.4) to report
the index's DB/relfileno and the block number it's failing to access.
Or if you built with debug enabled, you could gdb the core dump and
extract those numbers that way.  Knowing the file and the length it
needs to be, you could append zeroes to the file to make it long enough,
and then the replay should succeed.

A quicker-and-dirtier solution is to pass extend = true instead of false
to the XLogReadBuffer just above this, but I counsel doing the file
extensions manually as sketched above, so that you will know exactly
which index(es) have got this problem.  If I were doing this I would
certainly want to manually REINDEX those indexes afterwards.  The
specific page that's being requested will be filled in correctly from
the WAL entry, but who knows what else is wrong elsewhere in the index?

BTW, what do you mean by "the error log was 4.5 gigs"?  What you showed
us was only 10 lines.

            regards, tom lane