Thread: PANIC: btree_split_redo: lost left sibling?
Greetings, Our postgres system crashed and upon restarting it our database had the following errors. The error log was 4.5 gigs whichis much larger than usual. We looked online for information about lost left siblings and how to fix the data and notlose the 400 million records we have. Anyone have an idea what's the matter and what the fix is? LOG: database system was interrupted while in recovery at 2004-08-17 08:59:41 PDT HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. LOG: checkpoint record is at 326/C007B778 LOG: redo record is at 326/BD899570; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 46922114; next OID: 133662911 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 326/BD899570 PANIC: btree_split_redo: lost left sibling LOG: startup process (PID 9038) was terminated by signal 6 LOG: aborting startup due to startup process failure Thanks, Andrew
Please provide a PostgreSQL version and operating system information. --------------------------------------------------------------------------- Andrew Sukow wrote: > Greetings, > > Our postgres system crashed and upon restarting it our database had the following errors. The error log was 4.5 gigs whichis much larger than usual. We looked online for information about lost left siblings and how to fix the data and notlose the 400 million records we have. Anyone have an idea what's the matter and what the fix is? > > LOG: database system was interrupted while in recovery at 2004-08-17 08:59:41 PDT > HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. > LOG: checkpoint record is at 326/C007B778 > LOG: redo record is at 326/BD899570; undo record is at 0/0; shutdown FALSE > LOG: next transaction ID: 46922114; next OID: 133662911 > LOG: database system was not properly shut down; automatic recovery in progress > LOG: redo starts at 326/BD899570 > PANIC: btree_split_redo: lost left sibling > LOG: startup process (PID 9038) was terminated by signal 6 > LOG: aborting startup due to startup process failure > > Thanks, > > Andrew > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Gentoo Postgres V 7.4.3 Freshly recompiled postgres and compiler Thanks Andrew ----- Original Message ----- From: Bruce Momjian <pgman@candle.pha.pa.us> Date: Tuesday, August 17, 2004 10:09 am Subject: Re: [GENERAL] PANIC: btree_split_redo: lost left sibling? > > Please provide a PostgreSQL version and operating system information. > > ------------------------------------------------------------------- > -------- > > Andrew Sukow wrote: > > Greetings, > > > > Our postgres system crashed and upon restarting it our database > had the following errors. The error log was 4.5 gigs which is > much larger than usual. We looked online for information about > lost left siblings and how to fix the data and not lose the 400 > million records we have. Anyone have an idea what's the matter > and what the fix is? > > > > LOG: database system was interrupted while in recovery at 2004- > 08-17 08:59:41 PDT > > HINT: This probably means that some data is corrupted and you > will have to use the last backup for recovery. > > LOG: checkpoint record is at 326/C007B778 > > LOG: redo record is at 326/BD899570; undo record is at 0/0; > shutdown FALSE > > LOG: next transaction ID: 46922114; next OID: 133662911 > > LOG: database system was not properly shut down; automatic > recovery in progress > > LOG: redo starts at 326/BD899570 > > PANIC: btree_split_redo: lost left sibling > > LOG: startup process (PID 9038) was terminated by signal 6 > > LOG: aborting startup due to startup process failure > > > > Thanks, > > > > Andrew > > > > > > > > ---------------------------(end of broadcast)-------------------- > ------- > > TIP 4: Don't 'kill -9' the postmaster > > > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, > Pennsylvania 19073 >
Note that I have had a few segfaults on gentoo, pg v7.4.3, amd64, kernel 2.6.5-gentoo-r1 as well. Gavin Andrew Sukow wrote: >Gentoo >Postgres V 7.4.3 >Freshly recompiled postgres and compiler > >Thanks > >Andrew > >----- Original Message ----- >From: Bruce Momjian <pgman@candle.pha.pa.us> >Date: Tuesday, August 17, 2004 10:09 am >Subject: Re: [GENERAL] PANIC: btree_split_redo: lost left sibling? > > > >>Please provide a PostgreSQL version and operating system information. >> >>------------------------------------------------------------------- >>-------- >> >>Andrew Sukow wrote: >> >> >>>Greetings, >>> >>>Our postgres system crashed and upon restarting it our database >>> >>> >>had the following errors. The error log was 4.5 gigs which is >>much larger than usual. We looked online for information about >>lost left siblings and how to fix the data and not lose the 400 >>million records we have. Anyone have an idea what's the matter >>and what the fix is? >> >> >>>LOG: database system was interrupted while in recovery at 2004- >>> >>> >>08-17 08:59:41 PDT >> >> >>>HINT: This probably means that some data is corrupted and you >>> >>> >>will have to use the last backup for recovery. >> >> >>>LOG: checkpoint record is at 326/C007B778 >>>LOG: redo record is at 326/BD899570; undo record is at 0/0; >>> >>> >>shutdown FALSE >> >> >>>LOG: next transaction ID: 46922114; next OID: 133662911 >>>LOG: database system was not properly shut down; automatic >>> >>> >>recovery in progress >> >> >>>LOG: redo starts at 326/BD899570 >>>PANIC: btree_split_redo: lost left sibling >>>LOG: startup process (PID 9038) was terminated by signal 6 >>>LOG: aborting startup due to startup process failure >>> >>>Thanks, >>> >>>Andrew >>> >>> >>> >>>---------------------------(end of broadcast)-------------------- >>> >>> >>------- >> >> >>>TIP 4: Don't 'kill -9' the postmaster >>> >>> >>> >>-- >> Bruce Momjian | http://candle.pha.pa.us >> pgman@candle.pha.pa.us | (610) 359-1001 >> + If your life is a hard drive, | 13 Roberts Road >> + Christ can be your backup. | Newtown Square, >>Pennsylvania 19073 >> >> >> > > >---------------------------(end of broadcast)--------------------------- >TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > >
Andrew Sukow <creoe@shaw.ca> writes: > Our postgres system crashed and upon restarting it our database had the following errors. The error log was 4.5 gigs whichis much larger than usual. We looked online for information about lost left siblings and how to fix the data and notlose the 400 million records we have. Anyone have an idea what's the matter and what the fix is? > PANIC: btree_split_redo: lost left sibling Looking at the code, the most probable explanation seems to be that the WAL log contains a reference to a btree page that doesn't exist on disk (ie, the index file on disk is too short to contain that page number). The code is panicing because it expects that page should exist already. I have to agree with it --- it would seem you are suffering from filesystem misfeasance. Are you close to being out of disk space by any chance? What I would suggest doing is modifying the error message (it's in src/backend/access/nbtree/nbtxlog.c, about line 256 in 7.4) to report the index's DB/relfileno and the block number it's failing to access. Or if you built with debug enabled, you could gdb the core dump and extract those numbers that way. Knowing the file and the length it needs to be, you could append zeroes to the file to make it long enough, and then the replay should succeed. A quicker-and-dirtier solution is to pass extend = true instead of false to the XLogReadBuffer just above this, but I counsel doing the file extensions manually as sketched above, so that you will know exactly which index(es) have got this problem. If I were doing this I would certainly want to manually REINDEX those indexes afterwards. The specific page that's being requested will be filled in correctly from the WAL entry, but who knows what else is wrong elsewhere in the index? BTW, what do you mean by "the error log was 4.5 gigs"? What you showed us was only 10 lines. regards, tom lane