Thread: How to recover when can't start database
Hi! (Hope this is the right place to post) I crashed the postmaster and cannot start it anymore, with the error LOG: database system was interrupted while in recovery at 2005-04-01 11:04:33 CEST HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. LOG: checkpoint record is at 5/6F00C540 LOG: redo record is at 5/6F000ABC; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 599824; next OID: 147679259 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 5/6F000ABC PANIC: btree_split_redo: lost left sibling LOG: startup process (PID 5603) was terminated by signal 6 LOG: aborting startup due to startup process failure Is there a way to recover from that? I don't have a fresh backup, but loosing some couple of days won't be a problem. I use PG 8.0 on a linux box, with standard postgresq.conf (except some increased memory settings). thanks, L.B.
"L.Boldareva" <pg@pierro.dds.nl> writes: > LOG: database system was not properly shut down; automatic recovery in > progress > LOG: redo starts at 5/6F000ABC > PANIC: btree_split_redo: lost left sibling > LOG: startup process (PID 5603) was terminated by signal 6 Hmm. AFAICS that could only happen if a page split record is pointing at an "original" page that's not there anymore; that is, the page is past what the kernel says is the end of the file. Exactly how did you get into this state ... was there a system-level crash involved? ISTM this cannot happen unless the filesystem has dropped data. You could probably get it to start by changing the "false" to "true" in this call of XLogReadBuffer /* Left (original) sibling */ buffer = XLogReadBuffer(false, reln, leftsib); if (!BufferIsValid(buffer)) elog(PANIC, "btree_split_%s: lost left sibling", op); in src/backend/access/nbtree/nbtxlog.c (it's line 261 in CVS tip, possibly a little different in 8.0). Let us know if that helps. I'd be a bit suspicious of the contents of the index, if not the whole database, so an immediate dump,reinitdb,reload might be your most prudent course of action after you get it to start. Plan B would be to wipe out the WAL log with pg_resetxlog. This will allow you to start but the odds of having corrupt data afterwards would be about 100% ... you *must* dump and reload if you go that way. regards, tom lane
On Fri, 2005-04-01 at 04:02, L.Boldareva wrote: > Hi! > (Hope this is the right place to post) > > I crashed the postmaster and cannot start it anymore, with the error In addition to what everyone's posted already, I would suggest you spend some time figuring out what got you here in the first place. Bad hardware, misconfigured postgresql.conf (specifically fsync=false) or other possibilities. Under normal circumstances, a crashing postmaster should not cause these problems.
I have kind of fixed the problem (hopefully) I turned out I used plan B, dump/reload will be my next step then. What is meant by the "corrupt data", is this about the data or the things around it, like indexes, system tables? Here is how I got it crashed: I compiled a c-procedure and copied the .so file to its place exactly at a time when (quite unfortunately) another query was running, that used that library. I usually do pg_ctl reload right after that, and it seems to be enough, but not this time. The c-function contained code that would send a query to populate a table (most likely one of the 2 bad ones), but I am not sure this matters since there were 2 tables out of order, and only one at a time is touched by my script. That's it. I just checked, if that matters, that fsync = true on the config file. On Fri, 1 Apr 2005, Tom Lane wrote: > Hmm. AFAICS that could only happen if a page split record is pointing > at an "original" page that's not there anymore; that is, the page is > past what the kernel says is the end of the file. Something like that was mentioned in the WARNING message when I tried to drop the table, but I got that warning only once, further actions just raised the error about relid. Thank you for your help, LB > You could probably get it to start by changing the "false" to "true" > in this call of XLogReadBuffer > > /* Left (original) sibling */ > buffer = XLogReadBuffer(false, reln, leftsib); > if (!BufferIsValid(buffer)) > elog(PANIC, "btree_split_%s: lost left sibling", op); > > in src/backend/access/nbtree/nbtxlog.c (it's line 261 in CVS tip, > possibly a little different in 8.0). Let us know if that helps. > > I'd be a bit suspicious of the contents of the index, if not the > whole database, so an immediate dump,reinitdb,reload might be your > most prudent course of action after you get it to start. > > Plan B would be to wipe out the WAL log with pg_resetxlog. This will > allow you to start but the odds of having corrupt data afterwards would > be about 100% ... you *must* dump and reload if you go that way. > > regards, tom lane >
"L.Boldareva" <pg@pierro.dds.nl> writes: > Here is how I got it crashed: > I compiled a c-procedure and copied the .so file to its place exactly at a > time when (quite unfortunately) another query was running, that used that > library. Hmm, at worst that should only crash one backend, not result in bad data getting into the WAL log. If there was no filesystem crash then I have to suppose that the page number entered in the WAL page split record was wrong ... but how could that happen if the same page number had just been used successfully to read the page? regards, tom lane
On Fri, 1 Apr 2005, Tom Lane wrote: > "L.Boldareva" <pg@pierro.dds.nl> writes: > > Here is how I got it crashed: > > I compiled a c-procedure and copied the .so file to its place exactly at a > > time when (quite unfortunately) another query was running, that used that > > library. > > Hmm, at worst that should only crash one backend, not result in bad data > getting into the WAL log. If there was no filesystem crash then I have > to suppose that the page number entered in the WAL page split record was > wrong ... but how could that happen if the same page number had just > been used successfully to read the page? > > regards, tom lane I have no idea Tell me where to look, I'll have a look in the tar. I'd rather not reproduce the thing, but if someone is interested, I can try that, too (later). regards, LB >
Sorry if this post winds up as a duplicate on the list... --- In an effort to disable all of the foreign key restraints on a set of tables in my installation of PostgreSQL 8.0.1, I was fiddling with the pg_class table and managed to elicit behavior similar to what L. Boldareva ran into. (This is Red Hat AS 4; Opteron w/4GB...) pg_ctl can start and stop my postmaster, but I can't use psql to access a database. When I try, I get this on the command line: --- psql: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. --- ...and this in the log: --- 2005-06-07 16:44:37 EDTLOG: connection received: host=[local] port= 2005-06-07 16:44:37 EDTLOG: connection authorized: user=postgres database=caarray 2005-06-07 16:44:43 EDTLOG: server process (PID 14278) was terminated by signal 11 2005-06-07 16:44:43 EDTLOG: terminating any other active server processes 2005-06-07 16:44:43 EDTLOG: all server processes terminated; reinitializing 2005-06-07 16:44:43 EDTLOG: database system was interrupted at 2005-06-07 16:34:46 EDT 2005-06-07 16:44:43 EDTLOG: checkpoint record is at 0/81F18770 2005-06-07 16:44:43 EDTLOG: redo record is at 0/81F18770; undo record is at 0/0; shutdown TRUE 2005-06-07 16:44:43 EDTLOG: next transaction ID: 6196113; next OID: 6226772 2005-06-07 16:44:43 EDTLOG: database system was not properly shut down; automatic recovery in progress 2005-06-07 16:44:43 EDTLOG: record with zero length at 0/81F187AC 2005-06-07 16:44:43 EDTLOG: redo is not required 2005-06-07 16:44:43 EDTLOG: database system is ready --- I think I know exactly what caused this. Postmaster seemed to crash after I ran these two updates on pg_class: --- 2005-06-07 16:02:38 EDTLOG: statement: update pg_class set reltriggers=foo.c from (select relname,count(tgrelid) as c from pg_class,pg_trigger where pg_class.oid=tgrelid and relnamespace=2200 group by relname) foo; 2005-06-07 16:03:21 EDTLOG: statement: update pg_class set reltriggers=foo.c from (select relname,count(tgrelid) as c from pg_class,pg_trigger where pg_class.oid=tgrelid and relnamespace=2200 group by relname) foo where pg_class.relnamespace=2200; 2005-06-07 16:03:22 EDTLOG: connection received: host=[local] port= 2005-06-07 16:03:22 EDTLOG: connection authorized: user=postgres database=caarray 2005-06-07 16:03:22 EDTLOG: server process (PID 12845) was terminated by signal 11 2005-06-07 16:03:22 EDTLOG: terminating any other active server processes --- If I can't ever get this database back up and running, it's no big deal... this isn't something in production, and I wouldn't do this on a production machine. However, it would save me enough time to get this working that I'm trying this list. The only possible saving grace I seem to have is that immediately before I ran the updates, I made a copy of the pg_class table. So, if the only problem is the pg_class table, then I'm thinking I could just copy the valid data back in, and be more careful next time. But, right now, I can't even get the server to talk to me. Also, if there's an easier way to disable and re-enable the foreign key constraints, I'd certainly appreciate the tip as well. Any help appreciated. Colin Freas
"Colin E. Freas" <cef6@georgetown.edu> writes: > 2005-06-07 16:02:38 EDTLOG: statement: update pg_class set > reltriggers=foo.c from (select relname,count(tgrelid) as c from > pg_class,pg_trigger where pg_class.oid=tgrelid and relnamespace=2200 > group by relname) foo; I'm afraid that database is toast :-(. You managed to overwrite reltriggers in every row of pg_class, including all the system catalogs, including some that absolutely positively cannot have triggers ... like pg_trigger for instance. -> try to open pg_class ... hmm, it says it has triggers -> try to open pg_trigger to read triggers ... hmm, it says it has triggers -> try to open pg_trigger to read triggers ... hmm, it says it has triggers -> try to open pg_trigger to read triggers ... recurse until out of stack space. regards, tom lane
On Fri, Jun 10, 2005 at 12:34:15PM -0400, Tom Lane wrote: > "Colin E. Freas" <cef6@georgetown.edu> writes: > > 2005-06-07 16:02:38 EDTLOG: statement: update pg_class set > > reltriggers=foo.c from (select relname,count(tgrelid) as c from > > pg_class,pg_trigger where pg_class.oid=tgrelid and relnamespace=2200 > > group by relname) foo; > > I'm afraid that database is toast :-(. You managed to overwrite > reltriggers in every row of pg_class, including all the system > catalogs, including some that absolutely positively cannot have > triggers ... like pg_trigger for instance. > > -> try to open pg_class > ... hmm, it says it has triggers > -> try to open pg_trigger to read triggers > ... hmm, it says it has triggers > -> try to open pg_trigger to read triggers > ... hmm, it says it has triggers > -> try to open pg_trigger to read triggers > > ... recurse until out of stack space. IIRC Joe Conway had this problem some time ago, and he managed to get out of it. I don't recall details, maybe searching the archives ... -- Alvaro Herrera (<alvherre[a]surnet.cl>) "El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)