Thread: db recovery (FATAL 2)
My database apparently crashed - don't know how or why. It happend in the middle of the night so I wasn't around to troubleshoot it at the time. It looks like it died during the scheduled vacuum. Here's the log that gets generated when I attempt to bring it back up: postmaster successfully started DEBUG: database system shutdown was interrupted at 2002-05-07 09:35:35 EDT DEBUG: CheckPoint record at (10, 1531023244) DEBUG: Redo record at (10, 1531023244); Undo record at (10, 1531022908); Shutdown FALSE DEBUG: NextTransactionId: 29939385; NextOid: 9729307 DEBUG: database system was not properly shut down; automatic recovery in progress... DEBUG: redo starts at (10, 1531023308) DEBUG: ReadRecord: record with zero len at (10, 1575756128) DEBUG: redo done at (10, 1575756064) FATAL 2: write(logfile 10 seg 93 off 15474688) failed: Success /usr/bin/postmaster: Startup proc 1339 exited with status 512 - abort Any suggestions? What are my options, other than doing a complete restore of the DB from a dump (which is not really an option as the backup is not as recent as it should be). Thanks! Bojan
Bojan Belovic <bbelovic@usa.net> writes: > FATAL 2: write(logfile 10 seg 93 off 15474688) failed: Success This would be what PG version? Broad-jumping to conclusions, I'm going to guess (a) 7.1.0-7.1.2 and (b) you are out of disk space for the WAL logs. If so, you'll need to free up 16MB or so to restart the postmaster, and you'd be well advised to update to 7.1.3 before trying another VACUUM. regards, tom lane
Hi folks, this is a long email. I too experienced a data loss of 11 hrs recently. i have the most recent postgresql 7.2.1 on RedHat 6.2 but my case was bit different and i feel my wrong handling of situation was also responsible for it. I would be grateful if someone could tell me what should have been done *instead* to prevent the data loss. as far as i remember the following is the post mortem : the load average of my database server had reached 5.15 and my website had become slugglish so i decided to stop the postmaster and start again, (i dont know it it was a right thing but was inituitive to me) so i did # su - postgres # pg_ctl stop <-- did not work out it said postmaster could not be stopped. # pg_ctl stop -m immediate it said postmaster is stopped , but it was wrong ps auxwww still showed some processes running. # pg_clt -l /var/log/pgsql start said started successfully (but in reality not ) at this point postmaster is neither dead nor running essentially my live website was down, so under pressure i decided to reboot the system and told my ISP to do so. but even the reboot was not smooth , the unix admin of my isp says some process does not let the system reboot (and it was postmaster). so he has to put the machine in power cycle and the machine fscked in startup. as a result i too got similar messages as Bojan has given below . and my website was not connecting to the database. it used to say "database in recovery mode.... " then i did "pg_ctl stop" then start but nothing worked out. since it was my production database i had to restore the database in minimum time so i used my old backup that was 11 hrs old and hence a major data loss. I strongly beleive Postgresql is the best open source database around and is *safe* unless fiddled in a wrong manner. But there are problems in using it. due to The current Lack of inbuilt failover and replication solutions in postgresql people like me would tend to become desperate because one cannot keep webserver down for long as a result we take wrong steps. For mere mortals like me there should be set of guidelines for safe handling of the server. (DOS' and DON'TS type) to prevent DATA LOSS. Also i would like suggestions on how to live with postgresql with its current limitations of replication ( or failover solutions) and without data loss. what i currently do is backup my database with pg_dump but there are problems with it. Because of large size of my database pg_dump takes 20-30 mins and the server load increases this means i cannot do it quite frequently on my production server. so in worst case i still loose of duration ranging from 1-24 hrs depending on frequency of pg_dump. And for many of us even 1Hour of data is *quite* a loss for us. I would also want comments on usability of USOGRES / RSERV replication systems with postgres 7.2.1 hoping to get some tips from the intellectuals out here regds mallah. On Tuesday 07 May 2002 07:52 pm, Bojan Belovic wrote: > My database apparently crashed - don't know how or why. It happend in the > middle of the night so I wasn't around to troubleshoot it at the time. It > looks like it died during the scheduled vacuum. > > Here's the log that gets generated when I attempt to bring it back up: > > postmaster successfully started > DEBUG: database system shutdown was interrupted at 2002-05-07 09:35:35 EDT > DEBUG: CheckPoint record at (10, 1531023244) > DEBUG: Redo record at (10, 1531023244); Undo record at (10, 1531022908); > Shutdown FALSE > DEBUG: NextTransactionId: 29939385; NextOid: 9729307 > DEBUG: database system was not properly shut down; automatic recovery in > progress... > DEBUG: redo starts at (10, 1531023308) > DEBUG: ReadRecord: record with zero len at (10, 1575756128) > DEBUG: redo done at (10, 1575756064) > FATAL 2: write(logfile 10 seg 93 off 15474688) failed: Success > /usr/bin/postmaster: Startup proc 1339 exited with status 512 - abort > > Any suggestions? What are my options, other than doing a complete restore > of the DB from a dump (which is not really an option as the backup is not > as recent as it should be). > > Thanks! > > Bojan
Addition to my previous question... If postgres has trouble recovering the database from the log, is it possible to skip the recovery step and loose some recent data changes? If the loss is relatively small, I think it would be acceptable (say we loose all the changes in the last few hours). Not sure if the vacuum makes this a bigger problem? Thanks! -------- My database apparently crashed - don't know how or why. It happend in the middle of the night so I wasn't around to troubleshoot it at the time. It looks like it died during the scheduled vacuum. Here's the log that gets generated when I attempt to bring it back up: postmaster successfully started DEBUG: database system shutdown was interrupted at 2002-05-07 09:35:35 EDT DEBUG: CheckPoint record at (10, 1531023244) DEBUG: Redo record at (10, 1531023244); Undo record at (10, 1531022908); Shutdown FALSE DEBUG: NextTransactionId: 29939385; NextOid: 9729307 DEBUG: database system was not properly shut down; automatic recovery in progress... DEBUG: redo starts at (10, 1531023308) DEBUG: ReadRecord: record with zero len at (10, 1575756128) DEBUG: redo done at (10, 1575756064) FATAL 2: write(logfile 10 seg 93 off 15474688) failed: Success /usr/bin/postmaster: Startup proc 1339 exited with status 512 - abort Any suggestions? What are my options, other than doing a complete restore of the DB from a dump (which is not really an option as the backup is not as recent as it should be). Thanks! Bojan My database apparently crashed - don't know how or why. It happend in the middle of the night so I wasn't around to troubleshoot it at the time. It looks like it died during the scheduled vacuum. Here's the log that gets generated when I attempt to bring it back up: postmaster successfully started DEBUG: database system shutdown was interrupted at 2002-05-07 09:35:35 EDT DEBUG: CheckPoint record at (10, 1531023244) DEBUG: Redo record at (10, 1531023244); Undo record at (10, 1531022908); Shutdown FALSE DEBUG: NextTransactionId: 29939385; NextOid: 9729307 DEBUG: database system was not properly shut down; automatic recovery in progress... DEBUG: redo starts at (10, 1531023308) DEBUG: ReadRecord: record with zero len at (10, 1575756128) DEBUG: redo done at (10, 1575756064) FATAL 2: write(logfile 10 seg 93 off 15474688) failed: Success /usr/bin/postmaster: Startup proc 1339 exited with status 512 - abort Any suggestions? What are my options, other than doing a complete restore of the DB from a dump (which is not really an option as the backup is not as recent as it should be). Thanks! Bojan
You are correct, it's 7.1.2 . However, the problem is not with disk space (there's several gigs available), but there could be a problem with a bad sector on one of the log files. If this is the case, and the log file is corrupted, is there any way of recovering, even with a certain data loss? Thanks! Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bojan Belovic <bbelovic@usa.net> writes: > > FATAL 2: write(logfile 10 seg 93 off 15474688) failed: Success > > This would be what PG version? > > Broad-jumping to conclusions, I'm going to guess (a) 7.1.0-7.1.2 > and (b) you are out of disk space for the WAL logs. > > If so, you'll need to free up 16MB or so to restart the postmaster, > and you'd be well advised to update to 7.1.3 before trying another > VACUUM. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
Bojan Belovic <bbelovic@usa.net> writes: > You are correct, it's 7.1.2 . However, the problem is not with disk space > (there's several gigs available), but there could be a problem with a bad > sector on one of the log files. If this is the case, and the log file is > corrupted, is there any way of recovering, even with a certain data loss? Hm. It's complaining about a write, not a read, so there is no lost data (yet), even if your theory is correct. You might first try copying the entire $PGDATA/pg_xlog directory somewhere else. If nothing else avails, see contrib/pg_resetxlog. But that should be your last resort not first. regards, tom lane
It turns out it was a bad sector, and once the data was retreived and the drive replaced, postgres was able to go through the startup process sucessfully. Given that it did not report any errors on recovery, I suppose there was no data loss, but even if there was some damage to the log file, it should be minor - the crash happened at 6am, almost no activity, so I'm not going to worry about that at all at this point. Anyway, thank you very much for your help. One quick question - you mentioned I should upgrade to 7.1.3 beofre I run vacuum again. What are the known problems that "ask" for this? Thanks again! Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bojan Belovic <bbelovic@usa.net> writes: > > You are correct, it's 7.1.2 . However, the problem is not with disk space > > (there's several gigs available), but there could be a problem with a bad > > sector on one of the log files. If this is the case, and the log file is > > corrupted, is there any way of recovering, even with a certain data loss? > > Hm. It's complaining about a write, not a read, so there is no lost > data (yet), even if your theory is correct. You might first try copying > the entire $PGDATA/pg_xlog directory somewhere else. > > If nothing else avails, see contrib/pg_resetxlog. But that should be > your last resort not first. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html
Bojan Belovic <bbelovic@usa.net> writes: > One quick question - you mentioned I should upgrade to 7.1.3 beofre I run > vacuum again. What are the known problems that "ask" for this? WAL growth. My original theory was that you'd run out of disk space because of a VACUUM trying to do a huge amount of work. In 7.1.2 the WAL can grow arbitrarily large during a long transaction... There are some other not-unimportant bug fixes in 7.1.3 too, but that's the one I was thinking of. regards, tom lane
Just to make sure - given that there is plenty of space available (database is slightly larger than 1GB and there is almost 10GB free), there should be no problem with vacuum? Or should I upgrade regardless? (I generally like to keep stable system stable, unless I know there is a specific reason why I should change things in the environment) Thanks a lot, Bojan Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bojan Belovic <bbelovic@usa.net> writes: > > One quick question - you mentioned I should upgrade to 7.1.3 beofre I run > > vacuum again. What are the known problems that "ask" for this? > > WAL growth. My original theory was that you'd run out of disk space > because of a VACUUM trying to do a huge amount of work. In 7.1.2 > the WAL can grow arbitrarily large during a long transaction... > > There are some other not-unimportant bug fixes in 7.1.3 too, but that's > the one I was thinking of. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
My database ran out of disk space the other day (I usually monitor better but my wife is having chemo and that is more important). Anyway, I shtudown the server using the 'pg_ctl stop' command, moved about 6GB of indexes to another drive and fixed all the links. After I restarted the server everything looked fine until I got: FATAL 2: open of /usr/local/pgsql/data/pg_clog/0000 failed: No such file or directory This happens about every 30-40 minutes, then the server comes up and behaves okay for a while until *POOF*. I tried to dump my databases to one of my alternate servers, but it fails because I have duplicate records in miscellaneous tables in a primary key. I tried to get smart and changed the primary key to include the oid, figuring that would make it unique, and it would also be easier to delete one of the conflicting records. When I started the dump, I got the same error. After many hours running on a table with about 200M records: SELECT oid,cnt FROM (SELECT oid,count(oid) AS cnt FROM foo GROUP BY oid) as bar WHERE cnt > 1 ; I discovered that I have a bunch (20,000+) of records that have duplicated oid numbers. Is this because of the disk running out of space or is it some deeper more evil problem. Also, how the %!$#! do I fix it without losing the associated data? Most of the records are identical (maybe all), and when I delete any 1, all of them disappear. I guess I could select distinct into a temporary table, delete from current, and then insert from the temporary table, but this is gonna take a long time. - brian Wm. Brian McCane | Life is full of doors that won't open Search http://recall.maxbaud.net/ | when you knock, equally spaced amid those Usenet http://freenews.maxbaud.net/ | that open when you don't want them to. Auction http://www.sellit-here.com/ | - Roger Zelazny "Blood of Amber"