Thread: Do I have a corrupted database?
I fear I have a corrupted database, and I'm not sure what to do. Environment: Windows Server 2003 8GB RAM Dual processor, quad core 2.6Ghz Postgres 8.2.3 (The IT dept wants to upgrade to 8.2.9, but they are asking me what to do about this corrupt database before they proceed) The database files and logs are stored on a SAN drive 2008-08-23 06:57:06 FATAL: could not create sigchld waiter thread: error code 1816 *** ack! 13 hour hole! What the...? 2008-08-23 20:00:27 ERROR: xlog flush request E0/293CF278 is not satisfied --- flushed only to E0/21B1B7F0 2008-08-23 20:00:27 CONTEXT: writing block 94218 of relation 16712/16713/16725 2008-08-23 20:04:36 DETAIL: Multiple failures --- write error may be permanent. 2008-08-23 20:04:36 ERROR: xlog flush request E0/4FC5BEB8 is not satisfied --- flushed only to E0/21B9E270 2008-08-23 20:04:36 CONTEXT: writing block 81033 of relation 16712/16713/16725 2008-08-23 20:04:36 STATEMENT: BEGIN TRANSACTION; ... just a normal SQL stored proc... 2008-08-23 20:04:36 DETAIL: Multiple failures --- write error may be permanent. 2008-08-23 20:04:36 ERROR: xlog flush request E0/314D8248 is not satisfied --- flushed only to E0/21B9E358 2008-08-23 20:04:36 CONTEXT: writing block 371418 of relation 16712/16713/16719 2008-08-23 20:04:36 STATEMENT: BEGIN TRANSACTION;... just a normal SQL stored proc... repeats for quite a while. A few days later, after a restart, we are seeing these showing up quite often: 2008-08-26 11:59:42 FATAL: the database system is starting up 2008-08-26 11:59:42 FATAL: the database system is starting up 2008-08-26 11:59:43 FATAL: the database system is starting up 2008-08-26 11:59:43 FATAL: the database system is starting up 2008-08-26 11:59:43 FATAL: the database system is starting up 2008-08-26 11:59:43 LOG: database system is ready 2008-08-26 11:59:55 PANIC: right sibling's left-link doesn't match 2008-08-26 11:59:55 STATEMENT: BEGIN TRANSACTION;INSERT INTO ...SQL scrubbed... This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. 2008-08-26 11:59:55 LOG: server process (PID 2228) exited with exit code 3 2008-08-26 11:59:55 LOG: terminating any other active server processes 2008-08-26 11:59:55 LOG: all server processes terminated; reinitializing 2008-08-26 11:59:55 LOG: database system was interrupted at 2008-08-26 11:59:43 Pacific Daylight Time 2008-08-26 11:59:55 LOG: checkpoint record is at E2/F88B6C0 2008-08-26 11:59:55 LOG: redo record is at E2/F88B6C0; undo record is at 0/0; shutdown TRUE 2008-08-26 11:59:55 LOG: next transaction ID: 0/396816257; next OID: 58100 2008-08-26 11:59:55 LOG: next MultiXactId: 3; next MultiXactOffset: 5 2008-08-26 11:59:55 LOG: database system was not properly shut down; automatic recovery in progress 2008-08-26 11:59:55 LOG: redo starts at E2/F88B710 2008-08-26 11:59:55 LOG: record with zero length at E2/F984928 2008-08-26 11:59:55 LOG: redo done at E2/F9848F8 2008-08-26 11:59:55 FATAL: the database system is starting up 2008-08-26 11:59:56 FATAL: the database system is starting up 2008-08-26 11:59:56 FATAL: the database system is starting up 2008-08-26 11:59:56 FATAL: the database system is starting up 2008-08-26 11:59:56 FATAL: the database system is starting up 2008-08-26 11:59:56 FATAL: the database system is starting up 2008-08-26 11:59:56 FATAL: the database system is starting up 2008-08-26 11:59:56 LOG: database system is ready That section is repeating over and over. Oddly enough, the system actually seems to be running mostly. I need to do some diagnostics of our app to see what is going on at that layer and what is and isn't working. I found an article online with a similar problem, but no resolution: http://www.mydatabasesupport.com/forums/postgresql/399079-general-failing-recover-after-panic-shutdown.html
William Garrison wrote: > I fear I have a corrupted database, and I'm not sure what to do. First, make sure you have a recent backup. If your backups rotate, stop the rotation so that all currently available historical copies of the database are preserved from now on - just in case you need them. Now, if possible dump your database with pg_dump. Restore the dump to a test database instance and make sure that it all goes OK. Once that's done, so you know you have a decent recovery point to work from in case you make a mistake during your recovery efforts. After that I don't have all that much to offer, especially as you're using an operating system I don't have much experience with Pg on and you're using an (unspecified) SAN. Normally I'd ask if you'd verified your RAID array / tested your disks. In this case, I'm wondering if there's any chance there was a service interruption on the SAN that might've caused some sort of intermittent or partial writes. > 2008-08-23 20:00:27 ERROR: xlog flush request E0/293CF278 is not > satisfied --- flushed only to E0/21B1B7F0 > 2008-08-23 20:00:27 CONTEXT: writing block 94218 of relation > 16712/16713/16725 > 2008-08-23 20:04:36 DETAIL: Multiple failures --- write error may be > permanent. Yeah, I'm really wondering about the SAN and SAN connection. What sort of SAN is it? How is the host connected? Does it have any sort of logging and monitoring that might let you see if there was a problem around the time Pg was complaining? Have you checked the Windows error logs? -- Craig Ringer
Craig Ringer wrote: > William Garrison wrote: > >> I fear I have a corrupted database, and I'm not sure what to do. >> > > First, make sure you have a recent backup. If your backups rotate, stop > the rotation so that all currently available historical copies of the > database are preserved from now on - just in case you need them. > Since I made my post, we found that we can't do a pg_dump. :( Every time this error appears in the logs, postgres forcably closes any connections (including any running instances of pgadmin or pg_dump) when it runs this little recovery process. We have backups from some days ago plus transaction logs. We also have a snapshot of the file system, and I'm hoping to find a way to attach that onto another system. I've had trouble with that in the past. As for the SAN and the Windows event log: Our IT guy says the SAN reported no failures at the time. I don't know much about the SAN itself, I just know it uses dual fiber-channels and all the drives are in some RAID array. I think it also is hardened against nuclear strikes and has a built-in laser defense system. At the time of the problem, the Windows event log indicates no problems writing to the drives, or any other failures of any kind really. No other apps crashed, no unusual memory usage, plenty of disk space. So the cause is a complete mystery. :( So for now, I'm focused on repair. We tried to REINDEX each table, and we are getting duplicate key errors so the reindex fails. I can fix those records manually, but I was hoping to dump the database, find the duplicates using another system, then delete/repair the bad records and restore onto the production machine. But since the backup/restore isn't working, that isn't looking like a viable option. Are there any kind of repair tools for a postgres database? Any sort of routine where I can take it offline and run like pg_fsck --all and it will come back with a report or a repair procedure? > Now, if possible dump your database with pg_dump. Restore the dump to a > test database instance and make sure that it all goes OK. > > Once that's done, so you know you have a decent recovery point to work > from in case you make a mistake during your recovery efforts. > > After that I don't have all that much to offer, especially as you're > using an operating system I don't have much experience with Pg on and > you're using an (unspecified) SAN. > > Normally I'd ask if you'd verified your RAID array / tested your disks. > In this case, I'm wondering if there's any chance there was a service > interruption on the SAN that might've caused some sort of intermittent > or partial writes. > > >> 2008-08-23 20:00:27 ERROR: xlog flush request E0/293CF278 is not >> satisfied --- flushed only to E0/21B1B7F0 >> 2008-08-23 20:00:27 CONTEXT: writing block 94218 of relation >> 16712/16713/16725 >> 2008-08-23 20:04:36 DETAIL: Multiple failures --- write error may be >> permanent. >> > > Yeah, I'm really wondering about the SAN and SAN connection. What sort > of SAN is it? How is the host connected? Does it have any sort of > logging and monitoring that might let you see if there was a problem > around the time Pg was complaining? > > Have you checked the Windows error logs? > > -- > Craig Ringer > >
On Wed, Aug 27, 2008 at 01:45:43PM -0400, William Garrison wrote: > Since I made my post, we found that we can't do a pg_dump. :( Every > time this error appears in the logs, postgres forcably closes any > connections (including any running instances of pgadmin or pg_dump) when > it runs this little recovery process. We have backups from some days > ago plus transaction logs. We also have a snapshot of the file system, > and I'm hoping to find a way to attach that onto another system. I've > had trouble with that in the past. You're going to have to be more specific. What do you mean by "this error"? It is possible to startup postgresql such that it will not use any system indexes. > Are there any kind of repair tools for a postgres database? Any sort of > routine where I can take it offline and run like pg_fsck --all and it > will come back with a report or a repair procedure? There is no tools that do fixing, only the DB server itself. If you can't get it to work within postgresql, then pgfsck can attempt to do a raw data dump. It doesn't guarentee the integrity of the data but it may be able to get your data out. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
Wait... there really is a pgfsck...? I just made that up as an example of something I wanted. Great! And... how would I tell postgres to start without using any indexes?
Martijn van Oosterhout wrote:
Martijn van Oosterhout wrote:
On Wed, Aug 27, 2008 at 01:45:43PM -0400, William Garrison wrote:Are there any kind of repair tools for a postgres database? Any sort of routine where I can take it offline and run like pg_fsck --all and it will come back with a report or a repair procedure?You're going to have to be more specific. What do you mean by "this error"? It is possible to startup postgresql such that it will not use any system indexes. There is no tools that do fixing, only the DB server itself. If you can't get it to work within postgresql, then pgfsck can attempt to do a raw data dump. It doesn't guarentee the integrity of the data but it may be able to get your data out. Have a nice day,
William Garrison wrote: > Wait... there really is a pgfsck...? I just made that up as an example > of something I wanted. I had no idea either, but it does look like it: http://svana.org/kleptog/pgsql/pgfsck.html It's a perl script, so you may have a decent chance of getting it going on win32. It doesn't appear to officially support 8.3, but that doesn't look like it'll be an issue for you. Interesting that it doesn't appear to be on pgfoundry: http://pgfoundry.org/search/?type_of_search=soft&words=fsck&Search=Search http://pgfoundry.org/search/?type_of_search=soft&words=pgfsck&Search=Search > Great! And... how would I tell postgres to > start without using any indexes? http://www.postgresql.org/docs/8.2/static/app-postgres.html On Windows, you'd probably stop the PostgreSQL service then invoke the PostgreSQL server (postgres.exe) manually from a cmd.exe shell, using runas.exe to run it under the postgres user ID. If that's right, it'd be nice if you could reply with the exact command line you land up using so the documentation can be updated to show the appropriate one-liner for Windows users to put the server in recovery mode. -- Craig Ringer