corruption diag/recovery, pg_dump crash - Mailing list pgsql-general
From | Ed L. |
---|---|
Subject | corruption diag/recovery, pg_dump crash |
Date | |
Msg-id | 200312061430.37819.pgsql@bluepolka.net Whole thread Raw |
Responses |
Re: corruption diag/recovery, pg_dump crash
Re: corruption diag/recovery, pg_dump crash Re: corruption diag/recovery, pg_dump crash |
List | pgsql-general |
We are seeing what looks like pgsql data file corruption across multiple clusters on a RAID5 partition on a single redhat linux 2.4 server running 7.3.4. System has ~20 clusters installed with a mix of 7.2.3, 7.3.2, and 7.3.4 (mostly 7.3.4), 10gb ram, 76gb on a RAID5, dual cpus, and very busy with hundreds and sometimes > 1000 simultaneous connections. After ~250 days of continuous, flawless uptime operations, we recently began seeing major performance degradation accompanied by messages like the following: ERROR: Invalid page header in block NN of some_relation (10-15 instances) ERROR: XLogFlush: request 38/5E659BA0 is not satisfied ... (1 instance repeated many times) I think I've been able to repair most of the "Invalid page header" errors by rebuilding indices or truncating/reloading tabledata. The XLogFlush error was occuring for a particular index, and a drop/reload has at least ceased that error. Now, a pg_dump error is occurring on one cluster preventing a successful dump. Of course, it's gone unnoticed long enough to rollover our good online backups and the bazillion-dollar offline/offsite backup system wasn't working properly. Here's the pg_dump output, edited to protect the guilty: pg_dump: PANIC: open of .../data/pg_clog/04E5 failed: No such file or directory pg_dump: lost synchronization with server, resetting connection pg_dump: WARNING: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted ... blah blah pg_dump: SQL command to dump the contents of table "sometable" failed: PQendcopy() failed. pg_dump: Error message from server: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. pg_dump: The command was: COPY public.sometable ("key", ...) TO stdout; pg_dumpall: pg_dump failed on somedb, exiting Why that 04E5 file is missing, I haven't a clue. I've attached an "ls -l" for the pg_clog dir. Past list discussions suggest this may be an elusive hardware issue. We did find a msg in /var/log/messages... kernel: ISR called reentrantly!! which some here have found newsgroup reports of connection to some sort of raid/bios issue. We've taken the machine offline and conducted extensive hardware diagnostics on RAID controller, filesystem (fsck), RAM, and found no further indication of hardware failure. The machine had run flawlessly for these ~20 clusters for ~250 days until cratering yesterday amidst these errors and absurd system (disk) IO sluggishness. Upon reboot and upgrades, the machine continues to exhibit infrequent corruption (or infrequently discovered). Based on hardware vendor (Dell) support folks, we've upgraded our kernel (now 2.4.20-24.7bigmem), several drivers, raid controller firmware, rebooted, etc. The disk IO sluggishness has largely diminished, but we're still seeing the Invalid page header pop-up anew, albeit infrequently. The XLogFlush error seems to have gone away with the reconstruction of an index. Current plan is to get as much data recovered as possible, and then do significant hardware replacements (along with more frequent planned reboots and more vigilant backups). Any clues/suggestions for recovering this data or fixing other issues would be greatly appreciated. TIA.
Attachment
pgsql-general by date: