Re: corruption diag/recovery, pg_dump crash - Mailing list pgsql-general

From Ed L.
Subject Re: corruption diag/recovery, pg_dump crash
Date
Msg-id 200312061445.40643.pgsql@bluepolka.net
Whole thread Raw
In response to corruption diag/recovery, pg_dump crash  ("Ed L." <pgsql@bluepolka.net>)
List pgsql-general
Maybe worth mentioning the system has one 7.2.3 cluster, five 7.3.2
clusters, twelve 7.3.4 clusters, all with data on same partition/device,
and all corruption has occurred on only five of the twelve 7.3.4 clusters.

TIA.

On Saturday December 6 2003 2:30, Ed L. wrote:
> We are seeing what looks like pgsql data file corruption across multiple
> clusters on a RAID5 partition on a single redhat linux 2.4 server running
> 7.3.4.  System has ~20 clusters installed with a mix of 7.2.3, 7.3.2, and
> 7.3.4 (mostly 7.3.4), 10gb ram, 76gb on a RAID5, dual cpus, and very busy
> with hundreds and sometimes > 1000 simultaneous connections.  After ~250
> days of continuous, flawless uptime operations, we recently began seeing
> major performance degradation accompanied by messages like the following:
>
>     ERROR:  Invalid page header in block NN of some_relation (10-15
> instances)
>
>     ERROR:  XLogFlush: request 38/5E659BA0 is not satisfied ... (1 instance
> repeated many times)
>
> I think I've been able to repair most of the "Invalid page header" errors
> by rebuilding indices or truncating/reloading tabledata.  The XLogFlush
> error was occuring for a particular index, and a drop/reload has at least
> ceased that error.  Now, a pg_dump error is occurring on one cluster
> preventing a successful dump.  Of course, it's gone unnoticed long enough
> to rollover our good online backups and the bazillion-dollar
> offline/offsite backup system wasn't working properly.  Here's the
> pg_dump output, edited to protect the guilty:
>
> pg_dump: PANIC:  open of .../data/pg_clog/04E5 failed: No such file or
> directory
> pg_dump: lost synchronization with server, resetting connection
> pg_dump: WARNING:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted ... blah blah
> pg_dump: SQL command to dump the contents of table "sometable" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: server closed the connection
> unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> pg_dump: The command was: COPY public.sometable ("key", ...) TO stdout;
> pg_dumpall: pg_dump failed on somedb, exiting
>
> Why that 04E5 file is missing, I haven't a clue.  I've attached an "ls
> -l" for the pg_clog dir.
>
> Past list discussions suggest this may be an elusive hardware issue.  We
> did find a msg in /var/log/messages...
>
>     kernel: ISR called reentrantly!!
>
> which some here have found newsgroup reports of connection to some sort
> of raid/bios issue.  We've taken the machine offline and conducted
> extensive hardware diagnostics on RAID controller, filesystem (fsck),
> RAM, and found no further indication of hardware failure.  The machine
> had run flawlessly for these ~20 clusters for ~250 days until cratering
> yesterday amidst these errors and absurd system (disk) IO sluggishness.
> Upon reboot and upgrades, the machine continues to exhibit infrequent
> corruption (or infrequently discovered).  Based on hardware vendor (Dell)
> support folks, we've upgraded our kernel (now 2.4.20-24.7bigmem), several
> drivers, raid controller firmware, rebooted, etc.  The disk IO
> sluggishness has largely diminished, but we're still seeing the Invalid
> page header pop-up anew, albeit infrequently.  The XLogFlush error seems
> to have gone away with the reconstruction of an index.
>
> Current plan is to get as much data recovered as possible, and then do
> significant hardware replacements (along with more frequent planned
> reboots and more vigilant backups).
>
> Any clues/suggestions for recovering this data or fixing other issues
> would be greatly appreciated.
>
> TIA.


pgsql-general by date:

Previous
From: "Ed L."
Date:
Subject: corruption diag/recovery, pg_dump crash
Next
From: "Brian Maguire"
Date:
Subject: dblink questions