corruption diag/recovery, pg_dump crash - Mailing list pgsql-general

From Ed L.
Subject corruption diag/recovery, pg_dump crash
Date
Msg-id 200312061430.37819.pgsql@bluepolka.net
Whole thread Raw
Responses Re: corruption diag/recovery, pg_dump crash
Re: corruption diag/recovery, pg_dump crash
Re: corruption diag/recovery, pg_dump crash
List pgsql-general
We are seeing what looks like pgsql data file corruption across multiple
clusters on a RAID5 partition on a single redhat linux 2.4 server running
7.3.4.  System has ~20 clusters installed with a mix of 7.2.3, 7.3.2, and
7.3.4 (mostly 7.3.4), 10gb ram, 76gb on a RAID5, dual cpus, and very busy
with hundreds and sometimes > 1000 simultaneous connections.  After ~250
days of continuous, flawless uptime operations, we recently began seeing
major performance degradation accompanied by messages like the following:

    ERROR:  Invalid page header in block NN of some_relation (10-15 instances)

    ERROR:  XLogFlush: request 38/5E659BA0 is not satisfied ... (1 instance
repeated many times)

I think I've been able to repair most of the "Invalid page header" errors by
rebuilding indices or truncating/reloading tabledata.  The XLogFlush error
was occuring for a particular index, and a drop/reload has at least ceased
that error.  Now, a pg_dump error is occurring on one cluster preventing a
successful dump.  Of course, it's gone unnoticed long enough to rollover
our good online backups and the bazillion-dollar offline/offsite backup
system wasn't working properly.  Here's the pg_dump output, edited to
protect the guilty:

pg_dump: PANIC:  open of .../data/pg_clog/04E5 failed: No such file or
directory
pg_dump: lost synchronization with server, resetting connection
pg_dump: WARNING:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted ... blah blah
pg_dump: SQL command to dump the contents of table "sometable" failed:
PQendcopy() failed.
pg_dump: Error message from server: server closed the connection
unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
pg_dump: The command was: COPY public.sometable ("key", ...) TO stdout;
pg_dumpall: pg_dump failed on somedb, exiting

Why that 04E5 file is missing, I haven't a clue.  I've attached an "ls -l"
for the pg_clog dir.

Past list discussions suggest this may be an elusive hardware issue.  We did
find a msg in /var/log/messages...

    kernel: ISR called reentrantly!!

which some here have found newsgroup reports of connection to some sort of
raid/bios issue.  We've taken the machine offline and conducted extensive
hardware diagnostics on RAID controller, filesystem (fsck), RAM, and found
no further indication of hardware failure.  The machine had run flawlessly
for these ~20 clusters for ~250 days until cratering yesterday amidst these
errors and absurd system (disk) IO sluggishness.  Upon reboot and upgrades,
the machine continues to exhibit infrequent corruption (or infrequently
discovered).  Based on hardware vendor (Dell) support folks, we've upgraded
our kernel (now 2.4.20-24.7bigmem), several drivers, raid controller
firmware, rebooted, etc.  The disk IO sluggishness has largely diminished,
but we're still seeing the Invalid page header pop-up anew, albeit
infrequently.  The XLogFlush error seems to have gone away with the
reconstruction of an index.

Current plan is to get as much data recovered as possible, and then do
significant hardware replacements (along with more frequent planned reboots
and more vigilant backups).

Any clues/suggestions for recovering this data or fixing other issues would
be greatly appreciated.

TIA.

Attachment

pgsql-general by date:

Previous
From:
Date:
Subject: Re: xor for text
Next
From: "Ed L."
Date:
Subject: Re: corruption diag/recovery, pg_dump crash