Re: corruption diag/recovery, pg_dump crash - Mailing list pgsql-general
From | Ed L. |
---|---|
Subject | Re: corruption diag/recovery, pg_dump crash |
Date | |
Msg-id | 200312061445.40643.pgsql@bluepolka.net Whole thread Raw |
In response to | corruption diag/recovery, pg_dump crash ("Ed L." <pgsql@bluepolka.net>) |
List | pgsql-general |
Maybe worth mentioning the system has one 7.2.3 cluster, five 7.3.2 clusters, twelve 7.3.4 clusters, all with data on same partition/device, and all corruption has occurred on only five of the twelve 7.3.4 clusters. TIA. On Saturday December 6 2003 2:30, Ed L. wrote: > We are seeing what looks like pgsql data file corruption across multiple > clusters on a RAID5 partition on a single redhat linux 2.4 server running > 7.3.4. System has ~20 clusters installed with a mix of 7.2.3, 7.3.2, and > 7.3.4 (mostly 7.3.4), 10gb ram, 76gb on a RAID5, dual cpus, and very busy > with hundreds and sometimes > 1000 simultaneous connections. After ~250 > days of continuous, flawless uptime operations, we recently began seeing > major performance degradation accompanied by messages like the following: > > ERROR: Invalid page header in block NN of some_relation (10-15 > instances) > > ERROR: XLogFlush: request 38/5E659BA0 is not satisfied ... (1 instance > repeated many times) > > I think I've been able to repair most of the "Invalid page header" errors > by rebuilding indices or truncating/reloading tabledata. The XLogFlush > error was occuring for a particular index, and a drop/reload has at least > ceased that error. Now, a pg_dump error is occurring on one cluster > preventing a successful dump. Of course, it's gone unnoticed long enough > to rollover our good online backups and the bazillion-dollar > offline/offsite backup system wasn't working properly. Here's the > pg_dump output, edited to protect the guilty: > > pg_dump: PANIC: open of .../data/pg_clog/04E5 failed: No such file or > directory > pg_dump: lost synchronization with server, resetting connection > pg_dump: WARNING: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend > died abnormally and possibly corrupted ... blah blah > pg_dump: SQL command to dump the contents of table "sometable" failed: > PQendcopy() failed. > pg_dump: Error message from server: server closed the connection > unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > pg_dump: The command was: COPY public.sometable ("key", ...) TO stdout; > pg_dumpall: pg_dump failed on somedb, exiting > > Why that 04E5 file is missing, I haven't a clue. I've attached an "ls > -l" for the pg_clog dir. > > Past list discussions suggest this may be an elusive hardware issue. We > did find a msg in /var/log/messages... > > kernel: ISR called reentrantly!! > > which some here have found newsgroup reports of connection to some sort > of raid/bios issue. We've taken the machine offline and conducted > extensive hardware diagnostics on RAID controller, filesystem (fsck), > RAM, and found no further indication of hardware failure. The machine > had run flawlessly for these ~20 clusters for ~250 days until cratering > yesterday amidst these errors and absurd system (disk) IO sluggishness. > Upon reboot and upgrades, the machine continues to exhibit infrequent > corruption (or infrequently discovered). Based on hardware vendor (Dell) > support folks, we've upgraded our kernel (now 2.4.20-24.7bigmem), several > drivers, raid controller firmware, rebooted, etc. The disk IO > sluggishness has largely diminished, but we're still seeing the Invalid > page header pop-up anew, albeit infrequently. The XLogFlush error seems > to have gone away with the reconstruction of an index. > > Current plan is to get as much data recovered as possible, and then do > significant hardware replacements (along with more frequent planned > reboots and more vigilant backups). > > Any clues/suggestions for recovering this data or fixing other issues > would be greatly appreciated. > > TIA.
pgsql-general by date: