Thread: corruption diag/recovery, pg_dump crash

corruption diag/recovery, pg_dump crash

From
"Ed L."
Date:
We are seeing what looks like pgsql data file corruption across multiple
clusters on a RAID5 partition on a single redhat linux 2.4 server running
7.3.4.  System has ~20 clusters installed with a mix of 7.2.3, 7.3.2, and
7.3.4 (mostly 7.3.4), 10gb ram, 76gb on a RAID5, dual cpus, and very busy
with hundreds and sometimes > 1000 simultaneous connections.  After ~250
days of continuous, flawless uptime operations, we recently began seeing
major performance degradation accompanied by messages like the following:

    ERROR:  Invalid page header in block NN of some_relation (10-15 instances)

    ERROR:  XLogFlush: request 38/5E659BA0 is not satisfied ... (1 instance
repeated many times)

I think I've been able to repair most of the "Invalid page header" errors by
rebuilding indices or truncating/reloading tabledata.  The XLogFlush error
was occuring for a particular index, and a drop/reload has at least ceased
that error.  Now, a pg_dump error is occurring on one cluster preventing a
successful dump.  Of course, it's gone unnoticed long enough to rollover
our good online backups and the bazillion-dollar offline/offsite backup
system wasn't working properly.  Here's the pg_dump output, edited to
protect the guilty:

pg_dump: PANIC:  open of .../data/pg_clog/04E5 failed: No such file or
directory
pg_dump: lost synchronization with server, resetting connection
pg_dump: WARNING:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted ... blah blah
pg_dump: SQL command to dump the contents of table "sometable" failed:
PQendcopy() failed.
pg_dump: Error message from server: server closed the connection
unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
pg_dump: The command was: COPY public.sometable ("key", ...) TO stdout;
pg_dumpall: pg_dump failed on somedb, exiting

Why that 04E5 file is missing, I haven't a clue.  I've attached an "ls -l"
for the pg_clog dir.

Past list discussions suggest this may be an elusive hardware issue.  We did
find a msg in /var/log/messages...

    kernel: ISR called reentrantly!!

which some here have found newsgroup reports of connection to some sort of
raid/bios issue.  We've taken the machine offline and conducted extensive
hardware diagnostics on RAID controller, filesystem (fsck), RAM, and found
no further indication of hardware failure.  The machine had run flawlessly
for these ~20 clusters for ~250 days until cratering yesterday amidst these
errors and absurd system (disk) IO sluggishness.  Upon reboot and upgrades,
the machine continues to exhibit infrequent corruption (or infrequently
discovered).  Based on hardware vendor (Dell) support folks, we've upgraded
our kernel (now 2.4.20-24.7bigmem), several drivers, raid controller
firmware, rebooted, etc.  The disk IO sluggishness has largely diminished,
but we're still seeing the Invalid page header pop-up anew, albeit
infrequently.  The XLogFlush error seems to have gone away with the
reconstruction of an index.

Current plan is to get as much data recovered as possible, and then do
significant hardware replacements (along with more frequent planned reboots
and more vigilant backups).

Any clues/suggestions for recovering this data or fixing other issues would
be greatly appreciated.

TIA.

Attachment

Re: corruption diag/recovery, pg_dump crash

From
"Ed L."
Date:
Maybe worth mentioning the system has one 7.2.3 cluster, five 7.3.2
clusters, twelve 7.3.4 clusters, all with data on same partition/device,
and all corruption has occurred on only five of the twelve 7.3.4 clusters.

TIA.

On Saturday December 6 2003 2:30, Ed L. wrote:
> We are seeing what looks like pgsql data file corruption across multiple
> clusters on a RAID5 partition on a single redhat linux 2.4 server running
> 7.3.4.  System has ~20 clusters installed with a mix of 7.2.3, 7.3.2, and
> 7.3.4 (mostly 7.3.4), 10gb ram, 76gb on a RAID5, dual cpus, and very busy
> with hundreds and sometimes > 1000 simultaneous connections.  After ~250
> days of continuous, flawless uptime operations, we recently began seeing
> major performance degradation accompanied by messages like the following:
>
>     ERROR:  Invalid page header in block NN of some_relation (10-15
> instances)
>
>     ERROR:  XLogFlush: request 38/5E659BA0 is not satisfied ... (1 instance
> repeated many times)
>
> I think I've been able to repair most of the "Invalid page header" errors
> by rebuilding indices or truncating/reloading tabledata.  The XLogFlush
> error was occuring for a particular index, and a drop/reload has at least
> ceased that error.  Now, a pg_dump error is occurring on one cluster
> preventing a successful dump.  Of course, it's gone unnoticed long enough
> to rollover our good online backups and the bazillion-dollar
> offline/offsite backup system wasn't working properly.  Here's the
> pg_dump output, edited to protect the guilty:
>
> pg_dump: PANIC:  open of .../data/pg_clog/04E5 failed: No such file or
> directory
> pg_dump: lost synchronization with server, resetting connection
> pg_dump: WARNING:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted ... blah blah
> pg_dump: SQL command to dump the contents of table "sometable" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: server closed the connection
> unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> pg_dump: The command was: COPY public.sometable ("key", ...) TO stdout;
> pg_dumpall: pg_dump failed on somedb, exiting
>
> Why that 04E5 file is missing, I haven't a clue.  I've attached an "ls
> -l" for the pg_clog dir.
>
> Past list discussions suggest this may be an elusive hardware issue.  We
> did find a msg in /var/log/messages...
>
>     kernel: ISR called reentrantly!!
>
> which some here have found newsgroup reports of connection to some sort
> of raid/bios issue.  We've taken the machine offline and conducted
> extensive hardware diagnostics on RAID controller, filesystem (fsck),
> RAM, and found no further indication of hardware failure.  The machine
> had run flawlessly for these ~20 clusters for ~250 days until cratering
> yesterday amidst these errors and absurd system (disk) IO sluggishness.
> Upon reboot and upgrades, the machine continues to exhibit infrequent
> corruption (or infrequently discovered).  Based on hardware vendor (Dell)
> support folks, we've upgraded our kernel (now 2.4.20-24.7bigmem), several
> drivers, raid controller firmware, rebooted, etc.  The disk IO
> sluggishness has largely diminished, but we're still seeing the Invalid
> page header pop-up anew, albeit infrequently.  The XLogFlush error seems
> to have gone away with the reconstruction of an index.
>
> Current plan is to get as much data recovered as possible, and then do
> significant hardware replacements (along with more frequent planned
> reboots and more vigilant backups).
>
> Any clues/suggestions for recovering this data or fixing other issues
> would be greatly appreciated.
>
> TIA.


Re: corruption diag/recovery, pg_dump crash

From
Martijn van Oosterhout
Date:
While I can't help you with most of your message, the pg_clog is an easier
one. Basically, creating a file with that name with 256KB of zero will let
postgres complete the dump.

*HOWEVER*, what this means is that one of the tuple headers in the database
refers to a nonexistant transaction. So that is definitly some kind of
corruption going on there.

Hope this helps,

On Sat, Dec 06, 2003 at 02:30:37PM -0700, Ed L. wrote:
> We are seeing what looks like pgsql data file corruption across multiple
> clusters on a RAID5 partition on a single redhat linux 2.4 server running
> 7.3.4.  System has ~20 clusters installed with a mix of 7.2.3, 7.3.2, and
> 7.3.4 (mostly 7.3.4), 10gb ram, 76gb on a RAID5, dual cpus, and very busy
> with hundreds and sometimes > 1000 simultaneous connections.  After ~250
> days of continuous, flawless uptime operations, we recently began seeing
> major performance degradation accompanied by messages like the following:
>
>     ERROR:  Invalid page header in block NN of some_relation (10-15 instances)
>
>     ERROR:  XLogFlush: request 38/5E659BA0 is not satisfied ... (1 instance
> repeated many times)
>
> I think I've been able to repair most of the "Invalid page header" errors by
> rebuilding indices or truncating/reloading tabledata.  The XLogFlush error
> was occuring for a particular index, and a drop/reload has at least ceased
> that error.  Now, a pg_dump error is occurring on one cluster preventing a
> successful dump.  Of course, it's gone unnoticed long enough to rollover
> our good online backups and the bazillion-dollar offline/offsite backup
> system wasn't working properly.  Here's the pg_dump output, edited to
> protect the guilty:
>
> pg_dump: PANIC:  open of .../data/pg_clog/04E5 failed: No such file or
> directory
> pg_dump: lost synchronization with server, resetting connection
> pg_dump: WARNING:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted ... blah blah
> pg_dump: SQL command to dump the contents of table "sometable" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: server closed the connection
> unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> pg_dump: The command was: COPY public.sometable ("key", ...) TO stdout;
> pg_dumpall: pg_dump failed on somedb, exiting
>
> Why that 04E5 file is missing, I haven't a clue.  I've attached an "ls -l"
> for the pg_clog dir.
>
> Past list discussions suggest this may be an elusive hardware issue.  We did
> find a msg in /var/log/messages...
>
>     kernel: ISR called reentrantly!!
>
> which some here have found newsgroup reports of connection to some sort of
> raid/bios issue.  We've taken the machine offline and conducted extensive
> hardware diagnostics on RAID controller, filesystem (fsck), RAM, and found
> no further indication of hardware failure.  The machine had run flawlessly
> for these ~20 clusters for ~250 days until cratering yesterday amidst these
> errors and absurd system (disk) IO sluggishness.  Upon reboot and upgrades,
> the machine continues to exhibit infrequent corruption (or infrequently
> discovered).  Based on hardware vendor (Dell) support folks, we've upgraded
> our kernel (now 2.4.20-24.7bigmem), several drivers, raid controller
> firmware, rebooted, etc.  The disk IO sluggishness has largely diminished,
> but we're still seeing the Invalid page header pop-up anew, albeit
> infrequently.  The XLogFlush error seems to have gone away with the
> reconstruction of an index.
>
> Current plan is to get as much data recovered as possible, and then do
> significant hardware replacements (along with more frequent planned reboots
> and more vigilant backups).
>
> Any clues/suggestions for recovering this data or fixing other issues would
> be greatly appreciated.
>
> TIA.

> total 64336
> -rw-------    1 pgdba   pg        262144 Aug 12 18:39 0000
> -rw-------    1 pgdba   pg        262144 Aug 14 11:56 0001
> -rw-------    1 pgdba   pg        262144 Aug 14 20:22 0002
> -rw-------    1 pgdba   pg        262144 Aug 15 16:01 0003
> -rw-------    1 pgdba   pg        262144 Aug 15 23:08 0004
> -rw-------    1 pgdba   pg        262144 Aug 16 05:33 0005
> -rw-------    1 pgdba   pg        262144 Aug 16 11:42 0006
> -rw-------    1 pgdba   pg        262144 Aug 16 18:25 0007
> -rw-------    1 pgdba   pg        262144 Aug 16 23:57 0008
> -rw-------    1 pgdba   pg        262144 Aug 17 08:16 0009
> -rw-------    1 pgdba   pg        262144 Aug 17 14:31 000A
> -rw-------    1 pgdba   pg        262144 Aug 17 20:24 000B
> -rw-------    1 pgdba   pg        262144 Aug 17 23:57 000C
> -rw-------    1 pgdba   pg        262144 Aug 18 03:33 000D
> -rw-------    1 pgdba   pg        262144 Aug 18 13:01 000E
> -rw-------    1 pgdba   pg        262144 Aug 19 13:03 000F
> -rw-------    1 pgdba   pg        262144 Aug 19 18:54 0010
> -rw-------    1 pgdba   pg        262144 Aug 19 23:19 0011
> -rw-------    1 pgdba   pg        262144 Aug 20 04:29 0012
> -rw-------    1 pgdba   pg        262144 Aug 20 12:50 0013
> -rw-------    1 pgdba   pg        262144 Aug 20 15:00 0014
> -rw-------    1 pgdba   pg        262144 Aug 20 23:29 0015
> -rw-------    1 pgdba   pg        262144 Aug 21 11:50 0016
> -rw-------    1 pgdba   pg        262144 Aug 21 16:36 0017
> -rw-------    1 pgdba   pg        262144 Aug 21 21:36 0018
> -rw-------    1 pgdba   pg        262144 Aug 22 03:24 0019
> -rw-------    1 pgdba   pg        262144 Aug 22 09:16 001A
> -rw-------    1 pgdba   pg        262144 Aug 22 15:59 001B
> -rw-------    1 pgdba   pg        262144 Aug 23 06:39 001C
> -rw-------    1 pgdba   pg        262144 Aug 24 01:10 001D
> -rw-------    1 pgdba   pg        262144 Aug 24 15:53 001E
> -rw-------    1 pgdba   pg        262144 Aug 25 09:54 001F
> -rw-------    1 pgdba   pg        262144 Aug 25 14:37 0020
> -rw-------    1 pgdba   pg        262144 Aug 26 01:29 0021
> -rw-------    1 pgdba   pg        262144 Aug 26 13:13 0022
> -rw-------    1 pgdba   pg        262144 Aug 26 18:26 0023
> -rw-------    1 pgdba   pg        262144 Aug 27 10:14 0024
> -rw-------    1 pgdba   pg        262144 Aug 27 17:10 0025
> -rw-------    1 pgdba   pg        262144 Aug 28 08:31 0026
> -rw-------    1 pgdba   pg        262144 Aug 28 15:21 0027
> -rw-------    1 pgdba   pg        262144 Aug 29 06:11 0028
> -rw-------    1 pgdba   pg        262144 Aug 29 13:56 0029
> -rw-------    1 pgdba   pg        262144 Aug 30 03:51 002A
> -rw-------    1 pgdba   pg        262144 Aug 30 17:15 002B
> -rw-------    1 pgdba   pg        262144 Aug 31 11:31 002C
> -rw-------    1 pgdba   pg        262144 Sep  1 04:59 002D
> -rw-------    1 pgdba   pg        262144 Sep  1 17:01 002E
> -rw-------    1 pgdba   pg        262144 Sep  2 09:52 002F
> -rw-------    1 pgdba   pg        262144 Sep  2 16:24 0030
> -rw-------    1 pgdba   pg        262144 Sep  3 07:07 0031
> -rw-------    1 pgdba   pg        262144 Sep  3 13:27 0032
> -rw-------    1 pgdba   pg        262144 Sep  4 04:25 0033
> -rw-------    1 pgdba   pg        262144 Sep  4 13:11 0034
> -rw-------    1 pgdba   pg        262144 Sep  5 02:11 0035
> -rw-------    1 pgdba   pg        262144 Sep  5 12:31 0036
> -rw-------    1 pgdba   pg        262144 Sep  6 01:18 0037
> -rw-------    1 pgdba   pg        262144 Sep  6 17:12 0038
> -rw-------    1 pgdba   pg        262144 Sep  7 12:01 0039
> -rw-------    1 pgdba   pg        262144 Sep  8 08:00 003A
> -rw-------    1 pgdba   pg        262144 Sep  8 14:32 003B
> -rw-------    1 pgdba   pg        262144 Sep  9 06:14 003C
> -rw-------    1 pgdba   pg        262144 Sep  9 13:12 003D
> -rw-------    1 pgdba   pg        262144 Sep  9 20:56 003E
> -rw-------    1 pgdba   pg        262144 Sep 10 09:26 003F
> -rw-------    1 pgdba   pg        262144 Sep 10 14:27 0040
> -rw-------    1 pgdba   pg        262144 Sep 10 20:29 0041
> -rw-------    1 pgdba   pg        262144 Sep 11 03:29 0042
> -rw-------    1 pgdba   pg        262144 Sep 11 12:00 0043
> -rw-------    1 pgdba   pg        262144 Sep 11 20:27 0044
> -rw-------    1 pgdba   pg        262144 Sep 12 09:01 0045
> -rw-------    1 pgdba   pg        262144 Sep 12 15:37 0046
> -rw-------    1 pgdba   pg        262144 Sep 13 07:29 0047
> -rw-------    1 pgdba   pg        262144 Sep 13 18:59 0048
> -rw-------    1 pgdba   pg        262144 Sep 14 12:05 0049
> -rw-------    1 pgdba   pg        262144 Sep 15 07:17 004A
> -rw-------    1 pgdba   pg        262144 Sep 15 13:53 004B
> -rw-------    1 pgdba   pg        262144 Sep 16 01:09 004C
> -rw-------    1 pgdba   pg        262144 Sep 16 11:18 004D
> -rw-------    1 pgdba   pg        262144 Sep 16 18:46 004E
> -rw-------    1 pgdba   pg        262144 Sep 17 09:17 004F
> -rw-------    1 pgdba   pg        262144 Sep 17 16:45 0050
> -rw-------    1 pgdba   pg        262144 Sep 18 07:39 0051
> -rw-------    1 pgdba   pg        262144 Sep 18 14:20 0052
> -rw-------    1 pgdba   pg        262144 Sep 19 01:38 0053
> -rw-------    1 pgdba   pg        262144 Sep 19 12:05 0054
> -rw-------    1 pgdba   pg        262144 Sep 19 22:39 0055
> -rw-------    1 pgdba   pg        262144 Sep 20 13:55 0056
> -rw-------    1 pgdba   pg        262144 Sep 21 09:02 0057
> -rw-------    1 pgdba   pg        262144 Sep 22 02:47 0058
> -rw-------    1 pgdba   pg        262144 Sep 22 12:42 0059
> -rw-------    1 pgdba   pg        262144 Sep 22 21:57 005A
> -rw-------    1 pgdba   pg        262144 Sep 23 10:28 005B
> -rw-------    1 pgdba   pg        262144 Sep 23 18:00 005C
> -rw-------    1 pgdba   pg        262144 Sep 24 08:52 005D
> -rw-------    1 pgdba   pg        262144 Sep 24 15:14 005E
> -rw-------    1 pgdba   pg        262144 Sep 25 04:16 005F
> -rw-------    1 pgdba   pg        262144 Sep 25 12:17 0060
> -rw-------    1 pgdba   pg        262144 Sep 25 20:17 0061
> -rw-------    1 pgdba   pg        262144 Sep 26 10:07 0062
> -rw-------    1 pgdba   pg        262144 Sep 26 16:24 0063
> -rw-------    1 pgdba   pg        262144 Sep 27 09:20 0064
> -rw-------    1 pgdba   pg        262144 Sep 28 00:27 0065
> -rw-------    1 pgdba   pg        262144 Sep 28 16:17 0066
> -rw-------    1 pgdba   pg        262144 Sep 29 09:45 0067
> -rw-------    1 pgdba   pg        262144 Sep 29 16:37 0068
> -rw-------    1 pgdba   pg        262144 Sep 30 07:44 0069
> -rw-------    1 pgdba   pg        262144 Sep 30 15:03 006A
> -rw-------    1 pgdba   pg        262144 Oct  1 05:59 006B
> -rw-------    1 pgdba   pg        262144 Oct  1 12:52 006C
> -rw-------    1 pgdba   pg        262144 Oct  1 22:19 006D
> -rw-------    1 pgdba   pg        262144 Oct  2 10:53 006E
> -rw-------    1 pgdba   pg        262144 Oct  2 19:28 006F
> -rw-------    1 pgdba   pg        262144 Oct  3 10:18 0070
> -rw-------    1 pgdba   pg        262144 Oct  3 19:11 0071
> -rw-------    1 pgdba   pg        262144 Oct  4 12:42 0072
> -rw-------    1 pgdba   pg        262144 Oct  5 08:24 0073
> -rw-------    1 pgdba   pg        262144 Oct  6 00:03 0074
> -rw-------    1 pgdba   pg        262144 Oct  6 11:57 0075
> -rw-------    1 pgdba   pg        262144 Oct  6 19:46 0076
> -rw-------    1 pgdba   pg        262144 Oct  7 09:43 0077
> -rw-------    1 pgdba   pg        262144 Oct  7 17:09 0078
> -rw-------    1 pgdba   pg        262144 Oct  8 07:33 0079
> -rw-------    1 pgdba   pg        262144 Oct  8 13:34 007A
> -rw-------    1 pgdba   pg        262144 Oct  8 18:41 007B
> -rw-------    1 pgdba   pg        262144 Oct  8 23:28 007C
> -rw-------    1 pgdba   pg        262144 Oct  9 09:51 007D
> -rw-------    1 pgdba   pg        262144 Oct  9 14:22 007E
> -rw-------    1 pgdba   pg        262144 Oct  9 17:04 007F
> -rw-------    1 pgdba   pg        262144 Oct 10 06:56 0080
> -rw-------    1 pgdba   pg        262144 Oct 10 12:31 0081
> -rw-------    1 pgdba   pg        262144 Oct 10 18:19 0082
> -rw-------    1 pgdba   pg        262144 Oct 11 10:22 0083
> -rw-------    1 pgdba   pg        262144 Oct 12 02:29 0084
> -rw-------    1 pgdba   pg        262144 Oct 12 17:43 0085
> -rw-------    1 pgdba   pg        262144 Oct 13 09:49 0086
> -rw-------    1 pgdba   pg        262144 Oct 13 17:00 0087
> -rw-------    1 pgdba   pg        262144 Oct 14 07:48 0088
> -rw-------    1 pgdba   pg        262144 Oct 14 12:49 0089
> -rw-------    1 pgdba   pg        262144 Oct 14 16:48 008A
> -rw-------    1 pgdba   pg        262144 Oct 15 07:33 008B
> -rw-------    1 pgdba   pg        262144 Oct 15 14:30 008C
> -rw-------    1 pgdba   pg        262144 Oct 16 01:41 008D
> -rw-------    1 pgdba   pg        262144 Oct 16 12:30 008E
> -rw-------    1 pgdba   pg        262144 Oct 16 20:30 008F
> -rw-------    1 pgdba   pg        262144 Oct 17 10:32 0090
> -rw-------    1 pgdba   pg        262144 Oct 17 17:38 0091
> -rw-------    1 pgdba   pg        262144 Oct 18 10:25 0092
> -rw-------    1 pgdba   pg        262144 Oct 19 01:53 0093
> -rw-------    1 pgdba   pg        262144 Oct 19 16:38 0094
> -rw-------    1 pgdba   pg        262144 Oct 20 09:23 0095
> -rw-------    1 pgdba   pg        262144 Oct 20 16:40 0096
> -rw-------    1 pgdba   pg        262144 Oct 21 07:08 0097
> -rw-------    1 pgdba   pg        262144 Oct 21 13:31 0098
> -rw-------    1 pgdba   pg        262144 Oct 21 21:56 0099
> -rw-------    1 pgdba   pg        262144 Oct 22 10:02 009A
> -rw-------    1 pgdba   pg        262144 Oct 22 16:31 009B
> -rw-------    1 pgdba   pg        262144 Oct 22 22:59 009C
> -rw-------    1 pgdba   pg        262144 Oct 23 10:46 009D
> -rw-------    1 pgdba   pg        262144 Oct 23 17:20 009E
> -rw-------    1 pgdba   pg        262144 Oct 24 08:25 009F
> -rw-------    1 pgdba   pg        262144 Oct 24 14:48 00A0
> -rw-------    1 pgdba   pg        262144 Oct 25 05:45 00A1
> -rw-------    1 pgdba   pg        262144 Oct 25 20:22 00A2
> -rw-------    1 pgdba   pg        262144 Oct 26 13:16 00A3
> -rw-------    1 pgdba   pg        262144 Oct 27 07:34 00A4
> -rw-------    1 pgdba   pg        262144 Oct 27 13:54 00A5
> -rw-------    1 pgdba   pg        262144 Oct 28 03:14 00A6
> -rw-------    1 pgdba   pg        262144 Oct 28 11:58 00A7
> -rw-------    1 pgdba   pg        262144 Oct 28 19:36 00A8
> -rw-------    1 pgdba   pg        262144 Oct 29 09:39 00A9
> -rw-------    1 pgdba   pg        262144 Oct 29 16:27 00AA
> -rw-------    1 pgdba   pg        262144 Oct 30 07:23 00AB
> -rw-------    1 pgdba   pg        262144 Oct 30 13:43 00AC
> -rw-------    1 pgdba   pg        262144 Oct 31 02:31 00AD
> -rw-------    1 pgdba   pg        262144 Oct 31 11:59 00AE
> -rw-------    1 pgdba   pg        262144 Oct 31 19:54 00AF
> -rw-------    1 pgdba   pg        262144 Nov  1 13:44 00B0
> -rw-------    1 pgdba   pg        262144 Nov  2 08:26 00B1
> -rw-------    1 pgdba   pg        262144 Nov  2 20:59 00B2
> -rw-------    1 pgdba   pg        262144 Nov  3 10:33 00B3
> -rw-------    1 pgdba   pg        262144 Nov  3 17:21 00B4
> -rw-------    1 pgdba   pg        262144 Nov  4 09:01 00B5
> -rw-------    1 pgdba   pg        262144 Nov  4 14:44 00B6
> -rw-------    1 pgdba   pg        262144 Nov  5 06:33 00B7
> -rw-------    1 pgdba   pg        262144 Nov  5 13:17 00B8
> -rw-------    1 pgdba   pg        262144 Nov  5 20:45 00B9
> -rw-------    1 pgdba   pg        262144 Nov  6 09:45 00BA
> -rw-------    1 pgdba   pg        262144 Nov  6 17:04 00BB
> -rw-------    1 pgdba   pg        262144 Nov  7 06:55 00BC
> -rw-------    1 pgdba   pg        262144 Nov  7 13:31 00BD
> -rw-------    1 pgdba   pg        262144 Nov  8 03:58 00BE
> -rw-------    1 pgdba   pg        262144 Nov  8 17:04 00BF
> -rw-------    1 pgdba   pg        262144 Nov  9 11:14 00C0
> -rw-------    1 pgdba   pg        262144 Nov 10 06:16 00C1
> -rw-------    1 pgdba   pg        262144 Nov 10 12:47 00C2
> -rw-------    1 pgdba   pg        262144 Nov 10 21:18 00C3
> -rw-------    1 pgdba   pg        262144 Nov 11 10:34 00C4
> -rw-------    1 pgdba   pg        262144 Nov 11 17:23 00C5
> -rw-------    1 pgdba   pg        262144 Nov 12 09:15 00C6
> -rw-------    1 pgdba   pg        262144 Nov 12 15:03 00C7
> -rw-------    1 pgdba   pg        262144 Nov 13 06:30 00C8
> -rw-------    1 pgdba   pg        262144 Nov 13 13:56 00C9
> -rw-------    1 pgdba   pg        262144 Nov 14 00:38 00CA
> -rw-------    1 pgdba   pg        262144 Nov 14 13:06 00CB
> -rw-------    1 pgdba   pg        262144 Nov 14 21:27 00CC
> -rw-------    1 pgdba   pg        262144 Nov 15 13:25 00CD
> -rw-------    1 pgdba   pg        262144 Nov 16 08:57 00CE
> -rw-------    1 pgdba   pg        262144 Nov 16 23:22 00CF
> -rw-------    1 pgdba   pg        262144 Nov 17 11:49 00D0
> -rw-------    1 pgdba   pg        262144 Nov 17 20:12 00D1
> -rw-------    1 pgdba   pg        262144 Nov 18 09:10 00D2
> -rw-------    1 pgdba   pg        262144 Nov 18 16:02 00D3
> -rw-------    1 pgdba   pg        262144 Nov 19 05:23 00D4
> -rw-------    1 pgdba   pg        262144 Nov 19 12:27 00D5
> -rw-------    1 pgdba   pg        262144 Nov 19 19:22 00D6
> -rw-------    1 pgdba   pg        262144 Nov 20 10:36 00D7
> -rw-------    1 pgdba   pg        262144 Nov 20 16:40 00D8
> -rw-------    1 pgdba   pg        262144 Nov 21 08:19 00D9
> -rw-------    1 pgdba   pg        262144 Nov 21 14:53 00DA
> -rw-------    1 pgdba   pg        262144 Nov 22 05:41 00DB
> -rw-------    1 pgdba   pg        262144 Nov 22 19:28 00DC
> -rw-------    1 pgdba   pg        262144 Nov 23 12:30 00DD
> -rw-------    1 pgdba   pg        262144 Nov 24 07:24 00DE
> -rw-------    1 pgdba   pg        262144 Nov 24 14:18 00DF
> -rw-------    1 pgdba   pg        262144 Nov 25 02:03 00E0
> -rw-------    1 pgdba   pg        262144 Nov 25 11:47 00E1
> -rw-------    1 pgdba   pg        262144 Nov 25 18:46 00E2
> -rw-------    1 pgdba   pg        262144 Nov 26 09:57 00E3
> -rw-------    1 pgdba   pg        262144 Nov 26 17:09 00E4
> -rw-------    1 pgdba   pg        262144 Nov 27 11:48 00E5
> -rw-------    1 pgdba   pg        262144 Nov 28 07:43 00E6
> -rw-------    1 pgdba   pg        262144 Nov 28 16:12 00E7
> -rw-------    1 pgdba   pg        262144 Nov 29 09:02 00E8
> -rw-------    1 pgdba   pg        262144 Nov 30 01:06 00E9
> -rw-------    1 pgdba   pg        262144 Nov 30 16:51 00EA
> -rw-------    1 pgdba   pg        262144 Dec  1 09:23 00EB
> -rw-------    1 pgdba   pg        262144 Dec  1 17:05 00EC
> -rw-------    1 pgdba   pg        262144 Dec  2 07:24 00ED
> -rw-------    1 pgdba   pg        262144 Dec  2 14:19 00EE
> -rw-------    1 pgdba   pg        262144 Dec  3 03:52 00EF
> -rw-------    1 pgdba   pg        262144 Dec  3 12:51 00F0
> -rw-------    1 pgdba   pg        262144 Dec  3 22:34 00F1
> -rw-------    1 pgdba   pg        262144 Dec  4 10:46 00F2
> -rw-------    1 pgdba   pg        262144 Dec  4 17:20 00F3
> -rw-------    1 pgdba   pg        262144 Dec  5 11:34 00F4
> -rw-------    1 pgdba   pg        262144 Dec  6 00:23 00F5
> -rw-------    1 pgdba   pg        262144 Dec  6 11:07 00F6
> -rw-------    1 pgdba   pg        114688 Dec  6 16:10 00F7

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html


--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> "All that is needed for the forces of evil to triumph is for enough good
> men to do nothing." - Edmond Burke
> "The penalty good people pay for not being interested in politics is to be
> governed by people worse than themselves." - Plato

Attachment

Re: corruption diag/recovery, pg_dump crash

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> Here's the pg_dump output, edited to=20
> protect the guilty:

> pg_dump: PANIC:  open of .../data/pg_clog/04E5 failed: No such file or=20
> directory

Given that this is far away from the range of valid clog segment names,
it seems safe to say that it's a symptom of a corrupted tuple header
(specifically, a whacked-out transaction ID number in some tuple
header).

You could probably track down the bad row (if there's only one or a few)
by expedients like seeing how far "SELECT ... FROM sometable LIMIT n"
will go without crashing.  Once you have identified where the bad row is
located, you could try to repair it, or just zero out the whole page if
you're willing to lose the other rows on the same page.  I would be
interested to see a pg_filedump dump of the corrupted page, if you go as
far as finding it.

(There are previous discussions of coping with corrupted data in the
mailing list archives.  Searching for references to pg_filedump should
turn up some useful threads.)

            regards, tom lane

Re: corruption diag/recovery, pg_dump crash

From
"Ed L."
Date:
On Monday December 8 2003 6:55, Ed L. wrote:
> On Saturday December 6 2003 4:43, Tom Lane wrote:
> > "Ed L." <pgsql@bluepolka.net> writes:
> > > Here's the pg_dump output, edited to=20
> > > protect the guilty:
> > >
> > > pg_dump: PANIC:  open of .../data/pg_clog/04E5 failed: No such file
> > > or=20 directory
> >
> > Given that this is far away from the range of valid clog segment names,
> > it seems safe to say that it's a symptom of a corrupted tuple header
> > (specifically, a whacked-out transaction ID number in some tuple
> > header).
>
> I moved PGDATA to a new system due to catastrophic hardware failures
> (media and data errors on RAID5 + operator error when a tech pulled a
> hotswap disk without failing the drive first).  Now I am finally getting
> a good look at the corruption (which appears to have moved around during
> the scp):
>
>  $ psql -c "\d misc"
> ERROR:  _mdfd_getrelnfd: cannot open relation pg_depend_depender_index:
> No such file or directory

And note this from .../data/base/28607376:

$ oid2name -d mydb -t pg_depend_depender_index
Oid of table pg_depend_depender_index from database "mydb":
---------------------------------
16622  = pg_depend_depender_index
$ ls -l 16622
ls: 16622: No such file or directory

Any clues as to first steps at recovery?  Recovering from backup is
unfortunately not a very viable option.

Ed


Re: corruption diag/recovery, pg_dump crash

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> Now I am finally getting a good look at
> the corruption (which appears to have moved around during the scp):

Hm.  I don't see anything particularly exceptionable in pg_class page 11
--- rather a lot of dead tuples, but that's not proof of corruption.
To judge by your SELECT results, there are *no* live tuples in pg_class
between pages 11 and 543, and a bad page header in page 543.  What do
you see if you ask pg_filedump to dump all that page range?  (It'd be
a bit much to send to the list, but you can send it to me off-list.)

            regards, tom lane

Re: corruption diag/recovery, pg_dump crash

From
Tom Lane
Date:
Ed Loehr <ed@LoehrTech.com> writes:
> This is pg_class; look at the ascii names on the right.  I notice that one
> name (misc_doctors) is repeated twice.

Sure, but one of those rows is committed dead.  Looks like a perfectly
ordinary case of a not-yet-vacuumed update to me.

            regards, tom lane

Re: corruption diag/recovery, pg_dump crash

From
Ed Loehr
Date:
On Monday December 8 2003 8:23, you wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > Now I am finally getting a good look at
> > the corruption (which appears to have moved around during the scp):
>
> Hm.  I don't see anything particularly exceptionable in pg_class page 11
> --- rather a lot of dead tuples, but that's not proof of corruption.
> To judge by your SELECT results, there are *no* live tuples in pg_class
> between pages 11 and 543, and a bad page header in page 543.  What do
> you see if you ask pg_filedump to dump all that page range?  (It'd be
> a bit much to send to the list, but you can send it to me off-list.)

This is pg_class; look at the ascii names on the right.  I notice that one
name (misc_doctors) is repeated twice.  We also have an error dumping that
table in which the end of the dump gets two BLANK tuples in the output,
causing the load to fail due to missing columns.  Is it possible that we
have two pg_class tuples with same relname, and if so, is that a
corruption?

Will send full dump...

TIA


Ed