Corruption Debug Help. - Mailing list pgsql-admin

From Matthew Sellers
Subject Corruption Debug Help.
Date
Msg-id CACMbGu3Xx6P2TwjUuG0-tzTtiU2MV31jbqmsZ=YywSoiX4jXvg@mail.gmail.com
Whole thread Raw
List pgsql-admin
Hi All,

I believe I may have experienced a Postgres bug and am eager for bit
of feedback.  It seems we may have had some type of catalog corruption
as overview in the events pasted below.  I am including our
observations of the problem, but am asking the list to see if I can
perform any further diagnostics or root cause analysis.


# During normal database operations I received this error while a cron
job issued a COPY on a temporary table. Further SELECTS on this table
yielded the same results. This table is toast :-)

2011-10-26 20:32:25.603 CDT helios 172.20.45.57(34663)ERROR:  could
not read block 355 in file "base/16421/286173855": read only 0 of 8192
bytes

# Next we attempted to configure a hot standby server to replicate and
test possible corruption issues.  After rsyncing $PG_HOME and starting
up the read-only slave, I received this error. The file 'global/11595'
does not exist on the slave or the master, further supporting the
theory of data corruption.


2011-10-31 09:31:03.682 CDT  LOG:  streaming replication successfully
connected to primary
2011-10-31 09:31:04.976 CDT postgres [local]FATAL:  could not open
file "global/11595": Permission denied
2011-10-31 09:31:21.981 CDT postgres [local]FATAL:  could not access
status of transaction 65536
2011-10-31 09:31:21.981 CDT postgres [local]DETAIL:  Could not read
from file "pg_clog/0000" at offset 16384: Success.
2011-10-31 10:55:48.800 CDT helios [local]FATAL:  could not access
status of transaction 65536


As a final test we are performing a pg_dump on the master which ran
successfully, and are currently restoring the dump to another machine.
This test has not yielded any errors but is far from complete given my
database size.   I am runing Postgres 9.0.4 on high end hardware (
machine + SAN ) and have no indication of hardware related data loss,
so next im digging into understand the inner workings of the
Postgresql on disk format.

If anyone can suggest how to properly diagnose this type of issue it
would be greatly appreciated.

Thanks!
Matt

pgsql-admin by date:

Previous
From: Harald Fuchs
Date:
Subject: Re: SET search path
Next
From: Brian Fehrle
Date:
Subject: background writer being lazy?