On Mon, 2009-09-14 at 23:17 +0000, john martin wrote:
> All of a sudden we started seeing page header errors in certain queries.
Was there any particular event that marked the onset of these issues?
Anything in the system logs (dmesg / syslog etc) around that time?
[for SATA disks]: does smartctl from the smartmontools package indicate
anything interesting about the disk(s)? (Ignore the "health status",
it's a foul lie, and rely on the error log plus the vendor attributes:
reallocated sector count, pending sector, uncorrectable sector count,
etc).
Was Pg forcibly killed and restarted, or the machine hard-reset? (This
_shouldn't_ cause data corruption, but might give some starting point
for looking for a bug).
> I am urging the community to investigate the possibility that it may not be
> hardware related, especially since it was first reported at least 5 years
> back.
If anything, the fact that it was first reported 5 years back makes it
_more_ likely to be hardware related. Bad hardware eats/scrambles some
of your data; Pg goes "oh crap, that page is garbage". People aren't
constantly getting their data eaten, though, despite the age of the
initial reports.
It's not turning up lots. It's not turning up in cases where hardware
issues can be ruled out. There doesn't seem to be a strong pattern
associating issues to a particular CPU / disk controller / drive etc to
suggest it could be Pg triggering a hardware bug or a bug in Pg
triggered by a hardware quirk. It doesn't seem to be reproducible and
people generally don't seem to be able to trigger the issue repeatedly.
Either it's a *really* rare and quirky bug that's really hard to
trigger, or it's a variety of hardware / disk issues.
If it's a really rare and quirky hard to trigger bug, where do you even
start looking without *some* idea what happened to trigger the issue? Do
you have any idea what might've started it in your case?
*** DID YOU TAKE COPIES OF YOUR DATA FILES BEFORE "FIXING" THEM *** ?
--
Craig Ringer