Andy Osborne <andy@sift.co.uk> writes:
> Tom Lane wrote:
>>> FATAL 2: open of /u0/pgdata/pg_clog/0726 failed: No such file or directory
>> What range of file names do you actually see in pg_clog?
> Currently 0000 to 00D6. I don't know what it was last night.
Not any greater, for sure. (FYI, each segment covers one million
transactions.)
> the next backup was running when the database crashed. Any
> attempt to access the table crashed it again. I don't know if
> it helps, but a select * from news where <conditional on a field
> with an index) was ok but if the where was not indexed and resulted
> in a table scan, it crashed it.
This is consistent with one page of the table being corrupted.
> While I wouldn't rule out data corruption, the kernel message
> ring has no errors for the md dirver, scsi host adapter or the
> disks, which I would expect if we had bad blocks appearing on a
> disk or somesuch.
Some of the cases that I've seen look like completely unrelated data
(not even Postgres stuff, just bits of text files) was written into
a page of a Postgres table. This could possibly be a kernel bug,
along the lines of getting confused about which buffer belongs to
which file. But with no way to reproduce it it's hard to pin blame.
>> You didn't happen to make a physical copy of the news table before
>> dropping it, did you? It'd be interesting to examine the remains.
> Sadly, no I didn't. This is one of our live database servers
> and I was under a lot of pressure to get it back quickly. If
> it does it again, what can I do to provide the most useful
> feedback ?.
If the database isn't unreasonably large, perhaps you could take a
tarball dump of the whole $PGDATA directory tree while the postmaster
is stopped? That would document the situation for examination at leisure.
regards, tom lane