Re: Race-condition with failed block-write? - Mailing list pgsql-bugs

From Tom Lane
Subject Re: Race-condition with failed block-write?
Date
Msg-id 25482.1126634646@sss.pgh.pa.us
Whole thread Raw
In response to Race-condition with failed block-write?  (Arjen van der Meijden <acm@tweakers.net>)
Responses Re: Race-condition with failed block-write?  (Arjen van der Meijden <acm@tweakers.net>)
List pgsql-bugs
Arjen van der Meijden <acm@tweakers.net> writes:
> On 13-9-2005 16:25, Tom Lane wrote:
>> The first thing you ought to find out is which table
>> 1663/2013826/9975789 is, and look to see if the corrupted LSN value is
>> already present on disk in that block.

> Well, its an index, not a table. It was the index:
> "pg_class_relname_nsp_index" on pg_class(relname, relnamespace).

Ah.  So you've reindexed pg_class at some point.  Reindexing it again
would likely get you out of this.

> Using pg_filedump I extracted the LSN for block 21 and indeed, that was
> already 67713428 instead of something below 2E73E53C. It wasn't that
> block alone though, here are a few LSN-lines from it:

>   LSN:  logid     41 recoff 0x676f5174      Special  8176 (0x1ff0)
>   LSN:  logid     25 recoff 0x3c6c5504      Special  8176 (0x1ff0)
>   LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
>   LSN:  logid     41 recoff 0x2ea88190      Special  8176 (0x1ff0)
>   LSN:  logid      1 recoff 0x68e2f660      Special  8176 (0x1ff0)
>   LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
>   LSN:  logid      1 recoff 0x68e2f6a4      Special  8176 (0x1ff0)

logid is the high-order half of the LSN, so there's nothing wrong with
those other pages --- it's only the first one you show there that seems
to be past the current end of WAL.

> On that day I did some active query-tuning, but a few times it took too
> long, so I issued immediate shut downs when the selects took too long.
> There were no warnings about broken records afterwards in the log
> though, so I don't believe anything got damaged afterwards.

I have a feeling something may have gone wrong here, though it's hard to
say what.  If the bogus pages in the other tables all have LSNs close to
this one then that makes it less likely that this is a random corruption
event --- what would be more plausible is that end of WAL really was
that high and somehow the WAL counter got reset back during one of those
forced restarts.

Can you show us ls -l output for the pg_xlog directory?  I'm interested
to see the file names and mod dates there.

            regards, tom lane

pgsql-bugs by date:

Previous
From: "Ed L."
Date:
Subject: Re: ia64-hp-hpux11.23 configure warnings
Next
From: Arjen van der Meijden
Date:
Subject: Re: Race-condition with failed block-write?