Re: Race-condition with failed block-write? - Mailing list pgsql-bugs

From Arjen van der Meijden
Subject Re: Race-condition with failed block-write?
Date
Msg-id 43271D19.7030701@tweakers.net
Whole thread Raw
In response to Re: Race-condition with failed block-write?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Race-condition with failed block-write?
List pgsql-bugs
On 13-9-2005 20:04, Tom Lane wrote:
> Arjen van der Meijden <acm@tweakers.net> writes:
>
>>On 13-9-2005 16:25, Tom Lane wrote:
>>
>>Well, its an index, not a table. It was the index:
>>"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).
>
> Ah.  So you've reindexed pg_class at some point.  Reindexing it again
> would likely get you out of this.

Unless reindexing is part of other commands, I didn't do that. The last
time 'grep' was able to find an reference to something being reindexed
was in June, something (maybe me, but I doubt it, I'd also reindex the
user-tables, I suppose) was reindexing all system tables back then.
Besides, its not just the index, on pg_class, pg_class itself (and
pg_index) have wrong LSN's as well.

>>Using pg_filedump I extracted the LSN for block 21 and indeed, that was
>>already 67713428 instead of something below 2E73E53C. It wasn't that
>>block alone though, here are a few LSN-lines from it:
>
>
>>  LSN:  logid     41 recoff 0x676f5174      Special  8176 (0x1ff0)
>>  LSN:  logid     25 recoff 0x3c6c5504      Special  8176 (0x1ff0)
>>  LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
>>  LSN:  logid     41 recoff 0x2ea88190      Special  8176 (0x1ff0)
>>  LSN:  logid      1 recoff 0x68e2f660      Special  8176 (0x1ff0)
>>  LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
>>  LSN:  logid      1 recoff 0x68e2f6a4      Special  8176 (0x1ff0)
>
>
> logid is the high-order half of the LSN, so there's nothing wrong with
> those other pages --- it's only the first one you show there that seems
> to be past the current end of WAL.

There were 3 blocks of 40 with a LSN like the first one above in that
index-file. So with high-order 41, recoff 0x67[67]something.
In the pg_class-file there were 6 blocks, of which 5 LSN's were like the
above in that index. And for pg_index 3 blocks, with 1 wrong.

>>On that day I did some active query-tuning, but a few times it took too
>>long, so I issued immediate shut downs when the selects took too long.
>>There were no warnings about broken records afterwards in the log
>>though, so I don't believe anything got damaged afterwards.
>
> I have a feeling something may have gone wrong here, though it's hard to
> say what.  If the bogus pages in the other tables all have LSNs close to
> this one then that makes it less likely that this is a random corruption
> event --- what would be more plausible is that end of WAL really was
> that high and somehow the WAL counter got reset back during one of those
> forced restarts.
>
> Can you show us ls -l output for the pg_xlog directory?  I'm interested
> to see the file names and mod dates there.

Here you go:

l /var/lib/postgresql/data/pg_xlog/
total 145M
drwx------  3 postgres postgres 4.0K Sep  1 12:37 .
drwx------  8 postgres postgres 4.0K Sep 13 20:31 ..
-rw-------  1 postgres postgres  16M Sep 13 19:25 00000001000000290000002E
-rw-------  1 postgres postgres  16M Sep  1 12:36 000000010000002900000067
-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000068
-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000069
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006A
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006B
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006C
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006D
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006E

During data-load it was warning about too frequent checkpoints, but I do
hope thats mostly performance-related, not stability?

Best regards,

Arjen van der Meijden

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Race-condition with failed block-write?
Next
From: Tom Lane
Date:
Subject: Re: Race-condition with failed block-write?