Re: Race-condition with failed block-write? - Mailing list pgsql-bugs
From | Arjen van der Meijden |
---|---|
Subject | Re: Race-condition with failed block-write? |
Date | |
Msg-id | 43271D19.7030701@tweakers.net Whole thread Raw |
In response to | Re: Race-condition with failed block-write? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Race-condition with failed block-write?
|
List | pgsql-bugs |
On 13-9-2005 20:04, Tom Lane wrote: > Arjen van der Meijden <acm@tweakers.net> writes: > >>On 13-9-2005 16:25, Tom Lane wrote: >> >>Well, its an index, not a table. It was the index: >>"pg_class_relname_nsp_index" on pg_class(relname, relnamespace). > > Ah. So you've reindexed pg_class at some point. Reindexing it again > would likely get you out of this. Unless reindexing is part of other commands, I didn't do that. The last time 'grep' was able to find an reference to something being reindexed was in June, something (maybe me, but I doubt it, I'd also reindex the user-tables, I suppose) was reindexing all system tables back then. Besides, its not just the index, on pg_class, pg_class itself (and pg_index) have wrong LSN's as well. >>Using pg_filedump I extracted the LSN for block 21 and indeed, that was >>already 67713428 instead of something below 2E73E53C. It wasn't that >>block alone though, here are a few LSN-lines from it: > > >> LSN: logid 41 recoff 0x676f5174 Special 8176 (0x1ff0) >> LSN: logid 25 recoff 0x3c6c5504 Special 8176 (0x1ff0) >> LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0) >> LSN: logid 41 recoff 0x2ea88190 Special 8176 (0x1ff0) >> LSN: logid 1 recoff 0x68e2f660 Special 8176 (0x1ff0) >> LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0) >> LSN: logid 1 recoff 0x68e2f6a4 Special 8176 (0x1ff0) > > > logid is the high-order half of the LSN, so there's nothing wrong with > those other pages --- it's only the first one you show there that seems > to be past the current end of WAL. There were 3 blocks of 40 with a LSN like the first one above in that index-file. So with high-order 41, recoff 0x67[67]something. In the pg_class-file there were 6 blocks, of which 5 LSN's were like the above in that index. And for pg_index 3 blocks, with 1 wrong. >>On that day I did some active query-tuning, but a few times it took too >>long, so I issued immediate shut downs when the selects took too long. >>There were no warnings about broken records afterwards in the log >>though, so I don't believe anything got damaged afterwards. > > I have a feeling something may have gone wrong here, though it's hard to > say what. If the bogus pages in the other tables all have LSNs close to > this one then that makes it less likely that this is a random corruption > event --- what would be more plausible is that end of WAL really was > that high and somehow the WAL counter got reset back during one of those > forced restarts. > > Can you show us ls -l output for the pg_xlog directory? I'm interested > to see the file names and mod dates there. Here you go: l /var/lib/postgresql/data/pg_xlog/ total 145M drwx------ 3 postgres postgres 4.0K Sep 1 12:37 . drwx------ 8 postgres postgres 4.0K Sep 13 20:31 .. -rw------- 1 postgres postgres 16M Sep 13 19:25 00000001000000290000002E -rw------- 1 postgres postgres 16M Sep 1 12:36 000000010000002900000067 -rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000068 -rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000069 -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006A -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006B -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006C -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006D -rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006E During data-load it was warning about too frequent checkpoints, but I do hope thats mostly performance-related, not stability? Best regards, Arjen van der Meijden
pgsql-bugs by date: