Re: Race-condition with failed block-write? - Mailing list pgsql-bugs

From Arjen van der Meijden
Subject Re: Race-condition with failed block-write?
Date
Msg-id 43270FAA.20301@tweakers.net
Whole thread Raw
In response to Re: Race-condition with failed block-write?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On 13-9-2005 16:25, Tom Lane wrote:
> Arjen van der Meijden <acm@tweakers.net> writes:
>
> It's highly unlikely that that query has anything to do with it, since
> it's not touching anything but system catalogs and not trying to write
> them either.

Indeed, other things trigger it as well.

> The first thing you ought to find out is which table
> 1663/2013826/9975789 is, and look to see if the corrupted LSN value is
> already present on disk in that block.

Well, its an index, not a table. It was the index:
"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).

Using pg_filedump I extracted the LSN for block 21 and indeed, that was
already 67713428 instead of something below 2E73E53C. It wasn't that
block alone though, here are a few LSN-lines from it:

  LSN:  logid     41 recoff 0x676f5174      Special  8176 (0x1ff0)
  LSN:  logid     25 recoff 0x3c6c5504      Special  8176 (0x1ff0)
  LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
  LSN:  logid     41 recoff 0x2ea88190      Special  8176 (0x1ff0)
  LSN:  logid      1 recoff 0x68e2f660      Special  8176 (0x1ff0)
  LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
  LSN:  logid      1 recoff 0x68e2f6a4      Special  8176 (0x1ff0)

I tried other files and each one I tried only had LSN's of 0.

When trying (\d indexname in psql) to determine to which table that
index belonged I noticed it got the errors again, but for another file
(pg_index this time). And another try (oid2name ...) after that, yet
another file (the pg_class-table). All those files where last changed
somewhere August 25, so now new changes.

On that day I did some active query-tuning, but a few times it took too
long, so I issued immediate shut downs when the selects took too long.
There were no warnings about broken records afterwards in the log
though, so I don't believe anything got damaged afterwards.

After that I loaded some fresh data from a production-database using
either pg_restore or psql < some-file-from-pg_dump.sql (I don't know
which one anymore). A few days later I shut down that postgres,
installed 8.1-beta and used that (in another directory of course), this
8.0.3 only came back up because of a reboot and wasn't used since that
reboot.

I guess, during that reloading those system tables got mixed up?

> If it is, then we've probably
> not got much chance of finding out how it got there.  If it is *not* on
> disk, but you have a repeatable way of causing this to happen starting
> from a clean postmaster start, then that's pretty interesting --- but
> I don't know any way of figuring it out short of groveling through the
> code with a debugger.  If you're not already pretty familiar with the PG
> code, coaching you remotely isn't going to work very well :-(.  I'd be
> glad to look into it if you can get me access to the machine though.

Well, I can very probably give you that access. But as you say, finding
out was went wrong is very hard to do.

Best regards,

Arjen van der Meijden

pgsql-bugs by date:

Previous
From: "Puvi Subramanian"
Date:
Subject: bug on starting postgres
Next
From: "Abdulkadir Nazif"
Date:
Subject: BUG #1880: Installation failure