Re: Race-condition with failed block-write? - Mailing list pgsql-bugs

From Tom Lane
Subject Re: Race-condition with failed block-write?
Date
Msg-id 25897.1126637591@sss.pgh.pa.us
Whole thread Raw
In response to Re: Race-condition with failed block-write?  (Arjen van der Meijden <acm@tweakers.net>)
Responses Re: Race-condition with failed block-write?  (Arjen van der Meijden <acm@tweakers.net>)
List pgsql-bugs
Arjen van der Meijden <acm@tweakers.net> writes:
> Unless reindexing is part of other commands, I didn't do that. The last
> time 'grep' was able to find an reference to something being reindexed
> was in June, something (maybe me, but I doubt it, I'd also reindex the
> user-tables, I suppose) was reindexing all system tables back then.

Well, that would have done it.  But if you have such complete logs, do
you have the postmaster log from the whole period?  Look for sequences
approximately like this:

2005-09-13 14:48:33.450 EDT 25857 LOG:  database system was interrupted at 2005-09-13 14:42:08 EDT
2005-09-13 14:48:33.450 EDT 25857 LOG:  checkpoint record is at 0/12B65A40
2005-09-13 14:48:33.450 EDT 25857 LOG:  redo record is at 0/12B65A40; undo record is at 0/0; shutdown FALSE
2005-09-13 14:48:33.450 EDT 25857 LOG:  next transaction ID: 67789; next OID: 240494
2005-09-13 14:48:33.450 EDT 25857 LOG:  next MultiXactId: 1; next MultiXactOffset: 0
2005-09-13 14:48:33.450 EDT 25857 LOG:  database system was not properly shut down; automatic recovery in progress
2005-09-13 14:48:33.452 EDT 25857 LOG:  record with zero length at 0/12B65A88
2005-09-13 14:48:33.452 EDT 25857 LOG:  redo is not required
2005-09-13 14:48:33.474 EDT 25857 LOG:  database system is ready

and note whether there's any one that shows a WAL end address (here
0/12B65A88) that's less than the one before.

> l /var/lib/postgresql/data/pg_xlog/
> total 145M
> drwx------  3 postgres postgres 4.0K Sep  1 12:37 .
> drwx------  8 postgres postgres 4.0K Sep 13 20:31 ..
> -rw-------  1 postgres postgres  16M Sep 13 19:25 00000001000000290000002E
> -rw-------  1 postgres postgres  16M Sep  1 12:36 000000010000002900000067
> -rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000068
> -rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000069
> -rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006A
> -rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006B
> -rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006C
> -rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006D
> -rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006E

This is definitely ugly: there's no way all those higher-numbered
segments would be in place if the WAL counter hadn't been up to that
range.  So I think we are definitely looking at a WAL-counter-went-
backward problem, not some random data corruption.

Judging from the file dates, the WAL counter was still OK (ie, above
67xxx) on Sep 1, and the problem must have occurred since then.

            regards, tom lane

pgsql-bugs by date:

Previous
From: Arjen van der Meijden
Date:
Subject: Re: Race-condition with failed block-write?
Next
From: Arjen van der Meijden
Date:
Subject: Re: Race-condition with failed block-write?