Re: Race-condition with failed block-write? - Mailing list pgsql-bugs

From Arjen van der Meijden
Subject Re: Race-condition with failed block-write?
Date
Msg-id 432732A2.9060300@tweakers.net
Whole thread Raw
In response to Re: Race-condition with failed block-write?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Race-condition with failed block-write?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On 13-9-2005 20:53, Tom Lane wrote:
> Arjen van der Meijden <acm@tweakers.net> writes:
>
>>Unless reindexing is part of other commands, I didn't do that. The last
>>time 'grep' was able to find an reference to something being reindexed
>>was in June, something (maybe me, but I doubt it, I'd also reindex the
>>user-tables, I suppose) was reindexing all system tables back then.
>
>
> Well, that would have done it.  But if you have such complete logs, do
> you have the postmaster log from the whole period?
I have logs from the time I started using it seriously indeed. It was
just never rotated or something. I did adjust it to have proper prefixes
only at 23th of august, so everything before that is a bit less easy to
read.
Besides that it contains lots of "debug" stuff, like logged queries and
their statistics.

It is a lot of text and the file is currently 52MB. But without all
those repeated failure-messages its about 862KB. So I can e-mail a
compressed version to you, if you're interested in it.

> Look for sequences
> approximately like this:
>
> and note whether there's any one that shows a WAL end address (here
> 0/12B65A88) that's less than the one before.

I take it you mean something like this:
[ - 2005-08-23 15:29:26 CEST @] LOG:  checkpoint record is at 29/2E73E410
[ - 2005-08-23 15:37:27 CEST @] LOG:  checkpoint record is at 29/2E73E410
[ - 2005-08-25 10:26:27 CEST @] LOG:  checkpoint record is at 29/2E8D26B4
[ - 2005-08-25 10:47:59 CEST @] LOG:  checkpoint record is at 29/2E8D26F0
[ - 2005-09-05 11:24:08 CEST @] LOG:  checkpoint record is at 29/2E73E44C
[ - 2005-09-09 10:27:20 CEST @] LOG:  checkpoint record is at 29/2E73E44C

The last start-up at 25th of August has a higher second value than the
start-up at the 5th of September. Which was a start since it was closed
at sep 1.

>>l /var/lib/postgresql/data/pg_xlog/
>>total 145M
>>drwx------  3 postgres postgres 4.0K Sep  1 12:37 .
>>drwx------  8 postgres postgres 4.0K Sep 13 20:31 ..
>>-rw-------  1 postgres postgres  16M Sep 13 19:25 00000001000000290000002E
>>-rw-------  1 postgres postgres  16M Sep  1 12:36 000000010000002900000067
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000068
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000069
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006A
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006B
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006C
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006D
>>-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006E
>
>
> This is definitely ugly: there's no way all those higher-numbered
> segments would be in place if the WAL counter hadn't been up to that
> range.  So I think we are definitely looking at a WAL-counter-went-
> backward problem, not some random data corruption.
>
> Judging from the file dates, the WAL counter was still OK (ie, above
> 67xxx) on Sep 1, and the problem must have occurred since then.

That's odd. But it must have been my hard attempts to shut it down then.
These are all the lines there are for Sep 1:

[ - 2005-09-01 12:36:50 CEST @] LOG:  received fast shutdown request
[ - 2005-09-01 12:36:50 CEST @] LOG:  shutting down
[ - 2005-09-01 12:36:50 CEST @] LOG:  database system is shut down
[ - 2005-09-01 12:37:01 CEST @] LOG:  received smart shutdown request
[ - 2005-09-01 12:37:01 CEST @] LOG:  shutting down
[ - 2005-09-01 12:37:01 CEST @] LOG:  database system is shut down
[ - 2005-09-05 11:24:08 CEST @] LOG:  database system was shut down at
2005-09-01 12:37:01 CEST
[ - 2005-09-05 11:24:08 CEST @] LOG:  checkpoint record is at 29/2E73E44C
[ - 2005-09-05 11:24:08 CEST @] LOG:  redo record is at 29/2E73E44C;
undo record is at 0/0; shutdown TRUE
[ - 2005-09-05 11:24:08 CEST @] LOG:  next transaction ID: 166361; next
OID: 55768251
[ - 2005-09-05 11:24:08 CEST @] LOG:  database system is ready

I'm not sure why I needed two shut down-calls at that time. It may have
been due to a psql on a terminal I'd forgot to close first.

Best regards,

Arjen

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Race-condition with failed block-write?
Next
From: Tom Lane
Date:
Subject: Re: Race-condition with failed block-write?