Re: Invalid headers and xlog flush failures - Mailing list pgsql-general

From Bricklen Anderson
Subject Re: Invalid headers and xlog flush failures
Date
Msg-id 42026BC2.6030601@PresiNET.com
Whole thread Raw
In response to Re: Invalid headers and xlog flush failures  (Bricklen Anderson <BAnderson@PresiNET.com>)
Responses Re: Invalid headers and xlog flush failures
Re: Invalid headers and xlog flush failures
List pgsql-general
Bricklen Anderson wrote:
> Tom Lane wrote:
>
>> Bricklen Anderson <BAnderson@PresiNET.com> writes:
>>
>>> Tom Lane wrote:
>>>
>>>> I would have suggested that maybe this represented on-disk data
>>>> corruption, but the appearance of two different but not-too-far-apart
>>>> WAL offsets in two different pages suggests that indeed the end of WAL
>>>> was up around segment 972 or 973 at one time.
>>
>>
>>
>>> Nope, never touched pg_resetxlog.
>>> My pg_xlog list ranges from 000000010000007300000041 to
>>> 0000000100000073000000FE, with no breaks. There are also these:
>>> 000000010000007400000000 to 00000001000000740000000B
>>
>>
>>
>> That seems like rather a lot of files; do you have checkpoint_segments
>> set to a large value, like 100?  The pg_controldata dump shows that the
>> latest checkpoint record is in the 73/41 file, so presumably the active
>> end of WAL isn't exceedingly far past that.  You've got 200 segments
>> prepared for future activity, which is a bit over the top IMHO.
>>
>> But anyway, the evidence seems pretty clear that in fact end of WAL is
>> in the 73 range, and so those page LSNs with 972 and 973 have to be
>> bogus.  I'm back to thinking about dropped bits in RAM or on disk.
>> IIRC these numbers are all hex, so the extra "9" could come from just
>> two bits getting turned on that should not be.  Might be time to run
>> memtest86 and/or badblocks.
>>
>>             regards, tom lane
>
>
> Yes, checkpoint_segments is set to 100, although I can set that lower if
> you feel that that is more appropriate. Currently, the system receives
> around 5-8 million inserts per day (across 3 primary tables), so I was
> leaning towards the "more is better" philosophy.
>
> We ran e2fsck with badblocks option last week and didn't turn anything
> up, along with a couple of passes with memtest. I will run a full-scale
> memtest and post any interesting results.
>
> I've also read that kill -9 postmaster is "not a good thing". I honestly
> can't vouch for whether or not this may or may not have occurred around
> the time of the initial creation of this database. It's possible, since
> this db started it's life as a development db at 8r3 then was bumped to
> 8r5, then on to 8 final where it has become a dev-final db.
>
> Assuming that the memtest passes cleanly, as does another run of
> badblocks, do you have any more suggestions on how I should proceed?
> Should I run for a while with zero_damaged_pages set to true and accpet
> the data loss, or just recreate the whole db from scratch?
>

memtest86+ ran for over 15 hours with no errors reported.
e2fsck -c completed with no errors reported.

Any ideas on what I should try next? Considering that this db is not in production yet, I _do_ have
the liberty to rebuild the database if necessary. Do you have any further recommendations?

thanks again,

Bricklen

pgsql-general by date:

Previous
From: Roman Neuhauser
Date:
Subject: Re: SQL query question
Next
From: Tom Lane
Date:
Subject: Re: Invalid headers and xlog flush failures