Re: Invalid headers and xlog flush failures - Mailing list pgsql-general
From | Bricklen Anderson |
---|---|
Subject | Re: Invalid headers and xlog flush failures |
Date | |
Msg-id | 42026BC2.6030601@PresiNET.com Whole thread Raw |
In response to | Re: Invalid headers and xlog flush failures (Bricklen Anderson <BAnderson@PresiNET.com>) |
Responses |
Re: Invalid headers and xlog flush failures
Re: Invalid headers and xlog flush failures |
List | pgsql-general |
Bricklen Anderson wrote: > Tom Lane wrote: > >> Bricklen Anderson <BAnderson@PresiNET.com> writes: >> >>> Tom Lane wrote: >>> >>>> I would have suggested that maybe this represented on-disk data >>>> corruption, but the appearance of two different but not-too-far-apart >>>> WAL offsets in two different pages suggests that indeed the end of WAL >>>> was up around segment 972 or 973 at one time. >> >> >> >>> Nope, never touched pg_resetxlog. >>> My pg_xlog list ranges from 000000010000007300000041 to >>> 0000000100000073000000FE, with no breaks. There are also these: >>> 000000010000007400000000 to 00000001000000740000000B >> >> >> >> That seems like rather a lot of files; do you have checkpoint_segments >> set to a large value, like 100? The pg_controldata dump shows that the >> latest checkpoint record is in the 73/41 file, so presumably the active >> end of WAL isn't exceedingly far past that. You've got 200 segments >> prepared for future activity, which is a bit over the top IMHO. >> >> But anyway, the evidence seems pretty clear that in fact end of WAL is >> in the 73 range, and so those page LSNs with 972 and 973 have to be >> bogus. I'm back to thinking about dropped bits in RAM or on disk. >> IIRC these numbers are all hex, so the extra "9" could come from just >> two bits getting turned on that should not be. Might be time to run >> memtest86 and/or badblocks. >> >> regards, tom lane > > > Yes, checkpoint_segments is set to 100, although I can set that lower if > you feel that that is more appropriate. Currently, the system receives > around 5-8 million inserts per day (across 3 primary tables), so I was > leaning towards the "more is better" philosophy. > > We ran e2fsck with badblocks option last week and didn't turn anything > up, along with a couple of passes with memtest. I will run a full-scale > memtest and post any interesting results. > > I've also read that kill -9 postmaster is "not a good thing". I honestly > can't vouch for whether or not this may or may not have occurred around > the time of the initial creation of this database. It's possible, since > this db started it's life as a development db at 8r3 then was bumped to > 8r5, then on to 8 final where it has become a dev-final db. > > Assuming that the memtest passes cleanly, as does another run of > badblocks, do you have any more suggestions on how I should proceed? > Should I run for a while with zero_damaged_pages set to true and accpet > the data loss, or just recreate the whole db from scratch? > memtest86+ ran for over 15 hours with no errors reported. e2fsck -c completed with no errors reported. Any ideas on what I should try next? Considering that this db is not in production yet, I _do_ have the liberty to rebuild the database if necessary. Do you have any further recommendations? thanks again, Bricklen
pgsql-general by date: