Re: Invalid headers and xlog flush failures - Mailing list pgsql-general

From Bricklen Anderson
Subject Re: Invalid headers and xlog flush failures
Date
Msg-id 42039146.7020808@PresiNET.com
Whole thread Raw
In response to Re: Invalid headers and xlog flush failures  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Tom Lane wrote:
> Bricklen Anderson <BAnderson@PresiNET.com> writes:
>
>>>Tom Lane wrote:
>>>
>>>>But anyway, the evidence seems pretty clear that in fact end of WAL is
>>>>in the 73 range, and so those page LSNs with 972 and 973 have to be
>>>>bogus.  I'm back to thinking about dropped bits in RAM or on disk.
>
>
>>memtest86+ ran for over 15 hours with no errors reported.
>>e2fsck -c completed with no errors reported.
>
>
> Hmm ... that's not proof your hardware is ok, but it at least puts the
> ball back in play.
>
>
>>Any ideas on what I should try next? Considering that this db is not
>>in production yet, I _do_ have the liberty to rebuild the database if
>>necessary. Do you have any further recommendations?
>
>
> If the database isn't too large, I'd suggest saving aside a physical
> copy (eg, cp or tar dump taken with postmaster stopped) for forensic
> purposes, and then rebuilding so you can get on with your own work.
>
> One bit of investigation that might be worth doing is to look at every
> single 8K page in the database files and collect information about the
> LSN fields, which are the first 8 bytes of each page.
Do you mean this line from pg_filedump's results:

LSN:  logid     56 recoff 0x3f4be440      Special  8176 (0x1ff0)

If so, I've set up a shell script that looped all of the files and emitted that line.
It's not particularly elegant, but it worked. Again, that's assuming that it was the correct line.
I'll write a perl script to parse out the LSN values to see if any are greater than 116 (which I
believe is the hex of 74?).

In case anyone wants the script that I ran to get the LSN:
#!/bin/sh

for FILE in /var/postgres/data/base/17235/*; do
         i=0
         echo $FILE >> test_file;
         while [ 1==1 ]; do
                 str=`pg_filedump -R $i $FILE | grep LSN`;
                 if [ "$?" -eq "1" ]; then
                         break
                 fi
                 echo "$FILE: $str" >> LSN_out;
                 i=$((i+1));
         done
done

> In a non-broken database all of these should be less than or equal to the current ending
> WAL offset (which you can get with pg_controldata if the postmaster is
> stopped).  We know there are at least two bad pages, but are there more?
> Is there any pattern to the bad LSN values?  Also it would be useful to
> look at each bad page in some detail to see if there's any evidence of
> corruption extending beyond the LSN value.
>
>             regards, tom lane

NB. I've recreated the database, and saved off the old directory (all 350 gigs of it) so I can dig
into it further.


Thanks again for you help, Tom.

Cheers,

Bricklen

pgsql-general by date:

Previous
From: Eric Jain
Date:
Subject: Re: Postgres using up all my memory
Next
From: Scott Marlowe
Date:
Subject: Re: REPLICATION Solution for WINDOWS OS