Tom Lane wrote:
> Bricklen Anderson <BAnderson@PresiNET.com> writes:
>
>>>Tom Lane wrote:
>>>
>>>>But anyway, the evidence seems pretty clear that in fact end of WAL is
>>>>in the 73 range, and so those page LSNs with 972 and 973 have to be
>>>>bogus. I'm back to thinking about dropped bits in RAM or on disk.
>
>
>>memtest86+ ran for over 15 hours with no errors reported.
>>e2fsck -c completed with no errors reported.
>
>
> Hmm ... that's not proof your hardware is ok, but it at least puts the
> ball back in play.
>
>
>>Any ideas on what I should try next? Considering that this db is not
>>in production yet, I _do_ have the liberty to rebuild the database if
>>necessary. Do you have any further recommendations?
>
>
> If the database isn't too large, I'd suggest saving aside a physical
> copy (eg, cp or tar dump taken with postmaster stopped) for forensic
> purposes, and then rebuilding so you can get on with your own work.
>
> One bit of investigation that might be worth doing is to look at every
> single 8K page in the database files and collect information about the
> LSN fields, which are the first 8 bytes of each page.
Do you mean this line from pg_filedump's results:
LSN: logid 56 recoff 0x3f4be440 Special 8176 (0x1ff0)
If so, I've set up a shell script that looped all of the files and emitted that line.
It's not particularly elegant, but it worked. Again, that's assuming that it was the correct line.
I'll write a perl script to parse out the LSN values to see if any are greater than 116 (which I
believe is the hex of 74?).
In case anyone wants the script that I ran to get the LSN:
#!/bin/sh
for FILE in /var/postgres/data/base/17235/*; do
i=0
echo $FILE >> test_file;
while [ 1==1 ]; do
str=`pg_filedump -R $i $FILE | grep LSN`;
if [ "$?" -eq "1" ]; then
break
fi
echo "$FILE: $str" >> LSN_out;
i=$((i+1));
done
done
> In a non-broken database all of these should be less than or equal to the current ending
> WAL offset (which you can get with pg_controldata if the postmaster is
> stopped). We know there are at least two bad pages, but are there more?
> Is there any pattern to the bad LSN values? Also it would be useful to
> look at each bad page in some detail to see if there's any evidence of
> corruption extending beyond the LSN value.
>
> regards, tom lane
NB. I've recreated the database, and saved off the old directory (all 350 gigs of it) so I can dig
into it further.
Thanks again for you help, Tom.
Cheers,
Bricklen