Thread: Invalid headers and xlog flush failures

Invalid headers and xlog flush failures

From
Bricklen Anderson
Date:
Hi all,

I recently came across some apparent corruption in one of our databases around a month ago.
version: postgresql 8 (originally 8r3, now at 8.0.1), debian box

The messages that we were originally getting in our syslog were about invalid page headers. After
googling around, then dumping the page with pg_filedump, I decided to drop and recreate the affected
table. This seemed to work for a while, until this message cropped up in the syslog during a heavy load:

Feb  1 11:17:49 dev94 postgres[4959]: [470-2] 2005-02-01 11:17:49 PST> CONTEXT:  writing block 47272
of relation 1663/17235/57800
Feb  1 11:17:49 dev94 postgres[4959]: [471-1] 2005-02-01 11:17:49 PST> WARNING:  could not write
block 47272 of 1663/17235/57800
Feb  1 11:17:49 dev94 postgres[4959]: [471-2] 2005-02-01 11:17:49 PST> DETAIL:  Multiple failures
--- write error may be permanent.
Feb  1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR:  xlog flush request
972/FC932854 is not satisfied --- flushed only to 73/86D2640

This maps to an index. I reindexed it (and several other tables), and a 3 hours later, restarted my
load process. Shortly after that, the same thing happened again (with different numbers this time):

Feb  1 14:36:05 dev94 postgres[12887]: [626-2] 2005-02-01 14:36:05 PST> CONTEXT:  writing block 7502
of relation 1663/17235/151565
Feb  1 14:36:05 dev94 postgres[12887]: [627-1] 2005-02-01 14:36:05 PST> WARNING:  could not write
block 7502 of 1663/17235/151565
Feb  1 14:36:05 dev94 postgres[12887]: [627-2] 2005-02-01 14:36:05 PST> DETAIL:  Multiple failures
--- write error may be permanent.
Feb  1 14:36:06 dev94 postgres[12887]: [628-1] 2005-02-01 14:36:06 PST> ERROR:  xlog flush request
973/3EF36C2C is not satisfied --- flushed only to 73/419878B4

Both sets are repeated continuously through the syslog.

I pursued some references to XID wraparound, but that didn't seem likely from what I could see
(unless I'm misreading the numbers)
SELECT datname, age(datfrozenxid) FROM pg_database where datname='dev17';
  datname |    age
---------+------------
  dev17  | 1074008776


Here is a pg_filedump of 151565:
$pg_filedump -i -f -R 7502 /var/postgres/data/base/17235/151565

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 3.0
*
* File: /var/postgres/data/base/17235/151565
* Options used: -i -f -R 7502
*
* Dump created on: Tue Feb  1 14:34:14 2005
*******************************************************************

Block 7502 ********************************************************
<Header> -----
  Block Offset: 0x03a9c000         Offsets: Lower     988 (0x03dc)
  Block: Size 8192  Version    2            Upper    3336 (0x0d08)
  LSN:  logid    115 recoff 0x39e855f4      Special  8176 (0x1ff0)
  Items:  242                   Free Space: 2348
  Length (including item array): 992

  Error: Invalid header information.

   0000: 73000000 f455e839 01000000 dc03080d  s....U.9........
   0010: f01f0220 cc912800 e0912800 f4912800  ... ..(...(...(.
   0020: 08922800 1c922800 30922800 44922800  ..(...(.0.(.D.(.
<snipped>

<Data> ------
  Item   1 -- Length:   20  Offset: 4556 (0x11cc)  Flags: USED
   Block Id: 9016  linp Index: 2  Size: 20
   Has Nulls: 0  Has Varwidths: 16384

   11cc: 00003823 02001440 0b000000 022000cf  ..8#...@..... ..
   11dc: 66f06500                             f.e.

  Item   2 -- Length:   20  Offset: 4576 (0x11e0)  Flags: USED
   Block Id: 9571  linp Index: 8  Size: 20
   Has Nulls: 0  Has Varwidths: 16384

   11e0: 00006325 08001440 0b000000 022000cf  ..c%...@..... ..
   11f0: 66f06400                             f.d.

  Item   3 -- Length:   20  Offset: 4596 (0x11f4)  Flags: USED
   Block Id: 9571  linp Index: 3  Size: 20
   Has Nulls: 0  Has Varwidths: 16384

   11f4: 00006325 03001440 0b000000 022000cf  ..c%...@..... ..
   1204: 66f06400                             f.d.
<snipped>

<Special Section> -----
  BTree Index Section:
   Flags: 0x0001 (LEAF)
   Blocks: Previous (1314)  Next (1958)  Level (0)

   1ff0: 22050000 a6070000 00000000 01000000  "...............

*** End of Requested Range Encountered. Last Block Read: 7502 ***


Can anyone suggest what I should try next, or if you need more information, I'll happily supply what
I can.


Inline are the changes I made to pg_filedump to get it to compile and work, as such I can't attest
to pg_filedump's accuracy (which could be the source of those invalid header messages)
#########################################
--- pg_filedump.c.old       2004-02-23 12:58:58.000000000 -0800
+++ ../pg_filedump-3.0/pg_filedump.c.new 2005-01-31 09:24:36.000000000
-0800
@@ -742,8 +742,8 @@

           printf ("  XID: min (%u)  CMIN|XMAX: %u  CMAX|XVAC: %u\n"
                   "  Block Id: %u  linp Index: %u   Attributes: %d Size: %d\n",
-                 htup->t_xmin, htup->t_field2.t_cmin,
-                 htup->t_field3.t_cmax,
+                 htup->t_choice.t_heap.t_xmin,htup->t_choice.t_heap.t_cmin,
+                 htup->t_choice.t_heap.t_field4.t_cmax,
                   ((uint32)
                    ((htup->t_ctid.ip_blkid.bi_hi << 16) | (uint16) htup->t_ctid.ip_blkid.bi_lo)),
htup->t_ctid.ip_posid,
@@ -768,8 +768,8 @@
             strcat (flagString, "HASCOMPRESSED|");
           if (infoMask & HEAP_HASOID)
             strcat (flagString, "HASOID|");
-         if (infoMask & HEAP_XMAX_IS_XMIN)
-           strcat (flagString, "XMAX_IS_XMIN|");
+         //if (infoMask & HEAP_XMAX_IS_XMIN)
+         //  strcat (flagString, "XMAX_IS_XMIN|");
           if (infoMask & HEAP_XMAX_UNLOGGED)
             strcat (flagString, "XMAX_UNLOGGED|");
           if (infoMask & HEAP_XMIN_COMMITTED)
@@ -1042,7 +1042,7 @@
               "   Previous Checkpoint Record: Log File (%u) Offset (0x%08x)\n"
               "  Last Checkpoint Record Redo: Log File (%u) Offset (0x%08x)\n"
               "                |-       Undo: Log File (%u) Offset (0x%08x)\n"
-             "                |- StartUp ID: %u\n"
+             //"                |- StartUp ID: %u\n"
               "                |-   Next XID: %u\n"
               "                |-   Next OID: %u\n"
               "                 -       Time: %s"
@@ -1064,7 +1064,7 @@
               controlData->prevCheckPoint.xlogid,
               controlData->prevCheckPoint.xrecoff,checkPoint->redo.xlogid,
               checkPoint->redo.xrecoff, checkPoint->undo.xlogid,
-             checkPoint->undo.xrecoff, checkPoint->ThisStartUpID,
+             checkPoint->undo.xrecoff, /*checkPoint->ThisStartUpID,*/
               checkPoint->nextXid, checkPoint->nextOid,
               ctime (&checkPoint->time), controlData->blcksz,
               controlData->relseg_size, controlData->nameDataLen,
#########################################

Obviously, the bottom 3 chunks aren't very useful, but a cursory examination of the relevant headers
didn't make it obvious what these should be changed to.

Sorry for the long-winded email, and thanks for taking the time to read this far!

Cheers,

Bricklen

Re: Invalid headers and xlog flush failures

From
Tom Lane
Date:
Bricklen Anderson <BAnderson@PresiNET.com> writes:
> Feb  1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR:  xlog flush request
> 972/FC932854 is not satisfied --- flushed only to 73/86D2640

Hmm, have you perhaps played any games with pg_resetxlog in this database?

I would have suggested that maybe this represented on-disk data
corruption, but the appearance of two different but not-too-far-apart
WAL offsets in two different pages suggests that indeed the end of WAL
was up around segment 972 or 973 at one time.  And now it's evidently
ending at 73.  Not good.  What file names do you see in pg_xlog/, and
what does pg_controldata show?

            regards, tom lane

Re: Invalid headers and xlog flush failures

From
Bricklen Anderson
Date:
Tom Lane wrote:
> Bricklen Anderson <BAnderson@PresiNET.com> writes:
>
>>Feb  1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR:  xlog flush request
>>972/FC932854 is not satisfied --- flushed only to 73/86D2640
>
>
> Hmm, have you perhaps played any games with pg_resetxlog in this database?
>
> I would have suggested that maybe this represented on-disk data
> corruption, but the appearance of two different but not-too-far-apart
> WAL offsets in two different pages suggests that indeed the end of WAL
> was up around segment 972 or 973 at one time.  And now it's evidently
> ending at 73.  Not good.  What file names do you see in pg_xlog/, and
> what does pg_controldata show?
>
>             regards, tom lane

Hi Tom,

Nope, never touched pg_resetxlog.
My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks.
There are also these: 000000010000007400000000 to 00000001000000740000000B

$ pg_controldata
pg_control version number:            74
Catalog version number:               200411041
Database system identifier:           4738750823096876774
Database cluster state:               in production
pg_control last modified:             Wed 02 Feb 2005 12:38:22 AM PST
Current log file ID:                  115
Next log file segment:                66
Latest checkpoint location:           73/419A4BDC
Prior checkpoint location:            73/419A4B80
Latest checkpoint's REDO location:    73/419A4BDC
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          4161807
Latest checkpoint's NextOID:          176864
Time of latest checkpoint:            Wed 02 Feb 2005 12:38:22 AM PST
Database block size:                  8192
Blocks per segment of large relation: 131072
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum number of function arguments: 32
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           en_CA
LC_CTYPE:                             en_CA


Re: Invalid headers and xlog flush failures

From
Tom Lane
Date:
Bricklen Anderson <BAnderson@PresiNET.com> writes:
> Tom Lane wrote:
>> I would have suggested that maybe this represented on-disk data
>> corruption, but the appearance of two different but not-too-far-apart
>> WAL offsets in two different pages suggests that indeed the end of WAL
>> was up around segment 972 or 973 at one time.

> Nope, never touched pg_resetxlog.
> My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks.
> There are also these: 000000010000007400000000 to 00000001000000740000000B

That seems like rather a lot of files; do you have checkpoint_segments
set to a large value, like 100?  The pg_controldata dump shows that the
latest checkpoint record is in the 73/41 file, so presumably the active
end of WAL isn't exceedingly far past that.  You've got 200 segments
prepared for future activity, which is a bit over the top IMHO.

But anyway, the evidence seems pretty clear that in fact end of WAL is
in the 73 range, and so those page LSNs with 972 and 973 have to be
bogus.  I'm back to thinking about dropped bits in RAM or on disk.
IIRC these numbers are all hex, so the extra "9" could come from just
two bits getting turned on that should not be.  Might be time to run
memtest86 and/or badblocks.

            regards, tom lane

Re: Invalid headers and xlog flush failures

From
Bricklen Anderson
Date:
Tom Lane wrote:
> Bricklen Anderson <BAnderson@PresiNET.com> writes:
>
>>Tom Lane wrote:
>>
>>>I would have suggested that maybe this represented on-disk data
>>>corruption, but the appearance of two different but not-too-far-apart
>>>WAL offsets in two different pages suggests that indeed the end of WAL
>>>was up around segment 972 or 973 at one time.
>
>
>>Nope, never touched pg_resetxlog.
>>My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks.
>>There are also these: 000000010000007400000000 to 00000001000000740000000B
>
>
> That seems like rather a lot of files; do you have checkpoint_segments
> set to a large value, like 100?  The pg_controldata dump shows that the
> latest checkpoint record is in the 73/41 file, so presumably the active
> end of WAL isn't exceedingly far past that.  You've got 200 segments
> prepared for future activity, which is a bit over the top IMHO.
>
> But anyway, the evidence seems pretty clear that in fact end of WAL is
> in the 73 range, and so those page LSNs with 972 and 973 have to be
> bogus.  I'm back to thinking about dropped bits in RAM or on disk.
> IIRC these numbers are all hex, so the extra "9" could come from just
> two bits getting turned on that should not be.  Might be time to run
> memtest86 and/or badblocks.
>
>             regards, tom lane

Yes, checkpoint_segments is set to 100, although I can set that lower if you feel that that is more
appropriate. Currently, the system receives around 5-8 million inserts per day (across 3 primary
tables), so I was leaning towards the "more is better" philosophy.

We ran e2fsck with badblocks option last week and didn't turn anything up, along with a couple of
passes with memtest. I will run a full-scale memtest and post any interesting results.

I've also read that kill -9 postmaster is "not a good thing". I honestly can't vouch for whether or
not this may or may not have occurred around the time of the initial creation of this database. It's
possible, since this db started it's life as a development db at 8r3 then was bumped to 8r5, then on
to 8 final where it has become a dev-final db.

Assuming that the memtest passes cleanly, as does another run of badblocks, do you have any more
suggestions on how I should proceed? Should I run for a while with zero_damaged_pages set to true
and accpet the data loss, or just recreate the whole db from scratch?

Thanks again for your help.


Cheers,

Bricklen


Re: Invalid headers and xlog flush failures

From
Bricklen Anderson
Date:
Bricklen Anderson wrote:
> Tom Lane wrote:
>
>> Bricklen Anderson <BAnderson@PresiNET.com> writes:
>>
>>> Tom Lane wrote:
>>>
>>>> I would have suggested that maybe this represented on-disk data
>>>> corruption, but the appearance of two different but not-too-far-apart
>>>> WAL offsets in two different pages suggests that indeed the end of WAL
>>>> was up around segment 972 or 973 at one time.
>>
>>
>>
>>> Nope, never touched pg_resetxlog.
>>> My pg_xlog list ranges from 000000010000007300000041 to
>>> 0000000100000073000000FE, with no breaks. There are also these:
>>> 000000010000007400000000 to 00000001000000740000000B
>>
>>
>>
>> That seems like rather a lot of files; do you have checkpoint_segments
>> set to a large value, like 100?  The pg_controldata dump shows that the
>> latest checkpoint record is in the 73/41 file, so presumably the active
>> end of WAL isn't exceedingly far past that.  You've got 200 segments
>> prepared for future activity, which is a bit over the top IMHO.
>>
>> But anyway, the evidence seems pretty clear that in fact end of WAL is
>> in the 73 range, and so those page LSNs with 972 and 973 have to be
>> bogus.  I'm back to thinking about dropped bits in RAM or on disk.
>> IIRC these numbers are all hex, so the extra "9" could come from just
>> two bits getting turned on that should not be.  Might be time to run
>> memtest86 and/or badblocks.
>>
>>             regards, tom lane
>
>
> Yes, checkpoint_segments is set to 100, although I can set that lower if
> you feel that that is more appropriate. Currently, the system receives
> around 5-8 million inserts per day (across 3 primary tables), so I was
> leaning towards the "more is better" philosophy.
>
> We ran e2fsck with badblocks option last week and didn't turn anything
> up, along with a couple of passes with memtest. I will run a full-scale
> memtest and post any interesting results.
>
> I've also read that kill -9 postmaster is "not a good thing". I honestly
> can't vouch for whether or not this may or may not have occurred around
> the time of the initial creation of this database. It's possible, since
> this db started it's life as a development db at 8r3 then was bumped to
> 8r5, then on to 8 final where it has become a dev-final db.
>
> Assuming that the memtest passes cleanly, as does another run of
> badblocks, do you have any more suggestions on how I should proceed?
> Should I run for a while with zero_damaged_pages set to true and accpet
> the data loss, or just recreate the whole db from scratch?
>

memtest86+ ran for over 15 hours with no errors reported.
e2fsck -c completed with no errors reported.

Any ideas on what I should try next? Considering that this db is not in production yet, I _do_ have
the liberty to rebuild the database if necessary. Do you have any further recommendations?

thanks again,

Bricklen

Re: Invalid headers and xlog flush failures

From
Tom Lane
Date:
Bricklen Anderson <BAnderson@PresiNET.com> writes:
>> Tom Lane wrote:
>>> But anyway, the evidence seems pretty clear that in fact end of WAL is
>>> in the 73 range, and so those page LSNs with 972 and 973 have to be
>>> bogus.  I'm back to thinking about dropped bits in RAM or on disk.

> memtest86+ ran for over 15 hours with no errors reported.
> e2fsck -c completed with no errors reported.

Hmm ... that's not proof your hardware is ok, but it at least puts the
ball back in play.

> Any ideas on what I should try next? Considering that this db is not
> in production yet, I _do_ have the liberty to rebuild the database if
> necessary. Do you have any further recommendations?

If the database isn't too large, I'd suggest saving aside a physical
copy (eg, cp or tar dump taken with postmaster stopped) for forensic
purposes, and then rebuilding so you can get on with your own work.

One bit of investigation that might be worth doing is to look at every
single 8K page in the database files and collect information about the
LSN fields, which are the first 8 bytes of each page.  In a non-broken
database all of these should be less than or equal to the current ending
WAL offset (which you can get with pg_controldata if the postmaster is
stopped).  We know there are at least two bad pages, but are there more?
Is there any pattern to the bad LSN values?  Also it would be useful to
look at each bad page in some detail to see if there's any evidence of
corruption extending beyond the LSN value.

            regards, tom lane

Re: Invalid headers and xlog flush failures

From
Alban Hertroys
Date:
Bricklen Anderson wrote:
> Any ideas on what I should try next? Considering that this db is not in
> production yet, I _do_ have the liberty to rebuild the database if
> necessary. Do you have any further recommendations?

I recall reading something in this ML about problems with the way that
Ext3 FS recovers a dirty file system, could it be related?

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban@magproductions.nl
W: http://www.magproductions.nl

Re: Invalid headers and xlog flush failures

From
Bricklen Anderson
Date:
Alban Hertroys wrote:
> Bricklen Anderson wrote:
>
>> Any ideas on what I should try next? Considering that this db is not
>> in production yet, I _do_ have the liberty to rebuild the database if
>> necessary. Do you have any further recommendations?
>
>
> I recall reading something in this ML about problems with the way that
> Ext3 FS recovers a dirty file system, could it be related?
>
I really have no idea, but we _are_ running an ext3 fs on this particular server.

--
_______________________________

This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
_______________________________

Re: Invalid headers and xlog flush failures

From
Bricklen Anderson
Date:
Tom Lane wrote:
> Bricklen Anderson <BAnderson@PresiNET.com> writes:
>
>>>Tom Lane wrote:
>>>
>>>>But anyway, the evidence seems pretty clear that in fact end of WAL is
>>>>in the 73 range, and so those page LSNs with 972 and 973 have to be
>>>>bogus.  I'm back to thinking about dropped bits in RAM or on disk.
>
>
>>memtest86+ ran for over 15 hours with no errors reported.
>>e2fsck -c completed with no errors reported.
>
>
> Hmm ... that's not proof your hardware is ok, but it at least puts the
> ball back in play.
>
>
>>Any ideas on what I should try next? Considering that this db is not
>>in production yet, I _do_ have the liberty to rebuild the database if
>>necessary. Do you have any further recommendations?
>
>
> If the database isn't too large, I'd suggest saving aside a physical
> copy (eg, cp or tar dump taken with postmaster stopped) for forensic
> purposes, and then rebuilding so you can get on with your own work.
>
> One bit of investigation that might be worth doing is to look at every
> single 8K page in the database files and collect information about the
> LSN fields, which are the first 8 bytes of each page.
Do you mean this line from pg_filedump's results:

LSN:  logid     56 recoff 0x3f4be440      Special  8176 (0x1ff0)

If so, I've set up a shell script that looped all of the files and emitted that line.
It's not particularly elegant, but it worked. Again, that's assuming that it was the correct line.
I'll write a perl script to parse out the LSN values to see if any are greater than 116 (which I
believe is the hex of 74?).

In case anyone wants the script that I ran to get the LSN:
#!/bin/sh

for FILE in /var/postgres/data/base/17235/*; do
         i=0
         echo $FILE >> test_file;
         while [ 1==1 ]; do
                 str=`pg_filedump -R $i $FILE | grep LSN`;
                 if [ "$?" -eq "1" ]; then
                         break
                 fi
                 echo "$FILE: $str" >> LSN_out;
                 i=$((i+1));
         done
done

> In a non-broken database all of these should be less than or equal to the current ending
> WAL offset (which you can get with pg_controldata if the postmaster is
> stopped).  We know there are at least two bad pages, but are there more?
> Is there any pattern to the bad LSN values?  Also it would be useful to
> look at each bad page in some detail to see if there's any evidence of
> corruption extending beyond the LSN value.
>
>             regards, tom lane

NB. I've recreated the database, and saved off the old directory (all 350 gigs of it) so I can dig
into it further.


Thanks again for you help, Tom.

Cheers,

Bricklen