Invalid headers and xlog flush failures - Mailing list pgsql-general

From Bricklen Anderson
Subject Invalid headers and xlog flush failures
Date
Msg-id 420103A4.5060104@PresiNET.com
Whole thread Raw
Responses Re: Invalid headers and xlog flush failures
List pgsql-general
Hi all,

I recently came across some apparent corruption in one of our databases around a month ago.
version: postgresql 8 (originally 8r3, now at 8.0.1), debian box

The messages that we were originally getting in our syslog were about invalid page headers. After
googling around, then dumping the page with pg_filedump, I decided to drop and recreate the affected
table. This seemed to work for a while, until this message cropped up in the syslog during a heavy load:

Feb  1 11:17:49 dev94 postgres[4959]: [470-2] 2005-02-01 11:17:49 PST> CONTEXT:  writing block 47272
of relation 1663/17235/57800
Feb  1 11:17:49 dev94 postgres[4959]: [471-1] 2005-02-01 11:17:49 PST> WARNING:  could not write
block 47272 of 1663/17235/57800
Feb  1 11:17:49 dev94 postgres[4959]: [471-2] 2005-02-01 11:17:49 PST> DETAIL:  Multiple failures
--- write error may be permanent.
Feb  1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR:  xlog flush request
972/FC932854 is not satisfied --- flushed only to 73/86D2640

This maps to an index. I reindexed it (and several other tables), and a 3 hours later, restarted my
load process. Shortly after that, the same thing happened again (with different numbers this time):

Feb  1 14:36:05 dev94 postgres[12887]: [626-2] 2005-02-01 14:36:05 PST> CONTEXT:  writing block 7502
of relation 1663/17235/151565
Feb  1 14:36:05 dev94 postgres[12887]: [627-1] 2005-02-01 14:36:05 PST> WARNING:  could not write
block 7502 of 1663/17235/151565
Feb  1 14:36:05 dev94 postgres[12887]: [627-2] 2005-02-01 14:36:05 PST> DETAIL:  Multiple failures
--- write error may be permanent.
Feb  1 14:36:06 dev94 postgres[12887]: [628-1] 2005-02-01 14:36:06 PST> ERROR:  xlog flush request
973/3EF36C2C is not satisfied --- flushed only to 73/419878B4

Both sets are repeated continuously through the syslog.

I pursued some references to XID wraparound, but that didn't seem likely from what I could see
(unless I'm misreading the numbers)
SELECT datname, age(datfrozenxid) FROM pg_database where datname='dev17';
  datname |    age
---------+------------
  dev17  | 1074008776


Here is a pg_filedump of 151565:
$pg_filedump -i -f -R 7502 /var/postgres/data/base/17235/151565

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 3.0
*
* File: /var/postgres/data/base/17235/151565
* Options used: -i -f -R 7502
*
* Dump created on: Tue Feb  1 14:34:14 2005
*******************************************************************

Block 7502 ********************************************************
<Header> -----
  Block Offset: 0x03a9c000         Offsets: Lower     988 (0x03dc)
  Block: Size 8192  Version    2            Upper    3336 (0x0d08)
  LSN:  logid    115 recoff 0x39e855f4      Special  8176 (0x1ff0)
  Items:  242                   Free Space: 2348
  Length (including item array): 992

  Error: Invalid header information.

   0000: 73000000 f455e839 01000000 dc03080d  s....U.9........
   0010: f01f0220 cc912800 e0912800 f4912800  ... ..(...(...(.
   0020: 08922800 1c922800 30922800 44922800  ..(...(.0.(.D.(.
<snipped>

<Data> ------
  Item   1 -- Length:   20  Offset: 4556 (0x11cc)  Flags: USED
   Block Id: 9016  linp Index: 2  Size: 20
   Has Nulls: 0  Has Varwidths: 16384

   11cc: 00003823 02001440 0b000000 022000cf  ..8#...@..... ..
   11dc: 66f06500                             f.e.

  Item   2 -- Length:   20  Offset: 4576 (0x11e0)  Flags: USED
   Block Id: 9571  linp Index: 8  Size: 20
   Has Nulls: 0  Has Varwidths: 16384

   11e0: 00006325 08001440 0b000000 022000cf  ..c%...@..... ..
   11f0: 66f06400                             f.d.

  Item   3 -- Length:   20  Offset: 4596 (0x11f4)  Flags: USED
   Block Id: 9571  linp Index: 3  Size: 20
   Has Nulls: 0  Has Varwidths: 16384

   11f4: 00006325 03001440 0b000000 022000cf  ..c%...@..... ..
   1204: 66f06400                             f.d.
<snipped>

<Special Section> -----
  BTree Index Section:
   Flags: 0x0001 (LEAF)
   Blocks: Previous (1314)  Next (1958)  Level (0)

   1ff0: 22050000 a6070000 00000000 01000000  "...............

*** End of Requested Range Encountered. Last Block Read: 7502 ***


Can anyone suggest what I should try next, or if you need more information, I'll happily supply what
I can.


Inline are the changes I made to pg_filedump to get it to compile and work, as such I can't attest
to pg_filedump's accuracy (which could be the source of those invalid header messages)
#########################################
--- pg_filedump.c.old       2004-02-23 12:58:58.000000000 -0800
+++ ../pg_filedump-3.0/pg_filedump.c.new 2005-01-31 09:24:36.000000000
-0800
@@ -742,8 +742,8 @@

           printf ("  XID: min (%u)  CMIN|XMAX: %u  CMAX|XVAC: %u\n"
                   "  Block Id: %u  linp Index: %u   Attributes: %d Size: %d\n",
-                 htup->t_xmin, htup->t_field2.t_cmin,
-                 htup->t_field3.t_cmax,
+                 htup->t_choice.t_heap.t_xmin,htup->t_choice.t_heap.t_cmin,
+                 htup->t_choice.t_heap.t_field4.t_cmax,
                   ((uint32)
                    ((htup->t_ctid.ip_blkid.bi_hi << 16) | (uint16) htup->t_ctid.ip_blkid.bi_lo)),
htup->t_ctid.ip_posid,
@@ -768,8 +768,8 @@
             strcat (flagString, "HASCOMPRESSED|");
           if (infoMask & HEAP_HASOID)
             strcat (flagString, "HASOID|");
-         if (infoMask & HEAP_XMAX_IS_XMIN)
-           strcat (flagString, "XMAX_IS_XMIN|");
+         //if (infoMask & HEAP_XMAX_IS_XMIN)
+         //  strcat (flagString, "XMAX_IS_XMIN|");
           if (infoMask & HEAP_XMAX_UNLOGGED)
             strcat (flagString, "XMAX_UNLOGGED|");
           if (infoMask & HEAP_XMIN_COMMITTED)
@@ -1042,7 +1042,7 @@
               "   Previous Checkpoint Record: Log File (%u) Offset (0x%08x)\n"
               "  Last Checkpoint Record Redo: Log File (%u) Offset (0x%08x)\n"
               "                |-       Undo: Log File (%u) Offset (0x%08x)\n"
-             "                |- StartUp ID: %u\n"
+             //"                |- StartUp ID: %u\n"
               "                |-   Next XID: %u\n"
               "                |-   Next OID: %u\n"
               "                 -       Time: %s"
@@ -1064,7 +1064,7 @@
               controlData->prevCheckPoint.xlogid,
               controlData->prevCheckPoint.xrecoff,checkPoint->redo.xlogid,
               checkPoint->redo.xrecoff, checkPoint->undo.xlogid,
-             checkPoint->undo.xrecoff, checkPoint->ThisStartUpID,
+             checkPoint->undo.xrecoff, /*checkPoint->ThisStartUpID,*/
               checkPoint->nextXid, checkPoint->nextOid,
               ctime (&checkPoint->time), controlData->blcksz,
               controlData->relseg_size, controlData->nameDataLen,
#########################################

Obviously, the bottom 3 chunks aren't very useful, but a cursory examination of the relevant headers
didn't make it obvious what these should be changed to.

Sorry for the long-winded email, and thanks for taking the time to read this far!

Cheers,

Bricklen

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: simple case syntax oddity in 8.0.0
Next
From: Tom Lane
Date:
Subject: Re: basic pg lock question