Invalid headers and xlog flush failures - Mailing list pgsql-general
From | Bricklen Anderson |
---|---|
Subject | Invalid headers and xlog flush failures |
Date | |
Msg-id | 420103A4.5060104@PresiNET.com Whole thread Raw |
Responses |
Re: Invalid headers and xlog flush failures
|
List | pgsql-general |
Hi all, I recently came across some apparent corruption in one of our databases around a month ago. version: postgresql 8 (originally 8r3, now at 8.0.1), debian box The messages that we were originally getting in our syslog were about invalid page headers. After googling around, then dumping the page with pg_filedump, I decided to drop and recreate the affected table. This seemed to work for a while, until this message cropped up in the syslog during a heavy load: Feb 1 11:17:49 dev94 postgres[4959]: [470-2] 2005-02-01 11:17:49 PST> CONTEXT: writing block 47272 of relation 1663/17235/57800 Feb 1 11:17:49 dev94 postgres[4959]: [471-1] 2005-02-01 11:17:49 PST> WARNING: could not write block 47272 of 1663/17235/57800 Feb 1 11:17:49 dev94 postgres[4959]: [471-2] 2005-02-01 11:17:49 PST> DETAIL: Multiple failures --- write error may be permanent. Feb 1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR: xlog flush request 972/FC932854 is not satisfied --- flushed only to 73/86D2640 This maps to an index. I reindexed it (and several other tables), and a 3 hours later, restarted my load process. Shortly after that, the same thing happened again (with different numbers this time): Feb 1 14:36:05 dev94 postgres[12887]: [626-2] 2005-02-01 14:36:05 PST> CONTEXT: writing block 7502 of relation 1663/17235/151565 Feb 1 14:36:05 dev94 postgres[12887]: [627-1] 2005-02-01 14:36:05 PST> WARNING: could not write block 7502 of 1663/17235/151565 Feb 1 14:36:05 dev94 postgres[12887]: [627-2] 2005-02-01 14:36:05 PST> DETAIL: Multiple failures --- write error may be permanent. Feb 1 14:36:06 dev94 postgres[12887]: [628-1] 2005-02-01 14:36:06 PST> ERROR: xlog flush request 973/3EF36C2C is not satisfied --- flushed only to 73/419878B4 Both sets are repeated continuously through the syslog. I pursued some references to XID wraparound, but that didn't seem likely from what I could see (unless I'm misreading the numbers) SELECT datname, age(datfrozenxid) FROM pg_database where datname='dev17'; datname | age ---------+------------ dev17 | 1074008776 Here is a pg_filedump of 151565: $pg_filedump -i -f -R 7502 /var/postgres/data/base/17235/151565 ******************************************************************* * PostgreSQL File/Block Formatted Dump Utility - Version 3.0 * * File: /var/postgres/data/base/17235/151565 * Options used: -i -f -R 7502 * * Dump created on: Tue Feb 1 14:34:14 2005 ******************************************************************* Block 7502 ******************************************************** <Header> ----- Block Offset: 0x03a9c000 Offsets: Lower 988 (0x03dc) Block: Size 8192 Version 2 Upper 3336 (0x0d08) LSN: logid 115 recoff 0x39e855f4 Special 8176 (0x1ff0) Items: 242 Free Space: 2348 Length (including item array): 992 Error: Invalid header information. 0000: 73000000 f455e839 01000000 dc03080d s....U.9........ 0010: f01f0220 cc912800 e0912800 f4912800 ... ..(...(...(. 0020: 08922800 1c922800 30922800 44922800 ..(...(.0.(.D.(. <snipped> <Data> ------ Item 1 -- Length: 20 Offset: 4556 (0x11cc) Flags: USED Block Id: 9016 linp Index: 2 Size: 20 Has Nulls: 0 Has Varwidths: 16384 11cc: 00003823 02001440 0b000000 022000cf ..8#...@..... .. 11dc: 66f06500 f.e. Item 2 -- Length: 20 Offset: 4576 (0x11e0) Flags: USED Block Id: 9571 linp Index: 8 Size: 20 Has Nulls: 0 Has Varwidths: 16384 11e0: 00006325 08001440 0b000000 022000cf ..c%...@..... .. 11f0: 66f06400 f.d. Item 3 -- Length: 20 Offset: 4596 (0x11f4) Flags: USED Block Id: 9571 linp Index: 3 Size: 20 Has Nulls: 0 Has Varwidths: 16384 11f4: 00006325 03001440 0b000000 022000cf ..c%...@..... .. 1204: 66f06400 f.d. <snipped> <Special Section> ----- BTree Index Section: Flags: 0x0001 (LEAF) Blocks: Previous (1314) Next (1958) Level (0) 1ff0: 22050000 a6070000 00000000 01000000 "............... *** End of Requested Range Encountered. Last Block Read: 7502 *** Can anyone suggest what I should try next, or if you need more information, I'll happily supply what I can. Inline are the changes I made to pg_filedump to get it to compile and work, as such I can't attest to pg_filedump's accuracy (which could be the source of those invalid header messages) ######################################### --- pg_filedump.c.old 2004-02-23 12:58:58.000000000 -0800 +++ ../pg_filedump-3.0/pg_filedump.c.new 2005-01-31 09:24:36.000000000 -0800 @@ -742,8 +742,8 @@ printf (" XID: min (%u) CMIN|XMAX: %u CMAX|XVAC: %u\n" " Block Id: %u linp Index: %u Attributes: %d Size: %d\n", - htup->t_xmin, htup->t_field2.t_cmin, - htup->t_field3.t_cmax, + htup->t_choice.t_heap.t_xmin,htup->t_choice.t_heap.t_cmin, + htup->t_choice.t_heap.t_field4.t_cmax, ((uint32) ((htup->t_ctid.ip_blkid.bi_hi << 16) | (uint16) htup->t_ctid.ip_blkid.bi_lo)), htup->t_ctid.ip_posid, @@ -768,8 +768,8 @@ strcat (flagString, "HASCOMPRESSED|"); if (infoMask & HEAP_HASOID) strcat (flagString, "HASOID|"); - if (infoMask & HEAP_XMAX_IS_XMIN) - strcat (flagString, "XMAX_IS_XMIN|"); + //if (infoMask & HEAP_XMAX_IS_XMIN) + // strcat (flagString, "XMAX_IS_XMIN|"); if (infoMask & HEAP_XMAX_UNLOGGED) strcat (flagString, "XMAX_UNLOGGED|"); if (infoMask & HEAP_XMIN_COMMITTED) @@ -1042,7 +1042,7 @@ " Previous Checkpoint Record: Log File (%u) Offset (0x%08x)\n" " Last Checkpoint Record Redo: Log File (%u) Offset (0x%08x)\n" " |- Undo: Log File (%u) Offset (0x%08x)\n" - " |- StartUp ID: %u\n" + //" |- StartUp ID: %u\n" " |- Next XID: %u\n" " |- Next OID: %u\n" " - Time: %s" @@ -1064,7 +1064,7 @@ controlData->prevCheckPoint.xlogid, controlData->prevCheckPoint.xrecoff,checkPoint->redo.xlogid, checkPoint->redo.xrecoff, checkPoint->undo.xlogid, - checkPoint->undo.xrecoff, checkPoint->ThisStartUpID, + checkPoint->undo.xrecoff, /*checkPoint->ThisStartUpID,*/ checkPoint->nextXid, checkPoint->nextOid, ctime (&checkPoint->time), controlData->blcksz, controlData->relseg_size, controlData->nameDataLen, ######################################### Obviously, the bottom 3 chunks aren't very useful, but a cursory examination of the relevant headers didn't make it obvious what these should be changed to. Sorry for the long-winded email, and thanks for taking the time to read this far! Cheers, Bricklen
pgsql-general by date: