Thread: Invalid headers and xlog flush failures
Hi all, I recently came across some apparent corruption in one of our databases around a month ago. version: postgresql 8 (originally 8r3, now at 8.0.1), debian box The messages that we were originally getting in our syslog were about invalid page headers. After googling around, then dumping the page with pg_filedump, I decided to drop and recreate the affected table. This seemed to work for a while, until this message cropped up in the syslog during a heavy load: Feb 1 11:17:49 dev94 postgres[4959]: [470-2] 2005-02-01 11:17:49 PST> CONTEXT: writing block 47272 of relation 1663/17235/57800 Feb 1 11:17:49 dev94 postgres[4959]: [471-1] 2005-02-01 11:17:49 PST> WARNING: could not write block 47272 of 1663/17235/57800 Feb 1 11:17:49 dev94 postgres[4959]: [471-2] 2005-02-01 11:17:49 PST> DETAIL: Multiple failures --- write error may be permanent. Feb 1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR: xlog flush request 972/FC932854 is not satisfied --- flushed only to 73/86D2640 This maps to an index. I reindexed it (and several other tables), and a 3 hours later, restarted my load process. Shortly after that, the same thing happened again (with different numbers this time): Feb 1 14:36:05 dev94 postgres[12887]: [626-2] 2005-02-01 14:36:05 PST> CONTEXT: writing block 7502 of relation 1663/17235/151565 Feb 1 14:36:05 dev94 postgres[12887]: [627-1] 2005-02-01 14:36:05 PST> WARNING: could not write block 7502 of 1663/17235/151565 Feb 1 14:36:05 dev94 postgres[12887]: [627-2] 2005-02-01 14:36:05 PST> DETAIL: Multiple failures --- write error may be permanent. Feb 1 14:36:06 dev94 postgres[12887]: [628-1] 2005-02-01 14:36:06 PST> ERROR: xlog flush request 973/3EF36C2C is not satisfied --- flushed only to 73/419878B4 Both sets are repeated continuously through the syslog. I pursued some references to XID wraparound, but that didn't seem likely from what I could see (unless I'm misreading the numbers) SELECT datname, age(datfrozenxid) FROM pg_database where datname='dev17'; datname | age ---------+------------ dev17 | 1074008776 Here is a pg_filedump of 151565: $pg_filedump -i -f -R 7502 /var/postgres/data/base/17235/151565 ******************************************************************* * PostgreSQL File/Block Formatted Dump Utility - Version 3.0 * * File: /var/postgres/data/base/17235/151565 * Options used: -i -f -R 7502 * * Dump created on: Tue Feb 1 14:34:14 2005 ******************************************************************* Block 7502 ******************************************************** <Header> ----- Block Offset: 0x03a9c000 Offsets: Lower 988 (0x03dc) Block: Size 8192 Version 2 Upper 3336 (0x0d08) LSN: logid 115 recoff 0x39e855f4 Special 8176 (0x1ff0) Items: 242 Free Space: 2348 Length (including item array): 992 Error: Invalid header information. 0000: 73000000 f455e839 01000000 dc03080d s....U.9........ 0010: f01f0220 cc912800 e0912800 f4912800 ... ..(...(...(. 0020: 08922800 1c922800 30922800 44922800 ..(...(.0.(.D.(. <snipped> <Data> ------ Item 1 -- Length: 20 Offset: 4556 (0x11cc) Flags: USED Block Id: 9016 linp Index: 2 Size: 20 Has Nulls: 0 Has Varwidths: 16384 11cc: 00003823 02001440 0b000000 022000cf ..8#...@..... .. 11dc: 66f06500 f.e. Item 2 -- Length: 20 Offset: 4576 (0x11e0) Flags: USED Block Id: 9571 linp Index: 8 Size: 20 Has Nulls: 0 Has Varwidths: 16384 11e0: 00006325 08001440 0b000000 022000cf ..c%...@..... .. 11f0: 66f06400 f.d. Item 3 -- Length: 20 Offset: 4596 (0x11f4) Flags: USED Block Id: 9571 linp Index: 3 Size: 20 Has Nulls: 0 Has Varwidths: 16384 11f4: 00006325 03001440 0b000000 022000cf ..c%...@..... .. 1204: 66f06400 f.d. <snipped> <Special Section> ----- BTree Index Section: Flags: 0x0001 (LEAF) Blocks: Previous (1314) Next (1958) Level (0) 1ff0: 22050000 a6070000 00000000 01000000 "............... *** End of Requested Range Encountered. Last Block Read: 7502 *** Can anyone suggest what I should try next, or if you need more information, I'll happily supply what I can. Inline are the changes I made to pg_filedump to get it to compile and work, as such I can't attest to pg_filedump's accuracy (which could be the source of those invalid header messages) ######################################### --- pg_filedump.c.old 2004-02-23 12:58:58.000000000 -0800 +++ ../pg_filedump-3.0/pg_filedump.c.new 2005-01-31 09:24:36.000000000 -0800 @@ -742,8 +742,8 @@ printf (" XID: min (%u) CMIN|XMAX: %u CMAX|XVAC: %u\n" " Block Id: %u linp Index: %u Attributes: %d Size: %d\n", - htup->t_xmin, htup->t_field2.t_cmin, - htup->t_field3.t_cmax, + htup->t_choice.t_heap.t_xmin,htup->t_choice.t_heap.t_cmin, + htup->t_choice.t_heap.t_field4.t_cmax, ((uint32) ((htup->t_ctid.ip_blkid.bi_hi << 16) | (uint16) htup->t_ctid.ip_blkid.bi_lo)), htup->t_ctid.ip_posid, @@ -768,8 +768,8 @@ strcat (flagString, "HASCOMPRESSED|"); if (infoMask & HEAP_HASOID) strcat (flagString, "HASOID|"); - if (infoMask & HEAP_XMAX_IS_XMIN) - strcat (flagString, "XMAX_IS_XMIN|"); + //if (infoMask & HEAP_XMAX_IS_XMIN) + // strcat (flagString, "XMAX_IS_XMIN|"); if (infoMask & HEAP_XMAX_UNLOGGED) strcat (flagString, "XMAX_UNLOGGED|"); if (infoMask & HEAP_XMIN_COMMITTED) @@ -1042,7 +1042,7 @@ " Previous Checkpoint Record: Log File (%u) Offset (0x%08x)\n" " Last Checkpoint Record Redo: Log File (%u) Offset (0x%08x)\n" " |- Undo: Log File (%u) Offset (0x%08x)\n" - " |- StartUp ID: %u\n" + //" |- StartUp ID: %u\n" " |- Next XID: %u\n" " |- Next OID: %u\n" " - Time: %s" @@ -1064,7 +1064,7 @@ controlData->prevCheckPoint.xlogid, controlData->prevCheckPoint.xrecoff,checkPoint->redo.xlogid, checkPoint->redo.xrecoff, checkPoint->undo.xlogid, - checkPoint->undo.xrecoff, checkPoint->ThisStartUpID, + checkPoint->undo.xrecoff, /*checkPoint->ThisStartUpID,*/ checkPoint->nextXid, checkPoint->nextOid, ctime (&checkPoint->time), controlData->blcksz, controlData->relseg_size, controlData->nameDataLen, ######################################### Obviously, the bottom 3 chunks aren't very useful, but a cursory examination of the relevant headers didn't make it obvious what these should be changed to. Sorry for the long-winded email, and thanks for taking the time to read this far! Cheers, Bricklen
Bricklen Anderson <BAnderson@PresiNET.com> writes: > Feb 1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR: xlog flush request > 972/FC932854 is not satisfied --- flushed only to 73/86D2640 Hmm, have you perhaps played any games with pg_resetxlog in this database? I would have suggested that maybe this represented on-disk data corruption, but the appearance of two different but not-too-far-apart WAL offsets in two different pages suggests that indeed the end of WAL was up around segment 972 or 973 at one time. And now it's evidently ending at 73. Not good. What file names do you see in pg_xlog/, and what does pg_controldata show? regards, tom lane
Tom Lane wrote: > Bricklen Anderson <BAnderson@PresiNET.com> writes: > >>Feb 1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR: xlog flush request >>972/FC932854 is not satisfied --- flushed only to 73/86D2640 > > > Hmm, have you perhaps played any games with pg_resetxlog in this database? > > I would have suggested that maybe this represented on-disk data > corruption, but the appearance of two different but not-too-far-apart > WAL offsets in two different pages suggests that indeed the end of WAL > was up around segment 972 or 973 at one time. And now it's evidently > ending at 73. Not good. What file names do you see in pg_xlog/, and > what does pg_controldata show? > > regards, tom lane Hi Tom, Nope, never touched pg_resetxlog. My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks. There are also these: 000000010000007400000000 to 00000001000000740000000B $ pg_controldata pg_control version number: 74 Catalog version number: 200411041 Database system identifier: 4738750823096876774 Database cluster state: in production pg_control last modified: Wed 02 Feb 2005 12:38:22 AM PST Current log file ID: 115 Next log file segment: 66 Latest checkpoint location: 73/419A4BDC Prior checkpoint location: 73/419A4B80 Latest checkpoint's REDO location: 73/419A4BDC Latest checkpoint's UNDO location: 0/0 Latest checkpoint's TimeLineID: 1 Latest checkpoint's NextXID: 4161807 Latest checkpoint's NextOID: 176864 Time of latest checkpoint: Wed 02 Feb 2005 12:38:22 AM PST Database block size: 8192 Blocks per segment of large relation: 131072 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum number of function arguments: 32 Date/time type storage: floating-point numbers Maximum length of locale name: 128 LC_COLLATE: en_CA LC_CTYPE: en_CA
Bricklen Anderson <BAnderson@PresiNET.com> writes: > Tom Lane wrote: >> I would have suggested that maybe this represented on-disk data >> corruption, but the appearance of two different but not-too-far-apart >> WAL offsets in two different pages suggests that indeed the end of WAL >> was up around segment 972 or 973 at one time. > Nope, never touched pg_resetxlog. > My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks. > There are also these: 000000010000007400000000 to 00000001000000740000000B That seems like rather a lot of files; do you have checkpoint_segments set to a large value, like 100? The pg_controldata dump shows that the latest checkpoint record is in the 73/41 file, so presumably the active end of WAL isn't exceedingly far past that. You've got 200 segments prepared for future activity, which is a bit over the top IMHO. But anyway, the evidence seems pretty clear that in fact end of WAL is in the 73 range, and so those page LSNs with 972 and 973 have to be bogus. I'm back to thinking about dropped bits in RAM or on disk. IIRC these numbers are all hex, so the extra "9" could come from just two bits getting turned on that should not be. Might be time to run memtest86 and/or badblocks. regards, tom lane
Tom Lane wrote: > Bricklen Anderson <BAnderson@PresiNET.com> writes: > >>Tom Lane wrote: >> >>>I would have suggested that maybe this represented on-disk data >>>corruption, but the appearance of two different but not-too-far-apart >>>WAL offsets in two different pages suggests that indeed the end of WAL >>>was up around segment 972 or 973 at one time. > > >>Nope, never touched pg_resetxlog. >>My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks. >>There are also these: 000000010000007400000000 to 00000001000000740000000B > > > That seems like rather a lot of files; do you have checkpoint_segments > set to a large value, like 100? The pg_controldata dump shows that the > latest checkpoint record is in the 73/41 file, so presumably the active > end of WAL isn't exceedingly far past that. You've got 200 segments > prepared for future activity, which is a bit over the top IMHO. > > But anyway, the evidence seems pretty clear that in fact end of WAL is > in the 73 range, and so those page LSNs with 972 and 973 have to be > bogus. I'm back to thinking about dropped bits in RAM or on disk. > IIRC these numbers are all hex, so the extra "9" could come from just > two bits getting turned on that should not be. Might be time to run > memtest86 and/or badblocks. > > regards, tom lane Yes, checkpoint_segments is set to 100, although I can set that lower if you feel that that is more appropriate. Currently, the system receives around 5-8 million inserts per day (across 3 primary tables), so I was leaning towards the "more is better" philosophy. We ran e2fsck with badblocks option last week and didn't turn anything up, along with a couple of passes with memtest. I will run a full-scale memtest and post any interesting results. I've also read that kill -9 postmaster is "not a good thing". I honestly can't vouch for whether or not this may or may not have occurred around the time of the initial creation of this database. It's possible, since this db started it's life as a development db at 8r3 then was bumped to 8r5, then on to 8 final where it has become a dev-final db. Assuming that the memtest passes cleanly, as does another run of badblocks, do you have any more suggestions on how I should proceed? Should I run for a while with zero_damaged_pages set to true and accpet the data loss, or just recreate the whole db from scratch? Thanks again for your help. Cheers, Bricklen
Bricklen Anderson wrote: > Tom Lane wrote: > >> Bricklen Anderson <BAnderson@PresiNET.com> writes: >> >>> Tom Lane wrote: >>> >>>> I would have suggested that maybe this represented on-disk data >>>> corruption, but the appearance of two different but not-too-far-apart >>>> WAL offsets in two different pages suggests that indeed the end of WAL >>>> was up around segment 972 or 973 at one time. >> >> >> >>> Nope, never touched pg_resetxlog. >>> My pg_xlog list ranges from 000000010000007300000041 to >>> 0000000100000073000000FE, with no breaks. There are also these: >>> 000000010000007400000000 to 00000001000000740000000B >> >> >> >> That seems like rather a lot of files; do you have checkpoint_segments >> set to a large value, like 100? The pg_controldata dump shows that the >> latest checkpoint record is in the 73/41 file, so presumably the active >> end of WAL isn't exceedingly far past that. You've got 200 segments >> prepared for future activity, which is a bit over the top IMHO. >> >> But anyway, the evidence seems pretty clear that in fact end of WAL is >> in the 73 range, and so those page LSNs with 972 and 973 have to be >> bogus. I'm back to thinking about dropped bits in RAM or on disk. >> IIRC these numbers are all hex, so the extra "9" could come from just >> two bits getting turned on that should not be. Might be time to run >> memtest86 and/or badblocks. >> >> regards, tom lane > > > Yes, checkpoint_segments is set to 100, although I can set that lower if > you feel that that is more appropriate. Currently, the system receives > around 5-8 million inserts per day (across 3 primary tables), so I was > leaning towards the "more is better" philosophy. > > We ran e2fsck with badblocks option last week and didn't turn anything > up, along with a couple of passes with memtest. I will run a full-scale > memtest and post any interesting results. > > I've also read that kill -9 postmaster is "not a good thing". I honestly > can't vouch for whether or not this may or may not have occurred around > the time of the initial creation of this database. It's possible, since > this db started it's life as a development db at 8r3 then was bumped to > 8r5, then on to 8 final where it has become a dev-final db. > > Assuming that the memtest passes cleanly, as does another run of > badblocks, do you have any more suggestions on how I should proceed? > Should I run for a while with zero_damaged_pages set to true and accpet > the data loss, or just recreate the whole db from scratch? > memtest86+ ran for over 15 hours with no errors reported. e2fsck -c completed with no errors reported. Any ideas on what I should try next? Considering that this db is not in production yet, I _do_ have the liberty to rebuild the database if necessary. Do you have any further recommendations? thanks again, Bricklen
Bricklen Anderson <BAnderson@PresiNET.com> writes: >> Tom Lane wrote: >>> But anyway, the evidence seems pretty clear that in fact end of WAL is >>> in the 73 range, and so those page LSNs with 972 and 973 have to be >>> bogus. I'm back to thinking about dropped bits in RAM or on disk. > memtest86+ ran for over 15 hours with no errors reported. > e2fsck -c completed with no errors reported. Hmm ... that's not proof your hardware is ok, but it at least puts the ball back in play. > Any ideas on what I should try next? Considering that this db is not > in production yet, I _do_ have the liberty to rebuild the database if > necessary. Do you have any further recommendations? If the database isn't too large, I'd suggest saving aside a physical copy (eg, cp or tar dump taken with postmaster stopped) for forensic purposes, and then rebuilding so you can get on with your own work. One bit of investigation that might be worth doing is to look at every single 8K page in the database files and collect information about the LSN fields, which are the first 8 bytes of each page. In a non-broken database all of these should be less than or equal to the current ending WAL offset (which you can get with pg_controldata if the postmaster is stopped). We know there are at least two bad pages, but are there more? Is there any pattern to the bad LSN values? Also it would be useful to look at each bad page in some detail to see if there's any evidence of corruption extending beyond the LSN value. regards, tom lane
Bricklen Anderson wrote: > Any ideas on what I should try next? Considering that this db is not in > production yet, I _do_ have the liberty to rebuild the database if > necessary. Do you have any further recommendations? I recall reading something in this ML about problems with the way that Ext3 FS recovers a dirty file system, could it be related? -- Alban Hertroys MAG Productions T: +31(0)53 4346874 F: +31(0)53 4346876 E: alban@magproductions.nl W: http://www.magproductions.nl
Alban Hertroys wrote: > Bricklen Anderson wrote: > >> Any ideas on what I should try next? Considering that this db is not >> in production yet, I _do_ have the liberty to rebuild the database if >> necessary. Do you have any further recommendations? > > > I recall reading something in this ML about problems with the way that > Ext3 FS recovers a dirty file system, could it be related? > I really have no idea, but we _are_ running an ext3 fs on this particular server. -- _______________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. _______________________________
Tom Lane wrote: > Bricklen Anderson <BAnderson@PresiNET.com> writes: > >>>Tom Lane wrote: >>> >>>>But anyway, the evidence seems pretty clear that in fact end of WAL is >>>>in the 73 range, and so those page LSNs with 972 and 973 have to be >>>>bogus. I'm back to thinking about dropped bits in RAM or on disk. > > >>memtest86+ ran for over 15 hours with no errors reported. >>e2fsck -c completed with no errors reported. > > > Hmm ... that's not proof your hardware is ok, but it at least puts the > ball back in play. > > >>Any ideas on what I should try next? Considering that this db is not >>in production yet, I _do_ have the liberty to rebuild the database if >>necessary. Do you have any further recommendations? > > > If the database isn't too large, I'd suggest saving aside a physical > copy (eg, cp or tar dump taken with postmaster stopped) for forensic > purposes, and then rebuilding so you can get on with your own work. > > One bit of investigation that might be worth doing is to look at every > single 8K page in the database files and collect information about the > LSN fields, which are the first 8 bytes of each page. Do you mean this line from pg_filedump's results: LSN: logid 56 recoff 0x3f4be440 Special 8176 (0x1ff0) If so, I've set up a shell script that looped all of the files and emitted that line. It's not particularly elegant, but it worked. Again, that's assuming that it was the correct line. I'll write a perl script to parse out the LSN values to see if any are greater than 116 (which I believe is the hex of 74?). In case anyone wants the script that I ran to get the LSN: #!/bin/sh for FILE in /var/postgres/data/base/17235/*; do i=0 echo $FILE >> test_file; while [ 1==1 ]; do str=`pg_filedump -R $i $FILE | grep LSN`; if [ "$?" -eq "1" ]; then break fi echo "$FILE: $str" >> LSN_out; i=$((i+1)); done done > In a non-broken database all of these should be less than or equal to the current ending > WAL offset (which you can get with pg_controldata if the postmaster is > stopped). We know there are at least two bad pages, but are there more? > Is there any pattern to the bad LSN values? Also it would be useful to > look at each bad page in some detail to see if there's any evidence of > corruption extending beyond the LSN value. > > regards, tom lane NB. I've recreated the database, and saved off the old directory (all 350 gigs of it) so I can dig into it further. Thanks again for you help, Tom. Cheers, Bricklen