Thread: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
"Jim C. Nasby"
Date:
I've got a customer who was having a problem with a backend running away with memory. It would hit 46G before finally being running the box completely out of memory. It didn't appear to be related to hashjoin/agg or pending triggers, so I had them recompile with debug and assert turned on. A bit after restarting we get: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89) Looking through the code I see that's something to do with indexes, but I'm not sure what. Is this likely a corrupted index? If so, is there some way I could identify which index? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes: > TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89) > Looking through the code I see that's something to do with indexes, but > I'm not sure what. Is this likely a corrupted index? Sounds that way. > If so, is there some way I could identify which index? gdb the core file and do "p rel->rd_rel->relname" in the _bt_search stack frame (which is not going to be the top of stack but should be close to the top). regards, tom lane
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
"Jim C. Nasby"
Date:
On Wed, Oct 26, 2005 at 04:48:22PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89) > > > Looking through the code I see that's something to do with indexes, but > > I'm not sure what. Is this likely a corrupted index? > > Sounds that way. Is this something that should be logged better than it is? Is there any value in trying to save the index/table that's busted? This is 8.0.3, btw. > > If so, is there some way I could identify which index? > > gdb the core file and do "p rel->rd_rel->relname" in the _bt_search > stack frame (which is not going to be the top of stack but should be > close to the top). Yeah, looks like no core files yet (they would be in $PGDATA somewhere, correct?), so I had them hack their startup script to include ulimit -c unlimited. Luckily this is pretty reproducable, so hopefully we'll have a core soon. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Wed, Oct 26, 2005 at 04:48:22PM -0400, Tom Lane wrote: >> Sounds that way. > Is this something that should be logged better than it is? We don't even know what it is yet, so that question seems a bit premature. > Is there any value in trying to save the index/table that's busted? It'd probably be worth making a copy for forensic purposes before you REINDEX it. regards, tom lane
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
Martijn van Oosterhout
Date:
On Wed, Oct 26, 2005 at 04:03:41PM -0500, Jim C. Nasby wrote: > Is this something that should be logged better than it is? Is there any > value in trying to save the index/table that's busted? Unfortuatly, it's one of those Assert(expr) lines, they don't generally give the opportunity for extra debug data. They're usually not even compiled in... > Yeah, looks like no core files yet (they would be in $PGDATA somewhere, > correct?), so I had them hack their startup script to include ulimit -c > unlimited. Luckily this is pretty reproducable, so hopefully we'll have > a core soon. Well, you know the query that generates it? EXPLAIN should tell you the indexes it's using, so you could just REINDEX those... Or REINDEX them all... :) Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
"Jim C. Nasby"
Date:
On Wed, Oct 26, 2005 at 11:14:17PM +0200, Martijn van Oosterhout wrote: > On Wed, Oct 26, 2005 at 04:03:41PM -0500, Jim C. Nasby wrote: > > Is this something that should be logged better than it is? Is there any > > value in trying to save the index/table that's busted? > > Unfortuatly, it's one of those Assert(expr) lines, they don't generally > give the opportunity for extra debug data. They're usually not even > compiled in... > > > Yeah, looks like no core files yet (they would be in $PGDATA somewhere, > > correct?), so I had them hack their startup script to include ulimit -c > > unlimited. Luckily this is pretty reproducable, so hopefully we'll have > > a core soon. > > Well, you know the query that generates it? EXPLAIN should tell you the > indexes it's using, so you could just REINDEX those... Well, since query logging only logs when a query finishes... > Or REINDEX them all... :) 242G database. That would take quite a while... :) -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Wed, Oct 26, 2005 at 11:14:17PM +0200, Martijn van Oosterhout wrote: >> Well, you know the query that generates it? > Well, since query logging only logs when a query finishes... Don't forget to look at debug_query_string when you get the core file. regards, tom lane
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
"Jim C. Nasby"
Date:
On Wed, Oct 26, 2005 at 05:47:15PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > On Wed, Oct 26, 2005 at 11:14:17PM +0200, Martijn van Oosterhout wrote: > >> Well, you know the query that generates it? > > > Well, since query logging only logs when a query finishes... > > Don't forget to look at debug_query_string when you get the core file. Didn't know about that, it will certainly help. There's nothing in the FAQ's about what to look for in a coredump is there? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
"Jim C. Nasby"
Date:
On Wed, Oct 26, 2005 at 04:48:22PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89) > > > Looking through the code I see that's something to do with indexes, but > > I'm not sure what. Is this likely a corrupted index? > > Sounds that way. > > > If so, is there some way I could identify which index? > > gdb the core file and do "p rel->rd_rel->relname" in the _bt_search > stack frame (which is not going to be the top of stack but should be > close to the top). Reproduced the crash, but still no core file... where exactly should it have been put? Would in be outside of $PGDATA? Is there any way to verify the ulimit setings that the backend is running under? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes: > Reproduced the crash, but still no core file... where exactly should it > have been put? Would in be outside of $PGDATA? In 8.0 I'd expect to find it in $PGDATA/base/DBOID/core (or possibly core.NNNN). There are some platforms like Darwin that tend to put core files in a fixed directory such as /cores, though. regards, tom lane
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
"Jim C. Nasby"
Date:
On Wed, Oct 26, 2005 at 06:38:45PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > Reproduced the crash, but still no core file... where exactly should it > > have been put? Would in be outside of $PGDATA? > > In 8.0 I'd expect to find it in $PGDATA/base/DBOID/core (or possibly > core.NNNN). There are some platforms like Darwin that tend to put > core files in a fixed directory such as /cores, though. Grr... /etc/profile had ulimit -S -c 0 > /dev/null 2>&1 Is there any way to verify what limits are in place for a running backend? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
"Jim C. Nasby"
Date:
On Wed, Oct 26, 2005 at 06:06:19PM -0500, Jim C. Nasby wrote: > On Wed, Oct 26, 2005 at 06:38:45PM -0400, Tom Lane wrote: > > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > > Reproduced the crash, but still no core file... where exactly should it > > > have been put? Would in be outside of $PGDATA? > > > > In 8.0 I'd expect to find it in $PGDATA/base/DBOID/core (or possibly > > core.NNNN). There are some platforms like Darwin that tend to put > > core files in a fixed directory such as /cores, though. > > Grr... /etc/profile had > ulimit -S -c 0 > /dev/null 2>&1 > > Is there any way to verify what limits are in place for a running > backend? Also, is an assert guaranteed to dump core? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes: >> Is there any way to verify what limits are in place for a running >> backend? Damifino. > Also, is an assert guaranteed to dump core? Yup ... at least, it will call abort(). regards, tom lane
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes: > Is there anything in particular you'd like to see from the index file? I > made a copy of it before reindexing... Could you send me the whole file (off-list)? regards, tom lane
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Wed, Oct 26, 2005 at 09:29:23PM -0400, Tom Lane wrote: >> Could you send me the whole file (off-list)? > Ok, will send URL as soon as I have it from client. Well, the answer is that there's nothing wrong with that index except that four consecutive pages near the end (32K total) have been zeroed out :-( In the situation where _bt_search descends into one of these pages, we'd have the following behavior: * P_RIGHTMOST() appears true because btpo_next is zero, therefore _bt_moveright does nothing. * P_ISLEAF() appears false because the flags word is zero, so we don't fall out of the loop. * _bt_binsrch finds high < low and returns low (ie, 1) rather than complaining. * If Asserts are on then the internal ItemIdIsUsed assert in PageGetItem triggers, resulting in the behavior mentioned in $subject. * If not, then we stupidly fetch a zero block number out of the nonexistent item, and iterate to page zero, ie, the metapage. The above observations still hold true for the metapage, so it's an infinite loop --- or would be except that we're building a stack entry each time around the loop, and so we gradually exhaust memory. This matches the other behavior Jim saw. Bottom line is that index searches probably ought to have some non-Assert defenses against zeroed-out pages. Obviously we can't expect to catch every flavor of data corruption, but this particular one has been seen before... BTW, Jim, any thoughts about how the index got corrupted? Have you had any crashes on that machine lately? regards, tom lane
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
"Jim C. Nasby"
Date:
On Thu, Oct 27, 2005 at 11:53:01PM -0400, Tom Lane wrote: > BTW, Jim, any thoughts about how the index got corrupted? Have you > had any crashes on that machine lately? Write-through cache on drive array that's not battery backed. Plus, the backend has been crashing on a sig 11 about once a week for who-knows-how-long. You do the math... BTW, now that the backend is compiled with --enable-debug I hope to find out the reason for the random crashes. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
Tom Lane
Date:
Gavin Sherry <swm@linuxworld.com.au> writes: > Definately. I've seen faulty hardware somehow zero blocks where I would > have expected random data. I wonder if we can test with PageIsNew(), which > is very inexpensive. The question is: what do we do when we detect this? I think erroring out with a message along the line of "corrupted data in index foo" is plenty. The recovery action is easy: reindex. So all we have to do is not crash and deliver a useful error message. We know that the page has already gotten by PageHeaderIsValid, so either it's in reasonable condition or it's all zeroes. So a quick check for some nonzero header fields is enough ... PageIsNew is as good as anything. The next question is what's the minimal number of places we have to add it to to cover all paths in the index modules? > That being said, is there any reason where don't extend the file with a > PageInit()'d block instead of a zero'd file? Yeah: this scenario is exactly it. PageInit doesn't make for a valid index page. regards, tom lane
Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", File: "nbtsearch.c", Line: 89)
From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Thu, Oct 27, 2005 at 11:53:01PM -0400, Tom Lane wrote: >> BTW, Jim, any thoughts about how the index got corrupted? Have you >> had any crashes on that machine lately? > Write-through cache on drive array that's not battery backed. Plus, the > backend has been crashing on a sig 11 about once a week for > who-knows-how-long. You do the math... My guess is that this could not have been caused by a backend crash. I did some analysis and found that one of the four pages had been a level-1 internal page, not a leaf page (there is a sibling link from a level-1 page and a downlink from the level-2 root, QED). It's very hard to see how to explain the condition of the index as a result of a single backend crash, and even if you posit one crash for each page, it's tough to believe that there wouldn't be additional corruption elsewhere (eg, from incomplete page split operations). But AFAICT there's nothing at all wrong anywhere else. And if there were multiple crashes, why are the affected pages contiguous? I think the data was dropped at the disk or filesystem level. If you have had power failures then that's certainly not hard to believe. > BTW, now that the backend is compiled with --enable-debug I hope to find > out the reason for the random crashes. Yeah, that is an interesting question. regards, tom lane
On Thu, 27 Oct 2005, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > On Wed, Oct 26, 2005 at 09:29:23PM -0400, Tom Lane wrote: > >> Could you send me the whole file (off-list)? > > > Ok, will send URL as soon as I have it from client. > > Well, the answer is that there's nothing wrong with that index except > that four consecutive pages near the end (32K total) have been zeroed > out :-( [snip] > Bottom line is that index searches probably ought to have some > non-Assert defenses against zeroed-out pages. Obviously we can't > expect to catch every flavor of data corruption, but this particular > one has been seen before... Definately. I've seen faulty hardware somehow zero blocks where I would have expected random data. I wonder if we can test with PageIsNew(), which is very inexpensive. The question is: what do we do when we detect this? > > BTW, Jim, any thoughts about how the index got corrupted? Have you > had any crashes on that machine lately? Have spoken with Jim on IRC, he says that there have been several crashes recently due to a faulty disk array. I guess the zeroing could be an outcome of the faulty disk. I wonder if the crash the faulty disk resulted in could have been caused some where around mdextend() where we create a zero'd page but before we could have written out the initialised page. If this happened 4 times in a row it could account for the problem. It does seem a bit unlikely thought. That being said, is there any reason where don't extend the file with a PageInit()'d block instead of a zero'd file? Thanks, Gavin
On Fri, Oct 28, 2005 at 02:26:31PM +1000, Gavin Sherry wrote: > Have spoken with Jim on IRC, he says that there have been several crashes > recently due to a faulty disk array. I guess the zeroing could be an > outcome of the faulty disk. I wonder if the crash the faulty disk resulted > in could have been caused some where around mdextend() where we create a > zero'd page but before we could have written out the initialised page. Just to clarify, there's no evidence that the array is faulty. I do know that they were using write-back with a non-battery-backed cache though. What has been happening is periodic random crashes, around 1 a week. I now have a good core for one, as well as an assert: TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno && shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c", Line: 308) I haven't looked at that code yet, so I have no idea what that actually means. Let me know what info y'all would like to see out of the core. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > What has been happening is periodic random crashes, around 1 a week. I > now have a good core for one, as well as an assert: > TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno && > shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: > "slru.c", Line: 308) > I haven't looked at that code yet, so I have no idea what that actually > means. Let me know what info y'all would like to see out of the core. The whole contents of *shared and the local variables of SimpleLruReadPage would be good for starters. Also, what PG version is this exactly, again? regards, tom lane