Thread: terminated by signal 6 problem
I was sent a log file for a production system that has received several ABORT signals while under heavy load. Here's a snippet from the logs: ----------- LOG: recycled transaction log file "0000000A0000004B" LOG: recycled transaction log file "0000000A0000004D" LOG: recycled transaction log file "0000000A0000004E" LOG: recycled transaction log file "0000000A0000004F" WARNING: specified item offset is too large CONTEXT: COPY cymi_transaction_data, line 174: "3448545602 62365 39 4 0.00" PANIC: failed to add item to the page for "pk_transaction_data" CONTEXT: COPY cymi_transaction_data, line 174: "3448545602 62365 39 4 0.00" STATEMENT: COPY cymi_transaction_data FROM STDIN; LOG: server process (PID 11667) was terminated by signal 6 LOG: terminating any other active server processe ----------- The server information as I was given it: Sun server V250, 1 GHz Processor, 1 GB RAM, 4 x 72GB I'm requesting more data -- not yet clear whether I can get a core dump or even if postgres is compiled with debug symbols on that machine. The "WARNING: specified item offset is too large" seems to happen each time there is an ABORT, leading me to think it might be bad RAM. Any thoughts or specific data requests? I can send the full log off-list if needed. Thanks, Joe
I have seen similar when running under heavy load with high frequent insert+delete+vacuum. What happens is that adding another item to an index page in the btree access method fails. It seems to me that the decision to add an item to a page and the real work of actually adding it are not atomic, so that under certain race conditions two backends make the same decision while one would have to split the page. Restarting the whole postmaster looks like overreacting also, but it apparently fixes the problem since the index is just fine and retrying the insert afterwards works. Jan On 8/10/2004 6:50 PM, Joe Conway wrote: > I was sent a log file for a production system that has received several > ABORT signals while under heavy load. Here's a snippet from the logs: > > ----------- > LOG: recycled transaction log file "0000000A0000004B" > LOG: recycled transaction log file "0000000A0000004D" > LOG: recycled transaction log file "0000000A0000004E" > LOG: recycled transaction log file "0000000A0000004F" > WARNING: specified item offset is too large > CONTEXT: COPY cymi_transaction_data, line 174: "3448545602 62365 39 4 0.00" > PANIC: failed to add item to the page for "pk_transaction_data" > CONTEXT: COPY cymi_transaction_data, line 174: "3448545602 62365 39 4 0.00" > STATEMENT: COPY cymi_transaction_data FROM STDIN; > LOG: server process (PID 11667) was terminated by signal 6 > LOG: terminating any other active server processe > ----------- > > The server information as I was given it: > Sun server V250, 1 GHz Processor, 1 GB RAM, 4 x 72GB > > I'm requesting more data -- not yet clear whether I can get a core dump > or even if postgres is compiled with debug symbols on that machine. > > The "WARNING: specified item offset is too large" seems to happen each > time there is an ABORT, leading me to think it might be bad RAM. > > Any thoughts or specific data requests? I can send the full log off-list > if needed. > > Thanks, > > Joe > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > I have seen similar when running under heavy load with high frequent > insert+delete+vacuum. What happens is that adding another item to an This fits the profile of this application perfectly. > index page in the btree access method fails. It seems to me that the > decision to add an item to a page and the real work of actually adding > it are not atomic, so that under certain race conditions two backends > make the same decision while one would have to split the page. Hmmm, sounds like a somewhat serious issue. Any thoughts on a fix, or even a reliable workaround? > Restarting the whole postmaster looks like overreacting also, but it > apparently fixes the problem since the index is just fine and retrying > the insert afterwards works. But it is disruptive to an application that collects data 24 x 365 at this kind of load. I've been discussing with the developers ways to eliminate the delete and most of the vacuum load (in favor of "work" tables and TRUNCATE), so maybe this issue will force that change. Thanks, Joe
Jan Wieck <JanWieck@Yahoo.com> writes: > I have seen similar when running under heavy load with high frequent > insert+delete+vacuum. What happens is that adding another item to an > index page in the btree access method fails. It seems to me that the > decision to add an item to a page and the real work of actually adding > it are not atomic, so that under certain race conditions two backends > make the same decision while one would have to split the page. Sure it is. _bt_insertonpg is holding an exclusive lock on the page the entire time. We've seen reports like this once or twice before, so I think that there may indeed be some corner-case bug involved, but it's not going to be possible to find it without a test case ... or at least a debuggable core dump from the PANIC. regards, tom lane
Tom Lane wrote: > We've seen reports like this once or twice before, so I think that there > may indeed be some corner-case bug involved, but it's not going to be > possible to find it without a test case ... or at least a debuggable > core dump from the PANIC. I just talked to the guy who sent the log file to me. He's working with his contact on the customer side to see if there is a core dump from the last PANIC. They agreed to make adjustments so that next time there will be a core dump if none is found this time. They also agreed to run memory and disk integrity checks. The installation was compiled with --enable-debug, and the binaries have not been stripped, so if I can get a core dump from them it ought to be usable. Joe