Thread: terminated by signal 6 problem

terminated by signal 6 problem

From
Joe Conway
Date:
I was sent a log file for a production system that has received several 
ABORT signals while under heavy load. Here's a snippet from the logs:

-----------
LOG:  recycled transaction log file "0000000A0000004B"
LOG:  recycled transaction log file "0000000A0000004D"
LOG:  recycled transaction log file "0000000A0000004E"
LOG:  recycled transaction log file "0000000A0000004F"
WARNING:  specified item offset is too large
CONTEXT:  COPY cymi_transaction_data, line 174: "3448545602 62365 39 4 0.00"
PANIC:  failed to add item to the page for "pk_transaction_data"
CONTEXT:  COPY cymi_transaction_data, line 174: "3448545602 62365 39 4 0.00"
STATEMENT:  COPY cymi_transaction_data FROM STDIN;
LOG:  server process (PID 11667) was terminated by signal 6
LOG:  terminating any other active server processe
-----------

The server information as I was given it:
Sun server V250, 1 GHz Processor, 1 GB RAM, 4 x 72GB

I'm requesting more data -- not yet clear whether I can get a core dump 
or even if postgres is compiled with debug symbols on that machine.

The "WARNING:  specified item offset is too large" seems to happen each 
time there is an ABORT, leading me to think it might be bad RAM.

Any thoughts or specific data requests? I can send the full log off-list 
if needed.

Thanks,

Joe



Re: terminated by signal 6 problem

From
Jan Wieck
Date:
I have seen similar when running under heavy load with high frequent 
insert+delete+vacuum. What happens is that adding another item to an 
index page in the btree access method fails. It seems to me that the 
decision to add an item to a page and the real work of actually adding 
it are not atomic, so that under certain race conditions two backends 
make the same decision while one would have to split the page.

Restarting the whole postmaster looks like overreacting also, but it 
apparently fixes the problem since the index is just fine and retrying 
the insert afterwards works.


Jan

On 8/10/2004 6:50 PM, Joe Conway wrote:

> I was sent a log file for a production system that has received several 
> ABORT signals while under heavy load. Here's a snippet from the logs:
> 
> -----------
> LOG:  recycled transaction log file "0000000A0000004B"
> LOG:  recycled transaction log file "0000000A0000004D"
> LOG:  recycled transaction log file "0000000A0000004E"
> LOG:  recycled transaction log file "0000000A0000004F"
> WARNING:  specified item offset is too large
> CONTEXT:  COPY cymi_transaction_data, line 174: "3448545602 62365 39 4 0.00"
> PANIC:  failed to add item to the page for "pk_transaction_data"
> CONTEXT:  COPY cymi_transaction_data, line 174: "3448545602 62365 39 4 0.00"
> STATEMENT:  COPY cymi_transaction_data FROM STDIN;
> LOG:  server process (PID 11667) was terminated by signal 6
> LOG:  terminating any other active server processe
> -----------
> 
> The server information as I was given it:
> Sun server V250, 1 GHz Processor, 1 GB RAM, 4 x 72GB
> 
> I'm requesting more data -- not yet clear whether I can get a core dump 
> or even if postgres is compiled with debug symbols on that machine.
> 
> The "WARNING:  specified item offset is too large" seems to happen each 
> time there is an ABORT, leading me to think it might be bad RAM.
> 
> Any thoughts or specific data requests? I can send the full log off-list 
> if needed.
> 
> Thanks,
> 
> Joe
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faqs/FAQ.html


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: terminated by signal 6 problem

From
Joe Conway
Date:
Jan Wieck wrote:
> I have seen similar when running under heavy load with high frequent 
> insert+delete+vacuum. What happens is that adding another item to an 

This fits the profile of this application perfectly.

> index page in the btree access method fails. It seems to me that the 
> decision to add an item to a page and the real work of actually adding 
> it are not atomic, so that under certain race conditions two backends 
> make the same decision while one would have to split the page.

Hmmm, sounds like a somewhat serious issue. Any thoughts on a fix, or 
even a reliable workaround?

> Restarting the whole postmaster looks like overreacting also, but it 
> apparently fixes the problem since the index is just fine and retrying 
> the insert afterwards works.

But it is disruptive to an application that collects data 24 x 365 at 
this kind of load. I've been discussing with the developers ways to 
eliminate the delete and most of the vacuum load (in favor of "work" 
tables and TRUNCATE), so maybe this issue will force that change.

Thanks,

Joe


Re: terminated by signal 6 problem

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> I have seen similar when running under heavy load with high frequent 
> insert+delete+vacuum. What happens is that adding another item to an 
> index page in the btree access method fails. It seems to me that the 
> decision to add an item to a page and the real work of actually adding 
> it are not atomic, so that under certain race conditions two backends 
> make the same decision while one would have to split the page.

Sure it is.  _bt_insertonpg is holding an exclusive lock on the page
the entire time.

We've seen reports like this once or twice before, so I think that there
may indeed be some corner-case bug involved, but it's not going to be
possible to find it without a test case ... or at least a debuggable
core dump from the PANIC.
        regards, tom lane


Re: terminated by signal 6 problem

From
Joe Conway
Date:
Tom Lane wrote:
> We've seen reports like this once or twice before, so I think that there
> may indeed be some corner-case bug involved, but it's not going to be
> possible to find it without a test case ... or at least a debuggable
> core dump from the PANIC.

I just talked to the guy who sent the log file to me. He's working with 
his contact on the customer side to see if there is a core dump from the 
last PANIC. They agreed to make adjustments so that next time there will 
be a core dump if none is found this time. They also agreed to run 
memory and disk integrity checks.

The installation was compiled with --enable-debug, and the binaries have 
not been stripped, so if I can get a core dump from them it ought to be 
usable.

Joe