Thread: dblink crash on PPC

dblink crash on PPC

From
Andrew Dunstan
Date:
Something odd is happening on buildfarm member wombat, a PPC970MP box 
running Gentoo. We're getting dblink test failures. On the one I looked 
at more closely I saw this:

[4ddf2c59.7aec:153] LOG:  disconnection: session time: 0:00:00.444 user=markwkm database=contrib_regression
host=[local]

and then:

[4ddf2c4e.79d4:2] LOG:  server process (PID 31468) was terminated by signal 11: Segmentation fault
[4ddf2c4e.79d4:3] LOG:  terminating any other active server processes

which makes it look like something is failing badly in the backend cleanup code. (7aec = hex(31468))

We don't seem to have a backtrace, which is sad.

This seems to be happening on the 9.0 branch too.

I wonder what it could be?

cheers

andrew





Re: dblink crash on PPC

From
Robert Haas
Date:
On Fri, May 27, 2011 at 8:44 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
> Something odd is happening on buildfarm member wombat, a PPC970MP box
> running Gentoo. We're getting dblink test failures. On the one I looked at
> more closely I saw this:
>
> [4ddf2c59.7aec:153] LOG:  disconnection: session time: 0:00:00.444
> user=markwkm database=contrib_regression host=[local]
>
> and then:
>
> [4ddf2c4e.79d4:2] LOG:  server process (PID 31468) was terminated by signal
> 11: Segmentation fault
> [4ddf2c4e.79d4:3] LOG:  terminating any other active server processes
>
> which makes it look like something is failing badly in the backend cleanup
> code. (7aec = hex(31468))
>
> We don't seem to have a backtrace, which is sad.
>
> This seems to be happening on the 9.0 branch too.
>
> I wonder what it could be?

Around when did it start failing?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: dblink crash on PPC

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote: 
> Andrew Dunstan <andrew@dunslane.net> wrote:
>>
>> Something odd is happening on buildfarm member wombat, a PPC970MP
>> box running Gentoo. We're getting dblink test failures. On the
>> one I << looked at more closely I saw this:
>>
>> [4ddf2c59.7aec:153] LOG:  disconnection: session time:
>> 0:00:00.444
>> user=markwkm database=contrib_regression host=[local]
>>
>> and then:
>>
>> [4ddf2c4e.79d4:2] LOG:  server process (PID 31468) was terminated
>> by signal 11: Segmentation fault
>> [4ddf2c4e.79d4:3] LOG:  terminating any other active server
>> processes
>>
>> which makes it look like something is failing badly in the
>> backend cleanup code. (7aec = hex(31468))
>>
>> We don't seem to have a backtrace, which is sad.
>>
>> This seems to be happening on the 9.0 branch too.
>>
>> I wonder what it could be?
> 
> Around when did it start failing?
According to the buildfarm logs the first failure was roughly 1 day
10 hours 40 minutes before this post.
Keep in mind that PPC is a platform with weak memory ordering....
-Kevin


Re: dblink crash on PPC

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Robert Haas <robertmhaas@gmail.com> wrote: 
>> Around when did it start failing?
> According to the buildfarm logs the first failure was roughly 1 day
> 10 hours 40 minutes before this post.

See
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=wombat&br=HEAD

The problem here is that wombat has been offline for about a month
before that, so it could have broken anytime in the past month.
It's also not unlikely that the hiatus signals a change in the
underlying hardware or software, which might have been the real
cause.  (Mark?)

> Keep in mind that PPC is a platform with weak memory ordering....

grebe, which is also a PPC64 machine, isn't showing the bug.  And I just
failed to reproduce the problem on a RHEL6 PPC64 box.  About to go try
it on RHEL5, which has a gcc version much closer to what wombat says
it's using, but I'm not very hopeful about that.  I think the more
likely thing to be keeping in mind is that Gentoo is a platform with
poor quality control.
        regards, tom lane


Re: dblink crash on PPC

From
Tom Lane
Date:
I wrote:
> grebe, which is also a PPC64 machine, isn't showing the bug.  And I just
> failed to reproduce the problem on a RHEL6 PPC64 box.  About to go try
> it on RHEL5, which has a gcc version much closer to what wombat says
> it's using, but I'm not very hopeful about that.

Nope, no luck there either.  It's going to be hard to make any progress
on this without investigation on wombat itself.
        regards, tom lane


Re: dblink crash on PPC

From
Steve Singer
Date:
On 11-05-27 12:35 PM, Tom Lane wrote:
>
> grebe, which is also a PPC64 machine, isn't showing the bug.  And I just
> failed to reproduce the problem on a RHEL6 PPC64 box.  About to go try
> it on RHEL5, which has a gcc version much closer to what wombat says
> it's using, but I'm not very hopeful about that.  I think the more
> likely thing to be keeping in mind is that Gentoo is a platform with
> poor quality control.
>
>             regards, tom lane
>

As another data point, the dblink regression tests work fine for me on a 
PPC32 debian (squeeze,gcc 4.4.5) based system.



Re: dblink crash on PPC

From
Greg Stark
Date:
On Fri, May 27, 2011 at 10:06 AM, Steve Singer <ssinger@ca.afilias.info> wrote:
> As another data point, the dblink regression tests work fine for me on a
> PPC32 debian (squeeze,gcc 4.4.5) based system.

Given that it's dblink my guess is that it's picking up the wrong
version of libpq somehow.

-- 
greg


Re: dblink crash on PPC

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> On Fri, May 27, 2011 at 10:06 AM, Steve Singer <ssinger@ca.afilias.info> wrote:
>> As another data point, the dblink regression tests work fine for me on a
>> PPC32 debian (squeeze,gcc 4.4.5) based system.

> Given that it's dblink my guess is that it's picking up the wrong
> version of libpq somehow.

Maybe, but then why does the test only crash during backend exit, and
not while it's exercising dblink?
        regards, tom lane