Re: signal 11 on AIX: 7.4.2 - Mailing list pgsql-hackers

From Jan Wieck
Subject Re: signal 11 on AIX: 7.4.2
Date
Msg-id 414B72BF.1000402@Yahoo.com
Whole thread Raw
In response to Re: signal 11 on AIX: 7.4.2  (Jan Wieck <JanWieck@Yahoo.com>)
Responses Re: signal 11 on AIX: 7.4.2  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 4/19/2004 1:18 PM, Jan Wieck wrote:

> Tom Lane wrote:
> 
>> Andrew Sullivan <ajs@crankycanuck.ca> writes:
>>> On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote:
>>>> I can see from your trace that you are using the getaddrinfo code from
>>>> libc, but where is configure finding a header that declares struct
>>>> addrinfo?
>> 
>>> Hrm, I can't seem to tell.  I see this in config.log, but it isn't
>>> telling me where it found it.  Am I looking in the wrong place?
>> 
>> What you'd need to do is determine which system headers are being
>> #include'd by that config test, and then look through them to find
>> struct addrinfo.
> 
> judging by gdb's structure printing, the crashed postgres instance used 
> the non-43 compatible 64-bit version of the strucure. What I don't 
> really get is that the whole excercise seems to have scribbled over the 
> stack. The hints pointer originating from the on-stack structure in 
> parse_hba is somehow pointing into the blue.

This issue is still not closed and it is hitting us more and more. So I 
would like to add some more of what we have done in the hope to get some 
more ideas.

The "scribbled over the stack" part turned out to be not true. The stack 
dump is fine if compiled with -O0. The problem persists in 7.4.5.

I have tried to isolate the getaddrinfo() calls by writing a program 
that does the getaddrinfo() calls done during PM startup, then keeps 
100-200 child processes in a fork()/wait() loop and every child process 
does the same getaddrinfo() calls a starting backend would perform 
during the pg_hba parsing. This program does not crash.

So far we did not get a libc from IBM that has debug symbols. So I only 
know that getaddrinfo() calls getaddrinfo2(), which calls memmove() and 
that one crashes with a SIGSEGV. All the call arguments to getaddrinfo() 
look absolutely fine. I hope to get that libc any time soon to see what 
exactly that memmove tries to access.

The problem comes and goes. So either I can cause a coredump just on the 
snap by running a shellscript that does 100 psql -c "select version()" 
calls, or it is next to impossible to crash it at all.

There are numerous reports on the net about getaddrinfo() causing grief 
on AIX and it seems to be IPV6 related. For the moment we intend to 
replace the call with a slightly limited implementation using 
inet_aton() in getaddrinfo_all() whenever AI_NUMERICHOST is set. This 
will lose us the IPV6 support as hba.c can't parse those pg_hba.conf 
lines any more. So it is not a satisfactory workaround for PostgreSQL. 
But I will make that patch available tomorrow night in the event someone 
else finds it usefull.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Default value for stats_command_string (GUC)
Next
From: Tom Lane
Date:
Subject: Re: signal 11 on AIX: 7.4.2