Thread: signal 11 on AIX: 7.4.2

signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
We've had a backend crash with sig 11 during connection.  My guess is
there's something up with (maybe) the IPv6 support on AIX.  I seem to
recall something similar recently, but I can't find the post in the
archives.  Suggestions?


oxrslive=# SELECT version();                                  version                                    
------------------------------------------------------------------------------PostgreSQL 7.4.2 on
powerpc-ibm-aix5.1.0.0,compiled by GCC
 
2.9-aix51-020209
(1 row)

GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "powerpc-ibm-aix5.1.0.0"...
Core was generated by `postgres'.
Program terminated with signal 11, Segmentation fault.
#0  0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
(gdb) bt
#0  0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
#1  0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o)
#2  0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o)
#3  0x1005860c in getaddrinfo_all (hostname=0x34e0 "",    servname=0x74696f <Address 0x74696f out of bounds>,
hintp=0xf03a2e80,   result=0x74696f) at ip.c:78
 
#4  0x101f9330 in parse_hba (line=0x202ae198, port=0x202a6988,    found_p=0x2ff1f810 "", error_p=0x2ff1f811 "") at
hba.c:669
#5  0x101f96bc in check_hba (port=0x202a6988) at hba.c:793
#6  0x101fa934 in hba_getauthmethod (port=0x202b6f3c) at hba.c:1574
#7  0x101fad5c in ClientAuthentication (port=0x202a6988) at auth.c:415
#8  0x10004674 in BackendFork (port=0x202a6988) at postmaster.c:2444
#9  0x100040b8 in BackendStartup (port=0x202a6988) at postmaster.c:2207
#10 0x10002538 in ServerLoop () at postmaster.c:1119
#11 0x10001f8c in PostmasterMain (argc=1, argv=0x20270698) at postmaster.c:897
#12 0x100005f0 in main (argc=1, argv=0x2ff22b8c) at main.c:214
(gdb) 

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca


Re: signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
On Thu, Apr 15, 2004 at 01:07:33PM -0400, Andrew Sullivan wrote:
> We've had a backend crash with sig 11 during connection.  

By the way, I failed to mention, but sig 11 is segfault on AIX.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca


Re: signal 11 on AIX: 7.4.2

From
Tom Lane
Date:
Andrew Sullivan <ajs@crankycanuck.ca> writes:
> We've had a backend crash with sig 11 during connection.  My guess is
> there's something up with (maybe) the IPv6 support on AIX.

> (gdb) bt
> #0  0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
> #1  0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o)
> #2  0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o)
> #3  0x1005860c in getaddrinfo_all (hostname=0x34e0 "", 
>     servname=0x74696f <Address 0x74696f out of bounds>, hintp=0xf03a2e80, 
>     result=0x74696f) at ip.c:78
> #4  0x101f9330 in parse_hba (line=0x202ae198, port=0x202a6988, 
>     found_p=0x2ff1f810 "", error_p=0x2ff1f811 "") at hba.c:669

Hm, a crash inside the system-supplied getaddrinfo routine would suggest
that there's something wrong with the values we are passing into it.
The most likely bet is that we don't agree with libc about the layout of
"struct addrinfo".  The configure script goes out of its way to be
paranoid about this, because we've seen it get confused by add-on
libbind installations (see also the head comment in
src/include/getaddrinfo.h) ... but I'll bet that AIX has found another
way to trip it up.

I can see from your trace that you are using the getaddrinfo code from
libc, but where is configure finding a header that declares struct
addrinfo?
        regards, tom lane


Re: signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
(Sorry, had a mail problem here this weekend.)

On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote:
> 
> I can see from your trace that you are using the getaddrinfo code from
> libc, but where is configure finding a header that declares struct
> addrinfo?

Hrm, I can't seem to tell.  I see this in config.log, but it isn't
telling me where it found it.  Am I looking in the wrong place?  (I
expect so):

configure:10245: $? = 0
configure:10248: test -s conftest.o
configure:10251: $? = 0
configure:10261: result: yes
configure:10272: checking for struct addrinfo
configure:10303: gcc -c -O2 -fno-strict-aliasing -g
-I/path/to/readline-4.2/include/
-I/path/to/zlib-1.1.4/include/ 
conftest.c >&5

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what 
you told them to.  That actually seems sort of quaint now.    --J.D. Baldwin


Re: signal 11 on AIX: 7.4.2

From
Tom Lane
Date:
Andrew Sullivan <ajs@crankycanuck.ca> writes:
> On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote:
>> I can see from your trace that you are using the getaddrinfo code from
>> libc, but where is configure finding a header that declares struct
>> addrinfo?

> Hrm, I can't seem to tell.  I see this in config.log, but it isn't
> telling me where it found it.  Am I looking in the wrong place?

What you'd need to do is determine which system headers are being
#include'd by that config test, and then look through them to find
struct addrinfo.

A shortcut is just to grep through /usr/include and its subdirectories
for addrinfo.  If you only find one definition, then you don't really
need to worry too much.  But if there's more than one you need to
determine which is getting used.
        regards, tom lane


Re: signal 11 on AIX: 7.4.2

From
Alvaro Herrera
Date:
On Mon, Apr 19, 2004 at 11:18:07AM -0400, Tom Lane wrote:

> A shortcut is just to grep through /usr/include and its subdirectories
> for addrinfo.  If you only find one definition, then you don't really
> need to worry too much.  But if there's more than one you need to
> determine which is getting used.

Maybe an easier way is to examine the output of cpp src/include/c.h.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"En las profundidades de nuestro inconsciente hay una obsesiva necesidad
de un universo lógico y coherente. Pero el universo real se halla siempre
un paso más allá de la lógica" (Irulan)


Re: signal 11 on AIX: 7.4.2

From
Jan Wieck
Date:
Tom Lane wrote:

> Andrew Sullivan <ajs@crankycanuck.ca> writes:
>> On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote:
>>> I can see from your trace that you are using the getaddrinfo code from
>>> libc, but where is configure finding a header that declares struct
>>> addrinfo?
> 
>> Hrm, I can't seem to tell.  I see this in config.log, but it isn't
>> telling me where it found it.  Am I looking in the wrong place?
> 
> What you'd need to do is determine which system headers are being
> #include'd by that config test, and then look through them to find
> struct addrinfo.

judging by gdb's structure printing, the crashed postgres instance used 
the non-43 compatible 64-bit version of the strucure. What I don't 
really get is that the whole excercise seems to have scribbled over the 
stack. The hints pointer originating from the on-stack structure in 
parse_hba is somehow pointing into the blue.


Jan

> 
> A shortcut is just to grep through /usr/include and its subdirectories
> for addrinfo.  If you only find one definition, then you don't really
> need to worry too much.  But if there's more than one you need to
> determine which is getting used.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
On Mon, Apr 19, 2004 at 11:18:07AM -0400, Tom Lane wrote:
> 
> What you'd need to do is determine which system headers are being
> #include'd by that config test, and then look through them to find
> struct addrinfo.

Well, I have this in /usr/include/netdb.h:

struct addrinfo {       int              ai_flags;      /* AI_PASSIVE, AI_CANONNAME,
AI_NUMERICH
OST */       int              ai_family;     /* PF_xxx */       int              ai_socktype;   /* SOCK_xxx */
int             ai_protocol;   /* 0 or IPPROTO_xxx */       size_t           ai_addrlen;    /* length of ai_addr */
 char            *ai_canonname;  /* canonical name for
 
hostname */       struct sockaddr *ai_addr;       /* binary address */       struct addrinfo *ai_next;       /* next
structurein list */
 
};

Using the cpp trick that Alvaro Herrera suggested, I see that file
mentioned in the output, and this a little way along:

struct addrinfo {       int              ai_flags;              int              ai_family;             int
ai_socktype;           int              ai_protocol;           size_t           ai_addrlen;            char
*ai_canonname;         struct sockaddr *ai_addr;               struct addrinfo *ai_next;        
 
};

So it looks like that must be the one.  Dunno if this helps.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca


Re: signal 11 on AIX: 7.4.2

From
Bruce Momjian
Date:
Has this been resolved?

---------------------------------------------------------------------------

Andrew Sullivan wrote:
> On Mon, Apr 19, 2004 at 11:18:07AM -0400, Tom Lane wrote:
> > 
> > What you'd need to do is determine which system headers are being
> > #include'd by that config test, and then look through them to find
> > struct addrinfo.
> 
> Well, I have this in /usr/include/netdb.h:
> 
> struct addrinfo {
>         int              ai_flags;      /* AI_PASSIVE, AI_CANONNAME,
> AI_NUMERICH
> OST */
>         int              ai_family;     /* PF_xxx */
>         int              ai_socktype;   /* SOCK_xxx */
>         int              ai_protocol;   /* 0 or IPPROTO_xxx */
>         size_t           ai_addrlen;    /* length of ai_addr */
>         char            *ai_canonname;  /* canonical name for
> hostname */
>         struct sockaddr *ai_addr;       /* binary address */
>         struct addrinfo *ai_next;       /* next structure in list */
> };
> 
> Using the cpp trick that Alvaro Herrera suggested, I see that file
> mentioned in the output, and this a little way along:
> 
> struct addrinfo {
>         int              ai_flags;       
>         int              ai_family;      
>         int              ai_socktype;    
>         int              ai_protocol;    
>         size_t           ai_addrlen;     
>         char            *ai_canonname;   
>         struct sockaddr *ai_addr;        
>         struct addrinfo *ai_next;        
> };
> 
> So it looks like that must be the one.  Dunno if this helps.
> 
> A
> 
> -- 
> Andrew Sullivan  | ajs@crankycanuck.ca
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
On Mon, Apr 26, 2004 at 03:19:21PM -0400, Bruce Momjian wrote:
> 
> Has this been resolved?

Not as far as I know.  Unfortunately, the problem happened in an
environment I Can't Play With, and I haven't been able to reproduce
it elsewhere.  I've been trying some alternative approaches to
causing it today, and so far no luck.

Jan is, AFAIK, similarly mystified about what happened.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca


Re: signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
On Wed, Apr 28, 2004 at 03:56:55PM -0400, Andrew Sullivan wrote:
> On Mon, Apr 26, 2004 at 03:19:21PM -0400, Bruce Momjian wrote:
> > 
> > Has this been resolved?

> it elsewhere.  I've been trying some alternative approaches to
> causing it today, and so far no luck.

On the weekend, we ran a set of tests on the offending system to see
if we could re-create it.  We set up the triggering conditions just
as they'd been when it happened, and alas, no segfault.  So although
this was pretty much regularly reproducible when it actually
happened, it's now a note to the Journal of Irreproducible Results. 
I hate when that happens.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca


Re: signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
On Mon, May 10, 2004 at 11:59:40AM -0400, Andrew Sullivan wrote:
> 
> On the weekend, we ran a set of tests on the offending system to see
> if we could re-create it.  We set up the triggering conditions just
> as they'd been when it happened, and alas, no segfault.  So although
> this was pretty much regularly reproducible when it actually
> happened, it's now a note to the Journal of Irreproducible Results. 
> I hate when that happens.

I hate it even more when the symptom comes back inexplicably.  We had
it again.  For the record, here's what gdb says (there are some
high-bit characters in here; dunno how they'll come though in mail):

(gdb) bt
#0  0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
#1  0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o)
#2  0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o)
#3  0x10058668 in WriteControlFile () at xlog.c:2121
#4  0x101f8f78 in init_execution_state (src=0x202acd8c "",    argOidVect=0x7308710b, nargs=4, rettype=539520040,
haspolyarg=-104'\230')   at functions.c:121
 
#5  0x101f9304 in init_sql_fcache (finfo=0xdeadbeef) at functions.c:250
#6  0x101fa57c in set_tz (tz=0x7308710b <Address 0x7308710b out of bounds>)   at variable.c:261
#7  0x101fa9a4 in assign_timezone (value=0x202ad398 "", doit=-1 '�',    interactive=-8 '�') at variable.c:584
#8  0x1000466c in PostgresMain (argc=1, argv=0x2002cf38, username=0x1 "")   at postgres.c:2560
#9  0x100040b0 in PostgresMain (argc=537240896, argv=0xdeadbeef,    username=0xdeadbeef <Address 0xdeadbeef out of
bounds>)at postgres.c:2307
 
#10 0x10002530 in exec_parse_message (query_string=0x20000a24 "",    stmt_name=0x5 "", paramTypes=0x0, numParams=0) at
postgres.c:1216
#11 0x10001f84 in exec_simple_query (   query_string=0x2005a540 '�' <repeats 40 times>) at postgres.c:980
#12 0x100005f0 in main (argc=1, argv=0xdeadbeef) at main.c:228


-- 
Andrew Sullivan  | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what 
you told them to.  That actually seems sort of quaint now.    --J.D. Baldwin


Re: signal 11 on AIX: 7.4.2

From
Bruce Momjian
Date:
Andrew Sullivan wrote:
> On Mon, May 10, 2004 at 11:59:40AM -0400, Andrew Sullivan wrote:
> > 
> > On the weekend, we ran a set of tests on the offending system to see
> > if we could re-create it.  We set up the triggering conditions just
> > as they'd been when it happened, and alas, no segfault.  So although
> > this was pretty much regularly reproducible when it actually
> > happened, it's now a note to the Journal of Irreproducible Results. 
> > I hate when that happens.
> 
> I hate it even more when the symptom comes back inexplicably.  We had
> it again.  For the record, here's what gdb says (there are some
> high-bit characters in here; dunno how they'll come though in mail):
> 
> (gdb) bt
> #0  0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
> #1  0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o)
> #2  0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o)
> #3  0x10058668 in WriteControlFile () at xlog.c:2121
> #4  0x101f8f78 in init_execution_state (src=0x202acd8c "", 
>     argOidVect=0x7308710b, nargs=4, rettype=539520040, haspolyarg=-104 '\230')
>     at functions.c:121
> #5  0x101f9304 in init_sql_fcache (finfo=0xdeadbeef) at functions.c:250
> #6  0x101fa57c in set_tz (tz=0x7308710b <Address 0x7308710b out of bounds>)
>     at variable.c:261
> #7  0x101fa9a4 in assign_timezone (value=0x202ad398 "", doit=-1 '�', 
>     interactive=-8 '�') at variable.c:584
> #8  0x1000466c in PostgresMain (argc=1, argv=0x2002cf38, username=0x1 "")
>     at postgres.c:2560
> #9  0x100040b0 in PostgresMain (argc=537240896, argv=0xdeadbeef, 
>     username=0xdeadbeef <Address 0xdeadbeef out of bounds>) at postgres.c:2307
> #10 0x10002530 in exec_parse_message (query_string=0x20000a24 "", 
>     stmt_name=0x5 "", paramTypes=0x0, numParams=0) at postgres.c:1216
> #11 0x10001f84 in exec_simple_query (
>     query_string=0x2005a540 '�' <repeats 40 times>) at postgres.c:980
> #12 0x100005f0 in main (argc=1, argv=0xdeadbeef) at main.c:228

Well, the bad news is that this backtrace isn't very useful.  It states
the query you sent was 40 0xff's, and it says you called
assign_timezone, which called set_tz, which then shows it calling
init_sql_fcache() (impossible), which later calls WriteControlFile()
impossible, which calls getaddrinfo() (impossible).

My only guess is that getaddrinfo in your libc has a bug somehow that is
corrupting the stack (hance the improper backtrace), then crashing.

As to the cause, I assume this is not reproducable, right?  Is there
something unusual about your DNS setup or something that might have
changed recently that caused getaddrinfo() to do something new?

Of course, the memmove() might be causing the problem and the
getaddrinfo is a corrupt part of the backtrace too.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
On Thu, Jun 17, 2004 at 01:12:10PM -0400, Bruce Momjian wrote:
> Well, the bad news is that this backtrace isn't very useful. 

No kidding.  It's pretty frustrating.

> My only guess is that getaddrinfo in your libc has a bug somehow that is
> corrupting the stack (hance the improper backtrace), then crashing.

It could be libc on AIX, I suppose, but it strikes me as sort of odd
that nobody else ever seens this.  Unless nobody else is using AIX
5.1, which is of course possible.

One hypothesis is that this is happening at start up time (this core
dump didn't show up in the data/ area, but in the init directory,
however, which makes that theory a little suspect).

> As to the cause, I assume this is not reproducable, right?  Is there

Well, it's reproduced itsef a few times, but it isn't reproducible at
will, and we have no clue what is causing it.

> something unusual about your DNS setup or something that might have
> changed recently that caused getaddrinfo() to do something new?

Nothing has changed recently, but we started having this not long
after promoting an RS/6000 to production on AIX 5.1.  Before that we
were all-Solaris.  We have never managed to tickle this on a test
machine.  It's pretty tough to guess what might be going on, at least
for me.  If there are any AIX gurus around, I'd sure like to talk to
them.  (I do have a budget to pay such gurus, BTW!)

> Of course, the memmove() might be causing the problem and the
> getaddrinfo is a corrupt part of the backtrace too.

Yeah, which is why it's so frustrating.  If I could see what it was
doing when it did it, I'd be able to tell.  But without knowing why
it's happening, there's no way to sit up for 6 weeks while I wait for
it to happen.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.     --Dennis Ritchie


Re: signal 11 on AIX: 7.4.2

From
Bruce Momjian
Date:
Andrew Sullivan wrote:
> On Thu, Jun 17, 2004 at 01:12:10PM -0400, Bruce Momjian wrote:
>  
> > Well, the bad news is that this backtrace isn't very useful. 
> 
> No kidding.  It's pretty frustrating.
> 
> > My only guess is that getaddrinfo in your libc has a bug somehow that is
> > corrupting the stack (hance the improper backtrace), then crashing.
> 
> It could be libc on AIX, I suppose, but it strikes me as sort of odd
> that nobody else ever seens this.  Unless nobody else is using AIX
> 5.1, which is of course possible.
> 
> One hypothesis is that this is happening at start up time (this core
> dump didn't show up in the data/ area, but in the init directory,
> however, which makes that theory a little suspect).

When you say "init" directory, what do you mean?  /bin?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: signal 11 on AIX: 7.4.2

From
"Zeugswetter Andreas SB SD"
Date:
> > My only guess is that getaddrinfo in your libc has a bug somehow that is
> > corrupting the stack (hance the improper backtrace), then crashing.
>
> It could be libc on AIX, I suppose, but it strikes me as sort of odd
> that nobody else ever seens this.  Unless nobody else is using AIX
> 5.1, which is of course possible.

I can confirm, that AIX 4.3.2 getaddrinfo is at least a bit *funny*.
getaddrinfo seems to not honour nsorder and only does dns, even though the manual sais:
"Should there be any discrepancies between this description and the POSIX description,the POSIX description takes
precedence."
The function does return multiple entries, often the first is not the best.

Log is:
LOG:  could not translate service "5432" to address: Host not found
WARNING:  could not create listen socket for "*"
LOG:  could not bind socket for statistics collector: Can't assign requested address
LOG:  disabling statistics collector for lack of working socket

This area probably needs a fix/workaround on AIX :-(

Andreas


Re: signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
On Thu, Jun 17, 2004 at 06:06:12PM -0400, Bruce Momjian wrote:
> 
> When you say "init" directory, what do you mean?  /bin?

No.  The place where the init scripts (which cause postgres to start)
live.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism.                --Brad Holland


Re: signal 11 on AIX: 7.4.2

From
Christopher Browne
Date:
Quoth pgman@candle.pha.pa.us (Bruce Momjian):
> Andrew Sullivan wrote:
>> On Thu, Jun 17, 2004 at 01:12:10PM -0400, Bruce Momjian wrote:
>>  
>> > Well, the bad news is that this backtrace isn't very useful. 
>> 
>> No kidding.  It's pretty frustrating.
>> 
>> > My only guess is that getaddrinfo in your libc has a bug somehow that is
>> > corrupting the stack (hance the improper backtrace), then crashing.
>> 
>> It could be libc on AIX, I suppose, but it strikes me as sort of odd
>> that nobody else ever seens this.  Unless nobody else is using AIX
>> 5.1, which is of course possible.
>> 
>> One hypothesis is that this is happening at start up time (this
>> core dump didn't show up in the data/ area, but in the init
>> directory, however, which makes that theory a little suspect).
>
> When you say "init" directory, what do you mean?  /bin?

No, it's a directory with various "init-like" scripts.

In "premium hosting environments," root access is restricted to the
site operators, so PostgreSQL doesn't get started up from /etc/init.d.

Instead, PostgreSQL and other services get invoked by custom "init
scripts" in a custom "init directory."
-- 
let name="cbbrowne" and tld="ntlug.org" in name ^ "@" ^ tld;;
http://www.ntlug.org/~cbbrowne/sap.html
"I am a bomb technician. If you see me running, try to keep up..."


Re: signal 11 on AIX: 7.4.2

From
Jan Wieck
Date:
On 4/19/2004 1:18 PM, Jan Wieck wrote:

> Tom Lane wrote:
> 
>> Andrew Sullivan <ajs@crankycanuck.ca> writes:
>>> On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote:
>>>> I can see from your trace that you are using the getaddrinfo code from
>>>> libc, but where is configure finding a header that declares struct
>>>> addrinfo?
>> 
>>> Hrm, I can't seem to tell.  I see this in config.log, but it isn't
>>> telling me where it found it.  Am I looking in the wrong place?
>> 
>> What you'd need to do is determine which system headers are being
>> #include'd by that config test, and then look through them to find
>> struct addrinfo.
> 
> judging by gdb's structure printing, the crashed postgres instance used 
> the non-43 compatible 64-bit version of the strucure. What I don't 
> really get is that the whole excercise seems to have scribbled over the 
> stack. The hints pointer originating from the on-stack structure in 
> parse_hba is somehow pointing into the blue.

This issue is still not closed and it is hitting us more and more. So I 
would like to add some more of what we have done in the hope to get some 
more ideas.

The "scribbled over the stack" part turned out to be not true. The stack 
dump is fine if compiled with -O0. The problem persists in 7.4.5.

I have tried to isolate the getaddrinfo() calls by writing a program 
that does the getaddrinfo() calls done during PM startup, then keeps 
100-200 child processes in a fork()/wait() loop and every child process 
does the same getaddrinfo() calls a starting backend would perform 
during the pg_hba parsing. This program does not crash.

So far we did not get a libc from IBM that has debug symbols. So I only 
know that getaddrinfo() calls getaddrinfo2(), which calls memmove() and 
that one crashes with a SIGSEGV. All the call arguments to getaddrinfo() 
look absolutely fine. I hope to get that libc any time soon to see what 
exactly that memmove tries to access.

The problem comes and goes. So either I can cause a coredump just on the 
snap by running a shellscript that does 100 psql -c "select version()" 
calls, or it is next to impossible to crash it at all.

There are numerous reports on the net about getaddrinfo() causing grief 
on AIX and it seems to be IPV6 related. For the moment we intend to 
replace the call with a slightly limited implementation using 
inet_aton() in getaddrinfo_all() whenever AI_NUMERICHOST is set. This 
will lose us the IPV6 support as hba.c can't parse those pg_hba.conf 
lines any more. So it is not a satisfactory workaround for PostgreSQL. 
But I will make that patch available tomorrow night in the event someone 
else finds it usefull.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: signal 11 on AIX: 7.4.2

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> The problem comes and goes. So either I can cause a coredump just on the 
> snap by running a shellscript that does 100 psql -c "select version()" 
> calls, or it is next to impossible to crash it at all.

Hmm, that's really bizarre.  It seems like the only satisfactory
explanation for that would involve some external condition that varies
over time.  I'm wondering about DNS lookup results in particular.
What values are you asking getaddrinfo to look up, and might those
involve consulting DNS?  If so, try to correlate the crash probability
with changes in your DNS zone contents ...
        regards, tom lane


Re: signal 11 on AIX: 7.4.2

From
Jan Wieck
Date:
On 9/17/2004 7:32 PM, Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
>> The problem comes and goes. So either I can cause a coredump just on the 
>> snap by running a shellscript that does 100 psql -c "select version()" 
>> calls, or it is next to impossible to crash it at all.
> 
> Hmm, that's really bizarre.  It seems like the only satisfactory
> explanation for that would involve some external condition that varies
> over time.  I'm wondering about DNS lookup results in particular.
> What values are you asking getaddrinfo to look up, and might those
> involve consulting DNS?  If so, try to correlate the crash probability
> with changes in your DNS zone contents ...
> 
>             regards, tom lane

Except for one "localhost", one "/tmp/.s.PGSQL..." and the "543x" lookup 
during the postmaster start, all lookups are IP addresses with 
AI_NUMERICHOST set. And we have checked with tcpdump that the box really 
does not issue DNS lookups.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
On Fri, Sep 17, 2004 at 07:32:30PM -0400, Tom Lane wrote:

> involve consulting DNS?  If so, try to correlate the crash probability
> with changes in your DNS zone contents ...

No changes.  The systems in question have no access to DNS. 
/etc/hosts only.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.    --Philip Greenspun


Re: signal 11 on AIX: 7.4.2

From
Andrew Sullivan
Date:
On Sat, Sep 18, 2004 at 06:06:05AM -0400, Jan Wieck wrote:
> On 9/17/2004 7:32 PM, Tom Lane wrote:
> >over time.  I'm wondering about DNS lookup results in particular.
> 
> Except for one "localhost", one "/tmp/.s.PGSQL..." and the "543x" lookup 
> during the postmaster start, all lookups are IP addresses with 
> AI_NUMERICHOST set. And we have checked with tcpdump that the box really 
> does not issue DNS lookups.

Just for the sake of posterity, it appears that this is actually a
libc problem on AIX.  In particular, there's a patched libc fileset
which was released to solve a problem where getaddrinfo() returns an
error on valid input.  IBM's AIX support was unwilling to give us
libraries with debug symbols built in, but they did point me at a new
fileset for libc.  We've been running a test load which fairly
consistently produced sig 11s before, and haven't seen one since.  So
we don't have a perfect explanation, but it looks like this is the
cause.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The plural of anecdote is not data.    --Roger Brinner