Thread: signal 11 on AIX: 7.4.2
We've had a backend crash with sig 11 during connection. My guess is there's something up with (maybe) the IPv6 support on AIX. I seem to recall something similar recently, but I can't find the post in the archives. Suggestions? oxrslive=# SELECT version(); version ------------------------------------------------------------------------------PostgreSQL 7.4.2 on powerpc-ibm-aix5.1.0.0,compiled by GCC 2.9-aix51-020209 (1 row) GNU gdb 6.0 Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "powerpc-ibm-aix5.1.0.0"... Core was generated by `postgres'. Program terminated with signal 11, Segmentation fault. #0 0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o) (gdb) bt #0 0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o) #1 0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o) #2 0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o) #3 0x1005860c in getaddrinfo_all (hostname=0x34e0 "", servname=0x74696f <Address 0x74696f out of bounds>, hintp=0xf03a2e80, result=0x74696f) at ip.c:78 #4 0x101f9330 in parse_hba (line=0x202ae198, port=0x202a6988, found_p=0x2ff1f810 "", error_p=0x2ff1f811 "") at hba.c:669 #5 0x101f96bc in check_hba (port=0x202a6988) at hba.c:793 #6 0x101fa934 in hba_getauthmethod (port=0x202b6f3c) at hba.c:1574 #7 0x101fad5c in ClientAuthentication (port=0x202a6988) at auth.c:415 #8 0x10004674 in BackendFork (port=0x202a6988) at postmaster.c:2444 #9 0x100040b8 in BackendStartup (port=0x202a6988) at postmaster.c:2207 #10 0x10002538 in ServerLoop () at postmaster.c:1119 #11 0x10001f8c in PostmasterMain (argc=1, argv=0x20270698) at postmaster.c:897 #12 0x100005f0 in main (argc=1, argv=0x2ff22b8c) at main.c:214 (gdb) A -- Andrew Sullivan | ajs@crankycanuck.ca
On Thu, Apr 15, 2004 at 01:07:33PM -0400, Andrew Sullivan wrote: > We've had a backend crash with sig 11 during connection. By the way, I failed to mention, but sig 11 is segfault on AIX. A -- Andrew Sullivan | ajs@crankycanuck.ca
Andrew Sullivan <ajs@crankycanuck.ca> writes: > We've had a backend crash with sig 11 during connection. My guess is > there's something up with (maybe) the IPv6 support on AIX. > (gdb) bt > #0 0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o) > #1 0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o) > #2 0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o) > #3 0x1005860c in getaddrinfo_all (hostname=0x34e0 "", > servname=0x74696f <Address 0x74696f out of bounds>, hintp=0xf03a2e80, > result=0x74696f) at ip.c:78 > #4 0x101f9330 in parse_hba (line=0x202ae198, port=0x202a6988, > found_p=0x2ff1f810 "", error_p=0x2ff1f811 "") at hba.c:669 Hm, a crash inside the system-supplied getaddrinfo routine would suggest that there's something wrong with the values we are passing into it. The most likely bet is that we don't agree with libc about the layout of "struct addrinfo". The configure script goes out of its way to be paranoid about this, because we've seen it get confused by add-on libbind installations (see also the head comment in src/include/getaddrinfo.h) ... but I'll bet that AIX has found another way to trip it up. I can see from your trace that you are using the getaddrinfo code from libc, but where is configure finding a header that declares struct addrinfo? regards, tom lane
(Sorry, had a mail problem here this weekend.) On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote: > > I can see from your trace that you are using the getaddrinfo code from > libc, but where is configure finding a header that declares struct > addrinfo? Hrm, I can't seem to tell. I see this in config.log, but it isn't telling me where it found it. Am I looking in the wrong place? (I expect so): configure:10245: $? = 0 configure:10248: test -s conftest.o configure:10251: $? = 0 configure:10261: result: yes configure:10272: checking for struct addrinfo configure:10303: gcc -c -O2 -fno-strict-aliasing -g -I/path/to/readline-4.2/include/ -I/path/to/zlib-1.1.4/include/ conftest.c >&5 A -- Andrew Sullivan | ajs@crankycanuck.ca I remember when computers were frustrating because they *did* exactly what you told them to. That actually seems sort of quaint now. --J.D. Baldwin
Andrew Sullivan <ajs@crankycanuck.ca> writes: > On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote: >> I can see from your trace that you are using the getaddrinfo code from >> libc, but where is configure finding a header that declares struct >> addrinfo? > Hrm, I can't seem to tell. I see this in config.log, but it isn't > telling me where it found it. Am I looking in the wrong place? What you'd need to do is determine which system headers are being #include'd by that config test, and then look through them to find struct addrinfo. A shortcut is just to grep through /usr/include and its subdirectories for addrinfo. If you only find one definition, then you don't really need to worry too much. But if there's more than one you need to determine which is getting used. regards, tom lane
On Mon, Apr 19, 2004 at 11:18:07AM -0400, Tom Lane wrote: > A shortcut is just to grep through /usr/include and its subdirectories > for addrinfo. If you only find one definition, then you don't really > need to worry too much. But if there's more than one you need to > determine which is getting used. Maybe an easier way is to examine the output of cpp src/include/c.h. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "En las profundidades de nuestro inconsciente hay una obsesiva necesidad de un universo lógico y coherente. Pero el universo real se halla siempre un paso más allá de la lógica" (Irulan)
Tom Lane wrote: > Andrew Sullivan <ajs@crankycanuck.ca> writes: >> On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote: >>> I can see from your trace that you are using the getaddrinfo code from >>> libc, but where is configure finding a header that declares struct >>> addrinfo? > >> Hrm, I can't seem to tell. I see this in config.log, but it isn't >> telling me where it found it. Am I looking in the wrong place? > > What you'd need to do is determine which system headers are being > #include'd by that config test, and then look through them to find > struct addrinfo. judging by gdb's structure printing, the crashed postgres instance used the non-43 compatible 64-bit version of the strucure. What I don't really get is that the whole excercise seems to have scribbled over the stack. The hints pointer originating from the on-stack structure in parse_hba is somehow pointing into the blue. Jan > > A shortcut is just to grep through /usr/include and its subdirectories > for addrinfo. If you only find one definition, then you don't really > need to worry too much. But if there's more than one you need to > determine which is getting used. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Mon, Apr 19, 2004 at 11:18:07AM -0400, Tom Lane wrote: > > What you'd need to do is determine which system headers are being > #include'd by that config test, and then look through them to find > struct addrinfo. Well, I have this in /usr/include/netdb.h: struct addrinfo { int ai_flags; /* AI_PASSIVE, AI_CANONNAME, AI_NUMERICH OST */ int ai_family; /* PF_xxx */ int ai_socktype; /* SOCK_xxx */ int ai_protocol; /* 0 or IPPROTO_xxx */ size_t ai_addrlen; /* length of ai_addr */ char *ai_canonname; /* canonical name for hostname */ struct sockaddr *ai_addr; /* binary address */ struct addrinfo *ai_next; /* next structurein list */ }; Using the cpp trick that Alvaro Herrera suggested, I see that file mentioned in the output, and this a little way along: struct addrinfo { int ai_flags; int ai_family; int ai_socktype; int ai_protocol; size_t ai_addrlen; char *ai_canonname; struct sockaddr *ai_addr; struct addrinfo *ai_next; }; So it looks like that must be the one. Dunno if this helps. A -- Andrew Sullivan | ajs@crankycanuck.ca
Has this been resolved? --------------------------------------------------------------------------- Andrew Sullivan wrote: > On Mon, Apr 19, 2004 at 11:18:07AM -0400, Tom Lane wrote: > > > > What you'd need to do is determine which system headers are being > > #include'd by that config test, and then look through them to find > > struct addrinfo. > > Well, I have this in /usr/include/netdb.h: > > struct addrinfo { > int ai_flags; /* AI_PASSIVE, AI_CANONNAME, > AI_NUMERICH > OST */ > int ai_family; /* PF_xxx */ > int ai_socktype; /* SOCK_xxx */ > int ai_protocol; /* 0 or IPPROTO_xxx */ > size_t ai_addrlen; /* length of ai_addr */ > char *ai_canonname; /* canonical name for > hostname */ > struct sockaddr *ai_addr; /* binary address */ > struct addrinfo *ai_next; /* next structure in list */ > }; > > Using the cpp trick that Alvaro Herrera suggested, I see that file > mentioned in the output, and this a little way along: > > struct addrinfo { > int ai_flags; > int ai_family; > int ai_socktype; > int ai_protocol; > size_t ai_addrlen; > char *ai_canonname; > struct sockaddr *ai_addr; > struct addrinfo *ai_next; > }; > > So it looks like that must be the one. Dunno if this helps. > > A > > -- > Andrew Sullivan | ajs@crankycanuck.ca > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, Apr 26, 2004 at 03:19:21PM -0400, Bruce Momjian wrote: > > Has this been resolved? Not as far as I know. Unfortunately, the problem happened in an environment I Can't Play With, and I haven't been able to reproduce it elsewhere. I've been trying some alternative approaches to causing it today, and so far no luck. Jan is, AFAIK, similarly mystified about what happened. A -- Andrew Sullivan | ajs@crankycanuck.ca
On Wed, Apr 28, 2004 at 03:56:55PM -0400, Andrew Sullivan wrote: > On Mon, Apr 26, 2004 at 03:19:21PM -0400, Bruce Momjian wrote: > > > > Has this been resolved? > it elsewhere. I've been trying some alternative approaches to > causing it today, and so far no luck. On the weekend, we ran a set of tests on the offending system to see if we could re-create it. We set up the triggering conditions just as they'd been when it happened, and alas, no segfault. So although this was pretty much regularly reproducible when it actually happened, it's now a note to the Journal of Irreproducible Results. I hate when that happens. A -- Andrew Sullivan | ajs@crankycanuck.ca
On Mon, May 10, 2004 at 11:59:40AM -0400, Andrew Sullivan wrote: > > On the weekend, we ran a set of tests on the offending system to see > if we could re-create it. We set up the triggering conditions just > as they'd been when it happened, and alas, no segfault. So although > this was pretty much regularly reproducible when it actually > happened, it's now a note to the Journal of Irreproducible Results. > I hate when that happens. I hate it even more when the symptom comes back inexplicably. We had it again. For the record, here's what gdb says (there are some high-bit characters in here; dunno how they'll come though in mail): (gdb) bt #0 0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o) #1 0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o) #2 0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o) #3 0x10058668 in WriteControlFile () at xlog.c:2121 #4 0x101f8f78 in init_execution_state (src=0x202acd8c "", argOidVect=0x7308710b, nargs=4, rettype=539520040, haspolyarg=-104'\230') at functions.c:121 #5 0x101f9304 in init_sql_fcache (finfo=0xdeadbeef) at functions.c:250 #6 0x101fa57c in set_tz (tz=0x7308710b <Address 0x7308710b out of bounds>) at variable.c:261 #7 0x101fa9a4 in assign_timezone (value=0x202ad398 "", doit=-1 '�', interactive=-8 '�') at variable.c:584 #8 0x1000466c in PostgresMain (argc=1, argv=0x2002cf38, username=0x1 "") at postgres.c:2560 #9 0x100040b0 in PostgresMain (argc=537240896, argv=0xdeadbeef, username=0xdeadbeef <Address 0xdeadbeef out of bounds>)at postgres.c:2307 #10 0x10002530 in exec_parse_message (query_string=0x20000a24 "", stmt_name=0x5 "", paramTypes=0x0, numParams=0) at postgres.c:1216 #11 0x10001f84 in exec_simple_query ( query_string=0x2005a540 '�' <repeats 40 times>) at postgres.c:980 #12 0x100005f0 in main (argc=1, argv=0xdeadbeef) at main.c:228 -- Andrew Sullivan | ajs@crankycanuck.ca I remember when computers were frustrating because they *did* exactly what you told them to. That actually seems sort of quaint now. --J.D. Baldwin
Andrew Sullivan wrote: > On Mon, May 10, 2004 at 11:59:40AM -0400, Andrew Sullivan wrote: > > > > On the weekend, we ran a set of tests on the offending system to see > > if we could re-create it. We set up the triggering conditions just > > as they'd been when it happened, and alas, no segfault. So although > > this was pretty much regularly reproducible when it actually > > happened, it's now a note to the Journal of Irreproducible Results. > > I hate when that happens. > > I hate it even more when the symptom comes back inexplicably. We had > it again. For the record, here's what gdb says (there are some > high-bit characters in here; dunno how they'll come though in mail): > > (gdb) bt > #0 0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o) > #1 0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o) > #2 0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o) > #3 0x10058668 in WriteControlFile () at xlog.c:2121 > #4 0x101f8f78 in init_execution_state (src=0x202acd8c "", > argOidVect=0x7308710b, nargs=4, rettype=539520040, haspolyarg=-104 '\230') > at functions.c:121 > #5 0x101f9304 in init_sql_fcache (finfo=0xdeadbeef) at functions.c:250 > #6 0x101fa57c in set_tz (tz=0x7308710b <Address 0x7308710b out of bounds>) > at variable.c:261 > #7 0x101fa9a4 in assign_timezone (value=0x202ad398 "", doit=-1 '�', > interactive=-8 '�') at variable.c:584 > #8 0x1000466c in PostgresMain (argc=1, argv=0x2002cf38, username=0x1 "") > at postgres.c:2560 > #9 0x100040b0 in PostgresMain (argc=537240896, argv=0xdeadbeef, > username=0xdeadbeef <Address 0xdeadbeef out of bounds>) at postgres.c:2307 > #10 0x10002530 in exec_parse_message (query_string=0x20000a24 "", > stmt_name=0x5 "", paramTypes=0x0, numParams=0) at postgres.c:1216 > #11 0x10001f84 in exec_simple_query ( > query_string=0x2005a540 '�' <repeats 40 times>) at postgres.c:980 > #12 0x100005f0 in main (argc=1, argv=0xdeadbeef) at main.c:228 Well, the bad news is that this backtrace isn't very useful. It states the query you sent was 40 0xff's, and it says you called assign_timezone, which called set_tz, which then shows it calling init_sql_fcache() (impossible), which later calls WriteControlFile() impossible, which calls getaddrinfo() (impossible). My only guess is that getaddrinfo in your libc has a bug somehow that is corrupting the stack (hance the improper backtrace), then crashing. As to the cause, I assume this is not reproducable, right? Is there something unusual about your DNS setup or something that might have changed recently that caused getaddrinfo() to do something new? Of course, the memmove() might be causing the problem and the getaddrinfo is a corrupt part of the backtrace too. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, Jun 17, 2004 at 01:12:10PM -0400, Bruce Momjian wrote: > Well, the bad news is that this backtrace isn't very useful. No kidding. It's pretty frustrating. > My only guess is that getaddrinfo in your libc has a bug somehow that is > corrupting the stack (hance the improper backtrace), then crashing. It could be libc on AIX, I suppose, but it strikes me as sort of odd that nobody else ever seens this. Unless nobody else is using AIX 5.1, which is of course possible. One hypothesis is that this is happening at start up time (this core dump didn't show up in the data/ area, but in the init directory, however, which makes that theory a little suspect). > As to the cause, I assume this is not reproducable, right? Is there Well, it's reproduced itsef a few times, but it isn't reproducible at will, and we have no clue what is causing it. > something unusual about your DNS setup or something that might have > changed recently that caused getaddrinfo() to do something new? Nothing has changed recently, but we started having this not long after promoting an RS/6000 to production on AIX 5.1. Before that we were all-Solaris. We have never managed to tickle this on a test machine. It's pretty tough to guess what might be going on, at least for me. If there are any AIX gurus around, I'd sure like to talk to them. (I do have a budget to pay such gurus, BTW!) > Of course, the memmove() might be causing the problem and the > getaddrinfo is a corrupt part of the backtrace too. Yeah, which is why it's so frustrating. If I could see what it was doing when it did it, I'd be able to tell. But without knowing why it's happening, there's no way to sit up for 6 weeks while I wait for it to happen. A -- Andrew Sullivan | ajs@crankycanuck.ca This work was visionary and imaginative, and goes to show that visionary and imaginative work need not end up well. --Dennis Ritchie
Andrew Sullivan wrote: > On Thu, Jun 17, 2004 at 01:12:10PM -0400, Bruce Momjian wrote: > > > Well, the bad news is that this backtrace isn't very useful. > > No kidding. It's pretty frustrating. > > > My only guess is that getaddrinfo in your libc has a bug somehow that is > > corrupting the stack (hance the improper backtrace), then crashing. > > It could be libc on AIX, I suppose, but it strikes me as sort of odd > that nobody else ever seens this. Unless nobody else is using AIX > 5.1, which is of course possible. > > One hypothesis is that this is happening at start up time (this core > dump didn't show up in the data/ area, but in the init directory, > however, which makes that theory a little suspect). When you say "init" directory, what do you mean? /bin? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> > My only guess is that getaddrinfo in your libc has a bug somehow that is > > corrupting the stack (hance the improper backtrace), then crashing. > > It could be libc on AIX, I suppose, but it strikes me as sort of odd > that nobody else ever seens this. Unless nobody else is using AIX > 5.1, which is of course possible. I can confirm, that AIX 4.3.2 getaddrinfo is at least a bit *funny*. getaddrinfo seems to not honour nsorder and only does dns, even though the manual sais: "Should there be any discrepancies between this description and the POSIX description,the POSIX description takes precedence." The function does return multiple entries, often the first is not the best. Log is: LOG: could not translate service "5432" to address: Host not found WARNING: could not create listen socket for "*" LOG: could not bind socket for statistics collector: Can't assign requested address LOG: disabling statistics collector for lack of working socket This area probably needs a fix/workaround on AIX :-( Andreas
On Thu, Jun 17, 2004 at 06:06:12PM -0400, Bruce Momjian wrote: > > When you say "init" directory, what do you mean? /bin? No. The place where the init scripts (which cause postgres to start) live. A -- Andrew Sullivan | ajs@crankycanuck.ca In the future this spectacle of the middle classes shocking the avant- garde will probably become the textbook definition of Postmodernism. --Brad Holland
Quoth pgman@candle.pha.pa.us (Bruce Momjian): > Andrew Sullivan wrote: >> On Thu, Jun 17, 2004 at 01:12:10PM -0400, Bruce Momjian wrote: >> >> > Well, the bad news is that this backtrace isn't very useful. >> >> No kidding. It's pretty frustrating. >> >> > My only guess is that getaddrinfo in your libc has a bug somehow that is >> > corrupting the stack (hance the improper backtrace), then crashing. >> >> It could be libc on AIX, I suppose, but it strikes me as sort of odd >> that nobody else ever seens this. Unless nobody else is using AIX >> 5.1, which is of course possible. >> >> One hypothesis is that this is happening at start up time (this >> core dump didn't show up in the data/ area, but in the init >> directory, however, which makes that theory a little suspect). > > When you say "init" directory, what do you mean? /bin? No, it's a directory with various "init-like" scripts. In "premium hosting environments," root access is restricted to the site operators, so PostgreSQL doesn't get started up from /etc/init.d. Instead, PostgreSQL and other services get invoked by custom "init scripts" in a custom "init directory." -- let name="cbbrowne" and tld="ntlug.org" in name ^ "@" ^ tld;; http://www.ntlug.org/~cbbrowne/sap.html "I am a bomb technician. If you see me running, try to keep up..."
On 4/19/2004 1:18 PM, Jan Wieck wrote: > Tom Lane wrote: > >> Andrew Sullivan <ajs@crankycanuck.ca> writes: >>> On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote: >>>> I can see from your trace that you are using the getaddrinfo code from >>>> libc, but where is configure finding a header that declares struct >>>> addrinfo? >> >>> Hrm, I can't seem to tell. I see this in config.log, but it isn't >>> telling me where it found it. Am I looking in the wrong place? >> >> What you'd need to do is determine which system headers are being >> #include'd by that config test, and then look through them to find >> struct addrinfo. > > judging by gdb's structure printing, the crashed postgres instance used > the non-43 compatible 64-bit version of the strucure. What I don't > really get is that the whole excercise seems to have scribbled over the > stack. The hints pointer originating from the on-stack structure in > parse_hba is somehow pointing into the blue. This issue is still not closed and it is hitting us more and more. So I would like to add some more of what we have done in the hope to get some more ideas. The "scribbled over the stack" part turned out to be not true. The stack dump is fine if compiled with -O0. The problem persists in 7.4.5. I have tried to isolate the getaddrinfo() calls by writing a program that does the getaddrinfo() calls done during PM startup, then keeps 100-200 child processes in a fork()/wait() loop and every child process does the same getaddrinfo() calls a starting backend would perform during the pg_hba parsing. This program does not crash. So far we did not get a libc from IBM that has debug symbols. So I only know that getaddrinfo() calls getaddrinfo2(), which calls memmove() and that one crashes with a SIGSEGV. All the call arguments to getaddrinfo() look absolutely fine. I hope to get that libc any time soon to see what exactly that memmove tries to access. The problem comes and goes. So either I can cause a coredump just on the snap by running a shellscript that does 100 psql -c "select version()" calls, or it is next to impossible to crash it at all. There are numerous reports on the net about getaddrinfo() causing grief on AIX and it seems to be IPV6 related. For the moment we intend to replace the call with a slightly limited implementation using inet_aton() in getaddrinfo_all() whenever AI_NUMERICHOST is set. This will lose us the IPV6 support as hba.c can't parse those pg_hba.conf lines any more. So it is not a satisfactory workaround for PostgreSQL. But I will make that patch available tomorrow night in the event someone else finds it usefull. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > The problem comes and goes. So either I can cause a coredump just on the > snap by running a shellscript that does 100 psql -c "select version()" > calls, or it is next to impossible to crash it at all. Hmm, that's really bizarre. It seems like the only satisfactory explanation for that would involve some external condition that varies over time. I'm wondering about DNS lookup results in particular. What values are you asking getaddrinfo to look up, and might those involve consulting DNS? If so, try to correlate the crash probability with changes in your DNS zone contents ... regards, tom lane
On 9/17/2004 7:32 PM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> The problem comes and goes. So either I can cause a coredump just on the >> snap by running a shellscript that does 100 psql -c "select version()" >> calls, or it is next to impossible to crash it at all. > > Hmm, that's really bizarre. It seems like the only satisfactory > explanation for that would involve some external condition that varies > over time. I'm wondering about DNS lookup results in particular. > What values are you asking getaddrinfo to look up, and might those > involve consulting DNS? If so, try to correlate the crash probability > with changes in your DNS zone contents ... > > regards, tom lane Except for one "localhost", one "/tmp/.s.PGSQL..." and the "543x" lookup during the postmaster start, all lookups are IP addresses with AI_NUMERICHOST set. And we have checked with tcpdump that the box really does not issue DNS lookups. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Fri, Sep 17, 2004 at 07:32:30PM -0400, Tom Lane wrote: > involve consulting DNS? If so, try to correlate the crash probability > with changes in your DNS zone contents ... No changes. The systems in question have no access to DNS. /etc/hosts only. A -- Andrew Sullivan | ajs@crankycanuck.ca The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun
On Sat, Sep 18, 2004 at 06:06:05AM -0400, Jan Wieck wrote: > On 9/17/2004 7:32 PM, Tom Lane wrote: > >over time. I'm wondering about DNS lookup results in particular. > > Except for one "localhost", one "/tmp/.s.PGSQL..." and the "543x" lookup > during the postmaster start, all lookups are IP addresses with > AI_NUMERICHOST set. And we have checked with tcpdump that the box really > does not issue DNS lookups. Just for the sake of posterity, it appears that this is actually a libc problem on AIX. In particular, there's a patched libc fileset which was released to solve a problem where getaddrinfo() returns an error on valid input. IBM's AIX support was unwilling to give us libraries with debug symbols built in, but they did point me at a new fileset for libc. We've been running a test load which fairly consistently produced sig 11s before, and haven't seen one since. So we don't have a perfect explanation, but it looks like this is the cause. A -- Andrew Sullivan | ajs@crankycanuck.ca The plural of anecdote is not data. --Roger Brinner