Thread: Fwd: Bug#249083: postgresql: Postgres SIGSEGV if wins in nsswitch.conf
Hi PostgreSQL developers! Does anybody happen to use Postgresql with winbind authentication? We got the bug report below where this seems to crash. Is this problem known? Thanks and have a nice day! Martin ----- Forwarded message from Cory Dodt <cdodt@turing> ----- Subject: Bug#249083: postgresql: Postgres SIGSEGV if wins in nsswitch.conf Reply-To: Cory Dodt <cdodt@turing>, 249083@bugs.debian.org From: Cory Dodt <cdodt@turing> To: Debian Bug Tracking System <submit@bugs.debian.org> Date: Fri, 14 May 2004 14:56:11 -0700 X-Spam-Status: No, hits=3D1.6 required=3D4.0 tests=3DBAYES_44,NO_DNS_FOR_FR= OM=20 autolearn=3Dno version=3D2.61 Package: postgresql Version: 7.4.2-4 Severity: normal [system information] When configured to look up names with winbind (3.0.2a-1/sarge), postmaster crashes with the following: 2004-05-14 14:50:14 [8725] LOG: could not create IPv6 socket: Address fami= ly not supported by protocol 2004-05-14 14:50:14 [8725] LOG: authentication file token too long, skippi= ng: "=98.=ED=F1 Segmentation fault -------------------- ... the nsswitch.conf that goes with the above contains: hosts: wins files dns However if I change the name lookup order to: hosts: files dns wins pg starts up normally and the authentication file token error disappears. (The IPV6 error remains.) I tried to get a stack trace; unfortunately the stack is corrupted but this much (truncated) looks for real: ... #8 0x40ecb020 in ?? () from /lib/libnss_wins.so.2 ... #35 0x4028efce in __errno_location () from /lib/libc.so.6=20 Winbind name lookup works fine normally, still I wouldn't be too much surpr= ised if this was in fact a winbind bug. ----- End forwarded message ----- --=20 Martin Pitt Debian GNU/Linux Developer martin@piware.de mpitt@debian.org http://www.piware.de http://www.debian.org
Martin Pitt wrote: > Hi PostgreSQL developers! > > Does anybody happen to use Postgresql with winbind authentication? We > got the bug report below where this seems to crash. > > Is this problem known? > Subject: Bug#249083: postgresql: Postgres SIGSEGV if wins in nsswitch.conf > Reply-To: Cory Dodt <cdodt@turing>, 249083@bugs.debian.org > To: Debian Bug Tracking System <submit@bugs.debian.org> > Date: Fri, 14 May 2004 14:56:11 -0700 > Package: postgresql > Version: 7.4.2-4 > Severity: normal > > [system information] > > When configured to look up names with winbind (3.0.2a-1/sarge), postmaster > crashes with the following: > > 2004-05-14 14:50:14 [8725] LOG: could not create IPv6 socket: Address family not supported by protocol > 2004-05-14 14:50:14 [8725] LOG: authentication file token too long, skipping: ".íñ > Segmentation fault Duplicated this on Fedora 2 (RPM), but I don't get even that much logging. I can start single-user /usr/bin/postgres though. > -------------------- > ... the nsswitch.conf that goes with the above contains: > hosts: wins files dns > > However if I change the name lookup order to: > hosts: files dns wins > > pg starts up normally and the authentication file token error disappears. > (The IPV6 error remains.) Don't worry about the IPV6 - that's normal. If I run gdb --exec=/usr/bin/postmaster ... Program received signal SIGSEGV, Segmentation fault. 0x008805c7 in getc () from /lib/tls/libc.so.6 (gdb) bt #0 0x008805c7 in getc () from /lib/tls/libc.so.6 #1 0x0812089b in ?? () #2 0xfeefd090 in ?? () #3 0x004612d0 in ?? () from /lib/libnss_wins.so.2 #4 0x00000000 in ?? () Not much I know, but I don't have time this afternoon to setup and compile on my workstation. I can say that the contents of pg_hba.conf don't seem to matter (i.e. local=trust), so I'm guessing it's looking up the localhost name(s). The tail of strace /usr/bin/postmaster seems to bear this out: unlink("/var/lib/pgsql/data/global/pgstat.stat") = 0 gettimeofday({1085493152, 677972}, NULL) = 0 open("/etc/resolv.conf", O_RDONLY) = 4 fstat64(4, {st_mode=S_IFREG|0644, st_size=43, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf642e000 read(4, "search archonet.com\nnameserver 1"..., 4096) = 43 read(4, "", 4096) = 0 close(4) = 0 munmap(0xf642e000, 4096) = 0 open("/etc/ld.so.cache", O_RDONLY) = 4 fstat64(4, {st_mode=S_IFREG|0644, st_size=59776, ...}) = 0 old_mmap(NULL, 59776, PROT_READ, MAP_PRIVATE, 4, 0) = 0xf641f000 close(4) = 0 open("/lib/tls/i686/libnss_wins.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/lib/tls/i686", 0xfef90bc4) = -1 ENOENT (No such file or directory) open("/lib/tls/libnss_wins.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/lib/tls", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/lib/i686/libnss_wins.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/lib/i686", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/lib/libnss_wins.so.2", O_RDONLY) = 4 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0@\276\1"..., 512) = 512 fstat64(4, {st_mode=S_IFREG|0755, st_size=733664, ...}) = 0 old_mmap(NULL, 806248, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0xe82000 old_mmap(0xf2d000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0xaa000) = 0xf2d000 old_mmap(0xf36000, 68968, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf36000 close(4) = 0 open("/usr/lib/libldap.so.2", O_RDONLY) = 4 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\360\226"..., 512) = 512 fstat64(4, {st_mode=S_IFREG|0755, st_size=203068, ...}) = 0 old_mmap(0x971000, 200116, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x111000 old_mmap(0x141000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x30000) = 0x141000 close(4) = 0 open("/usr/lib/liblber.so.2", O_RDONLY) = 4 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000\344"..., 512) = 512 fstat64(4, {st_mode=S_IFREG|0755, st_size=48528, ...}) = 0 old_mmap(0x95c000, 46160, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x142000 old_mmap(0x14d000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0xb000) = 0x14d000 close(4) = 0 open("/usr/lib/libsasl2.so.2", O_RDONLY) = 4 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0p\217\224"..., 512) = 512 fstat64(4, {st_mode=S_IFREG|0755, st_size=82288, ...}) = 0 old_mmap(0x946000, 79676, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x14e000 old_mmap(0x161000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x13000) = 0x161000 close(4) = 0 munmap(0xf641f000, 59776) = 0 time(NULL) = 1085493152 gettimeofday({1085493152, 694560}, NULL) = 0 stat64("/etc/samba/smb.conf", {st_mode=S_IFREG|0644, st_size=10677, ...}) = 0 uname({sys="Linux", node="client17.archonet.com", ...}) = 0 --- SIGSEGV (Segmentation fault) @ 0 (0) --- In the output above, my workstation is "client17". My smb.conf hasn't been altered from the default. > I tried to get a stack trace; unfortunately the stack is corrupted but this > much (truncated) looks for real: > ... > #8 0x40ecb020 in ?? () from /lib/libnss_wins.so.2 > ... > #35 0x4028efce in __errno_location () from /lib/libc.so.6 For the record, my Fedora setup is slightly behind the release version of core-2. -- Richard Huxton Archonet Ltd
Martin Pitt <martin@piware.de> writes: > 2004-05-14 14:50:14 [8725] LOG: authentication file token too long, skippi= > ng: "=98.=ED=F1 > Segmentation fault Looking at the only place this message is produced, in src/backend/libpq/hba.c, it appears that we are printing a string buffer that is not guaranteed null-terminated. The segfault might be due to that. I would suggest adding more paranoia along these lines: if (buf >= end_buf) { + *buf = '\0'; ereport(LOG, (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("authentication file token too long, skipping: \"%s\"", buf))); /* Discard remainder of line */ while ((c = getc(fp)) != EOF && c != '\n') ; - buf[0] = '\0'; break; } This won't fix the underlying problem (where is the junk data coming from?) but it might at least let you get further in your investigation. regards, tom lane
I said: > I would suggest adding more paranoia along these lines: Actually the correct patch is as per attached. Even without a core dump, the original code would not print the token that was really causing the problem :-(. Please apply this patch and then tell us what you see with winbind. regards, tom lane Index: hba.c =================================================================== RCS file: /cvsroot/pgsql-server/src/backend/libpq/hba.c,v retrieving revision 1.116.2.1 diff -c -r1.116.2.1 hba.c *** hba.c 5 Dec 2003 15:50:39 -0000 1.116.2.1 --- hba.c 25 May 2004 19:06:54 -0000 *************** *** 124,137 **** if (buf >= end_buf) { ereport(LOG, (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("authentication file token too long, skipping: \"%s\"", ! buf))); /* Discard remainder of line */ while ((c = getc(fp)) != EOF && c != '\n') ; - buf[0] = '\0'; break; } --- 124,137 ---- if (buf >= end_buf) { + *buf = '\0'; ereport(LOG, (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("authentication file token too long, skipping: \"%s\"", ! start_buf))); /* Discard remainder of line */ while ((c = getc(fp)) != EOF && c != '\n') ; break; }
Hi Tom! On 2004-05-25 15:13 -0400, Tom Lane wrote: > Actually the correct patch is as per attached. Even without a core > dump, the original code would not print the token that was really > causing the problem :-(. Please apply this patch and then tell us > what you see with winbind. Thank you for this patch. I applied it and prepared an inofficial package for the original submitter to test. I hoped to get a quick answer from him, but up to now I didn't. I will report back when I have any news. Have a nice day! Martin --=20 Martin Pitt Debian GNU/Linux Developer martin@piware.de mpitt@debian.org http://www.piware.de http://www.debian.org
Hi again! On 2004-05-25 15:13 -0400, Tom Lane wrote: > I said: > > I would suggest adding more paranoia along these lines: >=20 > Actually the correct patch is as per attached. Even without a core > dump, the original code would not print the token that was really > causing the problem :-(. Please apply this patch and then tell us > what you see with winbind. The submitter of this bug eventually reported back (there was a problem with his email address) and tested the updated package (with your patch); unfortunately it seems that the patch did not improve the output very much :-( He wrote: ---------------------------------------------------- Started by /etc/init.d/postgresql: 2004-06-07 14:24:59 [6455] LOG: authentication file token too long, skippi= ng: "(garbage)" I piped the garbage through xxd and got: 986e 0241 9131 f840 I ran it again from the command line, as user postgres and got, instead: 98e3 ed40 9101 e340 These garbage strings are repeatable. In either case, I did get the seg fault. ---------------------------------------------------- Does that say anything to anybody? Thanks and have a nice day! Martin --=20 Martin Pitt Debian GNU/Linux Developer martin@piware.de mpitt@debian.org http://www.piware.de http://www.debian.org
Martin Pitt <martin@piware.de> writes: > The submitter of this bug eventually reported back (there was a > problem with his email address) and tested the updated package (with > your patch); unfortunately it seems that the patch did not improve the > output very much :-( He wrote: I didn't expect the failure to go away, only the consequent core dump. We still need to learn why winbind is returning bad data. To that end, could we *see* the "garbage" error report? I don't know that it will tell anything, but certainly I have zero to go on at the moment. regards, tom lane
Hi Tom! On 2004-06-08 1:24 -0400, Tom Lane wrote: > I didn't expect the failure to go away, only the consequent core dump. I also hoped that the missing string termination caused the segfault, but obviously it didn't (it was only a read operation after all). > We still need to learn why winbind is returning bad data. To that end, > could we *see* the "garbage" error report? I don't know that it will > tell anything, but certainly I have zero to go on at the moment. The submitter already filtered the garbage part (the actual token) through xxd, i. e. converted it to hex (I quoted this part of his message) to avoid problems with different encodings and mail filters. For your convenience I wrote these hex sequences into two files 'token1' and 'token2' and tar.gz'ed them. Thanks in advance and have a nice day! Martin P.S. No need to CC me, I'm subscribed (see Mail-Followup-To). -- Martin Pitt Debian GNU/Linux Developer martin@piware.de mpitt@debian.org http://www.piware.de http://www.debian.org
Attachment
Martin Pitt <martin@piware.de> writes: > For your convenience I wrote these hex sequences into two files > 'token1' and 'token2' and tar.gz'ed them. Mph ... no obvious pattern here. $ od -t x1 token1 0000000 98 6e 2 41 91 31 f8 40 0000010 $ od -t x1 token2 0000000 98 e3 ed 40 91 1 e3 40 0000010 Can you try again to get a debugger stack trace? Maybe with the patch there'll be a more sensible stack... regards, tom lane
Hi! On 2004-06-08 11:18 -0400, Tom Lane wrote: > Can you try again to get a debugger stack trace? Maybe with the patch > there'll be a more sensible stack... I am now able to reproduce this bug. I installed package 'winbind' and changed the hosts line in /etc/nsswitch.conf to=20 hosts: wins files dns (i. e. prepended wins). I recompiled postgresql with debugging and without stripping and tried to get a stack trace. Something really seems to mess up the stack, but running postmaster under electric fence seems to improve it (and it should also narrow down the error): ---------------- snip --------------------- postgres@donald:/usr/lib/postgresql/bin$ gdb ./postmaster GNU gdb 6.1-debian Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain condition= s. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-linux"...Using host libthread_db library "= /lib/tls/i686/cmov/libthread_db.so.1". (gdb) efence Enabled Electric Fence (gdb) set args -D /var/lib/postgres/data (gdb) r Starting program: /usr/lib/postgresql/bin/postmaster -D /var/lib/postgres/d= ata [Thread debugging using libthread_db enabled] [New Thread 1078114272 (LWP 2961)] Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens. 2004-06-08 19:27:43 [2961] LOG: konnte IPv6-Socket nicht erstellen: Die Ad= ressfamilie wird von der Protokollfamilie nicht unterst=FCtzt Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1078114272 (LWP 2961)] 0x402e675e in getc () from /lib/tls/i686/cmov/libc.so.6 (gdb) bt #0 0x402e675e in getc () from /lib/tls/i686/cmov/libc.so.6 #1 0x0814142d in next_token (fp=3D0xbfffde4c, buf=3D0xbfffde54 "", bufsz= =3D1109025003) at hba.c:102 #2 0x4217ecb5 in str_list_make () from /lib/libnss_wins.so.2 #3 0x421310bc in dyn_CACHEDIR () from /lib/libnss_wins.so.2 #4 0x42139591 in lp_load () from /lib/libnss_wins.so.2 #5 0xbfffe6f4 in ?? () #6 0x00000400 in ?? () #7 0x421c3020 in ?? () from /lib/libnss_wins.so.2 #8 0x000003ff in ?? () #9 0x00000000 in ?? () #10 0x0000b000 in ?? () #11 0x403553d9 in mprotect () from /lib/tls/i686/cmov/libc.so.6 #12 0x40019ecc in Page_DenyAccess () from /usr/lib/libefence.so.0.0 Previous frame inner to this frame (corrupt stack?) (gdb) ---------------- snip --------------------- The bufsz parameter of next_token really seems to be corrupted, but line 102 is=20 while ((c =3D getc(fp)) !=3D EOF && (pg_isblank(c) || c =3D=3D ',')= ) ; so the function already crashes while skipping the whitespace and bufsz does not yet come into real play yet (apart from determining end_buf, which is not yet used up to this point). I would like to debug this further (if you cannot reproduce this), but I grepped the whole source tree for an invocation of next_token[_expand] and found nothing. Where the heck this is called from? Looking at the stacktrace it seems to be kind of a callback from libnss_wins, but somewhere this must be set!? So who calls next_token and who sets the file, buffer and bufsz parameters? Can you make any sense of this? Thanks for any idea! Martin --=20 Martin Pitt Debian GNU/Linux Developer martin@piware.de mpitt@debian.org http://www.piware.de http://www.debian.org
Martin Pitt <martin@piware.de> writes: > (gdb) bt > #0 0x402e675e in getc () from /lib/tls/i686/cmov/libc.so.6 > #1 0x0814142d in next_token (fp=3D0xbfffde4c, buf=3D0xbfffde54 "", bufsz= > =3D1109025003) at hba.c:102 > #2 0x4217ecb5 in str_list_make () from /lib/libnss_wins.so.2 > #3 0x421310bc in dyn_CACHEDIR () from /lib/libnss_wins.so.2 > #4 0x42139591 in lp_load () from /lib/libnss_wins.so.2 > #5 0xbfffe6f4 in ?? () > #6 0x00000400 in ?? () > #7 0x421c3020 in ?? () from /lib/libnss_wins.so.2 > #8 0x000003ff in ?? () > #9 0x00000000 in ?? () > #10 0x0000b000 in ?? () > #11 0x403553d9 in mprotect () from /lib/tls/i686/cmov/libc.so.6 > #12 0x40019ecc in Page_DenyAccess () from /usr/lib/libefence.so.0.0 > Previous frame inner to this frame (corrupt stack?) > (gdb) > So who calls next_token and who sets the file, buffer and bufsz > parameters? Can you make any sense of this? Either gdb is lying to you completely (certainly possible if stack is corrupt) or something in libnss_wins is taking a wild jump that happens to end up in next_token. The only suggestion I have is to try to identify where control is headed before the stack clobber occurs. Set breakpoints and see how far it gets before crashing, narrow down the range, rinse and repeat. I'd suggest starting in and around getaddrinfo() and getnameinfo(), since that seems like the only area where winbind would introduce different behavior. regards, tom lane