Thread: BUG #2246: Bad malloc interactions: ecpg, openssl
The following bug has been logged online: Bug reference: 2246 Logged by: Andy Klosterman Email address: andrew5@ece.cmu.edu PostgreSQL version: 8.1.0 Operating system: Debian testing: Linux nc3 2.4.27-2-386 #1 Wed Nov 30 21:38:51 JST 2005 i686 GNU/Linux Description: Bad malloc interactions: ecpg, openssl Details: Before going into a full description and figuring out some example code for this situation, I'm fishing for interesting in tracking it down and fixing it (or not). On a program that I (pre-)compile with ecpg and connect to a remote Postgres instance over an SSL connection (as set up in pg_hba.conf with appropriate certificates installed) my application prematurely terminates with the following error: *** glibc detected *** corrupted double-linked list: 0x0807c830 *** Abort. (Without an SSL connection (as set in ph_hba.conf) the program executes just fine. This leads me to cast suspicion on SSL libraries.) The back trace from gdb looks like this (which doesn't appear to be too informative, but looks like an exception stack): #0 0x401bc851 in kill () from /lib/libc.so.6 #1 0x4014a309 in pthread_kill () from /lib/libpthread.so.0 #2 0x4014a6c0 in raise () from /lib/libpthread.so.0 #3 0x401bc606 in raise () from /lib/libc.so.6 #4 0x401bd971 in abort () from /lib/libc.so.6 #5 0x401ef930 in __fsetlocking () from /lib/libc.so.6 #6 0x401f52b9 in malloc_usable_size () from /lib/libc.so.6 #7 0x401f5395 in malloc_usable_size () from /lib/libc.so.6 #8 0x401f5a43 in malloc_trim () from /lib/libc.so.6 #9 0x401f5d51 in free () from /lib/libc.so.6 #10 0x4052ce6c in zcfree () from /usr/lib/libz.so.1 #11 0x4052f83f in inflateEnd () from /usr/lib/libz.so.1 #12 0x4040f262 in COMP_rle () from /usr/lib/i686/cmov/libcrypto.so.0.9.8 #13 0x0807e680 in ?? () #14 0x00000000 in ?? () After a bit of digging around online, I discovered the MALLOC_CHECK_ environment variable and how it changes the behavior of malloc (man 3 malloc). The above back trace was without MALLOC_CHECK_ in the environment (e.g., unsetenv MALLOC_CHECK_). Running with MALLOC_CHECK_ equal to 2 or 1 allows my program to run to completion. With MALLOC_CHECK_ set to 0 (which is supposed to ignore corruption), I get a segfault. Running inside gdb gets me the following back trace: #0 0x403d6f73 in ASN1_template_free () from /usr/lib/i686/cmov/libcrypto.so.0.9.8 #1 0x403d6e0d in ASN1_primitive_free () from /usr/lib/i686/cmov/libcrypto.so.0.9.8 #2 0x403d7023 in ASN1_item_free () from /usr/lib/i686/cmov/libcrypto.so.0.9.8 #3 0x403d0c07 in X509_CERT_AUX_free () from /usr/lib/i686/cmov/libcrypto.so.0.9.8 #4 0x403d077a in X509_CINF_free () from /usr/lib/i686/cmov/libcrypto.so.0.9.8 #5 0x403d6e35 in ASN1_primitive_free () from /usr/lib/i686/cmov/libcrypto.so.0.9.8 #6 0x403d7023 in ASN1_item_free () from /usr/lib/i686/cmov/libcrypto.so.0.9.8 #7 0x403d0927 in X509_free () from /usr/lib/i686/cmov/libcrypto.so.0.9.8 #8 0x402d16f3 in pqsecure_destroy () from /usr/lib/libpq.so.4 #9 0x402c387a in PQconninfoFree () from /usr/lib/libpq.so.4 #10 0x402c39c3 in PQfinish () from /usr/lib/libpq.so.4 #11 0x4002f41b in ECPGget_connection () from /usr/lib/libecpg.so.5 #12 0x40030223 in ECPGdisconnect () from /usr/lib/libecpg.so.5 #13 0x0804a113 in DBDisconnect (arg_connection=0x8054faf "client_correctness") at client_test.pgcc:215 #14 0x0804a64e in DoCorrectnessChecks () at client_test.pgcc:278 #15 0x0804aaa1 in main (argc=7, argv=0xbffffa84) at client_test.pgcc:523 PURE SPECULATION: It looks like there is either trouble in the interaction between Postgres and the SSL library or just a bit of trouble within the SSL library. SPECULATION: Another possibility is that I misunderstand some aspect of multi-threaded interactions with Postgres (I open uniquely named connections to the DB for each thread of my test program). Maybe I need to have a "lock" around the code that makes DB connections and make sure that only one happens at a time (might be better handled within Postgres/SSL if that is the case). PROCEEDING FURTHER: If there is any desire on the part of any developers to pursue this further, I'm open. As things stand right now, I have workarounds: 1. Don't use an SSL connection to the DB. 2. Do a "setenv MALLOC_CHECK_ 1" (or 2) and it works.
Andy Klosterman wrote: > Before going into a full description and figuring out some example code for > this situation, I'm fishing for interesting in tracking it down and fixing > it (or not). Whenever there is a bug that causes a crash, there is interest in tracking it down and fixing it. Please do provide a test case. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
"Andy Klosterman" <andrew5@ece.cmu.edu> writes: > SPECULATION: Another possibility is that I misunderstand some aspect of > multi-threaded interactions with Postgres (I open uniquely named connections > to the DB for each thread of my test program). Maybe I need to have a > "lock" around the code that makes DB connections and make sure that only one > happens at a time (might be better handled within Postgres/SSL if that is > the case). There could be some re-entrancy problem in the SSL connection startup code --- if you add such a lock, does it get more reliable? Also, did you remember to build PG with --enable-thread-safety ? regards, tom lane
Andrew Klosterman <andrew5@ece.cmu.edu> writes: > I threw in a pthread mutex around the code making the database connections > for each of my threads. The problem is still there ("corrupted > double-linked list"). > Even tuning things down and instructing my code to only run a single > pthread manifests the problem over an SSL connection. Hmm. Based on that, the problem is starting to smell more like a garden-variety memory clobber, for instance malloc'ing a chunk smaller than the data that's later stuffed into it. It might be worth running the program under something like ElectricFence, which will catch the offender on-the-spot rather than only later when corruption of malloc's private data structures is detected. Looking back at your original message, I wonder if it could be the combination of ecpg and SSL that triggers it? I'd have thought that libpq/SSL alone would be pretty well wrung out, but ecpg is not so widely used. BTW, you did say this was i386 right? If it were a 64-bit architecture, I'd be about ready to bet money on the wrong-malloc-size-calculation theory. > Tracking down exactly what's tickling the problem in this case could be > tricky... Yeah :-(. If you aren't able to narrow it further by yourself, please try to put together a self-contained test case. regards, tom lane
Andrew Klosterman <andrew5@ece.cmu.edu> writes: > (gdb) bt > #0 0x401c3851 in kill () from /lib/libc.so.6 > #1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0 > #2 0x40139823 in memalign () from /usr/lib/libefence.so.0 > #3 0x401399ad in malloc () from /usr/lib/libefence.so.0 > #4 0x40139a10 in calloc () from /usr/lib/libefence.so.0 > #5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3 > #6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4 > #7 0x402ded88 in ?? () from /usr/lib/libpq.so.4 > #8 0x00000000 in ?? () Any chance of doing this with debug symbols? libpq does not call krb5_set_default_tgs_ktypes directly, so I don't think I believe the above backtrace. gdb is easily misled without debug symbols :-( I'm not sure if Debian does things the way Red Hat does, but on RH there are separate "debuginfo" RPMs corresponding to each regular RPM --- if you install the ones matching your libpq and libkrb5 RPMs you should be able to get better info. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Andrew Klosterman <andrew5@ece.cmu.edu> writes: > > (gdb) bt > > #0 0x401c3851 in kill () from /lib/libc.so.6 > > #1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0 > > #2 0x40139823 in memalign () from /usr/lib/libefence.so.0 > > #3 0x401399ad in malloc () from /usr/lib/libefence.so.0 > > #4 0x40139a10 in calloc () from /usr/lib/libefence.so.0 > > #5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.= so.3 > > #6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4 > > #7 0x402ded88 in ?? () from /usr/lib/libpq.so.4 > > #8 0x00000000 in ?? () >=20 > Any chance of doing this with debug symbols? libpq does not call > krb5_set_default_tgs_ktypes directly, so I don't think I believe the > above backtrace. gdb is easily misled without debug symbols :-( Hrmpf, I missed this bug-on-Debian report. I'll go check the archive for the rest. > I'm not sure if Debian does things the way Red Hat does, but on RH > there are separate "debuginfo" RPMs corresponding to each regular > RPM --- if you install the ones matching your libpq and libkrb5 > RPMs you should be able to get better info. We do have debugging .debs- for some things. We don't have them for everything and unfortunately we don't yet have them for Postgres. I'll talk to Martin about building some though so that in the future it's easier to debug these problems. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > We do have debugging .debs- for some things. We don't have them for > everything and unfortunately we don't yet have them for Postgres. I'll > talk to Martin about building some though so that in the future it's > easier to debug these problems. Hmm. Andrew, it seems your choices are to rebuild the relevant libraries from source, or to concentrate on developing a test case that other people can try. regards, tom lane
On Mon, 13 Feb 2006, Tom Lane wrote: > Andrew Klosterman <andrew5@ece.cmu.edu> writes: > > I threw in a pthread mutex around the code making the database connections > > for each of my threads. The problem is still there ("corrupted > > double-linked list"). > > > Even tuning things down and instructing my code to only run a single > > pthread manifests the problem over an SSL connection. > > Hmm. Based on that, the problem is starting to smell more like a > garden-variety memory clobber, for instance malloc'ing a chunk smaller > than the data that's later stuffed into it. It might be worth running > the program under something like ElectricFence, which will catch the > offender on-the-spot rather than only later when corruption of malloc's > private data structures is detected. > > Looking back at your original message, I wonder if it could be the > combination of ecpg and SSL that triggers it? I'd have thought that > libpq/SSL alone would be pretty well wrung out, but ecpg is not so > widely used. > > BTW, you did say this was i386 right? If it were a 64-bit architecture, > I'd be about ready to bet money on the wrong-malloc-size-calculation > theory. > > > Tracking down exactly what's tickling the problem in this case could be > > tricky... > > Yeah :-(. If you aren't able to narrow it further by yourself, please > try to put together a self-contained test case. > > regards, tom lane I just did the "electric fence" thing for you and this is what I get in gdb... Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens. ElectricFence Aborting: Allocating 0 bytes, probably a bug. Program received signal SIGILL, Illegal instruction. [Switching to Thread 16384 (LWP 24753)] 0x401c3851 in kill () from /lib/libc.so.6 (gdb) bt #0 0x401c3851 in kill () from /lib/libc.so.6 #1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0 #2 0x40139823 in memalign () from /usr/lib/libefence.so.0 #3 0x401399ad in malloc () from /usr/lib/libefence.so.0 #4 0x40139a10 in calloc () from /usr/lib/libefence.so.0 #5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3 #6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4 #7 0x402ded88 in ?? () from /usr/lib/libpq.so.4 #8 0x00000000 in ?? () Looks like something fishy going on between libpq and libkrb5. I'm especially suspicious since I'm not using kerberos for authentication at all. I am developing on i386 (more or less). # uname -m i686 --Andrew J. Klosterman andrew5@ece.cmu.edu
On Mon, 13 Feb 2006, Tom Lane wrote: > Andrew Klosterman <andrew5@ece.cmu.edu> writes: > > (gdb) bt > > #0 0x401c3851 in kill () from /lib/libc.so.6 > > #1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0 > > #2 0x40139823 in memalign () from /usr/lib/libefence.so.0 > > #3 0x401399ad in malloc () from /usr/lib/libefence.so.0 > > #4 0x40139a10 in calloc () from /usr/lib/libefence.so.0 > > #5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3 > > #6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4 > > #7 0x402ded88 in ?? () from /usr/lib/libpq.so.4 > > #8 0x00000000 in ?? () > > Any chance of doing this with debug symbols? libpq does not call > krb5_set_default_tgs_ktypes directly, so I don't think I believe the > above backtrace. gdb is easily misled without debug symbols :-( > > I'm not sure if Debian does things the way Red Hat does, but on RH > there are separate "debuginfo" RPMs corresponding to each regular > RPM --- if you install the ones matching your libpq and libkrb5 > RPMs you should be able to get better info. > > regards, tom lane I thought about that and did some quick checks of how to get debug symbols in libraries on Debian. I didn't come up with anything right away. I'll poke around and see what I can come up with. --Andrew J. Klosterman andrew5@ece.cmu.edu
On Wed, 8 Feb 2006, Tom Lane wrote: > "Andy Klosterman" <andrew5@ece.cmu.edu> writes: > > SPECULATION: Another possibility is that I misunderstand some aspect of > > multi-threaded interactions with Postgres (I open uniquely named connections > > to the DB for each thread of my test program). Maybe I need to have a > > "lock" around the code that makes DB connections and make sure that only one > > happens at a time (might be better handled within Postgres/SSL if that is > > the case). > > There could be some re-entrancy problem in the SSL connection startup > code --- if you add such a lock, does it get more reliable? Also, did > you remember to build PG with --enable-thread-safety ? > > regards, tom lane (I'm back after a bit of an illness. Much better now!) I threw in a pthread mutex around the code making the database connections for each of my threads. The problem is still there ("corrupted double-linked list"). Even tuning things down and instructing my code to only run a single pthread manifests the problem over an SSL connection. Everything is just fine without SSL. Other code I've written works just fine with (and without) threads connecting to the database with (and without) SSL. Tracking down exactly what's tickling the problem in this case could be tricky... I'm using the pre-built debian testing packages, not self-compiled code, for my postgres installation. From the information I can gather from the debian build logs (http://buildd.debian.org/build.php), everything was configured and built with threads enabled. --Andrew J. Klosterman andrew5@ece.cmu.edu
* Andrew Klosterman (andrew5@ece.cmu.edu) wrote: > (gdb) bt > #0 0x401c3851 in kill () from /lib/libc.so.6 > #1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0 > #2 0x40139823 in memalign () from /usr/lib/libefence.so.0 > #3 0x401399ad in malloc () from /usr/lib/libefence.so.0 > #4 0x40139a10 in calloc () from /usr/lib/libefence.so.0 > #5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so= .3 > #6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4 > #7 0x402ded88 in ?? () from /usr/lib/libpq.so.4 > #8 0x00000000 in ?? () >=20 > Looks like something fishy going on between libpq and libkrb5. I'm > especially suspicious since I'm not using kerberos for authentication at > all. Seems kind of unlikely... What exact (.deb) versions of libpq and Postgres are you using? You originally posted w/ 8.1.0 but perhaps on the client you had something more recent? Thanks, Stephen
* Andrew Klosterman (andrew5@ece.cmu.edu) wrote: > > Seems kind of unlikely... What exact (.deb) versions of libpq and > > Postgres are you using? You originally posted w/ 8.1.0 but perhaps on > > the client you had something more recent? >=20 > Running "aptitude show X" where "X" is the package name, and applying > appropriate filtering gives the following results on my development > systems: >=20 > Package: libpq-dev > Version: 8.1.0-3 >=20 > Package: libpq3 > Version: 1:7.4.9-2 >=20 > Package: libpq4 > Version: 8.1.0-3 >=20 > Package: postgresql-8.1 > Version: 8.1.0-3 >=20 > Package: postgresql-contrib-8.1 > Version: 8.1.0-3 >=20 > Package: postgresql-server-dev-8.1 > Version: 8.1.0-3 >=20 > Package: postgresql-client-8.1 > Version: 8.1.0-3 >=20 > Package: postgresql-common > Version: 39 Hmm, alright, well, this is at least not the fault of the patch of mine which was included in Debian's 8.1.2-2 Postgres release. :) You might try compiling some debs with debugging enabled. This is (reasonably) straight-forward: (as root:) aptitude install build-essential debhelper cdbs bison perl libperl-dev \ tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \ libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \ gettext bzip2 fakeroot (as user:) apt-get source postgresql-8.1 cd postgresql-8.1-8.1.0 export DEB_BUILD_OPTIONS=3D"nostrip" dpkg-buildpackage -uc -us -rfakeroot Should produce .debs in the parent directory which have debugging information. Another useful build option is "noopt", ie: export DEB_BUILD_OPTIONS=3D"nostrip noopt", though that could make the error go disappear. It'd be terribly nice if you could do this and provide a gdb backtrace with debugging... :) Thanks, Stephen
On Mon, 13 Feb 2006, Stephen Frost wrote: > * Andrew Klosterman (andrew5@ece.cmu.edu) wrote: > > (gdb) bt > > #0 0x401c3851 in kill () from /lib/libc.so.6 > > #1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0 > > #2 0x40139823 in memalign () from /usr/lib/libefence.so.0 > > #3 0x401399ad in malloc () from /usr/lib/libefence.so.0 > > #4 0x40139a10 in calloc () from /usr/lib/libefence.so.0 > > #5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3 > > #6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4 > > #7 0x402ded88 in ?? () from /usr/lib/libpq.so.4 > > #8 0x00000000 in ?? () > > > > Looks like something fishy going on between libpq and libkrb5. I'm > > especially suspicious since I'm not using kerberos for authentication at > > all. > > Seems kind of unlikely... What exact (.deb) versions of libpq and > Postgres are you using? You originally posted w/ 8.1.0 but perhaps on > the client you had something more recent? > > Thanks, > > Stephen Running "aptitude show X" where "X" is the package name, and applying appropriate filtering gives the following results on my development systems: Package: libpq-dev Version: 8.1.0-3 Package: libpq3 Version: 1:7.4.9-2 Package: libpq4 Version: 8.1.0-3 Package: postgresql-8.1 Version: 8.1.0-3 Package: postgresql-contrib-8.1 Version: 8.1.0-3 Package: postgresql-server-dev-8.1 Version: 8.1.0-3 Package: postgresql-client-8.1 Version: 8.1.0-3 Package: postgresql-common Version: 39 (I frequently update and upgrade my installations...) --Andrew J. Klosterman andrew5@ece.cmu.edu
--On Montag, Februar 13, 2006 21:25:30 -0500 Stephen Frost=20 <sfrost@snowman.net> wrote: > * Andrew Klosterman (andrew5@ece.cmu.edu) wrote: >> > Seems kind of unlikely... What exact (.deb) versions of libpq and >> > Postgres are you using? You originally posted w/ 8.1.0 but perhaps on >> > the client you had something more recent? > aptitude install build-essential debhelper cdbs bison perl libperl-dev \ > tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \ > libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \ > gettext bzip2 fakeroot You might want to add valgrind to this list. It analyzes code on assembler= =20 basis and does a lot of memory checking / undefined variables checking=20 while the program runs. Fixed all SIGSEGV I ever encoutered which were not= =20 infinite recursions. Mit freundlichem Gru=DF Jens Schicke --=20 Jens Schicke j.schicke@asco.de asco GmbH http://www.asco.de Mittelweg 7 Tel 0531/3906-127 38106 Braunschweig Fax 0531/3906-400
On Feb 13 04:01, Andrew Klosterman wrote: > I threw in a pthread mutex around the code making the database connections > for each of my threads. The problem is still there ("corrupted > double-linked list"). > ... > Program received signal SIGILL, Illegal instruction. > [Switching to Thread 16384 (LWP 24753)] > 0x401c3851 in kill () from /lib/libc.so.6 > (gdb) bt > #0 0x401c3851 in kill () from /lib/libc.so.6 > #1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0 > #2 0x40139823 in memalign () from /usr/lib/libefence.so.0 > #3 0x401399ad in malloc () from /usr/lib/libefence.so.0 > #4 0x40139a10 in calloc () from /usr/lib/libefence.so.0 > #5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3 > #6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4 > #7 0x402ded88 in ?? () from /usr/lib/libpq.so.4 > #8 0x00000000 in ?? () I met with some other thread-safety issues caused by libc used in Debian repos. For instance, getpwuid_r() is broken in Debian's current stable libc package and this causes a similar memory leak in the libpq code. IMHO, testing code with a newer libc version can be the solution. Otherwise, for an exact answer - as Tom said - we need libpq symbols in the backtrace. Regards.
* Andrew Klosterman (andrew5@ece.cmu.edu) wrote: > Alright, I have built a system with the symbols left into the binaries. [...] > Again, it is showing a bad malloc in what appears to be some code using > kerberos. But there's nothing in my setup that I can think of right now > that should induce a connection to be set up using kerberos. The Kerberos libraries are still called when support for them has been compiled in. They generally don't cause any problems though. For some reason the line numbers in the backtrace line up but the function names don't quite (perhaps inlineing). Anyhow, the error is being reported down in 'krb5_init_context()' so either something strange is happening or it's actually a Kerberos bug after all. The reason the Kerberos libraries are called is to get the 'username' to use, which is determined prior to actually connecting to the backend (and finding out what authentication mechanism the backend thinks we should be trying). It's kind of a chicken-and-egg here because the backend decides what authentication mechanism to ask for based off the username (at least in part) through pg_hba.conf, so you can't find out the authentication method until you know the username so all methods to find the username have to be exhausted. You could avoid this by explicitly passing 'user=' into the connection parameters though... Would be interesting to know what happens then... Might also be interesting to look into the Kerberos libraries to see why they're attempting to malloc(0), perhaps there's a bug there when Kerberos isn't set up on the machine? Thanks, Stephen
* Andrew Klosterman (andrew5@ece.cmu.edu) wrote: > On Tue, 14 Feb 2006, Stephen Frost wrote: > <snip> > > It's kind of a chicken-and-egg here because the backend decides what > > authentication mechanism to ask for based off the username (at least in > > part) through pg_hba.conf, so you can't find out the authentication > > method until you know the username so all methods to find the username > > have to be exhausted. You could avoid this by explicitly passing > > 'user=3D' into the connection parameters though... Would be interesting > > to know what happens then... >=20 > When asking about "explicitly passing 'user=3D' in to the connection > parameters" do you mean that the EXEC SQL CONNECT line that ecpg parses > should specify a username? Oh, I see now. You're not using PQconnectdb but rather PQsetdbLogin, or at least, that's what ECPG is using. This ends up meaning that you can't pass in your own conninfo string and during the PQsetdbLogin call, libpq calls connectOptions1 with an empty conninfo string, which makes libpq think there's no set username which in turn makes it ask the Kerberos libraries for a username... As an initial comment, it seems like it'd be a good thing to update ECPG to use PQconnectdb. It's possible this is exposed already in some way but I'm not familiar enough with ECPG to know. Another approach would be to have PQsetdbLogin build up a conninfo string and pass that into connectOptions1 instead of calling connectOptions1 with an empty string and then changing the values afterwards. That'd probably be too large of a change to get in as a bugfix though. An alternative might be to move the pg_fe_getauthname() call to connectOptions2 as it's actually a bit more work than one might expect and if that can be avoided then that's probably all to the good. I'm a little worried about if that would work for all the various ways to use libpq to connect to the database... Sorry I don't have a simple answer. :/ In the end it seems like the Kerberos libraries should be able to survive Kerberos not being configured or whatever is going on to make it try to malloc 0-bytes... Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > Another approach would be to have PQsetdbLogin build up a conninfo > string and pass that into connectOptions1 instead of calling > connectOptions1 with an empty string and then changing the values > afterwards. That'd probably be too large of a change to get in as a > bugfix though. An alternative might be to move the pg_fe_getauthname() > call to connectOptions2 as it's actually a bit more work than one might > expect and if that can be avoided then that's probably all to the good. Right offhand I like the idea of pushing it into connectOptions2 --- can you experiment with that? Seems like there is no reason to call Kerberos if the user supplies the name to connect as. > Sorry I don't have a simple answer. :/ In the end it seems like the > Kerberos libraries should be able to survive Kerberos not being > configured or whatever is going on to make it try to malloc 0-bytes... We may be spending too much time on this one point --- as long as Kerberos isn't *writing* into the zero-length alloc, there is nothing illegal immoral or fattening about malloc(0). Can you get ElectricFence to not abort right here but continue on to the real problem? regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > Another approach would be to have PQsetdbLogin build up a conninfo > > string and pass that into connectOptions1 instead of calling > > connectOptions1 with an empty string and then changing the values > > afterwards. That'd probably be too large of a change to get in as a > > bugfix though. An alternative might be to move the pg_fe_getauthname() > > call to connectOptions2 as it's actually a bit more work than one might > > expect and if that can be avoided then that's probably all to the good. >=20 > Right offhand I like the idea of pushing it into connectOptions2 --- can > you experiment with that? Seems like there is no reason to call > Kerberos if the user supplies the name to connect as. Sure thing, I'll take a look at this probably tommorow night or thursday evening. > > Sorry I don't have a simple answer. :/ In the end it seems like the > > Kerberos libraries should be able to survive Kerberos not being > > configured or whatever is going on to make it try to malloc 0-bytes... >=20 > We may be spending too much time on this one point --- as long as > Kerberos isn't *writing* into the zero-length alloc, there is nothing > illegal immoral or fattening about malloc(0). Can you get ElectricFence > to not abort right here but continue on to the real problem? Good point. Stephen
Andrew Klosterman <andrew5@ece.cmu.edu> writes: > (gdb) print *conn > ... > allow_ssl_try = 1 '\001', wait_ssl_try = 0 '\0', ssl = 0x806d1d0, > peer = 0x807e430, > ... > *** glibc detected *** corrupted double-linked list: 0x0807e428 *** Hm, it looks like the problem is associated with whatever was allocated just before conn->peer (which is returned by SSL_get_peer_certificate called from open_client_SSL). Can you get efence or some other tool to produce a trace of malloc calls so we can determine what that is? regards, tom lane
On Tue, 14 Feb 2006, Jens-Wolfhard Schicke wrote: > --On Montag, Februar 13, 2006 21:25:30 -0500 Stephen Frost > <sfrost@snowman.net> wrote: > > > * Andrew Klosterman (andrew5@ece.cmu.edu) wrote: > >> > Seems kind of unlikely... What exact (.deb) versions of libpq and > >> > Postgres are you using? You originally posted w/ 8.1.0 but perhaps on > >> > the client you had something more recent? > > aptitude install build-essential debhelper cdbs bison perl libperl-dev \ > > tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \ > > libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \ > > gettext bzip2 fakeroot > You might want to add valgrind to this list. It analyzes code on assembler > basis and does a lot of memory checking / undefined variables checking > while the program runs. Fixed all SIGSEGV I ever encoutered which were not > infinite recursions. > > Mit freundlichem Gruß > Jens Schicke I tried valgrind this morning. It detected problems in the depths of the code behind ECPGconnect() down through SSL_read() and inflate(). Also, there was trouble reported behind ECPGconnect() -> PQsetdbLogin() -> pqGetpwuid() -> getpwuid_r() -> _dl_open() -> into the depths of /lib/ld-2.3.5.so. Valgrind got so upset at the number of errors it found that it gave up. Nothing bad seemed to show up in the code that I wrote. But, while running under valgrind, the original program that manifests the error condition runs just fine and to completion (maybe the errors are just ignored in valgrind's replacement version of malloc as they are with the MALLOC_CHECK_ environment variable set). I'm moving on to try building the binaries without removing the symbols. Hopefully that will give more useful information... --Andrew J. Klosterman andrew5@ece.cmu.edu
On Tue, 14 Feb 2006, Andrew Klosterman wrote: > > We may be spending too much time on this one point --- as long as > > Kerberos isn't *writing* into the zero-length alloc, there is nothing > > illegal immoral or fattening about malloc(0). Can you get ElectricFence > > to not abort right here but continue on to the real problem? > > > > regards, tom lane > > Doing a "man efence" lets me know that setting the EF_ALLOW_MALLOC_0 > environment variable ought to let the program continue... I'll check into > that right now! > > > --Andrew J. Klosterman > andrew5@ece.cmu.edu Well, when ElectricFence is allowed to ignore malloc() of zero bytes, my program runs like a champ! Might be associated with the replacement malloc() that it installs to check for bugs, though. (back to digging some more...) --Andrew J. Klosterman andrew5@ece.cmu.edu
On Tue, 14 Feb 2006, Andrew Klosterman wrote: > On Mon, 13 Feb 2006, Stephen Frost wrote: > > > Hmm, alright, well, this is at least not the fault of the patch of mine > > which was included in Debian's 8.1.2-2 Postgres release. :) You might > > try compiling some debs with debugging enabled. This is (reasonably) > > straight-forward: > > > > (as root:) > > aptitude install build-essential debhelper cdbs bison perl libperl-dev \ > > tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \ > > libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \ > > gettext bzip2 fakeroot > > (as user:) > > apt-get source postgresql-8.1 > > cd postgresql-8.1-8.1.0 > > export DEB_BUILD_OPTIONS="nostrip" > > dpkg-buildpackage -uc -us -rfakeroot > > > > Should produce .debs in the parent directory which have debugging > > information. Another useful build option is "noopt", ie: > > export DEB_BUILD_OPTIONS="nostrip noopt", though that could make the > > error go disappear. It'd be terribly nice if you could do this and > > provide a gdb backtrace with debugging... :) > > > > Thanks, > > > > Stephen > > Alright, I have built a system with the symbols left into the binaries. > > It still crashes with the "corrupted double-linked list" error. > > Running with ElectricFence the backtrace I get is: > > Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens. > > ElectricFence Aborting: Allocating 0 bytes, probably a bug. > > Program received signal SIGILL, Illegal instruction. > [Switching to Thread 16384 (LWP 1895)] > 0x401c4851 in kill () from /lib/libc.so.6 > (gdb) bt > #0 0x401c4851 in kill () from /lib/libc.so.6 > #1 0x40037dd5 in EF_Abort () from /usr/lib/libefence.so.0 > #2 0x40037823 in memalign () from /usr/lib/libefence.so.0 > #3 0x400379ad in malloc () from /usr/lib/libefence.so.0 > #4 0x40037a10 in calloc () from /usr/lib/libefence.so.0 > #5 0x404a282f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3 > #6 0x402c9b26 in pg_krb5_init (PQerrormsg=0x0) at fe-auth.c:119 > #7 0x402ca304 in pg_fe_getauthname (PQerrormsg=0xbffff29c "l\031") > at fe-auth.c:176 > #8 0x402cc861 in conninfo_parse (conninfo=<value optimized out>, > errorMessage=0x4057afe8) at fe-connect.c:2719 > #9 0x402cc983 in connectOptions1 (conn=0x4057acdc, conninfo=0x0) > at fe-connect.c:362 > #10 0x402cda11 in PQsetdbLogin (pghost=0x40574ffc "nc3", pgport=0x0, > pgoptions=0x0, pgtty=0x0, dbName=0x40576ff8 "andrew5", > login=0xbffffc31 "andrew5", pwd=0xbffffc3c "testbed") at fe-connect.c:568 > #11 0x40030fe7 in ECPGconnect (lineno=191, c=0, name=0xbffffc22 "andrew5@nc3", > user=0xbffffc31 "andrew5", passwd=0x0, > connection_name=0xbffff8b0 "CorrectnessCheck", autocommit=0) > at connect.c:452 > #12 0x08049ecb in DBConnect (arg_connection=0xbffff964 "CorrectnessCheck") > at client_test.pgcc:191 > #13 0x0804a14f in DoCorrectnessChecks () at client_test.pgcc:231 > #14 0x0804aa08 in main (argc=9, argv=0xbffffa74) at client_test.pgcc:526 > > Again, it is showing a bad malloc in what appears to be some code using > kerberos. But there's nothing in my setup that I can think of right now > that should induce a connection to be set up using kerberos. > > --Andrew J. Klosterman > andrew5@ece.cmu.edu With the debug binaries, I was able to step through the program and get to what appears to be the function where it bails: line 1166 of postgresql-8.1.0/src/interfaces/libpq/fe-secure.c where SSL_free() is called. Included below is a copy&paste of my GDB session. Within the function that calls SSL_free(), being close_SSL(PGconn *conn), I inserted a breakpoint. The value of *conn is printed out, which will hopefully assist in any debugging... (gdb) break fe-secure.c:1162 No source file named fe-secure.c. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (fe-secure.c:1162) pending. (gdb) set args -t andrew5@nc3 -u andrew5 -p testbed -i 10 (gdb) run Starting program: /.amd/flush/home/andrew5/projects/CVS-controlled/users/andrew5/thesis/code/database/metadata_server/test/client_test -t andrew5@nc3 -u andrew5 -p testbed -i 10 [Thread debugging using libthread_db enabled] [New Thread 16384 (LWP 2103)] Breakpoint 2 at 0x402d4bc0: file fe-secure.c, line 1162. Pending breakpoint "fe-secure.c:1162" resolved [Switching to Thread 16384 (LWP 2103)] Breakpoint 2, close_SSL (conn=0x8059d00) at fe-secure.c:1162 1162 { Current language: auto; currently c (gdb) bt #0 close_SSL (conn=0x8059d00) at fe-secure.c:1162 #1 0x402c6938 in closePGconn (conn=0x8059d00) at fe-connect.c:1976 #2 0x402c6a55 in PQfinish (conn=0x8059d00) at fe-connect.c:2021 #3 0x400308f9 in ecpg_finish (act=0x8059ca8) at connect.c:122 #4 0x40031707 in ECPGdisconnect (lineno=134585600, connection_name=0xbffff8a8 "CorrectnessCheck") at connect.c:540 #5 0x0804a036 in DBDisconnect (arg_connection=0xbffff954 "CorrectnessCheck") at client_test.pgcc:218 #6 0x0804a58a in DoCorrectnessChecks () at client_test.pgcc:282 #7 0x0804a9f8 in main (argc=9, argv=0xbffffa64) at client_test.pgcc:528 (gdb) list 1157 /* 1158 * Close SSL connection. 1159 */ 1160 static void 1161 close_SSL(PGconn *conn) 1162 { 1163 if (conn->ssl) 1164 { 1165 SSL_shutdown(conn->ssl); 1166 SSL_free(conn->ssl); (gdb) print *conn $1 = {pghost = 0x80634c0 "nc3", pghostaddr = 0x0, pgport = 0x80634d0 "5432", pgunixsocket = 0x0, pgtty = 0x80634e0 "", connect_timeout = 0x0, pgoptions = 0x80634f0 "", dbName = 0x80634b0 "andrew5", pguser = 0x8063500 "andrew5", pgpass = 0x80634a0 "testbed", sslmode = 0x8063510 "prefer", krbsrvname = 0x8063520 "postgres", Pfdebug = 0x0, noticeHooks = {noticeRec = 0x40030bd0 <ECPGnoticeReceiver>, noticeRecArg = 0x8059ca8, noticeProc = 0x402c90c0 <defaultNoticeProcessor>, noticeProcArg = 0x0}, status = CONNECTION_OK, asyncStatus = PGASYNC_IDLE, xactStatus = PQTRANS_IDLE, queryclass = PGQUERY_SIMPLE, nonblocking = 0 '\0', copy_is_binary = 0 '\0', copy_already_done = 0, notifyHead = 0x0, notifyTail = 0x0, sock = 3, laddr = {addr = { ss_family = 2, __ss_align = 92410796, __ss_padding = '\0' <repeats 119 times>}, salen = 16}, raddr = {addr = { ss_family = 2, __ss_align = 58856364, __ss_padding = '\0' <repeats 119 times>}, salen = 16}, pversion = 196608, sversion = 80100, addrlist = 0x0, addr_cur = 0x0, addrlist_family = 0, setenv_state = SETENV_STATE_IDLE, next_eo = 0x0, be_pid = 28824, be_key = 583752927, md5Salt = "\000\000\000", cryptSalt = "\000", pstatus = 0x807c330, client_encoding = 8, verbosity = PQERRORS_DEFAULT, lobjfuncs = 0x0, inBuffer = 0x805a028 "C", inBufSize = 16384, inStart = 18, inCursor = 18, inEnd = 18, outBuffer = 0x805e030 "X", outBufSize = 16384, outCount = 0, ---Type <return> to continue, or q <return> to quit--- outMsgStart = 1, outMsgEnd = 5, result = 0x0, curTuple = 0x0, allow_ssl_try = 1 '\001', wait_ssl_try = 0 '\0', ssl = 0x806d1d0, peer = 0x807e430, peer_dn = "/C=US/ST=Pennsylvania/L=Pittsburgh/O=CMU/PDL/OU=andrew5/CN=nc3.pdl.cmu.local/emailAddress=andrew5@mailinator.com", '\0' <repeats 144 times>, peer_cn = "nc3.pdl.cmu.local", '\0' <repeats 15 times>, errorMessage = { data = 0x8062038 "", len = 0, maxlen = 256}, workBuffer = { data = 0x8062140 "COMMIT", len = 6, maxlen = 256}} (gdb) s 1163 if (conn->ssl) (gdb) s 1162 { (gdb) s 1163 if (conn->ssl) (gdb) s 1165 SSL_shutdown(conn->ssl); (gdb) s 1166 SSL_free(conn->ssl); (gdb) s *** glibc detected *** corrupted double-linked list: 0x0807e428 *** Program received signal SIGABRT, Aborted. 0x401bf851 in kill () from /lib/libc.so.6 (gdb) --Andrew J. Klosterman andrew5@ece.cmu.edu
On Tue, 14 Feb 2006, Stephen Frost wrote: <snip> > It's kind of a chicken-and-egg here because the backend decides what > authentication mechanism to ask for based off the username (at least in > part) through pg_hba.conf, so you can't find out the authentication > method until you know the username so all methods to find the username > have to be exhausted. You could avoid this by explicitly passing > 'user=' into the connection parameters though... Would be interesting > to know what happens then... When asking about "explicitly passing 'user=' in to the connection parameters" do you mean that the EXEC SQL CONNECT line that ecpg parses should specify a username? My code is using the following statement when making a remote connection that uses SSL. EXEC SQL CONNECT TO :l_target AS :l_connection USER :l_user IDENTIFIED BY :l_passwd; The target machine (hosting the database) has "ssl=on" in postgresql.conf and in its pg_hba.conf (snippet below) a line for the client machine from which I am making the connection that specifies an SSL connection should be made. # TYPE DATABASE USER CIDR-ADDRESS METHOD hostssl andrew5 andrew5 172.19.130.4/32 pam passwd --Andrew J. Klosterman andrew5@ece.cmu.edu
> We may be spending too much time on this one point --- as long as > Kerberos isn't *writing* into the zero-length alloc, there is nothing > illegal immoral or fattening about malloc(0). Can you get ElectricFence > to not abort right here but continue on to the real problem? > > regards, tom lane Doing a "man efence" lets me know that setting the EF_ALLOW_MALLOC_0 environment variable ought to let the program continue... I'll check into that right now! --Andrew J. Klosterman andrew5@ece.cmu.edu
On Mon, 13 Feb 2006, Stephen Frost wrote: > Hmm, alright, well, this is at least not the fault of the patch of mine > which was included in Debian's 8.1.2-2 Postgres release. :) You might > try compiling some debs with debugging enabled. This is (reasonably) > straight-forward: > > (as root:) > aptitude install build-essential debhelper cdbs bison perl libperl-dev \ > tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \ > libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \ > gettext bzip2 fakeroot > (as user:) > apt-get source postgresql-8.1 > cd postgresql-8.1-8.1.0 > export DEB_BUILD_OPTIONS="nostrip" > dpkg-buildpackage -uc -us -rfakeroot > > Should produce .debs in the parent directory which have debugging > information. Another useful build option is "noopt", ie: > export DEB_BUILD_OPTIONS="nostrip noopt", though that could make the > error go disappear. It'd be terribly nice if you could do this and > provide a gdb backtrace with debugging... :) > > Thanks, > > Stephen Alright, I have built a system with the symbols left into the binaries. It still crashes with the "corrupted double-linked list" error. Running with ElectricFence the backtrace I get is: Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens. ElectricFence Aborting: Allocating 0 bytes, probably a bug. Program received signal SIGILL, Illegal instruction. [Switching to Thread 16384 (LWP 1895)] 0x401c4851 in kill () from /lib/libc.so.6 (gdb) bt #0 0x401c4851 in kill () from /lib/libc.so.6 #1 0x40037dd5 in EF_Abort () from /usr/lib/libefence.so.0 #2 0x40037823 in memalign () from /usr/lib/libefence.so.0 #3 0x400379ad in malloc () from /usr/lib/libefence.so.0 #4 0x40037a10 in calloc () from /usr/lib/libefence.so.0 #5 0x404a282f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3 #6 0x402c9b26 in pg_krb5_init (PQerrormsg=0x0) at fe-auth.c:119 #7 0x402ca304 in pg_fe_getauthname (PQerrormsg=0xbffff29c "l\031") at fe-auth.c:176 #8 0x402cc861 in conninfo_parse (conninfo=<value optimized out>, errorMessage=0x4057afe8) at fe-connect.c:2719 #9 0x402cc983 in connectOptions1 (conn=0x4057acdc, conninfo=0x0) at fe-connect.c:362 #10 0x402cda11 in PQsetdbLogin (pghost=0x40574ffc "nc3", pgport=0x0, pgoptions=0x0, pgtty=0x0, dbName=0x40576ff8 "andrew5", login=0xbffffc31 "andrew5", pwd=0xbffffc3c "testbed") at fe-connect.c:568 #11 0x40030fe7 in ECPGconnect (lineno=191, c=0, name=0xbffffc22 "andrew5@nc3", user=0xbffffc31 "andrew5", passwd=0x0, connection_name=0xbffff8b0 "CorrectnessCheck", autocommit=0) at connect.c:452 #12 0x08049ecb in DBConnect (arg_connection=0xbffff964 "CorrectnessCheck") at client_test.pgcc:191 #13 0x0804a14f in DoCorrectnessChecks () at client_test.pgcc:231 #14 0x0804aa08 in main (argc=9, argv=0xbffffa74) at client_test.pgcc:526 Again, it is showing a bad malloc in what appears to be some code using kerberos. But there's nothing in my setup that I can think of right now that should induce a connection to be set up using kerberos. --Andrew J. Klosterman andrew5@ece.cmu.edu
> > Tracking down exactly what's tickling the problem in this case could be > > tricky... > > Yeah :-(. If you aren't able to narrow it further by yourself, please > try to put together a self-contained test case. > > regards, tom lane Well, my attempt last night at putting together a test case that manifests the error that I encountered was a total failure! The test code executes flawlessly: no abnormal termination. There must be something different between the two programs. But my original is considerably more complex. I'll pursue other options for debugging before returning to figuring out the difference between the "real" code and the "test-case" code. --Andrew J. Klosterman andrew5@ece.cmu.edu
Andrew Klosterman <andrew5@ece.cmu.edu> writes: > With the debug binaries, I was able to step through the program and get to > what appears to be the function where it bails: line 1166 of > postgresql-8.1.0/src/interfaces/libpq/fe-secure.c where SSL_free() is > called. BTW, is the address that glibc says is corrupted consistent from run to run? If so, you could narrow down the problem pretty quickly by setting a hardware watchpoint on that address with gdb. Any hits that are not from the malloc subroutines are probably the source of the problem. regards, tom lane
On Wed, 15 Feb 2006, Tom Lane wrote: > Andrew Klosterman <andrew5@ece.cmu.edu> writes: > > With the debug binaries, I was able to step through the program and get to > > what appears to be the function where it bails: line 1166 of > > postgresql-8.1.0/src/interfaces/libpq/fe-secure.c where SSL_free() is > > called. > > BTW, is the address that glibc says is corrupted consistent from run to > run? If so, you could narrow down the problem pretty quickly by setting > a hardware watchpoint on that address with gdb. Any hits that are not > from the malloc subroutines are probably the source of the problem. > > regards, tom lane The address given by the error message is consistent. But, setting a break/watch point for it has been troublesome. A watchpoint can't be set until the memory is mapped in. I have narrowed down the time that the memory is mapped in to being somewhere in a call to PQconnectPoll() from within connectDBComplete() in src/interfaces/libpq/fe-connect.c. With the watchpoint set, though, the debugger isn't breaking the execution of the program until the error manifests itself. Digging around, I can't come up with a way to get information on the arguments and return results from malloc() every time it is called. "strace" only does system calls. The output I get from "ltrace" is not useful and no options I can see appear to improve the situation. So, I'm kinda stuck. This bug might be one that gets away... --Andrew J. Klosterman andrew5@ece.cmu.edu