Thread: BUG #2246: Bad malloc interactions: ecpg, openssl

BUG #2246: Bad malloc interactions: ecpg, openssl

From
"Andy Klosterman"
Date:
The following bug has been logged online:

Bug reference:      2246
Logged by:          Andy Klosterman
Email address:      andrew5@ece.cmu.edu
PostgreSQL version: 8.1.0
Operating system:   Debian testing: Linux nc3 2.4.27-2-386 #1 Wed Nov 30
21:38:51 JST 2005 i686 GNU/Linux
Description:        Bad malloc interactions: ecpg, openssl
Details:

Before going into a full description and figuring out some example code for
this situation, I'm fishing for interesting in tracking it down and fixing
it (or not).

On a program that I (pre-)compile with ecpg and connect to a remote Postgres
instance over an SSL connection (as set up in pg_hba.conf with appropriate
certificates installed) my application prematurely terminates with the
following error:
*** glibc detected *** corrupted double-linked list: 0x0807c830 ***
Abort.

(Without an SSL connection (as set in ph_hba.conf) the program executes just
fine.  This leads me to cast suspicion on SSL libraries.)

The back trace from gdb looks like this (which doesn't appear to be too
informative, but looks like an exception stack):
    #0  0x401bc851 in kill () from /lib/libc.so.6
    #1  0x4014a309 in pthread_kill () from /lib/libpthread.so.0
    #2  0x4014a6c0 in raise () from /lib/libpthread.so.0
    #3  0x401bc606 in raise () from /lib/libc.so.6
    #4  0x401bd971 in abort () from /lib/libc.so.6
    #5  0x401ef930 in __fsetlocking () from /lib/libc.so.6
    #6  0x401f52b9 in malloc_usable_size () from /lib/libc.so.6
    #7  0x401f5395 in malloc_usable_size () from /lib/libc.so.6
    #8  0x401f5a43 in malloc_trim () from /lib/libc.so.6
    #9  0x401f5d51 in free () from /lib/libc.so.6
    #10 0x4052ce6c in zcfree () from /usr/lib/libz.so.1
    #11 0x4052f83f in inflateEnd () from /usr/lib/libz.so.1
    #12 0x4040f262 in COMP_rle () from
/usr/lib/i686/cmov/libcrypto.so.0.9.8
    #13 0x0807e680 in ?? ()
    #14 0x00000000 in ?? ()

After a bit of digging around online, I discovered the MALLOC_CHECK_
environment variable and how it changes the behavior of malloc (man 3
malloc).  The above back trace was without MALLOC_CHECK_ in the environment
(e.g., unsetenv MALLOC_CHECK_).

Running with MALLOC_CHECK_ equal to 2 or 1 allows my program to run to
completion.

With MALLOC_CHECK_ set to 0 (which is supposed to ignore corruption), I get
a segfault.  Running inside gdb gets me the following back trace:
    #0  0x403d6f73 in ASN1_template_free ()
       from /usr/lib/i686/cmov/libcrypto.so.0.9.8
    #1  0x403d6e0d in ASN1_primitive_free ()
       from /usr/lib/i686/cmov/libcrypto.so.0.9.8
    #2  0x403d7023 in ASN1_item_free () from
/usr/lib/i686/cmov/libcrypto.so.0.9.8
    #3  0x403d0c07 in X509_CERT_AUX_free ()
       from /usr/lib/i686/cmov/libcrypto.so.0.9.8
    #4  0x403d077a in X509_CINF_free () from
/usr/lib/i686/cmov/libcrypto.so.0.9.8
    #5  0x403d6e35 in ASN1_primitive_free ()
       from /usr/lib/i686/cmov/libcrypto.so.0.9.8
    #6  0x403d7023 in ASN1_item_free () from
/usr/lib/i686/cmov/libcrypto.so.0.9.8
    #7  0x403d0927 in X509_free () from
/usr/lib/i686/cmov/libcrypto.so.0.9.8
    #8  0x402d16f3 in pqsecure_destroy () from /usr/lib/libpq.so.4
    #9  0x402c387a in PQconninfoFree () from /usr/lib/libpq.so.4
    #10 0x402c39c3 in PQfinish () from /usr/lib/libpq.so.4
    #11 0x4002f41b in ECPGget_connection () from /usr/lib/libecpg.so.5
    #12 0x40030223 in ECPGdisconnect () from /usr/lib/libecpg.so.5
    #13 0x0804a113 in DBDisconnect (arg_connection=0x8054faf
"client_correctness")
        at client_test.pgcc:215
    #14 0x0804a64e in DoCorrectnessChecks () at client_test.pgcc:278
    #15 0x0804aaa1 in main (argc=7, argv=0xbffffa84) at
client_test.pgcc:523

PURE SPECULATION:  It looks like there is either trouble in the interaction
between Postgres and the SSL library or just a bit of trouble within the SSL
library.
SPECULATION: Another possibility is that I misunderstand some aspect of
multi-threaded interactions with Postgres (I open uniquely named connections
to the DB for each thread of my test program).  Maybe I need to have a
"lock" around the code that makes DB connections and make sure that only one
happens at a time (might be better handled within Postgres/SSL if that is
the case).

PROCEEDING FURTHER: If there is any desire on the part of any developers to
pursue this further, I'm open.  As things stand right now, I have
workarounds:
1. Don't use an SSL connection to the DB.
2. Do a "setenv MALLOC_CHECK_ 1" (or 2) and it works.

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Alvaro Herrera
Date:
Andy Klosterman wrote:

> Before going into a full description and figuring out some example code for
> this situation, I'm fishing for interesting in tracking it down and fixing
> it (or not).

Whenever there is a bug that causes a crash, there is interest in
tracking it down and fixing it.  Please do provide a test case.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Tom Lane
Date:
"Andy Klosterman" <andrew5@ece.cmu.edu> writes:
> SPECULATION: Another possibility is that I misunderstand some aspect of
> multi-threaded interactions with Postgres (I open uniquely named connections
> to the DB for each thread of my test program).  Maybe I need to have a
> "lock" around the code that makes DB connections and make sure that only one
> happens at a time (might be better handled within Postgres/SSL if that is
> the case).

There could be some re-entrancy problem in the SSL connection startup
code --- if you add such a lock, does it get more reliable?  Also, did
you remember to build PG with --enable-thread-safety ?

            regards, tom lane

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Tom Lane
Date:
Andrew Klosterman <andrew5@ece.cmu.edu> writes:
> I threw in a pthread mutex around the code making the database connections
> for each of my threads.  The problem is still there ("corrupted
> double-linked list").

> Even tuning things down and instructing my code to only run a single
> pthread manifests the problem over an SSL connection.

Hmm.  Based on that, the problem is starting to smell more like a
garden-variety memory clobber, for instance malloc'ing a chunk smaller
than the data that's later stuffed into it.  It might be worth running
the program under something like ElectricFence, which will catch the
offender on-the-spot rather than only later when corruption of malloc's
private data structures is detected.

Looking back at your original message, I wonder if it could be the
combination of ecpg and SSL that triggers it?  I'd have thought that
libpq/SSL alone would be pretty well wrung out, but ecpg is not so
widely used.

BTW, you did say this was i386 right?  If it were a 64-bit architecture,
I'd be about ready to bet money on the wrong-malloc-size-calculation
theory.

> Tracking down exactly what's tickling the problem in this case could be
> tricky...

Yeah :-(.  If you aren't able to narrow it further by yourself, please
try to put together a self-contained test case.

            regards, tom lane

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Tom Lane
Date:
Andrew Klosterman <andrew5@ece.cmu.edu> writes:
> (gdb) bt
> #0  0x401c3851 in kill () from /lib/libc.so.6
> #1  0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
> #2  0x40139823 in memalign () from /usr/lib/libefence.so.0
> #3  0x401399ad in malloc () from /usr/lib/libefence.so.0
> #4  0x40139a10 in calloc () from /usr/lib/libefence.so.0
> #5  0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
> #6  0x402c8b3f in ?? () from /usr/lib/libpq.so.4
> #7  0x402ded88 in ?? () from /usr/lib/libpq.so.4
> #8  0x00000000 in ?? ()

Any chance of doing this with debug symbols?  libpq does not call
krb5_set_default_tgs_ktypes directly, so I don't think I believe the
above backtrace.  gdb is easily misled without debug symbols :-(

I'm not sure if Debian does things the way Red Hat does, but on RH
there are separate "debuginfo" RPMs corresponding to each regular
RPM --- if you install the ones matching your libpq and libkrb5
RPMs you should be able to get better info.

            regards, tom lane

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Andrew Klosterman <andrew5@ece.cmu.edu> writes:
> > (gdb) bt
> > #0  0x401c3851 in kill () from /lib/libc.so.6
> > #1  0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
> > #2  0x40139823 in memalign () from /usr/lib/libefence.so.0
> > #3  0x401399ad in malloc () from /usr/lib/libefence.so.0
> > #4  0x40139a10 in calloc () from /usr/lib/libefence.so.0
> > #5  0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.=
so.3
> > #6  0x402c8b3f in ?? () from /usr/lib/libpq.so.4
> > #7  0x402ded88 in ?? () from /usr/lib/libpq.so.4
> > #8  0x00000000 in ?? ()
>=20
> Any chance of doing this with debug symbols?  libpq does not call
> krb5_set_default_tgs_ktypes directly, so I don't think I believe the
> above backtrace.  gdb is easily misled without debug symbols :-(

Hrmpf, I missed this bug-on-Debian report.  I'll go check the archive
for the rest.

> I'm not sure if Debian does things the way Red Hat does, but on RH
> there are separate "debuginfo" RPMs corresponding to each regular
> RPM --- if you install the ones matching your libpq and libkrb5
> RPMs you should be able to get better info.

We do have debugging .debs- for some things.  We don't have them for
everything and unfortunately we don't yet have them for Postgres.  I'll
talk to Martin about building some though so that in the future it's
easier to debug these problems.

    Thanks,

        Stephen

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> We do have debugging .debs- for some things.  We don't have them for
> everything and unfortunately we don't yet have them for Postgres.  I'll
> talk to Martin about building some though so that in the future it's
> easier to debug these problems.

Hmm.  Andrew, it seems your choices are to rebuild the relevant
libraries from source, or to concentrate on developing a test case
that other people can try.

            regards, tom lane

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
On Mon, 13 Feb 2006, Tom Lane wrote:

> Andrew Klosterman <andrew5@ece.cmu.edu> writes:
> > I threw in a pthread mutex around the code making the database connections
> > for each of my threads.  The problem is still there ("corrupted
> > double-linked list").
>
> > Even tuning things down and instructing my code to only run a single
> > pthread manifests the problem over an SSL connection.
>
> Hmm.  Based on that, the problem is starting to smell more like a
> garden-variety memory clobber, for instance malloc'ing a chunk smaller
> than the data that's later stuffed into it.  It might be worth running
> the program under something like ElectricFence, which will catch the
> offender on-the-spot rather than only later when corruption of malloc's
> private data structures is detected.
>
> Looking back at your original message, I wonder if it could be the
> combination of ecpg and SSL that triggers it?  I'd have thought that
> libpq/SSL alone would be pretty well wrung out, but ecpg is not so
> widely used.
>
> BTW, you did say this was i386 right?  If it were a 64-bit architecture,
> I'd be about ready to bet money on the wrong-malloc-size-calculation
> theory.
>
> > Tracking down exactly what's tickling the problem in this case could be
> > tricky...
>
> Yeah :-(.  If you aren't able to narrow it further by yourself, please
> try to put together a self-contained test case.
>
>             regards, tom lane

I just did the "electric fence" thing for you and this is what I get in
gdb...

  Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens.

ElectricFence Aborting: Allocating 0 bytes, probably a bug.

Program received signal SIGILL, Illegal instruction.
[Switching to Thread 16384 (LWP 24753)]
0x401c3851 in kill () from /lib/libc.so.6
(gdb) bt
#0  0x401c3851 in kill () from /lib/libc.so.6
#1  0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
#2  0x40139823 in memalign () from /usr/lib/libefence.so.0
#3  0x401399ad in malloc () from /usr/lib/libefence.so.0
#4  0x40139a10 in calloc () from /usr/lib/libefence.so.0
#5  0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
#6  0x402c8b3f in ?? () from /usr/lib/libpq.so.4
#7  0x402ded88 in ?? () from /usr/lib/libpq.so.4
#8  0x00000000 in ?? ()

Looks like something fishy going on between libpq and libkrb5.  I'm
especially suspicious since I'm not using kerberos for authentication at
all.

I am developing on i386 (more or less).
# uname -m
i686

--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
On Mon, 13 Feb 2006, Tom Lane wrote:

> Andrew Klosterman <andrew5@ece.cmu.edu> writes:
> > (gdb) bt
> > #0  0x401c3851 in kill () from /lib/libc.so.6
> > #1  0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
> > #2  0x40139823 in memalign () from /usr/lib/libefence.so.0
> > #3  0x401399ad in malloc () from /usr/lib/libefence.so.0
> > #4  0x40139a10 in calloc () from /usr/lib/libefence.so.0
> > #5  0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
> > #6  0x402c8b3f in ?? () from /usr/lib/libpq.so.4
> > #7  0x402ded88 in ?? () from /usr/lib/libpq.so.4
> > #8  0x00000000 in ?? ()
>
> Any chance of doing this with debug symbols?  libpq does not call
> krb5_set_default_tgs_ktypes directly, so I don't think I believe the
> above backtrace.  gdb is easily misled without debug symbols :-(
>
> I'm not sure if Debian does things the way Red Hat does, but on RH
> there are separate "debuginfo" RPMs corresponding to each regular
> RPM --- if you install the ones matching your libpq and libkrb5
> RPMs you should be able to get better info.
>
>             regards, tom lane

I thought about that and did some quick checks of how to get debug symbols
in libraries on Debian.  I didn't come up with anything right away.  I'll
poke around and see what I can come up with.

--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
On Wed, 8 Feb 2006, Tom Lane wrote:

> "Andy Klosterman" <andrew5@ece.cmu.edu> writes:
> > SPECULATION: Another possibility is that I misunderstand some aspect of
> > multi-threaded interactions with Postgres (I open uniquely named connections
> > to the DB for each thread of my test program).  Maybe I need to have a
> > "lock" around the code that makes DB connections and make sure that only one
> > happens at a time (might be better handled within Postgres/SSL if that is
> > the case).
>
> There could be some re-entrancy problem in the SSL connection startup
> code --- if you add such a lock, does it get more reliable?  Also, did
> you remember to build PG with --enable-thread-safety ?
>
>             regards, tom lane

(I'm back after a bit of an illness.  Much better now!)

I threw in a pthread mutex around the code making the database connections
for each of my threads.  The problem is still there ("corrupted
double-linked list").

Even tuning things down and instructing my code to only run a single
pthread manifests the problem over an SSL connection.  Everything is just
fine without SSL.  Other code I've written works just fine with (and
without) threads connecting to the database with (and without) SSL.
Tracking down exactly what's tickling the problem in this case could be
tricky...

I'm using the pre-built debian testing packages, not self-compiled code,
for my postgres installation.  From the information I can gather from the
debian build logs (http://buildd.debian.org/build.php), everything was
configured and built with threads enabled.

--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Stephen Frost
Date:
* Andrew Klosterman (andrew5@ece.cmu.edu) wrote:
> (gdb) bt
> #0  0x401c3851 in kill () from /lib/libc.so.6
> #1  0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
> #2  0x40139823 in memalign () from /usr/lib/libefence.so.0
> #3  0x401399ad in malloc () from /usr/lib/libefence.so.0
> #4  0x40139a10 in calloc () from /usr/lib/libefence.so.0
> #5  0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so=
.3
> #6  0x402c8b3f in ?? () from /usr/lib/libpq.so.4
> #7  0x402ded88 in ?? () from /usr/lib/libpq.so.4
> #8  0x00000000 in ?? ()
>=20
> Looks like something fishy going on between libpq and libkrb5.  I'm
> especially suspicious since I'm not using kerberos for authentication at
> all.

Seems kind of unlikely...  What exact (.deb) versions of libpq and
Postgres are you using?  You originally posted w/ 8.1.0 but perhaps on
the client you had something more recent?

    Thanks,

        Stephen

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Stephen Frost
Date:
* Andrew Klosterman (andrew5@ece.cmu.edu) wrote:
> > Seems kind of unlikely...  What exact (.deb) versions of libpq and
> > Postgres are you using?  You originally posted w/ 8.1.0 but perhaps on
> > the client you had something more recent?
>=20
> Running "aptitude show X" where "X" is the package name, and applying
> appropriate filtering gives the following results on my development
> systems:
>=20
> Package: libpq-dev
> Version: 8.1.0-3
>=20
> Package: libpq3
> Version: 1:7.4.9-2
>=20
> Package: libpq4
> Version: 8.1.0-3
>=20
> Package: postgresql-8.1
> Version: 8.1.0-3
>=20
> Package: postgresql-contrib-8.1
> Version: 8.1.0-3
>=20
> Package: postgresql-server-dev-8.1
> Version: 8.1.0-3
>=20
> Package: postgresql-client-8.1
> Version: 8.1.0-3
>=20
> Package: postgresql-common
> Version: 39

Hmm, alright, well, this is at least not the fault of the patch of mine
which was included in Debian's 8.1.2-2 Postgres release. :)  You might
try compiling some debs with debugging enabled.  This is (reasonably)
straight-forward:

(as root:)
aptitude install build-essential debhelper cdbs bison perl libperl-dev \
    tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \
    libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \
    gettext bzip2 fakeroot
(as user:)
apt-get source postgresql-8.1
cd postgresql-8.1-8.1.0
export DEB_BUILD_OPTIONS=3D"nostrip"
dpkg-buildpackage -uc -us -rfakeroot

Should produce .debs in the parent directory which have debugging
information.  Another useful build option is "noopt", ie:
export DEB_BUILD_OPTIONS=3D"nostrip noopt", though that could make the
error go disappear.  It'd be terribly nice if you could do this and
provide a gdb backtrace with debugging... :)

    Thanks,

        Stephen

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
On Mon, 13 Feb 2006, Stephen Frost wrote:

> * Andrew Klosterman (andrew5@ece.cmu.edu) wrote:
> > (gdb) bt
> > #0  0x401c3851 in kill () from /lib/libc.so.6
> > #1  0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
> > #2  0x40139823 in memalign () from /usr/lib/libefence.so.0
> > #3  0x401399ad in malloc () from /usr/lib/libefence.so.0
> > #4  0x40139a10 in calloc () from /usr/lib/libefence.so.0
> > #5  0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
> > #6  0x402c8b3f in ?? () from /usr/lib/libpq.so.4
> > #7  0x402ded88 in ?? () from /usr/lib/libpq.so.4
> > #8  0x00000000 in ?? ()
> >
> > Looks like something fishy going on between libpq and libkrb5.  I'm
> > especially suspicious since I'm not using kerberos for authentication at
> > all.
>
> Seems kind of unlikely...  What exact (.deb) versions of libpq and
> Postgres are you using?  You originally posted w/ 8.1.0 but perhaps on
> the client you had something more recent?
>
>     Thanks,
>
>         Stephen

Running "aptitude show X" where "X" is the package name, and applying
appropriate filtering gives the following results on my development
systems:

Package: libpq-dev
Version: 8.1.0-3

Package: libpq3
Version: 1:7.4.9-2

Package: libpq4
Version: 8.1.0-3

Package: postgresql-8.1
Version: 8.1.0-3

Package: postgresql-contrib-8.1
Version: 8.1.0-3

Package: postgresql-server-dev-8.1
Version: 8.1.0-3

Package: postgresql-client-8.1
Version: 8.1.0-3

Package: postgresql-common
Version: 39

(I frequently update and upgrade my installations...)

--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Jens-Wolfhard Schicke
Date:
--On Montag, Februar 13, 2006 21:25:30 -0500 Stephen Frost=20
<sfrost@snowman.net> wrote:

> * Andrew Klosterman (andrew5@ece.cmu.edu) wrote:
>> > Seems kind of unlikely...  What exact (.deb) versions of libpq and
>> > Postgres are you using?  You originally posted w/ 8.1.0 but perhaps on
>> > the client you had something more recent?
> aptitude install build-essential debhelper cdbs bison perl libperl-dev \
>     tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \
>     libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \
>     gettext bzip2 fakeroot
You might want to add valgrind to this list. It analyzes code on assembler=
=20
basis and does a lot of memory checking / undefined variables checking=20
while the program runs. Fixed all SIGSEGV I ever encoutered which were not=
=20
infinite recursions.

Mit freundlichem Gru=DF
Jens Schicke
--=20
Jens Schicke              j.schicke@asco.de
asco GmbH              http://www.asco.de
Mittelweg 7              Tel 0531/3906-127
38106 Braunschweig          Fax 0531/3906-400

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Volkan YAZICI
Date:
On Feb 13 04:01, Andrew Klosterman wrote:
> I threw in a pthread mutex around the code making the database connections
> for each of my threads.  The problem is still there ("corrupted
> double-linked list").
> ...
> Program received signal SIGILL, Illegal instruction.
> [Switching to Thread 16384 (LWP 24753)]
> 0x401c3851 in kill () from /lib/libc.so.6
> (gdb) bt
> #0  0x401c3851 in kill () from /lib/libc.so.6
> #1  0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
> #2  0x40139823 in memalign () from /usr/lib/libefence.so.0
> #3  0x401399ad in malloc () from /usr/lib/libefence.so.0
> #4  0x40139a10 in calloc () from /usr/lib/libefence.so.0
> #5  0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
> #6  0x402c8b3f in ?? () from /usr/lib/libpq.so.4
> #7  0x402ded88 in ?? () from /usr/lib/libpq.so.4
> #8  0x00000000 in ?? ()

I met with some other thread-safety issues caused by libc used in
Debian repos. For instance, getpwuid_r() is broken in Debian's
current stable libc package and this causes a similar memory leak
in the libpq code.

IMHO, testing code with a newer libc version can be the solution.
Otherwise, for an exact answer - as Tom said - we need libpq symbols
in the backtrace.


Regards.

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Stephen Frost
Date:
* Andrew Klosterman (andrew5@ece.cmu.edu) wrote:
> Alright, I have built a system with the symbols left into the binaries.
[...]
> Again, it is showing a bad malloc in what appears to be some code using
> kerberos.  But there's nothing in my setup that I can think of right now
> that should induce a connection to be set up using kerberos.

The Kerberos libraries are still called when support for them has been
compiled in.  They generally don't cause any problems though.  For some
reason the line numbers in the backtrace line up but the function names
don't quite (perhaps inlineing).  Anyhow, the error is being reported
down in 'krb5_init_context()' so either something strange is happening
or it's actually a Kerberos bug after all.  The reason the Kerberos
libraries are called is to get the 'username' to use, which is
determined prior to actually connecting to the backend (and finding
out what authentication mechanism the backend thinks we should be
trying).

It's kind of a chicken-and-egg here because the backend decides what
authentication mechanism to ask for based off the username (at least in
part) through pg_hba.conf, so you can't find out the authentication
method until you know the username so all methods to find the username
have to be exhausted.  You could avoid this by explicitly passing
'user=' into the connection parameters though...  Would be interesting
to know what happens then...

Might also be interesting to look into the Kerberos libraries to see why
they're attempting to malloc(0), perhaps there's a bug there when
Kerberos isn't set up on the machine?

    Thanks,

        Stephen

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Stephen Frost
Date:
* Andrew Klosterman (andrew5@ece.cmu.edu) wrote:
> On Tue, 14 Feb 2006, Stephen Frost wrote:
> <snip>
> > It's kind of a chicken-and-egg here because the backend decides what
> > authentication mechanism to ask for based off the username (at least in
> > part) through pg_hba.conf, so you can't find out the authentication
> > method until you know the username so all methods to find the username
> > have to be exhausted.  You could avoid this by explicitly passing
> > 'user=3D' into the connection parameters though...  Would be interesting
> > to know what happens then...
>=20
> When asking about "explicitly passing 'user=3D' in to the connection
> parameters" do you mean that the EXEC SQL CONNECT line that ecpg parses
> should specify a username?

Oh, I see now.  You're not using PQconnectdb but rather PQsetdbLogin, or
at least, that's what ECPG is using.  This ends up meaning that you
can't pass in your own conninfo string and during the PQsetdbLogin call,
libpq calls connectOptions1 with an empty conninfo string, which makes
libpq think there's no set username which in turn makes it ask the
Kerberos libraries for a username...

As an initial comment, it seems like it'd be a good thing to update ECPG
to use PQconnectdb.  It's possible this is exposed already in some way
but I'm not familiar enough with ECPG to know.

Another approach would be to have PQsetdbLogin build up a conninfo
string and pass that into connectOptions1 instead of calling
connectOptions1 with an empty string and then changing the values
afterwards.  That'd probably be too large of a change to get in as a
bugfix though.  An alternative might be to move the pg_fe_getauthname()
call to connectOptions2 as it's actually a bit more work than one might
expect and if that can be avoided then that's probably all to the good.
I'm a little worried about if that would work for all the various ways
to use libpq to connect to the database...

Sorry I don't have a simple answer. :/  In the end it seems like the
Kerberos libraries should be able to survive Kerberos not being
configured or whatever is going on to make it try to malloc 0-bytes...

    Thanks,

        Stephen

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> Another approach would be to have PQsetdbLogin build up a conninfo
> string and pass that into connectOptions1 instead of calling
> connectOptions1 with an empty string and then changing the values
> afterwards.  That'd probably be too large of a change to get in as a
> bugfix though.  An alternative might be to move the pg_fe_getauthname()
> call to connectOptions2 as it's actually a bit more work than one might
> expect and if that can be avoided then that's probably all to the good.

Right offhand I like the idea of pushing it into connectOptions2 --- can
you experiment with that?  Seems like there is no reason to call
Kerberos if the user supplies the name to connect as.

> Sorry I don't have a simple answer. :/  In the end it seems like the
> Kerberos libraries should be able to survive Kerberos not being
> configured or whatever is going on to make it try to malloc 0-bytes...

We may be spending too much time on this one point --- as long as
Kerberos isn't *writing* into the zero-length alloc, there is nothing
illegal immoral or fattening about malloc(0).  Can you get ElectricFence
to not abort right here but continue on to the real problem?

            regards, tom lane

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > Another approach would be to have PQsetdbLogin build up a conninfo
> > string and pass that into connectOptions1 instead of calling
> > connectOptions1 with an empty string and then changing the values
> > afterwards.  That'd probably be too large of a change to get in as a
> > bugfix though.  An alternative might be to move the pg_fe_getauthname()
> > call to connectOptions2 as it's actually a bit more work than one might
> > expect and if that can be avoided then that's probably all to the good.
>=20
> Right offhand I like the idea of pushing it into connectOptions2 --- can
> you experiment with that?  Seems like there is no reason to call
> Kerberos if the user supplies the name to connect as.

Sure thing, I'll take a look at this probably tommorow night or thursday
evening.

> > Sorry I don't have a simple answer. :/  In the end it seems like the
> > Kerberos libraries should be able to survive Kerberos not being
> > configured or whatever is going on to make it try to malloc 0-bytes...
>=20
> We may be spending too much time on this one point --- as long as
> Kerberos isn't *writing* into the zero-length alloc, there is nothing
> illegal immoral or fattening about malloc(0).  Can you get ElectricFence
> to not abort right here but continue on to the real problem?

Good point.

    Stephen

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Tom Lane
Date:
Andrew Klosterman <andrew5@ece.cmu.edu> writes:
> (gdb) print *conn
> ...
>   allow_ssl_try = 1 '\001', wait_ssl_try = 0 '\0', ssl = 0x806d1d0,
>   peer = 0x807e430,
> ...
> *** glibc detected *** corrupted double-linked list: 0x0807e428 ***

Hm, it looks like the problem is associated with whatever was allocated
just before conn->peer (which is returned by SSL_get_peer_certificate
called from open_client_SSL).  Can you get efence or some other tool to
produce a trace of malloc calls so we can determine what that is?

            regards, tom lane

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
On Tue, 14 Feb 2006, Jens-Wolfhard Schicke wrote:
> --On Montag, Februar 13, 2006 21:25:30 -0500 Stephen Frost
> <sfrost@snowman.net> wrote:
>
> > * Andrew Klosterman (andrew5@ece.cmu.edu) wrote:
> >> > Seems kind of unlikely...  What exact (.deb) versions of libpq and
> >> > Postgres are you using?  You originally posted w/ 8.1.0 but perhaps on
> >> > the client you had something more recent?
> > aptitude install build-essential debhelper cdbs bison perl libperl-dev \
> >     tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \
> >     libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \
> >     gettext bzip2 fakeroot
> You might want to add valgrind to this list. It analyzes code on assembler
> basis and does a lot of memory checking / undefined variables checking
> while the program runs. Fixed all SIGSEGV I ever encoutered which were not
> infinite recursions.
>
> Mit freundlichem Gruß
> Jens Schicke

I tried valgrind this morning.  It detected problems in the depths of the
code behind ECPGconnect() down through SSL_read() and inflate().  Also,
there was trouble reported behind ECPGconnect() -> PQsetdbLogin() ->
pqGetpwuid() -> getpwuid_r() -> _dl_open() -> into the depths of
/lib/ld-2.3.5.so.  Valgrind got so upset at the number of errors it found
that it gave up.  Nothing bad seemed to show up in the code that I wrote.

But, while running under valgrind, the original program that manifests the
error condition runs just fine and to completion (maybe the errors are
just ignored in valgrind's replacement version of malloc as they are
with the MALLOC_CHECK_ environment variable set).

I'm moving on to try building the binaries without removing the symbols.
Hopefully that will give more useful information...

--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
On Tue, 14 Feb 2006, Andrew Klosterman wrote:

> > We may be spending too much time on this one point --- as long as
> > Kerberos isn't *writing* into the zero-length alloc, there is nothing
> > illegal immoral or fattening about malloc(0).  Can you get ElectricFence
> > to not abort right here but continue on to the real problem?
> >
> >             regards, tom lane
>
> Doing a "man efence" lets me know that setting the EF_ALLOW_MALLOC_0
> environment variable ought to let the program continue...  I'll check into
> that right now!
>
>
> --Andrew J. Klosterman
> andrew5@ece.cmu.edu

Well, when ElectricFence is allowed to ignore malloc() of zero bytes, my
program runs like a champ!  Might be associated with the replacement
malloc() that it installs to check for bugs, though.

(back to digging some more...)

--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
On Tue, 14 Feb 2006, Andrew Klosterman wrote:

> On Mon, 13 Feb 2006, Stephen Frost wrote:
>
> > Hmm, alright, well, this is at least not the fault of the patch of mine
> > which was included in Debian's 8.1.2-2 Postgres release. :)  You might
> > try compiling some debs with debugging enabled.  This is (reasonably)
> > straight-forward:
> >
> > (as root:)
> > aptitude install build-essential debhelper cdbs bison perl libperl-dev \
> >     tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \
> >     libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \
> >     gettext bzip2 fakeroot
> > (as user:)
> > apt-get source postgresql-8.1
> > cd postgresql-8.1-8.1.0
> > export DEB_BUILD_OPTIONS="nostrip"
> > dpkg-buildpackage -uc -us -rfakeroot
> >
> > Should produce .debs in the parent directory which have debugging
> > information.  Another useful build option is "noopt", ie:
> > export DEB_BUILD_OPTIONS="nostrip noopt", though that could make the
> > error go disappear.  It'd be terribly nice if you could do this and
> > provide a gdb backtrace with debugging... :)
> >
> >     Thanks,
> >
> >         Stephen
>
> Alright, I have built a system with the symbols left into the binaries.
>
> It still crashes with the "corrupted double-linked list" error.
>
> Running with ElectricFence the backtrace I get is:
>
>   Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens.
>
> ElectricFence Aborting: Allocating 0 bytes, probably a bug.
>
> Program received signal SIGILL, Illegal instruction.
> [Switching to Thread 16384 (LWP 1895)]
> 0x401c4851 in kill () from /lib/libc.so.6
> (gdb) bt
> #0  0x401c4851 in kill () from /lib/libc.so.6
> #1  0x40037dd5 in EF_Abort () from /usr/lib/libefence.so.0
> #2  0x40037823 in memalign () from /usr/lib/libefence.so.0
> #3  0x400379ad in malloc () from /usr/lib/libefence.so.0
> #4  0x40037a10 in calloc () from /usr/lib/libefence.so.0
> #5  0x404a282f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
> #6  0x402c9b26 in pg_krb5_init (PQerrormsg=0x0) at fe-auth.c:119
> #7  0x402ca304 in pg_fe_getauthname (PQerrormsg=0xbffff29c "l\031")
>     at fe-auth.c:176
> #8  0x402cc861 in conninfo_parse (conninfo=<value optimized out>,
>     errorMessage=0x4057afe8) at fe-connect.c:2719
> #9  0x402cc983 in connectOptions1 (conn=0x4057acdc, conninfo=0x0)
>     at fe-connect.c:362
> #10 0x402cda11 in PQsetdbLogin (pghost=0x40574ffc "nc3", pgport=0x0,
>     pgoptions=0x0, pgtty=0x0, dbName=0x40576ff8 "andrew5",
>     login=0xbffffc31 "andrew5", pwd=0xbffffc3c "testbed") at fe-connect.c:568
> #11 0x40030fe7 in ECPGconnect (lineno=191, c=0, name=0xbffffc22 "andrew5@nc3",
>     user=0xbffffc31 "andrew5", passwd=0x0,
>     connection_name=0xbffff8b0 "CorrectnessCheck", autocommit=0)
>     at connect.c:452
> #12 0x08049ecb in DBConnect (arg_connection=0xbffff964 "CorrectnessCheck")
>     at client_test.pgcc:191
> #13 0x0804a14f in DoCorrectnessChecks () at client_test.pgcc:231
> #14 0x0804aa08 in main (argc=9, argv=0xbffffa74) at client_test.pgcc:526
>
> Again, it is showing a bad malloc in what appears to be some code using
> kerberos.  But there's nothing in my setup that I can think of right now
> that should induce a connection to be set up using kerberos.
>
> --Andrew J. Klosterman
> andrew5@ece.cmu.edu

With the debug binaries, I was able to step through the program and get to
what appears to be the function where it bails:  line 1166 of
postgresql-8.1.0/src/interfaces/libpq/fe-secure.c where SSL_free() is
called.

Included below is a copy&paste of my GDB session.  Within the function
that calls SSL_free(), being close_SSL(PGconn *conn), I inserted a
breakpoint.  The value of *conn is printed out, which will hopefully
assist in any debugging...

(gdb) break fe-secure.c:1162
No source file named fe-secure.c.
Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 1 (fe-secure.c:1162) pending.
(gdb) set args -t andrew5@nc3 -u andrew5 -p testbed -i 10
(gdb) run
Starting program:
/.amd/flush/home/andrew5/projects/CVS-controlled/users/andrew5/thesis/code/database/metadata_server/test/client_test
-t andrew5@nc3 -u andrew5 -p testbed -i 10
[Thread debugging using libthread_db enabled]
[New Thread 16384 (LWP 2103)]
Breakpoint 2 at 0x402d4bc0: file fe-secure.c, line 1162.
Pending breakpoint "fe-secure.c:1162" resolved
[Switching to Thread 16384 (LWP 2103)]

Breakpoint 2, close_SSL (conn=0x8059d00) at fe-secure.c:1162
1162    {
Current language:  auto; currently c
(gdb) bt
#0  close_SSL (conn=0x8059d00) at fe-secure.c:1162
#1  0x402c6938 in closePGconn (conn=0x8059d00) at fe-connect.c:1976
#2  0x402c6a55 in PQfinish (conn=0x8059d00) at fe-connect.c:2021
#3  0x400308f9 in ecpg_finish (act=0x8059ca8) at connect.c:122
#4  0x40031707 in ECPGdisconnect (lineno=134585600,
    connection_name=0xbffff8a8 "CorrectnessCheck") at connect.c:540
#5  0x0804a036 in DBDisconnect (arg_connection=0xbffff954
"CorrectnessCheck")
    at client_test.pgcc:218
#6  0x0804a58a in DoCorrectnessChecks () at client_test.pgcc:282
#7  0x0804a9f8 in main (argc=9, argv=0xbffffa64) at client_test.pgcc:528
(gdb) list
1157    /*
1158     *      Close SSL connection.
1159     */
1160    static void
1161    close_SSL(PGconn *conn)
1162    {
1163            if (conn->ssl)
1164            {
1165                    SSL_shutdown(conn->ssl);
1166                    SSL_free(conn->ssl);
(gdb) print *conn
$1 = {pghost = 0x80634c0 "nc3", pghostaddr = 0x0, pgport = 0x80634d0
"5432",
  pgunixsocket = 0x0, pgtty = 0x80634e0 "", connect_timeout = 0x0,
  pgoptions = 0x80634f0 "", dbName = 0x80634b0 "andrew5",
  pguser = 0x8063500 "andrew5", pgpass = 0x80634a0 "testbed",
  sslmode = 0x8063510 "prefer", krbsrvname = 0x8063520 "postgres",
  Pfdebug = 0x0, noticeHooks = {noticeRec = 0x40030bd0
<ECPGnoticeReceiver>,
    noticeRecArg = 0x8059ca8,
    noticeProc = 0x402c90c0 <defaultNoticeProcessor>, noticeProcArg =
0x0},
  status = CONNECTION_OK, asyncStatus = PGASYNC_IDLE,
  xactStatus = PQTRANS_IDLE, queryclass = PGQUERY_SIMPLE,
  nonblocking = 0 '\0', copy_is_binary = 0 '\0', copy_already_done = 0,
  notifyHead = 0x0, notifyTail = 0x0, sock = 3, laddr = {addr = {
      ss_family = 2, __ss_align = 92410796,
      __ss_padding = '\0' <repeats 119 times>}, salen = 16}, raddr = {addr
= {
      ss_family = 2, __ss_align = 58856364,
      __ss_padding = '\0' <repeats 119 times>}, salen = 16},
  pversion = 196608, sversion = 80100, addrlist = 0x0, addr_cur = 0x0,
  addrlist_family = 0, setenv_state = SETENV_STATE_IDLE, next_eo = 0x0,
  be_pid = 28824, be_key = 583752927, md5Salt = "\000\000\000",
  cryptSalt = "\000", pstatus = 0x807c330, client_encoding = 8,
  verbosity = PQERRORS_DEFAULT, lobjfuncs = 0x0, inBuffer = 0x805a028 "C",
  inBufSize = 16384, inStart = 18, inCursor = 18, inEnd = 18,
  outBuffer = 0x805e030 "X", outBufSize = 16384, outCount = 0,
---Type <return> to continue, or q <return> to quit---
  outMsgStart = 1, outMsgEnd = 5, result = 0x0, curTuple = 0x0,
  allow_ssl_try = 1 '\001', wait_ssl_try = 0 '\0', ssl = 0x806d1d0,
  peer = 0x807e430,
  peer_dn =
"/C=US/ST=Pennsylvania/L=Pittsburgh/O=CMU/PDL/OU=andrew5/CN=nc3.pdl.cmu.local/emailAddress=andrew5@mailinator.com",
'\0' <repeats 144 times>,
  peer_cn = "nc3.pdl.cmu.local", '\0' <repeats 15 times>, errorMessage = {
    data = 0x8062038 "", len = 0, maxlen = 256}, workBuffer = {
    data = 0x8062140 "COMMIT", len = 6, maxlen = 256}}
(gdb) s
1163            if (conn->ssl)
(gdb) s
1162    {
(gdb) s
1163            if (conn->ssl)
(gdb) s
1165                    SSL_shutdown(conn->ssl);
(gdb) s
1166                    SSL_free(conn->ssl);
(gdb) s
*** glibc detected *** corrupted double-linked list: 0x0807e428 ***

Program received signal SIGABRT, Aborted.
0x401bf851 in kill () from /lib/libc.so.6
(gdb)


--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
On Tue, 14 Feb 2006, Stephen Frost wrote:
<snip>
> It's kind of a chicken-and-egg here because the backend decides what
> authentication mechanism to ask for based off the username (at least in
> part) through pg_hba.conf, so you can't find out the authentication
> method until you know the username so all methods to find the username
> have to be exhausted.  You could avoid this by explicitly passing
> 'user=' into the connection parameters though...  Would be interesting
> to know what happens then...

When asking about "explicitly passing 'user=' in to the connection
parameters" do you mean that the EXEC SQL CONNECT line that ecpg parses
should specify a username?

My code is using the following statement when making a remote connection
that uses SSL.

EXEC SQL CONNECT TO :l_target AS :l_connection
     USER :l_user IDENTIFIED BY :l_passwd;

The target machine (hosting the database) has "ssl=on" in postgresql.conf
and in its pg_hba.conf (snippet below) a line for the client machine from
which I am making the connection that specifies an SSL connection should
be made.

# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD
hostssl andrew5     andrew5     172.19.130.4/32       pam passwd

--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
> We may be spending too much time on this one point --- as long as
> Kerberos isn't *writing* into the zero-length alloc, there is nothing
> illegal immoral or fattening about malloc(0).  Can you get ElectricFence
> to not abort right here but continue on to the real problem?
>
>             regards, tom lane

Doing a "man efence" lets me know that setting the EF_ALLOW_MALLOC_0
environment variable ought to let the program continue...  I'll check into
that right now!


--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
On Mon, 13 Feb 2006, Stephen Frost wrote:

> Hmm, alright, well, this is at least not the fault of the patch of mine
> which was included in Debian's 8.1.2-2 Postgres release. :)  You might
> try compiling some debs with debugging enabled.  This is (reasonably)
> straight-forward:
>
> (as root:)
> aptitude install build-essential debhelper cdbs bison perl libperl-dev \
>     tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \
>     libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \
>     gettext bzip2 fakeroot
> (as user:)
> apt-get source postgresql-8.1
> cd postgresql-8.1-8.1.0
> export DEB_BUILD_OPTIONS="nostrip"
> dpkg-buildpackage -uc -us -rfakeroot
>
> Should produce .debs in the parent directory which have debugging
> information.  Another useful build option is "noopt", ie:
> export DEB_BUILD_OPTIONS="nostrip noopt", though that could make the
> error go disappear.  It'd be terribly nice if you could do this and
> provide a gdb backtrace with debugging... :)
>
>     Thanks,
>
>         Stephen

Alright, I have built a system with the symbols left into the binaries.

It still crashes with the "corrupted double-linked list" error.

Running with ElectricFence the backtrace I get is:

  Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens.

ElectricFence Aborting: Allocating 0 bytes, probably a bug.

Program received signal SIGILL, Illegal instruction.
[Switching to Thread 16384 (LWP 1895)]
0x401c4851 in kill () from /lib/libc.so.6
(gdb) bt
#0  0x401c4851 in kill () from /lib/libc.so.6
#1  0x40037dd5 in EF_Abort () from /usr/lib/libefence.so.0
#2  0x40037823 in memalign () from /usr/lib/libefence.so.0
#3  0x400379ad in malloc () from /usr/lib/libefence.so.0
#4  0x40037a10 in calloc () from /usr/lib/libefence.so.0
#5  0x404a282f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
#6  0x402c9b26 in pg_krb5_init (PQerrormsg=0x0) at fe-auth.c:119
#7  0x402ca304 in pg_fe_getauthname (PQerrormsg=0xbffff29c "l\031")
    at fe-auth.c:176
#8  0x402cc861 in conninfo_parse (conninfo=<value optimized out>,
    errorMessage=0x4057afe8) at fe-connect.c:2719
#9  0x402cc983 in connectOptions1 (conn=0x4057acdc, conninfo=0x0)
    at fe-connect.c:362
#10 0x402cda11 in PQsetdbLogin (pghost=0x40574ffc "nc3", pgport=0x0,
    pgoptions=0x0, pgtty=0x0, dbName=0x40576ff8 "andrew5",
    login=0xbffffc31 "andrew5", pwd=0xbffffc3c "testbed") at fe-connect.c:568
#11 0x40030fe7 in ECPGconnect (lineno=191, c=0, name=0xbffffc22 "andrew5@nc3",
    user=0xbffffc31 "andrew5", passwd=0x0,
    connection_name=0xbffff8b0 "CorrectnessCheck", autocommit=0)
    at connect.c:452
#12 0x08049ecb in DBConnect (arg_connection=0xbffff964 "CorrectnessCheck")
    at client_test.pgcc:191
#13 0x0804a14f in DoCorrectnessChecks () at client_test.pgcc:231
#14 0x0804aa08 in main (argc=9, argv=0xbffffa74) at client_test.pgcc:526

Again, it is showing a bad malloc in what appears to be some code using
kerberos.  But there's nothing in my setup that I can think of right now
that should induce a connection to be set up using kerberos.

--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
> > Tracking down exactly what's tickling the problem in this case could be
> > tricky...
>
> Yeah :-(.  If you aren't able to narrow it further by yourself, please
> try to put together a self-contained test case.
>
>             regards, tom lane

Well, my attempt last night at putting together a test case that manifests
the error that I encountered was a total failure!  The test code executes
flawlessly: no abnormal termination.

There must be something different between the two programs.  But my
original is considerably more complex.  I'll pursue other options for
debugging before returning to figuring out the difference between the
"real" code and the "test-case" code.

--Andrew J. Klosterman
andrew5@ece.cmu.edu

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Tom Lane
Date:
Andrew Klosterman <andrew5@ece.cmu.edu> writes:
> With the debug binaries, I was able to step through the program and get to
> what appears to be the function where it bails:  line 1166 of
> postgresql-8.1.0/src/interfaces/libpq/fe-secure.c where SSL_free() is
> called.

BTW, is the address that glibc says is corrupted consistent from run to
run?  If so, you could narrow down the problem pretty quickly by setting
a hardware watchpoint on that address with gdb.  Any hits that are not
from the malloc subroutines are probably the source of the problem.

            regards, tom lane

Re: BUG #2246: Bad malloc interactions: ecpg, openssl

From
Andrew Klosterman
Date:
On Wed, 15 Feb 2006, Tom Lane wrote:

> Andrew Klosterman <andrew5@ece.cmu.edu> writes:
> > With the debug binaries, I was able to step through the program and get to
> > what appears to be the function where it bails:  line 1166 of
> > postgresql-8.1.0/src/interfaces/libpq/fe-secure.c where SSL_free() is
> > called.
>
> BTW, is the address that glibc says is corrupted consistent from run to
> run?  If so, you could narrow down the problem pretty quickly by setting
> a hardware watchpoint on that address with gdb.  Any hits that are not
> from the malloc subroutines are probably the source of the problem.
>
>             regards, tom lane

The address given by the error message is consistent.  But, setting a
break/watch point for it has been troublesome.

A watchpoint can't be set until the memory is mapped in.  I have narrowed
down the time that the memory is mapped in to being somewhere in a call to
PQconnectPoll() from within connectDBComplete() in
src/interfaces/libpq/fe-connect.c.  With the watchpoint set, though, the
debugger isn't breaking the execution of the program until the error
manifests itself.

Digging around, I can't come up with a way to get information on the
arguments and return results from malloc() every time it is called.
"strace" only does system calls.  The output I get from "ltrace" is not
useful and no options I can see appear to improve the situation.

So, I'm kinda stuck.  This bug might be one that gets away...

--Andrew J. Klosterman
andrew5@ece.cmu.edu