Thread: BUG #1736: endless loop in PQconnectdb

BUG #1736: endless loop in PQconnectdb

From
"Karsten Desler"
Date:
The following bug has been logged online:

Bug reference:      1736
Logged by:          Karsten Desler
Email address:      pgsql@soohrt.org
PostgreSQL version: 7.4.7
Operating system:   debian sarge
Description:        endless loop in PQconnectdb
Details:

I've got a pretty flaky tcpip connection to a Postgres 7.4.7 database server
and often times (once or twice a day) my program gets stuck in an endless
busy-loop in PGconnectdb.

An excerpt from a strace:
poll([{fd=389, events=POLLIN|POLLERR, revents=POLLIN|POLLERR|POLLHUP}], 1,
-1) = 1
recv(389, "", 1, 0)                     = 0
poll([{fd=389, events=POLLIN|POLLERR, revents=POLLIN|POLLERR|POLLHUP}], 1,
-1) = 1
recv(389, "", 1, 0)                     = 0
...

SSL is not involved. Sadly I can't say how far along in the connection
process the bug is triggered, but I could install a libpq3 with debugging
symbols and add a few strategically placed gdb watch/break points, if
needed.

Re: BUG #1736: endless loop in PQconnectdb

From
Tom Lane
Date:
"Karsten Desler" <pgsql@soohrt.org> writes:
> I've got a pretty flaky tcpip connection to a Postgres 7.4.7 database server
> and often times (once or twice a day) my program gets stuck in an endless
> busy-loop in PGconnectdb.

Hmm.  Maybe you have a test case for the proposed patch for bug #1467?
Please see the patch posted in pgsql-patches a couple days ago, and let
us know if it helps.

            regards, tom lane

Re: BUG #1736: endless loop in PQconnectdb

From
Bruce Momjian
Date:
Yes, please --- the patch is at:

    http://archives.postgresql.org/pgsql-patches/2005-06/msg00486.php

---------------------------------------------------------------------------

Tom Lane wrote:
> "Karsten Desler" <pgsql@soohrt.org> writes:
> > I've got a pretty flaky tcpip connection to a Postgres 7.4.7 database server
> > and often times (once or twice a day) my program gets stuck in an endless
> > busy-loop in PGconnectdb.
>
> Hmm.  Maybe you have a test case for the proposed patch for bug #1467?
> Please see the patch posted in pgsql-patches a couple days ago, and let
> us know if it helps.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: BUG #1736: endless loop in PQconnectdb

From
Karsten Desler
Date:
* Bruce Momjian wrote:
> Yes, please --- the patch is at:
>
>     http://archives.postgresql.org/pgsql-patches/2005-06/msg00486.php

Thanks.
I've applied the patch now, and I'll keep you posted.

Best regards,
 Karsten Desler

Re: BUG #1736: endless loop in PQconnectdb

From
Karsten Desler
Date:
* Karsten Desler wrote:
> * Bruce Momjian wrote:
> > Yes, please --- the patch is at:
> >
> >     http://archives.postgresql.org/pgsql-patches/2005-06/msg00486.php
>
> Thanks.
> I've applied the patch now, and I'll keep you posted.

It doesn't seem to have helped, and while poking around a little, I
found another annoyance. libpq seems to leak memory if I pass a dns name as
host in conninfo. It doesn't leak when I do the getaddrinfo myself and pass
an IP.

root     16580  0.0  0.2  4004 1304 ?        S    10:07   0:00 monitor xxx.xxx.xxx.xxx
root     22434  0.0  1.9 47328 9240 ?        S    Jun07   5:42 monitor xxx.xxx.xxx.yyy

==9980== 4648 bytes in 166 blocks are definitely lost in loss record 4 of 4
==9980==    at 0x1B90459D: malloc (vg_replace_malloc.c:130)
==9980==    by 0x1BC39E3B: (within /lib/tls/i686/cmov/libresolv-2.3.2.so)
==9980==    by 0x1BC38B92: __libc_res_nquery (in /lib/tls/i686/cmov/libresolv-2.3.2.so)
==9980==    by 0x1BC39289: (within /lib/tls/i686/cmov/libresolv-2.3.2.so)
==9980==    by 0x1BC38E8F: __libc_res_nsearch (in /lib/tls/i686/cmov/libresolv-2.3.2.so)
==9980==    by 0x1BDA907D: ???
==9980==    by 0x1B9F0A65: (within /lib/tls/i686/cmov/libc-2.3.2.so)
==9980==    by 0x1B9F1673: getaddrinfo (in /lib/tls/i686/cmov/libc-2.3.2.so)
==9980==    by 0x1B9259A1: getaddrinfo_all (in /usr/lib/libpq.so.3.1)
==9980==    by 0x1B916F3B: (within /usr/lib/libpq.so.3.1)
==9980==    by 0x1B9164E9: PQconnectStart (in /usr/lib/libpq.so.3.1)
==9980==    by 0x1B916471: PQconnectdb (in /usr/lib/libpq.so.3.1)

Best regards,
 Karsten Desler

Re: BUG #1736: endless loop in PQconnectdb

From
Bruce Momjian
Date:
Karsten Desler wrote:
> * Karsten Desler wrote:
> > * Bruce Momjian wrote:
> > > Yes, please --- the patch is at:
> > >
> > >     http://archives.postgresql.org/pgsql-patches/2005-06/msg00486.php
> >
> > Thanks.
> > I've applied the patch now, and I'll keep you posted.
>
> It doesn't seem to have helped, and while poking around a little, I
> found another annoyance. libpq seems to leak memory if I pass a dns name as
> host in conninfo. It doesn't leak when I do the getaddrinfo myself and pass
> an IP.
>
> root     16580  0.0  0.2  4004 1304 ?        S    10:07   0:00 monitor xxx.xxx.xxx.xxx
> root     22434  0.0  1.9 47328 9240 ?        S    Jun07   5:42 monitor xxx.xxx.xxx.yyy
>
> ==9980== 4648 bytes in 166 blocks are definitely lost in loss record 4 of 4
> ==9980==    at 0x1B90459D: malloc (vg_replace_malloc.c:130)
> ==9980==    by 0x1BC39E3B: (within /lib/tls/i686/cmov/libresolv-2.3.2.so)
> ==9980==    by 0x1BC38B92: __libc_res_nquery (in /lib/tls/i686/cmov/libresolv-2.3.2.so)
> ==9980==    by 0x1BC39289: (within /lib/tls/i686/cmov/libresolv-2.3.2.so)
> ==9980==    by 0x1BC38E8F: __libc_res_nsearch (in /lib/tls/i686/cmov/libresolv-2.3.2.so)
> ==9980==    by 0x1BDA907D: ???
> ==9980==    by 0x1B9F0A65: (within /lib/tls/i686/cmov/libc-2.3.2.so)
> ==9980==    by 0x1B9F1673: getaddrinfo (in /lib/tls/i686/cmov/libc-2.3.2.so)
> ==9980==    by 0x1B9259A1: getaddrinfo_all (in /usr/lib/libpq.so.3.1)
> ==9980==    by 0x1B916F3B: (within /usr/lib/libpq.so.3.1)
> ==9980==    by 0x1B9164E9: PQconnectStart (in /usr/lib/libpq.so.3.1)
> ==9980==    by 0x1B916471: PQconnectdb (in /usr/lib/libpq.so.3.1)

I think what you are seeing is that the getaddrinfo memory is placed in
the PGconn structure that isn't freed until PQclear is called.  Does
your test call PQclear()?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: BUG #1736: endless loop in PQconnectdb

From
Karsten Desler
Date:
* Bruce Momjian wrote:
> I think what you are seeing is that the getaddrinfo memory is placed in
> the PGconn structure that isn't freed until PQclear is called.  Does
> your test call PQclear()?

s/PQclear/PQfinish/
It does call PQclear on the result, and PQfinish on the connection.
The code is attached.

With postgres doing the dns lookup:
fubar:~# while true; do ps aux|grep -v grep|grep test; sleep 30; done
root      3245  3.6  0.2  4056 1352 pts/3    S+   10:37   0:01 ./test
root      3245  3.6  0.3  4056 1456 pts/3    S+   10:37   0:02 ./test
root      3245  3.7  0.3  4184 1560 pts/3    S+   10:37   0:03 ./test
root      3245  3.7  0.3  4312 1668 pts/3    R+   10:37   0:04 ./test
root      3245  3.6  0.3  4440 1760 pts/3    S+   10:37   0:05 ./test

with an output of:
called PQconnectdb: 0x804a008
called PQexec: 0x80dcbe0
calling PQclear: 0x80dcbe0
calling PQfinish: 0x804a008
...
called PQconnectdb: 0x804a008
called PQexec: 0x80dcea0
calling PQclear: 0x80dcea0
calling PQfinish: 0x804a008
...
called PQconnectdb: 0x804a008
called PQexec: 0x80dd620
calling PQclear: 0x80dd620
calling PQfinish: 0x804a008
...

and valgrind complaining about lost blocks:
==3290== 35224 bytes in 1258 blocks are definitely lost in loss record 8 of 8
==3290==    at 0x1B90459D: malloc (vg_replace_malloc.c:130)
==3290==    by 0x1BC38E3B: (within /lib/tls/i686/cmov/libresolv-2.3.2.so)
==3290==    by 0x1BC37B92: __libc_res_nquery (in /lib/tls/i686/cmov/libresolv-2.3.2.so)
==3290==    by 0x1BC38289: (within /lib/tls/i686/cmov/libresolv-2.3.2.so)
==3290==    by 0x1BC37E8F: __libc_res_nsearch (in /lib/tls/i686/cmov/libresolv-2.3.2.so)
==3290==    by 0x1BDA307D: ???
==3290==    by 0x1B9EFA65: (within /lib/tls/i686/cmov/libc-2.3.2.so)
==3290==    by 0x1B9F0673: getaddrinfo (in /lib/tls/i686/cmov/libc-2.3.2.so)
==3290==    by 0x1B925701: getaddrinfo_all (in /usr/lib/libpq.so.3.1)
==3290==    by 0x1B916F0B: (within /usr/lib/libpq.so.3.1)
==3290==    by 0x1B9164B9: PQconnectStart (in /usr/lib/libpq.so.3.1)
==3290==    by 0x1B916441: PQconnectdb (in /usr/lib/libpq.so.3.1)


With the IP in the host field:
fubar:~# while true; do ps aux|grep -v grep|grep test; sleep 30; done
root      3312  1.4  0.2  3872 1092 pts/3    S+   10:42   0:00 ./test
root      3312  1.6  0.2  3872 1092 pts/3    S+   10:42   0:00 ./test
root      3312  1.9  0.2  3872 1092 pts/3    S+   10:42   0:01 ./test
root      3312  2.0  0.2  3872 1092 pts/3    S+   10:42   0:01 ./test
root      3312  2.0  0.2  3872 1092 pts/3    S+   10:42   0:02 ./test

output:
called PQconnectdb: 0x804a008
called PQexec: 0x80525b8
calling PQclear: 0x80525b8
calling PQfinish: 0x804a008
...
called PQconnectdb: 0x804a008
called PQexec: 0x80525b8
calling PQclear: 0x80525b8
calling PQfinish: 0x804a008
...
called PQconnectdb: 0x804a008
called PQexec: 0x80525b8
calling PQclear: 0x80525b8
calling PQfinish: 0x804a008
...

and no leaking output from valgrind.

Best regards,
 Karsten Desler

Attachment

Re: BUG #1736: endless loop in PQconnectdb

From
Tom Lane
Date:
Karsten Desler <kdesler@soohrt.org> writes:
> * Bruce Momjian wrote:
>> I think what you are seeing is that the getaddrinfo memory is placed in
>> the PGconn structure that isn't freed until PQclear is called.  Does
>> your test call PQclear()?

> s/PQclear/PQfinish/
> It does call PQclear on the result, and PQfinish on the connection.

In that case I think there is no doubt that you've found a bug in
getaddrinfo/freeaddrinfo, and you ought to be reporting it to your
libc provider.  We do call freeaddrinfo on the result of getaddrinfo,
so if not everything is cleaned up, that's a library bug not ours.

You could check this by reducing the test case to getaddrinfo()
then freeaddrinfo() using the same parameters that fe-connect.c
passes.

            regards, tom lane

Re: BUG #1736: endless loop in PQconnectdb

From
Karsten Desler
Date:
* Tom Lane wrote:
> Karsten Desler <kdesler@soohrt.org> writes:
> > * Bruce Momjian wrote:
> >> I think what you are seeing is that the getaddrinfo memory is placed in
> >> the PGconn structure that isn't freed until PQclear is called.  Does
> >> your test call PQclear()?
>
> > s/PQclear/PQfinish/
> > It does call PQclear on the result, and PQfinish on the connection.
>
> In that case I think there is no doubt that you've found a bug in
> getaddrinfo/freeaddrinfo, and you ought to be reporting it to your
> libc provider.  We do call freeaddrinfo on the result of getaddrinfo,
> so if not everything is cleaned up, that's a library bug not ours.
>
> You could check this by reducing the test case to getaddrinfo()
> then freeaddrinfo() using the same parameters that fe-connect.c
> passes.

Indeed. Sorry for the noise.
The GNU libc 2.3.2 leaks ai->ai_canonname for every struct addrinfo
in the result list.

The original problem hasn't happened again (it seems like the faulty
ethernet switch, that was the cause for the flaky connection was
finally replaced). Anyway, if it happenes again, I'll notify you.

Regards,
 Karsten Desler