Thread: BUG #1736: endless loop in PQconnectdb
The following bug has been logged online: Bug reference: 1736 Logged by: Karsten Desler Email address: pgsql@soohrt.org PostgreSQL version: 7.4.7 Operating system: debian sarge Description: endless loop in PQconnectdb Details: I've got a pretty flaky tcpip connection to a Postgres 7.4.7 database server and often times (once or twice a day) my program gets stuck in an endless busy-loop in PGconnectdb. An excerpt from a strace: poll([{fd=389, events=POLLIN|POLLERR, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 recv(389, "", 1, 0) = 0 poll([{fd=389, events=POLLIN|POLLERR, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 recv(389, "", 1, 0) = 0 ... SSL is not involved. Sadly I can't say how far along in the connection process the bug is triggered, but I could install a libpq3 with debugging symbols and add a few strategically placed gdb watch/break points, if needed.
"Karsten Desler" <pgsql@soohrt.org> writes: > I've got a pretty flaky tcpip connection to a Postgres 7.4.7 database server > and often times (once or twice a day) my program gets stuck in an endless > busy-loop in PGconnectdb. Hmm. Maybe you have a test case for the proposed patch for bug #1467? Please see the patch posted in pgsql-patches a couple days ago, and let us know if it helps. regards, tom lane
Yes, please --- the patch is at: http://archives.postgresql.org/pgsql-patches/2005-06/msg00486.php --------------------------------------------------------------------------- Tom Lane wrote: > "Karsten Desler" <pgsql@soohrt.org> writes: > > I've got a pretty flaky tcpip connection to a Postgres 7.4.7 database server > > and often times (once or twice a day) my program gets stuck in an endless > > busy-loop in PGconnectdb. > > Hmm. Maybe you have a test case for the proposed patch for bug #1467? > Please see the patch posted in pgsql-patches a couple days ago, and let > us know if it helps. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
* Bruce Momjian wrote: > Yes, please --- the patch is at: > > http://archives.postgresql.org/pgsql-patches/2005-06/msg00486.php Thanks. I've applied the patch now, and I'll keep you posted. Best regards, Karsten Desler
* Karsten Desler wrote: > * Bruce Momjian wrote: > > Yes, please --- the patch is at: > > > > http://archives.postgresql.org/pgsql-patches/2005-06/msg00486.php > > Thanks. > I've applied the patch now, and I'll keep you posted. It doesn't seem to have helped, and while poking around a little, I found another annoyance. libpq seems to leak memory if I pass a dns name as host in conninfo. It doesn't leak when I do the getaddrinfo myself and pass an IP. root 16580 0.0 0.2 4004 1304 ? S 10:07 0:00 monitor xxx.xxx.xxx.xxx root 22434 0.0 1.9 47328 9240 ? S Jun07 5:42 monitor xxx.xxx.xxx.yyy ==9980== 4648 bytes in 166 blocks are definitely lost in loss record 4 of 4 ==9980== at 0x1B90459D: malloc (vg_replace_malloc.c:130) ==9980== by 0x1BC39E3B: (within /lib/tls/i686/cmov/libresolv-2.3.2.so) ==9980== by 0x1BC38B92: __libc_res_nquery (in /lib/tls/i686/cmov/libresolv-2.3.2.so) ==9980== by 0x1BC39289: (within /lib/tls/i686/cmov/libresolv-2.3.2.so) ==9980== by 0x1BC38E8F: __libc_res_nsearch (in /lib/tls/i686/cmov/libresolv-2.3.2.so) ==9980== by 0x1BDA907D: ??? ==9980== by 0x1B9F0A65: (within /lib/tls/i686/cmov/libc-2.3.2.so) ==9980== by 0x1B9F1673: getaddrinfo (in /lib/tls/i686/cmov/libc-2.3.2.so) ==9980== by 0x1B9259A1: getaddrinfo_all (in /usr/lib/libpq.so.3.1) ==9980== by 0x1B916F3B: (within /usr/lib/libpq.so.3.1) ==9980== by 0x1B9164E9: PQconnectStart (in /usr/lib/libpq.so.3.1) ==9980== by 0x1B916471: PQconnectdb (in /usr/lib/libpq.so.3.1) Best regards, Karsten Desler
Karsten Desler wrote: > * Karsten Desler wrote: > > * Bruce Momjian wrote: > > > Yes, please --- the patch is at: > > > > > > http://archives.postgresql.org/pgsql-patches/2005-06/msg00486.php > > > > Thanks. > > I've applied the patch now, and I'll keep you posted. > > It doesn't seem to have helped, and while poking around a little, I > found another annoyance. libpq seems to leak memory if I pass a dns name as > host in conninfo. It doesn't leak when I do the getaddrinfo myself and pass > an IP. > > root 16580 0.0 0.2 4004 1304 ? S 10:07 0:00 monitor xxx.xxx.xxx.xxx > root 22434 0.0 1.9 47328 9240 ? S Jun07 5:42 monitor xxx.xxx.xxx.yyy > > ==9980== 4648 bytes in 166 blocks are definitely lost in loss record 4 of 4 > ==9980== at 0x1B90459D: malloc (vg_replace_malloc.c:130) > ==9980== by 0x1BC39E3B: (within /lib/tls/i686/cmov/libresolv-2.3.2.so) > ==9980== by 0x1BC38B92: __libc_res_nquery (in /lib/tls/i686/cmov/libresolv-2.3.2.so) > ==9980== by 0x1BC39289: (within /lib/tls/i686/cmov/libresolv-2.3.2.so) > ==9980== by 0x1BC38E8F: __libc_res_nsearch (in /lib/tls/i686/cmov/libresolv-2.3.2.so) > ==9980== by 0x1BDA907D: ??? > ==9980== by 0x1B9F0A65: (within /lib/tls/i686/cmov/libc-2.3.2.so) > ==9980== by 0x1B9F1673: getaddrinfo (in /lib/tls/i686/cmov/libc-2.3.2.so) > ==9980== by 0x1B9259A1: getaddrinfo_all (in /usr/lib/libpq.so.3.1) > ==9980== by 0x1B916F3B: (within /usr/lib/libpq.so.3.1) > ==9980== by 0x1B9164E9: PQconnectStart (in /usr/lib/libpq.so.3.1) > ==9980== by 0x1B916471: PQconnectdb (in /usr/lib/libpq.so.3.1) I think what you are seeing is that the getaddrinfo memory is placed in the PGconn structure that isn't freed until PQclear is called. Does your test call PQclear()? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
* Bruce Momjian wrote: > I think what you are seeing is that the getaddrinfo memory is placed in > the PGconn structure that isn't freed until PQclear is called. Does > your test call PQclear()? s/PQclear/PQfinish/ It does call PQclear on the result, and PQfinish on the connection. The code is attached. With postgres doing the dns lookup: fubar:~# while true; do ps aux|grep -v grep|grep test; sleep 30; done root 3245 3.6 0.2 4056 1352 pts/3 S+ 10:37 0:01 ./test root 3245 3.6 0.3 4056 1456 pts/3 S+ 10:37 0:02 ./test root 3245 3.7 0.3 4184 1560 pts/3 S+ 10:37 0:03 ./test root 3245 3.7 0.3 4312 1668 pts/3 R+ 10:37 0:04 ./test root 3245 3.6 0.3 4440 1760 pts/3 S+ 10:37 0:05 ./test with an output of: called PQconnectdb: 0x804a008 called PQexec: 0x80dcbe0 calling PQclear: 0x80dcbe0 calling PQfinish: 0x804a008 ... called PQconnectdb: 0x804a008 called PQexec: 0x80dcea0 calling PQclear: 0x80dcea0 calling PQfinish: 0x804a008 ... called PQconnectdb: 0x804a008 called PQexec: 0x80dd620 calling PQclear: 0x80dd620 calling PQfinish: 0x804a008 ... and valgrind complaining about lost blocks: ==3290== 35224 bytes in 1258 blocks are definitely lost in loss record 8 of 8 ==3290== at 0x1B90459D: malloc (vg_replace_malloc.c:130) ==3290== by 0x1BC38E3B: (within /lib/tls/i686/cmov/libresolv-2.3.2.so) ==3290== by 0x1BC37B92: __libc_res_nquery (in /lib/tls/i686/cmov/libresolv-2.3.2.so) ==3290== by 0x1BC38289: (within /lib/tls/i686/cmov/libresolv-2.3.2.so) ==3290== by 0x1BC37E8F: __libc_res_nsearch (in /lib/tls/i686/cmov/libresolv-2.3.2.so) ==3290== by 0x1BDA307D: ??? ==3290== by 0x1B9EFA65: (within /lib/tls/i686/cmov/libc-2.3.2.so) ==3290== by 0x1B9F0673: getaddrinfo (in /lib/tls/i686/cmov/libc-2.3.2.so) ==3290== by 0x1B925701: getaddrinfo_all (in /usr/lib/libpq.so.3.1) ==3290== by 0x1B916F0B: (within /usr/lib/libpq.so.3.1) ==3290== by 0x1B9164B9: PQconnectStart (in /usr/lib/libpq.so.3.1) ==3290== by 0x1B916441: PQconnectdb (in /usr/lib/libpq.so.3.1) With the IP in the host field: fubar:~# while true; do ps aux|grep -v grep|grep test; sleep 30; done root 3312 1.4 0.2 3872 1092 pts/3 S+ 10:42 0:00 ./test root 3312 1.6 0.2 3872 1092 pts/3 S+ 10:42 0:00 ./test root 3312 1.9 0.2 3872 1092 pts/3 S+ 10:42 0:01 ./test root 3312 2.0 0.2 3872 1092 pts/3 S+ 10:42 0:01 ./test root 3312 2.0 0.2 3872 1092 pts/3 S+ 10:42 0:02 ./test output: called PQconnectdb: 0x804a008 called PQexec: 0x80525b8 calling PQclear: 0x80525b8 calling PQfinish: 0x804a008 ... called PQconnectdb: 0x804a008 called PQexec: 0x80525b8 calling PQclear: 0x80525b8 calling PQfinish: 0x804a008 ... called PQconnectdb: 0x804a008 called PQexec: 0x80525b8 calling PQclear: 0x80525b8 calling PQfinish: 0x804a008 ... and no leaking output from valgrind. Best regards, Karsten Desler
Attachment
Karsten Desler <kdesler@soohrt.org> writes: > * Bruce Momjian wrote: >> I think what you are seeing is that the getaddrinfo memory is placed in >> the PGconn structure that isn't freed until PQclear is called. Does >> your test call PQclear()? > s/PQclear/PQfinish/ > It does call PQclear on the result, and PQfinish on the connection. In that case I think there is no doubt that you've found a bug in getaddrinfo/freeaddrinfo, and you ought to be reporting it to your libc provider. We do call freeaddrinfo on the result of getaddrinfo, so if not everything is cleaned up, that's a library bug not ours. You could check this by reducing the test case to getaddrinfo() then freeaddrinfo() using the same parameters that fe-connect.c passes. regards, tom lane
* Tom Lane wrote: > Karsten Desler <kdesler@soohrt.org> writes: > > * Bruce Momjian wrote: > >> I think what you are seeing is that the getaddrinfo memory is placed in > >> the PGconn structure that isn't freed until PQclear is called. Does > >> your test call PQclear()? > > > s/PQclear/PQfinish/ > > It does call PQclear on the result, and PQfinish on the connection. > > In that case I think there is no doubt that you've found a bug in > getaddrinfo/freeaddrinfo, and you ought to be reporting it to your > libc provider. We do call freeaddrinfo on the result of getaddrinfo, > so if not everything is cleaned up, that's a library bug not ours. > > You could check this by reducing the test case to getaddrinfo() > then freeaddrinfo() using the same parameters that fe-connect.c > passes. Indeed. Sorry for the noise. The GNU libc 2.3.2 leaks ai->ai_canonname for every struct addrinfo in the result list. The original problem hasn't happened again (it seems like the faulty ethernet switch, that was the cause for the flaky connection was finally replaced). Anyway, if it happenes again, I'll notify you. Regards, Karsten Desler