Thread: Bug in libpq causes local clients to hang
Lately I've noticed that local (UNIX socket) clients using libpq4 8.1.9 (Debian 8.1.9-0etch1) and the same version of the server can hang forever waiting in poll(). The symptom is that the local client waits forever, using no CPU time, until it is interrupted by some event (such as attaching gdb or strace to it), after which it proceeds normally. From the server's perspective, such clients are in the state "<IDLE> in transaction" as reported via pg_stat_activity. I attached GDB to one such client, and the stack trace is as follows: #0 0x00002b4f2f914d7f in poll () from /lib/libc.so.6 #1 0x00002b4f3038449f in PQmblen () from /usr/lib/libpq.so.4 #2 0x00002b4f30384580 in pqWaitTimed () from /usr/lib/libpq.so.4 #3 0x00002b4f30383e62 in PQgetResult () from /usr/lib/libpq.so.4 #4 0x00002b4f30383f3e in PQgetResult () from /usr/lib/libpq.so.4 #5 0x00002b4f3025f014 in dbd_st_execute () from /usr/lib/perl5/auto/DBD/Pg/Pg.so #6 0x00002b4f302548b6 in XS_DBD__Pg__db_do () from /usr/lib/perl5/auto/DBD/Pg/Pg.so #7 0x00002b4f2fd201f0 in XS_DBI_dispatch () from /usr/lib/perl5/auto/DBI/DBI.so #8 0x00002b4f2f310b95 in Perl_pp_entersub () from /usr/lib/libperl.so.5.8 #9 0x00002b4f2f30f36e in Perl_runops_standard () from /usr/lib/libperl.so.5.8 #10 0x00002b4f2f2ba7dc in perl_run () from /usr/lib/libperl.so.5.8 #11 0x00000000004017ac in main () You'll note that I'm using the DBD::Pg Perl interface. So far I've never seen this happen with TCP connections, only with UNIX sockets. I see it with about 1 in 100 local client invocations. As a workaround I've configured my local clients to use TCP anyway, and this seems to solve the problem. Is this something that might have been fixed in a post-8.1 version of libpq? -jwb
On Sun, Mar 23, 2008 at 7:12 PM, Jeffrey Baker <jwbaker@gmail.com> wrote: > Lately I've noticed that local (UNIX socket) clients using libpq4 > 8.1.9 (Debian 8.1.9-0etch1) and the same version of the server can > hang forever waiting in poll(). The symptom is that the local client > waits forever, using no CPU time, until it is interrupted by some > event (such as attaching gdb or strace to it), after which it proceeds > normally. From the server's perspective, such clients are in the > state "<IDLE> in transaction" as reported via pg_stat_activity. I > attached GDB to one such client, and the stack trace is as follows: > > #0 0x00002b4f2f914d7f in poll () from /lib/libc.so.6 > #1 0x00002b4f3038449f in PQmblen () from /usr/lib/libpq.so.4 > #2 0x00002b4f30384580 in pqWaitTimed () from /usr/lib/libpq.so.4 > #3 0x00002b4f30383e62 in PQgetResult () from /usr/lib/libpq.so.4 > #4 0x00002b4f30383f3e in PQgetResult () from /usr/lib/libpq.so.4 > #5 0x00002b4f3025f014 in dbd_st_execute () from > /usr/lib/perl5/auto/DBD/Pg/Pg.so > #6 0x00002b4f302548b6 in XS_DBD__Pg__db_do () from > /usr/lib/perl5/auto/DBD/Pg/Pg.so > #7 0x00002b4f2fd201f0 in XS_DBI_dispatch () from /usr/lib/perl5/auto/DBI/DBI.so > #8 0x00002b4f2f310b95 in Perl_pp_entersub () from /usr/lib/libperl.so.5.8 > #9 0x00002b4f2f30f36e in Perl_runops_standard () from /usr/lib/libperl.so.5.8 > #10 0x00002b4f2f2ba7dc in perl_run () from /usr/lib/libperl.so.5.8 > #11 0x00000000004017ac in main () Following up to myself, I note that a very similar issue was reported, with a very similar stack, only two days ago, with subject "ecpg program getting stuck" archived at http://groups.google.com/group/pgsql.general/browse_thread/thread/0b7ede57faad803e/9abfd7ab1b7e1d86 -jwb
"Jeffrey Baker" <jwbaker@gmail.com> writes: > You'll note that I'm using the DBD::Pg Perl interface. So far I've > never seen this happen with TCP connections, only with UNIX sockets. If it works over TCP and not over Unix socket, it's a kernel bug. The libpq code doesn't really know the difference after connection setup. regards, tom lane
On Sun, Mar 23, 2008 at 8:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jeffrey Baker" <jwbaker@gmail.com> writes: > > You'll note that I'm using the DBD::Pg Perl interface. So far I've > > never seen this happen with TCP connections, only with UNIX sockets. > > If it works over TCP and not over Unix socket, it's a kernel bug. > The libpq code doesn't really know the difference after connection > setup. The same thought occurred to me, but it could also be a race condition which the unix socket is fast enough to trigger but the TCP socket is not fast enough to trigger. I'm peeking around in the code but nothing jumps out yet. -jwb
"Jeffrey Baker" <jwbaker@gmail.com> writes: > On Sun, Mar 23, 2008 at 8:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If it works over TCP and not over Unix socket, it's a kernel bug. >> The libpq code doesn't really know the difference after connection >> setup. > The same thought occurred to me, but it could also be a race condition > which the unix socket is fast enough to trigger but the TCP socket is > not fast enough to trigger. I'm peeking around in the code but > nothing jumps out yet. Fairly hard to believe given that you're talking about communication between two sequential processes. Anyway I'd suggest that the first thing to do is extract a reproducible test case. It'd be useful to see if it hangs on other platforms... regards, tom lane
On Mon, Mar 24, 2008 at 9:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jeffrey Baker" <jwbaker@gmail.com> writes: > > On Sun, Mar 23, 2008 at 8:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> If it works over TCP and not over Unix socket, it's a kernel bug. > >> The libpq code doesn't really know the difference after connection > >> setup. > > > The same thought occurred to me, but it could also be a race condition > > which the unix socket is fast enough to trigger but the TCP socket is > > not fast enough to trigger. I'm peeking around in the code but > > nothing jumps out yet. > > Fairly hard to believe given that you're talking about communication > between two sequential processes. Anyway I'd suggest that the first > thing to do is extract a reproducible test case. It'd be useful > to see if it hangs on other platforms... The stack trace doesn't actually make sense, does it? I think that (at least) the PQmblen frame is spurious. -jwb
"Jeffrey Baker" <jwbaker@gmail.com> writes: > The stack trace doesn't actually make sense, does it? I think that > (at least) the PQmblen frame is spurious. Yeah, it's obviously lying to some extent. I think you need a copy of libpq with debug symbols enabled to get a more accurate trace. regards, tom lane