Thread: PGconn gets frozen ocassionally after select() timeout
Dear programmers, I am using the handy LISTEN/NOTIFY mechanism provided by PostgreSQL in my client application, written in C with libpq. It is a simple single-threaded single-process client (thanks to the NOTIFIcation!), waiting for the notification using select() call with finite timeout. After timeout, it performs some action, after notification something else. The problem: most of time, everything works fine, hundreds of successful or even timed-out selects() get handled without any problem. But time to time (eg. after several hours), the select() call returns with a timeout and then, a request to the opened PQconn (simple query) gets stuck, the call hangs and never returns. The simplified code snippet is: s = PQsocket(pg); for (;;) { FD_ZERO(&input_mask); FD_SET(s, &input_mask); rc = select(s + 1, &input_mask, NULL, NULL, select_timeout); if (rc == 0) { fprintf(stderr, "info: select timeout\n"); pr = PQexec(pg, "SELECT now()"); /********* <= does not return ****/ /* etc. */ } /* etc. */ } The client is connected to remote SQL server via TCP/IP, SSL may have been involved in (not sure right now). I guess that the problem could be caused by ocassional connection failure. However, I would expect, that PQexec() or other functions should return an error in such a case. I tried to test it using PQstatus(), but no success. The workaround, which works somehow, is to call PQreset() *always* after every timed-out select(). However, I am very unhappy with such a "solution", because after the connection is reset, I need to LISTEN again and I must expect, that DB state has changed during the reset. This is natural in error recovery case, after a connection or other failure. BUT since I need to reset the connection almost always, after _any_ timed-out select(), it is really harmful to my application. Please, can you tell me, what am I doing wrong? Thank you very much. Best regards, Marek.
On Fri, Nov 13, 2009 at 9:22 AM, Marek Peca <marek@duch.cz> wrote: > Please, can you tell me, what am I doing wrong? I have very similar code written in Perl, and I never observe failures like you see after a timeout on the select call. This code has been in production on farily busy systems for several years now.
Dear Vick, > I have very similar code written in Perl, and I never observe failures > like you see after a timeout on the select call. This code has been > in production on farily busy systems for several years now. can you tell me, which libpq and server versions you are running on? I have constant problems with libpq 8.3.7, connecting remotely to server of the same version. Are you connecting remotely, or through failure-safe interfaces like localhost? I can imagine, that this kind of error can hide if the connection link is good. Maybe I should try the Perl way, if it will have the same problem in my configuration... oh no, is there anybody familiar with this kind of error? Many thanks, Marek.
Marek Peca <marek@duch.cz> writes: > The problem: most of time, everything works fine, hundreds of successful > or even timed-out selects() get handled without any problem. But time to > time (eg. after several hours), the select() call returns with a timeout > and then, a request to the opened PQconn (simple query) gets stuck, the > call hangs and never returns. What that sounds like is a network-level problem. In particular, if there's a NAT-capable router between your client and server machines, it's probably dropping the connection after a certain period of inactivity. You may be able to fix this within Postgres by adjusting the server's tcp_keepalives_idle setting. If the server is on a platform that doesn't support changing the keepalive interval, the only recourse is to fix the router. regards, tom lane
On Fri, 13 Nov 2009, Tom Lane wrote: > What that sounds like is a network-level problem. In particular, if > there's a NAT-capable router between your client and server machines, > it's probably dropping the connection after a certain period of > inactivity. Yes, it probably is. The connection goes through a very fancy way, including several NATs and VPNs and some cruel 3rd party routers. Yes, it seems to me, that some ugly intervention of such a router is possible. (Not sure about inactivity, since the traffic is constant and with delay no longer than 10 minutes (in practice around 10-20 sec.) between two NOTIFYes.) > You may be able to fix this within Postgres by adjusting the server's > tcp_keepalives_idle setting. If the server is on a platform that > doesn't support changing the keepalive interval, the only recourse is to > fix the router. Thank you for the suggestion of tcp_keepalives_idle, I may try it. However: I can not get the point, why does the PQexec() (or PQstatus() at least) hang, instead of returning some error? I know, that situation with broken TCP connection may involve long timeouts, but it could return at least after several minutes, couldn't it? Ordinary telnet or ssh connection will tell me, that I am writing my characters to TCP line, which has died some time ago. Thank you for your answers, Marek.
Marek Peca <marek@duch.cz> writes: > However: I can not get the point, why does the PQexec() (or PQstatus() at > least) hang, instead of returning some error? I know, that situation with > broken TCP connection may involve long timeouts, but it could return at > least after several minutes, couldn't it? "Several hours" might be more like it. How long have you waited? In any case, that complaint should be directed to your kernel vendor not us. We do not control how long the TCP stack waits before declaring the connection dead. regards, tom lane
> "Several hours" might be more like it. Better than infinity, of course. > How long have you waited? Two days, for example. > In any case, that complaint should be directed to your kernel vendor > not us. We do not control how long the TCP stack waits before declaring > the connection dead. Well, this is what I wanted to hear. So if I call PQstatus() or PQexec(), the libpq sends some data to the TCP pipe, and you expect, that my broken underlying TCP/IP subsystem pretends, that the connection is OK and the data have been sent. Right? Thank you very much for your explanation. Marek
Marek Peca <marek@duch.cz> writes: >> In any case, that complaint should be directed to your kernel vendor >> not us. We do not control how long the TCP stack waits before declaring >> the connection dead. > Well, this is what I wanted to hear. So if I call PQstatus() or PQexec(), > the libpq sends some data to the TCP pipe, and you expect, that my broken > underlying TCP/IP subsystem pretends, that the connection is OK and the > data have been sent. Right? Well, if you *never* get a failure, then yeah I think you have a broken TCP stack. The default timeouts on this sort of thing are annoyingly long, but they aren't infinite. regards, tom lane
On Fri, 13 Nov 2009, Tom Lane wrote: > Well, if you *never* get a failure, then yeah I think you have a broken > TCP stack. The default timeouts on this sort of thing are annoyingly > long, but they aren't infinite. Yes, this is the case. Both ends are running on GNU/Linux system and I expect that core components are OK, however, as I said, there are some weird VPN relays, and maybe one of them pretends, that connection is OK, while it is broken for a long time. Thank you for your answers. Now I am glas I can solve the problem and throw away ugly workarounds. I hope my original code with NOTIFY and select() is all right, then. Best regards, Marek
On Fri, Nov 13, 2009 at 10:07 AM, Marek Peca <marek@duch.cz> wrote: > Dear Vick, > >> I have very similar code written in Perl, and I never observe failures >> like you see after a timeout on the select call. This code has been >> in production on farily busy systems for several years now. > > can you tell me, which libpq and server versions you are running on? > I have constant problems with libpq 8.3.7, connecting remotely to server of > the same version. Right now we're running Pg 8.3.7, but we have been doing this since 8.1 days if not earlier. Everything here is on a local LAN and runs FreeBSD. Based on the other messages in this thread, I'll vote one of your VPN or NAT boxes is breaking the expectations of your network stack.