Thread: PGconn gets frozen ocassionally after select() timeout

PGconn gets frozen ocassionally after select() timeout

From
Marek Peca
Date:
Dear programmers,

I am using the handy LISTEN/NOTIFY mechanism provided by PostgreSQL in my
client application, written in C with libpq. It is a simple
single-threaded single-process client (thanks to the NOTIFIcation!),
waiting for the notification using select() call with finite timeout.
After timeout, it performs some action, after notification something else.

The problem: most of time, everything works fine, hundreds of successful
or even timed-out selects() get handled without any problem. But time to
time (eg. after several hours), the select() call returns with a timeout
and then, a request to the opened PQconn (simple query) gets stuck, the
call hangs and never returns.

The simplified code snippet is:

s = PQsocket(pg);
for (;;) {
   FD_ZERO(&input_mask);
   FD_SET(s, &input_mask);
   rc = select(s + 1, &input_mask, NULL, NULL, select_timeout);

   if (rc == 0) {
     fprintf(stderr, "info: select timeout\n");
     pr = PQexec(pg, "SELECT now()");  /********* <= does not return ****/
     /* etc. */

   }
   /* etc. */
}

The client is connected to remote SQL server via TCP/IP, SSL may have been
involved in (not sure right now). I guess that the problem could be caused
by ocassional connection failure. However, I would expect, that PQexec()
or other functions should return an error in such a case. I tried to test
it using PQstatus(), but no success.

The workaround, which works somehow, is to call PQreset() *always* after
every timed-out select(). However, I am very unhappy with such a
"solution", because after the connection is reset, I need to LISTEN again
and I must expect, that DB state has changed during the reset. This is
natural in error recovery case, after a connection or other failure. BUT
since I need to reset the connection almost always, after _any_ timed-out
select(), it is really harmful to my application.


Please, can you tell me, what am I doing wrong?
Thank you very much.


Best regards,
Marek.

Re: PGconn gets frozen ocassionally after select() timeout

From
Vick Khera
Date:
On Fri, Nov 13, 2009 at 9:22 AM, Marek Peca <marek@duch.cz> wrote:
> Please, can you tell me, what am I doing wrong?

I have very similar code written in Perl, and I never observe failures
like you see after a timeout on the select call.  This code has been
in production on farily busy systems for several years now.

Re: PGconn gets frozen ocassionally after select() timeout

From
Marek Peca
Date:
Dear Vick,

> I have very similar code written in Perl, and I never observe failures
> like you see after a timeout on the select call.  This code has been
> in production on farily busy systems for several years now.

can you tell me, which libpq and server versions you are running on?
I have constant problems with libpq 8.3.7, connecting remotely to server
of the same version.

Are you connecting remotely, or through failure-safe interfaces like
localhost? I can imagine, that this kind of error can hide if the
connection link is good.

Maybe I should try the Perl way, if it will have the same problem in my
configuration... oh no, is there anybody familiar with this kind of error?


Many thanks,
Marek.

Re: PGconn gets frozen ocassionally after select() timeout

From
Tom Lane
Date:
Marek Peca <marek@duch.cz> writes:
> The problem: most of time, everything works fine, hundreds of successful
> or even timed-out selects() get handled without any problem. But time to
> time (eg. after several hours), the select() call returns with a timeout
> and then, a request to the opened PQconn (simple query) gets stuck, the
> call hangs and never returns.

What that sounds like is a network-level problem.  In particular, if
there's a NAT-capable router between your client and server machines,
it's probably dropping the connection after a certain period of
inactivity.  You may be able to fix this within Postgres by adjusting
the server's tcp_keepalives_idle setting.  If the server is on a
platform that doesn't support changing the keepalive interval, the
only recourse is to fix the router.

            regards, tom lane

Re: PGconn gets frozen ocassionally after select() timeout

From
Marek Peca
Date:
On Fri, 13 Nov 2009, Tom Lane wrote:
> What that sounds like is a network-level problem.  In particular, if
> there's a NAT-capable router between your client and server machines,
> it's probably dropping the connection after a certain period of
> inactivity.

Yes, it probably is. The connection goes through a very fancy way,
including several NATs and VPNs and some cruel 3rd party routers. Yes, it
seems to me, that some ugly intervention of such a router is possible.
(Not sure about inactivity, since the traffic is constant and with delay
no longer than 10 minutes (in practice around 10-20 sec.) between two
NOTIFYes.)

> You may be able to fix this within Postgres by adjusting the server's
> tcp_keepalives_idle setting.  If the server is on a platform that
> doesn't support changing the keepalive interval, the only recourse is to
> fix the router.

Thank you for the suggestion of tcp_keepalives_idle, I may try it.

However: I can not get the point, why does the PQexec() (or PQstatus() at
least) hang, instead of returning some error? I know, that situation with
broken TCP connection may involve long timeouts, but it could return at
least after several minutes, couldn't it?

Ordinary telnet or ssh connection will tell me, that I am writing my
characters to TCP line, which has died some time ago.


Thank you for your answers,
Marek.

Re: PGconn gets frozen ocassionally after select() timeout

From
Tom Lane
Date:
Marek Peca <marek@duch.cz> writes:
> However: I can not get the point, why does the PQexec() (or PQstatus() at
> least) hang, instead of returning some error? I know, that situation with
> broken TCP connection may involve long timeouts, but it could return at
> least after several minutes, couldn't it?

"Several hours" might be more like it.  How long have you waited?
In any case, that complaint should be directed to your kernel vendor
not us.  We do not control how long the TCP stack waits before declaring
the connection dead.

            regards, tom lane

Re: PGconn gets frozen ocassionally after select() timeout

From
Marek Peca
Date:
> "Several hours" might be more like it.

Better than infinity, of course.

>  How long have you waited?

Two days, for example.

> In any case, that complaint should be directed to your kernel vendor
> not us.  We do not control how long the TCP stack waits before declaring
> the connection dead.

Well, this is what I wanted to hear. So if I call PQstatus() or PQexec(),
the libpq sends some data to the TCP pipe, and you expect, that my broken
underlying TCP/IP subsystem pretends, that the connection is OK and the
data have been sent. Right?


Thank you very much for your explanation.
Marek

Re: PGconn gets frozen ocassionally after select() timeout

From
Tom Lane
Date:
Marek Peca <marek@duch.cz> writes:
>> In any case, that complaint should be directed to your kernel vendor
>> not us.  We do not control how long the TCP stack waits before declaring
>> the connection dead.

> Well, this is what I wanted to hear. So if I call PQstatus() or PQexec(),
> the libpq sends some data to the TCP pipe, and you expect, that my broken
> underlying TCP/IP subsystem pretends, that the connection is OK and the
> data have been sent. Right?

Well, if you *never* get a failure, then yeah I think you have a broken
TCP stack.  The default timeouts on this sort of thing are annoyingly
long, but they aren't infinite.

            regards, tom lane

Re: PGconn gets frozen ocassionally after select() timeout

From
Marek Peca
Date:
On Fri, 13 Nov 2009, Tom Lane wrote:
> Well, if you *never* get a failure, then yeah I think you have a broken
> TCP stack.  The default timeouts on this sort of thing are annoyingly
> long, but they aren't infinite.

Yes, this is the case. Both ends are running on GNU/Linux system and I
expect that core components are OK, however, as I said, there are some
weird VPN relays, and maybe one of them pretends, that connection is OK,
while it is broken for a long time.

Thank you for your answers. Now I am glas I can solve the problem and
throw away ugly workarounds. I hope my original code with NOTIFY and
select() is all right, then.


Best regards,
Marek

Re: PGconn gets frozen ocassionally after select() timeout

From
Vick Khera
Date:
On Fri, Nov 13, 2009 at 10:07 AM, Marek Peca <marek@duch.cz> wrote:
> Dear Vick,
>
>> I have very similar code written in Perl, and I never observe failures
>> like you see after a timeout on the select call.  This code has been
>> in production on farily busy systems for several years now.
>
> can you tell me, which libpq and server versions you are running on?
> I have constant problems with libpq 8.3.7, connecting remotely to server of
> the same version.

Right now we're running Pg 8.3.7, but we have been doing this since
8.1 days if not earlier.  Everything here is on a local LAN and runs
FreeBSD.

Based on the other messages in this thread, I'll vote one of your VPN
or NAT boxes is breaking the expectations of your network stack.