Re: psycopg2 (async) socket timeout - Mailing list psycopg
From | Jan Urbański |
---|---|
Subject | Re: psycopg2 (async) socket timeout |
Date | |
Msg-id | 4D5AFA7D.4080202@wulczer.org Whole thread Raw |
In response to | Re: psycopg2 (async) socket timeout (Marko Kreen <markokr@gmail.com>) |
List | psycopg |
On 15/02/11 21:55, Marko Kreen wrote: > On Tue, Feb 15, 2011 at 3:32 PM, Jan Urbański <wulczer@wulczer.org> wrote: >> * the the app sends a keepalive, receives response > > Sort of true, except Postgres does not have app-level > keepalive (except SELECT 1). The PQping mentioned > earlier creates new connection. By this I meant that an app is connected using libpq with keepalives enabled. >> * the connection is idle >> * before the next keepalive is sent, you want to do a query >> * the connection breaks silently >> * you try sending the query >> * libpq tries to write the query to the conncetion socket, does not >> receive TCP confirmation > > The TCP keepalive should help for those cases, perhaps > you are doing something wrong if you are not seeing the effect. Well for me it doesn't help, I'm not sure if it's my fault or the kernel's or it's just how TCP ought to work. >> * the kernel starts retransmitting the data, using TCP's RTO algorithm >> * you don't get notified about the failure until the TCP gives up, which >> might be a long time > > I'm not familiar with RTO, so cannot comment. > > Why would it stop keepalive from working? Looking at the traffic in Wireshark I'm seeing TCP retransmissions and no keepalive traffic. > The need for periodic query is exactly the thing that keepalive > should fix. OTOH, if you have connections that are long time idle > you could simply drop them. > > We have the (4m idle + 4x15sec ping) parameters as > default and they work fine - dead connection is killed > after 5m. Hm, so my test is like this: * I connect with psycopg2 enabling keepalives in the connection string, using "keepalives_idle=4 keepalives_interval=1 keepalives_count=1" * the test program sends a "select pg_sleep(6)" and then sleeps itself for 6 seconds, and does that in a loop * each time after the query is sent and 4 seconds elapse I'm seeing TCP keepalive packets going to the server and the server responding * each time after the program sleeps for 4 seconds, a keepalive is sent To simulate a connectivity loss I'm adding two rules to my firewall that block (the iptables DROP target) communication from or to port 5432. Now there are two scenarios: 1. if I block the connection right after the test program goes to sleep, the response to the keepalive is not received and a connectivity loss is detected. The app sends a RST packet (that obviously does not reach the server) and when it wakes up and tries to send the query, psycopg complains about a broken connection. Important: the backend stays alive and PG shows the connection as "IDLE in transaction". 2. if I block the connection after the test program already sent the keepalive, but before it sent the query it actually goes ahead and tries to send the query, and then blocks because the kernel is retrying the TCP delivery of the query. Keepalives are *not* sent and the process of TCP giving up can take quite some time (depends on the settigs for TCP timeout). The connection stays alive on the server anyway. 3. if I block the connection while it's waiting for the query to complete, a keepalive is sent, the connection is detected to be broken, the execute statement fails with a SystemError: null argument to internal routine (sic!), and the connection stays on the server. I'm not really sure what's the deal with the SystemError, but my conclusions are: * while TCP retry is in action, it disables keepalives * the backend stays alive on the server side anyway It's possible that TCP retries take a few minutes and I'm simply not patient enough (of course I'm not using a keepalive interval of 1 second in production). So if all you want is to detect a broken connection a couple of minutes from the moment it happened, you can have client-side keepalives tuned as Marko does it, and check that your TCP stack gives up a delivery attempt in less then a few minutes. On the other hand, you probably *should* also use server-side keepalives, so the server can detect a broken connection and kill the backend, otherwise you will end up with lots of "IDLE in connection" backends, which is Very Bad (can block autovacuum, still holds transaction locks etc). I'm going to do some more tests to see the default timeout for TCP delivery and if it's really in the range of 5 minutes, I'll be happy. Now I don't have any clue what's with the SystemError I'm getting, might take a look if I find the time. Attached is the test script and the command I use to simulate network outage. And last but not least, txpostgres does not play that well with client-side keepalives, because while the connection is idle, it's not watching its file descriptor, so error conditions on that descriptor will be detected only when you go and do a query. That is something I might fix in the future. Gah, looking at all that TCP stuff always makes my head spin. Cheers, Jan