Thread: psycopg2 (async) socket timeout

psycopg2 (async) socket timeout

From
Danny Milosavljevic
Date:
Hello,

is it possible to specify the timeout for the socket underlying a connection?

Alternatively, since I'm using the async interface anyway, is it
possible proactively cancel a query that is "stuck" since the TCP
connection to the database is down?

So the specific case is:
- connect to the postgres database using psycopg2 while network is up
- run some queries, get the results fine etc
- send a query
- the network goes down before the result to this last query has been received
- neither a result nor an error callback gets called - as far as I can
see (using txpostgres.ConnectionPool)

What's the proper way to deal with that?

Regards,
   Danny

Re: psycopg2 (async) socket timeout

From
Daniele Varrazzo
Date:
On Thu, Feb 3, 2011 at 8:04 PM, Danny Milosavljevic
<danny.milo@gmail.com> wrote:
> Hello,
>
> is it possible to specify the timeout for the socket underlying a connection?

You can send a command SET statement_timeout to affect all the
commands further sent to the connection

http://www.postgresql.org/docs/9.0/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT


> Alternatively, since I'm using the async interface anyway, is it
> possible proactively cancel a query that is "stuck" since the TCP
> connection to the database is down?

Jan has contributed a connection.cancel() method, available since psycopg 2.3

http://initd.org/psycopg/docs/connection.html#connection.cancel


> So the specific case is:
> - connect to the postgres database using psycopg2 while network is up
> - run some queries, get the results fine etc
> - send a query
> - the network goes down before the result to this last query has been received
> - neither a result nor an error callback gets called - as far as I can
> see (using txpostgres.ConnectionPool)
>
> What's the proper way to deal with that?

Probably canceling the query from the client: if the network is down
you wouldn't get the signal that the backend timed out. But I've never
dealt with this scenario myself so I guess you should test if any of
the above works for you :)

-- Daniele

Re: psycopg2 (async) socket timeout

From
Jan Urbański
Date:
----- Original message -----
> So the specific case is:
> - connect to the postgres database using psycopg2 while network is up
> - run some queries, get the results fine etc
> - send a query
> - the network goes down before the result to this last query has been
> received - neither a result nor an error callback gets called - as far
> as I can see (using txpostgres.ConnectionPool)
>
> What's the proper way to deal with that?

I'll try to reproduce this problem, AIUI you should have the Deferred errback if the connection is lost, but perhaps it
takessome time for Twisted to detect it (actually it takes time for the kernel to detect it). You might try playing
withyour TCP keepalive settings. 

Another option is implementing a timeout with a callLater. The problem there is that it requires additional code and
txpostgresdoes not support query cancellation (yet, it's on the roadmap). 

Cheers,
Jan

Re: psycopg2 (async) socket timeout

From
Mark Theisen
Date:
You could add connect_timeout to the dsn, e.g. dsn = "host='localhost' dbname='test' connect_timeout=60".
http://www.postgresql.org/docs/8.4/static/libpq-connect.html#AEN33199.I don't know if this will help with your
situation,but it might. 

Mark

----- Original Message -----
From: "Danny Milosavljevic" <danny.milo@gmail.com>
To: psycopg@postgresql.org
Sent: Thursday, February 3, 2011 2:04:39 PM
Subject: [psycopg] psycopg2 (async) socket timeout

Hello,

is it possible to specify the timeout for the socket underlying a connection?

Alternatively, since I'm using the async interface anyway, is it
possible proactively cancel a query that is "stuck" since the TCP
connection to the database is down?

So the specific case is:
- connect to the postgres database using psycopg2 while network is up
- run some queries, get the results fine etc
- send a query
- the network goes down before the result to this last query has been received
- neither a result nor an error callback gets called - as far as I can
see (using txpostgres.ConnectionPool)

What's the proper way to deal with that?

Regards,
   Danny

--
Sent via psycopg mailing list (psycopg@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/psycopg

Re: psycopg2 (async) socket timeout

From
Danny Milosavljevic
Date:
Hi,

2011/2/9 Jan Urbański <wulczer@wulczer.org>:
> ----- Original message -----
> I'll try to reproduce this problem, AIUI you should have the Deferred errback if the connection is lost, but perhaps
ittakes some time for Twisted to detect it (actually it takes time for the kernel to detect it). You might try playing
withyour TCP keepalive settings. 

I'm trying. No luck so far...

http://twistedmatrix.com/trac/wiki/FrequentlyAskedQuestions says "If
you rely on TCP timeouts, expect as much as two hours (the precise
amount is platform specific) to pass between when the disruption
occurs and when connectionLost is called". Oops.

Hmm, even when I connect, then just down the network interface and
only after that call runQuery, it is also never calling back anything
(well, I didn't wait more than half an hour per try so far).

But good point, although does this even work for async sockets? -
where you are not reading actively, that is, nobody knows you want to
receive any data? If that worked, that would be the nicest fix. For
the not-so-nice fix, read on :-)

I've now started to do it the way Daniele and you suggested ("just
close it from the client"), so I modified the Connection to start a
timer which will fire if I don't defuse it early enough (and modified
ConnectionPool to check connections periodically and reconnect).

After I receive a response, I defuse the timer. If not, the timer
callback will be run. It will call the errback - which will call
connection.close().

As far as noticing the "disconnect" (well, potential disconnect) goes,
this works perfectly.
However, doing a connection.close() then doesn't seem to help much,
still investigating why... getting the following:

      File "/usr/lib/python2.6/site-packages/twisted/internet/selectreactor.py",
line 104, in doSelect
        [], timeout)
    exceptions.ValueError: file descriptor cannot be a negative integer (-1)

So it seems the FD of the closed connection to postgres is still in
the Twisted reactor?
Seems I am missing some calls to self.reactor.removeReader or -Writer,
maybe. Do those belong in Connection.close() ?

If I try to reconnect periodically, can I use the same txpostgres
Connection instance and just call connect() again?

> Another option is implementing a timeout with a callLater. The problem there is that it requires additional code

Yeah, did that now, see the end of this post for the code...

Since I'm trying to make the pg client more resilient against our
flaky network, what I want to do is just close the socket to the
server on timeout. What I don't want to do it send a cancellation
request over the broken socket to the server saying that the
connection is broken and he please cancel the query :-)
(I hope the PostgreSQL server will notice soon enough when the client
doesn't answer)

> and txpostgres does not support query cancellation (yet, it's on the roadmap).

Yeah, but when you say "cancellation" do you mean "sending a -
cancellation - request to the server via the non-working connection"?
:)

Well, I'll be reading a bit more of the twisted reactor code now, I guess :-)

Cheers,
    Danny

The file "postgresto.py" which is all that extra code needed for
client-side impatience follows. For "Connection" it's mostly your code
with a few minimal changes: some connectTimeouter() calls sprinkled
in. For "ConnectionPool" it's a periodically-checking health checker
and reconnecter too now - please note that this is work in progress
and in no way stable yet:

#!/usr/bin/env python2

# Postgres with timeouts.

import sys
from txpostgres import txpostgres
from twisted.internet import interfaces, reactor, defer
from twisted.python import log

class Connection(txpostgres.Connection):
    def connectTimeouter(self, d, timeout):
        """ connect a timeouter to a deferred """
        delayedCall = reactor.callLater(timeout, self.handleTimeout, d)
        d.addBoth(self.cancelTimeout, delayedCall)
        return(d)
    def _runQuery(self, *args, **kwargs):
        c = self.cursor()
        timeout = kwargs.get("timeout") or 10
        d = c.execute(*args, **kwargs)
        return self.connectTimeouter(d, timeout).addCallback(lambda c:
c.fetchall())
    def _runOperation(self, *args, **kwargs):
        c = self.cursor()
        timeout = kwargs.get("timeout") or 10
        d = c.execute(*args, **kwargs)
        return self.connectTimeouter(d, timeout).addCallback(lambda _: None)
    def _runInteraction(self, interaction, *args, **kwargs):
        c = self.cursor()
        timeout = kwargs.get("timeout") or 10
        d = c.execute("begin")
        # we assume that the interaction does something on the
database here, so if the interaction times out, take it as a database
timeout!
        self.connectTimeouter(d, timeout).addCallback(lambda arg:
self.connectTimeouter(defer.maybeDeferred(interaction(arg, *args,
**kwargs))))
        # FIXME also timeout the interaction itself.
        def commitAndPassthrough(ret, cursor):
            e = cursor.execute("commit")
            return e.addCallback(lambda _: ret)
        def rollbackAndPassthrough(f, cursor):
            # maybeDeferred in case cursor.execute raises a
synchronous exception
            e = defer.maybeDeferred(cursor.execute, "rollback")
            def just_panic(rf):
                log.err(rf)
                return defer.fail(RollbackFailed(self, f))
            # if rollback failed, panic
            e.addErrback(just_panic)
            # reraise the original failure afterwards
            return e.addCallback(lambda _: f)
        #self.connectTimeouter(d, timeout)
        d.addCallback(commitAndPassthrough, c)
        d.addErrback(rollbackAndPassthrough, c)
        return d
    def handleTimeout(self, d):
        """ handles the timeout since we DID time out """
        log.err("timed out")
        self.close() # close the connection (maybe it was a connection
problem...)
    def cancelTimeout(self, arg, delayedCall):
        """ cancels the timeout since we DID NOT time out """
        #print >>sys.stderr, "not timed out, OK"
        if delayedCall.active():
            delayedCall.cancel()
        return(arg)
    def isConnected(self):
        return self.pollable() is not None

class ConnectionPool(txpostgres.ConnectionPool):
    connectionFactory = Connection
    """def connect(self, *args, **kwargs):
        result = txpostgres.ConnectionPool.connect(self, *args, **kwargs)
        return(result)
    """
    def __init__(self, *args, **kwargs):
        txpostgres.ConnectionPool.__init__(self, *args, **kwargs)
        self.reconnectionInterval = 10 # sec
        reactor.callLater(self.reconnectionInterval, self.reconnectIfNeeded)
        self.connectionAttempts = set()
    def reconnectIfNeeded(self):
        for connection in self.connections:
            if not connection.isConnected() and connection not in
self.connectionAttempts: # TODO don't try that too often...
                log.msg("database connection was lost, trying again")
                self.connectionAttempts.add(connection)
                d = connection.connect(*self.connargs, **self.connkw)
                delayedCall =
reactor.callLater(self.reconnectionInterval, self.handleTimeout, d,
connection)
                d.addBoth(self.cancelTimeout, delayedCall, connection)
        reactor.callLater(self.reconnectionInterval, self.reconnectIfNeeded)
    def handleTimeout(self, d, connection):
        """ handles the timeout since we DID time out """
        log.err("reconnect timed out")
        try:
            self.connectionAttempts.remove(connection)
        except KeyError:
            pass
        d.errback()
    def cancelTimeout(self, arg, delayedCall, connection):
        """ cancels the timeout since we DID NOT time out """
        print >>sys.stderr, "reconnect DID NOT time out, OK"
        # does not neccessarily mean that it worked.
        if delayedCall.active():
            delayedCall.cancel()
        try:
            self.connectionAttempts.remove(connection)
        except KeyError:
            pass
        return(arg)

Re: psycopg2 (async) socket timeout

From
Jan Urbański
Date:
On 14/02/11 19:59, Danny Milosavljevic wrote:
> Hi,
>
> 2011/2/9 Jan Urbański <wulczer@wulczer.org>:
>> ----- Original message -----
>> I'll try to reproduce this problem, AIUI you should have the Deferred errback if the connection is lost, but perhaps
ittakes some time for Twisted to detect it (actually it takes time for the kernel to detect it). You might try playing
withyour TCP keepalive settings. 
>
> I'm trying. No luck so far...
>
> http://twistedmatrix.com/trac/wiki/FrequentlyAskedQuestions says "If
> you rely on TCP timeouts, expect as much as two hours (the precise
> amount is platform specific) to pass between when the disruption
> occurs and when connectionLost is called". Oops.

Yup, default settings for TCP keepalives are quite high...

> Hmm, even when I connect, then just down the network interface and
> only after that call runQuery, it is also never calling back anything
> (well, I didn't wait more than half an hour per try so far).
>
> But good point, although does this even work for async sockets? -
> where you are not reading actively, that is, nobody knows you want to
> receive any data? If that worked, that would be the nicest fix. For
> the not-so-nice fix, read on :-)

AFAIK if you're connected through TCP and waiting for data from the
other side, and the other side decides to never send you anything (for
instance because it died and did not even send you a RST packet), you
have no way of detecting that short of trying to send something every
now and then and if there's no response assuming the connection's down.

So you actually *need* a heartbeat solution to be able to detect network
dying... I think the best idea would be starting a timer every time you
start a query and cancelling it when it finishes, and (important)
setting the timeout of that timer only a little bit higher than the
query timeout setting on the server. This way if your code times out the
server won't keep on running your query.

> I've now started to do it the way Daniele and you suggested ("just
> close it from the client"), so I modified the Connection to start a
> timer which will fire if I don't defuse it early enough (and modified
> ConnectionPool to check connections periodically and reconnect).

Well something like that ;) I'd try doing it on the per-query level,
actually. Since you can't have more than one outstanding query, your
keepalive won't be sent until the current query finishes.

Actually, libpq recently got a feature called PQPing that just checks
the state of the connection. So you can have timeouts on your queries
and periodic PQPings when you're not running anything. Reminds me:
psycopg2 needs to support PQPing, but that should be easy.

> After I receive a response, I defuse the timer. If not, the timer
> callback will be run. It will call the errback - which will call
> connection.close().
>
> As far as noticing the "disconnect" (well, potential disconnect) goes,
> this works perfectly.
> However, doing a connection.close() then doesn't seem to help much,
> still investigating why... getting the following:
>
>       File "/usr/lib/python2.6/site-packages/twisted/internet/selectreactor.py",
> line 104, in doSelect
>         [], timeout)
>     exceptions.ValueError: file descriptor cannot be a negative integer (-1)
>
> So it seems the FD of the closed connection to postgres is still in
> the Twisted reactor?
> Seems I am missing some calls to self.reactor.removeReader or -Writer,
> maybe. Do those belong in Connection.close() ?

Ha, it always comes back to the ticket I filed when writing txpostgres:
http://twistedmatrix.com/trac/ticket/4539

Believe it or not, this problem seems to also prevent proper
LISTEN/NOTIFY implementation...

> If I try to reconnect periodically, can I use the same txpostgres
> Connection instance and just call connect() again?

I think you can, although recreating the Connection object should not be
a problem.

Jan

Re: psycopg2 (async) socket timeout

From
Daniele Varrazzo
Date:
On Mon, Feb 14, 2011 at 7:16 PM, Jan Urbański <wulczer@wulczer.org> wrote:

> Actually, libpq recently got a feature called PQPing that just checks
> the state of the connection. So you can have timeouts on your queries
> and periodic PQPings when you're not running anything. Reminds me:
> psycopg2 needs to support PQPing, but that should be easy.

That's tasty. It's too early now to include something only available
in PG 9.1, but a psycopg2.ping(dsn_or_params) would actually be
useful.

-- Daniele

Re: psycopg2 (async) socket timeout

From
Marko Kreen
Date:
On Thu, Feb 3, 2011 at 10:04 PM, Danny Milosavljevic
<danny.milo@gmail.com> wrote:
> is it possible to specify the timeout for the socket underlying a connection?
>
> Alternatively, since I'm using the async interface anyway, is it
> possible proactively cancel a query that is "stuck" since the TCP
> connection to the database is down?
>
> So the specific case is:
> - connect to the postgres database using psycopg2 while network is up
> - run some queries, get the results fine etc
> - send a query
> - the network goes down before the result to this last query has been received
> - neither a result nor an error callback gets called - as far as I can
> see (using txpostgres.ConnectionPool)
>
> What's the proper way to deal with that?

TCP keepalive.  By default the timeouts are quite high,
but they are tunable.

libpq supports keepalive tuning since 9.0, on older libpq
you can do it yourself:

  https://github.com/markokr/skytools/blob/master/python/skytools/psycopgwrapper.py#L153

--
marko

Re: psycopg2 (async) socket timeout

From
Jan Urbański
Date:
On 15/02/11 06:39, Marko Kreen wrote:
> On Thu, Feb 3, 2011 at 10:04 PM, Danny Milosavljevic
> <danny.milo@gmail.com> wrote:
>> is it possible to specify the timeout for the socket underlying a connection?
>>
>> Alternatively, since I'm using the async interface anyway, is it
>> possible proactively cancel a query that is "stuck" since the TCP
>> connection to the database is down?
>>
>> So the specific case is:
>> - connect to the postgres database using psycopg2 while network is up
>> - run some queries, get the results fine etc
>> - send a query
>> - the network goes down before the result to this last query has been received
>> - neither a result nor an error callback gets called - as far as I can
>> see (using txpostgres.ConnectionPool)
>>
>> What's the proper way to deal with that?
>
> TCP keepalive.  By default the timeouts are quite high,
> but they are tunable.
>
> libpq supports keepalive tuning since 9.0, on older libpq
> you can do it yourself:
>
>   https://github.com/markokr/skytools/blob/master/python/skytools/psycopgwrapper.py#L153

After doing lots of tests, it seems that keepalives are not the full
solution. They're useful if you want to detect the connection breaking
while it's idle, but they don't help in the case of:

* the connection is idle
* the the app sends a keepalive, receives response
* before the next keepalive is sent, you want to do a query
* the connection breaks silently
* you try sending the query
* libpq tries to write the query to the conncetion socket, does not
receive TCP confirmation
* the kernel starts retransmitting the data, using TCP's RTO algorithm
* you don't get notified about the failure until the TCP gives up, which
might be a long time

So it seems to me that you need an application-level timeout also. I'm
thinking about supporting it in txpostgres, but will have to think
exactly how to do it and what would be the interface.

Alternatively, you can lower the kernel TCP retry parameters
(net.ipv4.tcp_retries1 and net.ipv4.tcp_retries2), which will make TCP
give up earlier. Unfortunately it seems that you can only set the
globally at the kernel level and not per connection, which IMHO is a bit
too scary. What bothers me is that the keepalives mechanism does not
come into play while you're doing TCP retries, but that's apparently how
TCP works (at least on Linux...).

If you want to detect the connection failing as soon as possible, and
not the next time you try to make a query, you need to regularly make
queries, IOW have a heartbeat. But all the things I wrote before still
apply, and without an app-level timeout or lowering the TCP retry
parameters it might take a lot of time to detect that the heartbeat failed.

Cheers,
Jan

Re: psycopg2 (async) socket timeout

From
Marko Kreen
Date:
On Tue, Feb 15, 2011 at 3:32 PM, Jan Urbański <wulczer@wulczer.org> wrote:
> On 15/02/11 06:39, Marko Kreen wrote:
>> On Thu, Feb 3, 2011 at 10:04 PM, Danny Milosavljevic
>> <danny.milo@gmail.com> wrote:
>>> is it possible to specify the timeout for the socket underlying a connection?
>>>
>>> Alternatively, since I'm using the async interface anyway, is it
>>> possible proactively cancel a query that is "stuck" since the TCP
>>> connection to the database is down?
>>>
>>> So the specific case is:
>>> - connect to the postgres database using psycopg2 while network is up
>>> - run some queries, get the results fine etc
>>> - send a query
>>> - the network goes down before the result to this last query has been received
>>> - neither a result nor an error callback gets called - as far as I can
>>> see (using txpostgres.ConnectionPool)
>>>
>>> What's the proper way to deal with that?
>>
>> TCP keepalive.  By default the timeouts are quite high,
>> but they are tunable.
>>
>> libpq supports keepalive tuning since 9.0, on older libpq
>> you can do it yourself:
>>
>>   https://github.com/markokr/skytools/blob/master/python/skytools/psycopgwrapper.py#L153

Keepalive will help to detect if TCP connection is down,
it will not help if connection is up but server app is unresponsive.

> After doing lots of tests, it seems that keepalives are not the full
> solution. They're useful if you want to detect the connection breaking
> while it's idle, but they don't help in the case of:
>
> * the the app sends a keepalive, receives response

Sort of true, except Postgres does not have app-level
keepalive (except SELECT 1).  The PQping mentioned
earlier creates new connection.

> * the connection is idle
> * before the next keepalive is sent, you want to do a query
> * the connection breaks silently
> * you try sending the query
> * libpq tries to write the query to the conncetion socket, does not
> receive TCP confirmation

The TCP keepalive should help for those cases, perhaps
you are doing something wrong if you are not seeing the effect.

> * the kernel starts retransmitting the data, using TCP's RTO algorithm
> * you don't get notified about the failure until the TCP gives up, which
> might be a long time

I'm not familiar with RTO, so cannot comment.

Why would it stop keepalive from working?

> So it seems to me that you need an application-level timeout also. I'm
> thinking about supporting it in txpostgres, but will have to think
> exactly how to do it and what would be the interface.
>
> Alternatively, you can lower the kernel TCP retry parameters
> (net.ipv4.tcp_retries1 and net.ipv4.tcp_retries2), which will make TCP
> give up earlier. Unfortunately it seems that you can only set the
> globally at the kernel level and not per connection, which IMHO is a bit
> too scary. What bothers me is that the keepalives mechanism does not
> come into play while you're doing TCP retries, but that's apparently how
> TCP works (at least on Linux...).
>
> If you want to detect the connection failing as soon as possible, and
> not the next time you try to make a query, you need to regularly make
> queries, IOW have a heartbeat. But all the things I wrote before still
> apply, and without an app-level timeout or lowering the TCP retry
> parameters it might take a lot of time to detect that the heartbeat failed.

The need for periodic query is exactly the thing that keepalive
should fix.  OTOH, if you have connections that are long time idle
you could simply drop them.

We have the (4m idle + 4x15sec ping) parameters as
default and they work fine - dead connection is killed
after 5m.

--
marko

Re: psycopg2 (async) socket timeout

From
Jan Urbański
Date:
On 15/02/11 21:55, Marko Kreen wrote:
> On Tue, Feb 15, 2011 at 3:32 PM, Jan Urbański <wulczer@wulczer.org> wrote:
>> * the the app sends a keepalive, receives response
>
> Sort of true, except Postgres does not have app-level
> keepalive (except SELECT 1).  The PQping mentioned
> earlier creates new connection.

By this I meant that an app is connected using libpq with keepalives
enabled.

>> * the connection is idle
>> * before the next keepalive is sent, you want to do a query
>> * the connection breaks silently
>> * you try sending the query
>> * libpq tries to write the query to the conncetion socket, does not
>> receive TCP confirmation
>
> The TCP keepalive should help for those cases, perhaps
> you are doing something wrong if you are not seeing the effect.

Well for me it doesn't help, I'm not sure if it's my fault or the
kernel's or it's just how TCP ought to work.

>> * the kernel starts retransmitting the data, using TCP's RTO algorithm
>> * you don't get notified about the failure until the TCP gives up, which
>> might be a long time
>
> I'm not familiar with RTO, so cannot comment.
>
> Why would it stop keepalive from working?

Looking at the traffic in Wireshark I'm seeing TCP retransmissions and
no keepalive traffic.

> The need for periodic query is exactly the thing that keepalive
> should fix.  OTOH, if you have connections that are long time idle
> you could simply drop them.
>
> We have the (4m idle + 4x15sec ping) parameters as
> default and they work fine - dead connection is killed
> after 5m.

Hm, so my test is like this:

* I connect with psycopg2 enabling keepalives in the connection string,
using "keepalives_idle=4 keepalives_interval=1 keepalives_count=1"
* the test program sends a "select pg_sleep(6)" and then sleeps itself
for 6 seconds, and does that in a loop
* each time after the query is sent and 4 seconds elapse I'm seeing TCP
keepalive packets going to the server and the server responding
* each time after the program sleeps for 4 seconds, a keepalive is sent

To simulate a connectivity loss I'm adding two rules to my firewall that
block (the iptables DROP target) communication from or to port 5432.

Now there are two scenarios:

1. if I block the connection right after the test program goes to sleep,
the response to the keepalive is not received and a connectivity loss is
detected. The app sends a RST packet (that obviously does not reach the
server) and when it wakes up and tries to send the query, psycopg
complains about a broken connection. Important: the backend stays alive
and PG shows the connection as "IDLE in transaction".

2. if I block the connection after the test program already sent the
keepalive, but before it sent the query it actually goes ahead and tries
to send the query, and then blocks because the kernel is retrying the
TCP delivery of the query. Keepalives are *not* sent and the process of
TCP giving up can take quite some time (depends on the settigs for TCP
timeout). The connection stays alive on the server anyway.

3. if I block the connection while it's waiting for the query to
complete, a keepalive is sent, the connection is detected to be broken,
the execute statement fails with a SystemError: null argument to
internal routine (sic!), and the connection stays on the server.

I'm not really sure what's the deal with the SystemError, but my
conclusions are:

* while TCP retry is in action, it disables keepalives
* the backend stays alive on the server side anyway

It's possible that TCP retries take a few minutes and I'm simply not
patient enough (of course I'm not using a keepalive interval of 1 second
in production). So if all you want is to detect a broken connection a
couple of minutes from the moment it happened, you can have client-side
keepalives tuned as Marko does it, and check that your TCP stack gives
up a delivery attempt in less then a few minutes.

On the other hand, you probably *should* also use server-side
keepalives, so the server can detect a broken connection and kill the
backend, otherwise you will end up with lots of "IDLE in connection"
backends, which is Very Bad (can block autovacuum, still holds
transaction locks etc).

I'm going to do some more tests to see the default timeout for TCP
delivery and if it's really in the range of 5 minutes, I'll be happy.

Now I don't have any clue what's with the SystemError I'm getting, might
take a look if I find the time. Attached is the test script and the
command I use to simulate network outage.

And last but not least, txpostgres does not play that well with
client-side keepalives, because while the connection is idle, it's not
watching its file descriptor, so error conditions on that descriptor
will be detected only when you go and do a query. That is something I
might fix in the future.

Gah, looking at all that TCP stuff always makes my head spin.

Cheers,
Jan

Attachment

Re: psycopg2 (async) socket timeout

From
Danny Milosavljevic
Date:
Hi,

> Well something like that ;) I'd try doing it on the per-query level,
> actually. Since you can't have more than one outstanding query, your
> keepalive won't be sent until the current query finishes.

Hm, what would be the difference to how I did it? It's per query, isn't it?

I kinda know what you mean... PostgreSQL thinks that
connection=transaction and so you have only one outstanding operation
on the connection(=transaction), otherwise it won't be able to
serialize. But how would the implementation in the txpostgresapi
Connection code change? Would it?

> psycopg2 needs to support PQPing, but that should be easy.

>> However, doing a connection.close() then doesn't seem to help much,

After adding debugging messages to the reactor, found out why:

Both the connection and the cursor are still in the select() list of
the reactor.

So my current try is the following one and this works (far) - no
exceptions raised by the reactor anymore... (note that I pass the
cursor instance too, now)

        def handleTimeout(self, d, cursor):
                """ handles the timeout since we DID time out """
                log.err("timed out")
                #cursor.close() # cannot be used while async query is
in progress :-(
                self.reactor.removeReader(cursor)
                self.reactor.removeWriter(cursor)
                self.close() # close the connection (maybe it was a
conn problem)
                if not d.called:
                        d.errback(failure.Failure(RuntimeError()))
       def close(self):
                self.reactor.removeReader(self)
                self.reactor.removeWriter(self)
                return txpostgres.Connection.close(self)

I don't really understand why I need to remove the cursor from the
reactor as well. Shouldn't the cursor and the connection use the same
socket? Or does it mean that there are two different python objects in
the reactor which are actually using the same socket, but the reactor
doesn't know that? I feel that I'm so close to it, but unfortunately
the next time I try to create a new cursor, it will nag about async
operation still in progress...

I can't call cursor.close() since it is guarded by
EXC_IF_ASYNC_IN_PROGRESS in psycopg2 - and so cannot be called while
my query is (technically) still in progress. There's a PQcancel thing
on the connection (not the cursor!) which cancels the query by sending
a request to the server - that's not exactly what I want, though. I
planned to set statement_timeout for the server to automatically
abandon the query after a timeout and set my client timer to the same
timeout - so both would do the right thing without explicit
synchronization.

> Ha, it always comes back to the ticket I filed when writing txpostgres:
> http://twistedmatrix.com/trac/ticket/4539

Interesting... however, in my simple case, it seems to be fixed by
calling both "removeReader" and "removeWriter" to remove both the
cursor instance and the connection instance from the watchlist.

> Believe it or not, this problem seems to also prevent proper
> LISTEN/NOTIFY implementation...

If it helps, in my case I changed functions in
/usr/lib/python2.6/site-packages/twisted/internet/selectreactor.py
(addReader, addWriter, removeReader, removeWriter) to print what's
going on and found out about the cursor that way. What the ValueError
means is that some of the file-like objects closed but are still in
the reactor - so the reactor doesn't know what it's supposed to do
with them: when is a closed FD ready? So one only has to make sure to
remove the file-like object early enough, before the reactor gets
control again.

Danny