Thread: libpq and connection failures

libpq and connection failures

From
jtv@xs4all.nl
Date:
Is there any reason why pqReadData() (interfaces/libpq/fe-misc.c), if
pqsecure_read() fails with an error that isn't special-cased as a
transient one, returns -1 but leaves the connection state at
CONNECTION_OK?  Was this meant to be caught by calling pqReadReady()
(which doesn't happen in this case because of the return)?

Some libpqxx users with postgres 8.0 installed have observed broken
backend connections leading to errors but not updating connection state
accordingly.  One user unplugged his network cable, leading to a long
timeout followed by the error message "could not receive data from
server," but libpqxx was unable to diagnose the situation because
PQstatus() blithely returned CONNECTION_OK.

Doesn't it make more sense in these cases to "goto definitelyFailed"
instead of "return -1" as it does now?

This occurs in two places BTW, so it may make sense to refactor the
function before changing it: the code blocks at labels retry3 and retry4
are completely identical AFAICS except for one small "if" clause (plus a
variable is replaced by 0 where it's provably equal to zero).  Each is
really a switch within a loop, disguised as a nested if through cunning
use of goto.


Jeroen




Re: libpq and connection failures

From
Tom Lane
Date:
jtv@xs4all.nl writes:
> Is there any reason why pqReadData() (interfaces/libpq/fe-misc.c), if
> pqsecure_read() fails with an error that isn't special-cased as a
> transient one, returns -1 but leaves the connection state at
> CONNECTION_OK?

I think it's probably better to have the default assumption be
"connection possibly recoverable" than have it be "summarily kill
connection at first hint of trouble".  The latter seems less robust
not more so.
        regards, tom lane


Re: libpq and connection failures

From
jtv@xs4all.nl
Date:
Tom Lane wrote:
> I think it's probably better to have the default assumption be
> "connection possibly recoverable" than have it be "summarily kill
> connection at first hint of trouble".  The latter seems less robust
> not more so.

Not after the connection failure has made its way into a PGresult, surely?Doesn't seem consistent with the design
choiceof aborting transactions
 
on error, for starters.  Are you saying that a session is still in usable
shape when you have no way of establishing whether the last command
succeeded?

If you're in an explicit transaction when this happens, it's in an unknown
state[*] so you have to abort anyway.  All the client hears is "there's
been an error of some sort, but the connection may or may not be fine,
thank you."  You don't necessarily know what level of transaction nesting
you're in though, so you may have to fire off aborts until you're pretty
sure you're out of all of them.  Frankly I'd rather call PQreset() just to
save myself the trouble.

(*) Yes, there are cases where the transaction is left in a reliable
state.  Such as on a read-only query, which the application could retry
(I'll assume it cares about the results or it wouldn't have queried) at
the cost of greater code complexity.  To me is one of the cases where
simplicity and clarity matter a damn sight more than optimizing out the
reconnect on the offchance that the application knows how to handle the
situation despite not receiving even the basic knowledge that the error
was something to do with the connection, not the query.  Tom, when I said
way back when that I wanted to do recovery and retry after a connection
was lost, weren't you the one who said "this scares the hell out of me"
because you couldn't be sure whether the last command committed?

I thought the mantra when it came to networking went "don't second-guess
the OS."  If you get negative bytes out of a socket, there are a few known
errno values that mean it's a transient thing.  Fine, if you identify more
of those then chuck them in there.  There are several cases of that in
there already.  But otherwise, why not assume that the system gave you an
error code because it decided it saw a failure?


Jeroen




Re: libpq and connection failures

From
Tom Lane
Date:
jtv@xs4all.nl writes:
> Tom Lane wrote:
>> I think it's probably better to have the default assumption be
>> "connection possibly recoverable" than have it be "summarily kill
>> connection at first hint of trouble".  The latter seems less robust
>> not more so.

> Not after the connection failure has made its way into a PGresult, surely?

Well, I've not had a whole lot of personal experience with flaky
connections to a database ... what do other people think would be
the most useful behavior?
        regards, tom lane


Re: libpq and connection failures

From
jtv@xs4all.nl
Date:
Tom Lane wrote:

>>> I think it's probably better to have the default assumption be
>>> "connection possibly recoverable" than have it be "summarily kill
>>> connection at first hint of trouble".  The latter seems less robust
>>> not more so.
>
>> Not after the connection failure has made its way into a PGresult,
>> surely?
>
> Well, I've not had a whole lot of personal experience with flaky
> connections to a database ... what do other people think would be
> the most useful behavior?

I can only speak for libpqxx and some of its users.  One has pulled a
network cable out and found that the error he ultimately received (after a
long timeout) looked like a query failure--and PQstatus still said
CONNECTION_OK, leaving libpqxx unable to diagnose the error type and throw
the appropriate exception type.  Searching for the error text brought us
to the scene of the crime.

I don't see how this state of affairs can be acceptable *at all* given
that PQstatus exists and purports to report a connection's status.  If
pulling out the network cable and not putting it back fails to meet our
definition of a broken connection, what meets it?

Another user ran into trouble because he couldn't connect to the database
due to an incorrectly formatted connection string, but didn't receive an
error.  Instead he got a PGconn that as far as I can make out, seemed
fully functional (non-NULL and CONNECTION_OK), but wasn't.  We happened to
discover what was going on because he printed out the result of PQdb()
without checking for a NULL result, and reported the resulting crash to
the libpqxx mailing list.


Jeroen