Thread: libpq and connection failures
Is there any reason why pqReadData() (interfaces/libpq/fe-misc.c), if pqsecure_read() fails with an error that isn't special-cased as a transient one, returns -1 but leaves the connection state at CONNECTION_OK? Was this meant to be caught by calling pqReadReady() (which doesn't happen in this case because of the return)? Some libpqxx users with postgres 8.0 installed have observed broken backend connections leading to errors but not updating connection state accordingly. One user unplugged his network cable, leading to a long timeout followed by the error message "could not receive data from server," but libpqxx was unable to diagnose the situation because PQstatus() blithely returned CONNECTION_OK. Doesn't it make more sense in these cases to "goto definitelyFailed" instead of "return -1" as it does now? This occurs in two places BTW, so it may make sense to refactor the function before changing it: the code blocks at labels retry3 and retry4 are completely identical AFAICS except for one small "if" clause (plus a variable is replaced by 0 where it's provably equal to zero). Each is really a switch within a loop, disguised as a nested if through cunning use of goto. Jeroen
jtv@xs4all.nl writes: > Is there any reason why pqReadData() (interfaces/libpq/fe-misc.c), if > pqsecure_read() fails with an error that isn't special-cased as a > transient one, returns -1 but leaves the connection state at > CONNECTION_OK? I think it's probably better to have the default assumption be "connection possibly recoverable" than have it be "summarily kill connection at first hint of trouble". The latter seems less robust not more so. regards, tom lane
Tom Lane wrote: > I think it's probably better to have the default assumption be > "connection possibly recoverable" than have it be "summarily kill > connection at first hint of trouble". The latter seems less robust > not more so. Not after the connection failure has made its way into a PGresult, surely?Doesn't seem consistent with the design choiceof aborting transactions on error, for starters. Are you saying that a session is still in usable shape when you have no way of establishing whether the last command succeeded? If you're in an explicit transaction when this happens, it's in an unknown state[*] so you have to abort anyway. All the client hears is "there's been an error of some sort, but the connection may or may not be fine, thank you." You don't necessarily know what level of transaction nesting you're in though, so you may have to fire off aborts until you're pretty sure you're out of all of them. Frankly I'd rather call PQreset() just to save myself the trouble. (*) Yes, there are cases where the transaction is left in a reliable state. Such as on a read-only query, which the application could retry (I'll assume it cares about the results or it wouldn't have queried) at the cost of greater code complexity. To me is one of the cases where simplicity and clarity matter a damn sight more than optimizing out the reconnect on the offchance that the application knows how to handle the situation despite not receiving even the basic knowledge that the error was something to do with the connection, not the query. Tom, when I said way back when that I wanted to do recovery and retry after a connection was lost, weren't you the one who said "this scares the hell out of me" because you couldn't be sure whether the last command committed? I thought the mantra when it came to networking went "don't second-guess the OS." If you get negative bytes out of a socket, there are a few known errno values that mean it's a transient thing. Fine, if you identify more of those then chuck them in there. There are several cases of that in there already. But otherwise, why not assume that the system gave you an error code because it decided it saw a failure? Jeroen
jtv@xs4all.nl writes: > Tom Lane wrote: >> I think it's probably better to have the default assumption be >> "connection possibly recoverable" than have it be "summarily kill >> connection at first hint of trouble". The latter seems less robust >> not more so. > Not after the connection failure has made its way into a PGresult, surely? Well, I've not had a whole lot of personal experience with flaky connections to a database ... what do other people think would be the most useful behavior? regards, tom lane
Tom Lane wrote: >>> I think it's probably better to have the default assumption be >>> "connection possibly recoverable" than have it be "summarily kill >>> connection at first hint of trouble". The latter seems less robust >>> not more so. > >> Not after the connection failure has made its way into a PGresult, >> surely? > > Well, I've not had a whole lot of personal experience with flaky > connections to a database ... what do other people think would be > the most useful behavior? I can only speak for libpqxx and some of its users. One has pulled a network cable out and found that the error he ultimately received (after a long timeout) looked like a query failure--and PQstatus still said CONNECTION_OK, leaving libpqxx unable to diagnose the error type and throw the appropriate exception type. Searching for the error text brought us to the scene of the crime. I don't see how this state of affairs can be acceptable *at all* given that PQstatus exists and purports to report a connection's status. If pulling out the network cable and not putting it back fails to meet our definition of a broken connection, what meets it? Another user ran into trouble because he couldn't connect to the database due to an incorrectly formatted connection string, but didn't receive an error. Instead he got a PGconn that as far as I can make out, seemed fully functional (non-NULL and CONNECTION_OK), but wasn't. We happened to discover what was going on because he printed out the result of PQdb() without checking for a NULL result, and reported the resulting crash to the libpqxx mailing list. Jeroen