Re: BUG #5837: PQstatus() fails to report lost connection - Mailing list pgsql-bugs
From | Murray S. Kucherawy |
---|---|
Subject | Re: BUG #5837: PQstatus() fails to report lost connection |
Date | |
Msg-id | F5833273385BB34F99288B3648C4F06F1341E73FC6@EXCH-C2.corp.cloudmark.com Whole thread Raw |
In response to | Re: BUG #5837: PQstatus() fails to report lost connection (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: BUG #5837: PQstatus() fails to report lost connection
(Robert Haas <robertmhaas@gmail.com>)
Re: BUG #5837: PQstatus() fails to report lost connection (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-bugs |
> -----Original Message----- > From: Robert Haas [mailto:robertmhaas@gmail.com] > Sent: Sunday, January 23, 2011 8:59 PM > To: Murray S. Kucherawy > Cc: Tom Lane; pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection >=20 > > As for the reply above, I disagree. =A0PQstatus(), as documented, > > doesn't say anything about certain conditions in which it won't report > > that the connection is dead, when it actually is, once the connection > > was already established and working. >=20 > I understand that gut feeling, but I think if you think about it a > little, you may realize that at some level it's an unreasonable > expectation. For example, suppose you were to connect to the server > (successfully), unplug the network cable, and then call PQstatus(). > There is literally no way for PQstatus() to know whether that > connection is still there. It's just returning some internal flag out > of the object. True, of course. But that's not the case here; PQsendRequest() was used to= try to contact the server to submit some SQL operation after the server ha= d been restarted. Enough of that operation failed so as to make PQresultSt= atus() return PQRES_FATAL_ERROR. There was enough I/O for libpq to figure = out that something has gone wrong, in fact enough to know that the cause wa= s administrative termination of the server for some reason (since the error= string returned is "FATAL: terminating connection due to administrator com= mand"). Basically it appears, then, that enough communication (or attempts= at it) took place to observe that the connection not usable anymore and ma= rk it as such, and even why. Again, I'm looking at this as a consumer of the API; I don't know, and to s= ome degree shouldn't care, what's going on inside. The results simply look= inconsistent. I don't expect libpq to do any kind of keep-alive work, but once something = has reported PGRES_FATAL_ERROR, I would expect the result returned by PQsta= tus() to be accurate. In this case, it isn't. It gives the impression tha= t an incomplete state change occurred or something like that. > Now, in this case, things are a bit less clear, because it's not as if > the remote side has given us no indication that the connection is > about to get closed. This doesn't strike me as awesome protocol > design, but 14 years on we're probably stuck with it. I think if you > want to have some special handling when an administrative shutdown > happens on the server, you should use PQresultErrorField() to check > PG_DIAG_SQLSTATE; or if you want FATAL and PANIC conditions to be > handled specially you can check PG_DIAG_SEVERITY. The model in OpenDBX appears to be to query and then iterate getting result= s until there are none; if there's ever an error, see if it's a kind of err= or that requires a reconnection attempt before continuing. Without assumin= g PGRES_FATAL_ERROR always means a reconnect is required, it seems like PQs= tatus() is the right test for the latter of those two "ifs", and that's wha= t they're doing. It doesn't really matter if the disconnect was a result o= f administrative action or not; the caller just needs to know if it has to = try to reconnect before continuing. > The description of PGRES_FATAL_ERROR in the documentation does pretty > much suck. I believe you'll get that if the backend returns either an > ERROR (in which case you need to retry the transaction) or a FATAL (in > which case you need to reset the connection), but that's not at all > clear either from the documentation or the naming of the constant > (which, alas, is hard to change at this point for > backward-compatibility reasons). Well my other suggestion would be to assume PGRES_FATAL_ERROR always means = the connection needs to be reset. But this blows that idea away; this woul= d cause a connection reset that wouldn't actually solve the problem when it= 's an ERROR as you described. > The description of PQstatus() looks correct to me. It says that "a > communications failure might result in the status changing to > CONNECTION_BAD prematurely". In the scenario you describe, no > communications failure has occurred. The server sent back an error > message, and closed the connection (though libpq hasn't noticed yet) > but there's no communications error anywhere until the client tries to > send a query over a connection that doesn't exist any more. Maybe we > could flesh that description out a bit to make it more clear that this > is really only going to tell you if TCP/IP has explicitly blown up in > your face, and not any other reason why things might not be working, > although I think there are hints of that there already. Given what you say here, this seems to be an exception for which there is c= urrently no detection mechanism short of looking for that one specific erro= r string indicating it was administrative action causing the error. I thin= k your suggested changes would certainly help, as well as some generalized = way to know that the connection handle is now unusable, either because of a= dministrative server action or because the connection became unreliable for= some other reason. I'd be fine with having to do PQstatus() || PQadminSta= tus() too, if such a thing were to appear. Better documentation of PGRES_F= ATAL_ERROR would help as well. -MSK
pgsql-bugs by date: