Re: BUG #5837: PQstatus() fails to report lost connection - Mailing list pgsql-bugs

From Murray S. Kucherawy
Subject Re: BUG #5837: PQstatus() fails to report lost connection
Date
Msg-id F5833273385BB34F99288B3648C4F06F1341E73FC6@EXCH-C2.corp.cloudmark.com
Whole thread Raw
In response to Re: BUG #5837: PQstatus() fails to report lost connection  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: BUG #5837: PQstatus() fails to report lost connection  (Robert Haas <robertmhaas@gmail.com>)
Re: BUG #5837: PQstatus() fails to report lost connection  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Sunday, January 23, 2011 8:59 PM
> To: Murray S. Kucherawy
> Cc: Tom Lane; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection
>=20
> > As for the reply above, I disagree. =A0PQstatus(), as documented,
> > doesn't say anything about certain conditions in which it won't report
> > that the connection is dead, when it actually is, once the connection
> > was already established and working.
>=20
> I understand that gut feeling, but I think if you think about it a
> little, you may realize that at some level it's an unreasonable
> expectation.  For example, suppose you were to connect to the server
> (successfully), unplug the network cable, and then call PQstatus().
> There is literally no way for PQstatus() to know whether that
> connection is still there.  It's just returning some internal flag out
> of the object.

True, of course.  But that's not the case here; PQsendRequest() was used to=
 try to contact the server to submit some SQL operation after the server ha=
d been restarted.  Enough of that operation failed so as to make PQresultSt=
atus() return PQRES_FATAL_ERROR.  There was enough I/O for libpq to figure =
out that something has gone wrong, in fact enough to know that the cause wa=
s administrative termination of the server for some reason (since the error=
 string returned is "FATAL: terminating connection due to administrator com=
mand").  Basically it appears, then, that enough communication (or attempts=
 at it) took place to observe that the connection not usable anymore and ma=
rk it as such, and even why.

Again, I'm looking at this as a consumer of the API; I don't know, and to s=
ome degree shouldn't care, what's going on inside.  The results simply look=
 inconsistent.

I don't expect libpq to do any kind of keep-alive work, but once something =
has reported PGRES_FATAL_ERROR, I would expect the result returned by PQsta=
tus() to be accurate.  In this case, it isn't.  It gives the impression tha=
t an incomplete state change occurred or something like that.

> Now, in this case, things are a bit less clear, because it's not as if
> the remote side has given us no indication that the connection is
> about to get closed.  This doesn't strike me as awesome protocol
> design, but 14 years on we're probably stuck with it. I think if you
> want to have some special handling when an administrative shutdown
> happens on the server, you should use PQresultErrorField() to check
> PG_DIAG_SQLSTATE; or if you want FATAL and PANIC conditions to be
> handled specially you can check PG_DIAG_SEVERITY.

The model in OpenDBX appears to be to query and then iterate getting result=
s until there are none; if there's ever an error, see if it's a kind of err=
or that requires a reconnection attempt before continuing.  Without assumin=
g PGRES_FATAL_ERROR always means a reconnect is required, it seems like PQs=
tatus() is the right test for the latter of those two "ifs", and that's wha=
t they're doing.  It doesn't really matter if the disconnect was a result o=
f administrative action or not; the caller just needs to know if it has to =
try to reconnect before continuing.

> The description of PGRES_FATAL_ERROR in the documentation does pretty
> much suck.  I believe you'll get that if the backend returns either an
> ERROR (in which case you need to retry the transaction) or a FATAL (in
> which case you need to reset the connection), but that's not at all
> clear either from the documentation or the naming of the constant
> (which, alas, is hard to change at this point for
> backward-compatibility reasons).

Well my other suggestion would be to assume PGRES_FATAL_ERROR always means =
the connection needs to be reset.  But this blows that idea away; this woul=
d cause a connection reset that wouldn't actually solve the problem when it=
's an ERROR as you described.

> The description of PQstatus() looks correct to me.  It says that "a
> communications failure might result in the status changing to
> CONNECTION_BAD prematurely".  In the scenario you describe, no
> communications failure has occurred.  The server sent back an error
> message, and closed the connection (though libpq hasn't noticed yet)
> but there's no communications error anywhere until the client tries to
> send a query over a connection that doesn't exist any more.  Maybe we
> could flesh that description out a bit to make it more clear that this
> is really only going to tell you if TCP/IP has explicitly blown up in
> your face, and not any other reason why things might not be working,
> although I think there are hints of that there already.

Given what you say here, this seems to be an exception for which there is c=
urrently no detection mechanism short of looking for that one specific erro=
r string indicating it was administrative action causing the error.  I thin=
k your suggested changes would certainly help, as well as some generalized =
way to know that the connection handle is now unusable, either because of a=
dministrative server action or because the connection became unreliable for=
 some other reason.  I'd be fine with having to do PQstatus() || PQadminSta=
tus() too, if such a thing were to appear.  Better documentation of PGRES_F=
ATAL_ERROR would help as well.

-MSK

pgsql-bugs by date:

Previous
From: "Murray S. Kucherawy"
Date:
Subject: Re: BUG #5837: PQstatus() fails to report lost connection
Next
From: "Darshana"
Date:
Subject: BUG #5844: maverick