Thread: BUG #5837: PQstatus() fails to report lost connection

BUG #5837: PQstatus() fails to report lost connection

From

"Murray S. Kucherawy"

Date:

14 January 2011, 00:32:33

The following bug has been logged online:

Bug reference:      5837
Logged by:          Murray S. Kucherawy
Email address:      msk@cloudmark.com
PostgreSQL version: 9.0.2
Operating system:   FreeBSD
Description:        PQstatus() fails to report lost connection
Details:

I'm accessing libpq via OpenDBX, but it appears to be doing the right thing.
 This is reproducible.

1) establish a connection to postgresql
2) initiate a query, collect results, etc.; all normal
3) while client is idle, restart the server
4) initiate the very same query as before
5) call PQgetResult(), returns non-NULL
6) call PQresultStatus(), returns PGRES_FATAL_ERROR
7) call PQstatus(), returns CONNECTION_OK

This causes the caller not to try PQreset() or equivalent, because the
connection is presumably fine.  However, of course, no further queries will
succeed.

A call to PQerrorMessage() returns the string "FATAL: terminating connection
due to administrator command", which would be expected.

I would hate to have to have the application pull that out each time and
look for this specific string to detect the dead connection.

What happens internally prior to (7) appears to be a bug.  I looked through
the "TODO" list and didn't see any mention of this.

Re: BUG #5837: PQstatus() fails to report lost connection

From

Robert Haas

Date:

22 January 2011, 22:51:41

On Thu, Jan 13, 2011 at 8:36 PM, Murray S. Kucherawy <msk@cloudmark.com> wr=
ote:
>
> The following bug has been logged online:
>
> Bug reference: =A0 =A0 =A05837
> Logged by: =A0 =A0 =A0 =A0 =A0Murray S. Kucherawy
> Email address: =A0 =A0 =A0msk@cloudmark.com
> PostgreSQL version: 9.0.2
> Operating system: =A0 FreeBSD
> Description: =A0 =A0 =A0 =A0PQstatus() fails to report lost connection
> Details:
>
> I'm accessing libpq via OpenDBX, but it appears to be doing the right thi=
ng.
> =A0This is reproducible.
>
> 1) establish a connection to postgresql
> 2) initiate a query, collect results, etc.; all normal
> 3) while client is idle, restart the server
> 4) initiate the very same query as before
> 5) call PQgetResult(), returns non-NULL
> 6) call PQresultStatus(), returns PGRES_FATAL_ERROR
> 7) call PQstatus(), returns CONNECTION_OK
>
> This causes the caller not to try PQreset() or equivalent, because the
> connection is presumably fine. =A0However, of course, no further queries =
will
> succeed.

I can reproduce this by hacking up src/test/examples/testlibpq to
loop, but I'm not totally sure what's causing the behavior.  I think
the problem may be that libpq only reads enough from the connection to
get the FATAL error.  It doesn't keep reading to see whether there's
an EOF afterward, and thus doesn't immediately realize that the
connection has been closed.

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #5837: PQstatus() fails to report lost connection

From

Tom Lane

Date:

22 January 2011, 23:44:14

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 13, 2011 at 8:36 PM, Murray S. Kucherawy <msk@cloudmark.com> wrote:
>> 1) establish a connection to postgresql
>> 2) initiate a query, collect results, etc.; all normal
>> 3) while client is idle, restart the server
>> 4) initiate the very same query as before
>> 5) call PQgetResult(), returns non-NULL
>> 6) call PQresultStatus(), returns PGRES_FATAL_ERROR
>> 7) call PQstatus(), returns CONNECTION_OK

> I can reproduce this by hacking up src/test/examples/testlibpq to
> loop, but I'm not totally sure what's causing the behavior.

What did you do exactly?  testlibpq.c just uses PQexec(), and AFAICS the
connection status does end up BAD if the backend is terminated before a
PQexec starts.

> I think
> the problem may be that libpq only reads enough from the connection to
> get the FATAL error.  It doesn't keep reading to see whether there's
> an EOF afterward, and thus doesn't immediately realize that the
> connection has been closed.

I think the OP's mistake is to assume that the first PQgetResult ought
to set this.  As you say, it hasn't discovered the EOF condition at the
time it returns that error-message result, and we are certainly not
going to add another kernel call to every query cycle to check for EOF.

The reason I don't see a problem when using PQexec is that PQexec will
internally do another PQgetResult, and it's the second one that will
fail and reset the connection status.  In a non-connection-termination
situation, the second internal PQgetResult call consumes the
ReadyForQuery message, and at that point we fall out of PQexec (without
any extra kernel call).  Here, though, there won't be a ReadyForQuery,
and it's the read() to try to collect one that discovers the loss of
connection.

In short: I don't think there's a bug here, just failure to understand
proper use of PQgetResult.

            regards, tom lane

Re: BUG #5837: PQstatus() fails to report lost connection

From

Robert Haas

Date:

24 January 2011, 00:26:59

On Sat, Jan 22, 2011 at 10:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What did you do exactly?  testlibpq.c just uses PQexec(), and AFAICS the
> connection status does end up BAD if the backend is terminated before a
> PQexec starts.

PFA.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

testlibpq-loop-hack

Re: BUG #5837: PQstatus() fails to report lost connection

From

Robert Haas

Date:

24 January 2011, 00:58:54

On Sun, Jan 23, 2011 at 10:59 PM, Murray S. Kucherawy <msk@cloudmark.com> w=
rote:
> As for the reply above, I disagree. =A0PQstatus(), as documented, doesn't=
 say anything about certain conditions in which it won't report that the co=
nnection is dead, when it actually is, once the connection was already esta=
blished and working.

I understand that gut feeling, but I think if you think about it a
little, you may realize that at some level it's an unreasonable
expectation.  For example, suppose you were to connect to the server
(successfully), unplug the network cable, and then call PQstatus().
There is literally no way for PQstatus() to know whether that
connection is still there.  It's just returning some internal flag out
of the object.  And even if we wanted it to do something else, what?
Send a keepalive packet of some sort and wait for a response?  Given
default TCP parameter settings, that could take many minutes to give
an answer - which is probably not what people want when they call
PQstatus() - and it would introduce a lot of otherwise unnecessary
network traffic for people who want to query the internal state of the
PGconn, not ping the server.

Now, in this case, things are a bit less clear, because it's not as if
the remote side has given us no indication that the connection is
about to get closed.  This doesn't strike me as awesome protocol
design, but 14 years on we're probably stuck with it. I think if you
want to have some special handling when an administrative shutdown
happens on the server, you should use PQresultErrorField() to check
PG_DIAG_SQLSTATE; or if you want FATAL and PANIC conditions to be
handled specially you can check PG_DIAG_SEVERITY.

> Moreover, the description of PQgetResult() doesn't say or illustrate anyt=
hing about proper use of it in this context, so how would a reader know he/=
she got it wrong? =A0The documentation I can find online of PQgetResult() d=
oesn't enumerate the conditions where PQstatus() gives a false indication o=
f whether or not a reconnect is required, nor does any part of the document=
ation I could find state that PGRES_FATAL_ERROR always implies the connecti=
on is no longer usable and must be re-established; "FATAL" could be referri=
ng to the transaction/request, not the connection.
>
> So, if this isn't a bug, then I think the documentation needs a bit of wo=
rk in this area.

The description of PGRES_FATAL_ERROR in the documentation does pretty
much suck.  I believe you'll get that if the backend returns either an
ERROR (in which case you need to retry the transaction) or a FATAL (in
which case you need to reset the connection), but that's not at all
clear either from the documentation or the naming of the constant
(which, alas, is hard to change at this point for
backward-compatibility reasons).

The description of PQstatus() looks correct to me.  It says that "a
communications failure might result in the status changing to
CONNECTION_BAD prematurely".  In the scenario you describe, no
communications failure has occurred.  The server sent back an error
message, and closed the connection (though libpq hasn't noticed yet)
but there's no communications error anywhere until the client tries to
send a query over a connection that doesn't exist any more.  Maybe we
could flesh that description out a bit to make it more clear that this
is really only going to tell you if TCP/IP has explicitly blown up in
your face, and not any other reason why things might not be working,
although I think there are hints of that there already.

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #5837: PQstatus() fails to report lost connection

From

"Murray S. Kucherawy"

Date:

24 January 2011, 05:15:48

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Saturday, January 22, 2011 7:44 PM
> To: Robert Haas
> Cc: Murray S. Kucherawy; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection
>=20
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Thu, Jan 13, 2011 at 8:36 PM, Murray S. Kucherawy <msk@cloudmark.com=
> wrote:
> >> 1) establish a connection to postgresql
> >> 2) initiate a query, collect results, etc.; all normal
> >> 3) while client is idle, restart the server
> >> 4) initiate the very same query as before
> >> 5) call PQgetResult(), returns non-NULL
> >> 6) call PQresultStatus(), returns PGRES_FATAL_ERROR
> >> 7) call PQstatus(), returns CONNECTION_OK
>=20
> I think the OP's mistake is to assume that the first PQgetResult ought
> to set this.  As you say, it hasn't discovered the EOF condition at the
> time it returns that error-message result, and we are certainly not
> going to add another kernel call to every query cycle to check for EOF.
>=20
> The reason I don't see a problem when using PQexec is that PQexec will
> internally do another PQgetResult, and it's the second one that will
> fail and reset the connection status.  In a non-connection-termination
> situation, the second internal PQgetResult call consumes the
> ReadyForQuery message, and at that point we fall out of PQexec (without
> any extra kernel call).  Here, though, there won't be a ReadyForQuery,
> and it's the read() to try to collect one that discovers the loss of
> connection.
>=20
> In short: I don't think there's a bug here, just failure to understand
> proper use of PQgetResult.

For the sake of context, I'm using opendbx which is a layer between my appl=
ication and any SQL backend it can support.  The code that hits this proble=
m is in there.  See http://www.linuxnetworks.de/doc/index.php/OpenDBX.  I'l=
l forward this thread to that code's maintainer.

As for the reply above, I disagree.  PQstatus(), as documented, doesn't say=
 anything about certain conditions in which it won't report that the connec=
tion is dead, when it actually is, once the connection was already establis=
hed and working.

I would also suggest that everything you're describing is internal details =
of libpq implementation and not stuff that should be visible to consumers o=
f libpq.  The level of knowledge described strikes me as the kind of thing =
your average libpq consumer shouldn't need to have; it's exactly why you wa=
nt to have a library providing this sort of service in the first place.

Moreover, the description of PQgetResult() doesn't say or illustrate anythi=
ng about proper use of it in this context, so how would a reader know he/sh=
e got it wrong?  The documentation I can find online of PQgetResult() doesn=
't enumerate the conditions where PQstatus() gives a false indication of wh=
ether or not a reconnect is required, nor does any part of the documentatio=
n I could find state that PGRES_FATAL_ERROR always implies the connection i=
s no longer usable and must be re-established; "FATAL" could be referring t=
o the transaction/request, not the connection.

So, if this isn't a bug, then I think the documentation needs a bit of work=
 in this area.

-MSK

Re: BUG #5837: PQstatus() fails to report lost connection

From

"Murray S. Kucherawy"

Date:

24 January 2011, 05:15:50

> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Sunday, January 23, 2011 8:59 PM
> To: Murray S. Kucherawy
> Cc: Tom Lane; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection
>=20
> > As for the reply above, I disagree. =A0PQstatus(), as documented,
> > doesn't say anything about certain conditions in which it won't report
> > that the connection is dead, when it actually is, once the connection
> > was already established and working.
>=20
> I understand that gut feeling, but I think if you think about it a
> little, you may realize that at some level it's an unreasonable
> expectation.  For example, suppose you were to connect to the server
> (successfully), unplug the network cable, and then call PQstatus().
> There is literally no way for PQstatus() to know whether that
> connection is still there.  It's just returning some internal flag out
> of the object.

True, of course.  But that's not the case here; PQsendRequest() was used to=
 try to contact the server to submit some SQL operation after the server ha=
d been restarted.  Enough of that operation failed so as to make PQresultSt=
atus() return PQRES_FATAL_ERROR.  There was enough I/O for libpq to figure =
out that something has gone wrong, in fact enough to know that the cause wa=
s administrative termination of the server for some reason (since the error=
 string returned is "FATAL: terminating connection due to administrator com=
mand").  Basically it appears, then, that enough communication (or attempts=
 at it) took place to observe that the connection not usable anymore and ma=
rk it as such, and even why.

Again, I'm looking at this as a consumer of the API; I don't know, and to s=
ome degree shouldn't care, what's going on inside.  The results simply look=
 inconsistent.

I don't expect libpq to do any kind of keep-alive work, but once something =
has reported PGRES_FATAL_ERROR, I would expect the result returned by PQsta=
tus() to be accurate.  In this case, it isn't.  It gives the impression tha=
t an incomplete state change occurred or something like that.

> Now, in this case, things are a bit less clear, because it's not as if
> the remote side has given us no indication that the connection is
> about to get closed.  This doesn't strike me as awesome protocol
> design, but 14 years on we're probably stuck with it. I think if you
> want to have some special handling when an administrative shutdown
> happens on the server, you should use PQresultErrorField() to check
> PG_DIAG_SQLSTATE; or if you want FATAL and PANIC conditions to be
> handled specially you can check PG_DIAG_SEVERITY.

The model in OpenDBX appears to be to query and then iterate getting result=
s until there are none; if there's ever an error, see if it's a kind of err=
or that requires a reconnection attempt before continuing.  Without assumin=
g PGRES_FATAL_ERROR always means a reconnect is required, it seems like PQs=
tatus() is the right test for the latter of those two "ifs", and that's wha=
t they're doing.  It doesn't really matter if the disconnect was a result o=
f administrative action or not; the caller just needs to know if it has to =
try to reconnect before continuing.

> The description of PGRES_FATAL_ERROR in the documentation does pretty
> much suck.  I believe you'll get that if the backend returns either an
> ERROR (in which case you need to retry the transaction) or a FATAL (in
> which case you need to reset the connection), but that's not at all
> clear either from the documentation or the naming of the constant
> (which, alas, is hard to change at this point for
> backward-compatibility reasons).

Well my other suggestion would be to assume PGRES_FATAL_ERROR always means =
the connection needs to be reset.  But this blows that idea away; this woul=
d cause a connection reset that wouldn't actually solve the problem when it=
's an ERROR as you described.

> The description of PQstatus() looks correct to me.  It says that "a
> communications failure might result in the status changing to
> CONNECTION_BAD prematurely".  In the scenario you describe, no
> communications failure has occurred.  The server sent back an error
> message, and closed the connection (though libpq hasn't noticed yet)
> but there's no communications error anywhere until the client tries to
> send a query over a connection that doesn't exist any more.  Maybe we
> could flesh that description out a bit to make it more clear that this
> is really only going to tell you if TCP/IP has explicitly blown up in
> your face, and not any other reason why things might not be working,
> although I think there are hints of that there already.

Given what you say here, this seems to be an exception for which there is c=
urrently no detection mechanism short of looking for that one specific erro=
r string indicating it was administrative action causing the error.  I thin=
k your suggested changes would certainly help, as well as some generalized =
way to know that the connection handle is now unusable, either because of a=
dministrative server action or because the connection became unreliable for=
 some other reason.  I'd be fine with having to do PQstatus() || PQadminSta=
tus() too, if such a thing were to appear.  Better documentation of PGRES_F=
ATAL_ERROR would help as well.

-MSK

Re: BUG #5837: PQstatus() fails to report lost connection

From

Robert Haas

Date:

24 January 2011, 07:55:19

On Mon, Jan 24, 2011 at 12:59 AM, Murray S. Kucherawy <msk@cloudmark.com> w=
rote:
> Well my other suggestion would be to assume PGRES_FATAL_ERROR always mean=
s the connection needs to be reset. =A0But this blows that idea away; this =
would cause a connection reset that wouldn't actually solve the problem whe=
n it's an ERROR as you described.

Well, actually it's the other way around: resetting the connection
would *work* in the case of an ERROR, but it'd be overkill.  Most of
the things that can go wrong are ERROR rather than FATAL, and you just
issue a ROLLBACK to balance out any BEGIN you previously issued, and
then do whatever it is you want to do next.  A full connection reset
would be pretty expensive in this context and, as you might guess,
there are a lot more things that cause ERROR than there are that cause
FATAL.

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #5837: PQstatus() fails to report lost connection

From

Tom Lane

Date:

24 January 2011, 11:34:36

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Jan 22, 2011 at 10:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What did you do exactly?  testlibpq.c just uses PQexec(), and AFAICS the
>> connection status does end up BAD if the backend is terminated before a
>> PQexec starts.

> PFA.

Yeah, that was my first cut at it too, but it's not terribly interesting
because it doesn't tell you anything about what PQstatus says after the
first failed PQexec.

            regards, tom lane

Re: BUG #5837: PQstatus() fails to report lost connection

From

Tom Lane

Date:

24 January 2011, 11:35:23

"Murray S. Kucherawy" <msk@cloudmark.com> writes:
> Given what you say here, this seems to be an exception for which there
> is currently no detection mechanism short of looking for that one
> specific error string indicating it was administrative action causing
> the error.

This is complete nonsense.  If you followed the documented method for
using PQgetResult -- ie, keep calling it till it returns NULL, rather
than assuming there will be a specific number of results --- then
PQstatus would show you the bad status afterwards.

            regards, tom lane

Re: BUG #5837: PQstatus() fails to report lost connection

From

"Murray S. Kucherawy"

Date:

24 January 2011, 13:40:29

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Monday, January 24, 2011 7:30 AM
> To: Murray S. Kucherawy
> Cc: Robert Haas; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection
>=20
> "Murray S. Kucherawy" <msk@cloudmark.com> writes:
> > Given what you say here, this seems to be an exception for which there
> > is currently no detection mechanism short of looking for that one
> > specific error string indicating it was administrative action causing
> > the error.
>=20
> This is complete nonsense.  If you followed the documented method for
> using PQgetResult -- ie, keep calling it till it returns NULL, rather
> than assuming there will be a specific number of results --- then
> PQstatus would show you the bad status afterwards.

Your assertion here of "Read the result set to exhaustion, THEN check conne=
ction status to see if it was bad to begin with," can be politely described=
 as counter-intuitive.

Please, at a minimum, add some documentation about it.

Re: BUG #5837: PQstatus() fails to report lost connection

From

Tom Lane

Date:

24 January 2011, 13:41:46

"Murray S. Kucherawy" <msk@cloudmark.com> writes:
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>> This is complete nonsense.  If you followed the documented method for
>> using PQgetResult -- ie, keep calling it till it returns NULL, rather
>> than assuming there will be a specific number of results --- then
>> PQstatus would show you the bad status afterwards.

> Your assertion here of "Read the result set to exhaustion, THEN check connection status to see if it was bad to begin
with,"can be politely described as counter-intuitive. 

[ shrug... ]  I guess if your intuition is that PQstatus should be
expected to intuit the failure of a call that hasn't happened yet,
then it's counter-intuitive.

            regards, tom lane

Re: BUG #5837: PQstatus() fails to report lost connection

From

"Kevin Grittner"

Date:

24 January 2011, 13:50:11

"Murray S. Kucherawy" <msk@cloudmark.com> wrote:

> Please, at a minimum, add some documentation about it.

Current documentation at:

http://www.postgresql.org/docs/9.0/interactive/libpq-async.html

says:

| PQgetResult must be called repeatedly until it returns a null
| pointer, indicating that the command is done.

What do you think would make this more clear?

-Kevin

Re: BUG #5837: PQstatus() fails to report lost connection

From

"Murray S. Kucherawy"

Date:

24 January 2011, 14:35:57

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Monday, January 24, 2011 9:42 AM
> To: Murray S. Kucherawy
> Cc: Robert Haas; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection
>=20
> >> This is complete nonsense.  If you followed the documented method for
> >> using PQgetResult -- ie, keep calling it till it returns NULL, rather
> >> than assuming there will be a specific number of results --- then
> >> PQstatus would show you the bad status afterwards.
>=20
> > Your assertion here of "Read the result set to exhaustion, THEN check
> > connection status to see if it was bad to begin with," can be politely
> > described as counter-intuitive.
>=20
> [ shrug... ]  I guess if your intuition is that PQstatus should be
> expected to intuit the failure of a call that hasn't happened yet,
> then it's counter-intuitive.

You're mixing internal implementation details with what the API consumer se=
es.  I don't expect PQstatus() to do anything other than tell me that a con=
nection is no longer usable, which is what it's documented to do.  I'm fine=
 with checking it after an error has been returned by some query-related fu=
nction, which is what OpenDBX is doing.  I don't know why that's "complete =
nonsense".

Look at the order of operations here:

0) A previous connection exists.  The server is manually restarted.

1) A query is issued.  This succeeds.

2) A request for results is issued.  This returns non-NULL, apparently indi=
cating at least one result object came back.

3) PQresultStatus() is called, and it returns PGRES_FATAL_ERROR.

4) PQstatus() is checked, and indicates no problem.

4a) A call to PQresultErrorMessage() indicates that the library knows the c=
onnection was administratively reset.

As a consumer of the API, not knowing (or wanting to know) what's going on =
under-the-hood, this looks like a bug; the library certainly did know that =
the connection can no longer be used or 4a wouldn't happen, but PQstatus() =
is not saying so.  The documentation I see at http://www.postgresql.org/doc=
s/9.0/interactive/libpq-status.html doesn't lead me to believe that any of =
this logic is faulty.  That I have to iterate on step 2 above to get PQstat=
us() to eventually tell me the truth even though I know something has gone =
wrong already seems strange, and that's why I raised this issue.

Another possible explanation is that PQresultStatus() doesn't give a consis=
tent result until after PQgetResult() has returned NULL.  If that's the cas=
e, it should be documented, because right now it isn't.

-MSK

Re: BUG #5837: PQstatus() fails to report lost connection

From

"Murray S. Kucherawy"

Date:

24 January 2011, 14:41:47

> -----Original Message-----
> From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov]
> Sent: Monday, January 24, 2011 9:50 AM
> To: Murray S. Kucherawy; Tom Lane
> Cc: Robert Haas; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection
>=20
> "Murray S. Kucherawy" <msk@cloudmark.com> wrote:
>=20
> > Please, at a minimum, add some documentation about it.
>=20
> Current documentation at:
>=20
> http://www.postgresql.org/docs/9.0/interactive/libpq-async.html
>=20
> says:
>=20
> | PQgetResult must be called repeatedly until it returns a null
> | pointer, indicating that the command is done.
>=20
> What do you think would make this more clear?

"command is done" sounds like normal operation to me, but we're talking abo=
ut an obvious exception here.  The command (I assume that means the query t=
hat ultimately failed) can't even have started because the connection was g=
one.  That's why I find the description misleading.

Indeed, the function order you're talking about is PQgetResult(), which ret=
urns non-NULL in this condition (which looks like there's a result availabl=
e), but then PQresultStatus() returns PGRES_FATAL_ERROR.  It's not very obv=
ious to me that the right thing to do here is call PQgetResult() again just=
 to get PQstatus() to tell me the truth.  Now we're dealing with an excepti=
on rather than something normal.

So maybe something like this after the paragraph you cited would help:

"Note that after returning a PGresult object, PQresultStatus() could indica=
te a fatal error.  The caller should still iterate calling PQgetResult() to=
 completion (i.e. until it returns NULL) in order to allow libpq to process=
 the error information completely."

Re: BUG #5837: PQstatus() fails to report lost connection

From

"Kevin Grittner"

Date:

24 January 2011, 14:47:32

"Murray S. Kucherawy" <msk@cloudmark.com> wrote:
> From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov]

>> What do you think would make this more clear?

> So maybe something like this after the paragraph you cited would
> help:
>
> "Note that after returning a PGresult object, PQresultStatus()
> could indicate a fatal error.  The caller should still iterate
> calling PQgetResult() to completion (i.e. until it returns NULL)
> in order to allow libpq to process the error information
> completely."

A patch based on this suggestion attached for consideration by the
community.

-Kevin

Attachment

PQgetResult-doc.patch

Re: BUG #5837: PQstatus() fails to report lost connection

From

"Murray S. Kucherawy"

Date:

24 January 2011, 15:02:07

> -----Original Message-----
> From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov]
> Sent: Monday, January 24, 2011 10:47 AM
> To: Murray S. Kucherawy; Tom Lane
> Cc: Robert Haas; pgsql-bugs@postgresql.org
> Subject: RE: [BUGS] BUG #5837: PQstatus() fails to report lost connection
>=20
> A patch based on this suggestion attached for consideration by the
> community.

That would work for me, thanks.

-MSK

Re: BUG #5837: PQstatus() fails to report lost connection

From

Robert Haas

Date:

25 January 2011, 11:32:50

On Mon, Jan 24, 2011 at 1:47 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> "Murray S. Kucherawy" <msk@cloudmark.com> wrote:
>> From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov]
>
>>> What do you think would make this more clear?
>
>> So maybe something like this after the paragraph you cited would
>> help:
>>
>> "Note that after returning a PGresult object, PQresultStatus()
>> could indicate a fatal error. =A0The caller should still iterate
>> calling PQgetResult() to completion (i.e. until it returns NULL)
>> in order to allow libpq to process the error information
>> completely."
>
> A patch based on this suggestion attached for consideration by the
> community.

I think this patch would only be adding to the confusion.  When
PQgetResult() is called, we read enough data from the connection to
create and return one result object.  It's true that this doesn't
necessarily detect an EOF, but IIUC calling PQgetResult() again is
just ONE way that you could trigger another read against the socket,
not the only one.  I think it would also work to call
PQconsumeInput(), for example.

I think the real, underlying problem here is that Murray would like a
behavior change: when a FATAL or PANIC condition is reported by the
server, he'd like the client to immediately close the socket and set
its status to CONNECTION_BAD.  If we're going to clarify the
documentation, I think that the lack of that behavior is what we
should try to clarify.  For example, it would be good to mention that
a PGRES_FATAL_ERROR is might corresponding to the server levels ERROR,
FATAL, or PANIC, and that only in the latter two cases is the
connection presumably of no further use; and we could also mention
that libpq doesn't understand anything about the meanings of those
error levels, so that PQstatus() won't automatically barf just because
the server has sent a FATAL, *unless* it's read far enough in the
input stream to detect the problem.

Perhaps, as Murray says, this is exposing an implementation detail.
But in my experience when coding against libraries written by other
people it's often necessary to have at least a superficial
understanding of what operations they are performing under the hood.
I don't think there's any way we can eliminate that problem
completely.  What we CAN try to do is make sure that the documentation
clearly explains enough of what's going on under the hood to allow
people to code against it successfully, without (hopefully) going
overboard into irrelevant details that the user doesn't need to care
about.

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #5837: PQstatus() fails to report lost connection

From

"Kevin Grittner"

Date:

25 January 2011, 11:54:23

Robert Haas <robertmhaas@gmail.com> wrote:

> I think this patch would only be adding to the confusion.  When
> PQgetResult() is called, we read enough data from the connection
> to create and return one result object.  It's true that this
> doesn't necessarily detect an EOF, but IIUC calling PQgetResult()
> again is just ONE way that you could trigger another read against
> the socket, not the only one.  I think it would also work to call
> PQconsumeInput(), for example.

I find it hard to reconcile the above with this:

http://archives.postgresql.org/message-id/6493.1295882981@sss.pgh.pa.us

and the quote from our documentation referenced here:

http://archives.postgresql.org/message-id/4D3D67600200002500039B2C@gw.wicourts.gov

> I think the real, underlying problem here is that Murray would
> like a behavior change

More than that I think he wants to be able to read the manual and
know what will work, without spending loads of time getting in tune
with The Tao of Libpq.  Based on his initial reading of the docs he
expected different behavior; that can be fixed by changing the
behavior or changing the docs.

-Kevin

Re: BUG #5837: PQstatus() fails to report lost connection

From

Robert Haas

Date:

25 January 2011, 15:09:36

On Tue, Jan 25, 2011 at 10:54 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> I think this patch would only be adding to the confusion. =A0When
>> PQgetResult() is called, we read enough data from the connection
>> to create and return one result object. =A0It's true that this
>> doesn't necessarily detect an EOF, but IIUC calling PQgetResult()
>> again is just ONE way that you could trigger another read against
>> the socket, not the only one. =A0I think it would also work to call
>> PQconsumeInput(), for example.
>
> I find it hard to reconcile the above with this:
>
> http://archives.postgresql.org/message-id/6493.1295882981@sss.pgh.pa.us
>
> and the quote from our documentation referenced here:
>
> http://archives.postgresql.org/message-id/4D3D67600200002500039B2C@gw.wic=
ourts.gov

IIUC, Tom's point was that doing it that way would detect the error,
not that it was the ONLY way to detect the error.

But it's easily testable.

>> I think the real, underlying problem here is that Murray would
>> like a behavior change
>
> More than that I think he wants to be able to read the manual and
> know what will work, without spending loads of time getting in tune
> with The Tao of Libpq. =A0Based on his initial reading of the docs he
> expected different behavior; that can be fixed by changing the
> behavior or changing the docs.

That is why I suggested the type of doc correction that I thought
would be most helpful and accurate.

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #5837: PQstatus() fails to report lost connection

From

"Murray S. Kucherawy"

Date:

25 January 2011, 17:20:54

> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Tuesday, January 25, 2011 7:33 AM
> To: Kevin Grittner
> Cc: Murray S. Kucherawy; Tom Lane; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection
>=20
> I think the real, underlying problem here is that Murray would like a
> behavior change: when a FATAL or PANIC condition is reported by the
> server, he'd like the client to immediately close the socket and set
> its status to CONNECTION_BAD.  If we're going to clarify the
> documentation, I think that the lack of that behavior is what we
> should try to clarify.  For example, it would be good to mention that
> a PGRES_FATAL_ERROR is might corresponding to the server levels ERROR,
> FATAL, or PANIC, and that only in the latter two cases is the
> connection presumably of no further use; and we could also mention
> that libpq doesn't understand anything about the meanings of those
> error levels, so that PQstatus() won't automatically barf just because
> the server has sent a FATAL, *unless* it's read far enough in the
> input stream to detect the problem.

First and foremost, I think a behavior change would be the right thing to d=
o.  I can't think of any API that follows this odd model; to me it's like s=
aying select() has reported a socket that's both read-ready and in an excep=
tion state, but we won't actually set errno until you iterate on read() unt=
il EOF first.  Counter-intuitive behaviors in APIs lead to application prob=
lems or even user attrition in extreme cases.

That said, I've maintained complex APIs before (in fact, this is all the re=
sult of one of them) and I understand what this sort of a change might enta=
il.  Thus, the next-best option is to document the idiosyncrasy that's been=
 uncovered; i.e., make the proposed change to the documentation with perhap=
s an added hint as to why it's necessary.  You might want to demonstrate wh=
y after one iteration PGgetResult() and PQstatus() seem to indicate no prob=
lem while PGresultStatus() and PGresultErrorString() both already seem to k=
now the severity and even the nature of the error.

I think Kevin summed it up nicely.  Quite simply, the documentation doesn't=
 match the behavior in this case.  One or the other should be adjusted so t=
hey are in alignment, while disrupting both your installed base and user ex=
perience as little as possible.

-MSK

Re: BUG #5837: PQstatus() fails to report lost connection

From

Bruce Momjian

Date:

11 March 2011, 08:54:29

Robert Haas wrote:
> On Tue, Jan 25, 2011 at 10:54 AM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
> > Robert Haas <robertmhaas@gmail.com> wrote:
> >> I think this patch would only be adding to the confusion. ?When
> >> PQgetResult() is called, we read enough data from the connection
> >> to create and return one result object. ?It's true that this
> >> doesn't necessarily detect an EOF, but IIUC calling PQgetResult()
> >> again is just ONE way that you could trigger another read against
> >> the socket, not the only one. ?I think it would also work to call
> >> PQconsumeInput(), for example.
> >
> > I find it hard to reconcile the above with this:
> >
> > http://archives.postgresql.org/message-id/6493.1295882981@sss.pgh.pa.us
> >
> > and the quote from our documentation referenced here:
> >
> > http://archives.postgresql.org/message-id/4D3D67600200002500039B2C@gw.wicourts.gov
>
> IIUC, Tom's point was that doing it that way would detect the error,
> not that it was the ONLY way to detect the error.
>
> But it's easily testable.
>
> >> I think the real, underlying problem here is that Murray would
> >> like a behavior change
> >
> > More than that I think he wants to be able to read the manual and
> > know what will work, without spending loads of time getting in tune
> > with The Tao of Libpq. ?Based on his initial reading of the docs he
> > expected different behavior; that can be fixed by changing the
> > behavior or changing the docs.
>
> That is why I suggested the type of doc correction that I thought
> would be most helpful and accurate.

Doc patch attached and applied.  I used "should be called" instead of
"must".

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
new file mode 100644
index 49edc51..59b4011
*** a/doc/src/sgml/libpq.sgml
--- b/doc/src/sgml/libpq.sgml
*************** PGresult *PQgetResult(PGconn *conn);
*** 3846,3851 ****
--- 3846,3860 ----
         active and the necessary response data has not yet been read by
         <function>PQconsumeInput</function>.
        </para>
+
+       <note>
+        <para>
+         Even when <function>PQresultStatus</function> indicates a fatal
+         error, <function>PQgetResult</function> should be called until it
+         returns a null pointer to allow <application>libpq</> to
+         process the error information completely.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
     </variablelist>

Re: BUG #5837: PQstatus() fails to report lost connection

From

Robert Haas

Date:

11 March 2011, 10:32:56

On Fri, Mar 11, 2011 at 5:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> On Tue, Jan 25, 2011 at 10:54 AM, Kevin Grittner
>> <Kevin.Grittner@wicourts.gov> wrote:
>> > Robert Haas <robertmhaas@gmail.com> wrote:
>> >> I think this patch would only be adding to the confusion. ?When
>> >> PQgetResult() is called, we read enough data from the connection
>> >> to create and return one result object. ?It's true that this
>> >> doesn't necessarily detect an EOF, but IIUC calling PQgetResult()
>> >> again is just ONE way that you could trigger another read against
>> >> the socket, not the only one. ?I think it would also work to call
>> >> PQconsumeInput(), for example.
>> >
>> > I find it hard to reconcile the above with this:
>> >
>> > http://archives.postgresql.org/message-id/6493.1295882981@sss.pgh.pa.us
>> >
>> > and the quote from our documentation referenced here:
>> >
>> > http://archives.postgresql.org/message-id/4D3D67600200002500039B2C@gw.=
wicourts.gov
>>
>> IIUC, Tom's point was that doing it that way would detect the error,
>> not that it was the ONLY way to detect the error.
>>
>> But it's easily testable.
>>
>> >> I think the real, underlying problem here is that Murray would
>> >> like a behavior change
>> >
>> > More than that I think he wants to be able to read the manual and
>> > know what will work, without spending loads of time getting in tune
>> > with The Tao of Libpq. ?Based on his initial reading of the docs he
>> > expected different behavior; that can be fixed by changing the
>> > behavior or changing the docs.
>>
>> That is why I suggested the type of doc correction that I thought
>> would be most helpful and accurate.
>
> Doc patch attached and applied. =A0I used "should be called" instead of
> "must".

I notice that your patch has exactly the same conceptual flaw I
complained about with respect to the previous version.

But I'm not sure it's worth arguing about...

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #5837: PQstatus() fails to report lost connection

From

Bruce Momjian

Date:

11 March 2011, 10:46:35

Robert Haas wrote:
> On Fri, Mar 11, 2011 at 5:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > Robert Haas wrote:
> >> On Tue, Jan 25, 2011 at 10:54 AM, Kevin Grittner
> >> <Kevin.Grittner@wicourts.gov> wrote:
> >> > Robert Haas <robertmhaas@gmail.com> wrote:
> >> >> I think this patch would only be adding to the confusion. ?When
> >> >> PQgetResult() is called, we read enough data from the connection
> >> >> to create and return one result object. ?It's true that this
> >> >> doesn't necessarily detect an EOF, but IIUC calling PQgetResult()
> >> >> again is just ONE way that you could trigger another read against
> >> >> the socket, not the only one. ?I think it would also work to call
> >> >> PQconsumeInput(), for example.
> >> >
> >> > I find it hard to reconcile the above with this:
> >> >
> >> > http://archives.postgresql.org/message-id/6493.1295882981@sss.pgh.pa.us
> >> >
> >> > and the quote from our documentation referenced here:
> >> >
> >> > http://archives.postgresql.org/message-id/4D3D67600200002500039B2C@gw.wicourts.gov
> >>
> >> IIUC, Tom's point was that doing it that way would detect the error,
> >> not that it was the ONLY way to detect the error.
> >>
> >> But it's easily testable.
> >>
> >> >> I think the real, underlying problem here is that Murray would
> >> >> like a behavior change
> >> >
> >> > More than that I think he wants to be able to read the manual and
> >> > know what will work, without spending loads of time getting in tune
> >> > with The Tao of Libpq. ?Based on his initial reading of the docs he
> >> > expected different behavior; that can be fixed by changing the
> >> > behavior or changing the docs.
> >>
> >> That is why I suggested the type of doc correction that I thought
> >> would be most helpful and accurate.
> >
> > Doc patch attached and applied. ?I used "should be called" instead of
> > "must".
>
> I notice that your patch has exactly the same conceptual flaw I
> complained about with respect to the previous version.
>
> But I'm not sure it's worth arguing about...

You mentioned that other functions could be used, so I said "should"
rather than "must".  Other ideas?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BUG #5837: PQstatus() fails to report lost connection

From

Merlin Moncure

Date:

11 March 2011, 11:13:02

On Tue, Jan 25, 2011 at 12:01 PM, Murray S. Kucherawy <msk@cloudmark.com> w=
rote:
>> -----Original Message-----
>> From: Robert Haas [mailto:robertmhaas@gmail.com]
>> Sent: Tuesday, January 25, 2011 7:33 AM
>> To: Kevin Grittner
>> Cc: Murray S. Kucherawy; Tom Lane; pgsql-bugs@postgresql.org
>> Subject: Re: [BUGS] BUG #5837: PQstatus() fails to report lost connection
>>
>> I think the real, underlying problem here is that Murray would like a
>> behavior change: when a FATAL or PANIC condition is reported by the
>> server, he'd like the client to immediately close the socket and set
>> its status to CONNECTION_BAD. =A0If we're going to clarify the
>> documentation, I think that the lack of that behavior is what we
>> should try to clarify. =A0For example, it would be good to mention that
>> a PGRES_FATAL_ERROR is might corresponding to the server levels ERROR,
>> FATAL, or PANIC, and that only in the latter two cases is the
>> connection presumably of no further use; and we could also mention
>> that libpq doesn't understand anything about the meanings of those
>> error levels, so that PQstatus() won't automatically barf just because
>> the server has sent a FATAL, *unless* it's read far enough in the
>> input stream to detect the problem.
>
> First and foremost, I think a behavior change would be the right thing to=
 do. =A0I can't think of any API that follows this odd model; to me it's li=
ke saying select() has reported a socket that's both read-ready and in an e=
xception state, but we won't actually set errno until you iterate on read()=
 until EOF first. =A0Counter-intuitive behaviors in APIs lead to applicatio=
n problems or even user attrition in extreme cases.

libpq is odd and clunky on many levels; it's old, but it's also very
rich and useful. many people here would admit that given infinite time
and resources it could use a rewrite.  however. this is not high at
all on the priority list because at the end of the day you can make it
do just about anything you really need it to do, and the old api would
have to be maintained for a good while.

just wrap the behavior you don't like out -- libpq simply demands a
good set of wrappers unfortunately.

merlin