Re: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on OpenBSD 7.0 - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on OpenBSD 7.0
Date
Msg-id 2187263.1644616494@sss.pgh.pa.us
Whole thread Raw
In response to Re: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on OpenBSD 7.0  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on OpenBSD 7.0  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
I wrote:
>> Hmm, maybe.  It sure *looks* like we need to do

>> -    return conn->asyncStatus == PGASYNC_BUSY || conn->write_failed;
>> +    return conn->asyncStatus == PGASYNC_BUSY && !conn->write_failed;

> Dunno anything about the Windows angle, but after hacking together
> a reproducer for this, I'm fairly convinced that this is indeed a bug.

After further study, I now think that checking write_failed here is
the wrong thing entirely, because it's irrelevant to what we want to
do, which is keep reading until we've collected a server message or
seen read EOF.

What we do need to do though is check to see if we've closed the
socket.  That's because places like libpqwalreceiver.c assume
that if PQisBusy is true, then PQsocket() must be valid:

    while (PQisBusy(streamConn))
    {
        int            rc;

        rc = WaitLatchOrSocket(MyLatch,
                               WL_EXIT_ON_PM_DEATH | WL_SOCKET_READABLE |
                               WL_LATCH_SET,
                               PQsocket(streamConn),
                               0,
                               WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE);

It seems like the existing coding in PQisBusy could allow that to fail.
I scraped the buildfarm for instances of the "cannot wait on socket
event without a socket" error that latch.c would emit if that happened,
but found none going back three months.  That's perhaps because of the
other behavior I noted that if the walsender crashes, we'll probably
report write failure immediately in PQsendQuery.  So maybe this is
mostly hypothetical in practice, but I think what we actually want
here is as attached.

In any case, I still don't see a way for this error to result in
an infinite loop, so it doesn't seem like an explanation for the
Windows problems.

            regards, tom lane

diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 59121873d2..9afd4d88b4 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -1957,10 +1957,14 @@ PQisBusy(PGconn *conn)
     parseInput(conn);
 
     /*
-     * PQgetResult will return immediately in all states except BUSY, or if we
-     * had a write failure.
+     * PQgetResult will return immediately in all states except BUSY.  Also,
+     * if we've detected read EOF and dropped the connection, we can expect
+     * that PQgetResult will fail immediately.  Note that we do *not* check
+     * conn->write_failed here --- once that's become set, we know we have
+     * trouble, but we need to keep trying to read until we have a complete
+     * server message or detect read EOF.
      */
-    return conn->asyncStatus == PGASYNC_BUSY || conn->write_failed;
+    return conn->asyncStatus == PGASYNC_BUSY && conn->status != CONNECTION_BAD;
 }
 
 /*

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: ERROR: XX000: variable not found in subplan target list
Next
From: Troy Frericks
Date:
Subject: A bug with the TimeStampTZ data type and the 'AT TIME ZONE' clause