Re: dblink query interruptibility - Mailing list pgsql-hackers

From Noah Misch
Subject Re: dblink query interruptibility
Date
Msg-id 20240127023539.5f.nmisch@google.com
Whole thread Raw
In response to Re: dblink query interruptibility  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
On Thu, Jan 25, 2024 at 12:28:43PM +0900, Fujii Masao wrote:
> On Thu, Jan 25, 2024 at 5:45 AM Noah Misch <noah@leadboat.com> wrote:
> > On Thu, Jan 25, 2024 at 04:23:39AM +0900, Fujii Masao wrote:
> > > I found that this patch was committed at d3c5f37dd5 and changed the
> > > error message in postgres_fdw slightly. Here's an example:
> > >
> > > #1. Begin a new transaction.
> > > #2. Execute a query accessing to a foreign table, like SELECT * FROM
> > > <foreign table>
> > > #3. Terminate the *remote* session corresponding to the foreign table.
> > > #4. Commit the transaction, and then currently the following error
> > > message is output.
> > >
> > >     ERROR:  FATAL:  terminating connection due to administrator command
> > >     server closed the connection unexpectedly
> > >         This probably means the server terminated abnormally
> > >         before or while processing the request.
> > >    invalid socket
> > >
> > > Previously, before commit d3c5f37dd5, the error message at #4 did not
> > > include "invalid socket." Now, after the commit, it does.

Other clients have witnessed the extra "invalid socket" message:
https://dba.stackexchange.com/questions/335081/how-to-investigate-intermittent-postgres-connection-errors
https://stackoverflow.com/questions/77781358/pgbackrest-backup-error-with-exit-code-57
https://github.com/timescale/timescaledb/issues/102

> > > + /* Consume whatever data is available from the socket */
> > > + if (PQconsumeInput(conn) == 0)
> > > + {
> > > + /* trouble; expect PQgetResult() to return NULL */
> > > + break;
> > > + }
> > > + }
> > > +
> > > + /* Now we can collect and return the next PGresult */
> > > + return PQgetResult(conn);
> > >
> > > This code appears to cause the change. When the remote session ends,
> > > PQconsumeInput() returns 0 and marks conn->socket as invalid.
> > > Subsequent PQgetResult() calls pqWait(), detecting the invalid socket
> > > and appending "invalid socket" to the error message.

What do you think of making PQconsumeInput() set PGASYNC_READY and
CONNECTION_BAD in this case?  Since libpq appended "server closed the
connection unexpectedly", it knows those indicators are correct.  That way,
PQgetResult() won't issue a pointless pqWait() call.

> > > I think the "invalid socket" message is unsuitable in this scenario,
> > > and PQgetResult() should not be called after PQconsumeInput() returns
> > > 0. Thought?
> >
> > The documentation is absolute about the necessity of PQgetResult():
> 
> The documentation looks unclear to me regarding what should be done
> when PQconsumeInput() returns 0. So I'm not sure if PQgetResult()
> must be called even in that case.

I agree PQconsumeInput() docs don't specify how to interpret it returning 0.

> As far as I read some functions like libpqrcv_PQgetResult() that use
> PQconsumeInput(), it appears that they basically report the error message
> using PQerrorMessage(), without calling PQgetResult(),
> when PQconsumeInput() returns 0.

libpqrcv_PQgetResult() is part of walreceiver, where any ERROR becomes FATAL.
Hence, it can't hurt anything by eagerly skipping to ERROR.  I designed
libpqsrv_exec() to mimic PQexec() as closely as possible, so it would be a
drop-in replacement for arbitrary callers.  Ideally, accepting interrupts
would be the only caller-visible difference.

I know of three ways PQconsumeInput() can return 0, along with my untested
estimates of how they work:

a. Protocol violation.  handleSyncLoss() sets PGASYNC_READY and
   CONNECTION_BAD.  PQgetResult() is optional.

b. Connection broken.  PQgetResult() is optional.

c. ENOMEM.  PGASYNC_ and CONNECTION_ status don't change.  Applications choose
   among (c1) free memory and retry, (c2) close the connection, or (c3) call
   PQgetResult() to break protocol sync and set PGASYNC_IDLE:

Comparing PQconsumeInput() with the PQgetResult() block under "while
(conn->asyncStatus == PGASYNC_BUSY)", there's a key difference that
PQgetResult() sets PGASYNC_IDLE on most errors, including ENOMEM.  That
prevents PQexec() subroutine PQexecFinish() from busy looping on ENOMEM, but I
suspect that also silently breaks protocol sync.  While we could change it
from (c3) to (c2) by dropping the connection via handleSyncLoss() or
equivalent, I'm not confident about that being better.

libpqsrv_exec() implements (c3) by way of calling PQgetResult() after
PQconsumeInput() fails.  If PQisBusy(), the same ENOMEM typically will repeat,
yielding (c3).  If memory became available in that brief time, PQgetResult()
may instead block.  That blocking is unwanted but unimportant.



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Request for comment on setting binary format output per session
Next
From: vignesh C
Date:
Subject: Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning