Re: Queries that should be canceled will get stuck on secure_write function - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Queries that should be canceled will get stuck on secure_write function
Date
Msg-id CA+TgmoY-5vO2qgGOSLqJDX0DjC34rXEOr+6ooBy2byQr4WJ89A@mail.gmail.com
Whole thread Raw
In response to Re: Queries that should be canceled will get stuck on secure_write function  ("Andres Freund" <andres@anarazel.de>)
List pgsql-hackers
On Fri, Aug 27, 2021 at 4:10 PM Andres Freund <andres@anarazel.de> wrote:
> On Fri, Aug 27, 2021, at 13:07, Robert Haas wrote:
> > On Fri, Aug 27, 2021 at 3:24 PM Andres Freund <andres@anarazel.de> wrote:
> > > I wonder if we could improve the situation somewhat by using the non-blocking
> > > pqcomm functions in a few select places. E.g. if elog.c's
> > > send_message_to_frontend() sent its message via a new pq_endmessage_noblock()
> > > (which'd use the existing pq_putmessage_noblock()) and used
> > > pq_flush_if_writable() instead of pq_flush(), we'd a) not block sending to the
> > > client before AbortCurrentTransaction(), b) able to queue further error
> > > messages safely.
> >
> > pq_flush_if_writable() could succeed in sending only part of the data,
> > so I don't see how this works.
>
> All the data is buffered though, so I don't see that problem that causes?

OK, I guess I'm confused here.

If we're always buffering the data, then I suppose there's no risk of
injecting a protocol message into the middle of some other protocol
message, assuming that we don't have a non-local transfer of control
halfway through putting a message in the buffer. But there's still the
risk of getting out of step with the client. Suppose the client does
SELECT 1/0 and we send an ErrorResponse complaining about the division
by zero. But as we're trying to send that response, we block. Later, a
query cancel occurs. We can't queue another ErrorResponse because the
client will interpret that as the response to the next query, since
the division by zero error is the response to the current one. I don't
think that changing pq_flush() to pq_flush_if_writable() in elog.c or
anywhere else can fix that problem.

But that doesn't mean that it isn't a good idea. Any place where we're
doing a pq_flush() and know that another one will happen soon
afterward, before we wait for data from the client, can be changed to
pq_flush_if_writable() without harm, and it's beneficial to do so,
because like you say, it avoids blocking in places that users may find
inconvenient - e.g. while holding locks, as Fujii-san said. The
comment here claims that "postgres.c will flush out waiting data when
control returns to the main loop" but the only pq_flush() call that's
directly present in postgres.c is in response to receiving a Flush
message, so I suppose this is actually talking about the pq_flush()
inside ReadyForQuery. It's not 100% clear to me that we do that in all
relevant cases, though. Suppose we hit an error while processing some
protocol message that does not set send_ready_for_query = true, like
for example Describe ('D'). I think in that case the flush in elog.c
is the only one. Perhaps we ought to change postgres.c so that if we
don't enter the block guarded by "if (send_ready_for_query)" we
instead pq_flush().

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Tom Lane
Date:
Subject: Re: missing warning in pg_import_system_collations