Thread: Queries that should be canceled will get stuck on secure_write function

Queries that should be canceled will get stuck on secure_write function

From

"蔡梦娟(玊于)"

Date:

23 August 2021, 08:15:02

Hi, all

Recently, I got a problem that the startup process of standby is stuck and keeps in a waiting state. The backtrace of startup process shows that it is waiting for a backend process which conflicts with recovery processing to exit, the guc parameters max_standby_streaming_delay and max_standby_archive_delay are both set as 30 seconds, but it doesn't work. The backend process keeps alive, and the backtrace of this backend process show that it is waiting for the socket to be writeable in secure_write(). After further reading the code, I found that ProcessClientWriteInterrupt() in secure_write() only process interrupts when ProcDiePending is true, otherwise do nothing. However, snapshot conflicts with recovery will only set QueryCancelPending as true, so the response to the signal will de delayed indefinitely if the corresponding client is stuck, thus blocking the recovery process.

I want to know why the interrupt is only handled when ProcDiePending is true, I think query which is supposed to be canceled also should respond to the signal.

I also want to share a patch with you, I add a guc parameter max_standby_client_write_delay, if a query is supposed to be canceled, and the time delayed by a client exceeds max_standby_client_write_delay, then set ProcDiePending as true to avoid being delayed indefinitely, what do you think of it, hope to get your reply.

Thanks & Best Regard

Attachment

0001-Set-max-client-write-delay-timeout-for-query-which-i.patch

Re: Queries that should be canceled will get stuck on secure_write function

From

Robert Haas

Date:

23 August 2021, 14:13:03

On Mon, Aug 23, 2021 at 4:15 AM 蔡梦娟(玊于) <mengjuan.cmj@alibaba-inc.com> wrote:
> I want to know why the interrupt is only handled when ProcDiePending is true, I think query which is supposed to be
canceledalso should respond to the signal. 

Well, if we're halfway through sending a message to the client and we
abort the write, we have no way of re-establishing protocol sync,
right? It's OK to die without sending any more data to the client,
since then the connection is closed anyway and the loss of sync
doesn't matter, but continuing the session doesn't seem workable.

Your proposed patch actually seems to dodge this problem and I think
perhaps we could consider something along those lines. It would be
interesting to hear what Andres thinks about this idea, since I think
he was the last one to rewrite that code.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Queries that should be canceled will get stuck on secure_write function

From

Alvaro Herrera

Date:

23 August 2021, 14:45:19

On 2021-Aug-23, Robert Haas wrote:

> On Mon, Aug 23, 2021 at 4:15 AM 蔡梦娟(玊于) <mengjuan.cmj@alibaba-inc.com> wrote:
> > I want to know why the interrupt is only handled when ProcDiePending
> > is true, I think query which is supposed to be canceled also should
> > respond to the signal.

Yeah, I agree.

> Well, if we're halfway through sending a message to the client and we
> abort the write, we have no way of re-establishing protocol sync,
> right? It's OK to die without sending any more data to the client,
> since then the connection is closed anyway and the loss of sync
> doesn't matter, but continuing the session doesn't seem workable.
> 
> Your proposed patch actually seems to dodge this problem and I think
> perhaps we could consider something along those lines.

Do we actually need new GUCs, though?  I think we should never let an
unresponsive client dictate what the server does, because that opens the
door for uncooperative or malicious clients to wreak serious havoc.  I
think the implementation should wait until time now+X to cancel the
query, but if by time now+2X (or whatever we deem reasonable -- maybe
now+1.1X) we're still waiting, then it's okay to just close the
connection.  This suggests a completely different implementation, though.

I wonder if it's possible to write a test for this.  We would have to
send a query and then hang the client somehow.  I recently added a TAP
test that uses SIGSTOP to a walsender ... can we use SIGSTOP with a
background psql that's running SELECT pg_sleep() perhaps?
(Or maybe it's sufficient to start background psql and not pump() it?)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
                http://smylers.hates-software.com/2005/09/08/1995c749.html

Re: Queries that should be canceled will get stuck on secure_write function

From

Robert Haas

Date:

23 August 2021, 15:09:32

On Mon, Aug 23, 2021 at 10:45 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Do we actually need new GUCs, though?  I think we should never let an
> unresponsive client dictate what the server does, because that opens the
> door for uncooperative or malicious clients to wreak serious havoc.  I
> think the implementation should wait until time now+X to cancel the
> query, but if by time now+2X (or whatever we deem reasonable -- maybe
> now+1.1X) we're still waiting, then it's okay to just close the
> connection.  This suggests a completely different implementation, though.

I don't quite understand what you mean by waiting until time now+X to
cancel the query. We don't know a priori when query cancels are going
to happen, but when they do happen we want to respond to them as
quickly as we can. It seems reasonable to say that if we can't handle
them within X amount of time we can resort to emergency measures, but
that's basically what the patch does, except it uses a GUC instead of
hardcoding X.

> I wonder if it's possible to write a test for this.  We would have to
> send a query and then hang the client somehow.  I recently added a TAP
> test that uses SIGSTOP to a walsender ... can we use SIGSTOP with a
> background psql that's running SELECT pg_sleep() perhaps?
> (Or maybe it's sufficient to start background psql and not pump() it?)

Starting a background process and not pumping it sounds promising,
because it seems like it would be more likely to be portable. I think
we'd want to be careful not to assume very much about how large the
output buffer is, because I'm guessing that varies by platform and
that it might in some cases be fairly large. Perhaps we could use
pg_stat_activity to wait until we block in a ClientWrite state,
although I wonder if we might find out that we sometimes block on a
different wait state than what we expect to see.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Queries that should be canceled will get stuck on secure_write function

From

Alvaro Herrera

Date:

23 August 2021, 15:26:53

On 2021-Aug-23, Robert Haas wrote:

> On Mon, Aug 23, 2021 at 10:45 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Do we actually need new GUCs, though?  I think we should never let an
> > unresponsive client dictate what the server does, because that opens the
> > door for uncooperative or malicious clients to wreak serious havoc.  I
> > think the implementation should wait until time now+X to cancel the
> > query, but if by time now+2X (or whatever we deem reasonable -- maybe
> > now+1.1X) we're still waiting, then it's okay to just close the
> > connection.  This suggests a completely different implementation, though.
> 
> I don't quite understand what you mean by waiting until time now+X to
> cancel the query. We don't know a priori when query cancels are going
> to happen, but when they do happen we want to respond to them as
> quickly as we can. It seems reasonable to say that if we can't handle
> them within X amount of time we can resort to emergency measures, but
> that's basically what the patch does, except it uses a GUC instead of
> hardcoding X.

Aren't we talking about query cancellations that occur in response to a
standby delay limit?  Those aren't in response to user action.  What I
mean is that if the standby delay limit is exceeded, then we send a
query cancel; we expect the standby to cancel its query at that time and
then the primary can move on.  But if the standby doesn't react, then we
can have it terminate its connection.  I'm looking at the problem from
the primary's point of view rather than the standby's point of view.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: Queries that should be canceled will get stuck on secure_write function

From

Robert Haas

Date:

23 August 2021, 18:45:19

On Mon, Aug 23, 2021 at 11:26 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Aren't we talking about query cancellations that occur in response to a
> standby delay limit?  Those aren't in response to user action.

Oh, you're right. But I guess a similar problem could also occur in
response to pg_terminate_backend(), no?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Queries that should be canceled will get stuck on secure_write function

From

Fujii Masao

Date:

24 August 2021, 05:14:57

On 2021/08/24 0:26, Alvaro Herrera wrote:
> Aren't we talking about query cancellations that occur in response to a
> standby delay limit?  Those aren't in response to user action.  What I
> mean is that if the standby delay limit is exceeded, then we send a
> query cancel; we expect the standby to cancel its query at that time and
> then the primary can move on.  But if the standby doesn't react, then we
> can have it terminate its connection.

+1


On 2021/08/24 3:45, Robert Haas wrote:
> On Mon, Aug 23, 2021 at 11:26 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Aren't we talking about query cancellations that occur in response to a
>> standby delay limit?  Those aren't in response to user action.
> 
> Oh, you're right. But I guess a similar problem could also occur in
> response to pg_terminate_backend(), no?

There seems no problem in that case because pg_terminate_backend() causes
a backend to set ProcDiePending to true in die() signal hander and
ProcessClientWriteInterrupt() called by secure_write() handles ProcDiePending.
No?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

回复：Queries that should be canceled will get stuck on secure_write function

From

"蔡梦娟(玊于)"

Date:

24 August 2021, 07:25:08

Yes, pg_terminate_backend() can terminate the connection successfully in this case because ProcDiePending is set as true and ProcessClientWriteInterrupt() can handle it.

Queries those exceed standby delay limit can be terminated in this way, but what about other queries that should be canceled but get stuck on secure_write()? After all, there is currently no way to list all possible situations and then terminate these queries one by one.

------------------------------------------------------------------
发件人：Fujii Masao <masao.fujii@oss.nttdata.com>
发送时间：2021年8月24日(星期二) 13:15
收件人：Robert Haas <robertmhaas@gmail.com>; Alvaro Herrera <alvherre@alvh.no-ip.org>
抄　送：蔡梦娟(玊于) <mengjuan.cmj@alibaba-inc.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; Andres Freund <andres@anarazel.de>
主　题：Re: Queries that should be canceled will get stuck on secure_write function

On 2021/08/24 0:26, Alvaro Herrera wrote:
> Aren't we talking about query cancellations that occur in response to a
> standby delay limit?  Those aren't in response to user action.  What I
> mean is that if the standby delay limit is exceeded, then we send a
> query cancel; we expect the standby to cancel its query at that time and
> then the primary can move on.  But if the standby doesn't react, then we
> can have it terminate its connection.

+1

On 2021/08/24 3:45, Robert Haas wrote:
> On Mon, Aug 23, 2021 at 11:26 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Aren't we talking about query cancellations that occur in response to a
>> standby delay limit?  Those aren't in response to user action.
>
> Oh, you're right. But I guess a similar problem could also occur in
> response to pg_terminate_backend(), no?

There seems no problem in that case because pg_terminate_backend() causes
a backend to set ProcDiePending to true in die() signal hander and
ProcessClientWriteInterrupt() called by secure_write() handles ProcDiePending.
No?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Queries that should be canceled will get stuck on secure_write function

From

Robert Haas

Date:

24 August 2021, 17:30:08

On Tue, Aug 24, 2021 at 1:15 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> > Oh, you're right. But I guess a similar problem could also occur in
> > response to pg_terminate_backend(), no?
>
> There seems no problem in that case because pg_terminate_backend() causes
> a backend to set ProcDiePending to true in die() signal hander and
> ProcessClientWriteInterrupt() called by secure_write() handles ProcDiePending.
> No?

Hmm, maybe you're right. What about pg_cancel_backend()?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Queries that should be canceled will get stuck on secure_write function

From

Fujii Masao

Date:

25 August 2021, 01:58:52


On 2021/08/25 2:30, Robert Haas wrote:
> Hmm, maybe you're right. What about pg_cancel_backend()?

I was thinking that it's valid even if secure_write() doesn't react to
pg_cancel_backend() because it's basically called outside transaction block,
i.e., there is no longer transaction to cancel in that case. But there can be
some cases where secure_write() is called inside transaction block,
for example, when the query generates NOTICE message. In these cases,
secure_write() might need to react to the cancel request.

BTW, when an error happens, I found that a backend calls EmitErrorReport()
to report an error to a client, and then calls AbortCurrentTransaction()
to abort the transaction. If secure_write() called by EmitErrorReport()
gets stuck, a backend gets stuck inside transaction block and the locks
keep being held unnecessarily. Isn't this problematic? Can we change
the order of them?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Queries that should be canceled will get stuck on secure_write function

From

Robert Haas

Date:

27 August 2021, 12:27:38

On Tue, Aug 24, 2021 at 9:58 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> I was thinking that it's valid even if secure_write() doesn't react to
> pg_cancel_backend() because it's basically called outside transaction block,
> i.e., there is no longer transaction to cancel in that case. But there can be
> some cases where secure_write() is called inside transaction block,
> for example, when the query generates NOTICE message. In these cases,
> secure_write() might need to react to the cancel request.

Yeah. I think we could also be sending tuple data.

> BTW, when an error happens, I found that a backend calls EmitErrorReport()
> to report an error to a client, and then calls AbortCurrentTransaction()
> to abort the transaction. If secure_write() called by EmitErrorReport()
> gets stuck, a backend gets stuck inside transaction block and the locks
> keep being held unnecessarily. Isn't this problematic? Can we change
> the order of them?

I think there might be problems with that, like perhaps the ErrorData
object can have pointers into the memory contexts that we'd be
destroying in AbortCurrentTransaction().

More generally, I think it's hopeless to try to improve the situation
very much by changing the place where secure_write() happens. It
happens in so many different places, and it is clearly not going to be
possible to make it so that in none of those places do we hold any
important server resources. Therefore I think the only solution is to
fix secure_write() itself ... and the trick is what to do there given
that we have to be very careful not to do anything that might try to
write another message while we are already in the middle of writing a
message.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Queries that should be canceled will get stuck on secure_write function

From

Andres Freund

Date:

27 August 2021, 19:15:43

Hi,

On 2021-08-23 10:13:03 -0400, Robert Haas wrote:
> On Mon, Aug 23, 2021 at 4:15 AM 蔡梦娟(玊于) <mengjuan.cmj@alibaba-inc.com> wrote:
> > I want to know why the interrupt is only handled when ProcDiePending is true, I think query which is supposed to be
canceledalso should respond to the signal.

> 
> Well, if we're halfway through sending a message to the client and we
> abort the write, we have no way of re-establishing protocol sync,
> right? It's OK to die without sending any more data to the client,
> since then the connection is closed anyway and the loss of sync
> doesn't matter, but continuing the session doesn't seem workable.

Right.

> Your proposed patch actually seems to dodge this problem and I think
> perhaps we could consider something along those lines. It would be
> interesting to hear what Andres thinks about this idea, since I think
> he was the last one to rewrite that code.

I think it's a reasonable idea. I have some quibbles with the implementation
(new code should be in ProcessClientWriteInterrupt(), not secure_write()), and
I suspect we should escalate more quickly to killing the connection, but those
seem like details.

I think that if we want to tackle this, we need to do the same for
secure_read() as well. secure_read() will process interrupts normally while
DoingCommandRead, but if we're already in a command, we'd just as well be
prevented from processing a !ProcDiePending interrupt.

Greetings,

Andres Freund

Re: Queries that should be canceled will get stuck on secure_write function

From

Andres Freund

Date:

27 August 2021, 19:24:46

Hi,

On 2021-08-27 08:27:38 -0400, Robert Haas wrote:
> On Tue, Aug 24, 2021 at 9:58 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> > to report an error to a client, and then calls AbortCurrentTransaction()
> > to abort the transaction. If secure_write() called by EmitErrorReport()
> > gets stuck, a backend gets stuck inside transaction block and the locks
> > keep being held unnecessarily. Isn't this problematic? Can we change
> > the order of them?
> ...
> More generally, I think it's hopeless to try to improve the situation
> very much by changing the place where secure_write() happens. It
> happens in so many different places, and it is clearly not going to be
> possible to make it so that in none of those places do we hold any
> important server resources. Therefore I think the only solution is to
> fix secure_write() itself ... and the trick is what to do there given
> that we have to be very careful not to do anything that might try to
> write another message while we are already in the middle of writing a
> message.

I wonder if we could improve the situation somewhat by using the non-blocking
pqcomm functions in a few select places. E.g. if elog.c's
send_message_to_frontend() sent its message via a new pq_endmessage_noblock()
(which'd use the existing pq_putmessage_noblock()) and used
pq_flush_if_writable() instead of pq_flush(), we'd a) not block sending to the
client before AbortCurrentTransaction(), b) able to queue further error
messages safely.

I think this'd not violate the goal of putting pq_flush() into
send_message_to_frontend():
    /*
     * This flush is normally not necessary, since postgres.c will flush out
     * waiting data when control returns to the main loop. But it seems best
     * to leave it here, so that the client has some clue what happened if the
     * backend dies before getting back to the main loop ... error/notice
     * messages should not be a performance-critical path anyway, so an extra
     * flush won't hurt much ...
     */
    pq_flush();

because the only situations where we'd not send the data out immediately would
be when the socket buffer is already full. In which case the client wouldn't
get the error immediately anyway?

Greetings,

Andres Freund

Re: Queries that should be canceled will get stuck on secure_write function

From

Robert Haas

Date:

27 August 2021, 20:07:02

On Fri, Aug 27, 2021 at 3:24 PM Andres Freund <andres@anarazel.de> wrote:
> I wonder if we could improve the situation somewhat by using the non-blocking
> pqcomm functions in a few select places. E.g. if elog.c's
> send_message_to_frontend() sent its message via a new pq_endmessage_noblock()
> (which'd use the existing pq_putmessage_noblock()) and used
> pq_flush_if_writable() instead of pq_flush(), we'd a) not block sending to the
> client before AbortCurrentTransaction(), b) able to queue further error
> messages safely.

pq_flush_if_writable() could succeed in sending only part of the data,
so I don't see how this works.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Queries that should be canceled will get stuck on secure_write function

From

"Andres Freund"

Date:

27 August 2021, 20:10:14

Hi,

On Fri, Aug 27, 2021, at 13:07, Robert Haas wrote:
> On Fri, Aug 27, 2021 at 3:24 PM Andres Freund <andres@anarazel.de> wrote:
> > I wonder if we could improve the situation somewhat by using the non-blocking
> > pqcomm functions in a few select places. E.g. if elog.c's
> > send_message_to_frontend() sent its message via a new pq_endmessage_noblock()
> > (which'd use the existing pq_putmessage_noblock()) and used
> > pq_flush_if_writable() instead of pq_flush(), we'd a) not block sending to the
> > client before AbortCurrentTransaction(), b) able to queue further error
> > messages safely.
> 
> pq_flush_if_writable() could succeed in sending only part of the data,
> so I don't see how this works.

All the data is buffered though, so I don't see that problem that causes?

Andres

回复：Queries that should be canceled will get stuck on secure_write function

From

"蔡梦娟(玊于)"

Date:

06 September 2021, 08:03:43

I add a test to reproduce the problem, you can see the attachment for specific content

during the last sleep time of the test, use pstack to get the stack of the backend process, which is as follows:

#0  0x00007f6ebdd744d3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x00000000007738d2 in WaitEventSetWait ()
#2  0x0000000000675aae in secure_write ()
#3  0x000000000067bfab in internal_flush ()
#4  0x000000000067c13a in internal_putbytes ()
#5  0x000000000067c217 in socket_putmessage ()
#6  0x0000000000497f36 in printtup ()
#7  0x00000000006301e0 in standard_ExecutorRun ()
#8  0x00000000007985fb in PortalRunSelect ()
#9  0x0000000000799968 in PortalRun ()
#10 0x0000000000795866 in exec_simple_query ()
#11 0x0000000000796cff in PostgresMain ()
#12 0x0000000000488339 in ServerLoop ()
#13 0x0000000000717bbc in PostmasterMain ()
#14 0x0000000000488f26 in main ()

Attachment

0001-Test-stuck-on-secure-write.patch

回复：Queries that should be canceled will get stuck on secure_write function

From

"蔡梦娟(玊于)"

Date:

09 September 2021, 09:38:06

I changed the implementation about this problem:

a) if the cancel query interrupt is from db for some reason, such as recovery conflict, then handle it immediately, and cancel request is treated as terminate request;

b) if the cancel query interrupt is from client, then ignore as original way

In the attached patch, I also add a tap test to generate a recovery conflict on a standby during the backend process is stuck on client write, check whether it can handle the cancel query request due to recovery conflict.

what do you think of it, hope to get your reply

Thanks & Best Regards

Attachment

0001-Handle-cancel-interrupts-during-client-readwrite.patch

Re: Queries that should be canceled will get stuck on secure_write function

From

Robert Haas

Date:

09 September 2021, 14:39:09

On Fri, Aug 27, 2021 at 4:10 PM Andres Freund <andres@anarazel.de> wrote:
> On Fri, Aug 27, 2021, at 13:07, Robert Haas wrote:
> > On Fri, Aug 27, 2021 at 3:24 PM Andres Freund <andres@anarazel.de> wrote:
> > > I wonder if we could improve the situation somewhat by using the non-blocking
> > > pqcomm functions in a few select places. E.g. if elog.c's
> > > send_message_to_frontend() sent its message via a new pq_endmessage_noblock()
> > > (which'd use the existing pq_putmessage_noblock()) and used
> > > pq_flush_if_writable() instead of pq_flush(), we'd a) not block sending to the
> > > client before AbortCurrentTransaction(), b) able to queue further error
> > > messages safely.
> >
> > pq_flush_if_writable() could succeed in sending only part of the data,
> > so I don't see how this works.
>
> All the data is buffered though, so I don't see that problem that causes?

OK, I guess I'm confused here.

If we're always buffering the data, then I suppose there's no risk of
injecting a protocol message into the middle of some other protocol
message, assuming that we don't have a non-local transfer of control
halfway through putting a message in the buffer. But there's still the
risk of getting out of step with the client. Suppose the client does
SELECT 1/0 and we send an ErrorResponse complaining about the division
by zero. But as we're trying to send that response, we block. Later, a
query cancel occurs. We can't queue another ErrorResponse because the
client will interpret that as the response to the next query, since
the division by zero error is the response to the current one. I don't
think that changing pq_flush() to pq_flush_if_writable() in elog.c or
anywhere else can fix that problem.

But that doesn't mean that it isn't a good idea. Any place where we're
doing a pq_flush() and know that another one will happen soon
afterward, before we wait for data from the client, can be changed to
pq_flush_if_writable() without harm, and it's beneficial to do so,
because like you say, it avoids blocking in places that users may find
inconvenient - e.g. while holding locks, as Fujii-san said. The
comment here claims that "postgres.c will flush out waiting data when
control returns to the main loop" but the only pq_flush() call that's
directly present in postgres.c is in response to receiving a Flush
message, so I suppose this is actually talking about the pq_flush()
inside ReadyForQuery. It's not 100% clear to me that we do that in all
relevant cases, though. Suppose we hit an error while processing some
protocol message that does not set send_ready_for_query = true, like
for example Describe ('D'). I think in that case the flush in elog.c
is the only one. Perhaps we ought to change postgres.c so that if we
don't enter the block guarded by "if (send_ready_for_query)" we
instead pq_flush().

-- 
Robert Haas
EDB: http://www.enterprisedb.com

回复：Queries that should be canceled will get stuck on secure_write function

From

"蔡梦娟(玊于)"

Date:

21 September 2021, 16:14:49

Hi all, I want to know your opinion on this patch, or in what way do you think we should solve this problem?

------------------------------------------------------------------
发件人：蔡梦娟(玊于) <mengjuan.cmj@alibaba-inc.com>
发送时间：2021年9月9日(星期四) 17:38
收件人：Robert Haas <robertmhaas@gmail.com>; Andres Freund <andres@anarazel.de>; alvherre <alvherre@alvh.no-ip.org>; masao.fujii <masao.fujii@oss.nttdata.com>
抄　送：pgsql-hackers <pgsql-hackers@lists.postgresql.org>
主　题：回复：Queries that should be canceled will get stuck on secure_write function

I changed the implementation about this problem:
a) if the cancel query interrupt is from db for some reason, such as recovery conflict, then handle it immediately, and cancel request is treated as terminate request;
b) if the cancel query interrupt is from client, then ignore as original way

In the attached patch, I also add a tap test to generate a recovery conflict on a standby during the backend process is stuck on client write, check whether it can handle the cancel query request due to recovery conflict.

what do you think of it, hope to get your reply

Thanks & Best Regards

Attachment

0001-Handle-cancel-interrupts-during-client-readwrite.patch

Re: 回复：Queries that should be canceled will get stuck on secure_write function

From

Fujii Masao

Date:

22 September 2021, 03:52:34


On 2021/09/22 1:14, 蔡梦娟(玊于) wrote:
> Hi all, I want to know your opinion on this patch, or in what way do you think we should solve this problem?

I agree that something like the patch (i.e., introduction of promotion
from cancel request to terminate one) is necessary for the fix. One concern
on the patch is that the cancel request can be promoted to the terminate one
even when secure_write() doesn't actually get stuck. Is this acceptable?
Maybe I'm tempted to set up the duration until the promotion happens....
Or we should introduce the dedicated timer for communication on the socket?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

回复：回复：Queries that should be canceled will get stuck on secure_write function

From

"蔡梦娟(玊于)"

Date:

24 September 2021, 03:59:42

Yes, it is more appropriate to set a duration time to determine whether secure_write() is stuck, but it is difficult to define how long the duration time is.

in my first patch, I add a GUC to allow the user to set the time, or it can be hardcoded if a time deemed reasonable is provided?

------------------------------------------------------------------
I agree that something like the patch (i.e., introduction of promotion
from cancel request to terminate one) is necessary for the fix. One concern
on the patch is that the cancel request can be promoted to the terminate one
even when secure_write() doesn't actually get stuck. Is this acceptable?
Maybe I'm tempted to set up the duration until the promotion happens....
Or we should introduce the dedicated timer for communication on the socket?