Thread: awkward cancellation of parallel queries on standby.

awkward cancellation of parallel queries on standby.

From
Jeff Janes
Date:
When a parallel query gets cancelled on a standby due to max_standby_streaming_delay, it happens rather awkwardly.  I get two errors stacked up, a query cancellation followed by a connection termination.

I use `pgbench -R 1 -T3600 -P5` on the master to generate a light but steady stream of HOT pruning records, and then run `select sum(a.abalance*b.abalance) from pgbench_accounts a join pgbench_accounts b using (bid);` on the standby not in a transaction block to be a long-running parallel query (scale factor of 20)

I also set max_standby_streaming_delay = 0.  That isn't necessary, but it saves wear and tear on my patience.

ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.
FATAL:  terminating connection due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.

This happens quite reliably.  In psql, these sometimes both show up immediately, and sometimes only the first one shows up immediately and then the second one appears upon the next communication to the backend.

I don't know if this is actually a problem.  It isn't for me as I don't do this kind of thing outside of testing, but it seems untidy and I can see it being frustrating from a catch-and-retry perspective and from a log-spam perspective.

It looks like the backend gets signalled by the startup process, and then it signals the postmaster to signal the parallel workers, and then they ignore it for a quite long time (tens to hundreds of ms).  By the time they get around responding, someone has decided to escalate things.  Which doesn't seem to be useful, because no one can do anything until the workers respond anyway.

This behavior seems to go back a long way, but the propensity for both messages to show up at the same time vs. in different round-trips changes from version to version.

Is this something we should do something about?

Cheers,

Jeff

Re: awkward cancellation of parallel queries on standby.

From
Kyotaro Horiguchi
Date:
At Sun, 26 Mar 2023 11:12:48 -0400, Jeff Janes <jeff.janes@gmail.com> wrote in 
> I don't know if this is actually a problem.  It isn't for me as I don't do
> this kind of thing outside of testing, but it seems untidy and I can see it
> being frustrating from a catch-and-retry perspective and from a log-spam
> perspective.
> 
> It looks like the backend gets signalled by the startup process, and then
> it signals the postmaster to signal the parallel workers, and then they
> ignore it for a quite long time (tens to hundreds of ms).  By the time they
> get around responding, someone has decided to escalate things.  Which
> doesn't seem to be useful, because no one can do anything until the workers
> respond anyway.

I believe you are seeing autovacuum_naptime as the latency since the
killed backend is running a busy query.  It seems to me that the
signals are get processed pretty much instantly in most cases. There's
a situation where detection takes longer if a session is sitting idle
in a transaction, but that's just how we deal with that
situation. There could be a delay when the system load is pretty high,
but it's not really our concern unless some messages start going
missing irregularly.

> This behavior seems to go back a long way, but the propensity for both
> messages to show up at the same time vs. in different round-trips changes
> from version to version.
> 
> Is this something we should do something about?

I can't say for certain about the version dependency, but the latency
you mentioned doesn't really seem to be an issue, so we don't need to
worry about it. Regarding session cancellation, taking action might be
an option. However, even if we detect transaction status in
PostgresMain, there's still a possibility of the cancellation if a
conflicting process tries to read a command right before ending the
ongoing transaction. Although we might prevent cancellations in those
final moments, it seems like things could get complicated.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center