When a parallel query gets cancelled on a standby due to max_standby_streaming_delay, it happens rather awkwardly. I get two errors stacked up, a query cancellation followed by a connection termination.
I use `pgbench -R 1 -T3600 -P5` on the master to generate a light but steady stream of HOT pruning records, and then run `select sum(a.abalance*b.abalance) from pgbench_accounts a join pgbench_accounts b using (bid);` on the standby not in a transaction block to be a long-running parallel query (scale factor of 20)
I also set max_standby_streaming_delay = 0. That isn't necessary, but it saves wear and tear on my patience.
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
FATAL: terminating connection due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
This happens quite reliably. In psql, these sometimes both show up immediately, and sometimes only the first one shows up immediately and then the second one appears upon the next communication to the backend.
I don't know if this is actually a problem. It isn't for me as I don't do this kind of thing outside of testing, but it seems untidy and I can see it being frustrating from a catch-and-retry perspective and from a log-spam perspective.
It looks like the backend gets signalled by the startup process, and then it signals the postmaster to signal the parallel workers, and then they ignore it for a quite long time (tens to hundreds of ms). By the time they get around responding, someone has decided to escalate things. Which doesn't seem to be useful, because no one can do anything until the workers respond anyway.
This behavior seems to go back a long way, but the propensity for both messages to show up at the same time vs. in different round-trips changes from version to version.
Is this something we should do something about?
Cheers,
Jeff