Hi,
Whenever you call CHECK_FOR_INTERRUPTS(), there are three flow-control
possibilities:
1. It doesn't return, because we ereport(FATAL) AKA "die".
2. It doesn't return, because we ereport(ERROR) AKA "cancel".
3. It returns, possibly having serviced various kinds of requests.
If we're in a critical section, it always returns.
Since commit 6ce0ed2813d, if we're in a HOLD_INTERRUPTS() section, it
always returns.
Since commit 2b3a8b20c2d, it we're in a HOLD_CANCEL_INTERRUPTS()
section, it either returns or does ereport(FATAL), but paths that
would ereport(ERROR) are deferred.
(That's not the whole story, as HandleParallelMessage() can
ereport(ERROR) without respecting QueryCancelHoldoffCount, but that's
probably OK WRT the protocol sync concerns of 2b3a8b20c2d because we
shouldn't be reading from the client during a parallel query.)
In recent years we've invented a new class of non-throwing interrupts
that process requests to perform work of some kind, but don't
necessarily die or cancel. So, while an "interrupt" used to mean
approximately a "queued error" of some kind, now it means
approximately a "queued task".
My question is: do we really need to suppress these non-ereporting
interrupts in all the places we currently do HOLD_INTERRUPTS()? The
reason I'm wondering about this is because the new ProcSignalBarrier
mechanism has to wait for any HOLD_INTERRUPTS() sections across all
backends to complete, and that possibly includes long cleanup loops
that perform disk I/O. While some future ProcSignalBarrier handler
might indeed not be safe during eg cleanup (perhaps because it can
ereport(ERROR)), it is free to return false to defer itself until the
next CFI.
Concretely, for example, where xact.c holds interrupts:
/* Prevent cancel/die interrupt while cleaning up */
HOLD_INTERRUPTS();
... or where dsm_detach does something similar, there is probably no
reason we should have to delay a ProcSignalBarrier just to accomplish
what the comment says. Presumably it really just wants to make sure
it doesn't lose control of the program counter via non-local return.
I get, though, that the current coding avoids a class of bug: we'd
have to make absolutely certain that so-called non-ereporting
interrupts really can't ereport, or chaos would ensue.
No patch yet, this is more of a problem statement, and a request for a
sanity check on my understanding of how we got here, namely that it's
really just a path dependency due to the way that interrupts have
evolved.