Thread: HOLD_INTERRUPTS() vs ProcSignalBarrier
Hi, Whenever you call CHECK_FOR_INTERRUPTS(), there are three flow-control possibilities: 1. It doesn't return, because we ereport(FATAL) AKA "die". 2. It doesn't return, because we ereport(ERROR) AKA "cancel". 3. It returns, possibly having serviced various kinds of requests. If we're in a critical section, it always returns. Since commit 6ce0ed2813d, if we're in a HOLD_INTERRUPTS() section, it always returns. Since commit 2b3a8b20c2d, it we're in a HOLD_CANCEL_INTERRUPTS() section, it either returns or does ereport(FATAL), but paths that would ereport(ERROR) are deferred. (That's not the whole story, as HandleParallelMessage() can ereport(ERROR) without respecting QueryCancelHoldoffCount, but that's probably OK WRT the protocol sync concerns of 2b3a8b20c2d because we shouldn't be reading from the client during a parallel query.) In recent years we've invented a new class of non-throwing interrupts that process requests to perform work of some kind, but don't necessarily die or cancel. So, while an "interrupt" used to mean approximately a "queued error" of some kind, now it means approximately a "queued task". My question is: do we really need to suppress these non-ereporting interrupts in all the places we currently do HOLD_INTERRUPTS()? The reason I'm wondering about this is because the new ProcSignalBarrier mechanism has to wait for any HOLD_INTERRUPTS() sections across all backends to complete, and that possibly includes long cleanup loops that perform disk I/O. While some future ProcSignalBarrier handler might indeed not be safe during eg cleanup (perhaps because it can ereport(ERROR)), it is free to return false to defer itself until the next CFI. Concretely, for example, where xact.c holds interrupts: /* Prevent cancel/die interrupt while cleaning up */ HOLD_INTERRUPTS(); ... or where dsm_detach does something similar, there is probably no reason we should have to delay a ProcSignalBarrier just to accomplish what the comment says. Presumably it really just wants to make sure it doesn't lose control of the program counter via non-local return. I get, though, that the current coding avoids a class of bug: we'd have to make absolutely certain that so-called non-ereporting interrupts really can't ereport, or chaos would ensue. No patch yet, this is more of a problem statement, and a request for a sanity check on my understanding of how we got here, namely that it's really just a path dependency due to the way that interrupts have evolved.
Hi, On 2022-05-25 14:47:41 +1200, Thomas Munro wrote: > My question is: do we really need to suppress these non-ereporting > interrupts in all the places we currently do HOLD_INTERRUPTS()? Most of those should be fairly short / only block on lwlocks, small amounts of IO. I'm not sure how much of an issue this is. Are there actually CFIs inside those HOLD_INTERRUPT sections? > The reason I'm wondering about this is because the new ProcSignalBarrier > mechanism has to wait for any HOLD_INTERRUPTS() sections across all backends > to complete, and that possibly includes long cleanup loops that perform disk > I/O. While some future ProcSignalBarrier handler might indeed not be safe > during eg cleanup (perhaps because it can ereport(ERROR)), it is free to > return false to defer itself until the next CFI. > > Concretely, for example, where xact.c holds interrupts: > > /* Prevent cancel/die interrupt while cleaning up */ > HOLD_INTERRUPTS(); > > ... or where dsm_detach does something similar, there is probably no > reason we should have to delay a ProcSignalBarrier just to accomplish > what the comment says. Presumably it really just wants to make sure > it doesn't lose control of the program counter via non-local return. I don't think that's quite it. There are elog(ERROR) reachable from within HOLD_INTERRUPTS() sections (it's not a critical section after all). I think it's more that there's no point in reacting to interrupts in those spots, because e.g. processing ProcDiePending requires aborting the currently active transaction. Greetings, Andres Freund
On Wed, May 25, 2022 at 3:08 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-05-25 14:47:41 +1200, Thomas Munro wrote: > > My question is: do we really need to suppress these non-ereporting > > interrupts in all the places we currently do HOLD_INTERRUPTS()? > > Most of those should be fairly short / only block on lwlocks, small amounts of > IO. I'm not sure how much of an issue this is. Are there actually CFIs inside > those HOLD_INTERRUPT sections? The concrete example I have in mind is the one created by me in 637668fb. That can reach a walkdir() that unlinks a ton of temporary files, and has a CFI() in it. Maybe that particular case should just be using HOLD_CANCEL_INTERRUPTS() instead, but that's not quite bulletproof enough (see note about parallel interrupts not respecting it), which made me start wondering about some other way to say "hold everything except non-ereturning interrupts".
On Tue, May 24, 2022 at 11:47 PM Thomas Munro <thomas.munro@gmail.com> wrote: > The concrete example I have in mind is the one created by me in > 637668fb. That can reach a walkdir() that unlinks a ton of temporary > files, and has a CFI() in it. Hmm, I missed that commit, and I have to say I'm a bit doubtful about it. I don't know what would be better, but using HOLD_INTERRUPTS() across that much code seems pretty sketch. > Maybe that particular case should just be using > HOLD_CANCEL_INTERRUPTS() instead, but that's not quite bulletproof > enough (see note about parallel interrupts not respecting it), which > made me start wondering about some other way to say "hold everything > except non-ereturning interrupts". Considering the current uses of HOLD_CANCEL_INTERRUPTS(), maybe we ought to make parallel interrupts respect that flag. It's not obvious that it's impossible to reach the parallel message handling stuff while we're in the middle of some wire protocol communication, but there's no loss either way. If the code is reachable, then it's incorrect not to hold off parallel message processing at that point, and if it's unreachable, then it doesn't matter whether we hold off parallel message processing at that point. In the case of the DSM cleanup code, we probably shouldn't be receiving parallel messages while we're trying to tear down a DSM segment, unless we've got multiple DSM segments around and the one we're tearing down is not the one being used by parallel query. However, that's an unlikely scenario, and probably won't cost much if it happens, and we can't really afford to lose control of the program counter anyway. -- Robert Haas EDB: http://www.enterprisedb.com