Thread: HOLD_INTERRUPTS() vs ProcSignalBarrier

HOLD_INTERRUPTS() vs ProcSignalBarrier

From
Thomas Munro
Date:
Hi,

Whenever you call CHECK_FOR_INTERRUPTS(), there are three flow-control
possibilities:

1.  It doesn't return, because we ereport(FATAL) AKA "die".
2.  It doesn't return, because we ereport(ERROR) AKA "cancel".
3.  It returns, possibly having serviced various kinds of requests.

If we're in a critical section, it always returns.

Since commit 6ce0ed2813d, if we're in a HOLD_INTERRUPTS() section, it
always returns.

Since commit 2b3a8b20c2d, it we're in a HOLD_CANCEL_INTERRUPTS()
section, it either returns or does ereport(FATAL), but paths that
would ereport(ERROR) are deferred.

(That's not the whole story, as HandleParallelMessage() can
ereport(ERROR) without respecting QueryCancelHoldoffCount, but that's
probably OK WRT the protocol sync concerns of 2b3a8b20c2d because we
shouldn't be reading from the client during a parallel query.)

In recent years we've invented a new class of non-throwing interrupts
that process requests to perform work of some kind, but don't
necessarily die or cancel.  So, while an "interrupt" used to mean
approximately a "queued error" of some kind, now it means
approximately a "queued task".

My question is: do we really need to suppress these non-ereporting
interrupts in all the places we currently do HOLD_INTERRUPTS()?  The
reason I'm wondering about this is because the new ProcSignalBarrier
mechanism has to wait for any HOLD_INTERRUPTS() sections across all
backends to complete, and that possibly includes long cleanup loops
that perform disk I/O.  While some future ProcSignalBarrier handler
might indeed not be safe during eg cleanup (perhaps because it can
ereport(ERROR)), it is free to return false to defer itself until the
next CFI.

Concretely, for example, where xact.c holds interrupts:

    /* Prevent cancel/die interrupt while cleaning up */
    HOLD_INTERRUPTS();

... or where dsm_detach does something similar, there is probably no
reason we should have to delay a ProcSignalBarrier just to accomplish
what the comment says.  Presumably it really just wants to make sure
it doesn't lose control of the program counter via non-local return.
I get, though, that the current coding avoids a class of bug: we'd
have to make absolutely certain that so-called non-ereporting
interrupts really can't ereport, or chaos would ensue.

No patch yet, this is more of a problem statement, and a request for a
sanity check on my understanding of how we got here, namely that it's
really just a path dependency due to the way that interrupts have
evolved.



Re: HOLD_INTERRUPTS() vs ProcSignalBarrier

From
Andres Freund
Date:
Hi,

On 2022-05-25 14:47:41 +1200, Thomas Munro wrote:
> My question is: do we really need to suppress these non-ereporting
> interrupts in all the places we currently do HOLD_INTERRUPTS()?

Most of those should be fairly short / only block on lwlocks, small amounts of
IO. I'm not sure how much of an issue this is. Are there actually CFIs inside
those HOLD_INTERRUPT sections?


> The reason I'm wondering about this is because the new ProcSignalBarrier
> mechanism has to wait for any HOLD_INTERRUPTS() sections across all backends
> to complete, and that possibly includes long cleanup loops that perform disk
> I/O.  While some future ProcSignalBarrier handler might indeed not be safe
> during eg cleanup (perhaps because it can ereport(ERROR)), it is free to
> return false to defer itself until the next CFI.
> 
> Concretely, for example, where xact.c holds interrupts:
> 
>     /* Prevent cancel/die interrupt while cleaning up */
>     HOLD_INTERRUPTS();
> 
> ... or where dsm_detach does something similar, there is probably no
> reason we should have to delay a ProcSignalBarrier just to accomplish
> what the comment says.  Presumably it really just wants to make sure
> it doesn't lose control of the program counter via non-local return.

I don't think that's quite it. There are elog(ERROR) reachable from within
HOLD_INTERRUPTS() sections (it's not a critical section after all). I think
it's more that there's no point in reacting to interrupts in those spots,
because e.g. processing ProcDiePending requires aborting the currently active
transaction.

Greetings,

Andres Freund



Re: HOLD_INTERRUPTS() vs ProcSignalBarrier

From
Thomas Munro
Date:
On Wed, May 25, 2022 at 3:08 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-05-25 14:47:41 +1200, Thomas Munro wrote:
> > My question is: do we really need to suppress these non-ereporting
> > interrupts in all the places we currently do HOLD_INTERRUPTS()?
>
> Most of those should be fairly short / only block on lwlocks, small amounts of
> IO. I'm not sure how much of an issue this is. Are there actually CFIs inside
> those HOLD_INTERRUPT sections?

The concrete example I have in mind is the one created by me in
637668fb.  That can reach a walkdir() that unlinks a ton of temporary
files, and has a CFI() in it.

Maybe that particular case should just be using
HOLD_CANCEL_INTERRUPTS() instead, but that's not quite bulletproof
enough (see note about parallel interrupts not respecting it), which
made me start wondering about some other way to say "hold everything
except non-ereturning interrupts".



Re: HOLD_INTERRUPTS() vs ProcSignalBarrier

From
Robert Haas
Date:
On Tue, May 24, 2022 at 11:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> The concrete example I have in mind is the one created by me in
> 637668fb.  That can reach a walkdir() that unlinks a ton of temporary
> files, and has a CFI() in it.

Hmm, I missed that commit, and I have to say I'm a bit doubtful about
it. I don't know what would be better, but using HOLD_INTERRUPTS()
across that much code seems pretty sketch.

> Maybe that particular case should just be using
> HOLD_CANCEL_INTERRUPTS() instead, but that's not quite bulletproof
> enough (see note about parallel interrupts not respecting it), which
> made me start wondering about some other way to say "hold everything
> except non-ereturning interrupts".

Considering the current uses of HOLD_CANCEL_INTERRUPTS(), maybe we
ought to make parallel interrupts respect that flag. It's not obvious
that it's impossible to reach the parallel message handling stuff
while we're in the middle of some wire protocol communication, but
there's no loss either way. If the code is reachable, then it's
incorrect not to hold off parallel message processing at that point,
and if it's unreachable, then it doesn't matter whether we hold off
parallel message processing at that point.

In the case of the DSM cleanup code, we probably shouldn't be
receiving parallel messages while we're trying to tear down a DSM
segment, unless we've got multiple DSM segments around and the one
we're tearing down is not the one being used by parallel query.
However, that's an unlikely scenario, and probably won't cost much if
it happens, and we can't really afford to lose control of the program
counter anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com