While working on the big interrupts refactoring, I noticed that
ExecAppendAsyncEventWait() doesn't include WL_LATCH_SET in the wait set.
As a result, it will not wake up on an interrupt like a query cancel
request or sinval catchup.
I'm surprised this has gone unnoticed for so long. It's pretty easy to
reproduce the issue with a foreign table over a view on "SELECT
pg_sleep(100)", for example. Did I miss a bug report?
At first glance the fix is very straightforward; we just need to call
AddWaitEventToSet() to wait for WL_LATCH_SET too. However, when I did
that the straightforward way, the 'postgres_fdw' test got stuck in a
busy-loop in ExecAppendAsyncGetNext().
postgresForeignAsyncConfigureWait() does this:
> else if (pendingAreq->requestor != areq->requestor)
> {
> /*
> * This is the case when the in-process request was made by another
> * Append. Note that it might be useless to process the request made
> * by that Append, because the query might not need tuples from that
> * Append anymore; so we avoid processing it to begin a fetch for the
> * given request if possible. If there are any child subplans of the
> * same parent that are ready for new requests, skip the given
> * request. Likewise, if there are any configured events other than
> * the postmaster death event, skip it. Otherwise, process the
> * in-process request, then begin a fetch to configure the event
> * below, because we might otherwise end up with no configured events
> * other than the postmaster death event.
> */
> if (!bms_is_empty(requestor->as_needrequest))
> return;
> if (GetNumRegisteredWaitEvents(set) > 1)
> return;
> process_pending_request(pendingAreq);
> fetch_more_data_begin(areq);
> }
When I added the WL_LATCH_SET event to the set before calling this, the
check for "GetNumRegisteredWaitEvents(set) > 1" was always true.
I don't understand what that comment is trying to say. What is an
"in-process request"? And why does postgres_fdw care what other async
subplans are registered in the Append node? If it is legitimate for it
to care, we probably should've abstracted that somehow, rather than
expose that "GetNumRegisteredWaitEvents(set) == 1" means that there are
no other events registered yet.
In any case, the attached patch works and seems like an easy fix for
stable branches at least.
--
Heikki Linnakangas
Neon (https://neon.tech)