Hi,
On 2024-02-08 19:20:35 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I might be missing something here, but leaving the concrete crash aside, why
> > is it ok for pgfdw_get_cleanup_result() etc to block during abort processing?
>
> It's not pretty, for sure. I thought briefly about postponing the
> cleanup until we next try to use the connection, but I fear the
> semantic side-effects of that would be catastrophic. We can't leave
> the remote's query sitting open long after the local transaction has
> been canceled --- that risks undetected deadlocks, at the least.
> I think all we can do is try to reduce the risk of failure during
> transaction cleanup.
I agree that we can't just delay cleanup till, potentially, much later , but I
don't think that means that we have to wait 30s for each connection,
one-by-one.
One way we could fix the issue at hand would be to make postgres fdw reserve
one FD, for all connections, and release it before the WaitLatchOrSocket() and
reacquire it after. That way we can make sure that there's an FD available.
OTOH, as waiting for connections one-by-one isn't good, perhaps we should just
rewrite the code to create one WES for all connections and wait in parallel,
processing cancel/aborts completing as they complete. That'd make the abort
less slow and it'd make the reserve-one-fd-for-postgres-fdw approach a bit
less ugly. But unfortunately that's a bit big for a bugfix...
Greetings,
Andres Freund