Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds - Mailing list pgsql-bugs

From Andres Freund
Subject Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds
Date
Msg-id 20240209000734.st2xwkfsrz7izbnx@awork3.anarazel.de
Whole thread Raw
In response to Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
Hi,

On 2024-02-08 18:50:32 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think we ought to understand *why* we are getting the "Too many open
> > files". The AcquireExternalFD() in CreateWaitEventSet() should prevent
> > that.
>
> Actually, I think the AcquireExternalFD() in CreateWaitEventSet() is
> *causing* that and needs to be removed.
>
> What is happening in Alexander's new example is that we are doing
> AcquireExternalFD() for each postgres_fdw connection
> (cf. libpqsrv_connect in libpq-be-fe-helpers.h), and the example
> is tuned to bring that exactly up to the limit of what
> AcquireExternalFD() allows.  Then the next WaitLatchOrSocket call
> fails, because it does
>
>     WaitEventSet *set = CreateWaitEventSet(CurrentResourceOwner, 3);
>
> Then when pgfdw_abort_cleanup tries to clean up the connections'
> state, it needs to do WaitLatchOrSocket again, and that fails again,
> and we PANIC because we're already in abort state.
>
> Since WaitLatchOrSocket is going to free this WaitEventSet before it
> returns, it's not apparent to me why we need to count it as a
> long-lived FD: we could just as well assume that it can slide in under
> the NUM_RESERVED_FDS limit.

Well, the AcquireExternalFD() is in more general code, that's also used for
long-lived WaitEventSets - for those it's the right thing to count it as a
long lived FD.


> Or perhaps use ReserveExternalFD instead of AcquireExternalFD.  We'd need
> some API extension to tell latch.c to do that, but that doesn't seem hard.
> (Unless we could consider that all WaitEventSets should use
> ReserveExternalFD?  Not sure I want to argue for that though.)

Yea, I don't think we want that.


> I guess a third possibility is that WaitLatchOrSocket could just
> permanently hang onto the WaitEventSet once it's got one.

Right now we don't support changing the socket FD associated with the WES, so
that'd not easily work. We've been talking about adding support for that for a
while though.


I might be missing something here, but leaving the concrete crash aside, why
is it ok for pgfdw_get_cleanup_result() etc to block during abort processing?
If I read the code right, we can end up waiting for up to 2x30s for each
connection, and there can be many connections.  The code also has a bunch of
memory allocations, so if we are in abort processing after an out-of-memory
error, we can easily cause failures again as well.

Greetings,

Andres Freund



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds
Next
From: Thomas Munro
Date:
Subject: Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds