Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds
Date
Msg-id 1273950.1707436232@sss.pgh.pa.us
Whole thread Raw
In response to Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds  (Andres Freund <andres@anarazel.de>)
Responses Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
Andres Freund <andres@anarazel.de> writes:
> I think we ought to understand *why* we are getting the "Too many open
> files". The AcquireExternalFD() in CreateWaitEventSet() should prevent
> that.

Actually, I think the AcquireExternalFD() in CreateWaitEventSet() is
*causing* that and needs to be removed.

What is happening in Alexander's new example is that we are doing
AcquireExternalFD() for each postgres_fdw connection
(cf. libpqsrv_connect in libpq-be-fe-helpers.h), and the example
is tuned to bring that exactly up to the limit of what
AcquireExternalFD() allows.  Then the next WaitLatchOrSocket call
fails, because it does

    WaitEventSet *set = CreateWaitEventSet(CurrentResourceOwner, 3);

Then when pgfdw_abort_cleanup tries to clean up the connections'
state, it needs to do WaitLatchOrSocket again, and that fails again,
and we PANIC because we're already in abort state.

Since WaitLatchOrSocket is going to free this WaitEventSet before it
returns, it's not apparent to me why we need to count it as a
long-lived FD: we could just as well assume that it can slide in under
the NUM_RESERVED_FDS limit.  Or perhaps use ReserveExternalFD instead
of AcquireExternalFD.  We'd need some API extension to tell latch.c to
do that, but that doesn't seem hard.  (Unless we could consider that
all WaitEventSets should use ReserveExternalFD?  Not sure I want to
argue for that though.)

I guess a third possibility is that WaitLatchOrSocket could just
permanently hang onto the WaitEventSet once it's got one.

> One annoying bit is that AcquireExternalFD() failing emits the same error as
> if epoll_create1() itself failing, including the same errno.

It's the former.  I tend to agree now that maybe using the same error
text wasn't too smart.

            regards, tom lane



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds
Next
From: Andres Freund
Date:
Subject: Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds