Thread: Misleading "epoll_create1 failed: Too many open files"

Misleading "epoll_create1 failed: Too many open files"

From
Andres Freund
Date:
Hi,

I ran something which triggered the error in $subject. Except that it turns
out that
a) epoll_create1() was not being called
b) we didn't actually hit EMFILE or even max_safe_fds

The reason for the failure is that we have:
    if (!AcquireExternalFD())
    {
        /* treat this as though epoll_create1 itself returned EMFILE */
        elog(ERROR, "epoll_create1 failed: %m");
    }

and

bool
AcquireExternalFD(void)
{
    /*
     * We don't want more than max_safe_fds / 3 FDs to be consumed for
     * "external" FDs.
     */
    if (numExternalFDs < max_safe_fds / 3)
    {
        ReserveExternalFD();
        return true;
    }
    errno = EMFILE;
    return false;
}

I think it's rather confusing to claim that epoll_create1() failed when we
didn't even call it.

Why are we misattributing the failure to a system call that we didn't make?

The current behaviour was introduced in

commit 3d475515a15f70a4a3f36fbbba93db6877ff8346
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2020-02-24 17:28:33 -0500

    Account explicitly for long-lived FDs that are allocated outside fd.c.



I also wish we wouldn't report EMFILE when we didn't actually reach any hard
limit - that makes the system behaviour unnecessarily confusing. But that's
not quite so easy to fix.


How about making the error message something like
                elog(ERROR, "AcquireExternalFD, for epoll_create1, failed: %m");

Greetings,

Andres Freund



Re: Misleading "epoll_create1 failed: Too many open files"

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I think it's rather confusing to claim that epoll_create1() failed when we
> didn't even call it.
> Why are we misattributing the failure to a system call that we didn't make?

I think the idea was that this mechanism is equivalent to an EMFILE
limit.  But if you feel a need to make a distinction, this seems fine:

>                 elog(ERROR, "AcquireExternalFD, for epoll_create1, failed: %m");

You should probably check all of 3d475515a, because I think
I applied the same idea in more than one place.

            regards, tom lane



Re: Misleading "epoll_create1 failed: Too many open files"

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2024-11-26 11:35:56 -0500, Tom Lane wrote:
>> You should probably check all of 3d475515a, because I think
>> I applied the same idea in more than one place.

> Yea, there's another equivalent message for kqueue a few lines below.

You should remove the "treat this..." comments, because those were
precisely about not making a distinction in the messages.
Otherwise okay with me.

            regards, tom lane



Re: Misleading "epoll_create1 failed: Too many open files"

From
Andres Freund
Date:
Hi,

On 2024-11-26 12:26:51 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2024-11-26 11:35:56 -0500, Tom Lane wrote:
> >> You should probably check all of 3d475515a, because I think
> >> I applied the same idea in more than one place.
> 
> > Yea, there's another equivalent message for kqueue a few lines below.
> 
> You should remove the "treat this..." comments, because those were
> precisely about not making a distinction in the messages.

Oops.

> Otherwise okay with me.

Thanks for checking. Pushed.

Greetings,

Andres Freund



Re: Misleading "epoll_create1 failed: Too many open files"

From
Robert Haas
Date:
On Tue, Nov 26, 2024 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think the idea was that this mechanism is equivalent to an EMFILE
> limit.  But if you feel a need to make a distinction, this seems fine:

I think we should never, ever confuse an error return from a system
call with any other kind of problem that can happen. Not even write()
returns 0 => ENOSPC.

AFAIK, the rationale for conflating failure cases like this is that
either both failures are unlikely or, at least, the case where errno
wasn't actually set is unlikely. But the problem is that when
something weird happens, that's exactly when you need a clear and
unambiguous error report. I've had multiple extremely painful support
experiences that were made painful precisely because I couldn't
determine exactly what really happened. Did a system call really
return an unlikely error code? Or was it the
not-a-real-error-code-but-we-faked-one case which is also not supposed
to happen?

I find this kind of thing maddening every time it happens, and it
happens to me more often than you might think, because it often
happens that other people are able to answer the normal questions and
they send me the weird ones. Let's say twice a year I spend a couple
of days sweating blood trying to determine the root cause of some
bizarre malfunction because the person who wrote the code couldn't be
bothered to take 2 minutes to make the errors distinguishable.

--
Robert Haas
EDB: http://www.enterprisedb.com