Thread: Misleading "epoll_create1 failed: Too many open files"
Hi, I ran something which triggered the error in $subject. Except that it turns out that a) epoll_create1() was not being called b) we didn't actually hit EMFILE or even max_safe_fds The reason for the failure is that we have: if (!AcquireExternalFD()) { /* treat this as though epoll_create1 itself returned EMFILE */ elog(ERROR, "epoll_create1 failed: %m"); } and bool AcquireExternalFD(void) { /* * We don't want more than max_safe_fds / 3 FDs to be consumed for * "external" FDs. */ if (numExternalFDs < max_safe_fds / 3) { ReserveExternalFD(); return true; } errno = EMFILE; return false; } I think it's rather confusing to claim that epoll_create1() failed when we didn't even call it. Why are we misattributing the failure to a system call that we didn't make? The current behaviour was introduced in commit 3d475515a15f70a4a3f36fbbba93db6877ff8346 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2020-02-24 17:28:33 -0500 Account explicitly for long-lived FDs that are allocated outside fd.c. I also wish we wouldn't report EMFILE when we didn't actually reach any hard limit - that makes the system behaviour unnecessarily confusing. But that's not quite so easy to fix. How about making the error message something like elog(ERROR, "AcquireExternalFD, for epoll_create1, failed: %m"); Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I think it's rather confusing to claim that epoll_create1() failed when we > didn't even call it. > Why are we misattributing the failure to a system call that we didn't make? I think the idea was that this mechanism is equivalent to an EMFILE limit. But if you feel a need to make a distinction, this seems fine: > elog(ERROR, "AcquireExternalFD, for epoll_create1, failed: %m"); You should probably check all of 3d475515a, because I think I applied the same idea in more than one place. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2024-11-26 11:35:56 -0500, Tom Lane wrote: >> You should probably check all of 3d475515a, because I think >> I applied the same idea in more than one place. > Yea, there's another equivalent message for kqueue a few lines below. You should remove the "treat this..." comments, because those were precisely about not making a distinction in the messages. Otherwise okay with me. regards, tom lane
Hi, On 2024-11-26 12:26:51 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2024-11-26 11:35:56 -0500, Tom Lane wrote: > >> You should probably check all of 3d475515a, because I think > >> I applied the same idea in more than one place. > > > Yea, there's another equivalent message for kqueue a few lines below. > > You should remove the "treat this..." comments, because those were > precisely about not making a distinction in the messages. Oops. > Otherwise okay with me. Thanks for checking. Pushed. Greetings, Andres Freund
On Tue, Nov 26, 2024 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think the idea was that this mechanism is equivalent to an EMFILE > limit. But if you feel a need to make a distinction, this seems fine: I think we should never, ever confuse an error return from a system call with any other kind of problem that can happen. Not even write() returns 0 => ENOSPC. AFAIK, the rationale for conflating failure cases like this is that either both failures are unlikely or, at least, the case where errno wasn't actually set is unlikely. But the problem is that when something weird happens, that's exactly when you need a clear and unambiguous error report. I've had multiple extremely painful support experiences that were made painful precisely because I couldn't determine exactly what really happened. Did a system call really return an unlikely error code? Or was it the not-a-real-error-code-but-we-faked-one case which is also not supposed to happen? I find this kind of thing maddening every time it happens, and it happens to me more often than you might think, because it often happens that other people are able to answer the normal questions and they send me the weird ones. Let's say twice a year I spend a couple of days sweating blood trying to determine the root cause of some bizarre malfunction because the person who wrote the code couldn't be bothered to take 2 minutes to make the errors distinguishable. -- Robert Haas EDB: http://www.enterprisedb.com