I've looked into Olivier Hubaut's recent reports of 'Too many open
files' errors on OS X. What I find is that on Darwin, where we are
using Posix semaphores rather than SysV semaphores, each Posix semaphore
is treated as an open file --- it shows up in "lsof" output, and more to
the point it appears to count against a process's ulimit -n limit.
This means that if you are running with, say, max-connections = 100,
that's 100+ open files in the postmaster and every active backend.
And it's 100+ open files that aren't accounted for in fd.c's estimate
of how many files it can open. Since the ulimit -n setting is by
default only 256 on this platform, it doesn't take much at all for us to
be bumping up against the ulimit -n limit. fd.c copes fine, since it
automatically closes other open files any time it gets an EMFILE error.
But code outside fd.c is likely to fail hard ... which is exactly the
symptom we saw in Olivier's report.
I plan to apply some band-aid fixes to make that code more robust;
for instance we can push all calls to opendir() into fd.c so that
EMFILE can be handled by closing other open files. (And why does
MoveOfflineLogs PANIC on this anyway? It's not critical code...)
However, it seems that the real problem here is that we are so far off
base about how many files we can open. I wonder whether we should stop
relying on sysconf() and instead try to make some direct probe of the
number of files we can open. I'm imagining repeatedly open() until
failure at some point during postmaster startup, and then save that
result as the number-of-openable-files limit.
I also notice that OS X 10.3 seems to have working SysV semaphore
support. I am tempted to change template/darwin to use SysV where
available, instead of Posix semaphores. I wonder whether inheriting
100-or-so open file descriptors every time we launch a backend isn't
in itself a nasty performance hit, quite aside from its effect on how
many normal files we can open.
Comments anyone? There are a lot of unknowns here...
regards, tom lane