Thread: Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Tom Lane

Date:

11 February, 22:20:13

Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> The default open file limit of 1024 is extremely low, given modern
> resources and kernel architectures. The reason that this hasn't changed
> change is because doing so would break legacy programs that use the
> select(2) system call in hard to debug ways. So instead programs that
> want to opt-in to a higher open file limit are expected to bump their
> soft limit to their hard limit on startup. Details on this are very well
> explained in a blogpost by the systemd author[1].

On a handy Linux machine (running RHEL9):

$ ulimit -n
1024
$ ulimit -n -H
524288

I'm okay with believing that 1024 is unreasonably small, but that
doesn't mean I think half a million is a safe value.  (Remember that
that's *per backend*.)  Postgres has run OSes out of FDs in the past
and I don't believe we couldn't do it again.

Also, the argument you cite is completely recent-Linux-centric
and does not consider the likely effects on other platforms.
To take one example, on current macOS:

$ ulimit -n
4864
$ ulimit -n -H
unlimited

(Hm, so Apple wasn't impressed by the "let's not break select(2)"
argument.  But I digress.)

I'm afraid this patch will replace "you need to tune ulimit -n
to get best performance" with "you need to tune ulimit -n to
avoid crashing your system".  Does not sound like an improvement.

Maybe a sanity limit on how high we'll try to raise the ulimit
would help.

            regards, tom lane

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Tom Lane

Date:

11 February, 23:00:22

I wrote:
> Maybe a sanity limit on how high we'll try to raise the ulimit
> would help.

Oh, I'd forgotten that we already have one: max_files_per_process.
Since that's only 1000 by default, this patch doesn't actually have
any effect (on Linux anyway) unless the DBA raises
max_files_per_process.  That alleviates my concern quite a bit.

... but not completely.  You didn't read all of Pid Eins' advice:

    If said program you hack on forks off foreign programs, make sure
    to reset the RLIMIT_NOFILE soft limit back to 1024 for them. Just
    because your program might be fine with fds >= 1024 it doesn't
    mean that those foreign programs might. And unfortunately
    RLIMIT_NOFILE is inherited down the process tree unless explicitly
    set.

I think we'd need to pay some attention to that in e.g. COPY FROM
PROGRAM.  I also wonder whether plperl, plpython, etc can be
guaranteed not to run any code that depends on select(2).

            regards, tom lane

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Tomas Vondra

Date:

11 February, 23:04:25

On 2/11/25 20:20, Tom Lane wrote:
> Jelte Fennema-Nio <postgres@jeltef.nl> writes:
>> The default open file limit of 1024 is extremely low, given modern
>> resources and kernel architectures. The reason that this hasn't changed
>> change is because doing so would break legacy programs that use the
>> select(2) system call in hard to debug ways. So instead programs that
>> want to opt-in to a higher open file limit are expected to bump their
>> soft limit to their hard limit on startup. Details on this are very well
>> explained in a blogpost by the systemd author[1].
> 
> On a handy Linux machine (running RHEL9):
> 
> $ ulimit -n
> 1024
> $ ulimit -n -H
> 524288
> 
> I'm okay with believing that 1024 is unreasonably small, but that
> doesn't mean I think half a million is a safe value.  (Remember that
> that's *per backend*.)  Postgres has run OSes out of FDs in the past
> and I don't believe we couldn't do it again.
> 
> Also, the argument you cite is completely recent-Linux-centric
> and does not consider the likely effects on other platforms.
> To take one example, on current macOS:
> 
> $ ulimit -n
> 4864
> $ ulimit -n -H
> unlimited
> 
> (Hm, so Apple wasn't impressed by the "let's not break select(2)"
> argument.  But I digress.)
> 
> I'm afraid this patch will replace "you need to tune ulimit -n
> to get best performance" with "you need to tune ulimit -n to
> avoid crashing your system".  Does not sound like an improvement.
> 
> Maybe a sanity limit on how high we'll try to raise the ulimit
> would help.
> 

I agree the defaults may be pretty low for current systems, but do we
want to get into the business of picking a value and overriding whatever
value is set by the sysadmin? I don't think a high hard limit should be
seen as an implicit permission to just set is as the soft limit.

Imagine you're a sysadmin / DBA who picks a low soft limit (for whatever
reason - there may be other stuff running on the system, ...). And then
postgres starts and just feels like bumping the soft limit. Sure, the
sysadmin can lower the hard limit and then we'll respect that, but I
don't recall any other tool requiring this approach, and it would
definitely be quite surprising to me.

I did run into bottlenecks due to "too few file descriptors" during a
recent experiments with partitioning, which made it pretty trivial to
get into a situation when we start trashing the VfdCache. I have a
half-written draft of a blog post about that somewhere.

But my conclusion was that it's damn difficult to even realize that's
happening, especially if you don't have access to the OS / perf, etc. So
my takeaway was we should improve that first, so that people have a
chance to realize they have this issue, and can do the tuning. The
improvements I thought about were:

- track hits/misses for the VfdCache (and add a system view for that)

- maybe have wait event for opening/closing file descriptors

- show max_safe_fds value somewhere, not just max_files_per_process
  (which we may silently override and use a lower value)

Even if we decide to increase the soft limit somehow, I think it'd still
be useful to make this info available (or at least some of it).

regards

-- 
Tomas Vondra

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Tom Lane

Date:

11 February, 23:18:40

Tomas Vondra <tomas@vondra.me> writes:
> I did run into bottlenecks due to "too few file descriptors" during a
> recent experiments with partitioning, which made it pretty trivial to
> get into a situation when we start trashing the VfdCache. I have a
> half-written draft of a blog post about that somewhere.

> But my conclusion was that it's damn difficult to even realize that's
> happening, especially if you don't have access to the OS / perf, etc.

Yeah.  fd.c does its level best to keep going even with only a few FDs
available, and it's hard to tell that you have a performance problem
arising from that.  (Although I recall old war stories about Postgres
continuing to chug along just fine after it'd run the kernel out of
FDs, although every other service on the system was crashing left and
right, making it difficult e.g. even to log in.  That scenario is why
I'm resistant to pushing our allowed number of FDs to the moon...)

> So
> my takeaway was we should improve that first, so that people have a
> chance to realize they have this issue, and can do the tuning. The
> improvements I thought about were:

> - track hits/misses for the VfdCache (and add a system view for that)

I think what we actually would like to know is how often we have to
close an open FD in order to make room to open a different file.
Maybe that's the same thing you mean by "cache miss", but it doesn't
seem like quite the right terminology.  Anyway, +1 for adding some way
to discover how often that's happening.

> - maybe have wait event for opening/closing file descriptors

Not clear that that helps, at least for this specific issue.

> - show max_safe_fds value somewhere, not just max_files_per_process
>   (which we may silently override and use a lower value)

Maybe we should just assign max_safe_fds back to max_files_per_process
after running set_max_safe_fds?  The existence of two variables is a
bit confusing anyhow.  I vaguely recall that we had a reason for
keeping them separate, but I can't think of the reasoning now.

            regards, tom lane

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Andres Freund

Date:

12 February, 00:14:21

Hi,

On 2025-02-11 21:04:25 +0100, Tomas Vondra wrote:
> I agree the defaults may be pretty low for current systems, but do we
> want to get into the business of picking a value and overriding whatever
> value is set by the sysadmin? I don't think a high hard limit should be
> seen as an implicit permission to just set is as the soft limit.

As documented in the links sent by Jelte, that's *explicitly* the reasoning
for the difference between default soft and hard limits. For safety programs
should opt in into using higher FD limits, rather than being opted into it.

> Imagine you're a sysadmin / DBA who picks a low soft limit (for whatever
> reason - there may be other stuff running on the system, ...). And then
> postgres starts and just feels like bumping the soft limit. Sure, the
> sysadmin can lower the hard limit and then we'll respect that, but I don't
> recall any other tool requiring this approach, and it would definitely be
> quite surprising to me.

https://codesearch.debian.net/search?q=setrlimit&literal=1

In a quick skim I found that at least gimp, openjdk, libreoffice, gcc, samba,
dbus increase the soft limit to something closer to the hard limit. And that
was at page 62 out of 1269.

> I did run into bottlenecks due to "too few file descriptors" during a
> recent experiments with partitioning, which made it pretty trivial to
> get into a situation when we start trashing the VfdCache. I have a
> half-written draft of a blog post about that somewhere.
>
> But my conclusion was that it's damn difficult to even realize that's
> happening, especially if you don't have access to the OS / perf, etc. So my
> takeaway was we should improve that first, so that people have a chance to
> realize they have this issue, and can do the tuning. The improvements I
> thought about were:

Hm, that seems something orthogonal to me. I'm on board with that suggestion,
but I don't see why that should stop us from having code to adjust the rlimit.

My suggestion would be to redefine max_files_per_process as the number of
files we try to be able to open in backends. I.e. set_max_safe_fds() would
first count the number of already open fds (since those will largely be
inherited by child processes) and then check if we can open up to
max_files_per_process files in addition. Adjusting the RLIMIT_NOFILE if
necessary.

That way we don't increase the number of FDs we use in the default
configuration drastically, but increasing the number only requires a config
change, it doesn't require also figuring out how to increase the settings in
whatever starts postgres. Which, e.g. in cloud environments, typically won't
be possible.

And when using something like io_uring for AIO, it'd allow to
max_files_per_process in addition to the files requires for the io_uring
instances.

Greetings,

Andres Freund

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Tom Lane

Date:

12 February, 00:18:37

Andres Freund <andres@anarazel.de> writes:
> My suggestion would be to redefine max_files_per_process as the number of
> files we try to be able to open in backends. I.e. set_max_safe_fds() would
> first count the number of already open fds (since those will largely be
> inherited by child processes) and then check if we can open up to
> max_files_per_process files in addition. Adjusting the RLIMIT_NOFILE if
> necessary.

Seems plausible.  IIRC we also want 10 or so FDs available as "slop"
for code that doesn't go through fd.c.

> And when using something like io_uring for AIO, it'd allow to
> max_files_per_process in addition to the files requires for the io_uring
> instances.

Not following?  Surely we'd not be configuring that so early in
postmaster start?

            regards, tom lane

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Andres Freund

Date:

12 February, 01:27:09

Hi,

On 2025-02-11 16:18:37 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > And when using something like io_uring for AIO, it'd allow to
> > max_files_per_process in addition to the files requires for the io_uring
> > instances.
>
> Not following?  Surely we'd not be configuring that so early in
> postmaster start?

The issue is that, with io_uring, we need to create one FD for each possible
child process, so that one backend can wait for completions for IO issued by
another backend [1]. Those io_uring instances need to be created in
postmaster, so they're visible to each backend. Obviously that helps to much
more quickly run into an unadjusted soft RLIMIT_NOFILE, particularly if
max_connections is set to a higher value.

In the current version of the AIO patchset, the creation of those io_uring
instances does happen as part of an shmem init callback, as the io uring
creation also sets up queues visible in shmem. And shmem init callbacks are
currently happening *before* postmaster's set_max_safe_fds() call:

    /*
     * Set up shared memory and semaphores.
     *
     * Note: if using SysV shmem and/or semas, each postmaster startup will
     * normally choose the same IPC keys.  This helps ensure that we will
     * clean up dead IPC objects if the postmaster crashes and is restarted.
     */
    CreateSharedMemoryAndSemaphores();

    /*
     * Estimate number of openable files.  This must happen after setting up
     * semaphores, because on some platforms semaphores count as open files.
     */
    set_max_safe_fds();

So the issue would actually be that we're currently doing set_max_safe_fds()
too late, not too early :/

Greetings,

Andres Freund

[1] Initially I tried to avoid that, by sharing a smaller number of io_uring
    instances across backends. Making that work was a fair bit of code *and*
    was considerably slower, due to now needing a lock around submission of
    IOs. Moving to one io_uring instance per backend fairly dramatically
    simplified the code while also speeding it up.

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Tomas Vondra

Date:

12 February, 01:33:45


On 2/11/25 21:18, Tom Lane wrote:
> Tomas Vondra <tomas@vondra.me> writes:
>> I did run into bottlenecks due to "too few file descriptors" during a
>> recent experiments with partitioning, which made it pretty trivial to
>> get into a situation when we start trashing the VfdCache. I have a
>> half-written draft of a blog post about that somewhere.
> 
>> But my conclusion was that it's damn difficult to even realize that's
>> happening, especially if you don't have access to the OS / perf, etc.
> 
> Yeah.  fd.c does its level best to keep going even with only a few FDs
> available, and it's hard to tell that you have a performance problem
> arising from that.  (Although I recall old war stories about Postgres
> continuing to chug along just fine after it'd run the kernel out of
> FDs, although every other service on the system was crashing left and
> right, making it difficult e.g. even to log in.  That scenario is why
> I'm resistant to pushing our allowed number of FDs to the moon...)
> 
>> So
>> my takeaway was we should improve that first, so that people have a
>> chance to realize they have this issue, and can do the tuning. The
>> improvements I thought about were:
> 
>> - track hits/misses for the VfdCache (and add a system view for that)
> 
> I think what we actually would like to know is how often we have to
> close an open FD in order to make room to open a different file.
> Maybe that's the same thing you mean by "cache miss", but it doesn't
> seem like quite the right terminology.  Anyway, +1 for adding some way
> to discover how often that's happening.
> 

We can count the evictions (i.e. closing a file so that we can open a
new one) too, but AFAICS that's about the same as counting "misses"
(opening a file after not finding it in the cache). After the cache
warms up, those counts should be about the same, I think.

Or am I missing something?

>> - maybe have wait event for opening/closing file descriptors
> 
> Not clear that that helps, at least for this specific issue.
> 

I don't think Jelte described any specific issue, but the symptoms I've
observed were that a query was accessing a table with ~1000 relations
(partitions + indexes), trashing the vfd cache, getting ~0% cache hits.
And the open/close calls were taking a lot of time (~25% CPU time).
That'd be very visible as a wait event, I believe.

>> - show max_safe_fds value somewhere, not just max_files_per_process
>>   (which we may silently override and use a lower value)
> 
> Maybe we should just assign max_safe_fds back to max_files_per_process
> after running set_max_safe_fds?  The existence of two variables is a
> bit confusing anyhow.  I vaguely recall that we had a reason for
> keeping them separate, but I can't think of the reasoning now.
> 

That might work. I don't know what were the reasons for not doing that,
I suppose there were reasons not to do that.


regards

-- 
Tomas Vondra

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Tomas Vondra

Date:

12 February, 01:48:38


On 2/11/25 22:14, Andres Freund wrote:
> Hi,
> 
> On 2025-02-11 21:04:25 +0100, Tomas Vondra wrote:
>> I agree the defaults may be pretty low for current systems, but do we
>> want to get into the business of picking a value and overriding whatever
>> value is set by the sysadmin? I don't think a high hard limit should be
>> seen as an implicit permission to just set is as the soft limit.
> 
> As documented in the links sent by Jelte, that's *explicitly* the reasoning
> for the difference between default soft and hard limits. For safety programs
> should opt in into using higher FD limits, rather than being opted into it.
> 

OK, I guess I was mistaken in how I understood the hard/soft limits.

> 
>> Imagine you're a sysadmin / DBA who picks a low soft limit (for whatever
>> reason - there may be other stuff running on the system, ...). And then
>> postgres starts and just feels like bumping the soft limit. Sure, the
>> sysadmin can lower the hard limit and then we'll respect that, but I don't
>> recall any other tool requiring this approach, and it would definitely be
>> quite surprising to me.
> 
> https://codesearch.debian.net/search?q=setrlimit&literal=1
> 
> In a quick skim I found that at least gimp, openjdk, libreoffice, gcc, samba,
> dbus increase the soft limit to something closer to the hard limit. And that
> was at page 62 out of 1269.
> 

Ack

> 
> 
>> I did run into bottlenecks due to "too few file descriptors" during a
>> recent experiments with partitioning, which made it pretty trivial to
>> get into a situation when we start trashing the VfdCache. I have a
>> half-written draft of a blog post about that somewhere.
>>
>> But my conclusion was that it's damn difficult to even realize that's
>> happening, especially if you don't have access to the OS / perf, etc. So my
>> takeaway was we should improve that first, so that people have a chance to
>> realize they have this issue, and can do the tuning. The improvements I
>> thought about were:
> 
> Hm, that seems something orthogonal to me. I'm on board with that suggestion,
> but I don't see why that should stop us from having code to adjust the rlimit.
> 

Right, it is somewhat orthogonal.

I was mentioning that mostly in the context of tuning the parameters we
already have (because how would you know you you have this problem /
what would be a good value to set). I'm not demanding that we do nothing
until the monitoring bits get implemented, but being able to monitor
this seems pretty useful even if we adjust the soft limit based on some
heuristic.

> 
> My suggestion would be to redefine max_files_per_process as the number of
> files we try to be able to open in backends. I.e. set_max_safe_fds() would
> first count the number of already open fds (since those will largely be
> inherited by child processes) and then check if we can open up to
> max_files_per_process files in addition. Adjusting the RLIMIT_NOFILE if
> necessary.
> 
> That way we don't increase the number of FDs we use in the default
> configuration drastically, but increasing the number only requires a config
> change, it doesn't require also figuring out how to increase the settings in
> whatever starts postgres. Which, e.g. in cloud environments, typically won't
> be possible.
> 

Seems reasonable, +1 to this. But that just makes the monitoring bits
more important, because how would you know you need to increase the GUC?


regards

-- 
Tomas Vondra

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Tom Lane

Date:

12 February, 01:55:39

Andres Freund <andres@anarazel.de> writes:
> In the current version of the AIO patchset, the creation of those io_uring
> instances does happen as part of an shmem init callback, as the io uring
> creation also sets up queues visible in shmem.

Hmm.

> So the issue would actually be that we're currently doing set_max_safe_fds()
> too late, not too early :/

Well, we'd rather set_max_safe_fds happen after semaphore creation,
so that it doesn't have to be explicitly aware of whether semaphores
consume FDs.  Could we have it be aware of how many FDs *will be*
needed for io_uring, but postpone creation of those until after we
jack up RLIMIT_NOFILE?

I guess the other way would be to have two rounds of RLIMIT_NOFILE
adjustment, before and after shmem creation.  That seems ugly but
shouldn't be very time-consuming.

            regards, tom lane

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Tom Lane

Date:

12 February, 02:04:15

Tomas Vondra <tomas@vondra.me> writes:
> On 2/11/25 21:18, Tom Lane wrote:
>> I think what we actually would like to know is how often we have to
>> close an open FD in order to make room to open a different file.
>> Maybe that's the same thing you mean by "cache miss", but it doesn't
>> seem like quite the right terminology.  Anyway, +1 for adding some way
>> to discover how often that's happening.

> We can count the evictions (i.e. closing a file so that we can open a
> new one) too, but AFAICS that's about the same as counting "misses"
> (opening a file after not finding it in the cache). After the cache
> warms up, those counts should be about the same, I think.

Umm ... only if the set of files you want access to is quite static,
which doesn't seem like a great bet in the presence of temporary
tables and such.  I think if we don't explicitly count evictions
then we'll be presenting misleading results.

            regards, tom lane

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Andres Freund

Date:

12 February, 02:13:26

Hi,

On 2025-02-11 17:55:39 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > So the issue would actually be that we're currently doing set_max_safe_fds()
> > too late, not too early :/
> 
> Well, we'd rather set_max_safe_fds happen after semaphore creation,
> so that it doesn't have to be explicitly aware of whether semaphores
> consume FDs.  Could we have it be aware of how many FDs *will be*
> needed for io_uring, but postpone creation of those until after we
> jack up RLIMIT_NOFILE?

Yes, that should be fairly trivial to do. It'd require a call from postmaster
into the aio subsystem, after set_max_safe_fds() is done, but I think that'd
be ok.  I'd be inclined to not make that a general mechanism for now.

> I guess the other way would be to have two rounds of RLIMIT_NOFILE
> adjustment, before and after shmem creation.  That seems ugly but
> shouldn't be very time-consuming.

Yea, something like that could also work.

At some point I was playing with calling AcquireExternalFD() the appropriate
number of times during shmem reservation. That turned out to work badly right
now, because it relies on max_safe_fds() being set *and* insists on
numExternalFDs < max_safe_fds / 3.

One way to deal with this would be for AcquireExternalFD() to first increase
the rlimit, if necessary and possible, then start to close LRU files and only
then fail.

Arguably that'd have the advantage of e.g. postgres_fdw not stomping quite so
hard on VFDs, because we'd first increase the limit.

Of course it'd also mean we'd increase the limit more often. But I don't think
it's particularly expensive.

Greetings,

Andres Freund

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Andres Freund

Date:

17 February, 20:24:46

Hi,

Thanks for these patches!

On 2025-02-12 22:52:52 +0100, Jelte Fennema-Nio wrote:
> From 8e964db585989734a5f6c1449ffb4c62e1190a6a Mon Sep 17 00:00:00 2001
> From: Jelte Fennema-Nio <github-tech@jeltef.nl>
> Date: Tue, 11 Feb 2025 22:04:44 +0100
> Subject: [PATCH v2 1/3] Bump pgbench soft open file limit (RLIMIT_NOFILE) to
>  hard limit
> 
> The default open file limit of 1024 is extremely low, given modern
> resources and kernel architectures. The reason that this hasn't changed
> is because doing so would break legacy programs that use the select(2)
> system call in hard to debug ways. So instead programs that want to
> opt-in to a higher open file limit are expected to bump their soft limit
> to their hard limit on startup. Details on this are very well explained
> in a blogpost by the systemd author[1].
> 
> This starts bumping pgbench its soft open file limit to the hard open
> file limit. This makes sure users are not told to change their ulimit,
> and then retry. Instead we now do that for them automatically.
> 
> [1]: https://0pointer.net/blog/file-descriptor-limits.html
> ---
>  src/bin/pgbench/pgbench.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
> index 5e1fcf59c61..ea7bf980bea 100644
> --- a/src/bin/pgbench/pgbench.c
> +++ b/src/bin/pgbench/pgbench.c
> @@ -6815,6 +6815,15 @@ main(int argc, char **argv)
>  #ifdef HAVE_GETRLIMIT
>                  if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
>                      pg_fatal("getrlimit failed: %m");
> +
> +                /*
> +                 * Bump the soft limit to the hard limit to not run into low
> +                 * file limits.
> +                 */
> +                rlim.rlim_cur = rlim.rlim_max;
> +                if (setrlimit(RLIMIT_NOFILE, &rlim) == -1)
> +                    pg_fatal("setrlimit failed: %m");
> +
>                  if (rlim.rlim_cur < nclients + 3)
>                  {
>                      pg_log_error("need at least %d open files, but system limit is %ld",

Why not do this only in the if (rlim.rlim_cur < nclients + 3) case?

Other than this I think we should just apply this.



> From e84092e3db7712db08dfc5f8824f692a53a1aa9e Mon Sep 17 00:00:00 2001
> From: Jelte Fennema-Nio <github-tech@jeltef.nl>
> Date: Tue, 11 Feb 2025 19:15:36 +0100
> Subject: [PATCH v2 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
>  when necessary
> 
> The default open file limit of 1024 on Linux is extremely low. The
> reason that this hasn't changed change is because doing so would break
> legacy programs that use the select(2) system call in hard to debug
> ways. So instead programs that want to opt-in to a higher open file
> limit are expected to bump their soft limit to their hard limit on
> startup. Details on this are very well explained in a blogpost by the
> systemd author[1]. There's also a similar change done by the Go
> language[2].
> 
> This starts bumping postmaster its soft open file limit when we realize
> that we'll run into the soft limit with the requested
> max_files_per_process GUC. We do so by slightly changing the meaning of
> the max_files_per_process GUC. The actual (not publicly exposed) limit
> is max_safe_fds, previously this would be set to:
> max_files_per_process - already_open_files - NUM_RESERVED_FDS
> After this change we now try to set max_safe_fds to
> max_files_per_process if the system allows that. This is deemed more
> natural to understand for users, because now the limit of files that
> they can open is actually what they configured in max_files_per_process.
> 
> Adding this infrastructure to change RLIMIT_NOFILE when needed is
> especially useful for the AIO work that Andres is doing, because
> io_uring consumes a lot of file descriptors. Even without looking at AIO
> there is a large number of reports from people that require changing
> their soft file limit before starting Postgres, sometimes falling back
> to lowering max_files_per_process when they fail to do so[3-8]. It's
> also not all that strange to fail at setting the soft open file limit
> because there are multiple places where one can configure such limits
> and usually only one of them is effective (which one depends on how
> Postgres is started). In cloud environments its also often not possible
> for user to change the soft limit, because they don't control the way
> that Postgres is started.
> 
> One thing to note is that we temporarily restore the original soft
> limit when shell-ing out to other executables. This is done as a
> precaution in case those executables are using select(2).


> diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
> index 1ef1713c91a..bf17426039e 100644
> --- a/src/backend/access/transam/xlogarchive.c
> +++ b/src/backend/access/transam/xlogarchive.c
> @@ -158,6 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
>              (errmsg_internal("executing restore command \"%s\"",
>                               xlogRestoreCmd)));
>  
> +    RestoreOriginalOpenFileLimit();
>      fflush(NULL);
>      pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
>  

I think it might be better to have a precursor commit that wraps system(),
including the fflush(), pgstat_report_wait_start(), pgstat_report_wait_end().

It seems a bit silly to have a dance of 5 function calls that are repeated in
three places.


> +#ifdef HAVE_GETRLIMIT
> +/*
> + * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
> + * Returns true if successful, false otherwise.
> + */
> +static bool
> +IncreaseOpenFileLimit(int extra_files)
> +{

Why is this a relative instead of an absolute amount?


> +    struct rlimit rlim = custom_max_open_files;
> +
> +    /* If we're already at the max we reached our limit */
> +    if (rlim.rlim_cur == original_max_open_files.rlim_max)
> +        return false;
> +
> +    /* Otherwise try to increase the soft limit to what we need */
> +    rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
> +
> +    if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
> +    {
> +        ereport(WARNING, (errmsg("setrlimit failed: %m")));
> +        return false;
> +    }

Is a WARNING really appropriate here?  I guess it's just copied from elsewhere
in the file, but still...


> +/*
> + * RestoreOriginalOpenFileLimit --- Restore the original open file limit that
> + *         was present at postmaster start.
> + *
> + * This should be called before spawning subprocesses that might use select(2)
> + * which can only handle file descriptors up to 1024.
> + */
> +void
> +RestoreOriginalOpenFileLimit(void)
> +{
> +#ifdef HAVE_GETRLIMIT
> +    if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
> +    {
> +        /* Not changed, so no need to call setrlimit at all */
> +        return;
> +    }
> +
> +    if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
> +    {
> +        ereport(WARNING, (errmsg("setrlimit failed: %m")));
> +    }
> +#endif
> +}

Hm. Does this actually work if we currently have more than
original_max_open_files.rlim_cur files open?

It'd be fine to do the system() with more files open, because it'll internally
do exec(), but of course setrlimit() doesn't know that ...


>  /*
>   * count_usable_fds --- count how many FDs the system will let us open,
>   *        and estimate how many are already open.
> @@ -969,7 +1051,6 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
>      int            j;
>  
>  #ifdef HAVE_GETRLIMIT
> -    struct rlimit rlim;
>      int            getrlimit_status;
>  #endif

All these ifdefs make this function rather hard to read.  It was kinda bad
before, but it does get even worse with this patch.  Not really sure what to
do about that though.


> @@ -977,9 +1058,11 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
>      fd = (int *) palloc(size * sizeof(int));
>  
>  #ifdef HAVE_GETRLIMIT
> -    getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
> +    getrlimit_status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
>      if (getrlimit_status != 0)
>          ereport(WARNING, (errmsg("getrlimit failed: %m")));
> +    else
> +        custom_max_open_files = original_max_open_files;
>  #endif                            /* HAVE_GETRLIMIT */

We were discussing calling set_max_fds() twice to deal with io_uring FDs
requiring a higher FD limit than before.  With the patch as-is, we'd overwrite
our original original_max_open_files in that case...



>      /* dup until failure or probe limit reached */
> @@ -993,13 +1076,21 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
>           * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
>           * some platforms
>           */
> -        if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
> -            break;
> +        if (getrlimit_status == 0 && highestfd >= custom_max_open_files.rlim_cur - 1)
> +        {
> +            if (!IncreaseOpenFileLimit(max_to_probe - used))
> +                break;
> +        }
>  #endif
>  
>          thisfd = dup(2);
>          if (thisfd < 0)
>          {
> +#ifdef HAVE_GETRLIMIT
> +            if (errno == ENFILE && IncreaseOpenFileLimit(max_to_probe - used))
> +                continue;
> +#endif
> +

Hm. When is this second case here reachable and useful? Shouldn't we already
have increased the limit before getting here? If so IncreaseOpenFileLimit()
won't help, no?


> @@ -1078,6 +1164,7 @@ set_max_safe_fds(void)
>           max_safe_fds, usable_fds, already_open);
>  }
>  
> +
>  /*
>   * Open a file with BasicOpenFilePerm() and pass default file mode for the
>   * fileMode parameter.

Unrelated change.


> From 44c196025cbf57a09f6aa1bc70646ed2537d9654 Mon Sep 17 00:00:00 2001
> From: Jelte Fennema-Nio <github-tech@jeltef.nl>
> Date: Wed, 12 Feb 2025 01:08:07 +0100
> Subject: [PATCH v2 3/3] Reflect the value of max_safe_fds in
>  max_files_per_process
> 
> It is currently hard to figure out if max_safe_fds is significantly
> lower than max_files_per_process. This starts reflecting the value of
> max_safe_fds in max_files_per_process after our limit detection. We
> still want to have two separate variables because for the bootstrap or
> standalone-backend cases their values differ on purpose.
> ---
>  src/backend/storage/file/fd.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
> index 04fb93be56d..05c0792e4e1 100644
> --- a/src/backend/storage/file/fd.c
> +++ b/src/backend/storage/file/fd.c
> @@ -1149,6 +1149,9 @@ set_max_safe_fds(void)
>  
>      max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
>  
> +    /* Update GUC variable to allow users to see the result */
> +    max_files_per_process = max_safe_fds;

If we want to reflect the value, shouldn't it show up as PGC_S_OVERRIDE or
such?

Greetings,

Andres Freund

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Tom Lane

Date:

17 February, 20:40:39

Andres Freund <andres@anarazel.de> writes:
> On 2025-02-12 22:52:52 +0100, Jelte Fennema-Nio wrote:
>> +                /*
>> +                 * Bump the soft limit to the hard limit to not run into low
>> +                 * file limits.
>> +                 */
>> +                rlim.rlim_cur = rlim.rlim_max;
>> +                if (setrlimit(RLIMIT_NOFILE, &rlim) == -1)
>> +                    pg_fatal("setrlimit failed: %m");
>> +
>> if (rlim.rlim_cur < nclients + 3)

> Why not do this only in the if (rlim.rlim_cur < nclients + 3) case?

+1, otherwise you're introducing a potential failure mode for nothing.
It'd take a couple extra lines to deal with the scenario where
rlim_max is still too small, but that seems fine.

> Other than this I think we should just apply this.

Agreed on the pgbench change.  I don't have an opinion yet on the
server changes.

            regards, tom lane

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Andres Freund

Date:

20 February, 03:37:55

Hi,

On 2025-02-17 21:25:33 +0100, Jelte Fennema-Nio wrote:
> On Mon, 17 Feb 2025 at 18:24, Andres Freund <andres@anarazel.de> wrote:
> > Why not do this only in the if (rlim.rlim_cur < nclients + 3) case?
> 
> Done, I also changed it to not bump to rlim_max, but only to nclients
> + 3. The rest of the patches I'll update later. But response below.

I've pushed this, with one trivial modification: I added %m to the error
message in case setrlimit() fails. That's really unlikely to ever happen, but
...

Greetings,

Andres Freund

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

"Jelte Fennema-Nio"

Date:

24 February, 12:56:23

On Thu Feb 20, 2025 at 1:37 AM CET, Andres Freund wrote:
> I've pushed this, with one trivial modification: I added %m to the error
> message in case setrlimit() fails. That's really unlikely to ever happen, but
> ...

Great! Attached are the updated other patches, I think I addressed all
feedback.

Attachment

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

"Jelte Fennema-Nio"

Date:

24 February, 13:58:53

On Mon Feb 24, 2025 at 10:56 AM CET, Jelte Fennema-Nio wrote:
> Great! Attached are the updated other patches, I think I addressed all
> feedback.

Ughh, a compiler warning snuck on windows during some of my final
refactoring. Fixed now.

Attachment

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

"Jelte Fennema-Nio"

Date:

24 February, 14:01:05

On Mon Feb 24, 2025 at 11:58 AM CET, Jelte Fennema-Nio wrote:
> Ughh, a compiler warning snuck on windows during some of my final
> refactoring. Fixed now.

Right after pressing send I realized I could remove two more useless
lines...

Attachment

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

"Jelte Fennema-Nio"

Date:

18 March, 02:08:03

On Mon Feb 24, 2025 at 12:01 PM CET, Jelte Fennema-Nio wrote:
> Right after pressing send I realized I could remove two more useless
> lines...

Rebased patchset attached (trivial conflict against pg_noreturn
changes).

Attachment

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

Heikki Linnakangas

Date:

04 April, 20:34:21

On 18/03/2025 01:08, Jelte Fennema-Nio wrote:
> On Mon Feb 24, 2025 at 12:01 PM CET, Jelte Fennema-Nio wrote:
>> Right after pressing send I realized I could remove two more useless
>> lines...
> 
> Rebased patchset attached (trivial conflict against pg_noreturn
> changes).

v7-0001-Adds-a-helper-for-places-that-shell-out-to-system.patch:

> /*
>  * A custom wrapper around the system() function that calls the necessary
>  * functions pre/post-fork.
>  */
> int
> System(const char *command, uint32 wait_event_info)
> {
>     int            rc;
> 
>     fflush(NULL);
>     pgstat_report_wait_start(wait_event_info);
> 
>     if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
>     {
>         /*
>          * Set in_restore_command to tell the signal handler that we should
>          * exit right away on SIGTERM. This is done for the duration of the
>          * system() call because there isn't a good way to break out while it
>          * is executing.  Since we might call proc_exit() in a signal handler,
>          * it is best to put any additional logic outside of this section
>          * where in_restore_command is set to true.
>          */
>         in_restore_command = true;
> 
>         /*
>          * Also check if we had already received the signal, so that we don't
>          * miss a shutdown request received just before this.
>          */
>         if (shutdown_requested)
>             proc_exit(1);
>     }
> 
>     rc = system(command);
> 
>     if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND)
>         in_restore_command = false;
> 
>     pgstat_report_wait_end();
>     return rc;
> }

Let's move that 'in_restore_command' business to the caller. It's weird 
modularity for the function to implicitly behave differently for some 
callers. And 'wait_event_info' should only affect pgstat reporting, not 
actual behavior.

I don't feel good about the function name. How about pg_system() or 
something? postmaster/startup.c also seems like a weird place for it; 
not sure where it belongs but probably not there. Maybe next to 
OpenPipeStream() in fd.c, or move both to a new file.

v7-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch:

> @@ -274,6 +275,7 @@ System(const char *command, uint32 wait_event_info)
>  {
>      int            rc;
>  
> +    RestoreOriginalOpenFileLimit();
>      fflush(NULL);
>      pgstat_report_wait_start(wait_event_info);
>  
> @@ -303,6 +305,7 @@ System(const char *command, uint32 wait_event_info)
>          in_restore_command = false;
>  
>      pgstat_report_wait_end();
> +    RestoreCustomOpenFileLimit();
>      return rc;
>  }
>  

Looks a bit funny that both functions are called Restore<something>(). 
Not sure what to suggest instead though. Maybe RaiseOpenFileLimit() and 
LowerOpenFileLimit().

> @@ -2724,6 +2873,19 @@ OpenPipeStream(const char *command, const char *mode)
>      ReleaseLruFiles();
>  
>  TryAgain:
> +
> +    /*
> +     * It would be great if we could call RestoreOriginalOpenFileLimit here,
> +     * but since popen() also opens a file in the current process (this side
> +     * of the pipe) we cannot do so safely. Because we might already have many
> +     * more files open than the original limit.
> +     *
> +     * The only way to address this would be implementing a custom popen()
> +     * that calls RestoreOriginalOpenFileLimit only around the actual fork
> +     * call, but that seems too much effort to handle the corner case where
> +     * this external command uses both select() and tries to open more files
> +     * than select() allows for.
> +     */
>      fflush(NULL);
>      pqsignal(SIGPIPE, SIG_DFL);
>      errno = 0;

What would it take to re-implement popen() with fork+exec? Genuine 
question; I have a feeling it might be complicated to do correctly on 
all platforms, but I don't know.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

From

"Jelte Fennema-Nio"

Date:

13 April, 22:30:46

On Fri Apr 4, 2025 at 7:34 PM CEST, Heikki Linnakangas wrote:
> Let's move that 'in_restore_command' business to the caller. It's weird
> modularity for the function to implicitly behave differently for some
> callers.

I definitely agree with the sentiment, and it was what I originally
planned to do. But then I went for this approach anyway because commit
8fb13dd6ab5b explicitely moved all code except for the actual call to
system() out of the PreRestoreCommand()/PostRestoreCommand() section
(which is also described in the code comment).

So I didn't move the the in_restore_command stuff to the caller, and
improved the function comment to call out this unfortunate coupling.

> And 'wait_event_info' should only affect pgstat reporting, not
> actual behavior.

Given that we need to keep the restore command stuff in this function, I
think the only other option is to add a dedicated argument for the
restore command stuff, like "bool is_restore_command". But that would
require every caller, except for the restore command, to pass an
additional "false" as an argument. To me the additionaly noise that that
adds seems like a worse issue than the non-purity we get by
piggy-backing on the wait_event_info argument.

> I don't feel good about the function name. How about pg_system() or
> something?

Done

> postmaster/startup.c also seems like a weird place for it;
> not sure where it belongs but probably not there. Maybe next to
> OpenPipeStream() in fd.c, or move both to a new file.

Moved it to fd.c

> Looks a bit funny that both functions are called Restore<something>().
> Not sure what to suggest instead though. Maybe RaiseOpenFileLimit() and
> LowerOpenFileLimit().

Changed them to UseOriginalOpenFileLimit() and
UseOriginalOpenFileLimit()

> What would it take to re-implement popen() with fork+exec? Genuine
> question; I have a feeling it might be complicated to do correctly on
> all platforms, but I don't know.

I initially attempted to re-implement it, but after looking at the
fairly complex FreeBSD implementation of popen[1] that suddenly seemed
more trouble than it's worth.

[1]: https://github.com/freebsd/freebsd-src/blob/c98367641991019bac0e8cd55b70682171820534/lib/libc/gen/popen.c#L63-L181