Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup - Mailing list pgsql-hackers
| From | Andres Freund | 
|---|---|
| Subject | Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup | 
| Date | |
| Msg-id | lj5q3h4zxwhfusvv2azjnwva74yjwthkrrsvcxrac67f636tpi@ewtgldqjfwes Whole thread Raw | 
| In response to | Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup (Tom Lane <tgl@sss.pgh.pa.us>) | 
| Responses | Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup | 
| List | pgsql-hackers | 
Hi, Thanks for these patches! On 2025-02-12 22:52:52 +0100, Jelte Fennema-Nio wrote: > From 8e964db585989734a5f6c1449ffb4c62e1190a6a Mon Sep 17 00:00:00 2001 > From: Jelte Fennema-Nio <github-tech@jeltef.nl> > Date: Tue, 11 Feb 2025 22:04:44 +0100 > Subject: [PATCH v2 1/3] Bump pgbench soft open file limit (RLIMIT_NOFILE) to > hard limit > > The default open file limit of 1024 is extremely low, given modern > resources and kernel architectures. The reason that this hasn't changed > is because doing so would break legacy programs that use the select(2) > system call in hard to debug ways. So instead programs that want to > opt-in to a higher open file limit are expected to bump their soft limit > to their hard limit on startup. Details on this are very well explained > in a blogpost by the systemd author[1]. > > This starts bumping pgbench its soft open file limit to the hard open > file limit. This makes sure users are not told to change their ulimit, > and then retry. Instead we now do that for them automatically. > > [1]: https://0pointer.net/blog/file-descriptor-limits.html > --- > src/bin/pgbench/pgbench.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c > index 5e1fcf59c61..ea7bf980bea 100644 > --- a/src/bin/pgbench/pgbench.c > +++ b/src/bin/pgbench/pgbench.c > @@ -6815,6 +6815,15 @@ main(int argc, char **argv) > #ifdef HAVE_GETRLIMIT > if (getrlimit(RLIMIT_NOFILE, &rlim) == -1) > pg_fatal("getrlimit failed: %m"); > + > + /* > + * Bump the soft limit to the hard limit to not run into low > + * file limits. > + */ > + rlim.rlim_cur = rlim.rlim_max; > + if (setrlimit(RLIMIT_NOFILE, &rlim) == -1) > + pg_fatal("setrlimit failed: %m"); > + > if (rlim.rlim_cur < nclients + 3) > { > pg_log_error("need at least %d open files, but system limit is %ld", Why not do this only in the if (rlim.rlim_cur < nclients + 3) case? Other than this I think we should just apply this. > From e84092e3db7712db08dfc5f8824f692a53a1aa9e Mon Sep 17 00:00:00 2001 > From: Jelte Fennema-Nio <github-tech@jeltef.nl> > Date: Tue, 11 Feb 2025 19:15:36 +0100 > Subject: [PATCH v2 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE) > when necessary > > The default open file limit of 1024 on Linux is extremely low. The > reason that this hasn't changed change is because doing so would break > legacy programs that use the select(2) system call in hard to debug > ways. So instead programs that want to opt-in to a higher open file > limit are expected to bump their soft limit to their hard limit on > startup. Details on this are very well explained in a blogpost by the > systemd author[1]. There's also a similar change done by the Go > language[2]. > > This starts bumping postmaster its soft open file limit when we realize > that we'll run into the soft limit with the requested > max_files_per_process GUC. We do so by slightly changing the meaning of > the max_files_per_process GUC. The actual (not publicly exposed) limit > is max_safe_fds, previously this would be set to: > max_files_per_process - already_open_files - NUM_RESERVED_FDS > After this change we now try to set max_safe_fds to > max_files_per_process if the system allows that. This is deemed more > natural to understand for users, because now the limit of files that > they can open is actually what they configured in max_files_per_process. > > Adding this infrastructure to change RLIMIT_NOFILE when needed is > especially useful for the AIO work that Andres is doing, because > io_uring consumes a lot of file descriptors. Even without looking at AIO > there is a large number of reports from people that require changing > their soft file limit before starting Postgres, sometimes falling back > to lowering max_files_per_process when they fail to do so[3-8]. It's > also not all that strange to fail at setting the soft open file limit > because there are multiple places where one can configure such limits > and usually only one of them is effective (which one depends on how > Postgres is started). In cloud environments its also often not possible > for user to change the soft limit, because they don't control the way > that Postgres is started. > > One thing to note is that we temporarily restore the original soft > limit when shell-ing out to other executables. This is done as a > precaution in case those executables are using select(2). > diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c > index 1ef1713c91a..bf17426039e 100644 > --- a/src/backend/access/transam/xlogarchive.c > +++ b/src/backend/access/transam/xlogarchive.c > @@ -158,6 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname, > (errmsg_internal("executing restore command \"%s\"", > xlogRestoreCmd))); > > + RestoreOriginalOpenFileLimit(); > fflush(NULL); > pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND); > I think it might be better to have a precursor commit that wraps system(), including the fflush(), pgstat_report_wait_start(), pgstat_report_wait_end(). It seems a bit silly to have a dance of 5 function calls that are repeated in three places. > +#ifdef HAVE_GETRLIMIT > +/* > + * Increases the open file limit (RLIMIT_NOFILE) by the requested amount. > + * Returns true if successful, false otherwise. > + */ > +static bool > +IncreaseOpenFileLimit(int extra_files) > +{ Why is this a relative instead of an absolute amount? > + struct rlimit rlim = custom_max_open_files; > + > + /* If we're already at the max we reached our limit */ > + if (rlim.rlim_cur == original_max_open_files.rlim_max) > + return false; > + > + /* Otherwise try to increase the soft limit to what we need */ > + rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max); > + > + if (setrlimit(RLIMIT_NOFILE, &rlim) != 0) > + { > + ereport(WARNING, (errmsg("setrlimit failed: %m"))); > + return false; > + } Is a WARNING really appropriate here? I guess it's just copied from elsewhere in the file, but still... > +/* > + * RestoreOriginalOpenFileLimit --- Restore the original open file limit that > + * was present at postmaster start. > + * > + * This should be called before spawning subprocesses that might use select(2) > + * which can only handle file descriptors up to 1024. > + */ > +void > +RestoreOriginalOpenFileLimit(void) > +{ > +#ifdef HAVE_GETRLIMIT > + if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur) > + { > + /* Not changed, so no need to call setrlimit at all */ > + return; > + } > + > + if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0) > + { > + ereport(WARNING, (errmsg("setrlimit failed: %m"))); > + } > +#endif > +} Hm. Does this actually work if we currently have more than original_max_open_files.rlim_cur files open? It'd be fine to do the system() with more files open, because it'll internally do exec(), but of course setrlimit() doesn't know that ... > /* > * count_usable_fds --- count how many FDs the system will let us open, > * and estimate how many are already open. > @@ -969,7 +1051,6 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open) > int j; > > #ifdef HAVE_GETRLIMIT > - struct rlimit rlim; > int getrlimit_status; > #endif All these ifdefs make this function rather hard to read. It was kinda bad before, but it does get even worse with this patch. Not really sure what to do about that though. > @@ -977,9 +1058,11 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open) > fd = (int *) palloc(size * sizeof(int)); > > #ifdef HAVE_GETRLIMIT > - getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim); > + getrlimit_status = getrlimit(RLIMIT_NOFILE, &original_max_open_files); > if (getrlimit_status != 0) > ereport(WARNING, (errmsg("getrlimit failed: %m"))); > + else > + custom_max_open_files = original_max_open_files; > #endif /* HAVE_GETRLIMIT */ We were discussing calling set_max_fds() twice to deal with io_uring FDs requiring a higher FD limit than before. With the patch as-is, we'd overwrite our original original_max_open_files in that case... > /* dup until failure or probe limit reached */ > @@ -993,13 +1076,21 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open) > * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on > * some platforms > */ > - if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1) > - break; > + if (getrlimit_status == 0 && highestfd >= custom_max_open_files.rlim_cur - 1) > + { > + if (!IncreaseOpenFileLimit(max_to_probe - used)) > + break; > + } > #endif > > thisfd = dup(2); > if (thisfd < 0) > { > +#ifdef HAVE_GETRLIMIT > + if (errno == ENFILE && IncreaseOpenFileLimit(max_to_probe - used)) > + continue; > +#endif > + Hm. When is this second case here reachable and useful? Shouldn't we already have increased the limit before getting here? If so IncreaseOpenFileLimit() won't help, no? > @@ -1078,6 +1164,7 @@ set_max_safe_fds(void) > max_safe_fds, usable_fds, already_open); > } > > + > /* > * Open a file with BasicOpenFilePerm() and pass default file mode for the > * fileMode parameter. Unrelated change. > From 44c196025cbf57a09f6aa1bc70646ed2537d9654 Mon Sep 17 00:00:00 2001 > From: Jelte Fennema-Nio <github-tech@jeltef.nl> > Date: Wed, 12 Feb 2025 01:08:07 +0100 > Subject: [PATCH v2 3/3] Reflect the value of max_safe_fds in > max_files_per_process > > It is currently hard to figure out if max_safe_fds is significantly > lower than max_files_per_process. This starts reflecting the value of > max_safe_fds in max_files_per_process after our limit detection. We > still want to have two separate variables because for the bootstrap or > standalone-backend cases their values differ on purpose. > --- > src/backend/storage/file/fd.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c > index 04fb93be56d..05c0792e4e1 100644 > --- a/src/backend/storage/file/fd.c > +++ b/src/backend/storage/file/fd.c > @@ -1149,6 +1149,9 @@ set_max_safe_fds(void) > > max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process); > > + /* Update GUC variable to allow users to see the result */ > + max_files_per_process = max_safe_fds; If we want to reflect the value, shouldn't it show up as PGC_S_OVERRIDE or such? Greetings, Andres Freund
pgsql-hackers by date: