Thread: Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
[ redirecting to -hackers ]

I wrote:
> =?utf-8?Q?R=C3=A9mi_Zara?= <remi_zara@mac.com> writes:
> Le 20 févr. 2020 à 12:15, Thomas Munro <thomas.munro@gmail.com> a écrit :
>>> Remi, any chance you could run gmake installcheck under
>>> contrib/postgres_fdw on that host, to see if this is repeatable?  Can
>>> you tell us about the relevant limits?  Maybe ulimit -n (for the user
>>> that runs the build farm), and also sysctl -a | grep descriptors,
>>> sysctl -a | grep maxfiles?

> I have a working NetBSD 8/ppc installation, will try to reproduce there.

Yup, it reproduces fine here.  I see

$ ulimit -a
...
nofiles       (-n descriptors) 128

which squares with the sysctl values:

proc.curproc.rlimit.descriptors.soft = 128
proc.curproc.rlimit.descriptors.hard = 1772
kern.maxfiles = 1772

and also with set_max_safe_fds' results:

2020-02-20 14:29:38.610 EST [2218] DEBUG:  max_safe_fds = 115, usable_fds = 125, already_open = 3

It seems fairly obvious now that I look at it, but: the epoll and kqueue
variants of CreateWaitEventSet are both *fundamentally* unsafe, because
they assume that they can always get a FD when they want one, which is
not a property that we generally want backend code to have.  The only
reason we've not seen this before with epoll is a lack of testing
under lots-of-FDs stress.

The fact that they'll likely leak those FDs on subsequent failures is
another not-very-nice property.

I think we ought to redesign this so that those FDs are handed out by
fd.c, which can ReleaseLruFile() and retry if it gets EMFILE or ENFILE.
fd.c could also be responsible for the resource tracking needed to
prevent leakage.

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
I wrote:
> It seems fairly obvious now that I look at it, but: the epoll and kqueue
> variants of CreateWaitEventSet are both *fundamentally* unsafe, because
> they assume that they can always get a FD when they want one, which is
> not a property that we generally want backend code to have.  The only
> reason we've not seen this before with epoll is a lack of testing
> under lots-of-FDs stress.
> The fact that they'll likely leak those FDs on subsequent failures is
> another not-very-nice property.

Hmmm ... actually, there's a third problem, which is the implicit
assumption that we can have as many concurrently-active WaitEventSets
as we like.  We can't, if they depend on FDs --- that's a precious
resource.  It doesn't look like we actually ever have more than about
two per process right now, but I'm concerned about what will happen
as the usage of the feature increases.

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Thomas Munro
Date:
On Fri, Feb 21, 2020 at 8:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > It seems fairly obvious now that I look at it, but: the epoll and kqueue
> > variants of CreateWaitEventSet are both *fundamentally* unsafe, because
> > they assume that they can always get a FD when they want one, which is
> > not a property that we generally want backend code to have.  The only
> > reason we've not seen this before with epoll is a lack of testing
> > under lots-of-FDs stress.
> > The fact that they'll likely leak those FDs on subsequent failures is
> > another not-very-nice property.
>
> Hmmm ... actually, there's a third problem, which is the implicit
> assumption that we can have as many concurrently-active WaitEventSets
> as we like.  We can't, if they depend on FDs --- that's a precious
> resource.  It doesn't look like we actually ever have more than about
> two per process right now, but I'm concerned about what will happen
> as the usage of the feature increases.

One thing I've been planning to do for 13 is to get rid of all the
temporary create/destroy WaitEventSets from the main backend loops.
My goal was cutting down on stupid system calls, but this puts a new
spin on it.  I have a patch set to do a bunch of that[1], but now I'm
thinking that perhaps I need to be even more aggressive about it and
set up the 'common' long lived WES up front at backend startup, rather
than doing it on demand, so that there is no chance of failure due to
lack of fds once you've started up.  I also recently figured out how
to handle some more places with the common WES.  I'll post a new patch
set over on that thread shortly.

That wouldn't mean that the postgres_fdw.sql can't fail on a ulimit -n
= 128 system, though, it might just mean that it's postgres_fdw's
socket() call that hits EMFILE rather than WES's kqueue() call while
running that test.  I suppose there are two kinds of system: those
where ulimit -n is higher than max_files_per_process (defaults, on
Linux: 1024 vs 1000) so you have more allowance for sockets and the
like, and those where it isn't, like coypu, where NUM_RESERVED_FDS is
the only thing ensuring we have some spare fds.  I don't know the
history but it looks like NUM_RESERVED_FDS was deliberately or
accidentally tuned to be just enough to be able to run the regression
tests (the interesting ones being the ones that use sockets, like
postgres_fdw.sql), but with a new long lived kqueue() fd in the
picture, it might have to be increased to cover it, no?

About the potential for leaks, Horiguchi-san realised this hazard and
posted a patch[2] to allow WaitEventSets to be cleaned up by the
resource owner machinery.  That's useful for the temporary
WaitEventSet objects that we'd genuinely need in the patch set that's
part of: that's for creating a query-lifetime WES to manage N
connections to remote shards, and it needs to be cleaned up on
failure.  For the temporary ones created by WaitLatch(), I suspect
they don't really belong in a resource owner: instead we should get
rid of it using my WaitMyLatch() patch set, and if there are any
places where we can't for some reason (I hope not), perhaps a
try/catch block should be used to fix that.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20191206.171211.1119526746053895900.horikyota.ntt%40gmail.com



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> One thing I've been planning to do for 13 is to get rid of all the
> temporary create/destroy WaitEventSets from the main backend loops.
> My goal was cutting down on stupid system calls, but this puts a new
> spin on it.  I have a patch set to do a bunch of that[1], but now I'm
> thinking that perhaps I need to be even more aggressive about it and
> set up the 'common' long lived WES up front at backend startup, rather
> than doing it on demand, so that there is no chance of failure due to
> lack of fds once you've started up.

+1

> That wouldn't mean that the postgres_fdw.sql can't fail on a ulimit -n
> = 128 system, though, it might just mean that it's postgres_fdw's
> socket() call that hits EMFILE rather than WES's kqueue() call while
> running that test.

Good point.

> I suppose there are two kinds of system: those
> where ulimit -n is higher than max_files_per_process (defaults, on
> Linux: 1024 vs 1000) so you have more allowance for sockets and the
> like, and those where it isn't, like coypu, where NUM_RESERVED_FDS is
> the only thing ensuring we have some spare fds.  I don't know the
> history but it looks like NUM_RESERVED_FDS was deliberately or
> accidentally tuned to be just enough to be able to run the regression
> tests (the interesting ones being the ones that use sockets, like
> postgres_fdw.sql), but with a new long lived kqueue() fd in the
> picture, it might have to be increased to cover it, no?

No.  NUM_RESERVED_FDS was set decades ago, long before any of those tests
existed, and it has never been changed AFAIK.  It is a bit striking that
we just started seeing it be insufficient with this patch.  Maybe that's
just happenstance, but I wonder whether there is a plain old FD leak
involved in addition to the design issue?  I'll take a closer look at
exactly what's open when we hit the error.

The point about possibly hitting EMFILE in libpq's socket() call is
an interesting one.  libpq of course can't do anything to recover
from that (and even if it could, there are lower levels such as a
possible DNS lookup that we're not going to be able to modify).
I'm speculating about having postgres_fdw ask fd.c to forcibly
free one LRU file before it attempts to open a new libpq connection.
That would prevent EMFILE (process-level exhaustion) and it would
also provide some small protection against ENFILE (system-wide
exhaustion), though of course there's no guarantee that someone
else doesn't snap up the FD you so graciously relinquished.

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
I wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> ... like coypu, where NUM_RESERVED_FDS is
>> the only thing ensuring we have some spare fds.  I don't know the
>> history but it looks like NUM_RESERVED_FDS was deliberately or
>> accidentally tuned to be just enough to be able to run the regression
>> tests (the interesting ones being the ones that use sockets, like
>> postgres_fdw.sql), but with a new long lived kqueue() fd in the
>> picture, it might have to be increased to cover it, no?

> No.  NUM_RESERVED_FDS was set decades ago, long before any of those tests
> existed, and it has never been changed AFAIK.  It is a bit striking that
> we just started seeing it be insufficient with this patch.  Maybe that's
> just happenstance, but I wonder whether there is a plain old FD leak
> involved in addition to the design issue?  I'll take a closer look at
> exactly what's open when we hit the error.

Hmm ... looks like I'm wrong: we've been skating way too close to the edge
for awhile.  Here's a breakdown of the open FDs in the backend at the time
of the failure, excluding stdin/stdout/stderr (which set_max_safe_fds
accounted for) and FDs pointing to database files:

COMMAND    PID     USER   FD   TYPE             DEVICE SIZE/OFF      NODE NAME
postmaste 2657 postgres    3r  FIFO                0,8      0t0  20902158 pipe            postmaster_alive_fds[0]
postmaste 2657 postgres    4u   REG                0,9        0      3877 [eventpoll]        FeBeWaitSet's epoll_fd
postmaste 2657 postgres    7u  unix 0xffff880878e21880      0t0  20902664 socket        socket for a PGconn owned by
postgres_fdw
postmaste 2657 postgres    8u  IPv6           20902171      0t0       UDP localhost:40795->localhost:40795
pgStatSock
postmaste 2657 postgres    9u  unix 0xffff880876903c00      0t0  20902605 /tmp/.s.PGSQL.5432    MyProcPort->sock
postmaste 2657 postgres   10r  FIFO                0,8      0t0  20902606 pipe            selfpipe_readfd
postmaste 2657 postgres   11w  FIFO                0,8      0t0  20902606 pipe            selfpipe_writefd
postmaste 2657 postgres  105u  unix 0xffff880878e21180      0t0  20902647 socket        socket for a PGconn owned by
postgres_fdw
postmaste 2657 postgres  118u  unix 0xffff8804772c88c0      0t0  20902650 socket        socket for a PGconn owned by
postgres_fdw

One thing to notice is that there are only nine, though NUM_RESERVED_FDS
should have allowed ten.  That's because there are 116 open FDs pointing
at database files, though max_safe_fds is 115.  I'm not sure whether
that's OK or an off-by-one error in fd.c's accounting.  One of the 116
is pointing at a WAL segment, and I think we might not be sending that
through the normal VFD path, so it might be "expected".

But anyway, what this shows is that over time we've eaten enough of
the "reserved" FDs that only three are available for random usage like
postgres_fdw's, if the process's back is against the wall FD-wise.
The postgres_fdw regression test is using all three, meaning there's
exactly no daylight in that test.

Clearly, we gotta do something about that too.  Maybe bumping up
NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
accounting for permanently-eaten FDs would be a better idea.  And
in any case we can't suppose that there's a fixed upper limit on
the number of postgres_fdw connections.  I'm liking the idea I floated
earlier of letting postgres_fdw forcibly close the oldest LRU entry.

BTW, you don't need anything very exotic to provoke this error.
The results above are from a Linux box; I just did "ulimit -n 128"
before starting the postmaster.

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
I wrote:
> Clearly, we gotta do something about that too.  Maybe bumping up
> NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
> accounting for permanently-eaten FDs would be a better idea.  And
> in any case we can't suppose that there's a fixed upper limit on
> the number of postgres_fdw connections.  I'm liking the idea I floated
> earlier of letting postgres_fdw forcibly close the oldest LRU entry.

A late-night glimmer of an idea: suppose we make fd.c keep count,
not just of the open FDs under its control, but also of open FDs
not under its control.  The latter count would include the initial
FDs (stdin/stdout/stderr), and FDs allocated by OpenTransientFile
et al, and we could provide functions for other callers to report
that they just allocated or released a FD.  So postgres_fdw could
report, when it opens or closes a PGconn, that the count of external
FDs should be increased or decreased.  fd.c would then forcibly
close VFDs as needed to keep NUM_RESERVED_FDS worth of slop.  We
still need slop, to provide some daylight for code that isn't aware
of this mechanism.  But we could certainly get all these known
long-lived FDs to be counted, and that would allow fd.c to reduce
the number of open VFDs enough to ensure that some slop remains.

This idea doesn't fix every possible problem.  For instance, if you
have a plperlu function that wants to open a bunch of files, I don't
see any easy way to ensure we release VFDs to make that possible.
But it's sure an improvement on the status quo.

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
I wrote:
>> Clearly, we gotta do something about that too.  Maybe bumping up
>> NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
>> accounting for permanently-eaten FDs would be a better idea.  And
>> in any case we can't suppose that there's a fixed upper limit on
>> the number of postgres_fdw connections.  I'm liking the idea I floated
>> earlier of letting postgres_fdw forcibly close the oldest LRU entry.

> A late-night glimmer of an idea: suppose we make fd.c keep count,
> not just of the open FDs under its control, but also of open FDs
> not under its control.

Here's a draft patch that does it like that.  There are undoubtedly
more places that need to be taught to report their FD consumption;
one obvious candidate that I didn't touch is dblink.  But this is
enough to account for all the long-lived "extra" FDs that are currently
open in a default build, and it passes check-world even with ulimit -n
set to the minimum that set_max_safe_fds will allow.

One thing I noticed is that if you open enough postgres_fdw connections
to cause a failure, that tends to come out as an unintelligible-to-
the-layman "epoll_create1 failed: Too many open files" error.  That's
because after postgres_fdw has consumed the last available "external
FD" slot, it tries to use CreateWaitEventSet to wait for input from
the remote server, and now that needs to get another external FD.
I left that alone for the moment, because if you do rejigger the
WaitEventSet code to avoid dynamically creating/destroying epoll sockets,
it will stop being a problem.  If that doesn't happen, another
possibility is to reclassify CreateWaitEventSet as a caller that should
ignore "failure" from ReserveExternalFD --- but that would open us up
to problems if a lot of WaitEventSets get created, so it's not a great
answer.  It'd be okay perhaps if we added a distinction between
long-lived and short-lived WaitEventSets (ignoring "failure" only for
the latter).  But I didn't want to go there in this patch.

Thoughts?

            regards, tom lane

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 29c811a..27d509c 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postgres_fdw.h"
+#include "storage/fd.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -259,10 +260,30 @@ connect_pg_server(ForeignServer *server, UserMapping *user)

         keywords[n] = values[n] = NULL;

-        /* verify connection parameters and make connection */
+        /* verify the set of connection parameters */
         check_conn_params(keywords, values, user);

+        /*
+         * We must obey fd.c's limit on non-virtual file descriptors.  Assume
+         * that a PGconn represents one long-lived FD.  (Doing this here also
+         * ensures that VFDs are closed if needed to make room.)
+         */
+        if (!ReserveExternalFD())
+        {
+            ReleaseExternalFD();    /* yup, gotta do this here */
+            ereport(ERROR,
+                    (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+                     errmsg("could not connect to server \"%s\"",
+                            server->servername),
+                     errdetail("There are too many open files.")));
+        }
+
+        /* OK to make connection */
         conn = PQconnectdbParams(keywords, values, false);
+
+        if (!conn)
+            ReleaseExternalFD();    /* because the PG_CATCH block won't */
+
         if (!conn || PQstatus(conn) != CONNECTION_OK)
             ereport(ERROR,
                     (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
@@ -294,7 +315,10 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
     {
         /* Release PGconn data structure if we managed to create one */
         if (conn)
+        {
             PQfinish(conn);
+            ReleaseExternalFD();
+        }
         PG_RE_THROW();
     }
     PG_END_TRY();
@@ -312,6 +336,7 @@ disconnect_pg_server(ConnCacheEntry *entry)
     {
         PQfinish(entry->conn);
         entry->conn = NULL;
+        ReleaseExternalFD();
     }
 }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd527f2..b60e45a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -774,6 +774,7 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
  * openLogFile is -1 or a kernel FD for an open log file segment.
  * openLogSegNo identifies the segment.  These variables are only used to
  * write the XLOG, and so will normally refer to the active segment.
+ * Note: call Reserve/ReleaseExternalFD to track consumption of this FD.
  */
 static int    openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
@@ -785,6 +786,9 @@ static XLogSegNo openLogSegNo = 0;
  * will be just past that page. readLen indicates how much of the current
  * page has been read into readBuf, and readSource indicates where we got
  * the currently open file from.
+ * Note: we could use Reserve/ReleaseExternalFD to track consumption of
+ * this FD too; but it doesn't currently seem worthwhile, since the XLOG is
+ * not read by general-purpose sessions.
  */
 static int    readFile = -1;
 static XLogSegNo readSegNo = 0;
@@ -2447,6 +2451,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
             /* create/use new log file */
             use_existent = true;
             openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+            (void) ReserveExternalFD();
         }

         /* Make sure we have the current logfile open */
@@ -2455,6 +2460,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
             XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
                             wal_segment_size);
             openLogFile = XLogFileOpen(openLogSegNo);
+            (void) ReserveExternalFD();
         }

         /* Add current page to the set of pending pages-to-dump */
@@ -2605,6 +2611,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                 XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
                                 wal_segment_size);
                 openLogFile = XLogFileOpen(openLogSegNo);
+                (void) ReserveExternalFD();
             }

             issue_xlog_fsync(openLogFile, openLogSegNo);
@@ -3811,6 +3818,7 @@ XLogFileClose(void)
     }

     openLogFile = -1;
+    ReleaseExternalFD();
 }

 /*
@@ -5224,6 +5232,11 @@ BootStrapXLOG(void)
     use_existent = false;
     openLogFile = XLogFileInit(1, &use_existent, false);

+    /*
+     * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
+     * close the file again in a moment.
+     */
+
     /* Write the first page with the initial record */
     errno = 0;
     pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 59dc4f3..5204c95 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -602,6 +602,9 @@ retry2:

     pg_freeaddrinfo_all(hints.ai_family, addrs);

+    /* Now that we have a long-lived socket, tell fd.c about it. */
+    (void) ReserveExternalFD();
+
     return;

 startup_failed:
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b3986be..2f68a06 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2554,9 +2554,14 @@ ClosePostmasterPorts(bool am_syslogger)
                 (errcode_for_file_access(),
                  errmsg_internal("could not close postmaster death monitoring pipe in child process: %m")));
     postmaster_alive_fds[POSTMASTER_FD_OWN] = -1;
+    /* Notify fd.c that we released one pipe FD. */
+    ReleaseExternalFD();
 #endif

-    /* Close the listen sockets */
+    /*
+     * Close the postmaster's listen sockets.  These aren't tracked by fd.c,
+     * so we don't call ReleaseExternalFD() here.
+     */
     for (i = 0; i < MAXLISTEN; i++)
     {
         if (ListenSocket[i] != PGINVALID_SOCKET)
@@ -2566,7 +2571,10 @@ ClosePostmasterPorts(bool am_syslogger)
         }
     }

-    /* If using syslogger, close the read side of the pipe */
+    /*
+     * If using syslogger, close the read side of the pipe.  We don't bother
+     * tracking this in fd.c, either.
+     */
     if (!am_syslogger)
     {
 #ifndef WIN32
@@ -4279,6 +4287,9 @@ BackendInitialize(Port *port)
     /* Save port etc. for ps status */
     MyProcPort = port;

+    /* Tell fd.c about the long-lived FD associated with the port */
+    (void) ReserveExternalFD();
+
     /*
      * PreAuthDelay is a debugging aid for investigating problems in the
      * authentication cycle: it can be set in postgresql.conf to allow time to
@@ -6442,6 +6453,20 @@ restore_backend_variables(BackendParameters *param, Port *port)
     strlcpy(pkglib_path, param->pkglib_path, MAXPGPATH);

     strlcpy(ExtraOptions, param->ExtraOptions, MAXPGPATH);
+
+    /*
+     * We need to restore fd.c's counts of externally-opened FDs; to avoid
+     * confusion, be sure to do this after restoring max_safe_fds.  (Note:
+     * BackendInitialize will handle this for port->sock.)
+     */
+#ifndef WIN32
+    if (postmaster_alive_fds[0] >= 0)
+        (void) ReserveExternalFD();
+    if (postmaster_alive_fds[1] >= 0)
+        (void) ReserveExternalFD();
+#endif
+    if (pgStatSock != PGINVALID_SOCKET)
+        (void) ReserveExternalFD();
 }


@@ -6584,6 +6609,10 @@ InitPostmasterDeathWatchHandle(void)
                 (errcode_for_file_access(),
                  errmsg_internal("could not create pipe to monitor postmaster death: %m")));

+    /* Notify fd.c that we've eaten two FDs for the pipe. */
+    (void) ReserveExternalFD();
+    (void) ReserveExternalFD();
+
     /*
      * Set O_NONBLOCK to allow testing for the fd's presence with a read()
      * call.
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index b2b69a7..cf7b535 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -562,6 +562,11 @@ SysLogger_Start(void)
      * This means the postmaster must continue to hold the read end of the
      * pipe open, so we can pass it down to the reincarnated syslogger. This
      * is a bit klugy but we have little choice.
+     *
+     * Also note that we don't bother counting the pipe FDs by calling
+     * Reserve/ReleaseExternalFD.  There's no real need to account for them
+     * accurately in the postmaster or syslogger process, and both ends of the
+     * pipe will wind up closed in all other postmaster children.
      */
 #ifndef WIN32
     if (syslogPipe[0] < 0)
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b5f4df6..2374ae9 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -61,6 +61,12 @@
  * BasicOpenFile, it is solely the caller's responsibility to close the file
  * descriptor by calling close(2).
  *
+ * If a non-virtual file descriptor needs to be held open for any length of
+ * time, report it to this module by calling ReserveExternalFD and eventually
+ * ReleaseExternalFD, so that we can take it into account while deciding how
+ * many VFDs can be open.  This applies to FDs obtained with BasicOpenFile
+ * as well as those obtained without use of any fd.c API.
+ *
  *-------------------------------------------------------------------------
  */

@@ -103,8 +109,8 @@
 /*
  * We must leave some file descriptors free for system(), the dynamic loader,
  * and other code that tries to open files without consulting fd.c.  This
- * is the number left free.  (While we can be pretty sure we won't get
- * EMFILE, there's never any guarantee that we won't get ENFILE due to
+ * is the number left free.  (While we try fairly hard to prevent EMFILE
+ * errors, there's never any guarantee that we won't get ENFILE due to
  * other processes chewing up FDs.  So it's a bad idea to try to open files
  * without consulting fd.c.  Nonetheless we cannot control all code.)
  *
@@ -119,9 +125,12 @@

 /*
  * If we have fewer than this many usable FDs after allowing for the reserved
- * ones, choke.
+ * ones, choke.  (This value is chosen to work with "ulimit -n 64", but not
+ * much less than that.  Note that this value ensures numExternalFDs can be
+ * at least 16; as of this writing, the contrib/postgres_fdw regression tests
+ * will not pass unless that can grow to at least 14.)
  */
-#define FD_MINFREE                10
+#define FD_MINFREE                48

 /*
  * A number of platforms allow individual processes to open many more files
@@ -132,8 +141,8 @@
 int            max_files_per_process = 1000;

 /*
- * Maximum number of file descriptors to open for either VFD entries or
- * AllocateFile/AllocateDir/OpenTransientFile operations.  This is initialized
+ * Maximum number of file descriptors to open for operations that fd.c knows
+ * about (VFDs, AllocateFile etc, or "external" FDs).  This is initialized
  * to a conservative value, and remains that way indefinitely in bootstrap or
  * standalone-backend cases.  In normal postmaster operation, the postmaster
  * calls set_max_safe_fds() late in initialization to update the value, and
@@ -142,7 +151,7 @@ int            max_files_per_process = 1000;
  * Note: the value of max_files_per_process is taken into account while
  * setting this variable, and so need not be tested separately.
  */
-int            max_safe_fds = 32;    /* default if not changed */
+int            max_safe_fds = FD_MINFREE;    /* default if not changed */

 /* Whether it is safe to continue running after fsync() fails. */
 bool        data_sync_retry = false;
@@ -244,6 +253,11 @@ static int    maxAllocatedDescs = 0;
 static AllocateDesc *allocatedDescs = NULL;

 /*
+ * Number of open "external" FDs reported to Reserve/ReleaseExternalFD.
+ */
+static int    numExternalFDs = 0;
+
+/*
  * Number of temporary files opened during the current session;
  * this is used in generation of tempfile names.
  */
@@ -1025,6 +1039,67 @@ tryAgain:
     return -1;                    /* failure */
 }

+/*
+ * ReserveExternalFD - report external consumption of a file descriptor
+ *
+ * This should be used by callers that need to hold a file descriptor open
+ * over more than a short interval, but cannot use any of the other facilities
+ * provided by this module.  This just tracks the use of the FD and closes
+ * VFDs if needed to ensure we keep NUM_RESERVED_FDS FDs available.
+ *
+ * Returns true if OK, false if too many external FDs have been reserved.
+ *
+ * A "false" result may be treated like an EMFILE failure; to facilitate that,
+ * we set errno = EMFILE before returning.  (Note that ReleaseExternalFD has
+ * to be called even after a "false" result.)  Don't ignore a "false" result
+ * unless failure would be disastrous; an example is that WAL-file access may
+ * ignore it, since the alternative is session failure.  Also, it's very
+ * unwise to ignore a "false" result in code that could consume more than one
+ * FD per process.
+ *
+ * Note: as long as everybody plays nice so that NUM_RESERVED_FDS FDs remain
+ * available, it doesn't matter too much whether this is called before or
+ * after actually opening the FD; but doing so beforehand reduces the risk of
+ * an EMFILE failure if not everybody played nice.  In any case, it's solely
+ * caller's responsibility to keep the external-FD count in sync with reality.
+ */
+bool
+ReserveExternalFD(void)
+{
+    /*
+     * Release VFDs if needed to stay safe.  Because we do this before
+     * incrementing numExternalFDs, the final state will be as desired, i.e.,
+     * nfile + numAllocatedDescs + numExternalFDs <= max_safe_fds.
+     */
+    ReleaseLruFiles();
+
+    numExternalFDs++;
+
+    /*
+     * We don't want more than max_safe_fds / 3 FDs to be consumed this way.
+     * However, we leave it to the caller to decide whether it's better to
+     * fail or press on; that is, the caller may decide to treat that as a
+     * soft limit not a hard one.
+     */
+    if (numExternalFDs <= max_safe_fds / 3)
+        return true;
+    errno = EMFILE;
+    return false;
+}
+
+/*
+ * ReleaseExternalFD - report release of an external file descriptor
+ *
+ * This is guaranteed not to change errno, so it can be used in failure paths.
+ */
+void
+ReleaseExternalFD(void)
+{
+    Assert(numExternalFDs > 0);
+    numExternalFDs--;
+}
+
+
 #if defined(FDDEBUG)

 static void
@@ -1185,7 +1260,7 @@ ReleaseLruFile(void)
 static void
 ReleaseLruFiles(void)
 {
-    while (nfile + numAllocatedDescs >= max_safe_fds)
+    while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
     {
         if (!ReleaseLruFile())
             break;
@@ -2176,13 +2251,13 @@ reserveAllocatedDesc(void)

     /*
      * If the array hasn't yet been created in the current process, initialize
-     * it with FD_MINFREE / 2 elements.  In many scenarios this is as many as
+     * it with FD_MINFREE / 3 elements.  In many scenarios this is as many as
      * we will ever need, anyway.  We don't want to look at max_safe_fds
      * immediately because set_max_safe_fds() may not have run yet.
      */
     if (allocatedDescs == NULL)
     {
-        newMax = FD_MINFREE / 2;
+        newMax = FD_MINFREE / 3;
         newDescs = (AllocateDesc *) malloc(newMax * sizeof(AllocateDesc));
         /* Out of memory already?  Treat as fatal error. */
         if (newDescs == NULL)
@@ -2200,10 +2275,12 @@ reserveAllocatedDesc(void)
      *
      * We mustn't let allocated descriptors hog all the available FDs, and in
      * practice we'd better leave a reasonable number of FDs for VFD use.  So
-     * set the maximum to max_safe_fds / 2.  (This should certainly be at
-     * least as large as the initial size, FD_MINFREE / 2.)
+     * set the maximum to max_safe_fds / 3.  (This should certainly be at
+     * least as large as the initial size, FD_MINFREE / 3, so we aren't
+     * tightening the restriction here.)  Recall that "external" FDs are
+     * allowed to consume another third of max_safe_fds.
      */
-    newMax = max_safe_fds / 2;
+    newMax = max_safe_fds / 3;
     if (newMax > maxAllocatedDescs)
     {
         newDescs = (AllocateDesc *) realloc(allocatedDescs,
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 138bdec..ef1fab9 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -247,14 +247,17 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
     /*
      * Create new segment or open an existing one for attach.
      *
-     * Even though we're not going through fd.c, we should be safe against
-     * running out of file descriptors, because of NUM_RESERVED_FDS.  We're
-     * only opening one extra descriptor here, and we'll close it before
-     * returning.
+     * Even though we will close the FD before returning, it seems desirable
+     * to use Reserve/ReleaseExternalFD, to reduce the probability of EMFILE
+     * failure.  The fact that we won't hold the FD open long justifies
+     * ignoring a "false" return from ReserveExternalFD, though.
      */
+    (void) ReserveExternalFD();
+
     flags = O_RDWR | (op == DSM_OP_CREATE ? O_CREAT | O_EXCL : 0);
     if ((fd = shm_open(name, flags, PG_FILE_MODE_OWNER)) == -1)
     {
+        ReleaseExternalFD();
         if (errno != EEXIST)
             ereport(elevel,
                     (errcode_for_dynamic_shared_memory(),
@@ -278,6 +281,7 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
             /* Back out what's already been done. */
             save_errno = errno;
             close(fd);
+            ReleaseExternalFD();
             errno = save_errno;

             ereport(elevel,
@@ -295,6 +299,7 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
         /* Back out what's already been done. */
         save_errno = errno;
         close(fd);
+        ReleaseExternalFD();
         shm_unlink(name);
         errno = save_errno;

@@ -323,6 +328,7 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
         /* Back out what's already been done. */
         save_errno = errno;
         close(fd);
+        ReleaseExternalFD();
         if (op == DSM_OP_CREATE)
             shm_unlink(name);
         errno = save_errno;
@@ -336,6 +342,7 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
     *mapped_address = address;
     *mapped_size = request_size;
     close(fd);
+    ReleaseExternalFD();

     return true;
 }
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index cbd4952..067c570 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
@@ -187,6 +188,9 @@ InitializeLatchSupport(void)
             /* Clean up, just for safety's sake; we'll set these below */
             selfpipe_readfd = selfpipe_writefd = -1;
             selfpipe_owner_pid = 0;
+            /* Keep fd.c's accounting straight */
+            ReleaseExternalFD();
+            ReleaseExternalFD();
         }
         else
         {
@@ -194,6 +198,7 @@ InitializeLatchSupport(void)
              * Postmaster didn't create a self-pipe ... or else we're in an
              * EXEC_BACKEND build, in which case it doesn't matter since the
              * postmaster's pipe FDs were closed by the action of FD_CLOEXEC.
+             * fd.c won't have state to clean up, either.
              */
             Assert(selfpipe_readfd == -1);
         }
@@ -228,6 +233,10 @@ InitializeLatchSupport(void)
     selfpipe_readfd = pipefd[0];
     selfpipe_writefd = pipefd[1];
     selfpipe_owner_pid = MyProcPid;
+
+    /* Tell fd.c avout these two long-lived FDs */
+    (void) ReserveExternalFD();
+    (void) ReserveExternalFD();
 #else
     /* currently, nothing to do here for Windows */
 #endif
@@ -604,24 +613,59 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     set->exit_on_postmaster_death = false;

 #if defined(WAIT_USE_EPOLL)
+    if (!ReserveExternalFD())
+    {
+        /* treat this as though epoll_create1 itself returned EMFILE */
+        ReleaseExternalFD();
+        elog(ERROR, "epoll_create1 failed: %m");
+    }
 #ifdef EPOLL_CLOEXEC
     set->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
     if (set->epoll_fd < 0)
+    {
+        ReleaseExternalFD();
         elog(ERROR, "epoll_create1 failed: %m");
+    }
 #else
     /* cope with ancient glibc lacking epoll_create1 (e.g., RHEL5) */
     set->epoll_fd = epoll_create(nevents);
     if (set->epoll_fd < 0)
+    {
+        ReleaseExternalFD();
         elog(ERROR, "epoll_create failed: %m");
+    }
     if (fcntl(set->epoll_fd, F_SETFD, FD_CLOEXEC) == -1)
+    {
+        int            save_errno = errno;
+
+        close(set->epoll_fd);
+        ReleaseExternalFD();
+        errno = save_errno;
         elog(ERROR, "fcntl(F_SETFD) failed on epoll descriptor: %m");
+    }
 #endif                            /* EPOLL_CLOEXEC */
 #elif defined(WAIT_USE_KQUEUE)
+    if (!ReserveExternalFD())
+    {
+        /* treat this as though kqueue itself returned EMFILE */
+        ReleaseExternalFD();
+        elog(ERROR, "kqueue failed: %m");
+    }
     set->kqueue_fd = kqueue();
     if (set->kqueue_fd < 0)
+    {
+        ReleaseExternalFD();
         elog(ERROR, "kqueue failed: %m");
+    }
     if (fcntl(set->kqueue_fd, F_SETFD, FD_CLOEXEC) == -1)
+    {
+        int            save_errno = errno;
+
+        close(set->kqueue_fd);
+        ReleaseExternalFD();
+        errno = save_errno;
         elog(ERROR, "fcntl(F_SETFD) failed on kqueue descriptor: %m");
+    }
     set->report_postmaster_not_running = false;
 #elif defined(WAIT_USE_WIN32)

@@ -655,8 +699,10 @@ FreeWaitEventSet(WaitEventSet *set)
 {
 #if defined(WAIT_USE_EPOLL)
     close(set->epoll_fd);
+    ReleaseExternalFD();
 #elif defined(WAIT_USE_KQUEUE)
     close(set->kqueue_fd);
+    ReleaseExternalFD();
 #elif defined(WAIT_USE_WIN32)
     WaitEvent  *cur_event;

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 51e2ece..2abd6cd 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -35,6 +35,10 @@
  * Likewise, use AllocateDir/FreeDir, not opendir/closedir, to allocate
  * open directories (DIR*), and OpenTransientFile/CloseTransientFile for an
  * unbuffered file descriptor.
+ *
+ * If you really can't use any of the above, at least call ReserveExternalFD/
+ * ReleaseExternalFD to report any file descriptors that are held for any
+ * length of time.  Failure to do so risks unnecessary EMFILE errors.
  */
 #ifndef FD_H
 #define FD_H
@@ -120,7 +124,11 @@ extern int    CloseTransientFile(int fd);
 extern int    BasicOpenFile(const char *fileName, int fileFlags);
 extern int    BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);

- /* Make a directory with default permissions */
+/* Use these for other cases, and also for long-lived BasicOpenFile FDs */
+extern bool ReserveExternalFD(void);
+extern void ReleaseExternalFD(void);
+
+/* Make a directory with default permissions */
 extern int    MakePGDirectory(const char *directoryName);

 /* Miscellaneous support routines */

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
I wrote:
> Here's a draft patch that does it like that.

On reflection, trying to make ReserveExternalFD serve two different
use-cases was pretty messy.  Here's a version that splits it into two
functions.  I also took the trouble to fix dblink.

            regards, tom lane

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index c1155e3..a5f69fc 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -200,12 +200,27 @@ dblink_get_conn(char *conname_or_str,
         if (connstr == NULL)
             connstr = conname_or_str;
         dblink_connstr_check(connstr);
+
+        /*
+         * We must obey fd.c's limit on non-virtual file descriptors.  Assume
+         * that a PGconn represents one long-lived FD.  (Doing this here also
+         * ensures that VFDs are closed if needed to make room.)
+         */
+        if (!AcquireExternalFD())
+            ereport(ERROR,
+                    (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+                     errmsg("could not establish connection"),
+                     errdetail("There are too many open files.")));
+
+        /* OK to make connection */
         conn = PQconnectdb(connstr);
+
         if (PQstatus(conn) == CONNECTION_BAD)
         {
             char       *msg = pchomp(PQerrorMessage(conn));

             PQfinish(conn);
+            ReleaseExternalFD();
             ereport(ERROR,
                     (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
                      errmsg("could not establish connection"),
@@ -282,12 +297,26 @@ dblink_connect(PG_FUNCTION_ARGS)

     /* check password in connection string if not superuser */
     dblink_connstr_check(connstr);
+
+    /*
+     * We must obey fd.c's limit on non-virtual file descriptors.  Assume that
+     * a PGconn represents one long-lived FD.  (Doing this here also ensures
+     * that VFDs are closed if needed to make room.)
+     */
+    if (!AcquireExternalFD())
+        ereport(ERROR,
+                (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+                 errmsg("could not establish connection"),
+                 errdetail("There are too many open files.")));
+
+    /* OK to make connection */
     conn = PQconnectdb(connstr);

     if (PQstatus(conn) == CONNECTION_BAD)
     {
         msg = pchomp(PQerrorMessage(conn));
         PQfinish(conn);
+        ReleaseExternalFD();
         if (rconn)
             pfree(rconn);

@@ -312,7 +341,10 @@ dblink_connect(PG_FUNCTION_ARGS)
     else
     {
         if (pconn->conn)
+        {
             PQfinish(pconn->conn);
+            ReleaseExternalFD();
+        }
         pconn->conn = conn;
     }

@@ -346,6 +378,7 @@ dblink_disconnect(PG_FUNCTION_ARGS)
         dblink_conn_not_avail(conname);

     PQfinish(conn);
+    ReleaseExternalFD();
     if (rconn)
     {
         deleteConnection(conname);
@@ -780,7 +813,10 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
     {
         /* if needed, close the connection to the database */
         if (freeconn)
+        {
             PQfinish(conn);
+            ReleaseExternalFD();
+        }
     }
     PG_END_TRY();

@@ -1458,7 +1494,10 @@ dblink_exec(PG_FUNCTION_ARGS)
     {
         /* if needed, close the connection to the database */
         if (freeconn)
+        {
             PQfinish(conn);
+            ReleaseExternalFD();
+        }
     }
     PG_END_TRY();

@@ -2563,6 +2602,7 @@ createNewConnection(const char *name, remoteConn *rconn)
     if (found)
     {
         PQfinish(rconn->conn);
+        ReleaseExternalFD();
         pfree(rconn);

         ereport(ERROR,
@@ -2604,6 +2644,7 @@ dblink_security_check(PGconn *conn, remoteConn *rconn)
         if (!PQconnectionUsedPassword(conn))
         {
             PQfinish(conn);
+            ReleaseExternalFD();
             if (rconn)
                 pfree(rconn);

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 29c811a..74576d7 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postgres_fdw.h"
+#include "storage/fd.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -259,10 +260,27 @@ connect_pg_server(ForeignServer *server, UserMapping *user)

         keywords[n] = values[n] = NULL;

-        /* verify connection parameters and make connection */
+        /* verify the set of connection parameters */
         check_conn_params(keywords, values, user);

+        /*
+         * We must obey fd.c's limit on non-virtual file descriptors.  Assume
+         * that a PGconn represents one long-lived FD.  (Doing this here also
+         * ensures that VFDs are closed if needed to make room.)
+         */
+        if (!AcquireExternalFD())
+            ereport(ERROR,
+                    (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+                     errmsg("could not connect to server \"%s\"",
+                            server->servername),
+                     errdetail("There are too many open files.")));
+
+        /* OK to make connection */
         conn = PQconnectdbParams(keywords, values, false);
+
+        if (!conn)
+            ReleaseExternalFD();    /* because the PG_CATCH block won't */
+
         if (!conn || PQstatus(conn) != CONNECTION_OK)
             ereport(ERROR,
                     (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
@@ -294,7 +312,10 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
     {
         /* Release PGconn data structure if we managed to create one */
         if (conn)
+        {
             PQfinish(conn);
+            ReleaseExternalFD();
+        }
         PG_RE_THROW();
     }
     PG_END_TRY();
@@ -312,6 +333,7 @@ disconnect_pg_server(ConnCacheEntry *entry)
     {
         PQfinish(entry->conn);
         entry->conn = NULL;
+        ReleaseExternalFD();
     }
 }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd527f2..d19408b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -774,6 +774,7 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
  * openLogFile is -1 or a kernel FD for an open log file segment.
  * openLogSegNo identifies the segment.  These variables are only used to
  * write the XLOG, and so will normally refer to the active segment.
+ * Note: call Reserve/ReleaseExternalFD to track consumption of this FD.
  */
 static int    openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
@@ -785,6 +786,9 @@ static XLogSegNo openLogSegNo = 0;
  * will be just past that page. readLen indicates how much of the current
  * page has been read into readBuf, and readSource indicates where we got
  * the currently open file from.
+ * Note: we could use Reserve/ReleaseExternalFD to track consumption of
+ * this FD too; but it doesn't currently seem worthwhile, since the XLOG is
+ * not read by general-purpose sessions.
  */
 static int    readFile = -1;
 static XLogSegNo readSegNo = 0;
@@ -2447,6 +2451,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
             /* create/use new log file */
             use_existent = true;
             openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+            ReserveExternalFD();
         }

         /* Make sure we have the current logfile open */
@@ -2455,6 +2460,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
             XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
                             wal_segment_size);
             openLogFile = XLogFileOpen(openLogSegNo);
+            ReserveExternalFD();
         }

         /* Add current page to the set of pending pages-to-dump */
@@ -2605,6 +2611,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                 XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
                                 wal_segment_size);
                 openLogFile = XLogFileOpen(openLogSegNo);
+                ReserveExternalFD();
             }

             issue_xlog_fsync(openLogFile, openLogSegNo);
@@ -3811,6 +3818,7 @@ XLogFileClose(void)
     }

     openLogFile = -1;
+    ReleaseExternalFD();
 }

 /*
@@ -5224,6 +5232,11 @@ BootStrapXLOG(void)
     use_existent = false;
     openLogFile = XLogFileInit(1, &use_existent, false);

+    /*
+     * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
+     * close the file again in a moment.
+     */
+
     /* Write the first page with the initial record */
     errno = 0;
     pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 59dc4f3..462b4d7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -602,6 +602,9 @@ retry2:

     pg_freeaddrinfo_all(hints.ai_family, addrs);

+    /* Now that we have a long-lived socket, tell fd.c about it. */
+    ReserveExternalFD();
+
     return;

 startup_failed:
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b3986be..cd61665 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2554,9 +2554,14 @@ ClosePostmasterPorts(bool am_syslogger)
                 (errcode_for_file_access(),
                  errmsg_internal("could not close postmaster death monitoring pipe in child process: %m")));
     postmaster_alive_fds[POSTMASTER_FD_OWN] = -1;
+    /* Notify fd.c that we released one pipe FD. */
+    ReleaseExternalFD();
 #endif

-    /* Close the listen sockets */
+    /*
+     * Close the postmaster's listen sockets.  These aren't tracked by fd.c,
+     * so we don't call ReleaseExternalFD() here.
+     */
     for (i = 0; i < MAXLISTEN; i++)
     {
         if (ListenSocket[i] != PGINVALID_SOCKET)
@@ -2566,7 +2571,10 @@ ClosePostmasterPorts(bool am_syslogger)
         }
     }

-    /* If using syslogger, close the read side of the pipe */
+    /*
+     * If using syslogger, close the read side of the pipe.  We don't bother
+     * tracking this in fd.c, either.
+     */
     if (!am_syslogger)
     {
 #ifndef WIN32
@@ -4279,6 +4287,9 @@ BackendInitialize(Port *port)
     /* Save port etc. for ps status */
     MyProcPort = port;

+    /* Tell fd.c about the long-lived FD associated with the port */
+    ReserveExternalFD();
+
     /*
      * PreAuthDelay is a debugging aid for investigating problems in the
      * authentication cycle: it can be set in postgresql.conf to allow time to
@@ -6442,6 +6453,20 @@ restore_backend_variables(BackendParameters *param, Port *port)
     strlcpy(pkglib_path, param->pkglib_path, MAXPGPATH);

     strlcpy(ExtraOptions, param->ExtraOptions, MAXPGPATH);
+
+    /*
+     * We need to restore fd.c's counts of externally-opened FDs; to avoid
+     * confusion, be sure to do this after restoring max_safe_fds.  (Note:
+     * BackendInitialize will handle this for port->sock.)
+     */
+#ifndef WIN32
+    if (postmaster_alive_fds[0] >= 0)
+        ReserveExternalFD();
+    if (postmaster_alive_fds[1] >= 0)
+        ReserveExternalFD();
+#endif
+    if (pgStatSock != PGINVALID_SOCKET)
+        ReserveExternalFD();
 }


@@ -6584,6 +6609,10 @@ InitPostmasterDeathWatchHandle(void)
                 (errcode_for_file_access(),
                  errmsg_internal("could not create pipe to monitor postmaster death: %m")));

+    /* Notify fd.c that we've eaten two FDs for the pipe. */
+    ReserveExternalFD();
+    ReserveExternalFD();
+
     /*
      * Set O_NONBLOCK to allow testing for the fd's presence with a read()
      * call.
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index b2b69a7..cf7b535 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -562,6 +562,11 @@ SysLogger_Start(void)
      * This means the postmaster must continue to hold the read end of the
      * pipe open, so we can pass it down to the reincarnated syslogger. This
      * is a bit klugy but we have little choice.
+     *
+     * Also note that we don't bother counting the pipe FDs by calling
+     * Reserve/ReleaseExternalFD.  There's no real need to account for them
+     * accurately in the postmaster or syslogger process, and both ends of the
+     * pipe will wind up closed in all other postmaster children.
      */
 #ifndef WIN32
     if (syslogPipe[0] < 0)
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b5f4df6..34f7443 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -61,6 +61,12 @@
  * BasicOpenFile, it is solely the caller's responsibility to close the file
  * descriptor by calling close(2).
  *
+ * If a non-virtual file descriptor needs to be held open for any length of
+ * time, report it to fd.c by calling AcquireExternalFD or ReserveExternalFD
+ * (and eventually ReleaseExternalFD), so that we can take it into account
+ * while deciding how many VFDs can be open.  This applies to FDs obtained
+ * with BasicOpenFile as well as those obtained without use of any fd.c API.
+ *
  *-------------------------------------------------------------------------
  */

@@ -103,8 +109,8 @@
 /*
  * We must leave some file descriptors free for system(), the dynamic loader,
  * and other code that tries to open files without consulting fd.c.  This
- * is the number left free.  (While we can be pretty sure we won't get
- * EMFILE, there's never any guarantee that we won't get ENFILE due to
+ * is the number left free.  (While we try fairly hard to prevent EMFILE
+ * errors, there's never any guarantee that we won't get ENFILE due to
  * other processes chewing up FDs.  So it's a bad idea to try to open files
  * without consulting fd.c.  Nonetheless we cannot control all code.)
  *
@@ -119,9 +125,12 @@

 /*
  * If we have fewer than this many usable FDs after allowing for the reserved
- * ones, choke.
+ * ones, choke.  (This value is chosen to work with "ulimit -n 64", but not
+ * much less than that.  Note that this value ensures numExternalFDs can be
+ * at least 16; as of this writing, the contrib/postgres_fdw regression tests
+ * will not pass unless that can grow to at least 14.)
  */
-#define FD_MINFREE                10
+#define FD_MINFREE                48

 /*
  * A number of platforms allow individual processes to open many more files
@@ -132,8 +141,8 @@
 int            max_files_per_process = 1000;

 /*
- * Maximum number of file descriptors to open for either VFD entries or
- * AllocateFile/AllocateDir/OpenTransientFile operations.  This is initialized
+ * Maximum number of file descriptors to open for operations that fd.c knows
+ * about (VFDs, AllocateFile etc, or "external" FDs).  This is initialized
  * to a conservative value, and remains that way indefinitely in bootstrap or
  * standalone-backend cases.  In normal postmaster operation, the postmaster
  * calls set_max_safe_fds() late in initialization to update the value, and
@@ -142,7 +151,7 @@ int            max_files_per_process = 1000;
  * Note: the value of max_files_per_process is taken into account while
  * setting this variable, and so need not be tested separately.
  */
-int            max_safe_fds = 32;    /* default if not changed */
+int            max_safe_fds = FD_MINFREE;    /* default if not changed */

 /* Whether it is safe to continue running after fsync() fails. */
 bool        data_sync_retry = false;
@@ -244,6 +253,11 @@ static int    maxAllocatedDescs = 0;
 static AllocateDesc *allocatedDescs = NULL;

 /*
+ * Number of open "external" FDs reported to Reserve/ReleaseExternalFD.
+ */
+static int    numExternalFDs = 0;
+
+/*
  * Number of temporary files opened during the current session;
  * this is used in generation of tempfile names.
  */
@@ -1025,6 +1039,80 @@ tryAgain:
     return -1;                    /* failure */
 }

+/*
+ * AcquireExternalFD - attempt to reserve an external file descriptor
+ *
+ * This should be used by callers that need to hold a file descriptor open
+ * over more than a short interval, but cannot use any of the other facilities
+ * provided by this module.
+ *
+ * The difference between this and the underlying ReserveExternalFD function
+ * is that this will report failure (by setting errno and returning false)
+ * if "too many" external FDs are already reserved.  This should be used in
+ * any code where the total number of FDs to be reserved is not predictable
+ * and small.
+ */
+bool
+AcquireExternalFD(void)
+{
+    /*
+     * We don't want more than max_safe_fds / 3 FDs to be consumed for
+     * "external" FDs.
+     */
+    if (numExternalFDs < max_safe_fds / 3)
+    {
+        ReserveExternalFD();
+        return true;
+    }
+    errno = EMFILE;
+    return false;
+}
+
+/*
+ * ReserveExternalFD - report external consumption of a file descriptor
+ *
+ * This should be used by callers that need to hold a file descriptor open
+ * over more than a short interval, but cannot use any of the other facilities
+ * provided by this module.  This just tracks the use of the FD and closes
+ * VFDs if needed to ensure we keep NUM_RESERVED_FDS FDs available.
+ *
+ * Call this directly only in code where failure to reserve the FD would be
+ * fatal; for example, the WAL-writing code does so, since the alternative is
+ * session failure.  Also, it's very unwise to do so in code that could
+ * consume more than one FD per process.
+ *
+ * Note: as long as everybody plays nice so that NUM_RESERVED_FDS FDs remain
+ * available, it doesn't matter too much whether this is called before or
+ * after actually opening the FD; but doing so beforehand reduces the risk of
+ * an EMFILE failure if not everybody played nice.  In any case, it's solely
+ * caller's responsibility to keep the external-FD count in sync with reality.
+ */
+void
+ReserveExternalFD(void)
+{
+    /*
+     * Release VFDs if needed to stay safe.  Because we do this before
+     * incrementing numExternalFDs, the final state will be as desired, i.e.,
+     * nfile + numAllocatedDescs + numExternalFDs <= max_safe_fds.
+     */
+    ReleaseLruFiles();
+
+    numExternalFDs++;
+}
+
+/*
+ * ReleaseExternalFD - report release of an external file descriptor
+ *
+ * This is guaranteed not to change errno, so it can be used in failure paths.
+ */
+void
+ReleaseExternalFD(void)
+{
+    Assert(numExternalFDs > 0);
+    numExternalFDs--;
+}
+
+
 #if defined(FDDEBUG)

 static void
@@ -1185,7 +1273,7 @@ ReleaseLruFile(void)
 static void
 ReleaseLruFiles(void)
 {
-    while (nfile + numAllocatedDescs >= max_safe_fds)
+    while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
     {
         if (!ReleaseLruFile())
             break;
@@ -2176,13 +2264,13 @@ reserveAllocatedDesc(void)

     /*
      * If the array hasn't yet been created in the current process, initialize
-     * it with FD_MINFREE / 2 elements.  In many scenarios this is as many as
+     * it with FD_MINFREE / 3 elements.  In many scenarios this is as many as
      * we will ever need, anyway.  We don't want to look at max_safe_fds
      * immediately because set_max_safe_fds() may not have run yet.
      */
     if (allocatedDescs == NULL)
     {
-        newMax = FD_MINFREE / 2;
+        newMax = FD_MINFREE / 3;
         newDescs = (AllocateDesc *) malloc(newMax * sizeof(AllocateDesc));
         /* Out of memory already?  Treat as fatal error. */
         if (newDescs == NULL)
@@ -2200,10 +2288,12 @@ reserveAllocatedDesc(void)
      *
      * We mustn't let allocated descriptors hog all the available FDs, and in
      * practice we'd better leave a reasonable number of FDs for VFD use.  So
-     * set the maximum to max_safe_fds / 2.  (This should certainly be at
-     * least as large as the initial size, FD_MINFREE / 2.)
+     * set the maximum to max_safe_fds / 3.  (This should certainly be at
+     * least as large as the initial size, FD_MINFREE / 3, so we aren't
+     * tightening the restriction here.)  Recall that "external" FDs are
+     * allowed to consume another third of max_safe_fds.
      */
-    newMax = max_safe_fds / 2;
+    newMax = max_safe_fds / 3;
     if (newMax > maxAllocatedDescs)
     {
         newDescs = (AllocateDesc *) realloc(allocatedDescs,
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 138bdec..1972aec 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -247,14 +247,17 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
     /*
      * Create new segment or open an existing one for attach.
      *
-     * Even though we're not going through fd.c, we should be safe against
-     * running out of file descriptors, because of NUM_RESERVED_FDS.  We're
-     * only opening one extra descriptor here, and we'll close it before
-     * returning.
+     * Even though we will close the FD before returning, it seems desirable
+     * to use Reserve/ReleaseExternalFD, to reduce the probability of EMFILE
+     * failure.  The fact that we won't hold the FD open long justifies using
+     * ReserveExternalFD rather than AcquireExternalFD, though.
      */
+    ReserveExternalFD();
+
     flags = O_RDWR | (op == DSM_OP_CREATE ? O_CREAT | O_EXCL : 0);
     if ((fd = shm_open(name, flags, PG_FILE_MODE_OWNER)) == -1)
     {
+        ReleaseExternalFD();
         if (errno != EEXIST)
             ereport(elevel,
                     (errcode_for_dynamic_shared_memory(),
@@ -278,6 +281,7 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
             /* Back out what's already been done. */
             save_errno = errno;
             close(fd);
+            ReleaseExternalFD();
             errno = save_errno;

             ereport(elevel,
@@ -295,6 +299,7 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
         /* Back out what's already been done. */
         save_errno = errno;
         close(fd);
+        ReleaseExternalFD();
         shm_unlink(name);
         errno = save_errno;

@@ -323,6 +328,7 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
         /* Back out what's already been done. */
         save_errno = errno;
         close(fd);
+        ReleaseExternalFD();
         if (op == DSM_OP_CREATE)
             shm_unlink(name);
         errno = save_errno;
@@ -336,6 +342,7 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
     *mapped_address = address;
     *mapped_size = request_size;
     close(fd);
+    ReleaseExternalFD();

     return true;
 }
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index cbd4952..046ca5c 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
@@ -187,6 +188,9 @@ InitializeLatchSupport(void)
             /* Clean up, just for safety's sake; we'll set these below */
             selfpipe_readfd = selfpipe_writefd = -1;
             selfpipe_owner_pid = 0;
+            /* Keep fd.c's accounting straight */
+            ReleaseExternalFD();
+            ReleaseExternalFD();
         }
         else
         {
@@ -194,6 +198,7 @@ InitializeLatchSupport(void)
              * Postmaster didn't create a self-pipe ... or else we're in an
              * EXEC_BACKEND build, in which case it doesn't matter since the
              * postmaster's pipe FDs were closed by the action of FD_CLOEXEC.
+             * fd.c won't have state to clean up, either.
              */
             Assert(selfpipe_readfd == -1);
         }
@@ -228,6 +233,10 @@ InitializeLatchSupport(void)
     selfpipe_readfd = pipefd[0];
     selfpipe_writefd = pipefd[1];
     selfpipe_owner_pid = MyProcPid;
+
+    /* Tell fd.c about these two long-lived FDs */
+    ReserveExternalFD();
+    ReserveExternalFD();
 #else
     /* currently, nothing to do here for Windows */
 #endif
@@ -604,24 +613,57 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     set->exit_on_postmaster_death = false;

 #if defined(WAIT_USE_EPOLL)
+    if (!AcquireExternalFD())
+    {
+        /* treat this as though epoll_create1 itself returned EMFILE */
+        elog(ERROR, "epoll_create1 failed: %m");
+    }
 #ifdef EPOLL_CLOEXEC
     set->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
     if (set->epoll_fd < 0)
+    {
+        ReleaseExternalFD();
         elog(ERROR, "epoll_create1 failed: %m");
+    }
 #else
     /* cope with ancient glibc lacking epoll_create1 (e.g., RHEL5) */
     set->epoll_fd = epoll_create(nevents);
     if (set->epoll_fd < 0)
+    {
+        ReleaseExternalFD();
         elog(ERROR, "epoll_create failed: %m");
+    }
     if (fcntl(set->epoll_fd, F_SETFD, FD_CLOEXEC) == -1)
+    {
+        int            save_errno = errno;
+
+        close(set->epoll_fd);
+        ReleaseExternalFD();
+        errno = save_errno;
         elog(ERROR, "fcntl(F_SETFD) failed on epoll descriptor: %m");
+    }
 #endif                            /* EPOLL_CLOEXEC */
 #elif defined(WAIT_USE_KQUEUE)
+    if (!AcquireExternalFD())
+    {
+        /* treat this as though kqueue itself returned EMFILE */
+        elog(ERROR, "kqueue failed: %m");
+    }
     set->kqueue_fd = kqueue();
     if (set->kqueue_fd < 0)
+    {
+        ReleaseExternalFD();
         elog(ERROR, "kqueue failed: %m");
+    }
     if (fcntl(set->kqueue_fd, F_SETFD, FD_CLOEXEC) == -1)
+    {
+        int            save_errno = errno;
+
+        close(set->kqueue_fd);
+        ReleaseExternalFD();
+        errno = save_errno;
         elog(ERROR, "fcntl(F_SETFD) failed on kqueue descriptor: %m");
+    }
     set->report_postmaster_not_running = false;
 #elif defined(WAIT_USE_WIN32)

@@ -655,8 +697,10 @@ FreeWaitEventSet(WaitEventSet *set)
 {
 #if defined(WAIT_USE_EPOLL)
     close(set->epoll_fd);
+    ReleaseExternalFD();
 #elif defined(WAIT_USE_KQUEUE)
     close(set->kqueue_fd);
+    ReleaseExternalFD();
 #elif defined(WAIT_USE_WIN32)
     WaitEvent  *cur_event;

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 51e2ece..2085c62 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -35,6 +35,10 @@
  * Likewise, use AllocateDir/FreeDir, not opendir/closedir, to allocate
  * open directories (DIR*), and OpenTransientFile/CloseTransientFile for an
  * unbuffered file descriptor.
+ *
+ * If you really can't use any of the above, at least call AcquireExternalFD
+ * or ReserveExternalFD to report any file descriptors that are held for any
+ * length of time.  Failure to do so risks unnecessary EMFILE errors.
  */
 #ifndef FD_H
 #define FD_H
@@ -120,7 +124,12 @@ extern int    CloseTransientFile(int fd);
 extern int    BasicOpenFile(const char *fileName, int fileFlags);
 extern int    BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);

- /* Make a directory with default permissions */
+/* Use these for other cases, and also for long-lived BasicOpenFile FDs */
+extern bool AcquireExternalFD(void);
+extern void ReserveExternalFD(void);
+extern void ReleaseExternalFD(void);
+
+/* Make a directory with default permissions */
 extern int    MakePGDirectory(const char *directoryName);

 /* Miscellaneous support routines */

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Thomas Munro
Date:
On Mon, Feb 24, 2020 at 12:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> On reflection, trying to make ReserveExternalFD serve two different
> use-cases was pretty messy.  Here's a version that splits it into two
> functions.  I also took the trouble to fix dblink.

+    /*
+     * We don't want more than max_safe_fds / 3 FDs to be consumed for
+     * "external" FDs.
+     */
+    if (numExternalFDs < max_safe_fds / 3)

This looks pretty reasonable to me.

I'll have a new patch set to create a common WES at startup over on
that other thread in a day or two.



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Thomas Munro
Date:
On Mon, Feb 24, 2020 at 7:42 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Feb 24, 2020 at 12:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > On reflection, trying to make ReserveExternalFD serve two different
> > use-cases was pretty messy.  Here's a version that splits it into two
> > functions.  I also took the trouble to fix dblink.
>
> +    /*
> +     * We don't want more than max_safe_fds / 3 FDs to be consumed for
> +     * "external" FDs.
> +     */
> +    if (numExternalFDs < max_safe_fds / 3)

I suppose there may be users who have set ulimit -n high enough to
support an FDW workload that connects to very many hosts, who will now
need to set max_files_per_process higher to avoid the new error now
that we're doing this accounting.  That doesn't seem to be a problem
in itself, but I wonder if the error message should make it clearer
that it's our limit they hit here.



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> I suppose there may be users who have set ulimit -n high enough to
> support an FDW workload that connects to very many hosts, who will now
> need to set max_files_per_process higher to avoid the new error now
> that we're doing this accounting.  That doesn't seem to be a problem
> in itself, but I wonder if the error message should make it clearer
> that it's our limit they hit here.

I struggled with the wording of that message, actually.  The patch
proposes

+            ereport(ERROR,
+                    (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+                     errmsg("could not connect to server \"%s\"",
+                            server->servername),
+                     errdetail("There are too many open files.")));

I wanted to say "The server has too many open files." but in context
it would seem to be talking about the remote server, so I'm not sure
how to fix that.

We could also consider a HINT, along the lines of "Raise the server's
max_files_per_process and/or \"ulimit -n\" limits."  This again has
the ambiguity about which server, and it also seems dangerously
platform-specific.

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Mark Dilger
Date:

> On Feb 20, 2020, at 8:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> This idea doesn't fix every possible problem.  For instance, if you
> have a plperlu function that wants to open a bunch of files, I don't
> see any easy way to ensure we release VFDs to make that possible.
> But it's sure an improvement on the status quo.

I understand that you were using plperlu just as an example, but it got me thinking.  Could we ship a wrapper using
perl'stie() mechanism to call a new spi function to report when a file handle is opened and when it is closed?  Most
plperlufunctions would not participate, since developers will not necessarily know to use the wrapper, but at least
theycould learn about the wrapper and use it as a work-around if this becomes a problem for them.  Perhaps the same spi
functioncould be used by other procedural languages. 

I can't see this solution working unless the backend can cleanup properly under exceptional conditions, and decrement
thecounter of used file handles appropriately.  But that's the same requirement that postgres_fdw would also have,
right? Would the same mechanism work for both? 

I imagine something like <PgPerluSafe>::IO::File and <PgPerluSafe>::File::Temp which could be thin wrappers around
IO::Fileand File::Temp that automatically do the tie()ing for you. (Replace <PgPerluSafe> with whichever name seems
best.)

Is this too convoluted to be worth the bother?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Andres Freund
Date:
Hi,

On 2020-02-24 10:29:51 -0800, Mark Dilger wrote:
> > On Feb 20, 2020, at 8:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > 
> > This idea doesn't fix every possible problem.  For instance, if you
> > have a plperlu function that wants to open a bunch of files, I don't
> > see any easy way to ensure we release VFDs to make that possible.
> > But it's sure an improvement on the status quo.
> 
> I understand that you were using plperlu just as an example, but it
> got me thinking.  Could we ship a wrapper using perl's tie() mechanism
> to call a new spi function to report when a file handle is opened and
> when it is closed?  Most plperlu functions would not participate,
> since developers will not necessarily know to use the wrapper, but at
> least they could learn about the wrapper and use it as a work-around
> if this becomes a problem for them.  Perhaps the same spi function
> could be used by other procedural languages.

While we're thinking a bit outside of the box: We could just dup() a
bunch of fds within fd.c to protect fd.c's fd "quota". And then close
them when actually needing the fds.

Not really suggesting that we go for that, but it does have some appeal.



> I can't see this solution working unless the backend can cleanup properly under exceptional conditions, and decrement
thecounter of used file handles appropriately.  But that's the same requirement that postgres_fdw would also have,
right? Would the same mechanism work for both?
 

We can just throw an error, and all fdw state should get cleaned up. We
can't generally rely on that for plperl, as it IIRC can global state. So
I don't think they're in the same boat.

Greetings,

Andres Freund



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Feb 20, 2020, at 8:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This idea doesn't fix every possible problem.  For instance, if you
>> have a plperlu function that wants to open a bunch of files, I don't
>> see any easy way to ensure we release VFDs to make that possible.
>> But it's sure an improvement on the status quo.

> I understand that you were using plperlu just as an example, but it got
> me thinking.  Could we ship a wrapper using perl's tie() mechanism to
> call a new spi function to report when a file handle is opened and when
> it is closed?  Most plperlu functions would not participate, since
> developers will not necessarily know to use the wrapper, but at least
> they could learn about the wrapper and use it as a work-around if this
> becomes a problem for them.  Perhaps the same spi function could be used
> by other procedural languages.

Hmm.  I had thought briefly about getting plperl to do that automatically
and had concluded that I didn't see a way (though there might be one;
I'm not much of a Perl expert).  But if we accept that changes in the
plperl function's source code might be needed, it gets a lot easier,
for sure.

Anyway, the point of the current patch is to provide the mechanism and
use it in a couple of places where we know there's an issue.  Improving
the PLs is something that could be added later.

> I can't see this solution working unless the backend can cleanup
> properly under exceptional conditions, and decrement the counter of used
> file handles appropriately.  But that's the same requirement that
> postgres_fdw would also have, right?  Would the same mechanism work for
> both?

The hard part is to tie into whatever is responsible for closing the
kernel FD.  If you can ensure that the FD gets closed, you can put
the ReleaseExternalFD() call at the same place(s).

> Is this too convoluted to be worth the bother?

So far we've not seen field reports of PL functions running out of FDs;
and there's always the ad-hoc solution of making sure the server's
ulimit -n limit is sufficiently larger than max_files_per_process.
So I wouldn't put a lot of effort into it right now.  But it's nice
to have an idea about what to do if it does become a hot issue for
somebody.

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Alvaro Herrera
Date:
On 2020-Feb-24, Tom Lane wrote:

> We could also consider a HINT, along the lines of "Raise the server's
> max_files_per_process and/or \"ulimit -n\" limits."  This again has
> the ambiguity about which server, and it also seems dangerously
> platform-specific.

Maybe talk about "the local server" to distinguish from the other one.

As for the platform dependencies, I see two main options: make the hint
platform-specific (i.e have #ifdef branches per platform) or make it
even more generic, such as "file descriptor limits".

A quick search suggests that current Windows itself doesn't typically
have such problems:
https://stackoverflow.com/questions/31108693/increasing-no-of-file-handles-in-windows-7-64-bit
https://docs.microsoft.com/es-es/archive/blogs/markrussinovich/pushing-the-limits-of-windows-handles

But the C runtime does:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/file-handling?view=vs-2019
I suppose we do use the C runtime.  There's a reference to
_setmaxstdio() being able to raise the limit from the default of 512 to
up to 8192 open files.  We don't currently invoke that.
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmaxstdio?view=vs-2019

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-Feb-24, Tom Lane wrote:
>> We could also consider a HINT, along the lines of "Raise the server's
>> max_files_per_process and/or \"ulimit -n\" limits."  This again has
>> the ambiguity about which server, and it also seems dangerously
>> platform-specific.

> Maybe talk about "the local server" to distinguish from the other one.

OK by me.

> As for the platform dependencies, I see two main options: make the hint
> platform-specific (i.e have #ifdef branches per platform) or make it
> even more generic, such as "file descriptor limits".

I thought about platform-specific messages, but it's not clear to me
whether our translation infrastructure will find messages that are
inside #ifdefs ... anyone know?  If that does work, I'd be inclined
to mention ulimit -n on non-Windows and just say nothing about that
on Windows.  "File descriptor limits" seems too unhelpful here.

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Andres Freund
Date:
Hi,

On 2020-02-24 16:14:53 -0300, Alvaro Herrera wrote:
> But the C runtime does:
> https://docs.microsoft.com/en-us/cpp/c-runtime-library/file-handling?view=vs-2019
> I suppose we do use the C runtime.  There's a reference to
> _setmaxstdio() being able to raise the limit from the default of 512 to
> up to 8192 open files.  We don't currently invoke that.
> https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmaxstdio?view=vs-2019

If we ever go for that, we should also consider raising the limit on
unix systems up to the hard limit when hitting the fd ceiling. I.e. get
the current limit with getrlimit(RLIMIT_NOFILE) and raise rlim_cur
[closer] to rlim_max with setrlimit.

Perhaps it'd even be worthwhile to just always raise the limit, if
possible, in set_max_safe_fds(), by max_safe_fds +
NUM_RESERVED_FDS. That way PLs, other shared libs, would have a more
usualy amount of FDs available. Rather than a fairly small number, but
only when the backend has been running for a while in the right
workload.

Greetings,

Andres Freund



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2020-02-24 16:14:53 -0300, Alvaro Herrera wrote:
>> I suppose we do use the C runtime.  There's a reference to
>> _setmaxstdio() being able to raise the limit from the default of 512 to
>> up to 8192 open files.  We don't currently invoke that.
>> https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmaxstdio?view=vs-2019

> If we ever go for that, we should also consider raising the limit on
> unix systems up to the hard limit when hitting the fd ceiling. I.e. get
> the current limit with getrlimit(RLIMIT_NOFILE) and raise rlim_cur
> [closer] to rlim_max with setrlimit.

I'm disinclined to think we should override the user's wishes in this way.
Especially given PG's proven ability to run the kernel completely out of
file descriptors ...

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
I wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> As for the platform dependencies, I see two main options: make the hint
>> platform-specific (i.e have #ifdef branches per platform) or make it
>> even more generic, such as "file descriptor limits".

> I thought about platform-specific messages, but it's not clear to me
> whether our translation infrastructure will find messages that are
> inside #ifdefs ... anyone know?

Oh, but of course it does.  So let's do

        errdetail("There are too many open files on the local server."),
#ifndef WIN32
        errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" limits.")
#else
        errhint("Raise the server's max_files_per_process setting.")
#endif

I don't think there's much point in telling Windows users about
_setmaxstdio() here.

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Alvaro Herrera
Date:
On 2020-Feb-24, Tom Lane wrote:

> I wrote:

> > I thought about platform-specific messages, but it's not clear to me
> > whether our translation infrastructure will find messages that are
> > inside #ifdefs ... anyone know?
> 
> Oh, but of course it does.  So let's do
> 
>         errdetail("There are too many open files on the local server."),
> #ifndef WIN32
>         errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" limits.")
> #else
>         errhint("Raise the server's max_files_per_process setting.")
> #endif

WFM.

> I don't think there's much point in telling Windows users about
> _setmaxstdio() here.

Yeah, telling users to _setmaxstdio() themselves is useless, because
they can't do it; that's something *we* should do.  I think the 512
limit is a bit low; why not increase that a little bit?  Maybe just to
the Linux default of 1024.

Then again, that would be akin to setrlimit() on Linux.  Maybe we can
consider that a separate GUC, in a separate patch, with a
platform-specific default value that just corresponds to the OS's
default, and the user can set to whatever suits them; then we call
either _setmaxstdio() or setrlimit().

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-Feb-24, Tom Lane wrote:
>> I don't think there's much point in telling Windows users about
>> _setmaxstdio() here.

> Yeah, telling users to _setmaxstdio() themselves is useless, because
> they can't do it; that's something *we* should do.  I think the 512
> limit is a bit low; why not increase that a little bit?  Maybe just to
> the Linux default of 1024.

> Then again, that would be akin to setrlimit() on Linux.  Maybe we can
> consider that a separate GUC, in a separate patch, with a
> platform-specific default value that just corresponds to the OS's
> default, and the user can set to whatever suits them; then we call
> either _setmaxstdio() or setrlimit().

Why not just drive it off max_files_per_process?  On Unix, that
largely exists to override the ulimit setting anyway.  With no
comparable knob on a Windows system, we might as well just say
that's what you set.

            regards, tom lane



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Alvaro Herrera
Date:
On 2020-Feb-24, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:

> > Then again, that would be akin to setrlimit() on Linux.  Maybe we can
> > consider that a separate GUC, in a separate patch, with a
> > platform-specific default value that just corresponds to the OS's
> > default, and the user can set to whatever suits them; then we call
> > either _setmaxstdio() or setrlimit().
> 
> Why not just drive it off max_files_per_process?  On Unix, that
> largely exists to override the ulimit setting anyway.  With no
> comparable knob on a Windows system, we might as well just say
> that's what you set.

That makes sense to me -- but if we do that, then maybe we should be
doing the setrlimit() dance on it too, on Linux^W^W where supported.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-Feb-24, Tom Lane wrote:
>> Why not just drive it off max_files_per_process?  On Unix, that
>> largely exists to override the ulimit setting anyway.  With no
>> comparable knob on a Windows system, we might as well just say
>> that's what you set.

> That makes sense to me -- but if we do that, then maybe we should be
> doing the setrlimit() dance on it too, on Linux^W^W where supported.

Yeah, arguably we could try to setrlimit if max_files_per_process is
larger than the ulimit.  We should definitely not reduce the ulimit
if max_files_per_process is smaller, though, since the DBA might
intentionally be leaving daylight for purposes such as FD-hungry PL
functions.  On the whole I'm inclined to leave well enough alone on
the Unix side --- there's nothing there that the DBA can't set if
she wishes.

            regards, tom lane