Thread: File descriptors in exec'd subprocesses

File descriptors in exec'd subprocesses

From
Thomas Munro
Date:
While investigating code paths that use system() and popen() and
trying to write latch-multiplexing replacements, which I'll write
about separately, leaks of $SUBJECT became obvious.  On a FreeBSD box,
you can see the sockets, pipes and data and WAL files that the
subprocess inherits:

  create table t (x text);
  copy t from program 'procstat -f $$';
  select * from t;

(On Linux, you might try something like 'ls -slap /proc/self/fd'.  Not
sure how to do it on a Mac but note that 'lsof' is no good for this
purpose because the first thing it does is close all descriptors > 2;
maybe 'lsof -p $PPID' inside a shell wrapper so you're analysing the
shell process that sits between postgres and lsof, rather than lsof
itself?)

Since we've started assuming a few other bits of SUSv3 (POSIX 2008)
functionality, the standard that specified O_CLOEXEC, and since in
practice Linux, *BSD, macOS, AIX, Solaris, illumos all have it, I
think we can unconditionally just use it on all files we open.  That
is, if we were to make fallback code, it would be untested, and if we
were to do it with fcntl() always it would be a frequent extra system
call that we don't need to support a computer that doesn't exist.  For
sockets and pipes, much more rarely created, some systems have
non-standard extensions along the same lines, but I guess we should
stick with standards and call fcntl(FD_CLOEXEC) for now.

There is a place in fd.c that already referenced O_CLOEXEC (it wasn't
really using it, just making an assertion that flags don't collide),
with #ifdef around it, but that was only conditional because at the
time of commit 04cad8f7 we had a macOS 10.4 system (released 2005) in
the 'farm which obviously didn't know about POSIX 2008 interfaces.  We
can just remove that #ifdef.  (It's probably OK to remove the test of
O_DSYNC too but I'll think about that another time.)

On Windows, handles, at least as we create them, are not inherited so
the problem doesn't come up AFAICS.  I *think* if we were to use
Windows' own open(), that would be an issue, but we have our own
CreateFile() thing and it doesn't turn on inheritance IIUC.  So I just
gave O_CLOEXEC a zero definition there.  It would be interesting to
know what handles a subprocess sees.  If someone who knows how to
drive Windows could run a subprogram that just does the equivalent of
'sleep 60' they might be able to see that in one of those handle spy
tools, to visually check the above.  (If I'm wrong about that, it
might be possible for a subprocess to interfere with a
ProcSignalBarrier command to close all files, so I'd love to know
about it.)

We were already doing FD_CLOEXEC on the latch self-pipe with comment
"we surely do not want any child processes messing with them", so it's
not like this wasn't a well-known issue before, but I guess it just
never bothered anyone enough to do anything about the more general
problem.

With the attached, the test at the top of this email shows only in,
out, error, and one thing that procstat opened itself.

Are there any more descriptors we need to think about?

Attachment

Re: File descriptors in exec'd subprocesses

From
Thomas Munro
Date:
On Sun, Feb 5, 2023 at 1:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> SUSv3 (POSIX 2008)

Oh, oops, 2008 actually corresponds to SUSv4.  Hmm.



Re: File descriptors in exec'd subprocesses

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Sun, Feb 5, 2023 at 1:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> SUSv3 (POSIX 2008)

> Oh, oops, 2008 actually corresponds to SUSv4.  Hmm.

Worst case, if we come across some allegedly-supported platform without
O_CLOEXEC, we #define that to zero.  Said platform is no worse off
than it was before.

            regards, tom lane



Re: File descriptors in exec'd subprocesses

From
Andres Freund
Date:
Hi,

Unsurprisingly I'm in favor of this.

On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote:
>Are there any more descriptors we need to think about?

Postmaster's listen sockets? Saves a bunch of syscalls, at least. Logging collector pipe write end, in backends?

Greetings,

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: File descriptors in exec'd subprocesses

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote:
>> Are there any more descriptors we need to think about?

> Postmaster's listen sockets?

I wonder whether O_CLOEXEC on that would be inherited by the
client-communication sockets, though.  That's fine ... unless you
are doing EXEC_BACKEND.

            regards, tom lane



Re: File descriptors in exec'd subprocesses

From
Andres Freund
Date:
Hi,

On 2023-02-05 11:06:13 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote:
> >> Are there any more descriptors we need to think about?
> 
> > Postmaster's listen sockets?
> 
> I wonder whether O_CLOEXEC on that would be inherited by the
> client-communication sockets, though.

I'd be very suprised if it were.

<hack>

Nope, at least not on linux. Verified by looking at /proc/*/fdinfo/n
after adding SOCK_CLOEXEC to just the socket() call. 'flags' changes
from 02 -> 02000002 for the listen socket, but stays at 04002 for the
client socket. If I add SOCK_CLOEXEC to accept() (well, accept4()), it
does change from 04002 to 02004002.

Greetings,

Andres Freund



Re: File descriptors in exec'd subprocesses

From
Thomas Munro
Date:
On Mon, Feb 6, 2023 at 3:29 AM Andres Freund <andres@anarazel.de> wrote:
> On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote:
> >Are there any more descriptors we need to think about?
>
> Postmaster's listen sockets? Saves a bunch of syscalls, at least.

Assuming you mean accepted sockets, yeah, I see how two save two
syscalls there, and since you nerd-sniped me into looking into the
SOCK_CLOEXEC landscape, I like it even more now that I've understood
that accept4() is rubber-stamped for the next revision of POSIX[1] and
is already accepted almost everywhere.  It's not just window dressing,
you need it to write multi-threaded programs that fork/exec without
worrying about the window between fd creation and fcntl(FD_CLOEXEC) in
another thread; hopefully one day we will care about that sort of
thing in some places too!  Here's a separate patch for that.

I *guess* we need HAVE_DECL_ACCEPT4 for the guarded availability
system (cf pwritev) when Apple gets the memo, but see below.  Hard to
say if AIX is still receiving memos (cf recent speculation in the
Register).  All other target OSes seem to have had this stuff for a
while.

Since client connections already do fcntl(FD_CLOEXEC), postgres_fdw
connections didn't have this problem.  It seems reasonable to want to
skip a couple of system calls there too; also, client programs might
also be interested in future-POSIX's atomic race-free close-on-exec
socket fabrication.  So here also is a patch to use SOCK_CLOEXEC on
that end too, if available.

But ... hmph, all we can do here is test for the existence of
SOCK_NONBLOCK and SOCK_CLOEXEC, since there is no new function to test
for.  Maybe we should just assume accept4() also exists if these exist
(it's hard to imagine that Apple or IBM would address atomicity on one
end but not the other of a socket), but predictions are so difficult,
especially about the future!  Anyone want to guess if it's better to
leave the meson/configure probe in for the accept4 end or just roll
with the macros?

> Logging collector pipe write end, in backends?

The write end of the logging pipe is already closed, after dup2'ing it
to STDOUT_FILENO to STDERR_FILENO, so archive commands and suchlike do
receive the handle, but they want them.  It's the intended and
documented behaviour that anything written to that will finish up in
the log.

As for pipe2(O_CLOEXEC), I see the point of it in a multi-threaded
application.  It's not terribly useful for us though, because we
usually want to close only one end, except in the case of the
self-pipe.  But the self-pipe is no longer used on the systems that
have pipe2()-from-the-future.

I haven't tested this under EXEC_BACKEND yet.

[1] https://www.austingroupbugs.net/view.php?id=411

Attachment

Re: File descriptors in exec'd subprocesses

From
Thomas Munro
Date:
I had missed one: the "watch" end of the postmaster pipe also needs FD_CLOEXEC.

Attachment

Re: File descriptors in exec'd subprocesses

From
Thomas Munro
Date:
Something bothered me about the previous versions:  Which layer should
add O_CLOEXEC, given that we have md.c -> PathNameOpenXXX() ->
BasicOpenFile() -> open()?  Previously I had md.c adding it, but on
reflection, it makes no sense to open a "File" (virtual file
descriptor) that is *not* O_CLOEXEC.  The client doesn't really know
when the raw descriptor is currently open and shouldn't really access
it; it would be strange to want it to survive a call to exec*().  I
think the 'highest' level API that we could consider requiring
O_CLOEXEC to be passed in explicitly, or not, is BasicOpenFile().
Done like that in this version.  This is the version I'm thinking of
committing, unless someone wants to argue for another level.

Another good choice would be to do it inside BasicOpenFile(), and then
the patch would be smaller again (xlog.c wouldn't need to mention it,
and there would perhaps be less risk that some long-lived descriptor
somewhere else has failed to request it), but perhaps that would be
presumptuous.  That function returns raw descriptors, and the caller,
perhaps an extension, might legitimately want to make an inheritable
descriptor for some reason, I guess?  Does anyone think I should move
it in there instead?

I realised that if we're going to use accept4() to cut down on
syscalls, we could also do the same for the postmaster pipe with
pipe2().

Here also is a tiny archeological cleanup to avoid creating
contradictory claims about whether all computers have O_CLOEXEC.

I toyed with the idea of a tiny Linux-only regression test using "COPY
fds FROM PROGRAM 'ls /proc/self/fd'" expecting 0, 1, 2, 3 (3 being
ls's opendir()), but that's probably a little too cute; and also
showed me that pg_regress.c leaks its log file, the fix for which is
to add "e" to its fdopen(), but that's another POSIX-next feature[1]
that seems a little harder to detect, and I gave up on that.

[1] https://wiki.freebsd.org/AtomicCloseOnExec

Attachment

Re: File descriptors in exec'd subprocesses

From
"Gregory Stark (as CFM)"
Date:
On Mon, 20 Feb 2023 at 23:04, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Done like that in this version.  This is the version I'm thinking of
> committing, unless someone wants to argue for another level.

FWIW the cfbot doesn't understand this patch series. I'm not sure why
but it's only trying to apply the first (the MacOS one) and it's
failing to apply even that.


-- 
Gregory Stark
As Commitfest Manager



Re: File descriptors in exec'd subprocesses

From
Thomas Munro
Date:
On Thu, Mar 2, 2023 at 9:49 AM Gregory Stark (as CFM)
<stark.cfm@gmail.com> wrote:
> On Mon, 20 Feb 2023 at 23:04, Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > Done like that in this version.  This is the version I'm thinking of
> > committing, unless someone wants to argue for another level.
>
> FWIW the cfbot doesn't understand this patch series. I'm not sure why
> but it's only trying to apply the first (the MacOS one) and it's
> failing to apply even that.

Ah, it's because I committed one patch in the series.  I'll commit one
more, and then repost the rest, shortly.



Re: File descriptors in exec'd subprocesses

From
Thomas Munro
Date:
On Thu, Mar 2, 2023 at 9:57 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Mar 2, 2023 at 9:49 AM Gregory Stark (as CFM)
> <stark.cfm@gmail.com> wrote:
> > On Mon, 20 Feb 2023 at 23:04, Thomas Munro <thomas.munro@gmail.com> wrote:
> > > Done like that in this version.  This is the version I'm thinking of
> > > committing, unless someone wants to argue for another level.
> >
> > FWIW the cfbot doesn't understand this patch series. I'm not sure why
> > but it's only trying to apply the first (the MacOS one) and it's
> > failing to apply even that.
>
> Ah, it's because I committed one patch in the series.  I'll commit one
> more, and then repost the rest, shortly.

I pushed the main patch, "Don't leak descriptors into subprograms.".
Here's a rebase of the POSIX-next stuff, but I'll sit on these for a
bit longer to see if the build farm agrees with my claim about the
ubiquity of O_CLOEXEC, and if anyone has comments on this stuff.

Attachment

Re: File descriptors in exec'd subprocesses

From
Thomas Munro
Date:
I pushed the libpq changes.  I'll leave the pipe2 and accept4 changes
on ice for now, maybe for a later cycle (unlike the committed patches,
they don't currently fix a known problem, they just avoid some
syscalls that are already fairly rare).  For the libpq change, the
build farm seems happy so far.  I was a little worried that there
could be ways that #ifdef SOCK_CLOEXEC could be true for a build that
might encounter a too-old kernel and break, but it looks like you'd
have to go so far back into EOL'd releases that even our zombie build
farm animals have it.  Only macOS and AIX don't have it yet, and this
should be fine with Apple's availability guards, which leaves just
AIX.  (AFAIK headers and kernels are always in sync at the major
version level on AIX, but if not presumably there'd have to be some
similar guard system?)