Thread: PG in container w/ pid namespace is init, process exits cause restart
Hi, A colleague debugged an issue where their postgres was occasionally crash-restarting under load. The cause turned out to be that a relatively complex archive_command was used, which could in some rare circumstances have a bash subshell pipeline not succeed. It wasn't at all obvious why that'd cause a crash though - the archive command handles the error. The issue turns out to be that postgres was in a container, with pid namespaces enabled. Because postgres was run directly in the container, without a parent process inside, it thus becomes pid 1. Which mostly works without a problem. Until, as the case here with the archive command, a sub-sub process exits while it still has a child. Then that child gets re-parented to postmaster (as init). Such a child is likely to have exited not just with 0 or 1, but something else. As the pid won't match anything in reaper(), we'll go to CleanupBackend(). Where any exit status but 0/1 will unconditionally trigger a restart: if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus)) { HandleChildCrash(pid, exitstatus, _("server process")); return; } This kind of thing is pretty hard to debug, because it's not easy to even figure out what the "crashing" pid belonged to. I wonder if we should work a bit harder to try to identify whether an exiting process was a "server process" before identifying it as such? And perhaps we ought to warn about postgres running as "init" unless we make that robust? Greetings, Andres Freund
On 2021-May-03, Andres Freund wrote: > The issue turns out to be that postgres was in a container, with pid > namespaces enabled. Because postgres was run directly in the container, > without a parent process inside, it thus becomes pid 1. Which mostly > works without a problem. Until, as the case here with the archive > command, a sub-sub process exits while it still has a child. Then that > child gets re-parented to postmaster (as init). Hah .. interesting. I think we should definitely make this work, since containerized stuff is going to become more and more prevalent. I also heard a story where things ran into trouble (I didn't get the whole story of *what* was the problem with that) because the datadir is /. I know -- nobody in their right mind would put the datadir in / -- but apparently in the container world that's not something as stupid as it sounds. That's of course not related to what you describe here code-wise, but the underlying reason is the same. > I wonder if we should work a bit harder to try to identify whether an > exiting process was a "server process" before identifying it as such? Well, we've never made any effort there because it just wasn't possible. Nobody ever had postmaster also be init .. until containers. Let's fix it. > And perhaps we ought to warn about postgres running as "init" unless we > make that robust? I guess we can do that in older releases, but do we really need it? As I understand, the only thing we need to do is verify that the dying PID is a backend PID, and not cause a crash cycle if it isn't. -- Álvaro Herrera Valdivia, Chile
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On 2021-May-03, Andres Freund wrote: >> The issue turns out to be that postgres was in a container, with pid >> namespaces enabled. Because postgres was run directly in the container, >> without a parent process inside, it thus becomes pid 1. Which mostly >> works without a problem. Until, as the case here with the archive >> command, a sub-sub process exits while it still has a child. Then that >> child gets re-parented to postmaster (as init). > Hah .. interesting. I think we should definitely make this work, since > containerized stuff is going to become more and more prevalent. How would we make it "work"? The postmaster can't possibly be expected to know the right thing to do with unexpected children. > I guess we can do that in older releases, but do we really need it? As > I understand, the only thing we need to do is verify that the dying PID > is a backend PID, and not cause a crash cycle if it isn't. I think that'd be a net reduction in reliability, not an improvement. In most scenarios it'd do little except mask bugs. And who's to say that ignoring unexpected child deaths is okay, anyway? We could hardly be sure that the dead process hadn't been connected to shared memory. Maybe we should put in a startup-time check, analogous to the can't-run-as-root test, that the postmaster mustn't be PID 1. regards, tom lane
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > I also heard a story where things ran into trouble (I didn't get the > whole story of *what* was the problem with that) because the datadir is /. BTW, as far as that goes, I think the general recommendation is that the datadir shouldn't be a mount point, because bad things happen if you mount or unmount the drive while the postmaster is up. I could see enforcing that, if we could find a reasonably platform-independent way to do it. (Of course, / can't be unmounted, so I wonder exactly what bad thing happened in that story.) regards, tom lane
On 2021-May-03, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > I also heard a story where things ran into trouble (I didn't get the > > whole story of *what* was the problem with that) because the datadir is /. > > BTW, as far as that goes, I think the general recommendation is that > the datadir shouldn't be a mount point, because bad things happen if > you mount or unmount the drive while the postmaster is up. I could > see enforcing that, if we could find a reasonably platform-independent > way to do it. / is not a mount point; it's just that the container system binds (?) some different directory as / for the process to run into. I suppose it must be similar to chrooting to /, but I'm not sure if it's exactly that. > (Of course, / can't be unmounted, so I wonder exactly what bad thing > happened in that story.) It's not related to unmounting. I'll try to get the details. -- Álvaro Herrera Valdivia, Chile
Hi, On 2021-05-03 15:37:24 -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > On 2021-May-03, Andres Freund wrote: > >> The issue turns out to be that postgres was in a container, with pid > >> namespaces enabled. Because postgres was run directly in the container, > >> without a parent process inside, it thus becomes pid 1. Which mostly > >> works without a problem. Until, as the case here with the archive > >> command, a sub-sub process exits while it still has a child. Then that > >> child gets re-parented to postmaster (as init). > > > Hah .. interesting. I think we should definitely make this work, since > > containerized stuff is going to become more and more prevalent. > > How would we make it "work"? The postmaster can't possibly be expected > to know the right thing to do with unexpected children. Not saying that we should, but we could check if we're pid 1 / init, and just warn about children we don't know anything about. Which we could detect by iterating over BackendList/BackgroundWorkerList before crash-restarting in CleanupBackend(). Then we'd not loose reliability in the "normal" case, while not reducing reliability in the container case. I'm not quite sure I buy the reliability argument, tbh: The additional process exits we see as pid 1 are after all process exits that we'd not see if we weren't pid 1. And if we're not pid 1 then there really should never be any "unexpected children" - we know what processes postmaster itself forked after all. So where would unexpected children come from, except reparenting? > And who's to say that ignoring unexpected child deaths is okay, > anyway? We could hardly be sure that the dead process hadn't been > connected to shared memory. I don't think checking the exit status of unexpected children to see whether we should crash-restart out of that concern is meaningful: We don't know that the child didn't do anything bad with shared memory when they exited with exit(1), instead of exit(2). Random thought: I wonder if we ought to set madvise(MADV_DONTFORK) on shared memory in postmaster children, where available. Then we could be fairly certain that there aren't processes we don't know about that are attached to shared memory (unless there's some nasty shared_preload_library forking early during backend startup - but that's hard to get excited about). > > I guess we can do that in older releases, but do we really need it? As > > I understand, the only thing we need to do is verify that the dying PID > > is a backend PID, and not cause a crash cycle if it isn't. > > I think that'd be a net reduction in reliability, not an improvement. > In most scenarios it'd do little except mask bugs. Do you feel the same about having different logging between the "known" and "unknown" child processes? Personally I don't think it's of utmost importance to support running as pid 1. But we should at least print useful log messages about what processes exited... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2021-05-03 15:37:24 -0400, Tom Lane wrote: >> And who's to say that ignoring unexpected child deaths is okay, >> anyway? We could hardly be sure that the dead process hadn't been >> connected to shared memory. > I don't think checking the exit status of unexpected children to see > whether we should crash-restart out of that concern is meaningful: We > don't know that the child didn't do anything bad with shared memory when > they exited with exit(1), instead of exit(2). Hmm, by that argument, any unexpected child PID in reaper() ought to be grounds for a restart, regardless of its exit code. Which'd be fine by me. I'm on board with being more restrictive about this, not less so. > Do you feel the same about having different logging between the "known" > and "unknown" child processes? No objection to logging such cases more clearly, for sure. regards, tom lane
Hi, On 2021-05-03 15:25:53 -0400, Alvaro Herrera wrote: > I also heard a story where things ran into trouble (I didn't get the > whole story of *what* was the problem with that) because the datadir is /. > I know -- nobody in their right mind would put the datadir in / -- but > apparently in the container world that's not something as stupid as it > sounds. That's of course not related to what you describe here > code-wise, but the underlying reason is the same. It still seems pretty insane in the container world too. Postgres needs shared libraries (even if you managed to link postgres itself statically, something we do not support). Postgres needs to write to the data directory. Putting shared libraries inside the data directory seems like a bad idea. Using / for a single statically linked binary that e.g. just serves a bunch of hardcoded files is one thing. Putting actual data in / for something like postgres another. > > I wonder if we should work a bit harder to try to identify whether an > > exiting process was a "server process" before identifying it as such? > > Well, we've never made any effort there because it just wasn't possible. > Nobody ever had postmaster also be init .. until containers. Let's fix > it. > > And perhaps we ought to warn about postgres running as "init" unless we > > make that robust? > > I guess we can do that in older releases, but do we really need it? As > I understand, the only thing we need to do is verify that the dying PID > is a backend PID, and not cause a crash cycle if it isn't. I think there's a few more special cases when running as init, other than reparenting. E.g. I think the default signal handlers are different, the kernel kills the process in fewer cases etc. I am not opposed to adding support for it, but I think it'd need a bit of care. Given that we probably shouldn't just break things in a minor release by refusing to run as 1, a warning seems to be the easiest thing for now? Greetings, Andres Freund
On 2021-May-03, Andres Freund wrote: > Using / for a single statically linked binary that e.g. just serves a > bunch of hardcoded files is one thing. Putting actual data in / for > something like postgres another. Yeah, I just had a word with them and I had misunderstood what they were doing. They were attempting something completely insane and pointless, so I'm going to leave it at that. > I think there's a few more special cases when running as init, other > than reparenting. E.g. I think the default signal handlers are > different, the kernel kills the process in fewer cases etc. I am not > opposed to adding support for it, but I think it'd need a bit of care. Ok, we can leave that as future development then. > Given that we probably shouldn't just break things in a minor release by > refusing to run as 1, a warning seems to be the easiest thing for now? WFM. -- Álvaro Herrera Valdivia, Chile "Once again, thank you and all of the developers for your hard work on PostgreSQL. This is by far the most pleasant management experience of any database I've worked on." (Dan Harris) http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php
On 5/3/21 3:07 PM, Andres Freund wrote: > Hi, > > A colleague debugged an issue where their postgres was occasionally > crash-restarting under load. > > The cause turned out to be that a relatively complex archive_command was > used, which could in some rare circumstances have a bash subshell > pipeline not succeed. It wasn't at all obvious why that'd cause a crash > though - the archive command handles the error. > > The issue turns out to be that postgres was in a container, with pid > namespaces enabled. Because postgres was run directly in the container, > without a parent process inside, it thus becomes pid 1. Which mostly > works without a problem. Until, as the case here with the archive > command, a sub-sub process exits while it still has a child. Then that > child gets re-parented to postmaster (as init). > > Such a child is likely to have exited not just with 0 or 1, but > something else. As the pid won't match anything in reaper(), we'll go to > CleanupBackend(). Where any exit status but 0/1 will unconditionally > trigger a restart: > > if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus)) > { > HandleChildCrash(pid, exitstatus, _("server process")); > return; > } > > > This kind of thing is pretty hard to debug, because it's not easy to > even figure out what the "crashing" pid belonged to. > > I wonder if we should work a bit harder to try to identify whether an > exiting process was a "server process" before identifying it as such? > > And perhaps we ought to warn about postgres running as "init" unless we > make that robust? > Hmm, my initial reaction was if we detect very early on we're PID 1 then fork and do all our work in the child, and in the parent just wait until there are no more children. Not sure if that's feasible but I thought I'd throw it out there. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2021-05-03 16:20:43 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2021-05-03 15:37:24 -0400, Tom Lane wrote: > >> And who's to say that ignoring unexpected child deaths is okay, > >> anyway? We could hardly be sure that the dead process hadn't been > >> connected to shared memory. > > > I don't think checking the exit status of unexpected children to see > > whether we should crash-restart out of that concern is meaningful: We > > don't know that the child didn't do anything bad with shared memory when > > they exited with exit(1), instead of exit(2). > > Hmm, by that argument, any unexpected child PID in reaper() ought to be > grounds for a restart, regardless of its exit code. Which'd be fine by > me. I'm on board with being more restrictive about this, not less so. Are there any holes / races that could lead to this "legitimately" happening? To me the signal blocking looks like it should prevent that? I'm a bit worried that we'd find some harmless corner cases under adding a new instability. So personally I'd be inclined to just make it a warning, but ... Greetings, Andres Freund
Re: PG in container w/ pid namespace is init, process exits cause restart
From
ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: >> On 2021-May-03, Andres Freund wrote: >>> The issue turns out to be that postgres was in a container, with pid >>> namespaces enabled. Because postgres was run directly in the container, >>> without a parent process inside, it thus becomes pid 1. Which mostly >>> works without a problem. Until, as the case here with the archive >>> command, a sub-sub process exits while it still has a child. Then that >>> child gets re-parented to postmaster (as init). > >> Hah .. interesting. I think we should definitely make this work, since >> containerized stuff is going to become more and more prevalent. > > How would we make it "work"? The postmaster can't possibly be expected > to know the right thing to do with unexpected children. > >> I guess we can do that in older releases, but do we really need it? As >> I understand, the only thing we need to do is verify that the dying PID >> is a backend PID, and not cause a crash cycle if it isn't. > Maybe we should put in a startup-time check, analogous to the > can't-run-as-root test, that the postmaster mustn't be PID 1. Given that a number of minimal `init`s already exist specifically for the case of running a single application in a container, I don't think Postgres should to reinvent that wheel. A quick eyball of the output of `apt search container init` on a Debian Bullseyse system reveals at least four: - https://github.com/Yelp/dumb-init - https://github.com/krallin/tini - https://github.com/fpco/pid1 - https://github.com/openSUSE/catatonit The first one also explains why there's more to being PID 1 than just handling reparented children. - ilmari -- "The surreality of the universe tends towards a maximum" -- Skud's Law "Never formulate a law or axiom that you're not prepared to live with the consequences of." -- Skud's Meta-Law
Andres Freund <andres@anarazel.de> writes: > On 2021-05-03 16:20:43 -0400, Tom Lane wrote: >> Hmm, by that argument, any unexpected child PID in reaper() ought to be >> grounds for a restart, regardless of its exit code. Which'd be fine by >> me. I'm on board with being more restrictive about this, not less so. > Are there any holes / races that could lead to this "legitimately" > happening? To me the signal blocking looks like it should prevent that? If it did happen it would imply a bug in the postmaster's child-process bookkeeping. (Or, I guess, some preloaded module deciding that launching its own children was OK, whether or not it could find out whether they succeeded.) > I'm a bit worried that we'd find some harmless corner cases under adding > a new instability. So personally I'd be inclined to just make it a > warning, but ... Well, I wouldn't recommend adding such a check in released branches, but I'd be in favor of changing it in HEAD (or waiting till v15 opens). Meanwhile, it seems like we both thought of complaining if the postmaster's PID is 1. I'm not quite sure if there are any portability hazards from that, but on the whole it sounds like a good way to detect badly-configured containers. regards, tom lane
On Mon, 3 May 2021 at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > I also heard a story where things ran into trouble (I didn't get the > > whole story of *what* was the problem with that) because the datadir is /. > > BTW, as far as that goes, I think the general recommendation is that > the datadir shouldn't be a mount point, because bad things happen if > you mount or unmount the drive while the postmaster is up. I could > see enforcing that, if we could find a reasonably platform-independent > way to do it. I don't think the problem is unmounting -- on BSD you have to try really hard to unmount filesystems that have files open on them and afaik you can't do it on Linux at all (which I still claim is the original sin that led to the fsync issues). The problem was mounting filesystems if it happened late -- ie. After Postgres had started up. It was exacerbated by some startup scripts that would automatically run initdb if there was nothing present. Offhand I don't actually see anything special about the Postgres directory root being the mountpoint though. There's nothing stopping someone from mounting on top of some parent directory other than it being slightly harder to imagine someone creating the whole directory tree up from the postgres root rather than just running initdb. Fwiw, I have a suspicion that the right check for being init is whether `pid == ppid`. -- greg
Greg Stark <stark@mit.edu> writes: > On Mon, 3 May 2021 at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, as far as that goes, I think the general recommendation is that >> the datadir shouldn't be a mount point, because bad things happen if >> you mount or unmount the drive while the postmaster is up. I could >> see enforcing that, if we could find a reasonably platform-independent >> way to do it. > I don't think the problem is unmounting -- on BSD you have to try > really hard to unmount filesystems that have files open on them and > afaik you can't do it on Linux at all (which I still claim is the > original sin that led to the fsync issues). > The problem was mounting filesystems if it happened late -- ie. After > Postgres had started up. It was exacerbated by some startup scripts > that would automatically run initdb if there was nothing present. Yeah, at least that was the case that somebody (Joe Conway if memory serves) reported years ago. > Offhand I don't actually see anything special about the Postgres > directory root being the mountpoint though. I think one good reason not to do it is that a mount point directory ought to be root-owned. I don't recall the specific reasoning behind that practice, but it seems sound. Also, if the filesystem is one that likes having a lost+found directory, you have some finagling to do to keep initdb from complaining about that. > Fwiw, I have a suspicion that the right check for being init is > whether `pid == ppid`. Makes sense, and seems nicer than hard-coding an assumption that PID 1 is special. regards, tom lane
On Mon, May 3, 2021 at 3:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I guess we can do that in older releases, but do we really need it? As > > I understand, the only thing we need to do is verify that the dying PID > > is a backend PID, and not cause a crash cycle if it isn't. > > I think that'd be a net reduction in reliability, not an improvement. > In most scenarios it'd do little except mask bugs. And who's to say > that ignoring unexpected child deaths is okay, anyway? We could hardly > be sure that the dead process hadn't been connected to shared memory. This argument doesn't make any sense to me. In almost all cases, postgres is not init, and if a backend forks a child which stomps on shared memory and exits, the postmaster will not know that there is a problem and will not restart. In practice this is not a problem, because the core code is careful not to touch shared memory in children that it forks, and extensions written by reasonably smart people aren't going to do that either, because it's not very hard to figure out that it can't possibly work. So, in the rare case where postgres IS init, and it finds out that a descendent process which is not a direct child has exited, it should do the same thing that we do in all the other cases where a descendent process that is not a direct child has exited, viz. nothing. And if that's the wrong idea - I don't think it is - then we should fix it in all cases, not just the one where postgres is init. I don't have a view on whether it is reasonable or prudent to teach postgres to work as init, because I don't really know what's involved. But I think you're taking a position that is basically blind panic. If something happens that we normally wouldn't even know about, and because of an unusual circumstance we do know about it, we should not leap to the conclusion that it is something bad. All that does is make the system behavior less consistent, and thus harder for users. -- Robert Haas EDB: http://www.enterprisedb.com
On 5/4/21 1:43 PM, Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: >> On Mon, 3 May 2021 at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> BTW, as far as that goes, I think the general recommendation is that >>> the datadir shouldn't be a mount point, because bad things happen if >>> you mount or unmount the drive while the postmaster is up. I could >>> see enforcing that, if we could find a reasonably platform-independent >>> way to do it. > >> I don't think the problem is unmounting -- on BSD you have to try >> really hard to unmount filesystems that have files open on them and >> afaik you can't do it on Linux at all (which I still claim is the >> original sin that led to the fsync issues). >> The problem was mounting filesystems if it happened late -- ie. After >> Postgres had started up. It was exacerbated by some startup scripts >> that would automatically run initdb if there was nothing present. > > Yeah, at least that was the case that somebody (Joe Conway if memory > serves) reported years ago. Guilty as charged ;-) Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, May 3, 2021 at 3:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think that'd be a net reduction in reliability, not an improvement. >> In most scenarios it'd do little except mask bugs. And who's to say >> that ignoring unexpected child deaths is okay, anyway? We could hardly >> be sure that the dead process hadn't been connected to shared memory. > This argument doesn't make any sense to me. In almost all cases, > postgres is not init, and if a backend forks a child which stomps on > shared memory and exits, the postmaster will not know that there is a > problem and will not restart. In practice this is not a problem, > because the core code is careful not to touch shared memory in > children that it forks, and extensions written by reasonably smart > people aren't going to do that either, because it's not very hard to > figure out that it can't possibly work. So, in the rare case where > postgres IS init, and it finds out that a descendent process which is > not a direct child has exited, it should do the same thing that we do > in all the other cases where a descendent process that is not a direct > child has exited, viz. nothing. And if that's the wrong idea - I don't > think it is - then we should fix it in all cases, not just the one > where postgres is init. You are arguing from assumptions not in evidence, specifically that if we reap a PID that isn't one we recognize, this must be what happened. I think it's *at least* as likely that the case implies some bug in the postmaster's child-process bookkeeping, in which case doing nothing is not a good answer. (The fact that that's what we do today doesn't make it right.) I don't wish to lobotomize our ability to detect such problems in order to support incompetently-configured containers. Independently of that, as was pointed out upthread, being init requires more than just ignoring unrecognized results from waitpid. We shouldn't take on that responsibility when there are perfectly good solutions out there already. regards, tom lane
On Tue, May 4, 2021 at 2:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > You are arguing from assumptions not in evidence, specifically that > if we reap a PID that isn't one we recognize, this must be what > happened. I think it's *at least* as likely that the case implies > some bug in the postmaster's child-process bookkeeping, ... It's hard to rule that out completely, but it doesn't seem incredibly likely to me. I would think that if we had such bugs they would result in system instability that is also not in evidence. > Independently of that, as was pointed out upthread, being init requires > more than just ignoring unrecognized results from waitpid. We shouldn't > take on that responsibility when there are perfectly good solutions out > there already. That's a separate point that should be judged on its own merits. I don't have an educated opinion on how hard it would be, or how valuable it would be. -- Robert Haas EDB: http://www.enterprisedb.com
On 5/3/21 5:13 PM, Dagfinn Ilmari Mannsåker wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > >> Maybe we should put in a startup-time check, analogous to the >> can't-run-as-root test, that the postmaster mustn't be PID 1. > Given that a number of minimal `init`s already exist specifically for > the case of running a single application in a container, I don't think > Postgres should to reinvent that wheel. A quick eyball of the output of > `apt search container init` on a Debian Bullseyse system reveals at > least four: > > - https://github.com/Yelp/dumb-init > - https://github.com/krallin/tini > - https://github.com/fpco/pid1 > - https://github.com/openSUSE/catatonit > > The first one also explains why there's more to being PID 1 than just > handling reparented children. > I looked at the first of these, and it seems perfectly sensible. So I agree all we really need to do is refuse to run as PID 1. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 5/3/21 5:13 PM, Dagfinn Ilmari Mannsåker wrote: >> Given that a number of minimal `init`s already exist specifically for >> the case of running a single application in a container, I don't think >> Postgres should to reinvent that wheel. A quick eyball of the output of >> `apt search container init` on a Debian Bullseyse system reveals at >> least four: >> >> - https://github.com/Yelp/dumb-init >> - https://github.com/krallin/tini >> - https://github.com/fpco/pid1 >> - https://github.com/openSUSE/catatonit >> >> The first one also explains why there's more to being PID 1 than just >> handling reparented children. > I looked at the first of these, and it seems perfectly sensible. So I > agree all we really need to do is refuse to run as PID 1. [ for the archives' sake ] I looked at the documentation for dumb-init, and it claims there are basically two things weird about init: 1. The kernel applies different signal handling rules to it. 2. It has to reap children it didn't spawn. Whether that list is exhaustive, I dunno ... it has an odor of Linux-specificity to me. Anyway, #2 is clearly no problem for the postmaster, since it's doing that anyway; quibbles about whether it *should* do that without complaining aside. We could imagine trying to handle #1, but that seems like the sort of dank system-specific corner that we'd regret having got into. If the behavior for init isn't consistent with our needs, or changes across platforms or kernel versions, things could get very messy indeed. I'm still thinking that we're best off refusing to do that and making people install one of these shims that's meant for the job. regards, tom lane
On Tue, May 04, 2021 at 01:35:50PM -0400, Greg Stark wrote: > Fwiw, I have a suspicion that the right check for being init is > whether `pid == ppid`. pryzbyj@pryzbyj:~$ ps -wwf 1 UID PID PPID C STIME TTY STAT TIME CMD root 1 0 0 2020 ? Ss 10:28 /sbin/init As I recall, on some OS, pid 0 is the "swapper". -- Justin
On Tue, May 4, 2021 at 4:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm still thinking that we're best off refusing to do > that and making people install one of these shims that's meant > for the job. I have to admit that I care less about the specific issue here than about the general issue of being open to hearing what the user needs actually are. I honestly have no idea whether it's sensible to want to run postgres as init. If people who know about container stuff say that's a dumb idea and you shouldn't do it, then IMHO your conclusion that we should simply disallow it is 100% correct. But if those people show up and say, no, it's actually super-convenient for postgres to run as init and using one of those shim things has significant downsides that are hard to mitigate, and if further we could do what they say they need with just a little bit of extra code, then IMHO your conclusion is 100% wrong. Now so far as I can see right now neither conclusion is crystal clear - opinions seem to be a bit mixed. So right now I don't really know what to think. I just don't want to fall into the trap of thinking that core developers are somehow in a better place to know the right answer than users. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I have to admit that I care less about the specific issue here than > about the general issue of being open to hearing what the user needs > actually are. I honestly have no idea whether it's sensible to want to > run postgres as init. If people who know about container stuff say > that's a dumb idea and you shouldn't do it, then IMHO your conclusion > that we should simply disallow it is 100% correct. But if those people > show up and say, no, it's actually super-convenient for postgres to > run as init and using one of those shim things has significant > downsides that are hard to mitigate, and if further we could do what > they say they need with just a little bit of extra code, then IMHO > your conclusion is 100% wrong. Now so far as I can see right now > neither conclusion is crystal clear - opinions seem to be a bit mixed. > So right now I don't really know what to think. I just don't want to > fall into the trap of thinking that core developers are somehow in a > better place to know the right answer than users. I don't claim to have an opinion about how convenient it would be for users to not need an init shim. I do claim to have a qualified opinion about how hard it would be for us to support the case. It'd hobble our ability to detect child-process bookkeeping errors, and it'd put constraints on how we manage the postmaster's signal handling. Maybe those constraints will never matter, but that's a contract I don't really want to buy into for this seemingly-not-large benefit. regards, tom lane