Thread: PG in container w/ pid namespace is init, process exits cause restart

PG in container w/ pid namespace is init, process exits cause restart

From
Andres Freund
Date:
Hi,

A colleague debugged an issue where their postgres was occasionally
crash-restarting under load.

The cause turned out to be that a relatively complex archive_command was
used, which could in some rare circumstances have a bash subshell
pipeline not succeed.  It wasn't at all obvious why that'd cause a crash
though - the archive command handles the error.

The issue turns out to be that postgres was in a container, with pid
namespaces enabled. Because postgres was run directly in the container,
without a parent process inside, it thus becomes pid 1. Which mostly
works without a problem. Until, as the case here with the archive
command, a sub-sub process exits while it still has a child. Then that
child gets re-parented to postmaster (as init).

Such a child is likely to have exited not just with 0 or 1, but
something else. As the pid won't match anything in reaper(), we'll go to
CleanupBackend(). Where any exit status but 0/1 will unconditionally
trigger a restart:

    if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
    {
        HandleChildCrash(pid, exitstatus, _("server process"));
        return;
    }


This kind of thing is pretty hard to debug, because it's not easy to
even figure out what the "crashing" pid belonged to.

I wonder if we should work a bit harder to try to identify whether an
exiting process was a "server process" before identifying it as such?

And perhaps we ought to warn about postgres running as "init" unless we
make that robust?

Greetings,

Andres Freund



Re: PG in container w/ pid namespace is init, process exits cause restart

From
Alvaro Herrera
Date:
On 2021-May-03, Andres Freund wrote:

> The issue turns out to be that postgres was in a container, with pid
> namespaces enabled. Because postgres was run directly in the container,
> without a parent process inside, it thus becomes pid 1. Which mostly
> works without a problem. Until, as the case here with the archive
> command, a sub-sub process exits while it still has a child. Then that
> child gets re-parented to postmaster (as init).

Hah .. interesting.  I think we should definitely make this work, since
containerized stuff is going to become more and more prevalent.

I also heard a story where things ran into trouble (I didn't get the
whole story of *what* was the problem with that) because the datadir is /.
I know -- nobody in their right mind would put the datadir in / -- but
apparently in the container world that's not something as stupid as it
sounds.  That's of course not related to what you describe here
code-wise, but the underlying reason is the same.

> I wonder if we should work a bit harder to try to identify whether an
> exiting process was a "server process" before identifying it as such?

Well, we've never made any effort there because it just wasn't possible.
Nobody ever had postmaster also be init .. until containers.  Let's fix
it.

> And perhaps we ought to warn about postgres running as "init" unless we
> make that robust?

I guess we can do that in older releases, but do we really need it?  As
I understand, the only thing we need to do is verify that the dying PID
is a backend PID, and not cause a crash cycle if it isn't.

-- 
Álvaro Herrera       Valdivia, Chile



Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-May-03, Andres Freund wrote:
>> The issue turns out to be that postgres was in a container, with pid
>> namespaces enabled. Because postgres was run directly in the container,
>> without a parent process inside, it thus becomes pid 1. Which mostly
>> works without a problem. Until, as the case here with the archive
>> command, a sub-sub process exits while it still has a child. Then that
>> child gets re-parented to postmaster (as init).

> Hah .. interesting.  I think we should definitely make this work, since
> containerized stuff is going to become more and more prevalent.

How would we make it "work"?  The postmaster can't possibly be expected
to know the right thing to do with unexpected children.

> I guess we can do that in older releases, but do we really need it?  As
> I understand, the only thing we need to do is verify that the dying PID
> is a backend PID, and not cause a crash cycle if it isn't.

I think that'd be a net reduction in reliability, not an improvement.
In most scenarios it'd do little except mask bugs.  And who's to say
that ignoring unexpected child deaths is okay, anyway?  We could hardly
be sure that the dead process hadn't been connected to shared memory.

Maybe we should put in a startup-time check, analogous to the
can't-run-as-root test, that the postmaster mustn't be PID 1.

            regards, tom lane



Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> I also heard a story where things ran into trouble (I didn't get the
> whole story of *what* was the problem with that) because the datadir is /.

BTW, as far as that goes, I think the general recommendation is that
the datadir shouldn't be a mount point, because bad things happen if
you mount or unmount the drive while the postmaster is up.  I could
see enforcing that, if we could find a reasonably platform-independent
way to do it.

(Of course, / can't be unmounted, so I wonder exactly what bad thing
happened in that story.)

            regards, tom lane



Re: PG in container w/ pid namespace is init, process exits cause restart

From
Alvaro Herrera
Date:
On 2021-May-03, Tom Lane wrote:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > I also heard a story where things ran into trouble (I didn't get the
> > whole story of *what* was the problem with that) because the datadir is /.
> 
> BTW, as far as that goes, I think the general recommendation is that
> the datadir shouldn't be a mount point, because bad things happen if
> you mount or unmount the drive while the postmaster is up.  I could
> see enforcing that, if we could find a reasonably platform-independent
> way to do it.

/ is not a mount point; it's just that the container system binds (?)
some different directory as / for the process to run into.  I suppose it
must be similar to chrooting to /, but I'm not sure if it's exactly
that.

> (Of course, / can't be unmounted, so I wonder exactly what bad thing
> happened in that story.)

It's not related to unmounting.  I'll try to get the details.

-- 
Álvaro Herrera       Valdivia, Chile



Re: PG in container w/ pid namespace is init, process exits cause restart

From
Andres Freund
Date:
Hi,

On 2021-05-03 15:37:24 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2021-May-03, Andres Freund wrote:
> >> The issue turns out to be that postgres was in a container, with pid
> >> namespaces enabled. Because postgres was run directly in the container,
> >> without a parent process inside, it thus becomes pid 1. Which mostly
> >> works without a problem. Until, as the case here with the archive
> >> command, a sub-sub process exits while it still has a child. Then that
> >> child gets re-parented to postmaster (as init).
>
> > Hah .. interesting.  I think we should definitely make this work, since
> > containerized stuff is going to become more and more prevalent.
>
> How would we make it "work"?  The postmaster can't possibly be expected
> to know the right thing to do with unexpected children.

Not saying that we should, but we could check if we're pid 1 / init, and
just warn about children we don't know anything about. Which we could
detect by iterating over BackendList/BackgroundWorkerList before
crash-restarting in CleanupBackend().  Then we'd not loose reliability
in the "normal" case, while not reducing reliability in the container
case.

I'm not quite sure I buy the reliability argument, tbh: The additional
process exits we see as pid 1 are after all process exits that we'd not
see if we weren't pid 1. And if we're not pid 1 then there really should
never be any "unexpected children" - we know what processes postmaster
itself forked after all. So where would unexpected children come from,
except reparenting?


> And who's to say that ignoring unexpected child deaths is okay,
> anyway?  We could hardly be sure that the dead process hadn't been
> connected to shared memory.

I don't think checking the exit status of unexpected children to see
whether we should crash-restart out of that concern is meaningful: We
don't know that the child didn't do anything bad with shared memory when
they exited with exit(1), instead of exit(2).


Random thought: I wonder if we ought to set madvise(MADV_DONTFORK) on
shared memory in postmaster children, where available. Then we could be
fairly certain that there aren't processes we don't know about that are
attached to shared memory (unless there's some nasty
shared_preload_library forking early during backend startup - but that's
hard to get excited about).


> > I guess we can do that in older releases, but do we really need it?  As
> > I understand, the only thing we need to do is verify that the dying PID
> > is a backend PID, and not cause a crash cycle if it isn't.
>
> I think that'd be a net reduction in reliability, not an improvement.
> In most scenarios it'd do little except mask bugs.

Do you feel the same about having different logging between the "known"
and "unknown" child processes?


Personally I don't think it's of utmost importance to support running as
pid 1. But we should at least print useful log messages about what
processes exited...


Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2021-05-03 15:37:24 -0400, Tom Lane wrote:
>> And who's to say that ignoring unexpected child deaths is okay,
>> anyway?  We could hardly be sure that the dead process hadn't been
>> connected to shared memory.

> I don't think checking the exit status of unexpected children to see
> whether we should crash-restart out of that concern is meaningful: We
> don't know that the child didn't do anything bad with shared memory when
> they exited with exit(1), instead of exit(2).

Hmm, by that argument, any unexpected child PID in reaper() ought to be
grounds for a restart, regardless of its exit code.  Which'd be fine by
me.  I'm on board with being more restrictive about this, not less so.

> Do you feel the same about having different logging between the "known"
> and "unknown" child processes?

No objection to logging such cases more clearly, for sure.

            regards, tom lane



Re: PG in container w/ pid namespace is init, process exits cause restart

From
Andres Freund
Date:
Hi,

On 2021-05-03 15:25:53 -0400, Alvaro Herrera wrote:
> I also heard a story where things ran into trouble (I didn't get the
> whole story of *what* was the problem with that) because the datadir is /.
> I know -- nobody in their right mind would put the datadir in / -- but
> apparently in the container world that's not something as stupid as it
> sounds.  That's of course not related to what you describe here
> code-wise, but the underlying reason is the same.

It still seems pretty insane in the container world too. Postgres needs
shared libraries (even if you managed to link postgres itself
statically, something we do not support). Postgres needs to write to the
data directory. Putting shared libraries inside the data directory seems
like a bad idea.

Using / for a single statically linked binary that e.g. just serves a
bunch of hardcoded files is one thing. Putting actual data in / for
something like postgres another.


> > I wonder if we should work a bit harder to try to identify whether an
> > exiting process was a "server process" before identifying it as such?
> 
> Well, we've never made any effort there because it just wasn't possible.
> Nobody ever had postmaster also be init .. until containers.  Let's fix
> it.

> > And perhaps we ought to warn about postgres running as "init" unless we
> > make that robust?
> 
> I guess we can do that in older releases, but do we really need it?  As
> I understand, the only thing we need to do is verify that the dying PID
> is a backend PID, and not cause a crash cycle if it isn't.

I think there's a few more special cases when running as init, other
than reparenting. E.g. I think the default signal handlers are
different, the kernel kills the process in fewer cases etc. I am not
opposed to adding support for it, but I think it'd need a bit of care.

Given that we probably shouldn't just break things in a minor release by
refusing to run as 1, a warning seems to be the easiest thing for now?

Greetings,

Andres Freund



Re: PG in container w/ pid namespace is init, process exits cause restart

From
Alvaro Herrera
Date:
On 2021-May-03, Andres Freund wrote:

> Using / for a single statically linked binary that e.g. just serves a
> bunch of hardcoded files is one thing. Putting actual data in / for
> something like postgres another.

Yeah, I just had a word with them and I had misunderstood what they were
doing.  They were attempting something completely insane and pointless,
so I'm going to leave it at that.

> I think there's a few more special cases when running as init, other
> than reparenting. E.g. I think the default signal handlers are
> different, the kernel kills the process in fewer cases etc. I am not
> opposed to adding support for it, but I think it'd need a bit of care.

Ok, we can leave that as future development then.

> Given that we probably shouldn't just break things in a minor release by
> refusing to run as 1, a warning seems to be the easiest thing for now?

WFM.

-- 
Álvaro Herrera       Valdivia, Chile
"Once again, thank you and all of the developers for your hard work on
PostgreSQL.  This is by far the most pleasant management experience of
any database I've worked on."                             (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php



Re: PG in container w/ pid namespace is init, process exits cause restart

From
Andrew Dunstan
Date:
On 5/3/21 3:07 PM, Andres Freund wrote:
> Hi,
>
> A colleague debugged an issue where their postgres was occasionally
> crash-restarting under load.
>
> The cause turned out to be that a relatively complex archive_command was
> used, which could in some rare circumstances have a bash subshell
> pipeline not succeed.  It wasn't at all obvious why that'd cause a crash
> though - the archive command handles the error.
>
> The issue turns out to be that postgres was in a container, with pid
> namespaces enabled. Because postgres was run directly in the container,
> without a parent process inside, it thus becomes pid 1. Which mostly
> works without a problem. Until, as the case here with the archive
> command, a sub-sub process exits while it still has a child. Then that
> child gets re-parented to postmaster (as init).
>
> Such a child is likely to have exited not just with 0 or 1, but
> something else. As the pid won't match anything in reaper(), we'll go to
> CleanupBackend(). Where any exit status but 0/1 will unconditionally
> trigger a restart:
>
>     if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
>     {
>         HandleChildCrash(pid, exitstatus, _("server process"));
>         return;
>     }
>
>
> This kind of thing is pretty hard to debug, because it's not easy to
> even figure out what the "crashing" pid belonged to.
>
> I wonder if we should work a bit harder to try to identify whether an
> exiting process was a "server process" before identifying it as such?
>
> And perhaps we ought to warn about postgres running as "init" unless we
> make that robust?
>

Hmm, my initial reaction was if we detect very early on we're PID 1 then
fork and do all our work in the child, and in the parent just wait until
there are no more children. Not sure if that's feasible but I thought
I'd throw it out there.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: PG in container w/ pid namespace is init, process exits cause restart

From
Andres Freund
Date:
Hi,

On 2021-05-03 16:20:43 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2021-05-03 15:37:24 -0400, Tom Lane wrote:
> >> And who's to say that ignoring unexpected child deaths is okay,
> >> anyway?  We could hardly be sure that the dead process hadn't been
> >> connected to shared memory.
> 
> > I don't think checking the exit status of unexpected children to see
> > whether we should crash-restart out of that concern is meaningful: We
> > don't know that the child didn't do anything bad with shared memory when
> > they exited with exit(1), instead of exit(2).
> 
> Hmm, by that argument, any unexpected child PID in reaper() ought to be
> grounds for a restart, regardless of its exit code.  Which'd be fine by
> me.  I'm on board with being more restrictive about this, not less so.

Are there any holes / races that could lead to this "legitimately"
happening? To me the signal blocking looks like it should prevent that?

I'm a bit worried that we'd find some harmless corner cases under adding
a new instability. So personally I'd be inclined to just make it a
warning, but ...

Greetings,

Andres Freund



Re: PG in container w/ pid namespace is init, process exits cause restart

From
ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>> On 2021-May-03, Andres Freund wrote:
>>> The issue turns out to be that postgres was in a container, with pid
>>> namespaces enabled. Because postgres was run directly in the container,
>>> without a parent process inside, it thus becomes pid 1. Which mostly
>>> works without a problem. Until, as the case here with the archive
>>> command, a sub-sub process exits while it still has a child. Then that
>>> child gets re-parented to postmaster (as init).
>
>> Hah .. interesting.  I think we should definitely make this work, since
>> containerized stuff is going to become more and more prevalent.
>
> How would we make it "work"?  The postmaster can't possibly be expected
> to know the right thing to do with unexpected children.
>
>> I guess we can do that in older releases, but do we really need it?  As
>> I understand, the only thing we need to do is verify that the dying PID
>> is a backend PID, and not cause a crash cycle if it isn't.

> Maybe we should put in a startup-time check, analogous to the
> can't-run-as-root test, that the postmaster mustn't be PID 1.

Given that a number of minimal `init`s already exist specifically for
the case of running a single application in a container, I don't think
Postgres should to reinvent that wheel.  A quick eyball of the output of
`apt search container init` on a Debian Bullseyse system reveals at
least four:

 - https://github.com/Yelp/dumb-init
 - https://github.com/krallin/tini
 - https://github.com/fpco/pid1
 - https://github.com/openSUSE/catatonit

The first one also explains why there's more to being PID 1 than just
handling reparented children.

- ilmari
-- 
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
 the consequences of."                              -- Skud's Meta-Law



Andres Freund <andres@anarazel.de> writes:
> On 2021-05-03 16:20:43 -0400, Tom Lane wrote:
>> Hmm, by that argument, any unexpected child PID in reaper() ought to be
>> grounds for a restart, regardless of its exit code.  Which'd be fine by
>> me.  I'm on board with being more restrictive about this, not less so.

> Are there any holes / races that could lead to this "legitimately"
> happening? To me the signal blocking looks like it should prevent that?

If it did happen it would imply a bug in the postmaster's child-process
bookkeeping.

(Or, I guess, some preloaded module deciding that launching its own
children was OK, whether or not it could find out whether they
succeeded.)

> I'm a bit worried that we'd find some harmless corner cases under adding
> a new instability. So personally I'd be inclined to just make it a
> warning, but ...

Well, I wouldn't recommend adding such a check in released branches,
but I'd be in favor of changing it in HEAD (or waiting till v15
opens).

Meanwhile, it seems like we both thought of complaining if the
postmaster's PID is 1.  I'm not quite sure if there are any
portability hazards from that, but on the whole it sounds like
a good way to detect badly-configured containers.

            regards, tom lane



On Mon, 3 May 2021 at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > I also heard a story where things ran into trouble (I didn't get the
> > whole story of *what* was the problem with that) because the datadir is /.
>
> BTW, as far as that goes, I think the general recommendation is that
> the datadir shouldn't be a mount point, because bad things happen if
> you mount or unmount the drive while the postmaster is up.  I could
> see enforcing that, if we could find a reasonably platform-independent
> way to do it.

I don't think the problem is unmounting -- on BSD you have to try
really hard to unmount filesystems that have files open on them and
afaik you can't do it on Linux at all (which I still claim is the
original sin that led to the fsync issues).

The problem was mounting filesystems if it happened late -- ie. After
Postgres had started up. It was exacerbated by some startup scripts
that would automatically run initdb if there was nothing present.

Offhand I don't actually see anything special about the Postgres
directory root being the mountpoint though. There's nothing stopping
someone from mounting on top of some parent directory other than it
being slightly harder to imagine someone creating the whole directory
tree up from the postgres root rather than just running initdb.

Fwiw, I have a suspicion that the right check for being init is
whether `pid == ppid`.

-- 
greg



Greg Stark <stark@mit.edu> writes:
> On Mon, 3 May 2021 at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, as far as that goes, I think the general recommendation is that
>> the datadir shouldn't be a mount point, because bad things happen if
>> you mount or unmount the drive while the postmaster is up.  I could
>> see enforcing that, if we could find a reasonably platform-independent
>> way to do it.

> I don't think the problem is unmounting -- on BSD you have to try
> really hard to unmount filesystems that have files open on them and
> afaik you can't do it on Linux at all (which I still claim is the
> original sin that led to the fsync issues).
> The problem was mounting filesystems if it happened late -- ie. After
> Postgres had started up. It was exacerbated by some startup scripts
> that would automatically run initdb if there was nothing present.

Yeah, at least that was the case that somebody (Joe Conway if memory
serves) reported years ago.

> Offhand I don't actually see anything special about the Postgres
> directory root being the mountpoint though.

I think one good reason not to do it is that a mount point directory
ought to be root-owned.  I don't recall the specific reasoning
behind that practice, but it seems sound.  Also, if the filesystem
is one that likes having a lost+found directory, you have some
finagling to do to keep initdb from complaining about that.

> Fwiw, I have a suspicion that the right check for being init is
> whether `pid == ppid`.

Makes sense, and seems nicer than hard-coding an assumption that
PID 1 is special.

            regards, tom lane



On Mon, May 3, 2021 at 3:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I guess we can do that in older releases, but do we really need it?  As
> > I understand, the only thing we need to do is verify that the dying PID
> > is a backend PID, and not cause a crash cycle if it isn't.
>
> I think that'd be a net reduction in reliability, not an improvement.
> In most scenarios it'd do little except mask bugs.  And who's to say
> that ignoring unexpected child deaths is okay, anyway?  We could hardly
> be sure that the dead process hadn't been connected to shared memory.

This argument doesn't make any sense to me. In almost all cases,
postgres is not init, and if a backend forks a child which stomps on
shared memory and exits, the postmaster will not know that there is a
problem and will not restart. In practice this is not a problem,
because the core code is careful not to touch shared memory in
children that it forks, and extensions written by reasonably smart
people aren't going to do that either, because it's not very hard to
figure out that it can't possibly work. So, in the rare case where
postgres IS init, and it finds out that a descendent process which is
not a direct child has exited, it should do the same thing that we do
in all the other cases where a descendent process that is not a direct
child has exited, viz. nothing. And if that's the wrong idea - I don't
think it is - then we should fix it in all cases, not just the one
where postgres is init.

I don't have a view on whether it is reasonable or prudent to teach
postgres to work as init, because I don't really know what's involved.
But I think you're taking a position that is basically blind panic. If
something happens that we normally wouldn't even know about, and
because of an unusual circumstance we do know about it, we should not
leap to the conclusion that it is something bad. All that does is make
the system behavior less consistent, and thus harder for users.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On 5/4/21 1:43 PM, Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
>> On Mon, 3 May 2021 at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> BTW, as far as that goes, I think the general recommendation is that
>>> the datadir shouldn't be a mount point, because bad things happen if
>>> you mount or unmount the drive while the postmaster is up.  I could
>>> see enforcing that, if we could find a reasonably platform-independent
>>> way to do it.
> 
>> I don't think the problem is unmounting -- on BSD you have to try
>> really hard to unmount filesystems that have files open on them and
>> afaik you can't do it on Linux at all (which I still claim is the
>> original sin that led to the fsync issues).
>> The problem was mounting filesystems if it happened late -- ie. After
>> Postgres had started up. It was exacerbated by some startup scripts
>> that would automatically run initdb if there was nothing present.
> 
> Yeah, at least that was the case that somebody (Joe Conway if memory
> serves) reported years ago.


Guilty as charged ;-)

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, May 3, 2021 at 3:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think that'd be a net reduction in reliability, not an improvement.
>> In most scenarios it'd do little except mask bugs.  And who's to say
>> that ignoring unexpected child deaths is okay, anyway?  We could hardly
>> be sure that the dead process hadn't been connected to shared memory.

> This argument doesn't make any sense to me. In almost all cases,
> postgres is not init, and if a backend forks a child which stomps on
> shared memory and exits, the postmaster will not know that there is a
> problem and will not restart. In practice this is not a problem,
> because the core code is careful not to touch shared memory in
> children that it forks, and extensions written by reasonably smart
> people aren't going to do that either, because it's not very hard to
> figure out that it can't possibly work. So, in the rare case where
> postgres IS init, and it finds out that a descendent process which is
> not a direct child has exited, it should do the same thing that we do
> in all the other cases where a descendent process that is not a direct
> child has exited, viz. nothing. And if that's the wrong idea - I don't
> think it is - then we should fix it in all cases, not just the one
> where postgres is init.

You are arguing from assumptions not in evidence, specifically that
if we reap a PID that isn't one we recognize, this must be what
happened.  I think it's *at least* as likely that the case implies
some bug in the postmaster's child-process bookkeeping, in which
case doing nothing is not a good answer.  (The fact that that's
what we do today doesn't make it right.)  I don't wish to
lobotomize our ability to detect such problems in order to support
incompetently-configured containers.

Independently of that, as was pointed out upthread, being init requires
more than just ignoring unrecognized results from waitpid.  We shouldn't
take on that responsibility when there are perfectly good solutions out
there already.

            regards, tom lane



On Tue, May 4, 2021 at 2:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> You are arguing from assumptions not in evidence, specifically that
> if we reap a PID that isn't one we recognize, this must be what
> happened.  I think it's *at least* as likely that the case implies
> some bug in the postmaster's child-process bookkeeping, ...

It's hard to rule that out completely, but it doesn't seem incredibly
likely to me. I would think that if we had such bugs they would result
in system instability that is also not in evidence.

> Independently of that, as was pointed out upthread, being init requires
> more than just ignoring unrecognized results from waitpid.  We shouldn't
> take on that responsibility when there are perfectly good solutions out
> there already.

That's a separate point that should be judged on its own merits. I
don't have an educated opinion on how hard it would be, or how
valuable it would be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: PG in container w/ pid namespace is init, process exits cause restart

From
Andrew Dunstan
Date:
On 5/3/21 5:13 PM, Dagfinn Ilmari Mannsåker wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>
>> Maybe we should put in a startup-time check, analogous to the
>> can't-run-as-root test, that the postmaster mustn't be PID 1.
> Given that a number of minimal `init`s already exist specifically for
> the case of running a single application in a container, I don't think
> Postgres should to reinvent that wheel.  A quick eyball of the output of
> `apt search container init` on a Debian Bullseyse system reveals at
> least four:
>
>  - https://github.com/Yelp/dumb-init
>  - https://github.com/krallin/tini
>  - https://github.com/fpco/pid1
>  - https://github.com/openSUSE/catatonit
>
> The first one also explains why there's more to being PID 1 than just
> handling reparented children.
>



I looked at the first of these, and it seems perfectly sensible. So I
agree all we really need to do is refuse to run as PID 1.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Andrew Dunstan <andrew@dunslane.net> writes:
> On 5/3/21 5:13 PM, Dagfinn Ilmari Mannsåker wrote:
>> Given that a number of minimal `init`s already exist specifically for
>> the case of running a single application in a container, I don't think
>> Postgres should to reinvent that wheel.  A quick eyball of the output of
>> `apt search container init` on a Debian Bullseyse system reveals at
>> least four:
>> 
>> - https://github.com/Yelp/dumb-init
>> - https://github.com/krallin/tini
>> - https://github.com/fpco/pid1
>> - https://github.com/openSUSE/catatonit
>> 
>> The first one also explains why there's more to being PID 1 than just
>> handling reparented children.

> I looked at the first of these, and it seems perfectly sensible. So I
> agree all we really need to do is refuse to run as PID 1.

[ for the archives' sake ]  I looked at the documentation for dumb-init,
and it claims there are basically two things weird about init:

1. The kernel applies different signal handling rules to it.

2. It has to reap children it didn't spawn.

Whether that list is exhaustive, I dunno ... it has an odor of
Linux-specificity to me.  Anyway, #2 is clearly no problem for
the postmaster, since it's doing that anyway; quibbles about
whether it *should* do that without complaining aside.  We could
imagine trying to handle #1, but that seems like the sort of dank
system-specific corner that we'd regret having got into.  If the
behavior for init isn't consistent with our needs, or changes
across platforms or kernel versions, things could get very messy
indeed.  I'm still thinking that we're best off refusing to do
that and making people install one of these shims that's meant
for the job.

            regards, tom lane



Re: PG in container w/ pid namespace is init, process exits cause restart

From
Justin Pryzby
Date:
On Tue, May 04, 2021 at 01:35:50PM -0400, Greg Stark wrote:
> Fwiw, I have a suspicion that the right check for being init is
> whether `pid == ppid`.

pryzbyj@pryzbyj:~$ ps -wwf 1
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
root         1     0  0  2020 ?        Ss    10:28 /sbin/init

As I recall, on some OS, pid 0 is the "swapper".

-- 
Justin



On Tue, May 4, 2021 at 4:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm still thinking that we're best off refusing to do
> that and making people install one of these shims that's meant
> for the job.

I have to admit that I care less about the specific issue here than
about the general issue of being open to hearing what the user needs
actually are. I honestly have no idea whether it's sensible to want to
run postgres as init. If people who know about container stuff say
that's a dumb idea and you shouldn't do it, then IMHO your conclusion
that we should simply disallow it is 100% correct. But if those people
show up and say, no, it's actually super-convenient for postgres to
run as init and using one of those shim things has significant
downsides that are hard to mitigate, and if further we could do what
they say they need with just a little bit of extra code, then IMHO
your conclusion is 100% wrong. Now so far as I can see right now
neither conclusion is crystal clear - opinions seem to be a bit mixed.
So right now I don't really know what to think. I just don't want to
fall into the trap of thinking that core developers are somehow in a
better place to know the right answer than users.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> I have to admit that I care less about the specific issue here than
> about the general issue of being open to hearing what the user needs
> actually are. I honestly have no idea whether it's sensible to want to
> run postgres as init. If people who know about container stuff say
> that's a dumb idea and you shouldn't do it, then IMHO your conclusion
> that we should simply disallow it is 100% correct. But if those people
> show up and say, no, it's actually super-convenient for postgres to
> run as init and using one of those shim things has significant
> downsides that are hard to mitigate, and if further we could do what
> they say they need with just a little bit of extra code, then IMHO
> your conclusion is 100% wrong. Now so far as I can see right now
> neither conclusion is crystal clear - opinions seem to be a bit mixed.
> So right now I don't really know what to think. I just don't want to
> fall into the trap of thinking that core developers are somehow in a
> better place to know the right answer than users.

I don't claim to have an opinion about how convenient it would be
for users to not need an init shim.  I do claim to have a qualified
opinion about how hard it would be for us to support the case.  It'd
hobble our ability to detect child-process bookkeeping errors, and
it'd put constraints on how we manage the postmaster's signal handling.
Maybe those constraints will never matter, but that's a contract I
don't really want to buy into for this seemingly-not-large benefit.

            regards, tom lane