Thread: kill -KILL: What happens?

kill -KILL: What happens?

From

David Fetter

Date:

13 January 2011, 13:38:18

Folks,

I've noticed over the years that we give people dire warnings never to
send a KILL signal to the postmaster, but I'm unsure as to what are
potential consequences of this, as in just exactly how this can result
in problems.  Is there some reference I can look to for explanations
of the mechanism(s) whereby the damage occurs?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: kill -KILL: What happens?

From

Tom Lane

Date:

13 January 2011, 14:41:38

David Fetter <david@fetter.org> writes:
> I've noticed over the years that we give people dire warnings never to
> send a KILL signal to the postmaster, but I'm unsure as to what are
> potential consequences of this, as in just exactly how this can result
> in problems.  Is there some reference I can look to for explanations
> of the mechanism(s) whereby the damage occurs?

There's no risk of data corruption, if that's what you're thinking of.
It's just that you're then looking at having to manually clean up the
child processes and then restart the postmaster; a process that is not
only tedious but does offer the possibility of screwing yourself.

In particular the risk is that someone clueless enough to do this would
next decide that removing $PGDATA/postmaster.pid, rather than killing
all the existing children, is the quickest way to get the postmaster
restarted.  Once he's done that, his data will shortly be hosed beyond
recovery, because now he has two noncommunicating sets of backends
massaging the same files via separate sets of shared buffers.

The reason this sequence of events doesn't seem improbable is that the
error you get when you try to start a new postmaster, if there are still
old backends running, is

FATAL:  pre-existing shared memory block (key 5490001, ID 15609) is still in use
HINT:  If you're sure there are no old server processes still running, remove the shared memory block or just delete
thefile "postmaster.pid".

Maybe we should rewrite that HINT --- while it's *possible* that
removing the shmem block or deleting postmaster.pid is the right thing
to do, it's not exactly *likely*.  I think we need to put a bit more
emphasis on the "If ..." part.  Like "If you are prepared to swear on
your mother's grave that there are no old server processes still
running, consider removing postmaster.pid.  But first check for existing
processes again."

(BTW, I notice that this interlock against starting a new postmaster
appears to be broken in HEAD, which is likely not unrelated to the fact
that the contents of postmaster.pid seem to be totally bollixed :-()
        regards, tom lane

Re: kill -KILL: What happens?

From

David Fetter

Date:

13 January 2011, 16:12:46

On Thu, Jan 13, 2011 at 10:41:28AM -0500, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > I've noticed over the years that we give people dire warnings never to
> > send a KILL signal to the postmaster, but I'm unsure as to what are
> > potential consequences of this, as in just exactly how this can result
> > in problems.  Is there some reference I can look to for explanations
> > of the mechanism(s) whereby the damage occurs?
> 
> There's no risk of data corruption, if that's what you're thinking of.
> It's just that you're then looking at having to manually clean up the
> child processes and then restart the postmaster; a process that is not
> only tedious but does offer the possibility of screwing yourself.

Does this mean that there's no cross-platform way to ensure that
killing a process results in its children's timely (i.e. before damage
can occur) death?  That such a way isn't practical from a performance
point of view?

> In particular the risk is that someone clueless enough to do this would
> next decide that removing $PGDATA/postmaster.pid, rather than killing
> all the existing children, is the quickest way to get the postmaster
> restarted.  Once he's done that, his data will shortly be hosed beyond
> recovery, because now he has two noncommunicating sets of backends
> massaging the same files via separate sets of shared buffers.

Right.

> The reason this sequence of events doesn't seem improbable is that the
> error you get when you try to start a new postmaster, if there are still
> old backends running, is
> 
> FATAL:  pre-existing shared memory block (key 5490001, ID 15609) is still in use
> HINT:  If you're sure there are no old server processes still running, remove the shared memory block or just delete
thefile "postmaster.pid".
 
> 
> Maybe we should rewrite that HINT --- while it's *possible* that
> removing the shmem block or deleting postmaster.pid is the right thing
> to do, it's not exactly *likely*.  I think we need to put a bit more
> emphasis on the "If ..." part.  Like "If you are prepared to swear on
> your mother's grave that there are no old server processes still
> running, consider removing postmaster.pid.  But first check for existing
> processes again."

Maybe the hint could give an OS-tailored way to check this...

> (BTW, I notice that this interlock against starting a new postmaster
> appears to be broken in HEAD, which is likely not unrelated to the
> fact that the contents of postmaster.pid seem to be totally bollixed
> :-()

D'oh!  Well, I hope knowing it's a problem gives some kind of glimmer
as to how to solve it :)

Is this worth writing tests for?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: kill -KILL: What happens?

From

Tom Lane

Date:

13 January 2011, 17:00:20

David Fetter <david@fetter.org> writes:
> On Thu, Jan 13, 2011 at 10:41:28AM -0500, Tom Lane wrote:
>> It's just that you're then looking at having to manually clean up the
>> child processes and then restart the postmaster; a process that is not
>> only tedious but does offer the possibility of screwing yourself.

> Does this mean that there's no cross-platform way to ensure that
> killing a process results in its children's timely (i.e. before damage
> can occur) death?  That such a way isn't practical from a performance
> point of view?

The simple, easy, cross-platform solution is this: don't kill -9 the
postmaster.  Send it one of the provisioned shutdown signals and let it
kill its children for you.

At least on Unix I don't believe there is any other solution.  You
could try looking at ps output but there's a fundamental race condition,
ie the postmaster could spawn another child just before you kill it,
whereupon the child is reassigned to init and there's no longer a good
way to tell that it came from that postmaster.
        regards, tom lane

Re: kill -KILL: What happens?

From

"Kevin Grittner"

Date:

13 January 2011, 17:08:08

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> At least on Unix I don't believe there is any other solution.  You
> could try looking at ps output but there's a fundamental race
> condition, ie the postmaster could spawn another child just before
> you kill it, whereupon the child is reassigned to init and there's
> no longer a good way to tell that it came from that postmaster.
Couldn't you run `ps auxf` and kill any postgres process which is
not functioning as postmaster (those are pretty easy to distinguish)
and which isn't the child of such a process?  Is there ever a reason
to allow such an orphan to run?
-Kevin

Re: kill -KILL: What happens?

From

Tom Lane

Date:

13 January 2011, 17:38:35

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> At least on Unix I don't believe there is any other solution.  You
>> could try looking at ps output but there's a fundamental race
>> condition, ie the postmaster could spawn another child just before
>> you kill it, whereupon the child is reassigned to init and there's
>> no longer a good way to tell that it came from that postmaster.
> Couldn't you run `ps auxf` and kill any postgres process which is
> not functioning as postmaster (those are pretty easy to distinguish)
> and which isn't the child of such a process?  Is there ever a reason
> to allow such an orphan to run?

That's not terribly hard to do by hand, especially since the cautious
DBA could also do things like checking a process' CWD to verify which
postmaster it had belonged to.  I can't see automating it though.
We already have a perfectly good solution to the automated shutdown
problem.
        regards, tom lane

Re: kill -KILL: What happens?

From

"Kevin Grittner"

Date:

13 January 2011, 17:45:21

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I can't see automating it though.  We already have a perfectly
> good solution to the automated shutdown problem.
Oh, I totally agree with that.  I somehow thought we'd gotten off
into how someone could recover after shooting their foot.
-Kevin

Re: kill -KILL: What happens?

From

Florian Pflug

Date:

13 January 2011, 18:04:31

On Jan13, 2011, at 19:00 , Tom Lane wrote:
> At least on Unix I don't believe there is any other solution.  You
> could try looking at ps output but there's a fundamental race condition,
> ie the postmaster could spawn another child just before you kill it,
> whereupon the child is reassigned to init and there's no longer a good
> way to tell that it came from that postmaster.

Maybe I'm totally confused, but ...

Couldn't normal backends call PostmasterIsAlive and exit if not, just
like the startup process, the stats collector, autovacuum, bgwriter,
walwriter, walreceiver, walsender and the wal archiver already do?

I assumed they do, but now that I grepped the code it seems they don't.

best regards,
Florian Pflug

Re: kill -KILL: What happens?

From

David Fetter

Date:

13 January 2011, 18:14:55

On Thu, Jan 13, 2011 at 12:45:07PM -0600, Kevin Grittner wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  
> > I can't see automating it though.  We already have a perfectly
> > good solution to the automated shutdown problem.
>  
> Oh, I totally agree with that.  I somehow thought we'd gotten off
> into how someone could recover after shooting their foot.

I get that we can't prevent all pilot error, but I was hoping we could
bullet-proof this a little more, especially in light of a certain
extremely popular server OS's OOM killer's default behavior.

Yes, I get that that behavior is crazy, and stupid, and that people
should shut it off, but it *is* our problem if we let the postmaster
start (or continue) when it's set that way.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: kill -KILL: What happens?

From

Tom Lane

Date:

13 January 2011, 18:16:18

Florian Pflug <fgp@phlo.org> writes:
> Couldn't normal backends call PostmasterIsAlive and exit if not, just
> like the startup process, the stats collector, autovacuum, bgwriter,
> walwriter, walreceiver, walsender and the wal archiver already do?

> I assumed they do, but now that I grepped the code it seems they don't.

That's intentional: they keep going until the user closes the session or
someone sends them a signal to do otherwise.  The other various
background processes have to watch PostmasterIsAlive because there is no
session to close.

Frankly I'd prefer to get rid of PostmasterIsAlive, not extend its use.
It sucks because you don't get a signal on parent death.  With the
arrival of the latch code, having to check for PostmasterIsAlive
frequently is the only reason for an idle background process to consume
CPU at all.

Another problem with the scheme is that it only works as long as the
background process is providing a *non critical* service.  Eventually we
are probably going to need some way for bgwriter/walwriter to stay alive
long enough to service orphaned backends, rather than disappearing
instantly if the postmaster goes away.
        regards, tom lane

Re: kill -KILL: What happens?

From

Tom Lane

Date:

13 January 2011, 18:21:54

David Fetter <david@fetter.org> writes:
> I get that we can't prevent all pilot error, but I was hoping we could
> bullet-proof this a little more, especially in light of a certain
> extremely popular server OS's OOM killer's default behavior.

> Yes, I get that that behavior is crazy, and stupid, and that people
> should shut it off, but it *is* our problem if we let the postmaster
> start (or continue) when it's set that way.

Packagers who are paying attention have fixed that ;-)
        regards, tom lane

Re: kill -KILL: What happens?

From

Robert Haas

Date:

13 January 2011, 18:36:19

On Thu, Jan 13, 2011 at 2:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Frankly I'd prefer to get rid of PostmasterIsAlive, not extend its use.
> It sucks because you don't get a signal on parent death.  With the
> arrival of the latch code, having to check for PostmasterIsAlive
> frequently is the only reason for an idle background process to consume
> CPU at all.

What we really need is SIGPARENT.  I wonder if the Linux folks would
consider adding such a thing.  Might be useful to others as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: kill -KILL: What happens?

From

Tom Lane

Date:

13 January 2011, 18:46:07

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 13, 2011 at 2:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Frankly I'd prefer to get rid of PostmasterIsAlive, not extend its use.
>> It sucks because you don't get a signal on parent death. �With the
>> arrival of the latch code, having to check for PostmasterIsAlive
>> frequently is the only reason for an idle background process to consume
>> CPU at all.

> What we really need is SIGPARENT.  I wonder if the Linux folks would
> consider adding such a thing.  Might be useful to others as well.

That's pretty much a dead-end idea unfortunately; it would never be
portable enough to let us change our system structure to rely on it.
Even more to the point, "go away when the postmaster does" isn't
really the behavior we want anyway.  "Go away when the last backend
does" is what we want.

I wonder whether we could have some sort of latch-like counter that
would count the number of active backends and deliver signals when the
count went to zero.  However, if the goal is to defend against random
applications of SIGKILL, there's probably no way to make this reliable
in userspace.

Another idea is to have a "postmaster minder" process that respawns the
postmaster when it's killed.  The hard part of that is that the minder
can't be connected to shared memory (else its OOM cross-section is just
as big as the postmaster's), and that makes it difficult for it to tell
when all the children have gone away.  I suppose it could be coded to
just retry every few seconds until success.  This doesn't improve the
behavior of background processes at all, though.
        regards, tom lane

Re: kill -KILL: What happens?

From

Robert Haas

Date:

13 January 2011, 18:54:07

On Thu, Jan 13, 2011 at 2:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jan 13, 2011 at 2:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Frankly I'd prefer to get rid of PostmasterIsAlive, not extend its use.
>>> It sucks because you don't get a signal on parent death.  With the
>>> arrival of the latch code, having to check for PostmasterIsAlive
>>> frequently is the only reason for an idle background process to consume
>>> CPU at all.
>
>> What we really need is SIGPARENT.  I wonder if the Linux folks would
>> consider adding such a thing.  Might be useful to others as well.
>
> That's pretty much a dead-end idea unfortunately; it would never be
> portable enough to let us change our system structure to rely on it.
> Even more to the point, "go away when the postmaster does" isn't
> really the behavior we want anyway.  "Go away when the last backend
> does" is what we want.

I'm not convinced.  I was thinking that we could simply treat it like
SIGQUIT, if it's available.  I doubt there's a real use case for
continuing to run queries after the postmaster and all the background
processes are dead.  Expedited death seems like much better behavior.
Even checking PostmasterIsAlive() once per query would be reasonable,
except that it'd add a system call to check for a condition that
almost never holds, which I'm not eager to do.

> I wonder whether we could have some sort of latch-like counter that
> would count the number of active backends and deliver signals when the
> count went to zero.  However, if the goal is to defend against random
> applications of SIGKILL, there's probably no way to make this reliable
> in userspace.

I don't think you can get there 100%.  We could, however, make a rule
that when a background process fails a PostmasterIsAlive() check, it
sends SIGQUIT to everyone it can find in the ProcArray, which would at
least ensure a timely exit in most real-world cases.

> Another idea is to have a "postmaster minder" process that respawns the
> postmaster when it's killed.  The hard part of that is that the minder
> can't be connected to shared memory (else its OOM cross-section is just
> as big as the postmaster's), and that makes it difficult for it to tell
> when all the children have gone away.  I suppose it could be coded to
> just retry every few seconds until success.  This doesn't improve the
> behavior of background processes at all, though.

It hardly seems worth it.  Given a reliable interlock against multiple
postmasters, the real concern is making sure that a half-dead
postmaster gets itself all-dead quickly so that the DBA can start up a
new one before he gets fired.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: kill -KILL: What happens?

From

Tom Lane

Date:

13 January 2011, 19:01:31

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 13, 2011 at 2:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder whether we could have some sort of latch-like counter that
>> would count the number of active backends and deliver signals when the
>> count went to zero. �However, if the goal is to defend against random
>> applications of SIGKILL, there's probably no way to make this reliable
>> in userspace.

> I don't think you can get there 100%.  We could, however, make a rule
> that when a background process fails a PostmasterIsAlive() check, it
> sends SIGQUIT to everyone it can find in the ProcArray, which would at
> least ensure a timely exit in most real-world cases.

You're going in the wrong direction there: we're trying to have the
system remain sane when the postmaster crashes, not see how quickly
it can screw up every remaining session.

BTW, in Unix-land we could maybe rely on SysV semaphores' SEM_UNDO
feature to keep a trustworthy count of how many live processes there
are.  But I don't know whether there's anything comparable for Windows.
        regards, tom lane

Re: kill -KILL: What happens?

From

Aidan Van Dyk

Date:

13 January 2011, 19:02:00

On Thu, Jan 13, 2011 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not convinced.  I was thinking that we could simply treat it like
> SIGQUIT, if it's available.  I doubt there's a real use case for
> continuing to run queries after the postmaster and all the background
> processes are dead.  Expedited death seems like much better behavior.
> Even checking PostmasterIsAlive() once per query would be reasonable,
> except that it'd add a system call to check for a condition that
> almost never holds, which I'm not eager to do.

If postmaster has a few fds to spare, what about having it open a pipe
to every child it spawns.  It never has to read/write to it, but
postmaster closing will signal the client's fd.  The client just has
to pop the fd into whatever nrmal poll/select event handlign it uses
to notice when the "parent's pipe" is closed.

A FIFO would allow postmaster to not need as many file handles, and
clients reading the fifo would notice when the writer (postmaster)
closes it.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: kill -KILL: What happens?

From

Robert Haas

Date:

13 January 2011, 19:09:47

On Thu, Jan 13, 2011 at 3:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jan 13, 2011 at 2:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I wonder whether we could have some sort of latch-like counter that
>>> would count the number of active backends and deliver signals when the
>>> count went to zero.  However, if the goal is to defend against random
>>> applications of SIGKILL, there's probably no way to make this reliable
>>> in userspace.
>
>> I don't think you can get there 100%.  We could, however, make a rule
>> that when a background process fails a PostmasterIsAlive() check, it
>> sends SIGQUIT to everyone it can find in the ProcArray, which would at
>> least ensure a timely exit in most real-world cases.
>
> You're going in the wrong direction there: we're trying to have the
> system remain sane when the postmaster crashes, not see how quickly
> it can screw up every remaining session.

I strongly believe you're in the minority on that one, for the same
reasons that I don't think most people would agree with your notion of
what should be the default shutdown mode.  A database that can't
accept new connections is a liability, not an asset.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: kill -KILL: What happens?

From

Florian Pflug

Date:

13 January 2011, 19:18:16

On Jan13, 2011, at 21:01 , Aidan Van Dyk wrote:
> On Thu, Jan 13, 2011 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm not convinced.  I was thinking that we could simply treat it like
>> SIGQUIT, if it's available.  I doubt there's a real use case for
>> continuing to run queries after the postmaster and all the background
>> processes are dead.  Expedited death seems like much better behavior.
>> Even checking PostmasterIsAlive() once per query would be reasonable,
>> except that it'd add a system call to check for a condition that
>> almost never holds, which I'm not eager to do.
> 
> If postmaster has a few fds to spare, what about having it open a pipe
> to every child it spawns.  It never has to read/write to it, but
> postmaster closing will signal the client's fd.  The client just has
> to pop the fd into whatever nrmal poll/select event handlign it uses
> to notice when the "parent's pipe" is closed.

I just started to experiment with that idea, and wrote a small test
program to check if that'd work. I'll post the results when I'm done.

best regards,
Florian Pflug

Re: kill -KILL: What happens?

From

"Kevin Grittner"

Date:

13 January 2011, 19:18:49

Robert Haas <robertmhaas@gmail.com> wrote:
> A database that can't accept new connections is a liability, not
> an asset.
+1
I have so far been unable to imagine a use case for the production
databases I use where I would prefer to see backends continue after
postmaster failure.
-Kevin

Re: kill -KILL: What happens?

From

Tom Lane

Date:

13 January 2011, 19:37:35

Robert Haas <robertmhaas@gmail.com> writes:
> I strongly believe you're in the minority on that one, for the same
> reasons that I don't think most people would agree with your notion of
> what should be the default shutdown mode.  A database that can't
> accept new connections is a liability, not an asset.

Killing active sessions when it's not absolutely necessary is not an
asset.
        regards, tom lane

Re: kill -KILL: What happens?

From

Magnus Hagander

Date:

13 January 2011, 19:40:20

On Thu, Jan 13, 2011 at 21:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I strongly believe you're in the minority on that one, for the same
>> reasons that I don't think most people would agree with your notion of
>> what should be the default shutdown mode.  A database that can't
>> accept new connections is a liability, not an asset.
>
> Killing active sessions when it's not absolutely necessary is not an
> asset.

It certainly can be. Consider any connection pooling scenario, which
would represent the vast majority of larger deployments today - if you
don't kill the sessions, they will never go away.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: kill -KILL: What happens?

From

Tom Lane

Date:

13 January 2011, 19:42:47

Aidan Van Dyk <aidan@highrise.ca> writes:
> If postmaster has a few fds to spare, what about having it open a pipe
> to every child it spawns.  It never has to read/write to it, but
> postmaster closing will signal the client's fd.  The client just has
> to pop the fd into whatever nrmal poll/select event handlign it uses
> to notice when the "parent's pipe" is closed.

Hmm.  Or more generally: there's one FIFO.  The postmaster holds both
sides open.  Backends hold the write side open.  (They can close the
read side, but that would just be to free up a FD.)  Background children
close the write side.  Now a background process can use EOF on the read
side of the FIFO to tell it that postmaster and all backends have
exited.  You still don't get a signal, but at least the condition you're
testing for is the one we actually want and not an approximation.
        regards, tom lane

Re: kill -KILL: What happens?

From

David Fetter

Date:

13 January 2011, 20:10:43

On Thu, Jan 13, 2011 at 09:18:06PM +0100, Florian Pflug wrote:
> On Jan13, 2011, at 21:01 , Aidan Van Dyk wrote:
> > On Thu, Jan 13, 2011 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> I'm not convinced.  I was thinking that we could simply treat it
> >> like SIGQUIT, if it's available.  I doubt there's a real use case
> >> for continuing to run queries after the postmaster and all the
> >> background processes are dead.  Expedited death seems like much
> >> better behavior.  Even checking PostmasterIsAlive() once per
> >> query would be reasonable, except that it'd add a system call to
> >> check for a condition that almost never holds, which I'm not
> >> eager to do.
> > 
> > If postmaster has a few fds to spare, what about having it open a
> > pipe to every child it spawns.  It never has to read/write to it,
> > but postmaster closing will signal the client's fd.  The client
> > just has to pop the fd into whatever nrmal poll/select event
> > handlign it uses to notice when the "parent's pipe" is closed.
> 
> I just started to experiment with that idea, and wrote a small test
> program to check if that'd work. I'll post the results when I'm
> done.

Great! :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: kill -KILL: What happens?

From

Robert Haas

Date:

13 January 2011, 20:32:33

On Thu, Jan 13, 2011 at 3:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I strongly believe you're in the minority on that one, for the same
>> reasons that I don't think most people would agree with your notion of
>> what should be the default shutdown mode.  A database that can't
>> accept new connections is a liability, not an asset.
>
> Killing active sessions when it's not absolutely necessary is not an
> asset.

That's a highly arguable point and I certainly don't agree with it.  A
database with no postmaster and no background processes can't possibly
be expected to function in any sort of halfway reasonable way.  In
particular:

1. No checkpoints will occur, so the time required for recovery will
grow longer without bound.
2. All walsenders will exit, so no transactions will be replicated to standbys.
3. Transactions committed asynchronously won't be flushed to disk, and
are lost entirely unless enough other WAL activity occurs before the
last backend dies to force a WAL write.
4. Autovacuum won't run until the system is properly restarted, and to
make matters worse there's no statistics collector, so the information
that might trigger a later run will be lost also.
5. At some point, you'll run out of clean buffers, after which
performance will start to suck as backends have to do their own
writes.
6. At some probably later point, the fsync request queue will fill up,
after which performance will go into the toilet.  On 9.1devel, this
takes less than a minute of moderate activity on my MacOS X machine.

All in all, running for any significant period of time in this state
is likely a recipe for disaster, even if for some inexplicable reason
you don't care about the fact that the system won't accept any new
connections.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: kill -KILL: What happens?

From

David Fetter

Date:

13 January 2011, 21:00:11

On Thu, Jan 13, 2011 at 02:21:44PM -0500, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > I get that we can't prevent all pilot error, but I was hoping we
> > could bullet-proof this a little more, especially in light of a
> > certain extremely popular server OS's OOM killer's default
> > behavior.
> 
> > Yes, I get that that behavior is crazy, and stupid, and that
> > people should shut it off, but it *is* our problem if we let the
> > postmaster start (or continue) when it's set that way.
> 
> Packagers who are paying attention have fixed that ;-)

Are we privileging packaged over unpackaged?  Some distro over others?  ;)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: kill -KILL: What happens?

From

Florian Pflug

Date:

13 January 2011, 22:00:12

On Jan13, 2011, at 21:42 , Tom Lane wrote:
> Aidan Van Dyk <aidan@highrise.ca> writes:
>> If postmaster has a few fds to spare, what about having it open a pipe
>> to every child it spawns.  It never has to read/write to it, but
>> postmaster closing will signal the client's fd.  The client just has
>> to pop the fd into whatever nrmal poll/select event handlign it uses
>> to notice when the "parent's pipe" is closed.
>
> Hmm.  Or more generally: there's one FIFO.  The postmaster holds both
> sides open.  Backends hold the write side open.  (They can close the
> read side, but that would just be to free up a FD.)  Background children
> close the write side.  Now a background process can use EOF on the read
> side of the FIFO to tell it that postmaster and all backends have
> exited.  You still don't get a signal, but at least the condition you're
> testing for is the one we actually want and not an approximation.

I was thinking along a similar line, and put together small test case to
prove that this actually works. The attached test program simulates the
interactions of a parent process (think postmaster), some utility processes
(think walwriter, bgwriter, ...) and some backends. It uses two pairs of
fd created with pipe(), called LifeSignParent and LifeSignParentBackends.

The writing end of the former is held open only in the parent process,
while the writing end of the latter is held open in the parent process and
all regular backend processes. Backend processes use select() to monitor
the reading end of the LifeSignParent fd pair. Since nothing is ever written
to the writing end, the fd becomes readable only when the parent exits,
because that is how select() signals EOF. Once that happens the backend
exits. The utility processes do the same, but monitor the reading end of
LifeSignParentBackends, and thus exit only after the parent and all regular
backends have died.

Since the lifesign checking uses select(), any place that already uses
select can easily check for vanishing life signs. CHECK_FOR_INTERRUPTS could
simply check the life sign once every few seconds.

If we want an absolutely reliable signal instead of checking in
CHECK_FOR_INTERRUPTS, every backend would need to launch a monitor subprocess
which monitors the life sign, and exits once it vanishes. The backend would
then get a SIGCHLD once the postmaster dies. Seems like overkill, though.

The whole thing won't work on Windows, since even if it's got a pipe() or
socketpair() call, with EXEC_BACKEND there's no way of transferring these
fds to the child processes. AFAIK, however, Windows has other means with
which such life signs can be implemented. For example, I seem to remember
that WaitForMultipleObjects() can be used to wait for process-related events.
But windows really isn't my area of expertise...

I have tested this on the latest Ubunutu LTS release (10.04.1) as well as
Mac OS X 10.6.6, and it seems to work correctly on both systems. I'd be
happy to hear from anyone who has access to other systems on whether this
works or not. The expected output is

Launched utility 5095
Launched backend 5097
Launched utility 5096
Launched backend 5099
Launched backend 5098
Utility 5095 detected live parent or backend
Backend 5097 detected live parent
Utility 5096 detected live parent or backend
Backend 5099 detected live parent
Backend 5098 detected live parent
Parent exiting
Backend 5097 exiting after parent died
Backend 5098 exiting after parent died
Backend 5099 exiting after parent died
Utility 5096 exiting after parent and backends died
Utility 5095 exiting after parent and backends died

Everything after "Parent exiting" might be interleaved with a shell prompt,
of course.

best regards,
Florian Pflug

Attachment

liveness.c

Re: kill -KILL: What happens?

From

Jeff Davis

Date:

13 January 2011, 22:29:27

On Thu, 2011-01-13 at 11:14 -0800, David Fetter wrote:
> I get that we can't prevent all pilot error, but I was hoping we could
> bullet-proof this a little more, especially in light of a certain
> extremely popular server OS's OOM killer's default behavior.

That's a good point. I'm not sure how much action can reasonably be
taken, however.

> Yes, I get that that behavior is crazy, and stupid, and that people
> should shut it off, but it *is* our problem if we let the postmaster
> start (or continue) when it's set that way.

As an aside, linux has actually changed the heuristic:


http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a63d83f427fbce97a6cea0db2e64b0eb8435cd10

Regards,Jeff Davis

Re: kill -KILL: What happens?

From

David Fetter

Date:

13 January 2011, 22:32:27

On Thu, Jan 13, 2011 at 03:29:13PM -0800, Jeff Davis wrote:
> On Thu, 2011-01-13 at 11:14 -0800, David Fetter wrote:
> > I get that we can't prevent all pilot error, but I was hoping we
> > could bullet-proof this a little more, especially in light of a
> > certain extremely popular server OS's OOM killer's default
> > behavior.
> 
> That's a good point.  I'm not sure how much action can reasonably be
> taken, however.

We may find out from Florian's experiments :)

> > Yes, I get that that behavior is crazy, and stupid, and that
> > people should shut it off, but it *is* our problem if we let the
> > postmaster start (or continue) when it's set that way.
> 
> As an aside, linux has actually changed the heuristic:
> 
>
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a63d83f427fbce97a6cea0db2e64b0eb8435cd10

Great!  In a decade or so, no more servers will be running with an
earlier kernel ;)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: kill -KILL: What happens?

From

Tom Lane

Date:

13 January 2011, 23:32:30

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 13, 2011 at 3:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Killing active sessions when it's not absolutely necessary is not an
>> asset.

> That's a highly arguable point and I certainly don't agree with it.

Your examples appear to rely on the assumption that background processes
exit instantly when the postmaster dies.  Which they should not.
        regards, tom lane

Re: kill -KILL: What happens?

From

Florian Pflug

Date:

13 January 2011, 23:57:11

On Jan14, 2011, at 01:32 , Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jan 13, 2011 at 3:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Killing active sessions when it's not absolutely necessary is not an
>>> asset.
> 
>> That's a highly arguable point and I certainly don't agree with it.
> 
> Your examples appear to rely on the assumption that background processes
> exit instantly when the postmaster dies.  Which they should not.

Even if they stay around, no new connections will be possible once the
postmaster is gone. So this really comes down to what somebody perceives
to be a bigger problem - new connections failing or existing connections
being terminated.

I don't believe there's one right answer to that.

Assume postgres is driving a website, and the postmaster crashes shortly
after a pg_dump run started. You probably won't want your website to be
offline while pg_dump is finishing its backup.

If, on the other hand, your data warehousing database is running a
multi-hour query, you might prefer that query to finish, even at the price
of not being able to accept new connections.

So maybe there should be a GUC for this?

best regards,
Florian Pflug

Re: kill -KILL: What happens?

From

Tom Lane

Date:

14 January 2011, 00:10:38

Florian Pflug <fgp@phlo.org> writes:
> I don't believe there's one right answer to that.

Right.  Force-kill presumes there is only one right answer.

> Assume postgres is driving a website, and the postmaster crashes shortly
> after a pg_dump run started. You probably won't want your website to be
> offline while pg_dump is finishing its backup.

> If, on the other hand, your data warehousing database is running a
> multi-hour query, you might prefer that query to finish, even at the price
> of not being able to accept new connections.

> So maybe there should be a GUC for this?

No need (and rather inflexible anyway).  If you don't want an orphaned
backend to continue, you send it SIGTERM.
        regards, tom lane

Re: kill -KILL: What happens?

From

Robert Haas

Date:

14 January 2011, 00:19:50

On Thu, Jan 13, 2011 at 7:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jan 13, 2011 at 3:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Killing active sessions when it's not absolutely necessary is not an
>>> asset.
>
>> That's a highly arguable point and I certainly don't agree with it.
>
> Your examples appear to rely on the assumption that background processes
> exit instantly when the postmaster dies.  Which they should not.

But they do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: kill -KILL: What happens?

From

Robert Haas

Date:

14 January 2011, 00:20:29

On Thu, Jan 13, 2011 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Florian Pflug <fgp@phlo.org> writes:
>> I don't believe there's one right answer to that.
>
> Right.  Force-kill presumes there is only one right answer.
>
>> Assume postgres is driving a website, and the postmaster crashes shortly
>> after a pg_dump run started. You probably won't want your website to be
>> offline while pg_dump is finishing its backup.
>
>> If, on the other hand, your data warehousing database is running a
>> multi-hour query, you might prefer that query to finish, even at the price
>> of not being able to accept new connections.
>
>> So maybe there should be a GUC for this?
>
> No need (and rather inflexible anyway).  If you don't want an orphaned
> backend to continue, you send it SIGTERM.

It is not easy to make this work in such a way that you can ensure a
clean, automatic restart of PostgreSQL after a postmaster death.
Which is what at least some people want.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: kill -KILL: What happens?

From

Tom Lane

Date:

14 January 2011, 00:28:43

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 13, 2011 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Florian Pflug <fgp@phlo.org> writes:
>>> So maybe there should be a GUC for this?

>> No need (and rather inflexible anyway). �If you don't want an orphaned
>> backend to continue, you send it SIGTERM.

> It is not easy to make this work in such a way that you can ensure a
> clean, automatic restart of PostgreSQL after a postmaster death.
> Which is what at least some people want.

True.  It strikes me also that the postmaster does provide some services
other than accepting new connections:

* ensuring that everybody gets killed if a backend crashes

* respawning autovac launcher and other processes that might exit
harmlessly

* is there still any cross-backend signaling that goes through the
postmaster?  We got rid of the sinval case, but I don't recall if
there's others.

While you could probably live without these in the scenario of "let my
honking big query finish before restarting", you would not want to do
without them in unattended operation.
        regards, tom lane

Re: kill -KILL: What happens?

From

Robert Haas

Date:

14 January 2011, 02:04:02

On Thu, Jan 13, 2011 at 8:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jan 13, 2011 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Florian Pflug <fgp@phlo.org> writes:
>>>> So maybe there should be a GUC for this?
>
>>> No need (and rather inflexible anyway).  If you don't want an orphaned
>>> backend to continue, you send it SIGTERM.
>
>> It is not easy to make this work in such a way that you can ensure a
>> clean, automatic restart of PostgreSQL after a postmaster death.
>> Which is what at least some people want.
>
> True.  It strikes me also that the postmaster does provide some services
> other than accepting new connections:
>
> * ensuring that everybody gets killed if a backend crashes
>
> * respawning autovac launcher and other processes that might exit
> harmlessly
>
> * is there still any cross-backend signaling that goes through the
> postmaster?  We got rid of the sinval case, but I don't recall if
> there's others.
>
> While you could probably live without these in the scenario of "let my
> honking big query finish before restarting", you would not want to do
> without them in unattended operation.

Yep.  I'm pretty doubtful that you're going to want them even in that
case, but you're surely not going to want them in unattended
operation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: kill -KILL: What happens?

From

Alvaro Herrera

Date:

14 January 2011, 14:58:20

Excerpts from Robert Haas's message of vie ene 14 00:03:53 -0300 2011:
> On Thu, Jan 13, 2011 at 8:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > True.  It strikes me also that the postmaster does provide some services
> > other than accepting new connections:
> >
> > * ensuring that everybody gets killed if a backend crashes

> > While you could probably live without these in the scenario of "let my
> > honking big query finish before restarting", you would not want to do
> > without them in unattended operation.
> 
> Yep.  I'm pretty doubtful that you're going to want them even in that
> case, but you're surely not going to want them in unattended
> operation.

I'm sure you don't want that.  The reason postmaster causes a restart of
all backends in case one of them crashes is that it could have left some
corrupted state behind.  If postmaster dies, and then another backend
crashes, then your backend running "your honking big query" could run
across corrupted state and then you'd be in serious trouble.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: kill -KILL: What happens?

From

"Kevin Grittner"

Date:

14 January 2011, 15:22:40

Alvaro Herrera <alvherre@commandprompt.com> wrote:
> If postmaster dies, and then another backend crashes, then your
> backend running "your honking big query" could run across
> corrupted state and then you'd be in serious trouble.
Worst of all, it could give bogus results without error.  I really
don't see a production use case for letting backends continue after
postmaster failure -- unless you only kinda, sorta care whether
committed data is actually retrievable or reported data is actually
accurate.
-Kevin

Re: kill -KILL: What happens?

From

Florian Pflug

Date:

14 January 2011, 15:29:07

On Jan14, 2011, at 17:22 , Kevin Grittner wrote:

> Alvaro Herrera <alvherre@commandprompt.com> wrote:
> 
>> If postmaster dies, and then another backend crashes, then your
>> backend running "your honking big query" could run across
>> corrupted state and then you'd be in serious trouble.
> 
> Worst of all, it could give bogus results without error.  I really
> don't see a production use case for letting backends continue after
> postmaster failure -- unless you only kinda, sorta care whether
> committed data is actually retrievable or reported data is actually
> accurate.

I gather that the behaviour we want is for normal backends to exit
once the postmaster is gone, and for utility processes (bgwriter, ...)
to exit once all the backends are gone.

The test program I posted in this thread proves that FIFOs and select()
can be used to implement this, if we're ready to check for EOF on the
socket in CHECK_FOR_INTERRUPTS() every few seconds. Is this a viable
route to take?

best regards,
Florian Pflug

Re: kill -KILL: What happens?

From

Robert Haas

Date:

14 January 2011, 15:45:25

On Fri, Jan 14, 2011 at 11:28 AM, Florian Pflug <fgp@phlo.org> wrote:
> I gather that the behaviour we want is for normal backends to exit
> once the postmaster is gone, and for utility processes (bgwriter, ...)
> to exit once all the backends are gone.
>
> The test program I posted in this thread proves that FIFOs and select()
> can be used to implement this, if we're ready to check for EOF on the
> socket in CHECK_FOR_INTERRUPTS() every few seconds. Is this a viable
> route to take?

I don't think there's much point in getting excited about the order in
which things exit.  If we're agreed (and we seem to be, modulo Tom)
that the backends should exit quickly if the postmaster dies, then
worrying about whether the utility processes exit slightly before or
slightly after that doesn't excite me very much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: kill -KILL: What happens?

From

Florian Pflug

Date:

15 January 2011, 10:27:18

On Jan14, 2011, at 17:45 , Robert Haas wrote:
> On Fri, Jan 14, 2011 at 11:28 AM, Florian Pflug <fgp@phlo.org> wrote:
>> I gather that the behaviour we want is for normal backends to exit
>> once the postmaster is gone, and for utility processes (bgwriter, ...)
>> to exit once all the backends are gone.
>> 
>> The test program I posted in this thread proves that FIFOs and select()
>> can be used to implement this, if we're ready to check for EOF on the
>> socket in CHECK_FOR_INTERRUPTS() every few seconds. Is this a viable
>> route to take?
> 
> I don't think there's much point in getting excited about the order in
> which things exit.  If we're agreed (and we seem to be, modulo Tom)
> that the backends should exit quickly if the postmaster dies, then
> worrying about whether the utility processes exit slightly before or
> slightly after that doesn't excite me very much.


Tom seems to think that as our utility processes gain importance, one day
we might require one to outlive all the backends, and that whatever solution
we adopt should allow us to arrange for that. Or at least this how I
understood him.

That parts can also easily be left out by using only one FIFO instead of
two, kept open for writing only in the postmaster.

best regards,
Florian Pflug

Re: kill -KILL: What happens?

From

Robert Haas

Date:

15 January 2011, 11:12:18

On Sat, Jan 15, 2011 at 6:27 AM, Florian Pflug <fgp@phlo.org> wrote:
> On Jan14, 2011, at 17:45 , Robert Haas wrote:
>> On Fri, Jan 14, 2011 at 11:28 AM, Florian Pflug <fgp@phlo.org> wrote:
>>> I gather that the behaviour we want is for normal backends to exit
>>> once the postmaster is gone, and for utility processes (bgwriter, ...)
>>> to exit once all the backends are gone.
>>>
>>> The test program I posted in this thread proves that FIFOs and select()
>>> can be used to implement this, if we're ready to check for EOF on the
>>> socket in CHECK_FOR_INTERRUPTS() every few seconds. Is this a viable
>>> route to take?
>>
>> I don't think there's much point in getting excited about the order in
>> which things exit.  If we're agreed (and we seem to be, modulo Tom)
>> that the backends should exit quickly if the postmaster dies, then
>> worrying about whether the utility processes exit slightly before or
>> slightly after that doesn't excite me very much.
>
> Tom seems to think that as our utility processes gain importance, one day
> we might require one to outlive all the backends, and that whatever solution
> we adopt should allow us to arrange for that. Or at least this how I
> understood him.

Well, there's certainly ONE of those already: the logging collector.
But it already has its own solution to this problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: kill -KILL: What happens?

From

Florian Pflug

Date:

15 January 2011, 14:45:06

On Jan14, 2011, at 17:45 , Robert Haas wrote:
> On Fri, Jan 14, 2011 at 11:28 AM, Florian Pflug <fgp@phlo.org> wrote:
>> I gather that the behaviour we want is for normal backends to exit
>> once the postmaster is gone, and for utility processes (bgwriter, ...)
>> to exit once all the backends are gone.
>>
>> The test program I posted in this thread proves that FIFOs and select()
>> can be used to implement this, if we're ready to check for EOF on the
>> socket in CHECK_FOR_INTERRUPTS() every few seconds. Is this a viable
>> route to take?
>
> I don't think there's much point in getting excited about the order in
> which things exit.  If we're agreed (and we seem to be, modulo Tom)
> that the backends should exit quickly if the postmaster dies, then
> worrying about whether the utility processes exit slightly before or
> slightly after that doesn't excite me very much.

I've realized that POSIX actually *does* provide a way to receive a signal -
the SIGIO machinery. I've modified my test case do to that. To simplify things,
I've removed support for multiple life sign objects.

The code now does the following:

The parents creates a pipe, sets it's reading fd to O_NONBLOCK and O_ASYNC,
and registers a SIGIO handler. The SIGIO handler checks a global flag, and
simply sends a SIGTERM to its own pid if the flag is set.

Child processes close the pipe's writing end (called "giving up ownership
of the life sign" in the code) and set the global flag if they want to receive
a SIGTERM once the parent is gone. The parent's health state can additionally
be checked at any time by trying to read() from the pipe. read() returns
EAGAIN as long as the parent is still alive and EOF otherwise.

I'm not sure how portable this is. It compiles and runs fine on both my linux
machine (Ubuntu 10.04.01 LTS) and my laptop (OSX 10.6.6).

In the EXEC_BACKEND case the pipe would need to be created with mkfifo() in
the data directory, but otherwise things should work the same. Haven't tried
that yet, though.

Code attached. The output should be

Launched backend 8636
Launched backend 8637
Launched backend 8638
Backend 8636 detected live parent
Backend 8637 detected live parent
Backend 8638 detected live parent
Backend 8636 detected live parent
Backend 8637 detected live parent
Backend 8638 detected live parent
Parent exiting
Backend 8637 exiting after parent died
Backend 8638 exiting after parent died
Backend 8636 exiting after parent died

if things work correctly.

best regards,
Florian Pflug

Attachment

liveness.c

Re: kill -KILL: What happens?

From

Robert Haas

Date:

07 May 2011, 01:50:58

On Sat, Jan 15, 2011 at 10:44 AM, Florian Pflug <fgp@phlo.org> wrote:
> On Jan14, 2011, at 17:45 , Robert Haas wrote:
>> On Fri, Jan 14, 2011 at 11:28 AM, Florian Pflug <fgp@phlo.org> wrote:
>>> I gather that the behaviour we want is for normal backends to exit
>>> once the postmaster is gone, and for utility processes (bgwriter, ...)
>>> to exit once all the backends are gone.
>>>
>>> The test program I posted in this thread proves that FIFOs and select()
>>> can be used to implement this, if we're ready to check for EOF on the
>>> socket in CHECK_FOR_INTERRUPTS() every few seconds. Is this a viable
>>> route to take?
>>
>> I don't think there's much point in getting excited about the order in
>> which things exit.  If we're agreed (and we seem to be, modulo Tom)
>> that the backends should exit quickly if the postmaster dies, then
>> worrying about whether the utility processes exit slightly before or
>> slightly after that doesn't excite me very much.
>
> I've realized that POSIX actually *does* provide a way to receive a signal -
> the SIGIO machinery. I've modified my test case do to that. To simplify things,
> I've removed support for multiple life sign objects.
>
> The code now does the following:
>
> The parents creates a pipe, sets it's reading fd to O_NONBLOCK and O_ASYNC,
> and registers a SIGIO handler. The SIGIO handler checks a global flag, and
> simply sends a SIGTERM to its own pid if the flag is set.
>
> Child processes close the pipe's writing end (called "giving up ownership
> of the life sign" in the code) and set the global flag if they want to receive
> a SIGTERM once the parent is gone. The parent's health state can additionally
> be checked at any time by trying to read() from the pipe. read() returns
> EAGAIN as long as the parent is still alive and EOF otherwise.
>
> I'm not sure how portable this is. It compiles and runs fine on both my linux
> machine (Ubuntu 10.04.01 LTS) and my laptop (OSX 10.6.6).
>
> In the EXEC_BACKEND case the pipe would need to be created with mkfifo() in
> the data directory, but otherwise things should work the same. Haven't tried
> that yet, though.
>
> Code attached. The output should be
>
> Launched backend 8636
> Launched backend 8637
> Launched backend 8638
> Backend 8636 detected live parent
> Backend 8637 detected live parent
> Backend 8638 detected live parent
> Backend 8636 detected live parent
> Backend 8637 detected live parent
> Backend 8638 detected live parent
> Parent exiting
> Backend 8637 exiting after parent died
> Backend 8638 exiting after parent died
> Backend 8636 exiting after parent died
>
> if things work correctly.

Are you planning to develop this into a patch for 9.2?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: kill -KILL: What happens?

From

Florian Pflug

Date:

27 May 2011, 09:07:57

On May7, 2011, at 03:50 , Robert Haas wrote:
> On Sat, Jan 15, 2011 at 10:44 AM, Florian Pflug <fgp@phlo.org> wrote:
>> I've realized that POSIX actually *does* provide a way to receive a signal -
>> the SIGIO machinery. I've modified my test case do to that. To simplify things,
>> I've removed support for multiple life sign objects.
>>
>> <snipped>
> Are you planning to develop this into a patch for 9.2?

Sorry for the extremely late answer - I received this mail while I was on
vacation, and then forgot to answer it once I came back :-(

Anyway, I'm glad to see that Peter Geoghegan has picked this up
any turned this into an actual patch.

Extremely cool!

best regards,
Florian Pflug

Re: kill -KILL: What happens?

From

Peter Geoghegan

Date:

27 May 2011, 11:05:23

On 27 May 2011 10:01, Florian Pflug <fgp@phlo.org> wrote:

> Anyway, I'm glad to see that Peter Geoghegan has picked this up
> any turned this into an actual patch.
>
> Extremely cool!

Thanks Florian.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services