Re: kill -KILL: What happens? - Mailing list pgsql-hackers

From David Fetter
Subject Re: kill -KILL: What happens?
Date
Msg-id 20110113171235.GA28078@fetter.org
Whole thread Raw
In response to Re: kill -KILL: What happens?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: kill -KILL: What happens?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, Jan 13, 2011 at 10:41:28AM -0500, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > I've noticed over the years that we give people dire warnings never to
> > send a KILL signal to the postmaster, but I'm unsure as to what are
> > potential consequences of this, as in just exactly how this can result
> > in problems.  Is there some reference I can look to for explanations
> > of the mechanism(s) whereby the damage occurs?
> 
> There's no risk of data corruption, if that's what you're thinking of.
> It's just that you're then looking at having to manually clean up the
> child processes and then restart the postmaster; a process that is not
> only tedious but does offer the possibility of screwing yourself.

Does this mean that there's no cross-platform way to ensure that
killing a process results in its children's timely (i.e. before damage
can occur) death?  That such a way isn't practical from a performance
point of view?

> In particular the risk is that someone clueless enough to do this would
> next decide that removing $PGDATA/postmaster.pid, rather than killing
> all the existing children, is the quickest way to get the postmaster
> restarted.  Once he's done that, his data will shortly be hosed beyond
> recovery, because now he has two noncommunicating sets of backends
> massaging the same files via separate sets of shared buffers.

Right.

> The reason this sequence of events doesn't seem improbable is that the
> error you get when you try to start a new postmaster, if there are still
> old backends running, is
> 
> FATAL:  pre-existing shared memory block (key 5490001, ID 15609) is still in use
> HINT:  If you're sure there are no old server processes still running, remove the shared memory block or just delete
thefile "postmaster.pid".
 
> 
> Maybe we should rewrite that HINT --- while it's *possible* that
> removing the shmem block or deleting postmaster.pid is the right thing
> to do, it's not exactly *likely*.  I think we need to put a bit more
> emphasis on the "If ..." part.  Like "If you are prepared to swear on
> your mother's grave that there are no old server processes still
> running, consider removing postmaster.pid.  But first check for existing
> processes again."

Maybe the hint could give an OS-tailored way to check this...

> (BTW, I notice that this interlock against starting a new postmaster
> appears to be broken in HEAD, which is likely not unrelated to the
> fact that the contents of postmaster.pid seem to be totally bollixed
> :-()

D'oh!  Well, I hope knowing it's a problem gives some kind of glimmer
as to how to solve it :)

Is this worth writing tests for?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: libpq documentation cleanups (repost 3)
Next
From: Bruce Momjian
Date:
Subject: Re: C++ keywords in headers (was Re: [GENERAL] #include )