Thread: Attempt to stop dead instance can stop a random process?

Attempt to stop dead instance can stop a random process?

From
"Kevin Grittner"
Date:
It appears that when pg_ctl gets a stop request for a given directory, it looks for a pid file in that directory and
signalsthat pid to stop.  It doesn't appear to check that the pid is for a PostgreSQL postmaster running out of the
givendirectory.  I think it should, although on a quick scan of the code, I didn't see a convenient way to do that. 
I have some evidence that when we attempted to stop a PostgreSQL instance which (it turned out) had died without
cleaningup the pid file, it actually stopped another instance which was using a different data directory but had
wrappedaround to the same pid. 
I guess if we ran each instance under a different OS user we would be protected from this, but that we hadn't thought
thatwas necessary.  Besides, we have other processes running under that OS login for maintenance or as part of the
recoveryprocessing. 
-Kevin




Re: Attempt to stop dead instance can stop a random process?

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> It appears that when pg_ctl gets a stop request for a given directory, it l=
> ooks for a pid file in that directory and signals that pid to stop.  It doe=
> sn't appear to check that the pid is for a PostgreSQL postmaster running ou=
> t of the given directory.  I think it should, although on a quick scan of t=
> he code, I didn't see a convenient way to do that.

[ shrug... ]  AFAICS there is no way to know that.

> I have some evidence that when we attempted to stop a PostgreSQL instance w=
> hich (it turned out) had died without cleaning up the pid file, it actually=
>  stopped another instance which was using a different data directory but ha=
> d wrapped around to the same pid.

The real question there is how come the postmaster died without removing
the pidfile.  It's not that easy to crash the postmaster ...
        regards, tom lane


Re: Attempt to stop dead instance can stop a random process?

From
"Kevin Grittner"
Date:
>>> On Fri, Aug 31, 2007 at  2:18 PM, in message <381.1188587883@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> It appears that when pg_ctl gets a stop request for a given directory, it l=
>> ooks for a pid file in that directory and signals that pid to stop.  It doe=
>> sn't appear to check that the pid is for a PostgreSQL postmaster running ou=
>> t of the given directory.  I think it should, although on a quick scan of t=
>> he code, I didn't see a convenient way to do that.
>
> [ shrug... ]  AFAICS there is no way to know that.
I sure couldn't see a way, but I was hoping that was just a matter of my own
ignorance.
>> I have some evidence that when we attempted to stop a PostgreSQL instance w=
>> hich (it turned out) had died without cleaning up the pid file, it actually=
>>  stopped another instance which was using a different data directory but ha=
>> d wrapped around to the same pid.
>
> The real question there is how come the postmaster died without removing
> the pidfile.  It's not that easy to crash the postmaster ...
Well, that's not due to a bug in PostgreSQL.  We're using a buggy LDAP
implementation (not my call) which can crash things.  The machine totally
locked up after logging distress messages from that daemon, and they cycled
power to get out of it.
The PostgreSQL issue here was a secondary problem in trying to get the
server back to normal.  So really, what I was suggesting was something to
improve the robustness of PostgreSQL in the face of severe challenges posed
by other issues.  I realize it's a very low volume issue; if it's not easy
to fix, probably not worth it.
Now to bug the people on the list of authorized contacts for Novell to open
a support case on the LDAP problems, and see how many of the 40 core dumps
I have from their daemon they want to see.
-Kevin



Re: Attempt to stop dead instance can stop a random process?

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Well, that's not due to a bug in PostgreSQL.  We're using a buggy LDAP
> implementation (not my call) which can crash things.  The machine totally
> locked up after logging distress messages from that daemon, and they cycled
> power to get out of it.

Hmm.  Do I correctly grasp the picture that you've got several Postgres
installations on the machine and they're all booted by startup scripts?

In this situation, it's actually not a bad idea to run each one under a
separate userid.  The problem is that in successive reboots, each
postmaster will typically get almost but not exactly the same PID as
last time, since the number of processes launched earlier in system
startup is mostly but not completely deterministic.  If you start all
the postmasters together, as you probably do, then there will be
occasions when one gets a PID that another one had in the previous boot
cycle.  That can lead to refusal to start up: if a postmaster sees a
postmaster lock file in its data directory, containing a PID that
belongs to another live process owned by the same userid, it has to
assume that that's a conflicting postmaster and it must respect the lock
file.  You can prevent that problem if each postmaster (data directory)
belongs to a different userid.

(Some people prefer to fix this by having a startup script that forcibly
removes all the lockfiles before launching the postmasters.  I think
that's kinda risky, although if it's done in a separate script that
you'd have no reason to run by hand, it's probably OK.  Clueless folks
put the action right in the postgresql start script, meaning that a
thoughtless "service postgresql start" blows away the lock file...)

BTW, I would imagine that some scenario like this preceded the problem
that you actually reported, since had all the postmasters started
successfully, they'd all have written correct lockfiles.
        regards, tom lane


Re: Attempt to stop dead instance can stop a random process?

From
"Kevin Grittner"
Date:
>>> On Fri, Aug 31, 2007 at  3:10 PM, in message <1068.1188591013@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Well, that's not due to a bug in PostgreSQL.  We're using a buggy LDAP
>> implementation (not my call) which can crash things.  The machine totally
>> locked up after logging distress messages from that daemon, and they cycled
>> power to get out of it.
>
> Hmm.  Do I correctly grasp the picture that you've got several Postgres
> installations on the machine and they're all booted by startup scripts?
Several is an understatement.  This is the machine where we're running one
PostgreSQL instance per county in "warm standby" mode -- not to actually use
in recovery, but only to confirm that the backups are  flowing back and
applying cleanly.  So, 72 instances, on ports 5401 to 5472.
> In this situation, it's actually not a bad idea to run each one under a
> separate userid.
OK, I'll see about getting that set up.
> (Some people prefer to fix this by having a startup script that forcibly
> removes all the lockfiles before launching the postmasters.  I think
> that's kinda risky, although if it's done in a separate script that
> you'd have no reason to run by hand, it's probably OK.
I don't like that idea much.  I'd rather add 72 new OS users.
> BTW, I would imagine that some scenario like this preceded the problem
> that you actually reported, since had all the postmasters started
> successfully, they'd all have written correct lockfiles.
Quite likely.  Most of the action happened before I arrived for the day.
Thanks.
-Kevin



Re: Attempt to stop dead instance can stop a random process?

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, Aug 31, 2007 at 02:41:47PM -0500, Kevin Grittner wrote:

[...]

> > The real question there is how come the postmaster died without removing
> > the pidfile.  It's not that easy to crash the postmaster ...
>  
> Well, that's not due to a bug in PostgreSQL.  We're using a buggy LDAP
> implementation (not my call) which can crash things.  The machine totally
> locked up after logging distress messages from that daemon, and they cycled
> power to get out of it.

Hm. I've come to expect the OS removing all pidfiles early at bootup. A
pidfile from the "last incarnation" doesn't make much sense anyway,
right?

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFG2O9yBcgs9XrR2kYRAjB1AJ99MwT7M4LCDCTm/lnoCQE+xxVq5wCdEfDN
j+/jC3DI3qr/W1OEJGDAg88=
=pCGj
-----END PGP SIGNATURE-----



Re: Attempt to stop dead instance can stop a random process?

From
Tom Lane
Date:
tomas@tuxteam.de writes:
> Hm. I've come to expect the OS removing all pidfiles early at bootup.

If there's a script in your system that does that, then adding Postgres
lockfiles to it makes all kinds of sense.  Our problem as upstream
software is that this isn't something well-standardized that we could
plug into ...
        regards, tom lane


Re: Attempt to stop dead instance can stop a random process?

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, Sep 01, 2007 at 12:57:35AM -0400, Tom Lane wrote:
> tomas@tuxteam.de writes:
> > Hm. I've come to expect the OS removing all pidfiles early at bootup.
> 
> If there's a script in your system that does that, then adding Postgres
> lockfiles to it makes all kinds of sense.  Our problem as upstream
> software is that this isn't something well-standardized that we could
> plug into ...

Right -- this becomes the distributor's job (when compiling from
sources, the distributor is the sysadmin). Upstream can only recommend.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFG2R2XBcgs9XrR2kYRAoNDAJ9gN3ytdJzXyzJ0/MKTzWVZvq/X3ACfZtQm
cVYkpNrJqhvRjtUvm/5co0c=
=HBPE
-----END PGP SIGNATURE-----



Re: Attempt to stop dead instance can stop a random process?

From
"Kevin Grittner"
Date:
>>> On Fri, Aug 31, 2007 at  3:10 PM, in message <1068.1188591013@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Hmm.  Do I correctly grasp the picture that you've got several Postgres
> installations on the machine and they're all booted by startup scripts?
>
> In this situation, it's actually not a bad idea to run each one under a
> separate userid.  The problem is that in successive reboots, each
> postmaster will typically get almost but not exactly the same PID as
> last time, since the number of processes launched earlier in system
> startup is mostly but not completely deterministic.  If you start all
> the postmasters together, as you probably do, then there will be
> occasions when one gets a PID that another one had in the previous boot
> cycle.  That can lead to refusal to start up: if a postmaster sees a
> postmaster lock file in its data directory, containing a PID that
> belongs to another live process owned by the same userid, it has to
> assume that that's a conflicting postmaster and it must respect the lock
> file.  You can prevent that problem if each postmaster (data directory)
> belongs to a different userid.
I was thinking of submitting a patch to add a recommendation to this effect
to section 16.1 ("The PostgreSQL User Account") in the documentation.  Does
that seem appropriate to all?  I'm not sure whether it would be worth
changing 16.2 ("Creating a Database Cluster") to say "while logged into the
PostgreSQL user account which you have chosen for the cluster".
> (Some people prefer to fix this by having a startup script that forcibly
> removes all the lockfiles before launching the postmasters.  I think
> that's kinda risky, although if it's done in a separate script that
> you'd have no reason to run by hand, it's probably OK.  Clueless folks
> put the action right in the postgresql start script, meaning that a
> thoughtless "service postgresql start" blows away the lock file...)
Would it be a good idea to mention pid file cleanup strategies in section
16.3 ("Starting the Database Server") where pid files are discussed, or
isn't that something we should get into in the docs?
Is there anywhere in the documentation to describe common causes and
solutions for messages such as these (from the log file)?:
[2007-09-02 11:47:14.697 CDT] 7910 FATAL:  lock file "postmaster.pid" already exists
[2007-09-02 11:47:14.697 CDT] 7910 HINT:  Is another postmaster (PID 7760) running in data directory
"/var/pgsql/data/county/dunn/data"?
[2007-09-02 14:45:28.541 CDT] 21735 FATAL:  lock file "/tmp/.s.PGSQL.5417.lock" already exists
[2007-09-02 14:45:28.541 CDT] 21735 HINT:  Is another postmaster (PID 7760) using socket file "/tmp/.s.PGSQL.5417"?
-Kevin