Thread: Attempt to stop dead instance can stop a random process?
It appears that when pg_ctl gets a stop request for a given directory, it looks for a pid file in that directory and signalsthat pid to stop. It doesn't appear to check that the pid is for a PostgreSQL postmaster running out of the givendirectory. I think it should, although on a quick scan of the code, I didn't see a convenient way to do that. I have some evidence that when we attempted to stop a PostgreSQL instance which (it turned out) had died without cleaningup the pid file, it actually stopped another instance which was using a different data directory but had wrappedaround to the same pid. I guess if we ran each instance under a different OS user we would be protected from this, but that we hadn't thought thatwas necessary. Besides, we have other processes running under that OS login for maintenance or as part of the recoveryprocessing. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > It appears that when pg_ctl gets a stop request for a given directory, it l= > ooks for a pid file in that directory and signals that pid to stop. It doe= > sn't appear to check that the pid is for a PostgreSQL postmaster running ou= > t of the given directory. I think it should, although on a quick scan of t= > he code, I didn't see a convenient way to do that. [ shrug... ] AFAICS there is no way to know that. > I have some evidence that when we attempted to stop a PostgreSQL instance w= > hich (it turned out) had died without cleaning up the pid file, it actually= > stopped another instance which was using a different data directory but ha= > d wrapped around to the same pid. The real question there is how come the postmaster died without removing the pidfile. It's not that easy to crash the postmaster ... regards, tom lane
>>> On Fri, Aug 31, 2007 at 2:18 PM, in message <381.1188587883@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> It appears that when pg_ctl gets a stop request for a given directory, it l= >> ooks for a pid file in that directory and signals that pid to stop. It doe= >> sn't appear to check that the pid is for a PostgreSQL postmaster running ou= >> t of the given directory. I think it should, although on a quick scan of t= >> he code, I didn't see a convenient way to do that. > > [ shrug... ] AFAICS there is no way to know that. I sure couldn't see a way, but I was hoping that was just a matter of my own ignorance. >> I have some evidence that when we attempted to stop a PostgreSQL instance w= >> hich (it turned out) had died without cleaning up the pid file, it actually= >> stopped another instance which was using a different data directory but ha= >> d wrapped around to the same pid. > > The real question there is how come the postmaster died without removing > the pidfile. It's not that easy to crash the postmaster ... Well, that's not due to a bug in PostgreSQL. We're using a buggy LDAP implementation (not my call) which can crash things. The machine totally locked up after logging distress messages from that daemon, and they cycled power to get out of it. The PostgreSQL issue here was a secondary problem in trying to get the server back to normal. So really, what I was suggesting was something to improve the robustness of PostgreSQL in the face of severe challenges posed by other issues. I realize it's a very low volume issue; if it's not easy to fix, probably not worth it. Now to bug the people on the list of authorized contacts for Novell to open a support case on the LDAP problems, and see how many of the 40 core dumps I have from their daemon they want to see. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Well, that's not due to a bug in PostgreSQL. We're using a buggy LDAP > implementation (not my call) which can crash things. The machine totally > locked up after logging distress messages from that daemon, and they cycled > power to get out of it. Hmm. Do I correctly grasp the picture that you've got several Postgres installations on the machine and they're all booted by startup scripts? In this situation, it's actually not a bad idea to run each one under a separate userid. The problem is that in successive reboots, each postmaster will typically get almost but not exactly the same PID as last time, since the number of processes launched earlier in system startup is mostly but not completely deterministic. If you start all the postmasters together, as you probably do, then there will be occasions when one gets a PID that another one had in the previous boot cycle. That can lead to refusal to start up: if a postmaster sees a postmaster lock file in its data directory, containing a PID that belongs to another live process owned by the same userid, it has to assume that that's a conflicting postmaster and it must respect the lock file. You can prevent that problem if each postmaster (data directory) belongs to a different userid. (Some people prefer to fix this by having a startup script that forcibly removes all the lockfiles before launching the postmasters. I think that's kinda risky, although if it's done in a separate script that you'd have no reason to run by hand, it's probably OK. Clueless folks put the action right in the postgresql start script, meaning that a thoughtless "service postgresql start" blows away the lock file...) BTW, I would imagine that some scenario like this preceded the problem that you actually reported, since had all the postmasters started successfully, they'd all have written correct lockfiles. regards, tom lane
>>> On Fri, Aug 31, 2007 at 3:10 PM, in message <1068.1188591013@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Well, that's not due to a bug in PostgreSQL. We're using a buggy LDAP >> implementation (not my call) which can crash things. The machine totally >> locked up after logging distress messages from that daemon, and they cycled >> power to get out of it. > > Hmm. Do I correctly grasp the picture that you've got several Postgres > installations on the machine and they're all booted by startup scripts? Several is an understatement. This is the machine where we're running one PostgreSQL instance per county in "warm standby" mode -- not to actually use in recovery, but only to confirm that the backups are flowing back and applying cleanly. So, 72 instances, on ports 5401 to 5472. > In this situation, it's actually not a bad idea to run each one under a > separate userid. OK, I'll see about getting that set up. > (Some people prefer to fix this by having a startup script that forcibly > removes all the lockfiles before launching the postmasters. I think > that's kinda risky, although if it's done in a separate script that > you'd have no reason to run by hand, it's probably OK. I don't like that idea much. I'd rather add 72 new OS users. > BTW, I would imagine that some scenario like this preceded the problem > that you actually reported, since had all the postmasters started > successfully, they'd all have written correct lockfiles. Quite likely. Most of the action happened before I arrived for the day. Thanks. -Kevin
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Fri, Aug 31, 2007 at 02:41:47PM -0500, Kevin Grittner wrote: [...] > > The real question there is how come the postmaster died without removing > > the pidfile. It's not that easy to crash the postmaster ... > > Well, that's not due to a bug in PostgreSQL. We're using a buggy LDAP > implementation (not my call) which can crash things. The machine totally > locked up after logging distress messages from that daemon, and they cycled > power to get out of it. Hm. I've come to expect the OS removing all pidfiles early at bootup. A pidfile from the "last incarnation" doesn't make much sense anyway, right? Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFG2O9yBcgs9XrR2kYRAjB1AJ99MwT7M4LCDCTm/lnoCQE+xxVq5wCdEfDN j+/jC3DI3qr/W1OEJGDAg88= =pCGj -----END PGP SIGNATURE-----
tomas@tuxteam.de writes: > Hm. I've come to expect the OS removing all pidfiles early at bootup. If there's a script in your system that does that, then adding Postgres lockfiles to it makes all kinds of sense. Our problem as upstream software is that this isn't something well-standardized that we could plug into ... regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sat, Sep 01, 2007 at 12:57:35AM -0400, Tom Lane wrote: > tomas@tuxteam.de writes: > > Hm. I've come to expect the OS removing all pidfiles early at bootup. > > If there's a script in your system that does that, then adding Postgres > lockfiles to it makes all kinds of sense. Our problem as upstream > software is that this isn't something well-standardized that we could > plug into ... Right -- this becomes the distributor's job (when compiling from sources, the distributor is the sysadmin). Upstream can only recommend. Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFG2R2XBcgs9XrR2kYRAoNDAJ9gN3ytdJzXyzJ0/MKTzWVZvq/X3ACfZtQm cVYkpNrJqhvRjtUvm/5co0c= =HBPE -----END PGP SIGNATURE-----
>>> On Fri, Aug 31, 2007 at 3:10 PM, in message <1068.1188591013@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Hmm. Do I correctly grasp the picture that you've got several Postgres > installations on the machine and they're all booted by startup scripts? > > In this situation, it's actually not a bad idea to run each one under a > separate userid. The problem is that in successive reboots, each > postmaster will typically get almost but not exactly the same PID as > last time, since the number of processes launched earlier in system > startup is mostly but not completely deterministic. If you start all > the postmasters together, as you probably do, then there will be > occasions when one gets a PID that another one had in the previous boot > cycle. That can lead to refusal to start up: if a postmaster sees a > postmaster lock file in its data directory, containing a PID that > belongs to another live process owned by the same userid, it has to > assume that that's a conflicting postmaster and it must respect the lock > file. You can prevent that problem if each postmaster (data directory) > belongs to a different userid. I was thinking of submitting a patch to add a recommendation to this effect to section 16.1 ("The PostgreSQL User Account") in the documentation. Does that seem appropriate to all? I'm not sure whether it would be worth changing 16.2 ("Creating a Database Cluster") to say "while logged into the PostgreSQL user account which you have chosen for the cluster". > (Some people prefer to fix this by having a startup script that forcibly > removes all the lockfiles before launching the postmasters. I think > that's kinda risky, although if it's done in a separate script that > you'd have no reason to run by hand, it's probably OK. Clueless folks > put the action right in the postgresql start script, meaning that a > thoughtless "service postgresql start" blows away the lock file...) Would it be a good idea to mention pid file cleanup strategies in section 16.3 ("Starting the Database Server") where pid files are discussed, or isn't that something we should get into in the docs? Is there anywhere in the documentation to describe common causes and solutions for messages such as these (from the log file)?: [2007-09-02 11:47:14.697 CDT] 7910 FATAL: lock file "postmaster.pid" already exists [2007-09-02 11:47:14.697 CDT] 7910 HINT: Is another postmaster (PID 7760) running in data directory "/var/pgsql/data/county/dunn/data"? [2007-09-02 14:45:28.541 CDT] 21735 FATAL: lock file "/tmp/.s.PGSQL.5417.lock" already exists [2007-09-02 14:45:28.541 CDT] 21735 HINT: Is another postmaster (PID 7760) using socket file "/tmp/.s.PGSQL.5417"? -Kevin