Thread: We should Axe /contrib/start-scripts
... for the simple reason that nobody is maintaining it. Wheeler just pointed out to me today that the OSX startup script hasn't been updated since 7.4 and contains misinformation and dangerous scripting. Other startup scripts there are equally dilapidated, and aren't used by the linux distros regardless. Unless someone volunteers to be *permanent* maintainer of this directory, I vote that we remove it from 8.5. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Josh Berkus wrote: > ... for the simple reason that nobody is maintaining it. Wheeler just > pointed out to me today that the OSX startup script hasn't been updated > since 7.4 and contains misinformation and dangerous scripting. > > Other startup scripts there are equally dilapidated, and aren't used by > the linux distros regardless. > IMHO, those scripts (at least the Linux one) has great value. There are numerous people that compile and install from source, and for them at least some of these scripts are used quite a bit. AFAIK, the Linux one is functional (yes, sub-optimal in some ways, but it works just fine)... -- Chander Ganesan Open Technology Group, Inc. One Copley Parkway, Suite 210 Morrisville, NC 27560 919-463-0999/877-258-8987 http://www.otg-nc.com
Chander Ganesan <chander@otg-nc.com> writes: > Josh Berkus wrote: >> ... for the simple reason that nobody is maintaining it. Wheeler just >> pointed out to me today that the OSX startup script hasn't been updated >> since 7.4 and contains misinformation and dangerous scripting. >> >> Other startup scripts there are equally dilapidated, and aren't used by >> the linux distros regardless. >> > IMHO, those scripts (at least the Linux one) has great value. There are > numerous people that compile and install from source, and for them at > least some of these scripts are used quite a bit. > AFAIK, the Linux one is functional (yes, sub-optimal in some ways, but > it works just fine)... There are lots of files in our distribution that don't get changed for years at a time. I think if there's something wrong with these, they should get fixed. I'm certainly not prepared to remove them on the basis of an unsubstantiated second-hand report of unspecified problems. (Personally, I use scripts based on start-scripts/osx/ for a number of services on my own machines, so if there's something wrong with them I'd definitely like to know what it is.) regards, tom lane
Tom, > (Personally, I use scripts based on start-scripts/osx/ for a number of > services on my own machines, so if there's something wrong with them > I'd definitely like to know what it is.) I quote: "# What to use to start up the postmaster (we do NOT use pg_ctl for this, # as it adds no value and can cause the postmaster to misrecognize a stale # lock file) DAEMON="$prefix/bin/postmaster" -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On Aug 19, 2009, at 11:48 AM, Tom Lane wrote: > (Personally, I use scripts based on start-scripts/osx/ for a number of > services on my own machines, so if there's something wrong with them > I'd definitely like to know what it is.) +1. Please don't remove the start scripts. I use them on every system on which I install PostgreSQL. They're very handy for those of us who don't use distro packages. But do fix them if they need it. Thanks, David
Josh Berkus wrote: > > ... for the simple reason that nobody is maintaining it. Wheeler just > pointed out to me today that the OSX startup script hasn't been updated > since 7.4 and contains misinformation and dangerous scripting. > > Other startup scripts there are equally dilapidated, and aren't used by > the linux distros regardless. If they are unmaintained, why not replace them with a pointer to some script that is maintained? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Tom Lane wrote: > (Personally, I use scripts based on start-scripts/osx/ for a number of > services on my own machines, so if there's something wrong with them > I'd definitely like to know what it is.) What kind of "based on"? I mean, are there some changes of yours that could be applied to the contrib script? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Josh Berkus <josh@agliodbs.com> writes: > Tom, >> (Personally, I use scripts based on start-scripts/osx/ for a number of >> services on my own machines, so if there's something wrong with them >> I'd definitely like to know what it is.) > I quote: > "# What to use to start up the postmaster (we do NOT use pg_ctl for this, > # as it adds no value and can cause the postmaster to misrecognize a stale > # lock file) > DAEMON="$prefix/bin/postmaster" And? That statement was and remains perfectly correct. We don't use pg_ctl to start the postmaster in the Linux initscripts, either. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> (Personally, I use scripts based on start-scripts/osx/ for a number of >> services on my own machines, so if there's something wrong with them >> I'd definitely like to know what it is.) > What kind of "based on"? I mean, are there some changes of yours that > could be applied to the contrib script? No, I've just modified them to start other services (bind, sendmail, ...). I'm sure OSX startup scripts are documented elsewhere, but this is a resource right under my nose, so I used it. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> we do NOT use pg_ctl for [postmaster start], as it adds no value >> and can cause the postmaster to misrecognize a stale lock file > And? That statement was and remains perfectly correct. Is this mentioned in the documentation somewhere that I've missed? I'm curious what the issues are, and why we can solve it in a bash script but not pg_ctl. -Kevin
Tom, >> "# What to use to start up the postmaster (we do NOT use pg_ctl for this, >> # as it adds no value and can cause the postmaster to misrecognize a stale >> # lock file) >> DAEMON="$prefix/bin/postmaster" > > And? That statement was and remains perfectly correct. We don't use > pg_ctl to start the postmaster in the Linux initscripts, either. Then WTF do we have pg_ctl, and why do our docs tell people to use it? -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Josh Berkus wrote: > Tom, > > >> "# What to use to start up the postmaster (we do NOT use pg_ctl for this, > >> # as it adds no value and can cause the postmaster to misrecognize a stale > >> # lock file) > >> DAEMON="$prefix/bin/postmaster" > > > > And? That statement was and remains perfectly correct. We don't use > > pg_ctl to start the postmaster in the Linux initscripts, either. > > Then WTF do we have pg_ctl, and why do our docs tell people to use it? I assume on boot we _know_ there can't be another postmaster running, while normal pg_ctl does not know that. Perhaps we just need to state that in the comment. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> we do NOT use pg_ctl for [postmaster start], as it adds no value >>> and can cause the postmaster to misrecognize a stale lock file >> And? That statement was and remains perfectly correct. > Is this mentioned in the documentation somewhere that I've missed? > I'm curious what the issues are, and why we can solve it in a bash > script but not pg_ctl. It's been covered repeatedly in the archives, but I'm not sure if it's in the docs anywhere. The problem is that after a system crash and reboot, an old postmaster.pid file might be left behind. The postmaster can only safely remove this lock file if it is *certain* that it doesn't represent another live postmaster process. Otherwise it is honor-bound to commit hara-kiri instead of starting up. It can tell whether or not the PID in the file belongs to a live process and whether that process belongs to the postgres userid (by attempting kill(PID, 0) and seeing what it gets). If not, it can remove the file with a clear conscience. However, because of the way that Unix startup works, it is very likely that successive system boots will assign nearly (but not necessarily exactly) the same PID that the postmaster had on the previous cycle. So there's a high probability of a false positive from this test. If the PID matches our own exactly, we can discount it as a false positive. If it matches our parent's exactly, we can also discount it (knowing that a postmaster would never launch another postmaster directly, and being able to get the parent's PID via getppid()). But further up the chain, we're out of luck, because there is no "get grandparent pid" operation in Unix. What this all leads to is that it's safe to launch a postmaster from an init script via something likesu - postgres sh -c "postmaster ..." The postmaster's parent process is a shell belonging to postgres, which it can discount via getppid(), and all further-up ancestors belong to root, so we can discount them via the kill test. So a false PID match cannot lead to failing to start. (You still have to be a bit careful about the form of the shell command, or there might be an intermediate postgres-owned shell process.) On the other hand, if you dosu - postgres sh -c "pg_ctl ..." then the postmaster's parent process is pg_ctl, and its grandparent is a postgres-owned shell process, and it cannot tell that postgres-owned shell process apart from a genuine conflicting postmaster. So a chance match of the shell process's PID to what is in the leftover postmaster.pid file will force it to refuse to start. And that chance match is not a low probability --- in my experience it's one in ten or worse, in a reasonably stable system environment. You can imagine various workarounds involving having pg_ctl pass down its parent's PID, but you'll still get screwed if the initscript author is careless about how many levels of postgres-owned shell process there are. The long and the short of it is that it's best to not use pg_ctl. As mentioned, it doesn't buy much of anything for an initscript anyway. These considerations don't apply to ordinary hand launching of the postmaster, for the primary reason that the chance of a false PID match is several orders of magnitude smaller when you're talking about a manual restart --- the likely postmaster PID now ranges over the whole PID space instead of being within a few counts of the same thing. So we don't need to discourage people from using pg_ctl for ordinary restarts. The whole thing is really only a problem for initscript authors (who all know about it by now ;-)) regards, tom lane
On Aug 19, 2009, at 2:03 PM, Tom Lane wrote: > These considerations don't apply to ordinary hand launching of the > postmaster, for the primary reason that the chance of a false PID > match > is several orders of magnitude smaller when you're talking about a > manual restart --- the likely postmaster PID now ranges over the whole > PID space instead of being within a few counts of the same thing. > So we > don't need to discourage people from using pg_ctl for ordinary > restarts. > The whole thing is really only a problem for initscript authors (who > all > know about it by now ;-)) Nice summary, Tom. Do the distro packagers know this, though? Do we have notes for distro packagers somewhere? Best, David
Should we add a comment to the startup scripts linking this email? http://archives.postgresql.org/message-id/28922.1250715832@sss.pgh.pa.us --------------------------------------------------------------------------- Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> we do NOT use pg_ctl for [postmaster start], as it adds no value > >>> and can cause the postmaster to misrecognize a stale lock file > > >> And? That statement was and remains perfectly correct. > > > Is this mentioned in the documentation somewhere that I've missed? > > I'm curious what the issues are, and why we can solve it in a bash > > script but not pg_ctl. > > It's been covered repeatedly in the archives, but I'm not sure if it's > in the docs anywhere. The problem is that after a system crash and > reboot, an old postmaster.pid file might be left behind. The postmaster > can only safely remove this lock file if it is *certain* that it doesn't > represent another live postmaster process. Otherwise it is honor-bound > to commit hara-kiri instead of starting up. It can tell whether or not > the PID in the file belongs to a live process and whether that process > belongs to the postgres userid (by attempting kill(PID, 0) and seeing > what it gets). If not, it can remove the file with a clear conscience. > However, because of the way that Unix startup works, it is very likely > that successive system boots will assign nearly (but not necessarily > exactly) the same PID that the postmaster had on the previous cycle. > So there's a high probability of a false positive from this test. > If the PID matches our own exactly, we can discount it as a false > positive. If it matches our parent's exactly, we can also discount it > (knowing that a postmaster would never launch another postmaster > directly, and being able to get the parent's PID via getppid()). > But further up the chain, we're out of luck, because there is no > "get grandparent pid" operation in Unix. > > What this all leads to is that it's safe to launch a postmaster from > an init script via something like > su - postgres sh -c "postmaster ..." > The postmaster's parent process is a shell belonging to postgres, > which it can discount via getppid(), and all further-up ancestors > belong to root, so we can discount them via the kill test. So a > false PID match cannot lead to failing to start. (You still have to > be a bit careful about the form of the shell command, or there might > be an intermediate postgres-owned shell process.) > > On the other hand, if you do > su - postgres sh -c "pg_ctl ..." > then the postmaster's parent process is pg_ctl, and its grandparent > is a postgres-owned shell process, and it cannot tell that > postgres-owned shell process apart from a genuine conflicting > postmaster. So a chance match of the shell process's PID to what is in > the leftover postmaster.pid file will force it to refuse to start. > And that chance match is not a low probability --- in my experience > it's one in ten or worse, in a reasonably stable system environment. > > You can imagine various workarounds involving having pg_ctl pass down > its parent's PID, but you'll still get screwed if the initscript author > is careless about how many levels of postgres-owned shell process there > are. The long and the short of it is that it's best to not use pg_ctl. > As mentioned, it doesn't buy much of anything for an initscript anyway. > > These considerations don't apply to ordinary hand launching of the > postmaster, for the primary reason that the chance of a false PID match > is several orders of magnitude smaller when you're talking about a > manual restart --- the likely postmaster PID now ranges over the whole > PID space instead of being within a few counts of the same thing. So we > don't need to discourage people from using pg_ctl for ordinary restarts. > The whole thing is really only a problem for initscript authors (who all > know about it by now ;-)) > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"David E. Wheeler" <david@kineticode.com> writes: > Nice summary, Tom. Do the distro packagers know this, though? All the active ones I know of learned it the hard way, or were paying attention when someone else did. Still, it wouldn't be a bad thing for us to document it somewhere. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > The problem is that after a system crash and reboot, an old > postmaster.pid file might be left behind. The postmaster can only > safely remove this lock file if it is *certain* that it doesn't > represent another live postmaster process. Otherwise it is honor- > bound to commit hara-kiri instead of starting up. It can tell > whether or not the PID in the file belongs to a live process and > whether that process belongs to the postgres userid (by attempting > kill(PID, 0) and seeing what it gets). If not, it can remove the > file with a clear conscience. Right -- we did run into this in spades when our backup server, running dozens of instances of PostgreSQL in "warm standby" to confirm the integrity of the files received, crashed hard. I wasn't sure if this was the problem being addressed. One obvious solution, which we now rigorously observe, is to use a different OS user for each PostgreSQL instance. I assume that pg_ctl is safe in such an environment? > The long and the short of it is that it's best to not use pg_ctl. > As mentioned, it doesn't buy much of anything for an initscript > anyway. It must buy something in our environment, because our attempts to use the sample script with minimal modification were problematic. Unfortunately I forget the details, but our problems vanished when we switched to pg_ctl. (Well, except for that one unfortunate episode mentioned above.) > The whole thing is really only a problem for initscript authors (who > all know about it by now ;-)) Well, one of them (at least) didn't quite understand the whole issue until receiving your email. Thanks for the clear description. -Kevin
On Wed, Aug 19, 2009 at 10:03 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > What this all leads to is that it's safe to launch a postmaster from > an init script via something like > su - postgres sh -c "postmaster ..." Surely you don't want "-"? If you run postgres's .profile etc. then random user customization for the postgres user could interfere with your startup process. -- greg http://mit.edu/~gsstark/resume.pdf
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Right -- we did run into this in spades when our backup server, > running dozens of instances of PostgreSQL in "warm standby" to confirm > the integrity of the files received, crashed hard. I wasn't sure if > this was the problem being addressed. One obvious solution, which we > now rigorously observe, is to use a different OS user for each > PostgreSQL instance. I assume that pg_ctl is safe in such an > environment? Well, using a different user per instance is a good idea because then the safety analysis I gave holds rigorously for each instance. It doesn't get you out of the problem by itself, because the problem as described can happen with just one instance. > It must buy something in our environment, because our attempts to use > the sample script with minimal modification were problematic. > Unfortunately I forget the details, but our problems vanished when we > switched to pg_ctl. (Well, except for that one unfortunate episode > mentioned above.) Hmm. As stated, I would expect pg_ctl to make it worse. It would be interesting to have a closer look at your before-and-after scripts. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > On Wed, Aug 19, 2009 at 10:03 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> What this all leads to is that it's safe to launch a postmaster from >> an init script via something like >> su - postgres sh -c "postmaster ..." > Surely you don't want "-"? If you run postgres's .profile etc. then > random user customization for the postgres user could interfere with > your startup process. Well, people who put random things into the postgres user's .profile deserve what they get. But it is useful to be able to customize the postmaster's PATH and other variables. As far as I know, all the Linux distros do use "su -" or equivalent, and so do the contrib start-scripts. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Well, using a different user per instance is a good idea because > then the safety analysis I gave holds rigorously for each instance. > It doesn't get you out of the problem by itself, because the problem > as described can happen with just one instance. Oh, right -- it does let PostgreSQL automatically deal with the file left by a different instance, but could still fail on it's own file. >> It must buy something in our environment, because our attempts to >> use the sample script with minimal modification were problematic. >> Unfortunately I forget the details, but our problems vanished when >> we switched to pg_ctl. (Well, except for that one unfortunate >> episode mentioned above.) > > Hmm. As stated, I would expect pg_ctl to make it worse. It would > be interesting to have a closer look at your before-and-after > scripts. I don't remember whether the problem had anything to do with the lock files; it could have been that pg_ctl was doing something else which worked better for us than the direct invocation of postmaster, at lease with the options we were using. I'll experiment and see what happens in a test environment. I don't think we saved our failed attempts from years back. Also, I was quite green with Linux at the time, and it was my first initscript tinkering -- it's not outside the range of possibility that I did something dumb with the direct attempt. That said, knowing so little about what I was doing with it had me starting from the point of changing as little as I could manage from the sample to try to get it going. -Kevin
I wrote: > Oh, right -- it does let PostgreSQL automatically deal with the file > left by a different instance, but could still fail on it's own file. Er, it does let PostgreSQL automatically deal with a different instance using the PID matching what this instance left in its file, but could be a problem if something else under this same user ID, like a higher level script above pg_ctl, is now using the PID. <sigh /> -Kevin
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hmm. As stated, I would expect pg_ctl to make it worse. I've been playing with this, and I think the problem was that we wanted a non-zero exit from the script if the start failed. That's trivial with pg_ctl -w but not running postgres directly. I guess I could run pg_ctl status in a loop after the start. The reason is that we don't want certain other processes attempting to start until and unless the database they use has started successfully. -Kevin
Kevin Grittner wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> Hmm. As stated, I would expect pg_ctl to make it worse. >> > > I've been playing with this, and I think the problem was that we > wanted a non-zero exit from the script if the start failed. That's > trivial with pg_ctl -w but not running postgres directly. I guess I > could run pg_ctl status in a loop after the start. > > The reason is that we don't want certain other processes attempting to > start until and unless the database they use has started successfully. > Have you looked at what the Fedora script does? Here's a snippet from my F11 system: $SU -l postgres -c "$PGENGINE/postmaster -p '$PGPORT' -D '$PGDATA' ${PGOPTS} &" >> "$PGLOG" 2>&1 < /dev/null sleep 2 pid=`pidof -s "$PGENGINE/postmaster"` if [$pid ] && [ -f "$PGDATA/postmaster.pid" ] then success "$PSQL_START" touch /var/lock/subsys/${NAME} head -n 1 "$PGDATA/postmaster.pid" > "/var/run/postmaster.${PGPORT}.pid" echo else failure "$PSQL_START" echo script_result=1 fi Doesn't seem that much harder than using pg_ctl. cheers andrew
Kevin Grittner wrote: > The reason is that we don't want certain other processes attempting to > start until and unless the database they use has started successfully. This is something we're not quite ready on, yet. We need some mechanism that allows scripts to verify not only that postmaster started, but also that it has finished recovery. You can sort-of do it by attempting a connection and checking the error message, but it's ugly. There was talk about a pg_ping utility years ago, but nobody got around to writing it ... -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Kevin Grittner wrote: > > >> The reason is that we don't want certain other processes attempting to >> start until and unless the database they use has started successfully. >> > > This is something we're not quite ready on, yet. We need some mechanism > that allows scripts to verify not only that postmaster started, but also > that it has finished recovery. You can sort-of do it by attempting a > connection and checking the error message, but it's ugly. There was > talk about a pg_ping utility years ago, but nobody got around to writing > it ... > Can't you use pg_controldata to see whether it is in recovery or not? Seems like you've got a way to see if it's running, seeing if it is in recovery should therefore be pretty straightforward, no? -- Chander Ganesan Open Technology Group, Inc. One Copley Parkway, Suite 210 Morrisville, NC 27560 919-463-0999/877-258-8987 http://www.otg-nc.com
Andrew Dunstan <andrew@dunslane.net> writes: > Here's a snippet from my F11 system: > $SU -l postgres -c "$PGENGINE/postmaster -p '$PGPORT' -D > '$PGDATA' ${PGOPTS} &" >> "$PGLOG" 2>&1 < /dev/null > sleep 2 > pid=`pidof -s "$PGENGINE/postmaster"` > if [ $pid ] && [ -f "$PGDATA/postmaster.pid" ] > then > success "$PSQL_START" Of course, this is a complete kluge --- it assumes the postmaster will create its pidfile in less than two seconds. And for that matter, it's not very proof against the case of a pre-existing postmaster. But in any case, it (intentionally) doesn't wait for the postmaster to be ready to accept connections, so it's not solving Kevin's problem. regards, tom lane
Chander Ganesan <chander@otg-nc.com> wrote: > Alvaro Herrera wrote: >> Kevin Grittner wrote: >> >>> The reason is that we don't want certain other processes >>> attempting to start until and unless the database they use has >>> started successfully. >> >> This is something we're not quite ready on, yet. We need some >> mechanism that allows scripts to verify not only that postmaster >> started, but also that it has finished recovery. You can sort-of >> do it by attempting a connection and checking the error message, >> but it's ugly. There was talk about a pg_ping utility years ago, >> but nobody got around to writing it ... >> > Can't you use pg_controldata to see whether it is in recovery or > not? Seems like you've got a way to see if it's running, seeing if > it is in recovery should therefore be pretty straightforward, no? Thanks Andrew, Alvaro, and Chander. You've given me some thoughts to toss around. Of course, any of these is going to be somewhat more complex than using "set -e" and the following lines: case $1 in start) echo -n "Starting PostgreSQL: " su - $PGUSER -c "$PGCTL start -w -D '$PGDATA' -l '$PGLOG'" echo "ok" ;; But that's OK, if it yields better results. :-) -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Thanks Andrew, Alvaro, and Chander. You've given me some thoughts to > toss around. Of course, any of these is going to be somewhat more > complex than using [ pg_ctl -w ] Yeah. I wonder if we shouldn't expend a bit more effort to make that way bulletproof. As I mentioned the other day, if there were a way for pg_ctl to pass down its parent's PID then we could have the postmaster exclude that as a false match, and then using pg_ctl would be just as safe as invoking the postmaster directly. The two ways I can see to do that are to add a command line switch to the postmaster, or to pass the PID as an environment variable, say "PG_GRANDPARENT_PID". The latter is a bit uglier but it would require touching much less code (and documentation). Thoughts? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Of course, this is a complete kluge --- it assumes the postmaster > will create its pidfile in less than two seconds. And for that > matter, it's not very proof against the case of a pre-existing > postmaster. But in any case, it (intentionally) doesn't wait for > the postmaster to be ready to accept connections, so it's not > solving Kevin's problem. Ah, well, it seems I don't have to spend a lot of time in close review of that script. Thanks Tom. To be a little more explicit, we're counting on the LSB dependencies to make sure that things start and stop in the right order, and wait until the time is right. (It seemed pointless to re-invent that wheel when Linux would do all the work of tracking dependencies and ordering things correctly if we just emit a meaningful exit code from each script....) Any thoughts on a best approach or a TODO item? -Kevin
Chander Ganesan wrote: > Alvaro Herrera wrote: > >This is something we're not quite ready on, yet. We need some mechanism > >that allows scripts to verify not only that postmaster started, but also > >that it has finished recovery. You can sort-of do it by attempting a > >connection and checking the error message, but it's ugly. There was > >talk about a pg_ping utility years ago, but nobody got around to writing > >it ... > Can't you use pg_controldata to see whether it is in recovery or > not? Seems like you've got a way to see if it's running, seeing if > it is in recovery should therefore be pretty straightforward, no? That's within my definition of "ugly", yes :-) My ideal tool would do something like $ pg_ping -h foo -p 5555 IN_RECOVERY $ echo $? 2 $ # sleep a bit ... $ pg_ping -h foo -p 5555 READY $ echo $? 0 -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Tom Lane <tgl@sss.pgh.pa.us> wrote: > The two ways I can see to do that are to add a command line switch > to the postmaster, or to pass the PID as an environment variable, > say "PG_GRANDPARENT_PID". The latter is a bit uglier but it would > require touching much less code (and documentation). > > Thoughts? You're thinking that pg_ctl would capture it's parent PID and pass it to the postmaster one way or the other? That seems like it covers the specific issue you were referencing up-thread. It has been bubbling around in my head that we have other processes which run under the same user ID for such things as vacuum and purge scripts, as well as rsync of backup files. These would still create some risk of a false match, right? Just a much smaller risk? I was tinkering with the idea of having the init script use lsof to pin it down more precisely. Does that sound remotely sane? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > You're thinking that pg_ctl would capture it's parent PID and pass it > to the postmaster one way or the other? That seems like it covers the > specific issue you were referencing up-thread. It has been bubbling > around in my head that we have other processes which run under the > same user ID for such things as vacuum and purge scripts, as well as > rsync of backup files. These would still create some risk of a false > match, right? Just a much smaller risk? Only if they are running at times when your postmaster(s) aren't ... realistically, unless you launch them from initscripts that start before your postmasters launch, I don't think there's going to be a problem. Still, just from a security point of view, it might be better if those don't run as the postgres operating-system user. Not sure if that's workable for rsync (since it has to be able to read the postgres files) but stuff like vacuum scripts could surely be run from a different userid. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> wrote: > That's within my definition of "ugly", yes :-) My ideal tool would > do something like > > $ pg_ping -h foo -p 5555 > IN_RECOVERY > $ echo $? > 2 > > $ # sleep a bit ... > > $ pg_ping -h foo -p 5555 > READY > $ echo $? > 0 Cool, but how would you do that without bypassing authentication? If you do bypass authentication, wouldn't that be sort of a big target for denial of service attacks? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Alvaro Herrera <alvherre@commandprompt.com> wrote: >> That's within my definition of "ugly", yes :-) My ideal tool would >> do something like >> >> $ pg_ping -h foo -p 5555 >> IN_RECOVERY >> $ echo $? >> 2 >> >> $ # sleep a bit ... >> >> $ pg_ping -h foo -p 5555 >> READY >> $ echo $? >> 0 > Cool, but how would you do that without bypassing authentication? The postmaster is already set up so that if it's not accepting connections, it'll tell you so before getting to the authentication stage. If anyone were to go to the trouble of inventing pg_ping, I'd be a bit inclined to add something to the postmaster protocol so that a ping packet didn't have to look just like a phony login attempt. But that's just so that people who log connection attempts won't get all flustered. The capability exists anyway. regards, tom lane
Kevin Grittner wrote: > Alvaro Herrera <alvherre@commandprompt.com> wrote: > > > That's within my definition of "ugly", yes :-) My ideal tool would > > do something like > > > > $ pg_ping -h foo -p 5555 > > IN_RECOVERY > > $ echo $? > > 2 > > > > $ # sleep a bit ... > > > > $ pg_ping -h foo -p 5555 > > READY > > $ echo $? > > 0 > > Cool, but how would you do that without bypassing authentication? If > you do bypass authentication, wouldn't that be sort of a big target > for denial of service attacks? The startup sequence lets you know that it's going to reject the connection before checking the auth credentials; see ProcessStartupPacket. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Tom Lane wrote: > But in > any case, it (intentionally) doesn't wait for the postmaster to be ready > to accept connections, so it's not solving Kevin's problem. > > > Maybe we need a --wait mode for "pg_ctl status" that would test connecting to the database the same way it does on start/restart. cheers andrew
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Only if they are running at times when your postmaster(s) aren't ... Well, those rsync scripts for pushing the PITR base backup and the WAL stream to other machines are crontab jobs which kick off once per minute. > Still, just from a security point of view, it might be better if > those don't run as the postgres operating-system user. Not sure if > that's workable for rsync (since it has to be able to read the > postgres files) We have some things running as that user which wouldn't have to be. You're right; we should change that. I don't see any way around it for generating the PITR base backup on the machine, though. > stuff like vacuum scripts could surely be run from a different > userid. My first thought was "they have to run as the database superuser." (In our case, that is the same as the OS user running the cluster.) But that's wrong, of course. They have to run as *a* database superuser. We could fix that. Still, this seems like it's not as deterministic as it should be. Is there any reasonable way to pin it down beyond the PID? Like also saving a start time into the postmaster.pid file and checking that it maches the start time of the PID found? Or, as I suggested before, checking that the potentially conflicting PID has some file open which would always be open if it was indeed a previous postmaster? It seems like there ought to be some way to get better traction on this.... -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> stuff like vacuum scripts could surely be run from a different >> userid. > My first thought was "they have to run as the database superuser." > (In our case, that is the same as the OS user running the cluster.) > But that's wrong, of course. They have to run as *a* database > superuser. We could fix that. Well, the database userid needn't be the same as the OS userid. Even if you're using ident auth, you could provide an identmap to allow the vacuum-user to log in as the database superuser. > Still, this seems like it's not as deterministic as it should be. Is > there any reasonable way to pin it down beyond the PID? Like also > saving a start time into the postmaster.pid file and checking that it > maches the start time of the PID found? How would you get the latter in a portable fashion? (Do not mention ps please ... and I don't want to hear about lsof either ...) regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Still, this seems like it's not as deterministic as it should be. >> Is there any reasonable way to pin it down beyond the PID? Like >> also saving a start time into the postmaster.pid file and checking >> that it maches the start time of the PID found? > > How would you get the latter in a portable fashion? (Do not mention > ps please ... and I don't want to hear about lsof either ...) Unfortunately I don't know enough about the relevant API to begin to answer that. (I barely managed to frame the question....) You wouldn't object to using either of those in a Linux service script, though, would you? -Kevin
Kevin Grittner wrote: > You wouldn't object to using either of those in a Linux service > script, though, would you? Yeah, operating-system-specific init scripts do not need to be portable :-) Of course, they need to work across a wide range of Linux systems ... -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> How would you get the latter in a portable fashion? (Do not mention >> ps please ... and I don't want to hear about lsof either ...) > You wouldn't object to using either of those in a Linux service > script, though, would you? No, certainly not, but I was thinking in terms of something that we could put into the postmaster. In general I'd not recommend that an init script go messing with the contents of the postmaster.pid file, which it would have to do to have any of this logic in the script. regards, tom lane
Aidan Van Dyk <aidan@highrise.ca> writes: > Can postmaster keep a exclusive lock on its own pid file the entire time it's > running? That's been discussed, but file locking isn't all that portable or trustworthy :-( regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [090825 18:43]: > How would you get the latter in a portable fashion? (Do not mention > ps please ... and I don't want to hear about lsof either ...) Can postmaster keep a exclusive lock on its own pid file the entire time it's running? If you can open it and lock it before getting the PID out of it, you know it's not an active postmaster... It may be a pid that corresponds with something else, or it may be a pid of a racing postmaster, but since you have the exclusive lock, it's not going to get that lock... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Tom Lane wrote: > Aidan Van Dyk <aidan@highrise.ca> writes: > >> Can postmaster keep a exclusive lock on its own pid file the entire time it's >> running? >> > > That's been discussed, but file locking isn't all that portable or > trustworthy :-( > > regards, tom lane > What about having a special request via the socket file asking for its data directory (not over the network)? Or even to have it just update the timestamp on its pid file... -- Chander Ganesan Open Technology Group, Inc. One Copley Parkway, Suite 210 Morrisville, NC 27560 877-258-8987/919-463-0999 http://www.otg-nc.com
Tom Lane <tgl@sss.pgh.pa.us> wrote: > In general I'd not recommend that an init script go messing with the > contents of the postmaster.pid file, which it would have to do to > have any of this logic in the script. But LSB specifically provides the pidofproc function to extract the pid info. http://refspecs.freestandards.org/LSB_3.1.0/LSB-Core-generic/LSB-Core-generic/iniscrptfunc.html This brings me back round to what I was looking at recently -- the possibility of trying to make an LSB-conforming init script for PostgreSQL. I'm having a lot of trouble, though, trying to get either the postmaster or pg_ctl to behave well with the start_daemon function implementations available to me. Is there a fundamental mismatch there, or am I probably just missing some crucial detail? (The killproc function seems to work just fine, however, as long as I use the -p switch and give it the right signal.) And there's the usual question: is there interest in such a script? -Kevin
Kevin Grittner wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> In general I'd not recommend that an init script go messing with the >> contents of the postmaster.pid file, which it would have to do to >> have any of this logic in the script. >> > > But LSB specifically provides the pidofproc function to extract the > pid info. > I think Tom meant that you don't want to modify the contents of that file (or its timestamp). Reading from the file (and using the data there) is a whole different story, and is - in part - why the file exists in the first place. > > http://refspecs.freestandards.org/LSB_3.1.0/LSB-Core-generic/LSB-Core-generic/iniscrptfunc.html > > This brings me back round to what I was looking at recently -- the > possibility of trying to make an LSB-conforming init script for > PostgreSQL. I'm having a lot of trouble, though, trying to get either > the postmaster or pg_ctl to behave well with the start_daemon function > implementations available to me. Is there a fundamental mismatch > there, or am I probably just missing some crucial detail? (The > killproc function seems to work just fine, however, as long as I use > the -p switch and give it the right signal.) > > And there's the usual question: is there interest in such a script? > The script is undoubtedly useful, if nothing more than to provide a template for Linux distros. However, I think by itself it is used quite broadly by admins that choose to install from source rather than using a pre-packaged distribution. Some time ago I started work on a HeartBeat/OpenAIS resource management script for PostgreSQL to integrate it more closely with HeartbeatV2 (to support resource monitoring on a standby, auto-setup of standby node, etc.) but haven't worked on it since hitting a couple walls that are somewhat related to this issue. Having a better (and foolproof) start/stop LSB script would definitely help that project when I get back to it.. -- Chander Ganesan Open Technology Group, Inc. One Copley Parkway, Suite 210 Morrisville, NC 27560 919-463-0999/877-258-8987 http://www.otg-nc.com
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > This brings me back round to what I was looking at recently -- the > possibility of trying to make an LSB-conforming init script for > PostgreSQL. I'm having a lot of trouble, though, trying to get either > the postmaster or pg_ctl to behave well with the start_daemon function > implementations available to me. Is there a fundamental mismatch > there, or am I probably just missing some crucial detail? start_daemon doesn't provide for switching to a non-root userid according to that spec, so it seems like *it's* missing a crucial detail. The functions seem woefully underspecified anyway, eg it's unclear if killproc is supposed to wait to see whether the daemon terminates. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > start_daemon doesn't provide for switching to a non-root userid > according to that spec, so it seems like *it's* missing a crucial > detail. Hmmm... I didn't see anything requiring that it only by run by root. Do you see something that suggests that it must be? I've been making my attempts as the owner of the cluster, assuming that the init script would su to that before calling that function. > it's unclear if killproc is supposed to wait to see whether the > daemon terminates. I think it's pretty clear, in the context of the whole spec, that it should give it *some* time to make the attempt, since it says that it should return zero if the daemon was successfully terminated and non-zero otherwise. I think that also suggests that it doesn't expect you to wait forever. You do kinda have to read between the lines and take the spec from a "holistic" perspective, I'll grant you that.... -Kevin
I wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Thanks Andrew, Alvaro, and Chander. You've given me some thoughts to >> toss around. Of course, any of these is going to be somewhat more >> complex than using [ pg_ctl -w ] > Yeah. I wonder if we shouldn't expend a bit more effort to make that > way bulletproof. As I mentioned the other day, if there were a way for > pg_ctl to pass down its parent's PID then we could have the postmaster > exclude that as a false match, and then using pg_ctl would be just as > safe as invoking the postmaster directly. > The two ways I can see to do that are to add a command line switch > to the postmaster, or to pass the PID as an environment variable, > say "PG_GRANDPARENT_PID". The latter is a bit uglier but it would > require touching much less code (and documentation). Attached is a simple patch that uses the environment-variable approach. This is a whole lot more self-contained than what would be needed to pass the PID as an explicit switch, so I'm inclined to do it this way. You could argue that a bad guy could confuse matters by intentionally passing an existing postmaster's PID in this variable --- but someone with the ability to launch the postmaster can probably also remove an existing lockfile, so I don't think there's a credible security risk. Any objections? regards, tom lane Index: src/backend/utils/init/miscinit.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/init/miscinit.c,v retrieving revision 1.176 diff -c -r1.176 miscinit.c *** src/backend/utils/init/miscinit.c 12 Aug 2009 20:53:30 -0000 1.176 --- src/backend/utils/init/miscinit.c 26 Aug 2009 23:24:56 -0000 *************** *** 683,689 **** int len; int encoded_pid; pid_t other_pid; ! pid_t my_pid = getpid(); /* * We need a loop here because of race conditions. But don't loop forever --- 683,728 ---- int len; int encoded_pid; pid_t other_pid; ! pid_t my_pid, ! my_p_pid, ! my_gp_pid; ! const char *envvar; ! ! /* ! * If the PID in the lockfile is our own PID or our parent's or ! * grandparent's PID, then the file must be stale (probably left over from ! * a previous system boot cycle). We need to check this because of the ! * likelihood that a reboot will assign exactly the same PID as we had in ! * the previous reboot, or one that's only one or two counts larger and ! * hence the lockfile's PID now refers to an ancestor shell process. We ! * allow pg_ctl to pass down its parent shell PID (our grandparent PID) ! * via the environment variable PG_GRANDPARENT_PID; this is so that ! * launching the postmaster via pg_ctl can be just as reliable as ! * launching it directly. There is no provision for detecting ! * further-removed ancestor processes, but if the init script is written ! * carefully then all but the immediate parent shell will be root-owned ! * processes and so the kill test will fail with EPERM. Note that we ! * cannot get a false negative this way, because an existing postmaster ! * would surely never launch a competing postmaster or pg_ctl process ! * directly. ! */ ! my_pid = getpid(); ! ! #ifndef WIN32 ! my_p_pid = getppid(); ! #else ! /* ! * Windows hasn't got getppid(), but doesn't need it since it's not ! * using real kill() either... ! */ ! my_p_pid = 0; ! #endif ! ! envvar = getenv("PG_GRANDPARENT_PID"); ! if (envvar) ! my_gp_pid = atoi(envvar); ! else ! my_gp_pid = 0; /* * We need a loop here because of race conditions. But don't loop forever *************** *** 745,761 **** /* * Check to see if the other process still exists * ! * If the PID in the lockfile is our own PID or our parent's PID, then ! * the file must be stale (probably left over from a previous system ! * boot cycle). We need this test because of the likelihood that a ! * reboot will assign exactly the same PID as we had in the previous ! * reboot. Also, if there is just one more process launch in this ! * reboot than in the previous one, the lockfile might mention our ! * parent's PID. We can reject that since we'd never be launched ! * directly by a competing postmaster. We can't detect grandparent ! * processes unfortunately, but if the init script is written ! * carefully then all but the immediate parent shell will be ! * root-owned processes and so the kill test will fail with EPERM. * * We can treat the EPERM-error case as okay because that error * implies that the existing process has a different userid than we --- 784,794 ---- /* * Check to see if the other process still exists * ! * Per discussion above, my_pid, my_p_pid, and my_gp_pid can be ! * ignored as false matches. ! * ! * Normally kill() will fail with ESRCH if the given PID doesn't ! * exist. * * We can treat the EPERM-error case as okay because that error * implies that the existing process has a different userid than we *************** *** 772,789 **** * Unix socket file belonging to an instance of Postgres being run by * someone else, at least on machines where /tmp hasn't got a * stickybit.) - * - * Windows hasn't got getppid(), but doesn't need it since it's not - * using real kill() either... - * - * Normally kill() will fail with ESRCH if the given PID doesn't - * exist. */ ! if (other_pid != my_pid ! #ifndef WIN32 ! && other_pid != getppid() ! #endif ! ) { if (kill(other_pid, 0) == 0 || (errno != ESRCH && errno != EPERM)) --- 805,813 ---- * Unix socket file belonging to an instance of Postgres being run by * someone else, at least on machines where /tmp hasn't got a * stickybit.) */ ! if (other_pid != my_pid && other_pid != my_p_pid && ! other_pid != my_gp_pid) { if (kill(other_pid, 0) == 0 || (errno != ESRCH && errno != EPERM)) Index: src/bin/pg_ctl/pg_ctl.c =================================================================== RCS file: /cvsroot/pgsql/src/bin/pg_ctl/pg_ctl.c,v retrieving revision 1.111 diff -c -r1.111 pg_ctl.c *** src/bin/pg_ctl/pg_ctl.c 11 Jun 2009 14:49:07 -0000 1.111 --- src/bin/pg_ctl/pg_ctl.c 26 Aug 2009 23:24:56 -0000 *************** *** 672,677 **** --- 672,692 ---- unlimit_core_size(); #endif + /* + * If possible, tell the postmaster our parent shell's PID (see the + * comments in CreateLockFile() for motivation). Windows hasn't got + * getppid() unfortunately. + */ + #ifndef WIN32 + { + static char env_var[32]; + + snprintf(env_var, sizeof(env_var), "PG_GRANDPARENT_PID=%d", + (int) getppid()); + putenv(env_var); + } + #endif + exitcode = start_postmaster(); if (exitcode != 0) {
On Thu, Aug 27, 2009 at 12:32 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Attached is a simple patch that uses the environment-variable approach. > This is a whole lot more self-contained than what would be needed to > pass the PID as an explicit switch, so I'm inclined to do it this way. > You could argue that a bad guy could confuse matters by intentionally > passing an existing postmaster's PID in this variable --- but someone > with the ability to launch the postmaster can probably also remove an > existing lockfile, so I don't think there's a credible security risk. > > Any objections? So with this change you would have the startup script not remove the lock file? Postgres would refuse to start up if there was another process with the same uid running as long as it wasn't pg_ctl or the parent of pg_ctl? This could still fail if the startup script runs some other commands with & to background them and those commands happen to land with the pid of postgres? Or the startup script runs pg_ctl within a ( ) subshell? Or are you suggesting making this change but then still recommending the startup script you have now. So this would just be to make the problem less severe for people who use pg_ctl without being aware of the problem? -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> writes: > On Thu, Aug 27, 2009 at 12:32 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Attached is a simple patch that uses the environment-variable approach. > So with this change you would have the startup script not remove the > lock file? Huh? The startup script shouldn't *ever* remove the lock file. That's been true all along, and this doesn't change it. > This could still fail if the startup script runs some other commands > with & to background them and those commands happen to land with the > pid of postgres? Or the startup script runs pg_ctl within a ( ) > subshell? Yup, and that's been true all along too. This patch makes it possible to write a safe initscript that uses pg_ctl --- it doesn't make it impossible to write an unsafe one. In practice, the situations where people would need to write unsafe constructs have been largely eliminated anyway. Before we had a builtin syslogger process, people often wanted to do something like su - postgres -c "postmaster | logrotate" which is quite unsafe because there's probably an intermediate shell process. No need for that anymore. But notice this is just as unsafe whether you use pg_ctl or postmaster directly ... regards, tom lane
On Thu, Aug 27, 2009 at 1:01 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> So with this change you would have the startup script not remove the >> lock file? > > Huh? The startup script shouldn't *ever* remove the lock file. > That's been true all along, and this doesn't change it. I thought that was the whole difference between using pg_ctl to start up and the initscripts. Since the init script knows it's starting something up at boot time it knows any lock files are left over and might contain the pid of one of the other processes the init script starts. I did have another thought. It could compare the time from uptime to the timestamp on the lock file. If the server's been restarted since the time in the lock file then it must be stale. uhm. unless clock's been changed... -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> writes: > On Thu, Aug 27, 2009 at 1:01 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Huh? �The startup script shouldn't *ever* remove the lock file. >> That's been true all along, and this doesn't change it. > I thought that was the whole difference between using pg_ctl to start > up and the initscripts. Since the init script knows it's starting > something up at boot time it knows any lock files are left over and > might contain the pid of one of the other processes the init script > starts. An initscript knows no such thing. It's very standard to do, eg, "service postgresql stop" and "service postgresql start" to restart a service. If initscripts were coded as you suggest, all it would take is accidentally reversing the order of those commands to thoroughly scramble your database. > I did have another thought. It could compare the time from uptime to > the timestamp on the lock file. If the server's been restarted since > the time in the lock file then it must be stale. uhm. unless clock's > been changed... Yeah, you can't trust system clocks too much either :-(. I was actually having second thoughts about the idea of using file locking. The only environment in which I've heard of file locks not being trustworthy is NFS, and if you're running a DB on NFS you've probably got worse problems than this one. Notably, if you mistakenly try to start postmasters on two different machines against the same NFS-mounted directory, the PID-based interlock will certainly fail, while file locking might save you. So maybe we should take another look at that. Has anyone heard of other contexts in which file locks don't work? Has Windows got them? But in the meanwhile, I think this patch is a step forward and won't break anything that works now. regards, tom lane
Tom Lane wrote: > > I was actually having second thoughts about the idea of using file > locking. The only environment in which I've heard of file locks not > being trustworthy is NFS, and if you're running a DB on NFS you've > probably got worse problems than this one. Notably, if you mistakenly > try to start postmasters on two different machines against the same > NFS-mounted directory, the PID-based interlock will certainly fail, while > file locking might save you. So maybe we should take another look at > that. Has anyone heard of other contexts in which file locks don't > work? Has Windows got them? > > > Yes. But they are mandatory rather than advisory, I believe. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> Has anyone heard of other contexts in which file locks don't >> work? Has Windows got them? > Yes. But they are mandatory rather than advisory, I believe. Probably wouldn't matter for our purposes? I guess what we'd need is a writer's lock that still allows other processes to read the file. regards, tom lane
On Thu, Aug 27, 2009 at 02:38, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> I did have another thought. It could compare the time from uptime to >> the timestamp on the lock file. If the server's been restarted since >> the time in the lock file then it must be stale. uhm. unless clock's >> been changed... > > Yeah, you can't trust system clocks too much either :-(. > > I was actually having second thoughts about the idea of using file > locking. The only environment in which I've heard of file locks not > being trustworthy is NFS, and if you're running a DB on NFS you've > probably got worse problems than this one. That is a bad generalization. A lot of people run their databases very successfully on NFS. It just requires that you have a good NFS server, know how to set it up, know how to set up your network, have a good NFS client and know how to set *it* up. Though I would assume that locks would be trustworthy in this case as well... > Notably, if you mistakenly > try to start postmasters on two different machines against the same > NFS-mounted directory, the PID-based interlock will certainly fail, while > file locking might save you. That's in no way limited to NFS though... The difference being that in a lot of other cases you just end up with a completely corrupt filesystem :) > So maybe we should take another look at > that. Has anyone heard of other contexts in which file locks don't > work? Has Windows got them? Certainly: http://msdn.microsoft.com/en-us/library/aa365202%28VS.85%29.aspx for example. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/