Thread: How to shoot yourself in the foot: kill -9 postmaster
I have spent several days now puzzling over the corrupted WAL logfile that Scott Parish was kind enough to send me from a 7.1beta4 crash. It looks a lot like two different series of transactions were getting written into the same logfile. I'd been digging like mad in the WAL code to try to explain this as a buffer-management logic error, but after a fresh exchange of info it turns out that I was barking up the wrong tree. There *were* two different series of transactions. Specifically, here's what happened: 1. Scott (or actually his associate) shut down and restarted the postmaster using the /etc/rc.d/init.d/pgsql script that ships with our RPMs. That script shuts down the old postmaster withkillproc postmaster It turns out that at least on Scott's machine (RedHat 6.1), the default kill level for the killproc function is kill -9. (This is clearly a bad bug in the init script, but I digress.) 2. So, the old postmaster was killed with kill -9, but its child backends were still running. The new postmaster will start up successfully because it'll think the old postmaster crashed, and so it will go through the usual recovery procedure. 3. Now we have two sets of backends running in different shmem blocks (7.0 might have choked on that part, but 7.1 doesn't care) and running different sets of transactions. But they're writing to the same WAL log. Result: guaranteed corruption of the log. It actually took two iterations of this to expose the bug: the third attempted postmaster start went looking for the checkpoint record last written by the second one, which meanwhile had got overwritten by activity of the first backend set. Now, killing the postmaster -9 and not cleaning up the backends has always been a good way to shoot yourself in the foot, but up to now the worst thing that was likely to happen to you was isolated corruption in specific tables. In the brave new world of WAL the stakes are higher, because the system will refuse to start up if it finds a corrupted checkpoint record. Clueless admins who resort to kill -9 as a routine admin tool *will* lose their databases. Moreover, the init scripts that are running around now are dangerous weapons if used with 7.1. I think we need a stronger interlock to prevent this scenario, but I'm unsure what it should be. Ideas? regards, tom lane
At 3/5/2001 04:30 PM, you wrote: >Now, killing the postmaster -9 and not cleaning up the backends has >always been a good way to shoot yourself in the foot, but up to now the >worst thing that was likely to happen to you was isolated corruption in >specific tables. In the brave new world of WAL the stakes are higher, >because the system will refuse to start up if it finds a corrupted >checkpoint record. Clueless admins who resort to kill -9 as a routine >admin tool *will* lose their databases. Moreover, the init scripts >that are running around now are dangerous weapons if used with 7.1. > >I think we need a stronger interlock to prevent this scenario, but I'm >unsure what it should be. Ideas? Is there anyway to see if the other processes (child) have a lock on the log file? On a lot of systems, when a daemon starts, will record the PID in a file so it/'the admin' can do a 'shutdown' script with the PID listed. Can child processes list themselves like child.PID in a configurable directory, and have the starting process look for all of these and shut the "orphaned" child processes down? Just thoughts... Thomas
* Tom Lane <tgl@sss.pgh.pa.us> [010305 14:51] wrote: > > I think we need a stronger interlock to prevent this scenario, but I'm > unsure what it should be. Ideas? Re having multiple postmasters active by accident. The sysV IPC stuff has some hooks in it that may help you. One idea is to check the 'struct shmid_ds' feild 'shm_nattch', basically at startup if it's not 1 (or 0) then you have more than one postgresql instance messing with it and it should not proceed. I'd also suggest looking into using sysV semaphores and the semundo stuff, afaik it can be used to track the number of consumers of a reasource. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Tom Lane wrote: > checkpoint record. Clueless admins who resort to kill -9 as a routine > admin tool *will* lose their databases. Moreover, the init scripts > that are running around now are dangerous weapons if used with 7.1. Thanks for the headsup, Tom. Time to nix killproc and do something cleaner -- compatible, but cleaner. I'll have to research what the defaults are for later RH's -- but, as 6.1 is one of my target platforms at this time, I have to fix that issue for sure. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes: > Thanks for the headsup, Tom. Time to nix killproc and do something > cleaner -- compatible, but cleaner. As far as I could tell from the 6.1 scripts, it would work to do killproc postmaster -TERM The problem is just that killproc has an overenthusiastic default... regards, tom lane
killproc should send a kill -15 to the process, wait a few seconds for it to exit. If it does not, try kill -1, and if that doesn't kill it, then kill -9. > Tom Lane wrote: > > checkpoint record. Clueless admins who resort to kill -9 as a routine > > admin tool *will* lose their databases. Moreover, the init scripts > > that are running around now are dangerous weapons if used with 7.1. > > Thanks for the headsup, Tom. Time to nix killproc and do something > cleaner -- compatible, but cleaner. I'll have to research what the > defaults are for later RH's -- but, as 6.1 is one of my target platforms > at this time, I have to fix that issue for sure. > -- > Lamar Owen > WGCR Internet Radio > 1 Peter 4:11 > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Lamar Owen <lamar.owen@wgcr.org> writes: > > Thanks for the headsup, Tom. Time to nix killproc and do something > > cleaner -- compatible, but cleaner. > > As far as I could tell from the 6.1 scripts, it would work to do > > killproc postmaster -TERM > Yes, amazing it has a -9 default. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > killproc should send a kill -15 to the process, wait a few seconds for > it to exit. If it does not, try kill -1, and if that doesn't kill it, > then kill -9. Tell it to the Linux people ... this is their boot-script code we're talking about. regards, tom lane
Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > killproc should send a kill -15 to the process, wait a few seconds for > > it to exit. If it does not, try kill -1, and if that doesn't kill it, > > then kill -9. > > Tell it to the Linux people ... this is their boot-script code we're > talking about. RedHat, in particular. I can't vouch for any others. On my RH 6.2 box, with initscripts-5.00-1 loaded, here's what killproc does if no killlevel is set (even though a default $killlevel is set to -9, it's not used in this code): ($pid is the pid of the proc to kill, $base is the name of the proc, etc) if [ "$notset" = "1" ] ; then if ps h $pid>/dev/null 2>&1; then # TERM first, then KILL if not dead kill-TERM $pid usleep 100000 if ps h $pid >/dev/null 2>&1 ; then sleep 1 if ps h $pid >/dev/null2>&1 ; then sleep 3 if ps h $pid >/dev/null 2>&1 ; then kill -KILL $pid fi fi fi fi ps h $pid >/dev/null 2>&1 RC=$? [ $RC -eq 0 ] && failure "$baseshutdown" || success "$base shutdown" RC=$((! $RC)) # use specified level only else if ps h $pid >/dev/null 2>&1; then kill $killlevel$pid RC=$? [ $RC -eq 0 ] && success "$base $killlevel" || failure "$base $killlevel" fi fi Is 6.1 this different from 6.2? This code on the surface seems reasonable to me -- am I missing something? The 6.2 code (found in /etc/rc.d/init.d/functions, for those who might not know where to find killproc) sets a default killlevel but never uses it -- ignorant but not stupid. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
> if [ "$notset" = "1" ] ; then > if ps h $pid>/dev/null 2>&1; then > # TERM first, then KILL if not dead > kill -TERM $pid > usleep 100000 > if ps h $pid >/dev/null 2>&1 ; then > sleep 1 > if ps h $pid >/dev/null 2>&1 ; then > sleep 3 > if ps h $pid >/dev/null 2>&1 ; then > kill -KILL $pid > fi > fi > fi > fi Yes, this seems like the proper way to do it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > > Now, killing the postmaster -9 and not cleaning up the backends has > always been a good way to shoot yourself in the foot, but up to now the > worst thing that was likely to happen to you was isolated corruption in > specific tables. In the brave new world of WAL the stakes are higher, > because the system will refuse to start up if it finds a corrupted > checkpoint record. Clueless admins who resort to kill -9 as a routine > admin tool *will* lose their databases. Moreover, the init scripts > that are running around now are dangerous weapons if used with 7.1. > > I think we need a stronger interlock to prevent this scenario, but I'm > unsure what it should be. Ideas? > Seems the simplest way is to inhibit starting postmaster if the pid file exists. Another way is to use flock() if flock() is available. We could flock() the pid file so that another postmaster could detect the lock of the file. Regards, Hiroshi Inoue
On Mon, Mar 05, 2001 at 08:55:41PM -0500, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > killproc should send a kill -15 to the process, wait a few seconds for > > it to exit. If it does not, try kill -1, and if that doesn't kill it, > > then kill -9. > > Tell it to the Linux people ... this is their boot-script code we're > talking about. Not to be a zealot, but this isn't _Linux_ boot-script code, it's _Red Hat_ boot-script code. Red Hat would like for us all to confuse the two, but they jes' ain't the same. (As a rule of thumb, where it works right, credit Linux; where it doesn't, blame Red Hat. :-) Nathan Myers ncm@zembu.com
Bruce Momjian wrote: > > # TERM first, then KILL if not dead > Yes, this seems like the proper way to do it. Now to verify that 6.1 is the same....or different.... Hmmmm.... The mirrors of ftp.redhat.com (and, in fact, RedHat.com itself) no longer have the updates or the original for 6.1's initscripts-4.70 package. Can a RedHat 6.1 user (using as close as possible to 6.1's release initscripts package) send me a copy of /etc/rc.d/init.d/functions, or verify how that initscripts package defines killproc? I cannot at this moment locate my RH 6.1 SRPMS CD. Found my RH _4_.1 CD, but that's just a _little_ old :-). -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Tom Lane wrote: >> I think we need a stronger interlock to prevent this scenario, but I'm >> unsure what it should be. Ideas? > Seems the simplest way is to inhibit starting postmaster > if the pid file exists. Then we're unable to recover from a crash without manual intervention. The tricky part of this is not to give up the ability to restart when there *has* been a crash. > Another way is to use flock() if flock() is available. > We could flock() the pid file so that another postmaster > could detect the lock of the file. This would only work if every backend is holding flock on the file, which would mean they'd all have to keep it open all the time. Kind of annoying to use up that many file descriptors on it. Might be the best answer though; I haven't thought of anything I like better... regards, tom lane
Nathan Myers wrote: > Not to be a zealot, but this isn't _Linux_ boot-script code, it's > _Red Hat_ boot-script code. Red Hat would like for us all to confuse > the two, but they jes' ain't the same. (As a rule of thumb, where it > works right, credit Linux; where it doesn't, blame Red Hat. :-) So we're going to credit Linux for PostgreSQL being shipped as part of the RedHat distribution since RH 5.0, then? :-0 -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes: > Is 6.1 this different from 6.2? Scott sent me a copy of /etc/init.d/functions from his box, and it has largely the same behavior (I hadn't read the whole code to notice that it doesn't use the default killlevel...). What's actually happening here is that the init script sends SIGTERM, and then SIGKILL four seconds later if the postmaster hasn't shut down yet. Unfortunately, unless your clients are very short-lived four seconds isn't going to be enough for a "polite" shutdown. (It's pretty marginal even for an impolite one, since a checkpoint will take at least a couple of seconds.) However, with an explicit kill level that doesn't happen: you get one signal of the specified value, no more. Possibly it would be better for the init script to send SIGINT (forcibly disconnect clients) instead of SIGTERM, however. So I'm now leaning to "killproc postmaster -INT". regards, tom lane
Lamar Owen <lamar.owen@wgcr.org> writes: > Tom Lane wrote: >> The tricky part of this is not to give up the ability to restart when >> there *has* been a crash. > But kill -9 effectively _is_ an admin-initiated crash. Yeah, but only a partial crash. If the admin finishes the job by killing the backends too, we're fine. Postmaster down, backends alive is not a scenario we're currently prepared for. We need a way to plug that gap. regards, tom lane
Tom Lane wrote: > However, with an explicit kill level that doesn't happen: you get one > signal of the specified value, no more. Possibly it would be better for > the init script to send SIGINT (forcibly disconnect clients) instead of > SIGTERM, however. So I'm now leaning to "killproc postmaster -INT". Ok, since I can't seem to count on killproc's exact behavior, istm that I can: killproc postmaster -INT wait some number of seconds if postmaster still up killproc postmaster -TERM wait some number of seconds if postmaster STILL up killproc postmaster #and let the grim reaper do its dirty work. After all, the system shutdown is relying on this script to properly and thoroughly shut things down, or it WILL do the 'kill -9 pid-of-postmaster' for you. Now, what's a good delay here? Or is there a better metric that a simple delay? After all, I want to avoid the kill -9 unless we have an emergency hard lock situation -- what's a good indicator of the backend fleet of processes actually _doing_ something? Or should I key on an indicator of processor speed (Linux does provide a nice bogus metric known as BogoMIPS for such a purpose)? The last thing I want to do is wait too long on some platforms and not long enough on others. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Tom Lane wrote: > The tricky part of this is not to give up the ability to restart when > there *has* been a crash. But kill -9 effectively _is_ an admin-initiated crash. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes: > The last thing I want to do is > wait too long on some platforms and not long enough on others. The difficulty is to know how long the final checkpoint will take. This depends on (at least) your hard disk speed and the number of dirty buffers, so I think you're going to have some difficulty estimating it with any reliability. BogoMIPS won't help, for sure. However, if you do SIGINT and then wait a few seconds, you can be fairly sure that all the extant backends are dead (if not frozen up...) and that the checkpoint is in progress. That may be about the best you can do. I do not agree that this script should take it on itself to kill -9 the postmaster. Please note that the reason we're having this discussion at all is that the init script may be used for purposes other than system shutdown. So the argument that "it's going to happen anyway" is wrong. regards, tom lane
> Ok, since I can't seem to count on killproc's exact behavior, istm that > I can: > killproc postmaster -INT > wait some number of seconds > if postmaster still up > killproc postmaster -TERM > wait some number of seconds > if postmaster STILL up > killproc postmaster #and let the grim reaper do its dirty work. > > After all, the system shutdown is relying on this script to properly and > thoroughly shut things down, or it WILL do the 'kill -9 > pid-of-postmaster' for you. > > Now, what's a good delay here? Or is there a better metric that a > simple delay? After all, I want to avoid the kill -9 unless we have an > emergency hard lock situation -- what's a good indicator of the backend > fleet of processes actually _doing_ something? Or should I key on an > indicator of processor speed (Linux does provide a nice bogus metric > known as BogoMIPS for such a purpose)? The last thing I want to do is > wait too long on some platforms and not long enough on others. In remembering how other databases handle it, I think you should use pg_ctl to shut it down. You need to enable wait mode, not sure if that is the default or not. That will wait for it to shut down before continuing. I realize a hung shutdown would stop the kernel from shutting down. You could put a sleep 100 in there and call a trap on a timeout. Here is some shell code: TIME=60 pg_ctl -w stop &BG="$!"; export BG (sleep "$TIME"; kill "$BG" ) &BG2="$!"; export BG2 wait "$BG"if ! kill -0 "$BG2"else kill "$BG2"fi This will try a pg_ctl shutdown for 60 seconds, then kill pg_ctl. You would then need a kill of you own. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > Yeah, but only a partial crash. If the admin finishes the job by > killing the backends too, we're fine. Postmaster down, backends alive > is not a scenario we're currently prepared for. We need a way to plug > that gap. Postmaster can easily enough find out if zombie backends are 'out there' during startup, right? What can postmaster _do_ about it, though? It won't necessarily be able to kill them -- but it also can't control them. If it _can_ kill them, should it try? After all, if those zombies are out there on this PGDATA there's going to be big trouble if we even try to start. If we can't kill the zombies (that might still be doing something useful with their clients) from our starting postmaster, how can we possibly start up underneath running backends? Should the backend look for the presence of its parent postmaster periodically and gracefully come down if postmaster goes away without the proper handshake? A watchdog semaphore (or shared memory flag) that the backend resets and then checks periodically for it being set by its parent postmaster? Should a set of backends detect a new postmaster coming up and try to 'sync up' with that postmaster, like the baroque GEMM handshake dance performed by 386 memory managers when Windows needs to start its own VMM? Or should we spend that much time protecting Barney Fife's from their own single bullet? :-) Just a nor-easter of a brainstorm.... -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Tom Lane wrote: > Please note that the reason we're having this discussion at > all is that the init script may be used for purposes other than system > shutdown. So the argument that "it's going to happen anyway" is wrong. Believe it or not, you just disproved your own statement that the initscript should not take it upon itself to issue the kill -9. So, what if I issue '/etc/rc.d/init.d/postgresql restart' -- and backends don't go away during the 'stop' phase, while postmaster may actually have died? Or is it even possible for postmaster to drop out with a running backend out there? No, more is needed. But I think a careful reap through the running backends to kill those that need killing if postmaster won't go down might be prudent. Currently it is not possible to run multiple postmasters with the RPM install (I am working on that little problem, but it won't be for 7.1's RPMset yet), so all backends that are running on the RPM PGDATA location (which I am looking at making configurable as well) will belong to the one postmaster. Of course, that would be an absolute last resort. Oh well -- the real solution is elsewhere, anyway. I just have to make sure it is not data-corruption broken. And, if leaving the -9 out completely is the only solution, then, well, it's the only solution. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes: > Tom Lane wrote: >> Postmaster down, backends alive is not a scenario we're currently >> prepared for. We need a way to plug that gap. > Postmaster can easily enough find out if zombie backends are 'out there' > during startup, right? If you think it's easy enough, enlighten the rest of us ;-). Be sure your solution only finds leftover backends from the previous instance of the same postmaster, else it will prevent running multiple postmasters on one system. > What can postmaster _do_ about it, though? It > won't necessarily be able to kill them -- but it also can't control > them. If it _can_ kill them, should it try? I think refusal to start is sufficient. They should go away by themselves as their clients disconnect, and forcing the issue doesn't seem like it will improve matters. The admin can kill them (hopefully with just a SIGTERM ;-)) if he wants to move things along ... but I'd not like to see a newly-starting postmaster do that automatically. > Should the backend look for the presence of its parent postmaster > periodically and gracefully come down if postmaster goes away without > the proper handshake? Unless we checked just before every disk write, this wouldn't represent a safe failure mode. The onus has to be on the newly-starting postmaster, I think, not on the old backends. > Should a set of backends detect a new postmaster coming up and try to > 'sync up' with that postmaster, Nice try ;-). How will you persuade the kernel that these processes are now children of the new postmaster? regards, tom lane
Bruce Momjian wrote: > This will try a pg_ctl shutdown for 60 seconds, then kill pg_ctl. You > would then need a kill of you own. I missed something somehwere: wasn't the consensus a few weeks ago that pg_ctl shouldn't be used for a system initscript? Or did I black out that day? :-) I certainly have no problem using pg_ctl for this purpose -- as I have been using pg_ctl to start postmaster all along (then why am I not using it to stop -- don't answer that :-))...... -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes: > I missed something somehwere: wasn't the consensus a few weeks ago that > pg_ctl shouldn't be used for a system initscript? I thought there was some concern about whether pg_ctl is really "ready for prime time". But I don't recall the details either. regards, tom lane
> Bruce Momjian wrote: > > This will try a pg_ctl shutdown for 60 seconds, then kill pg_ctl. You > > would then need a kill of you own. > > I missed something somehwere: wasn't the consensus a few weeks ago that > pg_ctl shouldn't be used for a system initscript? Or did I black out > that day? :-) I certainly have no problem using pg_ctl for this purpose > -- as I have been using pg_ctl to start postmaster all along (then why > am I not using it to stop -- don't answer that :-))...... I don't remember that discussion. My guess was that you didn't want pg_ctl to hang forever. My script handles that, I think. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Lamar Owen <lamar.owen@wgcr.org> writes: > Tom Lane wrote: >> Please note that the reason we're having this discussion at >> all is that the init script may be used for purposes other than system >> shutdown. So the argument that "it's going to happen anyway" is wrong. > Believe it or not, you just disproved your own statement that the > initscript should not take it upon itself to issue the kill -9. How? > So, what if I issue '/etc/rc.d/init.d/postgresql restart' -- and > backends don't go away during the 'stop' phase, while postmaster may > actually have died? Or is it even possible for postmaster to drop out > with a running backend out there? The postmaster will certainly not do so voluntarily. If you kill -9 it, of course, that's the situation you're left with ... but your reasoning seems circular to me. "I should kill -9 the postmaster to prevent the situation where I've kill -9'd the postmaster." regards, tom lane
Tom Lane wrote: > Lamar Owen wrote: > > Postmaster can easily enough find out if zombie backends are 'out there' > > during startup, right? > If you think it's easy enough, enlighten the rest of us ;-). If postgres reported PGDATA on the command line it would be easy enough. > > What can postmaster _do_ about it, though? It > > won't necessarily be able to kill them -- but it also can't control > > them. If it _can_ kill them, should it try? > I think refusal to start is sufficient. They should go away by > themselves as their clients disconnect, and forcing the issue doesn't ???? I have misunderstood your previous statement about not wanting to force a manual crash recovery, then. > > Should a set of backends detect a new postmaster coming up and try to > > 'sync up' with that postmaster, > Nice try ;-). How will you persuade the kernel that these processes are > now children of the new postmaster? Yeah, that's the kicker. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes: > Tom Lane wrote: >> If you think it's easy enough, enlighten the rest of us ;-). > If postgres reported PGDATA on the command line it would be easy enough. In ps status you mean? I don't think we are prepared to require ps status functionality to let the system start up... we'd lose a number of supported platforms that way. >> I think refusal to start is sufficient. They should go away by >> themselves as their clients disconnect, and forcing the issue doesn't > ???? I have misunderstood your previous statement about not wanting to > force a manual crash recovery, then. In the case of an actual crash and restart, postgres should come back up without help. However, the situation here is not a crash, it is incomplete admin intervention. I don't think that expecting the admin to complete his intervention is the same thing as manual crash recovery. I especially don't think that we should second-guess what the admin wants us to do by auto-killing backends that are still serving clients. regards, tom lane
Tom Lane wrote: > of course, that's the situation you're left with ... but your reasoning > seems circular to me. "I should kill -9 the postmaster to prevent the > situation where I've kill -9'd the postmaster." Ok, while the script can certainly be used from the command line, its primary purpose is system shutdown. And, I am thinking kindof circituitously at this point -- I only now realize just how circituitously. If I keep slapping my forehead like this, I'm going to be bald in a few years.... I don't want to reap the postmaster off -- I want to reap off the backends associated with that particular postmaster, allowing that postmaster to die on its own. Duh. Doing this in a safe manner is not going to be easy, given that the PGDATA is not on the command line to the backend as echoed by ps. Although I could key on PPID for the backends.... I'll have to experiment. But not tonight -- last week was more taxing than I thought. :-(. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Tom Lane wrote: > Lamar Owen <lamar.owen@wgcr.org> writes: > > Tom Lane wrote: > >> If you think it's easy enough, enlighten the rest of us ;-). > > If postgres reported PGDATA on the command line it would be easy enough. > In ps status you mean? I don't think we are prepared to require ps > status functionality to let the system start up... we'd lose a number > of supported platforms that way. That is one downside. A major downside. Again, alot of work to protect the Barney Fife's out there. > In the case of an actual crash and restart, postgres should come back up > without help. However, the situation here is not a crash, it is > incomplete admin intervention. I don't think that expecting the admin Is it a correct assumption that this is the only time postmaster might drop out? But, thanks for the clarification, as I had misunderstood what you meant. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes: > Is it a correct assumption that this is the only time postmaster might > drop out? Well, there's always the possibility of a bug leading to postmaster coredump. Historically those have been pretty rare though. In any case, I'm not sure that the init script is the place to be solving these problems. We do need some internal mechanism to protect against a crashed or kill -9'd postmaster. regards, tom lane
Lamar Owen <lamar.owen@wgcr.org> writes: > I don't want to reap the postmaster off -- I want to reap off the > backends associated with that particular postmaster, allowing that > postmaster to die on its own. Duh. Doing this in a safe manner is not > going to be easy, given that the PGDATA is not on the command line to > the backend as echoed by ps. Although I could key on PPID for the > backends.... I'll have to experiment. PPID should work fine, actually. Keep in mind though that SIGINT'ing the postmaster will already have sent a terminate signal to its children (barring postmaster breakage), and that if you wait around for awhile and then kill off remaining children, you may well accomplish nothing except to kill off the checkpoint process :-( regards, tom lane
Tom Lane wrote: > Well, there's always the possibility of a bug leading to postmaster > coredump. Historically those have been pretty rare though. I have never personally seen one, since 6.1.1. > In any case, I'm not sure that the init script is the place to be > solving these problems. Well, I do kindof have the responsibility to allow the system to shut down..... I'll have to double check -- there may be a timeout mechanism in the RedHat init to reap off shutdown scripts -- but I haven't yet found it. Better to gracefully yank the plugs than have the grim reaper yank them in the wrong order for you, in any case. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
* Tom Lane <tgl@sss.pgh.pa.us> [010305 19:13] wrote: > Lamar Owen <lamar.owen@wgcr.org> writes: > > Tom Lane wrote: > >> Postmaster down, backends alive is not a scenario we're currently > >> prepared for. We need a way to plug that gap. > > > Postmaster can easily enough find out if zombie backends are 'out there' > > during startup, right? > > If you think it's easy enough, enlighten the rest of us ;-). Be sure > your solution only finds leftover backends from the previous instance of > the same postmaster, else it will prevent running multiple postmasters > on one system. I'm sure some sort of encoding of the PGDATA directory along with the pids stored in the shm segment... > > What can postmaster _do_ about it, though? It > > won't necessarily be able to kill them -- but it also can't control > > them. If it _can_ kill them, should it try? > > I think refusal to start is sufficient. They should go away by > themselves as their clients disconnect, and forcing the issue doesn't > seem like it will improve matters. The admin can kill them (hopefully > with just a SIGTERM ;-)) if he wants to move things along ... but I'd > not like to see a newly-starting postmaster do that automatically. I agree, shooting down processes incorrectly should be left up to vendors braindead scripts. :) > > Should the backend look for the presence of its parent postmaster > > periodically and gracefully come down if postmaster goes away without > > the proper handshake? > > Unless we checked just before every disk write, this wouldn't represent > a safe failure mode. The onus has to be on the newly-starting > postmaster, I think, not on the old backends. > > > Should a set of backends detect a new postmaster coming up and try to > > 'sync up' with that postmaster, > > Nice try ;-). How will you persuade the kernel that these processes are > now children of the new postmaster? Oh, easy, use ptrace. :) -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
> I especially don't think that we should second-guess what the admin > wants us to do by auto-killing backends that are still serving > clients. Sure. But it would be nice anyway if pg_ctl could do this with a specific command line switch. -- << Tout n'y est pas parfait, mais on y honore certainement les jardiniers >> Dominique Quatravaux <dom@kilimandjaro.dyndns.org>
Lamar Owen writes: > I missed something somehwere: wasn't the consensus a few weeks ago that > pg_ctl shouldn't be used for a system initscript? The consensus(?) was that there was some work to do in pg_ctl before it was robust enough to be used (for anything). That work has been done. An example Linux init.d script is at contrib/start-scripts/linux. The only fault in that script that I can see is that it has no recipe for the case when the postmaster does not come down after 60 seconds. But this is really no problem for the issue at hand because if you do a normal runlevel switch then the postmaster will simply keep running, and during a system shutdown all the backends are going to die anyway. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
> Bruce Momjian writes: > > > This will try a pg_ctl shutdown for 60 seconds, then kill pg_ctl. You > > would then need a kill of you own. > > pg_ctl automatically times out after 60 seconds. Oh, yea, that's right, I saw that in the documenation. Forget my script. Just run pg_ctl first, then kill the postmaster if it is still there. Much safer than doing kill and checking because pg_ctl knows when the system cleanly shuts down and exits. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Peter Eisentraut wrote: > > Lamar Owen writes: > > > I missed something somehwere: wasn't the consensus a few weeks ago that > > pg_ctl shouldn't be used for a system initscript? > > The consensus(?) was that there was some work to do in pg_ctl before it > was robust enough to be used (for anything). That work has been done. That was the detail I missed. > case when the postmaster does not come down after 60 seconds. But this is > really no problem for the issue at hand because if you do a normal > runlevel switch then the postmaster will simply keep running, and during a > system shutdown all the backends are going to die anyway. Only if each and every shutdown script succeeds in its task. And I have to make sure that the RPM's shipping script successfully pulls down the system in an orderly fashion -- of course, I don't have to worry about the case where a postmaster is going to be started back up if we are in system shutdown -- but, as Tom also stated, I can't assume I'm in the system's death throes when called with the stop parameter. And it _is_ possible for an admin to set up the runlevels such that a level is set aside where even networking isn't running (actually, that level already exists, and is called 'single user mode') -- or a run level for website maintenance where networking is still up, but the webserver and postgresql (and other associated) processes are to be shut down. I personally use this -- I have set up runlevel 4 as a 'remote single user mode' of sorts where I still have sshd running (and the networking stack, obviously), but AOLserver, postgresql, and RealServer are shut down. I then switch runlevels back to 3 to return to normal. Much easier than manually stopping and restarting (in the correct order, as AOLserver is not a happy camper if postmaster drops out from underneath it) all the necessary pieces. So I can't assume anything. The default RPM installation used to automatically configure runlevels 3, 4, and 5 (not any more), but my script can't assume that the system is actually in that state by any means. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Alfred Perlstein <bright@wintelcom.net> writes: > I'm sure some sort of encoding of the PGDATA directory along with > the pids stored in the shm segment... I thought about this too, but it strikes me as not very trustworthy. The problem is that there's no guarantee that the new postmaster will even notice the old shmem segment: it might select a different shmem key. (The 7.1 coding of shmem key selection makes this more likely than it used to be, but even under 7.0, it will certainly fail to work if I choose to start the new postmaster using a different port number than the old one had. The shmem key is driven primarily by port number not data directory ...) The interlock has to be tightly tied to the PGDATA directory, because what we're trying to protect is the files in and under that directory. It seems that something based on file(s) in that directory is the way to go. The best idea I've seen so far is Hiroshi's idea of having all the backends hold fcntl locks on the same file (probably postmaster.pid would do fine). Then the new postmaster can test whether any backends are still alive by trying to lock the old postmaster.pid file. Unfortunately, I read in the fcntl man page: Locks are not inherited by a child process in a fork(2) system call. This makes the idea much less attractive than I originally thought: a new backend would not automatically inherit a lock on the postmaster.pid file from the postmaster, but would have to open/lock it for itself. That means there's a window where the new backend exists but would be invisible to a hypothetical new postmaster. We could work around this with the following, very ugly protocol: 1. Postmaster normally maintains fcntl read lock on its postmaster.pid file. Each spawned backend immediately opens and read-locks postmaster.pid, too, and holds that file open until it dies. (Thus wasting a kernel FD per backend, which is one of the less attractive things about this.) If the backend is unable to obtain read lock on postmaster.pid, then it complains and dies. We must use read locks here so that all these processes can hold them separately. 2. If a newly started postmaster sees a pre-existing postmaster.pid file, it tries to obtain a *write* lock on that file. If it fails, conclude that an old postmaster or backend is still alive; complain and quit. If it succeeds, sit for say 1 second before deleting the file and creating a new one. (The delay here is to allow any just-started old backends to fail to acquire read lock and quit. A possible objection is that we have no way to guarantee 1 second is enough, though it ought to be plenty if the lock acquisition is just after the fork.) One thing that worries me a little bit is that this means an fcntl read-lock request will exist inside the kernel for each active backend. Does anyone know of any performance problems or hard kernel limits we might run into with large numbers of backends (lots and lots of fcntl locks)? At least the locks are on a file that we don't actually touch in the normal course of business. A small savings is that the backends don't actually need to open new FDs for the postmaster.pid file; they can use the one they inherit from the postmaster, even though they do need to lock it again. I'm not sure how much that saves inside the kernel, but at least something. There are also the usual set of concerns about portability of flock, though this time we're locking a plain file and not a socket, so it shouldn't be as much trouble as it was before. Comments? Does anyone see a better way to do it? regards, tom lane
Lamar Owen writes: > > case when the postmaster does not come down after 60 seconds. But this is > > really no problem for the issue at hand because if you do a normal > > runlevel switch then the postmaster will simply keep running, and during a > > system shutdown all the backends are going to die anyway. > > Only if each and every shutdown script succeeds in its task. And I have > to make sure that the RPM's shipping script successfully pulls down the > system in an orderly fashion -- of course, I don't have to worry about > the case where a postmaster is going to be started back up if we are in > system shutdown -- but, as Tom also stated, I can't assume I'm in the > system's death throes when called with the stop parameter. Well, if you have something clever you want to do if the postmaster doesn't come down after an orderly shutdown then please share it. The current alternatives are 'leave running' or 'kill -9'. I know I'd prefer the former. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
* Tom Lane <tgl@sss.pgh.pa.us> [010306 10:10] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > > I'm sure some sort of encoding of the PGDATA directory along with > > the pids stored in the shm segment... > > I thought about this too, but it strikes me as not very trustworthy. > The problem is that there's no guarantee that the new postmaster will > even notice the old shmem segment: it might select a different shmem > key. (The 7.1 coding of shmem key selection makes this more likely > than it used to be, but even under 7.0, it will certainly fail to work > if I choose to start the new postmaster using a different port number > than the old one had. The shmem key is driven primarily by port number > not data directory ...) This seems like a mistake. I'm suprised you guys aren't just using some form of the FreeBSD ftok() algorithm for this: FTOK(3) FreeBSD Library Functions Manual FTOK(3) ... The ftok() function attempts to create a unique key suitable for use with the msgget(3), semget(2) and shmget(2) functionsgiven the path of an ex- isting file and a user-selectable id. The specified path must specify an existing file that is accessible to the calling process or the call will fail. Also, note that links to files will return the same key, given the same id. BUGS The returned key is computed based on the device minor number and inode of the specified path in combination withthe lower 8 bits of the given id. Thus it is quite possible for the routine to return duplicate keys. The "BUGS" seems to be exactly what you guys are looking for, a somewhat reliable method of obtaining a system id. If that sounds evil, read below for an alternate suggestion. > The interlock has to be tightly tied to the PGDATA directory, because > what we're trying to protect is the files in and under that directory. > It seems that something based on file(s) in that directory is the way > to go. > > The best idea I've seen so far is Hiroshi's idea of having all the > backends hold fcntl locks on the same file (probably postmaster.pid > would do fine). Then the new postmaster can test whether any backends > are still alive by trying to lock the old postmaster.pid file. > Unfortunately, I read in the fcntl man page: > > Locks are not inherited by a child process in a fork(2) system call. > > This makes the idea much less attractive than I originally thought: > a new backend would not automatically inherit a lock on the > postmaster.pid file from the postmaster, but would have to open/lock it > for itself. That means there's a window where the new backend exists > but would be invisible to a hypothetical new postmaster. > > We could work around this with the following, very ugly protocol: > > 1. Postmaster normally maintains fcntl read lock on its postmaster.pid > file. Each spawned backend immediately opens and read-locks > postmaster.pid, too, and holds that file open until it dies. (Thus > wasting a kernel FD per backend, which is one of the less attractive > things about this.) If the backend is unable to obtain read lock on > postmaster.pid, then it complains and dies. We must use read locks > here so that all these processes can hold them separately. > > 2. If a newly started postmaster sees a pre-existing postmaster.pid > file, it tries to obtain a *write* lock on that file. If it fails, > conclude that an old postmaster or backend is still alive; complain > and quit. If it succeeds, sit for say 1 second before deleting the file > and creating a new one. (The delay here is to allow any just-started > old backends to fail to acquire read lock and quit. A possible > objection is that we have no way to guarantee 1 second is enough, though > it ought to be plenty if the lock acquisition is just after the fork.) > > One thing that worries me a little bit is that this means an fcntl > read-lock request will exist inside the kernel for each active backend. > Does anyone know of any performance problems or hard kernel limits we > might run into with large numbers of backends (lots and lots of fcntl > locks)? At least the locks are on a file that we don't actually touch > in the normal course of business. > > A small savings is that the backends don't actually need to open new FDs > for the postmaster.pid file; they can use the one they inherit from the > postmaster, even though they do need to lock it again. I'm not sure how > much that saves inside the kernel, but at least something. > > There are also the usual set of concerns about portability of flock, > though this time we're locking a plain file and not a socket, so it > shouldn't be as much trouble as it was before. > > Comments? Does anyone see a better way to do it? Possibly... What about encoding the shm id in the pidfile? Then one can just ask how many processes are attached to that segment? (if it doesn't exist, one can assume all backends have exited) you want the field 'shm_nattch' The shmid_ds struct is defined as follows: struct shmid_ds { struct ipc_perm shm_perm; /* operation permission structure */ int shm_segsz; /* size of segment in bytes */ pid_t shm_lpid; /* process ID of last shared memory op */ pid_t shm_cpid; /* process ID of creator */ short shm_nattch; /* number of current attaches*/ time_t shm_atime; /* time of last shmat() */ time_t shm_dtime; /* time of lastshmdt() */ time_t shm_ctime; /* time of last change by shmctl() */ void *shm_internal;/* sysv stupidity */ }; -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein <bright@wintelcom.net> writes: > * Tom Lane <tgl@sss.pgh.pa.us> [010306 10:10] wrote: >> The shmem key is driven primarily by port number >> not data directory ...) > This seems like a mistake. > I'm suprised you guys aren't just using some form of the FreeBSD > ftok() algorithm for this: This has been discussed before --- see the archives. The conclusion was that since ftok doesn't guarantee uniqueness, it adds nothing except lack of predictability to the shmem key selection process. We'd still need logic to cope with key collisions, and given that, we might as well select keys that have some obvious relationship to user-visible parameters, viz the port number. As is, you can fairly easily tell which shmem segment belongs to which postmaster from the shmem key; with ftok-derived keys, you couldn't tell a thing. >> Comments? Does anyone see a better way to do it? > What about encoding the shm id in the pidfile? Then one can just ask > how many processes are attached to that segment? (if it doesn't > exist, one can assume all backends have exited) Hmm ... that might actually be a pretty good idea. A small problem is that the shm key isn't yet selected at the time we initially create the lockfile, but I can't think of any reason that we could not go back and append the key to the lockfile afterwards. > you want the field 'shm_nattch' Are there any portability problems with relying on shm_nattch to be available? If not, I like this a lot... regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010306 10:35] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > > > What about encoding the shm id in the pidfile? Then one can just ask > > how many processes are attached to that segment? (if it doesn't > > exist, one can assume all backends have exited) > > Hmm ... that might actually be a pretty good idea. A small problem is > that the shm key isn't yet selected at the time we initially create the > lockfile, but I can't think of any reason that we could not go back and > append the key to the lockfile afterwards. > > > you want the field 'shm_nattch' > > Are there any portability problems with relying on shm_nattch to be > available? If not, I like this a lot... Well it's available on FreeBSD and Solaris, I'm sure Redhat has some deamon that resets the value to 0 periodically just for kicks so it might not be viable... :) Seriously, there's some dispute on the type that 'shm_nattch' is, under Solaris it's "shmatt_t" (unsigned long afaik), under FreeBSD it's 'short' (i should fix this. :)). But since you're really only testing for 0'ness then it shouldn't really be a problem. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein <bright@wintelcom.net> writes: >> Are there any portability problems with relying on shm_nattch to be >> available? If not, I like this a lot... > Well it's available on FreeBSD and Solaris, I'm sure Redhat has > some deamon that resets the value to 0 periodically just for kicks > so it might not be viable... :) I notice that our BeOS and QNX emulations of shmctl() don't support IPC_STAT, but that could be dealt with, at least to the extent of stubbing it out. This does raise the question of what to do if shmctl(IPC_STAT) fails for a reason other than EINVAL. I think the conservative thing to do is refuse to start up. On EPERM, for example, it's possible that there is a postmaster running in your PGDATA but with a different userid. > Seriously, there's some dispute on the type that 'shm_nattch' is, > under Solaris it's "shmatt_t" (unsigned long afaik), under FreeBSD > it's 'short' (i should fix this. :)). > But since you're really only testing for 0'ness then it shouldn't > really be a problem. We need not copy the value anywhere, so as long as the struct is correctly declared in the system header files I don't think it matters what the field type is ... regards, tom lane
Peter Eisentraut wrote: > Well, if you have something clever you want to do if the postmaster > doesn't come down after an orderly shutdown then please share it. The > current alternatives are 'leave running' or 'kill -9'. I know I'd prefer > the former. Well, my preferences aren't really relevant here. I have a job to do as an initscript in the RPMish environment -- and I really have to meet my obligations (using the first personal pronoun there to anthropomorph the initscript to a person, allowing us to have a little sympathy for the poor shell script's plight :-)). My preference is to let it float in limbo -- if it's in limbo and won't come out, then we have bigger issues. However, I could do something really sneaky in the RedHat environment and let init do the dirty work for me -- but, again, I am not at all guaranteed that things will come down orderly -- if it is at all possible for me to bring about an orderly (if slow) shutdown that does terminate as the rest of the system needs it to do, then I'll attempt to do so. But, the immediate issue is preventing chaotic stops within the initscript, so I'm going to experiment with things and see if I can make the initscript hang -- if I can't, then I'll likely put in the 'killproc postmaster -INT' with escalation to -TERM if it doesn't come down within sixty seconds (and, no, I am not going to sleep 60 then check things -- I am going to sleep 1 and loop sixty times) -- no need to unnecessarily delay system shutdown (and potential restart). And I won't put in the -KILL unless I can find a safe and thorough way to do so. Or I may go ahead and pg_ctl-ize things and let pg_ctl do the dirty work, as that IS what pg_ctl is supposed to accomplish. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Alfred Perlstein writes: > Seriously, there's some dispute on the type that 'shm_nattch' is, > under Solaris it's "shmatt_t" (unsigned long afaik), under FreeBSD > it's 'short' (i should fix this. :)). What I don't like is that my /usr/include/sys/shm.h (through other headers) has: typedef unsigned long int shmatt_t; /* Data structure describing a set of semaphores. */ struct shmid_ds { struct ipc_perm shm_perm; /* operation permission struct */ size_t shm_segsz; /* size of segment in bytes */ __time_t shm_atime; /* time of last shmat() */ unsigned long int__unused1; __time_t shm_dtime; /* time of last shmdt() */ unsigned long int __unused2; __time_tshm_ctime; /* time of last change by shmctl() */ unsigned long int __unused3; __pid_t shm_cpid; /* pid of creator */ __pid_t shm_lpid; /* pid of last shmop */ shmatt_tshm_nattch; /* number of current attaches */ unsigned long int __unused4; unsigned long int __unused5;}; whereas /usr/src/linux/include/shm.h has: struct shmid_ds { struct ipc_perm shm_perm; /* operation perms */ int shm_segsz; /* size of segment (bytes) */ __kernel_time_t shm_atime; /* last attach time */ __kernel_time_t shm_dtime; /* last detach time */ __kernel_time_t shm_ctime; /* last changetime */ __kernel_ipc_pid_t shm_cpid; /* pid of creator */ __kernel_ipc_pid_t shm_lpid; /* pid of last operator */ unsigned short shm_nattch; /* no. of current attaches */ unsigned short shm_unused; /* compatibility */ void *shm_unused2; /* ditto - usedby DIPC */ void *shm_unused3; /* unused */ }; Not only note the shm_nattch type, but also shm_segsz, and the "unused" fields in between. I don't know a thing about the Linux kernel sources, but this doesn't seem right. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
* Tom Lane <tgl@sss.pgh.pa.us> [010306 11:03] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > >> Are there any portability problems with relying on shm_nattch to be > >> available? If not, I like this a lot... > > > Well it's available on FreeBSD and Solaris, I'm sure Redhat has > > some deamon that resets the value to 0 periodically just for kicks > > so it might not be viable... :) > > I notice that our BeOS and QNX emulations of shmctl() don't support > IPC_STAT, but that could be dealt with, at least to the extent of > stubbing it out. Well since we already have spinlocks, I can't see why we can't keep the refcount and spinlock in a special place in the shm for all cases? > This does raise the question of what to do if shmctl(IPC_STAT) fails > for a reason other than EINVAL. I think the conservative thing to do > is refuse to start up. On EPERM, for example, it's possible that there > is a postmaster running in your PGDATA but with a different userid. Yes, if possible a more meaningfull error message and pointer to some docco would be nice or even a nice "i don't care, i killed all the backends, just start darnit" flag, it's really no fun at all to have to attempt to decypher some cryptic error message at 3am when the database/system is acting up. :) > > Seriously, there's some dispute on the type that 'shm_nattch' is, > > under Solaris it's "shmatt_t" (unsigned long afaik), under FreeBSD > > it's 'short' (i should fix this. :)). > > > But since you're really only testing for 0'ness then it shouldn't > > really be a problem. > > We need not copy the value anywhere, so as long as the struct is > correctly declared in the system header files I don't think it matters > what the field type is ... Yup, my point exactly. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein <bright@wintelcom.net> writes: > * Tom Lane <tgl@sss.pgh.pa.us> [010306 11:03] wrote: >> I notice that our BeOS and QNX emulations of shmctl() don't support >> IPC_STAT, but that could be dealt with, at least to the extent of >> stubbing it out. > Well since we already have spinlocks, I can't see why we can't > keep the refcount and spinlock in a special place in the shm > for all cases? No, we mustn't go there. If the kernel isn't keeping the refcount then it's worse than useless: as soon as some process crashes without decrementing its refcount, you have a condition that you can't recover from without reboot. What I'm currently imagining is that the stub implementations will just return a failure code for IPC_STAT, and the outer code will in turn fail with a message along the lines of "It looks like there's a pre-existing shmem block (id XXX) still in use. If you're sure there are no old backends still running, remove the shmem block with ipcrm(1), or just delete $PGDATA/postmaster.pid." I dunno what shmem management tools exist on BeOS/QNX, but deleting the lockfile will definitely suppress the startup interlock ;-). > Yes, if possible a more meaningfull error message and pointer to > some docco would be nice Is the above good enough? regards, tom lane
Peter Eisentraut wrote: > Not only note the shm_nattch type, but also shm_segsz, and the "unused" > fields in between. I don't know a thing about the Linux kernel sources, > but this doesn't seem right. Red Hat 7, right? My RedHat 7 system isn't running RH 7 right now (it's this notebook that I'm running Win95 on right now), but see which RPM's own the two headers. You may be in for a shock. IIRC, the first system include is from the 2.4 kernel, and the second in the kernel source tree is from the 2.2 kernel. Odd, but not really broken. Should be fixed in the latest public beta of RedHat, that actually has the 2.4 kernel. I can't really say any more about that, however. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Peter Eisentraut <peter_e@gmx.net> writes: > What I don't like is that my /usr/include/sys/shm.h (through other > headers) has [foo] > whereas /usr/src/linux/include/shm.h has [bar] Are those declarations perhaps bit-compatible? Looks a tad endian- dependent, though ... regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010306 11:30] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > > * Tom Lane <tgl@sss.pgh.pa.us> [010306 11:03] wrote: > >> I notice that our BeOS and QNX emulations of shmctl() don't support > >> IPC_STAT, but that could be dealt with, at least to the extent of > >> stubbing it out. > > > Well since we already have spinlocks, I can't see why we can't > > keep the refcount and spinlock in a special place in the shm > > for all cases? > > No, we mustn't go there. If the kernel isn't keeping the refcount > then it's worse than useless: as soon as some process crashes without > decrementing its refcount, you have a condition that you can't recover > from without reboot. Not if the postmaster outputs the following: > What I'm currently imagining is that the stub implementations will just > return a failure code for IPC_STAT, and the outer code will in turn fail > with a message along the lines of "It looks like there's a pre-existing > shmem block (id XXX) still in use. If you're sure there are no old > backends still running, remove the shmem block with ipcrm(1), or just > delete $PGDATA/postmaster.pid." I dunno what shmem management tools > exist on BeOS/QNX, but deleting the lockfile will definitely suppress > the startup interlock ;-). > > > Yes, if possible a more meaningfull error message and pointer to > > some docco would be nice > > Is the above good enough? Sure. :) -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
* Lamar Owen <lamar.owen@wgcr.org> [010306 11:39] wrote: > Peter Eisentraut wrote: > > Not only note the shm_nattch type, but also shm_segsz, and the "unused" > > fields in between. I don't know a thing about the Linux kernel sources, > > but this doesn't seem right. > > Red Hat 7, right? My RedHat 7 system isn't running RH 7 right now (it's > this notebook that I'm running Win95 on right now), but see which RPM's > own the two headers. You may be in for a shock. IIRC, the first system > include is from the 2.4 kernel, and the second in the kernel source tree > is from the 2.2 kernel. > > Odd, but not really broken. Should be fixed in the latest public beta > of RedHat, that actually has the 2.4 kernel. I can't really say any > more about that, however. Y'know, I was only kidding about Linux going out of its way to defeat the 'shm_nattch' trick... *sigh* As a FreeBSD developer I'm wondering if Linux keeps compatibility calls around for old binaries or not. Any idea? -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
* Tom Lane <tgl@sss.pgh.pa.us> [010306 11:49] wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > What I don't like is that my /usr/include/sys/shm.h (through other > > headers) has [foo] > > whereas /usr/src/linux/include/shm.h has [bar] > > Are those declarations perhaps bit-compatible? Looks a tad endian- > dependent, though ... Of course not, the size of the struct changed (short->unsigned long, basically int16_t -> uint32_t), because the kernel and userland in Linux are hardly in sync you have the fun of guessing if you get: old struct -> old syscall (ok) new struct -> old syscall (boom) old struct -> new syscall (boom) new struct -> new syscall (ok) Honestly I think this problem should be left to the vendor to fix properly (if it needs fixing), the sysV API was published at least 6 years ago, they ought to have it mostly correct by now. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein <bright@wintelcom.net> writes: > Of course not, the size of the struct changed (short->unsigned > long, basically int16_t -> uint32_t), because the kernel and userland > in Linux are hardly in sync you have the fun of guessing if you > get: > old struct -> old syscall (ok) > new struct -> old syscall (boom) > old struct -> new syscall (boom) > new struct -> new syscall (ok) Ugh. However, it looks like it might be fairly fail-soft: if we have the wrong declaration then we pick up some other field of the struct, and probably end up complaining because nattch appears nonzero. Recovery method (clean up the shm seg or delete lockfile) is the same. I'm still inclined to go with this; it beats corrupting the WAL log, and the fcntl(SETLK) alternative has its own set of portability booby-traps. regards, tom lane
On Tue, Mar 06, 2001 at 08:19:12PM +0100, Peter Eisentraut wrote: > Alfred Perlstein writes: > > > Seriously, there's some dispute on the type that 'shm_nattch' is, > > under Solaris it's "shmatt_t" (unsigned long afaik), under FreeBSD > > it's 'short' (i should fix this. :)). > > What I don't like is that my /usr/include/sys/shm.h (through other > headers) has: > > typedef unsigned long int shmatt_t; > > /* Data structure describing a set of semaphores. */ > struct shmid_ds > { > struct ipc_perm shm_perm; /* operation permission struct */ > size_t shm_segsz; /* size of segment in bytes */ > __time_t shm_atime; /* time of last shmat() */ > unsigned long int __unused1; > __time_t shm_dtime; /* time of last shmdt() */ > unsigned long int __unused2; > __time_t shm_ctime; /* time of last change by shmctl() */ > unsigned long int __unused3; > __pid_t shm_cpid; /* pid of creator */ > __pid_t shm_lpid; /* pid of last shmop */ > shmatt_t shm_nattch; /* number of current attaches */ > unsigned long int __unused4; > unsigned long int __unused5; > }; > > whereas /usr/src/linux/include/shm.h has: > > struct shmid_ds { > struct ipc_perm shm_perm; /* operation perms */ > int shm_segsz; /* size of segment (bytes) */ > __kernel_time_t shm_atime; /* last attach time */ > __kernel_time_t shm_dtime; /* last detach time */ > __kernel_time_t shm_ctime; /* last change time */ > __kernel_ipc_pid_t shm_cpid; /* pid of creator */ > __kernel_ipc_pid_t shm_lpid; /* pid of last operator */ > unsigned short shm_nattch; /* no. of current attaches */ > unsigned short shm_unused; /* compatibility */ > void *shm_unused2; /* ditto - used by DIPC */ > void *shm_unused3; /* unused */ > }; > > > Not only note the shm_nattch type, but also shm_segsz, and the "unused" > fields in between. I don't know a thing about the Linux kernel sources, > but this doesn't seem right. On Linux, /usr/src/linux/include is meaningless for anything in userland; it's meant only for building the kernel and kernel modules. That Red Hat tends to expose it to user-level builds is a long-standing bug in Red Hat's distribution, in violation of the File Hierarchy Standard as well as explicit instructions from Linus & crew and from the maintainer of the C library. User-level programs see what's in /usr/include, which only has to match what the C library wants. It's the C library's job to do any mapping needed, and it does. The C library is maintained very, very carefully to keep binary compatibility with all old versions. (One sometimes encounters commercial programs that rely on a bug or undocumented/ usupported feature that disappears in a later library version.) That is why there is no problem with version skew in the syscall argument structures on a correctly-configured Linux system. (On a Red Hat system it is very easy to get them out of sync, but RH fans are used to problems.) Nathan Myers ncm@zembu.com
Bruce Momjian writes: > This will try a pg_ctl shutdown for 60 seconds, then kill pg_ctl. You > would then need a kill of you own. pg_ctl automatically times out after 60 seconds. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Nathan Myers wrote: > That is why there is no problem with version skew in the syscall > argument structures on a correctly-configured Linux system. (On a > Red Hat system it is very easy to get them out of sync, but RH fans > are used to problems.) Is RedHat bashing really necessary here? At least they are payrolling Second Chair on the Linux kernel hierarchy. And they are very supportive of PostgreSQL (by shipping us with their distribution). -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
On Tue, Mar 06, 2001 at 12:46:24PM -0800, Nathan Myers wrote: > > On Linux, /usr/src/linux/include is meaningless for anything in userland; > it's meant only for building the kernel and kernel modules. That Red Hat > tends to expose it to user-level builds is a long-standing bug in Red > Hat's distribution, in violation of the File Hierarchy Standard as well > as explicit instructions from Linus & crew and from the maintainer of the > C library. > Red Hat's Fisher Beta has split the 2 includes, which caused an error trying to compile a (I guess badly configured) kernel module. The header files in /usr/include now give an error if you try to build a kernel module that gets header files from there. So whether they were wrong in the past or not, they are now doing things the way you say is proper.
El Mar 06 Mar 2001 18:56, Samuel Sieb escribió: > On Tue, Mar 06, 2001 at 12:46:24PM -0800, Nathan Myers wrote: > > On Linux, /usr/src/linux/include is meaningless for anything in userland; > > it's meant only for building the kernel and kernel modules. That Red Hat > > tends to expose it to user-level builds is a long-standing bug in Red > > Hat's distribution, in violation of the File Hierarchy Standard as well > > as explicit instructions from Linus & crew and from the maintainer of the > > C library. > > Red Hat's Fisher Beta has split the 2 includes, which caused an error > trying to compile a (I guess badly configured) kernel module. The header > files in /usr/include now give an error if you try to build a kernel module > that gets header files from there. > > So whether they were wrong in the past or not, they are now doing things > the way you say is proper. I am very happy for seeing RedHat let out beta releases of there distribution. That's whats importante about it all. -- System Administration: It's a dirty job, but someone told I had to do it. ----------------------------------------------------------------- Martín Marqués email: martin@math.unl.edu.ar Santa Fe - Argentina http://math.unl.edu.ar/~martin/ Administrador de sistemas en math.unl.edu.ar -----------------------------------------------------------------
* Lamar Owen <lamar.owen@wgcr.org> [010306 13:27] wrote: > Nathan Myers wrote: > > That is why there is no problem with version skew in the syscall > > argument structures on a correctly-configured Linux system. (On a > > Red Hat system it is very easy to get them out of sync, but RH fans > > are used to problems.) > > Is RedHat bashing really necessary here? At least they are payrolling > Second Chair on the Linux kernel hierarchy. And they are very > supportive of PostgreSQL (by shipping us with their distribution). Just because they do some really nice things and have some really nice stuff doesn't mean they should really get cut any slack for doing things like shipping out of sync kernel/system headers, kill -9'ing databases and having programs like 'tmpwatch' running on the boxes. It really shows a lack of understanding of how Unix is supposed to run. What they really need to do is hire some grey beards (old school Unix folks) to QA the releases and keep stuff like this from happening/shipping. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
BeOS haven't this stat (I have a bunch of others but not this one). If I unsterstand correctly, you want to check if there is some backend still attached to shared mem segment of a given key ? In this case, I have an easy solution to fake the stat, because all segment have an encoded name containing this key, so I can count them. cyril > >Alfred Perlstein <bright@wintelcom.net> writes: >>> Are there any portability problems with relying on shm_nattch to be >>> available? If not, I like this a lot... > >> Well it's available on FreeBSD and Solaris, I'm sure Redhat has >> some deamon that resets the value to 0 periodically just for kicks >> so it might not be viable... :) > >I notice that our BeOS and QNX emulations of shmctl() don't support >IPC_STAT, but that could be dealt with, at least to the extent of >stubbing it out. > >This does raise the question of what to do if shmctl(IPC_STAT) fails >for a reason other than EINVAL. I think the conservative thing to do >is refuse to start up. On EPERM, for example, it's possible that there >is a postmaster running in your PGDATA but with a different userid. > > >> Seriously, there's some dispute on the type that 'shm_nattch' is, >> under Solaris it's "shmatt_t" (unsigned long afaik), under FreeBSD >> it's 'short' (i should fix this. :)). > >> But since you're really only testing for 0'ness then it shouldn't >> really be a problem. > >We need not copy the value anywhere, so as long as the struct is >correctly declared in the system header files I don't think it matters >what the field type is ... > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
> >Alfred Perlstein <bright@wintelcom.net> writes: > >>> Are there any portability problems with relying on shm_nattch to be > >>> available? If not, I like this a lot... > > > >> Well it's available on FreeBSD and Solaris, I'm sure Redhat has > >> some deamon that resets the value to 0 periodically just for kicks > >> so it might not be viable... :) > > > >I notice that our BeOS and QNX emulations of shmctl() don't support > >IPC_STAT, but that could be dealt with, at least to the extent of > >stubbing it out. * Cyril VELTER <cyril.velter@libertysurf.fr> [010306 16:15] wrote: > > BeOS haven't this stat (I have a bunch of others but not this one). > > If I unsterstand correctly, you want to check if there is some backend > still attached to shared mem segment of a given key ? In this case, I have an > easy solution to fake the stat, because all segment have an encoded name > containing this key, so I can count them. We need to be able to take a single shared memory segment and determine if any other process is using it. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein wrote: > What they really need to do is hire some grey beards (old school > Unix folks) to QA the releases and keep stuff like this from > happening/shipping. Like the 250-strong RedHat Beta Team, of which I am a member? :-) I can't disclose the discussions on that list, but, suffice to say the traffic there is at least as great as the traffic on this one. Of course, 7.1 hasn't shipped with a RedHat release yet -- and it's my job to make sure the postmaster gets killed properly in my initscript inside the package for 7.1 -- there will be no kill -9 unless it is an emergency to do so for postmaster. I've seen the advisories and the bug lists -- RedHat is not alone with bugs -- not even unusual with bugs. And every OS I know of (and you too) has had a brown paper bag release before. Even PostgreSQL, given its high release quality standards, has had a brown paper bag release -- we all still make mistakes (I know -- I've made more than my share of them). Anyway, that's more than what the rest of the list wanted to read. Replies to private e-mail, please. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
On Tue, Mar 06, 2001 at 04:20:13PM -0500, Lamar Owen wrote: > Nathan Myers wrote: > > That is why there is no problem with version skew in the syscall > > argument structures on a correctly-configured Linux system. (On a > > Red Hat system it is very easy to get them out of sync, but RH fans > > are used to problems.) > > Is RedHat bashing really necessary here? I recognize that my last seven words above contributed nothing. In the future I will only post strictly factual statements about Red Hat and similarly charged topics, and keep the opinions to myself. I value the collegiality of this list too much to risk it further. I offer my apologies for violating it. By the way... do they call Red Hat "RedHat" at Red Hat? Nathan Myers ncm@zembu.com
Nathan Myers wrote: > it further. I offer my apologies for violating it. Well, an apology is not really necessary -- but I do get a little tired at the treatment a good open source company gets at the hands of open source advocates. Yes, they make mistakes. Everyone does. > By the way... do they call Red Hat "RedHat" at Red Hat? No, they don't. I don't know how I got into the habit of leaving out the space, but the space is supposed to be there -- unless you are on the Red Hat CD, where you will find a directory called 'RedHat'. Oh well. Totally off topic. If the from header had your personal address in it (Reply-All only lets me reply to the list for that message) I wouldn't grieve the list further with it. My last words on that subject. Let's go on making PostgreSQL better. And preventing the kill -9 will make PostgreSQL better, even if it is masking a certain amount of shortsightedness on a certain initscripts author's part. :-) -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Re: How to shoot yourself in the foot: kill -9 postmaster
From
teg@redhat.com (Trond Eivind Glomsrød)
Date:
ncm@zembu.com (Nathan Myers) writes: > On Linux, /usr/src/linux/include is meaningless for anything in userland; > it's meant only for building the kernel and kernel modules. That Red Hat > tends to expose it to user-level builds is a long-standing bug in Red > Hat's distribution 1) it isn't this way anyore 2) this was so for most distributions for a loong time, not a "Red Hat" bug. > in violation of the File Hierarchy Standard as well as explicit > instructions from Linus & crew and from the maintainer of the C > library. Which obviously hasn't always been the case - the FHS isn't exactly old. Things have changed since then, we have followed. -- Trond Eivind Glomsrød Red Hat, Inc.
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > The interlock has to be tightly tied to the PGDATA directory, because > what we're trying to protect is the files in and under that directory. > It seems that something based on file(s) in that directory is the way > to go. > > The best idea I've seen so far is Hiroshi's idea of having all the > backends hold fcntl locks on the same file (probably postmaster.pid > would do fine). Then the new postmaster can test whether any backends > are still alive by trying to lock the old postmaster.pid file. > Unfortunately, I read in the fcntl man page: > > Locks are not inherited by a child process in a fork(2) system call. > Yes flock() works well here but fcntl() doesn't. > This makes the idea much less attractive than I originally thought: > a new backend would not automatically inherit a lock on the > postmaster.pid file from the postmaster, but would have to open/lock it > for itself. That means there's a window where the new backend exists > but would be invisible to a hypothetical new postmaster. > > We could work around this with the following, very ugly protocol: > > 1. Postmaster normally maintains fcntl read lock on its postmaster.pid > file. Each spawned backend immediately opens and read-locks > postmaster.pid, too, and holds that file open until it dies. (Thus > wasting a kernel FD per backend, which is one of the less attractive > things about this.) If the backend is unable to obtain read lock on > postmaster.pid, then it complains and dies. We must use read locks > here so that all these processes can hold them separately. > > 2. If a newly started postmaster sees a pre-existing postmaster.pid > file, it tries to obtain a *write* lock on that file. If it fails, > conclude that an old postmaster or backend is still alive; complain > and quit. If it succeeds, sit for say 1 second before deleting the file > and creating a new one. (The delay here is to allow any just-started > old backends to fail to acquire read lock and quit. A possible > objection is that we have no way to guarantee 1 second is enough, though > it ought to be plenty if the lock acquisition is just after the fork.) > I have another idea. My main point is to not remove the existent pidfile. For example 1) A newly started postmaster tries to obtain a write lock on the first byte of the pidfile. If it fails the postmasterquit. 2) The postmaster tries to obtain a write lock on the second byte of the pidfile. If it fails the postmaster quit. 3) The postmaster releases the lock of 2). 4) Each backend obtains a read-lock on the second byte of the pidfile. Regards, Hiroshi Inoue
> I have spent several days now puzzling over the corrupted WAL logfile > that Scott Parish was kind enough to send me from a 7.1beta4 crash. > It looks a lot like two different series of transactions were getting > written into the same logfile. I'd been digging like mad in the WAL > code to try to explain this as a buffer-management logic error, but > after a fresh exchange of info it turns out that I was barking up the > wrong tree. Sorry about that. This is the same situation I was in myself couple of times and "fresh exchange of info" was saving too -:) Anyway it's good to know that it wasn't buffer/etc logic error -:) (Actually, logs from you looked sooo grave so it becomes unclear how WAL worked at all -:). Nevertheless, subj is rised. BTW, does anybody know results of kill -9 in Oracle/Informix/etc? Just curious -:) Vadim
Vadim Mikheev wrote: > > Nevertheless, subj is rised. BTW, does anybody know results of kill -9 > in Oracle/Informix/etc? Just curious -:) Progress has no problem with it that I have ever seen. Regards, Andrew. -- _____________________________________________________________________ Andrew McMillan, e-mail: Andrew@catalyst.net.nz Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington Me: +64 (21) 635 694, Fax: +64 (4) 499 5596, Office: +64 (4) 499 2267
Re: How to shoot yourself in the foot: kill -9 postmaster
From
teg@redhat.com (Trond Eivind Glomsrød)
Date:
Samuel Sieb <samuel@sieb.net> writes: > On Tue, Mar 06, 2001 at 12:46:24PM -0800, Nathan Myers wrote: > > > > On Linux, /usr/src/linux/include is meaningless for anything in userland; > > it's meant only for building the kernel and kernel modules. That Red Hat > > tends to expose it to user-level builds is a long-standing bug in Red > > Hat's distribution, in violation of the File Hierarchy Standard as well > > as explicit instructions from Linus & crew and from the maintainer of the > > C library. > > > Red Hat's Fisher Beta has split the 2 includes, which caused an error trying > to compile a (I guess badly configured) kernel module. It was split in Red Hat Linux 7 as well. -- Trond Eivind Glomsrød Red Hat, Inc.