Thread: windows doesn't notice backend death
While trying to test Tom's theory about Windows not being able to reinitialise shared memory when a backend crashes, I was doing some testing on my Vista box. I killed a backend using the task manager, and the postmaster never noticed. pg_cancel_backend() reported correctly that the process didn't exist, but pg_stat_activity still had it listed as IDLE, and the server certainly didn't re-initialise as it would have on Unix. This seems a fairly serious bug. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > ... I killed a backend using the task manager, and > the postmaster never noticed. Ick. Is it possible that the postmaster did get a report, but thought it was normal session termination? If so, how could we distinguish? regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> ... I killed a backend using the task manager, and >> the postmaster never noticed. >> > > Ick. Is it possible that the postmaster did get a report, but thought > it was normal session termination? If so, how could we distinguish? > > > If that were the case then it would not have the dead process still listed as a live backend, ISTM, which it does. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> Ick. Is it possible that the postmaster did get a report, but thought >> it was normal session termination? If so, how could we distinguish? > If that were the case then it would not have the dead process still > listed as a live backend, ISTM, which it does. The postmaster does not control the content of the pg_stat_activity view. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> Ick. Is it possible that the postmaster did get a report, but thought >>> it was normal session termination? If so, how could we distinguish? >>> > > >> If that were the case then it would not have the dead process still >> listed as a live backend, ISTM, which it does. >> > > The postmaster does not control the content of the pg_stat_activity > view. > > > Well, I'm not I know how to find out the answer to your question. I could try attaching a debugger to the postmaster - if I knew where to put a breakpoint. cheers andrew
Andrew Dunstan wrote: > Well, I'm not I know how to find out the answer to your question. I > could try attaching a debugger to the postmaster - if I knew where to > put a breakpoint. Did you try reaper()? It's the SIGCHLD handler. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Andrew Dunstan wrote: > > > Tom Lane wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >> >>> Tom Lane wrote: >>> >>>> Ick. Is it possible that the postmaster did get a report, but thought >>>> it was normal session termination? If so, how could we distinguish? >>>> >> >> >>> If that were the case then it would not have the dead process still >>> listed as a live backend, ISTM, which it does. >>> >> >> The postmaster does not control the content of the pg_stat_activity >> view. >> >> >> > > Well, I'm not I know how to find out the answer to your question. I > could try attaching a debugger to the postmaster - if I knew where to > put a breakpoint. I'd start at pgwin32_deadchild_callback(). That's where the waiting thread should activate once a child goes away. That one should post a notice to the win32ChildQueue, which is polled in win32_waitpid() - that's a second good point for a breakpoint. This in turn should make things happen up in reaper() - as Alvaro suggests, a third good place for a breakpoint. FWIW, this certainly used to work. So we've either broken this recently, or it's always been broken on Vista (I've never tried it myself on Vista, only 2000, XP and 2003). //Magnus
Magnus Hagander <magnus@hagander.net> writes: > FWIW, this certainly used to work. So we've either broken this recently, > or it's always been broken on Vista (I've never tried it myself on > Vista, only 2000, XP and 2003). Maybe a quick check if it still works on non-Vista versions would be in order, to eliminate one or the other of those theories. regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: > >> FWIW, this certainly used to work. So we've either broken this recently, >> or it's always been broken on Vista (I've never tried it myself on >> Vista, only 2000, XP and 2003). >> > > Maybe a quick check if it still works on non-Vista versions would be > in order, to eliminate one or the other of those theories. > > > Well, I can tell you that it is getting an exit code of 1, which is why the postmaster isn't restarting. That raises two questions in my mind. First, is that the behaviour we expect when we kill the backend this way? And second, why is it still showing up in the output of pg_stat_activity? Meanwhile, I guess I can try to create a loadable function that will crash a backend badly so I can try to reproduce my client's problem. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Well, I can tell you that it is getting an exit code of 1, which is why > the postmaster isn't restarting. Blech. Count on Windows to find a way to break things. > That raises two questions in my mind. First, is that the behaviour we > expect when we kill the backend this way? And second, why is it still > showing up in the output of pg_stat_activity? Well, if the process is being hard-killed without an opportunity to run through proc_exit(), then yes it is going to still show up in pg_stat_activity. It's pgstat_beshutdown_hook that removes that entry. The problem here is that we need to be able to distinguish a task manager kill from a voluntary exit(1). Have M$ really been stupid enough to make an external kill look just like an exit() call? regards, tom lane
I wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Well, I can tell you that it is getting an exit code of 1, which is why >> the postmaster isn't restarting. > Blech. Count on Windows to find a way to break things. I reflected on this a bit more. Even if we find a way around this particular task-manager behavior, it seems to me there is a generic problem here. If some bit of clueless code does exit(0) or exit(1) inside a backend session, the postmaster will think everything is fine, but actually we have an un-cleaned-up session that's probably still holding locks etc. It's fairly easy to demonstrate the issue: pl_regression=# create language plperlu; CREATE LANGUAGE pl_regression=# create or replace function trouble() returns void as pl_regression-# $$ exit 0; $$ language plperlu; CREATE FUNCTION pl_regression=# select trouble(); server closed the connection unexpectedly This probably means the server terminated abnormally before or whileprocessing the request. The connection to the server was lost. Attempting reset: Succeeded. pl_regression=# select * from pg_stat_activity;datid | datname | procpid | usesysid | usename | current_query | waiting | xact_start | query_start | backend_start | client_addr | client_port -------+---------------+---------+----------+---------+---------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+-------------40179 |pl_regression | 20847 | 10 | tgl | select trouble(); | f | 2009-05-03 14:46:10.170604-04| 2009-05-03 14:46:10.170604-04 | 2009-05-03 14:45:10.911359-04 | | -140179 | pl_regression| 20855 | 10 | tgl | select * from pg_stat_activity; | f | 2009-05-03 14:46:23.986909-04 |2009-05-03 14:46:23.986909-04 | 2009-05-03 14:46:17.920486-04 | | -1 (2 rows) Up to now we've always just dismissed the above possibility as "superusers should know better", but I think there's a reasonable case to be made that this is an obvious failure mode and we should put a bit more effort into being robust against it. With more and more external code being routinely run in the backend, who wants to swear that there is no "exit(1)" in the guts of libperl or libxml or whatever? The first idea that comes to mind is to have some sort of "dead man switch" that flags an active backend and is reset by proc_exit() after it's finished cleaning up everything else. If the postmaster sees this flag still set after backend exit, then it treats the backend as having crashed regardless of what the reported exit code is. We could implement this via an array of sig_atomic_t in shared memory, so as to minimize the postmaster's entanglement with shared memory (it'd be no worse than the old WIN32-specific child pid arrays). Or maybe there's a better way. Thoughts? regards, tom lane
Tom Lane wrote: <blockquote cite="mid:28338.1241374442@sss.pgh.pa.us" type="cite"><pre wrap=""> Have M$ really been stupid enough to make an external kill look just like an exit() call? regards, tom lane </pre></blockquote><font face="Arial">kind of :-(<br /><br /> Everything i have read seems to point the Task Manager callsTerminateProcess() in the kernel32.dll and passes a 1 setting the exitcode to 1. I have not found anything clearlystating that, yet logic points that is what its doing. <br /><br /><a class="moz-txt-link-freetext" href="http://msdn.microsoft.com/en-us/library/ms686714(VS.85).aspx">http://msdn.microsoft.com/en-us/library/ms686714(VS.85).aspx</a><br /><br/> Would it not be easy to set the normal exitcode to something other than 1 to see the difference <br /> ExitProcess()<br/><a class="moz-txt-link-freetext" href="http://msdn.microsoft.com/en-us/library/ms682658(VS.85).aspx">http://msdn.microsoft.com/en-us/library/ms682658(VS.85).aspx</a><br /><br/><br /> this article maybe helpful. <br /><a class="moz-txt-link-freetext" href="http://msdn.microsoft.com/en-us/library/ms686722(VS.85).aspx">http://msdn.microsoft.com/en-us/library/ms686722(VS.85).aspx</a><br /><br/> BTW tested on Windows 2003 SP1 this problem shows up in Postgresql 8.3.3<br /></font>
justin wrote: > Tom Lane wrote: >> Have M$ really been stupid >> enough to make an external kill look just like an exit() call? >> >> regards, tom lane >> >> > kind of :-( > > > > Would it not be easy to set the normal exitcode to something other > than 1 to see the difference > ExitProcess() > Not really, as Tom showed later this is an example of a more general problem. I think his solution of detecting when backends have cleaned up nicely and when they have not is the right way to go. cheers andrew
Andrew Dunstan wrote: > > > justin wrote: >> >> Would it not be easy to set the normal exitcode to something other >> than 1 to see the difference >> ExitProcess() >> > > Not really, as Tom showed later this is an example of a more general > problem. I think his solution of detecting when backends have cleaned > up nicely and when they have not is the right way to go. > > cheers > > andrew > Stupid thought why can the some clueless code set the exit status to crashed status??? Would it not be more prudent to remove that ability???
justin wrote: > Stupid thought why can the some clueless code set the exit status to > crashed status??? Would it not be more prudent to remove that ability??? > You're missing the point. The danger comes from code that we don't control. cheers andrew
I wrote: > The first idea that comes to mind is to have some sort of "dead man > switch" that flags an active backend and is reset by proc_exit() after > it's finished cleaning up everything else. If the postmaster sees > this flag still set after backend exit, then it treats the backend as > having crashed regardless of what the reported exit code is. Another thought that came to mind: we could set up an atexit hook that does all the work that proc_exit() currently does, and reduce proc_exit() itself to just an exit() call. psql already relies on having atexit (or on_exit) so this doesn't appear to add any new portability issues. This will probably not fix the Vista taskmanager issue, since I'll bet it's not running atexit hooks anyway. What it would do is improve the situation so that a "clueless" exit() call would be no worse than elog(FATAL), rather than triggering a DB-wide restart as the dead man switch would do. regards, tom lane
Tom Lane wrote: > Up to now we've always just dismissed the above possibility as > "superusers should know better", but I think there's a reasonable case > to be made that this is an obvious failure mode and we should put a bit > more effort into being robust against it. With more and more external > code being routinely run in the backend, who wants to swear that there > is no "exit(1)" in the guts of libperl or libxml or whatever? FWIW there is (or there was, last time I looked) an exit(1) call in the guts of the PHP library that PL/php uses, which is triggered when the memory used goes over the configured memory limit. It was very easily triggered with some of the test functions we had on our regression tests, and the only solution was to kludge up the limit. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > FWIW there is (or there was, last time I looked) an exit(1) call in the > guts of the PHP library that PL/php uses, which is triggered when the > memory used goes over the configured memory limit. It was very easily > triggered with some of the test functions we had on our regression > tests, and the only solution was to kludge up the limit. I don't think we'll be able to prevent PHP from doing that :-(. But it now seems clear that we should try to make the database as a whole recover with some degree of grace. I'll go work up a patch. regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> FWIW, this certainly used to work. So we've either broken this recently, >> or it's always been broken on Vista (I've never tried it myself on >> Vista, only 2000, XP and 2003). > > Maybe a quick check if it still works on non-Vista versions would be > in order, to eliminate one or the other of those theories. I'll have to take this back. I just tested back to 8.2.latest on XP, and it has the same behavior. We don't notice a "kill -9" from the task manager. If it worked, it's either <8.2, or it's something that changed with a service pack to XP as well. //Magnus
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Well, I can tell you that it is getting an exit code of 1, which is why >> the postmaster isn't restarting. > > Blech. Count on Windows to find a way to break things. Yup. A quick search gives this: http://support.microsoft.com/kb/155075 which clearly states that task manager simply uses TerminateProcess(). And it's probably hard-coded with an exit code. Which really means this can't have worked, and my memory was wrong. I still claim it works when the process actually *crashes* though - since that will give you one of those funky huge exit numbers. >> That raises two questions in my mind. First, is that the behaviour we >> expect when we kill the backend this way? And second, why is it still >> showing up in the output of pg_stat_activity? > > Well, if the process is being hard-killed without an opportunity to run > through proc_exit(), then yes it is going to still show up in > pg_stat_activity. It's pgstat_beshutdown_hook that removes that entry. That's certainly what task manager does. TerminateProcess() just stops all threads and removes the process. No chance for atexit() stuff to run. > The problem here is that we need to be able to distinguish a task > manager kill from a voluntary exit(1). Have M$ really been stupid > enough to make an external kill look just like an exit() call? It certainly looks that way. It probably goes back to the "use threads, not processes, on this platform" design... //Magnus
Andrew Dunstan wrote: > > > justin wrote: >> Tom Lane wrote: >>> Have M$ really been stupid >>> enough to make an external kill look just like an exit() call? >>> >>> regards, tom lane >>> >>> >> kind of :-( >> >> >> >> Would it not be easy to set the normal exitcode to something other >> than 1 to see the difference >> ExitProcess() >> > > Not really, as Tom showed later this is an example of a more general > problem. I think his solution of detecting when backends have cleaned up > nicely and when they have not is the right way to go. Well, if we picked a different value than 1, the probability would certainly go up that things work. I bet most external libs that do such an evil thing would be doing exit(1). If we picked a return code way up in the space (say something well above 2^16), the likelihood that a lib would be using exactly that one decreases drastically. But there's no guarantee - so it'd just be a "better workaround". //Magnus
Tom Lane wrote: > I wrote: >> The first idea that comes to mind is to have some sort of "dead man >> switch" that flags an active backend and is reset by proc_exit() after >> it's finished cleaning up everything else. If the postmaster sees >> this flag still set after backend exit, then it treats the backend as >> having crashed regardless of what the reported exit code is. > > Another thought that came to mind: we could set up an atexit hook that > does all the work that proc_exit() currently does, and reduce > proc_exit() itself to just an exit() call. psql already relies on > having atexit (or on_exit) so this doesn't appear to add any new > portability issues. > > This will probably not fix the Vista taskmanager issue, since I'll > bet it's not running atexit hooks anyway. What it would do is improve > the situation so that a "clueless" exit() call would be no worse than > elog(FATAL), rather than triggering a DB-wide restart as the dead man > switch would do. This sounds like a good idea in general - because it will avoid having to restart on unix. It'd still have to be combined with the dead-man-switch idea to defend completely. But it could be worthwhile doing anyway, for other platforms. //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Tom Lane wrote: >> Another thought that came to mind: we could set up an atexit hook that > This sounds like a good idea in general - because it will avoid having > to restart on unix. > It'd still have to be combined with the dead-man-switch idea to defend > completely. But it could be worthwhile doing anyway, for other platforms. Right, exactly. I'll try to produce a patch today. regards, tom lane
I wrote: > I don't think we'll be able to prevent PHP from doing that :-(. But > it now seems clear that we should try to make the database as a whole > recover with some degree of grace. I'll go work up a patch. Attached is a proposed patch for the "dead man switch" idea. The switch is armed when a child process acquires a regular PGPROC and disarmed when the PGPROC is released. (So there is no coverage for auxiliary processes, but I doubt we need that.) I chose to put the management code into pmsignal.c --- if you hold your head at the proper angle, this can be seen as a form of child-to-postmaster signaling, so that seemed like a reasonable place. The array slot number is passed down to child processes in the same way as for MyCancelKey. Also, since the number of array slots needed is exactly the same as the size of the ShmemBackendArray for the EXEC_BACKEND case, I tweaked the code for the latter a little bit to avoid duplicate array-searching --- the array slot number assigned for the deadman switch flag is also used to index ShmemBackendArray. One bit of ugliness is that in the EXEC_BACKEND case, InitProcess is done before CreateSharedMemoryAndSemaphores; so it's necessary to have special-case code to pass down the shmem state pointer for pmsignal.c, similarly to what we do for ProcGlobal and a few other pointers. We could avoid that by arming the switch only sometime after CreateSharedMemoryAndSemaphores ... but if we're gonna have this mechanism at all, I'd like it to cover the process's entire ownership of shared-memory objects, not only part of it. This doesn't include the proposed change to use an atexit callback to ensure that proc_exit cleanup gets done --- that looks like a trivial refactoring in proc_exit, but it's a separate idea anyway. Barring objections I'll go ahead and apply this to HEAD. I'm wondering whether we are sufficiently worried about the Windows task manager issue to risk back-patching into 8.3 and 8.2 ... comments? regards, tom lane
Attachment
I wrote: > Attached is a proposed patch for the "dead man switch" idea. > ... > Barring objections I'll go ahead and apply this to HEAD. I'm wondering > whether we are sufficiently worried about the Windows task manager issue > to risk back-patching into 8.3 and 8.2 ... comments? For lack of response, I assume no one wants to back-patch this. regards, tom lane
Tom Lane wrote: > I wrote: > >> Attached is a proposed patch for the "dead man switch" idea. >> ... >> Barring objections I'll go ahead and apply this to HEAD. I'm wondering >> whether we are sufficiently worried about the Windows task manager issue >> to risk back-patching into 8.3 and 8.2 ... comments? >> > > For lack of response, I assume no one wants to back-patch this. > > > No, I think it's unnecessary, now we understand what's going on. cheers andrew
Tom Lane wrote: > I wrote: >> Attached is a proposed patch for the "dead man switch" idea. >> ... >> Barring objections I'll go ahead and apply this to HEAD. I'm wondering >> whether we are sufficiently worried about the Windows task manager issue >> to risk back-patching into 8.3 and 8.2 ... comments? > > For lack of response, I assume no one wants to back-patch this. Hmm. I didn't have time to look it over :( In general, killing server processes from task manager in windows is less likely to be a popular thing than using kill on unix (and it still surprises me how many people that consider themselves experts still do "kill -9" by defualt whenever they want to stop something..) Given that it actually doesn't notice it if we do, we might have people doing this that don't know about it. But I think we can at least keep it HEAD only for a while until it's seen some productoin level testing... //Magnus