Thread: windows doesn't notice backend death

windows doesn't notice backend death

From
Andrew Dunstan
Date:
While trying to test Tom's theory about Windows not being able to 
reinitialise shared memory when a backend crashes, I was doing some 
testing on my Vista box. I killed a backend using the task manager, and 
the postmaster never noticed. pg_cancel_backend() reported correctly 
that the process didn't exist, but pg_stat_activity still had it listed 
as IDLE, and the server certainly didn't re-initialise as it would have 
on Unix. This seems a fairly serious bug.

cheers

andrew


Re: windows doesn't notice backend death

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> ... I killed a backend using the task manager, and 
> the postmaster never noticed.

Ick.  Is it possible that the postmaster did get a report, but thought
it was normal session termination?  If so, how could we distinguish?
        regards, tom lane


Re: windows doesn't notice backend death

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> ... I killed a backend using the task manager, and 
>> the postmaster never noticed.
>>     
>
> Ick.  Is it possible that the postmaster did get a report, but thought
> it was normal session termination?  If so, how could we distinguish?
>
>     
>   

If that were the case then it would not have the dead process still 
listed as a live backend, ISTM, which it does.

cheers

andrew


Re: windows doesn't notice backend death

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Ick.  Is it possible that the postmaster did get a report, but thought
>> it was normal session termination?  If so, how could we distinguish?

> If that were the case then it would not have the dead process still 
> listed as a live backend, ISTM, which it does.

The postmaster does not control the content of the pg_stat_activity
view.
        regards, tom lane


Re: windows doesn't notice backend death

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Tom Lane wrote:
>>     
>>> Ick.  Is it possible that the postmaster did get a report, but thought
>>> it was normal session termination?  If so, how could we distinguish?
>>>       
>
>   
>> If that were the case then it would not have the dead process still 
>> listed as a live backend, ISTM, which it does.
>>     
>
> The postmaster does not control the content of the pg_stat_activity
> view.
>
>             
>   

Well, I'm not I know how to find out the answer to your question. I 
could try attaching a debugger to the postmaster - if I knew where to 
put a breakpoint.

cheers

andrew


Re: windows doesn't notice backend death

From
Alvaro Herrera
Date:
Andrew Dunstan wrote:

> Well, I'm not I know how to find out the answer to your question. I  
> could try attaching a debugger to the postmaster - if I knew where to  
> put a breakpoint.

Did you try reaper()?  It's the SIGCHLD handler.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: windows doesn't notice backend death

From
Magnus Hagander
Date:
Andrew Dunstan wrote:
> 
> 
> Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>  
>>> Tom Lane wrote:
>>>    
>>>> Ick.  Is it possible that the postmaster did get a report, but thought
>>>> it was normal session termination?  If so, how could we distinguish?
>>>>       
>>
>>  
>>> If that were the case then it would not have the dead process still
>>> listed as a live backend, ISTM, which it does.
>>>     
>>
>> The postmaster does not control the content of the pg_stat_activity
>> view.
>>
>>            
>>   
> 
> Well, I'm not I know how to find out the answer to your question. I
> could try attaching a debugger to the postmaster - if I knew where to
> put a breakpoint.

I'd start at pgwin32_deadchild_callback(). That's where the waiting
thread should activate once a child goes away.

That one should post a notice to the win32ChildQueue, which is polled in
win32_waitpid() - that's a second good point for a breakpoint.

This in turn should make things happen up in reaper() - as Alvaro
suggests, a third good place for a breakpoint.


FWIW, this certainly used to work. So we've either broken this recently,
or it's always been broken on Vista (I've never tried it myself on
Vista, only 2000, XP and 2003).

//Magnus


Re: windows doesn't notice backend death

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> FWIW, this certainly used to work. So we've either broken this recently,
> or it's always been broken on Vista (I've never tried it myself on
> Vista, only 2000, XP and 2003).

Maybe a quick check if it still works on non-Vista versions would be
in order, to eliminate one or the other of those theories.
        regards, tom lane


Re: windows doesn't notice backend death

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>   
>> FWIW, this certainly used to work. So we've either broken this recently,
>> or it's always been broken on Vista (I've never tried it myself on
>> Vista, only 2000, XP and 2003).
>>     
>
> Maybe a quick check if it still works on non-Vista versions would be
> in order, to eliminate one or the other of those theories.
>
>             
>   

Well, I can tell you that it is getting an exit code of 1, which is why 
the postmaster isn't restarting.

That raises two questions in my mind. First, is that the behaviour we 
expect when we kill the backend this way? And second, why is it still 
showing up in the output of pg_stat_activity?

Meanwhile, I guess I can try to create a loadable function that will 
crash a backend badly so I can try to reproduce my client's problem.

cheers

andrew


Re: windows doesn't notice backend death

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Well, I can tell you that it is getting an exit code of 1, which is why 
> the postmaster isn't restarting.

Blech.  Count on Windows to find a way to break things.

> That raises two questions in my mind. First, is that the behaviour we 
> expect when we kill the backend this way? And second, why is it still 
> showing up in the output of pg_stat_activity?

Well, if the process is being hard-killed without an opportunity to run
through proc_exit(), then yes it is going to still show up in
pg_stat_activity.  It's pgstat_beshutdown_hook that removes that entry.

The problem here is that we need to be able to distinguish a task
manager kill from a voluntary exit(1).  Have M$ really been stupid
enough to make an external kill look just like an exit() call?
        regards, tom lane


Re: windows doesn't notice backend death

From
Tom Lane
Date:
I wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Well, I can tell you that it is getting an exit code of 1, which is why 
>> the postmaster isn't restarting.

> Blech.  Count on Windows to find a way to break things.

I reflected on this a bit more.  Even if we find a way around this
particular task-manager behavior, it seems to me there is a generic
problem here.  If some bit of clueless code does exit(0) or exit(1)
inside a backend session, the postmaster will think everything is fine,
but actually we have an un-cleaned-up session that's probably still
holding locks etc.  It's fairly easy to demonstrate the issue:

pl_regression=# create language plperlu;
CREATE LANGUAGE
pl_regression=# create or replace function trouble() returns void as
pl_regression-# $$ exit 0; $$ language plperlu;
CREATE FUNCTION
pl_regression=# select trouble();
server closed the connection unexpectedly       This probably means the server terminated abnormally       before or
whileprocessing the request.
 
The connection to the server was lost. Attempting reset: Succeeded.
pl_regression=# select * from pg_stat_activity;datid |    datname    | procpid | usesysid | usename |
current_query         | waiting |          xact_start           |          query_start          |         backend_start
       | client_addr | client_port 
 

-------+---------------+---------+----------+---------+---------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+-------------40179
|pl_regression |   20847 |       10 | tgl     | select trouble();               | f       | 2009-05-03
14:46:10.170604-04| 2009-05-03 14:46:10.170604-04 | 2009-05-03 14:45:10.911359-04 |             |          -140179 |
pl_regression|   20855 |       10 | tgl     | select * from pg_stat_activity; | f       | 2009-05-03 14:46:23.986909-04
|2009-05-03 14:46:23.986909-04 | 2009-05-03 14:46:17.920486-04 |             |          -1
 
(2 rows)


Up to now we've always just dismissed the above possibility as
"superusers should know better", but I think there's a reasonable case
to be made that this is an obvious failure mode and we should put a bit
more effort into being robust against it.  With more and more external
code being routinely run in the backend, who wants to swear that there
is no "exit(1)" in the guts of libperl or libxml or whatever?

The first idea that comes to mind is to have some sort of "dead man
switch" that flags an active backend and is reset by proc_exit() after
it's finished cleaning up everything else.  If the postmaster sees
this flag still set after backend exit, then it treats the backend as
having crashed regardless of what the reported exit code is.
We could implement this via an array of sig_atomic_t in shared memory,
so as to minimize the postmaster's entanglement with shared memory
(it'd be no worse than the old WIN32-specific child pid arrays).

Or maybe there's a better way.  Thoughts?
        regards, tom lane


Re: windows doesn't notice backend death

From
justin
Date:
Tom Lane wrote: <blockquote cite="mid:28338.1241374442@sss.pgh.pa.us" type="cite"><pre wrap="">  Have M$ really been
stupid
enough to make an external kill look just like an exit() call?
        regards, tom lane
 </pre></blockquote><font face="Arial">kind of :-(<br /><br /> Everything i have read seems to point the Task Manager
callsTerminateProcess() in the kernel32.dll and passes a 1 setting the exitcode to 1.  I have not found anything
clearlystating that, yet logic points that is what its doing.  <br /><br /><a class="moz-txt-link-freetext"
href="http://msdn.microsoft.com/en-us/library/ms686714(VS.85).aspx">http://msdn.microsoft.com/en-us/library/ms686714(VS.85).aspx</a><br
/><br/> Would it not be easy to set the normal exitcode to something other than 1 to see the difference <br />
ExitProcess()<br/><a class="moz-txt-link-freetext"
href="http://msdn.microsoft.com/en-us/library/ms682658(VS.85).aspx">http://msdn.microsoft.com/en-us/library/ms682658(VS.85).aspx</a><br
/><br/><br />  this article maybe helpful.  <br /><a class="moz-txt-link-freetext"
href="http://msdn.microsoft.com/en-us/library/ms686722(VS.85).aspx">http://msdn.microsoft.com/en-us/library/ms686722(VS.85).aspx</a><br
/><br/> BTW tested on Windows 2003 SP1  this problem shows up in Postgresql 8.3.3<br /></font> 

Re: windows doesn't notice backend death

From
Andrew Dunstan
Date:

justin wrote:
> Tom Lane wrote:
>>   Have M$ really been stupid
>> enough to make an external kill look just like an exit() call?
>>
>>             regards, tom lane
>>
>>   
> kind of :-(
>
>
>
> Would it not be easy to set the normal exitcode to something other 
> than 1 to see the difference
> ExitProcess()
>

Not really, as Tom showed later this is an example of a more general 
problem. I think his solution of detecting when backends have cleaned up 
nicely and when they have not is the right way to go.

cheers

andrew


Re: windows doesn't notice backend death

From
justin
Date:
Andrew Dunstan wrote:
>
>
> justin wrote:
>>
>> Would it not be easy to set the normal exitcode to something other 
>> than 1 to see the difference
>> ExitProcess()
>>
>
> Not really, as Tom showed later this is an example of a more general 
> problem. I think his solution of detecting when backends have cleaned 
> up nicely and when they have not is the right way to go.
>
> cheers
>
> andrew
>
Stupid thought  why can the some clueless code set the exit status to 
crashed status???  Would it not be more prudent to remove that ability??? 




Re: windows doesn't notice backend death

From
Andrew Dunstan
Date:

justin wrote:
> Stupid thought  why can the some clueless code set the exit status to 
> crashed status???  Would it not be more prudent to remove that ability???
>

You're missing the point. The danger comes from code that we don't control.

cheers

andrew


Re: windows doesn't notice backend death

From
Tom Lane
Date:
I wrote:
> The first idea that comes to mind is to have some sort of "dead man
> switch" that flags an active backend and is reset by proc_exit() after
> it's finished cleaning up everything else.  If the postmaster sees
> this flag still set after backend exit, then it treats the backend as
> having crashed regardless of what the reported exit code is.

Another thought that came to mind: we could set up an atexit hook that
does all the work that proc_exit() currently does, and reduce
proc_exit() itself to just an exit() call.  psql already relies on
having atexit (or on_exit) so this doesn't appear to add any new
portability issues.

This will probably not fix the Vista taskmanager issue, since I'll
bet it's not running atexit hooks anyway.  What it would do is improve
the situation so that a "clueless" exit() call would be no worse than
elog(FATAL), rather than triggering a DB-wide restart as the dead man
switch would do.
        regards, tom lane


Re: windows doesn't notice backend death

From
Alvaro Herrera
Date:
Tom Lane wrote:

> Up to now we've always just dismissed the above possibility as
> "superusers should know better", but I think there's a reasonable case
> to be made that this is an obvious failure mode and we should put a bit
> more effort into being robust against it.  With more and more external
> code being routinely run in the backend, who wants to swear that there
> is no "exit(1)" in the guts of libperl or libxml or whatever?

FWIW there is (or there was, last time I looked) an exit(1) call in the
guts of the PHP library that PL/php uses, which is triggered when the
memory used goes over the configured memory limit.  It was very easily
triggered with some of the test functions we had on our regression
tests, and the only solution was to kludge up the limit.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: windows doesn't notice backend death

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> FWIW there is (or there was, last time I looked) an exit(1) call in the
> guts of the PHP library that PL/php uses, which is triggered when the
> memory used goes over the configured memory limit.  It was very easily
> triggered with some of the test functions we had on our regression
> tests, and the only solution was to kludge up the limit.

I don't think we'll be able to prevent PHP from doing that :-(.  But
it now seems clear that we should try to make the database as a whole
recover with some degree of grace.  I'll go work up a patch.
        regards, tom lane


Re: windows doesn't notice backend death

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> FWIW, this certainly used to work. So we've either broken this recently,
>> or it's always been broken on Vista (I've never tried it myself on
>> Vista, only 2000, XP and 2003).
> 
> Maybe a quick check if it still works on non-Vista versions would be
> in order, to eliminate one or the other of those theories.

I'll have to take this back.

I just tested back to 8.2.latest on XP, and it has the same behavior. We
don't notice a "kill -9" from the task manager.

If it worked, it's either <8.2, or it's something that changed with a
service pack to XP as well.

//Magnus




Re: windows doesn't notice backend death

From
Magnus Hagander
Date:
Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Well, I can tell you that it is getting an exit code of 1, which is why 
>> the postmaster isn't restarting.
> 
> Blech.  Count on Windows to find a way to break things.

Yup. A quick search gives this:
http://support.microsoft.com/kb/155075

which clearly states that task manager simply uses TerminateProcess().
And it's probably hard-coded with an exit code.

Which really means this can't have worked, and my memory was wrong.

I still claim it works when the process actually *crashes* though -
since that will give you one of those funky huge exit numbers.


>> That raises two questions in my mind. First, is that the behaviour we 
>> expect when we kill the backend this way? And second, why is it still 
>> showing up in the output of pg_stat_activity?
> 
> Well, if the process is being hard-killed without an opportunity to run
> through proc_exit(), then yes it is going to still show up in
> pg_stat_activity.  It's pgstat_beshutdown_hook that removes that entry.

That's certainly what task manager does. TerminateProcess() just stops
all threads and removes the process. No chance for atexit() stuff to run.

> The problem here is that we need to be able to distinguish a task
> manager kill from a voluntary exit(1).  Have M$ really been stupid
> enough to make an external kill look just like an exit() call?

It certainly looks that way. It probably goes back to the "use threads,
not processes, on this platform" design...

//Magnus



Re: windows doesn't notice backend death

From
Magnus Hagander
Date:
Andrew Dunstan wrote:
> 
> 
> justin wrote:
>> Tom Lane wrote:
>>>   Have M$ really been stupid
>>> enough to make an external kill look just like an exit() call?
>>>
>>>             regards, tom lane
>>>
>>>   
>> kind of :-(
>>
>>
>>
>> Would it not be easy to set the normal exitcode to something other
>> than 1 to see the difference
>> ExitProcess()
>>
> 
> Not really, as Tom showed later this is an example of a more general
> problem. I think his solution of detecting when backends have cleaned up
> nicely and when they have not is the right way to go.

Well, if we picked a different value than 1, the probability would
certainly go up that things work. I bet most external libs that do such
an evil thing would be doing exit(1). If we picked a return code way up
in the space (say something well above 2^16), the likelihood that a lib
would be using exactly that one decreases drastically. But there's no
guarantee - so it'd just be a "better workaround".

//Magnus



Re: windows doesn't notice backend death

From
Magnus Hagander
Date:
Tom Lane wrote:
> I wrote:
>> The first idea that comes to mind is to have some sort of "dead man
>> switch" that flags an active backend and is reset by proc_exit() after
>> it's finished cleaning up everything else.  If the postmaster sees
>> this flag still set after backend exit, then it treats the backend as
>> having crashed regardless of what the reported exit code is.
> 
> Another thought that came to mind: we could set up an atexit hook that
> does all the work that proc_exit() currently does, and reduce
> proc_exit() itself to just an exit() call.  psql already relies on
> having atexit (or on_exit) so this doesn't appear to add any new
> portability issues.
> 
> This will probably not fix the Vista taskmanager issue, since I'll
> bet it's not running atexit hooks anyway.  What it would do is improve
> the situation so that a "clueless" exit() call would be no worse than
> elog(FATAL), rather than triggering a DB-wide restart as the dead man
> switch would do.

This sounds like a good idea in general - because it will avoid having
to restart on unix.

It'd still have to be combined with the dead-man-switch idea to defend
completely. But it could be worthwhile doing anyway, for other platforms.

//Magnus


Re: windows doesn't notice backend death

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Tom Lane wrote:
>> Another thought that came to mind: we could set up an atexit hook that

> This sounds like a good idea in general - because it will avoid having
> to restart on unix.

> It'd still have to be combined with the dead-man-switch idea to defend
> completely. But it could be worthwhile doing anyway, for other platforms.

Right, exactly.  I'll try to produce a patch today.
        regards, tom lane


Re: windows doesn't notice backend death

From
Tom Lane
Date:
I wrote:
> I don't think we'll be able to prevent PHP from doing that :-(.  But
> it now seems clear that we should try to make the database as a whole
> recover with some degree of grace.  I'll go work up a patch.

Attached is a proposed patch for the "dead man switch" idea.  The switch is
armed when a child process acquires a regular PGPROC and disarmed when the
PGPROC is released.  (So there is no coverage for auxiliary processes, but
I doubt we need that.)  I chose to put the management code into pmsignal.c
--- if you hold your head at the proper angle, this can be seen as a form
of child-to-postmaster signaling, so that seemed like a reasonable place.
The array slot number is passed down to child processes in the same way as
for MyCancelKey.

Also, since the number of array slots needed is exactly the same as the
size of the ShmemBackendArray for the EXEC_BACKEND case, I tweaked the
code for the latter a little bit to avoid duplicate array-searching ---
the array slot number assigned for the deadman switch flag is also used to
index ShmemBackendArray.

One bit of ugliness is that in the EXEC_BACKEND case, InitProcess is done
before CreateSharedMemoryAndSemaphores; so it's necessary to have
special-case code to pass down the shmem state pointer for pmsignal.c,
similarly to what we do for ProcGlobal and a few other pointers.
We could avoid that by arming the switch only sometime after
CreateSharedMemoryAndSemaphores ... but if we're gonna have this mechanism
at all, I'd like it to cover the process's entire ownership of
shared-memory objects, not only part of it.

This doesn't include the proposed change to use an atexit callback to
ensure that proc_exit cleanup gets done --- that looks like a trivial
refactoring in proc_exit, but it's a separate idea anyway.

Barring objections I'll go ahead and apply this to HEAD.  I'm wondering
whether we are sufficiently worried about the Windows task manager issue
to risk back-patching into 8.3 and 8.2 ... comments?

            regards, tom lane


Attachment

Re: windows doesn't notice backend death

From
Tom Lane
Date:
I wrote:
> Attached is a proposed patch for the "dead man switch" idea.
> ...
> Barring objections I'll go ahead and apply this to HEAD.  I'm wondering
> whether we are sufficiently worried about the Windows task manager issue
> to risk back-patching into 8.3 and 8.2 ... comments?

For lack of response, I assume no one wants to back-patch this.
        regards, tom lane


Re: windows doesn't notice backend death

From
Andrew Dunstan
Date:

Tom Lane wrote:
> I wrote:
>   
>> Attached is a proposed patch for the "dead man switch" idea.
>> ...
>> Barring objections I'll go ahead and apply this to HEAD.  I'm wondering
>> whether we are sufficiently worried about the Windows task manager issue
>> to risk back-patching into 8.3 and 8.2 ... comments?
>>     
>
> For lack of response, I assume no one wants to back-patch this.
>
>             
>   

No, I think it's unnecessary, now we understand what's going on.

cheers

andrew


Re: windows doesn't notice backend death

From
Magnus Hagander
Date:
Tom Lane wrote:
> I wrote:
>> Attached is a proposed patch for the "dead man switch" idea.
>> ...
>> Barring objections I'll go ahead and apply this to HEAD.  I'm wondering
>> whether we are sufficiently worried about the Windows task manager issue
>> to risk back-patching into 8.3 and 8.2 ... comments?
> 
> For lack of response, I assume no one wants to back-patch this.

Hmm. I didn't have time to look it over :(

In general, killing server processes from task manager in windows is
less likely to be a popular thing than using kill on unix (and it still
surprises me how many people that consider themselves experts still do
"kill -9" by defualt whenever they want to stop something..)

Given that it actually doesn't notice it if we do, we might have people
doing this that don't know about it. But I think we can at least keep it
HEAD only for a while until it's seen some productoin level testing...

//Magnus