Re: windows doesn't notice backend death - Mailing list pgsql-hackers

From Tom Lane
Subject Re: windows doesn't notice backend death
Date
Msg-id 29196.1241377467@sss.pgh.pa.us
Whole thread Raw
In response to Re: windows doesn't notice backend death  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: windows doesn't notice backend death  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: windows doesn't notice backend death  (Alvaro Herrera <alvherre@commandprompt.com>)
List pgsql-hackers
I wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Well, I can tell you that it is getting an exit code of 1, which is why 
>> the postmaster isn't restarting.

> Blech.  Count on Windows to find a way to break things.

I reflected on this a bit more.  Even if we find a way around this
particular task-manager behavior, it seems to me there is a generic
problem here.  If some bit of clueless code does exit(0) or exit(1)
inside a backend session, the postmaster will think everything is fine,
but actually we have an un-cleaned-up session that's probably still
holding locks etc.  It's fairly easy to demonstrate the issue:

pl_regression=# create language plperlu;
CREATE LANGUAGE
pl_regression=# create or replace function trouble() returns void as
pl_regression-# $$ exit 0; $$ language plperlu;
CREATE FUNCTION
pl_regression=# select trouble();
server closed the connection unexpectedly       This probably means the server terminated abnormally       before or
whileprocessing the request.
 
The connection to the server was lost. Attempting reset: Succeeded.
pl_regression=# select * from pg_stat_activity;datid |    datname    | procpid | usesysid | usename |
current_query         | waiting |          xact_start           |          query_start          |         backend_start
       | client_addr | client_port 
 

-------+---------------+---------+----------+---------+---------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+-------------40179
|pl_regression |   20847 |       10 | tgl     | select trouble();               | f       | 2009-05-03
14:46:10.170604-04| 2009-05-03 14:46:10.170604-04 | 2009-05-03 14:45:10.911359-04 |             |          -140179 |
pl_regression|   20855 |       10 | tgl     | select * from pg_stat_activity; | f       | 2009-05-03 14:46:23.986909-04
|2009-05-03 14:46:23.986909-04 | 2009-05-03 14:46:17.920486-04 |             |          -1
 
(2 rows)


Up to now we've always just dismissed the above possibility as
"superusers should know better", but I think there's a reasonable case
to be made that this is an obvious failure mode and we should put a bit
more effort into being robust against it.  With more and more external
code being routinely run in the backend, who wants to swear that there
is no "exit(1)" in the guts of libperl or libxml or whatever?

The first idea that comes to mind is to have some sort of "dead man
switch" that flags an active backend and is reset by proc_exit() after
it's finished cleaning up everything else.  If the postmaster sees
this flag still set after backend exit, then it treats the backend as
having crashed regardless of what the reported exit code is.
We could implement this via an array of sig_atomic_t in shared memory,
so as to minimize the postmaster's entanglement with shared memory
(it'd be no worse than the old WIN32-specific child pid arrays).

Or maybe there's a better way.  Thoughts?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: windows doesn't notice backend death
Next
From: justin
Date:
Subject: Re: windows doesn't notice backend death