[moving to -hackers]
On Thu, Aug 19, 2010 at 9:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I suspect this is the same problem as bug #4897, and probably also the
> same problem as this:
> http://archives.postgresql.org/pgsql-bugs/2009-08/msg00114.php
>
> and maybe also this and this:
> http://archives.postgresql.org/pgsql-bugs/2010-02/msg00179.php
> http://archives.postgresql.org/pgsql-admin/2009-05/msg00105.php
>
> Unfortunately, it seems that no one has been able to get a stack trace yet.
Bruce pointed out yet another report of this problem to me:
http://archives.postgresql.org/pgsql-general/2010-08/msg00550.php
After some discussion with Magnus, I think what is going on here is
that the postmaster kicks off a new child process, which terminates
before it actually starts running our code, either in OS-supplied code
or some sort of "filter" like anti-spam or anti-virus software. It's
presumably NOT dying in our code because - at least AFAICS - we don't
exit(128) anywhere. One way we could possibly improve the situation
is to not treat this as a child crash - that is, don't do a
crash-and-restart cycle; just treat that backend as having done
elog(FATAL). The trick is that you need a reliable way to distinguish
between a regular child crash and an "early" child crash. Magnus
suggested perhaps we could create a mutex that the child grabs before
mapping shared memory; the postmaster could check whether the mutex
had been taken. If so, we handle the crash normally; if not, we just
chalk it up to experience and continue on.
This isn't really a "fix" for the bug in the sense that the nicest
thing of all would be to prevent the child from exiting abnormally in
the first place. But it's far from clear that we can control that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company