Re: [SQL] PostgreSQL crashes on me :( - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [SQL] PostgreSQL crashes on me :(
Date
Msg-id 21339.977111275@sss.pgh.pa.us
Whole thread Raw
Responses Re: Re: [SQL] PostgreSQL crashes on me :(  (ncm@zembu.com (Nathan Myers))
Re: Re: [SQL] PostgreSQL crashes on me :(  (Ian Lance Taylor <ian@airs.com>)
List pgsql-hackers
Mathijs Brands <mathijs@ilse.nl> writes:
> We recently installed a small server for an external party to develop
> websites on. This machine, a K6-233 with 256 MB, is running FreeBSD 3.3
> and PostgreSQL 7.0.2 (maybe I'll upgrade to 7.0.3 tonight). The database
> it's running is about 2 MB in size and gets to process an estimated 10000
> to 25000 queries per day. Nothing special, I'd say.

> However, pgsql keeps crashing. It can take days, but pgsql will crash.
> It spits out the following error:

> ServerLoop: select failed: No child processes

Hm.  It seems fairly unlikely that select() would return error ECHILD,
which is what this message *appears* to imply.  The code is
       if (select(nSockets, &rmask, &wmask, (fd_set *) NULL,                  (struct timeval *) NULL) < 0)       {
     if (errno == EINTR)               continue;           fprintf(stderr, "%s: ServerLoop: select failed: %s\n",
           progname, strerror(errno));           return STATUS_ERROR;       }
 

which seems pretty straightforward.  BUT: I think there's a race
condition here, at least on systems where errno is not saved and
restored around a signal handler.  Consider the following scenario:
Postmaster is waiting at the select() --- its normal state.
Postmaster receives a SIGCHLD signal due to backend exit, soit goes off and does the reaper() thing.  On return
fromreaper()the system arranges to return EINTR error fromthe select().
 
Before control can reach the "if (errno..." test, anotherSIGCHLD comes in.  reaper() is invoked again and does
itsthing.

The normal exit condition from reaper() will be errno == ECHILD,
because that's what the waitpid() or wait3() call will return after
all children are dealt with.  If the signal-handling mechanism allows
that to be returned to the mainline code, we have a failure.

Can any FreeBSD hackers comment on the plausibility of this theory?

A quick-and-dirty workaround would be to save and restore errno in
reaper() and the other postmaster signal handlers.  It might be
a better idea in the long run to avoid doing system calls in the
signal handlers --- but that would take a more substantial rewrite.

I seem to recall previous pghackers discussions in which
saving/restoring errno looked like a good idea.  Not sure why
it hasn't been done already.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Robert B. Easter"
Date:
Subject: Re: Tuple data
Next
From: Roger Smith
Date:
Subject: How to assign a new admin account name and password for 7.02?