Re: 9.4 HEAD: select() failed in postmaster - Mailing list pgsql-hackers

From MauMau
Subject Re: 9.4 HEAD: select() failed in postmaster
Date
Msg-id 53F0692AB35345348E29FB46F33127E3@maumau
Whole thread Raw
In response to 9.4 HEAD: select() failed in postmaster  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: 9.4 HEAD: select() failed in postmaster
List pgsql-hackers
From: "Jeff Janes" <jeff.janes@gmail.com>
--------------------------------------------------
I've implemented the Min to Max change and did some more testing.  Now I
have a different  but related problem (which I also saw before, but less
often than the select() one).  The 5 second clock doesn't get turned off.
 So after all processes end, and a new startup is launched, if that startup
doesn't report back to the postmaster soon enough, it gets SIGKILLED.

postmaster.c near line 1681


        if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
            now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)

It seems like this needs to have an additional and-test of pmState, but
which states to test I don't really know.

I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so
I think that this is on the right path but testing an enum for inequality
feels wrong.
--------------------------------------------------


"AbortStartTime > 0" is also necessary to avoid sending SIGKILL repeatedly.
I sent the attached patch during the original discussion.  The below
fragment is relevant:


--- 1663,1688 ----
     TouchSocketLockFiles();
     last_touch_time = now;
    }
+
+   /*
+    * When postmaster got an immediate shutdown request
+    * or some child terminated abnormally (FatalError case),
+    * postmaster sends SIGQUIT to all children except
+    * syslogger and dead_end ones, then wait for them to terminate.
+    * If some children didn't terminate within a certain amount of time,
+    * postmaster sends SIGKILL to them and wait again.
+    * This resolves, for example, the hang situation where
+    * a backend gets stuck in the call chain:
+    * free() acquires some lock -> <received SIGQUIT> ->
+    * quickdie() -> ereport() -> gettext() -> malloc() -> <lock
acquisition>
+    */
+   if (AbortStartTime > 0 &&  /* SIGKILL only once */
+    (Shutdown == ImmediateShutdown || (FatalError && !SendStop)) &&
+    now - AbortStartTime >= 10)
+   {
+    SignalAllChildren(SIGKILL);
+    AbortStartTime = 0;
+   }
   }
  }


Regards
MauMau

Attachment

pgsql-hackers by date:

Previous
From: Benedikt Grundmann
Date:
Subject: Re: record identical operator
Next
From: Atri Sharma
Date:
Subject: Re: Re: Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)