From: "Jeff Janes" <jeff.janes@gmail.com>
--------------------------------------------------
I've implemented the Min to Max change and did some more testing. Now I
have a different but related problem (which I also saw before, but less
often than the select() one). The 5 second clock doesn't get turned off.
So after all processes end, and a new startup is launched, if that startup
doesn't report back to the postmaster soon enough, it gets SIGKILLED.
postmaster.c near line 1681
if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)
It seems like this needs to have an additional and-test of pmState, but
which states to test I don't really know.
I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so
I think that this is on the right path but testing an enum for inequality
feels wrong.
--------------------------------------------------
"AbortStartTime > 0" is also necessary to avoid sending SIGKILL repeatedly.
I sent the attached patch during the original discussion. The below
fragment is relevant:
--- 1663,1688 ----
TouchSocketLockFiles();
last_touch_time = now;
}
+
+ /*
+ * When postmaster got an immediate shutdown request
+ * or some child terminated abnormally (FatalError case),
+ * postmaster sends SIGQUIT to all children except
+ * syslogger and dead_end ones, then wait for them to terminate.
+ * If some children didn't terminate within a certain amount of time,
+ * postmaster sends SIGKILL to them and wait again.
+ * This resolves, for example, the hang situation where
+ * a backend gets stuck in the call chain:
+ * free() acquires some lock -> <received SIGQUIT> ->
+ * quickdie() -> ereport() -> gettext() -> malloc() -> <lock
acquisition>
+ */
+ if (AbortStartTime > 0 && /* SIGKILL only once */
+ (Shutdown == ImmediateShutdown || (FatalError && !SendStop)) &&
+ now - AbortStartTime >= 10)
+ {
+ SignalAllChildren(SIGKILL);
+ AbortStartTime = 0;
+ }
}
}
Regards
MauMau