On Tue, May 6, 2014 at 8:25 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>> What I'm inclined to do is change the logic so that:
>>
>> (1) After a crash-and-restart sequence, zero rw->rw_crashed_at, so
>> that anything which is still registered gets restarted immediately.
>
> Yes, that's quite obvious change which I missed completely :).
I've committed the portion of your patch which does this, with pretty
extensive changes. I moved the function which resets the crash times
to bgworker.c, renamed it, and made it just reset all of the crash
times unconditionally; there didn't seem to be any point in skipping
the irrelevant ones, so it seemed best to keep things simple. I also
moved the call from reaper() where you had it to
PostmasterStateMachine(), because the old placement did not work for
me when I carried out the following steps:
1. Kill a background worker with a 60-second restart time using
SIGTERM, so that it does exit(1).
2. Before it restarts, kill a regular backend with SIGQUIT.
Let me know if you think I've got it wrong.
>> (2) If a shmem-connected backend fails to release the deadman switch
>> or exits with an exit code other than 0 or 1, we crash-and-restart. A
>> non-shmem-connected backend never causes a crash-and-restart.
>
> +1
I did this as a separate commit,
e2ce9aa27bf20eff2d991d0267a15ea5f7024cd7, just moving the check for
ReleasePostmasterChildSlot inside the if statement. Then I realized
that was bogus, so I just pushed
eee6cf1f337aa488a20e9111df446cdad770e645 to fix that. Hopefully it's
OK now.
>> (3) When a background worker exits without triggering a
>> crash-and-restart, an exit code of precisely 0 causes the worker to be
>> unregistered; any other exit code has no special effect, so
>> bgw_restart_time controls.
>
> +1
This isn't done yet.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company