Hi,
On 2021-05-05 18:34:36 +0200, Magnus Hagander wrote:
> Is this really a problem we should fix ourselves? Most daemon-managers
> today will happily be configured to automatically restart a daemon on
> failure with a single setting since a long time now. E.g. in systemd
> (which most linuxen uses now) you just set Restart=on-failure (or
> maybe even Restart=always) and something like RestartSec=10.
I'm not convinced by this. For two main reasons:
1) Our own code can know a lot more about the different error types than
we can signal to systemd. The retry timeouts for e.g. a connection
failure (whatever) is different than for fsync failing (alarm
alarm). If we run out of space we might want to clean up space /
invoke a command to do so, but there's nothing equivalent for
systemd.
2) Do we really want to either implement at least 3 different ways to do
this kind of thing, or force users to do it over and over again?
That's not to say that there's no space for handling "unexpected" errors
outside of postgres binaries, but I think it's pretty obvious that that
doesn't cover somewhat predictable types of errors.
And looking at the server side of things - it is *not* the same for
systemd to restart postgres, as postmaster doing so internally. The
latter can hold on onto shared memory. Which e.g. with simple huge_pages
configurations is crucial, because it prevents other processes to use
that shared memory. And it accelerates restart by a lot - the kernel
needing to zero shared memory on first access (or allocation) can be a
very significant penalty.
Greetings,
Andres Freund