Re: postmaster recovery and automatic restart suppression - Mailing list pgsql-hackers

From Czichy, Thoralf (NSN - FI/Helsinki)
Subject Re: postmaster recovery and automatic restart suppression
Date
Msg-id 2CD972C575FD624E814506C4916F3E1F50E59C@FIESEXC035.nsn-intra.net
Whole thread Raw
In response to Re: postmaster recovery and automatic restart suppression  (Alvaro Herrera <alvherre@commandprompt.com>)
Responses Re: postmaster recovery and automatic restart suppression
Re: postmaster recovery and automatic restart suppression
List pgsql-hackers

hi,

I am working together with Harald on this issue. Below some thoughts on
why we think it should be possible to disable the postmaster-internal
recovery attempt and instead have faults in the processes started
by postmaster escalated to postmaster-exit.



[Our typical "embedded" situation]

* Database is small 0.1 to 1 GB (e.g. we consider it the safest strategy
 to copy the whole database from the active to standby before  reconnecting the standby after switchover or failover).

* Few clients only (10-100)

* There is no shared storage between the two instances (this means no  concurrent access to shared resources, no
isolationproblems for  shared resources) 

* Switchover is fast, less than a few seconds

* Disk I/O is slow (no RAID, possibly (slow) flash-based)

* The same nodes running database also run lots of other functionality  (some dependent on DB, most not)



[Keep recovery decision and recovery action in cluster-HA-middleware]

Actually the problem we're trying to solve is to keep the decision
what's
the best recovery strategy outside of the DB. In our use case this logic

is expressed in the cluster-HA-middleware and recovery actions are
initiated
by this middleware rather than each individual piece of software started
by
it; software is generally expected to "fail fast and safe" in case of
errors. As long as you trust hardware and OS kernel, a process exit is
usually such a fail fast and safe operation. It's "Safe" because process

exit causes the kernel to release the resources the process holds. It's
also
fast. Though, "fast" is a bit more debatable as a simple signal from the

postmaster to the cluster middleware would probably be faster. However
lacking such a signal, a SIGCHILD is the next best thing.

The middleware can make decisions such as (all of this is configurable
and postmaster-health is _just_one_input_ of many to reach a decision on

the correct behavior)
Policy 1: By default try to restart the active instance N times, after           that do a switchoverPolicy 2: If the
activePostgres fails and the standby is available and 
          up-to-date, do an immediate switchover. If the standby is not
          available, restart.Policy 3: If the active Postgres fails, escalate the problem to
node-level,          isolate the active node and do the switchover to the standby.
Policy 4: In single-node systems, restart db instance N times. If it
fails           more often than N times in X seconds, stop it and give an           indication to the operator
(SNMP-trapto management system, 
text           message, ...) that something is seriously wrong and manual           intervention is needed.

In the current setup we want to go for Policy 2. In earlier unrelated
products (not using PostgreSQL) we actually had policies 1, 3 and 4.

Another typical situation is that recovery behavior is different during
upgrades compared to the behavior during normal operation. E.g. when
the (new) database instance fails during an automatic schema-conversion
during upgrade we would want to automatically fallback to the previous
version.



[STONITH is not always best strategy if failures can be declared as
user-space software problem only, limit STONITH to HW/OS failures]

The isolation of the failing Postgres instance does not require a
STONITH
- mainly as there's also other software running on the same node that
we'd
not want to automatically switchover (e.g. because it takes longer to do
or
the functionality is more critical or less critical). Also we generally
trust
the HW, OS kernel and cluster middleware to behave correctly . These
functions
also follow the principle of fail-fast-and-safe. This trust might be an
assumption that not everybody agrees with, though. So, if the failure
originated
from HW/OS/Clusterware it clearly is a STONITH situation, but if it's a
user-space problem - the default assumption is that isolation can be
implemented on
OS-level and that's a guarantee that the clusterware gives (using a
separate
Quorum mechanism to avoid split-brain situations).




[Example of user-space software failures]

So, what kind of failures would cause a user-space switchover rather
than
node-level isolation? This gets a bit philosophical. If you assume that
many
software failures are caused by concurrency issues, switching over to
the
standby is actually a good strategy as it's unlikely that the same
concurrency
issue happens again on the standby. Another reason for software failures

is entering exceptional situations, such as disk getting full, overload
on the
node (causes by some other process), backup being taken, upgrade
conversion
etc. So here the idea is that failover to a standby instance helps as
long as
there's some hope that on the standby side the situation is different.
If we'd
just have an internal Postgres restart in such situations, we'd have
flapping
db connectivity - without the operator even being aware of it (awareness
about
problem situations is also something that the cluster HA middleware
takes care
of).



[Possible implementation options]

I see only two solutions to allow an external cluster-HA-middleware to
make
recovery decisions:
  (1) postmaster process exits if it detects any unpredicted failure or
  (2) have postmaster provide an interface to notify about software       failures (i.e. the case it goes into
postmasterre-initializing). 

In case (2) it would be the cluster-HA-middleware that isolates the
postmaster
process, e.g. by SIGKILL-ing all related processes and forcefully
releasing all
shared resources that it uses. However, I favor case (1) as long as we
would keep
the logic that runs within the postmaster in case it detects a backend
process
failure as simple as possible - meaning force-stop all postgres
processes
(SIGKILL), wait for SIGCHLD from them and exit (should only take few
milliseconds).


[Question]

So the question remains: Is this behavior and the most likely addition
of a
postgresql.conf ""automatic_restart_after_crash = on" something that
completely
goes against the Postgres philosopy or is this something that once
implemented
would be acceptable to have in the main Postgres code base?


Thoralf


pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: GRANT ON ALL IN schema
Next
From: Tom Lane
Date:
Subject: Re: machine-readable explain output