Re: postmaster recovery and automatic restart suppression - Mailing list pgsql-hackers
From | Czichy, Thoralf (NSN - FI/Helsinki) |
---|---|
Subject | Re: postmaster recovery and automatic restart suppression |
Date | |
Msg-id | 2CD972C575FD624E814506C4916F3E1F50E59C@FIESEXC035.nsn-intra.net Whole thread Raw |
In response to | Re: postmaster recovery and automatic restart suppression (Alvaro Herrera <alvherre@commandprompt.com>) |
Responses |
Re: postmaster recovery and automatic restart suppression
Re: postmaster recovery and automatic restart suppression |
List | pgsql-hackers |
hi, I am working together with Harald on this issue. Below some thoughts on why we think it should be possible to disable the postmaster-internal recovery attempt and instead have faults in the processes started by postmaster escalated to postmaster-exit. [Our typical "embedded" situation] * Database is small 0.1 to 1 GB (e.g. we consider it the safest strategy to copy the whole database from the active to standby before reconnecting the standby after switchover or failover). * Few clients only (10-100) * There is no shared storage between the two instances (this means no concurrent access to shared resources, no isolationproblems for shared resources) * Switchover is fast, less than a few seconds * Disk I/O is slow (no RAID, possibly (slow) flash-based) * The same nodes running database also run lots of other functionality (some dependent on DB, most not) [Keep recovery decision and recovery action in cluster-HA-middleware] Actually the problem we're trying to solve is to keep the decision what's the best recovery strategy outside of the DB. In our use case this logic is expressed in the cluster-HA-middleware and recovery actions are initiated by this middleware rather than each individual piece of software started by it; software is generally expected to "fail fast and safe" in case of errors. As long as you trust hardware and OS kernel, a process exit is usually such a fail fast and safe operation. It's "Safe" because process exit causes the kernel to release the resources the process holds. It's also fast. Though, "fast" is a bit more debatable as a simple signal from the postmaster to the cluster middleware would probably be faster. However lacking such a signal, a SIGCHILD is the next best thing. The middleware can make decisions such as (all of this is configurable and postmaster-health is _just_one_input_ of many to reach a decision on the correct behavior) Policy 1: By default try to restart the active instance N times, after that do a switchoverPolicy 2: If the activePostgres fails and the standby is available and up-to-date, do an immediate switchover. If the standby is not available, restart.Policy 3: If the active Postgres fails, escalate the problem to node-level, isolate the active node and do the switchover to the standby. Policy 4: In single-node systems, restart db instance N times. If it fails more often than N times in X seconds, stop it and give an indication to the operator (SNMP-trapto management system, text message, ...) that something is seriously wrong and manual intervention is needed. In the current setup we want to go for Policy 2. In earlier unrelated products (not using PostgreSQL) we actually had policies 1, 3 and 4. Another typical situation is that recovery behavior is different during upgrades compared to the behavior during normal operation. E.g. when the (new) database instance fails during an automatic schema-conversion during upgrade we would want to automatically fallback to the previous version. [STONITH is not always best strategy if failures can be declared as user-space software problem only, limit STONITH to HW/OS failures] The isolation of the failing Postgres instance does not require a STONITH - mainly as there's also other software running on the same node that we'd not want to automatically switchover (e.g. because it takes longer to do or the functionality is more critical or less critical). Also we generally trust the HW, OS kernel and cluster middleware to behave correctly . These functions also follow the principle of fail-fast-and-safe. This trust might be an assumption that not everybody agrees with, though. So, if the failure originated from HW/OS/Clusterware it clearly is a STONITH situation, but if it's a user-space problem - the default assumption is that isolation can be implemented on OS-level and that's a guarantee that the clusterware gives (using a separate Quorum mechanism to avoid split-brain situations). [Example of user-space software failures] So, what kind of failures would cause a user-space switchover rather than node-level isolation? This gets a bit philosophical. If you assume that many software failures are caused by concurrency issues, switching over to the standby is actually a good strategy as it's unlikely that the same concurrency issue happens again on the standby. Another reason for software failures is entering exceptional situations, such as disk getting full, overload on the node (causes by some other process), backup being taken, upgrade conversion etc. So here the idea is that failover to a standby instance helps as long as there's some hope that on the standby side the situation is different. If we'd just have an internal Postgres restart in such situations, we'd have flapping db connectivity - without the operator even being aware of it (awareness about problem situations is also something that the cluster HA middleware takes care of). [Possible implementation options] I see only two solutions to allow an external cluster-HA-middleware to make recovery decisions: (1) postmaster process exits if it detects any unpredicted failure or (2) have postmaster provide an interface to notify about software failures (i.e. the case it goes into postmasterre-initializing). In case (2) it would be the cluster-HA-middleware that isolates the postmaster process, e.g. by SIGKILL-ing all related processes and forcefully releasing all shared resources that it uses. However, I favor case (1) as long as we would keep the logic that runs within the postmaster in case it detects a backend process failure as simple as possible - meaning force-stop all postgres processes (SIGKILL), wait for SIGCHLD from them and exit (should only take few milliseconds). [Question] So the question remains: Is this behavior and the most likely addition of a postgresql.conf ""automatic_restart_after_crash = on" something that completely goes against the Postgres philosopy or is this something that once implemented would be acceptable to have in the main Postgres code base? Thoralf
pgsql-hackers by date: