2.9. Fault Tolerance and High Availability #
Shardman provides out-of-the-box fault tolerance. The shardmand daemon monitors the cluster configuration and manages stolon clusters, which are used to guarantee high availability of all shards and fault tolerance. The common Shardman configuration (shardmand, stolon clusters) is stored in an etcd cluster.
To ensure fault tolerance for each stolon cluster, you must set Repfactor > 0
in the cross-replication mode (PlacementPolicy
=cross
) or add at least one replica in the manual-topology mode (PlacementPolicy
=manual
).
stolon sentinels
have the responsibility of observing the keepers
and carrying out elections to choose one of the keepers
as the master. Sentinels
hold elections when the cluster starts and every time the current master keeper
goes down.
One of the keepers
is elected as the master. All write operations take place at the master, and the other instances are used as follower instances.
In the case of automatic failover, stolon will take care of automatically changing slave to master and failed master to standby. Only one additional thing you need is etcd to store the master/slave instant information by stolon.
If necessary, you can switch to a new master manually by running the shardmanctl shard switch
command.
Аutomatic failover is based on the use of timeouts, which can be overridden in sdmspec.json
, as in the example:
{ "ShardSpec":{ "failInterval": "20s", "sleepInterval": "5s", "convergenceTimeout": "30s", "deadKeeperRemovalInterval": "48h", "requestTimeout": "10s", ... }, ... }, ...
You can specify some high-availability options to define cluster behavior in a fault state: masterDemotionEnabled, masterDemotionTimeout, minSyncMonitorEnabled and minSyncMonitorUnhealthyTimeout.
2.9.1. Timeouts #
-
convergenceTimeout
# Interval to wait for a database to be converged to the required state when no long operations are expected.
Default:
30s
.-
deadKeeperRemovalInterval
# Interval after which a dead
keeper
will be removed from the cluster data.Default:
48h
.-
failInterval
# Interval after the first failure to declare a
keeper
or a database as not healthy.Default:
20s
.-
requestTimeout
# Time after which any request (
keeper
checks fromsentinel
etc...) will fail.Default:
10s
.-
sleepInterval
# Interval to wait before the next check.
Default:
5s
.