shardmand
shardmand — Postgres Pro Shardman configuration daemon
Postgres Pro Shardman Start-up with shardmand #
shardmand is a Postgres Pro Shardman configuration daemon. It runs on each node in a Shardman cluster, generates a boot_uuid at every startup, subscribes for changes of shardman/cluster0/data/ladle and shardman/cluster0/data/cluster keys in the etcd store (cluster0 is the default cluster name used by Shardman utils), and manages Postgres Pro Shardman processes on the node where it is running according to the configuration described in these JSON documents.
shardmand manages integrated keepers and sentinels. On startup and when one of the monitored etcd keys changes, shardmand reconfigures them as follows:
It calculates the expected node configuration, i. e., the list of
keepersandsentinelsexpected to run and their configurations, from theshardman/cluster0/data/ladleandshardman/cluster0/data/clustervalues.It receives the list of running
keepersandsentinelswith their configurations from the internal process manager.It stops processes that are not expected to run. This can be a process that belongs to a cluster with the same name, but a different UUID, or a process whose description is no longer present in the expected node configuration. For
keeperprocesses, shardmand purges their data directory.If a process should be running, but its settings are different from the expected ones, shardmand updates the configuration and restarts the process. If a process should be running, but it is not running, shardmand starts it.
Also, a separate thread of shardmand periodically updates the shardman/cluster0/data/shardmand/NODENAME etcd key with the ClusterUUID of the last cluster to which the configuration was applied. So, before the shardmanctl nodes add command tries to initialize new clusters for a clover, the command can ensure that no alive threads from a previous cluster configuration are left on all nodes in the clover.
Additionaly, shardmand starts two http servers in separate threads. If servers ports match, a single server running both roles is started. The first server provides following metrics: shardmand_etcd_unavailable_time_seconds, shardmand_healthy_keepers, shardmand_sentinels, shardmand_uptime, shardmand_etcd_errors_total, shardmand_reconfigurations_number_total, shardmand_demotions_number_total. Also server provides a /healthz endpoint for shardmand health-check. The second server provides the following endponts:
/shardmand/v1/replica— returns 200 status code if a secondary instance is running on node, 500 status code if a primary instance is running on node,/shardmand/v1/master— returns 200 status code if a primary instance is running on node, 500 status code if a secondary instance is running on node,/shardmand/v1/referee— returns 200 status code if the only instance running on node is referee, 500 status code if the only instance running on node is primary or secondary, 404 if more than one instance is running. If node both primary and secondary instances are running on node/shardmand/v1/replicaandshardmand/v1/masterendpoints return 404 status code./shardmand/v1/status— getting information about shardmand status, includingboot_uuid.
All Postgres Pro Shardman services are managed by shardmand@cluster0.service, so when it is started, stopped, or restarted, it also starts, stops, or restarts all other Postgres Pro Shardman processes (including DBMS instances).
shardmand Syntax #
shardmand [common_options] [ --system-bus ] [ --user ]user_name
Here common_options are:
[ --cluster-name cluster_name ] [ --log-level error | warn | info | debug ] [ --retries ] [ retries_number--session-timeout ] [ seconds--store-endpoints ] [ store_endpoints--store-ca-file ] [ store_ca_file--store-cert-file ] [ store_cert_file--store-key ] [ client_private_key--store-timeout ] [ duration--version ] [ -h | --help ] [ --log-format ]
Command-line Reference #
This refsection describes shardmand-specific command-line options.
Common Options #
shardmand common options are optional parameters that are not specific to the utility. They specify etcd connection settings, cluster name and a few more settings. By default shardmand tries to connect to the etcd store 127.0.0.1:2379 and use the cluster0 cluster name. The default log level is info.
-
-h, --help# Show brief usage information.
-
--cluster-name#cluster_name Specifies the name for a cluster to operate on. The default is
cluster0.-
--log-level#level Specifies the log verbosity. Possible values of
levelare (from minimum to maximum):error,warn,infoanddebug. The default isinfo.-
--retries#number Specifies how many times shardmanctl retries a failing etcd request. If an etcd request fails, most likely, due to a connectivity issue, shardmanctl retries it the specified number of times before reporting an error. The default is 5.
-
--session-timeout#seconds Specifies the session timeout for shardmanctl locks. If there is no connectivity between shardmanctl and the etcd store for the specified number of seconds, the lock is released. The default is 30.
-
--store-endpoints#string Specifies the etcd address in the format:
http[s]://. The default isaddress[:port](,http[s]://address[:port])*http://127.0.0.1:2379.-
--store-ca-file#string Verify the certificate of the HTTPS-enabled etcd store server using this CA bundle.
-
--store-cert-file#string Specifies the certificate file for client identification by the etcd store.
-
--store-key#string Specifies the private key file for client identification by the etcd store.
-
--store-timeout#duration Specifies the timeout for a etcd request. The default is 5 seconds.
-
--monitor-port#number Specifies the port for the shardmand http server for metrics and probes. The default is 15432.
-
--api-port#number Specifies the port for the shardmand http api server. The default is 15432.
-
--version# Show shardman-utils version information.
shardmand Environment #
A shardmand service reads the environment from /etc/shardman/shardmand-cluster0.env. The following environment variables affect the behavior of shardmand.
-
SDM_CLUSTER_NAME# An alternative to setting the
--cluster-nameoption-
SDM_LOG_LEVEL# An alternative to setting the
--log-leveloption-
SDM_RETRIES# An alternative to setting the
--retriesoption-
SDM_SYSTEM_BUS# An alternative to setting the
--system-busoption-
SDM_STORE_ENDPOINTS# An alternative to setting the
--store-endpointsoption-
SDM_STORE_CA_FILE# An alternative to setting the
--store-ca-fileoption-
SDM_STORE_CERT_FILE# An alternative to setting the
--store-cert-fileoption-
SDM_STORE_KEY# An alternative to setting the
--store-keyoption-
SDM_STORE_TIMEOUT# An alternative to setting the
--store-timeoutoption-
SDM_SESSION_TIMEOUT# An alternative to setting the
--session-timeoutoption-
SDM_USER# An alternative to setting the
--useroption