sdmspec.json
sdmspec.json — Shardman initialization file
Synopsis
sdmspec.json
Description
shardmanctl uses the sdmspec.json configuration file during Shardman cluster initialization. A shardman-utils package provides a sample configuration file.
sdmspec.json file contains basic filesystem paths used by Shardman, global settings of the cluster, database-related settings, i. e., administrative and replication user logins and authentication method, FDW parameters and shard configuration (ShardSpec).
Note that there is a number of the internal Shardman parameters that, if modified by user, can result in the total cluster failure. These parameters are:
shardman.cluster_uuiddefines the version of a running cluster that the node belongs to.shardman.config_uuiddefines the config version. Ignored if set via shardmanctlconfig updateor shardmanctlinit.shardman.manual_executioncontrols the consistent work with the global objects.shardman.silk_never_restartprohibits the multiplexer workers restart in case of an error.shardman.pre_promote_modeapplies the consistent promotion mechanism (from standby to primary).
List of Parameters
Repfactor#Integer determining how many replicas shardmanctl should configure for each DBMS. This setting can only be changed for a Shardman cluster with a manual-topology mode.
PlacementPolicy#String determining the policy of placing DBMS instances. Currently,
crossandmanualplacement policy is only supported. The former valuecloveris used as an alias forcrosspolicy.With
crossplacement policy, nodes are grouped in clovers, where each node is running the master DBMS server and replicas for all other nodes in the clover. The number of nodes in a clover is determined by Repfactor and equalsRepfactor+ 1.manualplacement policy allows you to manually add/remove the required number of replicas to/from the specified replication groups. In this case,Rеpfactoris only used for recommendation purposes and does not impose restrictions.DataDir#Allows you to specify a directory other than the default one (
/var/lib/pgpro/sdm-14/data) for storing data. This parameter cannot be changed after the cluster has been initialized.PGsInitialPort#Ports starting with this integer are assigned to PostgeSQL instances. This parameter cannot be changed after the cluster has been initialized.
SilkInitialPort#Ports starting with this integer are assigned to Silk (Shardman InterLinK) instances. This parameter cannot be changed after the cluster has been initialized.
AuthMethod#Authentication method used by the administrative user to connect to the DBMS. Can be any authentication method supported by PostgreSQL.
scram-sha-256is currently recommended.md5is currently allowed but not recommended. This parameter cannot be changed after the cluster has been initialized. Located under a separateUsersblock for each array element.Default:
trust.Groups#An array that can have two possible values,
sufor superuser orreplfor replication.HTTP#Defines settings for the secure HTTP/HTTPS connection, with
Portbeing an API port, andPortMetricsbeing a port for the metrics. If these ports are the same, then API and metrics listen to the same port.Default:
15432.Name#Name of the user. Created on cluster initialization. Defaults to the name of the effective user running
shardmanctl init. This parameter cannot be changed after the cluster has been initialized. Located under a separateUsersblock for each array element.Password#Password for the user. Can be changed using shardmanctl
config update credentials. Located under a separateUsersblock for each array element.PgSuSSLCert#Client certificate for the administrative DBMS user.
PgSSLRootCert#Location of the root certificate file for the DBMS user connection.
PgSuSSLKey#Client private key for the administrative DBMS user.
PgSSLMode#SSL mode for the DBMS user. Allowed values:
verify-caandverify-full.PgReplSSLCert#Client certificate for the replication DBMS user.
PgReplSSLKey#Client private key for the replication DBMS user.
ShardSpec#Shard cluster specification. For more details, see
ShardSpecParameters. Can be changed using shardmanctlconfig update.FDWOptions#This object contains FDW settings.
These settings can be changed using shardmanctl
config update(with the exception of settings related to authorization, server connection, SSL and Kerberos, as well as theservice,target_session_attrsoptions).Foreign servers corresponding to Shardman replication groups will also get
extended_featuressetting automatically enabled. Never set this parameter for postgres_fdw foreign servers which you define for your own purposes (for example, to load data into Shardman cluster).
ShardSpec Parameters
The ShardSpec specification can include all usual stolon options described in Stolon Cluster Specification. However, the following options should be carefully tuned for a Shardman cluster.
pgHBA#JSON array of
pg_hba.confstrings. The default value allows user from thesugroup access from anywhere withAuthMethodauthentication method. If the value ofdefaultSUReplAccessModeisstrict,pg_hba.confstrings must explicitly allow users from the groupssuorreplaccess from all Shardman cluster nodes.-
forceSuUserLocalPeerAuth# When enabled, it sets a peer authentication via unix socket for the
postgresuser, ifstrictUserHBAis not set totrue.Default:
false.synchronousReplication#Determines whether replicas should use synchronous replication. Should be
truein a Shardman cluster.Default:
true.maxSynchronousStandbys#Maximum number of required synchronous standbys when synchronous replication is enabled. Should be
>=Repfactor in a Shardman cluster. Default:Repfactor.strictUserHBA#Prohibits adding automatically generated lines to
pg_hba.conffile. Default:false.automaticPgRestart#Determines whether a DBMS instance should be automatically restarted after a change of the pgParameters hash table that requires a restart. Should be enabled in a Shardman cluster.
Default:
true.masterDemotionEnabled#Enable master demotion in case the replica group master has lost connectivity with etcd. The master attempts to connect to each of its standby nodes to determine if any of them has become the master. If it discovers another master, it shuts down its own DBMS instance until the connectivity with etcd is restored. If the master fails to connect to one of its standby nodes for a long time, a DBMS instance shutdown occurs.
Default:
false.masterDemotionTimeout#The timeout during which the master attempts to connect to its standbys in cases where connectivity with etcd is lost. Works only if the
masterDemotionEnabledparameter is set to true.Default:
30s.minSyncMonitorEnabled#Enable the monitor for the
MinSynchronousStandbysvalue for every replica group. If a node loses connection with the cluster (allkeepersare unhealthy: akeeperdoes not update its state longer thanminSyncMonitorUnhealthyTimeout), the monitor decreases theMinSynchronousStandbysvalue for every replica group related to the disconnected node to the maximum available value. This allows preventing the read-only condition caused by the fake replica. The maximum available value is always less than or equal to the value specified in the cluster configuration. If allkeepersrelated to the disconnected node become healthy, the monitor changesMinSynchronousStandbysvalue for the replica group to the value specified in the cluster configuration.Default:
false.minSyncMonitorUnhealthyTimeout#Time interval after which the node (and all
keepersrelated to this node) will be considered in an unhealthy condition. Works only if theminSyncMonitorEnabledparameter is set to true.Default:
30s.-
syncPointMonitorEnabled# Enable the monitor that creates a syncpoint every minute, ensuring the Shardman can restore to a consistent LSN. At each syncpoint, the cluster's state is consistent, meaning that all transactions are complete. If this parameter is set to true, PITR will be guaranteed to work. If set to
true, it saves the syncpoint history in etcd with the keyshardman/{cluster_name}/data/cluster/syncpoints.Default:
false.-
syncpointInterval# Specifies how often a syncpoint is created (seconds).
Default: 60 seconds.
-
syncpointEnabled# Enables
monitorthat periodically creates syncpoints.Default:
true.-
syncpointLimit# Amount of the most recent syncpoints stored.
Default: 60.
-
dbWaitRewindTimeout# Before full resync of a replica, the cluster software first tries to do pg_rewind. Because the rewind operation is significantly faster than other approaches when the database is large and only a small fraction of blocks differs between the clusters. The
dbWaitRewindTimeoutparameter specifies the maximum working time for pg_rewind (examples of values:5m,30s,1m30s).Default:
7m.additionalReplicationSlots#Array of names of physical replication slots that are created on the master. Each slot name must begin with the
stolon_prefix.createSlotsOnFollowers#If
true, physical replication slots are also created on standby nodes.additionalSlotsLagLimit#The limit of the volume by which replication slots defined by the
additionalReplicationSlotsconfiguration parameter can lag behind. If this value is exceeded, the slot is recreated. Specify the value as a number followed by a unit of measurement. Possible units:B,kB,kiB,MB,MiB,GB,GiB,TB,TiB,PB,PiB,EB,EiB,ZB,ZiB,YB, andYiB. For example:100MB.pgParameters#Hash table that determines PostgreSQL settings, including Shardman-specific settings. Supports the following placeholders for
postgresparameters:{{dataDir}}for data directory,{{keeperDir}}for keeper data directory underdataDir,{{keeperName}}forkeepername,{{keeperID}}for keeper ID,{{cluster}}for cluster name,{{shard}}for shard name,{{host}}for host with the workingpostgresinstance.
Shardman-specific PostgreSQL Settings
The following settings in pgParameters are Shardman-specific:
enable_csn_snapshot(boolean) #Enables or disables Commit Sequence Number (CSN) based tracking of the transaction visibility for a snapshot.
PostgreSQL uses the clock timestamp as a CSN, so enabling CSN-based snapshots can be useful for implementing global snapshots and global transaction visibility.
When this parameter is enabled, PostgreSQL creates the
pg_csndirectory underPGDATAto keep track of CSN and XID mappings.Default:
off.enable_custom_cache_costs(boolean) #Enables estimation logic for plan costs. It helps the planner choose generic plans more often considering the runtime pruning.
Default:
off.enable_sql_func_custom_plans(boolean) #If enabled, custom plans can be created to execute statements inside SQL functions. These plans depend on the parameter values.
Query plans can be cached within one query. First, the plan is built five times with different parameter values, then a generic plan is created regardless of the values. If custom and generic plan price is slightly different, then the generic plan is cached and is set to be used in the future. However, custom plans allow a more effective way of excluding queries to the sharded table partitions if the choice of these partitions depends on the query parameter.
Default:
off.enable_merge_append(boolean) #Enables the use of
MergeAppendplans by the query planner.Default:
on.enable_async_merge_append(boolean) #Enables or disables the query planner's use of async-aware merge append plan types. The default is
on.-
csn_snapshot_defer_time(integer) # Specifies the minimal age of records that are allowed to be vacuumed, in seconds.
All global transactions must start on all participant nodes within
csn_snapshot_defer_timeseconds after start, otherwise, they are aborted with a “csn snapshot too old” error.Default:
15.-
csn_commit_delay(integer) # Specifies the maximum possible clock skew (in nanoseconds) in the cluster. Adds a delay before every commit in the system to ensure external consistency. If set to 0, external consistency is not guaranteed. Value suffixes
ns,us,msandsare allowed.Default:
0.-
csn_lsn_map_size(integer) # Size of CSNLSNMap.
The commit record of each completed transaction in Shardman contains the assigned
CSNfor this transaction. This value, together with theLSNof this record, forms a pair of values(CSN, LSN). Each of the cluster nodes stores a certain number of such pairs in RAM in a special structure - theCSNLSNMap. This map is used to get the syncpoint. See the “Syncpoints and Consistent Backup” section of the Internals chapter for more information.Default:
1024.-
csn_max_shift_error(boolean) # When checked against the
csn_max_shiftvalue, raises an error if thecsn_max_shiftvalue is exceeded.Default:
off.-
csn_max_shift(integer) # Maximum CSN shift in seconds for distributed queries and imported snapshots. If the shift exceeds the
csn_max_shiftvalue, an error or warning will occur. If the value is set to 0, no check is run.Default:
15(seconds).-
foreign_analyze_interval(integer) # Specifies how often foreign statistics should be gathered during autovacuum, in seconds. If the value of
foreign_analyze_intervalis less thanautovacuum_naptime, foreign statistics will be gathered eachautovacuum_naptimeseconds.Default:
60.-
foreign_join_fast_path(boolean) # Turns on a fast path for foreign join planning. When it is on, foreign join paths for
SELECTqueries are searched before all other possible paths and the search stops for a join as soon as a foreign join path is found.Default:
off.-
optimize_correlated_subqueries(boolean) # Enables or disables the query planner's logic of transforming correlated subqueries into semi-joins.
Default:
on.-
port(integer) # A TCP port the server listens on. For a Shardman cluster, the
portis assigned automatically by the system and is based on thePGsInitialPortparameter. If changed manually, the value will be overwritten by the configuration parameter that is automatically assigned.enable_partition_pruning_extra(boolean) #Enables the extended partition pruning for the prepared queries with a known partitioning key. If turned on, the partition-wise join plans can be pruned.
Default:
off.crash_info(boolean) #When set to
on, Shardman will write diagnostic information about a backend crash into a file.Default:
on.crash_info_dump(text) #Specifies a comma-separated list of character strings that contain data sources to provide data for a crash dump. Possible values of the strings are as follows:
queries— query texts.memory_context— PostgreSQL memory context (see Memory Management for details).system— information on the OS.module— information on modules loaded to thepostgresprocess.cpuinfo— information on the processor: number of cores, instruction set, etc.cpu_context:Processor context registers.
General information on the error: signal number, signal code information, pid, pid of the parent process, user ID, etc.
virtual_memory— information on virtual memory regions.instruction_pointer— bytes near RIP (the return instruction pointer, which points to a memory address indicating the progress of a program execution in memory).stack— stack bytes. The stack is dumped to an individual file in the directory where diagnostic information about backend crashes is logged (by default). This file has the same name as the crash log file, but thePGDATA/crash_info.dataextension. The addresses of the upper and lower stack bounds are written to the crash log file.
Default:
system,module,queries,cpu_context,instruction_pointercrash_info_location(string) #Specifies the directory where information about a backend crash is to be stored. The value of
stderrsends information about the crash to stderr. If this parameter is set to the empty string'', the$PGDATA/crash_infodirectory is used. If you wish to keep the files elsewhere, create the target directory in advance and grant appropriate privileges.Default:
''.crash_info_timer(boolean) #Enables or disables dumping of long-running query state by timer, which allows profiling such queries. The timer is launched by the interval specified in crash_info_timer_interval and measures query processing time. When query execution time exceeds the threshold specified in crash_info_query_threshold, the query state is dumped to disk. The default value is
off. For the timer to operate, crash_info_dump must be set at least toqueries. This setting can be changed only by superusers using theSETcommand in the current session or in the configuration file globally.This parameter can only be set at server start in the
postgresql.conffile. To re-read the parameter value, restart the server or send the SIGHUP signal to the main server process.crash_info_timer_interval(integer) #Sets the time interval to launch the timer, in milliseconds. The default value is
1000. This setting can be changed only by superusers using theSETcommand in the current session or in the configuration file globally.This parameter can only be set at server start in the
postgresql.conffile. To re-read the parameter value, restart the server or send the SIGHUP signal to the main server process.crash_info_query_threshold(integer) #Sets the threshold value for the query processing time, in milliseconds. When reached, the query state is dumped to disk. The default value is
3000. This setting can be changed only by superusers using theSETcommand in the current session or in the configuration file globally.This parameter can only be set at server start in the
postgresql.conffile. To re-read the parameter value, restart the server or send the SIGHUP signal to the main server process.shardman.context_log(bool) #Logs the remote contexts. If enabled, in case of an error, displays a field
Remote CONTEXT. Note that if the standart log level is set tolog_verbosity=terse, theshardman.context_logwill be disabled automatically.Default:
on.-
postgres_fdw.enforce_foreign_join(boolean) # Turns on alternative estimations for foreign join costs, which highly increases chances for join of several foreign tables referring to the same server to be pushed down. The cost of original join is estimated as
(1 - 1/(cost + 1)), wherecostis an originally estimated cost for this remote join.Default:
off.-
postgres_fdw.foreign_explain(enum) # Defines how to include the
EXPLAINcommand output from the remote servers if the query plan containsForeignScannodes. The possible values are:noneto exclude theEXPLAINoutput from the remote servers,fullto include theEXPLAINoutput from the remote servers,collapsedto include theEXPLAINoutput only for the firstForeignScannode under itsAppend/MergeAppend.Default:
collapsed.-
postgres_fdw.optimize_cursors(boolean) # Sets postgres_fdw to try fetching the first portion of cursor data immediately after declaration and delay the cursor closing.
This postgres_fdw parameter forces it to avoid closing cursors after the end of scan. Cursors are closed at the end of transaction.
Default:
off.-
postgres_fdw.subplan_pushdown(boolean) # Enables or disables postgres_fdw logic of pushing down subqueries referencing only foreign server tables to this foreign server.
Default:
off.-
postgres_fdw.use_twophase(enum) # Sets postgres_fdw to use the two-phase commit (2PC) protocol for distributed transactions.
This postgres_fdw parameter forces it to use a two-phase commit if the transaction touches several nodes. When set to
auto, a two-phase commit is only used in transactions withenable_csn_snapshot=trueand isolation level equal to or higher thanREPEATABLE READ.Temporary tables cannot be used in 2PC transactions.
Default:
auto.postgres_fdw.estimate_as_hashjoin(boolean) #When enabled, the planner estimates a foreign join cost in a way similar to a cost of a hash-join whenever possible. This cost is compared to the default cost (which is similar to nested loops) and the smaller cost is selected for the path.
Default:
off.-
postgres_fdw.estimate_indexscan(boolean) # Enables cost estimation for foreign scans as if they were index scans.
When disabled, the cost of scanning of a foreign table is estimated as a sequential scan added to the cost of data transfer and local filtering if some filters are applied locally. When enabled, any remote scan can be estimated as index scan, should there be any suitable index.
If a foreign table is a part of a sharded table, then information about its indexes is taken from a random local partition of the sharded table. This decision is made based on assumption that data distribution between the sharded table partitions is uniform, and definition of local and remote partitions match.
Expecting the remote scan to be executed as index scan makes foreign scan costs comparable to local scan costs, which can be important for further path selection. Note that this estimation makes planning more expensive, and is not recommended for simple queries.
Default:
off.postgres_fdw.additional_ordered_paths(boolean) #When enabled, sorting on the remote server is considered if it allows performing
MergeJoinorMergeAppendoperations. This parameter is enabled by default in new installations but must be explicitly enabled in upgraded clusters.-
shardman.broadcast_ddl(boolean) # Sets Shardman extension to broadcast DDL statements to all replication groups.
When this parameter is on, Shardman extension broadcasts supported DDL statements to all replication groups if it does make sense for those statements. You can enable/disable this behavior anytime. This parameter is not honored when set in configuration file. Note that it does not affect sequences, as they are only created per shard.
Default:
off.-
shardman.enable_limit_pushdown(boolean) # Enable pushing down limit clauses through the underlying appends. When on, Shardman optimizer will try to push down a limit clause to the subpaths of the underlying
Append/MergeAppendplan node if they reference postgres_fdw foreign tables. This optimization works only forSELECTplans when limit option is represented as a constant or a parameter. It is also restricted forAppendpaths, corresponding to a partitioned table.Default:
on.-
shardman.num_parts(integer) # Specifies the default number of sharded table partitions.
A sharded table has this default number of partitions unless
num_partsis specified in CREATE TABLE.To allow scaling,
shardman.num_partsshould be larger than the expected maximum number of nodes in a Shardman cluster.Possible values are from
1to1000.Default:
20.-
shardman.rgid(integer) # Specifies the replication group ID of a Shardman node.
This parameter is set by Shardman utilities when the node is added to the cluster and should never be changed manually.
Default:
-1.-
shardman.sync_schema(boolean) # Sets Shardman to propagate all DDL statements that touch sharded and global relations to all replication groups.
When this parameter is on, Shardman broadcasts all supported utility statements touching sharded and global relations to all replication groups. It is not recommended to turn this off. This parameter is not honored when set in configuration file.
Default:
on.-
shardman.sync_cluster_settings(boolean) # Enables cluster-wide synchronization of configuration parameters set by user. The configuration parameters are propagated with each remote query.
Default:
on.-
shardman.sync_cluster_settings_blacklist(boolean) # Excludes the options not to be propagated to a remote cluster.
Default: local system configuration parameters that are never synchronized.
-
shardman.query_engine_mode(enum) # Switches between modes of query planning/execution. Possible values are
noneandtext.nonemeans that query planning/execution will not use the Silk transport.textmeans that the text query representation is transferred via Silk transport for remote execution.Default:
none.-
shardman.silk_use_ip(string) # Silk transport uses IP address specified by this parameter for node identification. If the host name is specified, it is resolved and the first IP address corresponding to this name, is used.
Default: node hostname.
-
shardman.silk_listen_ip(string) # The Silk routing daemon listens for incoming connections on this IP address. If the host name is specified, it is resolved and the first IP address corresponding to this name, is used.
Default: node hostname.
-
shardman.silk_use_port(integer) # The Silk routing daemon listens for incoming connections on this port. This setting should be the same for all nodes in the Shardman cluster.
Default:
8888.-
shardman.silk_tracepoints(bool) # Enables tracing of queries passing through the Silk pipeline. The tracing results can be accessed by running the
EXPLAINcommand withANALYZEset toON.Default:
off.-
shardman.silk_num_workers(integer) # Number of background workers allocated for distributed execution. This setting must be less than
max_worker_processes(including auxilary postgres worker processes).Default:
2.-
shardman.silk_stream_work_mem(integer) # Sets the base maximum amount of memory to be used by a Silk stream (as a buffer size) before writing to the temporary disk files. If this value is specified without units, the default is kilobytes.
Note that most queries can perform multiple fetch operations at the same time, usually one for each remote partition of a sharded table, if any. Each fetch operation is generaly allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions can execute such operations concurrently. Therefore, the total memory used by Silk for buffers could be many times the value of
shardman.silk_stream_work_memand is correlated with shardman.num_parts. Thus, mind this fact when choosing the value.Default:
16MB.-
shardman.silkworm_fetch_size(integer) # Number of rows in a chunk that the
silkwormworker extracts and sends to the multiplexer as a result, per one reading iteration.Default: 100.
-
shardman.silk_unassigned_job_queue_size(integer) # Size of queue for jobs that have not yet been assigned to the
silkwormmultiplexer workers, in case all the workers are busy.Default: 1024.
-
shardman.silk_max_message(integer) # Maximum message size that can be transfered with Silk, in bytes. Note that this parameter does not limit the maximum size of the result returned by the query. It only affects messages sent to workers. Increasing this parameter value will result in a proportional memory increase consumed by Shardman. It is strongly recommended to use the default value unless there is an urgent need.
Default: 524288.
-
shardman.silk_hello_timeout(integer) # Handshake timeout between multiplexers of different nodes, in seconds.
Default: 3.
-
shardman.silk_scheduler_mode(enum) # Enables additional CPU scheduling settings for multiplexer processes (
silkroadandsilkworm).When this parameter is
fifo, Shardman assigns scheduling policy SCHED_FIFO for processessilkroadand each ofsilkworm. It assigns the static schediling priority (sched_priority) to valuesshardman.silkroad_sched_priorityandshardman.silkworm_sched_priorityrespectively.This setting improves silk transport performance while it operates under heavy CPU load.
Note that postgres binary need to have CAP_SYS_NICE capability to use this option. If no appropriate capability was assigned to the process, enabling this setting will have no effect. The capability must be assigned to postgres binary before starting postgres. Postgres (i.e. processes silkroad and silkworm) will apply scheduling options once during service start. You need restart postgres service if you want to change scheduling options.
Default:
none.To set capability you need execute following command once after postgres installed:
$sudo setcap cap_sys_nice+ep /opt/pgpro/sdm-14/bin/postgresReplace
/opt/pgpro/sdm-14/bin/postgresto the correct path to your postgres binary if needed. Also note that your filesystem should support extended file attributes. You need set this for each node in the cluster to take the full effect.In the Linux kernel, there is a mechanism called real-time throttling, which is designed to prevent tasks with real-time scheduling policies (like
SCHED_FIFO) from monopolizing CPU resources. This ensures that other tasks with lower priorities, typically scheduled under theSCHED_OTHERpolicy, still get some amount of the CPU time. This mechanism is controlled by two parameters, exported into theprocfilesystem or thesysctlinterface:/proc/sys/kernel/sched_rt_period_ussets the duration of a scheduling period in microseconds. During this period, both real-time and non-real-time tasks share CPU time./proc/sys/kernel/sched_rt_runtime_usspecifies how much of the scheduling period is allocated to real-time tasks (withSCHED_FIFO). The remainder of the time is left for non-real-time tasks (SCHED_OTHER).
A typical and acceptable configuration for Shardman might set these parameters as follows:
# cat /proc/sys/kernel/sched_rt_period_us 1000000 # cat /proc/sys/kernel/sched_rt_runtime_us 950000
This configuration allows real-time tasks to use up to 950 milliseconds of each second, leaving 50 milliseconds for non-real-time tasks.
However, in some Linux distributions, the default values for these parameters might be set so low (or even to zero) that real-time tasks receive very little or no CPU time. This can make real-time scheduling ineffective or prevent the configuration from being applied. For example, attempting to manually set a task to the
SCHED_FIFOpriority usingchrtmight result in an error like:$ sudo chrt -f -p 2 $(pgrep -f silkroad)chrt: failed to set pid 1897706's policy: Operation not permittedThis error indicates that the kernel parameters are not configured correctly. In such cases, run the following:
echo 1000000 > /proc/sys/kernel/sched_rt_period_us echo 950000 > /proc/sys/kernel/sched_rt_runtime_us
Or add the corresponding values into
/etc/sysctl.confand reload the settings usingsysctl -p:kernel.sched_rt_period_us = 1000000 kernel.sched_rt_runtime_us = 950000
-
shardman.silkroad_sched_priority(integer) # Value of static scheduling priority (sched_priority) for
silkroadprocess. It only makes sense ifshardman.silk_scheduler_modeequals to 'fifo'.Default:
2.-
shardman.silkworm_sched_priority(integer) # Value of static scheduling priority (sched_priority) for
silkwormprocesses (the same value for each of them). It only makes sense ifshardman.silk_scheduler_modeequals to 'fifo'.Default:
1.-
shardman.silk_set_affinity(bool) # Enables pinning of multiplexer processes (
silkroadandsilkworm) to CPU cores to eliminate negative effects of thread's cross-cpu migration.When this parameter is
true,silkroadprocess will be pinned to the first available CPU core andsilkwormprocesses (all of them) will pinned to all available CPU cores except the first one.This setting improves silk transport performance while it operates under heavy CPU load.
Note that postgres binary need to have CAP_SYS_NICE capability to use this option. If no appropriate capability was assigned to the process, enabling this setting will have no effect. The capability must be assigned to postgres binary before starting postgres. Postgres (i.e. processes silkroad and silkworm) will apply affinity options once during service start. You need restart postgres service if you want to change affinity options.
To set capability you need execute following command once after postgres installed:
$sudo setcap cap_sys_nice+ep /opt/pgpro/sdm-14/bin/postgresReplace
/opt/pgpro/sdm-14/bin/postgresto the correct path to your postgres binary if needed. Also note that your filesystem should support extended file attributes. You need set this for each node in the cluster to take the full effect.Default:
false.-
shardman.silk_flow_control(boolean) # Controls the mode of handling read events. It has three possible values:
none,round_robin, andshortest_job_first.The
nonemode means no control nor additional overhead. Yet in this case, the channel may become occupied by just one distributed query.The
round_robinmode means the events created earlier are the first ones to be processed, for each event loop. If enabled, all the backends are grouped, and the client backends are prioritized over the other.The
shortest_job_firstmode means full control over the traffic. If enabled, all the backends are grouped, and the client backends are prioritized over the others, along with the workers with the least session traffic.Default:
round_robin.-
shardman.silk_track_time(boolean) # Enables or disables the metrics with prefix
transferred_and time-based metrics (with prefixesread_efd_,write_efd_, andsort_time_). If disabled, these metrics have0values.Default:
off.-
shardman.silk_tracelog(bool) # Enables or disables Silk logging.
Default:
off.-
shardman.silk_tracelog_categories(string) # Defines the Silk message categories to be traced.
Default:
streams, routing, events.-
shardman.database(string) # Name of the database that all Silk workers connect to.
Default:
postgres.-
shardman.monitor_interval(integer) # shardman.monitor_intervalis deprecated and acts as noop.Use
shardman.monitor_dxact_intervalinstead.-
shardman.monitor_dxact_interval(integer) # Interval between checks for outdated prepared transactions.
The Shardman monitor background process wakes up every
shardman.monitor_dxact_intervalseconds and attempts to check and resolve any prepared transactions that did not complete and became outdated for some reason. To resolve these transactions, the Shardman monitor process determines the coordinator of the transaction and requests the transaction status from the coordinator. Based on the status of the transaction, Shardman monitor will either roll back or commit the transaction.To disable the prepared transaction resolution logic, set
shardman.monitor_dxact_intervalto0.Default:
5(seconds).-
shardman.monitor_trim_csnxid_map_interval(integer) # Each cluster node freezes its own
xminvalue forcsn_snapshot_defer_timeseconds to support global transactions. Largecsn_snapshot_defer_timevalues can negatively impact the performance. Shardman monitor has a routine that everyshardman.monitor_trim_csnxid_map_intervalseconds updatesxminon all nodes to the minimum possible value (taking into account active transactions).The background routine will run on only one node in the Shardman cluster. Note that this will give an additional load on this node.
To disable such updates, set
shardman.monitor_trim_csnxid_map_intervalto0.Default:
5(seconds).-
shardman.monitor_dxact_timeout(integer) # Maximum allowed age of prepared transactions before a resolution attempt.
During the resolution of a prepared transaction, Shardman monitor determines whether the transaction is outdated or not. A transaction becomes outdated if it was prepared more than
shardman.monitor_dxact_timeoutseconds ago.Default:
5(seconds).-
shardman.trim_csnxid_map_naptime(integer) # Specifies the minimum delay between
xminupdates on all nodes. See shardman.monitor_trim_csnxid_map_interval for more information.Possible values are from
1to600.Default:
5.-
shardman.monitor_deadlock_interval(integer) # Interval between checks for distributed deadlock conditions.
The Shardman monitor background process wakes up every
shardman.monitor_deadlock_intervalseconds and searches for distributed deadlocks in the cluster. It gathers information about mutual locks from all nodes and looks for circular dependencies between transactions. If it detects a deadlock, it resolves it by canceling one of the backend processes involved in the lock.To disable the distributed deadlock resolution logic, set
shardman.monitor_deadlock_intervalto0.Default:
2(seconds).-
postgres_fdw.remote_plan_cache(boolean) — EXPERIMENTAL # Enables remote plan caching for FDW queries produced by locally cached plans.
Default:
off.-
shardman.plan_cache_mem(integer) — EXPERIMENTAL # Specifies how much memory per worker can be used for remote plan caches.
Default:
0(caches are disabled).-
shardman.gt_batch_size(integer) — # Specifies the buffer size for
INSERTandDELETEcommands executed on a global table.Default:
64K.-
track_fdw_wait_timing(boolean) # The statistics for the network latency (wait time) for inter-cluster operations, in milliseconds. It can be accessed by running the
EXPLAINcommand with thenetworkparameter enabled, and via the pgpro_stats view pgpro_stats_sdm_statements.Default:
on.-
track_xact_time(boolean) # Enables or disables statistics collection for time spent on a transaction.
Default:
off.-
enable_non_equivalence_filters(boolean) # Enables the optimizer to generate additional non-equivalence conditions using equivalence classes.
Default:
off.-
optimize_row_in_expr(boolean) # Enables the optimizer to generate additional conditions from the
IN ()expression.Default:
off.
Examples
Spec File for a Cluster with Enabled scram-sha-256 Authentication
Note
The initial configuration file should be generated with the following command:
shardmanctl config generate > sdmspec.json
The example below is for educational purposes only and may lack the latest updates.
This is the contents of an example sdmspec.json configuration file:
{
"ConfigVersion": "1",
"Repfactor": 1,
"PlacementPolicy": "manual",
"PGsInitialPort": 5432,
"SilkInitialPort": 8000,
"HTTP": {
"Port": 15432,
"PortMetrics": 15432
},
"Users": [
{
"Name": "postgres",
"Groups": [ "su"],
"AuthMethod": "scram-sha-256",
"Password": "changeMe"
},
{
"Name": "repluser",
"Groups": ["repl"],
"AuthMethod": "scram-sha-256",
"Password": "changeMe"
}
],
"ShardSpec": {
"synchronousReplication": true,
"usePgrewind": true,
syncPointEnabled": true,
"syncpointLimit": 60,
"syncpointInterval": "60s",
"pgParameters": {
"csn_snapshot_defer_time": "300",
"enable_csn_snapshot": "on",
"enable_csn_wal": "true",
"shardman.query_engine_mode": "text",
"shardman.silk_num_workers": "8",
"max_connections": "600",
"max_files_per_process": "65535",
"max_logical_replication_workers": "14",
"max_prepared_transactions": "200",
"max_worker_processes": "24",
"shared_preload_libraries": "postgres_fdw, shardman"
},
"pgHBA": [
"host replication postgres 0.0.0.0/0 scram-sha-256",
"host replication postgres ::0/0 scram-sha-256"
],
"automaticPgRestart": true,
"masterDemotionEnabled": false
},
"FDWOptions": {
"async_capable": "on",
"batch_size": "100",
"connect_timeout": "5",
"fdw_tuple_cost": "0.2",
"fetch_size": "50000",
"tcp_user_timeout": "10000"
}
}
From that configuration file, you can see that a Shardman cluster initialized with this spec file has Repfactor equal to 1 (one replica for each master). The configuration file also shows that two special users are created in this cluster — superuser postgres and replication user repluser with ChangeMe passwords. They can be authenticated using the md5 or scram-sha-256 authorization method. One postgres_fdw fetch operation will get up to 50000 rows from the remote server. The cost of fetching one row is set to a reasonably high value to make PostgreSQL planner consider conditions pushdown-attractive. pg_hba.conf settings allow postgres user access from anywhere using a replication protocol; all other users can access any database from anywhere. Since defaultSUReplAccessMode is not set to strict, utilities will automatically add entries that allow PgSuUsername user's (postgres) access to any database from anywhere and PgReplUsername user's (repluser) replication access from anywhere.
Several important Shardman-specific parameters are set in the pgParameters hash table. These are:
wal_level#Should be set to
logicalfor Shardman to work correctly.shared_preload_libraries#Should include postgres_fdw and shardman extensions in the specified order.
max_logical_replication_workers#Should be rather high since the rebalance process uses up to
max(concurrent threads.max_replication_slots,max_logical_replication_workers,max_worker_processes,max_wal_senders)/3max_prepared_transactions#Should be rather high since Shardman utilities use the 2PC protocol. If
postgres_fdw.use_twophaseis true, postgres_fdw also uses 2PC.enable_csn_snapshot#Should be enabled to achieve a true
REPEATABLE READisolation level in a distributed system.csn_snapshot_defer_time#All global transactions must start on all participant nodes within
csn_snapshot_defer_timeseconds after start, otherwise they will be aborted.enable_partitionwise_aggregateenable_partitionwise_join#Set to
onto enable optimizations for partitioned tables.
Spec File for a Cluster with Enabled Certificate Authentication
This is the contents of an example sdmspec.json configuration file:
{
"ConfigVersion": "1",
"HTTP": {
"Port": 15432,
"PortMetrics": 15432
"SSLKey": "/pgpro/ssl/server.key",
"SSLCert": "/pgpro/ssl/server.crt"
},
"Users": [
{
"Name": "postgres",
"SSLKey": "/var/lib/postgresql/.ssh/client.key",
"SSLCert": "/var/lib/postgresql/.ssh/client.crt",
"Groups": ["su"],
"AuthMethod":"scram-sha-256"
},
{
"Name": "repluser",
"SSLKey": "/var/lib/postgresql/.ssh/repluser.key",
"SSLCert": "/var/lib/postgresql/.ssh/repluser.crt",
"Groups": ["repl"],
"AuthMethod":"scram-sha-256"
}
],
"ShardSpec": {
"synchronousReplication": true,
"usePgrewind": true,
"syncPointEnabled": true,
"syncpointLimit": 60,
"syncpointInterval": "60s",
"pgParameters": {
"ssl": "on",
"ssl_cert_file": "/var/lib/postgresql/.ssh/server.crt",
"ssl_key_file": "/var/lib/postgresql/.ssh/server.key",
"ssl_ca_file": "/var/lib/postgresql/.ssh/ca.crt",
"csn_snapshot_defer_time": "300",
"enable_csn_snapshot": "on",
"enable_csn_wal": "true",
"log_line_prefix": "%m [%r][%p]",
"log_min_messages": "INFO",
"log_statement": "none",
"maintenance_work_mem": "1GB",
"max_connections": "600",
"max_files_per_process": "65535",
"max_logical_replication_workers": "9",
"max_prepared_transactions": "200",
"max_wal_size": "4GB",
"max_worker_processes": "16",
"min_wal_size": "512MB",
"postgres_fdw.subplan_pushdown": "off",
"shardman.query_engine_mode": "text",
"shardman.silk_num_workers": "8",
"shared_buffers": "4GB",
"shared_preload_libraries": "postgres_fdw, shardman"
},
"strictUserHBA": true,
"pgHBA": [
"hostssl all postgres 0.0.0.0/0 cert clientcert=verify-full",
"hostssl all repluser 0.0.0.0/0 cert clientcert=verify-full",
"hostssl replication postgres 0.0.0.0/0 cert clientcert=verify-full",
"hostssl replication postgres ::0/0 cert clientcert=verify-full",
"hostssl replication repluser 0.0.0.0/0 cert clientcert=verify-full",
"hostssl replication repluser ::0/0 cert clientcert=verify-full",
"hostnossl all all 0.0.0.0/0 reject",
"local postgres postgres scram-sha-256",
"local replication repluser scram-sha-256"
],
"automaticPgRestart": true,
"masterDemotionEnabled": false
},
"FDWOptions": {
"async_capable": "on",
"batch_size": "100",
"connect_timeout": "5",
"fdw_tuple_cost": "0.2",
"fetch_size": "50000",
"tcp_user_timeout": "10000"
}
}
Spec File for a Disaster Recovery Cluster
Disaster Recovery Cluster (DRC) has a number of its own parameters that must be included in the configuration file.
StandbyModeshows if a cluster has thestandbymode enabled. It can only be enabled withshardmanctl cluster standby enableand disabled withshardmanctl cluster standby disable.slotNameshows the name of the slot created on the primary cluster instance. This slot receives WAL segments and it must be one of the slots from the primary clusteradditionalReplicationSlotsslots.primaryConnectionsis a connection mapping between the primary and standby cluster instances.maxWalReaderslimits the maximum number of threads for WAL segment parsing and LSN search.Default: 100.
This is the contents of an example sdmspec.json configuration file:
{
"ConfigVersion": "1",
"Repfactor": 1,
"PlacementPolicy": "manual",
"PgSuAuthMethod": "scram-sha-256",
"PgSuUsername": "postgres",
"PgSuPassword": "Enter_YOUR_Password!",
"PgReplAuthMethod": "scram-sha-256",
"PgReplUsername": "repluser",
"PgReplPassword": "Enter_YOUR_Password!",
"ShardSpec": {
"additionalReplicationSlots": [
"stolon_geo_slot"
],
"createSlotsOnFollowers": true,
"additionalSlotsLagLimit": "1GiB",
"synchronousReplication": true,
"minSynchronousStandbys": 0,
"maxSynchronousStandbys": 4,
"usePgrewind": true,
"syncPointEnabled": true,
"syncpointLimit": 60,
"syncpointInterval": "60s",
"pgParameters": {
"csn_snapshot_defer_time": "300",
"enable_csn_snapshot": "on",
"enable_csn_wal": "true",
"log_statement":"none",
"log_line_prefix":"%m [%r][%p]",
"log_destination": "stderr",
"log_filename": "pg.log",
"logging_collector": "on",
"log_checkpoints": "false",
"log_min_messages": "INFO",
"log_min_error_statement":"INFO",
"maintenance_work_mem": "1GB",
"max_connections": "600",
"max_files_per_process": "65535",
"max_logical_replication_workers": "12",
"max_prepared_transactions": "200",
"max_wal_size": "4GB",
"max_worker_processes": "18",
"min_wal_size": "512MB",
"postgres_fdw.subplan_pushdown": "off",
"shardman.query_engine_mode": "text",
"shardman.silk_num_workers": "8",
"shared_buffers": "4GB",
"ptrack.map_size": "16",
"shared_preload_libraries": "postgres_fdw, shardman, ptrack"
},
"pgHBA": [
"local postgres postgres peer",
"local replication repluser scram-sha-256",
"host all postgres 0.0.0.0/0 scram-sha-256",
"host all postgres ::0/0 scram-sha-256",
"host replication repluser 0.0.0.0/0 scram-sha-256",
"host replication repluser ::0/0 scram-sha-256",
"host replication postgres 0.0.0.0/0 scram-sha-256",
"host replication postgres ::0/0 scram-sha-256"
],
"strictUserHBA": true,
"forceSuUserLocalPeerAuth": true,
"automaticPgRestart": true,
"masterDemotionEnabled": false
},
"StandbySettings": {
"maxWalReaders": 100,
"primaryConnections": {
"shard-1": {
"node1r": {
"slotName": "stolon_replication_slot",
"connString": "dbname=postgres host=shrn1 password=61f0cb5ff27c port=5432 user=postgres"
}
},
"shard-2": {
"node2r": {
"slotName": "stolon_replication_slot",
"connString": "dbname=postgres host=shrn2 password=61f0cb5ff27c port=5432 user=postgres"
}
}
}
}
"FDWOptions": {
"async_capable": "on",
"batch_size": "100",
"connect_timeout": "5",
"fdw_tuple_cost": "0.2",
"fetch_size": "50000",
"tcp_user_timeout": "10000"
}
}
pgpro_stats parameters
pgpro_stats.track_sharded(boolean) #Specifies whether the sharded statements are tracked and aggregated by pgpro_stats.
Default:
on.pgpro_stats.pgss_max_nodes_tracked(integer) #Sets the maximum number of nodes that are tracked by pgpro_stats for query fragments.
It actually sets the maximum amount of the status entries that pgpro_stats can store for the
pgpro_stats_sdm_stats_updatedfunction. It does not affect the statistics tracking itself.Default:
2048.pgpro_stats.transport_compression(string) #Sets algorithm for transport compression during statistics transferring between nodes.
Transport compression is used to compress statistical entries passed from the shard nodes to the coordinator. The possible values are
pglz,zlib,lz4,zstdoroff.Default:
pglz.pgpro_stats.enable_wait_counters(boolean) #Enables or disables statistics collection for wait counters by enabling or disabling functions that calculate metrics of wait events.
Default:
off.pgpro_stats.enable_inval_msgs_counters(boolean) #Enables or disables statistics collection the invalidation messages by enabling or disabling functions that calculate metrics of invalidation messages.
If disabled, the
pgpro_stats_inval_statusview is empty.Default:
off.pgpro_stats.enable_rusage_counters(boolean) #Enables or disables statistics collection for resource usage counters by enabling or disabling functions that calculate metrics of OS resource usage.
Default:
off.pgpro_stats.track_shardman_connections(enum) #Enables or disables Shardman-specific statements processing. This parameter has three possible values.
nonewith no processing,normalized(default) with generalized statements being processed, andallwith all statements being processed.