5.3. Time Synchronization #
The algorithm that provides data consistency on all the cluster nodes uses the system clock installed on the hosts. Therefore, the transaction commit latency depends on clock drift on different hosts, as the coordinator always waits for the most lagging host to catch up. This makes it crucial that the time on all the connected nodes of a Shardman cluster are synchronized, as lack of synchronization may have a negative impact on Shardman performance by increasing the query latency.
First, to ensure time synchronization on all cluster nodes, install chrony daemon when deploying a new cluster.
sudo apt update sudo apt install -y chrony sudo systemctl enable --now chrony
Check that chrony is working properly.
chronyc tracking
Expected output:
Reference ID : C0248F82 (Time100.Stupi.SE) Stratum : 2 Ref time (UTC) : Tue Apr 18 11:50:44 2023 System time : 0.000019457 seconds slow of NTP time Last offset : -0.000005579 seconds RMS offset : 0.000089375 seconds Frequency : 30.777 ppm fast Residual freq : -0.000 ppm Skew : 0.003 ppm Root delay : 0.018349268 seconds Root dispersion : 0.000334640 seconds Update interval : 1039.1 seconds Leap status : Normal
Note that managing the clock drift should be performed using the OS tools. Shardman diagnostic tools cannot be considered as the only and defining measurement utility.
To see if any major drift already exists, use the shardman.pg_stat_csn view that shows statistics on delays that take place during import of CSN snapshots. Its values are calculated when any related action is performed, or if any of the shardman.trim_csnxid_map()
or shardman.pg_oldest_csn_snapshot()
functions are called. These functions are called from the csn trimmer routine
worker, therefore disabling this worker will result in these statistics not being collected.
The csn_max_shift
field of the shardman.pg_stat_csn
view shows the maximum registered snapshot CSN shift that caused a delay. This value defines the clock drift between the nodes in the cluster. A consecutive increase of this value means at least one's cluster system clock is out of sync. If this value exceeds 1000 (microseconds), it is recommended to check the time synchronization settings.
The same can be discovered if the csn_total_import_delay
value increases while csn_max_shift
remains unchanged. However, one-time increase may be due to single failures, non-related to the time issues.
Also, if the difference between CSNXidMap_head_csn
and shardman.oldest_csn
exceeds the csn_snapshot_defer_time
parameter value and stays the same for a long time, it means that the CSNSnapshotXidMap
map is full. It can result in a global transaction failure.
There are two main reasons for this issue.
There is a transaction that runs for more than
csn_snapshot_defer_time
seconds and holds the entire cluster, holding theVACUUM
process. In this case,xid
field of theshardman.oldest_csn
view is used to determine the transaction ID of this transaction, and thergid
field is used to determine the cluster node where this transaction is located.The
CSNSnapshotXidMap
map lacks capacity. During the normal operation the system might have transactions that exceed thecsn_snapshot_defer_time
value. To fix it, increase thecsn_snapshot_defer_time
time so that these transactions stay below this value.
If the shardman.silk_tracepoints
configuration parameter is enabled, executing the EXPLAIN
command for the distributed queries outputs the rows with information about how much time was spent on the query execution and what result it ended with, depending on the system components. These rows show metric values for the time spent on each component. The net (qry)
, net (1st tup)
, net (last tup)
metrics calculate the difference between timestamps on different servers. This difference includes both time spent on a message transfer and the clock drift (positive or negative) between these servers. Therefore, these metrics can also help to determine whether there is any clock drift.