7.3. Distributed Transactions #
7.3.1. Visibility and CSN #
7.3.1.1. CSN — Commit Sequence Number #
A Shardman cluster uses a snapshot isolation mechanism for distributed transactions. The mechanism provides a way to synchronize snapshots between different nodes of a cluster and a way to atomically commit such a transaction with respect to other concurrent global and local transactions. These global transactions can be coordinated by using provided SQL functions or through postgres_fdw, which uses these functions on remote nodes transparently.
Assume that each node uses the CSN-based visibility: the database tracks the counter for each transaction commit (CSN
). With such a setting, a snapshot is just a single number — a copy of the current CSN
at the moment when the snapshot was taken. Visibility rules are boiled down to checking whether the current tuple's CSN
is less than our snapshot's CSN
.
Let's assume that CSN
is the current physical time on the node and call it GlobalCSN
. If the physical time on different nodes is perfectly synchronized, then such a snapshot obtained on one node can be used on other nodes to provide the necessary level of transaction isolation. But unfortunately physical time is never perfectly sync and can drift, and this should be taken into account. Also, there is no easy notion of lock or atomic operation in the distributed environment, so commit atomicity on different nodes with respect to concurrent snapshot acquisition should be handled somehow. This is addressed in the following way:
To achieve commit atomicity of different nodes, intermediate step is introduced: at the first run, a transaction is marked as
InDoubt
on all nodes, and only after that each node commits it and stamps with a givenGlobalCSN
. All readers that ran into tuples of anInDoubt
transaction should wait until it ends and recheck the visibility.When the coordinator is marking transactions as
InDoubt
on other nodes, it collectsProposedGlobalCSN
from each participant, which is the local time on those nodes. Next, it selects the maximal value of allProposedGlobalCSN
s and commits the transaction on all nodes with that maximalGlobalCSN
even if that value is greater than the current time on this node due to clock drift. So theGlobalCSN
for the given transaction will be the same on all nodes. Each node records its last generatedCSN
(last_csn
) and cannot generateCSN
≤last_csn
. When a node commits a transaction withCSN
>last_csn
,last_csn
is adjusted to record thisCSN
. Due to this mechanism, a node cannot generate aCSN
, that is less thanCSN
s of already committed transactions.When a local transaction imports a foreign global snapshot with some
GlobalCSN
and the current time on this node is smaller than the incomingGlobalCSN
, then the transaction should wait until thisGlobalCSN
time comes to the local clock.
The two last rules provide protection against time drift.
7.3.1.2. Commit Delay and External Consistency #
The rules above still do not guarantee recency for snapshots genereted on nodes that do not participate in a transaction. A read operation that originates from such a node can see stale data. The probability of the anomaly directly depends on the system clock skew in the Shardman cluster.
Particular attention should be paid to the synchronization of system clocks on all cluster nodes. The size of the clock skew must be measured. If an external consistency is required, then the clock skew can be compensated with a commit delay. This delay is added before every commit in the system, so it has a negative impact on the latency of transactions. Read-only transactions are not affected by this delay. The delay can be set using the configuration parameter csn_commit_delay.
7.3.1.3. CSN Map #
The CSN
visibility mechanism described above is not a general way to check the visibility of all transactions. It is used to provide isolation only for distributed transactions. As a result, each cluster node uses a visibility checking mechanism based on xid
and xmin
. To be able to use the CSN
snapshot that points to the past, we need to keep old versions of tuples on all nodes and therefore defer vacuuming them. To do this, each node in a Shardman cluster maintains a CSN
to xid
mapping. The map is called CSNSnapshotXidMap
. This map is a ring buffer, and it stores the correspondence between the current snapshot_csn
and xmin
in a sparse way: snapshot_csn
is rounded to seconds (and here we use the fact that snapshot_csn
is just a timestamp), and xmin
is stored in the circular buffer where rounded snapshot_csn
acts as an offset from the current circular buffer head. The size of the circular buffer is controlled by the csn_snapshot_defer_time configuration setting. VACUUM
is not allowed to clean up tuples whose xmax
is newer than the oldest xmin
in CSNSnapshotXidMap
.
When a CSN
snapshot arrives, we check that its snapshot_csn
is still in our map, otherwise, we will error out with “snapshot too old” message. If the snapshot_csn
is successfully mapped, we fill backend's xmin
with the value from the map. That way we can take into account backends with an imported CSN
snapshot, and old tuple versions will be preserved.
7.3.1.4. CSN Map Trimming #
To support global transactions, each node keeps old versions of tuples for at least csn_snapshot_defer_time
seconds. With large values of csn_snapshot_defer_time
, this negatively affects performance. This is because nodes save all row versions during the last csn_snapshot_defer_time
seconds, but there may not be more transactions in the cluster that can read them. A special task of the monitor periodically recalculates xmin
in the cluster and sets it on all nodes to the minimum possible value. This allows the vacuuming routine to remove a row version that is no longer of interest to any transaction. The shardman.monitor_trim_csnxid_map_interval configuration setting controls the worker. The worker wakes up every monitor_interval
seconds and performs the following operations:
Checks if the current node's repgroup ID is the smallest among all IDs in the cluster. If this condition is not met, then the work on the current node is terminated. So only one node in the cluster can perform a horizon negotiation.
From each node of the Shardman cluster, the coordinator collects the oldest snapshot
CSN
among all active transactions on the node.The coordinator chooses the smallest
CSN
and sends it to each node. Each node discards itscsnXidMap
values that are less than this value.
7.3.2. 2PC and Prepared Transaction Resolution #
Shardman implements a two-phase commit protocol to ensure the atomicity of distributed transactions. During the execution of a distributed transaction, the coordinator node sends the command BEGIN
to participant nodes to initiate their local transactions.
The term "participant nodes" herein and subsequently refers to a subset of cluster nodes that participate in the execution of a transaction's command while the node is engaged in writing activity.
Additionally, a local transaction is created on the coordinator node. This ensures that there are corresponding local transactions on all nodes participating in the distributed transaction.
During the two-phase transaction commit, the coordinator node sends the command PREPARE TRANSACTION
to the participant nodes to initiate the preparation of their local transactions for commit. If the preparation is successful, the local transaction data is stored in a disk storage, making it persistent. If all participant nodes report successful preparation to the coordinator node, the coordinator node will commit its local transaction. Subsequently, the coordinator node will also commit the previously prepared transactions on the participant nodes using the command COMMIT PREPARED
.
If a failure occurs during the PREPARE TRANSACTION
command on any of the participant nodes, the distributed transaction is considered aborted. The coordinator node then broadcasts the command to abort the previously prepared transactions using the ROLLBACK PREPARED
command. If the local transaction was already prepared, it is aborted. However, if there was no prepared transaction with the specified name, the command to rollback is simply ignored. Subsequently, the coordinator node rolls back its local transaction.
After a successful preparation phase, there will be an object prepared transaction
on the each of participant nodes. These objects are actually disk files and records in the server memory.
It is possible to have a prepared transaction that was created earlier through a two-phase operation and will never be completed. This can occur, for example, if the coordinator node fails exactly after the preparation step but before the commit step. It can also occur as a result of network connectivity issues. For instance, if the command COMMIT PREPARED
from the coordinator node to a participant node ends with an error, local transactions will be committed on all participant nodes except for the one with the error. The local transaction will also be committed on the coordinator node. All participants, except for the one with the error, believe that the distributed transaction was completed. However, the one participant still waiting for COMMIT PREPARED
will never receive it, resulting in a prepared transaction that will never be completed.
A prepared transaction consumes system resources, such as memory and disk space. An incomplete prepared transaction causes other transactions that access rows modified by that transaction to wait until the distributed operation completes. Therefore, it is necessary to complete prepared transactions, even in cases where there were failures during commit, to free up resources and ensure that other transactions can proceed.
To resolve such situations, there is a mechanism for resolving prepared transactions that is implemented as part of the Shardman monitor. It is implemented as a background worker that wakes up periodically, acting as an internal “crontab” job. By default, the period is set to 5 seconds, but it can be configured using the shardman.monitor_dxact_interval
configuration parameter. The worker checks the presence of prepared transactions that were created earlier by a certain amount of time, specified by the shardman.monitor_dxact_timeout
configuration parameter (which is also set to 5 seconds by default), on the same node where the Shardman monitor is running.
When the PREPARE TRANSACTION
command is sent to a participant node, a special name is assigned to the prepared transaction. This name encodes useful information, which allows identifying the coordinator node and its local transaction.
If the Shardman monitor finds outdated prepared transactions, it extracts the coordinator's replication group ID and transaction ID of the coordinator's local transaction. The monitor then sends a query to the coordinator
SELECT shardman.xact_status(TransactionId)
which requests the current status of the coordinator's local transaction. If the query fails, for example, due to network connectivity issues, then the prepared transaction will remain untouched until the next time when the monitor wakes up.
In the case of a successful query, the coordinator node can reply with one of the following statuses:
committed
#The local transaction on the coordinator node was completed successfully. Therefore, the Shardman monitor also commits this prepared transaction using the
COMMIT PREPARED
command.aborted
#The local transaction on the coordinator node was aborted. Therefore, the monitor also aborts this transaction using the
ROLLBACK PREPARED
command.unknown
#The transaction with such an identifier never existed on the coordinator node. Therefore, the monitor aborts this transaction using the
ROLLBACK PREPARED
command.active
#The local transaction on the coordinator node is still somewhere inside the
CommitTransaction()
flow. Therefore, the monitor does nothing with this transaction. The monitor will try again with this transaction at the next wake-up.ambiguous
#This status can be returned when
CLOG
's truncating is enabled on the coordinator node. TheCLOG
is a bitmap that stores the status of completed local transactions. When a transaction is committed or aborted, its status is marked in theCLOG
. However, theCLOG
can be truncated (garbage collected) by theVACUUM
process to discard statuses of old transactions that do not affect the visibility of data for any existing transaction.When the
CLOG
is truncated, there is a possibility that theshardman.xact_status()
function may not be able to unambiguously decide if a transaction exists in the past (with some status) or if it never existed. In such cases, the function returns an ambiguous status. This can lead to uncertainty about the actual status of the transaction and can make it difficult to resolve the prepared transaction.When the
shardman.xact_status()
function returns theambiguous
status for a prepared transaction, the monitor node logs a warning message indicating that the status could not be determined unambiguously. The prepared transaction is left untouched, and the monitor will try again with this transaction at the next wake-up. It is important to properly configure themin_clog_size
parameter with the value of1024000
(which means "never truncate CLOG") to avoid ambiguity in the status of prepared transactions.
In situations where the prepared transaction resolution mechanism is unable to resolve prepared transactions due to constant errors or ambiguous status, the administrator will need to manually intervene to resolve these transactions. This may involve examining the server logs and performing a manual rollback or commit operation on the prepared transaction. Note that leaving prepared transactions unresolved can lead to resource-consumption and performance issues, so it is important to address these situations as soon as possible.