shardman : Documentation: 14: 7.3. Distributed Transactions : Postgres Professional

7.3. Distributed Transactions
Prev	Up	Chapter 7. Shardman Internals	Home	Next

7.3. Distributed Transactions #

7.3.1. Visibility and CSN
7.3.2. 2PC and Prepared Transaction Resolution

7.3.1. Visibility and CSN #

7.3.1.1. CSN — Commit Sequence Number #

A Shardman cluster uses a snapshot isolation mechanism for distributed transactions. The mechanism provides a way to synchronize snapshots between different nodes of a cluster and a way to atomically commit such a transaction with respect to other concurrent global and local transactions. These global transactions can be coordinated by using provided SQL functions or through postgres_fdw, which uses these functions on remote nodes transparently.

Assume that each node uses the CSN-based visibility: the database tracks the counter for each transaction commit (CSN). With such a setting, a snapshot is just a single number — a copy of the current CSN at the moment when the snapshot was taken. Visibility rules are boiled down to checking whether the current tuple's CSN is less than our snapshot's CSN.

Let's assume that CSN is the current physical time on the node and call it GlobalCSN. If the physical time on different nodes is perfectly synchronized, then such a snapshot obtained on one node can be used on other nodes to provide the necessary level of transaction isolation. But unfortunately physical time is never perfectly sync and can drift, and this should be taken into account. Also, there is no easy notion of lock or atomic operation in the distributed environment, so commit atomicity on different nodes with respect to concurrent snapshot acquisition should be handled somehow. This is addressed in the following way:

To achieve commit atomicity of different nodes, intermediate step is introduced: at the first run, a transaction is marked as InDoubt on all nodes, and only after that each node commits it and stamps with a given GlobalCSN. All readers that ran into tuples of an InDoubt transaction should wait until it ends and recheck the visibility.
When the coordinator is marking transactions as InDoubt on other nodes, it collects ProposedGlobalCSN from each participant, which is the local time on those nodes. Next, it selects the maximal value of all ProposedGlobalCSNs and commits the transaction on all nodes with that maximal GlobalCSN even if that value is greater than the current time on this node due to clock drift. So the GlobalCSN for the given transaction will be the same on all nodes. Each node records its last generated CSN (last_csn) and cannot generate CSN ≤ last_csn. When a node commits a transaction with CSN > last_csn, last_csn is adjusted to record this CSN. Due to this mechanism, a node cannot generate a CSN, that is less than CSNs of already committed transactions.
When a local transaction imports a foreign global snapshot with some GlobalCSN and the current time on this node is smaller than the incoming GlobalCSN, then the transaction should wait until this GlobalCSN time comes to the local clock.

The two last rules provide protection against time drift.

7.3.1.2. Commit Delay and External Consistency #

The rules above still do not guarantee recency for snapshots genereted on nodes that do not participate in a transaction. A read operation that originates from such a node can see stale data. The probability of the anomaly directly depends on the system clock skew in the Shardman cluster.

Particular attention should be paid to the synchronization of system clocks on all cluster nodes. The size of the clock skew must be measured. If an external consistency is required, then the clock skew can be compensated with a commit delay. This delay is added before every commit in the system, so it has a negative impact on the latency of transactions. Read-only transactions are not affected by this delay. The delay can be set using the configuration parameter csn_commit_delay.

7.3.1.3. CSN Map #

The CSN visibility mechanism described above is not a general way to check the visibility of all transactions. It is used to provide isolation only for distributed transactions. As a result, each cluster node uses a visibility checking mechanism based on xid and xmin. To be able to use the CSN snapshot that points to the past, we need to keep old versions of tuples on all nodes and therefore defer vacuuming them. To do this, each node in a Shardman cluster maintains a CSN to xid mapping. The map is called CSNSnapshotXidMap. This map is a ring buffer, and it stores the correspondence between the current snapshot_csn and xmin in a sparse way: snapshot_csn is rounded to seconds (and here we use the fact that snapshot_csn is just a timestamp), and xmin is stored in the circular buffer where rounded snapshot_csn acts as an offset from the current circular buffer head. The size of the circular buffer is controlled by the csn_snapshot_defer_time configuration setting. VACUUM is not allowed to clean up tuples whose xmax is newer than the oldest xmin in CSNSnapshotXidMap.

When a CSN snapshot arrives, we check that its snapshot_csn is still in our map, otherwise, we will error out with “snapshot too old” message. If the snapshot_csn is successfully mapped, we fill backend's xmin with the value from the map. That way we can take into account backends with an imported CSN snapshot, and old tuple versions will be preserved.

7.3.1.4. CSN Map Trimming #

To support global transactions, each node keeps old versions of tuples for at least csn_snapshot_defer_time seconds. With large values of csn_snapshot_defer_time, this negatively affects performance. This is because nodes save all row versions during the last csn_snapshot_defer_time seconds, but there may not be more transactions in the cluster that can read them. A special task of the monitor periodically recalculates xmin in the cluster and sets it on all nodes to the minimum possible value. This allows the vacuuming routine to remove a row version that is no longer of interest to any transaction. The shardman.monitor_trim_csnxid_map_interval configuration setting controls the worker. The worker wakes up every monitor_interval seconds and performs the following operations:

Checks if the current node's repgroup ID is the smallest among all IDs in the cluster. If this condition is not met, then the work on the current node is terminated. So only one node in the cluster can perform a horizon negotiation.
From each node of the Shardman cluster, the coordinator collects the oldest snapshot CSN among all active transactions on the node.
The coordinator chooses the smallest CSN and sends it to each node. Each node discards its csnXidMap values that are less than this value.

7.3.2. 2PC and Prepared Transaction Resolution #

Shardman implements a two-phase commit protocol to ensure the atomicity of distributed transactions. During the execution of a distributed transaction, the coordinator node sends the command BEGIN to participant nodes to initiate their local transactions.

The term "participant nodes" herein and subsequently refers to a subset of cluster nodes that participate in the execution of a transaction's command while the node is engaged in writing activity.

Additionally, a local transaction is created on the coordinator node. This ensures that there are corresponding local transactions on all nodes participating in the distributed transaction.

During the two-phase transaction commit, the coordinator node sends the command PREPARE TRANSACTION to the participant nodes to initiate the preparation of their local transactions for commit. If the preparation is successful, the local transaction data is stored in a disk storage, making it persistent. If all participant nodes report successful preparation to the coordinator node, the coordinator node will commit its local transaction. Subsequently, the coordinator node will also commit the previously prepared transactions on the participant nodes using the command COMMIT PREPARED.

If a failure occurs during the PREPARE TRANSACTION command on any of the participant nodes, the distributed transaction is considered aborted. The coordinator node then broadcasts the command to abort the previously prepared transactions using the ROLLBACK PREPARED command. If the local transaction was already prepared, it is aborted. However, if there was no prepared transaction with the specified name, the command to rollback is simply ignored. Subsequently, the coordinator node rolls back its local transaction.

After a successful preparation phase, there will be an object prepared transaction on the each of participant nodes. These objects are actually disk files and records in the server memory.

It is possible to have a prepared transaction that was created earlier through a two-phase operation and will never be completed. This can occur, for example, if the coordinator node fails exactly after the preparation step but before the commit step. It can also occur as a result of network connectivity issues. For instance, if the command COMMIT PREPARED from the coordinator node to a participant node ends with an error, local transactions will be committed on all participant nodes except for the one with the error. The local transaction will also be committed on the coordinator node. All participants, except for the one with the error, believe that the distributed transaction was completed. However, the one participant still waiting for COMMIT PREPARED will never receive it, resulting in a prepared transaction that will never be completed.

A prepared transaction consumes system resources, such as memory and disk space. An incomplete prepared transaction causes other transactions that access rows modified by that transaction to wait until the distributed operation completes. Therefore, it is necessary to complete prepared transactions, even in cases where there were failures during commit, to free up resources and ensure that other transactions can proceed.

To resolve such situations, there is a mechanism for resolving prepared transactions that is implemented as part of the Shardman monitor. It is implemented as a background worker that wakes up periodically, acting as an internal “crontab” job. By default, the period is set to 5 seconds, but it can be configured using the shardman.monitor_dxact_interval configuration parameter. The worker checks the presence of prepared transactions that were created earlier by a certain amount of time, specified by the shardman.monitor_dxact_timeout configuration parameter (which is also set to 5 seconds by default), on the same node where the Shardman monitor is running.

When the PREPARE TRANSACTION command is sent to a participant node, a special name is assigned to the prepared transaction. This name encodes useful information, which allows identifying the coordinator node and its local transaction.

If the Shardman monitor finds outdated prepared transactions, it extracts the coordinator's replication group ID and transaction ID of the coordinator's local transaction. The monitor then sends a query to the coordinator

SELECT shardman.xact_status(TransactionId)

which requests the current status of the coordinator's local transaction. If the query fails, for example, due to network connectivity issues, then the prepared transaction will remain untouched until the next time when the monitor wakes up.

In the case of a successful query, the coordinator node can reply with one of the following statuses:

committed #

The local transaction on the coordinator node was completed successfully. Therefore, the Shardman monitor also commits this prepared transaction using the COMMIT PREPARED command.

aborted #

The local transaction on the coordinator node was aborted. Therefore, the monitor also aborts this transaction using the ROLLBACK PREPARED command.

unknown #

The transaction with such an identifier never existed on the coordinator node. Therefore, the monitor aborts this transaction using the ROLLBACK PREPARED command.

active #

The local transaction on the coordinator node is still somewhere inside the CommitTransaction() flow. Therefore, the monitor does nothing with this transaction. The monitor will try again with this transaction at the next wake-up.

ambiguous #

This status can be returned when CLOG's truncating is enabled on the coordinator node. The CLOG is a bitmap that stores the status of completed local transactions. When a transaction is committed or aborted, its status is marked in the CLOG. However, the CLOG can be truncated (garbage collected) by the VACUUM process to discard statuses of old transactions that do not affect the visibility of data for any existing transaction.

When the CLOG is truncated, there is a possibility that the shardman.xact_status() function may not be able to unambiguously decide if a transaction exists in the past (with some status) or if it never existed. In such cases, the function returns an ambiguous status. This can lead to uncertainty about the actual status of the transaction and can make it difficult to resolve the prepared transaction.

When the shardman.xact_status() function returns the ambiguous status for a prepared transaction, the monitor node logs a warning message indicating that the status could not be determined unambiguously. The prepared transaction is left untouched, and the monitor will try again with this transaction at the next wake-up. It is important to properly configure the min_clog_size parameter with the value of 1024000 (which means "never truncate CLOG") to avoid ambiguity in the status of prepared transactions.

In situations where the prepared transaction resolution mechanism is unable to resolve prepared transactions due to constant errors or ambiguous status, the administrator will need to manually intervene to resolve these transactions. This may involve examining the server logs and performing a manual rollback or commit operation on the prepared transaction. Note that leaving prepared transactions unresolved can lead to resource-consumption and performance issues, so it is important to address these situations as soon as possible.

Prev	Up	Next
7.2. Query Processing	Home	7.4. Silk

epub pdf