F.35. multimaster — synchronous cluster to provide OLTP scalability and high availability #
multimaster
is a Postgres Pro Enterprise extension with a set of patches that turns Postgres Pro Enterprise into a synchronous shared-nothing cluster to provide Online Transaction Processing (OLTP) scalability for read transactions and high availability with automatic disaster recovery.
As compared to a standard PostgreSQL primary-standby cluster, a cluster configured with the multimaster
extension offers the following benefits:
Fault tolerance and automatic node recovery
Synchronous logical replication and DDL replication
Read scalability
Working with temporary tables on each cluster node
Upgrading minor releases of Postgres Pro Enterprise seamlessly for multimaster cluster clients.
Important
Before deploying multimaster
on production systems, make sure to take its replication restrictions into account. For details, see Section F.35.1.
The multimaster
extension replicates your database to all nodes of the cluster and allows write transactions on each node. Write transactions are synchronously replicated to all nodes, which increases commit latency. Read-only transactions and queries are executed locally, without any measurable overhead.
To ensure high availability and fault tolerance of the cluster, multimaster
determines each transaction outcome through Paxos consensus algorithm, uses custom recovery protocol and heartbeats for failure discovery. A multi-master cluster of N
nodes can continue working while the majority of the nodes are alive and reachable by other nodes. To be configured with multimaster
, the cluster must include at least three nodes. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes. There is also a special 2+1 (referee) mode in which 2 nodes hold data and an additional one called referee only participates in voting. Compared to traditional three nodes setup, this is cheaper (referee resources demands are low) but availability is decreased. For details, see Section F.35.3.3.
When a failed node is reconnected to the cluster, multimaster
automatically fast-forwards the node to the actual state based on the Write-Ahead Log (WAL) data in the corresponding replication slot. If a node was excluded from the cluster, you can add it back using pg_basebackup.
To learn more about the multimaster
internals, see Section F.35.2.
F.35.1. Limitations #
The multimaster
extension takes care of the database replication in a fully automated way. You can perform write transactions on any node and work with temporary tables on each cluster node simultaneously. However, make sure to take the following replication restrictions into account:
Microsoft Windows operating system is not supported.
1C solutions are not supported.
multimaster
can replicate only one database in a cluster. If it is required to replicate the contents of several databases, you can either transfer all data into different schemas within a single database or create a separate cluster for each database and set upmultimaster
for each cluster.Large objects are not supported. Although creating large objects is allowed, multimaster cannot replicate such objects, and their OIDs may conflict on different nodes, so their use is not recommended.
Since
multimaster
is based on logical replication and Paxos over three-phase commit protocol, its operation is highly affected by network latency. It is not recommended to set up amultimaster
cluster with geographically distributed nodes.Using tables without primary keys can have negative impact on performance. In some cases, it can even lead to inability to restore a cluster node, so you should avoid replicating such tables with
multimaster
.Unlike in vanilla PostgreSQL,
read committed
isolation level can cause serialization failures on a multi-master cluster (with an SQLSTATE code '40001') if there are conflicting transactions from different nodes, so the application must be ready to retry transactions.Serializable
isolation level works only with respect to local transactions on the current node.Sequence generation. To avoid conflicts between unique identifiers on different nodes,
multimaster
modifies the default behavior of sequence generators. By default, ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly. Thus, the generated sequence values are not monotonic. If it is critical to get a monotonically increasing sequence cluster-wide, you can set themultimaster.monotonic_sequences
totrue
.Commit latency. In the current implementation of logical replication,
multimaster
sends data to subscriber nodes only after the local commit, so you have to wait for transaction processing twice: first on the local node, and then on all the other nodes simultaneously. In the case of a heavy-write transaction, this may result in a noticeable delay.Logical replication does not guarantee that a system object OID is the same on all cluster nodes, so OIDs for the same object may differ between
multimaster
cluster nodes. If your driver or application relies on OIDs, make sure that their use is restricted to connections to one and the same node to avoid errors. For example, theNpgsql
driver may not work correctly withmultimaster
if theNpgsqlConnection.GlobalTypeMapper
method tries using OIDs in connections to different cluster nodes.A
multimaster
cluster node cannot function as a logical replication subscriber. While amultimaster
node can function as a publisher, the subscription cannot automatically switch to another node if the node fails.Replicated non-conflicting transactions are applied on the receiving nodes in parallel, so such transactions may become visible on different nodes in different order.
If
multimaster
is working under heavy load, and one of the nodes stops even for a short time, this node can fall far behind other nodes as they take the load in multiple threads, whereas the lagging node catches up in a single thread. In this case, you may need to unload other nodes in order to synchronize all the nodes.CREATE INDEX CONCURRENTLY
,REINDEX CONCURRENTLY
,CREATE TABLESPACE
, andDROP TABLESPACE
are not supported.COMMIT AND CHAIN
feature is not supported.
F.35.2. Architecture #
F.35.2.1. Replication #
Since each server in a multi-master cluster can accept writes, any server can abort a transaction because of a concurrent update — in the same way as it happens on a single server between different backends. To ensure high availability and data consistency on all cluster nodes, multimaster
uses logical replication and the three-phase commit protocol with transaction outcome determined by Paxos consensus algorithm.
When Postgres Pro Enterprise loads the multimaster
shared library, multimaster
sets up a logical replication producer and consumer for each node, and hooks into the transaction commit pipeline. The typical data replication workflow consists of the following phases:
PREPARE
phase.multimaster
captures and implicitly transforms eachCOMMIT
statement to aPREPARE
statement. All the nodes that get the transaction via the replication protocol (the cohort nodes) send their vote for approving or declining the transaction to the backend process on the initiating node. This ensures that all the cohort can accept the transaction, and no write conflicts occur. For details onPREPARE
transactions support in PostgreSQL, see the PREPARE TRANSACTION topic.PRECOMMIT
phase. If all the cohort nodes approve the transaction, the backend process sends aPRECOMMIT
message to all the cohort nodes to express an intention to commit the transaction. The cohort nodes respond to the backend with thePRECOMMITTED
message. In case of a failure, all the nodes can use this information to complete the transaction using a quorum-based voting procedure.COMMIT
phase. IfPRECOMMIT
is successful, the transaction is committed to all nodes.
If a node crashes or gets disconnected from the cluster between the PREPARE
and COMMIT
phases, the PRECOMMIT
phase ensures that the survived nodes have enough information to complete the prepared transaction. The PRECOMMITTED
messages help avoid the situation when the crashed node has already committed or aborted the transaction, but has not notified other nodes about the transaction status. In a two-phase commit (2PC), such a transaction would block resources (hold locks) until the recovery of the crashed node. Otherwise, data inconsistencies can appear in the database when the failed node is recovered, for example, if the failed node committed the transaction, but the survived node aborted it.
To complete the transaction, the backend must receive a response from the majority of the nodes. For example, for a cluster of 2N
+1 nodes, at least N
+1 responses are required. Thus, multimaster
ensures that your cluster is available for reads and writes while the majority of the nodes are connected, and no data inconsistencies occur in case of a node or connection failure.
F.35.2.2. Failure Detection and Recovery #
Since multimaster
allows writes to each node, it has to wait for responses about transaction acknowledgment from all the other nodes. Without special actions in case of a node failure, each commit would have to wait until the failed node recovery. To deal with such situations, multimaster
periodically sends heartbeats to check the node state and the connectivity between nodes. When several heartbeats to the node are lost in a row, this node is kicked out of the cluster to allow writes to the remaining alive nodes. You can configure the heartbeat frequency and the response timeout in the multimaster.heartbeat_send_timeout
and multimaster.heartbeat_recv_timeout
parameters, respectively.
For example, suppose a five-node multi-master cluster experienced a network failure that split the network into two isolated subnets, with two and three cluster nodes. Based on heartbeats propagation information, multimaster
will continue accepting writes at each node in the bigger partition, and deny all writes in the smaller one. Thus, a cluster consisting of 2N
+1 nodes can tolerate N
node failures and stay alive if any N
+1 nodes are alive and connected to each other. You can also set up a two nodes cluster plus a lightweight referee node that does not hold the data, but acts as a tie-breaker during symmetric node partitioning. For details, see Section F.35.3.3.
In case of a partial network split when different nodes have different connectivity, multimaster
finds a fully connected subset of nodes and disconnects nodes outside of this subset. For example, in a three-node cluster, if node A can access both B and C, but node B cannot access node C, multimaster
isolates node C to ensure that both A and B can work.
To preserve order of transactions on different nodes and thus data integrity, the decision to exclude or add back node(s) must be taken coherently. Generations which represent a subset of currently supposedly live nodes serve this purpose. Technically, generation is a pair <n, members>
where n
is unique number and members
is subset of configured nodes. A node always lives in some generation and switches to the one with higher number as soon as it learns about its existence; generation numbers act as logical clocks/terms/epochs here. Each transaction is stamped during commit with current generation of the node it is being executed on. The transaction can be proposed to be committed only after it has been PREPAREd on all its generation members. This allows to design the recovery protocol so that order of conflicting committed transactions is the same on all nodes. Node resides in generation in one of three states (can be shown with mtm.status()
):
ONLINE
: node is member of the generation and making transactions normally;RECOVERY
: node is member of the generation, but it must apply in recovery mode transactions from previous generations to becomeONLINE
;DEAD
: node will never beONLINE
in this generation;
For alive nodes, there is no way to distinguish between a failed node that stopped serving requests and a network-partitioned node that can be accessed by database users, but is unreachable for other nodes. If during commit of writing transaction some of current generation members are disconnected, transaction is rolled back according to generation rules. To avoid futile work, connectivity is also checked during transaction start; if you try to access an isolated node, multimaster
returns an error message indicating the current status of the node. Thus, to prevent stale reads read-only queries are also forbidden. If you would like to continue using a disconnected node outside of the cluster in the standalone mode, you have to uninstall the multimaster
extension on this node, as explained in Section F.35.4.5.
Each node maintains a data structure that keeps the information about the state of all nodes in relation to this node. You can get this data by calling the mtm.status()
and the mtm.nodes()
functions.
When a failed node connects back to the cluster, multimaster
starts automatic recovery:
The reconnected node selects a cluster node, which is
ONLINE
in the highest generation, referred to as the donor node, and starts catching up with the current state of the cluster based on the Write-Ahead Log (WAL).When the node is caught up, it ballots for including itself in the next generation. Once generation is elected, commit of new transactions will start waiting for apply on the joining node.
When the rest of transactions till the switch to the new generation is applied, the reconnected node is promoted to the
online
state and included into the replication scheme.
The correctness of recovery protocol was verified with TLA+ model checker. You can find the model (and more detailed description) at doc/specs directory of the source code.
Automatic recovery requires presence of all WAL files generated after node failure. If a node is down for a long time and storing more WALs is unacceptable, you may have to exclude this node from the cluster and manually restore it from one of the working nodes using pg_basebackup. For details, see Section F.35.4.3.
F.35.2.3. Multimaster Background Workers #
- mtm-monitor #
Starts all other workers for a database managed with multimaster. This is the first worker loaded during multimaster boot. Each multimaster node has a single
mtm-monitor
worker. When a new node is added,mtm-monitor
startsmtm-logrep-receiver
andmtm-dmq-receiver
workers to enable replication to this node. If a node is dropped,mtm-monitor
stopsmtm-logrep-receiver
andmtm-dmq-receiver
workers that have been serving the dropped node. Eachmtm-monitor
controls workers on its own node only.- mtm-logrep-receiver #
Receives logical replication stream from a given peer node. During normal operation,
mtm-logrep-receiver
sends replicated transactions to the pool of dynamic workers (see mtm-logrep-receiver-dynworker). During catchup, depending on the value of the multimaster.catchup_algorithm configuration parameter,mtm-logrep-receiver
applies replicated transactions on the reconnected node or sends them to the pool of dynamic workers. The number ofmtm-logrep-receiver
workers on each node corresponds to the number of peer nodes available.- mtm-dmq-receiver #
Receives acknowledgment for transactions sent to peers and checks for heartbeat timeouts. The number of
mtm-logrep-receiver
workers on each node corresponds to the number of peer nodes available.- mtm-dmq-sender #
Collects acknowledgment for transactions applied on the current node and sends them to the corresponding mtm-dmq-receiver on the peer node. There is a single worker per Postgres Pro Enterprise instance.
- mtm-logrep-receiver-dynworker #
Dynamic pool worker for a given mtm-logrep-receiver. Applies replicated transactions received during normal operation or catchup. You can use the multimaster.max_workers configuration parameter to specify the maximum number of dynamic workers.
- mtm-resolver #
Performs Paxos to resolve unfinished transactions. This worker is only active during recovery or when connection with other nodes was lost. There is a single worker per Postgres Pro Enterprise instance.
- mtm-campaigner #
Ballots for new generations to exclude some node(s) or add myself. There is a single worker per Postgres Pro Enterprise instance.
- mtm-replier #
Responds to requests of mtm-campaigner and mtm-resolver.
F.35.3. Installation and Setup #
To use multimaster
, you need to install Postgres Pro or Postgres Pro Enterprise on all nodes of your cluster. Postgres Pro includes all the required dependencies and extensions. For PostgreSQL follow build and install instructions at readme.md.
F.35.3.1. Setting up a Multi-Master Cluster #
Suppose you are setting up a cluster of three nodes, with node1
, node2
, and node3
host names. After installing Postgres Pro Enterprise on all nodes, you need to initialize data directory on each node, as explained in Section 18.2. If you would like to set up a multi-master cluster for an already existing mydb
database, you can load data from mydb
to one of the nodes once the cluster is initialized, or you can load data to all new nodes before cluster initialization using any convenient mechanism, such as pg_basebackup or pg_dump.
Once the data directory is set up, complete the following steps on each cluster node:
Modify the
postgresql.conf
configuration file, as follows:Add
multimaster
to theshared_preload_libraries
variable:shared_preload_libraries = 'multimaster'
Tip
If the
shared_preload_libraries
variable is already defined inpostgresql.auto.conf
, you will need to modify its value using the ALTER SYSTEM command. For details, see Section 19.1.2. Note that in a multi-master cluster, theALTER SYSTEM
command only affects the configuration of the node from which it was run.Set up Postgres Pro Enterprise parameters related to replication:
wal_level = logical max_connections = 100 max_prepared_transactions = 300 # max_connections * N max_wal_senders = 10 # at least N max_replication_slots = 10 # at least 2N wal_sender_timeout = 0
where
N
is the number of nodes in your cluster.You must change the replication level to
logical
asmultimaster
relies on logical replication. For a cluster ofN
nodes, enable at leastN
WAL sender processes and replication slots. Sincemultimaster
implicitly adds aPREPARE
phase to eachCOMMIT
transaction, make sure to set the number of prepared transactions toN
*max_connections
.wal_sender_timeout
should be disabled as multimaster uses its custom logic for failure detection.Make sure you have enough background workers allocated for each node:
max_worker_processes = 250 # (N - 1) * (multimaster.max_workers + 1) + 5
For example, for a three-node cluster with
multimaster.max_workers
= 100,multimaster
may need up to 207 background workers at peak times: five always-on workers (monitor, resolver, dmq-sender, campaigner, replier), one walreceiver per each peer node and up to 200 replication dynamic workers. When setting this parameter, remember that other modules may also use background workers at the same time.Depending on your network environment and usage patterns, you may want to tune other
multimaster
parameters. For details, see Section F.35.3.2.
Start Postgres Pro Enterprise on all nodes.
Create database
mydb
and usermtmuser
on each node:CREATE USER mtmuser WITH SUPERUSER PASSWORD 'mtmuserpassword'; CREATE DATABASE mydb OWNER mtmuser;
If you are using password-based authentication, you may want to create a password file.
You can omit this step if you already have a database you are going to replicate, but you are recommended to create a separate superuser for multi-master replication. The examples below assume that you are going to replicate the
mydb
database on behalf ofmtmuser
.Allow replication of the
mydb
database to each cluster node on behalf ofmtmuser
, as explained in Section 20.1. Make sure to use the authentication method that satisfies your security requirements. For example,pg_hba.conf
might have the following lines onnode1
:host replication mtmuser node2 md5 host mydb mtmuser node2 md5 host replication mtmuser node3 md5 host mydb mtmuser node3 md5
Connect to any node on behalf of the
mtmuser
database user, create themultimaster
extension in themydb
database and runmtm.init_cluster()
, specifying the connection string to the current node as the first argument and an array of connection strings to the other nodes as the second argument.For example, if you would like to connect to
node1
, run:CREATE EXTENSION multimaster; SELECT mtm.init_cluster('dbname=mydb user=mtmuser host=node1', '{"dbname=mydb user=mtmuser host=node2", "dbname=mydb user=mtmuser host=node3"}');
To ensure that
multimaster
is enabled, you can run themtm.status()
andmtm.nodes()
functions:SELECT * FROM mtm.status(); SELECT * FROM mtm.nodes();
If
status
is equal toonline
and all nodes are present in themtm.nodes
output, your cluster is successfully configured and ready to use.
Tip
If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows:
SELECT mtm.make_table_local('table_name')
F.35.3.2. Tuning Configuration Parameters #
While you can use multimaster
in the default configuration, you may want to tune several parameters for faster failure detection or more reliable automatic recovery.
F.35.3.2.1. Setting Timeout for Failure Detection #
To check availability of the peer nodes, multimaster
periodically sends heartbeat packets to all nodes. You can define the timeout for failure detection with the following variables:
The
multimaster.heartbeat_send_timeout
variable defines the time interval between the heartbeats. By default, this variable is set to 200ms.The
multimaster.heartbeat_recv_timeout
variable sets the timeout for the response. If no heartbeats are received during this time, the node is assumed to be disconnected and is excluded from the cluster. By default, this variable is set to 2000ms.
It's a good idea to set multimaster.heartbeat_send_timeout
based on typical ping latencies between the nodes. Small recv/send ratio decreases the time of failure detection, but increases the probability of false-positive failure detection. When setting this parameter, take into account the typical packet loss ratio between your cluster nodes.
F.35.3.3. 2+1 Mode: Setting up a Standalone Referee Node #
By default, multimaster
uses a majority-based algorithm to determine whether the cluster nodes have a quorum: a cluster can only continue working if the majority of its nodes are alive and can access each other. Majority-based approach is pointless for two nodes cluster: if one of them fails, another one becomes inaccessible. There is a special 2+1 or referee mode which trades less hardware resources by decreasing availability: two nodes hold full copy of data, and separate referee node participates only in voting, acting as a tie-breaker.
If one node goes down, another one requests referee grant (elects referee-approved generation with single node). Once the grant is received, it continues to work normally. If offline node gets up, it recovers and elects full generation containing both nodes, essentially removing the grant - this allows the node to get it in its turn later. While the grant is issued, it can't be given to another node until full generation is elected and excluded node recovers. This ensures data loss doesn't happen by the price of availability: in this setup two nodes (one normal and one referee) can be alive but cluster might be still unavailable if the referee winner is down, which is impossible with classic three nodes configuration.
The referee node does not store any cluster data, so it is not resource-intensive and can be configured on virtually any system with Postgres Pro Enterprise installed.
To avoid split-brain problems, you must have only a single referee in your cluster.
To set up a referee for your cluster:
Install Postgres Pro Enterprise on the node you are going to make a referee and create the
referee
extension:CREATE EXTENSION referee;
Make sure the
pg_hba.conf
file allows access to the referee node.Set up the nodes that will hold cluster data following the instructions in Section F.35.3.1.
On all data nodes, specify the referee connection string in the
postgresql.conf
file:multimaster.referee_connstring =
connstring
where
connstring
holds libpq options required to access the referee.
The first subset of nodes that gets connected to the referee wins the voting and starts working. The other nodes have to go through the recovery process to catch up with them and join the cluster. Under heavy load, the recovery can take unpredictably long, so it is recommended to wait for all data nodes going online before switching on the load when setting up a new cluster. Once all the nodes get online, the referee discards the voting result, and all data nodes start operating together.
In case of any failure, the voting mechanism is triggered again. At this time, all nodes appear to be offline for a short period of time to allow the referee to choose a new winner, so you can see the following error message when trying to access the cluster: [multimaster] node is not online: current status is "disabled"
.
F.35.4. Multi-Master Cluster Administration #
F.35.4.1. Monitoring Cluster Status #
multimaster
provides several functions to check the current cluster state.
To check node-specific information, use mtm.status()
:
SELECT * FROM mtm.status();
To get the list of all nodes in the cluster together with their status, use mtm.nodes()
:
SELECT * FROM mtm.nodes();
For details on all the returned information, see Section F.35.5.2.
F.35.4.2. Accessing Disabled Nodes #
If a cluster node is disabled, any attempt to read or write data on this node raises an error by default. If you need to access the data on a disabled node, you can override this behavior at connection time by setting the application_name parameter to mtm_admin
. In this case, you can run read and write queries on this node without multimaster supervision.
F.35.4.3. Adding New Nodes to the Cluster #
With the multimaster
extension, you can add or drop cluster nodes. Before adding node, stop the load and ensure (with mtm.status()
) that all nodes are online
. When adding a new node, you need to load all the data to this node using pg_basebackup from any cluster node, and then start this node.
Suppose we have a working cluster of three nodes, with node1
, node2
, and node3
host names. To add node4
, follow these steps:
Figure out the required connection string to access the new node. For example, for the database
mydb
, usermtmuser
, and the new nodenode4
, the connection string can be"dbname=mydb user=mtmuser host=node4"
.In
psql
connected to any alive node, run:SELECT mtm.add_node('dbname=mydb user=mtmuser host=node4');
This command changes the cluster configuration on all nodes and creates replication slots for the new node. It also returns
node_id
of the new node, which will be required to complete the setup.Go to the new node and clone all the data from one of the alive nodes to this node:
pg_basebackup -D
datadir
-h node1 -U mtmuser -c fast -vpg_basebackup copies the entire data directory from
node1
, together with configuration settings, and prints the last LSN replayed from WAL, such as'0/12D357F0'
. This value will be required to complete the setup.Configure the new node to boot with
recovery_target=immediate
to prevent redo past the point where replication will begin. Add topostgresql.conf
:restore_command = 'false' recovery_target = 'immediate' recovery_target_action = 'promote'
And create
recovery.signal
file in the data directory.Start Postgres Pro Enterprise on the new node.
In
psql
connected to the node used to take the base backup, run:SELECT mtm.join_node(4, '0/12D357F0');
where
4
is thenode_id
returned by themtm.add_node()
function call and'0/12D357F0'
is the LSN value returned by pg_basebackup.
F.35.4.4. Removing Nodes from the Cluster #
Before removing node, stop the load and ensure (with mtm.status()
) that all nodes (except the ones to be dropped) are online
. Shut down the nodes you are going to remove. To remove the node from the cluster:
Run the
mtm.nodes()
function to learn the ID of the node to be removed:SELECT * FROM mtm.nodes();
Run the
mtm.drop_node()
function with this node ID as a parameter:SELECT mtm.drop_node(3);
This will delete replication slots for node 3 on all cluster nodes and stop replication to this node.
If you would like to return the node to the cluster later, you will have to add it as a new node, as explained in Section F.35.4.3.
F.35.4.5. Uninstalling the multimaster Extension #
If you would like to continue using the node that has been removed from the cluster in the standalone mode, you have to drop the multimaster
extension on this node and clean up all multimaster
-related subscriptions and uncommitted transactions to ensure that the node is no longer associated with the cluster.
Remove
multimaster
from shared_preload_libraries and restart Postgres Pro Enterprise.Delete the
multimaster
extension and publication:DROP EXTENSION multimaster; DROP PUBLICATION multimaster;
Review the list of existing subscriptions using the
\dRs
command and delete each subscription that starts with themtm_sub_
prefix:\dRs DROP SUBSCRIPTION mtm_sub_
subscription_name
;Review the list of existing replication slots and delete each slot that starts with the
mtm_
prefix:SELECT * FROM pg_replication_slots; SELECT pg_drop_replication_slot('mtm_
slot_name
');Review the list of existing replication origins and delete each origin that starts with the
mtm_
prefix:SELECT * FROM pg_replication_origin; SELECT pg_replication_origin_drop('mtm_
origin_name
');Review the list of prepared transaction left, if any:
SELECT * FROM pg_prepared_xacts;
You have to commit or abort these transactions by running
ABORT PREPARED
ortransaction_id
COMMIT PREPARED
, respectively.transaction_id
Once all these steps are complete, you can start using the node in the standalone mode, if required.
F.35.4.6. Checking Data Consistency Across Cluster Nodes #
You can check that the data is the same on all cluster nodes using the mtm.check_query(query_text) function.
As a parameter, this function takes the text of a query you would like to run for data comparison. When you call this function, it takes a consistent snapshot of data on each cluster node and runs this query against the captured snapshots. The query results are compared between pairs of nodes. If there are no differences, this function returns true
. Otherwise, it reports the first detected difference in a warning and returns false
.
To avoid false-positive results, always use the ORDER BY
clause in your test query. For example, suppose you would like to check that the data in a my_table
is the same on all cluster nodes. Compare the results of the following queries:
postgres=# SELECT mtm.check_query('SELECT * FROM my_table ORDER BY id'); check_query ------------- t (1 row)
postgres=# SELECT mtm.check_query('SELECT * FROM my_table'); WARNING: mismatch in column 'b' of row 0: 256 on node0, 255 on node1 check_query ------------- f (1 row)
Even though the data is the same, the second query reports an issue because the order of the returned data differs between cluster nodes.
F.35.4.7. Delayed Transaction Commits #
When a lagging node is catching up to the donor node and cannot apply changes as quickly, you can slow down transaction execution on the donor node using the multimaster.tx_delay_on_slow_catchup
configuration parameter. To do this, set this parameter to on
in the postgresql.conf
configuration file on the donor node, but not on the lagging peer node. If you edit the file on a running server, you will need to signal the postmaster to make it re-read the file (see Chapter 19 for details). If necessary, you can also specify the maximum possible delay for transaction execution in the optional multimaster.max_tx_delay_on_slow_catchup
parameter (in milliseconds). A value of 0
means that no maximum delay is set. Currently, delays can range from 1 ms to approximately 4 seconds. Values outside the allowed range will be truncated towards the nearest valid value.
By default, this feature is disabled.
F.35.5. Reference #
F.35.5.1. Configuration Parameters #
multimaster.heartbeat_recv_timeout
#Timeout, in milliseconds. If no heartbeat message is received from the node within this timeframe, the node is excluded from the cluster.
Default: 2000 ms
multimaster.heartbeat_send_timeout
#Time interval between heartbeat messages, in milliseconds. An arbiter process broadcasts heartbeat messages to all nodes to detect connection problems.
Default: 200 ms
multimaster.max_workers
#The maximum number of
walreceiver
workers per peer node.Important
This parameter should be used with caution. If the number of simultaneous transactions in the whole cluster is bigger than the provided value, it can lead to undetected deadlocks.
Default: 100
multimaster.monotonic_sequences
#Defines the sequence generation mode for unique identifiers. This variable can take the following values:
false
(default) — ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly.true
— the generated sequence increases monotonically cluster-wide. ID generation on each node is started with this node number and is incremented by the number of nodes, but the values are omitted if they are smaller than the already generated IDs on another node. For example, in a three-node cluster, if 1, 4 and 7 IDs are already allocated to the objects on the first node, 2 and 5 IDs will be omitted on the second node. In this case, the first ID on the second node is 8. Thus, the next generated ID is always higher than the previous one, regardless of the cluster node.
Default:
false
multimaster.referee_connstring
#Connection string to access the referee node. You must set this parameter on all cluster nodes if the referee is set up.
multimaster.remote_functions
#Provides a comma-separated list of function names that should be executed remotely on all multimaster nodes instead of replicating the result of their work.
multimaster.trans_spill_threshold
#The maximal size of transaction, in kB. When this threshold is reached, the transaction is written to the disk.
Default: 100MB
multimaster.break_connection
#Break connection with clients connected to the node if this node disconnects from the cluster. If this variable is set to
false
, the client stays connected to the node but receives an error that the node is disabled.Default:
false
multimaster.connect_timeout
#Maximum time to wait while connecting, in seconds. Zero, negative, or not specified means wait indefinitely. The minimum allowed timeout is 2 seconds, therefore a value of
1
is interpreted as2
.Default:
0
multimaster.ignore_tables_without_pk
#Do not replicate tables without primary key. When
false
, such tables are replicated.Default:
false
multimaster.syncpoint_interval
#Amount of WAL generated between synchronization points.
Default:
10 MB
multimaster.binary_basetypes
#Send data of built-in types in binary format.
Default:
true
multimaster.wait_peer_commits
#Wait until all peers commit the transaction before the command returns a success indication to the client.
Default:
true
multimaster.deadlock_prevention
#Manage prevention of transaction deadlocks that can occur when the same tuple is updated or deleted on different nodes simultaneously. If set to
off
, deadlock prevention is disabled.If set to
simple
, the conflicting transactions are rejected. This setting may be used in any setup.If set to
smart
, a specific algorithm is used to provide best resource availability by selectively committing or rejecting transactions. This is recommended for a setup of two nodes and a referee. In three node setups, deadlocks are still possible. If there are more than four nodes, all conflicting transactions are rejected, just as withsimple
.Default:
off
multimaster.tx_delay_on_slow_catchup
#Enable delay of transaction commits on the donor node when peer nodes are catching up to this node. This parameter should only be set on the donor node.
Default:
off
multimaster.max_tx_delay_on_slow_catchup
#If
multimaster.tx_delay_on_slow_catchup
is enabled, this parameter specifies maximum transaction execution delay, in milliseconds. Possible values are integers greater than0
, but it allows values only up to 4 seconds.Default: 0
multimaster.enable_async_3pc_on_catchup
#Enables asynchronous commit operations (
PREPARE
,PRECOMMIT
, andCOMMIT PREPARED
) on a lagging node syncing with the donor node. This significantly accelerates the catchup process. Data integrity is maintained using the synchronization point mechanism: during catchup, the system periodically creates synchronization points when all preceding transactions are safely written to disk. This way, even if the lagging node is restarted, no transaction data is lost.Default: true
multimaster.catchup_algorithm
#Catchup mode for reconnected nodes. It determines how replicated transactions are applied on the reconnected node that is catching up. This configuration parameter can take one of the following values:
sequential
—mtm-logrep-receiver
applies replicated transactions sequentially in the order they are received.parallel
—mtm-logrep-receiver
sends replicated transactions to the pool of dynamic workers (see mtm-logrep-receiver-dynworker). Dynamic workers apply non-conflicting transactions in parallel and conflicting transactions sequentially in the receiving order, similar to thesequential
catchup mode. You can use the multimaster.parallel_catchup_workers configuration parameter to specify the maximum number of dynamic workers that can apply replicated transactions in this catchup mode.
Default:
sequential
multimaster.parallel_catchup_workers
#The maximum number of dynamic workers that can apply replicated transactions on the reconnected node in the
parallel
catchup mode. The value of this configuration parameter cannot be greater than that of the multimaster.max_workers configuration parameter. You can use the multimaster.catchup_algorithm configuration parameter to specify the catchup mode for reconnected nodes.Default:
8
F.35.5.2. Functions #
-
mtm.init_cluster(
#my_conninfo
text
,peers_conninfo
text[]
) Initializes cluster configuration on all nodes. It connects the current node to all nodes listed in
peers_conninfo
and creates the multimaster extension, replications slots, and replication origins on each node. Run this function once all the nodes are running and can accept connections.Arguments:
my_conninfo
— connection string to the node on which you are running this function. Peer nodes use this string to connect back to this node.peers_conninfo
— an array of connection strings to all the other nodes to be added to the cluster.
-
mtm.add_node(
#connstr
text
) Adds a new node to the cluster. This function should be called before loading data to this node using pg_basebackup.
mtm.add_node
creates the required replication slots for a new node, so you can add a node while the cluster is under load.Arguments:
connstr
— connection string for the new node. For example, for the databasemydb
, usermtmuser
, and the new nodenode4
, the connection string is"dbname=mydb user=mtmuser host=node4"
.
-
mtm.join_node(
#node_id
int
,backup_end_lsn
pg_lsn
) Completes the cluster setup after adding a new node. This function should be called after the added node has been started.
Arguments:
node_id
— ID of the node to add to the cluster. It corresponds to the value in theid
column returned bymtm.nodes()
.backup_end_lsn
— the last LSN of the base backup copied to the new node. This LSN will be used as the starting point for data replication once the node joins the cluster.
-
mtm.drop_node(
#node_id
integer
) Excludes a node from the cluster.
If you would like to continue using this node outside of the cluster in the standalone mode, you have to uninstall the
multimaster
extension from this node, as explained in Section F.35.4.5.Arguments:
node_id
— ID of the node being dropped. It corresponds to the value in theid
column returned bymtm.nodes()
.
-
mtm.alter_sequences()
# Fixes unique identifiers on all cluster nodes. This may be required after restoring all nodes from a single base backup.
-
mtm.status()
# Shows the status of the
multimaster
extension on the current node. Returns a tuple of the following values:my_node_id
,int
— ID of this node.status
,text
— status of the node. Possible values are:online
,recovery
,catchup
,disabled
(need to recover, but not yet clear from whom),isolated
(online in current generation, but some members are disconnected).connected
,int[]
— array of peer IDs connected to this node.gen_num
,int8
— current generation number.gen_members
,int[]
— array of current generation members node IDs.gen_members_online
,int[]
— array of current generation members node IDs which areonline
in it.gen_configured
,int[]
— array of node IDs configured in current generation.
-
mtm.nodes()
# Shows the information on all nodes in the cluster. Returns a tuple of the following values:
id
,integer
— node ID.conninfo
,text
— connection string to this node.is_self
,boolean
— is it me?enabled
,boolean
— is this node online in current generation?connected
,boolean
— shows whether the node is connected to our node.sender_pid
,integer
— WAL sender process ID.receiver_pid
,integer
— WAL receiver process ID.n_workers
,text
— number of started dynamic apply workers from this node.receiver_mode
,text
— in which mode receiver from this node works. Possible values are:disabled
,recovery
,normal
.
-
mtm.make_table_local(
#relation
regclass
) Stops replication for the specified table.
Arguments:
relation
— the table you would like to exclude from the replication scheme.
mtm.check_query(
#query_text
text
)Checks data consistency across cluster nodes. This function takes a snapshot of the current state of each node, runs the specified query against these snapshots, and compares the results. If the results are different between any two nodes, displays a warning with the first found issue and returns
false
. Otherwise, returnstrue
.Arguments:
query_text
— the query you would like to run on all nodes for data comparison. To avoid false-positive results, always use theORDER BY
clause in the test query.
mtm.get_snapshots()
#Takes a snapshot of data on each cluster node and returns the snapshot ID. The snapshots remain available until the
mtm.free_snapshots()
is called, or the current session is terminated. This function is used by the mtm.check_query(query_text), there is no need to call it manually.mtm.free_snapshots()
#Removes data snapshots taken by the
mtm.get_snapshots()
function. This function is used by the mtm.check_query(query_text), there is no need to call it manually.
F.35.6. Compatibility #
F.35.6.1. Local and Global DDL Statements #
By default, any DDL statement is executed on all cluster nodes, except the following statements that can only act locally on a given node:
ALTER SYSTEM
CREATE DATABASE
DROP DATABASE
REINDEX
CHECKPOINT
CLUSTER
LOAD
LISTEN
CHECKPOINT
NOTIFY
F.35.7. Authors #
Postgres Professional, Moscow, Russia.
F.35.7.1. Credits #
The replication mechanism is based on logical decoding and an earlier version of the pglogical
extension provided for community by the 2ndQuadrant team.
The Paxos consensus algorithm is described at:
Leslie Lamport. The Part-Time Parliament
Parallel replication and recovery mechanism is similar to the one described in:
Odorico M. Mendizabal, et al. Checkpointing in Parallel State-Machine Replication.