F.35. multimaster — synchronous cluster to provide OLTP scalability and high availability #

multimaster is a Postgres Pro Enterprise extension with a set of patches that turns Postgres Pro Enterprise into a synchronous shared-nothing cluster to provide Online Transaction Processing (OLTP) scalability for read transactions and high availability with automatic disaster recovery.

As compared to a standard PostgreSQL primary-standby cluster, a cluster configured with the multimaster extension offers the following benefits:

  • Fault tolerance and automatic node recovery

  • Synchronous logical replication and DDL replication

  • Read scalability

  • Working with temporary tables on each cluster node

  • Upgrading minor releases of Postgres Pro Enterprise seamlessly for multimaster cluster clients.

Important

Before deploying multimaster on production systems, make sure to take its replication restrictions into account. For details, see Section F.35.1.

The multimaster extension replicates your database to all nodes of the cluster and allows write transactions on each node. Write transactions are synchronously replicated to all nodes, which increases commit latency. Read-only transactions and queries are executed locally, without any measurable overhead.

To ensure high availability and fault tolerance of the cluster, multimaster determines each transaction outcome through Paxos consensus algorithm, uses custom recovery protocol and heartbeats for failure discovery. A multi-master cluster of N nodes can continue working while the majority of the nodes are alive and reachable by other nodes. To be configured with multimaster, the cluster must include at least three nodes. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes. There is also a special 2+1 (referee) mode in which 2 nodes hold data and an additional one called referee only participates in voting. Compared to traditional three nodes setup, this is cheaper (referee resources demands are low) but availability is decreased. For details, see Section F.35.3.3.

When a failed node is reconnected to the cluster, multimaster automatically fast-forwards the node to the actual state based on the Write-Ahead Log (WAL) data in the corresponding replication slot. If a node was excluded from the cluster, you can add it back using pg_basebackup.

To learn more about the multimaster internals, see Section F.35.2.

F.35.1. Limitations #

The multimaster extension takes care of the database replication in a fully automated way. You can perform write transactions on any node and work with temporary tables on each cluster node simultaneously. However, make sure to take the following replication restrictions into account:

  • Microsoft Windows operating system is not supported.

  • 1C solutions are not supported.

  • multimaster can replicate only one database in a cluster. If it is required to replicate the contents of several databases, you can either transfer all data into different schemas within a single database or create a separate cluster for each database and set up multimaster for each cluster.

  • Large objects are not supported. Although creating large objects is allowed, multimaster cannot replicate such objects, and their OIDs may conflict on different nodes, so their use is not recommended.

  • Since multimaster is based on logical replication and Paxos over three-phase commit protocol, its operation is highly affected by network latency. It is not recommended to set up a multimaster cluster with geographically distributed nodes.

  • Using tables without primary keys can have negative impact on performance. In some cases, it can even lead to inability to restore a cluster node, so you should avoid replicating such tables with multimaster.

  • Unlike in vanilla PostgreSQL, read committed isolation level can cause serialization failures on a multi-master cluster (with an SQLSTATE code '40001') if there are conflicting transactions from different nodes, so the application must be ready to retry transactions. Serializable isolation level works only with respect to local transactions on the current node.

  • Sequence generation. To avoid conflicts between unique identifiers on different nodes, multimaster modifies the default behavior of sequence generators. By default, ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly. Thus, the generated sequence values are not monotonic. If it is critical to get a monotonically increasing sequence cluster-wide, you can set the multimaster.monotonic_sequences to true.

  • Commit latency. In the current implementation of logical replication, multimaster sends data to subscriber nodes only after the local commit, so you have to wait for transaction processing twice: first on the local node, and then on all the other nodes simultaneously. In the case of a heavy-write transaction, this may result in a noticeable delay.

  • Logical replication does not guarantee that a system object OID is the same on all cluster nodes, so OIDs for the same object may differ between multimaster cluster nodes. If your driver or application relies on OIDs, make sure that their use is restricted to connections to one and the same node to avoid errors. For example, the Npgsql driver may not work correctly with multimaster if the NpgsqlConnection.GlobalTypeMapper method tries using OIDs in connections to different cluster nodes.

  • A multimaster cluster node cannot function as a logical replication subscriber. While a multimaster node can function as a publisher, the subscription cannot automatically switch to another node if the node fails.

  • Replicated non-conflicting transactions are applied on the receiving nodes in parallel, so such transactions may become visible on different nodes in different order.

  • If multimaster is working under heavy load, and one of the nodes stops even for a short time, this node can fall far behind other nodes as they take the load in multiple threads, whereas the lagging node catches up in a single thread. In this case, you may need to unload other nodes in order to synchronize all the nodes.

  • CREATE INDEX CONCURRENTLY, REINDEX CONCURRENTLY, CREATE TABLESPACE, and DROP TABLESPACE are not supported.

  • COMMIT AND CHAIN feature is not supported.

F.35.2. Architecture #

F.35.2.1. Replication #

Since each server in a multi-master cluster can accept writes, any server can abort a transaction because of a concurrent update — in the same way as it happens on a single server between different backends. To ensure high availability and data consistency on all cluster nodes, multimaster uses logical replication and the three-phase commit protocol with transaction outcome determined by Paxos consensus algorithm.

When Postgres Pro Enterprise loads the multimaster shared library, multimaster sets up a logical replication producer and consumer for each node, and hooks into the transaction commit pipeline. The typical data replication workflow consists of the following phases:

  1. PREPARE phase. multimaster captures and implicitly transforms each COMMIT statement to a PREPARE statement. All the nodes that get the transaction via the replication protocol (the cohort nodes) send their vote for approving or declining the transaction to the backend process on the initiating node. This ensures that all the cohort can accept the transaction, and no write conflicts occur. For details on PREPARE transactions support in PostgreSQL, see the PREPARE TRANSACTION topic.

  2. PRECOMMIT phase. If all the cohort nodes approve the transaction, the backend process sends a PRECOMMIT message to all the cohort nodes to express an intention to commit the transaction. The cohort nodes respond to the backend with the PRECOMMITTED message. In case of a failure, all the nodes can use this information to complete the transaction using a quorum-based voting procedure.

  3. COMMIT phase. If PRECOMMIT is successful, the transaction is committed to all nodes.

If a node crashes or gets disconnected from the cluster between the PREPARE and COMMIT phases, the PRECOMMIT phase ensures that the survived nodes have enough information to complete the prepared transaction. The PRECOMMITTED messages help avoid the situation when the crashed node has already committed or aborted the transaction, but has not notified other nodes about the transaction status. In a two-phase commit (2PC), such a transaction would block resources (hold locks) until the recovery of the crashed node. Otherwise, data inconsistencies can appear in the database when the failed node is recovered, for example, if the failed node committed the transaction, but the survived node aborted it.

To complete the transaction, the backend must receive a response from the majority of the nodes. For example, for a cluster of 2N+1 nodes, at least N+1 responses are required. Thus, multimaster ensures that your cluster is available for reads and writes while the majority of the nodes are connected, and no data inconsistencies occur in case of a node or connection failure.

F.35.2.2. Failure Detection and Recovery #

Since multimaster allows writes to each node, it has to wait for responses about transaction acknowledgment from all the other nodes. Without special actions in case of a node failure, each commit would have to wait until the failed node recovery. To deal with such situations, multimaster periodically sends heartbeats to check the node state and the connectivity between nodes. When several heartbeats to the node are lost in a row, this node is kicked out of the cluster to allow writes to the remaining alive nodes. You can configure the heartbeat frequency and the response timeout in the multimaster.heartbeat_send_timeout and multimaster.heartbeat_recv_timeout parameters, respectively.

For example, suppose a five-node multi-master cluster experienced a network failure that split the network into two isolated subnets, with two and three cluster nodes. Based on heartbeats propagation information, multimaster will continue accepting writes at each node in the bigger partition, and deny all writes in the smaller one. Thus, a cluster consisting of 2N+1 nodes can tolerate N node failures and stay alive if any N+1 nodes are alive and connected to each other. You can also set up a two nodes cluster plus a lightweight referee node that does not hold the data, but acts as a tie-breaker during symmetric node partitioning. For details, see Section F.35.3.3.

In case of a partial network split when different nodes have different connectivity, multimaster finds a fully connected subset of nodes and disconnects nodes outside of this subset. For example, in a three-node cluster, if node A can access both B and C, but node B cannot access node C, multimaster isolates node C to ensure that both A and B can work.

To preserve order of transactions on different nodes and thus data integrity, the decision to exclude or add back node(s) must be taken coherently. Generations which represent a subset of currently supposedly live nodes serve this purpose. Technically, generation is a pair <n, members> where n is unique number and members is subset of configured nodes. A node always lives in some generation and switches to the one with higher number as soon as it learns about its existence; generation numbers act as logical clocks/terms/epochs here. Each transaction is stamped during commit with current generation of the node it is being executed on. The transaction can be proposed to be committed only after it has been PREPAREd on all its generation members. This allows to design the recovery protocol so that order of conflicting committed transactions is the same on all nodes. Node resides in generation in one of three states (can be shown with mtm.status()):

  1. ONLINE: node is member of the generation and making transactions normally;

  2. RECOVERY: node is member of the generation, but it must apply in recovery mode transactions from previous generations to become ONLINE;

  3. DEAD: node will never be ONLINE in this generation;

For alive nodes, there is no way to distinguish between a failed node that stopped serving requests and a network-partitioned node that can be accessed by database users, but is unreachable for other nodes. If during commit of writing transaction some of current generation members are disconnected, transaction is rolled back according to generation rules. To avoid futile work, connectivity is also checked during transaction start; if you try to access an isolated node, multimaster returns an error message indicating the current status of the node. Thus, to prevent stale reads read-only queries are also forbidden. If you would like to continue using a disconnected node outside of the cluster in the standalone mode, you have to uninstall the multimaster extension on this node, as explained in Section F.35.4.5.

Each node maintains a data structure that keeps the information about the state of all nodes in relation to this node. You can get this data by calling the mtm.status() and the mtm.nodes() functions.

When a failed node connects back to the cluster, multimaster starts automatic recovery:

  1. The reconnected node selects a cluster node, which is ONLINE in the highest generation, referred to as the donor node, and starts catching up with the current state of the cluster based on the Write-Ahead Log (WAL).

  2. When the node is caught up, it ballots for including itself in the next generation. Once generation is elected, commit of new transactions will start waiting for apply on the joining node.

  3. When the rest of transactions till the switch to the new generation is applied, the reconnected node is promoted to the online state and included into the replication scheme.

The correctness of recovery protocol was verified with TLA+ model checker. You can find the model (and more detailed description) at doc/specs directory of the source code.

Automatic recovery requires presence of all WAL files generated after node failure. If a node is down for a long time and storing more WALs is unacceptable, you may have to exclude this node from the cluster and manually restore it from one of the working nodes using pg_basebackup. For details, see Section F.35.4.3.

F.35.2.3. Multimaster Background Workers #

mtm-monitor #

Starts all other workers for a database managed with multimaster. This is the first worker loaded during multimaster boot. Each multimaster node has a single mtm-monitor worker. When a new node is added, mtm-monitor starts mtm-logrep-receiver and mtm-dmq-receiver workers to enable replication to this node. If a node is dropped, mtm-monitor stops mtm-logrep-receiver and mtm-dmq-receiver workers that have been serving the dropped node. Each mtm-monitor controls workers on its own node only.

mtm-logrep-receiver #

Receives logical replication stream from a given peer node. During normal operation, mtm-logrep-receiver sends replicated transactions to the pool of dynamic workers (see mtm-logrep-receiver-dynworker). During catchup, depending on the value of the multimaster.catchup_algorithm configuration parameter, mtm-logrep-receiver applies replicated transactions on the reconnected node or sends them to the pool of dynamic workers. The number of mtm-logrep-receiver workers on each node corresponds to the number of peer nodes available.

mtm-dmq-receiver #

Receives acknowledgment for transactions sent to peers and checks for heartbeat timeouts. The number of mtm-logrep-receiver workers on each node corresponds to the number of peer nodes available.

mtm-dmq-sender #

Collects acknowledgment for transactions applied on the current node and sends them to the corresponding mtm-dmq-receiver on the peer node. There is a single worker per Postgres Pro Enterprise instance.

mtm-logrep-receiver-dynworker #

Dynamic pool worker for a given mtm-logrep-receiver. Applies replicated transactions received during normal operation or catchup. You can use the multimaster.max_workers configuration parameter to specify the maximum number of dynamic workers.

mtm-resolver #

Performs Paxos to resolve unfinished transactions. This worker is only active during recovery or when connection with other nodes was lost. There is a single worker per Postgres Pro Enterprise instance.

mtm-campaigner #

Ballots for new generations to exclude some node(s) or add myself. There is a single worker per Postgres Pro Enterprise instance.

mtm-replier #

Responds to requests of mtm-campaigner and mtm-resolver.

F.35.3. Installation and Setup #

To use multimaster, you need to install Postgres Pro or Postgres Pro Enterprise on all nodes of your cluster. Postgres Pro includes all the required dependencies and extensions. For PostgreSQL follow build and install instructions at readme.md.

F.35.3.1. Setting up a Multi-Master Cluster #

Suppose you are setting up a cluster of three nodes, with node1, node2, and node3 host names. After installing Postgres Pro Enterprise on all nodes, you need to initialize data directory on each node, as explained in Section 18.2. If you would like to set up a multi-master cluster for an already existing mydb database, you can load data from mydb to one of the nodes once the cluster is initialized, or you can load data to all new nodes before cluster initialization using any convenient mechanism, such as pg_basebackup or pg_dump.

Once the data directory is set up, complete the following steps on each cluster node:

  1. Modify the postgresql.conf configuration file, as follows:

    • Add multimaster to the shared_preload_libraries variable:

      shared_preload_libraries = 'multimaster'
      

      Tip

      If the shared_preload_libraries variable is already defined in postgresql.auto.conf, you will need to modify its value using the ALTER SYSTEM command. For details, see Section 19.1.2. Note that in a multi-master cluster, the ALTER SYSTEM command only affects the configuration of the node from which it was run.

    • Set up Postgres Pro Enterprise parameters related to replication:

      wal_level = logical
      max_connections = 100
      max_prepared_transactions = 300 # max_connections * N
      max_wal_senders = 10            # at least N
      max_replication_slots = 10      # at least 2N
      wal_sender_timeout = 0
      

      where N is the number of nodes in your cluster.

      You must change the replication level to logical as multimaster relies on logical replication. For a cluster of N nodes, enable at least N WAL sender processes and replication slots. Since multimaster implicitly adds a PREPARE phase to each COMMIT transaction, make sure to set the number of prepared transactions to N * max_connections. wal_sender_timeout should be disabled as multimaster uses its custom logic for failure detection.

    • Make sure you have enough background workers allocated for each node:

      max_worker_processes = 250 # (N - 1) * (multimaster.max_workers + 1) + 5
      

      For example, for a three-node cluster with multimaster.max_workers = 100, multimaster may need up to 207 background workers at peak times: five always-on workers (monitor, resolver, dmq-sender, campaigner, replier), one walreceiver per each peer node and up to 200 replication dynamic workers. When setting this parameter, remember that other modules may also use background workers at the same time.

    • Depending on your network environment and usage patterns, you may want to tune other multimaster parameters. For details, see Section F.35.3.2.

  2. Start Postgres Pro Enterprise on all nodes.

  3. Create database mydb and user mtmuser on each node:

    CREATE USER mtmuser WITH SUPERUSER PASSWORD 'mtmuserpassword';
    CREATE DATABASE mydb OWNER mtmuser;
    

    If you are using password-based authentication, you may want to create a password file.

    You can omit this step if you already have a database you are going to replicate, but you are recommended to create a separate superuser for multi-master replication. The examples below assume that you are going to replicate the mydb database on behalf of mtmuser.

  4. Allow replication of the mydb database to each cluster node on behalf of mtmuser, as explained in Section 20.1. Make sure to use the authentication method that satisfies your security requirements. For example, pg_hba.conf might have the following lines on node1:

    host replication mtmuser node2 md5
    host mydb mtmuser node2 md5
    host replication mtmuser node3 md5
    host mydb mtmuser node3 md5
    

  5. Connect to any node on behalf of the mtmuser database user, create the multimaster extension in the mydb database and run mtm.init_cluster(), specifying the connection string to the current node as the first argument and an array of connection strings to the other nodes as the second argument.

    For example, if you would like to connect to node1, run:

    CREATE EXTENSION multimaster;
    SELECT mtm.init_cluster('dbname=mydb user=mtmuser host=node1',
    '{"dbname=mydb user=mtmuser host=node2", "dbname=mydb user=mtmuser host=node3"}');
    

  6. To ensure that multimaster is enabled, you can run the mtm.status() and mtm.nodes() functions:

    SELECT * FROM mtm.status();
    SELECT * FROM mtm.nodes();
    

    If status is equal to online and all nodes are present in the mtm.nodes output, your cluster is successfully configured and ready to use.

Tip

If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows:

SELECT mtm.make_table_local('table_name') 

F.35.3.2. Tuning Configuration Parameters #

While you can use multimaster in the default configuration, you may want to tune several parameters for faster failure detection or more reliable automatic recovery.

F.35.3.2.1. Setting Timeout for Failure Detection #

To check availability of the peer nodes, multimaster periodically sends heartbeat packets to all nodes. You can define the timeout for failure detection with the following variables:

  • The multimaster.heartbeat_send_timeout variable defines the time interval between the heartbeats. By default, this variable is set to 200ms.

  • The multimaster.heartbeat_recv_timeout variable sets the timeout for the response. If no heartbeats are received during this time, the node is assumed to be disconnected and is excluded from the cluster. By default, this variable is set to 2000ms.

It's a good idea to set multimaster.heartbeat_send_timeout based on typical ping latencies between the nodes. Small recv/send ratio decreases the time of failure detection, but increases the probability of false-positive failure detection. When setting this parameter, take into account the typical packet loss ratio between your cluster nodes.

F.35.3.3. 2+1 Mode: Setting up a Standalone Referee Node #

By default, multimaster uses a majority-based algorithm to determine whether the cluster nodes have a quorum: a cluster can only continue working if the majority of its nodes are alive and can access each other. Majority-based approach is pointless for two nodes cluster: if one of them fails, another one becomes inaccessible. There is a special 2+1 or referee mode which trades less hardware resources by decreasing availability: two nodes hold full copy of data, and separate referee node participates only in voting, acting as a tie-breaker.

If one node goes down, another one requests referee grant (elects referee-approved generation with single node). Once the grant is received, it continues to work normally. If offline node gets up, it recovers and elects full generation containing both nodes, essentially removing the grant - this allows the node to get it in its turn later. While the grant is issued, it can't be given to another node until full generation is elected and excluded node recovers. This ensures data loss doesn't happen by the price of availability: in this setup two nodes (one normal and one referee) can be alive but cluster might be still unavailable if the referee winner is down, which is impossible with classic three nodes configuration.

The referee node does not store any cluster data, so it is not resource-intensive and can be configured on virtually any system with Postgres Pro Enterprise installed.

To avoid split-brain problems, you must have only a single referee in your cluster.

To set up a referee for your cluster:

  1. Install Postgres Pro Enterprise on the node you are going to make a referee and create the referee extension:

    CREATE EXTENSION referee;
    

  2. Make sure the pg_hba.conf file allows access to the referee node.

  3. Set up the nodes that will hold cluster data following the instructions in Section F.35.3.1.

  4. On all data nodes, specify the referee connection string in the postgresql.conf file:

    multimaster.referee_connstring = connstring
    

    where connstring holds libpq options required to access the referee.

The first subset of nodes that gets connected to the referee wins the voting and starts working. The other nodes have to go through the recovery process to catch up with them and join the cluster. Under heavy load, the recovery can take unpredictably long, so it is recommended to wait for all data nodes going online before switching on the load when setting up a new cluster. Once all the nodes get online, the referee discards the voting result, and all data nodes start operating together.

In case of any failure, the voting mechanism is triggered again. At this time, all nodes appear to be offline for a short period of time to allow the referee to choose a new winner, so you can see the following error message when trying to access the cluster: [multimaster] node is not online: current status is "disabled".

F.35.4. Multi-Master Cluster Administration #

F.35.4.1. Monitoring Cluster Status #

multimaster provides several functions to check the current cluster state.

To check node-specific information, use mtm.status():

SELECT * FROM mtm.status();

To get the list of all nodes in the cluster together with their status, use mtm.nodes():

SELECT * FROM mtm.nodes();

For details on all the returned information, see Section F.35.5.2.

F.35.4.2. Accessing Disabled Nodes #

If a cluster node is disabled, any attempt to read or write data on this node raises an error by default. If you need to access the data on a disabled node, you can override this behavior at connection time by setting the application_name parameter to mtm_admin. In this case, you can run read and write queries on this node without multimaster supervision.

F.35.4.3. Adding New Nodes to the Cluster #

With the multimaster extension, you can add or drop cluster nodes. Before adding node, stop the load and ensure (with mtm.status()) that all nodes are online. When adding a new node, you need to load all the data to this node using pg_basebackup from any cluster node, and then start this node.

Suppose we have a working cluster of three nodes, with node1, node2, and node3 host names. To add node4, follow these steps:

  1. Figure out the required connection string to access the new node. For example, for the database mydb, user mtmuser, and the new node node4, the connection string can be "dbname=mydb user=mtmuser host=node4".

  2. In psql connected to any alive node, run:

    SELECT mtm.add_node('dbname=mydb user=mtmuser host=node4');
    

    This command changes the cluster configuration on all nodes and creates replication slots for the new node. It also returns node_id of the new node, which will be required to complete the setup.

  3. Go to the new node and clone all the data from one of the alive nodes to this node:

    pg_basebackup -D datadir -h node1 -U mtmuser -c fast -v
    

    pg_basebackup copies the entire data directory from node1, together with configuration settings, and prints the last LSN replayed from WAL, such as '0/12D357F0'. This value will be required to complete the setup.

  4. Configure the new node to boot with recovery_target=immediate to prevent redo past the point where replication will begin. Add to postgresql.conf:

    restore_command = 'false'
    recovery_target = 'immediate'
    recovery_target_action = 'promote'
              

    And create recovery.signal file in the data directory.

  5. Start Postgres Pro Enterprise on the new node.

  6. In psql connected to the node used to take the base backup, run:

    SELECT mtm.join_node(4, '0/12D357F0');
    

    where 4 is the node_id returned by the mtm.add_node() function call and '0/12D357F0' is the LSN value returned by pg_basebackup.

F.35.4.4. Removing Nodes from the Cluster #

Before removing node, stop the load and ensure (with mtm.status()) that all nodes (except the ones to be dropped) are online. Shut down the nodes you are going to remove. To remove the node from the cluster:

  1. Run the mtm.nodes() function to learn the ID of the node to be removed:

    SELECT * FROM mtm.nodes();
    

  2. Run the mtm.drop_node() function with this node ID as a parameter:

    SELECT mtm.drop_node(3);
    

    This will delete replication slots for node 3 on all cluster nodes and stop replication to this node.

If you would like to return the node to the cluster later, you will have to add it as a new node, as explained in Section F.35.4.3.

F.35.4.5. Uninstalling the multimaster Extension #

If you would like to continue using the node that has been removed from the cluster in the standalone mode, you have to drop the multimaster extension on this node and clean up all multimaster-related subscriptions and uncommitted transactions to ensure that the node is no longer associated with the cluster.

  1. Remove multimaster from shared_preload_libraries and restart Postgres Pro Enterprise.

  2. Delete the multimaster extension and publication:

    DROP EXTENSION multimaster;
    DROP PUBLICATION multimaster;
    

  3. Review the list of existing subscriptions using the \dRs command and delete each subscription that starts with the mtm_sub_ prefix:

    \dRs
    DROP SUBSCRIPTION mtm_sub_subscription_name;
    

  4. Review the list of existing replication slots and delete each slot that starts with the mtm_ prefix:

    SELECT * FROM pg_replication_slots;
    SELECT pg_drop_replication_slot('mtm_slot_name');
    

  5. Review the list of existing replication origins and delete each origin that starts with the mtm_ prefix:

    SELECT * FROM pg_replication_origin;
    SELECT pg_replication_origin_drop('mtm_origin_name');
    

  6. Review the list of prepared transaction left, if any:

    SELECT * FROM pg_prepared_xacts;
    

    You have to commit or abort these transactions by running ABORT PREPARED transaction_id or COMMIT PREPARED transaction_id, respectively.

Once all these steps are complete, you can start using the node in the standalone mode, if required.

F.35.4.6. Checking Data Consistency Across Cluster Nodes #

You can check that the data is the same on all cluster nodes using the mtm.check_query(query_text) function.

As a parameter, this function takes the text of a query you would like to run for data comparison. When you call this function, it takes a consistent snapshot of data on each cluster node and runs this query against the captured snapshots. The query results are compared between pairs of nodes. If there are no differences, this function returns true. Otherwise, it reports the first detected difference in a warning and returns false.

To avoid false-positive results, always use the ORDER BY clause in your test query. For example, suppose you would like to check that the data in a my_table is the same on all cluster nodes. Compare the results of the following queries:

postgres=# SELECT mtm.check_query('SELECT * FROM my_table ORDER BY id');
 check_query
-------------
 t
(1 row)

postgres=# SELECT mtm.check_query('SELECT * FROM my_table');
WARNING: mismatch in column 'b' of row 0: 256 on node0, 255 on node1
 check_query
-------------
 f
(1 row)

Even though the data is the same, the second query reports an issue because the order of the returned data differs between cluster nodes.

F.35.4.7. Delayed Transaction Commits #

When a lagging node is catching up to the donor node and cannot apply changes as quickly, you can slow down transaction execution on the donor node using the multimaster.tx_delay_on_slow_catchup configuration parameter. To do this, set this parameter to on in the postgresql.conf configuration file on the donor node, but not on the lagging peer node. If you edit the file on a running server, you will need to signal the postmaster to make it re-read the file (see Chapter 19 for details). If necessary, you can also specify the maximum possible delay for transaction execution in the optional multimaster.max_tx_delay_on_slow_catchup parameter (in milliseconds). A value of 0 means that no maximum delay is set. Currently, delays can range from 1 ms to approximately 4 seconds. Values outside the allowed range will be truncated towards the nearest valid value.

By default, this feature is disabled.

F.35.5. Reference #

F.35.5.1. Configuration Parameters #

multimaster.heartbeat_recv_timeout #

Timeout, in milliseconds. If no heartbeat message is received from the node within this timeframe, the node is excluded from the cluster.

Default: 2000 ms

multimaster.heartbeat_send_timeout #

Time interval between heartbeat messages, in milliseconds. An arbiter process broadcasts heartbeat messages to all nodes to detect connection problems.

Default: 200 ms

multimaster.max_workers #

The maximum number of walreceiver workers per peer node.

Important

This parameter should be used with caution. If the number of simultaneous transactions in the whole cluster is bigger than the provided value, it can lead to undetected deadlocks.

Default: 100

multimaster.monotonic_sequences #

Defines the sequence generation mode for unique identifiers. This variable can take the following values:

  • false (default) — ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly.

  • true — the generated sequence increases monotonically cluster-wide. ID generation on each node is started with this node number and is incremented by the number of nodes, but the values are omitted if they are smaller than the already generated IDs on another node. For example, in a three-node cluster, if 1, 4 and 7 IDs are already allocated to the objects on the first node, 2 and 5 IDs will be omitted on the second node. In this case, the first ID on the second node is 8. Thus, the next generated ID is always higher than the previous one, regardless of the cluster node.

Default: false

multimaster.referee_connstring #

Connection string to access the referee node. You must set this parameter on all cluster nodes if the referee is set up.

multimaster.remote_functions #

Provides a comma-separated list of function names that should be executed remotely on all multimaster nodes instead of replicating the result of their work.

multimaster.trans_spill_threshold #

The maximal size of transaction, in kB. When this threshold is reached, the transaction is written to the disk.

Default: 100MB

multimaster.break_connection #

Break connection with clients connected to the node if this node disconnects from the cluster. If this variable is set to false, the client stays connected to the node but receives an error that the node is disabled.

Default: false

multimaster.connect_timeout #

Maximum time to wait while connecting, in seconds. Zero, negative, or not specified means wait indefinitely. The minimum allowed timeout is 2 seconds, therefore a value of 1 is interpreted as 2.

Default: 0

multimaster.ignore_tables_without_pk #

Do not replicate tables without primary key. When false, such tables are replicated.

Default: false

multimaster.syncpoint_interval #

Amount of WAL generated between synchronization points.

Default: 10 MB

multimaster.binary_basetypes #

Send data of built-in types in binary format.

Default: true

multimaster.wait_peer_commits #

Wait until all peers commit the transaction before the command returns a success indication to the client.

Default: true

multimaster.deadlock_prevention #

Manage prevention of transaction deadlocks that can occur when the same tuple is updated or deleted on different nodes simultaneously. If set to off, deadlock prevention is disabled.

If set to simple, the conflicting transactions are rejected. This setting may be used in any setup.

If set to smart, a specific algorithm is used to provide best resource availability by selectively committing or rejecting transactions. This is recommended for a setup of two nodes and a referee. In three node setups, deadlocks are still possible. If there are more than four nodes, all conflicting transactions are rejected, just as with simple.

Default: off

multimaster.tx_delay_on_slow_catchup #

Enable delay of transaction commits on the donor node when peer nodes are catching up to this node. This parameter should only be set on the donor node.

Default: off

multimaster.max_tx_delay_on_slow_catchup #

If multimaster.tx_delay_on_slow_catchup is enabled, this parameter specifies maximum transaction execution delay, in milliseconds. Possible values are integers greater than 0, but it allows values only up to 4 seconds.

Default: 0

multimaster.enable_async_3pc_on_catchup #

Enables asynchronous commit operations (PREPARE, PRECOMMIT, and COMMIT PREPARED) on a lagging node syncing with the donor node. This significantly accelerates the catchup process. Data integrity is maintained using the synchronization point mechanism: during catchup, the system periodically creates synchronization points when all preceding transactions are safely written to disk. This way, even if the lagging node is restarted, no transaction data is lost.

Default: true

multimaster.catchup_algorithm #

Catchup mode for reconnected nodes. It determines how replicated transactions are applied on the reconnected node that is catching up. This configuration parameter can take one of the following values:

  • sequentialmtm-logrep-receiver applies replicated transactions sequentially in the order they are received.

  • parallelmtm-logrep-receiver sends replicated transactions to the pool of dynamic workers (see mtm-logrep-receiver-dynworker). Dynamic workers apply non-conflicting transactions in parallel and conflicting transactions sequentially in the receiving order, similar to the sequential catchup mode. You can use the multimaster.parallel_catchup_workers configuration parameter to specify the maximum number of dynamic workers that can apply replicated transactions in this catchup mode.

Default: sequential

multimaster.parallel_catchup_workers #

The maximum number of dynamic workers that can apply replicated transactions on the reconnected node in the parallel catchup mode. The value of this configuration parameter cannot be greater than that of the multimaster.max_workers configuration parameter. You can use the multimaster.catchup_algorithm configuration parameter to specify the catchup mode for reconnected nodes.

Default: 8

F.35.5.2. Functions #

mtm.init_cluster(my_conninfo text, peers_conninfo text[]) #

Initializes cluster configuration on all nodes. It connects the current node to all nodes listed in peers_conninfo and creates the multimaster extension, replications slots, and replication origins on each node. Run this function once all the nodes are running and can accept connections.

Arguments:

  • my_conninfo — connection string to the node on which you are running this function. Peer nodes use this string to connect back to this node.

  • peers_conninfo — an array of connection strings to all the other nodes to be added to the cluster.

mtm.add_node(connstr text) #

Adds a new node to the cluster. This function should be called before loading data to this node using pg_basebackup. mtm.add_node creates the required replication slots for a new node, so you can add a node while the cluster is under load.

Arguments:

  • connstr — connection string for the new node. For example, for the database mydb, user mtmuser, and the new node node4, the connection string is "dbname=mydb user=mtmuser host=node4".

mtm.join_node(node_id int, backup_end_lsn pg_lsn) #

Completes the cluster setup after adding a new node. This function should be called after the added node has been started.

Arguments:

  • node_id — ID of the node to add to the cluster. It corresponds to the value in the id column returned by mtm.nodes().

    backup_end_lsn — the last LSN of the base backup copied to the new node. This LSN will be used as the starting point for data replication once the node joins the cluster.

mtm.drop_node(node_id integer) #

Excludes a node from the cluster.

If you would like to continue using this node outside of the cluster in the standalone mode, you have to uninstall the multimaster extension from this node, as explained in Section F.35.4.5.

Arguments:

  • node_id — ID of the node being dropped. It corresponds to the value in the id column returned by mtm.nodes().

mtm.alter_sequences() #

Fixes unique identifiers on all cluster nodes. This may be required after restoring all nodes from a single base backup.

mtm.status() #

Shows the status of the multimaster extension on the current node. Returns a tuple of the following values:

  • my_node_id, int — ID of this node.

  • status, text — status of the node. Possible values are: online, recovery, catchup, disabled (need to recover, but not yet clear from whom), isolated (online in current generation, but some members are disconnected).

  • connected, int[] — array of peer IDs connected to this node.

  • gen_num, int8 — current generation number.

  • gen_members, int[] — array of current generation members node IDs.

  • gen_members_online, int[] — array of current generation members node IDs which are online in it.

  • gen_configured, int[] — array of node IDs configured in current generation.

mtm.nodes() #

Shows the information on all nodes in the cluster. Returns a tuple of the following values:

  • id, integer — node ID.

  • conninfo, text — connection string to this node.

  • is_self, boolean — is it me?

  • enabled, boolean — is this node online in current generation?

  • connected, boolean — shows whether the node is connected to our node.

  • sender_pid, integer — WAL sender process ID.

  • receiver_pid, integer — WAL receiver process ID.

  • n_workers, text — number of started dynamic apply workers from this node.

  • receiver_mode, text — in which mode receiver from this node works. Possible values are: disabled, recovery, normal.

mtm.make_table_local(relation regclass) #

Stops replication for the specified table.

Arguments:

  • relation — the table you would like to exclude from the replication scheme.

mtm.check_query(query_text text) #

Checks data consistency across cluster nodes. This function takes a snapshot of the current state of each node, runs the specified query against these snapshots, and compares the results. If the results are different between any two nodes, displays a warning with the first found issue and returns false. Otherwise, returns true.

Arguments:

  • query_text — the query you would like to run on all nodes for data comparison. To avoid false-positive results, always use the ORDER BY clause in the test query.

mtm.get_snapshots() #

Takes a snapshot of data on each cluster node and returns the snapshot ID. The snapshots remain available until the mtm.free_snapshots() is called, or the current session is terminated. This function is used by the mtm.check_query(query_text), there is no need to call it manually.

mtm.free_snapshots() #

Removes data snapshots taken by the mtm.get_snapshots() function. This function is used by the mtm.check_query(query_text), there is no need to call it manually.

F.35.6. Compatibility #

F.35.6.1. Local and Global DDL Statements #

By default, any DDL statement is executed on all cluster nodes, except the following statements that can only act locally on a given node:

  • ALTER SYSTEM

  • CREATE DATABASE

  • DROP DATABASE

  • REINDEX

  • CHECKPOINT

  • CLUSTER

  • LOAD

  • LISTEN

  • CHECKPOINT

  • NOTIFY

F.35.7. Authors #

Postgres Professional, Moscow, Russia.

F.35.7.1. Credits #

The replication mechanism is based on logical decoding and an earlier version of the pglogical extension provided for community by the 2ndQuadrant team.

The Paxos consensus algorithm is described at:

Parallel replication and recovery mechanism is similar to the one described in: