F.31. multimaster

multimaster is a Postgres Pro Enterprise extension with a set of patches that turns Postgres Pro Enterprise into a synchronous shared-nothing cluster to provide Online Transaction Processing (OLTP) scalability for read transactions and high availability with automatic disaster recovery.

As compared to a standard PostgreSQL master-standby cluster, a cluster configured with the multimaster extension offers the following benefits:

  • Cluster-wide transaction isolation

  • Synchronous logical replication and DDL replication

  • Working with temporary tables on each cluster node

  • Fault tolerance and automatic node recovery

  • PostgreSQL online upgrades

The multimaster extension replicates your database to all nodes of the cluster and allows write transactions on each node. To ensure data consistency in the case of concurrent updates, multimaster enforces transaction isolation cluster-wide, using multiversion concurrency control (MVCC) at the read committed or repeatable read isolation levels. Any write transaction is synchronously replicated to all nodes, which increases commit latency for the time required for synchronization. Read-only transactions and queries are executed locally, without any measurable overhead.

To ensure high availability and fault tolerance of the cluster, multimaster uses three-phase commit protocol and heartbeats for failure discovery. A multi-master cluster of N nodes can continue working while the majority of the nodes are alive and reachable by other nodes. To be configured with multimaster, the cluster must include at least two nodes. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.

When a failed node is reconnected to the cluster, multimaster can automatically fast-forward the node to the actual state based on the Write-Ahead Log (WAL) data in the corresponding replication slot. If WAL data is no longer available for the time when the node was excluded from the cluster, you can restore the node using pg_basebackup.

Important

When using multimaster, make sure to take its replication restrictions into account. For details, see Section F.31.1.

To learn more about the multimaster internals, see Section F.31.2.

F.31.1. Limitations

The multimaster extension takes care of the database replication in a fully automated way. You can perform write transactions on any node and work with temporary tables on each cluster node simultaneously. However, make sure to take the following replication restrictions into account:

  • Microsoft Windows operating system is not supported.

  • multimaster can only replicate one database per cluster. This database must be specified in the multimaster.conn_strings variable. If you connect to a different database, all operations will fail with the corresponding error message.

  • The replicated tables must have primary keys or replica identity because of logical replication restrictions. Although tables without primary keys can be replicated, multimaster does not allow UPDATE and DELETE operations on such tables. For details, see multimaster.ignore_tables_without_pk. Unlogged tables are not replicated, as in the standard PostgreSQL.

  • Isolation level. The multimaster extension supports read committed and repeatable read isolation levels. Serializable isolation level is currently not supported.

    Important

    Using repeatable read isolation level increases the probability of serialization failure at commit time. Unlike in the standard PostgreSQL, read committed level can also cause serialization failures on a multi-master cluster (with an SQLSTATE code '40001').

    When performing a write transaction, multimaster blocks the affected objects only on the node on which the transaction is performed. However, since write transactions are allowed on all nodes, other transactions can try to change the same objects on the neighbor nodes at the same time. In this case, the replication of the first transaction can fail because the affected objects on the neighbor nodes are already blocked by another transaction. Similarly, the latter transaction cannot be replicated to the first node. In this case, a distributed deadlock occurs. As a result, one of the transactions is automatically rolled back and needs to be repeated. The application must be ready to retry transactions.

    If your typical workload has too many rollbacks, it is recommended to use read committed isolation level. However, the read committed still does not guarantee the absence of deadlocks on a multi-master cluster. If using the read committed level does not help, you can try directing all the write transactions to a single node.

  • In a multi-master cluster, the ALTER SYSTEM command only affects the configuration of the current node. If you would like to change configuration parameters across the whole database cluster, you need to run this command on each node.

  • Sequence generation. To avoid conflicts between unique identifiers on different nodes, multimaster modifies the default behavior of sequence generators. By default, ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly. Thus, the generated sequence values are not monotonic. If it is critical to get a monotonically increasing sequence cluster-wide, you can set the multimaster.monotonic_sequences to true.

  • DDL replication. While multimaster replicates data on the logical level, DDL is replicated on the statement level, which results in distributed commits of the same statement on different nodes. As a result, complex DDL scenarios, such as stored procedures and temporary tables, may work differently as compared to the standard PostgreSQL.

  • Commit latency. In the current implementation of logical replication, multimaster sends data to subscriber nodes only after the local commit, so you have to wait for transaction processing twice: first on the local node, and then on all the other nodes simultaneously. In the case of a heavy-write transaction, this may result in a noticeable delay.

If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows:

SELECT mtm.make_table_local('table_name') 

F.31.2. Architecture

F.31.2.1. Replication

Since each server in a multi-master cluster can accept writes, any server can abort a transaction because of a concurrent update — in the same way as it happens on a single server between different backends. To ensure high availability and data consistency on all cluster nodes, multimaster uses logical replication and the three-phase E3PC commit protocol.

When Postgres Pro Enterprise loads the multimaster shared library, multimaster sets up a logical replication producer and consumer for each node, and hooks into the transaction commit pipeline. The typical data replication workflow consists of the following phases:

  1. PREPARE phase. multimaster captures and implicitly transforms each COMMIT statement to a PREPARE statement. All the nodes that get the transaction via the replication protocol (the cohort nodes) send their vote for approving or declining the transaction to the arbiter process on the initiating node. This ensures that all the cohort can accept the transaction, and no write conflicts occur. For details on PREPARE transactions support in PostgreSQL, see the PREPARE TRANSACTION topic.

  2. PRECOMMIT phase. If all the cohort nodes approve the transaction, the arbiter process sends a PRECOMMIT message to all the cohort nodes to express an intention to commit the transaction. The cohort nodes respond to the arbiter with the PRECOMMITTED message. In case of a failure, all the nodes can use this information to complete the transaction using a quorum-based voting procedure.

  3. COMMIT phase. If PRECOMMIT is successful, the arbiter commits the transaction to all nodes.

Important

multimaster currently supports the read committed and repeatable read isolation levels only, which can cause unexpected serialization failures in your workload. For details, see Section F.31.1.

If a node crashes or gets disconnected from the cluster between the PREPARE and COMMIT phases, the PRECOMMIT phase ensures that the survived nodes have enough information to complete the prepared transaction. The PRECOMMITTED messages help avoid the situation when the crashed node has already committed or aborted the transaction, but has not notified other nodes about the transaction status. In a two-phase commit (2PC), such a transaction would block resources (hold locks) until the recovery of the crashed node. Otherwise, you could get data inconsistencies in the database when the failed node is recovered. For example, if the failed node committed the transaction but the survived node aborted it.

To complete the transaction, the arbiter must receive a response from the majority of the nodes. For example, for a cluster of 2N+1 nodes, at least N+1 responses are required. Thus, multimaster ensures that your cluster is available for reads and writes while the majority of the nodes are connected, and no data inconsistencies occur in case of a node or connection failure. For details on the failure detection mechanism, see Section F.31.2.2.

F.31.2.2. Failure Detection and Recovery

Since multimaster allows writes to each node, it has to wait for responses about transaction acknowledgement from all the other nodes. Without special actions in case of a node failure, each commit would have to wait until the failed node recovery. To deal with such situations, multimaster periodically sends heartbeats to check the node state and the connectivity between nodes. When several heartbeats to the node are lost in a row, this node is kicked out of the cluster to allow writes to the remaining alive nodes. You can configure the heartbeat frequency and the response timeout in the multimaster.heartbeat_send_timeout and multimaster.heartbeat_recv_timeout parameters, respectively.

For alive nodes, there is no way to distinguish between a failed node that stopped serving requests and a network-partitioned node that can be accessed by database users, but is unreachable for other nodes. To avoid conflicting writes to nodes in different network partitions, multimaster only allows writes to the nodes that see the majority of other nodes.

For example, suppose a five-node multi-master cluster experienced a network failure that split the network into two isolated subnets, with two and three cluster nodes. Based on heartbeats propagation information, multimaster will continue accepting writes at each node in the bigger partition, and deny all writes in the smaller one. Thus, a cluster consisting of 2N+1 nodes can tolerate N node failures and stay alive if any N+1 nodes are alive and connected to each other.

Tip

For clusters with an even number of nodes, you can override this behavior. For details, see Section F.31.3.3.

In case of a partial network split when different nodes have different connectivity, multimaster finds a fully connected subset of nodes and disconnects other nodes. For example, in a three-node cluster, if node A can access both B and C, but node B cannot access node C, multimaster isolates node C to ensure data consistency on nodes A and B.

If you try to access a disconnected node, multimaster returns an error message indicating the current status of the node. To prevent stale reads, read-only queries are also forbidden. Additionally, you can break connections between the disconnected node and the clients using the multimaster.break_connection variable.

Each node maintains a data structure that keeps the information about the state of all nodes in relation to this node. You can get this data in the mtm.get_nodes_state() view.

When a failed node connects back to the cluster, multimaster starts automatic recovery:

  1. The reconnected node selects a random cluster node and starts catching up with the current state of the cluster based on the Write-Ahead Log (WAL).

  2. When the node gets synchronized up to the minimum recovery lag, all the cluster nodes get locked for write transactions to allow the recovery process to finish. By default, the minimum recovery lag is 10KB. You can change this value in the multimaster.min_recovery_lag variable.

  3. When the recovery is complete, multimaster promotes the reconnected node to the online state and includes it into the replication scheme.

Note

Automatic recovery is only possible if the failed node WAL lag behind the working ones does not exceed the multimaster.max_recovery_lag value. When the WAL lag is bigger than multimaster.max_recovery_lag, you can manually restore the node from one of the working nodes using pg_basebackup.

See Also

Restoring a Cluster Node

F.31.3. Installation and Setup

To use multimaster, you need to install Postgres Pro Enterprise on all nodes of your cluster. Postgres Pro Enterprise includes all the required dependencies and extensions.

F.31.3.1. Setting up a Multi-Master Cluster

After installing Postgres Pro Enterprise on all nodes, you need to configure the cluster with multimaster.

Suppose you are setting up a cluster of three nodes, with node1, node2, and node3 domain names. First, set up the database to be replicated, and make sure you have a DBMS user with superuser rights to perform replication:

  • If you are starting from scratch, initialize a cluster, create an empty database mydb and a new DBMS user myuser with superuser rights, on each node of the cluster. For details, see Section 18.2.

  • If you already have a database mydb running on node1, create a new DBMS user myuser with superuser rights and initialize new nodes from the working node using pg_basebackup on behalf of this user. Run the following command on each node you are going to add:

    pg_basebackup -D datadir -h node1 -U myuser
    

    where datadir is the directory containing the database cluster. This directory is specified at the cluster initialization stage, or set in the PGDATA environment variable. You can also use any other DBMS user with superuser rights instead of myuser to perform this task.

    For details on using pg_basebackup, see pg_basebackup.

Once the database is set up, complete the following steps on each cluster node:

  1. Allow replication of the mydb database to each cluster node on behalf of myuser, as explained in Section 20.1. Make sure to use the authentication method that satisfies your security requirements.

  2. Modify the postgresql.conf configuration file, as follows:

    • Add multimaster to the shared_preload_libraries variable:

      shared_preload_libraries = 'multimaster'
      

      Tip

      If the shared_preload_libraries variable is already defined in postgresql.auto.conf, you will need to modify its value using the ALTER SYSTEM command. For details, see Section 19.1.2. Note that in a multi-master cluster, the ALTER SYSTEM command only affects the configuration of the node from which it was run.

    • Specify the transaction isolation level for your cluster. multimaster currently supports read committed and repeatable read isolation levels.

      default_transaction_isolation = 'read committed'

      Important

      Using repeatable read isolation level increases the probability of serialization failure at commit time. If such cases are not handled by your application, you are recommended to use read committed isolation level.

    • Set up PostgreSQL parameters related to replication.

      wal_level = logical
      max_connections = 100
      max_prepared_transactions = 300
      max_wal_senders = 10       # at least the number of nodes
      max_replication_slots = 10 # at least the number of nodes
      

      You must change the replication level to logical as multimaster relies on logical replication. For a cluster of N nodes, enable at least N WAL sender processes and replication slots. Since multimaster implicitly adds a PREPARE phase to each COMMIT transaction, make sure to set the number of prepared transactions to N*max_connections. Otherwise, prepared transactions may be queued.

    • Make sure you have enough background workers allocated for each node:

      max_worker_processes = 250
      

      For example, for a three-node cluster with max_connections = 100, multimaster may need up to 206 background workers at peak times: 200 workers for connections from the neighbor nodes, two workers for WAL sender processes, two workers for WAL receiver processes, and two workers for the arbiter sender and receiver processes. When setting this parameter, remember that other modules may also use background workers at the same time.

    • Add multimaster-specific options:

      multimaster.max_nodes = 3  # cluster size
      multimaster.node_id = 1    # the 1-based index of this node 
                                 # in the cluster
      multimaster.conn_strings = 'dbname=mydb user=myuser host=node1 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node2 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node3 port=5432 arbiter_port=5433'
                                 # comma-separated list 
                                 # of connection strings 
                                 # to neighbor nodes
      multimaster.arbiter_port = 5433
      

      The multimaster.max_nodes variable defines the maximum cluster size. If you plan to add new nodes to your cluster, the multimaster.max_nodes value should exceed the initial number of nodes. In this case, you can add new nodes without restarting Postgres Pro Enterprise until the specified number of nodes is reached.

      In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.

      Important

      The multimaster.node_id variable takes natural numbers starting from 1, without any gaps in numbering. For example, for a cluster of five nodes, set node IDs to 1, 2, 3, 4, and 5. In the multimaster.conn_strings variable, make sure to list the nodes in the order of their IDs. The multimaster.conn_strings variable must be the same on all nodes.

      You can customize connection parameters by adding other libpq connection options to connection strings in the multimaster.conn_strings variable. By default, multimaster uses port 5432 to establish connections between the nodes and port 5433 for the arbiter process to listen for connections. If these ports are already in use, you must specify other ports in port and arbiter_port options in each connection string in the multimaster.conn_strings variable.

      If you change the arbiter_port option, you must also specify this port in the multimaster.arbiter_port variable. For details, see multimaster.arbiter_port and multimaster.conn_strings.

      To check whether the default ports are available, you can run the following command:

      netstat -ln | grep -E '5432|5433'
      

      Make sure these ports are not blocked by firewall.

    Depending on your network environment and usage patterns, you may want to tune other multimaster parameters. For details, see Section F.31.3.2.

  3. On behalf of an OS superuser, restart Postgres Pro Enterprise:

    pg_ctl -D datadir -l pg.log restart
    

When Postgres Pro Enterprise is started on all nodes, connect to any node on behalf of the postgres OS user and create the multimaster extension in the mydb database that will be replicated:

psql -h node1 -d mydb
CREATE EXTENSION multimaster;

The CREATE EXTENSION query is replicated to all the cluster nodes.

To ensure that multimaster is enabled, check the mtm.get_cluster_state() view:

SELECT mtm.get_cluster_state();

If liveNodes is equal to allNodes, your cluster is successfully configured and ready to use.

See Also

Tuning Configuration Parameters

F.31.3.1.1. Sample Cluster Configuration for CentOS, Oracle Linux, and RHEL

This section shows a sample configuration of a three-node cluster with host names node1, node2 and node3. The name of the database to be replicated is mydb. The name of DBMS user with superuser rights to perform database replication is myuser.

  1. First, run the following commands on behalf of OS user postgres, on all cluster nodes:

    # Configure each node, and then create the myuser user
    # and mydb database owned by myuser.
    export PATH=/usr/pgproee-9.6/bin:$PATH
    export PGDATA=/var/lib/pgproee/9.6/data
    
    for i in `seq 1 3`; do
    echo "host replication myuser node$i md5" >> $PGDATA/pg_hba.conf
    echo "host mydb myuser node$i md5" >> $PGDATA/pg_hba.conf
    echo "node$i:5432:mydb:myuser:myuserpassword" >> ~/.pgpass
    done
    chmod 0600 ~/.pgpass
    
    cat << EOF | psql --dbname=postgres --username=postgres --port=5432
    ALTER SYSTEM SET default_transaction_isolation = 'read committed';
    ALTER SYSTEM SET wal_level = logical;
    ALTER SYSTEM SET max_connections = 100;
    ALTER SYSTEM SET max_prepared_transactions = 300;
    ALTER SYSTEM SET max_wal_senders = 10;
    ALTER SYSTEM SET max_replication_slots = 10;
    ALTER SYSTEM SET max_worker_processes = 250;
    ALTER SYSTEM SET shared_preload_libraries = 'multimaster';
    CREATE USER myuser WITH SUPERUSER PASSWORD 'myuserpassword';
    CREATE DATABASE mydb OWNER myuser;
    EOF
    
    # Define the number of nodes in the cluster, assign node IDs,
    # and connection strings to the nodes. This example assumes
    # that node hostnames are node1, node2, and node3, and their
    # IDs are the corresponding numbers starting with 1.
    cat << EOF >> $PGDATA/postgresql.conf
    multimaster.max_nodes = 3
    multimaster.node_id = `hostname | awk '{ print substr($1,5,1) }'`
    multimaster.arbiter_port = 5433
    multimaster.conn_strings = 'dbname=mydb user=myuser host=node1 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node2 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node3 port=5432 arbiter_port=5433'
    EOF
    
    
  2. On behalf of an OS user with superuser privileges, restart Postgres Pro Enterprise service on all cluster nodes:

    sudo service postgrespro-enterprise-9.6 restart
    
  3. Now create the multimaster extension on one of the cluster nodes, on behalf of OS user postgres. It will be replicated to all the other nodes automatically. The following example illustrates extension creation on node1.

    psql --dbname=mydb --username=myuser --host=node1 --port=5432 -c "CREATE EXTENSION IF NOT EXISTS multimaster"
    

    The cluster is set up and ready to use.

F.31.3.2. Tuning Configuration Parameters

While you can use multimaster in the default configuration, you may want to tune several parameters for faster failure detection or more reliable automatic recovery.

F.31.3.2.1. Setting Timeout for Failure Detection

To check availability of the neighbor nodes, multimaster periodically sends heartbeat packets to all nodes. You can define the timeout for failure detection with the following variables:

  • The multimaster.heartbeat_send_timeout variable defines the time interval between the heartbeats. By default, this variable is set to 200ms.

  • The multimaster.heartbeat_recv_timeout variable sets the timeout for the response. If no heartbeats are received during this time, the node is assumed to be disconnected and is excluded from the cluster. By default, this variable is set to 1000ms.

It's a good idea to set multimaster.heartbeat_send_timeout based on typical ping latencies between the nodes. Small recv/send ratio decreases the time of failure detection, but increases the probability of false-positive failure detection. When setting this parameter, take into account the typical packet loss ratio between your cluster nodes.

F.31.3.2.2. Configuring Automatic Recovery Parameters

If a cluster node fails, multimaster can automatically restore it based on the WAL collected on other cluster nodes. To control the recovery settings, use the following variables:

  • multimaster.min_recovery_lag — sets the minimal WAL lag between the node to be restored and the current cluster state. By default, multimaster.min_recovery_lag is set to 10KB. When the disconnected node is fast-forwarded up to the multimaster.min_recovery_lag threshold, multimaster stops all new commits to the alive nodes until the node fully catches up with the current state of the cluster. When the data is fully synchronized, the disconnected node is promoted to the online state, and the cluster resumes its work.

  • multimaster.max_recovery_lag — sets the maximum size of WAL. Upon reaching the multimaster.max_recovery_lag threshold, WAL for the disconnected node is overwritten. At this point, automatic recovery is no longer possible. In this case, you can restore the node manually by cloning the data from one of the alive nodes using pg_basebackup.

By default, multimaster.max_recovery_lag is set to 1GB. Setting multimaster.max_recovery_lag to a larger value increases the timeframe for automatic recovery, but requires more disk space for WAL collection.

See Also

GUC Variables

F.31.3.3. Defining Quorum Settings for Clusters with an Even Number of Nodes

By default, multimaster uses a majority-based algorithm to determine whether the cluster nodes have a quorum: a cluster can only continue working if the majority of its nodes are alive and can access each other. For clusters with an even number of nodes, this approach is not optimal. For example, if a network failure splits the cluster into equal parts, or one of the nodes fails in a two-node cluster, all the nodes stop accepting queries, even though at least half of the cluster nodes are running normally.

To enable a smooth failover for such cases, you can modify the multimaster majority-based behavior using one of the following options:

Important

To avoid split-brain problems, do not use the major node together with a referee in the same cluster.

F.31.3.3.1. Setting up a Standalone Referee Node

A referee is a voting node used to determine which subset of nodes has a quorum if the cluster is split into equal parts. The referee node does not store any cluster data, so it is not resource-intensive and can be configured on virtually any system with Postgres Pro Enterprise installed.

To set up a referee for your cluster:

  1. Install Postgres Pro Enterprise on the node you are going to make a referee and create the referee extension:

    CREATE EXTENSION referee;
    

  2. Make sure the pg_hba.conf file allows access to the referee node.

  3. On all your cluster nodes, specify the referee connection string in the postgresql.conf file:

    multimaster.referee_connstring = connstring
    

    where connstring holds libpq options required to access the referee.

The first subset of nodes that gets connected to the referee wins the voting and continues working. The referee keeps the voting result until all the other cluster nodes get online again. Then the result is discarded, and a new winner can be chosen in case of another network failure.

To avoid split-brain problems, you can only have a single referee in your cluster. Do not set up a referee if you have already configured the major node.

F.31.3.3.2. Configuring the Major Node

If you configure one of the nodes to be the major one, this node will continue accepting queries even if it is isolated by a network failure, or other nodes get broken. This setting is useful in a two-node cluster configuration, or to quickly restore a single node in a broken cluster.

Important

If your cluster has more than two nodes, promoting one of the nodes to the major status can lead to split-brain problems in case of network failures, and reduce the number of possible failover options. Consider setting up a standalone referee instead.

To make one of the nodes major, enable the multimaster.major_node parameter on this node:

ALTER SYSTEM SET multimaster.major_node TO on
SELECT pg_reload_conf();

Do not set the major_node parameter on more than one cluster node. When enabled on several nodes, it can cause the split-brain problem. If you have already set up a referee for your cluster, the major_node option is forbidden.

F.31.4. Multi-Master Cluster Administration

F.31.4.1. Monitoring Cluster Status

multimaster provides several views to check the current cluster state.

To check node-specific information, use mtm.get_nodes_state():

SELECT mtm.get_nodes_state();

To check the status of the whole cluster, use the mtm.get_cluster_state() view:

SELECT mtm.get_cluster_state();

For details on all the returned information, see Section F.31.5.2.

F.31.4.2. Adding New Nodes to the Cluster

With the multimaster extension, you can add or drop cluster nodes without stopping the database service.

To add a new node, you need to change the cluster configuration on alive nodes, load all the data to the new node using pg_basebackup, and start the node.

Suppose we have a working cluster of three nodes, with node1, node2, and node3 domain names. To add node4, follow these steps:

  1. Check whether the current number of cluster nodes has reached the value specified in the multimaster.max_nodes variable. If this value is reached, increase the multimaster.max_nodes value on each node and restart all nodes. You can restart the nodes one by one, without stopping the database. If the maximum number of nodes is not reached, proceed to the next step.

  2. Figure out the required connection string to access the new node. For example, for the database mydb, user myuser, and the new node node4, the connection string can be "dbname=mydb user=myuser host=node4". For details, see multimaster.conn_strings.

  3. In psql connected to any alive node, run:

    SELECT mtm.add_node('dbname=mydb user=myuser host=node4');
    

    This command changes the cluster configuration on all nodes and starts replication slots for the new node.

  4. Connect to the new node and clone all the data from one of the alive nodes to the new node:

    pg_basebackup -D datadir -h node1 -x
    

    pg_basebackup copies the entire data directory from node1, together with configuration settings.

  5. Update postgresql.conf settings on node4:

    multimaster.node_id = 4
    multimaster.conn_strings = 'dbname=mydb user=myuser host=node1,
                                dbname=mydb user=myuser host=node2,
                                dbname=mydb user=myuser host=node3,
                                dbname=mydb user=myuser host=node4'
    
  6. Start Postgres Pro on the new node:

    pg_ctl -D datadir -l pg.log start
    

    When the node gets synchronized up to the minimum recovery lag, all the cluster nodes get locked for write transactions until the new node retrieves all the updates. When data recovery is complete, multimaster promotes the new node to the online state and includes it into the replication scheme.

To ensure that the new configuration is loaded in the case of PostgreSQL restart, update configuration settings on all the cluster nodes:

  1. Change multimaster.conn_strings to include the new node.

  2. Make sure the pg_hba.conf file allows replication to the new node.

See Also

Setting up a Multi-Master Cluster

Monitoring Cluster Status

F.31.4.3. Removing Nodes from the Cluster

multimaster provides the mtm.stop_node() function that can temporarily or permanently remove nodes from the cluster.

To temporarily exclude a node from the cluster, run the mtm.stop_node() function specifying the node ID. For example, to exclude node 3, run the following command on any other node of the cluster:

SELECT mtm.stop_node(3);

This command excludes node 3 from the cluster and stops replication to this node.

While the WAL lag between the node and the current cluster state is less than the multimaster.max_recovery_lag value, you can restore the node by running the following command:

SELECT mtm.recover_node(3);

Otherwise, follow the procedure described in Section F.31.4.4.

Note

If you simply shut down a node, it will be excluded from the cluster as well. However, all transactions in the cluster will be frozen until other nodes detect the offline state of the node. This time interval is defined by the multimaster.heartbeat_recv_timeout parameter.

To permanently drop the node from the cluster:

  1. Run the mtm.stop_node() function with the drop_slot parameter set to true:

    SELECT mtm.stop_node(3, true);
    

    This disables replication slots for node 3 on all cluster nodes and stops replication to this node.

  2. Adjust multimaster.node_id and multimaster.conn_strings settings in postgresql.conf on the remaining cluster nodes to reflect the new state of the cluster.

  3. Edit the pg_hba.conf file on the remaining cluster nodes to disable replication to the removed node, if required.

If you would like to return the node to the cluster later, you will have to add it as a new node, as explained in Section F.31.4.2.

F.31.4.4. Restoring a Cluster Node Manually

The multimaster extension can automatically restore a failed node if the WAL is available for the time when the node was disconnected from the cluster. However, if the data updates on the alive nodes exceed the allowed WAL size specified in the multimaster.max_recovery_lag variable, automatic recovery is impossible. In this case, you can manually restore the failed node.

Suppose node2 got disconnected from your three-node cluster and needs to be manually restored. The typical workflow is as follows:

  1. In psql connected to any alive node, create a new replication slot for the disconnected node with the following command:

    SELECT mtm.recover_node(2);
    

    where 2 is the ID of the disconnected node specified in the multimaster.node_id variable.

  2. Connect to node2 and clone all the data from one of the alive nodes:

    pg_basebackup -D datadir -h node1 -x
    

    pg_basebackup copies the entire data directory from node1, together with configuration settings.

  3. On the restored node, update the multimaster.node_id setting to the value this node used to have before the failure.

  4. Make sure replication is enabled between the restored node and the rest of the cluster.

  5. Start Postgres Pro on the restored node:

    pg_ctl -D datadir -l pg.log start
    

    When the node gets synchronized up to the minimum recovery lag, all the cluster nodes get locked for write transactions until the restored node retrieves all the updates. When data recovery is complete, multimaster promotes the new node to the online state and includes it into the replication scheme.

See Also

Failure Detection and Recovery

F.31.5. Reference

F.31.5.1. GUC Variables

multimaster.node_id

Node ID — a unique natural number identifying the node of a multi-master cluster. You must start node numbering from 1. There must be no gaps in numbering. For example, for a cluster of five nodes, set node IDs to 1, 2, 3, 4, and 5.

multimaster.conn_strings

Connection strings for each node of a multi-master cluster, separated by commas. The multimaster.conn_strings parameter must be identical on all nodes. Each connection string must include the name of the database to replicate and the cluster node domain name. For example, 'dbname=mydb host=node1,dbname=mydb host=node2,dbname=mydb host=node3'. Optionally, you can add other connection parameters to change the default connection settings. Connection strings must appear in the order of the node IDs specified in the multimaster.node_id variable. Connection string for the i-th node must be on the i-th position. If you specify a custom port in the multimaster.arbiter_port, you must provide this value in the arbiter_port parameter in the connection string for the corresponding node.

multimaster.max_nodes

The maximum number of nodes allowed in the cluster. If you plan to add new nodes to your cluster, the multimaster.max_nodes value should exceed the initial number of nodes. In this case, you can add new nodes without restarting Postgres Pro Enterprise until the specified number of nodes is reached. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes. The maximum possible number of nodes is limited to 64.

Default: the number of nodes specified in the multimaster.conn_strings variable

multimaster.arbiter_port

Port for the arbiter process to listen on. If you change the default value, you must specify this value in the arbiter_port parameter in the connection string for the corresponding node.

Default: 5433

multimaster.heartbeat_send_timeout

Time interval between heartbeat messages, in milliseconds. An arbiter process broadcasts heartbeat messages to all nodes to detect connection problems.

Default: 200

multimaster.heartbeat_recv_timeout

Timeout, in milliseconds. If no heartbeat message is received from the node within this timeframe, the node is excluded from the cluster.

Default: 1000

multimaster.min_recovery_lag

Minimal WAL lag between the node to be restored and the current cluster state, in kB. When this threshold is reached during node recovery, the cluster is locked for write transactions until the recovery is complete.

Default: 10KB

multimaster.max_recovery_lag

Maximal WAL lag size, in kB. When a node is disconnected from the cluster, other nodes copy WAL data for all new transactions into the replication slot of this node. Upon reaching the multimaster.max_recovery_lag value, the replication slot for the disconnected node is dropped to avoid overflow. At this point, automatic recovery of the node is no longer possible. In this case, you can restore the node manually by cloning the data from one of the alive nodes using pg_basebackup or a similar tool. If you set this variable to zero, replication slot will not be dropped.

Default: 1GB

multimaster.ignore_tables_without_pk

Boolean. This variable enables/disables replication of INSERT operations for tables without primary keys. By default, such replication is enabled. Setting this parameter to on disables replication of INSERT operations for these tables. Regardless of this setting, DDL operations on tables without primary keys are always replicated, while UPDATE and DELETE operations are not replicated because of the logical replication restrictions.

Default: off

multimaster.cluster_name

Name of the cluster. If you define this variable when setting up the cluster, multimaster checks that the cluster name is the same for all the cluster nodes.

multimaster.break_connection

Break connection with clients connected to the node if this node disconnects from the cluster. If this variable is set to false, the client stays connected to the node but receives an error that the node is in minority.

Default: false

multimaster.major_node

The node with this flag continues working even if there is no quorum. This may be required to break the symmetry in two-node clusters, or to quickly restore a single node in a broken cluster.

Important

This parameter should be used with caution to avoid split-brain problems:

  • Do not use the major_node parameter on clusters with more than two nodes. If the major node gets isolated by a network failure while the majority of nodes are connected to each other, both the major node and the subset that has a quorum will continue working. Consider setting up a standalone referee instead.

  • Never set more than one major node in the cluster.

multimaster.referee_connstring

Connection string to access the referee node. You must set this parameter on all cluster nodes if the referee is set up.

multimaster.max_workers

The maximum number of walreceiver workers on this server.

Important

This parameter should be used with caution. If the number of simultaneous transactions in the whole cluster is bigger than the provided value, it can lead to undetected deadlocks.

multimaster.trans_spill_threshold

The maximal size of transaction, in kB. When this threshold is reached, the transaction is written to the disk.

Default: 100MB

multimaster.monotonic_sequences

Defines the sequence generation mode for unique identifiers. This variable can take the following values:

  • false (default) — ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly.

  • true — the generated sequence increases monotonically cluster-wide. ID generation on each node is started with this node number and is incremented by the number of nodes, but the values are omitted if they are smaller than the already generated IDs on another node. For example, in a three-node cluster, if 1, 4 and 7 IDs are already allocated to the objects on the first node, 2 and 5 IDs will be omitted on the second node. In this case, the first ID on the second node is 8. Thus, the next generated ID is always higher than the previous one, regardless of the cluster node.

Default: false

multimaster.remote_functions

Provides a comma-separated list of function names that should be executed remotely on all multimaster nodes instead of replicating the result of their work.

F.31.5.2. Functions

mtm.get_nodes_state()

Shows the status of all nodes in the cluster. Returns a tuple of the following values:

  • id, integer

    Node ID.

  • enabled, boolean

    Shows whether the node is excluded from the cluster. The node can only be disabled if responses to heartbeats are not received within the heartbeat_recv_timeout time interval. When the node starts responding to heartbeats, multimaster can automatically restore the node and switch it back to the enabled state. Automatic recovery is only possible if the replication slot is still active. Otherwise, you can restore the node manually.

  • connected, boolean

    Shows whether the node is connected to the WAL sender.

  • slot_active, boolean

    Shows whether the node has an active replication slot. For a disabled node, the slot remains active until the max_recovery_lag value is reached.

  • stopped, boolean

    Shows whether replication to this node was stopped by the mtm.stop_node() function. A stopped node acts as a disabled one, but cannot be automatically recovered. Call mtm.recover_node() to re-enable such a node.

  • catchUp, boolean

    During the node recovery, shows whether the data is recovered up to the min_recovery_lag value.

  • slotLag, bigint

    The size of WAL data that the replication slot holds for a disabled/stopped node. The slot is dropped when slotLag reaches the max_recovery_lag value.

  • avgTransDelay, bigint

    An average commit delay caused by this node, in microseconds.

  • lastStatusChange, timestamp

    Last time when the node changed its status (enabled/disabled).

  • oldestSnapshot, bigint

    The oldest global snapshot existing on this node.

  • SenderPid, integer

    Process ID of the WAL sender.

  • SenderStartTime, timestamp

    WAL sender start time.

  • ReceiverPid, integer

    Process ID of the WAL receiver.

  • ReceiverStartTime, timestamp

    WAL receiver start time.

  • connStr, text

    Connection string to this node.

  • connectivityMask, bigint

    Bitmask representing connectivity to neighbor nodes. Each bit represents a connection to node.

  • nHeartbeats, integer

    The number of heartbeat responses received from this node.

mtm.collect_cluster_info()

Collects the data returned by the mtm.get_cluster_state() function from all available nodes. For this function to work, in addition to replication connections, pg_hba.conf must allow ordinary connections to the node with the specified connection string.

mtm.get_cluster_state()

Shows the status of the multimaster extension. Returns a tuple of the following values:

  • id, integer

    Node ID.

  • status, text

    Node status. Possible values are: Initialization, Offline, Connected, Online, Recovery, Recovered, InMinor, OutOfService.

  • disabledNodeMask, bigint

    Bitmask of disabled nodes.

  • disconnectedNodeMask, bigint

    Bitmask of disconnected nodes.

  • catchUpNodeMask, bigint

    Bitmask of nodes that completed the recovery.

  • liveNodes, integer

    Number of enabled nodes.

  • allNodes, integer

    Number of nodes in the cluster. The majority of alive nodes is calculated based on this parameter.

  • nActiveQueries, integer

    Number of queries being currently processed on this node.

  • nPendingQueries, integer

    Number of queries waiting for execution on this node.

  • queueSize, bigint

    Size of the pending query queue, in bytes.

  • transCount, bigint

    The total number of replicated transactions processed by this node.

  • timeShift, bigint

    Global snapshot shift caused by unsynchronized clocks on nodes, in microseconds.

  • recoverySlot, integer

    The node from which a failed node gets data updates during automatic recovery.

  • xidHashSize, bigint

    Size of xid2state hash.

  • gidHashSize, bigint

    Size of gid2state hash.

  • oldestXid, bigint

    The oldest transaction ID on this node.

  • configChanges, integer

    Number of state changes (enabled/disabled) since the last reboot.

  • stalledNodeMask, bigint

    Bitmask of nodes for which replication slots were dropped.

  • stoppedNodeMask, bigint

    Bitmask of nodes that were stopped by mtm.stop_node().

  • lastStatusChange, timestamp

    Timestamp of the last state change.

mtm.add_node(conn_str text)

Adds a new node to the cluster.

Arguments:

  • conn_str — connection string for the new node. For example, for the database mydb, user myuser, and the new node node4, the connection string is "dbname=mydb user=myuser host=node4".

    Type: text

mtm.alter_sequences()

Fixes unique identifiers on all cluster nodes. This may be required after restoring all nodes from a single base backup.

mtm.stop_node(node integer, drop_slot bool default false)

Excludes a node from the cluster.

Arguments:

  • node — ID of the node to be dropped that you specified in the multimaster.node_id variable.

    Type: integer

  • drop_slot — Optional. Defines whether the replication slot should be dropped together with the node. Set this option to true if you do not plan to restore the node in the future.

    Type: boolean

    Default: false

mtm.recover_node(node integer)

Creates a replication slot for the node that was previously dropped together with its slot.

Arguments:

  • node — ID of the node to be restored.

mtm.make_table_local('table_name' regclass)

Stops replication for the specified table.

Arguments:

  • table_name — the table you would like to exclude from the replication scheme.

    Type: regclass

mtm.copy_table(table_name regclass, node_id integer)

Copies the specified table to another node. You can use this function to restore the corrupted data on one or more nodes of the cluster. This function must be called on the node from which the table is copied.

Arguments:

  • table_name — the table you would like to copy.

    Type: regclass

  • node_id — the node to copy the table to.

    Type: integer

mtm.broadcast_table(table_name regclass)

Copies the specified table to all the alive nodes of the cluster. You can use this function to restore the corrupted data. This function must be called on the node from which the table is copied.

Arguments:

  • table_name — the table you would like to copy.

    Type: regclass

F.31.6. Compatibility

The multimaster extension currently passes almost all PostgreSQL regression tests, except for a few tests for edge cases related to working with temporary tables and updating enum, which are not always transactional in PostgreSQL. We are working right now on providing full compatibility with the standard PostgreSQL.

F.31.7. Authors

Postgres Professional, Moscow, Russia.

F.31.7.1. Credits

The replication mechanism is based on logical decoding and an earlier version of the pglogical extension provided for community by the 2ndQuadrant team.

The three-phase E3PC commit protocol is based on the following works: