F.32. multimaster
multimaster
is a Postgres Pro Enterprise extension with a set of patches that turns Postgres Pro Enterprise into a synchronous shared-nothing cluster to provide Online Transaction Processing (OLTP) scalability for read transactions and high availability with automatic disaster recovery.
As compared to a standard PostgreSQL master-standby cluster, a cluster configured with the multimaster
extension offers the following benefits:
Cluster-wide transaction isolation
Synchronous logical replication and DDL replication
Working with temporary tables on each cluster node
Fault tolerance and automatic node recovery
PostgreSQL online upgrades
Important
The multimaster
extension is no longer supported in Postgres Pro Enterprise 10. Please consider migrating to the latest Postgres Pro Enterprise major version to ensure stable performance of multimaster
.
The multimaster
extension replicates your database to all nodes of the cluster and allows write transactions on each node. To ensure data consistency in the case of concurrent updates, multimaster
enforces transaction isolation cluster-wide, using multiversion concurrency control (MVCC) at the read committed or repeatable read isolation levels. Any write transaction is synchronously replicated to all nodes, which increases commit latency for the time required for synchronization. Read-only transactions and queries are executed locally, without any measurable overhead.
To ensure high availability and fault tolerance of the cluster, multimaster
uses three-phase commit protocol and heartbeats for failure discovery. A multi-master cluster of N
nodes can continue working while the majority of the nodes are alive and reachable by other nodes. To be configured with multimaster
, the cluster must include at least two nodes. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.
When a failed node is reconnected to the cluster, multimaster
can automatically fast-forward the node to the actual state based on the Write-Ahead Log (WAL) data in the corresponding replication slot. If WAL data is no longer available for the time when the node was excluded from the cluster, you can restore the node using pg_basebackup.
Note
When using multimaster
, make sure to take its replication restrictions into account. For details, see Section F.32.1.
To learn more about the multimaster
internals, see Section F.32.2.
F.32.1. Limitations
The multimaster
extension takes care of the database replication in a fully automated way. You can perform write transactions on any node and work with temporary tables on each cluster node simultaneously. However, make sure to take the following replication restrictions into account:
Microsoft Windows operating system is not supported.
multimaster
can only replicate one database per cluster. This database must be specified in themultimaster.conn_strings
variable. If you connect to a different database, all operations will fail with the corresponding error message.The replicated tables must have primary keys or replica identity because of logical replication restrictions. Although tables without primary keys can be replicated,
multimaster
does not allowUPDATE
andDELETE
operations on such tables. For details, seemultimaster.ignore_tables_without_pk
. Unlogged tables are not replicated, as in the standard PostgreSQL.Isolation level. The
multimaster
extension supportsread committed
andrepeatable read
isolation levels.Serializable
isolation level is currently not supported.Important
Using
repeatable read
isolation level increases the probability of serialization failure at commit time. Unlike in the standard PostgreSQL,read committed
level can also cause serialization failures on a multi-master cluster (with an SQLSTATE code '40001').When performing a write transaction,
multimaster
blocks the affected objects only on the node on which the transaction is performed. However, since write transactions are allowed on all nodes, other transactions can try to change the same objects on the neighbor nodes at the same time. In this case, the replication of the first transaction can fail because the affected objects on the neighbor nodes are already blocked by another transaction. Similarly, the latter transaction cannot be replicated to the first node. In this case, a distributed deadlock occurs. As a result, one of the transactions is automatically rolled back and needs to be repeated. The application must be ready to retry transactions.If your typical workload has too many rollbacks, it is recommended to use
read committed
isolation level. However, theread committed
still does not guarantee the absence of deadlocks on a multi-master cluster. If using theread committed
level does not help, you can try directing all the write transactions to a single node.In a multi-master cluster, the
ALTER SYSTEM
command only affects the configuration of the current node. If you would like to change configuration parameters across the whole database cluster, you need to run this command on each node.Sequence generation. To avoid conflicts between unique identifiers on different nodes,
multimaster
modifies the default behavior of sequence generators. By default, ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly. Thus, the generated sequence values are not monotonic. If it is critical to get a monotonically increasing sequence cluster-wide, you can set themultimaster.monotonic_sequences
totrue
.DDL replication. While
multimaster
replicates data on the logical level, DDL is replicated on the statement level, which results in distributed commits of the same statement on different nodes. As a result, complex DDL scenarios, such as stored procedures and temporary tables, may work differently as compared to the standard PostgreSQL.Commit latency. In the current implementation of logical replication,
multimaster
sends data to subscriber nodes only after the local commit, so you have to wait for transaction processing twice: first on the local node, and then on all the other nodes simultaneously. In the case of a heavy-write transaction, this may result in a noticeable delay.
If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows:
SELECT mtm.make_table_local('table_name')
F.32.2. Architecture
F.32.2.1. Replication
Since each server in a multi-master cluster can accept writes, any server can abort a transaction because of a concurrent update — in the same way as it happens on a single server between different backends. To ensure high availability and data consistency on all cluster nodes, multimaster
uses logical replication and the three-phase E3PC commit protocol.
When Postgres Pro Enterprise loads the multimaster
shared library, multimaster
sets up a logical replication producer and consumer for each node, and hooks into the transaction commit pipeline. The typical data replication workflow consists of the following phases:
PREPARE
phase.multimaster
captures and implicitly transforms eachCOMMIT
statement to aPREPARE
statement. All the nodes that get the transaction via the replication protocol (the cohort nodes) send their vote for approving or declining the transaction to the arbiter process on the initiating node. This ensures that all the cohort can accept the transaction, and no write conflicts occur. For details onPREPARE
transactions support in PostgreSQL, see the PREPARE TRANSACTION topic.PRECOMMIT
phase. If all the cohort nodes approve the transaction, the arbiter process sends aPRECOMMIT
message to all the cohort nodes to express an intention to commit the transaction. The cohort nodes respond to the arbiter with thePRECOMMITTED
message. In case of a failure, all the nodes can use this information to complete the transaction using a quorum-based voting procedure.COMMIT
phase. IfPRECOMMIT
is successful, the arbiter commits the transaction to all nodes.
Important
multimaster
currently supports the read committed
and repeatable read
isolation levels only, which can cause unexpected serialization failures in your workload. For details, see Section F.32.1.
If a node crashes or gets disconnected from the cluster between the PREPARE
and COMMIT
phases, the PRECOMMIT
phase ensures that the survived nodes have enough information to complete the prepared transaction. The PRECOMMITTED
messages help avoid the situation when the crashed node has already committed or aborted the transaction, but has not notified other nodes about the transaction status. In a two-phase commit (2PC), such a transaction would block resources (hold locks) until the recovery of the crashed node. Otherwise, you could get data inconsistencies in the database when the failed node is recovered. For example, if the failed node committed the transaction but the survived node aborted it.
To complete the transaction, the arbiter must receive a response from the majority of the nodes. For example, for a cluster of 2N
+1 nodes, at least N
+1 responses are required. Thus, multimaster
ensures that your cluster is available for reads and writes while the majority of the nodes are connected, and no data inconsistencies occur in case of a node or connection failure. For details on the failure detection mechanism, see Section F.32.2.2.
F.32.2.2. Failure Detection and Recovery
Since multimaster
allows writes to each node, it has to wait for responses about transaction acknowledgment from all the other nodes. Without special actions in case of a node failure, each commit would have to wait until the failed node recovery. To deal with such situations, multimaster
periodically sends heartbeats to check the node state and the connectivity between nodes. When several heartbeats to the node are lost in a row, this node is kicked out of the cluster to allow writes to the remaining alive nodes. You can configure the heartbeat frequency and the response timeout in the multimaster.heartbeat_send_timeout
and multimaster.heartbeat_recv_timeout
parameters, respectively.
For alive nodes, there is no way to distinguish between a failed node that stopped serving requests and a network-partitioned node that can be accessed by database users, but is unreachable for other nodes. To avoid conflicting writes to nodes in different network partitions, multimaster
only allows writes to the nodes that see the majority of other nodes.
For example, suppose a five-node multi-master cluster experienced a network failure that split the network into two isolated subnets, with two and three cluster nodes. Based on heartbeats propagation information, multimaster
will continue accepting writes at each node in the bigger partition, and deny all writes in the smaller one. Thus, a cluster consisting of 2N
+1 nodes can tolerate N
node failures and stay alive if any N
+1 nodes are alive and connected to each other.
Tip
For clusters with an even number of nodes, you can override this behavior. For details, see Section F.32.3.3.
In case of a partial network split when different nodes have different connectivity, multimaster
finds a fully connected subset of nodes and disconnects other nodes. For example, in a three-node cluster, if node A can access both B and C, but node B cannot access node C, multimaster
isolates node C to ensure data consistency on nodes A and B.
If you try to access a disconnected node, multimaster
returns an error message indicating the current status of the node. To prevent stale reads, read-only queries are also forbidden. Additionally, you can break connections between the disconnected node and the clients using the multimaster.break_connection
variable.
Each node maintains a data structure that keeps the information about the state of all nodes in relation to this node. You can get this data in the mtm.get_nodes_state()
view.
When a failed node connects back to the cluster, multimaster
starts automatic recovery:
The reconnected node selects a random cluster node and starts catching up with the current state of the cluster based on the Write-Ahead Log (WAL).
When the node gets synchronized up to the minimum recovery lag, all the cluster nodes get locked for write transactions to allow the recovery process to finish. By default, the minimum recovery lag is 10KB. You can change this value in the
multimaster.min_recovery_lag
variable.When the recovery is complete,
multimaster
promotes the reconnected node to the online state and includes it into the replication scheme.
Note
Automatic recovery is only possible if the failed node WAL lag behind the working ones does not exceed the multimaster.max_recovery_lag
value. When the WAL lag is bigger than multimaster.max_recovery_lag
, you can manually restore the node from one of the working nodes using pg_basebackup.
See Also
F.32.3. Installation and Setup
To use multimaster
, you need to install Postgres Pro Enterprise on all nodes of your cluster. Postgres Pro Enterprise includes all the required dependencies and extensions.
F.32.3.1. Setting up a Multi-Master Cluster
After installing Postgres Pro Enterprise on all nodes, you need to configure the cluster with multimaster
.
Suppose you are setting up a cluster of three nodes, with node1
, node2
, and node3
domain names. First, set up the database to be replicated, and make sure you have a DBMS user with superuser rights to perform replication:
If you are starting from scratch, initialize a cluster, create an empty database
mydb
and a new DBMS usermyuser
with superuser rights, on each node of the cluster. For details, see Section 18.2.If you already have a database
mydb
running onnode1
, create a new DBMS usermyuser
with superuser rights and initialize new nodes from the working node using pg_basebackup on behalf of this user. Run the following command on each node you are going to add:pg_basebackup -D
datadir
-h node1 -U myuser -c fastwhere
datadir
is the directory containing the database cluster. This directory is specified at the cluster initialization stage, or set in thePGDATA
environment variable. You can also use any other DBMS user with superuser rights instead ofmyuser
to perform this task.For details on using pg_basebackup, see pg_basebackup.
Once the database is set up, complete the following steps on each cluster node:
Allow replication of the
mydb
database to each cluster node on behalf ofmyuser
, as explained in Section 20.1. Make sure to use the authentication method that satisfies your security requirements.Modify the
postgresql.conf
configuration file, as follows:Add
multimaster
to theshared_preload_libraries
variable:shared_preload_libraries = 'multimaster'
Tip
If the
shared_preload_libraries
variable is already defined inpostgresql.auto.conf
, you will need to modify its value using the ALTER SYSTEM command. For details, see Section 19.1.2. Note that in a multi-master cluster, theALTER SYSTEM
command only affects the configuration of the node from which it was run.Specify the transaction isolation level for your cluster.
multimaster
currently supports read committed and repeatable read isolation levels.default_transaction_isolation = 'read committed'
Important
Using
repeatable read
isolation level increases the probability of serialization failure at commit time. If such cases are not handled by your application, you are recommended to useread committed
isolation level.Set up PostgreSQL parameters related to replication.
wal_level = logical max_connections = 100 max_prepared_transactions = 300 max_wal_senders = 10 # at least the number of nodes max_replication_slots = 10 # at least the number of nodes
You must change the replication level to
logical
asmultimaster
relies on logical replication. For a cluster ofN
nodes, enable at leastN
WAL sender processes and replication slots. Sincemultimaster
implicitly adds aPREPARE
phase to eachCOMMIT
transaction, make sure to set the number of prepared transactions toN
*max_connections
. Otherwise, prepared transactions may be queued.Make sure you have enough background workers allocated for each node:
max_worker_processes = 250
For example, for a three-node cluster with
max_connections
= 100,multimaster
may need up to 206 background workers at peak times: 200 workers for connections from the neighbor nodes, two workers for WAL sender processes, two workers for WAL receiver processes, and two workers for the arbiter sender and receiver processes. When setting this parameter, remember that other modules may also use background workers at the same time.Add
multimaster
-specific options:multimaster.max_nodes = 3 # cluster size multimaster.node_id = 1 # the 1-based index of this node # in the cluster multimaster.conn_strings = 'dbname=mydb user=myuser host=node1 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node2 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node3 port=5432 arbiter_port=5433' # comma-separated list # of connection strings # to neighbor nodes multimaster.arbiter_port = 5433
The
multimaster.max_nodes
variable defines the maximum cluster size. If you plan to add new nodes to your cluster, themultimaster.max_nodes
value should exceed the initial number of nodes. In this case, you can add new nodes without restarting Postgres Pro Enterprise until the specified number of nodes is reached.In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.
Important
The
multimaster.node_id
variable takes natural numbers starting from 1, without any gaps in numbering. For example, for a cluster of five nodes, set node IDs to 1, 2, 3, 4, and 5. In themultimaster.conn_strings
variable, make sure to list the nodes in the order of their IDs. Themultimaster.conn_strings
variable must be the same on all nodes.You can customize connection parameters by adding other libpq connection options to connection strings in the
multimaster.conn_strings
variable. By default,multimaster
uses port 5432 to establish connections between the nodes and port 5433 for the arbiter process to listen for connections. If these ports are already in use, you must specify other ports inport
andarbiter_port
options in each connection string in themultimaster.conn_strings
variable.If you change the
arbiter_port
option, you must also specify this port in themultimaster.arbiter_port
variable. For details, seemultimaster.arbiter_port
andmultimaster.conn_strings
.To check whether the default ports are available, you can run the following command:
netstat -ln | grep -E '5432|5433'
Make sure these ports are not blocked by firewall.
Depending on your network environment and usage patterns, you may want to tune other
multimaster
parameters. For details, see Section F.32.3.2.On behalf of an OS superuser, restart Postgres Pro Enterprise:
pg_ctl -D
datadir
-lpg.log
restart
When Postgres Pro Enterprise is started on all nodes, connect to any node on behalf of the postgres
OS user and create the multimaster
extension in the mydb
database that will be replicated:
psql -h node1 -d mydb CREATE EXTENSION multimaster;
The CREATE EXTENSION
query is replicated to all the cluster nodes.
To ensure that multimaster
is enabled, check the mtm.get_cluster_state()
view:
SELECT mtm.get_cluster_state();
If liveNodes
is equal to allNodes
, your cluster is successfully configured and ready to use.
See Also
Tuning Configuration Parameters
F.32.3.1.1. Sample Cluster Configuration
This section shows a sample configuration of a three-node cluster with host names node1
, node2
and node3
. The name of the database to be replicated is mydb
. The name of DBMS user with superuser rights to perform database replication is myuser
.
First, run the following commands on behalf of OS user
postgres
, on all cluster nodes:# Configure each node, and then create the myuser user # and mydb database owned by myuser. export PATH=/opt/pgpro/ent-10/bin:$PATH export PGDATA=/var/lib/pgpro/ent-10/data for i in `seq 1 3`; do echo "host replication myuser node$i md5" >> $PGDATA/pg_hba.conf echo "host mydb myuser node$i md5" >> $PGDATA/pg_hba.conf echo "node$i:5432:mydb:myuser:myuserpassword" >> ~/.pgpass done chmod 0600 ~/.pgpass cat << EOF | psql --dbname=postgres --username=postgres --port=5432 ALTER SYSTEM SET default_transaction_isolation = 'read committed'; ALTER SYSTEM SET wal_level = logical; ALTER SYSTEM SET max_connections = 100; ALTER SYSTEM SET max_prepared_transactions = 300; ALTER SYSTEM SET max_wal_senders = 10; ALTER SYSTEM SET max_replication_slots = 10; ALTER SYSTEM SET max_worker_processes = 250; ALTER SYSTEM SET shared_preload_libraries = 'multimaster'; CREATE USER myuser WITH SUPERUSER PASSWORD 'myuserpassword'; CREATE DATABASE mydb OWNER myuser; EOF # Define the number of nodes in the cluster, assign node IDs, # and connection strings to the nodes. This example assumes # that node hostnames are node1, node2, and node3, and their # IDs are the corresponding numbers starting with 1. cat << EOF >> $PGDATA/postgresql.conf multimaster.max_nodes = 3 multimaster.node_id = `hostname | awk '{ print substr($1,5,1) }'` multimaster.arbiter_port = 5433 multimaster.conn_strings = 'dbname=mydb user=myuser host=node1 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node2 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node3 port=5432 arbiter_port=5433' EOF
On behalf of an OS user with superuser privileges, restart Postgres Pro Enterprise service on all cluster nodes:
sudo service postgrespro-ent-10 restart
Now create the
multimaster
extension on one of the cluster nodes, on behalf of OS userpostgres
. It will be replicated to all the other nodes automatically. The following example illustrates extension creation on node1.psql --dbname=mydb --username=myuser --host=node1 --port=5432 -c "CREATE EXTENSION IF NOT EXISTS multimaster"
The cluster is set up and ready to use.
F.32.3.2. Tuning Configuration Parameters
While you can use multimaster
in the default configuration, you may want to tune several parameters for faster failure detection or more reliable automatic recovery.
F.32.3.2.1. Setting Timeout for Failure Detection
To check availability of the neighbor nodes, multimaster
periodically sends heartbeat packets to all nodes. You can define the timeout for failure detection with the following variables:
The
multimaster.heartbeat_send_timeout
variable defines the time interval between the heartbeats. By default, this variable is set to 200ms.The
multimaster.heartbeat_recv_timeout
variable sets the timeout for the response. If no heartbeats are received during this time, the node is assumed to be disconnected and is excluded from the cluster. By default, this variable is set to 1000ms.
It's a good idea to set multimaster.heartbeat_send_timeout
based on typical ping latencies between the nodes. Small recv/send ratio decreases the time of failure detection, but increases the probability of false-positive failure detection. When setting this parameter, take into account the typical packet loss ratio between your cluster nodes.
F.32.3.2.2. Configuring Automatic Recovery Parameters
If a cluster node fails, multimaster
can automatically restore it based on the WAL collected on other cluster nodes. To control the recovery settings, use the following variables:
multimaster.min_recovery_lag
— sets the minimal WAL lag between the node to be restored and the current cluster state. By default,multimaster.min_recovery_lag
is set to 10KB. When the disconnected node is fast-forwarded up to themultimaster.min_recovery_lag
threshold,multimaster
stops all new commits to the alive nodes until the node fully catches up with the current state of the cluster. When the data is fully synchronized, the disconnected node is promoted to the online state, and the cluster resumes its work.multimaster.max_recovery_lag
— sets the maximum size of WAL. Upon reaching themultimaster.max_recovery_lag
threshold, WAL for the disconnected node is overwritten. At this point, automatic recovery is no longer possible. In this case, you can restore the node manually by cloning the data from one of the alive nodes using pg_basebackup.
By default, multimaster.max_recovery_lag
is set to 1GB. Setting multimaster.max_recovery_lag
to a larger value increases the timeframe for automatic recovery, but requires more disk space for WAL collection.
See Also
F.32.3.3. Defining Quorum Settings for Clusters with an Even Number of Nodes
By default, multimaster
uses a majority-based algorithm to determine whether the cluster nodes have a quorum: a cluster can only continue working if the majority of its nodes are alive and can access each other. For clusters with an even number of nodes, this approach is not optimal. For example, if a network failure splits the cluster into equal parts, or one of the nodes fails in a two-node cluster, all the nodes stop accepting queries, even though at least half of the cluster nodes are running normally.
To enable a smooth failover for such cases, you can modify the multimaster
majority-based behavior using one of the following options:
Set up a standalone referee node to assign the quorum status to a subset of nodes that constitutes half of the cluster.
Choose the major node that continues working regardless of the status of other nodes. Use this option in two-node cluster configurations only.
Important
To avoid split-brain problems, do not use the major node together with a referee in the same cluster.
F.32.3.3.1. Setting up a Standalone Referee Node
A referee is a voting node used to determine which subset of nodes has a quorum if the cluster is split into equal parts. The referee node does not store any cluster data, so it is not resource-intensive and can be configured on virtually any system with Postgres Pro Enterprise installed.
To set up a referee for your cluster:
Install Postgres Pro Enterprise on the node you are going to make a referee and create the
referee
extension:CREATE EXTENSION referee;
Make sure the
pg_hba.conf
file allows access to the referee node.On all your cluster nodes, specify the referee connection string in the
postgresql.conf
file:multimaster.referee_connstring =
connstring
where
connstring
holds libpq options required to access the referee.
The first subset of nodes that gets connected to the referee wins the voting and continues working. The referee keeps the voting result until all the other cluster nodes get online again. Then the result is discarded, and a new winner can be chosen in case of another network failure.
To avoid split-brain problems, you can only have a single referee in your cluster. Do not set up a referee if you have already configured the major node.
F.32.3.3.2. Configuring the Major Node
If you configure one of the nodes to be the major one, this node will continue accepting queries even if it is isolated by a network failure, or other nodes get broken. This setting is useful in a two-node cluster configuration, or to quickly restore a single node in a broken cluster.
To make one of the nodes major, enable the multimaster.major_node
parameter on this node:
ALTER SYSTEM SET multimaster.major_node TO on; SELECT pg_reload_conf();
Do not set the major_node
parameter on more than one cluster node. When enabled on several nodes, it can cause the split-brain problem. If you have already set up a referee for your cluster, the major_node
option is forbidden.
Important
If your cluster has more than two nodes, do not use the major node setting, as it can lead to split-brain problems in case of network failures, and reduce the number of possible failover options. If the major node gets isolated by a network failure while the majority of nodes are connected to each other, both the major node and the subset that has a quorum will continue working. Consider setting up a standalone referee instead.
F.32.4. Multi-Master Cluster Administration
F.32.4.1. Monitoring Cluster Status
multimaster
provides several views to check the current cluster state.
To check node-specific information, use mtm.get_nodes_state()
:
SELECT mtm.get_nodes_state();
To check the status of the whole cluster, use the mtm.get_cluster_state()
view:
SELECT mtm.get_cluster_state();
For details on all the returned information, see Section F.32.5.2.
F.32.4.2. Adding New Nodes to the Cluster
With the multimaster
extension, you can add or drop cluster nodes without stopping the database service.
To add a new node, you need to change the cluster configuration on alive nodes, load all the data to the new node using pg_basebackup, and start the node.
Suppose we have a working cluster of three nodes, with node1
, node2
, and node3
domain names. To add node4
, follow these steps:
Check whether the current number of cluster nodes has reached the value specified in the
multimaster.max_nodes
variable. If this value is reached, increase themultimaster.max_nodes
value on each node and restart all nodes. You can restart the nodes one by one, without stopping the database. If the maximum number of nodes is not reached, proceed to the next step.Figure out the required connection string to access the new node. For example, for the database
mydb
, usermyuser
, and the new nodenode4
, the connection string can be"dbname=mydb user=myuser host=node4"
. For details, seemultimaster.conn_strings
.In
psql
connected to any alive node, run:SELECT mtm.add_node('dbname=mydb user=myuser host=node4');
This command changes the cluster configuration on all nodes and starts replication slots for the new node.
Connect to the new node and clone all the data from one of the alive nodes to the new node:
pg_basebackup -D
datadir
-h node1 -U myuser -c fastpg_basebackup copies the entire data directory from
node1
, together with configuration settings.Update
postgresql.conf
settings onnode4
:multimaster.node_id = 4 multimaster.conn_strings = 'dbname=mydb user=myuser host=node1, dbname=mydb user=myuser host=node2, dbname=mydb user=myuser host=node3, dbname=mydb user=myuser host=node4'
Start Postgres Pro on the new node:
pg_ctl -D
datadir
-lpg.log
startWhen the node gets synchronized up to the minimum recovery lag, all the cluster nodes get locked for write transactions until the new node retrieves all the updates. When data recovery is complete,
multimaster
promotes the new node to the online state and includes it into the replication scheme.
To ensure that the new configuration is loaded in the case of PostgreSQL restart, update configuration settings on all the cluster nodes:
Change
multimaster.conn_strings
to include the new node.Make sure the
pg_hba.conf
file allows replication to the new node.
See Also
F.32.4.3. Removing Nodes from the Cluster
multimaster
provides the mtm.stop_node()
function that can temporarily or permanently remove nodes from the cluster.
To temporarily exclude a node from the cluster, run the mtm.stop_node()
function specifying the node ID. For example, to exclude node 3, run the following command on any other node of the cluster:
SELECT mtm.stop_node(3);
This command excludes node 3 from the cluster and stops replication to this node.
While the WAL lag between the node and the current cluster state is less than the multimaster.max_recovery_lag
value, you can restore the node by running the following command:
SELECT mtm.recover_node(3);
Otherwise, follow the procedure described in Section F.32.4.4.
Note
If you simply shut down a node, it will be excluded from the cluster as well. However, all transactions in the cluster will be frozen until other nodes detect the offline state of the node. This time interval is defined by the multimaster.heartbeat_recv_timeout
parameter.
To permanently drop the node from the cluster:
Run the
mtm.stop_node()
function with thedrop_slot
parameter set totrue
:SELECT mtm.stop_node(3, true);
This disables replication slots for node 3 on all cluster nodes and stops replication to this node.
Adjust
multimaster.node_id
andmultimaster.conn_strings
settings inpostgresql.conf
on the remaining cluster nodes to reflect the new state of the cluster.Edit the
pg_hba.conf
file on the remaining cluster nodes to disable replication to the removed node, if required.
If you would like to return the node to the cluster later, you will have to add it as a new node, as explained in Section F.32.4.2.
F.32.4.4. Restoring a Cluster Node Manually
The multimaster
extension can automatically restore a failed node if the WAL is available for the time when the node was disconnected from the cluster. However, if the data updates on the alive nodes exceed the allowed WAL size specified in the multimaster.max_recovery_lag
variable, automatic recovery is impossible. In this case, you can manually restore the failed node.
Suppose node2
got disconnected from your three-node cluster and needs to be manually restored. The typical workflow is as follows:
In
psql
connected to any alive node, create a new replication slot for the disconnected node with the following command:SELECT mtm.recover_node(2);
where 2 is the ID of the disconnected node specified in the
multimaster.node_id
variable.Connect to
node2
and clone all the data from one of the alive nodes:pg_basebackup -D
datadir
-h node1 -U myuser -c fastpg_basebackup copies the entire data directory from
node1
, together with configuration settings.On the restored node, update the
multimaster.node_id
setting to the value this node used to have before the failure.Make sure replication is enabled between the restored node and the rest of the cluster.
Start Postgres Pro on the restored node:
pg_ctl -D
datadir
-lpg.log
startWhen the node gets synchronized up to the minimum recovery lag, all the cluster nodes get locked for write transactions until the restored node retrieves all the updates. When data recovery is complete,
multimaster
promotes the new node to the online state and includes it into the replication scheme.
See Also
F.32.5. Reference
F.32.5.1. GUC Variables
multimaster.node_id
Node ID — a unique natural number identifying the node of a multi-master cluster. You must start node numbering from 1. There must be no gaps in numbering. For example, for a cluster of five nodes, set node IDs to 1, 2, 3, 4, and 5.
multimaster.conn_strings
Connection strings for each node of a multi-master cluster, separated by commas. The
multimaster.conn_strings
parameter must be identical on all nodes. Each connection string must include the name of the database to replicate and the cluster node domain name. For example, 'dbname=mydb host=node1,dbname=mydb host=node2,dbname=mydb host=node3'. Optionally, you can add other connection parameters to change the default connection settings. Connection strings must appear in the order of the node IDs specified in themultimaster.node_id
variable. Connection string for the i-th node must be on the i-th position. If you specify a custom port in themultimaster.arbiter_port
, you must provide this value in thearbiter_port
parameter in the connection string for the corresponding node.multimaster.max_nodes
The maximum number of nodes allowed in the cluster. If you plan to add new nodes to your cluster, the
multimaster.max_nodes
value should exceed the initial number of nodes. In this case, you can add new nodes without restarting Postgres Pro Enterprise until the specified number of nodes is reached. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes. The maximum possible number of nodes is limited to 64.Default: the number of nodes specified in the
multimaster.conn_strings
variablemultimaster.arbiter_port
Port for the arbiter process to listen on. If you change the default value, you must specify this value in the
arbiter_port
parameter in the connection string for the corresponding node.Default: 5433
multimaster.heartbeat_send_timeout
Time interval between heartbeat messages, in milliseconds. An arbiter process broadcasts heartbeat messages to all nodes to detect connection problems.
Default: 200
multimaster.heartbeat_recv_timeout
Timeout, in milliseconds. If no heartbeat message is received from the node within this timeframe, the node is excluded from the cluster.
Default: 1000
multimaster.min_recovery_lag
Minimal WAL lag between the node to be restored and the current cluster state, in kB. When this threshold is reached during node recovery, the cluster is locked for write transactions until the recovery is complete.
Default: 10KB
multimaster.max_recovery_lag
Maximal WAL lag size, in kB. When a node is disconnected from the cluster, other nodes copy WAL data for all new transactions into the replication slot of this node. Upon reaching the
multimaster.max_recovery_lag
value, the replication slot for the disconnected node is dropped to avoid overflow. At this point, automatic recovery of the node is no longer possible. In this case, you can restore the node manually by cloning the data from one of the alive nodes using pg_basebackup or a similar tool. If you set this variable to zero, replication slot will not be dropped.Default: 1GB
multimaster.ignore_tables_without_pk
Boolean. This variable enables/disables replication of
INSERT
operations for tables without primary keys. By default, such replication is enabled. Setting this parameter toon
disables replication ofINSERT
operations for these tables. Regardless of this setting, DDL operations on tables without primary keys are always replicated, whileUPDATE
andDELETE
operations are not replicated because of the logical replication restrictions.Default:
false
multimaster.cluster_name
Name of the cluster. If you define this variable when setting up the cluster,
multimaster
checks that the cluster name is the same for all the cluster nodes.multimaster.break_connection
Break connection with clients connected to the node if this node disconnects from the cluster. If this variable is set to
false
, the client stays connected to the node but receives an error that the node is in minority.Default:
false
multimaster.major_node
The node with this flag continues working even if there is no quorum. This may be required to break the symmetry in two-node clusters, or to quickly restore a single node in a broken cluster.
Important
This parameter should be used with caution to avoid split-brain problems:
Do not use the
major_node
parameter on clusters with more than two nodes. If the major node gets isolated by a network failure while the majority of nodes are connected to each other, both the major node and the subset that has a quorum will continue working. Consider setting up a standalone referee instead.Never set more than one major node in the cluster.
multimaster.referee_connstring
Connection string to access the referee node. You must set this parameter on all cluster nodes if the referee is set up.
multimaster.max_workers
The maximum number of
walreceiver
workers on this server.Important
This parameter should be used with caution. If the number of simultaneous transactions in the whole cluster is bigger than the provided value, it can lead to undetected deadlocks.
Default: 100
multimaster.trans_spill_threshold
The maximal size of transaction, in kB. When this threshold is reached, the transaction is written to the disk.
Default: 100MB
multimaster.monotonic_sequences
Defines the sequence generation mode for unique identifiers. This variable can take the following values:
false
(default) — ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly.true
— the generated sequence increases monotonically cluster-wide. ID generation on each node is started with this node number and is incremented by the number of nodes, but the values are omitted if they are smaller than the already generated IDs on another node. For example, in a three-node cluster, if 1, 4 and 7 IDs are already allocated to the objects on the first node, 2 and 5 IDs will be omitted on the second node. In this case, the first ID on the second node is 8. Thus, the next generated ID is always higher than the previous one, regardless of the cluster node.
Default:
false
multimaster.remote_functions
Provides a comma-separated list of function names that should be executed remotely on all multimaster nodes instead of replicating the result of their work.
multimaster.use_rdma
Enables node-to-node connections using Remote Direct Memory Access (RDMA) technology. You must have an RDMA implementation set up on all the cluster nodes. For details, see Section 18.11.
F.32.5.2. Functions
-
mtm.get_nodes_state()
Shows the status of all nodes in the cluster. Returns a tuple of the following values:
id
,integer
Node ID.
enabled
,boolean
Shows whether the node is excluded from the cluster. The node can only be disabled if responses to heartbeats are not received within the
heartbeat_recv_timeout
time interval. When the node starts responding to heartbeats,multimaster
can automatically restore the node and switch it back to the enabled state. Automatic recovery is only possible if the replication slot is still active. Otherwise, you can restore the node manually.connected
,boolean
Shows whether the node is connected to the WAL sender.
slot_active
,boolean
Shows whether the node has an active replication slot. For a disabled node, the slot remains active until the
max_recovery_lag
value is reached.stopped
,boolean
Shows whether replication to this node was stopped by the
mtm.stop_node()
function. A stopped node acts as a disabled one, but cannot be automatically recovered. Callmtm.recover_node()
to re-enable such a node.catchUp
,boolean
During the node recovery, shows whether the data is recovered up to the
min_recovery_lag
value.slotLag
,bigint
The size of WAL data that the replication slot holds for a disabled/stopped node. The slot is dropped when
slotLag
reaches themax_recovery_lag
value.avgTransDelay
,bigint
An average commit delay caused by this node, in microseconds.
lastStatusChange
,timestamp
Last time when the node changed its status (enabled/disabled).
oldestSnapshot
,bigint
The oldest global snapshot existing on this node.
SenderPid
,integer
Process ID of the WAL sender.
SenderStartTime
,timestamp
WAL sender start time.
ReceiverPid
,integer
Process ID of the WAL receiver.
ReceiverStartTime
,timestamp
WAL receiver start time.
connStr
,text
Connection string to this node.
connectivityMask
,bigint
Bitmask representing connectivity to neighbor nodes. Each bit represents a connection to node.
nHeartbeats
,integer
The number of heartbeat responses received from this node.
-
mtm.collect_cluster_info()
Collects the data returned by the
mtm.get_cluster_state()
function from all available nodes. For this function to work, in addition to replication connections,pg_hba.conf
must allow ordinary connections to the node with the specified connection string.-
mtm.get_cluster_state()
Shows the status of the
multimaster
extension. Returns a tuple of the following values:id
,integer
Node ID.
status
,text
Node status. Possible values are:
Initialization
,Offline
,Connected
,Online
,Recovery
,Recovered
,InMinor
,OutOfService
.disabledNodeMask
,bigint
Bitmask of disabled nodes.
disconnectedNodeMask
,bigint
Bitmask of disconnected nodes.
catchUpNodeMask
,bigint
Bitmask of nodes that completed the recovery.
liveNodes
,integer
Number of enabled nodes.
allNodes
,integer
Number of nodes in the cluster. The majority of alive nodes is calculated based on this parameter.
nActiveQueries
,integer
Number of queries being currently processed on this node.
nPendingQueries
,integer
Number of queries waiting for execution on this node.
queueSize
,bigint
Size of the pending query queue, in bytes.
transCount
,bigint
The total number of replicated transactions processed by this node.
timeShift
,bigint
Global snapshot shift caused by unsynchronized clocks on nodes, in microseconds.
recoverySlot
,integer
The node from which a failed node gets data updates during automatic recovery.
xidHashSize
,bigint
Size of xid2state hash.
gidHashSize
,bigint
Size of gid2state hash.
oldestXid
,bigint
The oldest transaction ID on this node.
configChanges
,integer
Number of state changes (enabled/disabled) since the last reboot.
stalledNodeMask
,bigint
Bitmask of nodes for which replication slots were dropped.
stoppedNodeMask
,bigint
Bitmask of nodes that were stopped by
mtm.stop_node()
.lastStatusChange
,timestamp
Timestamp of the last state change.
-
mtm.add_node(
connstr
text
) Adds a new node to the cluster.
Arguments:
connstr
— connection string for the new node. For example, for the databasemydb
, usermyuser
, and the new nodenode4
, the connection string is"dbname=mydb user=myuser host=node4"
.Type:
text
-
mtm.alter_sequences()
Fixes unique identifiers on all cluster nodes. This may be required after restoring all nodes from a single base backup.
-
mtm.stop_node(
node
integer
,drop_slot
bool
default false) Excludes a node from the cluster.
Arguments:
node
— ID of the node to be dropped that you specified in themultimaster.node_id
variable.Type:
integer
drop_slot
— Optional. Defines whether the replication slot should be dropped together with the node. Set this option totrue
if you do not plan to restore the node in the future.Type:
boolean
Default:
false
-
mtm.recover_node(
node
integer
) Creates a replication slot for the node that was previously dropped together with its slot.
Arguments:
node
— ID of the node to be restored.
-
mtm.make_table_local(
'table_name'
regclass
) Stops replication for the specified table.
Arguments:
table_name
— the table you would like to exclude from the replication scheme.Type:
regclass
-
mtm.copy_table(
table_name
regclass
,node_id
integer
) Copies the specified table to another node. You can use this function to restore the corrupted data on one or more nodes of the cluster. This function must be called on the node from which the table is copied.
Arguments:
table_name
— the table you would like to copy.Type:
regclass
node_id
— the node to copy the table to.Type:
integer
-
mtm.broadcast_table(
table_name
regclass
) Copies the specified table to all the alive nodes of the cluster. You can use this function to restore the corrupted data. This function must be called on the node from which the table is copied.
Arguments:
table_name
— the table you would like to copy.Type:
regclass
F.32.6. Compatibility
The multimaster
extension currently passes almost all PostgreSQL regression tests, except for a few tests for edge cases related to working with temporary tables and updating enum
, which are not always transactional in PostgreSQL. We are working right now on providing full compatibility with the standard PostgreSQL.
F.32.7. Authors
Postgres Professional, Moscow, Russia.
F.32.7.1. Credits
The replication mechanism is based on logical decoding and an earlier version of the pglogical
extension provided for community by the 2ndQuadrant team.
The three-phase E3PC commit protocol is based on the following works:
Idit Keidar, Danny Dolev. Increasing the Resilience of Distributed and Replicated Database Systems.
Tim Kempster, Colin Stirling, Peter Thanisch. A More Committed Quorum-Based Three Phase Commit Protocol.