50.3. Streaming Replication Protocol
To initiate streaming replication, the frontend sends the replication
parameter in the startup message. A Boolean value of true
tells the backend to go into walsender mode, wherein a small set of replication commands can be issued instead of SQL statements. Only the simple query protocol can be used in walsender mode. Replication commands are logged in the server log when log_replication_commands is enabled. Passing database
as the value instructs walsender to connect to the database specified in the dbname
parameter, which will allow the connection to be used for logical replication from that database.
For the purpose of testing replication commands, you can make a replication connection via psql or any other libpq
-using tool with a connection string including the replication
option, e.g.:
psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
However it is often more useful to use pg_receivexlog (for physical replication) or pg_recvlogical (for logical replication).
The commands accepted in walsender mode are:
- IDENTIFY_SYSTEM
Requests the server to identify itself. Server replies with a result set of a single row, containing four fields:
- systemid
The unique system identifier identifying the cluster. This can be used to check that the base backup used to initialize the standby came from the same cluster.
- timeline
Current TimelineID. Also useful to check that the standby is consistent with the master.
- xlogpos
Current xlog flush location. Useful to get a known location in the transaction log where streaming can start.
- dbname
Database connected to or NULL.
- TIMELINE_HISTORY
tli
Requests the server to send over the timeline history file for timeline
tli
. Server replies with a result set of a single row, containing two fields. While the fields are labeled astext
andbytea
, they effectively return raw bytes, with no escaping or encoding conversion:- filename
Filename of the timeline history file, e.g
00000002.history
.- content
Contents of the timeline history file.
- CREATE_REPLICATION_SLOT
slot_name
{PHYSICAL
|LOGICAL
output_plugin
} Create a physical or logical replication slot. See Section 25.2.6 for more about replication slots.
slot_name
The name of the slot to create. Must be a valid replication slot name (see Section 25.2.6.1).
output_plugin
The name of the output plugin used for logical decoding (see Section 46.6).
- START_REPLICATION [
SLOT
slot_name
] [PHYSICAL
]XXX/XXX
[TIMELINE
tli
] Instructs server to start streaming WAL, starting at WAL position
XXX/XXX
. IfTIMELINE
option is specified, streaming starts on timelinetli
; otherwise, the server's current timeline is selected. The server can reply with an error, e.g., if the requested section of WAL has already been recycled. On success, server responds with a CopyBothResponse message, and then starts to stream WAL to the frontend.If a slot's name is provided via
slot_name
, it will be updated as replication progresses so that the server knows which WAL segments, and ifhot_standby_feedback
is on which transactions, are still needed by the standby.If the client requests a timeline that's not the latest, but is part of the history of the server, the server will stream all the WAL on that timeline starting from the requested startpoint, up to the point where the server switched to another timeline. If the client requests streaming at exactly the end of an old timeline, the server responds immediately with CommandComplete without entering COPY mode.
After streaming all the WAL on a timeline that is not the latest one, the server will end streaming by exiting the COPY mode. When the client acknowledges this by also exiting COPY mode, the server sends a result set with one row and two columns, indicating the next timeline in this server's history. The first column is the next timeline's ID, and the second column is the XLOG position where the switch happened. Usually, the switch position is the end of the WAL that was streamed, but there are corner cases where the server can send some WAL from the old timeline that it has not itself replayed before promoting. Finally, the server sends CommandComplete message, and is ready to accept a new command.
WAL data is sent as a series of CopyData messages. (This allows other information to be intermixed; in particular the server can send an ErrorResponse message if it encounters a failure after beginning to stream.) The payload of each CopyData message from server to the client contains a message of one of the following formats:
- XLogData (B)
- Byte1('w')
Identifies the message as WAL data.
- Int64
The starting point of the WAL data in this message.
- Int64
The current end of WAL on the server.
- Int64
The server's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.
- Byte
n
A section of the WAL data stream.
A single WAL record is never split across two XLogData messages. When a WAL record crosses a WAL page boundary, and is therefore already split using continuation records, it can be split at the page boundary. In other words, the first main WAL record and its continuation records can be sent in different XLogData messages.
- Primary keepalive message (B)
- Byte1('k')
Identifies the message as a sender keepalive.
- Int64
The current end of WAL on the server.
- Int64
The server's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.
- Byte1
1 means that the client should reply to this message as soon as possible, to avoid a timeout disconnect. 0 otherwise.
The receiving process can send replies back to the sender at any time, using one of the following message formats (also in the payload of a CopyData message):
- Standby status update (F)
- Byte1('r')
Identifies the message as a receiver status update.
- Int64
The location of the last WAL byte + 1 received and written to disk in the standby.
- Int64
The location of the last WAL byte + 1 flushed to disk in the standby.
- Int64
The location of the last WAL byte + 1 applied in the standby.
- Int64
The client's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.
- Byte1
If 1, the client requests the server to reply to this message immediately. This can be used to ping the server, to test if the connection is still healthy.
- Hot Standby feedback message (F)
- Byte1('h')
Identifies the message as a Hot Standby feedback message.
- Int64
The client's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.
- Int32
The standby's current xmin. This may be 0, if the standby is sending notification that Hot Standby feedback will no longer be sent on this connection. Later non-zero messages may reinitiate the feedback mechanism.
- Int32
The standby's current epoch.
- START_REPLICATION
SLOT
slot_name
LOGICAL
XXX/XXX
[ (option_name
[option_value
] [, ... ] ) ] Instructs server to start streaming WAL for logical replication, starting at WAL position
XXX/XXX
. The server can reply with an error, e.g., if the requested section of WAL has already been recycled. On success, server responds with a CopyBothResponse message, and then starts to stream WAL to the frontend.The messages inside the CopyBothResponse messages are of the same format documented for
START_REPLICATION ... PHYSICAL
.The output plugin associated with the selected slot is used to process the output for streaming.
SLOT
slot_name
The name of the slot to stream changes from. This parameter is required, and must correspond to an existing logical replication slot created with
CREATE_REPLICATION_SLOT
inLOGICAL
mode.XXX/XXX
The WAL position to begin streaming at.
option_name
The name of an option passed to the slot's logical decoding plugin.
option_value
Optional value, in the form of a string constant, associated with the specified option.
- DROP_REPLICATION_SLOT
slot_name
Drops a replication slot, freeing any reserved server-side resources. If the slot is currently in use by an active connection, this command fails.
slot_name
The name of the slot to drop.
- BASE_BACKUP [
LABEL
'label'
] [PROGRESS
] [FAST
] [WAL
] [NOWAIT
] [MAX_RATE
rate
] [TABLESPACE_MAP
] Instructs the server to start streaming a base backup. The system will automatically be put in backup mode before the backup is started, and taken out of it when the backup is complete. The following options are accepted:
LABEL
'label'
Sets the label of the backup. If none is specified, a backup label of
base backup
will be used. The quoting rules for the label are the same as a standard SQL string with standard_conforming_strings turned on.PROGRESS
Request information required to generate a progress report. This will send back an approximate size in the header of each tablespace, which can be used to calculate how far along the stream is done. This is calculated by enumerating all the file sizes once before the transfer is even started, and may as such have a negative impact on the performance - in particular it may take longer before the first data is streamed. Since the database files can change during the backup, the size is only approximate and may both grow and shrink between the time of approximation and the sending of the actual files.
FAST
Request a fast checkpoint.
WAL
Include the necessary WAL segments in the backup. This will include all the files between start and stop backup in the
pg_xlog
directory of the base directory tar file.NOWAIT
By default, the backup will wait until the last required xlog segment has been archived, or emit a warning if log archiving is not enabled. Specifying
NOWAIT
disables both the waiting and the warning, leaving the client responsible for ensuring the required log is available.MAX_RATE
rate
Limit (throttle) the maximum amount of data transferred from server to client per unit of time. The expected unit is kilobytes per second. If this option is specified, the value must either be equal to zero or it must fall within the range from 32 kB through 1 GB (inclusive). If zero is passed or the option is not specified, no restriction is imposed on the transfer.
TABLESPACE_MAP
Include information about symbolic links present in the directory
pg_tblspc
in a file namedtablespace_map
. The tablespace map file includes each symbolic link name as it exists in the directorypg_tblspc/
and the full path of that symbolic link.
When the backup is started, the server will first send two ordinary result sets, followed by one or more CopyResponse results.
The first ordinary result set contains the starting position of the backup, in a single row with two columns. The first column contains the start position given in XLogRecPtr format, and the second column contains the corresponding timeline ID.
The second ordinary result set has one row for each tablespace. The fields in this row are:
- spcoid
The oid of the tablespace, or
NULL
if it's the base directory.- spclocation
The full path of the tablespace directory, or
NULL
if it's the base directory.- size
The approximate size of the tablespace, if progress report has been requested; otherwise it's
NULL
.
After the second regular result set, one or more CopyResponse results will be sent, one for PGDATA and one for each additional tablespace other than
pg_default
andpg_global
. The data in the CopyResponse results will be a tar format (following the “ustar interchange format” specified in the POSIX 1003.1-2008 standard) dump of the tablespace contents, except that the two trailing blocks of zeroes specified in the standard are omitted. After the tar data is complete, a final ordinary result set will be sent, containing the WAL end position of the backup, in the same format as the start position.The tar archive for the data directory and each tablespace will contain all files in the directories, regardless of whether they are PostgreSQL files or other files added to the same directory. The only excluded files are:
postmaster.pid
postmaster.opts
various temporary files created during the operation of the PostgreSQL server
pg_xlog
, including subdirectories. If the backup is run with WAL files included, a synthesized version ofpg_xlog
will be included, but it will only contain the files necessary for the backup to work, not the rest of the contents.pg_replslot
is copied as an empty directory.Files other than regular files and directories, such as symbolic links and special device files, are skipped. (Symbolic links in
pg_tblspc
are maintained.)
Owner, group and file mode are set if the underlying file system on the server supports it.
Once all tablespaces have been sent, a final regular result set will be sent. This result set contains the end position of the backup, given in XLogRecPtr format as a single column in a single row.