Pg_logical protocol

pglogical_output defines a libpq subprocotol for streaming tuples, metadata, etc, from the decoding plugin to receivers.

This protocol is an inner layer in a stack:

tcp or unix sockets
- libpq protocol
  - libpq replication subprotocol (COPY BOTH etc)
    
    pg_logical output plugin ⇒ consumer protocol

so clients can simply use libpq’s existing replication protocol support, directly or via their libpq-wrapper driver.

This is a binary protocol intended for compact representation.

pglogical_output also supports a json-based text protocol with json representations of the same changesets, supporting all the same hooks etc, intended mainly for tracing/debugging/diagnostics. That protocol is not discussed here.

Protocol flow

The protocol flow is primarily from upstream walsender/decoding plugin to the downstream receiver.

The only information the flows downstream-to-upstream is:

The initial parameter list sent to START_REPLICATION; and
replay progress messages

We can accept an arbitrary list of params to START_REPLICATION. After that we have no general purpose channel for information to flow upstream. That means we can’t do a multi-step negotiation/handshake for determining the replication options to use, binary protocol, etc.

The main form of negotiation is the client getting a "take it or leave it" set of settings from the server in an initial startup message sent before any replication data (see below) and, if it doesn’t like them, reconnecting with different startup options.

Except for the negotiation via initial parameter list and then startup message the protocol flow is the same as any other walsender-based logical replication plugin. The data stream is sent in COPY BOTH mode as a series of CopyData messages encapsulating replication data, and ends when the client disconnects. There’s no facility for ending the COPY BOTH mode and returning to the walsender command parser to issue new commands. This is a limiation of the walsender interface, not pglogical_output.

Protocol messages

The individual protocol messages are discussed in the following sub-sections. Protocol flow and logic comes in the next major section.

Absolutely all top-level protocol messages begin with a message type byte. While represented in code as a character, this is a signed byte with no associated encoding.

Since the PostgreSQL libpq COPY protocol supplies a message length there’s no need for top-level protocol messages to embed a length in their header.

BEGIN message

A stream of rows starts with a BEGIN message. Rows may only be sent after a BEGIN and before a COMMIT.

Message	Type/Size	Notes
Message type	signed char	Literal ‘B’ (0x42)
flags	uint8	* 0-3: Reserved, client must ERROR if set and not recognised.
lsn	uint64	“final_lsn” in decoding context - currently it means lsn of commit
commit time	uint64	“commit_time” in decoding context
remote XID	uint32	“xid” in decoding context

Forwarded transaction origin message

The message after the BEGIN may be a forwarded transaction origin message indicating what upstream node the transaction came from.

Sent if the immediately prior message was a BEGIN message, the upstream transaction was forwarded from another node, and replication origin forwarding is enabled, i.e. forward_changeset_origins is t in the startup reply message.

A "node" could be another host, another DB on the same host, or pretty much anything. Whatever origin name is found gets forwarded. The origin identifier is of arbitrary and application-defined format. Applications should prefix their origin identifier with a fixed application name part, like bdr_, myapp_, etc. It is application-defined what an application does with forwarded transactions from other applications.

An origin message with a zero-length origin name indicates that the origin could not be identified but was (probably) not the local node. It is client-defined what action is taken in this case.

It is a protocol error to send/receive a forwarded transaction origin message at any time other than immediately after a BEGIN message.

The origin identifier is typically closely related to replication slot names and replication origins’ names in an application system.

For more detail see Changeset Forwarding in the README.

Message	Type/Size	Notes
Message type	signed char	Literal ‘O’ (0x4f)
flags	uint8	* 0-3: Reserved, application must ERROR if set and not recognised
origin_lsn	uint64	Log sequence number (LSN, XLogRecPtr) of the transaction’s commit record on its origin node (as opposed to the forwarding node’s commit LSN, which is ‘lsn’ in the BEGIN message)
origin_identifier_length	uint8	Length in bytes of origin_identifier
origin_identifier	signed char[origin_identifier_length]	An origin identifier of arbitrary, upstream-application-defined structure. Should be text in the same encoding as the upstream database. NULL-terminated. Should be 7-bit ASCII.

COMMIT message

A stream of rows ends with a COMMIT message.

There is no ROLLBACK message because aborted transactions are not sent by the upstream.

Message	Type/Size	Notes
Message type	signed char	Literal ‘C’ (0x43)
Flags	uint8	* 0-3: Reserved, client must ERROR if set and not recognised
Commit LSN	uint64	commit_lsn in decoding commit decode callback. This is the same value as in the BEGIN message, and marks the end of the transaction.
End LSN	uint64	end_lsn in decoding transaction context
Commit time	uint64	commit_time in decoding transaction context

INSERT, UPDATE or DELETE message

After a BEGIN or metadata message, the downstream should expect to receive zero or more row change messages, composed of an insert/update/delete message with zero or more tuple fields, each of which has one or more tuple field values.

The row’s relidentifier must match that of the most recently preceding metadata message. All consecutive row messages must currently have the same relidentifier. (Later extensions to add metadata caching will relax these requirements for clients that advertise caching support; see the documentation on metadata messages for more detail).

It is an error to decode rows using metadata received after the row was received, or using metadata that is not the most recently received metadata revision that still predates the row. I.e. in the sequence M1, R1, R2, M2, R3, M4: R1 and R2 must be decoded using M1, and R3 must be decoded using M2. It is an error to use M4 to decode any of the rows, to use M1 to decode R3, or to use M2 to decode R1 and R2.

Row messages may not arrive except during a transaction as delimited by BEGIN and COMMIT messages. It is an error to receive a row message outside a transaction.

Any unrecognised tuple type or tuple part type is an error on the downstream that must result in a client disconnect and error message. Downstreams are expected to negotiate compatibility, and upstreams must not add new tuple types or tuple field types without negotiation.

The downstream reads rows until the next non-row message is received. There is no other end marker or any indication of how many rows to expect in a sequence.

Row message header

Message	Type/Size	Notes
Message type	signed char	Literal ‘I’nsert (0x49), ‘U’pdate’ (0x55) or ‘D’elete (0x44)
flags	uint8	Row flags (reserved)
relidentifier	uint32	relidentifier that matches the table metadata message sent for this row. (Not present in BDR, which sends nspname and relname instead)
[tuple parts]	[composite]

One or more tuple-parts fields follow.

Tuple fields

Tuple type	signed char	Identifies the kind of tuple being sent.
tupleformat	signed char	‘T’ (0x54)
natts	uint16	Number of fields sent in this tuple part. (Present in BDR, but meaning significantly different here)
[tuple field values]	[composite]

Tuple tupleformat compatibility

Unrecognised tupleformat kinds are a protocol error for the downstream.

Tuple field value fields

These message parts describe individual fields within a tuple.

There are two kinds of tuple value fields, abbreviated and full. Which is being read is determined based on the first field, kind.

Abbreviated tuple value fields are nothing but the message kind:

Message	Type/Size	Notes
kind	signed char	* ‘n’ull (0x6e) field

Full tuple value fields have a length and datum:

Message	Type/Size	Notes
kind	signed char	* ‘i’nternal binary (0x62) field
length	int4	Only defined for kind = i\|b\|t
data	[length]	Data in a format defined by the table metadata and column kind.

Tuple field values kind compatibility

Unrecognised field kind values are a protocol error for the downstream. The downstream may not continue processing the protocol stream after this point.

The upstream may not send ‘i’nternal or ‘b’inary format values to the downstream without the downstream negotiating acceptance of such values. The downstream will also generally negotiate to receive type information to use to decode the values. See the section on startup parameters and the startup message for details.

Table/row metadata messages

Before sending changed rows for a relation, a metadata message for the relation must be sent so the downstream knows the namespace, table name, column names, optional column types, etc. A relidentifier field, an arbitrary numeric value unique for that relation on that upstream connection, maps the metadata to following rows.

A client should not assume that relation metadata will be followed immediately (or at all) by rows, since future changes may lead to metadata messages being delivered at other times. Metadata messages may arrive during or between transactions.

The upstream may not assume that the downstream retains more metadata than the one most recent table metadata message. This applies across all tables, so a client is permitted to discard metadata for table x when getting metadata for table y. The upstream must send a new metadata message before sending rows for a different table, even if that metadata was already sent in the same session or even same transaction. This requirement will later be weakened by the addition of client metadata caching, which will be advertised to the upstream with an output plugin parameter.

Columns in metadata messages are numbered from 0 to natts-1, reading consecutively from start to finish. The column numbers do not have to be a complete description of the columns in the upstream relation, so long as all columns that will later have row values sent are described. The upstream may choose to omit columns it doesn’t expect to send changes for in any given series of rows. Column numbers are not necessarily stable across different sets of metadata for the same table, even if the table hasn’t changed structurally.

A metadata message may not be used to decode rows received before that metadata message.

Table metadata header

Message	Type/Size	Notes
Message type	signed char	Literal ‘R’ (0x52)
flags	uint8	* 0-6: Reserved, client must ERROR if set and not recognised.
relidentifier	uint32	Arbitrary relation id, unique for this upstream. In practice this will probably be the upstream table’s oid, but the downstream can’t assume anything.
nspnamelength	uint8	Length of namespace name
nspname	signed char[nspnamelength]	Relation namespace (null terminated)
relnamelength	uint8	Length of relation name
relname	char[relname]	Relation name (null terminated)
attrs block	signed char	Literal: ‘A’ (0x41)
natts	uint16	number of attributes
[fields]	[composite]	Sequence of ‘natts’ column metadata blocks, each of which begins with a column delimiter followed by zero or more column metadata blocks, each with the same column metadata block header. This chunked format is used so that new metadata messages can be added without breaking existing clients.

Column delimiter

Each column’s metadata begins with a column metadata header. This comes immediately after the natts field in the table metadata header or after the last metadata block in the prior column.

It has the same char header as all the others, and the flags field is the same size as the length field in other blocks, so it’s safe to read this as a column metadata block header.

Message	Type/Size	Notes
blocktype	signed char	‘C’ (0x43) - column
flags	uint8	Column info flags

Column metadata block header

All column metadata blocks share the same header, which is the same length as a column delimiter:

Message	Type/Size	Notes
blocktype	signed char	Identifies the kind of metadata block that follows.
blockbodylength	uint16	Length of block in bytes, excluding blocktype char and length field.

Column name block

This block just carries the name of the column, nothing more. It begins with a column metadata block, and the rest of the message is the column name.

Message	Type/Size	Notes
[column metadata block header]	[composite]	blocktype = ‘N’ (0x4e)
colname	char[blockbodylength]	Column name.

Column type block

T.B.D.

Not defined in first protocol revision.

Likely to send a type identifier (probably the upstream oid) as a reference to a “type info” protocol message to be delivered before. Then we can cache the type descriptions and avoid repeating long schemas and names, just using the oids.

Needs to have room to handle:

built-in core types
extension types (ext version may vary)
enum types (CREATE TYPE … AS ENUM)
range types (CREATE TYPE … AS RANGE)
composite types (CREATE TYPE … AS (…))
custom types (CREATE TYPE ( input = x_in, output = x_out ))

… some of which can be nested

Startup message

After processing output plugin arguments, the upstream output plugin must send a startup message as its first message on the wire. It is a trivial header followed by alternating key and value strings represented as null-terminated unsigned char strings.

This message specifies the capabilities the output plugin enabled and describes the upstream server and plugin. This may change how the client decodes the data stream, and/or permit the client to disconnect and report an error to the user if the result isn’t acceptable.

If replication is rejected because the client is incompatible or the server is unable to satisfy required options, the startup message may be followed by a libpq protocol FATAL message that terminates the session. See “Startup errors” below.

The parameter names and values are sent as alternating key/value pairs as null-terminated strings, e.g.

“key1\0parameter1\0key2\0value2\0”

Message	Type/Size	Notes
Message type	signed char	‘S’ (0x53) - startup
Startup message version	uint8	Value is always “1”.
(parameters)	null-terminated key/value pairs	See table below for parameter definitions.

Startup message parameters

Since all parameter values are sent as strings, the value types given below specify what the value must be reasonably interpretable as.

Key name	Value type	Description
max_proto_version	integer	Newest version of the protocol supported by output plugin.
min_proto_version	integer	Oldest protocol version supported by server.
proto_format	text	Protocol format requested. native (documented here) or json. Default is native.
coltypes	boolean	Column types will be sent in table metadata.
pg_version_num	integer	PostgreSQL server_version_num of server, if it’s PostgreSQL. e.g. 090400
pg_version	string	PostgreSQL server_version of server, if it’s PostgreSQL.
pg_catversion	uint32	Version of the PostgreSQL system catalogs on the upstream server, if it’s PostgreSQL.
binary	set of parameters, specified separately	See “_the ‘binary’ parameters_” below, and “_Parameters relating to exchange of binary values_”
encoding	string	The text encoding used in the upstream database. Used for text fields.
forward_changesets	bool	Specifies that all transactions, not just those originating on the upstream, will be forwarded. See “_Changeset forwarding_”.
forward_changeset_origins	bool	Tells the client that the server will send changeset origin information. Independent of forward_changesets. See “_Changeset forwarding_” for details.
no_txinfo	bool	Requests that variable transaction info such as XIDs, LSNs, and timestamps be omitted from output. Mainly for tests. Currently ignored for protos other than json.

The ‘binary’ parameter set: ==

Key name	Value type	Description
binary.internal_basetypes	boolean	If true, PostgreSQL internal binary representations for row field data may be used for some or all row fields, if here the type is appropriate and the binary compatibility parameters of upstream and downstream match. See binary.want_internal_basetypes in the output plugin parameters for details. May only be true if binary.want_internal_basetypes was set to true by the client in the parameters and the client’s accepted binary format matches that of the server.
binary.binary_basetypes	boolean	If true, external binary format (send/recv format) may be used for some or all row field data where the field type is a built-in base type whose send/recv format is compatible with binary.binary_pg_version . May only be set if binary.want_binary_basetypes was set to true by the client in the parameters and the client’s accepted send/recv format matches that of the server.
binary.binary_pg_version	uint16	The PostgreSQL major version that send/recv format values will be compatible with. This is not necessarily the actual upstream PostgreSQL version.
binary.sizeof_int	uint8	sizeof(int) on the upstream.
binary.sizeof_long	uint8	sizeof(long) on the upstream.
binary.sizeof_datum	uint8	Same as sizeof_int, but for the PostgreSQL Datum typedef.
binary.maxalign	uint8	Upstream PostgreSQL server’s MAXIMUM_ALIGNOF value - platform dependent, determined at build time.
binary.bigendian	bool	True iff the upstream is big-endian.
binary.float4_byval	bool	Upstream PostgreSQL’s float4_byval compile option.
binary.float8_byval	bool	Upstream PostgreSQL’s float8_byval compile option.
binary.integer_datetimes	bool	Whether TIME, TIMESTAMP and TIMESTAMP WITH TIME ZONE will be sent using integer or floating point representation. Usually this is the value of the upstream PostgreSQL’s integer_datetimes compile option.

Startup errors

If the server rejects the client’s connection - due to non-overlapping protocol support, unrecognised parameter formats, unsupported required parameters like hooks, etc - then it will follow the startup reply message with a normal libpq protocol error message. (Current versions send this before the startup message).

Arguments client supplies to output plugin

The one opportunity for the downstream client to send information (other than replay feedback) to the upstream is at connect-time, as an array of arguments to the output plugin supplied to START LOGICAL REPLICATION.

There is no back-and-forth, no handshake.

As a result, the client mainly announces capabilities and makes requests of the output plugin. The output plugin will ERROR if required parameters are unset, or where incompatibilities that cannot be resolved are found. Otherwise the output plugin reports what it could and could not honour in the startup message it sends as the first message on the wire down to the client. The client chooses whether to continue replay or to disconnect and report an error to the user, then possibly reconnect with different options.

Output plugin arguments

The output plugin’s key/value arguments are specified in pairs, as key and value. They’re what’s passed to START_REPLICATION, etc.

All parameters are passed in text form. They should be limited to 7-bit ASCII, since the server’s text encoding is not known, but may be normalized precomposed UTF-8. The types specified for parameters indicate what the output plugin should attempt to convert the text into. Clients should not send text values that are outside the range for that type.

Capabilities

Many values are capabilities flags for the client, indicating that it understands optional features like metadata caching, binary format transfers, etc. In general the output plugin may disregard capabilities the client advertises as supported and act as if they are not supported. If a capability is advertised as unsupported or is not advertised the output plugin must not enable the corresponding features.

In other words, don’t send the client something it’s not expecting.

Protocol versioning

Two parameters max_proto_version and min_proto_version, which clients must always send, allow negotiation of the protocol version. The output plugin must ERROR if the client protocol support does not overlap its own protocol support range.

The protocol version is only incremented when there are major breaking changes that all or most clients must be modified to accommodate. Most changes are done by adding new optional messages and/or by having clients advertise capabilities to opt in to features.

Because these versions are expected to be incremented, to make it clear that the format of the startup parameters themselves haven’t changed, the first key/value pair must be the parameter startup_params_format with value “1”.

Key	Type	Value(s)	Notes
startup_params_format	int8	1	The format version of this startup parameter set. Always the digit 1 (0x31), null terminated.
max_proto_version	int32	1	Newest version of the protocol supported by client. Output plugin must ERROR if supported version too old. Required, ERROR if missing.
min_proto_version	int32	1	Oldest version of the protocol supported by client. Output plugin must ERROR if supported version too old. Required, ERROR if missing.

Client requirements and capabilities

Key	Type	Default	Notes
expected_encoding	string	null	The text encoding the downstream expects results to be in. If specified, the upstream must honour it.
forward_changesets	bool	false	Request that all transactions, not just those originating on the upstream, be forwarded. See “_Changeset forwarding_”.
want_coltypes	boolean	false	The client wants to receive data type information about columns.

General client information

These keys tell the output plugin about the client. They’re mainly for informational purposes. In particular, the versions must not be used to determine compatibility for binary or send/recv format, as non-PostgreSQL clients will simply not send them at all but may still understand binary or send/recv format fields.

Key	Type	Default	Notes
pg_version_num	integer	null	PostgreSQL server_version_num of client, if it’s PostgreSQL. e.g. 090400
pg_version	string	null	PostgreSQL server_version of client, if it’s PostgreSQL.

Parameters relating to exchange of binary values

The downstream may specify to the upstream that it is capable of understanding binary (PostgreSQL internal binary datum format), and/or send/recv (PostgreSQL binary interchange) format data by setting the binary.want_binary_basetypes and/or binary.want_internal_basetypes options, or other yet-to-be-defined options.

An upstream output plugin that does not support one or both formats may ignore the downstream’s binary support and send text format, in which case it may ignore all binary. parameters. All downstreams must support text format. An upstream output plugin must not send binary or send/recv format unless the downstream has announced it can receive it. If both upstream and downstream support both formats an upstream should prefer binary format and fall back to send/recv, then to text, if compatibility requires.

Internal and binary format selection should be done on a type-by-type basis. It is quite normal to send ‘text’ format for extension types while sending binary for built-in types.

The downstream must specify its compatibility requirements for internal and binary data if it requests either or both formats. The upstream must honour these by falling back from binary to send/recv, and from send/recv to text, where the upstream and downstream are not compatible.

An unspecified compatibility field must presumed to be unsupported by the downstream so that older clients that don’t know about a change in a newer version don’t receive unexpected data. For example, in the unlikely event that PostgreSQL 99.8 switched to 128-bit DPD (Densely Packed Decimal) representations of NUMERIC instead of the current arbitrary-length BCD (Binary Coded Decimal) format, a new binary.dpd_numerics parameter would be added. Clients that didn’t know about the change wouldn’t know to set it, so the upstream would presume it unsupported and send text format NUMERIC to those clients. This also means that clients that support the new format wouldn’t be able to receive the old format in binary from older servers since they’d specify dpd_numerics = true in their compatibility parameters.

At this time a downstream may specify compatibility with only one value for a given option; i.e. a downstream cannot say it supports both 4-byte and 8-byte sizeof(int). Leaving it unspecified means the upstream must assume the downstream supports neither. (A future protocol extension may allow clients to specify alternative sets of supported formats).

The pg_version option must not be used to decide compatibility. Use binary.basetypes_major_version instead.

Key name	Value type	Default	Description
binary.want_binary_basetypes	boolean	false	True if the client accepts binary interchange (send/recv) format rows for PostgreSQL built-in base types.
binary.want_internal_basetypes	boolean	false	True if the client accepts PostgreSQL internal-format binary output for base PostgreSQL types not otherwise specified elsewhere.
binary.basetypes_major_version	uint16	null	The PostgreSQL major version (x.y) the downstream expects binary and send/recv format values to be in. Represented as an integer in XXYY format (no leading zero since it’s an integer), e.g. 9.5 is 905. This corresponds to PG_VERSION_NUM/100 in PostgreSQL.
binary.sizeof_int	uint8	`null`	sizeof(int) on the downstream.
binary.sizeof_long	uint8	null	sizeof(long) on the downstream.
binary.sizeof_datum	uint8	null	Same as sizeof_int, but for the PostgreSQL Datum typedef.
binary.maxalign	uint8	null	Downstream PostgreSQL server’s maxalign value - platform dependent, determined at build time.
binary.bigendian	bool	null	True iff the downstream is big-endian.
binary.float4_byval	bool	null	Downstream PostgreSQL’s float4_byval compile option.
binary.float8_byval	bool	null	Downstream PostgreSQL’s float8_byval compile option.
binary.integer_datetimes	bool	null	Downstream PostgreSQL’s integer_datetimes compile option.

Extensibility

Because of the use of optional parameters in output plugin arguments, and the confirmation/response sent in the startup packet, a basic handshake is possible between upstream and downstream, allowing negotiation of capabilities.

The output plugin must never send non-optional data or change its wire format without confirmation from the client that it can understand the new data. It may send optional data without negotiation.

When extending the output plugin arguments, add-ons are expected to prefix all keys with the extension name, and should preferably use a single top level key with a json object value to carry their extension information. Additions to the startup message should follow the same pattern.

Hooks and plugins can be used to add functionality specific to a client.

JSON protocol

If proto_format is set to json then the output plugin will emit JSON instead of the custom binary protocol. JSON support is intended mainly for debugging and diagnostics.

The JSON format supports all the same hooks.