pglogical_output defines a libpq subprocotol for streaming tuples, metadata, etc, from the decoding plugin to receivers.

This protocol is an inner layer in a stack:

  • tcp or unix sockets

    • libpq protocol

      • libpq replication subprotocol (COPY BOTH etc)

        • pg_logical output plugin ⇒ consumer protocol

so clients can simply use libpq’s existing replication protocol support, directly or via their libpq-wrapper driver.

This is a binary protocol intended for compact representation.

pglogical_output also supports a json-based text protocol with json representations of the same changesets, supporting all the same hooks etc, intended mainly for tracing/debugging/diagnostics. That protocol is not discussed here.

ToC

Protocol flow

The protocol flow is primarily from upstream walsender/decoding plugin to the downstream receiver.

The only information the flows downstream-to-upstream is:

  • The initial parameter list sent to START_REPLICATION; and

  • replay progress messages

We can accept an arbitrary list of params to START_REPLICATION. After that we have no general purpose channel for information to flow upstream. That means we can’t do a multi-step negotiation/handshake for determining the replication options to use, binary protocol, etc.

The main form of negotiation is the client getting a "take it or leave it" set of settings from the server in an initial startup message sent before any replication data (see below) and, if it doesn’t like them, reconnecting with different startup options.

Except for the negotiation via initial parameter list and then startup message the protocol flow is the same as any other walsender-based logical replication plugin. The data stream is sent in COPY BOTH mode as a series of CopyData messages encapsulating replication data, and ends when the client disconnects. There’s no facility for ending the COPY BOTH mode and returning to the walsender command parser to issue new commands. This is a limiation of the walsender interface, not pglogical_output.

Protocol messages

The individual protocol messages are discussed in the following sub-sections. Protocol flow and logic comes in the next major section.

Absolutely all top-level protocol messages begin with a message type byte. While represented in code as a character, this is a signed byte with no associated encoding.

Since the PostgreSQL libpq COPY protocol supplies a message length there’s no need for top-level protocol messages to embed a length in their header.

BEGIN message

A stream of rows starts with a BEGIN message. Rows may only be sent after a BEGIN and before a COMMIT.

Message

Type/Size

Notes

Message type

signed char

Literal ‘B’ (0x42)

flags

uint8

* 0-3: Reserved, client must ERROR if set and not recognised.

lsn

uint64

“final_lsn” in decoding context - currently it means lsn of commit

commit time

uint64

“commit_time” in decoding context

remote XID

uint32

“xid” in decoding context

Forwarded transaction origin message

The message after the BEGIN may be a forwarded transaction origin message indicating what upstream node the transaction came from.

Sent if the immediately prior message was a BEGIN message, the upstream transaction was forwarded from another node, and replication origin forwarding is enabled, i.e. forward_changeset_origins is t in the startup reply message.

A "node" could be another host, another DB on the same host, or pretty much anything. Whatever origin name is found gets forwarded. The origin identifier is of arbitrary and application-defined format. Applications should prefix their origin identifier with a fixed application name part, like bdr_, myapp_, etc. It is application-defined what an application does with forwarded transactions from other applications.

An origin message with a zero-length origin name indicates that the origin could not be identified but was (probably) not the local node. It is client-defined what action is taken in this case.

It is a protocol error to send/receive a forwarded transaction origin message at any time other than immediately after a BEGIN message.

The origin identifier is typically closely related to replication slot names and replication origins’ names in an application system.

For more detail see Changeset Forwarding in the README.

Message

Type/Size

Notes

Message type

signed char

Literal ‘O’ (0x4f)

flags

uint8

* 0-3: Reserved, application must ERROR if set and not recognised

origin_lsn

uint64

Log sequence number (LSN, XLogRecPtr) of the transaction’s commit record on its origin node (as opposed to the forwarding node’s commit LSN, which is ‘lsn’ in the BEGIN message)

origin_identifier_length

uint8

Length in bytes of origin_identifier

origin_identifier

signed char[origin_identifier_length]

An origin identifier of arbitrary, upstream-application-defined structure. Should be text in the same encoding as the upstream database. NULL-terminated. Should be 7-bit ASCII.

COMMIT message

A stream of rows ends with a COMMIT message.

There is no ROLLBACK message because aborted transactions are not sent by the upstream.

Message

Type/Size

Notes

Message type

signed char

Literal ‘C’ (0x43)

Flags

uint8

* 0-3: Reserved, client must ERROR if set and not recognised

Commit LSN

uint64

commit_lsn in decoding commit decode callback. This is the same value as in the BEGIN message, and marks the end of the transaction.

End LSN

uint64

end_lsn in decoding transaction context

Commit time

uint64

commit_time in decoding transaction context

INSERT, UPDATE or DELETE message

After a BEGIN or metadata message, the downstream should expect to receive zero or more row change messages, composed of an insert/update/delete message with zero or more tuple fields, each of which has one or more tuple field values.

The row’s relidentifier must match that of the most recently preceding metadata message. All consecutive row messages must currently have the same relidentifier. (Later extensions to add metadata caching will relax these requirements for clients that advertise caching support; see the documentation on metadata messages for more detail).

It is an error to decode rows using metadata received after the row was received, or using metadata that is not the most recently received metadata revision that still predates the row. I.e. in the sequence M1, R1, R2, M2, R3, M4: R1 and R2 must be decoded using M1, and R3 must be decoded using M2. It is an error to use M4 to decode any of the rows, to use M1 to decode R3, or to use M2 to decode R1 and R2.

Row messages may not arrive except during a transaction as delimited by BEGIN and COMMIT messages. It is an error to receive a row message outside a transaction.

Any unrecognised tuple type or tuple part type is an error on the downstream that must result in a client disconnect and error message. Downstreams are expected to negotiate compatibility, and upstreams must not add new tuple types or tuple field types without negotiation.

The downstream reads rows until the next non-row message is received. There is no other end marker or any indication of how many rows to expect in a sequence.

Row message header

Message

Type/Size

Notes

Message type

signed char

Literal ‘I’nsert (0x49), ‘U’pdate’ (0x55) or ‘D’elete (0x44)

flags

uint8

Row flags (reserved)

relidentifier

uint32

relidentifier that matches the table metadata message sent for this row. (Not present in BDR, which sends nspname and relname instead)

[tuple parts]

[composite]

One or more tuple-parts fields follow.

Tuple fields

Tuple type

signed char

Identifies the kind of tuple being sent.

tupleformat

signed char

T’ (0x54)

natts

uint16

Number of fields sent in this tuple part. (Present in BDR, but meaning significantly different here)

[tuple field values]

[composite]

Tuple tupleformat compatibility

Unrecognised tupleformat kinds are a protocol error for the downstream.

Tuple field value fields

These message parts describe individual fields within a tuple.

There are two kinds of tuple value fields, abbreviated and full. Which is being read is determined based on the first field, kind.

Abbreviated tuple value fields are nothing but the message kind:

Message

Type/Size

Notes

kind

signed char

* ‘n’ull (0x6e) field

Full tuple value fields have a length and datum:

Message

Type/Size

Notes

kind

signed char

* ‘i’nternal binary (0x62) field

length

int4

Only defined for kind = i|b|t

data

[length]

Data in a format defined by the table metadata and column kind.

Tuple field values kind compatibility

Unrecognised field kind values are a protocol error for the downstream. The downstream may not continue processing the protocol stream after this point.

The upstream may not send ‘i’nternal or ‘b’inary format values to the downstream without the downstream negotiating acceptance of such values. The downstream will also generally negotiate to receive type information to use to decode the values. See the section on startup parameters and the startup message for details.

Table/row metadata messages

Before sending changed rows for a relation, a metadata message for the relation must be sent so the downstream knows the namespace, table name, column names, optional column types, etc. A relidentifier field, an arbitrary numeric value unique for that relation on that upstream connection, maps the metadata to following rows.

A client should not assume that relation metadata will be followed immediately (or at all) by rows, since future changes may lead to metadata messages being delivered at other times. Metadata messages may arrive during or between transactions.

The upstream may not assume that the downstream retains more metadata than the one most recent table metadata message. This applies across all tables, so a client is permitted to discard metadata for table x when getting metadata for table y. The upstream must send a new metadata message before sending rows for a different table, even if that metadata was already sent in the same session or even same transaction. This requirement will later be weakened by the addition of client metadata caching, which will be advertised to the upstream with an output plugin parameter.

Columns in metadata messages are numbered from 0 to natts-1, reading consecutively from start to finish. The column numbers do not have to be a complete description of the columns in the upstream relation, so long as all columns that will later have row values sent are described. The upstream may choose to omit columns it doesn’t expect to send changes for in any given series of rows. Column numbers are not necessarily stable across different sets of metadata for the same table, even if the table hasn’t changed structurally.

A metadata message may not be used to decode rows received before that metadata message.

Table metadata header

Message

Type/Size

Notes

Message type

signed char

Literal ‘R’ (0x52)

flags

uint8

* 0-6: Reserved, client must ERROR if set and not recognised.

relidentifier

uint32

Arbitrary relation id, unique for this upstream. In practice this will probably be the upstream table’s oid, but the downstream can’t assume anything.

nspnamelength

uint8

Length of namespace name

nspname

signed char[nspnamelength]

Relation namespace (null terminated)

relnamelength

uint8

Length of relation name

relname

char[relname]

Relation name (null terminated)

attrs block

signed char

Literal: ‘A’ (0x41)

natts

uint16

number of attributes

[fields]

[composite]

Sequence of ‘natts’ column metadata blocks, each of which begins with a column delimiter followed by zero or more column metadata blocks, each with the same column metadata block header.

This chunked format is used so that new metadata messages can be added without breaking existing clients.

Column delimiter

Each column’s metadata begins with a column metadata header. This comes immediately after the natts field in the table metadata header or after the last metadata block in the prior column.

It has the same char header as all the others, and the flags field is the same size as the length field in other blocks, so it’s safe to read this as a column metadata block header.

Message

Type/Size

Notes

blocktype

signed char

C’ (0x43) - column

flags

uint8

Column info flags

Column metadata block header

All column metadata blocks share the same header, which is the same length as a column delimiter:

Message

Type/Size

Notes

blocktype

signed char

Identifies the kind of metadata block that follows.

blockbodylength

uint16

Length of block in bytes, excluding blocktype char and length field.

Column name block

This block just carries the name of the column, nothing more. It begins with a column metadata block, and the rest of the message is the column name.

Message

Type/Size

Notes

[column metadata block header]

[composite]

blocktype = ‘N’ (0x4e)

colname

char[blockbodylength]

Column name.

Column type block

T.B.D.

Not defined in first protocol revision.

Likely to send a type identifier (probably the upstream oid) as a reference to a “type info” protocol message to be delivered before. Then we can cache the type descriptions and avoid repeating long schemas and names, just using the oids.

Needs to have room to handle:

  • built-in core types

  • extension types (ext version may vary)

  • enum types (CREATE TYPE … AS ENUM)

  • range types (CREATE TYPE … AS RANGE)

  • composite types (CREATE TYPE … AS (…))

  • custom types (CREATE TYPE ( input = x_in, output = x_out ))

… some of which can be nested

Startup message

After processing output plugin arguments, the upstream output plugin must send a startup message as its first message on the wire. It is a trivial header followed by alternating key and value strings represented as null-terminated unsigned char strings.

This message specifies the capabilities the output plugin enabled and describes the upstream server and plugin. This may change how the client decodes the data stream, and/or permit the client to disconnect and report an error to the user if the result isn’t acceptable.

If replication is rejected because the client is incompatible or the server is unable to satisfy required options, the startup message may be followed by a libpq protocol FATAL message that terminates the session. See “Startup errors” below.

The parameter names and values are sent as alternating key/value pairs as null-terminated strings, e.g.

“key1\0parameter1\0key2\0value2\0”

Message

Type/Size

Notes

Message type

signed char

S’ (0x53) - startup

Startup message version

uint8

Value is always “1”.

(parameters)

null-terminated key/value pairs

See table below for parameter definitions.

Startup message parameters

Since all parameter values are sent as strings, the value types given below specify what the value must be reasonably interpretable as.

Key name

Value type

Description

max_proto_version

integer

Newest version of the protocol supported by output plugin.

min_proto_version

integer

Oldest protocol version supported by server.

proto_format

text

Protocol format requested. native (documented here) or json. Default is native.

coltypes

boolean

Column types will be sent in table metadata.

pg_version_num

integer

PostgreSQL server_version_num of server, if it’s PostgreSQL. e.g. 090400

pg_version

string

PostgreSQL server_version of server, if it’s PostgreSQL.

pg_catversion

uint32

Version of the PostgreSQL system catalogs on the upstream server, if it’s PostgreSQL.

binary

set of parameters, specified separately

See “_the ‘binary’ parameters_” below, and “_Parameters relating to exchange of binary values_”

encoding

string

The text encoding used in the upstream database. Used for text fields.

forward_changesets

bool

Specifies that all transactions, not just those originating on the upstream, will be forwarded. See “_Changeset forwarding_”.

forward_changeset_origins

bool

Tells the client that the server will send changeset origin information. Independent of forward_changesets. See “_Changeset forwarding_” for details.

no_txinfo

bool

Requests that variable transaction info such as XIDs, LSNs, and timestamps be omitted from output. Mainly for tests. Currently ignored for protos other than json.

The ‘binary’ parameter set: ==

Key name

Value type

Description

binary.internal_basetypes

boolean

If true, PostgreSQL internal binary representations for row field data may be used for some or all row fields, if here the type is appropriate and the binary compatibility parameters of upstream and downstream match. See binary.want_internal_basetypes in the output plugin parameters for details.

May only be true if binary.want_internal_basetypes was set to true by the client in the parameters and the client’s accepted binary format matches that of the server.

binary.binary_basetypes

boolean

If true, external binary format (send/recv format) may be used for some or all row field data where the field type is a built-in base type whose send/recv format is compatible with binary.binary_pg_version .

May only be set if binary.want_binary_basetypes was set to true by the client in the parameters and the client’s accepted send/recv format matches that of the server.

binary.binary_pg_version

uint16

The PostgreSQL major version that send/recv format values will be compatible with. This is not necessarily the actual upstream PostgreSQL version.

binary.sizeof_int

uint8

sizeof(int) on the upstream.

binary.sizeof_long

uint8

sizeof(long) on the upstream.

binary.sizeof_datum

uint8

Same as sizeof_int, but for the PostgreSQL Datum typedef.

binary.maxalign

uint8

Upstream PostgreSQL server’s MAXIMUM_ALIGNOF value - platform dependent, determined at build time.

binary.bigendian

bool

True iff the upstream is big-endian.

binary.float4_byval

bool

Upstream PostgreSQL’s float4_byval compile option.

binary.float8_byval

bool

Upstream PostgreSQL’s float8_byval compile option.

binary.integer_datetimes

bool

Whether TIME, TIMESTAMP and TIMESTAMP WITH TIME ZONE will be sent using integer or floating point representation.

Usually this is the value of the upstream PostgreSQL’s integer_datetimes compile option.

Startup errors

If the server rejects the client’s connection - due to non-overlapping protocol support, unrecognised parameter formats, unsupported required parameters like hooks, etc - then it will follow the startup reply message with a normal libpq protocol error message. (Current versions send this before the startup message).

Arguments client supplies to output plugin

The one opportunity for the downstream client to send information (other than replay feedback) to the upstream is at connect-time, as an array of arguments to the output plugin supplied to START LOGICAL REPLICATION.

There is no back-and-forth, no handshake.

As a result, the client mainly announces capabilities and makes requests of the output plugin. The output plugin will ERROR if required parameters are unset, or where incompatibilities that cannot be resolved are found. Otherwise the output plugin reports what it could and could not honour in the startup message it sends as the first message on the wire down to the client. The client chooses whether to continue replay or to disconnect and report an error to the user, then possibly reconnect with different options.

Output plugin arguments

The output plugin’s key/value arguments are specified in pairs, as key and value. They’re what’s passed to START_REPLICATION, etc.

All parameters are passed in text form. They should be limited to 7-bit ASCII, since the server’s text encoding is not known, but may be normalized precomposed UTF-8. The types specified for parameters indicate what the output plugin should attempt to convert the text into. Clients should not send text values that are outside the range for that type.

Capabilities

Many values are capabilities flags for the client, indicating that it understands optional features like metadata caching, binary format transfers, etc. In general the output plugin may disregard capabilities the client advertises as supported and act as if they are not supported. If a capability is advertised as unsupported or is not advertised the output plugin must not enable the corresponding features.

In other words, don’t send the client something it’s not expecting.

Protocol versioning

Two parameters max_proto_version and min_proto_version, which clients must always send, allow negotiation of the protocol version. The output plugin must ERROR if the client protocol support does not overlap its own protocol support range.

The protocol version is only incremented when there are major breaking changes that all or most clients must be modified to accommodate. Most changes are done by adding new optional messages and/or by having clients advertise capabilities to opt in to features.

Because these versions are expected to be incremented, to make it clear that the format of the startup parameters themselves haven’t changed, the first key/value pair must be the parameter startup_params_format with value “1”.

Key

Type

Value(s)

Notes

startup_params_format

int8

1

The format version of this startup parameter set. Always the digit 1 (0x31), null terminated.

max_proto_version

int32

1

Newest version of the protocol supported by client. Output plugin must ERROR if supported version too old. Required, ERROR if missing.

min_proto_version

int32

1

Oldest version of the protocol supported by client. Output plugin must ERROR if supported version too old. Required, ERROR if missing.

Client requirements and capabilities

Key

Type

Default

Notes

expected_encoding

string

null

The text encoding the downstream expects results to be in. If specified, the upstream must honour it.

forward_changesets

bool

false

Request that all transactions, not just those originating on the upstream, be forwarded. See “_Changeset forwarding_”.

want_coltypes

boolean

false

The client wants to receive data type information about columns.

General client information

These keys tell the output plugin about the client. They’re mainly for informational purposes. In particular, the versions must not be used to determine compatibility for binary or send/recv format, as non-PostgreSQL clients will simply not send them at all but may still understand binary or send/recv format fields.

Key

Type

Default

Notes

pg_version_num

integer

null

PostgreSQL server_version_num of client, if it’s PostgreSQL. e.g. 090400

pg_version

string

null

PostgreSQL server_version of client, if it’s PostgreSQL.

Parameters relating to exchange of binary values

The downstream may specify to the upstream that it is capable of understanding binary (PostgreSQL internal binary datum format), and/or send/recv (PostgreSQL binary interchange) format data by setting the binary.want_binary_basetypes and/or binary.want_internal_basetypes options, or other yet-to-be-defined options.

An upstream output plugin that does not support one or both formats may ignore the downstream’s binary support and send text format, in which case it may ignore all binary. parameters. All downstreams must support text format. An upstream output plugin must not send binary or send/recv format unless the downstream has announced it can receive it. If both upstream and downstream support both formats an upstream should prefer binary format and fall back to send/recv, then to text, if compatibility requires.

Internal and binary format selection should be done on a type-by-type basis. It is quite normal to send ‘text’ format for extension types while sending binary for built-in types.

The downstream must specify its compatibility requirements for internal and binary data if it requests either or both formats. The upstream must honour these by falling back from binary to send/recv, and from send/recv to text, where the upstream and downstream are not compatible.

An unspecified compatibility field must presumed to be unsupported by the downstream so that older clients that don’t know about a change in a newer version don’t receive unexpected data. For example, in the unlikely event that PostgreSQL 99.8 switched to 128-bit DPD (Densely Packed Decimal) representations of NUMERIC instead of the current arbitrary-length BCD (Binary Coded Decimal) format, a new binary.dpd_numerics parameter would be added. Clients that didn’t know about the change wouldn’t know to set it, so the upstream would presume it unsupported and send text format NUMERIC to those clients. This also means that clients that support the new format wouldn’t be able to receive the old format in binary from older servers since they’d specify dpd_numerics = true in their compatibility parameters.

At this time a downstream may specify compatibility with only one value for a given option; i.e. a downstream cannot say it supports both 4-byte and 8-byte sizeof(int). Leaving it unspecified means the upstream must assume the downstream supports neither. (A future protocol extension may allow clients to specify alternative sets of supported formats).

The pg_version option must not be used to decide compatibility. Use binary.basetypes_major_version instead.

Key name

Value type

Default

Description

binary.want_binary_basetypes

boolean

false

True if the client accepts binary interchange (send/recv) format rows for PostgreSQL built-in base types.

binary.want_internal_basetypes

boolean

false

True if the client accepts PostgreSQL internal-format binary output for base PostgreSQL types not otherwise specified elsewhere.

binary.basetypes_major_version

uint16

null

The PostgreSQL major version (x.y) the downstream expects binary and send/recv format values to be in. Represented as an integer in XXYY format (no leading zero since it’s an integer), e.g. 9.5 is 905. This corresponds to PG_VERSION_NUM/100 in PostgreSQL.

binary.sizeof_int

uint8

null

sizeof(int) on the downstream.

binary.sizeof_long

uint8

null

sizeof(long) on the downstream.

binary.sizeof_datum

uint8

null

Same as sizeof_int, but for the PostgreSQL Datum typedef.

binary.maxalign

uint8

null

Downstream PostgreSQL server’s maxalign value - platform dependent, determined at build time.

binary.bigendian

bool

null

True iff the downstream is big-endian.

binary.float4_byval

bool

null

Downstream PostgreSQL’s float4_byval compile option.

binary.float8_byval

bool

null

Downstream PostgreSQL’s float8_byval compile option.

binary.integer_datetimes

bool

null

Downstream PostgreSQL’s integer_datetimes compile option.

Extensibility

Because of the use of optional parameters in output plugin arguments, and the confirmation/response sent in the startup packet, a basic handshake is possible between upstream and downstream, allowing negotiation of capabilities.

The output plugin must never send non-optional data or change its wire format without confirmation from the client that it can understand the new data. It may send optional data without negotiation.

When extending the output plugin arguments, add-ons are expected to prefix all keys with the extension name, and should preferably use a single top level key with a json object value to carry their extension information. Additions to the startup message should follow the same pattern.

Hooks and plugins can be used to add functionality specific to a client.

JSON protocol

If proto_format is set to json then the output plugin will emit JSON instead of the custom binary protocol. JSON support is intended mainly for debugging and diagnostics.

The JSON format supports all the same hooks.