Thread: Replication slot WAL reservation

Replication slot WAL reservation

From
Phillip Diffley
Date:
I am trying to understand how logical replication slots work, specifically in regard to how WAL is reserved and freed by a replication slot.

My current understanding of the WAL (set to wal_level logical) is that:
1. Every DML operation (insert, update, delete, truncate) will have a row in the WAL and that row will have an LSN assigned to it. 
2. The LSNs are assigned when the operation happens.
3. Operations within a transaction are written to the WAL in a contiguous block when the transaction commits.
4. Each transaction block in the WAL has a commit timestamp associated with it, and these timestamps are monotonically increasing (I'm only inferring this, so please correct me if this is not always the case).

A result of an operation being assigned an LSN when the operation occurs but not being recorded to the WAL until it's transaction commits is that LSNs in the WAL are not necessarily in order.

But when processing data from a replication slot, we confirm rows that have been processed and can be deleted from the WAL based on the LSN (eg. with pg_replication_slot_advance). How does postgres identify what parts of the WAL can be freed?

Re: Replication slot WAL reservation

From
Christophe Pettus
Date:

> On Mar 25, 2025, at 09:56, Phillip Diffley <phillip6402@gmail.com> wrote:
> 1. Every DML operation (insert, update, delete, truncate) will have a row in the WAL and that row will have an LSN
assignedto it.  
> 2. The LSNs are assigned when the operation happens.
> 3. Operations within a transaction are written to the WAL in a contiguous block when the transaction commits.
> 4. Each transaction block in the WAL has a commit timestamp associated with it, and these timestamps are
monotonicallyincreasing (I'm only inferring this, so please correct me if this is not always the case). 

This isn't *quite* right.  LSNs are byte offsets, not operation sequence numbers (despite the name), so they don't have
tobe "assigned"; they're just an offset to a particular point in the (virtual) WAL bytestream. 

The WAL is written as it is generated (allowing for buffering etc.).  A transaction end flushes out any remaining WAL
thatis involved in that transaction (and any other WAL that might be between operations for that transaction), unless
youare not in synchronous_commit mode. 

It's not quite the "transaction block" that has the commit timestamp; it's the commit record for a particular
transaction. I can't come up with a scenario off the top of my head in which the commit timestamp would go down, but I
wouldn'tabsolutely bet on it never doing so. 


Re: Replication slot WAL reservation

From
Christophe Pettus
Date:
Missed this question!

> On Mar 25, 2025, at 09:56, Phillip Diffley <phillip6402@gmail.com> wrote:
> But when processing data from a replication slot, we confirm rows that have been processed and can be deleted from
theWAL based on the LSN (eg. with pg_replication_slot_advance). How does postgres identify what parts of the WAL can be
freed?

Basically, if no part of the system "needs" a particular LSN position, the segments that include that LSN position and
earliercan be free. 

The various things that can "need" a particular LSN point are:

1. Replication slots, if the other side has not confirmed that it has received it (under whatever synchronous commit
rulesthat slot is operating under). 
2. The wal_keep_size setting.
3. The max_wal_size setting.
4. The archive_command, if a WAL segment hasn't been successfully archived yet.

One thing to remember is that the WAL does *not* contain contiguous blocks of operations for a single transaction.  The
operationsare written to the WAL by every session as they do operations, so the WAL is a jumble of different
transactions. One of the jobs of the logical replication framework is to sort that out so it can present only the
operationsthat belong to committed transactions to the output plugin.  (This is why there's an internal structure
calledthe "reorder buffer": it reorders WAL operations into transaction blocks.) 


Re: Replication slot WAL reservation

From
Phillip Diffley
Date:
Oh I see! I was conflating the data I see coming out of a replication slot with the internal organization of the WAL. I think the more specific question I am trying to answer is, as a consumer of a replication slot, how do I reason about what replication records will be made unavailable when I confirm an LSN? Here I am worried about situations where the replication connection is interrupted or the program processing the records crashes, and we need to replay records that may have been previously sent but were not fully processed.

For example, are the records sent by a replication slot always sent in the same order such that if I advance the confirmed_flush_lsn of a slot to the LSN of record "A", I will know that any records that had been streamed after record "A" will be replayable?


On Tue, Mar 25, 2025 at 12:14 PM Christophe Pettus <xof@thebuild.com> wrote:
Missed this question!

> On Mar 25, 2025, at 09:56, Phillip Diffley <phillip6402@gmail.com> wrote:
> But when processing data from a replication slot, we confirm rows that have been processed and can be deleted from the WAL based on the LSN (eg. with pg_replication_slot_advance). How does postgres identify what parts of the WAL can be freed?

Basically, if no part of the system "needs" a particular LSN position, the segments that include that LSN position and earlier can be free.

The various things that can "need" a particular LSN point are:

1. Replication slots, if the other side has not confirmed that it has received it (under whatever synchronous commit rules that slot is operating under).
2. The wal_keep_size setting.
3. The max_wal_size setting.
4. The archive_command, if a WAL segment hasn't been successfully archived yet.

One thing to remember is that the WAL does *not* contain contiguous blocks of operations for a single transaction.  The operations are written to the WAL by every session as they do operations, so the WAL is a jumble of different transactions.  One of the jobs of the logical replication framework is to sort that out so it can present only the operations that belong to committed transactions to the output plugin.  (This is why there's an internal structure called the "reorder buffer": it reorders WAL operations into transaction blocks.)

Re: Replication slot WAL reservation

From
Christophe Pettus
Date:

> On Mar 25, 2025, at 13:58, Phillip Diffley <phillip6402@gmail.com> wrote:
>
> Oh I see! I was conflating the data I see coming out of a replication slot with the internal organization of the WAL.
Ithink the more specific question I am trying to answer is, as a consumer of a replication slot, how do I reason about
whatreplication records will be made unavailable when I confirm an LSN? Here I am worried about situations where the
replicationconnection is interrupted or the program processing the records crashes, and we need to replay records that
mayhave been previously sent but were not fully processed. 

It's up to the consuming client to keep track of where it is in the WAL (using an LSN).  When the client connects, it
specifieswhat LSN to start streaming at.  If that LSN is no longer available, the publisher / primary returns an error. 

The client shouldn't confirm the flush of an LSN unless it is crash-proof to that point, since any WAL before that
shouldbe assumed to be unavailable. 

> For example, are the records sent by a replication slot always sent in the same order such that if I advance the
confirmed_flush_lsnof a slot to the LSN of record "A", I will know that any records that had been streamed after record
"A"will be replayable? 

You know that any WAL generated after `confirmed_flush_lsn` is available for replay.  That's the oldest LSN that the
clientcan specify on connection (although it can specify a later one, if it exists).  You shouldn't need to manually
advancethe replication slot.  Instead, the client specifies where it wants to start when it connects.  The client is
alsoexpected to send back regular messages letting the publisher / primary know that it has successfully consumed up to
aparticular point in the WAL, so the publisher / primary knows it can release that WAL information. 


Re: Replication slot WAL reservation

From
Phillip Diffley
Date:
> You shouldn't need to manually advance the replication slot.
> The client is also expected to send back regular messages letting the publisher / primary know that it has successfully consumed up to a particular point

I was thinking of these as the same thing, but it sounds like they are different. At the moment, the only method I know for letting the publisher/primary know what has been successfully consumed is pg_replication_slot_advance. I looked at the message formats and logical replication message formats pages, but I did not see a message type for updating confirmed_flush_lsn or otherwise letting the publisher/primary know what logs have been successfully consumed. There is the flush message, but it looks like it only passes a 4 byte int instead of the 8 bytes required for an LSN. Is there a message type that is used to confirm what logs have been successfully consumed?

> You know that any WAL generated after `confirmed_flush_lsn` is available for replay.

The part I am uncertain about is what "after" means here, since LSNs are not presented in order, and the order of data streamed over the replication slot does not match the order of the data in the WAL. 

I initially (and incorrectly) thought the confirmation order was based on LSN. So if you confirmed an LSN "x" then all logs with LSN less than "x" could be released by the publisher/primary. That can't work though since LSNs are not presented in order by the replication slot. Is there a monotonically increasing identifier that can be used to identify which logs come "after" another? Or do you just keep track of the order the replication slot delivers logs in and not confirm a log until it and all the logs received before it are processed to the point of being crash-proof?

On Tue, Mar 25, 2025 at 4:32 PM Christophe Pettus <xof@thebuild.com> wrote:


> On Mar 25, 2025, at 13:58, Phillip Diffley <phillip6402@gmail.com> wrote:
>
> Oh I see! I was conflating the data I see coming out of a replication slot with the internal organization of the WAL. I think the more specific question I am trying to answer is, as a consumer of a replication slot, how do I reason about what replication records will be made unavailable when I confirm an LSN? Here I am worried about situations where the replication connection is interrupted or the program processing the records crashes, and we need to replay records that may have been previously sent but were not fully processed.

It's up to the consuming client to keep track of where it is in the WAL (using an LSN).  When the client connects, it specifies what LSN to start streaming at.  If that LSN is no longer available, the publisher / primary returns an error.

The client shouldn't confirm the flush of an LSN unless it is crash-proof to that point, since any WAL before that should be assumed to be unavailable.

> For example, are the records sent by a replication slot always sent in the same order such that if I advance the confirmed_flush_lsn of a slot to the LSN of record "A", I will know that any records that had been streamed after record "A" will be replayable?

You know that any WAL generated after `confirmed_flush_lsn` is available for replay.  That's the oldest LSN that the client can specify on connection (although it can specify a later one, if it exists).  You shouldn't need to manually advance the replication slot.  Instead, the client specifies where it wants to start when it connects.  The client is also expected to send back regular messages letting the publisher / primary know that it has successfully consumed up to a particular point in the WAL, so the publisher / primary knows it can release that WAL information.

Re: Replication slot WAL reservation

From
Christophe Pettus
Date:

> On Mar 25, 2025, at 20:56, Phillip Diffley <phillip6402@gmail.com> wrote:
>
> Is there a message type that is used to confirm what logs have been successfully consumed?

You're looking for Standby Status Update:

    https://www.postgresql.org/docs/current/protocol-replication.html#PROTOCOL-REPLICATION-STANDBY-STATUS-UPDATE

The logical replication protocol is a superset of the binary replication protocol, so many of the same messages are
used.

> The part I am uncertain about is what "after" means here, since LSNs are not presented in order, and the order of
datastreamed over the replication slot does not match the order of the data in the WAL.  

I think there's a misunderstanding here (possibly my fault).  Transactions are always presented to the output plugin in
commitorder, and LSNs can be reliably used to determine the time ordering of commits.  LSNs are exactly what is used to
determinehow far into the WAL the replication slot has gotten. 


Re: Replication slot WAL reservation

From
Phillip Diffley
Date:
> You're looking for Standby Status Update

Awesome! I completely missed that.

> Transactions are always presented to the output plugin in commit order

Ok great. I think that is what I needed to know. 

Just to confirm, it sounds like the order messages are sent from the output plugin is what matters here. When you update confirmed_flush_lsn to LSN "A", any messages that were sent by the output plugin after the message with LSN "A" will be replayable. Any messages sent by the output plugin before the message with LSN "A" will most likely not be replayed, since their data is freed for deletion. Is that correct?

On Tue, Mar 25, 2025 at 11:32 PM Christophe Pettus <xof@thebuild.com> wrote:


> On Mar 25, 2025, at 20:56, Phillip Diffley <phillip6402@gmail.com> wrote:
>
> Is there a message type that is used to confirm what logs have been successfully consumed?

You're looking for Standby Status Update:

        https://www.postgresql.org/docs/current/protocol-replication.html#PROTOCOL-REPLICATION-STANDBY-STATUS-UPDATE

The logical replication protocol is a superset of the binary replication protocol, so many of the same messages are used.

> The part I am uncertain about is what "after" means here, since LSNs are not presented in order, and the order of data streamed over the replication slot does not match the order of the data in the WAL.

I think there's a misunderstanding here (possibly my fault).  Transactions are always presented to the output plugin in commit order, and LSNs can be reliably used to determine the time ordering of commits.  LSNs are exactly what is used to determine how far into the WAL the replication slot has gotten.

Re: Replication slot WAL reservation

From
Christophe Pettus
Date:

> On Mar 26, 2025, at 07:55, Phillip Diffley <phillip6402@gmail.com> wrote:
> Just to confirm, it sounds like the order messages are sent from the output plugin is what matters here. When you
updateconfirmed_flush_lsn to LSN "A", any messages that were sent by the output plugin after the message with LSN "A"
willbe replayable. Any messages sent by the output plugin before the message with LSN "A" will most likely not be
replayed,since their data is freed for deletion. Is that correct? 

The terminology is shifting around a bit here, so to be specific: When the primary (or publisher) receives a message
fromthe secondary (or replica) that a particular LSN has been flushed, the primary at that point feels free to recycle
anyWAL segments that only contain WAL entries whose LSN is less than that flush point (whether or not it actually does
dependson a lot of other factors).  The actual horizon that the primary needs to retain can be farther back than that,
becausethere's no requirement that the secondary send an LSN as confirmed_flush_lsn that is at a transaction boundary,
sothe flush LSN might land in the middle of a transaction.  The actual point before which the primary can recycle WAL
isrestart_lsn, which the primary determines based on the flush LSN. 

When the secondary connects, it provides an LSN from which the primary should start sending WAL (if a binary replica)
ordecoded WAL via the plugin (if a logical replica).  For a logical replica, that can be confirmed_flush_lsn or any
pointafter, but it can't be before.  (Even if the WAL exists, the primary will return an error if the start point
providedin START_REPLICATION is before confirmed_flush_lsn for a logical replication slot.)  Of course, you'll get an
errorif START_REPLICATION supplies an LSN that doesn't actually exist yet. 

The behavior that the primary is expecting from the secondary is that the secondary never sends back a
confirmed_flush_lsnuntil up to that point is crash / disconnection-safe.  What "safe" means in this case depends on the
clientbehavior.  It might be just spooling the incoming stream to disk and processing it later, or it might be
processingit completely on the fly as it comes in. 

The most important point here is that the client consuming the logical replication messages must keep track of the
flushpoint (defined however the client implements processing the messages), and provide the right one back to the
primarywhen it connects.  (Another option is that that the client is written so that each transaction is idempotent,
andeven if transactions that it has already processed are sent again, the result is the same.) 

One more note is that if the client supplies an LSN (for logical replication) that lands in the middle of a
transaction,the primary will send over the complete transaction, so the actual start point may be earlier than the
suppliedstart point.  Generally, this means that the client should respect transaction boundaries, and be able to deal
withgetting a partial transaction but discarding it if it doesn't get a commit record for it.