Thread: Replication lag due to lagging restart_lsn

Replication lag due to lagging restart_lsn

From

Satyam Shekhar

Date:

18 August 2020, 16:27:34

Hello,

I wish to use logical replication in Postgres to capture transactions as CDC and forward them to a custom sink.

To understand the overhead of logical replication workflow I created a toy subscriber using the V3PGReplicationStream that acknowledges LSNs after every 16k reads by calling setAppliedLsn, setFlushedLsn, and forceUpdateState. The toy subscriber is set up as a subscriber for a master Postgres instance that publishes changes using a Publication. I then run a write-heavy workload on this setup that generates transaction logs at approximately 235MBps. Postgres is run on a beefy machine with a 10+GBps network link between Postgres and the toy subscriber.

My expectation with this setup was that the replication lag on master would be minimal as the subscriber acks the LSN almost immediately. However, I observe the replication lag to increase continuously for the duration of the test. Statistics in pg_replication_slots show that restart_lsn lags significantly behind the confirmed_flushed_lsn. Cursory reading on restart_lsn suggests that an increasing gap between restart_lsn and confirmed_flushed_lsn means that Postgres needs to reclaim disk space and advance restart_lsn to catch up to confirmed_flushed_lsn.

With that context, I am looking for answers for two questions -

1. What work needs to happen in the database to advance restart_lsn to confirmed_flushed_lsn?

2. What is the recommendation on tuning the database to improve the replication lag in such scenarios?

Regards,

Satyam

Re: Replication lag due to lagging restart_lsn

From

Kyotaro Horiguchi

Date:

19 August 2020, 11:43:18

Hello.

At Tue, 18 Aug 2020 09:27:34 -0700, Satyam Shekhar <satyamshekhar@gmail.com> wrote in 
> Hello,
> 
> I wish to use logical replication in Postgres to capture transactions as
> CDC and forward them to a custom sink.
> 
> To understand the overhead of logical replication workflow I created a toy
> subscriber using the V3PGReplicationStream that acknowledges LSNs after
> every 16k reads by calling setAppliedLsn, setFlushedLsn, and forceUpdateState.
> The toy subscriber is set up as a subscriber for a master Postgres instance
> that publishes changes using a Publication. I then run a write-heavy
> workload on this setup that generates transaction logs at approximately
> 235MBps. Postgres is run on a beefy machine with a 10+GBps network link
> between Postgres and the toy subscriber.
> 
> My expectation with this setup was that the replication lag on master would
> be minimal as the subscriber acks the LSN almost immediately. However, I
> observe the replication lag to increase continuously for the duration of
> the test. Statistics in pg_replication_slots show that restart_lsn
> lags significantly behind
> the confirmed_flushed_lsn. Cursory reading on restart_lsn suggests that an
> increasing gap between restart_lsn and confirmed_flushed_lsn means that
> Postgres needs to reclaim disk space and advance restart_lsn to catch up to
> confirmed_flushed_lsn.
> 
> With that context, I am looking for answers for two questions -
> 
> 1. What work needs to happen in the database to advance restart_lsn to
> confirmed_flushed_lsn?
> 2. What is the recommendation on tuning the database to improve the
> replication lag in such scenarios?

To make sure, replication delay or lag here is current_wal_lsn() -
confirmed_flush_lsn. restart_lsn has nothing to do with replication
lag. It is the minimum LSN the server thinks it needs for restarting
replication on the slot.

How long have you observed the increase of the gap? If no
long-transactions are running, restart_lsn is the current LSN about
from 15 to 30 seconds ago. That is, the gap between restart_lsn and
confirmed_flush_lsn would be at most the amount of WAL emitted in the
last 30 seconds. In this case, that is estimated to be 235MB*30 =
about 7GB or 440 in 16MB-segments even if the system is perfectly
working. Anyway the publisher server would need to preserve WAL files
up to about 68GB (in the case where checkpoint_timeout is 5 minutes)
so requirement of 7GB by restart_lsn doesn't matter.

In short, I don't think you need to do something against that "lag".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Replication lag due to lagging restart_lsn

From

Kiran Singh

Date:

19 August 2020, 13:15:44

When logical replication is setup, any wal generation on any tables will result in replication lag. Since you are running a long running transaction on the master, the maximum number of changes kept in the memory per transaction is 4MB. If the transaction requires more than 4MB the changes are spilled to disk. This is when you will start seeing

1. Replication lag spiking

2. Storage being consumed

3. Restart lsn stops moving forward

You can confirm if the heavy write that you are talking about is spilling to disk or not by setting log_min_messges to debug 2. Try to find if the changes are spilled to disk.

To answer your question:

1. As long as the write heavy query is running on the database, you will not see restart lsn moving.

2. You will have to have smaller transactions

3. When the query is completed, you will see restart_lsn moving forward

On Tue, Aug 18, 2020 at 11:27 AM Satyam Shekhar <satyamshekhar@gmail.com> wrote:

Hello,

I wish to use logical replication in Postgres to capture transactions as CDC and forward them to a custom sink.

To understand the overhead of logical replication workflow I created a toy subscriber using the V3PGReplicationStream that acknowledges LSNs after every 16k reads by calling setAppliedLsn, setFlushedLsn, and forceUpdateState. The toy subscriber is set up as a subscriber for a master Postgres instance that publishes changes using a Publication. I then run a write-heavy workload on this setup that generates transaction logs at approximately 235MBps. Postgres is run on a beefy machine with a 10+GBps network link between Postgres and the toy subscriber.

My expectation with this setup was that the replication lag on master would be minimal as the subscriber acks the LSN almost immediately. However, I observe the replication lag to increase continuously for the duration of the test. Statistics in pg_replication_slots show that restart_lsn lags significantly behind the confirmed_flushed_lsn. Cursory reading on restart_lsn suggests that an increasing gap between restart_lsn and confirmed_flushed_lsn means that Postgres needs to reclaim disk space and advance restart_lsn to catch up to confirmed_flushed_lsn.

With that context, I am looking for answers for two questions -

1. What work needs to happen in the database to advance restart_lsn to confirmed_flushed_lsn?
2. What is the recommendation on tuning the database to improve the replication lag in such scenarios?

Regards,
Satyam

Re: Replication lag due to lagging restart_lsn

From

milist ujang

Date:

11 April 2023, 03:39:38

I have a case that master has been restarted many times, restart_lsn not moved since nov 2022 till today apr 2023.

I have tried pg_replication_slot_advance() but no luck :-(

postgres 12.8

1 master (4 publisher) many SR slaves, and 1 logical replication (4 subscribers)

Is there a chance to edit the state file under the pg_replslots folder?

On Wed, Aug 19, 2020 at 8:16 PM Kiran Singh <kiranjanarthan24@gmail.com> wrote:

When logical replication is setup, any wal generation on any tables will result in replication lag. Since you are running a long running transaction on the master, the maximum number of changes kept in the memory per transaction is 4MB. If the transaction requires more than 4MB the changes are spilled to disk. This is when you will start seeing

1. Replication lag spiking
2. Storage being consumed
3. Restart lsn stops moving forward

You can confirm if the heavy write that you are talking about is spilling to disk or not by setting log_min_messges to debug 2. Try to find if the changes are spilled to disk.

To answer your question:

1. As long as the write heavy query is running on the database, you will not see restart lsn moving.
2. You will have to have smaller transactions
3. When the query is completed, you will see restart_lsn moving forward

regards

ujang jaenudin | Self-Employed, DBA Consultant
http://id.linkedin.com/pub/ujang-jaenudin/12/64/bab

Re: Replication lag due to lagging restart_lsn

From

milist ujang

Date:

15 April 2023, 10:41:16

Finally got this useful tool:

https://github.com/EnterpriseDB/pg_failover_slots

Since I usually do scheduled switchover, this tool really helps a lot.

On Tue, Apr 11, 2023 at 10:39 AM milist ujang <ujang.milist@gmail.com> wrote:

I have a case that master has been restarted many times, restart_lsn not moved since nov 2022 till today apr 2023.
I have tried pg_replication_slot_advance() but no luck :-(

postgres 12.8
1 master (4 publisher) many SR slaves, and 1 logical replication (4 subscribers)

Is there a chance to edit the state file under the pg_replslots folder?

regards

ujang jaenudin | Self-Employed, DBA Consultant
http://id.linkedin.com/pub/ujang-jaenudin/12/64/bab