Replication lag due to lagging restart_lsn - Mailing list pgsql-performance

From Satyam Shekhar
Subject Replication lag due to lagging restart_lsn
Date
Msg-id CAAy_rtEP_CroVy4Gvcu3HmHxzRTKtYLC2JwNWSdsOPAsvMEyBQ@mail.gmail.com
Whole thread Raw
Responses Re: Replication lag due to lagging restart_lsn  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: Replication lag due to lagging restart_lsn  (Kiran Singh <kiranjanarthan24@gmail.com>)
List pgsql-performance
Hello,

I wish to use logical replication in Postgres to capture transactions as CDC and forward them to a custom sink. 

To understand the overhead of logical replication workflow I created a toy subscriber using the V3PGReplicationStream that acknowledges LSNs after every 16k reads by calling setAppliedLsn, setFlushedLsn, and forceUpdateState. The toy subscriber is set up as a subscriber for a master Postgres instance that publishes changes using a Publication. I then run a write-heavy workload on this setup that generates transaction logs at approximately 235MBps. Postgres is run on a beefy machine with a 10+GBps network link between Postgres and the toy subscriber. 

My expectation with this setup was that the replication lag on master would be minimal as the subscriber acks the LSN almost immediately. However, I observe the replication lag to increase continuously for the duration of the test. Statistics in pg_replication_slots show that restart_lsn lags significantly behind the confirmed_flushed_lsnCursory reading on restart_lsn suggests that an increasing gap between restart_lsn and confirmed_flushed_lsn means that Postgres needs to reclaim disk space and advance restart_lsn to catch up to confirmed_flushed_lsn

With that context, I am looking for answers for two questions -

1. What work needs to happen in the database to advance restart_lsn to confirmed_flushed_lsn?
2. What is the recommendation on tuning the database to improve the replication lag in such scenarios?

Regards,
Satyam

pgsql-performance by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Query takes way longer with LIMIT, and EXPLAIN takes way longer than actual query
Next
From: Jim Jarvie
Date:
Subject: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED