Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes - Mailing list pgsql-hackers

From Ashwin Agrawal
Subject Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
Date
Msg-id CAKSySwfaXPtmGiJ_m9tmVGpuK9-VQ3T_j=wLuKd-tuo=UCCSnA@mail.gmail.com
Whole thread Raw
List pgsql-hackers
On Wed, Dec 22, 2021 at 4:23 PM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
Hi Hackers,

I am considering implementing RPO (recovery point objective) enforcement feature for Postgres where the WAL writes on the primary are stalled when the WAL distance between the primary and standby exceeds the configured (replica_lag_in_bytes) threshold. This feature is useful particularly in the disaster recovery setups where primary and standby are in different regions and synchronous replication can't be set up for latency and performance reasons yet requires some level of RPO enforcement.

The idea here is to calculate the lag between the primary and the standby (Async?) server during XLogInsert and block the caller until the lag is less than the threshold value. We can calculate the max lag by iterating over ReplicationSlotCtl->replication_slots. If this is not something we don't want to do in the core, at least adding a hook for XlogInsert is of great value.

A few other scenarios I can think of with the hook are:
  1. Enforcing RPO as described above
  2. Enforcing rate limit and slow throttling when sync standby is falling behind (could be flush lag or replay lag)
  3. Transactional log rate governance - useful for cloud providers to provide SKU sizes based on allowed WAL writes.
Thoughts?

Very similar requirement or need was discussed in the past in [1], not exactly RPO enforcement but large bulk operation/transaction negatively impacting concurrent transactions due to replication lag.
Would be good to refer to that thread as it explains the challenges for implementing functionality mentioned in this thread. Mostly the challenge being no common place to code the throttling logic instead requiring calls to be sprinkled around in various parts.

pgsql-hackers by date:

Previous
From: Nikhil Benesch
Date:
Subject: Re: Remove inconsistent quotes from date_part error
Next
From: Tom Lane
Date:
Subject: Re: TYPCATEGORY_{NETWORK,USER} [was Dubious usage of TYPCATEGORY_STRING]