Thread: Metric to calculate WAL size left to transfer to Standby

Metric to calculate WAL size left to transfer to Standby

From
Viral Shah
Date:
Hello All,

We have a PostgreSQL 10.12 cluster of servers in two different data centers. Off lately, in the case of a large WAL generation, we are seeing replication delay between the master and the standby server. These delays have off lately been there for an unusually long time. I was wondering if we have any metric that can calculate the amount (size) of WAL transfer left between master and standby?

PS: We have ensured we have upgraded our firewalls for better speed transfer.

Any help on how to figure out the slowness in the WAL transfer would be much appreciated.

Thanks,
Viral Shah

Re: Metric to calculate WAL size left to transfer to Standby

From
Laurenz Albe
Date:
On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:
> We have a PostgreSQL 10.12 cluster of servers in two different data centers.
>  Off lately, in the case of a large WAL generation, we are seeing replication
>  delay between the master and the standby server. These delays have off lately
>  been there for an unusually long time. I was wondering if we have any metric
>  that can calculate the amount (size) of WAL transfer left between master and
>  standby?
> 
> PS: We have ensured we have upgraded our firewalls for better speed transfer.
> 
> Any help on how to figure out the slowness in the WAL transfer would be much appreciated.

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;

If both are delayed, it might be that the network cannot cope.

If only the second number is delayed, you have replication conflicts
with queries on the standby.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Metric to calculate WAL size left to transfer to Standby

From
Viral Shah
Date:
Hello Laurenz,

Thank you so much for sending the query. It was exactly what I needed. I just made 1 modification to beautify the transfer and replay lag and I can see the size in bytes.

SELECT application_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS transfer_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;
I am now using zabbix to constantly monitor them and notify myself if it breaches a certain threshold.

Thanks again!

Best,
Viral Shah
Nodal Exchange LLC


On Thu, Apr 15, 2021 at 8:10 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:
> We have a PostgreSQL 10.12 cluster of servers in two different data centers.
>  Off lately, in the case of a large WAL generation, we are seeing replication
>  delay between the master and the standby server. These delays have off lately
>  been there for an unusually long time. I was wondering if we have any metric
>  that can calculate the amount (size) of WAL transfer left between master and
>  standby?
>
> PS: We have ensured we have upgraded our firewalls for better speed transfer.
>
> Any help on how to figure out the slowness in the WAL transfer would be much appreciated.

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;

If both are delayed, it might be that the network cannot cope.

If only the second number is delayed, you have replication conflicts
with queries on the standby.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Re: Metric to calculate WAL size left to transfer to Standby

From
Viral Shah
Date:
Hello Laurenz,

The above metric works fine for my primary server. However, We have a cascading setup in our production system. This particular query doesn't work for my intermediate server which is standby to the primary server but also a master to one more standby server. We get the following error:
prod=# SELECT round(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / (1024 :: NUMERIC ^ 2), 2 ) AS replay_lag FROM pg_stat_replication where application_name like 'rtv%';
ERROR:  recovery is in progress
HINT:  WAL control functions cannot be executed during recovery.


I wanted to also track the transfer/replay lag between the intermediate server and the final standby. I narrowed down the problem with pg_current_wal_lsn() procedure. Any idea on how to mitigate this? or any alternative to the pg_current_wal_lsn() procedure that I can use to get transfer lag?

Thanks,
Viral Shah


On Fri, Apr 16, 2021 at 7:52 PM Viral Shah <vshah@nodalexchange.com> wrote:
Hello Laurenz,

Thank you so much for sending the query. It was exactly what I needed. I just made 1 modification to beautify the transfer and replay lag and I can see the size in bytes.

SELECT application_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS transfer_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;
I am now using zabbix to constantly monitor them and notify myself if it breaches a certain threshold.

Thanks again!

Best,
Viral Shah
Nodal Exchange LLC


On Thu, Apr 15, 2021 at 8:10 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:
> We have a PostgreSQL 10.12 cluster of servers in two different data centers.
>  Off lately, in the case of a large WAL generation, we are seeing replication
>  delay between the master and the standby server. These delays have off lately
>  been there for an unusually long time. I was wondering if we have any metric
>  that can calculate the amount (size) of WAL transfer left between master and
>  standby?
>
> PS: We have ensured we have upgraded our firewalls for better speed transfer.
>
> Any help on how to figure out the slowness in the WAL transfer would be much appreciated.

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;

If both are delayed, it might be that the network cannot cope.

If only the second number is delayed, you have replication conflicts
with queries on the standby.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com