Re: Metric to calculate WAL size left to transfer to Standby - Mailing list pgsql-general

From Viral Shah
Subject Re: Metric to calculate WAL size left to transfer to Standby
Date
Msg-id CAKL_ojwYZDnCPh7P82OX5pdY6EnpLYFbFGZ2NOm+pra-kueLHA@mail.gmail.com
Whole thread Raw
In response to Re: Metric to calculate WAL size left to transfer to Standby  (Viral Shah <vshah@nodalexchange.com>)
List pgsql-general
Hello Laurenz,

The above metric works fine for my primary server. However, We have a cascading setup in our production system. This particular query doesn't work for my intermediate server which is standby to the primary server but also a master to one more standby server. We get the following error:
prod=# SELECT round(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / (1024 :: NUMERIC ^ 2), 2 ) AS replay_lag FROM pg_stat_replication where application_name like 'rtv%';
ERROR:  recovery is in progress
HINT:  WAL control functions cannot be executed during recovery.


I wanted to also track the transfer/replay lag between the intermediate server and the final standby. I narrowed down the problem with pg_current_wal_lsn() procedure. Any idea on how to mitigate this? or any alternative to the pg_current_wal_lsn() procedure that I can use to get transfer lag?

Thanks,
Viral Shah


On Fri, Apr 16, 2021 at 7:52 PM Viral Shah <vshah@nodalexchange.com> wrote:
Hello Laurenz,

Thank you so much for sending the query. It was exactly what I needed. I just made 1 modification to beautify the transfer and replay lag and I can see the size in bytes.

SELECT application_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS transfer_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;
I am now using zabbix to constantly monitor them and notify myself if it breaches a certain threshold.

Thanks again!

Best,
Viral Shah
Nodal Exchange LLC


On Thu, Apr 15, 2021 at 8:10 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:
> We have a PostgreSQL 10.12 cluster of servers in two different data centers.
>  Off lately, in the case of a large WAL generation, we are seeing replication
>  delay between the master and the standby server. These delays have off lately
>  been there for an unusually long time. I was wondering if we have any metric
>  that can calculate the amount (size) of WAL transfer left between master and
>  standby?
>
> PS: We have ensured we have upgraded our firewalls for better speed transfer.
>
> Any help on how to figure out the slowness in the WAL transfer would be much appreciated.

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;

If both are delayed, it might be that the network cannot cope.

If only the second number is delayed, you have replication conflicts
with queries on the standby.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: trigger impacting insertion of records
Next
From: Venkata B Nagothi
Date:
Subject: Re: Postgres upgrade 12 - issues with OIDs