Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls - Mailing list pgsql-hackers

From Shinya Kato
Subject Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls
Date
Msg-id CAOzEurTw-Q2q9K+HFsD5nxibrb6n7vKz5xevWFrThGCKpGx0Wg@mail.gmail.com
Whole thread Raw
In response to Fix lag columns in pg_stat_replication not advancing when replay LSN stalls  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
Hi,

On Fri, Oct 17, 2025 at 12:57 PM Fujii Masao <masao.fujii@gmail.com> wrote:
>
> Hi,
>
> While testing, I noticed that write_lag and flush_lag in pg_stat_replication
> initially advanced but eventually stopped updating. This happened when
> I started pg_receivewal, ran pgbench, and periodically monitored
> pg_stat_replication.

Nice catch! I reproduced the same issue.

>
> My analysis shows that this issue occurs when any of the write, flush,
> or replay LSNs in the standby’s feedback message stop updating for some time.
> In the case of pg_receivewal, the replay LSN is always invalid (never updated),
> which triggers the problem. Similarly, in regular streaming replication,
> if the replay LSN remains unchanged for a long time—such as during
> a recovery conflict—the lag values for both write and flush can stop advancing.
>
> The root cause seems to be that when any of the LSNs stop updating,
> the lag tracker's cyclic buffer becomes full (the write head reaches
> the slowest read head). In this situation, LagTrackerWrite() and
> LagTrackerRead() didn't handle the full-buffer condition properly.
> For instance, if the replay LSN stalls, the buffer fills up and the read heads
> for "write" and "flush" end up at the same position as the write head.
> This causes LagTrackerRead() to return -1 for both, preventing write_lag
> and flush_lag from advancing.
>
> The attached patch fixes the problem by treating the slowest read entry
> (the one causing the buffer to fill up) as a separate overflow entry,
> allowing the lag tracker to continue operating correctly.

Thank you for the patch. I have one comment.

+       if (lag_tracker->overflowed[head].lsn > lsn)
+           return now - lag_tracker->overflowed[head].time;

Could this return a negative value if the clock somehow went
backwards? The original code returns -1 in this case, so I'm curious
about this.

--
Best regards,
Shinya Kato
NTT OSS Center



pgsql-hackers by date:

Previous
From: Marcos Pegoraro
Date:
Subject: Re: [PATCH] Add pg_get_trigger_ddl() to retrieve the CREATE TRIGGER statement
Next
From: Philip Alger
Date:
Subject: Re: [PATCH] Add pg_get_trigger_ddl() to retrieve the CREATE TRIGGER statement