Re: Monitoring gaps in XLogWalRcvWrite() for the WAL receiver - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Monitoring gaps in XLogWalRcvWrite() for the WAL receiver
Date
Msg-id Z8gFnH4o3jBm5BRz@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Monitoring gaps in XLogWalRcvWrite() for the WAL receiver  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Monitoring gaps in XLogWalRcvWrite() for the WAL receiver
Re: Monitoring gaps in XLogWalRcvWrite() for the WAL receiver
List pgsql-hackers
Hi,

On Wed, Mar 05, 2025 at 12:35:26PM +0900, Michael Paquier wrote:
> Hi all,
> 
> While doing some monitoring of a replication setup for a stable
> branch, I have been surprised by the fact that we have never tracked
> WAL statistics for the WAL receiver in pg_stat_wal because we have
> never bothered to update its code so as WAL stats are reported.

Nice catch!

> This
> is relevant for the write and sync counts and timings.

Also for sync? sync looks fine as issue_xlog_fsync() is being called in 
XLogWalRcvFlush(), no?

> As of f4694e0f35b2, the situation is better thanks to the addition of
> a pgstat_report_wal() in the WAL receiver main loop, so we have some
> data.  However, we are only able to gather the data for segment syncs
> and initializations, not the writes themselves as these are managed by
> an independent code path, XLogWalRcvWrite().
> 
> A second thing that lacks in XLogWalRcvWrite() is a wait event around
> the pg_pwrite() call, which is useful as the WAL receiver is listed in
> pg_stat_activity.  Note that it is possible to re-use the same wait
> event as XLogWrite() for the WAL receiver, WAL_WRITE, because the WAL
> receiver does not rely on the write and flush calls from xlog.c when
> doing its work, and both have the same meaning, aka they write WAL.
> The fsync calls use issue_xlog_fsync() and the segment inits happen in
> XLogFileInit().
> 
> Perhaps there's a point in backpatching a portion of what's in the
> attached patch (the wait event?), but I am not planning to bother much
> with the stable branches based on the lack of complaints.

We're not emitting some statistics, so I think that it's hard for users to
complain about something they don't/can't see.

> If you
> have an opinion about that, please feel free.

I'm tempted to say that the wal receiver part of f4694e0f35b2 should be
backpatched as well as what you're doing here.

+               /*
+                * Measure I/O timing to write WAL data, for pg_stat_io.
+                */
+               start = pgstat_prepare_io_time(track_wal_io_timing);
+
+               pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
                byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+               pgstat_report_wait_end();

Same logic as in XLogWrite() and I don't think there is a need for a 
dedicated wait event, so LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: Add regression test checking combinations of (object,backend_type,context) in pg_stat_io
Next
From: Corey Huinker
Date:
Subject: Re: Statistics Import and Export