Re: [PATCH] Expose checkpoint timestamp and duration in pg_stat_checkpointer - Mailing list pgsql-hackers

From Álvaro Herrera
Subject Re: [PATCH] Expose checkpoint timestamp and duration in pg_stat_checkpointer
Date
Msg-id 202511240955.vt3fjrb4ksrs@alvherre.pgsql
Whole thread Raw
In response to Re: [PATCH] Expose checkpoint timestamp and duration in pg_stat_checkpointer  (Michael Banck <mbanck@gmx.net>)
Responses Re: [PATCH] Expose checkpoint timestamp and duration in pg_stat_checkpointer
List pgsql-hackers
On 2025-Nov-24, Michael Banck wrote:

> In general I doubt how much those gauges (as oppposed to counters) only
> pertaining to the last checkpoint are useful in pg_stat_checkpointer.
> What would be the use case for those two values?

I think it's useful to know how long checkpoint has to work.  It's a bit
lame to have only one duration (the last one), but at least with this
arrangement you can have external monitoring software connect to the
server, extract that value and save it somewhere else.  Monitoring
systems do this all the time, and we've been waiting for a better
implementation to store monitoring data inside Postgres for years.  I
think we shouldn't block this proposal just because of this issue,
because it can clearly be useful.

However, I'm not sure I'm very interested in knowing only the duration
of the checkpoint.  I mean, much of the time the duration is going to be
whatever fraction of the checkpoint timeout you have as
checkpoint_completion_target, right?  Which includes sleeps.  So I think
you really want two durations: one is the duration itself, and the other
is what fraction of that did the checkpointer sleep in order to achieve
that duration.  So you know how much time checkpointer spent trying to
get the operating system do stuff rather than just sit there waiting.
We already have that data, kinda, in write_time and sync_time, but those
are cumulative rather than just for the last one.  (I guess you can have
the monitoring system compute the deltas as it finds each new
checkpoint.)  I'm not sure how good this system is.

In the past, I looked at a couple of monitoring dashboards offered by
cloud vendors, searching for anything valuable in terms of checkpoints.
What I saw was very disappointing -- mostly just "how many checkpoints
per minute", which is mostly flat zero with periodic spikes.  Totally
useless.  Does anybody know if some vendor has good charts for this?
Also, if we were to add this new proposed duration, how could these
charts improve?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it."         (ncm, http://lwn.net/Articles/174769/)



pgsql-hackers by date:

Previous
From: Alexander Borisov
Date:
Subject: Re: Improve the performance of Unicode Normalization Forms.
Next
From: Peter Eisentraut
Date:
Subject: get rid of Pointer type, mostly