Re: Adding wait events statistics - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Adding wait events statistics
Date
Msg-id aIIeX7p2cKUO6KTa@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Adding wait events statistics  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Adding wait events statistics
List pgsql-hackers
Hi,

On Wed, Jul 23, 2025 at 11:38:07AM -0400, Robert Haas wrote:
> On Tue, Jul 22, 2025 at 8:24 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > So based on the cycles metric I think it looks pretty safe to implement for the
> > whole majority of classes.
> 
> I'm not convinced that this is either cheap enough to implement,

Thanks for looking at it and providing your thoughts!

> and I
> don't understand the value proposition, either. I see the first couple
> of messages in the thread say that this is important for
> troubleshooting problems in "our user base" (not sure about the
> antecedent of "our") but the description of what those problems are
> seems pretty vague to me.

Well, the idea was more: as we speak about "wait" events then it would make
sense to add their duration. And then, to have meaningful data to interpret the
durations then it would make sense to add the counters. So that one could answer
questions like:

* Is the engine’s wait pattern the same over time?
* Is application’s "A" wait pattern the same over time?
* I observe a peak in wait event "W": is it because "W" is now waiting longer or
is it because it is hit more frequently?

> I think it's probably a mistake to even be thinking about this in
> terms of wait events.

I think it feels natural to add "wait time" to "wait events", it started from
there. 

> For
> instance, if you asked "what information is useful to gather about
> heavyweight locks?" you might say "well, we'd like to know how many
> times we tried to acquire one, and how many of those times we had to
> wait, and how many of those times we waited for longer than
> deadlock_timeout".

Yes.

> And then you could design a facility to answer
> those questions.

I think the current proposal would provide a way to get some of those answers
and a way to "estimate" some others. But designing a facility to answer all of
those would be better.

> Or you might say "we'd like a history of all the
> different heavyweight locks that a certain backend has tried to
> acquire," and then you could design a tracing facility to give you
> that. Either of those hypothetical facilities involve providing more
> information than you would get from just counting wait events, or
> counting+timing wait events, or recording a complete history of all
> wait events.

Agreed.

> And, I would say that, for more or less that exact reason, there's a
> real use case for either of them. Maybe either or both are valuable
> enough to add and maybe they are not, and maybe the overhead is
> acceptable or maybe it isn't, but I think the arguments are much
> better than for a facility that just counts and times all the wait
> events.

If we add more knowledge per wait class/wait events by providing "dedicated"
facilities, I agree that would provide more values than just "counts and durations".

> For instance, the former facility would let you answer the
> question "what percentage of lock acquisitions are contended?" whereas
> a pure wait event count just lets you answer the question "how many
> contended lock acquisitions were there?".

Right.

> The latter isn't useless,
> but the former is a lot better,

Agreed.

> I'm almost sure that measuring LWLock wait
> times is going to be too expensive to be practical,

On my lab it added 60 cycles, I'm not sure that is too expensive. But even
if we think this is, maybe we could provide an option to turn this "overhead" off/on
with a GUC or compilation flag.

> and I don't really
> see why you'd want to: the right approach is sampling, which is cheap
> and in my experience highly effective.

I think the sampling is very useful too (that's why I created the pgsentinel
extension [1]). But I think the sampling and the current proposal do not
answer the same questions. Most (if not all) of the questions that I mentioned
above could not be answered accurately with the sampling approach.

> Measuring counts doesn't seem
> very useful either: knowing the number of times that somebody tried to
> acquire a relation lock or a tuple lock arguably tells you something
> about your workload that you might want to know, whereas I would argue
> that knowing the number of times that somebody tried to acquire a
> buffer lock doesn't really tell you anything at all.

If you add the duration to the mix, that's more useful. And if you add the buffer
relfile information to the mix, that's even more insightful.

One could spot hot buffers with this data in hand.

> What you might
> want to know is how many buffers you accessed, which is why we've
> already got a system for tracking that. That's not to say that there's
> nothing at all that somebody could want to know about LWLocks that you
> can't already find out today: for example, a system that identifies
> which buffers are experiencing significant buffer lock contention, by
> relation OID and block number, sounds quite handy.

Could not agree more.

> But just counting
> wait events, or counting and timing them, will not give you that.

Right, I had in mind a step by step approach. Counters, timings and then providing
more details to wait events, something like:

   pid   | wait_event_type |  wait_event  |                            infos
---------+-----------------+--------------+-------------------------------------------------------------
 2560105 | IO              | DataFileRead | {"blocknum" : "9272", "dbnode" : "5", "relnode" : "16407"}
 2560135 | IO              | WalSync      | {"segno" : "1", "tli" : "1"}
 2560138 | IO              | DataFileRead | {"blocknum" : "78408", "dbnode" : "5", "relnode" : "16399"}

But it also has its own challenges like you and Andres pointed out on the
Discord server (in the PG #core-hacking channel).

> Knowing which SLRUs are being heavily used could also be useful, but I
> think there's a good argument that having some instrumentation that
> cuts across SLRU types and exposes a bunch of useful numbers for each
> could be more useful than just hoping you can figure it out from
> LWLock wait event counts.

Right, I'm all in for even more details.

> In short, I feel like just counting, or counting+timing, all the wait
> events is painting with too broad a brush. Wait events get inserted
> for a specific purpose: so you can know why a certain backend is
> off-CPU without having to use debugging tools. They serve that purpose
> pretty well, but that doesn't mean that they serve other purposes
> well, and I find it kind of hard to see the argument that just
> sticking a bunch of counters or timers in the same places where we put
> the wait event calls would be the right thing in general.
> Introspection is an important topic and, IMHO, deserves much more
> specific and nuanced thought about what we're trying to accomplish and
> how we're going about it.

I see what you mean and also think that providing more instrumentation is an
important topic and that it deserves improvement.

The current proposal was maybe too "generic" as you pointed out and maybe the wait
events is not the best place to add this instrumentation/details.

If we could agree on extra details/instrumentation per area (like the few examples
you provided) then I would be happy to work on it: as said, I think the instrumentation
is an area that deserves improvement.

Your approach would probably take longer but OTOH would surely be even more meaningful.

[1]: https://github.com/pgsentinel/pgsentinel

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Shlok Kyal
Date:
Subject: Re: recoveryStopsAfter is not usable when recovery_target_inclusive is false
Next
From: Nazir Bilal Yavuz
Date:
Subject: Re: Making type Datum be 8 bytes everywhere