Re: Adding wait events statistics - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Adding wait events statistics
Date
Msg-id aGvioIC2XpLzACy+@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Adding wait events statistics  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
List pgsql-hackers
Hi,

On Mon, Jul 07, 2025 at 02:56:29PM +0900, Michael Paquier wrote:
> On Mon, Jun 30, 2025 at 01:36:12PM +0000, Bertrand Drouvot wrote:

> >  * Do we need to serialize the stats based on their names (as for
> >  PGSTAT_KIND_REPLSLOT)? This question is the same as "is the ordering preserved
> >  if file stats format is not changed": I think the answer is yes (see f98dbdeb51d)
> >  , which means there is no need for to_serialized_name/from_serialized_name.
> 
> The serialization is important for the custom wait events, much less
> for wait events related to injection points (where we don't really
> care) which could be filtered out when reporting the stats.  It seems
> to me that associating the names (perhaps as $CLASS/$NAME on disk as a
> single string, two strings with two lengths are perhaps better) is
> going to be needed.  When the stats are read, the wait event ID could
> be allocated from the startup process in the from_serialized_name
> callback associated to the new stats kind, before extensions would try
> to allocate it.

Thanks for looking at it!

Yeah, I think that this is needed for the custom wait events. But do we need
to handle the custom wait events case? As, I think, the extensions have all they
need at their disposal to count/increment their own counters when their custom
wait events is triggered.

I think the question is: if the extension owner does not increment it, do we want
to do it in core on their behalf? I'm not sure as it's still up to them to make
use of custom wait events: so some of them will, some not, making the "in core"
counters not consistent depending of the extension. That could result in end
users asking why they see counters for some and not for some.

I also have the feeling that adding this extra complexity for (I think) a really
small number of cases may impact the "in core" wait events stats negatively (from
a complexity and maybe performance points of view).

I was thinking to just start with "in core" wait events and see if there is a
need/ask to expand it to custom wait events. Thoughts?

> >  * What if a new wait event is added? We'd need to change the stats file format,
> >  unless: the wait event stats kind becomes a variable one or we change a bit the
> >  way fixed stats are written/read to/from the stat file (we could add a new field
> >  in the PgStat_KindInfo for example).
> 
> I am pretty sure that everybody is going to forget about this
> dependency, myself included.  Please note that we add wait events in
> stable branches, sometimes as a portion of a bug fix, and we cannot
> bump the version of the stats file or stats would be lost on restart.
> So these stats should be variable,

Agree, that it's better to go the "variable" stats kind way.

> For the stats waiting to be flushed, you could
> allocate a single array of WAIT_EVENT_CUSTOM_HASH_MAX_SIZE elements in
> the TopMemoryContext of each backend, as we know that the number of
> custom wait events is never going to be higher than that, indexed by
> ID assigned.

That's an idea, but I'm not 100% sure we need to care about custom wait events
for now (see above).

> It may be simpler to allocate a series of arrays, one for each class,
> as well, sized depending on the wait event enum sizes.  These could be
> indexed based on the enums generated in wait_event_types.h, minus
> their initial value (for the array of events in WaitEventTimeout,
> minus PG_WAIT_TIMEOUT needs to be applied).  Deriving the contents
> that you need for the sizing and indexing of the pending data based on
> WaitClassTableEntry seems indeed necessary at some degree.

I'm not sure that would be simpler (or maybe that would simplify the perl
script a bit) but I think that might complicate the C code a bit. I mean, given
a wait_event_info we'd need to know in which array to increment the related wait
event counter. So, (at least) some switch on the wait class would be needed (and
that would probably need to be generated by the perl script, so I'm not sure
that would simplify the perl script after all).

Also in pg_stat_get_wait_event(), we'd need to iterate on all the arrays so
we'd need to know and identify them.

> your proposal feels overcomplicated, and that the names may not be
> necessary as we already have calls able to do class name and event
> name lookups based on the uint32 class/name values.

I started without the names in the array (making use of pgstat_get_wait_event())
but that did not work well for LWLOCK. Indeed, for LWLOCK we have "holes" in the
eventID numbering (see lwlocklist.h). That's why there is "NULL" in some
names for PG_WAIT_LWLOCK in the WaitClassTable[]. So, instead of putting random
stuff when it's not NULL, I thought that would be ok to put the names instead
(and avoid calling pgstat_get_wait_event()).

> This is not new.  This limitation exists for some auxiliary processes
> or processes within their main loops.  As long as we have APIs that
> can be called to do the flushes, IMO we are fine, like we do for WAL,
> IO, etc.

I do agree. My point was that I did not add those in the patch series yet.

> > It might be better for PGSTAT_KIND_WAIT_EVENT to be a variable sized stats kind.
> > That way:
> > 
> > * It would be easier to add custom wait events if we want to
> > * It would be possible to add new wait events without having to change the stats
> > file format
> > 
> > So adding 0004 to see what it would look like to have a variable sized stats kind
> > instead and decide how we want to move forward. 
> 
> IMO, I think that we should just do that.  ABI compatibility and
> custom wait events make fixed-sized stats unfit with what you are
> trying to achieve here.

Yeah, next version will do that. Just waiting on a agreement about:

- custom wait events
- split the array in multiple arrays

before sharing v2.

> > That said it might be better to use all the 64 bits (means not have the half full
> > of zeroes) for the hash key (better hashing distribution?): we could imagine
> > using something like:
> > 
> > ((uint64) wait_event_info) | (((uint64) wait_event_info) << 32)
> > 
> > for the hash key.
> 
> Using uint32 for the hash key is fine.  At least it's not an issue for
> the pgstats machinery.

Yeah, my doubt was more about the hashing distribution if all the keys have
the exact same bits (the upper 32 bits) sets to zero (which is what using a
uint32 key would produce).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: Inconsistent LSN format in pg_waldump output
Next
From: Peter Eisentraut
Date:
Subject: Re: C11 / VS 2019