Home > mailing lists

Re: Adding wait events statistics - Mailing list pgsql-hackers

From	Bertrand Drouvot
Subject	Re: Adding wait events statistics
Date	July 10 11:21:56
Msg-id	aG94JEEWAebZoAtW@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw
In response to	Re: Adding wait events statistics (Andres Freund <andres@anarazel.de>)
Responses	Re: Adding wait events statistics
List	pgsql-hackers

Tree view

Hi,

On Wed, Jul 09, 2025 at 11:19:00AM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-07-09 07:26:05 +0000, Bertrand Drouvot wrote:
> > On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote:
> > > On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> 
> TBH, the more I think about this, the less convinced a pure counter provides
> *any* additional value over stats we already have, in which case the overhead
> is moot.

I think there is a lot of wait events that don't have a stat counterpart:
Controlfile ones, some of the DataFile ones (prefetch..), Dsm ones to name a few.

> We use commonly use wait events unconditionally around system calls, e.g.:
> 
>     pgstat_report_wait_start(wait_event_info);
>     returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
>     pgstat_report_wait_end();
> 
> A counter for the number of wait events is just less informative version of
> what we already have in pg_stat_io. There's literally no value provided by
> tracking the number of times that wait event is hit.

For this particular one I do agree, IO is kind of exception here.

> A pretty large number of wait events operate that way. For those the
> interesting information would be to measure *how long* you are in the wait
> event, not the number of times the wait event was entered for a very short
> amount of time.

The counters on the wait events could be seen as "system/engine stats" not as
wait events. That could help describe the engine behavior over time.

> The obvious counter to that is that you likely are planning to add timing
> information at some point down the line

Yes, and then the counters also become even more valuable. Also, timings without
counters would not make that much sense, so counters are kind of a mandatory
first step.

> - leaving overhead aside, that would
> *also* just duplicate information for the above case, namely track_io_timing.

Agree for this particular case but we currently have 273 in core wait events.

> The wait events we have as-is are useful because they tell you what's
> currently happening, and you can correlate that with other things that are
> going on. But a pure counter of wait events encountered doesn't allow for
> that.

Yes, I think the pure counters could/should be seen as engine statistics instead.
And then become even more valuable once we add the timings.

> I think adding 6 conditional branches to this is just out of question...

This has to be optimized, I do agree.

> > > I think to make this viable, we would have to remove some wait events and be
> > > more careful about adding them.
> >
> > You mean incrementing the counters for wait events for which we expect that
> > the increment is "negligible" as compare to the wait event itself? (Like DATAFILE
> > IO and such)?
> 
> I mean that we would need to either *remove* a bunch of wait events, or change
> the API so that there are "cheap" wait events that just have the current
> overhead, and expensive wait events that do this tracking.

I had the later in mind as it's still good to know what the engine is doing/waiting
for even if that's "only" at regular sampling interval (we would lost that information
if we were to *remove* a bunch of wait events).

> > If so I agree and that's something I also have discussed with a few (also if we
> > add the timings later), that we could/should be more granular. If we go that way
> > we could also imagine a GUC to set the stats for all of them? (default being
> > off).
> 
> Any conditional code that you add makes this stuff more expensive...

Yes, or could be a compilation time flag to avoid the condition to be hit
at runtime. That said I like your API change idea, we could decide which code
path use the tracking and which does not.

> That's still way too many jumps.

Yes, did not try to optimize that part yet. The above idea would certainly help
though, I mean probably we could remove most of the jumps.

> > That looks better before the first branch (could be done in v1 too) as it uses
> > registers only but looks worse when it's time to increment the counter while
> > the code is much simpler:
> >
> > "
> >     WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info);
> >
> >     AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++;
> > "
> 
> Too many indirections. The increment should be doable without needing to
> dereference intermediary memory locations.

Right. I can think of 2 approaches:

1. Automatically build a function (thanks to generate-wait_event_types.pl) that
returns the offset of the wait event, something like:

static inline int
get_wait_event_counter_index(uint32 wait_event_info)
{
    uint32 classId = wait_event_info & WAIT_EVENT_CLASS_MASK;
    uint32 eventId = wait_event_info & WAIT_EVENT_ID_MASK;

    switch (classId) {
        case PG_WAIT_LWLOCK:
            if (eventId >= 95) return -1;
            return 0 + eventId;
        case PG_WAIT_LOCK:
            if (eventId >= 12) return -1;
            return 95 + eventId;
        case PG_WAIT_BUFFERPIN:
.
.
.

Then the offset are "hardcoded" and there is no memory reads to get the offset (
confirmed by the assembly code):

"
cmp    $0x8,%edi
ja     0x91b32e
add    $0x7e,%edi
"

But the switch is not that good because it also produces much more cmp, so 
the last case will "suffer" more than the "first" one for example.

2. Make use of a lookup table, like:

static const int WaitClassOffsets[11] = {
    -1, 
    0,
    -1,
    95,
    107,
    108,
    126,
    135,
    136,
    194,
    204,
};

but then one indirection would still be needed, so a memory read to get
WaitClassOffsets[classId].

I think that the read can need more cpu cycles (than the worst switch case)
depending where it is read from (L1, L2, L3).

That said, given its size and how frequently it would be accessed then it's
likely to be in L1 and if that's the case then that's probably better than the
switch idea, thoughts?

Let me try to produce an action plan:

- Let's define a list of "cheap" and a list of "expensive" wait events
- Change the code path to increment counters only for the "expensive" ones
- Optimize the counter incrementation as much as possible

Does that sound reasonable to you?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

pgsql-hackers by date:

From: Frédéric Yhuel
Date: 10 July, 11:18:47
Subject: Re: [doc] minor fix for CREATE INDEX CONCURRENTLY

From: Amit Langote
Date: 10 July, 11:28:50
Subject: Re: Some ExecSeqScan optimizations

Re: Adding wait events statistics - Mailing list pgsql-hackers

Previous

Next