Re: Adding wait events statistics - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Adding wait events statistics
Date
Msg-id xuynb44ql3hhggvqtme7axbliww7gwuy6pbaohxc4ngu3ynbsi@rvvpjxf55aia
Whole thread Raw
In response to Re: Adding wait events statistics  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
List pgsql-hackers
Hi,

On 2025-07-09 07:26:05 +0000, Bertrand Drouvot wrote:
> On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote:
> > On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> > >  * we don't want to allocate memory in some places where the counters are
> > >  incremented (see 4feba03d8b9). We could still use the same implementation as for
> > >  backend statistics (i.e, make use of flush_static_cb) if we really want/need to
> > >  switch to variable stats.
> >
> > Even without allocating memory, this makes wait events considerably more
> > expensive. Going from ~4 instructions to something more like ~40.  Including
> > external function calls.
>
> Yeah looking at the performance impacts and discuss those was in my todo (as
> mentioned above), thanks for looking at it!

TBH, the more I think about this, the less convinced a pure counter provides
*any* additional value over stats we already have, in which case the overhead
is moot.

We use commonly use wait events unconditionally around system calls, e.g.:

    pgstat_report_wait_start(wait_event_info);
    returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
    pgstat_report_wait_end();

A counter for the number of wait events is just less informative version of
what we already have in pg_stat_io. There's literally no value provided by
tracking the number of times that wait event is hit.

A pretty large number of wait events operate that way. For those the
interesting information would be to measure *how long* you are in the wait
event, not the number of times the wait event was entered for a very short
amount of time.

The obvious counter to that is that you likely are planning to add timing
information at some point down the line - leaving overhead aside, that would
*also* just duplicate information for the above case, namely track_io_timing.


The wait events we have as-is are useful because they tell you what's
currently happening, and you can correlate that with other things that are
going on. But a pure counter of wait events encountered doesn't allow for
that.


> > end wait:
> >    0x0000000000a08eaa <+90>:    mov    (%r12),%rax
>
> Yeah this one is unbeatable, it will have to be more expensive...
>
> On my side, I'm getting:
>
> => 0x000000000091a320 <+0>:     mov    0x5337f9(%rip),%rax        # 0xe4db20 <my_wait_event_info>
>    0x000000000091a327 <+7>:     mov    (%rax),%eax
>    0x000000000091a329 <+9>:     mov    %eax,%edx
>    0x000000000091a32b <+11>:    and    $0xff000000,%edx
>    0x000000000091a331 <+17>:    jne    0x91a340 <waitEventIncrementCounter+32>
>    0x000000000091a333 <+19>:    test   %ax,%ax
>    0x000000000091a336 <+22>:    jne    0x91a340 <waitEventIncrementCounter+32>
>    0x000000000091a338 <+24>:    ret
>    0x000000000091a339 <+25>:    nopl   0x0(%rax)
>    0x000000000091a340 <+32>:    cmp    $0x1000000,%edx
>    0x000000000091a346 <+38>:    jne    0x91a34e <waitEventIncrementCounter+46>
>    0x000000000091a348 <+40>:    cmp    $0x5e,%ax
>    0x000000000091a34c <+44>:    ja     0x91a338 <waitEventIncrementCounter+24>
>    0x000000000091a34e <+46>:    cmp    $0x7000000,%edx
>    0x000000000091a354 <+52>:    sete   %cl
>    0x000000000091a357 <+55>:    test   %ax,%ax
>    0x000000000091a35a <+58>:    setne  %al
>    0x000000000091a35d <+61>:    test   %al,%cl
>    0x000000000091a35f <+63>:    jne    0x91a338 <waitEventIncrementCounter+24>
>    0x000000000091a361 <+65>:    cmp    $0xb000000,%edx
>    0x000000000091a367 <+71>:    je     0x91a338 <waitEventIncrementCounter+24>
>    0x000000000091a369 <+73>:    mov    %edi,%eax
>    0x000000000091a36b <+75>:    movzwl %di,%edi
>    0x000000000091a36e <+78>:    shr    $0x18,%eax
>    0x000000000091a371 <+81>:    shl    $0x5,%rax
>    0x000000000091a375 <+85>:    movslq 0xe4d9c8(%rax),%rax
>    0x000000000091a37c <+92>:    movb   $0x1,0x56045d(%rip)        # 0xe7a7e0 <have_wait_event_stats>
>    0x000000000091a383 <+99>:    add    %rdi,%rax
>    0x000000000091a386 <+102>:   addq   $0x1,0xe7a800(,%rax,8)
>    0x000000000091a38f <+111>:   ret

I think adding 6 conditional branches to this is just out of question...


> > I think to make this viable, we would have to remove some wait events and be
> > more careful about adding them.
>
> You mean incrementing the counters for wait events for which we expect that
> the increment is "negligible" as compare to the wait event itself? (Like DATAFILE
> IO and such)?

I mean that we would need to either *remove* a bunch of wait events, or change
the API so that there are "cheap" wait events that just have the current
overhead, and expensive wait events that do this tracking.


> If so I agree and that's something I also have discussed with a few (also if we
> add the timings later), that we could/should be more granular. If we go that way
> we could also imagine a GUC to set the stats for all of them? (default being
> off).

Any conditional code that you add makes this stuff more expensive...


> > And optimize this code a good bit. This
> > really shouldn't end up with this mouch code.
>
> Yes, with the idea mentioned in [1] (code not shared yet), I end up with:
>
> Dump of assembler code for function waitEventIncrementCounter:
> => 0x000000000091b320 <+0>:     mov    %edi,%eax
>    0x000000000091b322 <+2>:     and    $0xff000000,%eax
>    0x000000000091b327 <+7>:     jne    0x91b330 <waitEventIncrementCounter+16>
>    0x000000000091b329 <+9>:     test   %di,%di
>    0x000000000091b32c <+12>:    jne    0x91b330 <waitEventIncrementCounter+16>
>    0x000000000091b32e <+14>:    ret
>    0x000000000091b32f <+15>:    nop
>    0x000000000091b330 <+16>:    cmp    $0x1000000,%eax
>    0x000000000091b335 <+21>:    jne    0x91b33d <waitEventIncrementCounter+29>
>    0x000000000091b337 <+23>:    cmp    $0x5e,%di
>    0x000000000091b33b <+27>:    ja     0x91b32e <waitEventIncrementCounter+14>
>    0x000000000091b33d <+29>:    cmp    $0x7000000,%eax
>    0x000000000091b342 <+34>:    sete   %cl
>    0x000000000091b345 <+37>:    test   %di,%di
>    0x000000000091b348 <+40>:    setne  %dl
>    0x000000000091b34b <+43>:    test   %dl,%cl
>    0x000000000091b34d <+45>:    jne    0x91b32e <waitEventIncrementCounter+14>
>    0x000000000091b34f <+47>:    cmp    $0xb000000,%eax
>    0x000000000091b354 <+52>:    je     0x91b32e <waitEventIncrementCounter+14>
>    0x000000000091b356 <+54>:    mov    %edi,%eax
>    0x000000000091b358 <+56>:    movzwl %di,%edi
>    0x000000000091b35b <+59>:    shr    $0x18,%eax
>    0x000000000091b35e <+62>:    mov    0xe4e9c0(,%rax,8),%rax
>    0x000000000091b366 <+70>:    movb   $0x1,0x5603f3(%rip)        # 0xe7b760 <have_wait_event_stats>
>    0x000000000091b36d <+77>:    mov    0x8(%rax),%rax
>    0x000000000091b371 <+81>:    addq   $0x1,(%rax,%rdi,8)
>    0x000000000091b376 <+86>:    ret

That's still way too many jumps.


> That looks better before the first branch (could be done in v1 too) as it uses
> registers only but looks worse when it's time to increment the counter while
> the code is much simpler:
>
> "
>     WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info);
>
>     AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++;
> "

Too many indirections. The increment should be doable without needing to
dereference intermediary memory locations.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: AIO v2.5
Next
From: Tom Lane
Date:
Subject: Re: What is a typical precision of gettimeofday()?