Re: Adding wait events statistics - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Adding wait events statistics
Date
Msg-id lt6n664ijbmfftnuv3bgvt47q7kjz4tflu4kg3ingv6njjtvld@kesknxnidemo
Whole thread Raw
In response to Adding wait events statistics  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Responses Re: Adding wait events statistics
List pgsql-hackers
Hi,

On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> Wait events are useful to know what backends are waiting for when there is/was
> a performance issue: for this we can sample pg_stat_activity at regular intervals
> and record historical data. That’s how it is commonly used.
>
> It could also be useful to observe the engine/backends behavior over time and
> help answer questions like:
>
> * Is the engine’s wait pattern the same over time?
> * Is application’s "A" wait pattern the same over time?
> * I observe a peak in wait event "W": is it because "W" is now waiting longer or
> is it because it is hit more frequently?
> * I observe a peak in some of them (say for example MultiXact%), is it due to a
> workload change?
>
> For the above use cases, we need a way to track the wait events that occur between
> samples: please find attached a proof of concept patch series doing so.
> >
> 0002 -  It adds wait events statistics
>
> It adds a new stat kind PGSTAT_KIND_WAIT_EVENT for the wait event statistics.
>
> This new statistic kind is a fixed one because we know the maximum number of wait
> events. Indeed:
>
>  * it does not take into account custom wait events as extensions have all they need
>  at their disposal to create custom stats on their own wait events should they
>  want to (limited by WAIT_EVENT_CUSTOM_HASH_MAX_SIZE though).
>
>  * it does not take into account LWLock > LWTRANCHE_FIRST_USER_DEFINED for the same
>  reasons as above. That said, there is no maximum limitation in LWLockNewTrancheId().
>
>  * we don't want to allocate memory in some places where the counters are
>  incremented (see 4feba03d8b9). We could still use the same implementation as for
>  backend statistics (i.e, make use of flush_static_cb) if we really want/need to
>  switch to variable stats.

Even without allocating memory, this makes wait events considerably more
expensive. Going from ~4 instructions to something more like ~40.  Including
external function calls.

Today:

start wait:
   0x0000000000a08e8f <+63>:    lea    0x735662(%rip),%r12        # 0x113e4f8 <my_wait_event_info>
   0x0000000000a08e96 <+70>:    mov    (%r12),%rax
   0x0000000000a08e9a <+74>:    movl   $0x9000006,(%rax)

   < do wait operation >


end wait:
   0x0000000000a08eaa <+90>:    mov    (%r12),%rax


After:

start wait:
   0x0000000000a0b39f <+63>:    lea    0x73673a(%rip),%r12        # 0x1141ae0 <my_wait_event_info>
   0x0000000000a0b3a6 <+70>:    mov    (%r12),%rax
   0x0000000000a0b3aa <+74>:    movl   $0x9000006,(%rax)


   < do wait operation >

end wait:

   0x0000000000a0b3ba <+90>:    mov    (%r12),%rax
   0x0000000000a0b3be <+94>:    mov    (%rax),%edi
   0x0000000000a0b3c0 <+96>:    rex call 0xa452e0 <waitEventIncrementCounter>
   0x0000000000a0b3c6 <+102>:    mov    (%r12),%rax

Where waitEventIncrementCounter ends up as:

   0x0000000000a452e0 <+0>:    lea    0x6fc7f9(%rip),%rax        # 0x1141ae0 <my_wait_event_info>
   0x0000000000a452e7 <+7>:    mov    (%rax),%rax
   0x0000000000a452ea <+10>:    mov    (%rax),%eax
   0x0000000000a452ec <+12>:    mov    %eax,%edx
   0x0000000000a452ee <+14>:    and    $0xff000000,%edx
   0x0000000000a452f4 <+20>:    jne    0xa45300 <waitEventIncrementCounter+32>
   0x0000000000a452f6 <+22>:    test   %ax,%ax
   0x0000000000a452f9 <+25>:    jne    0xa45300 <waitEventIncrementCounter+32>
   0x0000000000a452fb <+27>:    ret
   0x0000000000a452fc <+28>:    nopl   0x0(%rax)
   0x0000000000a45300 <+32>:    cmp    $0x1000000,%edx
   0x0000000000a45306 <+38>:    jne    0xa4530e <waitEventIncrementCounter+46>
   0x0000000000a45308 <+40>:    cmp    $0x5e,%ax
   0x0000000000a4530c <+44>:    ja     0xa452fb <waitEventIncrementCounter+27>
   0x0000000000a4530e <+46>:    cmp    $0x7000000,%edx
   0x0000000000a45314 <+52>:    sete   %cl
   0x0000000000a45317 <+55>:    test   %ax,%ax
   0x0000000000a4531a <+58>:    setne  %al
   0x0000000000a4531d <+61>:    test   %al,%cl
   0x0000000000a4531f <+63>:    jne    0xa452fb <waitEventIncrementCounter+27>
   0x0000000000a45321 <+65>:    cmp    $0xb000000,%edx
   0x0000000000a45327 <+71>:    je     0xa452fb <waitEventIncrementCounter+27>
   0x0000000000a45329 <+73>:    mov    %edi,%eax
   0x0000000000a4532b <+75>:    movzwl %di,%edi
   0x0000000000a4532e <+78>:    lea    0x714e4b(%rip),%rdx        # 0x115a180 <PendingWaitEventStats>
   0x0000000000a45335 <+85>:    shr    $0x18,%eax
   0x0000000000a45338 <+88>:    shl    $0x5,%rax
   0x0000000000a4533c <+92>:    add    0x4e1735(%rip),%rax        # 0xf26a78
   0x0000000000a45343 <+99>:    movslq 0x8(%rax),%rax
   0x0000000000a45347 <+103>:    add    %rdi,%rax
   0x0000000000a4534a <+106>:    incq   (%rdx,%rax,8)
   0x0000000000a4534e <+110>:    lea    0x714e0b(%rip),%rax        # 0x115a160 <have_wait_event_stats>
   0x0000000000a45355 <+117>:    movb   $0x1,(%rax)
   0x0000000000a45358 <+120>:    ret


That's about an order of magnitude increase in overhead, just based on the
number of instructions, and probably considerably xmore in runtime (you're
adding external function calls etc).


I think to make this viable, we would have to remove some wait events and be
more careful about adding them.  And optimize this code a good bit. This
really shouldn't end up with this mouch code.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Adding wait events statistics
Next
From: jian he
Date:
Subject: Re: [PATCH] Prevent replacement of a function if it's used in an index expression and is not IMMUTABLE