Home > mailing lists

Re: Adding wait events statistics - Mailing list pgsql-hackers

From	Bertrand Drouvot
Subject	Re: Adding wait events statistics
Date	July 22 15:24:46
Msg-id	aH+DDoiLujt+b9cV@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw
In response to	Re: Adding wait events statistics (Andres Freund <andres@anarazel.de>)
Responses	Re: Adding wait events statistics Re: Adding wait events statistics
List	pgsql-hackers

Tree view

Hi,

On Fri, Jul 18, 2025 at 11:43:47AM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-07-18 06:04:38 +0000, Bertrand Drouvot wrote:
> > Here what I’ve done:
> >
> > 1. Added 2 probes: one in pgstat_report_wait_start() and one in pgstat_report_wait_end()
> > to measure the wait events duration. Then, I ran make check-world and computed
> > the duration percentiles (in microseconds) per wait event (see attached).
> 
> > Note that this is without the counters patch, just to get an idea of wait events
> > duration on my lab.
> >
> > Due to the way I ran the eBPF script I’ve been able to trap 159 wait events
> > (out of 273) during "make check-world" (I think that’s enough for a first
> > analysis, see below).
> 
> It's important to note that eBPF tracepoints can add significant overhead,
> i.e. your baseline measurement is influenced by that overhead.

Thanks for looking at it! I do agree that eBPF tracepoints can add significant
overhead.

> The overhead of
> the explicit counter is going to be *WAY* lower than the eBPF based probes.

Agree.

> Unfortunately, due to the issues explained above, I don't think we can learn a
> lot from this experiment :(

I think that statement is debatable. While it's true that eBPF tracepoints
add overhead, I used them to get the wait events percentile timings and also
the counter increments timings. And then I used those percentiles (
both include the overhead) to extract the wait events/classes for which the
p50 counters increments would represent less than 5%, means:

100 * p50_counter_timing / p50_wait_time < 5

And we have seen that for the Lock class we got 0.7% and 0.03% for the Timeout
class. With those numbers and even if the eBPF tracepoints adds overhead I
think that we can learn that it would be safe to add the counters for those
two wait classes.

Anyway, let's forget about eBPF, I ran another experiment by counting the cycles
with:

static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

and then calling this function before and after waitEventIncrementCounter()
and also at wait_start() and wait_end() (without the increment counters patchs).

This gives for counter increments (also added INSTR* timings to get the full
picture as we want them anyway) the following number of cycles:

p50: 146.000
p90: 722.000
p95: 1148.000
p99: 3474.000
p99.9: 13600.132

So that we can compare with the percentile cycles per wait events (see attached).

We can see that, for those wait classes, all their wait events overhead would be
< 5% and more precisely:

Overhead on the lock class is about 0.03%
Overhead on the timeout class is less than 0.01%

and now we can also see that:

Overhead on the lwlock class is about 1%
Overhead on the client class is about 0.5%
Overhead on the bufferpin class is about 0.2%

while the io and ipc classes have mixed results.

So based on the cycles metric I think it looks pretty safe to implement for the
whole majority of classes.

> I also continue to not believe that pure event counters are going to be useful
> for the majority of wait events. I'm not sure it is really interesting for
> *any* wait event that we don't already have independent stats for.

For pure counters only I can see your point, but for counters + timings are you
also not convinced?

> I think if we do want to have wait events that have more details, we need to:
> 
> a) make them explicitly opt-in, i.e. code has to be changed over to use the
>    extended wait events
> b) the extended wait events need to count both the number of encounters as
>    well as the duration, the number of encounters is not useful on its own
> c) for each callsite that is converted to the extended wait event, you either
>    need to reason why the added overhead is ok, or do a careful experiment
>

I do agree with the above, what do you think about this lastest experiment counting
the cycles?

> 
> Personally I'd rather have an in-core sampling collector, counting how often
> it sees certain wait events when sampling.

Yeah but even if we are okay with losing "counters" by sampling, we'd still not get
the duration. For the duration to be meaningful we also need the exact number
of counters.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

make_check-world_cycles_percentiles.csv

pgsql-hackers by date:

From: Fujii Masao
Date: 22 July, 15:21:10
Subject: Re: Log prefix missing for subscriber log messages received from publisher

From: Vik Fearing
Date: 22 July, 15:26:25
Subject: Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions

Re: Adding wait events statistics - Mailing list pgsql-hackers

Attachment

Previous

Next