Re: Adding wait events statistics - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Adding wait events statistics |
Date | |
Msg-id | aH+DDoiLujt+b9cV@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Re: Adding wait events statistics (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Adding wait events statistics
Re: Adding wait events statistics |
List | pgsql-hackers |
Hi, On Fri, Jul 18, 2025 at 11:43:47AM -0400, Andres Freund wrote: > Hi, > > On 2025-07-18 06:04:38 +0000, Bertrand Drouvot wrote: > > Here what I’ve done: > > > > 1. Added 2 probes: one in pgstat_report_wait_start() and one in pgstat_report_wait_end() > > to measure the wait events duration. Then, I ran make check-world and computed > > the duration percentiles (in microseconds) per wait event (see attached). > > > Note that this is without the counters patch, just to get an idea of wait events > > duration on my lab. > > > > Due to the way I ran the eBPF script I’ve been able to trap 159 wait events > > (out of 273) during "make check-world" (I think that’s enough for a first > > analysis, see below). > > It's important to note that eBPF tracepoints can add significant overhead, > i.e. your baseline measurement is influenced by that overhead. Thanks for looking at it! I do agree that eBPF tracepoints can add significant overhead. > The overhead of > the explicit counter is going to be *WAY* lower than the eBPF based probes. Agree. > Unfortunately, due to the issues explained above, I don't think we can learn a > lot from this experiment :( I think that statement is debatable. While it's true that eBPF tracepoints add overhead, I used them to get the wait events percentile timings and also the counter increments timings. And then I used those percentiles ( both include the overhead) to extract the wait events/classes for which the p50 counters increments would represent less than 5%, means: 100 * p50_counter_timing / p50_wait_time < 5 And we have seen that for the Lock class we got 0.7% and 0.03% for the Timeout class. With those numbers and even if the eBPF tracepoints adds overhead I think that we can learn that it would be safe to add the counters for those two wait classes. Anyway, let's forget about eBPF, I ran another experiment by counting the cycles with: static inline uint64_t rdtsc(void) { uint32_t lo, hi; __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi)); return ((uint64_t)hi << 32) | lo; } and then calling this function before and after waitEventIncrementCounter() and also at wait_start() and wait_end() (without the increment counters patchs). This gives for counter increments (also added INSTR* timings to get the full picture as we want them anyway) the following number of cycles: p50: 146.000 p90: 722.000 p95: 1148.000 p99: 3474.000 p99.9: 13600.132 So that we can compare with the percentile cycles per wait events (see attached). We can see that, for those wait classes, all their wait events overhead would be < 5% and more precisely: Overhead on the lock class is about 0.03% Overhead on the timeout class is less than 0.01% and now we can also see that: Overhead on the lwlock class is about 1% Overhead on the client class is about 0.5% Overhead on the bufferpin class is about 0.2% while the io and ipc classes have mixed results. So based on the cycles metric I think it looks pretty safe to implement for the whole majority of classes. > I also continue to not believe that pure event counters are going to be useful > for the majority of wait events. I'm not sure it is really interesting for > *any* wait event that we don't already have independent stats for. For pure counters only I can see your point, but for counters + timings are you also not convinced? > I think if we do want to have wait events that have more details, we need to: > > a) make them explicitly opt-in, i.e. code has to be changed over to use the > extended wait events > b) the extended wait events need to count both the number of encounters as > well as the duration, the number of encounters is not useful on its own > c) for each callsite that is converted to the extended wait event, you either > need to reason why the added overhead is ok, or do a careful experiment > I do agree with the above, what do you think about this lastest experiment counting the cycles? > > Personally I'd rather have an in-core sampling collector, counting how often > it sees certain wait events when sampling. Yeah but even if we are okay with losing "counters" by sampling, we'd still not get the duration. For the duration to be meaningful we also need the exact number of counters. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: