Re: Adding wait events statistics - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Adding wait events statistics |
Date | |
Msg-id | sofkrmi3skg3ekc3y23uwxscbviy5lcbukincoyauypg4ylfdg@6lwzhi6uagc7 Whole thread Raw |
In response to | Re: Adding wait events statistics (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>) |
List | pgsql-hackers |
Hi, On 2025-07-18 06:04:38 +0000, Bertrand Drouvot wrote: > Here what I’ve done: > > 1. Added 2 probes: one in pgstat_report_wait_start() and one in pgstat_report_wait_end() > to measure the wait events duration. Then, I ran make check-world and computed > the duration percentiles (in microseconds) per wait event (see attached). > Note that this is without the counters patch, just to get an idea of wait events > duration on my lab. > > Due to the way I ran the eBPF script I’ve been able to trap 159 wait events > (out of 273) during "make check-world" (I think that’s enough for a first > analysis, see below). It's important to note that eBPF tracepoints can add significant overhead, i.e. your baseline measurement is influenced by that overhead. To showcase that, in a ridiculously exaggerated way, I added probe with bpftrace, on start and exit of heap_getnextslot(), measuring the time in the function, and compared the performance of that to not having the probes. Without probes my seqscan takes 1651.221ms, with probes 15417.930ms. I.e. it slowed down by about an order of magnitude. I don't think you can use numbers gathered from this as a baseline measurement to evaluate the overhead added by adding explicit counters. The overhead of the explicit counter is going to be *WAY* lower than the eBPF based probes. You can't usefully use eBPF probes to measure costs that are this small, each time you hit a probe it has to trap into kernel space, run a bit of code, return to userspace. That'll be a few hundred cycles at the very least, if you actually want to compute a semi-accurate duration with eBPF, the duration of what you measure needs to be considerably longer. > 2. Applied the patch with the counters (one that I did not share yet), removed > the probes in pgstat_report_wait_start() and pgstat_report_wait_end() and added > one probe before incrementing the counters (i.e before calling waitEventIncrementCounter()) > in pgstat_report_wait_end() and one probe after incrementing the counters to measure > the counters increment duration. > > Then I ran make check-world and computed counter increment duration percentiles > (in nanoseconds) and got: > > P50.00: 1138.000000 > P90.00: 2674.000000 > P95.00: 3741.000000 > P99.00: 7657.000000 > P99.90: 16973.881000 > P99.99: 55614.830000 The above confirms my suspicion. It's inconceivable that the explicit counting you added costs ~1k nanosecond on average. Multiply that by the frequency of the CPU and that CPUs are superscalar (can execute multiple instructions at once), and the overhead you compute here would be roughly 10k instructions. Obviously that's not true. If it were, we'd have to just mark this patch as rejected and never talk about it again. Even if your baseline measurement did provide an accurate baseline number, it'd still not be particularly helpful, I suspect. Whether adding additional instructions actually causes slowdowns isn't just determined by the raw number of instructions, it's also depends on what work is being done in the CPU. E.g. if you add an extra 50 cheap instructions into a codepath that's completely bottlenecked by memory latency, you'll often not increase the total time taken meaningfully, as all those instructions are executed while waiting for memory (which has a latency longer than those 50 cheap instructions added together). But if you add 50 cheap instructions to a workload that's purely CPU bound, you're much more likely to slow that down measurably. > I think that: > > - That’s easy enough to reason about and kind of expected for those 2 classes > - We would miss other wait events but I think the trade-off is worth the simplicity > - Having the ability to get the counters and the timings for those 2 wait classes > is better than nothing > - Having those metrics for the Lock class would be a nice addition to pg_locks > > Thoughts? Unfortunately, due to the issues explained above, I don't think we can learn a lot from this experiment :( I also continue to not believe that pure event counters are going to be useful for the majority of wait events. I'm not sure it is really interesting for *any* wait event that we don't already have independent stats for. I think if we do want to have wait events that have more details, we need to: a) make them explicitly opt-in, i.e. code has to be changed over to use the extended wait events b) the extended wait events need to count both the number of encounters as well as the duration, the number of encounters is not useful on its own c) for each callsite that is converted to the extended wait event, you either need to reason why the added overhead is ok, or do a careful experiment Personally I'd rather have an in-core sampling collector, counting how often it sees certain wait events when sampling. It then also can correlate those samples with other things on the system, e.g. by counting the number of active backends together with each sample. And eventually also memory usage etc. Greetings, Andres Freund
pgsql-hackers by date: