Re: Adding wait events statistics - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Adding wait events statistics
Date
Msg-id sofkrmi3skg3ekc3y23uwxscbviy5lcbukincoyauypg4ylfdg@6lwzhi6uagc7
Whole thread Raw
In response to Re: Adding wait events statistics  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
List pgsql-hackers
Hi,

On 2025-07-18 06:04:38 +0000, Bertrand Drouvot wrote:
> Here what I’ve done:
>
> 1. Added 2 probes: one in pgstat_report_wait_start() and one in pgstat_report_wait_end()
> to measure the wait events duration. Then, I ran make check-world and computed
> the duration percentiles (in microseconds) per wait event (see attached).

> Note that this is without the counters patch, just to get an idea of wait events
> duration on my lab.
>
> Due to the way I ran the eBPF script I’ve been able to trap 159 wait events
> (out of 273) during "make check-world" (I think that’s enough for a first
> analysis, see below).

It's important to note that eBPF tracepoints can add significant overhead,
i.e. your baseline measurement is influenced by that overhead.  To showcase
that, in a ridiculously exaggerated way, I added probe with bpftrace, on start
and exit of heap_getnextslot(), measuring the time in the function, and
compared the performance of that to not having the probes.

Without probes my seqscan takes 1651.221ms, with probes 15417.930ms. I.e. it
slowed down by about an order of magnitude.

I don't think you can use numbers gathered from this as a baseline measurement
to evaluate the overhead added by adding explicit counters. The overhead of
the explicit counter is going to be *WAY* lower than the eBPF based probes.


You can't usefully use eBPF probes to measure costs that are this small, each
time you hit a probe it has to trap into kernel space, run a bit of code,
return to userspace. That'll be a few hundred cycles at the very least, if you
actually want to compute a semi-accurate duration with eBPF, the duration of
what you measure needs to be considerably longer.


> 2. Applied the patch with the counters (one that I did not share yet), removed
> the probes in pgstat_report_wait_start() and pgstat_report_wait_end() and added
> one probe before incrementing the counters (i.e before calling waitEventIncrementCounter())
> in pgstat_report_wait_end() and one probe after incrementing the counters to measure
> the counters increment duration.
>
> Then I ran make check-world and computed counter increment duration percentiles
> (in nanoseconds) and got:
>
> P50.00: 1138.000000
> P90.00: 2674.000000
> P95.00: 3741.000000
> P99.00: 7657.000000
> P99.90: 16973.881000
> P99.99: 55614.830000

The above confirms my suspicion. It's inconceivable that the explicit counting
you added costs ~1k nanosecond on average. Multiply that by the frequency of
the CPU and that CPUs are superscalar (can execute multiple instructions at
once), and the overhead you compute here would be roughly 10k
instructions. Obviously that's not true. If it were, we'd have to just mark
this patch as rejected and never talk about it again.


Even if your baseline measurement did provide an accurate baseline number,
it'd still not be particularly helpful, I suspect. Whether adding additional
instructions actually causes slowdowns isn't just determined by the raw number
of instructions, it's also depends on what work is being done in the
CPU. E.g. if you add an extra 50 cheap instructions into a codepath that's
completely bottlenecked by memory latency, you'll often not increase the total
time taken meaningfully, as all those instructions are executed while waiting
for memory (which has a latency longer than those 50 cheap instructions added
together). But if you add 50 cheap instructions to a workload that's purely
CPU bound, you're much more likely to slow that down measurably.


> I think that:
>
> - That’s easy enough to reason about and kind of expected for those 2 classes
> - We would miss other wait events but I think the trade-off is worth the simplicity
> - Having the ability to get the counters and the timings for those 2 wait classes
> is better than nothing
> - Having those metrics for the Lock class would be a nice addition to pg_locks
>
> Thoughts?

Unfortunately, due to the issues explained above, I don't think we can learn a
lot from this experiment :(

I also continue to not believe that pure event counters are going to be useful
for the majority of wait events. I'm not sure it is really interesting for
*any* wait event that we don't already have independent stats for.

I think if we do want to have wait events that have more details, we need to:

a) make them explicitly opt-in, i.e. code has to be changed over to use the
   extended wait events
b) the extended wait events need to count both the number of encounters as
   well as the duration, the number of encounters is not useful on its own
c) for each callsite that is converted to the extended wait event, you either
   need to reason why the added overhead is ok, or do a careful experiment


Personally I'd rather have an in-core sampling collector, counting how often
it sees certain wait events when sampling. It then also can correlate those
samples with other things on the system, e.g. by counting the number of active
backends together with each sample. And eventually also memory usage etc.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: postgres_fdw could deparse ArrayCoerceExpr
Next
From: "Joel Jacobson"
Date:
Subject: Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput