Re: Adding wait events statistics - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Adding wait events statistics |
Date | |
Msg-id | xuynb44ql3hhggvqtme7axbliww7gwuy6pbaohxc4ngu3ynbsi@rvvpjxf55aia Whole thread Raw |
In response to | Re: Adding wait events statistics (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>) |
List | pgsql-hackers |
Hi, On 2025-07-09 07:26:05 +0000, Bertrand Drouvot wrote: > On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote: > > On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote: > > > * we don't want to allocate memory in some places where the counters are > > > incremented (see 4feba03d8b9). We could still use the same implementation as for > > > backend statistics (i.e, make use of flush_static_cb) if we really want/need to > > > switch to variable stats. > > > > Even without allocating memory, this makes wait events considerably more > > expensive. Going from ~4 instructions to something more like ~40. Including > > external function calls. > > Yeah looking at the performance impacts and discuss those was in my todo (as > mentioned above), thanks for looking at it! TBH, the more I think about this, the less convinced a pure counter provides *any* additional value over stats we already have, in which case the overhead is moot. We use commonly use wait events unconditionally around system calls, e.g.: pgstat_report_wait_start(wait_event_info); returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset); pgstat_report_wait_end(); A counter for the number of wait events is just less informative version of what we already have in pg_stat_io. There's literally no value provided by tracking the number of times that wait event is hit. A pretty large number of wait events operate that way. For those the interesting information would be to measure *how long* you are in the wait event, not the number of times the wait event was entered for a very short amount of time. The obvious counter to that is that you likely are planning to add timing information at some point down the line - leaving overhead aside, that would *also* just duplicate information for the above case, namely track_io_timing. The wait events we have as-is are useful because they tell you what's currently happening, and you can correlate that with other things that are going on. But a pure counter of wait events encountered doesn't allow for that. > > end wait: > > 0x0000000000a08eaa <+90>: mov (%r12),%rax > > Yeah this one is unbeatable, it will have to be more expensive... > > On my side, I'm getting: > > => 0x000000000091a320 <+0>: mov 0x5337f9(%rip),%rax # 0xe4db20 <my_wait_event_info> > 0x000000000091a327 <+7>: mov (%rax),%eax > 0x000000000091a329 <+9>: mov %eax,%edx > 0x000000000091a32b <+11>: and $0xff000000,%edx > 0x000000000091a331 <+17>: jne 0x91a340 <waitEventIncrementCounter+32> > 0x000000000091a333 <+19>: test %ax,%ax > 0x000000000091a336 <+22>: jne 0x91a340 <waitEventIncrementCounter+32> > 0x000000000091a338 <+24>: ret > 0x000000000091a339 <+25>: nopl 0x0(%rax) > 0x000000000091a340 <+32>: cmp $0x1000000,%edx > 0x000000000091a346 <+38>: jne 0x91a34e <waitEventIncrementCounter+46> > 0x000000000091a348 <+40>: cmp $0x5e,%ax > 0x000000000091a34c <+44>: ja 0x91a338 <waitEventIncrementCounter+24> > 0x000000000091a34e <+46>: cmp $0x7000000,%edx > 0x000000000091a354 <+52>: sete %cl > 0x000000000091a357 <+55>: test %ax,%ax > 0x000000000091a35a <+58>: setne %al > 0x000000000091a35d <+61>: test %al,%cl > 0x000000000091a35f <+63>: jne 0x91a338 <waitEventIncrementCounter+24> > 0x000000000091a361 <+65>: cmp $0xb000000,%edx > 0x000000000091a367 <+71>: je 0x91a338 <waitEventIncrementCounter+24> > 0x000000000091a369 <+73>: mov %edi,%eax > 0x000000000091a36b <+75>: movzwl %di,%edi > 0x000000000091a36e <+78>: shr $0x18,%eax > 0x000000000091a371 <+81>: shl $0x5,%rax > 0x000000000091a375 <+85>: movslq 0xe4d9c8(%rax),%rax > 0x000000000091a37c <+92>: movb $0x1,0x56045d(%rip) # 0xe7a7e0 <have_wait_event_stats> > 0x000000000091a383 <+99>: add %rdi,%rax > 0x000000000091a386 <+102>: addq $0x1,0xe7a800(,%rax,8) > 0x000000000091a38f <+111>: ret I think adding 6 conditional branches to this is just out of question... > > I think to make this viable, we would have to remove some wait events and be > > more careful about adding them. > > You mean incrementing the counters for wait events for which we expect that > the increment is "negligible" as compare to the wait event itself? (Like DATAFILE > IO and such)? I mean that we would need to either *remove* a bunch of wait events, or change the API so that there are "cheap" wait events that just have the current overhead, and expensive wait events that do this tracking. > If so I agree and that's something I also have discussed with a few (also if we > add the timings later), that we could/should be more granular. If we go that way > we could also imagine a GUC to set the stats for all of them? (default being > off). Any conditional code that you add makes this stuff more expensive... > > And optimize this code a good bit. This > > really shouldn't end up with this mouch code. > > Yes, with the idea mentioned in [1] (code not shared yet), I end up with: > > Dump of assembler code for function waitEventIncrementCounter: > => 0x000000000091b320 <+0>: mov %edi,%eax > 0x000000000091b322 <+2>: and $0xff000000,%eax > 0x000000000091b327 <+7>: jne 0x91b330 <waitEventIncrementCounter+16> > 0x000000000091b329 <+9>: test %di,%di > 0x000000000091b32c <+12>: jne 0x91b330 <waitEventIncrementCounter+16> > 0x000000000091b32e <+14>: ret > 0x000000000091b32f <+15>: nop > 0x000000000091b330 <+16>: cmp $0x1000000,%eax > 0x000000000091b335 <+21>: jne 0x91b33d <waitEventIncrementCounter+29> > 0x000000000091b337 <+23>: cmp $0x5e,%di > 0x000000000091b33b <+27>: ja 0x91b32e <waitEventIncrementCounter+14> > 0x000000000091b33d <+29>: cmp $0x7000000,%eax > 0x000000000091b342 <+34>: sete %cl > 0x000000000091b345 <+37>: test %di,%di > 0x000000000091b348 <+40>: setne %dl > 0x000000000091b34b <+43>: test %dl,%cl > 0x000000000091b34d <+45>: jne 0x91b32e <waitEventIncrementCounter+14> > 0x000000000091b34f <+47>: cmp $0xb000000,%eax > 0x000000000091b354 <+52>: je 0x91b32e <waitEventIncrementCounter+14> > 0x000000000091b356 <+54>: mov %edi,%eax > 0x000000000091b358 <+56>: movzwl %di,%edi > 0x000000000091b35b <+59>: shr $0x18,%eax > 0x000000000091b35e <+62>: mov 0xe4e9c0(,%rax,8),%rax > 0x000000000091b366 <+70>: movb $0x1,0x5603f3(%rip) # 0xe7b760 <have_wait_event_stats> > 0x000000000091b36d <+77>: mov 0x8(%rax),%rax > 0x000000000091b371 <+81>: addq $0x1,(%rax,%rdi,8) > 0x000000000091b376 <+86>: ret That's still way too many jumps. > That looks better before the first branch (could be done in v1 too) as it uses > registers only but looks worse when it's time to increment the counter while > the code is much simpler: > > " > WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info); > > AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++; > " Too many indirections. The increment should be doable without needing to dereference intermediary memory locations. Greetings, Andres Freund
pgsql-hackers by date: