Thread: More idle thoughts
The Linux kernel had a big push to reduce latency, and one of the tricks they did was they replaced the usual interrupt points with a call which noted how long it had been since the last interrupt point. It occurs to me we could do the same for CHECK_FOR_INTERRUPTS() by conditionally having it call a function which calls gettimeofday and compares with the previous timestamp received at the last CFI(). I think the only gotcha would be at CHECK_FOR_INTERRUPTS() calls which occur after a syscall that an actual interrupt would interrupt. They might show a long wait even though an actual interrupt would trigger them right away -- the two that come to mind are reading from the client and blocking on a semaphore. So I think we would need to add a parameter to indicate if that's the case for a given call-site. I think it would be easy to know what value to put there because I think we always do a CFI() immediately after such syscalls but I'm not 100% sure that's the case. Obviously this wouldn't run all the time. I'm not even sure it should run on the build farm because I think doing an extra syscall at these places might mask timing bugs by synchronizing the bus in a lot of places. But even a few build farm animals might uncover places where we don't respond to C-c or hold up the sinval messages etc. It also doesn't replace our current method of responding to user complaints -- many if not all of them are relatively subtle cases where the user is doing something unusual to create a loop that doesn't normally occur. But we won't know that unless we try. -- greg
Greg Stark <stark@mit.edu> writes: > The Linux kernel had a big push to reduce latency, and one of the > tricks they did was they replaced the usual interrupt points with a > call which noted how long it had been since the last interrupt point. > It occurs to me we could do the same for CHECK_FOR_INTERRUPTS() by > conditionally having it call a function which calls gettimeofday and > compares with the previous timestamp received at the last CFI(). Hmm. The only thing you could find out is which CFI call was reporting the long delay, not what the path of control in between had been. If the long computation had been down inside some function that's no longer active, this wouldn't help much to track it down. Still, it'd be better than no data at all. > I think the only gotcha would be at CHECK_FOR_INTERRUPTS() calls which > occur after a syscall that an actual interrupt would interrupt. They > might show a long wait even though an actual interrupt would trigger > them right away -- the two that come to mind are reading from the > client and blocking on a semaphore. So I think we would need to add a > parameter to indicate if that's the case for a given call-site. I > think it would be easy to know what value to put there because I think > we always do a CFI() immediately after such syscalls but I'm not 100% > sure that's the case. I'm afraid you're being too optimistic about that. What I'd think about is just adding an extra call to reset the delay-time counter after each such syscall, rather than having two kinds of CFI. > Obviously this wouldn't run all the time. Yeah, it would increase the overhead of CFI by orders of magnitude, so you wouldn't even want to think about building a production version that way. But it might be a useful testing option. regards, tom lane
On Fri, 2010-03-26 at 18:59 +0000, Greg Stark wrote: > The Linux kernel had a big push to reduce latency, and one of the > tricks they did was they replaced the usual interrupt points with a > call which noted how long it had been since the last interrupt point. > It occurs to me we could do the same for CHECK_FOR_INTERRUPTS() by > conditionally having it call a function which calls gettimeofday and > compares with the previous timestamp received at the last CFI(). Reducing latency sounds good, but what has CFI got to do with that? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2010-03-26 at 18:59 +0000, Greg Stark wrote: >> It occurs to me we could do the same for CHECK_FOR_INTERRUPTS() by >> conditionally having it call a function which calls gettimeofday and >> compares with the previous timestamp received at the last CFI(). > Reducing latency sounds good, but what has CFI got to do with that? It took me about five minutes to figure out what Greg was on about too. His point is that we need to locate code paths in which an extremely long time can pass between successive CFI calls, because that means the backend will fail to respond to SIGINT/SIGTERM for a long time. Instrumenting CFI itself is a possible tool for that. regards, tom lane
On Sun, 2010-03-28 at 12:47 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Fri, 2010-03-26 at 18:59 +0000, Greg Stark wrote: > >> It occurs to me we could do the same for CHECK_FOR_INTERRUPTS() by > >> conditionally having it call a function which calls gettimeofday and > >> compares with the previous timestamp received at the last CFI(). > > > Reducing latency sounds good, but what has CFI got to do with that? > > It took me about five minutes to figure out what Greg was on about too. > His point is that we need to locate code paths in which an extremely > long time can pass between successive CFI calls, because that means > the backend will fail to respond to SIGINT/SIGTERM for a long time. > Instrumenting CFI itself is a possible tool for that. I was thinking we could do this via signals, but actually instrumenting the code paths seems better. There probably are a few paths still to improve. Dare I suggest we follow the tried and tested open source approach of wait-for-complaint? Reducing latency elsewhere would be time better spent (!). I was thinking of adding a "reason" field onto ReadBuffer, so we can diagnose the source of buffer waits. -- Simon Riggs www.2ndQuadrant.com