Thread: Generic Monitoring Framework Proposal
Motivation: ---------- The main goal for this Generic Monitoring Framework is to provide a common interface for adding instrumentation points orprobes to Postgres so its behavior can be easily observed by developers and administrators even in production systems. This frameworkwill allow Postgres to use the appropriate monitoring/tracing facility provided by each OS. For example, Solaris and FreeBSD will use DTrace, and other OSes can usetheir respective tool. What is DTrace? -------------- Some of you may have heard about or used DTrace already. In a nutshell, DTrace is a comprehensive dynamic tracing facilitythat is built into Solaris and FreeBSD (mostly working) that can be used by administrators and developers on live production systems to examinethe behavior of both user programs and of the operating system. DTrace can help answer difficult questions about the OS and the application itself. For example, you may want to ask: - Show all functions that get invoked (userland & kernel) and execution time when my function foo() is called. Seeing thepath a function takes into the kernel may provide clues for performance tuning. - Show how many times a particular lock is acquired and how long it's held. This can help identity contentions in the system. The best way to appreciate DTrace capabilities is by seeing a demo or through hands-on experience, and I plan to show someinteresting demos at the PG Summit. There are a numer of docs on Dtrace, and here's a quick start doc and a complete reference guide. http://www.sun.com/software/solaris/howtoguides/dtracehowto.jsp http://docs.sun.com/app/docs/doc/817-6223 Here is a recent DTrace for FreeBSD status http://marc.theaimsgroup.com/?l=freebsd-current&m=114854018213275&w=2 Open source apps that provide user level probes (bottom of page) http://uadmin.blogspot.com/2006/05/what-is-dtrace.html Proposed Solution: ---------------- This solution is actually quite simple and non-intrusive. 1. Define macros PG_TRACE, PG_TRACE1, etc, in a new header file called pg_trace.h with multiple #if defined(xxx) sections for Solaris, FreeBSD, Linux, etc, and add pg_trace.h to c.h which is included in postgres.h and included by every C file. The macros will have the following format: PG_TRACE[n](module_name, probe_name [, arg1, ..., arg5]) module_name = Name to identify PG module such as pg_backend, pg_psql, pg_plpgsql, etc probe_name = Probe name such as transaction_start, lwlock_acquire, etc arg1..arg5 = Any args to pass to the probe such as txn id, lock id, etc 2. Map PG_TRACE, PG_TRACE1, etc, to macros or functions appropriate for each OS. For OSes that don't have suitable tracing facility, just map the macros to nothing - doing this will not have any affecton performance or existing behavior. Sample of pg_trace.h #if defined(sun) || defined(FreeBSD) #include <sys/sdt.h> #define PG_TRACE DTRACE_PROBE #define PG_TRACE1 DTRACE_PROBE1 ... #define PG_TRACE5 DTRACE_PROBE5 #elif defined(__linux__) || defined(_AIX) || defined(__sgi) ... /* Map the macros to no-ops */ #define PG_TRACE(module, name) #define PG_TRACE1(module, name, arg1) ... #define PG_TRACE5(module, name, arg1, arg2, arg3, arg4, arg5) #endif 3. Add any file(s) to support the particular OS tracing facility 4. Update the Makefiles as necessary for each OS How to add probes: ----------------- To add a probe, just add a one line macro in the appropriate location in the source. Here's an example of two probes, onewith no argument and the other with 2 arguments: PG_TRACE (pg_backend, fsync_start); PG_TRACE2 (pg_backend, lwlock_acquire, lockid, mode); If there are enough probes embedded in PG, its behavior can be easily observed. With the help of Gavin Sherry, we have added about 20 probes, and Gavin has suggested a number of other interesting areasfor additional probes. Pervasive has also added some probes to PG 8.0.4 and posted the patch on http://pgfoundry.org/projects/dtrace/. I hope tocombine the probes using this generic framework for 8.1.4, and make it available for folks to try. Since my knowledge of the PG source code is limited, I'm looking for assistance from experts to hep identify some new interestingprobe points. How to use probes: ---------------- For DTrace, probes can be enabled using a D script. When the probes are not enabled, there is absolutely no performance hitwhatsoever. Here is a simple example to print out the number of LWLock counts for each PG process. test.d #!/usr/sbin/dtrace -s pg_backend*:::lwlock-acquire { @foo[pid] = count(); } dtrace:::END { printf("\n%10s %15s\n", "PID", "Count"); printa("%10d %@15d\n",@foo); } # ./test.d PID Count 1438 28 1447 7240 1448 9675 1449 11972 I have a prototype working, so if anyone wants to try it, I can provide a patch or give access to my test system. This is a proposal, so comments, suggestions, feedbacks are certainly welcome. Regards, Robert
Robert Lor <Robert.Lor@Sun.COM> writes: > The main goal for this Generic Monitoring Framework is to provide a > common interface for adding instrumentation points or probes to > Postgres so its behavior can be easily observed by developers and > administrators even in production systems. What is the overhead of a "probe" when you're not using it? The answer had better not include the phrase "kernel call", or this is unlikely to pass muster... > For DTrace, probes can be enabled using a D script. When the probes are not enabled, there is absolutely no performancehit whatsoever. If you believe that, I have a bridge in Brooklyn you might be interested in. What are the criteria going to be for where to put probe calls? If it has to be hard-wired into the source code, I foresee a lot of contention about which probes are worth their overhead, because we'll need one-size-fits-all answers. > arg1..arg5 = Any args to pass to the probe such as txn id, lock id, etc Where is the data type of a probe argument defined? regards, tom lane
On Jun 19, 2006, at 4:40 PM, Tom Lane wrote: > Robert Lor <Robert.Lor@Sun.COM> writes: >> The main goal for this Generic Monitoring Framework is to provide a >> common interface for adding instrumentation points or probes to >> Postgres so its behavior can be easily observed by developers and >> administrators even in production systems. > > What is the overhead of a "probe" when you're not using it? The > answer > had better not include the phrase "kernel call", or this is > unlikely to > pass muster... > >> For DTrace, probes can be enabled using a D script. When the >> probes are not enabled, there is absolutely no performance hit >> whatsoever. > > If you believe that, I have a bridge in Brooklyn you might be > interested > in. Heh. Syscall probes and FBT probes in Dtrace have zero overhead. User-space probes do have overhead, but it is only a few instructions (two I think). Besically, the probe points are replaced by illegal instructions and the kernel infrastructure for Dtrace will fasttrap the ops and then act. So, it is tiny tiny overhead. Little enough that it isn't unreasonable to instrument things like s_lock which are tiny. > What are the criteria going to be for where to put probe calls? If it > has to be hard-wired into the source code, I foresee a lot of > contention > about which probes are worth their overhead, because we'll need > one-size-fits-all answers. > >> arg1..arg5 = Any args to pass to the probe such as txn id, lock >> id, etc > Where is the data type of a probe argument defined? I assume it would depend on the probe implementation. In Dtrace they are implemented in .d files that will post-instrument the object before final linkage. Dtrace's whole purpose is to be low overhead and it really does it in a fantastic way. As an example, you can take an uninstrumented binary and add dynamic instrumentation to the entry, exit and every instruction op-code over every single routine in the process. And clearly, as the binary is uninstrumented, the overhead is indeed zero when the probes are not enabled. The reason that Robert proposes user-space probes (I assume) is that tracing C functions can be too granular and not conveniently expose the "right" information to make tracing useful. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/ // Ecelerity: Run with it.
Tom Lane wrote: >Robert Lor <Robert.Lor@Sun.COM> writes: > > >>The main goal for this Generic Monitoring Framework is to provide a >>common interface for adding instrumentation points or probes to >>Postgres so its behavior can be easily observed by developers and >>administrators even in production systems. >> >> > >What is the overhead of a "probe" when you're not using it? The answer >had better not include the phrase "kernel call", or this is unlikely to >pass muster... > > Here's what the DTrace developers have to say in their Usenix paper. "When not explicitly enabled, DTrace has zero probe effect - the system operates exactly as if DTrace were not present at all." http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf The technical details are beyond me, so I can't tell you exactly what happens internally. I can find out if you're interested! > > >>For DTrace, probes can be enabled using a D script. When the probes are not enabled, there is absolutely no performancehit whatsoever. >> >> > >If you believe that, I have a bridge in Brooklyn you might be interested >in. > >What are the criteria going to be for where to put probe calls? If it >has to be hard-wired into the source code, I foresee a lot of contention >about which probes are worth their overhead, because we'll need >one-size-fits-all answers. > > > I think we need to be selective in terms of which probes to add since we don't want to scatter them all over the source files. For DTrace, the overhead is very minimal, but you're right, other implementation for the same probe may have more perf overhead. >>arg1..arg5 = Any args to pass to the probe such as txn id, lock id, etc >> >> >Where is the data type of a probe argument defined? > > It's in a .d file which looks like below: provider pg_backend { probe fsync__start(void); probe fsync__end(void); probe lwlock__acquire (int, int); probe lwlock__release(int); ... } Regards, Robert
On Mon, Jun 19, 2006 at 05:20:31PM -0400, Theo Schlossnagle wrote: > Heh. Syscall probes and FBT probes in Dtrace have zero overhead. > User-space probes do have overhead, but it is only a few instructions > (two I think). Besically, the probe points are replaced by illegal > instructions and the kernel infrastructure for Dtrace will fasttrap > the ops and then act. So, it is tiny tiny overhead. Little enough > that it isn't unreasonable to instrument things like s_lock which are > tiny. If someone wanted to, they should be able to do benchmarking with the DTrace patches on pgFoundry to see the overhead of just having the probes in, and then having the probes in and actually using them. If you *really* want to see the difference, add a probe in s_lock. :) -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Robert.Lor@Sun.COM (Robert Lor) writes: > For DTrace, probes can be enabled using a D script. When the probes > are not enabled, there is absolutely no performance hit whatsoever. That seems inconceivable. In order to have a way of deciding whether or not the probes are enabled, there has *got* to be at least one instruction executed, and that can't be costless. -- output = reverse("gro.mca" "@" "enworbbc") http://www.ntlug.org/~cbbrowne/wp.html "...while I know many people who emphatically believe in reincarnation, I have never met or read one who could satisfactorily explain population growth." -- Spider Robinson
Theo Schlossnagle wrote: > > Heh. Syscall probes and FBT probes in Dtrace have zero overhead. > User-space probes do have overhead, but it is only a few instructions > (two I think). Besically, the probe points are replaced by illegal > instructions and the kernel infrastructure for Dtrace will fasttrap > the ops and then act. So, it is tiny tiny overhead. Little enough > that it isn't unreasonable to instrument things like s_lock which are > tiny. Theo, you're a genius. FBT (funciton boundary tracing) probes have zero overhead (section 4.1) and user-space probes has two instructions over head (section 4.2). I was incorrect about making a general zero overhead statement. But it's so close to zero :-) http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf > > The reason that Robert proposes user-space probes (I assume) is that > tracing C functions can be too granular and not conveniently expose > the "right" information to make tracing useful. Yes, I'm proposing user-space probes (aka User Statically-Defined Tracing - USDT). USDT provides a high-level abstraction so the application can expose well defined probes without the user having to know the detailed implementation. For example, instead of having to know the function LWLockAcquire(), a well documented probe called lwlock_acquire with the appropriate args is much more usable. Regards, Robert
On Jun 19, 2006, at 6:41 PM, Robert Lor wrote: > Theo Schlossnagle wrote: > >> >> Heh. Syscall probes and FBT probes in Dtrace have zero >> overhead. User-space probes do have overhead, but it is only a >> few instructions (two I think). Besically, the probe points are >> replaced by illegal instructions and the kernel infrastructure >> for Dtrace will fasttrap the ops and then act. So, it is tiny >> tiny overhead. Little enough that it isn't unreasonable to >> instrument things like s_lock which are tiny. > > Theo, you're a genius. FBT (funciton boundary tracing) probes have > zero overhead (section 4.1) and user-space probes has two > instructions over head (section 4.2). I was incorrect about making > a general zero overhead statement. But it's so close to zero :-) > > http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf > >> >> The reason that Robert proposes user-space probes (I assume) is >> that tracing C functions can be too granular and not conveniently >> expose the "right" information to make tracing useful. > > Yes, I'm proposing user-space probes (aka User Statically-Defined > Tracing - USDT). USDT provides a high-level abstraction so the > application can expose well defined probes without the user having > to know the detailed implementation. For example, instead of > having to know the function LWLockAcquire(), a well documented > probe called lwlock_acquire with the appropriate args is much more > usable. I am giving a talk at OSCON this year about PostgreSQL on "big systems". Big is all relative, but I will be talking about dtrace a bit and the advantages of running PostgreSQL on Solaris which is what we ended up doing after some extremely disturbing experiences on Linux. I was able to track a very acute memory "leak" in pl/perl (which Neil so kindly fixed) within a few moments -- and this is without explicit user-space trace points. If there were good user- space points, I likely wouldn't have had to dig in the source as a pre-cursor to my dtrace efforts. The things you might be able to do with user-specific trace points: o better understand the block scatter (distance of block-level reads) for a specific query). o understand lock contention in vastly multiprocessor systems using plockstat (my hunch is that heavy-weight locks might be better). o our current box is 4 way opteron, but we havea 16-way T2000 as well. o report on queries including turn-around time, block-accesses, lock acquisitions grouped by query for specific time windows. The nice thing about dtrace is that it requires no "prep" to look at a problem. When something is acting odd in production, you don't want to attempt to repeat it in a test environment first. You want to observe it. Dtrace allows you to dig in "really deep" in production with an acceptable performance penalty and ask questions that couldn't be asked before. It is exceptionally clever stuff. Of all the new "neat stuff" in Solaris 10, it has my vote for coolest and most useful. I've nailed several production problems (outside of Postgres) using dtrace with accuracy and efficiency. When Solaris 10u2 is released, we'll be trying Postgres on ZFS, so my rankings may change :-) The idea of having intelligently placed dtrace probes in Postrgres would allow us to deal with postgres as a "first class" app on Solaris 10 with respect to troubleshooting obtuse production problems. That, to me, is exciting stuff. Best regards, Theo // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/ // Ecelerity: Run with it.
Jim C. Nasby wrote: > On Mon, Jun 19, 2006 at 05:20:31PM -0400, Theo Schlossnagle wrote: >> Heh. Syscall probes and FBT probes in Dtrace have zero overhead. >> User-space probes do have overhead, but it is only a few instructions >> (two I think). Besically, the probe points are replaced by illegal >> instructions and the kernel infrastructure for Dtrace will fasttrap >> the ops and then act. So, it is tiny tiny overhead. Little enough >> that it isn't unreasonable to instrument things like s_lock which are >> tiny. > > If someone wanted to, they should be able to do benchmarking with the > DTrace patches on pgFoundry to see the overhead of just having the > probes in, and then having the probes in and actually using them. If you > *really* want to see the difference, add a probe in s_lock. :) We will need to benchmark on FreeBSD to see if those comments about overhead stand up to scrutiny there too. I would think that even if (for instance) we find that there is no overhead on Solaris, those of us on platforms where DTrace is less mature would want the option of building without any probes at all in the code - I guess a configure option "--without-dtrace" on by default on those platforms would do it. regards Mark
On Jun 19, 2006, at 7:39 PM, Mark Kirkwood wrote: > We will need to benchmark on FreeBSD to see if those comments about > overhead stand up to scrutiny there too. I've followed the development of DTrace on FreeBSD and the design approach is mostly identical to the Solaris one. This would mean that if there is overhead on FreeBSD not present on Solaris it would be considered a big and likely fixed. > I would think that even if (for instance) we find that there is no > overhead on Solaris, those of us on platforms where DTrace is less > mature would want the option of building without any probes at all > in the code - I guess a configure option "--without-dtrace" on by > default on those platforms would do it. Absolutely. As they are all proposed as preprocessor macros, this would be trivial to accomplish. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/ // Ecelerity: Run with it.
On Mon, 2006-06-19 at 19:36 -0400, Theo Schlossnagle wrote: > The idea of having intelligently placed dtrace probes in Postrgres > would allow us ... > to troubleshoot[ing] obtuse production > problems. That, to me, is exciting stuff. [paraphrased by SR] I very much agree with the requirement here. This needs to work on Linux and Windows, minimum, also. It's obviously impossible to move a production system to a different OS just to use a cool tracing tool. So the architecture must intelligently handle the needs of multiple OS - even if the underlying facilities on them do not yet provide what we'd like. So I'm OK with Solaris being the best, just as long as its not the only one that benefits. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, Jun 19, 2006 at 05:14:15PM -0400, Chris Browne wrote: > Robert.Lor@Sun.COM (Robert Lor) writes: > > For DTrace, probes can be enabled using a D script. When the probes > > are not enabled, there is absolutely no performance hit whatsoever. > > That seems inconceivable. > > In order to have a way of deciding whether or not the probes are > enabled, there has *got* to be at least one instruction executed, and > that can't be costless. I think the trick is that the probe are enabled by overwriting bits of code. So by default you might put a No-Op instruction and if you want to trace you replace that with an illegal instruction or the special one-byte INT3 instruction x86 system have for this purpose. With a 17-stage pipelined processor I imagine the cost of a no-op would indeed be almost unmeasurable (increase code size I suppose). Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Robert Lor <Robert.Lor@Sun.COM> writes: > Yes, I'm proposing user-space probes (aka User Statically-Defined Tracing - > USDT). USDT provides a high-level abstraction so the application can expose > well defined probes without the user having to know the detailed > implementation. For example, instead of having to know the function > LWLockAcquire(), a well documented probe called lwlock_acquire with the > appropriate args is much more usable. It seems pointless to me to expose things like lwlock_acuire that map 1-1 to C function calls like LWLockAcquire. They're useless except to people who understand what's going on and if people know the low level implementation details of Postgres they can already trace those calls with dtrace without any help. What would be useful is instrumenting high level calls that can't be traced without application guidance. For example, inserting a dtrace probe for each SQL and each plan node. That way someone could get the same info as EXPLAIN ANALYZE from a production server without having to make application modifications (or suffer the gettimeofday overhead). It's one thing to know "I seem to be acquiring a lot of locks" or "i'm spending all my time in sorting". It's another to be able to ask dtrace "what query am I running when doing all this sorting?" or "what kind of plan node am I running when I'm acquiring all these locks?" -- greg
Greg Stark <gsstark@mit.edu> writes: > What would be useful is instrumenting high level calls that can't be traced > without application guidance. For example, inserting a dtrace probe for each > SQL and each plan node. That way someone could get the same info as EXPLAIN > ANALYZE from a production server without having to make application > modifications (or suffer the gettimeofday overhead). My bogometer just went off again. How is something like dtrace going to magically get realtime information without reading the clock? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > What would be useful is instrumenting high level calls that can't be traced > > without application guidance. For example, inserting a dtrace probe for each > > SQL and each plan node. That way someone could get the same info as EXPLAIN > > ANALYZE from a production server without having to make application > > modifications (or suffer the gettimeofday overhead). > > My bogometer just went off again. How is something like dtrace going to > magically get realtime information without reading the clock? Sorry, I meant get the same info as EXPLAIN ANALYZE minus the timing. I'm not familiar with DTrace first-hand but I did have the impression it was possible to get timing information though. I don't know how much overhead it has but I wouldn't be surprised if it was lower for a kernel-based profiling elapsed time counter on Sun hardware than a general purpose gettimeofday call on commodity PC hardware. For example it could use a cpu instruction counter and have hooks in the scheduler for saving and restoring the counter to avoid the familiar gotchas with being rescheduled across processors. -- greg
Simon Riggs wrote: >This needs to work on Linux and Windows, minimum, also. > > The proposed solution will work on Linux & Windows if they similar facility that the macros can map to. Otherwise, the macros stay as no-ops and will not affect those platforms at all. >It's obviously impossible to move a production system to a different OS >just to use a cool tracing tool. So the architecture must intelligently >handle the needs of multiple OS - even if the underlying facilities on >them do not yet provide what we'd like. So I'm OK with Solaris being the >best, just as long as its not the only one that benefits. > > > The way it's proposed now, any OS can use the same interfaces and map to their underlying facilities. Does it look reasonable? Regards, Robert
Greg Stark wrote: >It seems pointless to me to expose things like lwlock_acuire that map 1-1 to C >function calls like LWLockAcquire. They're useless except to people who >understand what's going on and if people know the low level implementation >details of Postgres they can already trace those calls with dtrace without any >help. > > > lwlock_acquire is just an example. I think once we decided to down this path, we can solicit ideas for interesting probes and put them up for discussion on this alias whether or not they are needed. I think we need to have two categories of probes for admins and developers. Perhaps the probes for admins are more important since, as you said, the developers already know which function does what, but I think the low-level probes are still useful for new developers as there behavior will be documented. >What would be useful is instrumenting high level calls that can't be traced >without application guidance. For example, inserting a dtrace probe for each >SQL and each plan node. That way someone could get the same info as EXPLAIN >ANALYZE from a production server without having to make application >modifications (or suffer the gettimeofday overhead). > > >It's one thing to know "I seem to be acquiring a lot of locks" or "i'm >spending all my time in sorting". It's another to be able to ask dtrace "what >query am I running when doing all this sorting?" or "what kind of plan node am >I running when I'm acquiring all these locks?" > > > Completely agree. Regards, Robert