Thread: Patch: add timing of buffer I/O requests
Hi, I know I'm late for the ongoing commitfest, but I thought I might as well be early for the next one. Attached is a patch against master that implements timing of shared buffer fills and per relation stats collection of said timings. Buffer flushes are timed as well but aren't exposed per table because of difficulty of correctly attributing them. Some notes on the implementation: * The timing is done in bufmgr.c. Under high load some CPU contention will get attributed to IO because the process doing the I/O won't get a time- slice immediately. * I decided to also account waiting for others doing the I/O as I/O waits. They aren't double counted for per relation stats though. * I added a GUC track_iotiming defaulting to off because timing isn't cheap on all platforms. * I used instr_time to keep the counts to be consistent with function timings, but maybe both should convert to plain uint64 counts to make the arithmetic code cleaner. * Timings are exposed via EXPLAIN (BUFFERS), pg_stat_statements and pg_statio_* views. * I noticed there aren't any pg_statio_xact_* views. I don't have any need for them myself, but thought I'd mention the inconsistency. * The feature is really useful for me with auto_explain. Even better with Peter's pg_stat_statements query cleaning applied. I did some testing on an older AMD Athlon X2 BE-2350 and an Intel i5 M 540 to see the overhead. The AMD CPU doesn't have the necessary features for fast user mode timing and has an overhead of about 900ns per gettimeofday call. The Intel processor has an overhead of 22ns per call. I tried a read only pgbench with scalefactor 50 and shared_buffers=32MB to induce a lot of IO traffic that hits the OS cache. Seems like it should be the worst case for this patch. On the AMD I saw about 3% performance drop with timing enabled. On the Intel machine I couldn't measure any statistically significant change. The median was actually higher with timing enabled, but stddevs were large enough to hide a couple of percent of performance loss. This needs some further testing. Preliminary results for the Intel machine with stddev (10 5min runs): -c | master | io-stats 4 | 16521.53 ±4.49% | +1.16% ±3.21% 16 | 13923.49 ±5.98% | +0.56% ±7.11% This is my first patch, so I hope I haven't missed anything too trivial. -- Ants Aasma ants.aasma@eesti.ee
Attachment
On 11/27/2011 04:39 PM, Ants Aasma wrote: > On the AMD I saw about 3% performance drop with timing enabled. On the > Intel machine I couldn't measure any statistically significant change. Oh no, it's party pooper time again. Sorry I have to be the one to do it this round. The real problem with this whole area is that we know there are systems floating around where the amount of time taken to grab timestamps like this is just terrible. I've been annoyed enough by that problem to spend some time digging into why that is--seems to be a bunch of trivia around the multiple ways to collect time info on x86 systems--and after this CommitFest is over I was already hoping to dig through my notes and start quantifying that more. So you can't really prove the overhead of this approach is acceptable just by showing two examples; we need to find one of the really terrible clocks and test there to get a real feel for the worst-case. I recall a patch similar to this one was submitted by Greg Stark some time ago. It used the info for different reasons--to try and figure out whether reads were cached or not--but I believe it withered rather than being implemented mainly because it ran into the same fundamental roadblocks here. My memory could be wrong here, there were also concerns about what the data would be used for. I've been thinking about a few ways to try and cope with this whole class of timing problem: -Document the underlying problem and known workarounds, provide a way to test how bad the overhead is, and just throw our hands up and say "sorry, you just can't instrument like this" if someone has a slow system. -Have one of the PostgreSQL background processes keep track of a time estimate on its own, only periodically pausing to sync against the real time. Then most calls to gettimeofday() can use that value instead. I was thinking of that idea for slightly longer running things though; I doubt that can be made accurate enough to test instrument buffer And while I hate to kick off massive bike-shedding in your direction, I'm also afraid this area--collecting stats about how long individual operations take--will need a much wider ranging approach than just looking at the buffer cache ones. If you step back and ask "what do people expect here?", there's a pretty large number who really want something like Oracle's v$session_wait and v$system_event interface for finding the underlying source of slow things. There's enough demand for that that EnterpriseDB has even done some work in this area too; what I've been told about it suggests the code isn't a great fit for contribution to community PostgreSQL though. Like I said, this area is really messy and hard to get right. Something more ambitious like the v$ stuff would also take care of what you're doing here; I'm not sure that what you've done helps built it though. Please don't take that personally. Part of one of my own instrumentation patches recently was rejected out of hand for the same reason, just not being general enough. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On 28 Listopad 2011, 8:54, Greg Smith wrote: > -Have one of the PostgreSQL background processes keep track of a time > estimate on its own, only periodically pausing to sync against the real > time. Then most calls to gettimeofday() can use that value instead. I > was thinking of that idea for slightly longer running things though; I > doubt that can be made accurate enough to test instrument buffer What about random sampling, i.e. "measure just 5% of the events" or something like that? Sure, it's not exact but it significantly reduces the overhead. And it might be a config parameter, so the user might decide how precise results are needed, and even consider how fast the clocks are. > Something more ambitious like the v$ stuff would also take care of what > you're doing here; I'm not sure that what you've done helps built it > though. Please don't take that personally. Part of one of my own > instrumentation patches recently was rejected out of hand for the same > reason, just not being general enough. Yes, that'd be significant improvement. The wait-event stuff is very useful and changes the tuning significantly. Tomas
On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith <greg@2ndquadrant.com> wrote: > The real problem with this whole area is that we know there are > systems floating around where the amount of time taken to grab timestamps > like this is just terrible. Assuming the feature is off by default (and I can't imagine we'd consider anything else), I don't see why that should be cause for concern. If the instrumentation creates too much system load, then don't use it: simple as that. A more interesting question is "how much load does this feature create even when it's turned off?". The other big problem for a patch of this sort is that it would bloat the stats file. I think we really need to come up with a more scalable alternative to the current system, but I haven't looked the current system in enough detail to have a clear feeling about what such an alternative would look like. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
<p><br /> On Nov 28, 2011 8:55 AM, "Greg Smith" <<a href="mailto:greg@2ndquadrant.com">greg@2ndquadrant.com</a>> wrote:<br/> ><br /> > On 11/27/2011 04:39 PM, Ants Aasma wrote:<br /> >><br /> >> On the AMD I saw about3% performance drop with timing enabled. On the<br /> >> Intel machine I couldn't measure any statistically significantchange.<br /> ><br /> ><br /> > Oh no, it's party pooper time again. Sorry I have to be the one to doit this round. The real problem with this whole area is that we know there are systems floating around where the amountof time taken to grab timestamps like this is just terrible. <p>I believe on most systems on modern linux kernels gettimeofdayan its ilk will be a vsyscall and nearly as fast as a regular function call.<br /><p>><br /> > I recalla patch similar to this one was submitted by Greg Stark some time ago. It used the info for different reasons--totry and figure out whether reads were cached or not--but I believe it withered rather than being implemented mainlybecause it ran into the same fundamental roadblocks here. My memory could be wrong here, there were also concernsabout what the data would be used for.<p>I speculated about doing that but never did. I had an experimental patchusing mincore to do what you describe but it wasn't intended for production code I think. The only real patch was touse getrusage which I still intend to commit but it doesn't tell you the time spent in I/O -- though it does tell you thesys time which should be similar.<br />
On 28 Listopad 2011, 15:40, Greg Stark wrote: > On Nov 28, 2011 8:55 AM, "Greg Smith" <greg@2ndquadrant.com> wrote: >> >> On 11/27/2011 04:39 PM, Ants Aasma wrote: >>> >>> On the AMD I saw about 3% performance drop with timing enabled. On the >>> Intel machine I couldn't measure any statistically significant change. >> >> >> Oh no, it's party pooper time again. Sorry I have to be the one to do >> it > this round. The real problem with this whole area is that we know there > are systems floating around where the amount of time taken to grab > timestamps like this is just terrible. > > I believe on most systems on modern linux kernels gettimeofday an its ilk > will be a vsyscall and nearly as fast as a regular function call. AFAIK a vsyscall should be faster than a regular syscall. It does not need to switch to kernel space at all, it "just" reads the data from a shared page. The problem is that this is Linux-specific - for example FreeBSD does not have vsyscall at all (it's actually one of the Linux-isms mentioned here: http://wiki.freebsd.org/AvoidingLinuxisms). There's also something called VDSO, that (among other things) uses vsyscall if availabe, or the best implementation available. So there are platforms that do not provide vsyscall, and in that case it'd be just as slow as a regular syscall :( I wouldn't expect a patch that works fine on Linux but not on other platforms to be accepted, unless there's a compile-time configure switch (--with-timings) that'd allow to disable that. Another option would be to reimplement the vsyscall, even on platforms that don't provide it. The principle is actually quite simple - allocate a shared memory, store there a current time and update it whenever a clock interrupt happens. This is basically what Greg suggested in one of the previous posts, where "regularly" means "on every interrupt". Greg was worried about the precision, but this should be just fine I guess. It's the precision you get on Linux, anyway ... >> I recall a patch similar to this one was submitted by Greg Stark some > time ago. It used the info for different reasons--to try and figure out > whether reads were cached or not--but I believe it withered rather than > being implemented mainly because it ran into the same fundamental > roadblocks here. My memory could be wrong here, there were also concerns > about what the data would be used for. The difficulty when distinguishing whether the reads were cached or not is the price we pay for using filesystem cache instead of managing our own. Not sure if this can be solved just by measuring the latency - with spinners it's quite easy, the differences are rather huge (and it's not difficult to derive that even from pgbench log). But with SSDs, multiple tablespaces on different storage, etc. it gets much harder. Tomas
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith <greg@2ndquadrant.com> wrote: >> The real problem with this whole area is that we know there are >> systems floating around where the amount of time taken to grab timestamps >> like this is just terrible. > Assuming the feature is off by default (and I can't imagine we'd > consider anything else), I don't see why that should be cause for > concern. If the instrumentation creates too much system load, then > don't use it: simple as that. A more interesting question is "how > much load does this feature create even when it's turned off?". Right. I see that the code already has a switch to skip the gettimeofday calls, so the objection is only problematic if the added overhead is significant even with the switch off. I would worry mainly about the added time/space to deal with the extra stats counters. > The other big problem for a patch of this sort is that it would bloat > the stats file. Yes. Which begs the question of why we need to measure this per-table. I would think per-tablespace would be sufficient. regards, tom lane
On Sun, Nov 27, 2011 at 11:54:38PM -0800, Greg Smith wrote: > On 11/27/2011 04:39 PM, Ants Aasma wrote: > >On the AMD I saw about 3% performance drop with timing enabled. On the > >Intel machine I couldn't measure any statistically significant change. > > Oh no, it's party pooper time again. Sorry I have to be the one to > do it this round. The real problem with this whole area is that we > know there are systems floating around where the amount of time > taken to grab timestamps like this is just terrible. I've been > annoyed enough by that problem to spend some time digging into why > that is--seems to be a bunch of trivia around the multiple ways to > collect time info on x86 systems--and after this CommitFest is over Something good to know: in Linux the file /sys/devices/system/clocksource/clocksource0/current_clocksource lists the current clock source, and /sys/devices/system/clocksource/clocksource0/available_clocksource lists the available clock sources. With cat you can switch them. That way you may be able to quantify the effects on a single machine. Learned the hard way while tracking clock-skew on a multicore system. The hpet may not be the fastest (that would be the cpu timer), but it's the fastest (IME) that gives guarenteed monotonic time. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
"Tomas Vondra" <tv@fuzzy.cz> writes: > Another option would be to reimplement the vsyscall, even on platforms > that don't provide it. The principle is actually quite simple - allocate a > shared memory, store there a current time and update it whenever a clock > interrupt happens. This is basically what Greg suggested in one of the > previous posts, where "regularly" means "on every interrupt". Greg was > worried about the precision, but this should be just fine I guess. It's > the precision you get on Linux, anyway ... That sounds good for other interesting things, which entails being able to have timing information attached to the XID sequence. If we go this way, how far are we from having a ticker in PostgreSQL? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Nov 28, 2011, at 9:29 AM, Tomas Vondra wrote: >>> I recall a patch similar to this one was submitted by Greg Stark some >> time ago. It used the info for different reasons--to try and figure out >> whether reads were cached or not--but I believe it withered rather than >> being implemented mainly because it ran into the same fundamental >> roadblocks here. My memory could be wrong here, there were also concerns >> about what the data would be used for. > > The difficulty when distinguishing whether the reads were cached or not is > the price we pay for using filesystem cache instead of managing our own. > Not sure if this can be solved just by measuring the latency - with > spinners it's quite easy, the differences are rather huge (and it's not > difficult to derive that even from pgbench log). But with SSDs, multiple > tablespaces on different storage, etc. it gets much harder. True, but every use case for this information I can think of ultimately only cares about how long it took to perform somekind of IO; it doesn't *really* care about whether it was cached. So in that context, we don't really care if SSDs arefast enough that they look like cache, because that means they're performing (essentially) the same as cache. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 28.11.2011 22:32, Dimitri Fontaine wrote: > "Tomas Vondra" <tv@fuzzy.cz> writes: >> Another option would be to reimplement the vsyscall, even on platforms >> that don't provide it. The principle is actually quite simple - allocate a >> shared memory, store there a current time and update it whenever a clock >> interrupt happens. This is basically what Greg suggested in one of the >> previous posts, where "regularly" means "on every interrupt". Greg was >> worried about the precision, but this should be just fine I guess. It's >> the precision you get on Linux, anyway ... > > That sounds good for other interesting things, which entails being able > to have timing information attached to the XID sequence. If we go this > way, how far are we from having a ticker in PostgreSQL? I'm not sure. On Linux/x86 this is already done, but my knowledge of kernel development is rather limited, especially when it comes to other OSes and platforms. E.g. I'm not sure why it's not available in FreeBSD on x86, I guess it's rather "we don't want it" than "it's not possible." In Linux sources, the most interesting pieces are probably these: 1) arch/x86/include/asm/vgtod.h - that's the shared memory structure 2) arch/x86/kernel/vsyscall_64.c - this is how the memory is read (do_vgettimeofday) 3) arch/x86/kernel/vsyscall_64.c - this is how the memory is updated (update_vsyscall) 4) kernel/time/timekeeping.c - do_settimeofday (calls update_vsyscall) 5) drivers/rtc/class.c (and other) RTC drivers call do_settimeofday Tomas
On 29.11.2011 02:14, Jim Nasby wrote: > On Nov 28, 2011, at 9:29 AM, Tomas Vondra wrote: >>>> I recall a patch similar to this one was submitted by Greg >>>> Stark some >>> time ago. It used the info for different reasons--to try and >>> figure out whether reads were cached or not--but I believe it >>> withered rather than being implemented mainly because it ran into >>> the same fundamental roadblocks here. My memory could be wrong >>> here, there were also concerns about what the data would be used >>> for. >> >> The difficulty when distinguishing whether the reads were cached or >> not is the price we pay for using filesystem cache instead of >> managing our own. Not sure if this can be solved just by measuring >> the latency - with spinners it's quite easy, the differences are >> rather huge (and it's not difficult to derive that even from >> pgbench log). But with SSDs, multiple tablespaces on different >> storage, etc. it gets much harder. > > True, but every use case for this information I can think of > ultimately only cares about how long it took to perform some kind of > IO; it doesn't *really* care about whether it was cached. So in that > context, we don't really care if SSDs are fast enough that they look > like cache, because that means they're performing (essentially) the > same as cache. Yup, that's right. The wait times are generally much more interesting than the cached/not cached ratio. Tomas
Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes: > That sounds good for other interesting things, which entails being able > to have timing information attached to the XID sequence. If we go this > way, how far are we from having a ticker in PostgreSQL? Those of us who are trying to get rid of idle-state process wakeups will protest any such thing with vigor. regards, tom lane
Sorry for taking so long to respong, had a pretty busy day at work. Anyway.. On Mon, Nov 28, 2011 at 9:54 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Oh no, it's party pooper time again. Sorry I have to be the one to do it > this round. The real problem with this whole area is that we know there are > systems floating around where the amount of time taken to grab timestamps > like this is just terrible. I've been annoyed enough by that problem to > spend some time digging into why that is--seems to be a bunch of trivia > around the multiple ways to collect time info on x86 systems--and after this > CommitFest is over I was already hoping to dig through my notes and start > quantifying that more. So you can't really prove the overhead of this > approach is acceptable just by showing two examples; we need to find one of > the really terrible clocks and test there to get a real feel for the > worst-case. Sure, I know that the timing calls might be awfully slow. That's why I turned it off by default. I saw that track_functions was already using this, so I figured it was ok to have it potentially run very slowly. > -Document the underlying problem and known workarounds, provide a way to > test how bad the overhead is, and just throw our hands up and say "sorry, > you just can't instrument like this" if someone has a slow system. Some documentation about potential problems would definitely be good. Same goes for a test tool. ISTM that fast accurate timing is just not possible on all supported platforms. That doesn't seem like a good enough justification to refuse implementing something useful for the majority that do as long as it doesn't cause regressions for those that don't or significant code complexity. > -Have one of the PostgreSQL background processes keep track of a time > estimate on its own, only periodically pausing to sync against the real > time. Then most calls to gettimeofday() can use that value instead. I was > thinking of that idea for slightly longer running things though; I doubt > that can be made accurate enough to test instrument buffer This would limit it to those cases where hundreds of milliseconds of jitter or more don't bother all that much. > And while I hate to kick off massive bike-shedding in your direction, I'm > also afraid this area--collecting stats about how long individual operations > take--will need a much wider ranging approach than just looking at the > buffer cache ones. If you step back and ask "what do people expect here?", > there's a pretty large number who really want something like Oracle's > v$session_wait and v$system_event interface for finding the underlying > source of slow things. There's enough demand for that that EnterpriseDB has > even done some work in this area too; what I've been told about it suggests > the code isn't a great fit for contribution to community PostgreSQL though. > Like I said, this area is really messy and hard to get right. Yeah, something like that should probably be something to strive for. I'll ponder a bit more about resource and latency tracking a general. Maybe the question here should be about the cost/benefit ratio of having some utility now vs maintaining/deprecating the user visible interface when a more general framework will turn up. > Something more ambitious like the v$ stuff would also take care of what > you're doing here; I'm not sure that what you've done helps built it though. > Please don't take that personally. Part of one of my own instrumentation > patches recently was rejected out of hand for the same reason, just not > being general enough. No problem, I understand that half-way solutions can be more trouble than they're worth. I actually built this to help with performance testing an application and thought it would be an interesting experience to try to give the community back something. On Mon, Nov 28, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote: > I believe on most systems on modern linux kernels gettimeofday an its ilk > will be a vsyscall and nearly as fast as a regular function call. clock_gettime() is implemented as a vDSO since 2.6.23. gettimeofday() has been user context callable since before git shows any history (2.6.12). On Mon, Nov 28, 2011 at 5:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The other big problem for a patch of this sort is that it would bloat >> the stats file. > > Yes. Which begs the question of why we need to measure this per-table. > I would think per-tablespace would be sufficient. Yeah, I figured that this is something that should be discussed. I implemented per-table collection because I thought it might be useful for tools to pick up and show a quick overview on which tables are causing the most IO overhead for queries. On Mon, Nov 28, 2011 at 8:10 PM, Martijn van Oosterhout <kleptog@svana.org> wrote: > Something good to know: in Linux the file > /sys/devices/system/clocksource/clocksource0/current_clocksource > lists the current clock source, and > /sys/devices/system/clocksource/clocksource0/available_clocksource > lists the available clock sources. With cat you can switch them. That > way you may be able to quantify the effects on a single machine. > > Learned the hard way while tracking clock-skew on a multicore system. > The hpet may not be the fastest (that would be the cpu timer), but it's > the fastest (IME) that gives guarenteed monotonic time. The Linux kernel seems to go pretty far out of its way to ensure that TSC (CPU timestamp counter) based clocksource returns monotonic values, including actually testing if it does. [1] If the hardware doesn't support stable and consistent tsc values, tsc isn't used as a clock source. Of course trying to keep it monotonic doesn't mean succeeding. I thought about inserting a sanity check. But as the current instrumentation doesn't use one and it would catch errors only in one direction, biasing the long term average, I decided against it. Because this is non-essential instrumentation, I don't see an issue with it returning bogus information when the system clock is broken. Atleast it seems that no one has complained about the same issue in track_functions. The only complaint I found is that it's off by default. On Mon, Nov 28, 2011 at 5:29 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > Another option would be to reimplement the vsyscall, even on platforms > that don't provide it. The principle is actually quite simple - allocate a > shared memory, store there a current time and update it whenever a clock > interrupt happens. This is basically what Greg suggested in one of the > previous posts, where "regularly" means "on every interrupt". Greg was > worried about the precision, but this should be just fine I guess. It's > the precision you get on Linux, anyway ... On modern platforms you actually really do get the microsecond precision. Even more, if you use clock_gettime(CLOCK_MONOTONIC), you get nanosecond precision and avoid issues with someone changing the system time while you're timing. This precision does require OS and hardware cooperation, because of CPU offsets, TSC's changing frequencies, stopping, etc. -- Ants Aasma [1] https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc_sync.c#L143
On 11/28/2011 05:51 AM, Robert Haas wrote: > On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith<greg@2ndquadrant.com> wrote: >> The real problem with this whole area is that we know there are >> systems floating around where the amount of time taken to grab timestamps >> like this is just terrible. > Assuming the feature is off by default (and I can't imagine we'd > consider anything else), I don't see why that should be cause for > concern. If the instrumentation creates too much system load, then > don't use it: simple as that. It's not quite that simple though. Releasing a performance measurement feature that itself can perform terribly under undocumented conditions has a wider downside than that. Consider that people aren't going to turn it on until they are already overloaded. If that has the potential to completely tank performance, we better make sure that area is at least explored usefully first; the minimum diligence should be to document that fact and make suggestions for avoiding or testing it. Instrumentation that can itself become a performance problem is an advocacy problem waiting to happen. As I write this I'm picturing such an encounter resulting in an angry blog post, about how this proves PostgreSQL isn't usable for serious systems because someone sees massive overhead turning this on. Right now the primary exposure to this class of issue is EXPLAIN ANALYZE. When I was working on my book, I went out of my way to find a worst case for that[1], and that turned out to be a query that went from 7.994ms to 69.837ms when instrumented. I've been meaning to investigate what was up there since finding that one. The fact that we already have one such problem bit exposed already worries me; I'd really prefer not to have two. [1] (Dell Store 2 schema, query was "SELECT count(*) FROM customers;")
Greg Smith <greg@2ndQuadrant.com> writes: > On 11/28/2011 05:51 AM, Robert Haas wrote: >> Assuming the feature is off by default (and I can't imagine we'd >> consider anything else), I don't see why that should be cause for >> concern. If the instrumentation creates too much system load, then >> don't use it: simple as that. > It's not quite that simple though. Releasing a performance measurement > feature that itself can perform terribly under undocumented conditions > has a wider downside than that. Yeah, that's a good point, and the machines on which this would suck are exactly the ones where EXPLAIN ANALYZE creates very large overhead. We don't seem to see a lot of complaints about that anymore, but we do still see some ... and yes, it's documented that EXPLAIN ANALYZE can add significant overhead, but that doesn't stop the questions. > Instrumentation that can itself become a performance problem is an > advocacy problem waiting to happen. As I write this I'm picturing such > an encounter resulting in an angry blog post, about how this proves > PostgreSQL isn't usable for serious systems because someone sees massive > overhead turning this on. Of course, the rejoinder could be that if you see that, you're not testing on serious hardware. But still, I take your point. > Right now the primary exposure to this class > of issue is EXPLAIN ANALYZE. When I was working on my book, I went out > of my way to find a worst case for that[1], > [1] (Dell Store 2 schema, query was "SELECT count(*) FROM customers;") That's pretty meaningless without saying what sort of clock hardware was on the machine... regards, tom lane
Here's the second version of the I/O timings patch. Changes from the previous version: * Rebased against master. * Added the missing pg_stat_statements upgrade script that I accidentally left out from the previous version. * Added a tool to test timing overhead under contrib/pg_test_timing I hope that having a tool to measure the overhead and check the sanity of clock sources is enough to answer the worries about the potential performance hit. We could also check that the clock source is fast enough on start-up/when the guc is changed, but that seems a bit too much and leaves open the question about what is fast enough. About issues with stats file bloat - if it really is a blocker, I can easily rip out the per-table or even per-database stats fields. The patch is plenty useful without them. It seemed like a useful tool for overworked DBAs with limited amount of SSD space available to easily figure out which tables and indexes would benefit most from fast storage. -- Ants Aasma
Attachment
On 01/15/2012 05:14 PM, Ants Aasma wrote: > I hope that having a tool to measure the overhead and check the sanity > of clock sources is enough to answer the worries about the potential > performance hit. We could also check that the clock source is fast > enough on start-up/when the guc is changed, but that seems a bit too > much and leaves open the question about what is fast enough. I've been thinking along those same lines--check at startup, provide some guidance on the general range of what's considered fast vs. slow in both the code and documentation. What I'm hoping to do here is split your patch in half and work on the pg_test_timing contrib utility first. That part answers some overdue questions about when EXPLAIN ANALYZE can be expected to add a lot of overhead, which means it's useful all on its own. I'd like to see that utility go into 9.2, along with a new documentation section covering that topic. I'll write the new documentation bit. As far as the buffer timing goes, there is a lot of low-level timing information I'd like to see the database collect. To pick a second example with very similar mechanics, I'd like to know which queries spend a lot of their time waiting on locks. The subset of time a statement spends waiting just for commit related things is a third. The industry standard term for these is wait events, as seen in Oracle, MySQL, MS SQL Server. etc. That's so standard I don't see an intellectual property issue with PostgreSQL using the same term. Talk with a random person who is converting from Oracle to PostgreSQL, ask them about their performance concerns. At least 3/4 of those conversations I have mention being nervous about not having wait event data. Right now, I feel the biggest hurdle to performance tuning PostgreSQL is not having good enough built-in query log analysis tools. If the pg_stat_statements normalization upgrade in the CF queue is commited, that's enough to make me bump that to "solved well enough". After clearing that hurdle, figuring out how to log, analyze, and manage storage of wait events is the next biggest missing piece. One of my top goals for 9.3 was to make sure that happened. I don't think the long-term answer for how to manage wait event data is to collect it as part of pg_stat_statements though. But I don't have a good alternate proposal, while you've submitted a patch that actually does something useful right now. I'm going to think some more about how to reconcile all that. There is an intermediate point to consider as well, which is just committing something that adjusts the core code to make the buffer wait event data available. pg_stat_statements is easy enough to continue work on outside of core. I could see a path where that happens in parallel with adding a better core wait event infrastructure, just to get the initial buffer wait info into people's hands earlier. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attached is the pg_test_timing utility portion of this submission, broken out into its own patch. It's a contrib module modeled on pg_test_fsync. The documentation is still a bit rough, I'm not done with that yet. I have included an example of good timing results, switching to a bad clock source, and the resulting bad results. Code review found some formatting things to nitpick I've already fixed: non-standard brace locations and not including enough spaces in expressions were the main two. This is now referenced by the existing cryptic documentation comment around EXPLAIN ANALYZE, which says that overhead can be high because gettimeofday is slow on some systems. Since this utility measures that directly, I think it's a clear win to include it just for that purpose. The fact that there are more places coming where timing overhead matters is also true. But this existing one is already bad enough to justify shipping something to help measure/manage it in my mind. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachment
Attached are updated versions of this feature without the pg_test_timing tool part, since I broke that out into another discussion thread. I've split the part that updates pg_stat_statistics out from the main feature too, separate patch attached to here (but I'm not reviewing that yet). Lots of bitrot since this was submitted, and yes I noticed that I've almost recreated earlier versions of this patch--by splitting off the parts that were developed later. Earlier discussion of this got side tracked on a few things, partly my fault. It's worth taking a look at what this provides before judging it too much. It can demo well. The stated purpose is helping figure out what relations are gobbling up the most access time, presumably to optimize them and/or the storage they are on. "What do I put onto SSD" is surely a popular request nowadays. To check suitability for that, I decided to run the standard pgbench test and see what it would show lots of time being consumed by. Any answer other than "pgbench_accounts and to a lesser extent its index" is a failing grade. I started with a clean database and OS cache so I'd get real read timings: $ psql -d pgbench -x -c "select relname,heap_blks_read,heap_blks_hit,heap_blks_time, idx_blks_read ,idx_blks_hit,idx_blks_time from pg_statio_user_tables where idx_blks_read > 0 order by heap_blks_time desc" relname | pgbench_accounts heap_blks_read | 7879 heap_blks_hit | 43837 heap_blks_time | 151770 idx_blks_read | 7503 idx_blks_hit | 60484 idx_blks_time | 70968 relname | pgbench_tellers heap_blks_read | 19 heap_blks_hit | 15856 heap_blks_time | 105 idx_blks_read | 11 idx_blks_hit | 15745 idx_blks_time | 62 relname | pgbench_branches heap_blks_read | 11 heap_blks_hit | 32333 heap_blks_time | 77 idx_blks_read | 2 idx_blks_hit | 0 idx_blks_time | 9 Now, the first critical question to ask is "what additional information is this providing above the existing counters?" After all, it's possible to tell pgbench_accounts is the hotspot just from comparing heap_blks_read, right? To really be useful, this would need to make it obvious that reads from pgbench_accounts are slower than the other two, because it's bigger and requires more seeking around to populate. That should show up if we compute time per read numbers: $ psql -d pgbench -x -c "select relname, 1.0 * heap_blks_time / heap_blks_read as time_per_read, 1.0 * idx_blks_time / idx_blks_read as time_per_idx_read from pg_statio_user_tables where idx_blks_read > 0 order by heap_blks_time desc" relname | pgbench_accounts time_per_read | 19.2625967762406397 time_per_idx_read | 9.4586165533786485 relname | pgbench_tellers time_per_read | 5.5263157894736842 time_per_idx_read | 5.6363636363636364 relname | pgbench_branches time_per_read | 7.0000000000000000 time_per_idx_read | 4.5000000000000000 This run looks useful at providing the data wished for--that read times are slower per capita from the accounts table. The first time I tried this I got a bizarre high number for pgbench_branches.heap_blks_time ; I'm not sure how reliable this is yet. One problem that might be easy to fix is that the write timing info doesn't show in any of these system views, only in EXPLAIN and statement level ones. I still think a full wait timing interface is the right long-term direction here. It's hard to reject this idea when it seems to be working right now though, while more comprehensive wait storage is still at least a release off. Opinions welcome, I'm still juggling this around now that I have it working again. Some implementation notes. This currently fails regression test create_function_3, haven't looked into why yet. I've confirmed that on a system where timing is cheap, this is too. On something touching real data, not just a synthetic thing moving memory around, Aants couldn't measure it on a server similar to mine; I can't either. Yes, this is going to gobble up more room for statistics. The track_iotiming GUC seems to work as expected. Off by default, can turn it on in a session and see that session's work get timed, and it toggles on a config reload. Everything needed to only turn it on selectively; main penalty you pay all the time is the stats bloat. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachment
On Wed, Feb 22, 2012 at 4:43 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Attached are updated versions of this feature without the pg_test_timing > tool part, since I broke that out into another discussion thread. I've > split the part that updates pg_stat_statistics out from the main feature > too, separate patch attached to here (but I'm not reviewing that yet). Lots > of bitrot since this was submitted, and yes I noticed that I've almost > recreated earlier versions of this patch--by splitting off the parts that > were developed later. Thanks for the review and splitting. Sorry I didn't fix up the bit rot myself. > Earlier discussion of this got side tracked on a few things, partly my > fault. It's worth taking a look at what this provides before judging it too > much. It can demo well. > > The stated purpose is helping figure out what relations are gobbling up the > most access time, presumably to optimize them and/or the storage they are > on. "What do I put onto SSD" is surely a popular request nowadays. I should have stated the purpose more clearly. The original reason for developing this patch was to figure out "what queries are taking the most time and why", specifically in the case where OS memory is a lot larger than shared_buffers. Basically the following query to get a quick overview where the bottlenecks are: SELECT query, total_time, (time_read+time_write)/total_time AS io_fraction FROM pg_stat_statements ORDER BY total_time DESC LIMIT 20; This of course hugely benefits from Peter's pg_stat_statements normalization patch. Tracking timings per relation was actually an afterthought. > Now, the first critical question to ask is "what additional information is > this providing above the existing counters?" After all, it's possible to > tell pgbench_accounts is the hotspot just from comparing heap_blks_read, > right? Like I said above, I find it mostly useful to see what is missing the OS cache. With memory being as cheap as it is, a reasonably priced server can have 128G of memory, while max recommended value for shared_buffers is 8GB. It's quite likely to have tables that fit into OS cache but not into shared_buffers, but it's not trivial to figure out which those are. > This run looks useful at providing the data wished for--that read times are > slower per capita from the accounts table. The first time I tried this I > got a bizarre high number for pgbench_branches.heap_blks_time ; I'm not sure > how reliable this is yet. One problem that might be easy to fix is that the > write timing info doesn't show in any of these system views, only in EXPLAIN > and statement level ones. I'm not sure about the source of the huge number, might instability in the clock source. Have you tried running the monotonicity check for a longer period while the system is under load? Another issue with the current timing code is that gettimeofday isn't guaranteed to be monotonic anyway, things like NTP adjustments can make time go backwards. clock_gettime with CLOCK_MONOTONIC_RAW would be better, but that's linux specific :( The reason why I didn't add write timings to relation stats is that I couldn't figure out what the semantics should be. It could be either "time spent waiting for this relations blocks to be written out" or "time spent waiting for some other relations blocks to be written out to free space for this relations block" or maybe distribute the cost, background writes could be included or excluded. Writes usually return quickly, unless lots of possibly unrelated writes have dirtied enough of OS cache, etc. I figured that what ever choices I made, they wouldn't really help anyone diagnose anything. Having global write timings in pg_stat_bgwriter might be useful, but I feel that is something for another patch. > I still think a full wait timing interface is the right long-term direction > here. It's hard to reject this idea when it seems to be working right now > though, while more comprehensive wait storage is still at least a release > off. Opinions welcome, I'm still juggling this around now that I have it > working again. I agree that wait timing interface is the right direction. I have thought a bit about it and could share some ideas - maybe I should create a wiki page where the general design could be hashed out? Anyway, the user visible information from this patch should be trivial to extract from a general wait timing framework. Pushing my own agenda a bit - having this patch in the current release would help to get some field experience on any issues surrounding timing :) > Some implementation notes. This currently fails regression test > create_function_3, haven't looked into why yet. I'll take a look at it. Thanks again. -- Ants Aasma
On Wed, Feb 22, 2012 at 6:35 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: >> Some implementation notes. This currently fails regression test >> create_function_3, haven't looked into why yet. > > I'll take a look at it. The failure was due to leakproof changes to pgproc. Attached patches are adjusted accordingly and rebased over Robert's blocks dirtied patch. Cheers, Ants Aasma
Attachment
On 2/22/12 10:35 AM, Ants Aasma wrote: > The reason why I didn't add write timings to relation stats is that I > couldn't figure out what the semantics should be. It could be either > "time spent waiting for this relations blocks to be written out" or > "time spent waiting for some other relations blocks to be written out > to free space for this relations block" or maybe distribute the cost, > background writes could be included or excluded. Writes usually return > quickly, unless lots of possibly unrelated writes have dirtied enough > of OS cache, etc. I figured that what ever choices I made, they > wouldn't really help anyone diagnose anything. Having global write > timings in pg_stat_bgwriter might be useful, but I feel that is > something for another patch. I know it's not perfect, but I would argue that what users care about most of the time is time taken up in actual backends.So I wouldn't worry about bgwriter. I also wouldn't worry about time spent waiting to find a buffer at this point(see below). >> > I still think a full wait timing interface is the right long-term direction >> > here. It's hard to reject this idea when it seems to be working right now >> > though, while more comprehensive wait storage is still at least a release >> > off. Opinions welcome, I'm still juggling this around now that I have it >> > working again. > I agree that wait timing interface is the right direction. I have > thought a bit about it and could share some ideas - maybe I should > create a wiki page where the general design could be hashed out? Yes, I think a wiki would be a good place to start. As you showed in your previous question about writes there's a *lot*of places where timing info would be useful to us. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, Feb 24, 2012 at 2:23 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Wed, Feb 22, 2012 at 6:35 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: >>> Some implementation notes. This currently fails regression test >>> create_function_3, haven't looked into why yet. >> >> I'll take a look at it. > > The failure was due to leakproof changes to pgproc. Attached patches > are adjusted accordingly and rebased over Robert's blocks dirtied > patch. This seems to have bitrotted again. :-( Can you please rebase again? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 21, 2012 at 5:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: > This seems to have bitrotted again. :-( > > Can you please rebase again? Rebased. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Attachment
On Thu, Mar 22, 2012 at 7:38 PM, Ants Aasma <ants@cybertec.at> wrote: > On Wed, Mar 21, 2012 at 5:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> This seems to have bitrotted again. :-( >> >> Can you please rebase again? > > Rebased. I've committed the core of this. I left out the stats collector stuff, because it's still per-table and I think perhaps we should back off to just per-database. I changed it so that it does not conflate wait time with I/O time. Maybe there should be a separate method of measuring wait time, but I don't think it's a good idea for the per-backend stats to measure a different thing than what gets reported up to the stats collector - we should have ONE definition of each counter. I also tweaked the EXPLAIN output format a bit, and the docs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 27, 2012 at 2:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 22, 2012 at 7:38 PM, Ants Aasma <ants@cybertec.at> wrote: >> On Wed, Mar 21, 2012 at 5:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> This seems to have bitrotted again. :-( >>> >>> Can you please rebase again? >> >> Rebased. > > I've committed the core of this. I left out the stats collector > stuff, because it's still per-table and I think perhaps we should back > off to just per-database. I changed it so that it does not conflate > wait time with I/O time. Maybe there should be a separate method of > measuring wait time, but I don't think it's a good idea for the > per-backend stats to measure a different thing than what gets reported > up to the stats collector - we should have ONE definition of each > counter. I also tweaked the EXPLAIN output format a bit, and the > docs. And I've now committed the pg_stat_statements code as well, hopefully not stomping too badly one what Tom's apparently in the midst of doing with Peter's pg_stat_statements patch. I committed this almost exactly as submitted; just a minor code style correction and a few documentation enhancements. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 27, 2012 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I've committed the core of this. I left out the stats collector > stuff, because it's still per-table and I think perhaps we should back > off to just per-database. I changed it so that it does not conflate > wait time with I/O time. Maybe there should be a separate method of > measuring wait time, but I don't think it's a good idea for the > per-backend stats to measure a different thing than what gets reported > up to the stats collector - we should have ONE definition of each > counter. I also tweaked the EXPLAIN output format a bit, and the > docs. Thank you for working on this. Taking a fresh look at it, I agree with you that conflating waiting for backend local IO, and IO performed by some other backend might paint us into a corner. For most workloads the wait for IO performed by others should be the minority anyway. I won't really miss the per table stats. But if the pg_stat_statements normalisation patch gets commited, it would be really neat to also have IO waits there. It would be much easier to quickly diagnose "what is causing all this IO" issues. Thanks again, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Tue, Mar 27, 2012 at 3:22 PM, Ants Aasma <ants@cybertec.at> wrote: > On Tue, Mar 27, 2012 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I've committed the core of this. I left out the stats collector >> stuff, because it's still per-table and I think perhaps we should back >> off to just per-database. I changed it so that it does not conflate >> wait time with I/O time. Maybe there should be a separate method of >> measuring wait time, but I don't think it's a good idea for the >> per-backend stats to measure a different thing than what gets reported >> up to the stats collector - we should have ONE definition of each >> counter. I also tweaked the EXPLAIN output format a bit, and the >> docs. > > Thank you for working on this. > > Taking a fresh look at it, I agree with you that conflating waiting > for backend local IO, and IO performed by some other backend might > paint us into a corner. For most workloads the wait for IO performed > by others should be the minority anyway. > > I won't really miss the per table stats. But if the pg_stat_statements > normalisation patch gets commited, it would be really neat to also > have IO waits there. It would be much easier to quickly diagnose "what > is causing all this IO" issues. So, the pg_stat_statements part of this is committed now. Do you want to prepare a revised patch to add per-database counters to the statistics collector? I think that might be a good idea... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 27, 2012 at 7:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I've committed the core of this. I left out the stats collector > stuff, because it's still per-table and I think perhaps we should back > off to just per-database. I changed it so that it does not conflate > wait time with I/O time. Maybe there should be a separate method of > measuring wait time, but I don't think it's a good idea for the > per-backend stats to measure a different thing than what gets reported > up to the stats collector - we should have ONE definition of each > counter. I also tweaked the EXPLAIN output format a bit, and the > docs. Maybe I missed some earlier discussoin -- I've been having trouble keeping up with the lists. But was there discussion of why this is a GUC? Why not just another parameter to EXPLAIN like the others? i.e. EXPLAIN (ANALYZE, BUFFERS, IOTIMING) -- greg
On Tue, Mar 27, 2012 at 8:41 PM, Greg Stark <stark@mit.edu> wrote: > On Tue, Mar 27, 2012 at 7:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I've committed the core of this. I left out the stats collector >> stuff, because it's still per-table and I think perhaps we should back >> off to just per-database. I changed it so that it does not conflate >> wait time with I/O time. Maybe there should be a separate method of >> measuring wait time, but I don't think it's a good idea for the >> per-backend stats to measure a different thing than what gets reported >> up to the stats collector - we should have ONE definition of each >> counter. I also tweaked the EXPLAIN output format a bit, and the >> docs. > > Maybe I missed some earlier discussoin -- I've been having trouble > keeping up with the lists. > > But was there discussion of why this is a GUC? Why not just another > parameter to EXPLAIN like the others? > i.e. EXPLAIN (ANALYZE, BUFFERS, IOTIMING) Because you want to be able to expose the data even for queries that aren't explained. Right now, you can do that with pg_stat_statements; and the original patch also had per-table counters, but I didn't commit that part due to some concerns about stats bloat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 27, 2012 at 8:30 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 27, 2012 at 3:22 PM, Ants Aasma <ants@cybertec.at> wrote: >> On Tue, Mar 27, 2012 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I've committed the core of this. I left out the stats collector >>> stuff, because it's still per-table and I think perhaps we should back >>> off to just per-database. I changed it so that it does not conflate >>> wait time with I/O time. Maybe there should be a separate method of >>> measuring wait time, but I don't think it's a good idea for the >>> per-backend stats to measure a different thing than what gets reported >>> up to the stats collector - we should have ONE definition of each >>> counter. I also tweaked the EXPLAIN output format a bit, and the >>> docs. >> >> Thank you for working on this. >> >> Taking a fresh look at it, I agree with you that conflating waiting >> for backend local IO, and IO performed by some other backend might >> paint us into a corner. For most workloads the wait for IO performed >> by others should be the minority anyway. >> >> I won't really miss the per table stats. But if the pg_stat_statements >> normalisation patch gets commited, it would be really neat to also >> have IO waits there. It would be much easier to quickly diagnose "what >> is causing all this IO" issues. > > So, the pg_stat_statements part of this is committed now. Do you want > to prepare a revised patch to add per-database counters to the > statistics collector? I think that might be a good idea... Hearing nothing further on this point, I went and did it myself. In the process I noticed a couple of things that I think we need to fix. Currently, the statistics views that include timing information are displayed in milliseconds (see pg_stat_user_functions), while the underlying SQL-callable functions return the data in microseconds. Whether or not this was a good design decision, we're stuck with it now; the documentation implies that the views and functions use the same units. I'll go fix that next. Meanwhile, pg_stat_statements converts the same data to seconds but makes it a double rather than a bigint. I think that was a bad idea and we should make it consistent use a bigint and milliseconds while we still can. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 5, 2012 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Meanwhile, pg_stat_statements converts the same data to seconds but > makes it a double rather than a bigint. I think that was a bad idea > and we should make it consistent use a bigint and milliseconds while > we still can. Hmm. So, on further review, this is not as simple as it seems. I'd like some input from other people on what we should do here. pg_stat_statements has long exposed a column called "total_time" as a float8. It now exposes columns "time_read" and "time_write" which are actually measuring the time spent reading and writing data blocks, and those are also exposed as a float8; all these count seconds. Meanwhile, all times exposed by the stats collector (including the new and analagous pg_stat_database.block_read_time and pg_stat_database.block_write_time columns) are exposed as int8; these count milliseconds. So, should we make the new columns exposed by pg_stat_statements use milliseconds, so that the block read/write timings are everywhere in milliseconds, or should we keep them as a float8, so that all the times exposed by pg_stat_statements use float8? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Hmm. So, on further review, this is not as simple as it seems. I'd > like some input from other people on what we should do here. > pg_stat_statements has long exposed a column called "total_time" as a > float8. It now exposes columns "time_read" and "time_write" which are > actually measuring the time spent reading and writing data blocks, and > those are also exposed as a float8; all these count seconds. > Meanwhile, all times exposed by the stats collector (including the new > and analagous pg_stat_database.block_read_time and > pg_stat_database.block_write_time columns) are exposed as int8; these > count milliseconds. > So, should we make the new columns exposed by pg_stat_statements use > milliseconds, so that the block read/write timings are everywhere in > milliseconds, or should we keep them as a float8, so that all the > times exposed by pg_stat_statements use float8? Given that we've whacked pg_stat_statements' behavior around rather thoroughly in this release, maybe we could get away with redefining total_time as being measured in msec rather than sec, thereby aligning units as msec across the board. It's arguably a smaller deal than the way we've redefined what the query column contains... float8 vs int8 is a distinct issue, and probably one that is not as much of an impact on clients if we change it. It is not hard to predict that somebody will eventually want sub-msec resolution on these things, which would suggest that float8 would be the better idea. But perhaps we could leave that change for a future release. regards, tom lane
On 10 April 2012 14:33, Robert Haas <robertmhaas@gmail.com> wrote: > So, should we make the new columns exposed by pg_stat_statements use > milliseconds, so that the block read/write timings are everywhere in > milliseconds, or should we keep them as a float8, so that all the > times exposed by pg_stat_statements use float8? I believe that we should keep them as float8, on the basis that a user is more likely to generalise from total_time's format (when writing a script to query the view of whatever) than from that of pg_stat_database. A part of me would like to change the view definitions so that all the columns are strongly typed (i.e. all these values would be intervals). I realise that that isn't practical though. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Hmm. So, on further review, this is not as simple as it seems. I'd >> like some input from other people on what we should do here. > >> pg_stat_statements has long exposed a column called "total_time" as a >> float8. It now exposes columns "time_read" and "time_write" which are >> actually measuring the time spent reading and writing data blocks, and >> those are also exposed as a float8; all these count seconds. > >> Meanwhile, all times exposed by the stats collector (including the new >> and analagous pg_stat_database.block_read_time and >> pg_stat_database.block_write_time columns) are exposed as int8; these >> count milliseconds. > >> So, should we make the new columns exposed by pg_stat_statements use >> milliseconds, so that the block read/write timings are everywhere in >> milliseconds, or should we keep them as a float8, so that all the >> times exposed by pg_stat_statements use float8? > > Given that we've whacked pg_stat_statements' behavior around rather > thoroughly in this release, maybe we could get away with redefining > total_time as being measured in msec rather than sec, thereby aligning > units as msec across the board. It's arguably a smaller deal than the > way we've redefined what the query column contains... Retyping columns is an awfully good way to produce grumpy users. Then again, if we're going to do it, it would certainly be better to do it now rather than later, because right now I'm guessing pg_stat_statements is a lot less heavily used than it will be after 9.2 hits shelves. > float8 vs int8 is a distinct issue, and probably one that is not as > much of an impact on clients if we change it. It is not hard to predict > that somebody will eventually want sub-msec resolution on these things, > which would suggest that float8 would be the better idea. But perhaps > we could leave that change for a future release. Well, internally, the I/O timing stuff as well as the function timing stuff use microseconds, and the SQL functions expose it as microseconds, but then the view divides by 1000 to convert to milliseconds. I made the I/O timing stuff do it that way because that's how the function timing stuff does it, but it does seem a little random. One thing in its favor is that it provides a way for users to get this if they want it, without screwing readability for the vast majority who don't care. On the flip side, the new checkpoint timing stuff is in milliseconds all the way through, though it seems vanishingly unlikely that anyone needs more resolution in that case. We have lots of other things in milliseconds, too. No matter what we end up doing here it will be consistent with something; I am reminded of the phrase "the good thing about standards is that there are so many to choose from...". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > No matter what we end up doing here it will be consistent with > something; I am reminded of the phrase "the good thing about standards > is that there are so many to choose from...". Well, FWIW I vote for making the new columns be float8 msec. If you don't want to change total_time to match, I guess there's no law that says it *has* to be consistent ... regards, tom lane
On Tue, Apr 10, 2012 at 17:58, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Hmm. So, on further review, this is not as simple as it seems. I'd >>> like some input from other people on what we should do here. >> >>> pg_stat_statements has long exposed a column called "total_time" as a >>> float8. It now exposes columns "time_read" and "time_write" which are >>> actually measuring the time spent reading and writing data blocks, and >>> those are also exposed as a float8; all these count seconds. >> >>> Meanwhile, all times exposed by the stats collector (including the new >>> and analagous pg_stat_database.block_read_time and >>> pg_stat_database.block_write_time columns) are exposed as int8; these >>> count milliseconds. >> >>> So, should we make the new columns exposed by pg_stat_statements use >>> milliseconds, so that the block read/write timings are everywhere in >>> milliseconds, or should we keep them as a float8, so that all the >>> times exposed by pg_stat_statements use float8? >> >> Given that we've whacked pg_stat_statements' behavior around rather >> thoroughly in this release, maybe we could get away with redefining >> total_time as being measured in msec rather than sec, thereby aligning >> units as msec across the board. It's arguably a smaller deal than the >> way we've redefined what the query column contains... > > Retyping columns is an awfully good way to produce grumpy users. Then > again, if we're going to do it, it would certainly be better to do it > now rather than later, because right now I'm guessing > pg_stat_statements is a lot less heavily used than it will be after > 9.2 hits shelves. Agreed. It's better if we can also change the name of it - provided we can come up with a reasonable new name. Then peoples applications will break *visibly*, which is a lot better than breaking invisibly. (This is the main reason why we renamed current_query in pg_stat_activity..) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > On Tue, Apr 10, 2012 at 17:58, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Given that we've whacked pg_stat_statements' behavior around rather >>> thoroughly in this release, maybe we could get away with redefining >>> total_time as being measured in msec rather than sec, thereby aligning >>> units as msec across the board. It's arguably a smaller deal than the >>> way we've redefined what the query column contains... >> >> Retyping columns is an awfully good way to produce grumpy users. Then >> again, if we're going to do it, it would certainly be better to do it >> now rather than later, because right now I'm guessing >> pg_stat_statements is a lot less heavily used than it will be after >> 9.2 hits shelves. > Agreed. It's better if we can also change the name of it - provided we > can come up with a reasonable new name. Then peoples applications will > break *visibly*, which is a lot better than breaking invisibly. (This > is the main reason why we renamed current_query in pg_stat_activity..) That might be overkill. Changing the column name will definitely break anything more specific than "select * from pg_stat_statements". However, it's less clear that changing the units in which the column is expressed will break things. It seems likely to me that nobody out there is doing anything much more sophisticated than sorting by the column, and that's still going to work the same. regards, tom lane
On Tue, Apr 10, 2012 at 18:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Tue, Apr 10, 2012 at 17:58, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Given that we've whacked pg_stat_statements' behavior around rather >>>> thoroughly in this release, maybe we could get away with redefining >>>> total_time as being measured in msec rather than sec, thereby aligning >>>> units as msec across the board. It's arguably a smaller deal than the >>>> way we've redefined what the query column contains... >>> >>> Retyping columns is an awfully good way to produce grumpy users. Then >>> again, if we're going to do it, it would certainly be better to do it >>> now rather than later, because right now I'm guessing >>> pg_stat_statements is a lot less heavily used than it will be after >>> 9.2 hits shelves. > >> Agreed. It's better if we can also change the name of it - provided we >> can come up with a reasonable new name. Then peoples applications will >> break *visibly*, which is a lot better than breaking invisibly. (This >> is the main reason why we renamed current_query in pg_stat_activity..) > > That might be overkill. Changing the column name will definitely break > anything more specific than "select * from pg_stat_statements". > However, it's less clear that changing the units in which the column is > expressed will break things. It seems likely to me that nobody out > there is doing anything much more sophisticated than sorting by the > column, and that's still going to work the same. I've seen cases where the timing is correlated with external timings, e.g. from the application. Have I seen it a lot? No - but then I haven't seen a big usage of pg_stat_statements either, which might be the better argument for allowing a change of unit but not name. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Tue, Apr 10, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> No matter what we end up doing here it will be consistent with >> something; I am reminded of the phrase "the good thing about standards >> is that there are so many to choose from...". > > Well, FWIW I vote for making the new columns be float8 msec. If you > don't want to change total_time to match, I guess there's no law that > says it *has* to be consistent ... Ugh. So the three ways of doing timing that we have already aren't enough, and we need a fourth one? Ack! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Apr 10, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, FWIW I vote for making the new columns be float8 msec. > Ugh. So the three ways of doing timing that we have already aren't > enough, and we need a fourth one? Ack! Huh? I understood what you said upthread to be that we have two ways in existing releases (anything unreleased has zero standing in this discussion): float8 sec in pg_stat_statements.total_time, and int8 msec everywhere else. Did I miss something? regards, tom lane
On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Apr 10, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Well, FWIW I vote for making the new columns be float8 msec. > >> Ugh. So the three ways of doing timing that we have already aren't >> enough, and we need a fourth one? Ack! > > Huh? I understood what you said upthread to be that we have two ways > in existing releases (anything unreleased has zero standing in this > discussion): float8 sec in pg_stat_statements.total_time, and > int8 msec everywhere else. Did I miss something? We also have int8 usec floating around. But even if we didn't, float8 msec would be a new one, regardless of whether it would be third or fourth... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Huh? I understood what you said upthread to be that we have two ways >> in existing releases (anything unreleased has zero standing in this >> discussion): float8 sec in pg_stat_statements.total_time, and >> int8 msec everywhere else. Did I miss something? > We also have int8 usec floating around. But even if we didn't, float8 > msec would be a new one, regardless of whether it would be third or > fourth... It would still be the second one, because it would replace the only use of float8 sec, no? And more to the point, it converges us on msec being the only exposed unit. The business about underlying microseconds is maybe not so good, but I don't think we want to touch that right now. In the long run I think it would make sense to converge on float8 msec as being the standard for exposed timing values, because that is readily adaptable to the underlying data having nsec or even better precision. regards, tom lane
On Tue, Apr 10, 2012 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Huh? I understood what you said upthread to be that we have two ways >>> in existing releases (anything unreleased has zero standing in this >>> discussion): float8 sec in pg_stat_statements.total_time, and >>> int8 msec everywhere else. Did I miss something? > >> We also have int8 usec floating around. But even if we didn't, float8 >> msec would be a new one, regardless of whether it would be third or >> fourth... > > It would still be the second one, because it would replace the only use > of float8 sec, no? And more to the point, it converges us on msec being > the only exposed unit. > > The business about underlying microseconds is maybe not so good, but > I don't think we want to touch that right now. In the long run > I think it would make sense to converge on float8 msec as being the > standard for exposed timing values, because that is readily adaptable to > the underlying data having nsec or even better precision. Hmm. Maybe we should think about numeric ms, which would have all the same advantages but without the round-off error. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 10, 2012 at 02:01:02PM -0400, Robert Haas wrote: > On Tue, Apr 10, 2012 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> Huh? I understood what you said upthread to be that we have two ways > >>> in existing releases (anything unreleased has zero standing in this > >>> discussion): float8 sec in pg_stat_statements.total_time, and > >>> int8 msec everywhere else. Did I miss something? > > > >> We also have int8 usec floating around. But even if we didn't, float8 > >> msec would be a new one, regardless of whether it would be third or > >> fourth... > > > > It would still be the second one, because it would replace the only use > > of float8 sec, no? And more to the point, it converges us on msec being > > the only exposed unit. > > > > The business about underlying microseconds is maybe not so good, but > > I don't think we want to touch that right now. In the long run > > I think it would make sense to converge on float8 msec as being the > > standard for exposed timing values, because that is readily adaptable to > > the underlying data having nsec or even better precision. > > Hmm. Maybe we should think about numeric ms, which would have all the > same advantages but without the round-off error. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > They are also a lot bigger with tons of added overhead. :) Regards, Ken
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Apr 10, 2012 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The business about underlying microseconds is maybe not so good, but >> I don't think we want to touch that right now. �In the long run >> I think it would make sense to converge on float8 msec as being the >> standard for exposed timing values, because that is readily adaptable to >> the underlying data having nsec or even better precision. > Hmm. Maybe we should think about numeric ms, which would have all the > same advantages but without the round-off error. Color me unimpressed ... numeric calculations are vastly more expensive than float, and where are you going to get timing data that has more than sixteen decimal digits of accuracy? IME we're lucky to get three repeatable digits in any timing measurement. The point of using a non-integer type here is not so much precision as dynamic range: sometimes you might be measuring queries that run for hours, and other times ones that run for microseconds. In the latter case it's important to be able to represent nanoseconds, but not so much in the former. regards, tom lane
On 04/10/2012 12:27 PM, Tom Lane wrote: >Changing the column name will definitely break > anything more specific than "select * from pg_stat_statements". > However, it's less clear that changing the units in which the column is > expressed will break things. It seems likely to me that nobody out > there is doing anything much more sophisticated than sorting by the > column, and that's still going to work the same. I am doing more sophisticated things with it, so I'll celebrate this as my opportunity to say I did something you didn't see coming for 2012. All the sites involved will happily shred those scripts and rewrite for either normalized queries *or* better I/O timing info though, so net positive for 9.2 changes even if this part breaks on them. I think this is one of those rare opportunities where there's enough positive goodwill from changes to ask "what's the best way to handle this long-term?" and get away with whatever change that requires, too. I'm really not liking the look of this wart now that Robert has pointed it out. I'd prefer to see at least usec resolution and 8 bytes of "dynamic range" for query related statistics. Any of these would be fine from a UI perspective to me: -float8 seconds -float8 msec -float8 usec -int8 usec I don't think int8 msec will be enough resolution to time queries for very long, if it's not already obsolete. The committed example for pg_test_timing on good hardware already clocks trivial events at a single usec. Even I/O is getting there. I've measured my Fusion-io loaner card peaking at 8GB/s, which works out to 1 usec per 8K page. None of that is even price no object hardware today; it's the stuff sitting in my office. If anything, I'd expect more timing code in the database that only has ms resolution right now will start looking fat in a year or two, and more things might need to be shifted to usec instead. Checkpoint timing can survive having less resolution because its primary drumbeat is very unlikely to drop below the minutes range. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 10 April 2012 23:07, Greg Smith <greg@2ndquadrant.com> wrote: > On 04/10/2012 12:27 PM, Tom Lane wrote: > I am doing more sophisticated things with it, so I'll celebrate this as my > opportunity to say I did something you didn't see coming for 2012. This is why I requested that we expose the query_id hash value - I believe that it will be generally useful in clustering situations. It would be nice to have a persistent identifier. While we're discussing revising pg_stat_statement's interface, are you still opposed to exposing that value, Tom? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Tue, Apr 10, 2012 at 6:32 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 10 April 2012 23:07, Greg Smith <greg@2ndquadrant.com> wrote: >> On 04/10/2012 12:27 PM, Tom Lane wrote: >> I am doing more sophisticated things with it, so I'll celebrate this as my >> opportunity to say I did something you didn't see coming for 2012. > > This is why I requested that we expose the query_id hash value - I > believe that it will be generally useful in clustering situations. It > would be nice to have a persistent identifier. While we're discussing > revising pg_stat_statement's interface, are you still opposed to > exposing that value, Tom? If people need something like that, couldn't they create it by hashing the normalized query text with an arbitrary algorithm? The only obvious advantage of exposing the value used internally is that it might be helpful in terms of understanding the collision behavior. But hopefully collisions are pretty rare anyway, so... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Peter Geoghegan <peter@2ndquadrant.com> writes: > On 10 April 2012 23:07, Greg Smith <greg@2ndquadrant.com> wrote: >> On 04/10/2012 12:27 PM, Tom Lane wrote: >> I am doing more sophisticated things with it, so I'll celebrate this as my >> opportunity to say I did something you didn't see coming for 2012. > This is why I requested that we expose the query_id hash value - I > believe that it will be generally useful in clustering situations. It > would be nice to have a persistent identifier. While we're discussing > revising pg_stat_statement's interface, are you still opposed to > exposing that value, Tom? I still am. I'm unconvinced by references to "clustering situations", because as constructed the hash is extremely database-specific. It will vary depending on OID assignments, not to mention platform characteristics such as word width and endianness. regards, tom lane
On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote: > If people need something like that, couldn't they create it by hashing > the normalized query text with an arbitrary algorithm? That supposes that the normalised query text is perfectly stable. It may well not be, particularly for things like ad-hoc queries or queries generated by ORMs, across database clusters and over long periods of time - you're basically throwing the benefit of all of that intelligent normalisation out of the window, because it's pretty close to free to expose it. What if a developer tweaks an alias in the application for clarity? Also, as you point out, it has additional utility in advertising when a collision has happened, and setting the user's expectations appropriately. I assume that collisions are very rare, but when they do happen, this gives you a fighting chance of noticing them. As Tom points out, the query hash will vary according to platform specific characteristics, including endianness, and will require OIDs are the same on every node. However, it is still going to be useful in clusters that use streaming replication, though not a third party trigger based replication system for example, because streaming replication does of course require that those factors (and rather a lot more) will be identical across the cluster anyway. Realistically, I'd expect a large majority of people interested in this feature to only want to use it with streaming replication anyway. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Peter Geoghegan <peter@2ndquadrant.com> writes: > On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote: >> If people need something like that, couldn't they create it by hashing >> the normalized query text with an arbitrary algorithm? > That supposes that the normalised query text is perfectly stable. It > may well not be, particularly for things like ad-hoc queries or > queries generated by ORMs, across database clusters and over long > periods of time - Indeed, but the hash value isn't stable either given those sorts of assumptions, so I'm not convinced that there's any advantage there. What I think people would actually like to know, if they're in a situation where distinct query texts are getting hashed to the same thing, is *which* different texts got hashed to the same thing. But there's no good way to expose that given the pg_stat_statements infrastructure, and exposing the hash value doesn't help. regards, tom lane
On 11 April 2012 01:16, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <peter@2ndquadrant.com> writes: >> On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote: >>> If people need something like that, couldn't they create it by hashing >>> the normalized query text with an arbitrary algorithm? > >> That supposes that the normalised query text is perfectly stable. It >> may well not be, particularly for things like ad-hoc queries or >> queries generated by ORMs, across database clusters and over long >> periods of time - > > Indeed, but the hash value isn't stable either given those sorts of > assumptions, so I'm not convinced that there's any advantage there. Isn't it? The hash captures the true meaning of the query, while having the database server's platform as a usually irrelevant artefact. Another thing that I forgot to mention is client encoding - it may well be fairly inconvenient to have to use the same algorithm to hash the query string across applications. You also have to hash the query string yourself again and again, which is expensive to do from Python or something, and is often inconvenient - differences beyond track_activity_query_size bytes (default:1024) are not recognised. Using an SQL code beautifier where a single byte varies now breaks everything, which developers don't expect at all (we've trained them not to), so in many ways you're back to the same limitations as classic pg_stat_statements if you attempt to aggregate queries over time and across machines, which is a very real use case. It's probably pretty annoying to have to get your Python app to use the same hash function as your Java app or whatever I, unless you want to use something heavyweight like a cryptographic hash function. By doing it within Postgres, you avoid those headaches. I'm not asking you to very loudly proclaim that it should be used like this - just expose it, accurately document it, and I'm quite confident that it will be widely used and relied upon by those that are reasonably well informed, and understand its limitations, which are really quite straightforward. > What I think people would actually like to know, if they're in a > situation where distinct query texts are getting hashed to the same > thing, is *which* different texts got hashed to the same thing. > But there's no good way to expose that given the pg_stat_statements > infrastructure, and exposing the hash value doesn't help. Apart from detecting the case where we get a straightforward collision, I don't expect that that would be useful. The whole point is that the user doesn't care about the difference, and I think we've specified a practical, widely useful standard for when queries should be considered equivalent. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Apr 11, 2012 at 01:53:06AM +0100, Peter Geoghegan wrote: > On 11 April 2012 01:16, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Peter Geoghegan <peter@2ndquadrant.com> writes: > >> On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote: > >>> If people need something like that, couldn't they create it by hashing > >>> the normalized query text with an arbitrary algorithm? > > > >> That supposes that the normalised query text is perfectly stable. It > >> may well not be, particularly for things like ad-hoc queries or > >> queries generated by ORMs, across database clusters and over long > >> periods of time - > > > > Indeed, but the hash value isn't stable either given those sorts of > > assumptions, so I'm not convinced that there's any advantage there. > > Isn't it? The hash captures the true meaning of the query, while > having the database server's platform as a usually irrelevant > artefact. Another thing that I forgot to mention is client encoding - > it may well be fairly inconvenient to have to use the same algorithm > to hash the query string across applications. You also have to hash > the query string yourself again and again, which is expensive to do > from Python or something, and is often inconvenient - differences > beyond track_activity_query_size bytes (default:1024) are not > recognised. Using an SQL code beautifier where a single byte varies > now breaks everything, which developers don't expect at all (we've > trained them not to), so in many ways you're back to the same > limitations as classic pg_stat_statements if you attempt to aggregate > queries over time and across machines, which is a very real use case. > > It's probably pretty annoying to have to get your Python app to use > the same hash function as your Java app or whatever I, unless you want > to use something heavyweight like a cryptographic hash function. By > doing it within Postgres, you avoid those headaches. > > I'm not asking you to very loudly proclaim that it should be used like > this - just expose it, accurately document it, and I'm quite confident > that it will be widely used and relied upon by those that are > reasonably well informed, and understand its limitations, which are > really quite straightforward. > > > What I think people would actually like to know, if they're in a > > situation where distinct query texts are getting hashed to the same > > thing, is *which* different texts got hashed to the same thing. > > But there's no good way to expose that given the pg_stat_statements > > infrastructure, and exposing the hash value doesn't help. > > Apart from detecting the case where we get a straightforward > collision, I don't expect that that would be useful. The whole point > is that the user doesn't care about the difference, and I think we've > specified a practical, widely useful standard for when queries should > be considered equivalent. > -- > Peter Geoghegan http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training and Services > By using all 64-bits of the hash that we currently calculate, instead of the current use of 32-bits only, the collision probabilities are very low. Regards, Ken
On Wed, Apr 11, 2012 at 9:02 AM, ktm@rice.edu <ktm@rice.edu> wrote: > By using all 64-bits of the hash that we currently calculate, instead > of the current use of 32-bits only, the collision probabilities are > very low. That's probably true, but I'm not sure it's worth worrying about - one-in-four-billion is a pretty small probability. On the broader issue, Peter's argument here seems to boil down to "there is probably a reason why this is useful" and Tom's argument seems to boil down to "i don't want to expose it without knowing what that reason is". Peter may well be right, but that doesn't make Tom wrong. If we are going to expose this, we ought to be able to document why we have it, and right now we can't, because we don't *really* know what it's good for, other than raising awareness of the theoretical possibility of collisions, which doesn't seem like quite enough. On another note, I think this is a sufficiently major change that we ought to add Peter's name to the "Author" section of the pg_stat_statements documentation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On another note, I think this is a sufficiently major change that we > ought to add Peter's name to the "Author" section of the > pg_stat_statements documentation. +1 ... as long as we have those at all, they probably ought to credit anybody who did substantial work on the module. I think that eventually we'll have to give them up, just the way that we don't credit anybody in particular as author of the core code; but for now this is a good change. regards, tom lane
On Fri, Apr 13, 2012 at 4:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On another note, I think this is a sufficiently major change that we >> ought to add Peter's name to the "Author" section of the >> pg_stat_statements documentation. > > +1 ... as long as we have those at all, they probably ought to credit > anybody who did substantial work on the module. > > I think that eventually we'll have to give them up, just the way that > we don't credit anybody in particular as author of the core code; but > for now this is a good change. Yeah. I'd actually be in favor of removing them, and similar references in the comments, because they do lead and have led to disputes about who deserves to be mentioned. However, a change of this magnitude certainly deserves mention; it's not really the same module any more. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 13 April 2012 20:15, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 11, 2012 at 9:02 AM, ktm@rice.edu <ktm@rice.edu> wrote: >> By using all 64-bits of the hash that we currently calculate, instead >> of the current use of 32-bits only, the collision probabilities are >> very low. > > That's probably true, but I'm not sure it's worth worrying about - > one-in-four-billion is a pretty small probability. That assumes that our hashing of query trees will exhibit uniform distribution. While it's easy enough to prove that hash_any does that, it would seem to me that that does not imply that we will exhibit perfectly uniform distribution within pg_stat_statements. The reason that I invented the jumble nomenclature to distinguish it from a straight serialisation is that, apart from the fact that many fields are simply ignored, we still couldn't deserialize what we do store from the jumble, because the correct way to serialise a tree entails serialising NULL pointers too - two non-equivalent jumbles could actually be bitwise identical. In practice, I think that adding NodeTags ought to be sufficient here, but I wouldn't like to bet my life on it. Actually, that is a point that perhaps should have been touched on in the comments at the top of the file. You may wish to take a look at my original analysis in the pg_stat_statements normalisation thread, which attempts to quantify the odds by drawing upon the mathematics of the birthday problem. > On the broader issue, Peter's argument here seems to boil down to > "there is probably a reason why this is useful" and Tom's argument > seems to boil down to "i don't want to expose it without knowing what > that reason is". Peter may well be right, but that doesn't make Tom > wrong. If we are going to expose this, we ought to be able to > document why we have it, and right now we can't, because we don't > *really* know what it's good for, other than raising awareness of the > theoretical possibility of collisions, which doesn't seem like quite > enough. Well, it's important to realise that the query string isn't really stable, to the extent that it could differ as the query is evicted and re-enters pg_stat_statements over time. The hash value is necessarily a stable identifier, as it is the very identifier used by pg_stat_statements. People are going to want to aggregate this information over long periods and across database clusters, and I would really like to facilitate that. To be honest, with the plans that we have for replication, with the addition of things like cascading replication, I think it will increasingly be the case that people prefer built-in replication. The fact that the actual hash value is affected by factors like the endianness of the architecture in question seems mostly irrelevant. If we were to expose this, I'd suggest that the documentation in the table describing each column say of the value: query_hash | stable identifier used by pg_stat_statements for the query There'd then be additional clarification after the existing reference to the hash value, which gave the required caveats about the hash value being subject to various implementation artefacts that effectively only make the values persistently indentify queries referencing particular objects in the same database (that is, the object OID values are used), or across databases that are binary clones, such as between a streaming replica master and its standby. The OID restriction alone effectively shadows all others, so there's no need to mention endianness or anything like that. Anyone who jumped to the conclusion that their aggregation tool would work fine with Slony or something based on the query_hash description alone would probably have bigger problems. > On another note, I think this is a sufficiently major change that we > ought to add Peter's name to the "Author" section of the > pg_stat_statements documentation. Thanks. I actually thought this myself, but didn't want to mention it because I didn't think that it was up to me to decide, or to attempt to influence that decision. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 4/10/12 5:07 PM, Greg Smith wrote: > I'd prefer to see at least usec resolution and 8 bytes of "dynamic range" for query related statistics. Any of these wouldbe fine from a UI perspective to me: > > -float8 seconds > -float8 msec > -float8 usec > -int8 usec > > I don't think int8 msec will be enough resolution to time queries for very long, if it's not already obsolete. The committedexample for pg_test_timing on good hardware already clocks trivial events at a single usec. Even I/O is gettingthere. I've measured my Fusion-io loaner card peaking at 8GB/s, which works out to 1 usec per 8K page. None of thatis even price no object hardware today; it's the stuff sitting in my office. > > If anything, I'd expect more timing code in the database that only has ms resolution right now will start looking fat ina year or two, and more things might need to be shifted to usec instead. Checkpoint timing can survive having less resolutionbecause its primary drumbeat is very unlikely to drop below the minutes range. I agree that ms is on it's way out... and frankly it wouldn't surprise me if at some point we actually had need of ns resolution. Given that, I don't think ms or us definitions make sense... float8 seconds seams much more logical to me. Though, if we're going to end up seriously breaking things anyway, perhaps it would make sense to switch everything overto interval... I realize that there's more overhead there, but I don't think selecting from the stats views is exactlyperformance critical. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On tis, 2012-04-10 at 09:33 -0400, Robert Haas wrote: > So, should we make the new columns exposed by pg_stat_statements use > milliseconds, so that the block read/write timings are everywhere in > milliseconds, or should we keep them as a float8, so that all the > times exposed by pg_stat_statements use float8? Wouldn't interval be the proper type to represent elapsed time?
Jim Nasby <jim@nasby.net> writes: > I agree that ms is on it's way out... and frankly it wouldn't surprise me if at some point we actually had need of ns resolution. > Given that, I don't think ms or us definitions make sense... float8 seconds seams much more logical to me. Well, the important point is that it be float8, so that fractions of whatever units you choose can be represented. I do not feel a strong need to change the units in all the existing pg_stat_ columns from msec to sec; that strikes me as just breaking things to little gain. If the majority of them were in sec then I'd want to converge on that, but since the majority are in msec it seems like the path of least breakage is to converge on that. > Though, if we're going to end up seriously breaking things anyway, > perhaps it would make sense to switch everything over to interval... I > realize that there's more overhead there, but I don't think selecting > from the stats views is exactly performance critical. But (a) I *don't* want to seriously break things, and don't see a need to; (b) interval is expensive and has got its own problems, notably an internal limitation to usec resolution that we would not be able to get rid of easily. It's not even apparent to me that interval is semantically The Right Thing for values that represent an accumulation of a lot of different measurements. regards, tom lane
On Fri, Apr 13, 2012 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > That's probably true, but I'm not sure it's worth worrying about - > one-in-four-billion is a pretty small probability. Is this not subject to the birthday paradox? If you have a given hash you're worried about a collision with then you have a one-in-four-billion chance. But if you have a collection of hashes and you're worried about any collisions then it only takes about 64k before there's likely a collision. -- greg
Greg Stark <stark@mit.edu> writes: > On Fri, Apr 13, 2012 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> That's probably true, but I'm not sure it's worth worrying about - >> one-in-four-billion is a pretty small probability. > Is this not subject to the birthday paradox? If you have a given hash > you're worried about a collision with then you have a > one-in-four-billion chance. But if you have a collection of hashes and > you're worried about any collisions then it only takes about 64k > before there's likely a collision. ... so, if pg_stat_statements.max were set as high as 64k, you would need to worry. Realistically, I'm more worried about collisions due to inadequacies in the jumble calculation logic (Peter already pointed out some risk factors in that regard). regards, tom lane
On 14 April 2012 03:01, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Realistically, I'm more worried about collisions due to inadequacies in > the jumble calculation logic (Peter already pointed out some risk > factors in that regard). It's important to have a sense of proportion about the problem. The worst thing that a collision can do is lead the DBA on a bit of a wild goose chase. Importantly, collisions across databases and users are impossible. I've always taken the view that aggregating query statistics is a lossy process, and these kinds of problems seem like a more than acceptable price to pay for low-overhead dynamic query statistics . -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Apr 13, 2012 at 10:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> On Fri, Apr 13, 2012 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> That's probably true, but I'm not sure it's worth worrying about - >>> one-in-four-billion is a pretty small probability. > >> Is this not subject to the birthday paradox? If you have a given hash >> you're worried about a collision with then you have a >> one-in-four-billion chance. But if you have a collection of hashes and >> you're worried about any collisions then it only takes about 64k >> before there's likely a collision. > > ... so, if pg_stat_statements.max were set as high as 64k, you would > need to worry. Well... at 64k, you'd be very likely to have a collision. But the whole birthday paradox thing means that there's a non-trivial collision probability even at lower numbers of entries. Seems like maybe we ought to be using 64 bits here... > Realistically, I'm more worried about collisions due to inadequacies in > the jumble calculation logic (Peter already pointed out some risk > factors in that regard). ...especially if collisions are even more frequent than random chance would suggest. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 04/13/2012 06:22 PM, Tom Lane wrote: > But (a) I *don't* want to seriously break things, and don't see a need > to; (b) interval is expensive and has got its own problems, notably an > internal limitation to usec resolution that we would not be able to get > rid of easily. A straight float seems pretty future proof compared to a usec resolution interval. Jim was commenting in the same direction I already did, that ns resolution is not impossible to see coming. I also expect to compute plenty of derived statistics from these numbers. Interval math is good enough that I'm sure such things could be done, but it seems odd to start with those units. I appreciate that the interval type has a nice purist feel to it. My pragmatic side says we're going to pay overhead to create in that type, only to find people end up converting it right back to other types for easier math tricks. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Sat, Apr 14, 2012 at 3:27 AM, Greg Smith <greg@2ndquadrant.com> wrote: > On 04/13/2012 06:22 PM, Tom Lane wrote: >> >> But (a) I *don't* want to seriously break things, and don't see a need >> to; (b) interval is expensive and has got its own problems, notably an >> internal limitation to usec resolution that we would not be able to get >> rid of easily. > > A straight float seems pretty future proof compared to a usec resolution > interval. Jim was commenting in the same direction I already did, that ns > resolution is not impossible to see coming. > > I also expect to compute plenty of derived statistics from these numbers. > Interval math is good enough that I'm sure such things could be done, but > it seems odd to start with those units. I appreciate that the interval type > has a nice purist feel to it. My pragmatic side says we're going to pay > overhead to create in that type, only to find people end up converting it > right back to other types for easier math tricks. I'm still rooting for numeric. As somebody said upthread, performance ain't critical here; and that lets us whack around the internal representation however we like without worrying about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10 April 2012 19:10, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm. Maybe we should think about numeric ms, which would have all the >> same advantages but without the round-off error. > > Color me unimpressed ... numeric calculations are vastly more expensive > than float, and where are you going to get timing data that has more > than sixteen decimal digits of accuracy? +1 Besides, how do you propose to solve the problem of storing numerics in a fixed allocation of shared memory? The only comparable thing in pg_stat_statements is the query string, which is capped at track_activity_query_size bytes for this very reason. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 14 April 2012 02:42, Greg Stark <stark@mit.edu> wrote: > Is this not subject to the birthday paradox? If you have a given hash > you're worried about a collision with then you have a > one-in-four-billion chance. But if you have a collection of hashes and > you're worried about any collisions then it only takes about 64k > before there's likely a collision. Just for the sake of the archives, assuming that there is a uniform distribution of values, by my calculations that puts the probability of collision at: pg_stat_statements.max of 1,000 = 0.00011562303995116263 and perhaps more representatively, if we follow the example in the docs: pg_stat_statements.max of 10,000 = 0.011496378237656368 It's enough of a problem that I'd expect to hear one or two complaints about it in the next few years, maybe. This is the probability of there being a collision, not the probability of someone caring about it. You probably wouldn't want to push your luck too far: pg_stat_statements.max of 100,000 = 0.6853509059051395 Even if you did, that would only mean that there was usually one, but perhaps two or three values that were collisions, out of an entire 100,000. To labour the point, you'd have to have a lot of bad luck for those to be the values that a human actually ended up caring about. Jim Nasby said upthread that selecting from the stats view isn't performance critical, and he's right. However, maintaining the stats themselves certainly is, since each and every query will have to update them, adding latency. pg_stat_statements is far from being a tool of minority interest, particularly now. People are going to want to add additional bells and whistles to it, which is fine by me, but I'm very much opposed to making everyone pay for additional features that imply performance overhead for all queries, particularly if the feature is of minority interest. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Apr 14, 2012 at 9:54 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 10 April 2012 19:10, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Hmm. Maybe we should think about numeric ms, which would have all the >>> same advantages but without the round-off error. >> >> Color me unimpressed ... numeric calculations are vastly more expensive >> than float, and where are you going to get timing data that has more >> than sixteen decimal digits of accuracy? > > +1 > > Besides, how do you propose to solve the problem of storing numerics > in a fixed allocation of shared memory? The only comparable thing in > pg_stat_statements is the query string, which is capped at > track_activity_query_size bytes for this very reason. The internal representation doesn't have to be (and certainly shouldn't be) numeric. But if you translate to numeric before returning the data to the user, then you have the freedom, in the future, to whack around the internal representation however you like, without breaking backward compatibility. Choosing float8 for the external representation is fine as long as we're sure we're not ever going to want more than 16 significant digits, but I see no particular value in baking in that assumption. But perhaps, as the saying goes, 16 digits ought to be enough for anyone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > The internal representation doesn't have to be (and certainly > shouldn't be) numeric. But if you translate to numeric before > returning the data to the user, then you have the freedom, in the > future, to whack around the internal representation however you like, > without breaking backward compatibility. Choosing float8 for the > external representation is fine as long as we're sure we're not ever > going to want more than 16 significant digits, but I see no particular > value in baking in that assumption. But perhaps, as the saying goes, > 16 digits ought to be enough for anyone. There's no particular reason to think that Moore's Law is going to result in an increase in the fractional precision of timing data. It hasn't done so in the past, for sure. regards, tom lane
On Sat, Apr 14, 2012 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> The internal representation doesn't have to be (and certainly >> shouldn't be) numeric. But if you translate to numeric before >> returning the data to the user, then you have the freedom, in the >> future, to whack around the internal representation however you like, >> without breaking backward compatibility. Choosing float8 for the >> external representation is fine as long as we're sure we're not ever >> going to want more than 16 significant digits, but I see no particular >> value in baking in that assumption. But perhaps, as the saying goes, >> 16 digits ought to be enough for anyone. > > There's no particular reason to think that Moore's Law is going to > result in an increase in the fractional precision of timing data. > It hasn't done so in the past, for sure. Perhaps, but nobody's explained what we gain out of NOT using numeric."It's slow" doesn't impress me; selecting from a systemview doesn't need to be lightning-fast. However, the main thing here is that we need to do *something* here... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Apr 14, 2012 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There's no particular reason to think that Moore's Law is going to >> result in an increase in the fractional precision of timing data. >> It hasn't done so in the past, for sure. > Perhaps, but nobody's explained what we gain out of NOT using numeric. > "It's slow" doesn't impress me; selecting from a system view doesn't > need to be lightning-fast. Well, how about "the code is going to be quite a lot less readable"? C can manipulate floats natively, but not numerics. Also, as was pointed out upthread, the underlying data in shared memory is almost certainly never going to be infinite-precision; so using numeric in the API seems to me to be more likely to convey a false impression of exactness than to do anything useful. > However, the main thing here is that we need to do *something* here... Agreed, this has got to be pushed forward. regards, tom lane
On Wed, Apr 25, 2012 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sat, Apr 14, 2012 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> There's no particular reason to think that Moore's Law is going to >>> result in an increase in the fractional precision of timing data. >>> It hasn't done so in the past, for sure. > >> Perhaps, but nobody's explained what we gain out of NOT using numeric. >> "It's slow" doesn't impress me; selecting from a system view doesn't >> need to be lightning-fast. > > Well, how about "the code is going to be quite a lot less readable"? > C can manipulate floats natively, but not numerics. > > Also, as was pointed out upthread, the underlying data in shared memory > is almost certainly never going to be infinite-precision; so using > numeric in the API seems to me to be more likely to convey a false > impression of exactness than to do anything useful. > >> However, the main thing here is that we need to do *something* here... > > Agreed, this has got to be pushed forward. In the interest of furthering that goal, I propose that whoever is willing to take the time to clean this up gets to decide what to standardize on, and I'm happy to give you first crack at that if you have the cycles. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Apr 25, 2012 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> However, the main thing here is that we need to do *something* here... >> Agreed, this has got to be pushed forward. > In the interest of furthering that goal, I propose that whoever is > willing to take the time to clean this up gets to decide what to > standardize on, and I'm happy to give you first crack at that if you > have the cycles. OK. I have just returned from some emergency family business, and have got assorted catching-up to do, but I will try to get to that later this week. regards, tom lane
On Wed, Apr 25, 2012 at 1:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Apr 25, 2012 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>>> However, the main thing here is that we need to do *something* here... > >>> Agreed, this has got to be pushed forward. > >> In the interest of furthering that goal, I propose that whoever is >> willing to take the time to clean this up gets to decide what to >> standardize on, and I'm happy to give you first crack at that if you >> have the cycles. > > OK. I have just returned from some emergency family business, and have > got assorted catching-up to do, but I will try to get to that later > this week. Sounds good to me. You might want to revisit the issue of how the new columns in pg_stat_statements are named, as well. I am not sure I'm happy with that, but neither am I sure that I know what I'd like better. It's not too clear that the timing is specifically for data block reads and writes, for example. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 25, 2012 at 5:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Also, as was pointed out upthread, the underlying data in shared memory > is almost certainly never going to be infinite-precision; so using > numeric in the API seems to me to be more likely to convey a false > impression of exactness than to do anything useful. I don't think that follows. The underlyng data will be measured in some metric unit of time like microsecond or nanosecond or something like that. So a base-10 representation will show exactly the precision that the underlying data has. On the other hand a floating point number will show a base-2 approximation that may in fact display with more digits than the underlying data representation has. -- greg
Greg Stark <stark@mit.edu> writes: > On Wed, Apr 25, 2012 at 5:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Also, as was pointed out upthread, the underlying data in shared memory >> is almost certainly never going to be infinite-precision; so using >> numeric in the API seems to me to be more likely to convey a false >> impression of exactness than to do anything useful. > I don't think that follows. The underlyng data will be measured in > some metric unit of time like microsecond or nanosecond or something > like that. So a base-10 representation will show exactly the precision > that the underlying data has. On the other hand a floating point > number will show a base-2 approximation that may in fact display with > more digits than the underlying data representation has. My point is that the underlying data is going to be stored in a fixed-width representation, and therefore it will have accuracy and/or range limitations that are considerably more severe than use of "numeric" for output might suggest to the user. In the current pg_stat_statements code, timings are in fact accumulated in float8, and emitting them as something other than float8 is just plain misleading IMHO. regards, tom lane
On Wed, Apr 25, 2012 at 1:58 PM, Greg Stark <stark@mit.edu> wrote: > On Wed, Apr 25, 2012 at 5:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Also, as was pointed out upthread, the underlying data in shared memory >> is almost certainly never going to be infinite-precision; so using >> numeric in the API seems to me to be more likely to convey a false >> impression of exactness than to do anything useful. > > I don't think that follows. The underlyng data will be measured in > some metric unit of time like microsecond or nanosecond or something > like that. So a base-10 representation will show exactly the precision > that the underlying data has. On the other hand a floating point > number will show a base-2 approximation that may in fact display with > more digits than the underlying data representation has. I wholeheartedly agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On mån, 2012-04-23 at 22:03 -0400, Robert Haas wrote: > Perhaps, but nobody's explained what we gain out of NOT using numeric. So if you want to have possibly different internal and external representations, why not use interval for the external one?
Peter Eisentraut <peter_e@gmx.net> writes: > On mån, 2012-04-23 at 22:03 -0400, Robert Haas wrote: >> Perhaps, but nobody's explained what we gain out of NOT using numeric. > So if you want to have possibly different internal and external > representations, why not use interval for the external one? That doesn't add any usefulness, only extra complication for clients that want to do more arithmetic with the values. Also, as was pointed out earlier, we have a hard-coded restriction to microsecond precision with the default implementation of interval; and it's not hard to foresee the day when that won't do. regards, tom lane
On Wed, Apr 25, 2012 at 5:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Eisentraut <peter_e@gmx.net> writes: >> On mån, 2012-04-23 at 22:03 -0400, Robert Haas wrote: >>> Perhaps, but nobody's explained what we gain out of NOT using numeric. > >> So if you want to have possibly different internal and external >> representations, why not use interval for the external one? > > That doesn't add any usefulness, only extra complication for clients > that want to do more arithmetic with the values. Also, as was pointed > out earlier, we have a hard-coded restriction to microsecond precision > with the default implementation of interval; and it's not hard to > foresee the day when that won't do. Agreed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > ... You might want to revisit the issue of how the new > columns in pg_stat_statements are named, as well. I am not sure I'm > happy with that, but neither am I sure that I know what I'd like > better. It's not too clear that the timing is specifically for data > block reads and writes, for example. Well, the names "time_read" and "time_write" are certainly out of step with every other stats view in the system; everyplace else, such columns are named "something_time" (and even in this view itself the other timing column is "total_time", not "time_total"). So that's got to change. We could just reverse the word order to "read_time" and "write_time", or we could do something like "buf_read_time" or "data_read_time". IIUC block_read_time/block_write_time in the pg_stat_database view are database-wide totals for the same numbers, so perhaps the pg_stat_statements column names should be consistent with those. I am kinda wondering though why those columns spell out "block" where every single other column name in the stats views uses the abbreviation "blk". regards, tom lane
... btw, while I'm criticizing names, how about changing "track_iotiming" to "track_io_timing"? The former seems inelegant and unreadable. regards, tom lane
On Sat, Apr 28, 2012 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> ... You might want to revisit the issue of how the new >> columns in pg_stat_statements are named, as well. I am not sure I'm >> happy with that, but neither am I sure that I know what I'd like >> better. It's not too clear that the timing is specifically for data >> block reads and writes, for example. > > Well, the names "time_read" and "time_write" are certainly out of step > with every other stats view in the system; everyplace else, such columns > are named "something_time" (and even in this view itself the other > timing column is "total_time", not "time_total"). So that's got to > change. We could just reverse the word order to "read_time" and > "write_time", or we could do something like "buf_read_time" or > "data_read_time". IIUC block_read_time/block_write_time in the > pg_stat_database view are database-wide totals for the same numbers, so > perhaps the pg_stat_statements column names should be consistent with > those. I am kinda wondering though why those columns spell out "block" > where every single other column name in the stats views uses the > abbreviation "blk". I like the idea of including the word block in there. I don't think it was probably a terribly good idea to abbreviate that to blk everywhere, but at this point it's probably better to be consistent, sigh. As for track_iotiming -> track_io_timing, I'm fine with that as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I like the idea of including the word block in there. I don't think > it was probably a terribly good idea to abbreviate that to blk > everywhere, but at this point it's probably better to be consistent, > sigh. > As for track_iotiming -> track_io_timing, I'm fine with that as well. I made these changes, so I think we are done with the naming issues. However, I'd still like to propose that we think about adjusting the timing column datatypes, ie uniformly use float8 msec for values representing elapsed times. By my count there are six columns that would be affected: pg_stat_bgwriter.checkpoint_write_timepg_stat_bgwriter.checkpoint_sync_timepg_stat_database.blk_read_timepg_stat_database.blk_write_timepg_stat_user_functions.total_timepg_stat_user_functions.self_time The first four of these are new in 9.2, meaning that we would only be creating a compatibility issue for the last two. If we wait to do this in the future, we will have a significantly bigger problem than if we do it today. Advantages of the change are: * Better precision exposed to the user (pg_stat_user_functions has historically provided only millisecond precision). * Removal of arbitrary limit of microsecond precision. Of course, the underlying data is still no better than microsecond, but if we ever are able to migrate to OS APIs that return better-than-microsecond data, we won't have to adjust the stats view APIs to expose that data. * A chance to make the functions underlying these stats view columns agree with the exposed column definitions. Any objections out there? regards, tom lane
On Sun, Apr 29, 2012 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: > As for track_iotiming -> track_io_timing, I'm fine with that as well. I'm still grumpy about the idea of a GUC changing the explain analyze output. How would people feel about adding an explain option that explicitly requests io timing for this explain analyze and then having the io timing be enabled if either it's requested by explain analyze or if it's set on globally? That would make it more consistent with the other explain analyze options? I realize I don't get to be grumpy without actually contributing anything, but I'm happy to write up the patch if people agree with the change. -- greg
Greg Stark <stark@mit.edu> writes: > On Sun, Apr 29, 2012 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> As for track_iotiming -> track_io_timing, I'm fine with that as well. > I'm still grumpy about the idea of a GUC changing the explain analyze > output. How would people feel about adding an explain option that > explicitly requests io timing for this explain analyze and then having > the io timing be enabled if either it's requested by explain analyze > or if it's set on globally? That would make it more consistent with > the other explain analyze options? I think it's going to be hard to decouple that altogether. For instance, if track_io_timing were not on but you did EXPLAIN (TIMING), you'd end up with timing info getting sent to the stats collector for just that one statement. That seems a bit weird too. I see where you're coming from but I don't think it's a good idea to add an EXPLAIN option unless you can make the two behaviors (EXPLAIN reporting and stats collection) truly independent. regards, tom lane