Thread: EXPLAIN ANALYZE on 8.2
Hi, everybody! Running the same query on pg 8.2 through EXPLAIN ANALYZE takes 4x-10x time as running it without it. Is it ok? Example: testing=> select count(*) from auth_user; count --------- 2575675 (1 row) Time: 1450,829 ms testing=> explain analyze select count(*) from auth_user; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=89814.87..89814.88 rows=1 width=0) (actual time=18460.436..18460.439 rows=1 loops=1) -> Seq Scan on auth_user (cost=0.00..83373.89 rows=2576389 width=0) (actual time=0.424..9871.520 rows=2575675 loops=1) Total runtime: 18460.535 ms (3 rows) Time: 18461,194 ms
Evgeny Gridasov <eugrid@fpm.kubsu.ru> writes: > Running the same query on pg 8.2 through EXPLAIN ANALYZE takes 4x-10x time as running it without it. If your machine has slow gettimeofday() this is not surprising. 8.2 is no worse (or better) than any prior version. Some quick arithmetic from your results suggests that gettimeofday() is taking about 3.3 microseconds, which is indeed pretty awful. What sort of machine is this exactly? regards, tom lane
Tom, Hello. This is a Linux Debian 3.1 ontop of 2x XEON 3.4 Ghz. PostgreSQL is 8.2 checked out from CVS REL8_2_STABLE yesterday. I'm running the same Postgres on another machine, with Debian Etch and have the same results. On Thu, 14 Dec 2006 11:11:42 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Evgeny Gridasov <eugrid@fpm.kubsu.ru> writes: > > Running the same query on pg 8.2 through EXPLAIN ANALYZE takes 4x-10x time as running it without it. > > If your machine has slow gettimeofday() this is not surprising. 8.2 is > no worse (or better) than any prior version. > > Some quick arithmetic from your results suggests that gettimeofday() is > taking about 3.3 microseconds, which is indeed pretty awful. What sort > of machine is this exactly? > > regards, tom lane >
Evgeny Gridasov <eugrid@fpm.kubsu.ru> writes: > This is a Linux Debian 3.1 ontop of 2x XEON 3.4 Ghz. > PostgreSQL is 8.2 checked out from CVS REL8_2_STABLE yesterday. > I'm running the same Postgres on another machine, > with Debian Etch and have the same results. Hmph. With 8.2 on Fedora 5 on a 2.8Ghz dual Xeon, I get this: regression=# create table foo as select x from generate_series(1,2500000) x; SELECT regression=# vacuum foo; VACUUM regression=# checkpoint; CHECKPOINT regression=# \timing Timing is on. regression=# select count(*) from foo; count --------- 2500000 (1 row) Time: 666.639 ms regression=# select count(*) from foo; count --------- 2500000 (1 row) Time: 609.514 ms regression=# explain analyze select count(*) from foo; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Aggregate (cost=44764.00..44764.01 rows=1 width=0) (actual time=1344.812..1344.813 rows=1 loops=1) -> Seq Scan on foo (cost=0.00..38514.00 rows=2500000 width=0) (actual time=0.031..748.571 rows=2500000 loops=1) Total runtime: 1344.891 ms (3 rows) Time: 1345.755 ms regression=# explain analyze select count(*) from foo; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Aggregate (cost=44764.00..44764.01 rows=1 width=0) (actual time=1324.846..1324.847 rows=1 loops=1) -> Seq Scan on foo (cost=0.00..38514.00 rows=2500000 width=0) (actual time=0.046..748.582 rows=2500000 loops=1) Total runtime: 1324.902 ms (3 rows) Time: 1325.591 ms regression=# which works out to about 0.14 microsec per gettimeofday call, on a machine that ought to be slower than yours. So I think you've got either a crummy motherboard, or a kernel that doesn't know the best way to read the clock on that hardware. There is some discussion of this in the archives (probably in pgsql-hackers); look back around May or so when we were unsuccessfully trying to hack EXPLAIN to use fewer gettimeofday calls. regards, tom lane
On 12/14/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Evgeny Gridasov <eugrid@fpm.kubsu.ru> writes: > > This is a Linux Debian 3.1 ontop of 2x XEON 3.4 Ghz. > > PostgreSQL is 8.2 checked out from CVS REL8_2_STABLE yesterday. > > I'm running the same Postgres on another machine, > > with Debian Etch and have the same results. > > Hmph. With 8.2 on Fedora 5 on a 2.8Ghz dual Xeon, I get this: > <snip> > regression=# explain analyze select count(*) from foo; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=44764.00..44764.01 rows=1 width=0) (actual time=1324.846..1324.847 rows=1 loops=1) > -> Seq Scan on foo (cost=0.00..38514.00 rows=2500000 width=0) (actual time=0.046..748.582 rows=2500000 loops=1) > Total runtime: 1324.902 ms > (3 rows) > > Time: 1325.591 ms > regression=# > > which works out to about 0.14 microsec per gettimeofday call, on a > machine that ought to be slower than yours. So I think you've got > either a crummy motherboard, or a kernel that doesn't know the best > way to read the clock on that hardware. There is some discussion > of this in the archives (probably in pgsql-hackers); look back around > May or so when we were unsuccessfully trying to hack EXPLAIN to use > fewer gettimeofday calls. Yow! I notice the same thing on our HP BL25p blades w/2*opteron 270 (four total cores, AMD 8111 or 8131 chipset). 1.25 microsec/call vs my new desktop (Intel Core2 6300) 0.16 microsec/call. I hope this isn't a "crummy mainboard" but I can't seem to affect things by changing clock source (kernel 2.6.16 SLES10). I tried kernel command option clock=XXX where XXX in (cyclone,hpet,pmtmr,tsc,pit), no option was significantly better than the default. Anyone know how this might be improved (short of replacing hardware)? -K
Kelly Burkhart wrote: > > I hope this isn't a "crummy mainboard" but I can't seem to affect > things by changing clock source (kernel 2.6.16 SLES10). I tried > kernel command option clock=XXX where XXX in > (cyclone,hpet,pmtmr,tsc,pit), no option was significantly better than > the default. > > Anyone know how this might be improved (short of replacing hardware)? > Updating the BIOS might be worth investigating, and then bugging your Linux distro mailing list/support etc for more help. (What sort of motherboard is it?) Cheers Mark
"Kelly Burkhart" <kelly.burkhart@gmail.com> writes: > I hope this isn't a "crummy mainboard" but I can't seem to affect > things by changing clock source (kernel 2.6.16 SLES10). I tried > kernel command option clock=XXX where XXX in > (cyclone,hpet,pmtmr,tsc,pit), no option was significantly better than > the default. I believe that on machines where gettimeofday is really nice and fast, it doesn't require entry to the kernel at all; there's some hack that makes the clock readable from userspace. (Obviously a switch to kernel mode would set you back a lot of the cycles involved here.) So it's not so much the kernel that you need to teach as glibc. How you do that is beyond my expertise, but maybe that will help you google for the right thing ... regards, tom lane
On Thu, 2006-12-14 at 19:05 -0500, Tom Lane wrote: > "Kelly Burkhart" <kelly.burkhart@gmail.com> writes: > > I hope this isn't a "crummy mainboard" but I can't seem to affect > > things by changing clock source (kernel 2.6.16 SLES10). I tried > > kernel command option clock=XXX where XXX in > > (cyclone,hpet,pmtmr,tsc,pit), no option was significantly better than > > the default. > > I believe that on machines where gettimeofday is really nice and fast, > it doesn't require entry to the kernel at all; there's some hack that > makes the clock readable from userspace. (Obviously a switch to kernel > mode would set you back a lot of the cycles involved here.) So it's not > so much the kernel that you need to teach as glibc. How you do that is > beyond my expertise, but maybe that will help you google for the right > thing ... Until we work out a better solution we can fix this in two ways: 1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ... 2. enable_analyze_timer = off | on (default) (USERSET) A performance drop of 4x-10x is simply unacceptable when trying to tune queries where the current untuned time is already too high. Tying down production servers for hours on end when we know for certain all they are doing is calling gettimeofday millions of times is not good. This quickly leads to the view from objective people that PostgreSQL doesn't have a great optimizer, whatever we say in its defence. I don't want to leave this alone, but I don't want to spend a month fixing it either. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, Dec 15, 2006 at 10:28:08AM +0000, Simon Riggs wrote: > Until we work out a better solution we can fix this in two ways: > > 1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ... > > 2. enable_analyze_timer = off | on (default) (USERSET) What exactly would this do? Only count actual rows or something? I wrote a patch that tried statistical sampling, but the figures were too far off for people's liking. > A performance drop of 4x-10x is simply unacceptable when trying to tune > queries where the current untuned time is already too high. Tying down > production servers for hours on end when we know for certain all they > are doing is calling gettimeofday millions of times is not good. This > quickly leads to the view from objective people that PostgreSQL doesn't > have a great optimizer, whatever we say in its defence. I don't want to > leave this alone, but I don't want to spend a month fixing it either. I think the best option is setitimer(), but it's not POSIX so platform support is going to be patchy. BTW, doing gettimeofday() without kernel entry is not really possible. You could use the cycle counter but it has the problem that if you have multiple CPUs you need to calibrate the result. If the CPU goes to sleep, there's is no way for the userspace process to know. Only the kernel has all the relevent information about what "time" is to get a reasonable result. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Hi list, Le vendredi 15 décembre 2006 11:50, Martijn van Oosterhout a écrit : > BTW, doing gettimeofday() without kernel entry is not really possible. > You could use the cycle counter but it has the problem that if you have > multiple CPUs you need to calibrate the result. If the CPU goes to > sleep, there's is no way for the userspace process to know. Only the > kernel has all the relevent information about what "time" is to get a > reasonable result. I remember having played with intel RDTSC (time stamp counter) for some timing measurement, but just read from several sources (including linux kernel hackers considering its usage for gettimeofday() implementation) that TSC is not an accurate method to have elapsed time information. May be some others method than gettimeofday() are available (Lamport Timestamps, as PGDG may have to consider having a distributed processing ready EA in some future), cheaper and accurate? After all, the discussion, as far as I understand it, is about having a accurate measure of duration of events, knowing when they occurred in the day does not seem to be the point. My 2¢, hoping this could be somehow helpfull, -- Dimitri Fontaine http://www.dalibo.com/
Attachment
On Fri, 2006-12-15 at 11:50 +0100, Martijn van Oosterhout wrote: > On Fri, Dec 15, 2006 at 10:28:08AM +0000, Simon Riggs wrote: > > Until we work out a better solution we can fix this in two ways: > > > > 1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ... > > > > 2. enable_analyze_timer = off | on (default) (USERSET) > > What exactly would this do? Only count actual rows or something? Yes. It's better to have this than nothing at all. > I > wrote a patch that tried statistical sampling, but the figures were too > far off for people's liking. Well, I like your ideas, so if you have any more... Maybe sampling every 10 rows will bring things down to an acceptable level (after the first N). You tried less than 10 didn't you? Maybe we can count how many real I/Os were required to perform each particular row, so we can adjust the time per row based upon I/Os. ISTM that sampling at too low a rate means we can't spot the effects of cache and I/O which can often be low frequency but high impact. > I think the best option is setitimer(), but it's not POSIX so > platform support is going to be patchy. Don't understand that. I thought that was to do with alarms and signals. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, Dec 15, 2006 at 12:15:59PM +0000, Gregory Stark wrote: > There are various attempts at providing better timing infrastructure at low > overhead but I'm not sure what's out there currently. I expect to do this what > we'll have to do is invent a pg_* abstraction that has various implementations > on different architectures. On Solaris it can use DTrace internally, on Linux > it might have something else (or more likely several different options > depending on the age and config options of the kernel). I think we need to move to a sampling approach. setitimer is good, except it doesn't tell you if signals have been lost. Given they are most likely to be lost during high disk I/O, they're actually significant. I'm trying to think of a way around that. Then you don't need a cheap gettimeofday at all... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Fri, Dec 15, 2006 at 12:20:46PM +0000, Simon Riggs wrote: > > I > > wrote a patch that tried statistical sampling, but the figures were too > > far off for people's liking. > > Well, I like your ideas, so if you have any more... > > Maybe sampling every 10 rows will bring things down to an acceptable > level (after the first N). You tried less than 10 didn't you? Yeah, it reduced the number of calls as the count got larger. It broke somewhere, though I don't quite remember why. > Maybe we can count how many real I/Os were required to perform each > particular row, so we can adjust the time per row based upon I/Os. ISTM > that sampling at too low a rate means we can't spot the effects of cache > and I/O which can often be low frequency but high impact. One idea is to sample at fixed interval. Where the I/O cost is high, there'll be a lot of sampling points. The issue being that the final result are not totally accurate anymore. > > I think the best option is setitimer(), but it's not POSIX so > > platform support is going to be patchy. > > Don't understand that. I thought that was to do with alarms and signals. On my system the manpage say setitimer() conforms to: SVr4, 4.4BSD. I'm unsure how many of the supported platforms fall under that catagorisation. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
* Simon Riggs: >> I think the best option is setitimer(), but it's not POSIX so >> platform support is going to be patchy. > > Don't understand that. I thought that was to do with alarms and > signals. You could use it for sampling. Every few milliseconds, you record which code is executing (possibly using a global variable which is set and reset accordingly). But this might lead to tons of interrupted system calls, and not all systems mask them, so this might not be an option for the PostgreSQL code base. On the other hand, if performance in more recent Linux releases (from kernel.org) are acceptable, you should assume that the problem will eventually fix itself. FWIW, I see the 9x overhead on something that is close to 2.6.17 (on AMD64/Opteron), so this could be wishful thinking. 8-( -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Martijn van Oosterhout <kleptog@svana.org> writes: > On Fri, Dec 15, 2006 at 12:20:46PM +0000, Simon Riggs wrote: >> Maybe sampling every 10 rows will bring things down to an acceptable >> level (after the first N). You tried less than 10 didn't you? > Yeah, it reduced the number of calls as the count got larger. It broke > somewhere, though I don't quite remember why. The fundamental problem with it was the assumption that different executions of a plan node will have the same timing. That's not true, in fact not even approximately true. IIRC the patch did realize that first-time-through is not a predictor for the rest, but some of our plan nodes have enormous variance even after the first time. I think the worst case is batched hash joins. regards, tom lane
Gregory Stark <stark@enterprisedb.com> writes: > There are various attempts at providing better timing infrastructure at low > overhead but I'm not sure what's out there currently. I expect to do this what > we'll have to do is invent a pg_* abstraction that has various implementations > on different architectures. You've got to be kidding. Surely it's glibc's responsibility, not ours, to implement gettimeofday correctly for the hardware. regards, tom lane
On Fri, Dec 15, 2006 at 09:56:57AM -0500, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > On Fri, Dec 15, 2006 at 12:20:46PM +0000, Simon Riggs wrote: > >> Maybe sampling every 10 rows will bring things down to an acceptable > >> level (after the first N). You tried less than 10 didn't you? > > > Yeah, it reduced the number of calls as the count got larger. It broke > > somewhere, though I don't quite remember why. > > The fundamental problem with it was the assumption that different > executions of a plan node will have the same timing. That's not true, > in fact not even approximately true. IIRC the patch did realize > that first-time-through is not a predictor for the rest, but some of > our plan nodes have enormous variance even after the first time. > I think the worst case is batched hash joins. It didn't assume that because that's obviously bogus. It assumed the durations would be spread as a normal distribution. Which meant that over time the average of the measured iterations would approch the actual average. It tried to take enough measurements to try and keep expected error small, but it's statistics, you can only say "this will give the right answer >95% of the time". You are correct though, the error was caused by unexpectedly large variations, or more likely, an unexpected distribution curve. Statistically, we took enough samples to not be affected significantly by large variations. Even if it looked more like a gamma distribution it should not have been as far off as it was. Looking at alternative approaches, like sampling with a timer, you end up with the same problem: sometimes the calculations will fail and produce something strange. The simplest example being than a 100Hz timer is not going to produce any useful information for queries in the millisecond range. A higher frequency timer than that is not going to be available portably. You could probably throw more effort into refining the statistics behind it, but at some point we're going to have to draw a line and say: it's going to be wrong X% of the time, deal with it. If we're not willing to say that, then there's no point going ahead with any statistical approach. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Peter Eisentraut <peter_e@gmx.net> writes: > Am Freitag, 15. Dezember 2006 11:28 schrieb Simon Riggs: >> Until we work out a better solution we can fix this in two ways: >> >> 1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ... >> 2. enable_analyze_timer = off | on (default) (USERSET) > The second one is enough in my mind. I don't see any point in either one. If you're not going to collect timing data then the only useful info EXPLAIN ANALYZE could provide is knowledge of which rowcount estimates are badly off ... and to get that, you have to wait for the query to finish, which may well be impractical even without the gettimeofday overhead. We had discussed upthread the idea of having an option to issue a NOTICE as soon as any actual rowcount exceeds the estimate by some-configurable-percentage, and that seems to me to be a much more useful response to the problem of "E.A. takes too long" than removing gettimeofday. One thing that's not too clear to me though is how the NOTICE would identify the node at which the rowcount was exceeded... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: > > There are various attempts at providing better timing infrastructure at low > > overhead but I'm not sure what's out there currently. I expect to do this what > > we'll have to do is invent a pg_* abstraction that has various implementations > > on different architectures. > > You've got to be kidding. Surely it's glibc's responsibility, not ours, > to implement gettimeofday correctly for the hardware. Except for two things: a) We don't really need gettimeofday. That means we don't need something sensitive to adjustments made by ntp etc. In fact that would be actively bad. Currently if the user runs "date" to reset his clock back a few days I bet interesting things happen to a large explain analyze that's running. In fact we don't need something that represents any absolute time, only time elapsed since some other point we choose. That might be easier to implement than what glibc has to do to implement gettimeofday fully. b) glibc may not want to endure an overhead on every syscall and context switch to make gettimeofday faster on the assumption that gettimeofday is a rare call and it should pay the price rather than imposing an overhead on everything else. Postgres knows when it's running an explain analyze and a 1% overhead would be entirely tolerable, especially if it affected the process pretty much evenly unlike the per-gettimeofday-overhead which can get up as high as 100% on some types of subplans and is negligible on others. And more to the point Postgres wouldn't have to endure this overhead at all when it's not needed whereas glibc has no idea when you're going to need gettimeofday next. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
At 10:45 AM 12/15/2006, Tom Lane wrote: >Gregory Stark <stark@enterprisedb.com> writes: > > There are various attempts at providing better timing infrastructure at low > > overhead but I'm not sure what's out there currently. I expect to > do this what > > we'll have to do is invent a pg_* abstraction that has various > implementations > > on different architectures. > >You've got to be kidding. Surely it's glibc's responsibility, not ours, >to implement gettimeofday correctly for the hardware. > > regards, tom lane I agree with Tom on this. Perhaps the best compromise is for the pg community to make thoughtful suggestions to the glibc community? Ron Peacetree
On Fri, 2006-12-15 at 10:57 -0500, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > Am Freitag, 15. Dezember 2006 11:28 schrieb Simon Riggs: > >> Until we work out a better solution we can fix this in two ways: > >> > >> 1. EXPLAIN ANALYZE [ [ WITH | WITHOUT ] TIME STATISTICS ] ... > >> 2. enable_analyze_timer = off | on (default) (USERSET) > > > The second one is enough in my mind. > > I don't see any point in either one. If you're not going to collect > timing data then the only useful info EXPLAIN ANALYZE could provide is > knowledge of which rowcount estimates are badly off ... and to get that, > you have to wait for the query to finish, which may well be impractical > even without the gettimeofday overhead. On a different part of this thread, you say: On Fri, 2006-12-15 at 09:56 -0500, Tom Lane wrote: > The fundamental problem with it was the assumption that different > executions of a plan node will have the same timing. That's not true, > in fact not even approximately true. It doesn't make sense to me to claim that the timing is so important that we cannot do without it, at the same time as saying it isn't even approximately true that is highly variable. > We had discussed upthread the > idea of having an option to issue a NOTICE as soon as any actual > rowcount exceeds the estimate by some-configurable-percentage, and that > seems to me to be a much more useful response to the problem of > "E.A. takes too long" than removing gettimeofday. > One thing that's not too clear to me though is how the NOTICE would > identify the node at which the rowcount was exceeded... We'd have to output the whole EXPLAIN as a NOTICE for it to make any sense. If we can't do without the timings, then half an EXPLAIN would be even worse. We'd need to take account of non-linear nodes. Hash nodes react badly beyond a certain point, HashAgg even worse. Sort performs poorly after the end of memory, as does Materialize. Other nodes are more linear so would need a different percentage. I don't like the sound of a whole gaggle of GUCs to describe that. Any ideas? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Fri, 2006-12-15 at 09:56 -0500, Tom Lane wrote: >> The fundamental problem with it was the assumption that different >> executions of a plan node will have the same timing. That's not true, >> in fact not even approximately true. > It doesn't make sense to me to claim that the timing is so important > that we cannot do without it, at the same time as saying it isn't even > approximately true that is highly variable. Huh? What I said was that successive executions of the same plan node may take considerably different amounts of time, and the proposed sampling patch failed to handle that situation with acceptable accuracy. regards, tom lane