Thread: That EXPLAIN ANALYZE patch still needs work
Just got this rather surprising result: regression=# \timing Timing is on. regression=# explain analyze select * from tenk1 a join tenk1 b on a.unique1 = b.unique2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------Hash Join (cost=825.00..2484.00 rows=10000 width=488) (actual time=250.510..2557.129 rows=10000 loops=1) Hash Cond: (a.unique1= b.unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.084..30.070 rows=10000loops=1) -> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=250.242..250.242 rows=10000 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244) (actual time=0.019..23.317 rows=10000 loops=1)Totalruntime: 847.855 ms (6 rows) Time: 856.179 ms regression=# The "Total runtime" is correct AFAICT, which puts the top node's "actual time" rather far out in left field. This is pretty repeatable on my old slow HPPA machine. A new Xeon shows less of a discrepancy, but it's still claiming top node actual > total, which is not right. I realize that the sampling code can't be expected to be exactly right, but a factor-of-three error is not acceptable. One of us broke this :-( regards, tom lane
On Mon, Jun 05, 2006 at 11:02:33PM -0400, Tom Lane wrote: > Just got this rather surprising result: <snip bogus explain output> > The "Total runtime" is correct AFAICT, which puts the top node's "actual > time" rather far out in left field. This is pretty repeatable on my old > slow HPPA machine. A new Xeon shows less of a discrepancy, but it's > still claiming top node actual > total, which is not right. Wierd. Can you get the output of *instr in each call of InstrEndLoop? Preferably after it does the calculation but before it clears the values. So we get an idea of number of samples and what it guesses. SampleOverhead would be good too. I know my version produced sensible results on my machine and the handful of people testing, so I'll try it again with your changes, see how it looks... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes: > Wierd. Can you get the output of *instr in each call of > InstrEndLoop? Preferably after it does the calculation but before it > clears the values. So we get an idea of number of samples and what it > guesses. SampleOverhead would be good too. The problem looks to be an underestimation of SampleOverhead, and on reflection it's clear why: what CalculateSampleOverhead is measuring isn't the total overhead, but the time between the two gettimeofday calls. Which is probably about half the true overhead. What we ought to do is iterate InstStartNode/InstrStopNode N times, and *separately* measure the total elapsed time spent. It occurs to me that what we really want to know is not so much the total time spent in InstStartNode/InstrStopNode, as the difference in the time spent when sampling is on vs when it is off. I'm not quite sure if the time spent when it's off is negligible. Off to do some measuring ... regards, tom lane
I wrote: > The problem looks to be an underestimation of SampleOverhead, and on > reflection it's clear why: what CalculateSampleOverhead is measuring > isn't the total overhead, but the time between the two gettimeofday > calls. Which is probably about half the true overhead. On further thought, I take that back: the "true overhead" is not the point here. The time elapsed during a plan node execution when sampling can be broken down into three phases:time before first gettimeofday calltime between gettimeofday callstime aftersecond gettimeofday call Only the second part is actually measured by the instrumentation system; the other parts are overhead that has never been counted by EXPLAIN ANALYZE, sampling version or no. Moreover, most of the runtime of InstrStartNode and InstrStopNode falls into the first or third parts. What we would actually like to set SampleOverhead to is the portion of the second-part runtime that doesn't occur when sampling = false. Assuming that gettimeofday() has consistent runtime and the actual time reported is measured at a consistent instant within that runtime, I believe that we should take the SampleOverhead as just equal to the runtime of a single gettimeofday() call. The added or removed second-part time within InstrStartNode is just the tail time of gettimeofday, and the added or removed second-part time within InstrStopNode is basically just the head time of gettimeofday. (To make this as true as possible, we need to change the order of operations so that gettimeofday is invoked *immediately* after the "if (sampling)" test, but that's easy.) So this line of thought leads to the conclusion that CalculateSampleOverhead is actually overestimating SampleOverhead a bit, and we should simplify it to just time INSTR_TIME_SET_CURRENT(). But that still leaves me with a problem because my machine is clearly overestimating the correction needed. I added some printouts and got raw totaltime = 0.370937 per_iter = 0.000156913, SampleOverhead = 3.28e-06 adj totaltime = 1.82976 sampling = 0 starttime = 0/000000 counter = 0/370937 firsttuple = 0.258321 tuplecount = 10000 itercount = 10001 samplecount = 704 nextsample = 10011 startup = 0.258321 total = 1.82976 ntuples = 10000 nloops = 1 on a run with an actual elapsed time near 750 msec. Clearly the sampling adjustment is wrong, but why? I have a theory about this, and it's not pleasant at all. What I think is that we have a Heisenberg problem here: the act of invoking gettimeofday() actually changes what is measured. That is, the runtime of the "second part" of ExecProcNode is actually longer when we sample than when we don't, not merely due to the extra time spent in gettimeofday(). It's not very hard to guess at reasons why, either. The kernel entry is probably flushing some part of the CPU's state, such as virtual/physical address mapping for the userland address space. After returning from the kernel call, the time to reload that state shows up as more execution time within the "second part". This theory explains two observations that otherwise are hard to explain. One, that the effect is platform-specific: your machine may avoid flushing as much state during a kernel call as mine does. And two, that upper plan nodes seem much more affected than lower ones. That makes sense because the execution cycle of an upper node will involve touching more userspace data than a lower node, and therefore more of the flushed TLB entries will need to be reloaded. If this theory is correct, then the entire notion of EXPLAIN ANALYZE sampling has just crashed and burned. We can't ship a measurement tool that is accurate on some platforms and not others. I'm wondering if it's at all practical to go over to a gprof-like measurement method for taking EXPLAIN ANALYZE runtime measurements; that is, take an interrupt every so often and bump the count for the currently active plan node. This would spread the TLB-flush overhead more evenly and thus avoid introducing that bias. There may be too much platform dependency in this idea, though, and I also wonder if it'd break the ability to do normal gprof profiling of the backend. regards, tom lane
On Tue, Jun 06, 2006 at 04:50:28PM -0400, Tom Lane wrote: > I have a theory about this, and it's not pleasant at all. What I > think is that we have a Heisenberg problem here: the act of invoking > gettimeofday() actually changes what is measured. That is, the > runtime of the "second part" of ExecProcNode is actually longer when > we sample than when we don't, not merely due to the extra time spent > in gettimeofday(). It's not very hard to guess at reasons why, either. > The kernel entry is probably flushing some part of the CPU's state, > such as virtual/physical address mapping for the userland address > space. After returning from the kernel call, the time to reload > that state shows up as more execution time within the "second part". > > This theory explains two observations that otherwise are hard to > explain. One, that the effect is platform-specific: your machine > may avoid flushing as much state during a kernel call as mine does. > And two, that upper plan nodes seem much more affected than lower > ones. That makes sense because the execution cycle of an upper node > will involve touching more userspace data than a lower node, and > therefore more of the flushed TLB entries will need to be reloaded. If that's the case, then maybe a more sopdisticated method of measuring the overhead would work. My thought is that on the second call to pull a tuple from a node (second because the first probably has some anomolies due to startup), we measure the overhead for that node. This would probably mean doing the following: get start time # I'm not refering to this as gettimeofday to avoid # confusion gettimeofday() # this is the gettimeofday call that will happen during # normal operation get end time Hopefully, there's no caching effect that would come into play from not actually touching any of the data structures after the gettimeofday() call. If that's not the case, it makes measuring the overhead more complex, but I think it should still be doable... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > If that's the case, then maybe a more sopdisticated method of measuring > the overhead would work. I think you missed the point: the time spent in gettimeofday() itself is not the major overhead of EXPLAIN ANALYZE. At least it appears that this is the case on my machine. I'm thinking that interrupt-driven sampling might work OK though. My previous worries were based on trying to use the ITIMER_PROF timer, which might not be portable and would conflict with gprof anyway. But EXPLAIN ANALYZE has always been interested in real time rather than CPU time, so the correct interrupt to use is ITIMER_REAL. That means we only have to share with our own existing usages of that interrupt, which turns it from a portability issue into just a Small Matter Of Programming. regards, tom lane
On Tue, Jun 06, 2006 at 04:50:28PM -0400, Tom Lane wrote: > But that still leaves me with a problem because my machine is clearly > overestimating the correction needed. I added some printouts and got > > raw totaltime = 0.370937 > per_iter = 0.000156913, SampleOverhead = 3.28e-06 > adj totaltime = 1.82976 > sampling = 0 > starttime = 0/000000 > counter = 0/370937 > firsttuple = 0.258321 > tuplecount = 10000 > itercount = 10001 > samplecount = 704 > nextsample = 10011 > startup = 0.258321 > total = 1.82976 > ntuples = 10000 > nloops = 1 > > on a run with an actual elapsed time near 750 msec. Clearly the > sampling adjustment is wrong, but why? This doesn't make any sense at all. How can a sampling run that only sampled 7% of the actual tuples, end up with a actual measured time that's more than 50% of the actual final runtime? Can you get an estimate of the actual overhead (ie compare wall clock time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is on the order of a few hundred cycles, I'm not sure about the cost of TLB flushes (that's probably highly architechture dependant). To be honest, I wonder about caching effects, but for the disks. The first few cycles of any plan (like an index scan) is going to incur costs that won't happen later on. Because we sample much more heavily at the beginning rather than the end, this will bias towards higher numbers. You should be able to see this by seeing if running queries that don't require disk access fare better. That would suggest a much more careful correction method that works for non-linear timing patterns... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes: > This doesn't make any sense at all. How can a sampling run that only > sampled 7% of the actual tuples, end up with a actual measured time > that's more than 50% of the actual final runtime? AFAICS, the only conclusion is that the sampled executions are in fact taking longer than supposedly-equivalent unsampled ones, and by a pretty good percentage too. I'm growing unsure again about the mechanism responsible for that, however. > Can you get an estimate of the actual overhead (ie compare wall clock > time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is > on the order of a few hundred cycles, I'm not sure about the cost of > TLB flushes (that's probably highly architechture dependant). Here's some examples. Keep in mind I've already determined that gettimeofday() takes about 3 usec on this hardware ... regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0)ss; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=797.412..797.416 rows=1 loops=1) -> Limit (cost=825.00..2484.00 rows=10000width=488) (actual time=208.208..2576.528 rows=10000 loops=1) -> Hash Join (cost=825.00..2484.00 rows=10000width=488) (actual time=208.190..2082.577 rows=10000 loops=1) Hash Cond: (a.unique1 = b.unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.082..3.718 rows=10000 loops=1) -> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=207.933..207.933 rows=10000 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244) (actual time=0.017..3.583 rows=10000loops=1)Total runtime: 805.036 ms (8 rows) Time: 816.463 ms regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;count -------10000 (1 row) Time: 816.970 m The actual elapsed time for EXPLAIN ANALYZE seems to jump around quite a bit, probably because of the random variation we're using in sampling interval. This particular combination was unusually close. But in any case, the *actual* overhead of EXPLAIN ANALYZE is clearly pretty small here; the problem is that we're incorrectly extrapolating the measured runtime to the unmeasured executions. What's especially interesting is that the excess time doesn't seem to show up if I form the query in a way that doesn't require pushing as much data around: regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss; QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------Aggregate (cost=1341.00..1341.01 rows=1 width=0) (actual time=212.313..212.317 rows=1 loops=1) -> Hash Join (cost=483.00..1316.00rows=10000 width=0) (actual time=88.061..160.886 rows=10000 loops=1) Hash Cond: (a.unique1 =b.unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=4) (actual time=0.071..4.068 rows=10000 loops=1) -> Hash (cost=458.00..458.00 rows=10000 width=4) (actual time=87.862..87.862 rows=10000 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=4) (actual time=0.031..4.780 rows=10000 loops=1)Totalruntime: 221.022 ms (7 rows) Time: 229.377 ms regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss;count -------10000 (1 row) Time: 202.531 ms regression=# (Without the OFFSET 0, the planner flattens the subquery and discovers that it doesn't actually need to fetch any of the non-join-key table columns.) Note the only plan nodes showing whacked-out timings are the ones returning wide tuples (large "width" values). I'm not entirely sure what to make of this. It could be interpreted as evidence for my theory about TLB reloads during userspace data access being the problem. But I'm getting a bit disenchanted with that theory after running the same test case in 8.1: regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0)ss; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=1033.866..1033.870 rows=1 loops=1) -> Limit (cost=825.00..2484.00rows=10000 width=488) (actual time=330.961..989.272 rows=10000 loops=1) -> Hash Join (cost=825.00..2484.00rows=10000 width=488) (actual time=330.946..920.398 rows=10000 loops=1) Hash Cond: ("outer".unique1= "inner".unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actualtime=0.050..66.861 rows=10000 loops=1) -> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=330.745..330.745rows=10000 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244)(actual time=0.010..65.890 rows=10000 loops=1)Total runtime: 1041.293 ms (8 rows) Time: 1051.594 ms regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;count -------10000 (1 row) Time: 870.348 ms Both runtimes are pretty repeatable. That says that on this example, where we have sampled 40000 node executions (remember HASH is only called once), the overhead of EXPLAIN ANALYZE is about 4.5 usec per node, which is not very far from twice the gettimeofday() time; it certainly doesn't show any evidence for a huge additional overhead due to TLB flushes. One thing I was wondering about was whether the samplings of upper nodes might be correlating to a greater-than-random extent with the samplings of lower nodes. That would result in the observed problem of overestimating upper nodes. It doesn't explain the apparent dependence on targetlist width, though. So I'm confused again. > To be honest, I wonder about caching effects, but for the disks. All of the above are repeated executions that should be fully cached in RAM. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > And two, that upper plan nodes seem much more affected than lower > ones. That makes sense because the execution cycle of an upper node > will involve touching more userspace data than a lower node, and > therefore more of the flushed TLB entries will need to be reloaded. I would have expected the opposite effect. If you only execute one instruction then the cache miss can make it take many times longer than normal. But as the number of instructions grows the cache gets repopulated and the overhead levels off and becomes negligible relative to the total time. The other option aside from gprof-like profiling would be to investigate those cpu timing instructions again. I know some of them are unsafe on multi-cpu systems but surely there's a solution out there. It's not like there aren't a million games, music playing, and other kewl kid toys that depend on accurate low overhead timing these days. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> And two, that upper plan nodes seem much more affected than lower >> ones. That makes sense because the execution cycle of an upper node >> will involve touching more userspace data than a lower node, and >> therefore more of the flushed TLB entries will need to be reloaded. > I would have expected the opposite effect. If you only execute one instruction > then the cache miss can make it take many times longer than normal. But as the > number of instructions grows the cache gets repopulated and the overhead > levels off and becomes negligible relative to the total time. Well, none of our plan nodes are in the "one instruction" regime ;-). I was thinking that the total volume of data accessed was the critical factor. Right at the moment I'm disillusioned with the TLB-access theory though. Something I'm noticing right now is that it seems like only hash joins are really seriously misestimated --- nest and merge joins have some small issues but only the hash is way out there. What's going on?? Can anyone else reproduce this? > The other option aside from gprof-like profiling would be to > investigate those cpu timing instructions again. I know some of them > are unsafe on multi-cpu systems but surely there's a solution out > there. It's not like there aren't a million games, music playing, and > other kewl kid toys that depend on accurate low overhead timing these > days. Yeah, and they all work only on Windoze and Intel chips :-( regards, tom lane
On Tue, 2006-06-06 at 16:50 -0400, Tom Lane wrote: > I have a theory about this, and it's not pleasant at all. What I > think is that we have a Heisenberg problem here: the act of invoking > gettimeofday() actually changes what is measured. > If this theory is correct, then the entire notion of EXPLAIN ANALYZE > sampling has just crashed and burned. We can't ship a measurement > tool that is accurate on some platforms and not others. Regrettably, I would agree and so conclude that we shouldn't pursue the sampling idea further. Heisenbugs suck time like no other. Interesting, though. That leaves us with a number of possibilities: 0. Do Nothing 1. Option to skip timing altogether on an EXPLAIN ANALYZE 2. Option to produce a partial execution only, to locate problem areas. Any others? Option 2 would be harder to interpret, but still useful - originally discussed in a current thread on -perform. Option 1 wouldn't be as useful as the original sampling idea, but if its not on the table any longer.... I'd revert back to Option 1 as being the best choice for further work. Do we agree the idea can't go further? What next? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > Do we agree the idea can't go further? What next? It still needs investigation; I'm no longer convinced that the TLB-flush theory is correct. See rest of thread. We may well have to revert the current patch, but I'd like to be sure we understand why. If we do have to revert, I'd propose that we pursue the notion of interrupt-driven sampling like gprof uses. regards, tom lane
On Wed, Jun 07, 2006 at 09:53:32AM -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Do we agree the idea can't go further? What next? > > It still needs investigation; I'm no longer convinced that the TLB-flush > theory is correct. See rest of thread. We may well have to revert the > current patch, but I'd like to be sure we understand why. One thing I'm thinking of trying is to, instead of assuming we can estimate the duractions of all the nodes by taking the total time divided by samples. we assume that the duration of tuple X is similar in duration to tuple X+1 but not necessarily the same as all other tuples. This moves the calculation from EndLoop to StopInstr. Basically in StopInstr you do the steps: if( sampling ) { x = get time for this tuple n = number of tuples skipped cumulativetime += x*n } This would mean that we wouldn't be assuming that tuples near the end take as long as tuples near the beginning. Except we're now dealing will smaller numbers, so I'm worried about error accumlation. > If we do have to revert, I'd propose that we pursue the notion of > interrupt-driven sampling like gprof uses. How would that work? You could then estimate how much time was spent in each node, but you no longer have any idea about when they were entered or left. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes: > On Wed, Jun 07, 2006 at 09:53:32AM -0400, Tom Lane wrote: >> If we do have to revert, I'd propose that we pursue the notion of >> interrupt-driven sampling like gprof uses. > How would that work? You could then estimate how much time was spent in > each node, but you no longer have any idea about when they were entered > or left. Hm? It would work the same way gprof works. I'm imagining something along the lines of global variable:volatile Instrumentation *current_instr = NULL; also add an "Instrumentation *parent_instr" field to Instrumentation InstrStartNode does:instr->parent_instr = current_instr;current_instr = instr; InstrStopNode restores the previous value of the global:current_instr = instr->parent_instr; timer interrupt routine does this once every few milliseconds:total_samples++;for (instr = current_instr; instr; instr =instr->parent_instr) instr->samples++; You still count executions and tuples in InstrStartNode/InstrStopNode, but you never call gettimeofday there. You *do* call gettimeofday at beginning and end of the whole query to measure the total runtime, and then you blame a fraction samples/total_samples of that on each node. The bubble-up of sample counts to parent nodes could perhaps be done while printing the results instead of on-the-fly as sketched above, but the above seems simpler. regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > This would mean that we wouldn't be assuming that tuples near the end > take as long as tuples near the beginning. Except we're now dealing > will smaller numbers, so I'm worried about error accumlation. Hm, that would explain why Hash joins suffer from this especially. Even when functioning properly hashes get slower as the buckets fill up and there are longer lists to traverse. Perhaps the hashes are suffering inordinately from collisions though. Some of the data type hash functions looked kind of suspect when I peeked at them a while back. -- greg
Greg Stark <gsstark@mit.edu> writes: > Martijn van Oosterhout <kleptog@svana.org> writes: >> This would mean that we wouldn't be assuming that tuples near the end >> take as long as tuples near the beginning. Except we're now dealing >> will smaller numbers, so I'm worried about error accumlation. > Hm, that would explain why Hash joins suffer from this especially. Even when > functioning properly hashes get slower as the buckets fill up and there are > longer lists to traverse. Nope, that is certainly not the explanation, because the hash table is loaded in the (single) call of the Hash node at the start of the query. It is static all through the sampled-and-not executions of the Hash Join node, which is where our problem is. I don't see that Martijn's idea responds to the problem anyway, if it is some sort of TLB-related issue. The assumption we are making is not "tuples near the end take as long as tuples near the beginning", it is "tuples we sample take as long as tuples we don't" (both statements of course meaning "on the average"). If the act of sampling incurs overhead beyond the gettimeofday() call itself, then we are screwed, and playing around with which iterations we sample and how we do the extrapolation won't make the slightest bit of difference. I'm unsure about the TLB-flush theory because I see no evidence of any such overhead in the 8.1 timings; but on the other hand it's hard to see what else could explain the apparent dependence on targetlist width. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Nope, that is certainly not the explanation, because the hash table is > loaded in the (single) call of the Hash node at the start of the query. > It is static all through the sampled-and-not executions of the Hash Join > node, which is where our problem is. At the risk of asking a stupid question, it's not perchance including that hash build in the first sample it takes of the hash join node? -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Nope, that is certainly not the explanation, because the hash table is >> loaded in the (single) call of the Hash node at the start of the query. >> It is static all through the sampled-and-not executions of the Hash Join >> node, which is where our problem is. > At the risk of asking a stupid question, it's not perchance including that > hash build in the first sample it takes of the hash join node? Sure. Which is one of the reasons why the first tuple is excluded from the extrapolation... regards, tom lane
Ah-hah, I've sussed it. The faulty assumption can actually be stated as: "all the executions, except maybe the first, will take approximately the same amount of time". The failing test case I've been looking at is one where the system decides to use a "batched" hash join, and in this plan type some iterations take much longer than others. (The apparent dependence on targetlist width is now easy to understand, because that affects the estimated hashtable size and hence the decision whether batching is needed. I'm not sure why I don't see the effect when running the identical case on my other machine, but since the other one is a 64-bit machine its space calculations are probably a bit different.) I added some printf's to instrument.c to print out the actual time measurements for each sample, as well as the calculations in InstrEndLoop. Attached are the printouts that occurred for the HashJoin node. The thing that is killing the extrapolation is of course the very large time for iteration 2, which the extrapolation includes in its average. But there's well over 10:1 variation in the later samples as well. On reflection it's easy to imagine other cases where some iterations take much longer than others in a not-very-predictable way. For instance, a join where only a subset of the outer tuples have a match is going to act that way. I don't think there's any good way that we can be sure we have a representative sample of executions, and so I'm afraid this approach to sampling EXPLAIN ANALYZE is a failure. I propose we revert this patch and think about an interrupt-driven sampling method instead. regards, tom lane 401489a0 1: 331616 usec in iter 1 401489a0 1: 110338 usec in iter 2 401489a0 1: 54 usec in iter 3 401489a0 1: 99 usec in iter 4 401489a0 1: 77 usec in iter 5 401489a0 1: 145 usec in iter 6 401489a0 1: 117 usec in iter 7 401489a0 1: 33 usec in iter 8 401489a0 1: 97 usec in iter 9 401489a0 1: 98 usec in iter 10 401489a0 1: 52 usec in iter 11 401489a0 1: 33 usec in iter 12 401489a0 1: 51 usec in iter 13 401489a0 1: 83 usec in iter 14 401489a0 1: 153 usec in iter 15 401489a0 1: 115 usec in iter 16 401489a0 1: 52 usec in iter 17 401489a0 1: 242 usec in iter 18 401489a0 1: 48 usec in iter 19 401489a0 1: 87 usec in iter 20 401489a0 1: 23 usec in iter 21 401489a0 1: 80 usec in iter 22 401489a0 1: 57 usec in iter 23 401489a0 1: 17 usec in iter 24 401489a0 1: 51 usec in iter 25 401489a0 1: 18 usec in iter 26 401489a0 1: 16 usec in iter 27 401489a0 1: 100 usec in iter 28 401489a0 1: 45 usec in iter 29 401489a0 1: 174 usec in iter 30 401489a0 1: 131 usec in iter 31 401489a0 1: 17 usec in iter 32 401489a0 1: 45 usec in iter 33 401489a0 1: 16 usec in iter 34 401489a0 1: 120 usec in iter 35 401489a0 1: 15 usec in iter 36 401489a0 1: 17 usec in iter 37 401489a0 1: 15 usec in iter 38 401489a0 1: 48 usec in iter 39 401489a0 1: 127 usec in iter 40 401489a0 1: 36 usec in iter 41 401489a0 1: 41 usec in iter 42 401489a0 1: 69 usec in iter 43 401489a0 1: 50 usec in iter 44 401489a0 1: 104 usec in iter 45 401489a0 1: 22 usec in iter 46 401489a0 1: 50 usec in iter 47 401489a0 1: 17 usec in iter 48 401489a0 1: 47 usec in iter 49 401489a0 1: 54 usec in iter 50 401489a0 1: 46 usec in iter 51 401489a0 1: 20 usec in iter 54 401489a0 1: 38 usec in iter 55 401489a0 1: 68 usec in iter 56 401489a0 1: 17 usec in iter 60 401489a0 1: 16 usec in iter 61 401489a0 1: 15 usec in iter 67 401489a0 1: 31 usec in iter 68 401489a0 1: 15 usec in iter 70 401489a0 1: 61 usec in iter 78 401489a0 1: 143 usec in iter 85 401489a0 1: 21 usec in iter 89 401489a0 1: 14 usec in iter 96 401489a0 1: 21 usec in iter 104 401489a0 1: 21 usec in iter 107 401489a0 1: 16 usec in iter 116 401489a0 1: 194 usec in iter 118 401489a0 1: 136 usec in iter 122 401489a0 1: 34 usec in iter 127 401489a0 1: 46 usec in iter 131 401489a0 1: 15 usec in iter 133 401489a0 1: 15 usec in iter 135 401489a0 1: 34 usec in iter 137 401489a0 1: 54 usec in iter 142 401489a0 1: 206 usec in iter 151 401489a0 1: 75 usec in iter 162 401489a0 1: 20 usec in iter 172 401489a0 1: 66 usec in iter 177 401489a0 1: 21 usec in iter 181 401489a0 1: 69 usec in iter 186 401489a0 1: 16 usec in iter 193 401489a0 1: 46 usec in iter 201 401489a0 1: 33 usec in iter 210 401489a0 1: 50 usec in iter 216 401489a0 1: 21 usec in iter 222 401489a0 1: 18 usec in iter 224 401489a0 1: 33 usec in iter 229 401489a0 1: 20 usec in iter 232 401489a0 1: 44 usec in iter 236 401489a0 1: 29 usec in iter 239 401489a0 1: 34 usec in iter 240 401489a0 1: 31 usec in iter 241 401489a0 1: 27 usec in iter 254 401489a0 1: 45 usec in iter 257 401489a0 1: 147 usec in iter 259 401489a0 1: 15 usec in iter 269 401489a0 1: 16 usec in iter 278 401489a0 1: 14 usec in iter 279 401489a0 1: 58 usec in iter 290 401489a0 1: 15 usec in iter 291 401489a0 1: 53 usec in iter 295 401489a0 1: 15 usec in iter 306 401489a0 1: 16 usec in iter 318 401489a0 1: 34 usec in iter 328 401489a0 1: 37 usec in iter 339 401489a0 1: 28 usec in iter 346 401489a0 1: 21 usec in iter 353 401489a0 1: 15 usec in iter 361 401489a0 1: 49 usec in iter 370 401489a0 1: 61 usec in iter 380 401489a0 1: 21 usec in iter 385 401489a0 1: 50 usec in iter 397 401489a0 1: 36 usec in iter 411 401489a0 1: 44 usec in iter 417 401489a0 1: 55 usec in iter 420 401489a0 1: 15 usec in iter 427 401489a0 1: 74 usec in iter 429 401489a0 1: 178 usec in iter 433 401489a0 1: 182 usec in iter 440 401489a0 1: 41 usec in iter 454 401489a0 1: 39 usec in iter 455 401489a0 1: 28 usec in iter 470 401489a0 1: 57 usec in iter 473 401489a0 1: 15 usec in iter 480 401489a0 1: 21 usec in iter 489 401489a0 1: 76 usec in iter 498 401489a0 1: 38 usec in iter 513 401489a0 1: 21 usec in iter 523 401489a0 1: 33 usec in iter 529 401489a0 1: 39 usec in iter 545 401489a0 1: 25 usec in iter 559 401489a0 1: 27 usec in iter 566 401489a0 1: 15 usec in iter 574 401489a0 1: 25 usec in iter 590 401489a0 1: 62 usec in iter 606 401489a0 1: 27 usec in iter 610 401489a0 1: 16 usec in iter 625 401489a0 1: 49 usec in iter 641 401489a0 1: 27 usec in iter 654 401489a0 1: 22 usec in iter 665 401489a0 1: 17 usec in iter 669 401489a0 1: 39 usec in iter 682 401489a0 1: 27 usec in iter 684 401489a0 1: 68 usec in iter 697 401489a0 1: 15 usec in iter 700 401489a0 1: 16 usec in iter 712 401489a0 1: 15 usec in iter 728 401489a0 1: 22 usec in iter 746 401489a0 1: 16 usec in iter 752 401489a0 1: 45 usec in iter 766 401489a0 1: 44 usec in iter 777 401489a0 1: 33 usec in iter 785 401489a0 1: 21 usec in iter 802 401489a0 1: 23 usec in iter 814 401489a0 1: 15 usec in iter 820 401489a0 1: 21 usec in iter 839 401489a0 1: 23 usec in iter 840 401489a0 1: 39 usec in iter 858 401489a0 1: 94 usec in iter 867 401489a0 1: 33 usec in iter 877 401489a0 1: 56 usec in iter 895 401489a0 1: 14 usec in iter 898 401489a0 1: 22 usec in iter 900 401489a0 1: 14 usec in iter 910 401489a0 1: 20 usec in iter 918 401489a0 1: 14 usec in iter 937 401489a0 1: 28 usec in iter 942 401489a0 1: 29 usec in iter 944 401489a0 1: 21 usec in iter 947 401489a0 1: 68 usec in iter 957 401489a0 1: 46 usec in iter 969 401489a0 1: 15 usec in iter 971 401489a0 1: 16 usec in iter 987 401489a0 1: 22 usec in iter 1000 401489a0 1: 36 usec in iter 1001 401489a0 1: 22 usec in iter 1017 401489a0 1: 17 usec in iter 1031 401489a0 1: 36 usec in iter 1042 401489a0 1: 21 usec in iter 1062 401489a0 1: 15 usec in iter 1077 401489a0 1: 21 usec in iter 1097 401489a0 1: 88 usec in iter 1114 401489a0 1: 62 usec in iter 1123 401489a0 1: 16 usec in iter 1131 401489a0 1: 21 usec in iter 1149 401489a0 1: 15 usec in iter 1156 401489a0 1: 33 usec in iter 1177 401489a0 1: 22 usec in iter 1190 401489a0 1: 83 usec in iter 1201 401489a0 1: 16 usec in iter 1209 401489a0 1: 184 usec in iter 1224 401489a0 1: 33 usec in iter 1234 401489a0 1: 39 usec in iter 1236 401489a0 1: 21 usec in iter 1247 401489a0 1: 33 usec in iter 1267 401489a0 1: 16 usec in iter 1277 401489a0 1: 15 usec in iter 1297 401489a0 1: 14 usec in iter 1319 401489a0 1: 21 usec in iter 1329 401489a0 1: 15 usec in iter 1342 401489a0 1: 16 usec in iter 1357 401489a0 1: 57 usec in iter 1378 401489a0 1: 16 usec in iter 1400 401489a0 1: 16 usec in iter 1410 401489a0 1: 113 usec in iter 1417 401489a0 1: 16 usec in iter 1418 401489a0 1: 15 usec in iter 1433 401489a0 1: 23 usec in iter 1434 401489a0 1: 96 usec in iter 1442 401489a0 1: 30 usec in iter 1449 401489a0 1: 121 usec in iter 1454 401489a0 1: 16 usec in iter 1462 401489a0 1: 29 usec in iter 1482 401489a0 1: 48 usec in iter 1490 401489a0 1: 40 usec in iter 1512 401489a0 1: 20 usec in iter 1529 401489a0 1: 48 usec in iter 1545 401489a0 1: 45 usec in iter 1547 401489a0 1: 271 usec in iter 1557 401489a0 1: 15 usec in iter 1558 401489a0 1: 44 usec in iter 1576 401489a0 1: 15 usec in iter 1592 401489a0 1: 148 usec in iter 1616 401489a0 1: 15 usec in iter 1640 401489a0 1: 34 usec in iter 1650 401489a0 1: 27 usec in iter 1659 401489a0 1: 15 usec in iter 1673 401489a0 1: 59 usec in iter 1686 401489a0 1: 22 usec in iter 1689 401489a0 1: 16 usec in iter 1696 401489a0 1: 27 usec in iter 1715 401489a0 1: 23 usec in iter 1736 401489a0 1: 16 usec in iter 1749 401489a0 1: 27 usec in iter 1757 401489a0 1: 15 usec in iter 1777 401489a0 1: 15 usec in iter 1801 401489a0 1: 16 usec in iter 1804 401489a0 1: 227 usec in iter 1829 401489a0 1: 27 usec in iter 1848 401489a0 1: 16 usec in iter 1869 401489a0 1: 33 usec in iter 1881 401489a0 1: 57 usec in iter 1902 401489a0 1: 29 usec in iter 1926 401489a0 1: 15 usec in iter 1932 401489a0 1: 58 usec in iter 1955 401489a0 1: 15 usec in iter 1980 401489a0 1: 22 usec in iter 1992 401489a0 1: 14 usec in iter 2001 401489a0 1: 56 usec in iter 2010 401489a0 1: 22 usec in iter 2018 401489a0 1: 16 usec in iter 2043 401489a0 1: 15 usec in iter 2046 401489a0 1: 50 usec in iter 2051 401489a0 1: 16 usec in iter 2053 401489a0 1: 22 usec in iter 2078 401489a0 1: 89 usec in iter 2094 401489a0 1: 69 usec in iter 2103 401489a0 1: 33 usec in iter 2121 401489a0 1: 28 usec in iter 2124 401489a0 1: 34 usec in iter 2135 401489a0 1: 22 usec in iter 2148 401489a0 1: 22 usec in iter 2167 401489a0 1: 15 usec in iter 2186 401489a0 1: 40 usec in iter 2199 401489a0 1: 23 usec in iter 2219 401489a0 1: 21 usec in iter 2223 401489a0 1: 27 usec in iter 2235 401489a0 1: 14 usec in iter 2238 401489a0 1: 22 usec in iter 2250 401489a0 1: 31 usec in iter 2252 401489a0 1: 36 usec in iter 2259 401489a0 1: 15 usec in iter 2282 401489a0 1: 259 usec in iter 2300 401489a0 1: 37 usec in iter 2318 401489a0 1: 21 usec in iter 2331 401489a0 1: 16 usec in iter 2345 401489a0 1: 15 usec in iter 2350 401489a0 1: 16 usec in iter 2361 401489a0 1: 69 usec in iter 2364 401489a0 1: 29 usec in iter 2374 401489a0 1: 33 usec in iter 2383 401489a0 1: 116 usec in iter 2400 401489a0 1: 37 usec in iter 2420 401489a0 1: 39 usec in iter 2447 401489a0 1: 29 usec in iter 2455 401489a0 1: 114 usec in iter 2476 401489a0 1: 33 usec in iter 2488 401489a0 1: 22 usec in iter 2493 401489a0 1: 53 usec in iter 2512 401489a0 1: 22 usec in iter 2539 401489a0 1: 57 usec in iter 2563 401489a0 1: 32 usec in iter 2564 401489a0 1: 39 usec in iter 2572 401489a0 1: 22 usec in iter 2574 401489a0 1: 14 usec in iter 2597 401489a0 1: 74 usec in iter 2610 401489a0 1: 23 usec in iter 2618 401489a0 1: 21 usec in iter 2626 401489a0 1: 16 usec in iter 2627 401489a0 1: 45 usec in iter 2653 401489a0 1: 28 usec in iter 2662 401489a0 1: 20 usec in iter 2679 401489a0 1: 27 usec in iter 2694 401489a0 1: 17 usec in iter 2696 401489a0 1: 14 usec in iter 2706 401489a0 1: 22 usec in iter 2720 401489a0 1: 36 usec in iter 2742 401489a0 1: 20 usec in iter 2753 401489a0 1: 20 usec in iter 2761 401489a0 1: 16 usec in iter 2788 401489a0 1: 22 usec in iter 2795 401489a0 1: 16 usec in iter 2804 401489a0 1: 144 usec in iter 2812 401489a0 1: 16 usec in iter 2817 401489a0 1: 16 usec in iter 2834 401489a0 1: 16 usec in iter 2853 401489a0 1: 53 usec in iter 2872 401489a0 1: 22 usec in iter 2882 401489a0 1: 28 usec in iter 2907 401489a0 1: 38 usec in iter 2922 401489a0 1: 56 usec in iter 2935 401489a0 1: 188 usec in iter 2954 401489a0 1: 15 usec in iter 2964 401489a0 1: 22 usec in iter 2984 401489a0 1: 23 usec in iter 3007 401489a0 1: 27 usec in iter 3028 401489a0 1: 16 usec in iter 3051 401489a0 1: 178 usec in iter 3064 401489a0 1: 15 usec in iter 3082 401489a0 1: 20 usec in iter 3083 401489a0 1: 15 usec in iter 3097 401489a0 1: 14 usec in iter 3120 401489a0 1: 16 usec in iter 3126 401489a0 1: 15 usec in iter 3145 401489a0 1: 15 usec in iter 3171 401489a0 1: 15 usec in iter 3200 401489a0 1: 15 usec in iter 3218 401489a0 1: 15 usec in iter 3242 401489a0 1: 50 usec in iter 3263 401489a0 1: 16 usec in iter 3267 401489a0 1: 15 usec in iter 3284 401489a0 1: 14 usec in iter 3300 401489a0 1: 16 usec in iter 3317 401489a0 1: 15 usec in iter 3342 401489a0 1: 14 usec in iter 3372 401489a0 1: 15 usec in iter 3377 401489a0 1: 14 usec in iter 3379 401489a0 1: 14 usec in iter 3389 401489a0 1: 14 usec in iter 3417 401489a0 1: 16 usec in iter 3419 401489a0 1: 16 usec in iter 3446 401489a0 1: 16 usec in iter 3461 401489a0 1: 15 usec in iter 3475 401489a0 1: 16 usec in iter 3484 401489a0 1: 16 usec in iter 3489 401489a0 1: 14 usec in iter 3511 401489a0 1: 15 usec in iter 3519 401489a0 1: 15 usec in iter 3527 401489a0 1: 15 usec in iter 3552 401489a0 1: 15 usec in iter 3572 401489a0 1: 16 usec in iter 3574 401489a0 1: 14 usec in iter 3584 401489a0 1: 15 usec in iter 3586 401489a0 1: 16 usec in iter 3610 401489a0 1: 14 usec in iter 3619 401489a0 1: 15 usec in iter 3629 401489a0 1: 15 usec in iter 3658 401489a0 1: 14 usec in iter 3682 401489a0 1: 14 usec in iter 3683 401489a0 1: 15 usec in iter 3692 401489a0 1: 15 usec in iter 3718 401489a0 1: 15 usec in iter 3749 401489a0 1: 15 usec in iter 3758 401489a0 1: 16 usec in iter 3783 401489a0 1: 15 usec in iter 3796 401489a0 1: 15 usec in iter 3812 401489a0 1: 15 usec in iter 3818 401489a0 1: 15 usec in iter 3826 401489a0 1: 15 usec in iter 3836 401489a0 1: 15 usec in iter 3851 401489a0 1: 15 usec in iter 3875 401489a0 1: 57 usec in iter 3891 401489a0 1: 15 usec in iter 3922 401489a0 1: 15 usec in iter 3925 401489a0 1: 15 usec in iter 3933 401489a0 1: 16 usec in iter 3942 401489a0 1: 15 usec in iter 3967 401489a0 1: 14 usec in iter 3994 401489a0 1: 14 usec in iter 4015 401489a0 1: 15 usec in iter 4025 401489a0 1: 16 usec in iter 4056 401489a0 1: 15 usec in iter 4063 401489a0 1: 15 usec in iter 4070 401489a0 1: 15 usec in iter 4079 401489a0 1: 15 usec in iter 4100 401489a0 1: 15 usec in iter 4122 401489a0 1: 14 usec in iter 4149 401489a0 1: 14 usec in iter 4171 401489a0 1: 14 usec in iter 4186 401489a0 1: 16 usec in iter 4208 401489a0 1: 15 usec in iter 4231 401489a0 1: 15 usec in iter 4262 401489a0 1: 14 usec in iter 4280 401489a0 1: 14 usec in iter 4293 401489a0 1: 15 usec in iter 4300 401489a0 1: 14 usec in iter 4323 401489a0 1: 34 usec in iter 4336 401489a0 1: 15 usec in iter 4359 401489a0 1: 15 usec in iter 4378 401489a0 1: 15 usec in iter 4384 401489a0 1: 15 usec in iter 4388 401489a0 1: 15 usec in iter 4406 401489a0 1: 14 usec in iter 4432 401489a0 1: 15 usec in iter 4452 401489a0 1: 15 usec in iter 4460 401489a0 1: 15 usec in iter 4469 401489a0 1: 54 usec in iter 4492 401489a0 1: 15 usec in iter 4493 401489a0 1: 16 usec in iter 4505 401489a0 1: 16 usec in iter 4527 401489a0 1: 16 usec in iter 4557 401489a0 1: 15 usec in iter 4569 401489a0 1: 16 usec in iter 4588 401489a0 1: 15 usec in iter 4598 401489a0 1: 15 usec in iter 4607 401489a0 1: 55 usec in iter 4629 401489a0 1: 14 usec in iter 4644 401489a0 1: 14 usec in iter 4652 401489a0 1: 14 usec in iter 4680 401489a0 1: 14 usec in iter 4692 401489a0 1: 16 usec in iter 4696 401489a0 1: 15 usec in iter 4706 401489a0 1: 15 usec in iter 4718 401489a0 1: 15 usec in iter 4722 401489a0 1: 14 usec in iter 4730 401489a0 1: 16 usec in iter 4731 401489a0 1: 15 usec in iter 4743 401489a0 1: 15 usec in iter 4754 401489a0 1: 16 usec in iter 4785 401489a0 1: 14 usec in iter 4787 401489a0 1: 14 usec in iter 4808 401489a0 1: 14 usec in iter 4827 401489a0 1: 15 usec in iter 4861 401489a0 1: 15 usec in iter 4869 401489a0 1: 14 usec in iter 4890 401489a0 1: 15 usec in iter 4910 401489a0 1: 15 usec in iter 4943 401489a0 1: 14 usec in iter 4967 401489a0 1: 15 usec in iter 4969 401489a0 1: 15 usec in iter 4980 401489a0 1: 15 usec in iter 4985 401489a0 1: 16 usec in iter 5008 401489a0 1: 16 usec in iter 5042 401489a0 1: 15 usec in iter 5049 401489a0 1: 15 usec in iter 5055 401489a0 1: 15 usec in iter 5064 401489a0 1: 16 usec in iter 5071 401489a0 1: 16 usec in iter 5091 401489a0 1: 15 usec in iter 5117 401489a0 1: 15 usec in iter 5136 401489a0 1: 15 usec in iter 5162 401489a0 1: 15 usec in iter 5167 401489a0 1: 14 usec in iter 5180 401489a0 1: 15 usec in iter 5182 401489a0 1: 15 usec in iter 5193 401489a0 1: 14 usec in iter 5196 401489a0 1: 16 usec in iter 5206 401489a0 1: 14 usec in iter 5211 401489a0 1: 16 usec in iter 5221 401489a0 1: 15 usec in iter 5241 401489a0 1: 15 usec in iter 5272 401489a0 1: 54 usec in iter 5284 401489a0 1: 16 usec in iter 5288 401489a0 1: 15 usec in iter 5307 401489a0 1: 16 usec in iter 5337 401489a0 1: 15 usec in iter 5351 401489a0 1: 15 usec in iter 5352 401489a0 1: 16 usec in iter 5374 401489a0 1: 14 usec in iter 5409 401489a0 1: 15 usec in iter 5410 401489a0 1: 14 usec in iter 5442 401489a0 1: 14 usec in iter 5446 401489a0 1: 15 usec in iter 5465 401489a0 1: 14 usec in iter 5489 401489a0 1: 14 usec in iter 5518 401489a0 1: 14 usec in iter 5536 401489a0 1: 15 usec in iter 5563 401489a0 1: 53 usec in iter 5594 401489a0 1: 16 usec in iter 5610 401489a0 1: 15 usec in iter 5616 401489a0 1: 14 usec in iter 5626 401489a0 1: 15 usec in iter 5630 401489a0 1: 15 usec in iter 5662 401489a0 1: 15 usec in iter 5693 401489a0 1: 14 usec in iter 5699 401489a0 1: 14 usec in iter 5730 401489a0 1: 16 usec in iter 5754 401489a0 1: 15 usec in iter 5787 401489a0 1: 14 usec in iter 5789 401489a0 1: 15 usec in iter 5808 401489a0 1: 16 usec in iter 5809 401489a0 1: 16 usec in iter 5841 401489a0 1: 14 usec in iter 5849 401489a0 1: 14 usec in iter 5873 401489a0 1: 15 usec in iter 5887 401489a0 1: 14 usec in iter 5908 401489a0 1: 16 usec in iter 5937 401489a0 1: 15 usec in iter 5961 401489a0 1: 15 usec in iter 5982 401489a0 1: 15 usec in iter 5984 401489a0 1: 59 usec in iter 6004 401489a0 1: 14 usec in iter 6032 401489a0 1: 15 usec in iter 6051 401489a0 1: 15 usec in iter 6081 401489a0 1: 15 usec in iter 6112 401489a0 1: 15 usec in iter 6119 401489a0 1: 15 usec in iter 6136 401489a0 1: 60 usec in iter 6140 401489a0 1: 14 usec in iter 6167 401489a0 1: 56 usec in iter 6168 401489a0 1: 14 usec in iter 6179 401489a0 1: 13 usec in iter 6209 401489a0 1: 15 usec in iter 6227 401489a0 1: 14 usec in iter 6249 401489a0 1: 61 usec in iter 6250 401489a0 1: 14 usec in iter 6270 401489a0 1: 14 usec in iter 6291 401489a0 1: 13 usec in iter 6312 401489a0 1: 16 usec in iter 6331 401489a0 1: 14 usec in iter 6358 401489a0 1: 53 usec in iter 6359 401489a0 1: 15 usec in iter 6368 401489a0 1: 14 usec in iter 6373 401489a0 1: 13 usec in iter 6391 401489a0 1: 15 usec in iter 6411 401489a0 1: 15 usec in iter 6432 401489a0 1: 15 usec in iter 6454 401489a0 1: 15 usec in iter 6476 401489a0 1: 15 usec in iter 6508 401489a0 1: 15 usec in iter 6545 401489a0 1: 14 usec in iter 6564 401489a0 1: 15 usec in iter 6581 401489a0 1: 15 usec in iter 6611 401489a0 1: 14 usec in iter 6641 401489a0 1: 14 usec in iter 6661 401489a0 1: 15 usec in iter 6680 401489a0 1: 16 usec in iter 6700 401489a0 1: 15 usec in iter 6728 401489a0 1: 14 usec in iter 6738 401489a0 1: 15 usec in iter 6758 401489a0 1: 14 usec in iter 6777 401489a0 1: 15 usec in iter 6782 401489a0 1: 15 usec in iter 6784 401489a0 1: 15 usec in iter 6799 401489a0 1: 15 usec in iter 6803 401489a0 1: 16 usec in iter 6817 401489a0 1: 15 usec in iter 6831 401489a0 1: 14 usec in iter 6840 401489a0 1: 15 usec in iter 6861 401489a0 1: 14 usec in iter 6893 401489a0 1: 15 usec in iter 6911 401489a0 1: 14 usec in iter 6935 401489a0 1: 16 usec in iter 6950 401489a0 1: 15 usec in iter 6976 401489a0 1: 54 usec in iter 6987 401489a0 1: 16 usec in iter 6989 401489a0 1: 15 usec in iter 7020 401489a0 1: 14 usec in iter 7043 401489a0 1: 15 usec in iter 7065 401489a0 1: 14 usec in iter 7097 401489a0 1: 14 usec in iter 7110 401489a0 1: 16 usec in iter 7117 401489a0 1: 15 usec in iter 7126 401489a0 1: 15 usec in iter 7142 401489a0 1: 14 usec in iter 7153 401489a0 1: 14 usec in iter 7157 401489a0 1: 15 usec in iter 7160 401489a0 1: 15 usec in iter 7187 401489a0 1: 15 usec in iter 7223 401489a0 1: 15 usec in iter 7251 401489a0 1: 15 usec in iter 7271 401489a0 1: 15 usec in iter 7277 401489a0 1: 15 usec in iter 7298 401489a0 1: 14 usec in iter 7301 401489a0 1: 15 usec in iter 7334 401489a0 1: 16 usec in iter 7359 401489a0 1: 15 usec in iter 7378 401489a0 1: 15 usec in iter 7404 401489a0 1: 17 usec in iter 7417 401489a0 1: 17 usec in iter 7420 401489a0 1: 15 usec in iter 7431 401489a0 1: 16 usec in iter 7437 401489a0 1: 16 usec in iter 7451 401489a0 1: 16 usec in iter 7478 401489a0 1: 15 usec in iter 7502 401489a0 1: 15 usec in iter 7503 401489a0 1: 14 usec in iter 7535 401489a0 1: 15 usec in iter 7556 401489a0 1: 15 usec in iter 7558 401489a0 1: 52 usec in iter 7578 401489a0 1: 16 usec in iter 7583 401489a0 1: 13 usec in iter 7596 401489a0 1: 15 usec in iter 7618 401489a0 1: 14 usec in iter 7623 401489a0 1: 16 usec in iter 7624 401489a0 1: 15 usec in iter 7626 401489a0 1: 15 usec in iter 7636 401489a0 1: 16 usec in iter 7638 401489a0 1: 16 usec in iter 7661 401489a0 1: 16 usec in iter 7690 401489a0 1: 14 usec in iter 7716 401489a0 1: 16 usec in iter 7739 401489a0 1: 15 usec in iter 7767 401489a0 1: 15 usec in iter 7776 401489a0 1: 15 usec in iter 7815 401489a0 1: 16 usec in iter 7844 401489a0 1: 15 usec in iter 7875 401489a0 1: 15 usec in iter 7891 401489a0 1: 14 usec in iter 7893 401489a0 1: 14 usec in iter 7930 401489a0 1: 15 usec in iter 7937 401489a0 1: 14 usec in iter 7975 401489a0 1: 16 usec in iter 8001 401489a0 1: 14 usec in iter 8039 401489a0 1: 14 usec in iter 8044 401489a0 1: 15 usec in iter 8054 401489a0 1: 15 usec in iter 8080 401489a0 1: 15 usec in iter 8113 401489a0 1: 16 usec in iter 8120 401489a0 1: 15 usec in iter 8159 401489a0 1: 15 usec in iter 8174 401489a0 1: 16 usec in iter 8210 401489a0 1: 14 usec in iter 8226 401489a0 1: 55 usec in iter 8261 401489a0 1: 16 usec in iter 8291 401489a0 1: 15 usec in iter 8319 401489a0 1: 15 usec in iter 8322 401489a0 1: 14 usec in iter 8348 401489a0 1: 14 usec in iter 8359 401489a0 1: 14 usec in iter 8375 401489a0 1: 15 usec in iter 8401 401489a0 1: 15 usec in iter 8422 401489a0 1: 16 usec in iter 8437 401489a0 1: 15 usec in iter 8457 401489a0 1: 15 usec in iter 8485 401489a0 1: 14 usec in iter 8511 401489a0 1: 15 usec in iter 8512 401489a0 1: 16 usec in iter 8536 401489a0 1: 15 usec in iter 8540 401489a0 1: 15 usec in iter 8570 401489a0 1: 15 usec in iter 8609 401489a0 1: 15 usec in iter 8637 401489a0 1: 15 usec in iter 8648 401489a0 1: 15 usec in iter 8670 401489a0 1: 14 usec in iter 8673 401489a0 1: 16 usec in iter 8714 401489a0 1: 15 usec in iter 8729 401489a0 1: 14 usec in iter 8750 401489a0 1: 15 usec in iter 8787 401489a0 1: 14 usec in iter 8803 401489a0 1: 16 usec in iter 8816 401489a0 1: 15 usec in iter 8838 401489a0 1: 14 usec in iter 8843 401489a0 1: 15 usec in iter 8844 401489a0 1: 16 usec in iter 8865 401489a0 1: 17 usec in iter 8870 401489a0 1: 16 usec in iter 8879 401489a0 1: 16 usec in iter 8896 401489a0 1: 15 usec in iter 8937 401489a0 1: 15 usec in iter 8949 401489a0 1: 15 usec in iter 8960 401489a0 1: 15 usec in iter 8994 401489a0 1: 16 usec in iter 9027 401489a0 1: 17 usec in iter 9040 401489a0 1: 15 usec in iter 9046 401489a0 1: 15 usec in iter 9050 401489a0 1: 14 usec in iter 9057 401489a0 1: 16 usec in iter 9058 401489a0 1: 14 usec in iter 9075 401489a0 1: 15 usec in iter 9076 401489a0 1: 16 usec in iter 9099 401489a0 1: 14 usec in iter 9139 401489a0 1: 15 usec in iter 9158 401489a0 1: 15 usec in iter 9186 401489a0 1: 14 usec in iter 9212 401489a0 1: 15 usec in iter 9238 401489a0 1: 16 usec in iter 9269 401489a0 1: 17 usec in iter 9275 401489a0 1: 14 usec in iter 9298 401489a0 1: 16 usec in iter 9312 401489a0 1: 14 usec in iter 9325 401489a0 1: 15 usec in iter 9345 401489a0 1: 16 usec in iter 9362 401489a0 1: 16 usec in iter 9382 401489a0 1: 16 usec in iter 9387 401489a0 1: 15 usec in iter 9424 401489a0 1: 52 usec in iter 9435 401489a0 1: 15 usec in iter 9438 401489a0 1: 15 usec in iter 9481 401489a0 1: 15 usec in iter 9491 401489a0 1: 15 usec in iter 9499 401489a0 1: 16 usec in iter 9535 401489a0 1: 15 usec in iter 9554 401489a0 1: 14 usec in iter 9561 401489a0 1: 14 usec in iter 9576 401489a0 1: 15 usec in iter 9614 401489a0 1: 15 usec in iter 9640 401489a0 1: 14 usec in iter 9668 401489a0 1: 15 usec in iter 9700 401489a0 1: 16 usec in iter 9705 401489a0 1: 15 usec in iter 9729 401489a0 1: 15 usec in iter 9738 401489a0 1: 15 usec in iter 9777 401489a0 1: 15 usec in iter 9816 401489a0 1: 16 usec in iter 9822 401489a0 1: 15 usec in iter 9828 401489a0 1: 17 usec in iter 9870 401489a0 1: 15 usec in iter 9911 401489a0 1: 15 usec in iter 9916 401489a0 1: 15 usec in iter 9956 401489a0 1: 15 usec in iter 9991 401489a0 1: raw totaltime = 0.462843 401489a0 1: per_iter = 0.00017897, SampleOverhead = 3.29e-06 401489a0 1: adj totaltime = 2.12368 401489a0 1: sampling = 0 401489a0 1: starttime = 0/000000 401489a0 1: counter = 0/462843 401489a0 1: firsttuple = 0.331616 401489a0 1: tuplecount = 10000 401489a0 1: itercount = 10001 401489a0 1: samplecount = 721 401489a0 1: nextsample = 10018.7 401489a0 1: startup = 0.331616 401489a0 1: total = 2.12368 401489a0 1: ntuples = 10000
On Wed, Jun 07, 2006 at 03:32:35PM -0400, Tom Lane wrote: > On reflection it's easy to imagine other cases where some iterations > take much longer than others in a not-very-predictable way. For > instance, a join where only a subset of the outer tuples have a match > is going to act that way. I don't think there's any good way that we > can be sure we have a representative sample of executions, and so I'm > afraid this approach to sampling EXPLAIN ANALYZE is a failure. I don't think we ever assumed it would never be a problem. We just assumed that the sampling would cancel the effect out to give a decent average. Thing is, I never expected to get a three order magnitude difference between samples. That's just far too much to be corrected in any way. The random sampling should counter most effects, and I didn't consider the one tuple in a million that takes much longer to be a particularly common case. Anyway, as a test, if you take the approach that the measurement at item X only applies to the tuples immediately preceding it, for the data you posted you get a result of 0.681148 seconds. How long did that query run that produced that data? (The bit of perl I used is: cat data | perl -lne 'BEGIN { $last=0; $sum =0 } /: (\d+) usec in iter (\d+)/ && do { $sum += ($2-$last)*$1; $last=$2 }; END { print "$sum\n" }' > I propose we revert this patch and think about an interrupt-driven > sampling method instead. That's another possibility ofcourse... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes: > Anyway, as a test, if you take the approach that the measurement at > item X only applies to the tuples immediately preceding it, for the > data you posted you get a result of 0.681148 seconds. How long did that > query run that produced that data? I didn't save the corresponding printout unfortunately, but it was probably pretty similar to this: regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0)ss; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=869.395..869.399 rows=1 loops=1) -> Limit (cost=825.00..2484.00 rows=10000width=488) (actual time=248.640..3368.313 rows=10000 loops=1) -> Hash Join (cost=825.00..2484.00 rows=10000width=488) (actual time=248.609..2983.528 rows=10000 loops=1) Hash Cond: (a.unique1 = b.unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.084..21.525 rows=10000 loops=1) -> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=248.269..248.269 rows=10000 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244) (actual time=0.025..22.760 rows=10000loops=1)Total runtime: 877.265 ms (8 rows) Time: 888.469 ms regression=# The above idea won't fix it anyway, only move the failure cases around. regards, tom lane
On Wed, Jun 07, 2006 at 11:34:30AM -0400, Tom Lane wrote: > timer interrupt routine does this once every few milliseconds: One issue is that on very fast queries, you'd get absolutely no data this way. A possible solution would be to start with an extremely short timer interval, and ramp it up if the query runs longer. The downside is that you'd need to somehow re-scale iterations every time the timer interval changed. Of course another option is to start off using the gettimeofday() method and switch to sampling after X seconds, but it'd be nice if both code paths weren't needed. > The bubble-up of sample counts to parent nodes could perhaps be done > while printing the results instead of on-the-fly as sketched above, but > the above seems simpler. It'd be nice if there was an option to not aggregate child runtimes to their parents at all, since it'd make spotting hot spots much easier. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote: > Ah-hah, I've sussed it. > so I'm > afraid this approach to sampling EXPLAIN ANALYZE is a failure. Hmmm, I thought we already got that bit...sorry to sound negative. > I propose we revert this patch and think about an interrupt-driven > sampling method instead. I don't have much more faith in crazy scheme No.2 either. (Mine or yours...) Can we just have an option to avoid the timing altogether, please? I don't want to have long discussions about instrumentation, I just want a reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one that we never, ever have to doubt whether the sampling works correctly on a Miasmic-367 with HyperKooling. You could lose a month on -perform going into the details of this for everybody - this was supposed to be a simple additional feature. If you're set on the sampling, great, but can we have the option to avoid it completely also? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote: >> I propose we revert this patch and think about an interrupt-driven >> sampling method instead. > I don't have much more faith in crazy scheme No.2 either. (Mine or > yours...) > Can we just have an option to avoid the timing altogether, please? I > don't want to have long discussions about instrumentation, I just want a > reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one > that we never, ever have to doubt whether the sampling works correctly > on a Miasmic-367 with HyperKooling. Frankly, I think the pre-existing version of EXPLAIN ANALYZE is fine. People have been hyperventilating about the timing overhead but I think that it's perfectly acceptable as-is. Certainly the removal of timing is not going to convert an intolerable EXPLAIN ANALYZE runtime into an acceptable one; what it *is* likely to do is let you be misled about which part of the query is the problem. regards, tom lane
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Wed, Jun 07, 2006 at 11:34:30AM -0400, Tom Lane wrote: >> timer interrupt routine does this once every few milliseconds: > One issue is that on very fast queries, you'd get absolutely no data > this way. Yeah. Of course, on "very fast" queries you don't get very good data from the existing method either --- with a machine fast enough to have sub-microsecond plan node execution times, the accuracy of gettimeofday really isn't good enough. The other thing that was bothering me was whether disk I/O might be undercounted by an interrupt-driven method. I kinda doubt that any kernel will save up N interrupts that occur while the process is blocked on a slow read() ... you'll probably get only one. So the whole idea may be unworkable. At the moment I think we need to revert the recent patch and go back to the drawing board. Improving on the original implementation of EXPLAIN ANALYZE is clearly not as easy as it looks. regards, tom lane
On Wed, Jun 07, 2006 at 04:04:33PM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > Anyway, as a test, if you take the approach that the measurement at > > item X only applies to the tuples immediately preceding it, for the > > data you posted you get a result of 0.681148 seconds. How long did that > > query run that produced that data? <snip> > The above idea won't fix it anyway, only move the failure cases around. Well, if we're trying to make a system that never ever produces strange looking results, then yes, we have to drop it. No matter how you sample, there's going to be a failure mode somewhere. If you are prepared to live with a certain margin (it will be within X% of the real value Y% of the time) then I think that's an acheivable goal (I'm not saying that's necessarily what we have now). You could always give people the choice of disabling sampling if it looks wierd, but that just gives people more knobs to twiddle and get upset about. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote: > >> I propose we revert this patch and think about an interrupt-driven > >> sampling method instead. > > > I don't have much more faith in crazy scheme No.2 either. (Mine or > > yours...) > > > Can we just have an option to avoid the timing altogether, please? I > > don't want to have long discussions about instrumentation, I just want a > > reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one > > that we never, ever have to doubt whether the sampling works correctly > > on a Miasmic-367 with HyperKooling. > > Frankly, I think the pre-existing version of EXPLAIN ANALYZE is fine. I respect everybody's opinion, yours doubly so, as you know. But with this current opinion, on this occasion, I believe you are in the minority of those who have spent some time asking for EXPLAIN ANALYSEs and have not received them because the run time is unacceptable. > People have been hyperventilating about the timing overhead but I think > that it's perfectly acceptable as-is. The phrase hyperventilating is subjective. From my side, I could use the phrase explaining-the-obvious and potentially cause rankle also. We should be able to discuss things without that. > Certainly the removal of timing > is not going to convert an intolerable EXPLAIN ANALYZE runtime into an > acceptable one; I disagree, as have others. > what it *is* likely to do is let you be misled about > which part of the query is the problem. A full EXPLAIN ANALYZE is always desirable - we agree on that. The question is what we do when one is not available. Guessing leaves you much more open to being misled. I'd like an option that we can take when a full EXPLAIN ANALYZE is not possible, one that doesn't give different results on different CPUs/OS. We need this to help people who are in difficulty. This isn't a vanity feature or a they-do-it-so-we-should-also. It's for support. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote: >> Certainly the removal of timing >> is not going to convert an intolerable EXPLAIN ANALYZE runtime into an >> acceptable one; > I disagree, as have others. The overhead seems to be on the order of a couple tens of percent usually. I don't see how that makes the difference between an EXPLAIN ANALYZE you can run and one you can't. > A full EXPLAIN ANALYZE is always desirable - we agree on that. The > question is what we do when one is not available. The least bad alternative I've heard is to let EXPLAIN ANALYZE print out stats-so-far if the query is canceled by control-C or statement timeout. The objection to this is you may mistake startup transients for full query behavior ... but at least the numbers will be good as far as they go. regards, tom lane
On Wed, Jun 07, 2006 at 05:05:23PM -0400, Tom Lane wrote: > The other thing that was bothering me was whether disk I/O might be > undercounted by an interrupt-driven method. I kinda doubt that any > kernel will save up N interrupts that occur while the process is blocked > on a slow read() ... you'll probably get only one. So the whole idea > may be unworkable. True, but if you get to the point where you're waiting on I/O, I would think you could afford a gettimeofday() call. Even if the block we need is in the OS cache, it's possible that the overhead of getting it from there is high enough that the gettimeofday() call won't matter. FWIW, it looks like running a much larger setting for shared_buffers (like, 50% of memory) is a good way to boost performance, and that configuration would make it less likely that you'd do a gettimeofday just to pull data out of the OS cache. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote: > >> Certainly the removal of timing > >> is not going to convert an intolerable EXPLAIN ANALYZE runtime into an > >> acceptable one; > > > I disagree, as have others. > > The overhead seems to be on the order of a couple tens of percent usually. > I don't see how that makes the difference between an EXPLAIN ANALYZE you > can run and one you can't. Well, thats not my experience and doesn't match others posted on -hackers. A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE to be consistently above 500% (or more than +400%, depending upon how you style those numbers). This is for in-shared-buffers data, following cache priming. Test shown below is typical result from 5 tests: > postgres=# select count(*) from accounts; > count > -------- > 100000 > (1 row) > > Time: 267.008 ms > postgres=# explain analyze select count(*) from accounts; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=10945.00..10945.01 rows=1 width=0) (actual time=1497.830..1497.837 rows=1 loops=1) > -> Seq Scan on accounts (cost=0.00..9998.20 rows=378720 width=0) (actual time=0.078..828.455 rows=100000 loops=1) > Total runtime: 1497.954 ms > (3 rows) > > Time: 1498.983 ms Other timings were: 1493 1498 1707 1814 1827 EXPLAIN ANALYZE is designed to be run *when* you have unacceptable run times and need to find out why. So making the execution time even more unacceptable makes the utility infeasible at the time you need it most. The additional run-time occurs on people's production systems, so they generally aren't happy running long tasks. BTW I think EA is great - so good in fact I want to run it more often. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote: >> The overhead seems to be on the order of a couple tens of percent usually. >> I don't see how that makes the difference between an EXPLAIN ANALYZE you >> can run and one you can't. > Well, thats not my experience and doesn't match others posted on > -hackers. > A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE > to be consistently above 500% (or more than +400%, depending upon how > you style those numbers). I think we ought to find out why your machine is so broken. Even in this pretty-much-worst-case scenario (a seqscan does about as little real work per plan node call as possible, especially if the table is already fully cached), I don't see more than about a 2X degradation. On queries that are complicated enough to actually need EXPLAIN ANALYZE, it's not nearly that bad. Old slow HPUX/HPPA machine, PG 8.1 branch tip: bench=# \timing Timing is on. bench=# select count(*) from accounts;count --------100000 (1 row) Time: 543.565 ms -- do it again to ensure fully cached bench=# select count(*) from accounts;count --------100000 (1 row) Time: 492.667 ms bench=# explain analyze select count(*) from accounts; QUERY PLAN -------------------------------------------------------------------------------- ---------------------------------------Aggregate (cost=2975.00..2975.01 rows=1 width=0) (actual time=1172.856..1172.8 60 rows=1 loops=1) -> Seq Scan on accounts (cost=0.00..2725.00 rows=100000 width=0) (actual ti me=0.175..720.741 rows=100000 loops=1)Total runtime: 1173.290 ms (3 rows) Time: 1176.293 ms bench=# Spiffy new Fedora 5/dual Xeon machine, PG 8.1 branch tip: bench=# \timing Timing is on. bench=# select count(*) from accounts;count --------100000 (1 row) Time: 61.737 ms -- do it again to ensure fully cached bench=# select count(*) from accounts;count --------100000 (1 row) Time: 53.941 ms bench=# explain analyze select count(*) from accounts; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------Aggregate (cost=2975.00..2975.01 rows=1 width=0) (actual time=117.881..117.882 rows=1 loops=1) -> Seq Scan on accounts (cost=0.00..2725.00rows=100000 width=0) (actual time=0.041..77.628 rows=100000 loops=1)Total runtime: 117.936 ms (3 rows) Time: 118.510 ms bench=# I'm too lazy to pull up any of my other machines right now, but this is generally consistent with my experience ever since EXPLAIN ANALYZE was written. So: what's your platform exactly? regards, tom lane
On Thu, 2006-06-08 at 10:27 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote: > >> The overhead seems to be on the order of a couple tens of percent usually. > >> I don't see how that makes the difference between an EXPLAIN ANALYZE you > >> can run and one you can't. > > > Well, thats not my experience and doesn't match others posted on > > -hackers. > > > A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE > > to be consistently above 500% (or more than +400%, depending upon how > > you style those numbers). > > I think we ought to find out why your machine is so broken. > I'm too lazy to pull up any of my other machines right now, but this is > generally consistent with my experience ever since EXPLAIN ANALYZE was > written. Great. Well it isn't consistent with mine, or others who've posted to this list. > So: what's your platform exactly? FC5, Intel laptop running cvstip, new in January. But thats irrelevant. I'm not a user, I solve others problems, as you know. Hence my interest in a usable tool to do that. So far we have myself, Kevin, Martijn and Luke all saying there is a distortion or a massive overhead caused by EXPLAIN ANALYZE. http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php It's real. I won't press the point further. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > So far we have myself, Kevin, Martijn and Luke all saying there is a > distortion or a massive overhead caused by EXPLAIN ANALYZE. > http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php > http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php The first of these shows the exact opposite of what you are claiming, and the second is without details of any sort that might help determine what the actual problem is. Given that we're seeing diametrically opposed results on the same OS (FC5) and similar (at least all Intel) hardware, I think the prudent thing is to find out what's really going on before leaping in with proposed solutions. As the person who's *not* seeing the problem, I'm not in a position to do that investigation... regards, tom lane
Ending EXPLAIN ANALYZE early (was Re: That EXPLAIN ANALYZE patch still needs work)
From
Tom Lane
Date:
I wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> A full EXPLAIN ANALYZE is always desirable - we agree on that. The >> question is what we do when one is not available. > The least bad alternative I've heard is to let EXPLAIN ANALYZE print > out stats-so-far if the query is canceled by control-C or statement > timeout. The objection to this is you may mistake startup transients > for full query behavior ... but at least the numbers will be good as > far as they go. I thought some more about this, and it seems doable but tricky (ie, there are many wrong ways to do it). Here are my conclusions: We can't just use the normal QueryCancel logic that throws an elog(ERROR) from someplace down inside the query. This would leave the backend in an unclean state from which we could only certainly recover by doing AbortTransaction. And once we've aborted the transaction we can't do catalog accesses, which gets in the way of producing the EXPLAIN printout. Running the test query inside a subtransaction would fix that, but aborting the subtransaction would throw away the executor state, including the Instrumentation nodes we need. So it seems we need a way to stop the query without raising an error per se. What I'm thinking is that while EXPLAIN ANALYZE is running, timeout or SIGINT should not set QueryCancelPending, but instead set a separate flag "ExplainCancelPending", which we then test in ExecProcNode(), say if (node->instrument){ + if (ExplainCancelPending) + return NULL; InstrStartNode(node->instrument);} There might be one or two other places to check it, but basically we'd only notice the flag in very limited circumstances where it's definitely safe to force early termination of ExecutorRun. When control comes back to explain.c, we just print the results as normal (but probably adding a line explicitly noting that the query was abandoned before completion). Note that we won't have any "running" Instrumentation nodes to contend with, since the change doesn't cause nodes to drop out after they've started timing. So the data is good, just incomplete. After we've printed the results, we have a bit of a problem: if ExplainCancelPending is set, we now need to abort the transaction. It would not do at all to allow an incompletely executed UPDATE to commit. I experimented with throwing an elog() at the bottom of ExplainQuery() after end_tup_output(), but this does not work: psql only prints the error and not the data, because libpq throws away the query result upon seeing the error. We could probably hack psql to print the results before noting the error, but I'm concerned about making a change that would change the behavior for other error-at-end-of-statement cases. Also, it's not clear what might have to happen to get non-psql clients to play along. It seems like the best solution is to establish a subtransaction around the entire EXPLAIN command (not just the test query), which we can abort after we've printed the results. A possible objection to this is that running a query inside a subtransaction might have different/worse performance than running it at top level. I don't recall any severe bottlenecks of that kind but that doesn't mean there aren't any (Alvaro, any comments?) Note that this would mean that ending an EXPLAIN ANALYZE early, via either control-C or statement timeout, would be treated as a non-error situation from the point of view of the outer transaction. This bothers me a bit, because in fact the effects if any of the tested query would have been rolled back. Not sure we have any choice though. If we expose the error then we'll have problems with clients not showing the EXPLAIN results. Thoughts? regards, tom lane
Tom Lane wrote: > -- do it again to ensure fully cached > bench=# select count(*) from accounts; Short summary: Does running a query only twice really insure that a result is cached? It seems not to be the case for seq-scans on Linux. I think this may matters to the discussions about a readahead thread/process that come up on this list that come up hereoccasionally. Experimental results here suggest that for larger tables Linux seems to detect a seq-scan and not bother caching. It's very reproducible for me here to do a reboot and not see the full speedup on a seq_scan until the third time I run a query.su An example shown below [1] shows that the third run of a query is faster than the second run. The output of a 'vmstat 5' [2] while these queries was happening agrees that significant I/O was still happening on the second run, but no I/O happened the third time. The table comfortably fits in memory (700MB table on a 2GB ram machine) and the machine was otherwise idle so noone else wiped out the cache between the first and second runs. Why do I think this is worth mentioning here? * I think it impacts the occasional thread about wanting to include logicin postgresql for readahead [3] or for the threads suggesting hinting to the the OS though madvise or similarto avoid caching seq-scans. It seems that the Linux is detecting and at least somewhat reacting to seq scanseven with no hinting. Anything added to postgresql might end up being a duplicated effort. I think Bruce suggestedthat Solaris does this free-behind automatically [4], but this is the first I've noticed that Linux seemsto do similar. * I think it matters to people who post explain analyze twice without running it so often they get stable results. (I note that this was not a problem for Tom since the timing of his first and second runs were the same so I assumehe was just saying that he observed that the query was cached rather than that the first run forced the secondrun to be cached.) Ron ========================================================================= == [note 1] the repeated queries showing the speedup after 3 runs. == Running the same select count(*) 4 times after a clean reboot. == Seems the OS's caching logic decided that the first seq_scan == wasn't 'interesting' enough ========================================================================= fli=# select count(*) from facets_s; count ---------- 15976558 (1 row) Time: 29788.047 ms fli=# select count(*) from facets_s; count ---------- 15976558 (1 row) Time: 19344.573 ms fli=# select count(*) from facets_s; count ---------- 15976558 (1 row) Time: 13411.272 ms fli=# select count(*) from facets_s; count ---------- 15976558 (1 row) Time: 13107.856 ms ################################################################################ # [note 2] vmstat 5 while the above queries were being run ################################################################################ procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 1 140 62140 71256 713360 0 0 47 31 92 84 7 1 92 0 *** the first time 1 0 140 50860 31912 808304 0 2 25215 29 1147 2612 49 15 0 36 1 0 360 54420 11112855240 0 0 23934 7 1139 2553 47 14 0 39 0 1 360 54008 11100 878708 0 0 23704 25 1149 246746 12 0 41 0 1 360 52512 11140 896592 0 0 24062 6 1135 2460 47 11 0 41 *** the second time 0 0 360 52688 11172 906916 0 0 13357 19 1085 1989 31 7 38 24 1 0 360 53976 11076912540 0 44 14273 57 1113 2102 32 7 29 32 2 0 360 54788 10908 923788 0 0 24509 54 1171 247446 12 0 42 1 0 360 54944 3096 939948 0 0 11180 39 1093 1976 65 13 0 22 *** the third time 3 0 360 54280 3872 940508 0 0 264 14 1041 1560 85 15 0 0 1 0 360 53852 3904940940 0 0 88 29 1022 1505 53 9 36 2 2 0 360 51616 4052 943068 0 0 443 54 1037 155282 15 0 4 1 0 360 51488 4060 943180 0 0 22 2 1013 1522 84 16 0 0 ############# [3] http://archives.postgresql.org/pgsql-hackers/2005-11/msg01449.php [4] http://archives.postgresql.org/pgsql-performance/2003-10/msg00188.php
Re: Ending EXPLAIN ANALYZE early (was Re: That EXPLAIN ANALYZE patch still needs work)
From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > After we've printed the results, we have a bit of a problem: if > ExplainCancelPending is set, we now need to abort the transaction. It > would not do at all to allow an incompletely executed UPDATE to commit. > I experimented with throwing an elog() at the bottom of ExplainQuery() > after end_tup_output(), but this does not work: psql only prints the > error and not the data, because libpq throws away the query result > upon seeing the error. We could probably hack psql to print the results > before noting the error, but I'm concerned about making a change that > would change the behavior for other error-at-end-of-statement cases. > Also, it's not clear what might have to happen to get non-psql clients > to play along. Would it be possible to make a whole new protocol message for EXPLAIN results? The reason I'm suggesting that is because it might make it easier to implement the SIGINFO handler that sends incremental EXPLAIN results on demand that I was describing earlier. Then libpq would have a separate api call to check for EXPLAIN results. If a non-psql client doesn't check it then it doesn't get the EXPLAIN results but the behaviour is correct. If it does then it can get the EXPLAIN results. This might also let you run with EXPLAIN ANALYZE instrumentation but still get the regular query results. Since the explain analyze results would still be available out of band. -- Gregory Stark T + 514 938 2456
Re: Ending EXPLAIN ANALYZE early (was Re: That EXPLAIN ANALYZE patch still needs work)
From
Tom Lane
Date:
Gregory Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: > Would it be possible to make a whole new protocol message for EXPLAIN results? I'm really unwilling to get into that. For one thing, that would absolutely positively break *all* use of EXPLAIN from un-fixed clients, whether you were trying to stop early or not. > The reason I'm suggesting that is because it might make it easier to implement > the SIGINFO handler that sends incremental EXPLAIN results on demand that I > was describing earlier. Doesn't matter, because that's not happening ;-) SIGINFO isn't portable, and even if it were, a signal handler couldn't possibly generate EXPLAIN output (remember those catalog accesses). regards, tom lane
I think what he meant was "a separate EXPLAIN-CANCEL message" on a cancel-type connection, which would be completely backwards compatible. Old clients simply wouldn't be able to use the special EXPLAIN cancel, just like it is now. On Thu, June 8, 2006 3:01 pm, Tom Lane wrote: > Gregory Stark <gsstark@mit.edu> writes: > >> Tom Lane <tgl@sss.pgh.pa.us> writes: >> Would it be possible to make a whole new protocol message for EXPLAIN >> results? > > I'm really unwilling to get into that. For one thing, that would > absolutely positively break *all* use of EXPLAIN from un-fixed clients, > whether you were trying to stop early or not. > >> The reason I'm suggesting that is because it might make it easier to >> implement the SIGINFO handler that sends incremental EXPLAIN results on >> demand that I was describing earlier. > > Doesn't matter, because that's not happening ;-) SIGINFO isn't > portable, and even if it were, a signal handler couldn't possibly generate > EXPLAIN output (remember those catalog accesses). > > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > > http://www.postgresql.org/docs/faq > >
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > Does running a query only twice really insure that a result is cached? > It seems not to be the case for seq-scans on Linux. Should work for tables small enough to fit into the shared_buffers arena. I wouldn't necessarily assume it for large tables. Note that what I was really doing was taking the timing measurement again on data cached by the *first* run, so that I would have something that could fairly be compared to the following EXPLAIN ANALYZE --- which of course would likewise be operating on cached data. regards, tom lane
Re: Ending EXPLAIN ANALYZE early (was Re: That EXPLAIN ANALYZE patch still needs work)
From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > Gregory Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > Would it be possible to make a whole new protocol message for EXPLAIN results? > > I'm really unwilling to get into that. For one thing, that would > absolutely positively break *all* use of EXPLAIN from un-fixed clients, > whether you were trying to stop early or not. Well I was picturing still having the manual EXPLAIN ANALYZE which works as it does now replacing the query results in addition to a lower level protocol request which generates a protocol response with the data. > > The reason I'm suggesting that is because it might make it easier to implement > > the SIGINFO handler that sends incremental EXPLAIN results on demand that I > > was describing earlier. > > Doesn't matter, because that's not happening ;-) SIGINFO isn't > portable, and even if it were, a signal handler couldn't possibly > generate EXPLAIN output (remember those catalog accesses). Well the signal gets handled by psql. It would have to be a wire protocol message like Query Cancel that the backend watches for and handles the first time it can. -- greg
Re: Ending EXPLAIN ANALYZE early (was Re: That EXPLAIN ANALYZE patch still needs work)
From
Alvaro Herrera
Date:
Tom Lane wrote: > A possible objection to this is that running a query inside a > subtransaction might have different/worse performance than running it > at top level. I don't recall any severe bottlenecks of that kind but > that doesn't mean there aren't any (Alvaro, any comments?) Nope, nothing that I recall. Maybe make sure to generate the Xid of the subtransaction before starting the EXPLAIN itself (so that there is no pg_subtrans I/O pending, etc) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Thu, 2006-06-08 at 13:51 -0400, Tom Lane wrote: > I wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > >> A full EXPLAIN ANALYZE is always desirable - we agree on that. The > >> question is what we do when one is not available. > > > The least bad alternative I've heard is to let EXPLAIN ANALYZE print > > out stats-so-far if the query is canceled by control-C or statement > > timeout. The objection to this is you may mistake startup transients > > for full query behavior ... but at least the numbers will be good as > > far as they go. > > I thought some more about this Thanks. > So it seems we need a way to stop the query If we were able to stream results back to the client, then EA would be simply a meta-query, emitting a stream of explain text tuples, rather than result tuples. No further need for stopping the query to return intermediate results. If we did this say every minute, rather than on interrupt, you'd be getting close to being able to build that "progress bar" that people have been mentioning - I'd not really thought that was possible until now. Would allow us to build a pg_top that would show the current EA results for a query as it executes. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Thu, 2006-06-08 at 12:56 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > So far we have myself, Kevin, Martijn and Luke all saying there is a > > distortion or a massive overhead caused by EXPLAIN ANALYZE. > > http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php > > http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php > > Given that we're seeing diametrically opposed results on the same OS > (FC5) and similar (at least all Intel) hardware, I think the prudent > thing is to find out what's really going on before leaping in with > proposed solutions. As the person who's *not* seeing the problem, > I'm not in a position to do that investigation... That seems reasonable. I've cut a patch to remove timing from the EA results. Output shown here: postgres=# set explain_analyze_timing = on; SET Time: 0.673 ms postgres=# explain analyze select count(*) from accounts; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=2890.00..2890.01 rows=1 width=0) (actual time=0.000..690.780 rows=1 loops=1) -> Seq Scan on accounts (cost=0.00..2640.00 rows=100000 width=0) (actual time=0.000..2672.562 rows=100000 loops=1) Total runtime: 734.474 ms (3 rows) Time: 891.822 ms postgres=# set explain_analyze_timing = off; SET Time: 0.480 ms postgres=# explain analyze select count(*) from accounts; QUERY PLAN -------------------------------------------------------------------------------------------- Aggregate (cost=2890.00..2890.01 rows=1 width=0) (rows=1 loops=1) -> Seq Scan on accounts (cost=0.00..2640.00 rows=100000 width=0) (rows=100000 loops=1) Total runtime: 133.674 ms (3 rows) Time: 134.565 ms postgres=# select count(*) from accounts; count -------- 100000 (1 row) Time: 130.528 ms So the timing is clearly responsible for the additional time I'm personally experiencing and very likely to be that for others also. As to why that should be the case, I'm not sure. The timing overhead seems fairly constant on particular hardware/OS, just different for each. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Attachment
Simon Riggs <simon@2ndquadrant.com> writes: > So the timing is clearly responsible for the additional time I'm > personally experiencing and very likely to be that for others also. Well, that's certainly unsurprising, but the question is why it's such a large overhead for you when it's not on other apparently-similar kit. If anyone else is still following this discussion, could you try the test case Simon gave a bit earlier (select count(*) from 100000-row table, check \timing with and without explain analyze)? If we could get a clearer picture of which platforms show the problem and which don't, maybe we could make some progress. (BTW, I just tried it on my Apple Powerbook G4, and got results in line with what I get elsewhere. So I've now tried it on three very different CPU types and OSes, and not duplicated Simon's problem on any of them.) regards, tom lane
Simon Riggs wrote: > postgres=# explain analyze select count(*) from accounts; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------ > Aggregate (cost=2890.00..2890.01 rows=1 width=0) (actual > time=0.000..690.780 rows=1 loops=1) > -> Seq Scan on accounts (cost=0.00..2640.00 rows=100000 width=0) > (actual time=0.000..2672.562 rows=100000 loops=1) > Total runtime: 734.474 ms > (3 rows) > > Time: 891.822 ms Wow, that is slow. Maybe a problem in the kernel? Perhaps something similar to this: http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282 -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> So the timing is clearly responsible for the additional time I'm >> personally experiencing and very likely to be that for others also. > > Well, that's certainly unsurprising, but the question is why it's such > a large overhead for you when it's not on other apparently-similar > kit. > > If anyone else is still following this discussion, could you try the > test case Simon gave a bit earlier (select count(*) from 100000-row > table, check \timing with and without explain analyze)? If we could > get a clearer picture of which platforms show the problem and which > don't, maybe we could make some progress. > > (BTW, I just tried it on my Apple Powerbook G4, and got results in > line with what I get elsewhere. So I've now tried it on three very > different CPU types and OSes, and not duplicated Simon's problem on > any of them.) > $ psql Welcome to psql 8.1.4, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g orterminate with semicolon to execute query \q to quit ler=# create table test1(id int); CREATE TABLE ler=# insert into test1 select * from generate_series(1,100000); INSERT 0 100000 ler=# \timing Timing is on. ler=# select count(*) from test1;count --------100000 (1 row) Time: 39.191 ms ler=# explain analyze select count(*) from test1; QUERY PLAN ---------------------------------------------------------------------------- ----------------------------------------Aggregate (cost=1988.18..1988.19 rows=1 width=0) (actual time=282.899..282.901 rows=1 loops=1) -> Seq Scan on test1 (cost=0.00..1698.74 rows=115774 width=0) (actual time=0.007..147.845 rows=100000 loops=1)Total runtime: 282.987 ms (3 rows) Time: 283.764 ms ler=# \q $ uname -a FreeBSD thebighonker.lerctr.org 6.1-STABLE FreeBSD 6.1-STABLE #59: Thu Jun 1 09:40:47 CDT 2006 root@thebighonker.lerctr.org:/usr/obj/usr/src/sys/THEBIGHONKER amd64 $ sysctl hw hw.machine: amd64 hw.model: Intel(R) Xeon(TM) CPU 3.00GHz hw.ncpu: 4 hw.byteorder: 1234 hw.physmem: 4286132224 hw.usermem: 4003151872 hw.pagesize: 4096 hw.floatingpoint: 1 hw.machine_arch: amd64 hw.realmem: 5368709120 [snip] the database is 8.1.4 -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 512-248-2683 E-Mail: ler@lerctr.org US Mail: 430 Valona Loop, Round Rock, TX 78681-3893
Tom Lane wrote: > If anyone else is still following this discussion, could you try the > test case Simon gave a bit earlier (select count(*) from 100000-row > table, check \timing with and without explain analyze)? If we could get > a clearer picture of which platforms show the problem and which don't, > maybe we could make some progress. Tried on two machines. The first (Turion64 laptop) gives 44-45 ms for the SELECT, and 50-51 ms for the EXPLAIN ANALYZE. The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction you were looking for. All results are repeated multiple times, I'm reporting times other than the first query (which is slower). More specs on the Celeron machine follow: $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : Celeron (Mendocino) stepping : 5 cpu MHz : 534.616 cache size : 128 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips : 1070.25 $ uname -a Linux drizzt 2.6.16-1-686 #2 Sat Apr 15 20:45:20 UTC 2006 i686 GNU/Linux -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tried on two machines. The first (Turion64 laptop) gives 44-45 ms for > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE. > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT > and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction > you were looking for. Do you have oprofile installed on these? Comparing oprofile results might give some more insight where the time is going. regards, tom lane
On Thu, Jun 08, 2006 at 04:58:07PM -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > So the timing is clearly responsible for the additional time I'm > > personally experiencing and very likely to be that for others also. > > Well, that's certainly unsurprising, but the question is why it's such > a large overhead for you when it's not on other apparently-similar kit. > > If anyone else is still following this discussion, could you try the > test case Simon gave a bit earlier (select count(*) from 100000-row > table, check \timing with and without explain analyze)? If we could get > a clearer picture of which platforms show the problem and which don't, > maybe we could make some progress. > > (BTW, I just tried it on my Apple Powerbook G4, and got results in line > with what I get elsewhere. So I've now tried it on three very different > CPU types and OSes, and not duplicated Simon's problem on any of them.) CPU: AMD Athlon(tm) 64 Processor 3500+ (2210.20-MHz K8-class CPU) 8.1.3 on amd64-portbld-freebsd6.0 25ms 353ms CPU: AMD Opteron(tm) Processor 244 (1792.50-MHz K8-class CPU) (dual CPU) 8.1.4 on amd64-portbld-freebsd6.0 31ms 295ms Powerbook G4 1.33GHz 8.1.4 on powerpc-apple-darwin8.6.0 5.1s 5.8s* The powerbook tests were not very repeatable at 100,000 rows, so I bumped up to 1M. The results still aren't very repeatable... decibel=# select count(*) from i; count ---------1000000 (1 row) Time: 4914.604 ms decibel=# select count(*) from i; count ---------1000000 (1 row) Time: 5186.516 ms decibel=# select count(*) from i; count ---------1000000 (1 row) Time: 5174.418 ms decibel=# explain analyze select count(*) from i; QUERY PLAN -------------------------------------------------------------------------------------------------------------------Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5729.623..5729.624 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24rows=999924 width=0) (actual time=0.155..4039.317 rows=1000000 loops=1)Total runtime: 5729.907 ms (3 rows) Time: 5732.076 ms decibel=# explain analyze select count(*) from i; QUERY PLAN -------------------------------------------------------------------------------------------------------------------Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5916.025..5916.026 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24rows=999924 width=0) (actual time=0.157..4246.123 rows=1000000 loops=1)Total runtime: 5916.261 ms (3 rows) Time: 5922.542 ms decibel=# explain analyze select count(*) from i; QUERY PLAN -------------------------------------------------------------------------------------------------------------------Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5800.788..5800.789 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24rows=999924 width=0) (actual time=0.152..4126.781 rows=1000000 loops=1)Total runtime: 5801.026 ms (3 rows) Time: 5803.070 ms decibel=# explain analyze select count(*) from i; QUERY PLAN -------------------------------------------------------------------------------------------------------------------Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5994.285..5994.286 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24rows=999924 width=0) (actual time=0.156..4341.463 rows=1000000 loops=1)Total runtime: 5994.520 ms (3 rows) Time: 5996.577 ms -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tried on two machines. The first (Turion64 laptop) gives 44-45 ms for > > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE. > > > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT > > and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction > > you were looking for. > > Do you have oprofile installed on these? Comparing oprofile results > might give some more insight where the time is going. No, I don't. I'll have a look. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"Jim C. Nasby" <jnasby@pervasive.com> writes: > The powerbook tests were not very repeatable at 100,000 rows, so I > bumped up to 1M. The results still aren't very repeatable... Hmm, I suspect you pushed it into the realm where it's doing actual I/O ... which we don't want for this particular test. (Although it's worth noting that if your query *is* doing actual I/O, the EXPLAIN overhead certainly ought to be down in the noise.) regards, tom lane
Just tested 2 windows boxes, both running 8.1.3 and XP SP2. P4 2.8G desktop 47ms 297ms Pentium M 2G laptop 40ms 240ms -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Thu, Jun 08, 2006 at 05:32:36PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > The powerbook tests were not very repeatable at 100,000 rows, so I > > bumped up to 1M. The results still aren't very repeatable... > > Hmm, I suspect you pushed it into the realm where it's doing actual > I/O ... which we don't want for this particular test. (Although it's > worth noting that if your query *is* doing actual I/O, the EXPLAIN > overhead certainly ought to be down in the noise.) Hmm... I guess it depends on how smart the OS cache is; the table is 36MB; 4406 pages. But shared buffers is 3000... Bumping shared_buffers up to 5k, there is a bigger difference, but these numbers are still all over the place, so I don't know that they're very trust-worthy. decibel=# explain analyze select count(*) from i; QUERY PLAN -------------------------------------------------------------------------------------------------------------------Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5398.625..5398.626 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24rows=999924 width=0) (actual time=0.034..3967.927 rows=1000000 loops=1)Total runtime: 5398.871 ms (3 rows) Time: 5400.900 ms decibel=# explain analyze select count(*) from i; QUERY PLAN -------------------------------------------------------------------------------------------------------------------Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5626.671..5626.671 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24rows=999924 width=0) (actual time=0.035..3875.641 rows=1000000 loops=1)Total runtime: 5626.904 ms (3 rows) Time: 5628.896 ms decibel=# explain analyze select count(*) from i; QUERY PLAN -------------------------------------------------------------------------------------------------------------------Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5137.549..5137.549 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24rows=999924 width=0) (actual time=0.020..3440.034 rows=1000000 loops=1)Total runtime: 5137.707 ms (3 rows) Time: 5139.178 ms decibel=# select count(*) from i; count ---------1000000 (1 row) Time: 4357.443 ms decibel=# select count(*) from i; count ---------1000000 (1 row) Time: 4251.208 ms decibel=# select count(*) from i; count ---------1000000 (1 row) Time: 4712.912 ms decibel=# select count(*) from i; count ---------1000000 (1 row) Time: 4479.278 ms decibel=# -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Alvaro Herrera <alvherre@commandprompt.com> writes: > Wow, that is slow. Maybe a problem in the kernel? Perhaps something > similar to this: > http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282 Yeah, that's a pretty interesting thread. I came across something similar on a Red Hat internal list. It seems there are three or four different popular standards for clock hardware in the Intel world, and some good implementations and some pretty bad implementations of each. So the answer may well boil down to "if you're using cheap junk PC hardware then gettimeofday will be slow". In fact, I just got around to trying this on my old x86 PC, and behold: regression=# select count(*) from tenk1;count -------10000 (1 row) Time: 9.670 ms regression=# explain analyze select count(*) from tenk1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------Aggregate (cost=470.00..470.01 rows=1 width=0) (actual time=93.549..93.553 rows=1 loops=1) -> Seq Scan on tenk1 (cost=0.00..445.00rows=10000 width=0) (actual time=0.014..49.261 rows=10000 loops=1)Total runtime: 93.663 ms (3 rows) Time: 94.431 ms regression=# So that says that this machine takes about 4 usec to do gettimeofday(), as compared to 3 usec on my nominally 4x slower HPPA machine. The new dual Xeon sitting next to it probably has a much less junky motherboard, with a better clock ... and I find it unsurprising that the HP and Apple machines I was trying aren't subject to such problems. I didn't get the impression from that linux-kernel thread that the proposed patch had actually gone in yet anyplace; anyone know how to track that? regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Wow, that is slow. Maybe a problem in the kernel? Perhaps something >> similar to this: >> http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282 > > Yeah, that's a pretty interesting thread. I came across something > similar on a Red Hat internal list. It seems there are three or four > different popular standards for clock hardware in the Intel world, > and some good implementations and some pretty bad implementations > of each. So the answer may well boil down to "if you're using cheap > junk PC hardware then gettimeofday will be slow". > OS seems to matter as well - I've got two identical Supermicro P3TDER dual intel boxes. 1 running FreeBSD 6.1, one running Gentoo Linux 2.6.16. Doing the 'select count(*) vs explain analyze select count(*) on 100000 row table gives: Freebsd : select 108 ms explain analyze 688 ms Linux : select 100 ms explain analyze 196 ms Both systems have ACPI enabled in BIOS (which means there is a better timecounter than 'i8254' available (FreeBSD says its using 'ACPI-safe' - not sure how to check on Linux). Cheers Mark
It's worth noting that on Darwin (on Apple hardware) gettimeofday is never a syscall whereas on Linux (AFAIK), it always is. On Jun 8, 2006, at 7:58 PM, Mark Kirkwood wrote: > Tom Lane wrote: >> Alvaro Herrera <alvherre@commandprompt.com> writes: >>> Wow, that is slow. Maybe a problem in the kernel? Perhaps something >>> similar to this: >>> http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282 >> Yeah, that's a pretty interesting thread. I came across something >> similar on a Red Hat internal list. It seems there are three or four >> different popular standards for clock hardware in the Intel world, >> and some good implementations and some pretty bad implementations >> of each. So the answer may well boil down to "if you're using cheap >> junk PC hardware then gettimeofday will be slow". > > OS seems to matter as well - I've got two identical Supermicro P3TDER > dual intel boxes. 1 running FreeBSD 6.1, one running Gentoo Linux > 2.6.16. > > Doing the 'select count(*) vs explain analyze select count(*) on > 100000 row table gives: > > Freebsd : select 108 ms explain analyze 688 ms > Linux : select 100 ms explain analyze 196 ms > > Both systems have ACPI enabled in BIOS (which means there is a better > timecounter than 'i8254' available (FreeBSD says its using 'ACPI-safe' > - not sure how to check on Linux). ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ AgentM agentm@themactionfaction.com ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
On Thu, 2006-06-08 at 17:21 -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tried on two machines. The first (Turion64 laptop) gives 44-45 ms for > > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE. > > > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT > > and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction > > you were looking for. > > Do you have oprofile installed on these? Comparing oprofile results > might give some more insight where the time is going. On my office PC the overhead from EA is about 20%. stracing a backend, I also notice that on a 1.5 million row table we make only 39158 calls to gettimeofday, on a table of 24591 blocks. I'm thinking that the instrumentation overhead has been reduced as a result of the page-at-a-time heap scans we now have? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Alvaro Herrera wrote: > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT > and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction > you were looking for. Hmm, I take that back. I updated to the current CVS tip on this machine and now I don't see the problem. Without EXPLAIN ANALYZE, the times are around 115-117 ms. With EXPLAIN ANALYZE, 149-152ms. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Simon Riggs wrote: > On Thu, 2006-06-08 at 17:21 -0400, Tom Lane wrote: > > Alvaro Herrera <alvherre@commandprompt.com> writes: > > > Tried on two machines. The first (Turion64 laptop) gives 44-45 ms for > > > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE. > > > > > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT > > > and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction > > > you were looking for. > > > > Do you have oprofile installed on these? Comparing oprofile results > > might give some more insight where the time is going. > > On my office PC the overhead from EA is about 20%. Can you please run the test program in this thread, if you are running Linux? http://marc.theaimsgroup.com/?l=linux-kernel&m=114297656924494&w=2 The machine I reported to show the problem says # ./a.out 0xe408 0 ./a.out: Detect PM-Timer Bug -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Hmm, I take that back. I updated to the current CVS tip on this machine > and now I don't see the problem. Without EXPLAIN ANALYZE, the times are > around 115-117 ms. With EXPLAIN ANALYZE, 149-152ms. At the moment, CVS HEAD still has the sampling patch in it. You need to be looking at 8.1, or HEAD from several weeks ago, to see the full impact of gettimeofday-every-time. (I imagine Simon's strace count is wrong for the same reason.) regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Hmm, I take that back. I updated to the current CVS tip on this >> machine and now I don't see the problem. Without EXPLAIN ANALYZE, >> the times are around 115-117 ms. With EXPLAIN ANALYZE, 149-152ms. > > At the moment, CVS HEAD still has the sampling patch in it. You need > to be looking at 8.1, or HEAD from several weeks ago, to see the full > impact of gettimeofday-every-time. (I imagine Simon's strace count > is wrong for the same reason.) > Does my post from yesterday (39ms without explain analyze, 280+ with explain analyze) on FreeBSD/amd64, Dual Xeon's in HTT mode help? What can I do to further the investigation? It has 8.1.4 on it. LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 512-248-2683 E-Mail: ler@lerctr.org US Mail: 430 Valona Loop, Round Rock, TX 78681-3893
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Hmm, I take that back. I updated to the current CVS tip on this machine > > and now I don't see the problem. Without EXPLAIN ANALYZE, the times are > > around 115-117 ms. With EXPLAIN ANALYZE, 149-152ms. > > At the moment, CVS HEAD still has the sampling patch in it. You need to > be looking at 8.1, or HEAD from several weeks ago, to see the full > impact of gettimeofday-every-time. (I imagine Simon's strace count > is wrong for the same reason.) Ok, then I re-retract, because the test I posted previously was done with 8.1. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
"Larry Rosenman" <ler@lerctr.org> writes: > Does my post from yesterday (39ms without explain analyze, 280+ with explain > analyze) on FreeBSD/amd64, Dual Xeon's in HTT mode help? Well, it confirms that FreeBSD is subject to the same problem ;-). I think the bottom line here is that there are some machines out there where gettimeofday() is fast enough for our purposes, and some where it is nowhere near fast enough. And that kernel-level fixes may be possible for some of the latter, but not all, and we shouldn't hold our breath waiting for that to happen anyway. To tell you the truth, this information makes me even less pleased with the sampling-gettimeofday patch than I was before. If gettimeofday() in itself increases the runtime of a node by a factor of 10, then just trying to subtract off that time is no solution. There's too much impact on surrounding nodes, and too much roundoff error anyhow. I had thought we were applying an order-of-ten-percent correction by subtracting SampleOverhead, not an order-of-10x correction :-( regards, tom lane
On Fri, 2006-06-09 at 10:00 -0400, Tom Lane wrote: > I think the bottom line here is that there are some machines out there > where gettimeofday() is fast enough for our purposes, and some where > it is nowhere near fast enough. And that kernel-level fixes may be > possible for some of the latter, but not all, and we shouldn't hold our > breath waiting for that to happen anyway. Agreed. The issue seems to be some systems are set to get exactly correct times and some are set to return a time (possibly imprecise) with low overhead. Even if fixes existed, OS packagers may not pick the right one of those two options for our purposes for EA. (We might prefer accuracy to speed for other parts of PostgreSQL anyway). I propose we revert the sampling patch (sorry Martijn) and go with the patch to have an explain_analyze_timing parameter (default=on). That way we'll have an option to turn off timing *if* we happen to be running on a OS/hw combo that sucks *and* trying to run large enough EAs that we care. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > I propose we revert the sampling patch (sorry Martijn) yah ... > and go with the > patch to have an explain_analyze_timing parameter (default=on). This I'm unexcited about. EXPLAIN output isn't all that transparent anyway, and losing the extra cue of seeing where the time is really going would make it extremely easy for people to misinterpret their problems. I was intending to push forward with the idea of being able to get numbers out of a canceled EXPLAIN. That will allow you to get some information even when the underlying query runs longer than you're willing to tolerate. I still say that the number of queries where avoiding gettimeofday overhead would transform an intolerable runtime into a tolerable one is pretty limited. The other thing that I think would be worth investigating is timer-driven sampling, although it's not yet clear whether we can make that work usefully. Anyone want to take up that project? regards, tom lane
On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote: > To tell you the truth, this information makes me even less pleased with > the sampling-gettimeofday patch than I was before. If gettimeofday() in > itself increases the runtime of a node by a factor of 10, then just > trying to subtract off that time is no solution. There's too much > impact on surrounding nodes, and too much roundoff error anyhow. > I had thought we were applying an order-of-ten-percent correction by > subtracting SampleOverhead, not an order-of-10x correction :-( Eh? The whole point is to call gettimeofday() much less often. If you call it 1000th as often, then the correction is only on the order of one hundredth of the normal query time... Subtracting SampleOverhead is only a correction on the order of a few percent, it's the reduced calling of gettimeofday() that provides the benefit. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
It would be nice to keep the gettimeofday()s wherever they are most useful on hardware/software where they are cheap. Perhaps a compile-time option? On Fri, June 9, 2006 11:18 am, Martijn van Oosterhout wrote: > On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote: > >> To tell you the truth, this information makes me even less pleased with >> the sampling-gettimeofday patch than I was before. If gettimeofday() >> in itself increases the runtime of a node by a factor of 10, then just >> trying to subtract off that time is no solution. There's too much >> impact on surrounding nodes, and too much roundoff error anyhow. I had >> thought we were applying an order-of-ten-percent correction by >> subtracting SampleOverhead, not an order-of-10x correction :-( > > Eh? The whole point is to call gettimeofday() much less often. If you > call it 1000th as often, then the correction is only on the order of one > hundredth of the normal query time... > > Subtracting SampleOverhead is only a correction on the order of a few > percent, it's the reduced calling of gettimeofday() that provides the > benefit. > > Have a nice day, > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > >> From each according to his ability. To each according to his ability to >> litigate. >
Martijn van Oosterhout <kleptog@svana.org> writes: > On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote: >> I had thought we were applying an order-of-ten-percent correction by >> subtracting SampleOverhead, not an order-of-10x correction :-( > Eh? The whole point is to call gettimeofday() much less often. If you > call it 1000th as often, then the correction is only on the order of > one hundredth of the normal query time... No, because the correction calculation is totaltime += (average time per sampled execution - SampleOverhead) * (number of unsampled executions) If SampleOverhead is 90% of the average time per sampled execution, then you are multiplying something with a large component of cancellation error by a possibly-large number. As an example, using the numbers I posted here for my old PC: http://archives.postgresql.org/pgsql-hackers/2006-06/msg00407.php the actual runtime is clearly about 1 usec per tuple but enabling timing adds 8.5 usec per tuple. If we suppose we sampled 10000 out of 1 million rows, then we'd have raw totaltime = 10000 * 9.5 usec = 95msec avg time/execution = 9.5 usec SampleOverhead = 8.5 usec number of unsampled executions = 990000 correction = 990msec which means that a 10% error in estimating SampleOverhead would contribute as much to the final estimate as the actual measurement did. regards, tom lane
On Fri, 2006-06-09 at 10:56 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I propose we revert the sampling patch (sorry Martijn) > > yah ... > > > and go with the > > patch to have an explain_analyze_timing parameter (default=on). > > This I'm unexcited about. EXPLAIN output isn't all that transparent > anyway, and losing the extra cue of seeing where the time is really > going would make it extremely easy for people to misinterpret their > problems. As before, agreed, but it works and is available now. > I was intending to push forward with the idea of being able to get > numbers out of a canceled EXPLAIN. That will allow you to get some > information even when the underlying query runs longer than you're > willing to tolerate. I still say that the number of queries where > avoiding gettimeofday overhead would transform an intolerable runtime > into a tolerable one is pretty limited. That would be a good feature to have. > The other thing that I think would be worth investigating is > timer-driven sampling, although it's not yet clear whether we can > make that work usefully. Anyone want to take up that project? Not me, sorry. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, Jun 09, 2006 at 11:22:11AM -0400, A.M. wrote: > It would be nice to keep the gettimeofday()s wherever they are most useful > on hardware/software where they are cheap. Perhaps a compile-time option? I think making it optional at compile time is worse, because then at support time you might have to recompile to get the option you want. Right now I'm confused though. I was under the impression the changes were going to be ripped out because it was decided to be unworkable. I think improvements can be made but I'm unsure if there's any interest... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Tom Lane <tgl@sss.pgh.pa.us> writes: > I was intending to push forward with the idea of being able to get > numbers out of a canceled EXPLAIN. That will allow you to get some > information even when the underlying query runs longer than you're > willing to tolerate. I still say that the number of queries where > avoiding gettimeofday overhead would transform an intolerable runtime > into a tolerable one is pretty limited. Are we still going to subtract out the gettimeofday overhead? I was always more excited about that than the sampling aspect. I've run into queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday overhead but I've never run into a query where gettimeofday overhead made running the query under EXPLAIN ANALYZE impractical. -- greg
Greg Stark <gsstark@mit.edu> writes: > Are we still going to subtract out the gettimeofday overhead? > I was always more excited about that than the sampling aspect. I've run into > queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday > overhead but I've never run into a query where gettimeofday overhead made > running the query under EXPLAIN ANALYZE impractical. That would be deceptive in a different way, ie, make it look like there was a whole bunch of outside-the-plan overhead. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > Are we still going to subtract out the gettimeofday overhead? > > > I was always more excited about that than the sampling aspect. I've run into > > queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday > > overhead but I've never run into a query where gettimeofday overhead made > > running the query under EXPLAIN ANALYZE impractical. > > That would be deceptive in a different way, ie, make it look like there > was a whole bunch of outside-the-plan overhead. Well that's exactly what there is. To avoid user confusion it would reasonable to print out a line at the bottom: Explain analyze profiling overhead removed: xxx ms That also gives the user feedback on how precise their explain analyze results are. If they see that the overhead being removed is as large as the timing remaining then they can realize that the results aren't especially precise. On the other hand if they see that the overhead being removed is small then they can be pretty confident in the precision of the results. -- greg
Greg Stark <gsstark@mit.edu> writes: > To avoid user confusion it would reasonable to print out a line at the bottom: > Explain analyze profiling overhead removed: xxx ms > That also gives the user feedback on how precise their explain analyze results > are. If they see that the overhead being removed is as large as the timing > remaining then they can realize that the results aren't especially precise. On > the other hand if they see that the overhead being removed is small then they > can be pretty confident in the precision of the results. I would prefer to leave the numbers unmanipulated, because frankly I'd have no confidence in the correction. Perhaps we could put a note at the bottom about "Estimated total profiling overhead: xxx ms". regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > To avoid user confusion it would reasonable to print out a line at the bottom: > > > Explain analyze profiling overhead removed: xxx ms > > > That also gives the user feedback on how precise their explain analyze results > > are. If they see that the overhead being removed is as large as the timing > > remaining then they can realize that the results aren't especially precise. On > > the other hand if they see that the overhead being removed is small then they > > can be pretty confident in the precision of the results. > > I would prefer to leave the numbers unmanipulated, because frankly I'd > have no confidence in the correction. Perhaps we could put a note at the > bottom about "Estimated total profiling overhead: xxx ms". Well the whole reason the overhead is deceptive is that some nodes call gettimeofday many more times proportionally to their running time than others. It makes users suspect cpu-intensive nodes that process many records when the real problem lies in a i/o-intensive node that processes few records. Just showing the total overhead will give the user the impression that time is distributed evenly throughout the nodes proportionally to their total time which doesn't help correct the distortion. There seem to be two types of overhead going on. There's the amount of time spent in gettimeofday itself which is pretty consistent. Then there's other effects that cause normal operation itself to take longer (or potentially even less long). Postgres may as well remove the consistent gettimeofday overhead it can identify and measure even if there's more it can't characterize perfectly. -- greg
On Fri, Jun 09, 2006 at 03:55:28PM -0400, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > There seem to be two types of overhead going on. There's the amount of time > > spent in gettimeofday itself which is pretty consistent. > > That is a fact not in evidence. The impression I had from that > linux-kernel discussion was that the problematic kernel code was looping > until it got consistent results from successive hardware reads. I'm > not at all sure that you can make the above assertion across all > varieties of clock hardware, or even all common varieties. The interesting thing about this is that they obviously are gearing gettimeofday() to be accurate, rather than just considering it a counter that is somewhat close to real time. At the expense of speed. What we want is just a monotonically increasing counter that can be read quickly and consistantly, we're not majorly fussed if it doesn't match real time. This puts us back to CPU cycle counters, but they have drawbacks of their own. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes: > Right now I'm confused though. I was under the impression the changes > were going to be ripped out because it was decided to be unworkable. I > think improvements can be made but I'm unsure if there's any > interest... I've reverted the current patch because it clearly doesn't work well enough. There's nothing stopping you from having a better idea though. It's clear that on some platforms the cost of gettimeofday is high enough that some workaround would be good. regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > The interesting thing about this is that they obviously are gearing > gettimeofday() to be accurate, rather than just considering it a > counter that is somewhat close to real time. At the expense of speed. Not sure that that's an accurate description. What I think the kernel fuss is about is that they have to read the counter value as several separate "byte read" operations, and if the hardware doesn't latch the whole counter value when the first byte is pulled then they'll get bytes from several distinct states of the counter, leading to something that's not consistent or even monotonic. On non-latching hardware there's really not a lot of choice what to do. The patch is about not using that same very-slow code path on hardware that does latch. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > There seem to be two types of overhead going on. There's the amount of time > spent in gettimeofday itself which is pretty consistent. That is a fact not in evidence. The impression I had from that linux-kernel discussion was that the problematic kernel code was looping until it got consistent results from successive hardware reads. I'm not at all sure that you can make the above assertion across all varieties of clock hardware, or even all common varieties. regards, tom lane
>>>>> "MvO" == Martijn van Oosterhout <kleptog@svana.org> writes: MvO> What we want is just a monotonically increasing counter that can MvO> be read quickly and consistantly, we're not majorly fussed if it MvO> doesn't match real time. This puts us back to CPU cycle counters, MvO> but they have drawbacks of their own. It is amazing how this discussion is mirroring discussions on the linux-kernel list. Applications have been using CPU cycle counters on Linux to avoid the gettimeofday() overhead. With reasonably recent kernels, the overhead is very low when the CPU cycle counters are usable, because gettimeofday() never actually enters kernel space. Unfortunately fewer and fewer systems seem to have usable cycle counters. As an example, dual core Athlon64/Opteron boots with the cycle counters unsynced. The kernel can compensate for that. However they also lose sync whenever clock frequency changes, and the kernel has a really hard time compensating for it. On such systems the kernel switches back to slower timers and gettimeofday() becomes a real system call. Applications should not try to use cycle counters directly on such systems. If the kernel developers manage a good workaround, gettimeofday() becomes fast again, but applications which use cycle counters most likely stay broken. Basically either gettimeofday() is fast, or the cycle counters are useless -- unless you really care about counting CPU cycles and not real time. Some CPUs like Transmetas actually get the counter thing right and count 2 every tick when running at half speed and so on. /Benny
Ron, On 6/8/06 11:49 AM, "Ron Mayer" <rm_pg@cheapcomplexdevices.com> wrote: > Experimental results here suggest that for larger tables Linux seems > to detect a seq-scan and not bother caching. It's very reproducible > for me here to do a reboot and not see the full speedup on a seq_scan > until the third time I run a query.su What you are seeing is the now infamous "Postgres writes a table one more time after loading" behavior. Simon Riggs once dug into it to find the root cause, and I no longer recall exactly why, but after you've loaded data, the first seq scan will re-write some large portion of the data while doing the initial scan. This wreaks havoc on normal benchmarking practices. Tom - can you explain what's going on with this? It seems to write more than just the contents of the WAL, so it's not a flush of the WAL writes AFAICT. - Luke
On Tue, Jun 13, 2006 at 04:54:05AM -0700, Luke Lonergan wrote: > > Experimental results here suggest that for larger tables Linux seems > > to detect a seq-scan and not bother caching. It's very reproducible > > for me here to do a reboot and not see the full speedup on a seq_scan > > until the third time I run a query.su > > What you are seeing is the now infamous "Postgres writes a table one more > time after loading" behavior. > > Simon Riggs once dug into it to find the root cause, and I no longer recall > exactly why, but after you've loaded data, the first seq scan will re-write > some large portion of the data while doing the initial scan. This wreaks > havoc on normal benchmarking practices. Is it possible it may have something to do with the hint bits? There's are a bunch of bits in the header to deal with speeding up of MVCC tests. Maybe changing those bits marks the page dirty and forces a write? Have a ncie day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martin, On 6/13/06 5:42 AM, "Martijn van Oosterhout" <kleptog@svana.org> wrote: > Is it possible it may have something to do with the hint bits? There's > are a bunch of bits in the header to deal with speeding up of MVCC > tests. Maybe changing those bits marks the page dirty and forces a > write? Yes, that's it. Writing data twice (three times including WAL) is a Very Bad Thing for large data work - how should we fix it? - Luke
On Tue, Jun 13, 2006 at 05:46:23AM -0700, Luke Lonergan wrote: > Martin, > > On 6/13/06 5:42 AM, "Martijn van Oosterhout" <kleptog@svana.org> wrote: > > > Is it possible it may have something to do with the hint bits? There's > > are a bunch of bits in the header to deal with speeding up of MVCC > > tests. Maybe changing those bits marks the page dirty and forces a > > write? > > Yes, that's it. Writing data twice (three times including WAL) is a Very > Bad Thing for large data work - how should we fix it? Well, I don't think you want to get rid of it entirely because otherwise forevermore, every lookup in that table will require a check to see if the transaction is committed. So at some point the hint bit needs to be set and/or the xmin frozen (vacuum freeze does that for example). What you might be able to do is to reduce its effect. The thing that occurs to me is to make hint bit changes only mark a page half-dirty. If the page is evicted because the space is needed in the buffer cache, it can be just dropped. However, the bgwriter will write it as normal. Hence, setting hint bits will become a sort of background operation, done when there's time. It seems to me that if a large table is loaded in a single transaction, the check for if the transaction is committed should be cheap because it's checking the same transaction id over and over. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martin, On 6/13/06 6:00 AM, "Martijn van Oosterhout" <kleptog@svana.org> wrote: > What you might be able to do is to reduce its effect. The thing that > occurs to me is to make hint bit changes only mark a page half-dirty. > If the page is evicted because the space is needed in the buffer cache, > it can be just dropped. However, the bgwriter will write it as normal. > Hence, setting hint bits will become a sort of background operation, > done when there's time. > > It seems to me that if a large table is loaded in a single transaction, > the check for if the transaction is committed should be cheap because > it's checking the same transaction id over and over. Thanks for the tip - it's now in the mill for mulling over and finding someone with the time to work something in... - Luke
"Luke Lonergan" <llonergan@greenplum.com> writes: > What you are seeing is the now infamous "Postgres writes a table one more > time after loading" behavior. > Tom - can you explain what's going on with this? Setting XMIN_COMMITTED hint bits. regards, tom lane
On Tue, 2006-06-13 at 15:00 +0200, Martijn van Oosterhout wrote: > What you might be able to do is to reduce its effect. The thing that > occurs to me is to make hint bit changes only mark a page half-dirty. > If the page is evicted because the space is needed in the buffer cache, > it can be just dropped. However, the bgwriter will write it as normal. > Hence, setting hint bits will become a sort of background operation, > done when there's time. Yes, I think that's a very good idea. What that introduces is the concept of "dirty priority" - i.e. some pages are more important to write out quickly than others. If the bgwriter cleans the higher priority ones first it should do a better job of keeping the bufferpool clean. That needs some work on it before its a fully fledged proposal. > It seems to me that if a large table is loaded in a single transaction, > the check for if the transaction is committed should be cheap because > it's checking the same transaction id over and over. I was considering this the other day. My original idea was to set the xmin to be FrozenTransaction when loading a table with COPY in the same transaction as the one that loaded it but that didn't work because of MVCC violation. If we just set the hint bit to show XMIN_COMMITTED then we need not worry about MVCC violations, since the xmin will still be higher than any pre-existing snapshots. I think Tom had a problem with that because it relied on file truncation on recovery, but I'm not clear on why exactly? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com