Thread: That EXPLAIN ANALYZE patch still needs work

That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Just got this rather surprising result:

regression=# \timing
Timing is on.
regression=# explain analyze select * from tenk1 a join tenk1 b on a.unique1 = b.unique2;
                       QUERY PLAN                                                        
 

--------------------------------------------------------------------------------------------------------------------------Hash
Join (cost=825.00..2484.00 rows=10000 width=488) (actual time=250.510..2557.129 rows=10000 loops=1)  Hash Cond:
(a.unique1= b.unique2)  ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.084..30.070
rows=10000loops=1)  ->  Hash  (cost=458.00..458.00 rows=10000 width=244) (actual time=250.242..250.242 rows=10000
loops=1)       ->  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.019..23.317 rows=10000
loops=1)Totalruntime: 847.855 ms
 
(6 rows)

Time: 856.179 ms
regression=# 

The "Total runtime" is correct AFAICT, which puts the top node's "actual
time" rather far out in left field.  This is pretty repeatable on my old
slow HPPA machine.  A new Xeon shows less of a discrepancy, but it's
still claiming top node actual > total, which is not right.

I realize that the sampling code can't be expected to be exactly right,
but a factor-of-three error is not acceptable.  One of us broke this :-(
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Martijn van Oosterhout
Date:
On Mon, Jun 05, 2006 at 11:02:33PM -0400, Tom Lane wrote:
> Just got this rather surprising result:

<snip bogus explain output>

> The "Total runtime" is correct AFAICT, which puts the top node's "actual
> time" rather far out in left field.  This is pretty repeatable on my old
> slow HPPA machine.  A new Xeon shows less of a discrepancy, but it's
> still claiming top node actual > total, which is not right.

Wierd. Can you get the output of *instr in each call of
InstrEndLoop? Preferably after it does the calculation but before it
clears the values. So we get an idea of number of samples and what it
guesses. SampleOverhead would be good too.

I know my version produced sensible results on my machine and the
handful of people testing, so I'll try it again with your changes, see
how it looks...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Wierd. Can you get the output of *instr in each call of
> InstrEndLoop? Preferably after it does the calculation but before it
> clears the values. So we get an idea of number of samples and what it
> guesses. SampleOverhead would be good too.

The problem looks to be an underestimation of SampleOverhead, and on
reflection it's clear why: what CalculateSampleOverhead is measuring
isn't the total overhead, but the time between the two gettimeofday
calls.  Which is probably about half the true overhead.  What we
ought to do is iterate InstStartNode/InstrStopNode N times, and
*separately* measure the total elapsed time spent.

It occurs to me that what we really want to know is not so much the
total time spent in InstStartNode/InstrStopNode, as the difference in
the time spent when sampling is on vs when it is off.  I'm not quite
sure if the time spent when it's off is negligible.  Off to do some
measuring ...
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
I wrote:
> The problem looks to be an underestimation of SampleOverhead, and on
> reflection it's clear why: what CalculateSampleOverhead is measuring
> isn't the total overhead, but the time between the two gettimeofday
> calls.  Which is probably about half the true overhead.

On further thought, I take that back: the "true overhead" is not the
point here.  The time elapsed during a plan node execution when
sampling can be broken down into three phases:time before first gettimeofday calltime between gettimeofday callstime
aftersecond gettimeofday call
 
Only the second part is actually measured by the instrumentation system;
the other parts are overhead that has never been counted by EXPLAIN
ANALYZE, sampling version or no.  Moreover, most of the runtime of
InstrStartNode and InstrStopNode falls into the first or third parts.

What we would actually like to set SampleOverhead to is the portion
of the second-part runtime that doesn't occur when sampling = false.
Assuming that gettimeofday() has consistent runtime and the actual
time reported is measured at a consistent instant within that runtime,
I believe that we should take the SampleOverhead as just equal to
the runtime of a single gettimeofday() call.  The added or removed
second-part time within InstrStartNode is just the tail time of
gettimeofday, and the added or removed second-part time within
InstrStopNode is basically just the head time of gettimeofday.  (To make
this as true as possible, we need to change the order of operations so
that gettimeofday is invoked *immediately* after the "if (sampling)"
test, but that's easy.)

So this line of thought leads to the conclusion that
CalculateSampleOverhead is actually overestimating SampleOverhead
a bit, and we should simplify it to just time INSTR_TIME_SET_CURRENT().

But that still leaves me with a problem because my machine is clearly
overestimating the correction needed.  I added some printouts and got

raw totaltime = 0.370937
per_iter = 0.000156913, SampleOverhead = 3.28e-06
adj totaltime = 1.82976
sampling = 0
starttime = 0/000000
counter = 0/370937
firsttuple = 0.258321
tuplecount = 10000
itercount = 10001
samplecount = 704
nextsample = 10011
startup = 0.258321
total = 1.82976
ntuples = 10000
nloops = 1

on a run with an actual elapsed time near 750 msec.  Clearly the
sampling adjustment is wrong, but why?

I have a theory about this, and it's not pleasant at all.  What I
think is that we have a Heisenberg problem here: the act of invoking
gettimeofday() actually changes what is measured.  That is, the
runtime of the "second part" of ExecProcNode is actually longer when
we sample than when we don't, not merely due to the extra time spent
in gettimeofday().  It's not very hard to guess at reasons why, either.
The kernel entry is probably flushing some part of the CPU's state,
such as virtual/physical address mapping for the userland address
space.  After returning from the kernel call, the time to reload
that state shows up as more execution time within the "second part".

This theory explains two observations that otherwise are hard to
explain.  One, that the effect is platform-specific: your machine
may avoid flushing as much state during a kernel call as mine does.
And two, that upper plan nodes seem much more affected than lower
ones.  That makes sense because the execution cycle of an upper node
will involve touching more userspace data than a lower node, and
therefore more of the flushed TLB entries will need to be reloaded.

If this theory is correct, then the entire notion of EXPLAIN ANALYZE
sampling has just crashed and burned.  We can't ship a measurement
tool that is accurate on some platforms and not others.

I'm wondering if it's at all practical to go over to a gprof-like
measurement method for taking EXPLAIN ANALYZE runtime measurements;
that is, take an interrupt every so often and bump the count for the
currently active plan node.  This would spread the TLB-flush overhead
more evenly and thus avoid introducing that bias.  There may be too much
platform dependency in this idea, though, and I also wonder if it'd
break the ability to do normal gprof profiling of the backend.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
"Jim C. Nasby"
Date:
On Tue, Jun 06, 2006 at 04:50:28PM -0400, Tom Lane wrote:
> I have a theory about this, and it's not pleasant at all.  What I
> think is that we have a Heisenberg problem here: the act of invoking
> gettimeofday() actually changes what is measured.  That is, the
> runtime of the "second part" of ExecProcNode is actually longer when
> we sample than when we don't, not merely due to the extra time spent
> in gettimeofday().  It's not very hard to guess at reasons why, either.
> The kernel entry is probably flushing some part of the CPU's state,
> such as virtual/physical address mapping for the userland address
> space.  After returning from the kernel call, the time to reload
> that state shows up as more execution time within the "second part".
> 
> This theory explains two observations that otherwise are hard to
> explain.  One, that the effect is platform-specific: your machine
> may avoid flushing as much state during a kernel call as mine does.
> And two, that upper plan nodes seem much more affected than lower
> ones.  That makes sense because the execution cycle of an upper node
> will involve touching more userspace data than a lower node, and
> therefore more of the flushed TLB entries will need to be reloaded.

If that's the case, then maybe a more sopdisticated method of measuring
the overhead would work. My thought is that on the second call to pull a
tuple from a node (second because the first probably has some anomolies
due to startup), we measure the overhead for that node. This would
probably mean doing the following:
get start time # I'm not refering to this as gettimeofday to avoid              # confusion
gettimeofday() # this is the gettimeofday call that will happen during              # normal operation
get end time

Hopefully, there's no caching effect that would come into play from not
actually touching any of the data structures after the gettimeofday()
call. If that's not the case, it makes measuring the overhead more
complex, but I think it should still be doable...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> If that's the case, then maybe a more sopdisticated method of measuring
> the overhead would work.

I think you missed the point: the time spent in gettimeofday() itself
is not the major overhead of EXPLAIN ANALYZE.  At least it appears that
this is the case on my machine.

I'm thinking that interrupt-driven sampling might work OK though.
My previous worries were based on trying to use the ITIMER_PROF
timer, which might not be portable and would conflict with gprof
anyway.  But EXPLAIN ANALYZE has always been interested in real time
rather than CPU time, so the correct interrupt to use is ITIMER_REAL.
That means we only have to share with our own existing usages of that
interrupt, which turns it from a portability issue into just a Small
Matter Of Programming.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Martijn van Oosterhout
Date:
On Tue, Jun 06, 2006 at 04:50:28PM -0400, Tom Lane wrote:
> But that still leaves me with a problem because my machine is clearly
> overestimating the correction needed.  I added some printouts and got
>
> raw totaltime = 0.370937
> per_iter = 0.000156913, SampleOverhead = 3.28e-06
> adj totaltime = 1.82976
> sampling = 0
> starttime = 0/000000
> counter = 0/370937
> firsttuple = 0.258321
> tuplecount = 10000
> itercount = 10001
> samplecount = 704
> nextsample = 10011
> startup = 0.258321
> total = 1.82976
> ntuples = 10000
> nloops = 1
>
> on a run with an actual elapsed time near 750 msec.  Clearly the
> sampling adjustment is wrong, but why?

This doesn't make any sense at all. How can a sampling run that only
sampled 7% of the actual tuples, end up with a actual measured time
that's more than 50% of the actual final runtime?

Can you get an estimate of the actual overhead (ie compare wall clock
time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is
on the order of a few hundred cycles, I'm not sure about the cost of
TLB flushes (that's probably highly architechture dependant).

To be honest, I wonder about caching effects, but for the disks. The
first few cycles of any plan (like an index scan) is going to incur
costs that won't happen later on. Because we sample much more heavily
at the beginning rather than the end, this will bias towards higher
numbers. You should be able to see this by seeing if running queries
that don't require disk access fare better.

That would suggest a much more careful correction method that works for
non-linear timing patterns...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> This doesn't make any sense at all. How can a sampling run that only
> sampled 7% of the actual tuples, end up with a actual measured time
> that's more than 50% of the actual final runtime?

AFAICS, the only conclusion is that the sampled executions are in fact
taking longer than supposedly-equivalent unsampled ones, and by a pretty
good percentage too.  I'm growing unsure again about the mechanism
responsible for that, however.

> Can you get an estimate of the actual overhead (ie compare wall clock
> time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is
> on the order of a few hundred cycles, I'm not sure about the cost of
> TLB flushes (that's probably highly architechture dependant).

Here's some examples.  Keep in mind I've already determined that
gettimeofday() takes about 3 usec on this hardware ...

regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset
0)ss;                                                            QUERY PLAN
                
 

-------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=2609.00..2609.01 rows=1 width=0) (actual time=797.412..797.416 rows=1 loops=1)  ->  Limit  (cost=825.00..2484.00
rows=10000width=488) (actual time=208.208..2576.528 rows=10000 loops=1)        ->  Hash Join  (cost=825.00..2484.00
rows=10000width=488) (actual time=208.190..2082.577 rows=10000 loops=1)              Hash Cond: (a.unique1 = b.unique2)
            ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.082..3.718 rows=10000
loops=1)             ->  Hash  (cost=458.00..458.00 rows=10000 width=244) (actual time=207.933..207.933 rows=10000
loops=1)                   ->  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.017..3.583
rows=10000loops=1)Total runtime: 805.036 ms
 
(8 rows)

Time: 816.463 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;count 
-------10000
(1 row)

Time: 816.970 m

The actual elapsed time for EXPLAIN ANALYZE seems to jump around quite
a bit, probably because of the random variation we're using in sampling
interval.  This particular combination was unusually close.  But in any
case, the *actual* overhead of EXPLAIN ANALYZE is clearly pretty small
here; the problem is that we're incorrectly extrapolating the measured
runtime to the unmeasured executions.

What's especially interesting is that the excess time doesn't seem to
show up if I form the query in a way that doesn't require pushing as
much data around:

regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss;
                                                  QUERY PLAN                                                          
 

-----------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=1341.00..1341.01 rows=1 width=0) (actual time=212.313..212.317 rows=1 loops=1)  ->  Hash Join
(cost=483.00..1316.00rows=10000 width=0) (actual time=88.061..160.886 rows=10000 loops=1)        Hash Cond: (a.unique1
=b.unique2)        ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=10000 width=4) (actual time=0.071..4.068 rows=10000
loops=1)       ->  Hash  (cost=458.00..458.00 rows=10000 width=4) (actual time=87.862..87.862 rows=10000 loops=1)
      ->  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=10000 width=4) (actual time=0.031..4.780 rows=10000
loops=1)Totalruntime: 221.022 ms
 
(7 rows)

Time: 229.377 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss;count 
-------10000
(1 row)

Time: 202.531 ms
regression=# 

(Without the OFFSET 0, the planner flattens the subquery and discovers
that it doesn't actually need to fetch any of the non-join-key table
columns.)  Note the only plan nodes showing whacked-out timings are the
ones returning wide tuples (large "width" values).  I'm not entirely
sure what to make of this.  It could be interpreted as evidence for my
theory about TLB reloads during userspace data access being the problem.
But I'm getting a bit disenchanted with that theory after running the
same test case in 8.1:

regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset
0)ss;                                                             QUERY PLAN
                 
 

--------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=2609.00..2609.01 rows=1 width=0) (actual time=1033.866..1033.870 rows=1 loops=1)  ->  Limit
(cost=825.00..2484.00rows=10000 width=488) (actual time=330.961..989.272 rows=10000 loops=1)        ->  Hash Join
(cost=825.00..2484.00rows=10000 width=488) (actual time=330.946..920.398 rows=10000 loops=1)              Hash Cond:
("outer".unique1= "inner".unique2)              ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=10000 width=244)
(actualtime=0.050..66.861 rows=10000 loops=1)              ->  Hash  (cost=458.00..458.00 rows=10000 width=244) (actual
time=330.745..330.745rows=10000 loops=1)                    ->  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=10000
width=244)(actual time=0.010..65.890 rows=10000 loops=1)Total runtime: 1041.293 ms
 
(8 rows)

Time: 1051.594 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;count 
-------10000
(1 row)

Time: 870.348 ms

Both runtimes are pretty repeatable.  That says that on this example,
where we have sampled 40000 node executions (remember HASH is only
called once), the overhead of EXPLAIN ANALYZE is about 4.5 usec per
node, which is not very far from twice the gettimeofday() time;
it certainly doesn't show any evidence for a huge additional overhead
due to TLB flushes.

One thing I was wondering about was whether the samplings of upper nodes
might be correlating to a greater-than-random extent with the samplings
of lower nodes.  That would result in the observed problem of
overestimating upper nodes.  It doesn't explain the apparent dependence
on targetlist width, though.  So I'm confused again.

> To be honest, I wonder about caching effects, but for the disks.

All of the above are repeated executions that should be fully cached
in RAM.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> And two, that upper plan nodes seem much more affected than lower
> ones.  That makes sense because the execution cycle of an upper node
> will involve touching more userspace data than a lower node, and
> therefore more of the flushed TLB entries will need to be reloaded.

I would have expected the opposite effect. If you only execute one instruction
then the cache miss can make it take many times longer than normal. But as the
number of instructions grows the cache gets repopulated and the overhead
levels off and becomes negligible relative to the total time.


The other option aside from gprof-like profiling would be to investigate those
cpu timing instructions again. I know some of them are unsafe on multi-cpu
systems but surely there's a solution out there. It's not like there aren't a
million games, music playing, and other kewl kid toys that depend on accurate
low overhead timing these days.


-- 
greg



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> And two, that upper plan nodes seem much more affected than lower
>> ones.  That makes sense because the execution cycle of an upper node
>> will involve touching more userspace data than a lower node, and
>> therefore more of the flushed TLB entries will need to be reloaded.

> I would have expected the opposite effect. If you only execute one instruction
> then the cache miss can make it take many times longer than normal. But as the
> number of instructions grows the cache gets repopulated and the overhead
> levels off and becomes negligible relative to the total time.

Well, none of our plan nodes are in the "one instruction" regime ;-).
I was thinking that the total volume of data accessed was the critical
factor.  Right at the moment I'm disillusioned with the TLB-access
theory though.

Something I'm noticing right now is that it seems like only hash joins
are really seriously misestimated --- nest and merge joins have some
small issues but only the hash is way out there.  What's going on??
Can anyone else reproduce this?

> The other option aside from gprof-like profiling would be to
> investigate those cpu timing instructions again. I know some of them
> are unsafe on multi-cpu systems but surely there's a solution out
> there. It's not like there aren't a million games, music playing, and
> other kewl kid toys that depend on accurate low overhead timing these
> days.

Yeah, and they all work only on Windoze and Intel chips :-(
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Simon Riggs
Date:
On Tue, 2006-06-06 at 16:50 -0400, Tom Lane wrote:

> I have a theory about this, and it's not pleasant at all.  What I
> think is that we have a Heisenberg problem here: the act of invoking
> gettimeofday() actually changes what is measured. 

> If this theory is correct, then the entire notion of EXPLAIN ANALYZE
> sampling has just crashed and burned.  We can't ship a measurement
> tool that is accurate on some platforms and not others.

Regrettably, I would agree and so conclude that we shouldn't pursue the
sampling idea further. Heisenbugs suck time like no other. Interesting,
though.

That leaves us with a number of possibilities:
0. Do Nothing
1. Option to skip timing altogether on an EXPLAIN ANALYZE
2. Option to produce a partial execution only, to locate problem areas.

Any others?

Option 2 would be harder to interpret, but still useful - originally
discussed in a current thread on -perform.
Option 1 wouldn't be as useful as the original sampling idea, but if its
not on the table any longer....

I'd revert back to Option 1 as being the best choice for further work.

Do we agree the idea can't go further? What next?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Do we agree the idea can't go further? What next?

It still needs investigation; I'm no longer convinced that the TLB-flush
theory is correct.  See rest of thread.  We may well have to revert the
current patch, but I'd like to be sure we understand why.

If we do have to revert, I'd propose that we pursue the notion of
interrupt-driven sampling like gprof uses.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Martijn van Oosterhout
Date:
On Wed, Jun 07, 2006 at 09:53:32AM -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Do we agree the idea can't go further? What next?
>
> It still needs investigation; I'm no longer convinced that the TLB-flush
> theory is correct.  See rest of thread.  We may well have to revert the
> current patch, but I'd like to be sure we understand why.

One thing I'm thinking of trying is to, instead of assuming we can
estimate the duractions of all the nodes by taking the total time
divided by samples. we assume that the duration of tuple X is similar
in duration to tuple X+1 but not necessarily the same as all other
tuples.

This moves the calculation from EndLoop to StopInstr. Basically in
StopInstr you do the steps:

if( sampling )
{ x = get time for this tuple n = number of tuples skipped
 cumulativetime += x*n
}

This would mean that we wouldn't be assuming that tuples near the end
take as long as tuples near the beginning. Except we're now dealing
will smaller numbers, so I'm worried about error accumlation.

> If we do have to revert, I'd propose that we pursue the notion of
> interrupt-driven sampling like gprof uses.

How would that work? You could then estimate how much time was spent in
each node, but you no longer have any idea about when they were entered
or left.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Wed, Jun 07, 2006 at 09:53:32AM -0400, Tom Lane wrote:
>> If we do have to revert, I'd propose that we pursue the notion of
>> interrupt-driven sampling like gprof uses.

> How would that work? You could then estimate how much time was spent in
> each node, but you no longer have any idea about when they were entered
> or left.

Hm?  It would work the same way gprof works.  I'm imagining something
along the lines of

global variable:volatile Instrumentation *current_instr = NULL;

also add an "Instrumentation *parent_instr" field to Instrumentation

InstrStartNode does:instr->parent_instr = current_instr;current_instr = instr;

InstrStopNode restores the previous value of the global:current_instr = instr->parent_instr;

timer interrupt routine does this once every few milliseconds:total_samples++;for (instr = current_instr; instr; instr
=instr->parent_instr)    instr->samples++;
 

You still count executions and tuples in InstrStartNode/InstrStopNode,
but you never call gettimeofday there.  You *do* call gettimeofday at
beginning and end of the whole query to measure the total runtime,
and then you blame a fraction samples/total_samples of that on each
node.

The bubble-up of sample counts to parent nodes could perhaps be done
while printing the results instead of on-the-fly as sketched above, but
the above seems simpler.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Greg Stark
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:

> This would mean that we wouldn't be assuming that tuples near the end
> take as long as tuples near the beginning. Except we're now dealing
> will smaller numbers, so I'm worried about error accumlation.

Hm, that would explain why Hash joins suffer from this especially. Even when
functioning properly hashes get slower as the buckets fill up and there are
longer lists to traverse. Perhaps the hashes are suffering inordinately from
collisions though. Some of the data type hash functions looked kind of suspect
when I peeked at them a while back.

-- 
greg



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Martijn van Oosterhout <kleptog@svana.org> writes:
>> This would mean that we wouldn't be assuming that tuples near the end
>> take as long as tuples near the beginning. Except we're now dealing
>> will smaller numbers, so I'm worried about error accumlation.

> Hm, that would explain why Hash joins suffer from this especially. Even when
> functioning properly hashes get slower as the buckets fill up and there are
> longer lists to traverse.

Nope, that is certainly not the explanation, because the hash table is
loaded in the (single) call of the Hash node at the start of the query.
It is static all through the sampled-and-not executions of the Hash Join
node, which is where our problem is.

I don't see that Martijn's idea responds to the problem anyway, if it is
some sort of TLB-related issue.  The assumption we are making is not
"tuples near the end take as long as tuples near the beginning", it is
"tuples we sample take as long as tuples we don't" (both statements of
course meaning "on the average").  If the act of sampling incurs
overhead beyond the gettimeofday() call itself, then we are screwed,
and playing around with which iterations we sample and how we do the
extrapolation won't make the slightest bit of difference.

I'm unsure about the TLB-flush theory because I see no evidence of any
such overhead in the 8.1 timings; but on the other hand it's hard to see
what else could explain the apparent dependence on targetlist width.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Nope, that is certainly not the explanation, because the hash table is
> loaded in the (single) call of the Hash node at the start of the query.
> It is static all through the sampled-and-not executions of the Hash Join
> node, which is where our problem is.

At the risk of asking a stupid question, it's not perchance including that
hash build in the first sample it takes of the hash join node?

-- 
greg



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Nope, that is certainly not the explanation, because the hash table is
>> loaded in the (single) call of the Hash node at the start of the query.
>> It is static all through the sampled-and-not executions of the Hash Join
>> node, which is where our problem is.

> At the risk of asking a stupid question, it's not perchance including that
> hash build in the first sample it takes of the hash join node?

Sure.  Which is one of the reasons why the first tuple is excluded from
the extrapolation...
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Ah-hah, I've sussed it.  The faulty assumption can actually be stated
as: "all the executions, except maybe the first, will take approximately
the same amount of time".  The failing test case I've been looking at
is one where the system decides to use a "batched" hash join, and in
this plan type some iterations take much longer than others.  (The
apparent dependence on targetlist width is now easy to understand,
because that affects the estimated hashtable size and hence the decision
whether batching is needed.  I'm not sure why I don't see the effect
when running the identical case on my other machine, but since the
other one is a 64-bit machine its space calculations are probably a
bit different.)

I added some printf's to instrument.c to print out the actual time
measurements for each sample, as well as the calculations in
InstrEndLoop.  Attached are the printouts that occurred for the HashJoin
node.  The thing that is killing the extrapolation is of course the very
large time for iteration 2, which the extrapolation includes in its
average.  But there's well over 10:1 variation in the later samples
as well.

On reflection it's easy to imagine other cases where some iterations
take much longer than others in a not-very-predictable way.  For
instance, a join where only a subset of the outer tuples have a match
is going to act that way.  I don't think there's any good way that we
can be sure we have a representative sample of executions, and so I'm
afraid this approach to sampling EXPLAIN ANALYZE is a failure.

I propose we revert this patch and think about an interrupt-driven
sampling method instead.
        regards, tom lane


401489a0 1: 331616 usec in iter 1
401489a0 1: 110338 usec in iter 2
401489a0 1: 54 usec in iter 3
401489a0 1: 99 usec in iter 4
401489a0 1: 77 usec in iter 5
401489a0 1: 145 usec in iter 6
401489a0 1: 117 usec in iter 7
401489a0 1: 33 usec in iter 8
401489a0 1: 97 usec in iter 9
401489a0 1: 98 usec in iter 10
401489a0 1: 52 usec in iter 11
401489a0 1: 33 usec in iter 12
401489a0 1: 51 usec in iter 13
401489a0 1: 83 usec in iter 14
401489a0 1: 153 usec in iter 15
401489a0 1: 115 usec in iter 16
401489a0 1: 52 usec in iter 17
401489a0 1: 242 usec in iter 18
401489a0 1: 48 usec in iter 19
401489a0 1: 87 usec in iter 20
401489a0 1: 23 usec in iter 21
401489a0 1: 80 usec in iter 22
401489a0 1: 57 usec in iter 23
401489a0 1: 17 usec in iter 24
401489a0 1: 51 usec in iter 25
401489a0 1: 18 usec in iter 26
401489a0 1: 16 usec in iter 27
401489a0 1: 100 usec in iter 28
401489a0 1: 45 usec in iter 29
401489a0 1: 174 usec in iter 30
401489a0 1: 131 usec in iter 31
401489a0 1: 17 usec in iter 32
401489a0 1: 45 usec in iter 33
401489a0 1: 16 usec in iter 34
401489a0 1: 120 usec in iter 35
401489a0 1: 15 usec in iter 36
401489a0 1: 17 usec in iter 37
401489a0 1: 15 usec in iter 38
401489a0 1: 48 usec in iter 39
401489a0 1: 127 usec in iter 40
401489a0 1: 36 usec in iter 41
401489a0 1: 41 usec in iter 42
401489a0 1: 69 usec in iter 43
401489a0 1: 50 usec in iter 44
401489a0 1: 104 usec in iter 45
401489a0 1: 22 usec in iter 46
401489a0 1: 50 usec in iter 47
401489a0 1: 17 usec in iter 48
401489a0 1: 47 usec in iter 49
401489a0 1: 54 usec in iter 50
401489a0 1: 46 usec in iter 51
401489a0 1: 20 usec in iter 54
401489a0 1: 38 usec in iter 55
401489a0 1: 68 usec in iter 56
401489a0 1: 17 usec in iter 60
401489a0 1: 16 usec in iter 61
401489a0 1: 15 usec in iter 67
401489a0 1: 31 usec in iter 68
401489a0 1: 15 usec in iter 70
401489a0 1: 61 usec in iter 78
401489a0 1: 143 usec in iter 85
401489a0 1: 21 usec in iter 89
401489a0 1: 14 usec in iter 96
401489a0 1: 21 usec in iter 104
401489a0 1: 21 usec in iter 107
401489a0 1: 16 usec in iter 116
401489a0 1: 194 usec in iter 118
401489a0 1: 136 usec in iter 122
401489a0 1: 34 usec in iter 127
401489a0 1: 46 usec in iter 131
401489a0 1: 15 usec in iter 133
401489a0 1: 15 usec in iter 135
401489a0 1: 34 usec in iter 137
401489a0 1: 54 usec in iter 142
401489a0 1: 206 usec in iter 151
401489a0 1: 75 usec in iter 162
401489a0 1: 20 usec in iter 172
401489a0 1: 66 usec in iter 177
401489a0 1: 21 usec in iter 181
401489a0 1: 69 usec in iter 186
401489a0 1: 16 usec in iter 193
401489a0 1: 46 usec in iter 201
401489a0 1: 33 usec in iter 210
401489a0 1: 50 usec in iter 216
401489a0 1: 21 usec in iter 222
401489a0 1: 18 usec in iter 224
401489a0 1: 33 usec in iter 229
401489a0 1: 20 usec in iter 232
401489a0 1: 44 usec in iter 236
401489a0 1: 29 usec in iter 239
401489a0 1: 34 usec in iter 240
401489a0 1: 31 usec in iter 241
401489a0 1: 27 usec in iter 254
401489a0 1: 45 usec in iter 257
401489a0 1: 147 usec in iter 259
401489a0 1: 15 usec in iter 269
401489a0 1: 16 usec in iter 278
401489a0 1: 14 usec in iter 279
401489a0 1: 58 usec in iter 290
401489a0 1: 15 usec in iter 291
401489a0 1: 53 usec in iter 295
401489a0 1: 15 usec in iter 306
401489a0 1: 16 usec in iter 318
401489a0 1: 34 usec in iter 328
401489a0 1: 37 usec in iter 339
401489a0 1: 28 usec in iter 346
401489a0 1: 21 usec in iter 353
401489a0 1: 15 usec in iter 361
401489a0 1: 49 usec in iter 370
401489a0 1: 61 usec in iter 380
401489a0 1: 21 usec in iter 385
401489a0 1: 50 usec in iter 397
401489a0 1: 36 usec in iter 411
401489a0 1: 44 usec in iter 417
401489a0 1: 55 usec in iter 420
401489a0 1: 15 usec in iter 427
401489a0 1: 74 usec in iter 429
401489a0 1: 178 usec in iter 433
401489a0 1: 182 usec in iter 440
401489a0 1: 41 usec in iter 454
401489a0 1: 39 usec in iter 455
401489a0 1: 28 usec in iter 470
401489a0 1: 57 usec in iter 473
401489a0 1: 15 usec in iter 480
401489a0 1: 21 usec in iter 489
401489a0 1: 76 usec in iter 498
401489a0 1: 38 usec in iter 513
401489a0 1: 21 usec in iter 523
401489a0 1: 33 usec in iter 529
401489a0 1: 39 usec in iter 545
401489a0 1: 25 usec in iter 559
401489a0 1: 27 usec in iter 566
401489a0 1: 15 usec in iter 574
401489a0 1: 25 usec in iter 590
401489a0 1: 62 usec in iter 606
401489a0 1: 27 usec in iter 610
401489a0 1: 16 usec in iter 625
401489a0 1: 49 usec in iter 641
401489a0 1: 27 usec in iter 654
401489a0 1: 22 usec in iter 665
401489a0 1: 17 usec in iter 669
401489a0 1: 39 usec in iter 682
401489a0 1: 27 usec in iter 684
401489a0 1: 68 usec in iter 697
401489a0 1: 15 usec in iter 700
401489a0 1: 16 usec in iter 712
401489a0 1: 15 usec in iter 728
401489a0 1: 22 usec in iter 746
401489a0 1: 16 usec in iter 752
401489a0 1: 45 usec in iter 766
401489a0 1: 44 usec in iter 777
401489a0 1: 33 usec in iter 785
401489a0 1: 21 usec in iter 802
401489a0 1: 23 usec in iter 814
401489a0 1: 15 usec in iter 820
401489a0 1: 21 usec in iter 839
401489a0 1: 23 usec in iter 840
401489a0 1: 39 usec in iter 858
401489a0 1: 94 usec in iter 867
401489a0 1: 33 usec in iter 877
401489a0 1: 56 usec in iter 895
401489a0 1: 14 usec in iter 898
401489a0 1: 22 usec in iter 900
401489a0 1: 14 usec in iter 910
401489a0 1: 20 usec in iter 918
401489a0 1: 14 usec in iter 937
401489a0 1: 28 usec in iter 942
401489a0 1: 29 usec in iter 944
401489a0 1: 21 usec in iter 947
401489a0 1: 68 usec in iter 957
401489a0 1: 46 usec in iter 969
401489a0 1: 15 usec in iter 971
401489a0 1: 16 usec in iter 987
401489a0 1: 22 usec in iter 1000
401489a0 1: 36 usec in iter 1001
401489a0 1: 22 usec in iter 1017
401489a0 1: 17 usec in iter 1031
401489a0 1: 36 usec in iter 1042
401489a0 1: 21 usec in iter 1062
401489a0 1: 15 usec in iter 1077
401489a0 1: 21 usec in iter 1097
401489a0 1: 88 usec in iter 1114
401489a0 1: 62 usec in iter 1123
401489a0 1: 16 usec in iter 1131
401489a0 1: 21 usec in iter 1149
401489a0 1: 15 usec in iter 1156
401489a0 1: 33 usec in iter 1177
401489a0 1: 22 usec in iter 1190
401489a0 1: 83 usec in iter 1201
401489a0 1: 16 usec in iter 1209
401489a0 1: 184 usec in iter 1224
401489a0 1: 33 usec in iter 1234
401489a0 1: 39 usec in iter 1236
401489a0 1: 21 usec in iter 1247
401489a0 1: 33 usec in iter 1267
401489a0 1: 16 usec in iter 1277
401489a0 1: 15 usec in iter 1297
401489a0 1: 14 usec in iter 1319
401489a0 1: 21 usec in iter 1329
401489a0 1: 15 usec in iter 1342
401489a0 1: 16 usec in iter 1357
401489a0 1: 57 usec in iter 1378
401489a0 1: 16 usec in iter 1400
401489a0 1: 16 usec in iter 1410
401489a0 1: 113 usec in iter 1417
401489a0 1: 16 usec in iter 1418
401489a0 1: 15 usec in iter 1433
401489a0 1: 23 usec in iter 1434
401489a0 1: 96 usec in iter 1442
401489a0 1: 30 usec in iter 1449
401489a0 1: 121 usec in iter 1454
401489a0 1: 16 usec in iter 1462
401489a0 1: 29 usec in iter 1482
401489a0 1: 48 usec in iter 1490
401489a0 1: 40 usec in iter 1512
401489a0 1: 20 usec in iter 1529
401489a0 1: 48 usec in iter 1545
401489a0 1: 45 usec in iter 1547
401489a0 1: 271 usec in iter 1557
401489a0 1: 15 usec in iter 1558
401489a0 1: 44 usec in iter 1576
401489a0 1: 15 usec in iter 1592
401489a0 1: 148 usec in iter 1616
401489a0 1: 15 usec in iter 1640
401489a0 1: 34 usec in iter 1650
401489a0 1: 27 usec in iter 1659
401489a0 1: 15 usec in iter 1673
401489a0 1: 59 usec in iter 1686
401489a0 1: 22 usec in iter 1689
401489a0 1: 16 usec in iter 1696
401489a0 1: 27 usec in iter 1715
401489a0 1: 23 usec in iter 1736
401489a0 1: 16 usec in iter 1749
401489a0 1: 27 usec in iter 1757
401489a0 1: 15 usec in iter 1777
401489a0 1: 15 usec in iter 1801
401489a0 1: 16 usec in iter 1804
401489a0 1: 227 usec in iter 1829
401489a0 1: 27 usec in iter 1848
401489a0 1: 16 usec in iter 1869
401489a0 1: 33 usec in iter 1881
401489a0 1: 57 usec in iter 1902
401489a0 1: 29 usec in iter 1926
401489a0 1: 15 usec in iter 1932
401489a0 1: 58 usec in iter 1955
401489a0 1: 15 usec in iter 1980
401489a0 1: 22 usec in iter 1992
401489a0 1: 14 usec in iter 2001
401489a0 1: 56 usec in iter 2010
401489a0 1: 22 usec in iter 2018
401489a0 1: 16 usec in iter 2043
401489a0 1: 15 usec in iter 2046
401489a0 1: 50 usec in iter 2051
401489a0 1: 16 usec in iter 2053
401489a0 1: 22 usec in iter 2078
401489a0 1: 89 usec in iter 2094
401489a0 1: 69 usec in iter 2103
401489a0 1: 33 usec in iter 2121
401489a0 1: 28 usec in iter 2124
401489a0 1: 34 usec in iter 2135
401489a0 1: 22 usec in iter 2148
401489a0 1: 22 usec in iter 2167
401489a0 1: 15 usec in iter 2186
401489a0 1: 40 usec in iter 2199
401489a0 1: 23 usec in iter 2219
401489a0 1: 21 usec in iter 2223
401489a0 1: 27 usec in iter 2235
401489a0 1: 14 usec in iter 2238
401489a0 1: 22 usec in iter 2250
401489a0 1: 31 usec in iter 2252
401489a0 1: 36 usec in iter 2259
401489a0 1: 15 usec in iter 2282
401489a0 1: 259 usec in iter 2300
401489a0 1: 37 usec in iter 2318
401489a0 1: 21 usec in iter 2331
401489a0 1: 16 usec in iter 2345
401489a0 1: 15 usec in iter 2350
401489a0 1: 16 usec in iter 2361
401489a0 1: 69 usec in iter 2364
401489a0 1: 29 usec in iter 2374
401489a0 1: 33 usec in iter 2383
401489a0 1: 116 usec in iter 2400
401489a0 1: 37 usec in iter 2420
401489a0 1: 39 usec in iter 2447
401489a0 1: 29 usec in iter 2455
401489a0 1: 114 usec in iter 2476
401489a0 1: 33 usec in iter 2488
401489a0 1: 22 usec in iter 2493
401489a0 1: 53 usec in iter 2512
401489a0 1: 22 usec in iter 2539
401489a0 1: 57 usec in iter 2563
401489a0 1: 32 usec in iter 2564
401489a0 1: 39 usec in iter 2572
401489a0 1: 22 usec in iter 2574
401489a0 1: 14 usec in iter 2597
401489a0 1: 74 usec in iter 2610
401489a0 1: 23 usec in iter 2618
401489a0 1: 21 usec in iter 2626
401489a0 1: 16 usec in iter 2627
401489a0 1: 45 usec in iter 2653
401489a0 1: 28 usec in iter 2662
401489a0 1: 20 usec in iter 2679
401489a0 1: 27 usec in iter 2694
401489a0 1: 17 usec in iter 2696
401489a0 1: 14 usec in iter 2706
401489a0 1: 22 usec in iter 2720
401489a0 1: 36 usec in iter 2742
401489a0 1: 20 usec in iter 2753
401489a0 1: 20 usec in iter 2761
401489a0 1: 16 usec in iter 2788
401489a0 1: 22 usec in iter 2795
401489a0 1: 16 usec in iter 2804
401489a0 1: 144 usec in iter 2812
401489a0 1: 16 usec in iter 2817
401489a0 1: 16 usec in iter 2834
401489a0 1: 16 usec in iter 2853
401489a0 1: 53 usec in iter 2872
401489a0 1: 22 usec in iter 2882
401489a0 1: 28 usec in iter 2907
401489a0 1: 38 usec in iter 2922
401489a0 1: 56 usec in iter 2935
401489a0 1: 188 usec in iter 2954
401489a0 1: 15 usec in iter 2964
401489a0 1: 22 usec in iter 2984
401489a0 1: 23 usec in iter 3007
401489a0 1: 27 usec in iter 3028
401489a0 1: 16 usec in iter 3051
401489a0 1: 178 usec in iter 3064
401489a0 1: 15 usec in iter 3082
401489a0 1: 20 usec in iter 3083
401489a0 1: 15 usec in iter 3097
401489a0 1: 14 usec in iter 3120
401489a0 1: 16 usec in iter 3126
401489a0 1: 15 usec in iter 3145
401489a0 1: 15 usec in iter 3171
401489a0 1: 15 usec in iter 3200
401489a0 1: 15 usec in iter 3218
401489a0 1: 15 usec in iter 3242
401489a0 1: 50 usec in iter 3263
401489a0 1: 16 usec in iter 3267
401489a0 1: 15 usec in iter 3284
401489a0 1: 14 usec in iter 3300
401489a0 1: 16 usec in iter 3317
401489a0 1: 15 usec in iter 3342
401489a0 1: 14 usec in iter 3372
401489a0 1: 15 usec in iter 3377
401489a0 1: 14 usec in iter 3379
401489a0 1: 14 usec in iter 3389
401489a0 1: 14 usec in iter 3417
401489a0 1: 16 usec in iter 3419
401489a0 1: 16 usec in iter 3446
401489a0 1: 16 usec in iter 3461
401489a0 1: 15 usec in iter 3475
401489a0 1: 16 usec in iter 3484
401489a0 1: 16 usec in iter 3489
401489a0 1: 14 usec in iter 3511
401489a0 1: 15 usec in iter 3519
401489a0 1: 15 usec in iter 3527
401489a0 1: 15 usec in iter 3552
401489a0 1: 15 usec in iter 3572
401489a0 1: 16 usec in iter 3574
401489a0 1: 14 usec in iter 3584
401489a0 1: 15 usec in iter 3586
401489a0 1: 16 usec in iter 3610
401489a0 1: 14 usec in iter 3619
401489a0 1: 15 usec in iter 3629
401489a0 1: 15 usec in iter 3658
401489a0 1: 14 usec in iter 3682
401489a0 1: 14 usec in iter 3683
401489a0 1: 15 usec in iter 3692
401489a0 1: 15 usec in iter 3718
401489a0 1: 15 usec in iter 3749
401489a0 1: 15 usec in iter 3758
401489a0 1: 16 usec in iter 3783
401489a0 1: 15 usec in iter 3796
401489a0 1: 15 usec in iter 3812
401489a0 1: 15 usec in iter 3818
401489a0 1: 15 usec in iter 3826
401489a0 1: 15 usec in iter 3836
401489a0 1: 15 usec in iter 3851
401489a0 1: 15 usec in iter 3875
401489a0 1: 57 usec in iter 3891
401489a0 1: 15 usec in iter 3922
401489a0 1: 15 usec in iter 3925
401489a0 1: 15 usec in iter 3933
401489a0 1: 16 usec in iter 3942
401489a0 1: 15 usec in iter 3967
401489a0 1: 14 usec in iter 3994
401489a0 1: 14 usec in iter 4015
401489a0 1: 15 usec in iter 4025
401489a0 1: 16 usec in iter 4056
401489a0 1: 15 usec in iter 4063
401489a0 1: 15 usec in iter 4070
401489a0 1: 15 usec in iter 4079
401489a0 1: 15 usec in iter 4100
401489a0 1: 15 usec in iter 4122
401489a0 1: 14 usec in iter 4149
401489a0 1: 14 usec in iter 4171
401489a0 1: 14 usec in iter 4186
401489a0 1: 16 usec in iter 4208
401489a0 1: 15 usec in iter 4231
401489a0 1: 15 usec in iter 4262
401489a0 1: 14 usec in iter 4280
401489a0 1: 14 usec in iter 4293
401489a0 1: 15 usec in iter 4300
401489a0 1: 14 usec in iter 4323
401489a0 1: 34 usec in iter 4336
401489a0 1: 15 usec in iter 4359
401489a0 1: 15 usec in iter 4378
401489a0 1: 15 usec in iter 4384
401489a0 1: 15 usec in iter 4388
401489a0 1: 15 usec in iter 4406
401489a0 1: 14 usec in iter 4432
401489a0 1: 15 usec in iter 4452
401489a0 1: 15 usec in iter 4460
401489a0 1: 15 usec in iter 4469
401489a0 1: 54 usec in iter 4492
401489a0 1: 15 usec in iter 4493
401489a0 1: 16 usec in iter 4505
401489a0 1: 16 usec in iter 4527
401489a0 1: 16 usec in iter 4557
401489a0 1: 15 usec in iter 4569
401489a0 1: 16 usec in iter 4588
401489a0 1: 15 usec in iter 4598
401489a0 1: 15 usec in iter 4607
401489a0 1: 55 usec in iter 4629
401489a0 1: 14 usec in iter 4644
401489a0 1: 14 usec in iter 4652
401489a0 1: 14 usec in iter 4680
401489a0 1: 14 usec in iter 4692
401489a0 1: 16 usec in iter 4696
401489a0 1: 15 usec in iter 4706
401489a0 1: 15 usec in iter 4718
401489a0 1: 15 usec in iter 4722
401489a0 1: 14 usec in iter 4730
401489a0 1: 16 usec in iter 4731
401489a0 1: 15 usec in iter 4743
401489a0 1: 15 usec in iter 4754
401489a0 1: 16 usec in iter 4785
401489a0 1: 14 usec in iter 4787
401489a0 1: 14 usec in iter 4808
401489a0 1: 14 usec in iter 4827
401489a0 1: 15 usec in iter 4861
401489a0 1: 15 usec in iter 4869
401489a0 1: 14 usec in iter 4890
401489a0 1: 15 usec in iter 4910
401489a0 1: 15 usec in iter 4943
401489a0 1: 14 usec in iter 4967
401489a0 1: 15 usec in iter 4969
401489a0 1: 15 usec in iter 4980
401489a0 1: 15 usec in iter 4985
401489a0 1: 16 usec in iter 5008
401489a0 1: 16 usec in iter 5042
401489a0 1: 15 usec in iter 5049
401489a0 1: 15 usec in iter 5055
401489a0 1: 15 usec in iter 5064
401489a0 1: 16 usec in iter 5071
401489a0 1: 16 usec in iter 5091
401489a0 1: 15 usec in iter 5117
401489a0 1: 15 usec in iter 5136
401489a0 1: 15 usec in iter 5162
401489a0 1: 15 usec in iter 5167
401489a0 1: 14 usec in iter 5180
401489a0 1: 15 usec in iter 5182
401489a0 1: 15 usec in iter 5193
401489a0 1: 14 usec in iter 5196
401489a0 1: 16 usec in iter 5206
401489a0 1: 14 usec in iter 5211
401489a0 1: 16 usec in iter 5221
401489a0 1: 15 usec in iter 5241
401489a0 1: 15 usec in iter 5272
401489a0 1: 54 usec in iter 5284
401489a0 1: 16 usec in iter 5288
401489a0 1: 15 usec in iter 5307
401489a0 1: 16 usec in iter 5337
401489a0 1: 15 usec in iter 5351
401489a0 1: 15 usec in iter 5352
401489a0 1: 16 usec in iter 5374
401489a0 1: 14 usec in iter 5409
401489a0 1: 15 usec in iter 5410
401489a0 1: 14 usec in iter 5442
401489a0 1: 14 usec in iter 5446
401489a0 1: 15 usec in iter 5465
401489a0 1: 14 usec in iter 5489
401489a0 1: 14 usec in iter 5518
401489a0 1: 14 usec in iter 5536
401489a0 1: 15 usec in iter 5563
401489a0 1: 53 usec in iter 5594
401489a0 1: 16 usec in iter 5610
401489a0 1: 15 usec in iter 5616
401489a0 1: 14 usec in iter 5626
401489a0 1: 15 usec in iter 5630
401489a0 1: 15 usec in iter 5662
401489a0 1: 15 usec in iter 5693
401489a0 1: 14 usec in iter 5699
401489a0 1: 14 usec in iter 5730
401489a0 1: 16 usec in iter 5754
401489a0 1: 15 usec in iter 5787
401489a0 1: 14 usec in iter 5789
401489a0 1: 15 usec in iter 5808
401489a0 1: 16 usec in iter 5809
401489a0 1: 16 usec in iter 5841
401489a0 1: 14 usec in iter 5849
401489a0 1: 14 usec in iter 5873
401489a0 1: 15 usec in iter 5887
401489a0 1: 14 usec in iter 5908
401489a0 1: 16 usec in iter 5937
401489a0 1: 15 usec in iter 5961
401489a0 1: 15 usec in iter 5982
401489a0 1: 15 usec in iter 5984
401489a0 1: 59 usec in iter 6004
401489a0 1: 14 usec in iter 6032
401489a0 1: 15 usec in iter 6051
401489a0 1: 15 usec in iter 6081
401489a0 1: 15 usec in iter 6112
401489a0 1: 15 usec in iter 6119
401489a0 1: 15 usec in iter 6136
401489a0 1: 60 usec in iter 6140
401489a0 1: 14 usec in iter 6167
401489a0 1: 56 usec in iter 6168
401489a0 1: 14 usec in iter 6179
401489a0 1: 13 usec in iter 6209
401489a0 1: 15 usec in iter 6227
401489a0 1: 14 usec in iter 6249
401489a0 1: 61 usec in iter 6250
401489a0 1: 14 usec in iter 6270
401489a0 1: 14 usec in iter 6291
401489a0 1: 13 usec in iter 6312
401489a0 1: 16 usec in iter 6331
401489a0 1: 14 usec in iter 6358
401489a0 1: 53 usec in iter 6359
401489a0 1: 15 usec in iter 6368
401489a0 1: 14 usec in iter 6373
401489a0 1: 13 usec in iter 6391
401489a0 1: 15 usec in iter 6411
401489a0 1: 15 usec in iter 6432
401489a0 1: 15 usec in iter 6454
401489a0 1: 15 usec in iter 6476
401489a0 1: 15 usec in iter 6508
401489a0 1: 15 usec in iter 6545
401489a0 1: 14 usec in iter 6564
401489a0 1: 15 usec in iter 6581
401489a0 1: 15 usec in iter 6611
401489a0 1: 14 usec in iter 6641
401489a0 1: 14 usec in iter 6661
401489a0 1: 15 usec in iter 6680
401489a0 1: 16 usec in iter 6700
401489a0 1: 15 usec in iter 6728
401489a0 1: 14 usec in iter 6738
401489a0 1: 15 usec in iter 6758
401489a0 1: 14 usec in iter 6777
401489a0 1: 15 usec in iter 6782
401489a0 1: 15 usec in iter 6784
401489a0 1: 15 usec in iter 6799
401489a0 1: 15 usec in iter 6803
401489a0 1: 16 usec in iter 6817
401489a0 1: 15 usec in iter 6831
401489a0 1: 14 usec in iter 6840
401489a0 1: 15 usec in iter 6861
401489a0 1: 14 usec in iter 6893
401489a0 1: 15 usec in iter 6911
401489a0 1: 14 usec in iter 6935
401489a0 1: 16 usec in iter 6950
401489a0 1: 15 usec in iter 6976
401489a0 1: 54 usec in iter 6987
401489a0 1: 16 usec in iter 6989
401489a0 1: 15 usec in iter 7020
401489a0 1: 14 usec in iter 7043
401489a0 1: 15 usec in iter 7065
401489a0 1: 14 usec in iter 7097
401489a0 1: 14 usec in iter 7110
401489a0 1: 16 usec in iter 7117
401489a0 1: 15 usec in iter 7126
401489a0 1: 15 usec in iter 7142
401489a0 1: 14 usec in iter 7153
401489a0 1: 14 usec in iter 7157
401489a0 1: 15 usec in iter 7160
401489a0 1: 15 usec in iter 7187
401489a0 1: 15 usec in iter 7223
401489a0 1: 15 usec in iter 7251
401489a0 1: 15 usec in iter 7271
401489a0 1: 15 usec in iter 7277
401489a0 1: 15 usec in iter 7298
401489a0 1: 14 usec in iter 7301
401489a0 1: 15 usec in iter 7334
401489a0 1: 16 usec in iter 7359
401489a0 1: 15 usec in iter 7378
401489a0 1: 15 usec in iter 7404
401489a0 1: 17 usec in iter 7417
401489a0 1: 17 usec in iter 7420
401489a0 1: 15 usec in iter 7431
401489a0 1: 16 usec in iter 7437
401489a0 1: 16 usec in iter 7451
401489a0 1: 16 usec in iter 7478
401489a0 1: 15 usec in iter 7502
401489a0 1: 15 usec in iter 7503
401489a0 1: 14 usec in iter 7535
401489a0 1: 15 usec in iter 7556
401489a0 1: 15 usec in iter 7558
401489a0 1: 52 usec in iter 7578
401489a0 1: 16 usec in iter 7583
401489a0 1: 13 usec in iter 7596
401489a0 1: 15 usec in iter 7618
401489a0 1: 14 usec in iter 7623
401489a0 1: 16 usec in iter 7624
401489a0 1: 15 usec in iter 7626
401489a0 1: 15 usec in iter 7636
401489a0 1: 16 usec in iter 7638
401489a0 1: 16 usec in iter 7661
401489a0 1: 16 usec in iter 7690
401489a0 1: 14 usec in iter 7716
401489a0 1: 16 usec in iter 7739
401489a0 1: 15 usec in iter 7767
401489a0 1: 15 usec in iter 7776
401489a0 1: 15 usec in iter 7815
401489a0 1: 16 usec in iter 7844
401489a0 1: 15 usec in iter 7875
401489a0 1: 15 usec in iter 7891
401489a0 1: 14 usec in iter 7893
401489a0 1: 14 usec in iter 7930
401489a0 1: 15 usec in iter 7937
401489a0 1: 14 usec in iter 7975
401489a0 1: 16 usec in iter 8001
401489a0 1: 14 usec in iter 8039
401489a0 1: 14 usec in iter 8044
401489a0 1: 15 usec in iter 8054
401489a0 1: 15 usec in iter 8080
401489a0 1: 15 usec in iter 8113
401489a0 1: 16 usec in iter 8120
401489a0 1: 15 usec in iter 8159
401489a0 1: 15 usec in iter 8174
401489a0 1: 16 usec in iter 8210
401489a0 1: 14 usec in iter 8226
401489a0 1: 55 usec in iter 8261
401489a0 1: 16 usec in iter 8291
401489a0 1: 15 usec in iter 8319
401489a0 1: 15 usec in iter 8322
401489a0 1: 14 usec in iter 8348
401489a0 1: 14 usec in iter 8359
401489a0 1: 14 usec in iter 8375
401489a0 1: 15 usec in iter 8401
401489a0 1: 15 usec in iter 8422
401489a0 1: 16 usec in iter 8437
401489a0 1: 15 usec in iter 8457
401489a0 1: 15 usec in iter 8485
401489a0 1: 14 usec in iter 8511
401489a0 1: 15 usec in iter 8512
401489a0 1: 16 usec in iter 8536
401489a0 1: 15 usec in iter 8540
401489a0 1: 15 usec in iter 8570
401489a0 1: 15 usec in iter 8609
401489a0 1: 15 usec in iter 8637
401489a0 1: 15 usec in iter 8648
401489a0 1: 15 usec in iter 8670
401489a0 1: 14 usec in iter 8673
401489a0 1: 16 usec in iter 8714
401489a0 1: 15 usec in iter 8729
401489a0 1: 14 usec in iter 8750
401489a0 1: 15 usec in iter 8787
401489a0 1: 14 usec in iter 8803
401489a0 1: 16 usec in iter 8816
401489a0 1: 15 usec in iter 8838
401489a0 1: 14 usec in iter 8843
401489a0 1: 15 usec in iter 8844
401489a0 1: 16 usec in iter 8865
401489a0 1: 17 usec in iter 8870
401489a0 1: 16 usec in iter 8879
401489a0 1: 16 usec in iter 8896
401489a0 1: 15 usec in iter 8937
401489a0 1: 15 usec in iter 8949
401489a0 1: 15 usec in iter 8960
401489a0 1: 15 usec in iter 8994
401489a0 1: 16 usec in iter 9027
401489a0 1: 17 usec in iter 9040
401489a0 1: 15 usec in iter 9046
401489a0 1: 15 usec in iter 9050
401489a0 1: 14 usec in iter 9057
401489a0 1: 16 usec in iter 9058
401489a0 1: 14 usec in iter 9075
401489a0 1: 15 usec in iter 9076
401489a0 1: 16 usec in iter 9099
401489a0 1: 14 usec in iter 9139
401489a0 1: 15 usec in iter 9158
401489a0 1: 15 usec in iter 9186
401489a0 1: 14 usec in iter 9212
401489a0 1: 15 usec in iter 9238
401489a0 1: 16 usec in iter 9269
401489a0 1: 17 usec in iter 9275
401489a0 1: 14 usec in iter 9298
401489a0 1: 16 usec in iter 9312
401489a0 1: 14 usec in iter 9325
401489a0 1: 15 usec in iter 9345
401489a0 1: 16 usec in iter 9362
401489a0 1: 16 usec in iter 9382
401489a0 1: 16 usec in iter 9387
401489a0 1: 15 usec in iter 9424
401489a0 1: 52 usec in iter 9435
401489a0 1: 15 usec in iter 9438
401489a0 1: 15 usec in iter 9481
401489a0 1: 15 usec in iter 9491
401489a0 1: 15 usec in iter 9499
401489a0 1: 16 usec in iter 9535
401489a0 1: 15 usec in iter 9554
401489a0 1: 14 usec in iter 9561
401489a0 1: 14 usec in iter 9576
401489a0 1: 15 usec in iter 9614
401489a0 1: 15 usec in iter 9640
401489a0 1: 14 usec in iter 9668
401489a0 1: 15 usec in iter 9700
401489a0 1: 16 usec in iter 9705
401489a0 1: 15 usec in iter 9729
401489a0 1: 15 usec in iter 9738
401489a0 1: 15 usec in iter 9777
401489a0 1: 15 usec in iter 9816
401489a0 1: 16 usec in iter 9822
401489a0 1: 15 usec in iter 9828
401489a0 1: 17 usec in iter 9870
401489a0 1: 15 usec in iter 9911
401489a0 1: 15 usec in iter 9916
401489a0 1: 15 usec in iter 9956
401489a0 1: 15 usec in iter 9991
401489a0 1: raw totaltime = 0.462843
401489a0 1: per_iter = 0.00017897, SampleOverhead = 3.29e-06
401489a0 1: adj totaltime = 2.12368
401489a0 1: sampling = 0
401489a0 1: starttime = 0/000000
401489a0 1: counter = 0/462843
401489a0 1: firsttuple = 0.331616
401489a0 1: tuplecount = 10000
401489a0 1: itercount = 10001
401489a0 1: samplecount = 721
401489a0 1: nextsample = 10018.7
401489a0 1: startup = 0.331616
401489a0 1: total = 2.12368
401489a0 1: ntuples = 10000


Re: That EXPLAIN ANALYZE patch still needs work

From
Martijn van Oosterhout
Date:
On Wed, Jun 07, 2006 at 03:32:35PM -0400, Tom Lane wrote:
> On reflection it's easy to imagine other cases where some iterations
> take much longer than others in a not-very-predictable way.  For
> instance, a join where only a subset of the outer tuples have a match
> is going to act that way.  I don't think there's any good way that we
> can be sure we have a representative sample of executions, and so I'm
> afraid this approach to sampling EXPLAIN ANALYZE is a failure.

I don't think we ever assumed it would never be a problem. We just
assumed that the sampling would cancel the effect out to give a decent
average.

Thing is, I never expected to get a three order magnitude difference
between samples. That's just far too much to be corrected in any way.
The random sampling should counter most effects, and I didn't consider
the one tuple in a million that takes much longer to be a particularly
common case.

Anyway, as a test, if you take the approach that the measurement at
item X only applies to the tuples immediately preceding it, for the
data you posted you get a result of 0.681148 seconds. How long did that
query run that produced that data?

(The bit of perl I used is:

cat data | perl -lne 'BEGIN { $last=0; $sum =0 }
/: (\d+) usec in iter (\d+)/ && do { $sum += ($2-$last)*$1; $last=$2 };
END { print "$sum\n" }'

> I propose we revert this patch and think about an interrupt-driven
> sampling method instead.

That's another possibility ofcourse...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Anyway, as a test, if you take the approach that the measurement at
> item X only applies to the tuples immediately preceding it, for the
> data you posted you get a result of 0.681148 seconds. How long did that
> query run that produced that data?

I didn't save the corresponding printout unfortunately, but it was
probably pretty similar to this:

regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset
0)ss;                                                             QUERY PLAN
 

--------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=2609.00..2609.01 rows=1 width=0) (actual time=869.395..869.399 rows=1 loops=1)  ->  Limit  (cost=825.00..2484.00
rows=10000width=488) (actual time=248.640..3368.313 rows=10000 loops=1)        ->  Hash Join  (cost=825.00..2484.00
rows=10000width=488) (actual time=248.609..2983.528 rows=10000 loops=1)              Hash Cond: (a.unique1 = b.unique2)
            ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.084..21.525 rows=10000
loops=1)             ->  Hash  (cost=458.00..458.00 rows=10000 width=244) (actual time=248.269..248.269 rows=10000
loops=1)                   ->  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.025..22.760
rows=10000loops=1)Total runtime: 877.265 ms
 
(8 rows)

Time: 888.469 ms
regression=#

The above idea won't fix it anyway, only move the failure cases around.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
"Jim C. Nasby"
Date:
On Wed, Jun 07, 2006 at 11:34:30AM -0400, Tom Lane wrote:
> timer interrupt routine does this once every few milliseconds:

One issue is that on very fast queries, you'd get absolutely no data
this way. A possible solution would be to start with an extremely short
timer interval, and ramp it up if the query runs longer. The downside is
that you'd need to somehow re-scale iterations every time the timer
interval changed. Of course another option is to start off using the
gettimeofday() method and switch to sampling after X seconds, but it'd
be nice if both code paths weren't needed.

> The bubble-up of sample counts to parent nodes could perhaps be done
> while printing the results instead of on-the-fly as sketched above, but
> the above seems simpler.

It'd be nice if there was an option to not aggregate child runtimes to
their parents at all, since it'd make spotting hot spots much easier.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: That EXPLAIN ANALYZE patch still needs work

From
Simon Riggs
Date:
On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote:
> Ah-hah, I've sussed it. 

> so I'm
> afraid this approach to sampling EXPLAIN ANALYZE is a failure.

Hmmm, I thought we already got that bit...sorry to sound negative.

> I propose we revert this patch and think about an interrupt-driven
> sampling method instead.

I don't have much more faith in crazy scheme No.2 either. (Mine or
yours...)

Can we just have an option to avoid the timing altogether, please? I
don't want to have long discussions about instrumentation, I just want a
reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one
that we never, ever have to doubt whether the sampling works correctly
on a Miasmic-367 with HyperKooling. You could lose a month on -perform
going into the details of this for everybody - this was supposed to be a
simple additional feature.

If you're set on the sampling, great, but can we have the option to
avoid it completely also?

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote:
>> I propose we revert this patch and think about an interrupt-driven
>> sampling method instead.

> I don't have much more faith in crazy scheme No.2 either. (Mine or
> yours...)

> Can we just have an option to avoid the timing altogether, please? I
> don't want to have long discussions about instrumentation, I just want a
> reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one
> that we never, ever have to doubt whether the sampling works correctly
> on a Miasmic-367 with HyperKooling.

Frankly, I think the pre-existing version of EXPLAIN ANALYZE is fine.
People have been hyperventilating about the timing overhead but I think
that it's perfectly acceptable as-is.  Certainly the removal of timing
is not going to convert an intolerable EXPLAIN ANALYZE runtime into an
acceptable one; what it *is* likely to do is let you be misled about
which part of the query is the problem.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Wed, Jun 07, 2006 at 11:34:30AM -0400, Tom Lane wrote:
>> timer interrupt routine does this once every few milliseconds:

> One issue is that on very fast queries, you'd get absolutely no data
> this way.

Yeah.  Of course, on "very fast" queries you don't get very good data
from the existing method either --- with a machine fast enough to have
sub-microsecond plan node execution times, the accuracy of gettimeofday
really isn't good enough.

The other thing that was bothering me was whether disk I/O might be
undercounted by an interrupt-driven method.  I kinda doubt that any
kernel will save up N interrupts that occur while the process is blocked
on a slow read() ... you'll probably get only one.  So the whole idea
may be unworkable.

At the moment I think we need to revert the recent patch and go back
to the drawing board.  Improving on the original implementation of 
EXPLAIN ANALYZE is clearly not as easy as it looks.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Martijn van Oosterhout
Date:
On Wed, Jun 07, 2006 at 04:04:33PM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Anyway, as a test, if you take the approach that the measurement at
> > item X only applies to the tuples immediately preceding it, for the
> > data you posted you get a result of 0.681148 seconds. How long did that
> > query run that produced that data?

<snip>

> The above idea won't fix it anyway, only move the failure cases around.

Well, if we're trying to make a system that never ever produces strange
looking results, then yes, we have to drop it. No matter how you
sample, there's going to be a failure mode somewhere.

If you are prepared to live with a certain margin (it will be within X%
of the real value Y% of the time) then I think that's an acheivable
goal (I'm not saying that's necessarily what we have now). You could
always give people the choice of disabling sampling if it looks wierd,
but that just gives people more knobs to twiddle and get upset about.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: That EXPLAIN ANALYZE patch still needs work

From
Simon Riggs
Date:
On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote:
> >> I propose we revert this patch and think about an interrupt-driven
> >> sampling method instead.
> 
> > I don't have much more faith in crazy scheme No.2 either. (Mine or
> > yours...)
> 
> > Can we just have an option to avoid the timing altogether, please? I
> > don't want to have long discussions about instrumentation, I just want a
> > reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one
> > that we never, ever have to doubt whether the sampling works correctly
> > on a Miasmic-367 with HyperKooling.
> 
> Frankly, I think the pre-existing version of EXPLAIN ANALYZE is fine.

I respect everybody's opinion, yours doubly so, as you know. But with
this current opinion, on this occasion, I believe you are in the
minority of those who have spent some time asking for EXPLAIN ANALYSEs
and have not received them because the run time is unacceptable.

> People have been hyperventilating about the timing overhead but I think
> that it's perfectly acceptable as-is.  

The phrase hyperventilating is subjective. From my side, I could use the
phrase explaining-the-obvious and potentially cause rankle also. We
should be able to discuss things without that. 

> Certainly the removal of timing
> is not going to convert an intolerable EXPLAIN ANALYZE runtime into an
> acceptable one; 

I disagree, as have others.

> what it *is* likely to do is let you be misled about
> which part of the query is the problem.

A full EXPLAIN ANALYZE is always desirable - we agree on that. The
question is what we do when one is not available. Guessing leaves you
much more open to being misled. I'd like an option that we can take when
a full EXPLAIN ANALYZE is not possible, one that doesn't give different
results on different CPUs/OS.

We need this to help people who are in difficulty. This isn't a vanity
feature or a they-do-it-so-we-should-also. It's for support.

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote:
>> Certainly the removal of timing
>> is not going to convert an intolerable EXPLAIN ANALYZE runtime into an
>> acceptable one; 

> I disagree, as have others.

The overhead seems to be on the order of a couple tens of percent usually.
I don't see how that makes the difference between an EXPLAIN ANALYZE you
can run and one you can't.

> A full EXPLAIN ANALYZE is always desirable - we agree on that. The
> question is what we do when one is not available.

The least bad alternative I've heard is to let EXPLAIN ANALYZE print
out stats-so-far if the query is canceled by control-C or statement
timeout.  The objection to this is you may mistake startup transients
for full query behavior ... but at least the numbers will be good as
far as they go.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
"Jim C. Nasby"
Date:
On Wed, Jun 07, 2006 at 05:05:23PM -0400, Tom Lane wrote:
> The other thing that was bothering me was whether disk I/O might be
> undercounted by an interrupt-driven method.  I kinda doubt that any
> kernel will save up N interrupts that occur while the process is blocked
> on a slow read() ... you'll probably get only one.  So the whole idea
> may be unworkable.

True, but if you get to the point where you're waiting on I/O, I would
think you could afford a gettimeofday() call. Even if the block we need
is in the OS cache, it's possible that the overhead of getting it from
there is high enough that the gettimeofday() call won't matter. FWIW, it
looks like running a much larger setting for shared_buffers (like, 50%
of memory) is a good way to boost performance, and that configuration
would make it less likely that you'd do a gettimeofday just to pull data
out of the OS cache.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: That EXPLAIN ANALYZE patch still needs work

From
Simon Riggs
Date:
On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote:
> >> Certainly the removal of timing
> >> is not going to convert an intolerable EXPLAIN ANALYZE runtime into an
> >> acceptable one; 
> 
> > I disagree, as have others.
> 
> The overhead seems to be on the order of a couple tens of percent usually.
> I don't see how that makes the difference between an EXPLAIN ANALYZE you
> can run and one you can't.

Well, thats not my experience and doesn't match others posted on
-hackers. 

A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE
to be consistently above 500% (or more than +400%, depending upon how
you style those numbers). This is for in-shared-buffers data, following
cache priming. Test shown below is typical result from 5 tests:

> postgres=# select count(*) from accounts;
>  count
> --------
>  100000
> (1 row)
> 
> Time: 267.008 ms
> postgres=# explain analyze select count(*) from accounts;
>                                                       QUERY PLAN 
>
-----------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=10945.00..10945.01 rows=1 width=0) (actual time=1497.830..1497.837 rows=1 loops=1)
>    ->  Seq Scan on accounts  (cost=0.00..9998.20 rows=378720 width=0) (actual time=0.078..828.455 rows=100000
loops=1)
>  Total runtime: 1497.954 ms
> (3 rows)
> 
> Time: 1498.983 ms

Other timings were: 1493 1498 1707 1814 1827

EXPLAIN ANALYZE is designed to be run *when* you have unacceptable run
times and need to find out why. So making the execution time even more
unacceptable makes the utility infeasible at the time you need it most.
The additional run-time occurs on people's production systems, so they
generally aren't happy running long tasks.

BTW I think EA is great - so good in fact I want to run it more often.

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote:
>> The overhead seems to be on the order of a couple tens of percent usually.
>> I don't see how that makes the difference between an EXPLAIN ANALYZE you
>> can run and one you can't.

> Well, thats not my experience and doesn't match others posted on
> -hackers. 

> A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE
> to be consistently above 500% (or more than +400%, depending upon how
> you style those numbers).

I think we ought to find out why your machine is so broken.

Even in this pretty-much-worst-case scenario (a seqscan does about as
little real work per plan node call as possible, especially if the table
is already fully cached), I don't see more than about a 2X degradation.
On queries that are complicated enough to actually need EXPLAIN ANALYZE,
it's not nearly that bad.

Old slow HPUX/HPPA machine, PG 8.1 branch tip:

bench=# \timing
Timing is on.
bench=# select count(*) from accounts;count
--------100000
(1 row)

Time: 543.565 ms
-- do it again to ensure fully cached
bench=# select count(*) from accounts;count
--------100000
(1 row)

Time: 492.667 ms
bench=# explain analyze select count(*) from accounts;                                                     QUERY PLAN

--------------------------------------------------------------------------------
---------------------------------------Aggregate  (cost=2975.00..2975.01 rows=1 width=0) (actual time=1172.856..1172.8
60 rows=1 loops=1)  ->  Seq Scan on accounts  (cost=0.00..2725.00 rows=100000 width=0) (actual ti
me=0.175..720.741 rows=100000 loops=1)Total runtime: 1173.290 ms
(3 rows)

Time: 1176.293 ms
bench=#

Spiffy new Fedora 5/dual Xeon machine, PG 8.1 branch tip:

bench=# \timing
Timing is on.
bench=# select count(*) from accounts;count
--------100000
(1 row)

Time: 61.737 ms
-- do it again to ensure fully cached
bench=# select count(*) from accounts;count
--------100000
(1 row)

Time: 53.941 ms
bench=# explain analyze select count(*) from accounts;                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------Aggregate
(cost=2975.00..2975.01 rows=1 width=0) (actual time=117.881..117.882 rows=1 loops=1)  ->  Seq Scan on accounts
(cost=0.00..2725.00rows=100000 width=0) (actual time=0.041..77.628 rows=100000 loops=1)Total runtime: 117.936 ms
 
(3 rows)

Time: 118.510 ms
bench=#

I'm too lazy to pull up any of my other machines right now, but this is
generally consistent with my experience ever since EXPLAIN ANALYZE was
written.

So: what's your platform exactly?
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Simon Riggs
Date:
On Thu, 2006-06-08 at 10:27 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote:
> >> The overhead seems to be on the order of a couple tens of percent usually.
> >> I don't see how that makes the difference between an EXPLAIN ANALYZE you
> >> can run and one you can't.
> 
> > Well, thats not my experience and doesn't match others posted on
> > -hackers. 
> 
> > A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE
> > to be consistently above 500% (or more than +400%, depending upon how
> > you style those numbers).
> 
> I think we ought to find out why your machine is so broken.

> I'm too lazy to pull up any of my other machines right now, but this is
> generally consistent with my experience ever since EXPLAIN ANALYZE was
> written.

Great. Well it isn't consistent with mine, or others who've posted to
this list.

> So: what's your platform exactly?

FC5, Intel laptop running cvstip, new in January.

But thats irrelevant. I'm not a user, I solve others problems, as you
know. Hence my interest in a usable tool to do that.

So far we have myself, Kevin, Martijn and Luke all saying there is a
distortion or a massive overhead caused by EXPLAIN ANALYZE.
http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php
http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php

It's real. I won't press the point further.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> So far we have myself, Kevin, Martijn and Luke all saying there is a
> distortion or a massive overhead caused by EXPLAIN ANALYZE.
> http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php
> http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php

The first of these shows the exact opposite of what you are claiming,
and the second is without details of any sort that might help determine
what the actual problem is.

Given that we're seeing diametrically opposed results on the same OS
(FC5) and similar (at least all Intel) hardware, I think the prudent
thing is to find out what's really going on before leaping in with
proposed solutions.  As the person who's *not* seeing the problem,
I'm not in a position to do that investigation...
        regards, tom lane


I wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
>> A full EXPLAIN ANALYZE is always desirable - we agree on that. The
>> question is what we do when one is not available.

> The least bad alternative I've heard is to let EXPLAIN ANALYZE print
> out stats-so-far if the query is canceled by control-C or statement
> timeout.  The objection to this is you may mistake startup transients
> for full query behavior ... but at least the numbers will be good as
> far as they go.

I thought some more about this, and it seems doable but tricky (ie,
there are many wrong ways to do it).  Here are my conclusions:

We can't just use the normal QueryCancel logic that throws an
elog(ERROR) from someplace down inside the query.  This would leave
the backend in an unclean state from which we could only certainly
recover by doing AbortTransaction.  And once we've aborted the
transaction we can't do catalog accesses, which gets in the way of
producing the EXPLAIN printout.

Running the test query inside a subtransaction would fix that,
but aborting the subtransaction would throw away the executor state,
including the Instrumentation nodes we need.

So it seems we need a way to stop the query without raising an error
per se.  What I'm thinking is that while EXPLAIN ANALYZE is running,
timeout or SIGINT should not set QueryCancelPending, but instead
set a separate flag "ExplainCancelPending", which we then test in
ExecProcNode(), say
if (node->instrument){
+        if (ExplainCancelPending)
+            return NULL;    InstrStartNode(node->instrument);}

There might be one or two other places to check it, but basically
we'd only notice the flag in very limited circumstances where
it's definitely safe to force early termination of ExecutorRun.

When control comes back to explain.c, we just print the results as
normal (but probably adding a line explicitly noting that the query
was abandoned before completion).  Note that we won't have any "running"
Instrumentation nodes to contend with, since the change doesn't cause
nodes to drop out after they've started timing.  So the data is good,
just incomplete.

After we've printed the results, we have a bit of a problem: if
ExplainCancelPending is set, we now need to abort the transaction.  It
would not do at all to allow an incompletely executed UPDATE to commit.
I experimented with throwing an elog() at the bottom of ExplainQuery()
after end_tup_output(), but this does not work: psql only prints the
error and not the data, because libpq throws away the query result
upon seeing the error.  We could probably hack psql to print the results
before noting the error, but I'm concerned about making a change that
would change the behavior for other error-at-end-of-statement cases.
Also, it's not clear what might have to happen to get non-psql clients
to play along.

It seems like the best solution is to establish a subtransaction around
the entire EXPLAIN command (not just the test query), which we can abort
after we've printed the results.

A possible objection to this is that running a query inside a
subtransaction might have different/worse performance than running it
at top level.  I don't recall any severe bottlenecks of that kind but
that doesn't mean there aren't any (Alvaro, any comments?)

Note that this would mean that ending an EXPLAIN ANALYZE early, via
either control-C or statement timeout, would be treated as a non-error
situation from the point of view of the outer transaction.  This bothers
me a bit, because in fact the effects if any of the tested query would
have been rolled back.  Not sure we have any choice though.  If we
expose the error then we'll have problems with clients not showing the
EXPLAIN results.

Thoughts?
        regards, tom lane


Running a query twice to ensure cached results.

From
Ron Mayer
Date:
Tom Lane wrote:
> -- do it again to ensure fully cached
> bench=# select count(*) from accounts;

Short summary:
 Does running a query only twice really insure that a result is cached? It seems not to be the case for seq-scans on
Linux.
 I think this may matters to the discussions about a readahead thread/process that come up on this list that come up
hereoccasionally.
 

Experimental results here suggest that for larger tables Linux seems
to detect a seq-scan and not bother caching.   It's very reproducible
for me here to do a reboot and not see the full speedup on a seq_scan
until the third time I run a query.su  An example shown below [1] shows
that the third run of a query is faster than the second run.  The
output of a 'vmstat 5' [2] while these queries was happening agrees
that significant I/O was still happening on the second run, but
no I/O happened the third time.   The table comfortably fits in
memory (700MB table on a 2GB ram machine) and the machine was
otherwise idle so noone else wiped out the cache between the
first and second runs.

Why do I think this is worth mentioning here?  * I think it impacts the occasional thread about wanting    to include
logicin postgresql for readahead [3] or for    the threads suggesting hinting to the the OS though madvise    or
similarto avoid caching seq-scans.   It seems that the    Linux is detecting and at least somewhat reacting    to seq
scanseven with no hinting.  Anything added    to postgresql might end up being a duplicated effort.    I think Bruce
suggestedthat Solaris does this free-behind    automatically [4], but this is the first I've noticed    that Linux
seemsto do similar.
 
  * I think it matters to people who post explain analyze    twice without running it so often they get stable results.
  (I note that this was not a problem for Tom since the    timing of his first and second runs were the same so    I
assumehe was just saying that he observed that the    query was cached rather than that the first run forced    the
secondrun to be cached.)
 
    Ron


=========================================================================
== [note 1] the repeated queries showing the speedup after 3 runs.
== Running the same select count(*) 4 times after a clean reboot.
== Seems the OS's caching logic decided that the first seq_scan
== wasn't 'interesting' enough
=========================================================================
fli=# select count(*) from facets_s;  count
---------- 15976558
(1 row)

Time: 29788.047 ms
fli=# select count(*) from facets_s;  count
---------- 15976558
(1 row)

Time: 19344.573 ms
fli=# select count(*) from facets_s;  count
---------- 15976558
(1 row)

Time: 13411.272 ms
fli=# select count(*) from facets_s;  count
---------- 15976558
(1 row)

Time: 13107.856 ms

################################################################################
# [note 2] vmstat 5 while the above queries were being run
################################################################################

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r  b   swpd   free   buff  cache   si
so   bi    bo   in    cs us sy id wa 1  1    140  62140  71256 713360    0    0    47    31   92    84  7  1 92  0
 
*** the first time 1  0    140  50860  31912 808304    0    2 25215    29 1147  2612 49 15  0 36 1  0    360  54420
11112855240    0    0 23934     7 1139  2553 47 14  0 39 0  1    360  54008  11100 878708    0    0 23704    25 1149
246746 12  0 41 0  1    360  52512  11140 896592    0    0 24062     6 1135  2460 47 11  0 41
 
*** the second time 0  0    360  52688  11172 906916    0    0 13357    19 1085  1989 31  7 38 24 1  0    360  53976
11076912540    0   44 14273    57 1113  2102 32  7 29 32 2  0    360  54788  10908 923788    0    0 24509    54 1171
247446 12  0 42 1  0    360  54944   3096 939948    0    0 11180    39 1093  1976 65 13  0 22
 
*** the third time 3  0    360  54280   3872 940508    0    0   264    14 1041  1560 85 15  0  0 1  0    360  53852
3904940940    0    0    88    29 1022  1505 53  9 36  2 2  0    360  51616   4052 943068    0    0   443    54 1037
155282 15  0  4 1  0    360  51488   4060 943180    0    0    22     2 1013  1522 84 16  0  0
 

#############
[3] http://archives.postgresql.org/pgsql-hackers/2005-11/msg01449.php
[4] http://archives.postgresql.org/pgsql-performance/2003-10/msg00188.php


Tom Lane <tgl@sss.pgh.pa.us> writes:

> After we've printed the results, we have a bit of a problem: if
> ExplainCancelPending is set, we now need to abort the transaction.  It
> would not do at all to allow an incompletely executed UPDATE to commit.
> I experimented with throwing an elog() at the bottom of ExplainQuery()
> after end_tup_output(), but this does not work: psql only prints the
> error and not the data, because libpq throws away the query result
> upon seeing the error.  We could probably hack psql to print the results
> before noting the error, but I'm concerned about making a change that
> would change the behavior for other error-at-end-of-statement cases.
> Also, it's not clear what might have to happen to get non-psql clients
> to play along.

Would it be possible to make a whole new protocol message for EXPLAIN results?

The reason I'm suggesting that is because it might make it easier to implement
the SIGINFO handler that sends incremental EXPLAIN results on demand that I
was describing earlier.

Then libpq would have a separate api call to check for EXPLAIN results. If a
non-psql client doesn't check it then it doesn't get the EXPLAIN results but
the behaviour is correct. If it does then it can get the EXPLAIN results.

This might also let you run with EXPLAIN ANALYZE instrumentation but still get
the regular query results. Since the explain analyze results would still be
available out of band.

-- 
Gregory Stark
T + 514 938 2456



Gregory Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> Would it be possible to make a whole new protocol message for EXPLAIN results?

I'm really unwilling to get into that.  For one thing, that would
absolutely positively break *all* use of EXPLAIN from un-fixed clients,
whether you were trying to stop early or not.

> The reason I'm suggesting that is because it might make it easier to implement
> the SIGINFO handler that sends incremental EXPLAIN results on demand that I
> was describing earlier.

Doesn't matter, because that's not happening ;-)  SIGINFO isn't
portable, and even if it were, a signal handler couldn't possibly
generate EXPLAIN output (remember those catalog accesses).
        regards, tom lane


Re: Ending EXPLAIN ANALYZE early (was Re: That EXPLAIN

From
"A.M."
Date:
I think what he meant was "a separate EXPLAIN-CANCEL message" on a
cancel-type connection, which would be completely backwards compatible.
Old clients simply wouldn't be able to use the special EXPLAIN cancel,
just like it is now.

On Thu, June 8, 2006 3:01 pm, Tom Lane wrote:
> Gregory Stark <gsstark@mit.edu> writes:
>
>> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Would it be possible to make a whole new protocol message for EXPLAIN
>> results?
>
> I'm really unwilling to get into that.  For one thing, that would
> absolutely positively break *all* use of EXPLAIN from un-fixed clients,
> whether you were trying to stop early or not.
>
>> The reason I'm suggesting that is because it might make it easier to
>> implement the SIGINFO handler that sends incremental EXPLAIN results on
>> demand that I was describing earlier.
>
> Doesn't matter, because that's not happening ;-)  SIGINFO isn't
> portable, and even if it were, a signal handler couldn't possibly generate
> EXPLAIN output (remember those catalog accesses).
>
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>
> http://www.postgresql.org/docs/faq
>
>




Re: Running a query twice to ensure cached results.

From
Tom Lane
Date:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>   Does running a query only twice really insure that a result is cached?
>   It seems not to be the case for seq-scans on Linux.

Should work for tables small enough to fit into the shared_buffers
arena.  I wouldn't necessarily assume it for large tables.

Note that what I was really doing was taking the timing measurement
again on data cached by the *first* run, so that I would have something
that could fairly be compared to the following EXPLAIN ANALYZE --- which
of course would likewise be operating on cached data.
        regards, tom lane


Tom Lane <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <gsstark@mit.edu> writes:
> > Tom Lane <tgl@sss.pgh.pa.us> writes:
> > Would it be possible to make a whole new protocol message for EXPLAIN results?
> 
> I'm really unwilling to get into that.  For one thing, that would
> absolutely positively break *all* use of EXPLAIN from un-fixed clients,
> whether you were trying to stop early or not.

Well I was picturing still having the manual EXPLAIN ANALYZE which works as it
does now replacing the query results in addition to a lower level protocol
request which generates a protocol response with the data.

> > The reason I'm suggesting that is because it might make it easier to implement
> > the SIGINFO handler that sends incremental EXPLAIN results on demand that I
> > was describing earlier.
> 
> Doesn't matter, because that's not happening ;-)  SIGINFO isn't
> portable, and even if it were, a signal handler couldn't possibly
> generate EXPLAIN output (remember those catalog accesses).

Well the signal gets handled by psql. It would have to be a wire protocol
message like Query Cancel that the backend watches for and handles the first
time it can.

-- 
greg



Tom Lane wrote:

> A possible objection to this is that running a query inside a
> subtransaction might have different/worse performance than running it
> at top level.  I don't recall any severe bottlenecks of that kind but
> that doesn't mean there aren't any (Alvaro, any comments?)

Nope, nothing that I recall.  Maybe make sure to generate the Xid of the
subtransaction before starting the EXPLAIN itself (so that there is no
pg_subtrans I/O pending, etc)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Ending EXPLAIN ANALYZE early (was Re: That EXPLAIN

From
Simon Riggs
Date:
On Thu, 2006-06-08 at 13:51 -0400, Tom Lane wrote:
> I wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> >> A full EXPLAIN ANALYZE is always desirable - we agree on that. The
> >> question is what we do when one is not available.
> 
> > The least bad alternative I've heard is to let EXPLAIN ANALYZE print
> > out stats-so-far if the query is canceled by control-C or statement
> > timeout.  The objection to this is you may mistake startup transients
> > for full query behavior ... but at least the numbers will be good as
> > far as they go.
> 
> I thought some more about this

Thanks.

> So it seems we need a way to stop the query 

If we were able to stream results back to the client, then EA would be
simply a meta-query, emitting a stream of explain text tuples, rather
than result tuples. No further need for stopping the query to return
intermediate results.

If we did this say every minute, rather than on interrupt, you'd be
getting close to being able to build that "progress bar" that people
have been mentioning - I'd not really thought that was possible until
now. Would allow us to build a pg_top that would show the current EA
results for a query as it executes.

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com



Re: That EXPLAIN ANALYZE patch still needs work

From
Simon Riggs
Date:
On Thu, 2006-06-08 at 12:56 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > So far we have myself, Kevin, Martijn and Luke all saying there is a
> > distortion or a massive overhead caused by EXPLAIN ANALYZE.
> > http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php
> > http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php
>
> Given that we're seeing diametrically opposed results on the same OS
> (FC5) and similar (at least all Intel) hardware, I think the prudent
> thing is to find out what's really going on before leaping in with
> proposed solutions.  As the person who's *not* seeing the problem,
> I'm not in a position to do that investigation...

That seems reasonable.

I've cut a patch to remove timing from the EA results.
Output shown here:

postgres=# set explain_analyze_timing = on;
SET
Time: 0.673 ms
postgres=# explain analyze select count(*) from accounts;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2890.00..2890.01 rows=1 width=0) (actual
time=0.000..690.780 rows=1 loops=1)
   ->  Seq Scan on accounts  (cost=0.00..2640.00 rows=100000 width=0)
(actual time=0.000..2672.562 rows=100000 loops=1)
 Total runtime: 734.474 ms
(3 rows)

Time: 891.822 ms
postgres=# set explain_analyze_timing = off;
SET
Time: 0.480 ms
postgres=# explain analyze select count(*) from accounts;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Aggregate  (cost=2890.00..2890.01 rows=1 width=0) (rows=1 loops=1)
   ->  Seq Scan on accounts  (cost=0.00..2640.00 rows=100000 width=0)
(rows=100000 loops=1)
 Total runtime: 133.674 ms
(3 rows)

Time: 134.565 ms
postgres=# select count(*) from accounts;
 count
--------
 100000
(1 row)

Time: 130.528 ms

So the timing is clearly responsible for the additional time I'm
personally experiencing and very likely to be that for others also.

As to why that should be the case, I'm not sure. The timing overhead
seems fairly constant on particular hardware/OS, just different for
each.

--
  Simon Riggs
  EnterpriseDB          http://www.enterprisedb.com

Attachment

Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> So the timing is clearly responsible for the additional time I'm
> personally experiencing and very likely to be that for others also.

Well, that's certainly unsurprising, but the question is why it's such
a large overhead for you when it's not on other apparently-similar kit.

If anyone else is still following this discussion, could you try the
test case Simon gave a bit earlier (select count(*) from 100000-row
table, check \timing with and without explain analyze)?  If we could get
a clearer picture of which platforms show the problem and which don't,
maybe we could make some progress.

(BTW, I just tried it on my Apple Powerbook G4, and got results in line
with what I get elsewhere.  So I've now tried it on three very different
CPU types and OSes, and not duplicated Simon's problem on any of them.)
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Alvaro Herrera
Date:
Simon Riggs wrote:

> postgres=# explain analyze select count(*) from accounts;
>                                                        QUERY PLAN 
>
------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=2890.00..2890.01 rows=1 width=0) (actual
> time=0.000..690.780 rows=1 loops=1)
>    ->  Seq Scan on accounts  (cost=0.00..2640.00 rows=100000 width=0)
> (actual time=0.000..2672.562 rows=100000 loops=1)
>  Total runtime: 734.474 ms
> (3 rows)
> 
> Time: 891.822 ms

Wow, that is slow.  Maybe a problem in the kernel?  Perhaps something
similar to this:

http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: That EXPLAIN ANALYZE patch still needs work

From
"Larry Rosenman"
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
>> So the timing is clearly responsible for the additional time I'm
>> personally experiencing and very likely to be that for others also.
> 
> Well, that's certainly unsurprising, but the question is why it's such
> a large overhead for you when it's not on other apparently-similar
> kit. 
> 
> If anyone else is still following this discussion, could you try the
> test case Simon gave a bit earlier (select count(*) from 100000-row
> table, check \timing with and without explain analyze)?  If we could
> get a clearer picture of which platforms show the problem and which
> don't, maybe we could make some progress.
> 
> (BTW, I just tried it on my Apple Powerbook G4, and got results in
> line with what I get elsewhere.  So I've now tried it on three very
> different CPU types and OSes, and not duplicated Simon's problem on
> any of them.) 
> 

$ psql
Welcome to psql 8.1.4, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms      \h for help with SQL commands      \? for help with psql commands      \g
orterminate with semicolon to execute query      \q to quit
 

ler=# create table test1(id int);
CREATE TABLE
ler=# insert into test1 select * from generate_series(1,100000);
INSERT 0 100000
ler=# \timing
Timing is on.
ler=# select count(*) from test1;count
--------100000
(1 row)

Time: 39.191 ms
ler=# explain analyze select count(*) from test1;                                                    QUERY PLAN

----------------------------------------------------------------------------
----------------------------------------Aggregate  (cost=1988.18..1988.19 rows=1 width=0) (actual
time=282.899..282.901 rows=1 loops=1)  ->  Seq Scan on test1  (cost=0.00..1698.74 rows=115774 width=0) (actual
time=0.007..147.845 rows=100000 loops=1)Total runtime: 282.987 ms
(3 rows)

Time: 283.764 ms
ler=# \q
$ uname -a
FreeBSD thebighonker.lerctr.org 6.1-STABLE FreeBSD 6.1-STABLE #59: Thu Jun
1 09:40:47 CDT 2006
root@thebighonker.lerctr.org:/usr/obj/usr/src/sys/THEBIGHONKER  amd64
$ sysctl hw
hw.machine: amd64
hw.model: Intel(R) Xeon(TM) CPU 3.00GHz
hw.ncpu: 4
hw.byteorder: 1234
hw.physmem: 4286132224
hw.usermem: 4003151872
hw.pagesize: 4096
hw.floatingpoint: 1
hw.machine_arch: amd64
hw.realmem: 5368709120

[snip]

the database is 8.1.4
-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 512-248-2683                 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893



Re: That EXPLAIN ANALYZE patch still needs work

From
Alvaro Herrera
Date:
Tom Lane wrote:

> If anyone else is still following this discussion, could you try the
> test case Simon gave a bit earlier (select count(*) from 100000-row
> table, check \timing with and without explain analyze)?  If we could get
> a clearer picture of which platforms show the problem and which don't,
> maybe we could make some progress.

Tried on two machines.  The first (Turion64 laptop) gives 44-45 ms for
the SELECT, and 50-51 ms for the EXPLAIN ANALYZE.

The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
you were looking for.

All results are repeated multiple times, I'm reporting times other than
the first query (which is slower).

More specs on the Celeron machine follow:

$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 6
model name      : Celeron (Mendocino)
stepping        : 5
cpu MHz         : 534.616
cache size      : 128 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips        : 1070.25

$ uname -a
Linux drizzt 2.6.16-1-686 #2 Sat Apr 15 20:45:20 UTC 2006 i686 GNU/Linux

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tried on two machines.  The first (Turion64 laptop) gives 44-45 ms for
> the SELECT, and 50-51 ms for the EXPLAIN ANALYZE.

> The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
> and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
> you were looking for.

Do you have oprofile installed on these?  Comparing oprofile results
might give some more insight where the time is going.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
"Jim C. Nasby"
Date:
On Thu, Jun 08, 2006 at 04:58:07PM -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > So the timing is clearly responsible for the additional time I'm
> > personally experiencing and very likely to be that for others also.
> 
> Well, that's certainly unsurprising, but the question is why it's such
> a large overhead for you when it's not on other apparently-similar kit.
> 
> If anyone else is still following this discussion, could you try the
> test case Simon gave a bit earlier (select count(*) from 100000-row
> table, check \timing with and without explain analyze)?  If we could get
> a clearer picture of which platforms show the problem and which don't,
> maybe we could make some progress.
> 
> (BTW, I just tried it on my Apple Powerbook G4, and got results in line
> with what I get elsewhere.  So I've now tried it on three very different
> CPU types and OSes, and not duplicated Simon's problem on any of them.)

CPU: AMD Athlon(tm) 64 Processor 3500+ (2210.20-MHz K8-class CPU)
8.1.3 on amd64-portbld-freebsd6.0   25ms    353ms

CPU: AMD Opteron(tm) Processor 244 (1792.50-MHz K8-class CPU) (dual CPU)
8.1.4 on amd64-portbld-freebsd6.0   31ms    295ms

Powerbook G4 1.33GHz
8.1.4 on powerpc-apple-darwin8.6.0  5.1s    5.8s*

The powerbook tests were not very repeatable at 100,000 rows, so I
bumped up to 1M. The results still aren't very repeatable...
decibel=# select count(*) from i; count  
---------1000000
(1 row)

Time: 4914.604 ms
decibel=# select count(*) from i; count  
---------1000000
(1 row)

Time: 5186.516 ms
decibel=# select count(*) from i; count  
---------1000000
(1 row)

Time: 5174.418 ms
decibel=# explain analyze select count(*) from i;                                                   QUERY PLAN
                                          
 

-------------------------------------------------------------------------------------------------------------------Aggregate
(cost=16905.05..16905.06 rows=1 width=0) (actual time=5729.623..5729.624 rows=1 loops=1)  ->  Seq Scan on i
(cost=0.00..14405.24rows=999924 width=0) (actual time=0.155..4039.317 rows=1000000 loops=1)Total runtime: 5729.907 ms
 
(3 rows)

Time: 5732.076 ms
decibel=# explain analyze select count(*) from i;                                                   QUERY PLAN
                                          
 

-------------------------------------------------------------------------------------------------------------------Aggregate
(cost=16905.05..16905.06 rows=1 width=0) (actual time=5916.025..5916.026 rows=1 loops=1)  ->  Seq Scan on i
(cost=0.00..14405.24rows=999924 width=0) (actual time=0.157..4246.123 rows=1000000 loops=1)Total runtime: 5916.261 ms
 
(3 rows)

Time: 5922.542 ms
decibel=# explain analyze select count(*) from i;                                                   QUERY PLAN
                                          
 

-------------------------------------------------------------------------------------------------------------------Aggregate
(cost=16905.05..16905.06 rows=1 width=0) (actual time=5800.788..5800.789 rows=1 loops=1)  ->  Seq Scan on i
(cost=0.00..14405.24rows=999924 width=0) (actual time=0.152..4126.781 rows=1000000 loops=1)Total runtime: 5801.026 ms
 
(3 rows)

Time: 5803.070 ms
decibel=# explain analyze select count(*) from i;                                                   QUERY PLAN
                                          
 

-------------------------------------------------------------------------------------------------------------------Aggregate
(cost=16905.05..16905.06 rows=1 width=0) (actual time=5994.285..5994.286 rows=1 loops=1)  ->  Seq Scan on i
(cost=0.00..14405.24rows=999924 width=0) (actual time=0.156..4341.463 rows=1000000 loops=1)Total runtime: 5994.520 ms
 
(3 rows)

Time: 5996.577 ms
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: That EXPLAIN ANALYZE patch still needs work

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tried on two machines.  The first (Turion64 laptop) gives 44-45 ms for
> > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE.
> 
> > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
> > and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
> > you were looking for.
> 
> Do you have oprofile installed on these?  Comparing oprofile results
> might give some more insight where the time is going.

No, I don't.  I'll have a look.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> The powerbook tests were not very repeatable at 100,000 rows, so I
> bumped up to 1M. The results still aren't very repeatable...

Hmm, I suspect you pushed it into the realm where it's doing actual
I/O ... which we don't want for this particular test.  (Although it's
worth noting that if your query *is* doing actual I/O, the EXPLAIN
overhead certainly ought to be down in the noise.)
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
"Jim C. Nasby"
Date:
Just tested 2 windows boxes, both running 8.1.3 and XP SP2.

P4 2.8G desktop     47ms    297ms
Pentium M 2G laptop 40ms    240ms
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: That EXPLAIN ANALYZE patch still needs work

From
"Jim C. Nasby"
Date:
On Thu, Jun 08, 2006 at 05:32:36PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > The powerbook tests were not very repeatable at 100,000 rows, so I
> > bumped up to 1M. The results still aren't very repeatable...
> 
> Hmm, I suspect you pushed it into the realm where it's doing actual
> I/O ... which we don't want for this particular test.  (Although it's
> worth noting that if your query *is* doing actual I/O, the EXPLAIN
> overhead certainly ought to be down in the noise.)

Hmm... I guess it depends on how smart the OS cache is; the table is
36MB; 4406 pages. But shared buffers is 3000...

Bumping shared_buffers up to 5k, there is a bigger difference, but these
numbers are still all over the place, so I don't know that they're very
trust-worthy.

decibel=# explain analyze select count(*) from i;                                                   QUERY PLAN
                                          
 

-------------------------------------------------------------------------------------------------------------------Aggregate
(cost=16905.05..16905.06 rows=1 width=0) (actual time=5398.625..5398.626 rows=1 loops=1)  ->  Seq Scan on i
(cost=0.00..14405.24rows=999924 width=0) (actual time=0.034..3967.927 rows=1000000 loops=1)Total runtime: 5398.871 ms
 
(3 rows)

Time: 5400.900 ms
decibel=# explain analyze select count(*) from i;                                                   QUERY PLAN
                                          
 

-------------------------------------------------------------------------------------------------------------------Aggregate
(cost=16905.05..16905.06 rows=1 width=0) (actual time=5626.671..5626.671 rows=1 loops=1)  ->  Seq Scan on i
(cost=0.00..14405.24rows=999924 width=0) (actual time=0.035..3875.641 rows=1000000 loops=1)Total runtime: 5626.904 ms
 
(3 rows)

Time: 5628.896 ms
decibel=# explain analyze select count(*) from i;                                                   QUERY PLAN
                                          
 

-------------------------------------------------------------------------------------------------------------------Aggregate
(cost=16905.05..16905.06 rows=1 width=0) (actual time=5137.549..5137.549 rows=1 loops=1)  ->  Seq Scan on i
(cost=0.00..14405.24rows=999924 width=0) (actual time=0.020..3440.034 rows=1000000 loops=1)Total runtime: 5137.707 ms
 
(3 rows)

Time: 5139.178 ms
decibel=# select count(*) from i; count  
---------1000000
(1 row)

Time: 4357.443 ms
decibel=# select count(*) from i; count  
---------1000000
(1 row)

Time: 4251.208 ms
decibel=# select count(*) from i; count  
---------1000000
(1 row)

Time: 4712.912 ms
decibel=# select count(*) from i; count  
---------1000000
(1 row)

Time: 4479.278 ms
decibel=# 
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Wow, that is slow.  Maybe a problem in the kernel?  Perhaps something
> similar to this:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282

Yeah, that's a pretty interesting thread.  I came across something
similar on a Red Hat internal list.  It seems there are three or four
different popular standards for clock hardware in the Intel world,
and some good implementations and some pretty bad implementations
of each.  So the answer may well boil down to "if you're using cheap
junk PC hardware then gettimeofday will be slow".

In fact, I just got around to trying this on my old x86 PC, and
behold:

regression=# select count(*) from tenk1;count 
-------10000
(1 row)

Time: 9.670 ms
regression=# explain analyze select count(*) from tenk1;                                                  QUERY PLAN
                                              
 

----------------------------------------------------------------------------------------------------------------Aggregate
(cost=470.00..470.01 rows=1 width=0) (actual time=93.549..93.553 rows=1 loops=1)  ->  Seq Scan on tenk1
(cost=0.00..445.00rows=10000 width=0) (actual time=0.014..49.261 rows=10000 loops=1)Total runtime: 93.663 ms
 
(3 rows)

Time: 94.431 ms
regression=# 

So that says that this machine takes about 4 usec to do gettimeofday(),
as compared to 3 usec on my nominally 4x slower HPPA machine.

The new dual Xeon sitting next to it probably has a much less junky
motherboard, with a better clock ... and I find it unsurprising that
the HP and Apple machines I was trying aren't subject to such problems.

I didn't get the impression from that linux-kernel thread that the
proposed patch had actually gone in yet anyplace; anyone know how to
track that?
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Mark Kirkwood
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Wow, that is slow.  Maybe a problem in the kernel?  Perhaps something
>> similar to this:
>> http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282
> 
> Yeah, that's a pretty interesting thread.  I came across something
> similar on a Red Hat internal list.  It seems there are three or four
> different popular standards for clock hardware in the Intel world,
> and some good implementations and some pretty bad implementations
> of each.  So the answer may well boil down to "if you're using cheap
> junk PC hardware then gettimeofday will be slow".
> 

OS seems to matter as well - I've got two identical Supermicro P3TDER 
dual intel boxes. 1 running FreeBSD 6.1, one running Gentoo Linux 2.6.16.

Doing the 'select count(*) vs explain analyze select count(*) on 100000 
row table gives:

Freebsd : select 108 ms  explain analyze 688 ms
Linux   : select 100 ms  explain analyze 196 ms

Both systems have ACPI enabled in BIOS (which means there is a better 
timecounter than 'i8254' available (FreeBSD says its using 'ACPI-safe' - 
not sure how to check on Linux).

Cheers

Mark



Re: That EXPLAIN ANALYZE patch still needs work

From
Agent M
Date:
It's worth noting that on Darwin (on Apple hardware) gettimeofday is
never a syscall whereas on Linux (AFAIK), it always is.

On Jun 8, 2006, at 7:58 PM, Mark Kirkwood wrote:

> Tom Lane wrote:
>> Alvaro Herrera <alvherre@commandprompt.com> writes:
>>> Wow, that is slow.  Maybe a problem in the kernel?  Perhaps something
>>> similar to this:
>>> http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282
>> Yeah, that's a pretty interesting thread.  I came across something
>> similar on a Red Hat internal list.  It seems there are three or four
>> different popular standards for clock hardware in the Intel world,
>> and some good implementations and some pretty bad implementations
>> of each.  So the answer may well boil down to "if you're using cheap
>> junk PC hardware then gettimeofday will be slow".
>
> OS seems to matter as well - I've got two identical Supermicro P3TDER
> dual intel boxes. 1 running FreeBSD 6.1, one running Gentoo Linux
> 2.6.16.
>
> Doing the 'select count(*) vs explain analyze select count(*) on
> 100000 row table gives:
>
> Freebsd : select 108 ms  explain analyze 688 ms
> Linux   : select 100 ms  explain analyze 196 ms
>
> Both systems have ACPI enabled in BIOS (which means there is a better
> timecounter than 'i8254' available (FreeBSD says its using 'ACPI-safe'
> - not sure how to check on Linux).

¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
AgentM
agentm@themactionfaction.com
¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬



Re: That EXPLAIN ANALYZE patch still needs work

From
Simon Riggs
Date:
On Thu, 2006-06-08 at 17:21 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tried on two machines.  The first (Turion64 laptop) gives 44-45 ms for
> > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE.
> 
> > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
> > and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
> > you were looking for.
> 
> Do you have oprofile installed on these?  Comparing oprofile results
> might give some more insight where the time is going.

On my office PC the overhead from EA is about 20%.

stracing a backend, I also notice that on a 1.5 million row table we
make only 39158 calls to gettimeofday, on a table of 24591 blocks.

I'm thinking that the instrumentation overhead has been reduced as a
result of the page-at-a-time heap scans we now have?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com



Re: That EXPLAIN ANALYZE patch still needs work

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
> and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
> you were looking for.

Hmm, I take that back.  I updated to the current CVS tip on this machine
and now I don't see the problem.  Without EXPLAIN ANALYZE, the times are
around 115-117 ms.  With EXPLAIN ANALYZE, 149-152ms.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: That EXPLAIN ANALYZE patch still needs work

From
Alvaro Herrera
Date:
Simon Riggs wrote:
> On Thu, 2006-06-08 at 17:21 -0400, Tom Lane wrote:
> > Alvaro Herrera <alvherre@commandprompt.com> writes:
> > > Tried on two machines.  The first (Turion64 laptop) gives 44-45 ms for
> > > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE.
> > 
> > > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
> > > and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
> > > you were looking for.
> > 
> > Do you have oprofile installed on these?  Comparing oprofile results
> > might give some more insight where the time is going.
> 
> On my office PC the overhead from EA is about 20%.

Can you please run the test program in this thread, if you are running
Linux?

http://marc.theaimsgroup.com/?l=linux-kernel&m=114297656924494&w=2

The machine I reported to show the problem says

# ./a.out 0xe408
0
./a.out: Detect PM-Timer Bug

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Hmm, I take that back.  I updated to the current CVS tip on this machine
> and now I don't see the problem.  Without EXPLAIN ANALYZE, the times are
> around 115-117 ms.  With EXPLAIN ANALYZE, 149-152ms.

At the moment, CVS HEAD still has the sampling patch in it.  You need to
be looking at 8.1, or HEAD from several weeks ago, to see the full
impact of gettimeofday-every-time.  (I imagine Simon's strace count
is wrong for the same reason.)
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
"Larry Rosenman"
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Hmm, I take that back.  I updated to the current CVS tip on this
>> machine and now I don't see the problem.  Without EXPLAIN ANALYZE,
>> the times are around 115-117 ms.  With EXPLAIN ANALYZE, 149-152ms.
> 
> At the moment, CVS HEAD still has the sampling patch in it.  You need
> to be looking at 8.1, or HEAD from several weeks ago, to see the full
> impact of gettimeofday-every-time.  (I imagine Simon's strace count
> is wrong for the same reason.)
> 
Does my post from yesterday (39ms without explain analyze, 280+ with explain
analyze)
on FreeBSD/amd64, Dual Xeon's in HTT mode help?

What can I do to further the investigation?

It has 8.1.4 on it.

LER



-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 512-248-2683                 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893



Re: That EXPLAIN ANALYZE patch still needs work

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Hmm, I take that back.  I updated to the current CVS tip on this machine
> > and now I don't see the problem.  Without EXPLAIN ANALYZE, the times are
> > around 115-117 ms.  With EXPLAIN ANALYZE, 149-152ms.
> 
> At the moment, CVS HEAD still has the sampling patch in it.  You need to
> be looking at 8.1, or HEAD from several weeks ago, to see the full
> impact of gettimeofday-every-time.  (I imagine Simon's strace count
> is wrong for the same reason.)

Ok, then I re-retract, because the test I posted previously was done
with 8.1.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
"Larry Rosenman" <ler@lerctr.org> writes:
> Does my post from yesterday (39ms without explain analyze, 280+ with explain
> analyze) on FreeBSD/amd64, Dual Xeon's in HTT mode help?

Well, it confirms that FreeBSD is subject to the same problem ;-).

I think the bottom line here is that there are some machines out there
where gettimeofday() is fast enough for our purposes, and some where
it is nowhere near fast enough.  And that kernel-level fixes may be
possible for some of the latter, but not all, and we shouldn't hold our
breath waiting for that to happen anyway.

To tell you the truth, this information makes me even less pleased with
the sampling-gettimeofday patch than I was before.  If gettimeofday() in
itself increases the runtime of a node by a factor of 10, then just
trying to subtract off that time is no solution.  There's too much
impact on surrounding nodes, and too much roundoff error anyhow.
I had thought we were applying an order-of-ten-percent correction by
subtracting SampleOverhead, not an order-of-10x correction :-(
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Simon Riggs
Date:
On Fri, 2006-06-09 at 10:00 -0400, Tom Lane wrote:

> I think the bottom line here is that there are some machines out there
> where gettimeofday() is fast enough for our purposes, and some where
> it is nowhere near fast enough.  And that kernel-level fixes may be
> possible for some of the latter, but not all, and we shouldn't hold our
> breath waiting for that to happen anyway.

Agreed.

The issue seems to be some systems are set to get exactly correct times
and some are set to return a time (possibly imprecise) with low
overhead. Even if fixes existed, OS packagers may not pick the right one
of those two options for our purposes for EA. (We might prefer accuracy
to speed for other parts of PostgreSQL anyway).

I propose we revert the sampling patch (sorry Martijn) and go with the
patch to have an explain_analyze_timing parameter (default=on). That way
we'll have an option to turn off timing *if* we happen to be running on
a OS/hw combo that sucks *and* trying to run large enough EAs that we
care.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> I propose we revert the sampling patch (sorry Martijn)

yah ...

> and go with the
> patch to have an explain_analyze_timing parameter (default=on).

This I'm unexcited about.  EXPLAIN output isn't all that transparent
anyway, and losing the extra cue of seeing where the time is really
going would make it extremely easy for people to misinterpret their
problems.

I was intending to push forward with the idea of being able to get
numbers out of a canceled EXPLAIN.  That will allow you to get some
information even when the underlying query runs longer than you're
willing to tolerate.  I still say that the number of queries where
avoiding gettimeofday overhead would transform an intolerable runtime
into a tolerable one is pretty limited.

The other thing that I think would be worth investigating is
timer-driven sampling, although it's not yet clear whether we can
make that work usefully.  Anyone want to take up that project?
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Martijn van Oosterhout
Date:
On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote:
> To tell you the truth, this information makes me even less pleased with
> the sampling-gettimeofday patch than I was before.  If gettimeofday() in
> itself increases the runtime of a node by a factor of 10, then just
> trying to subtract off that time is no solution.  There's too much
> impact on surrounding nodes, and too much roundoff error anyhow.
> I had thought we were applying an order-of-ten-percent correction by
> subtracting SampleOverhead, not an order-of-10x correction :-(

Eh? The whole point is to call gettimeofday() much less often. If you
call it 1000th as often, then the correction is only on the order of
one hundredth of the normal query time...

Subtracting SampleOverhead is only a correction on the order of a few
percent, it's the reduced calling of gettimeofday() that provides the
benefit.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: That EXPLAIN ANALYZE patch still needs work

From
"A.M."
Date:
It would be nice to keep the gettimeofday()s wherever they are most useful
on hardware/software where they are cheap. Perhaps a compile-time option?

On Fri, June 9, 2006 11:18 am, Martijn van Oosterhout wrote:
> On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote:
>
>> To tell you the truth, this information makes me even less pleased with
>>  the sampling-gettimeofday patch than I was before.  If gettimeofday()
>> in itself increases the runtime of a node by a factor of 10, then just
>> trying to subtract off that time is no solution.  There's too much
>> impact on surrounding nodes, and too much roundoff error anyhow. I had
>> thought we were applying an order-of-ten-percent correction by
>> subtracting SampleOverhead, not an order-of-10x correction :-(
>
> Eh? The whole point is to call gettimeofday() much less often. If you
> call it 1000th as often, then the correction is only on the order of one
> hundredth of the normal query time...
>
> Subtracting SampleOverhead is only a correction on the order of a few
> percent, it's the reduced calling of gettimeofday() that provides the
> benefit.
>
> Have a nice day,
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
>
>> From each according to his ability. To each according to his ability to
>> litigate.
>




Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote:
>> I had thought we were applying an order-of-ten-percent correction by
>> subtracting SampleOverhead, not an order-of-10x correction :-(

> Eh? The whole point is to call gettimeofday() much less often. If you
> call it 1000th as often, then the correction is only on the order of
> one hundredth of the normal query time...

No, because the correction calculation is
totaltime += (average time per sampled execution - SampleOverhead) * (number of unsampled executions)

If SampleOverhead is 90% of the average time per sampled execution,
then you are multiplying something with a large component of
cancellation error by a possibly-large number.

As an example, using the numbers I posted here for my old PC:
http://archives.postgresql.org/pgsql-hackers/2006-06/msg00407.php
the actual runtime is clearly about 1 usec per tuple but enabling
timing adds 8.5 usec per tuple.  If we suppose we sampled
10000 out of 1 million rows, then we'd have

raw totaltime = 10000 * 9.5 usec = 95msec
avg time/execution = 9.5 usec
SampleOverhead = 8.5 usec
number of unsampled executions = 990000
correction = 990msec

which means that a 10% error in estimating SampleOverhead would
contribute as much to the final estimate as the actual measurement did.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Simon Riggs
Date:
On Fri, 2006-06-09 at 10:56 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I propose we revert the sampling patch (sorry Martijn)
> 
> yah ...
> 
> > and go with the
> > patch to have an explain_analyze_timing parameter (default=on).
> 
> This I'm unexcited about.  EXPLAIN output isn't all that transparent
> anyway, and losing the extra cue of seeing where the time is really
> going would make it extremely easy for people to misinterpret their
> problems.

As before, agreed, but it works and is available now.

> I was intending to push forward with the idea of being able to get
> numbers out of a canceled EXPLAIN.  That will allow you to get some
> information even when the underlying query runs longer than you're
> willing to tolerate.  I still say that the number of queries where
> avoiding gettimeofday overhead would transform an intolerable runtime
> into a tolerable one is pretty limited.

That would be a good feature to have.

> The other thing that I think would be worth investigating is
> timer-driven sampling, although it's not yet clear whether we can
> make that work usefully.  Anyone want to take up that project?

Not me, sorry.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com



Re: That EXPLAIN ANALYZE patch still needs work

From
Martijn van Oosterhout
Date:
On Fri, Jun 09, 2006 at 11:22:11AM -0400, A.M. wrote:
> It would be nice to keep the gettimeofday()s wherever they are most useful
> on hardware/software where they are cheap. Perhaps a compile-time option?

I think making it optional at compile time is worse, because then at
support time you might have to recompile to get the option you want.

Right now I'm confused though. I was under the impression the changes
were going to be ripped out because it was decided to be unworkable. I
think improvements can be made but I'm unsure if there's any
interest...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: That EXPLAIN ANALYZE patch still needs work

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I was intending to push forward with the idea of being able to get
> numbers out of a canceled EXPLAIN.  That will allow you to get some
> information even when the underlying query runs longer than you're
> willing to tolerate.  I still say that the number of queries where
> avoiding gettimeofday overhead would transform an intolerable runtime
> into a tolerable one is pretty limited.

Are we still going to subtract out the gettimeofday overhead?

I was always more excited about that than the sampling aspect. I've run into
queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday
overhead but I've never run into a query where gettimeofday overhead made
running the query under EXPLAIN ANALYZE impractical.


-- 
greg



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Are we still going to subtract out the gettimeofday overhead?

> I was always more excited about that than the sampling aspect. I've run into
> queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday
> overhead but I've never run into a query where gettimeofday overhead made
> running the query under EXPLAIN ANALYZE impractical.

That would be deceptive in a different way, ie, make it look like there
was a whole bunch of outside-the-plan overhead.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > Are we still going to subtract out the gettimeofday overhead?
> 
> > I was always more excited about that than the sampling aspect. I've run into
> > queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday
> > overhead but I've never run into a query where gettimeofday overhead made
> > running the query under EXPLAIN ANALYZE impractical.
> 
> That would be deceptive in a different way, ie, make it look like there
> was a whole bunch of outside-the-plan overhead.

Well that's exactly what there is.

To avoid user confusion it would reasonable to print out a line at the bottom:
 Explain analyze profiling overhead removed: xxx ms

That also gives the user feedback on how precise their explain analyze results
are. If they see that the overhead being removed is as large as the timing
remaining then they can realize that the results aren't especially precise. On
the other hand if they see that the overhead being removed is small then they
can be pretty confident in the precision of the results.

-- 
greg



Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> To avoid user confusion it would reasonable to print out a line at the bottom:

>   Explain analyze profiling overhead removed: xxx ms

> That also gives the user feedback on how precise their explain analyze results
> are. If they see that the overhead being removed is as large as the timing
> remaining then they can realize that the results aren't especially precise. On
> the other hand if they see that the overhead being removed is small then they
> can be pretty confident in the precision of the results.

I would prefer to leave the numbers unmanipulated, because frankly I'd
have no confidence in the correction.  Perhaps we could put a note at the
bottom about "Estimated total profiling overhead: xxx ms".
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > To avoid user confusion it would reasonable to print out a line at the bottom:
> 
> >   Explain analyze profiling overhead removed: xxx ms
> 
> > That also gives the user feedback on how precise their explain analyze results
> > are. If they see that the overhead being removed is as large as the timing
> > remaining then they can realize that the results aren't especially precise. On
> > the other hand if they see that the overhead being removed is small then they
> > can be pretty confident in the precision of the results.
> 
> I would prefer to leave the numbers unmanipulated, because frankly I'd
> have no confidence in the correction.  Perhaps we could put a note at the
> bottom about "Estimated total profiling overhead: xxx ms".

Well the whole reason the overhead is deceptive is that some nodes call
gettimeofday many more times proportionally to their running time than others.
It makes users suspect cpu-intensive nodes that process many records when the
real problem lies in a i/o-intensive node that processes few records. Just
showing the total overhead will give the user the impression that time is
distributed evenly throughout the nodes proportionally to their total time
which doesn't help correct the distortion.

There seem to be two types of overhead going on. There's the amount of time
spent in gettimeofday itself which is pretty consistent. Then there's other
effects that cause normal operation itself to take longer (or potentially even
less long).

Postgres may as well remove the consistent gettimeofday overhead it can
identify and measure even if there's more it can't characterize perfectly.

-- 
greg



Re: That EXPLAIN ANALYZE patch still needs work

From
Martijn van Oosterhout
Date:
On Fri, Jun 09, 2006 at 03:55:28PM -0400, Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > There seem to be two types of overhead going on. There's the amount of time
> > spent in gettimeofday itself which is pretty consistent.
>
> That is a fact not in evidence.  The impression I had from that
> linux-kernel discussion was that the problematic kernel code was looping
> until it got consistent results from successive hardware reads.  I'm
> not at all sure that you can make the above assertion across all
> varieties of clock hardware, or even all common varieties.

The interesting thing about this is that they obviously are gearing
gettimeofday() to be accurate, rather than just considering it a
counter that is somewhat close to real time. At the expense of speed.

What we want is just a monotonically increasing counter that can be
read quickly and consistantly, we're not majorly fussed if it doesn't
match real time. This puts us back to CPU cycle counters, but they have
drawbacks of their own.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Right now I'm confused though. I was under the impression the changes
> were going to be ripped out because it was decided to be unworkable. I
> think improvements can be made but I'm unsure if there's any
> interest...

I've reverted the current patch because it clearly doesn't work well
enough.  There's nothing stopping you from having a better idea though.
It's clear that on some platforms the cost of gettimeofday is high
enough that some workaround would be good.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> The interesting thing about this is that they obviously are gearing
> gettimeofday() to be accurate, rather than just considering it a
> counter that is somewhat close to real time. At the expense of speed.

Not sure that that's an accurate description.  What I think the kernel
fuss is about is that they have to read the counter value as several
separate "byte read" operations, and if the hardware doesn't latch the
whole counter value when the first byte is pulled then they'll get bytes
from several distinct states of the counter, leading to something that's
not consistent or even monotonic.  On non-latching hardware there's
really not a lot of choice what to do.  The patch is about not using
that same very-slow code path on hardware that does latch.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> There seem to be two types of overhead going on. There's the amount of time
> spent in gettimeofday itself which is pretty consistent.

That is a fact not in evidence.  The impression I had from that
linux-kernel discussion was that the problematic kernel code was looping
until it got consistent results from successive hardware reads.  I'm
not at all sure that you can make the above assertion across all
varieties of clock hardware, or even all common varieties.
        regards, tom lane


Re: That EXPLAIN ANALYZE patch still needs work

From
Benny Amorsen
Date:
>>>>> "MvO" == Martijn van Oosterhout <kleptog@svana.org> writes:

MvO> What we want is just a monotonically increasing counter that can
MvO> be read quickly and consistantly, we're not majorly fussed if it
MvO> doesn't match real time. This puts us back to CPU cycle counters,
MvO> but they have drawbacks of their own.

It is amazing how this discussion is mirroring discussions on the
linux-kernel list.

Applications have been using CPU cycle counters on Linux to avoid the
gettimeofday() overhead. With reasonably recent kernels, the overhead
is very low when the CPU cycle counters are usable, because
gettimeofday() never actually enters kernel space.

Unfortunately fewer and fewer systems seem to have usable cycle
counters. As an example, dual core Athlon64/Opteron boots with the
cycle counters unsynced. The kernel can compensate for that. However
they also lose sync whenever clock frequency changes, and the kernel
has a really hard time compensating for it. On such systems the kernel
switches back to slower timers and gettimeofday() becomes a real
system call. Applications should not try to use cycle counters
directly on such systems. If the kernel developers manage a good
workaround, gettimeofday() becomes fast again, but applications which
use cycle counters most likely stay broken.

Basically either gettimeofday() is fast, or the cycle counters are
useless -- unless you really care about counting CPU cycles and not
real time. Some CPUs like Transmetas actually get the counter thing
right and count 2 every tick when running at half speed and so on.


/Benny




Re: Running a query twice to ensure cached results.

From
"Luke Lonergan"
Date:
Ron,

On 6/8/06 11:49 AM, "Ron Mayer" <rm_pg@cheapcomplexdevices.com> wrote:

> Experimental results here suggest that for larger tables Linux seems
> to detect a seq-scan and not bother caching.   It's very reproducible
> for me here to do a reboot and not see the full speedup on a seq_scan
> until the third time I run a query.su

What you are seeing is the now infamous "Postgres writes a table one more
time after loading" behavior.

Simon Riggs once dug into it to find the root cause, and I no longer recall
exactly why, but after you've loaded data, the first seq scan will re-write
some large portion of the data while doing the initial scan. This wreaks
havoc on normal benchmarking practices.

Tom - can you explain what's going on with this?  It seems to write more
than just the contents of the WAL, so it's not a flush of the WAL writes
AFAICT.

- Luke 




Re: Running a query twice to ensure cached results.

From
Martijn van Oosterhout
Date:
On Tue, Jun 13, 2006 at 04:54:05AM -0700, Luke Lonergan wrote:
> > Experimental results here suggest that for larger tables Linux seems
> > to detect a seq-scan and not bother caching.   It's very reproducible
> > for me here to do a reboot and not see the full speedup on a seq_scan
> > until the third time I run a query.su
>
> What you are seeing is the now infamous "Postgres writes a table one more
> time after loading" behavior.
>
> Simon Riggs once dug into it to find the root cause, and I no longer recall
> exactly why, but after you've loaded data, the first seq scan will re-write
> some large portion of the data while doing the initial scan. This wreaks
> havoc on normal benchmarking practices.

Is it possible it may have something to do with the hint bits? There's
are a bunch of bits in the header to deal with speeding up of MVCC
tests. Maybe changing those bits marks the page dirty and forces a
write?

Have a ncie day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Running a query twice to ensure cached results.

From
"Luke Lonergan"
Date:
Martin,

On 6/13/06 5:42 AM, "Martijn van Oosterhout" <kleptog@svana.org> wrote:

> Is it possible it may have something to do with the hint bits? There's
> are a bunch of bits in the header to deal with speeding up of MVCC
> tests. Maybe changing those bits marks the page dirty and forces a
> write?

Yes, that's it.  Writing data twice (three times including WAL) is a Very
Bad Thing for large data work - how should we fix it?

- Luke




Re: Running a query twice to ensure cached results.

From
Martijn van Oosterhout
Date:
On Tue, Jun 13, 2006 at 05:46:23AM -0700, Luke Lonergan wrote:
> Martin,
>
> On 6/13/06 5:42 AM, "Martijn van Oosterhout" <kleptog@svana.org> wrote:
>
> > Is it possible it may have something to do with the hint bits? There's
> > are a bunch of bits in the header to deal with speeding up of MVCC
> > tests. Maybe changing those bits marks the page dirty and forces a
> > write?
>
> Yes, that's it.  Writing data twice (three times including WAL) is a Very
> Bad Thing for large data work - how should we fix it?

Well, I don't think you want to get rid of it entirely because
otherwise forevermore, every lookup in that table will require a check
to see if the transaction is committed. So at some point the hint bit
needs to be set and/or the xmin frozen (vacuum freeze does that for
example).

What you might be able to do is to reduce its effect. The thing that
occurs to me is to make hint bit changes only mark a page half-dirty.
If the page is evicted because the space is needed in the buffer cache,
it can be just dropped. However, the bgwriter will write it as normal.
Hence, setting hint bits will become a sort of background operation,
done when there's time.

It seems to me that if a large table is loaded in a single transaction,
the check for if the transaction is committed should be cheap because
it's checking the same transaction id over and over.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Running a query twice to ensure cached results.

From
"Luke Lonergan"
Date:
Martin,

On 6/13/06 6:00 AM, "Martijn van Oosterhout" <kleptog@svana.org> wrote:

> What you might be able to do is to reduce its effect. The thing that
> occurs to me is to make hint bit changes only mark a page half-dirty.
> If the page is evicted because the space is needed in the buffer cache,
> it can be just dropped. However, the bgwriter will write it as normal.
> Hence, setting hint bits will become a sort of background operation,
> done when there's time.
> 
> It seems to me that if a large table is loaded in a single transaction,
> the check for if the transaction is committed should be cheap because
> it's checking the same transaction id over and over.

Thanks for the tip - it's now in the mill for mulling over and finding
someone with the time to work something in...

- Luke




Re: Running a query twice to ensure cached results.

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> What you are seeing is the now infamous "Postgres writes a table one more
> time after loading" behavior.

> Tom - can you explain what's going on with this?

Setting XMIN_COMMITTED hint bits.
        regards, tom lane


Re: Running a query twice to ensure cached results.

From
Simon Riggs
Date:
On Tue, 2006-06-13 at 15:00 +0200, Martijn van Oosterhout wrote:

> What you might be able to do is to reduce its effect. The thing that
> occurs to me is to make hint bit changes only mark a page half-dirty.
> If the page is evicted because the space is needed in the buffer cache,
> it can be just dropped. However, the bgwriter will write it as normal.
> Hence, setting hint bits will become a sort of background operation,
> done when there's time.

Yes, I think that's a very good idea. What that introduces is the
concept of "dirty priority" - i.e. some pages are more important to
write out quickly than others. If the bgwriter cleans the higher
priority ones first it should do a better job of keeping the bufferpool
clean. That needs some work on it before its a fully fledged proposal.

> It seems to me that if a large table is loaded in a single transaction,
> the check for if the transaction is committed should be cheap because
> it's checking the same transaction id over and over.

I was considering this the other day. My original idea was to set the
xmin to be FrozenTransaction when loading a table with COPY in the same
transaction as the one that loaded it but that didn't work because of
MVCC violation. If we just set the hint bit to show XMIN_COMMITTED then
we need not worry about MVCC violations, since the xmin will still be
higher than any pre-existing snapshots.

I think Tom had a problem with that because it relied on file truncation
on recovery, but I'm not clear on why exactly?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com