Re: That EXPLAIN ANALYZE patch still needs work - Mailing list pgsql-hackers

From Tom Lane
Subject Re: That EXPLAIN ANALYZE patch still needs work
Date
Msg-id 23053.1149627028@sss.pgh.pa.us
Whole thread Raw
In response to Re: That EXPLAIN ANALYZE patch still needs work  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: That EXPLAIN ANALYZE patch still needs work  ("Jim C. Nasby" <jnasby@pervasive.com>)
Re: That EXPLAIN ANALYZE patch still needs work  (Martijn van Oosterhout <kleptog@svana.org>)
Re: That EXPLAIN ANALYZE patch still needs work  (Greg Stark <gsstark@mit.edu>)
Re: That EXPLAIN ANALYZE patch still needs work  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
I wrote:
> The problem looks to be an underestimation of SampleOverhead, and on
> reflection it's clear why: what CalculateSampleOverhead is measuring
> isn't the total overhead, but the time between the two gettimeofday
> calls.  Which is probably about half the true overhead.

On further thought, I take that back: the "true overhead" is not the
point here.  The time elapsed during a plan node execution when
sampling can be broken down into three phases:time before first gettimeofday calltime between gettimeofday callstime
aftersecond gettimeofday call
 
Only the second part is actually measured by the instrumentation system;
the other parts are overhead that has never been counted by EXPLAIN
ANALYZE, sampling version or no.  Moreover, most of the runtime of
InstrStartNode and InstrStopNode falls into the first or third parts.

What we would actually like to set SampleOverhead to is the portion
of the second-part runtime that doesn't occur when sampling = false.
Assuming that gettimeofday() has consistent runtime and the actual
time reported is measured at a consistent instant within that runtime,
I believe that we should take the SampleOverhead as just equal to
the runtime of a single gettimeofday() call.  The added or removed
second-part time within InstrStartNode is just the tail time of
gettimeofday, and the added or removed second-part time within
InstrStopNode is basically just the head time of gettimeofday.  (To make
this as true as possible, we need to change the order of operations so
that gettimeofday is invoked *immediately* after the "if (sampling)"
test, but that's easy.)

So this line of thought leads to the conclusion that
CalculateSampleOverhead is actually overestimating SampleOverhead
a bit, and we should simplify it to just time INSTR_TIME_SET_CURRENT().

But that still leaves me with a problem because my machine is clearly
overestimating the correction needed.  I added some printouts and got

raw totaltime = 0.370937
per_iter = 0.000156913, SampleOverhead = 3.28e-06
adj totaltime = 1.82976
sampling = 0
starttime = 0/000000
counter = 0/370937
firsttuple = 0.258321
tuplecount = 10000
itercount = 10001
samplecount = 704
nextsample = 10011
startup = 0.258321
total = 1.82976
ntuples = 10000
nloops = 1

on a run with an actual elapsed time near 750 msec.  Clearly the
sampling adjustment is wrong, but why?

I have a theory about this, and it's not pleasant at all.  What I
think is that we have a Heisenberg problem here: the act of invoking
gettimeofday() actually changes what is measured.  That is, the
runtime of the "second part" of ExecProcNode is actually longer when
we sample than when we don't, not merely due to the extra time spent
in gettimeofday().  It's not very hard to guess at reasons why, either.
The kernel entry is probably flushing some part of the CPU's state,
such as virtual/physical address mapping for the userland address
space.  After returning from the kernel call, the time to reload
that state shows up as more execution time within the "second part".

This theory explains two observations that otherwise are hard to
explain.  One, that the effect is platform-specific: your machine
may avoid flushing as much state during a kernel call as mine does.
And two, that upper plan nodes seem much more affected than lower
ones.  That makes sense because the execution cycle of an upper node
will involve touching more userspace data than a lower node, and
therefore more of the flushed TLB entries will need to be reloaded.

If this theory is correct, then the entire notion of EXPLAIN ANALYZE
sampling has just crashed and burned.  We can't ship a measurement
tool that is accurate on some platforms and not others.

I'm wondering if it's at all practical to go over to a gprof-like
measurement method for taking EXPLAIN ANALYZE runtime measurements;
that is, take an interrupt every so often and bump the count for the
currently active plan node.  This would spread the TLB-flush overhead
more evenly and thus avoid introducing that bias.  There may be too much
platform dependency in this idea, though, and I also wonder if it'd
break the ability to do normal gprof profiling of the backend.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Re: Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE
Next
From: Tom Lane
Date:
Subject: Re: AIX check in datetime.h