Thread: Tuning planner cost estimates

Tuning planner cost estimates

From
"Jim C. Nasby"
Date:
I've been doing some work to try and identify the actual costs
associated with an index scan with some limited sucess. What's been run
so far can be seen at http://stats.distributed.net/~decibel. But there's
a couple problems. First, I can't use the box exclusively for this
testing, which results in some result inconsistencies. Second, I've been
using a dataset that I can't make public, which means no one else can
run these tests on different hardware.

So what I think would be useful is some way to generate a known dataset,
and then be able to run tests against it on different machines. In the
case of testing index scans, we need to be able to vary correlation,
which so far I've been doing by ordering by different columns. I suspect
it will also be important to test with different tuple sizes. There's
also the question of whether or not the cache should be flushed for each
run or not.

Does this sound like a good way to determine actual costs for index
scans (and hopefully other access methods in the future)? If so, what
would be a good way to implement this?
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Tuning planner cost estimates

From
Josh Berkus
Date:
Jim,

> I've been doing some work to try and identify the actual costs
> associated with an index scan with some limited sucess. What's been run
> so far can be seen at http://stats.distributed.net/~decibel. But there's
> a couple problems. First, I can't use the box exclusively for this
> testing, which results in some result inconsistencies.

I can get you access to boxes.  Chat on IRC?

> Second, I've been
> using a dataset that I can't make public, which means no one else can
> run these tests on different hardware.

Then use one of the DBT databases.

> In the
> case of testing index scans, we need to be able to vary correlation,
> which so far I've been doing by ordering by different columns. I suspect
> it will also be important to test with different tuple sizes. There's
> also the question of whether or not the cache should be flushed for each
> run or not.
>
> Does this sound like a good way to determine actual costs for index
> scans (and hopefully other access methods in the future)? If so, what
> would be a good way to implement this?

Well, the problem is that what we need to index scans is a formula, rather
than a graph.   The usefulness of benchmarking index scan cost is so that we
can test our formula for accuracy and precision.  However, such a formula
*does* need to take into account concurrent activity, updates, etc ... that
is, it needs to approximately estimate the relative cost on a live database,
not a test one.

This is also going to be a moving target because Tom's in-memory-bitmapping
changes relative cost equations.

I think a first step would be, in fact, to develop a tool that allows us to
put EXPLAIN ANALYZE results in a database table.  Without that, there is no
possibility of statistical-scale analysis.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Tuning planner cost estimates

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> I think a first step would be, in fact, to develop a tool that allows us to
> put EXPLAIN ANALYZE results in a database table.  Without that, there is no
> possibility of statistical-scale analysis.

AFAIK you can do that today using, eg, plpgsql:

    for rec in explain analyze ... loop
        insert into table values(rec."QUERY PLAN");
    end loop;

            regards, tom lane

Re: Tuning planner cost estimates

From
Josh Berkus
Date:
Tom,

>     for rec in explain analyze ... loop
>         insert into table values(rec."QUERY PLAN");
>     end loop;

I need to go further than that and parse the results as well.  And preserve
relationships and nesting levels.

Hmmmm ... what's the indenting formula for nesting levels?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Tuning planner cost estimates

From
"Jim C. Nasby"
Date:
On Thu, May 19, 2005 at 09:31:47AM -0700, Josh Berkus wrote:
> > In the
> > case of testing index scans, we need to be able to vary correlation,
> > which so far I've been doing by ordering by different columns. I suspect
> > it will also be important to test with different tuple sizes. There's
> > also the question of whether or not the cache should be flushed for each
> > run or not.
> >
> > Does this sound like a good way to determine actual costs for index
> > scans (and hopefully other access methods in the future)? If so, what
> > would be a good way to implement this?
>
> Well, the problem is that what we need to index scans is a formula, rather
> than a graph.   The usefulness of benchmarking index scan cost is so that we

True, but having a graphical representation of how different input
variables (such as correlation) affect runtime is a good way to derive
such a formula, or at least point you in the right direction.

> can test our formula for accuracy and precision.  However, such a formula
> *does* need to take into account concurrent activity, updates, etc ... that
> is, it needs to approximately estimate the relative cost on a live database,
> not a test one.

Well, that raises an interesting issue, because AFAIK none of the cost
estimate functions currently do that. Heck, AFAIK even the piggyback seqscan
code doesn't take other seqscans into account.

Another issue is: what state should the buffers/disk cache be in? In the
thread that kicked all this off Tom noted that my results were skewed
because of caching, so I changed my tests to flush the disk cache as
effectively as I could (by running a program that would consume enough
available memory to just start the box swapping), but I don't think
that's necessarily realistic. Though at least it should preclude the
need to run tests multiple times on an otherwise idle box in order to
'pre-seed' the cache (not that that's any more realistic). If you don't
use one of these techniques you end up with results that depend on what
test was run before the current one...

> This is also going to be a moving target because Tom's in-memory-bitmapping
> changes relative cost equations.

I thought those all had seperate costing functions...? In any case, if
we have a cost estimation tool it will make it much easier to derive
cost estimation functions.

> I think a first step would be, in fact, to develop a tool that allows us to
> put EXPLAIN ANALYZE results in a database table.  Without that, there is no
> possibility of statistical-scale analysis.

Rather than trying to parse all possible output, ISTM it would be much
better if there was a way to access the info directly. Would it be
difficult to have an option that produces output that is a set of
different fields? I'm thinking something like:

Level (basically how far something's indented)
Parent node (what node a child node is feeding)
node_id (some kind of identifier for each step)
operation
(estimate|actual)_(startup|total|rows|width|loops)
other (something to hold index condition, filter, etc)

But ultimately, I'm not sure if this is really required or not, because
I don't see that we need to use explain when running queries. In fact,
it's possibly desireable that we don't, because of the overhead it
incurs. We would want to log an explain (maybe analyze) just to make
sure we knew what the optimizer was doing, but I think we shouldn't need
the info to produce cost estimates.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Tuning planner cost estimates

From
Tom Lane
Date:
"Jim C. Nasby" <decibel@decibel.org> writes:
> On Thu, May 19, 2005 at 09:31:47AM -0700, Josh Berkus wrote:
>> can test our formula for accuracy and precision.  However, such a formula
>> *does* need to take into account concurrent activity, updates, etc ... that
>> is, it needs to approximately estimate the relative cost on a live database,
>> not a test one.

> Well, that raises an interesting issue, because AFAIK none of the cost
> estimate functions currently do that.

I'm unconvinced that it'd be a good idea, either.  People already
complain that the planner's choices change when they ANALYZE; if the
current load factor or something like that were to be taken into account
then you'd *really* have a problem with irreproducible behavior.

It might make sense to have something a bit more static, perhaps a GUC
variable that says "plan on the assumption that there's X amount of
concurrent activity".  I'm not sure what scale to measure X on, nor
exactly how this would factor into the estimates anyway --- but at least
this approach would maintain reproducibility of behavior.

> Another issue is: what state should the buffers/disk cache be in?

The current cost models are all based on the assumption that every query
starts from ground zero: nothing in cache.  Which is pretty bogus in
most real-world scenarios.  We need to think about ways to tune that
assumption, too.  Maybe this is actually the same discussion, because
certainly one of the main impacts of a concurrent environment is on what
you can expect to find in cache.

            regards, tom lane

Re: Tuning planner cost estimates

From
Josh Berkus
Date:
Jim,

> Well, that raises an interesting issue, because AFAIK none of the cost
> estimate functions currently do that. Heck, AFAIK even the piggyback
> seqscan code doesn't take other seqscans into account.

Sure.   But you're striving for greater accuracy, no?

Actually, all that's really needed in the way of concurrent activity is a
calculated factor that lets us know how likely a particular object is to be
cached, either in the fs cache or the pg cache (with different factors for
each presumably) based on history.   Right now, that's based on
estimated_cache_size, which is rather innacurate: a table which is queried
once a month has the exact same cost factors as one which is queried every
2.1 seconds.  This would mean an extra column in pg_stats I suppose.

> But ultimately, I'm not sure if this is really required or not, because
> I don't see that we need to use explain when running queries. In fact,
> it's possibly desireable that we don't, because of the overhead it
> incurs. We would want to log an explain (maybe analyze) just to make
> sure we knew what the optimizer was doing, but I think we shouldn't need
> the info to produce cost estimates.

Well, the problem is that you need to know how much time the index scan took
vs. other query steps.   I don't see a way to do this other than an anayze.

--
__Aglio Database Solutions_______________
Josh Berkus               Consultant
josh@agliodbs.com     www.agliodbs.com
Ph: 415-752-2500    Fax: 415-752-2387
2166 Hayes Suite 200    San Francisco, CA

Re: Tuning planner cost estimates

From
"Jim C. Nasby"
Date:
On Fri, May 20, 2005 at 04:47:38PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <decibel@decibel.org> writes:
> > On Thu, May 19, 2005 at 09:31:47AM -0700, Josh Berkus wrote:
> >> can test our formula for accuracy and precision.  However, such a formula
> >> *does* need to take into account concurrent activity, updates, etc ... that
> >> is, it needs to approximately estimate the relative cost on a live database,
> >> not a test one.
>
> > Well, that raises an interesting issue, because AFAIK none of the cost
> > estimate functions currently do that.
>
> I'm unconvinced that it'd be a good idea, either.  People already
> complain that the planner's choices change when they ANALYZE; if the
> current load factor or something like that were to be taken into account
> then you'd *really* have a problem with irreproducible behavior.
>
> It might make sense to have something a bit more static, perhaps a GUC
> variable that says "plan on the assumption that there's X amount of
> concurrent activity".  I'm not sure what scale to measure X on, nor
> exactly how this would factor into the estimates anyway --- but at least
> this approach would maintain reproducibility of behavior.

Or allowing the load of the machine to affect query plans dynamically is
something that could be disabled by default, so presumably if you turn
it on it means you know what you're doing.

Of course this is all academic until we have a means to actually measure
how much system load affects the different things we estimate cost for,
and I don't see that happening until we have a system for measuring how
changing different input variables affects costs.

> > Another issue is: what state should the buffers/disk cache be in?
>
> The current cost models are all based on the assumption that every query
> starts from ground zero: nothing in cache.  Which is pretty bogus in
> most real-world scenarios.  We need to think about ways to tune that
> assumption, too.  Maybe this is actually the same discussion, because
> certainly one of the main impacts of a concurrent environment is on what
> you can expect to find in cache.

Well, load doesn't directly effect cache efficiency; it's really a
question of the ratios of how often different things in the database are
accessed. If you wanted to get a crude idea of how likely pages from
some relation are to be in cache, you could take periodic snapshots of
io stats and see what percentage of the IO done in a given time period
was on the relation you're interested in as compared to the rest of the
database. But I think this is probably still a 2nd order effect.

In terms of a testing system, here's what I'm thinking of. For each cost
estimate, there will be a number of input variables we want to vary, and
then check to see how changes in them effect run time. Using index scan
as a simple example, 1st order variables will likely be index and table
size (especially in relation to cache size), and correlation. So, we
need some kind of a test harness that can vary these variables
(prefferably one at a time), and run a test case after each change. It
would then need to store the timing info, possibly along with other
information (such as explain output). If I'm the one to write this it'll
end up in perl, since that's the only language I know that would be able
to accomplish this. DBT seems to be a reasonable test database to do
this testing with, especially since it would provide a ready means to
provide external load.

Does this sound like a reasonable approach? Also, how important do
people think it is to use explain analyze output instead of just doing
SELECT count(*) FROM (query you actually want to test)? (The select
count(*) wrapper is just a means to throw away the results since we
don't really want to worry about data transfer times, etc). The testing
I've done (http://stats.distributed.net/~decibel/base.log) shows explain
analyze to be almost 5x slower than select count(*), so it seems a big
gain if we can avoid that.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Tuning planner cost estimates

From
"Jim C. Nasby"
Date:
On Fri, May 20, 2005 at 03:23:16PM -0700, Josh Berkus wrote:
> Jim,
>
> > Well, that raises an interesting issue, because AFAIK none of the cost
> > estimate functions currently do that. Heck, AFAIK even the piggyback
> > seqscan code doesn't take other seqscans into account.
>
> Sure.   But you're striving for greater accuracy, no?
>
> Actually, all that's really needed in the way of concurrent activity is a
> calculated factor that lets us know how likely a particular object is to be
> cached, either in the fs cache or the pg cache (with different factors for
> each presumably) based on history.   Right now, that's based on
> estimated_cache_size, which is rather innacurate: a table which is queried
> once a month has the exact same cost factors as one which is queried every
> 2.1 seconds.  This would mean an extra column in pg_stats I suppose.

True, though that's a somewhat different issue that what the load on the
box is (see the reply I just posted). Load on the box (particuarly IO
load) will also play a factor for things; for example, it probably means
seqscans end up costing a lot more than random IO does, because the disk
heads are being sent all over the place anyway.

> > But ultimately, I'm not sure if this is really required or not, because
> > I don't see that we need to use explain when running queries. In fact,
> > it's possibly desireable that we don't, because of the overhead it
> > incurs. We would want to log an explain (maybe analyze) just to make
> > sure we knew what the optimizer was doing, but I think we shouldn't need
> > the info to produce cost estimates.
>
> Well, the problem is that you need to know how much time the index scan took
> vs. other query steps.   I don't see a way to do this other than an anayze.

True, but that can be done by a seperate seqscan step. I would argue
that doing it that way is actually more accurate, because the overhead
of explain analyze is huge and tends to swamp other factors out. As I
mentioned in my other email, my tests show explain analyze select * from
table is 5x slower than select count(*) from table.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Tuning planner cost estimates

From
Tom Lane
Date:
"Jim C. Nasby" <decibel@decibel.org> writes:
> Does this sound like a reasonable approach? Also, how important do
> people think it is to use explain analyze output instead of just doing
> SELECT count(*) FROM (query you actually want to test)? (The select
> count(*) wrapper is just a means to throw away the results since we
> don't really want to worry about data transfer times, etc). The testing
> I've done (http://stats.distributed.net/~decibel/base.log) shows explain
> analyze to be almost 5x slower than select count(*), so it seems a big
> gain if we can avoid that.

I'd go with the select count(*) --- I can't imagine that we will be
trying to model the behavior of anything so complex that we really need
explain analyze output.  (On the other hand, recording explain output is
a good idea to make sure you are testing what you think you are testing.)

Actually, it might be worth using "select count(null)", which should
avoid the calls to int8inc.  I think this doesn't matter so much in CVS
tip, but certainly in existing releases the palloc overhead involved is
noticeable.

BTW, 5x is an awful lot; I've not noticed overheads more than about 2x.

            regards, tom lane

Re: Tuning planner cost estimates

From
Simon Riggs
Date:
On Fri, 2005-05-20 at 15:23 -0700, Josh Berkus wrote:
> > Well, that raises an interesting issue, because AFAIK none of the cost
> > estimate functions currently do that. Heck, AFAIK even the piggyback
> > seqscan code doesn't take other seqscans into account.
>
> Sure.   But you're striving for greater accuracy, no?
>
> Actually, all that's really needed in the way of concurrent activity is a
> calculated factor that lets us know how likely a particular object is to be
> cached, either in the fs cache or the pg cache (with different factors for
> each presumably) based on history.   Right now, that's based on
> estimated_cache_size, which is rather innacurate: a table which is queried
> once a month has the exact same cost factors as one which is queried every
> 2.1 seconds.  This would mean an extra column in pg_stats I suppose.

Hmmm...not sure that would be a good thing.

effective_cache_size isn't supposed to be set according to how much of a
table is in cache when the query starts. The setting is supposed to
reflect how much cache is *available* for the current index scan, when
performing an index scan on a table that is not in clustered sequence.
The more out of sequence the table is, the more memory is required to
avoid doing any repeated I/Os during the scan. Of course, if there are
many users, the available cache may be much reduced.

Best regards, Simon Riggs