Thread: What about utility to calculate planner cost constants?

What about utility to calculate planner cost constants?

From
"Tambet Matiisen"
Date:
I was following the cpu_tuple_cost thread and wondering, if it could be
possible to make PQA style utility to calculate configuration-specific
values for planner cost constants. It could make use of output of
log_(statement|parser|planner|executor)_stats, tough I'm not sure if the
output contains anything useful for those purposes.

Otherwise it could just collect statements, run EXPLAIN ANALYZE for all
of them and then play with planner cost constants to get the estimated
values as close as possible to actual values. Something like Goal Seek
in Excel, if you pardon my reference to MS :).

Somewhat similar project seems to be pgautotune from pgFoundry, but it
only considers shared_buffers, sort_mem and vacuum_mem. And it seems to
use synthetic data instead of actual database and actual statements from
log. And it has been inactive for a while.

  Tambet

Re: What about utility to calculate planner cost constants?

From
Josh Berkus
Date:
Tambet,

> I was following the cpu_tuple_cost thread and wondering, if it could be
> possible to make PQA style utility to calculate configuration-specific
> values for planner cost constants. It could make use of output of
> log_(statement|parser|planner|executor)_stats, tough I'm not sure if the
> output contains anything useful for those purposes.

Yeah, that's something I need to look at.

> Otherwise it could just collect statements, run EXPLAIN ANALYZE for all
> of them and then play with planner cost constants to get the estimated
> values as close as possible to actual values. Something like Goal Seek
> in Excel, if you pardon my reference to MS :).

That's not really practical.   There are currently 5 major query tuning
parameters, not counting the memory adjustments which really can't be left
out.  You can't realistically test all combinations of 6 variables.

> Somewhat similar project seems to be pgautotune from pgFoundry, but it
> only considers shared_buffers, sort_mem and vacuum_mem. And it seems to
> use synthetic data instead of actual database and actual statements from
> log. And it has been inactive for a while.

Yeah, pg_autotune is a dead project.   Once we got OSDL able to run tests, we
came up with some rules-of-thumb which are more useful than autotune's
output.  More importantly, the approach doesn't scale to the 15-20 GUCs which
we'd realistically want to test.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: What about utility to calculate planner cost constants?

From
Thomas F.O'Connell
Date:
If by not practical you mean, "no one has implemented a multivariable
testing approach," I'll agree with you. But multivariable testing is
definitely a valid statistical approach to solving just such problems.

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Mar 21, 2005, at 11:51 AM, Josh Berkus wrote:

> That's not really practical.   There are currently 5 major query tuning
> parameters, not counting the memory adjustments which really can't be
> left
> out.  You can't realistically test all combinations of 6 variables.


Re: What about utility to calculate planner cost constants?

From
Josh Berkus
Date:
Thomas,

> If by not practical you mean, "no one has implemented a multivariable
> testing approach," I'll agree with you. But multivariable testing is
> definitely a valid statistical approach to solving just such problems.

Well, not practical as in:  "would take either $10 million in equipment or
10,000 hours or both"

--Josh

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: What about utility to calculate planner cost constants?

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> > Otherwise it could just collect statements, run EXPLAIN ANALYZE for all
> > of them and then play with planner cost constants to get the estimated
> > values as close as possible to actual values. Something like Goal Seek
> > in Excel, if you pardon my reference to MS :).
>
> That's not really practical.   There are currently 5 major query tuning
> parameters, not counting the memory adjustments which really can't be left
> out.  You can't realistically test all combinations of 6 variables.

I don't think it would be very hard at all actually.

It's just a linear algebra problem with a bunch of independent variables and a
system of equations. Solving for values for all of them is a straightforward
problem.

Of course in reality these variables aren't actually independent because the
costing model isn't perfect. But that wouldn't be a problem, it would just
reduce the accuracy of the results.

What's needed is for the explain plan to total up the costing penalties
independently. So the result would be something like

1000 * random_page_cost + 101 * sequential_page_cost + 2000 * index_tuple_cost
+ ...

In other words a tuple like <1000,101,2000,...>

And explain analyze would produce the above tuple along with the resulting
time.

Some program would have to gather these values from the log or stats data and
gather them up into a large linear system and solve for values that minimize
the divergence from the observed times.



(costs penalties are currently normalized to sequential_page_cost being 1.
That could be maintained, or it could be changed to be normalized to an
expected 1ms.)

(Also, currently explain analyze has overhead that makes this impractical.
Ideally it could subtract out its overhead so the solutions would be accurate
enough to be useful)

--
greg

Re: What about utility to calculate planner cost constants?

From
Dawid Kuroczko
Date:
On Mon, 21 Mar 2005 14:59:56 -0800, Josh Berkus <josh@agliodbs.com> wrote:
> > If by not practical you mean, "no one has implemented a multivariable
> > testing approach," I'll agree with you. But multivariable testing is
> > definitely a valid statistical approach to solving just such problems.
> Well, not practical as in:  "would take either $10 million in equipment or
> 10,000 hours or both"

I think you don't need EXPLAIN ANALYZE each query with different GUCs,
you would only need EXPLAIN most of the times (which is way quicker).
Once you get 'near' actual values, you would do EXECUTE ANALYZE to
verify the variables.

   Regards,
      Dawid

Re: What about utility to calculate planner cost constants?

From
Richard Huxton
Date:
Greg Stark wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>>That's not really practical.   There are currently 5 major query tuning
>>parameters, not counting the memory adjustments which really can't be left
>>out.  You can't realistically test all combinations of 6 variables.
>
> I don't think it would be very hard at all actually.
[snip]
> What's needed is for the explain plan to total up the costing penalties
> independently. So the result would be something like
>
> 1000 * random_page_cost + 101 * sequential_page_cost + 2000 * index_tuple_cost
> + ...
>
> In other words a tuple like <1000,101,2000,...>
 >
> And explain analyze would produce the above tuple along with the resulting
> time.
>
> Some program would have to gather these values from the log or stats data and
> gather them up into a large linear system and solve for values that minimize
> the divergence from the observed times.

You'd only need to log them if they diverged from expected anyway. That
should result in fairly low activity pretty quickly (or we're wasting
our time). Should they go to the stats collector rather than logs?

> (Also, currently explain analyze has overhead that makes this impractical.
> Ideally it could subtract out its overhead so the solutions would be accurate
> enough to be useful)

Don't we only need the top-level figures though? There's no need to
record timings for each stage, just work completed.

--
   Richard Huxton
   Archonet Ltd

Re: What about utility to calculate planner cost constants?

From
Christopher Browne
Date:
Martha Stewart called it a Good Thing when gsstark@mit.edu (Greg Stark) wrote:
> I don't think it would be very hard at all actually.
>
> It's just a linear algebra problem with a bunch of independent
> variables and a system of equations. Solving for values for all of
> them is a straightforward problem.
>
> Of course in reality these variables aren't actually independent
> because the costing model isn't perfect. But that wouldn't be a
> problem, it would just reduce the accuracy of the results.

Are you certain it's a linear system?  I'm not.  If it was a matter of
minimizing a linear expression subject to some set of linear
equations, then we could model this as a Linear Program for which
there are some perfectly good solvers available.  (Few with BSD-style
licenses, but we could probably get some insight out of running for a
while with something that's there...)

I think there's good reason to consider it to be distinctly
NON-linear, which makes it way more challenging to solve the problem.

There might well be some results to be gotten out of a linear
approximation; the Grand Challenge is to come up with the model in the
first place...
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','gmail.com').
http://linuxdatabases.info/info/or.html
"Tom Christiansen  asked me,  "Chip, is there  anything that  you like
that isn't big and complicated?" C++, EMACS, Perl, Unix, English-no, I
guess not." -- Chip Salzenberg, when commenting on Perl6/C++

Re: What about utility to calculate planner cost constants?

From
Bruno Wolff III
Date:
On Tue, Mar 22, 2005 at 08:09:40 -0500,
  Christopher Browne <cbbrowne@acm.org> wrote:
>
> Are you certain it's a linear system?  I'm not.  If it was a matter of
> minimizing a linear expression subject to some set of linear
> equations, then we could model this as a Linear Program for which
> there are some perfectly good solvers available.  (Few with BSD-style
> licenses, but we could probably get some insight out of running for a
> while with something that's there...)

For less than 100 equations and 100 unknowns, you should be able to use
naive solvers. After that you don't get very accurate answers without
being more careful. I still have my numerical analysis text books around
and can look algorithms up for doing this without too much trouble.

Re: What about utility to calculate planner cost constants?

From
Greg Stark
Date:
Richard Huxton <dev@archonet.com> writes:

> You'd only need to log them if they diverged from expected anyway. That should
> result in fairly low activity pretty quickly (or we're wasting our time).
> Should they go to the stats collector rather than logs?

I think you need to log them all. Otherwise when you go to analyze the numbers
and come up with ideal values you're going to be basing your optimization on a
skewed subset.

I don't know whether the stats collector or the logs is better suited to this.

> > (Also, currently explain analyze has overhead that makes this impractical.
> > Ideally it could subtract out its overhead so the solutions would be accurate
> > enough to be useful)
>
> Don't we only need the top-level figures though? There's no need to record
> timings for each stage, just work completed.

I guess you only need top level values. But you also might want some flag if
the row counts for any node were far off. In that case perhaps you would want
to discard the data point.

--
greg

Re: What about utility to calculate planner cost constants?

From
Greg Stark
Date:
Christopher Browne <cbbrowne@acm.org> writes:

> Are you certain it's a linear system?

If you just consider the guc parameters that tell postgres how long various
real world operations take (all the *_cost parameters) then it's a linear
system. It has to be. The resulting time is just a sum of the times for some
number of each of these real world operations.

If you include parameters like the geqo_* parameters or the hypothetical
parameter that controls what selectivity to assume for clauses with unknown
selectivity then no, it wouldn't be.

But if you assume the estimated row counts are correct and you're just trying
to solve for the parameters to come up with the most accurate cost for the
current hardware then I think you're golden.

> There might well be some results to be gotten out of a linear
> approximation; the Grand Challenge is to come up with the model in the
> first place...

Indeed. The model's not perfect now of course, and it'll never really be
perfect since some of the parameters represent operations that aren't always a
consistent cost. But you should be able to solve for the values that result in
the most accurate totals the most often. There may be some tradeoffs (and
therefore new guc variables :)

PS

It occurs to me that there's no reason to use the unreliable EXPLAIN counts of
the costs. You may as well account accurately for them and use the actual
values used in performing the query. This means there's no reason to discard
inaccurately estimated data points.

Moreover, the overhead issue a non-issue. Since you only need the total time,
and the total costs. You would have the overhead of performing lots of
increments on those costs, but you only have to do two gettimeofdays. Once at
the beginning and once at the end.

--
greg

Re: What about utility to calculate planner cost constants?

From
Richard Huxton
Date:
Greg Stark wrote:
> Richard Huxton <dev@archonet.com> writes:
>
>>You'd only need to log them if they diverged from expected anyway. That should
>>result in fairly low activity pretty quickly (or we're wasting our time).
>>Should they go to the stats collector rather than logs?
>
> I think you need to log them all. Otherwise when you go to analyze the numbers
> and come up with ideal values you're going to be basing your optimization on a
> skewed subset.

I can see your thinking, I must admit I was thinking of a more iterative
process: estimate deltas, change config, check, repeat. I'm not
convinced there are "ideal" values with a changing workload - for
example, random_page_cost will presumably vary depending on how much
contention there is for random seeks. Certainly, effective_cache size
could vary.

> I don't know whether the stats collector or the logs is better suited to this.
>
>>>(Also, currently explain analyze has overhead that makes this impractical.
>>>Ideally it could subtract out its overhead so the solutions would be accurate
>>>enough to be useful)
>>
>>Don't we only need the top-level figures though? There's no need to record
>>timings for each stage, just work completed.
>
> I guess you only need top level values. But you also might want some flag if
> the row counts for any node were far off. In that case perhaps you would want
> to discard the data point.

I think you'd need to adjust work-estimates by actual-rows / estimated-rows.

I _was_ trying to think of a clever way of using row mis-estimates to
correct statistics automatically. This was triggered by the discussion a
few weeks ago about hints to the planner and the recent talk about plan
cacheing. Some sort of feedback loop so the planner would know its
estimates were off should be a big win from an ease-of-use point of
view. Didn't look easy to do though :-(

--
   Richard Huxton
   Archonet Ltd

Re: What about utility to calculate planner cost constants?

From
Tom Lane
Date:
Christopher Browne <cbbrowne@acm.org> writes:
> Martha Stewart called it a Good Thing when gsstark@mit.edu (Greg Stark) wrote:
>> It's just a linear algebra problem with a bunch of independent
>> variables and a system of equations. Solving for values for all of
>> them is a straightforward problem.

> Are you certain it's a linear system?  I'm not.

I'm quite certain it isn't a linear system, because the planner's cost
models include nonlinear equations.

While I don't have a whole lot of hard evidence to back this up, my
belief is that our worst problems stem not from bad parameter values
but from wrong models.  In particular we *know* that the cost model for
nestloop-inner-indexscan joins is wrong, because it doesn't account for
cacheing effects across repeated scans.  There are some other obvious
weak spots as well.  It could be argued that we ought to allow the
system to assume index cacheing even for standalone queries, on the
grounds that if you are doing a query often enough to care about it,
there was probably a recent use of the same query that pulled in the
upper index levels.  The current cost models all assume starting from
ground zero with empty caches for each query, and that is surely not
reflective of many real-world cases.

I've looked at fixing this a couple times, but so far my attempts
to devise a more believable index access cost estimator have come
out with numbers higher than the current estimates ... not the
direction we want it to go :-(

Anyway, I see little point in trying to build an automatic parameter
optimizer until we have cost models whose parameters are more stable
than the current ones.

            regards, tom lane

Re: What about utility to calculate planner cost constants?

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Christopher Browne <cbbrowne@acm.org> writes:
>> Are you certain it's a linear system?

> If you just consider the guc parameters that tell postgres how long various
> real world operations take (all the *_cost parameters) then it's a linear
> system. It has to be.

No it doesn't.  Think caching effects for instance.  We do have cache
effects in the cost models, even though they are wrong as per my nearby
rant...

            regards, tom lane

Re: What about utility to calculate planner cost constants?

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Christopher Browne <cbbrowne@acm.org> writes:
> > Martha Stewart called it a Good Thing when gsstark@mit.edu (Greg Stark) wrote:
> >> It's just a linear algebra problem with a bunch of independent
> >> variables and a system of equations. Solving for values for all of
> >> them is a straightforward problem.
>
> > Are you certain it's a linear system?  I'm not.
>
> I'm quite certain it isn't a linear system, because the planner's cost
> models include nonlinear equations.

The equations will all be linear for the *_cost variables. If they weren't
they would be meaningless, the units would be wrong. Things like caching are
just going to be the linear factors that determine how many random page costs
and sequential page costs to charge the query.

> While I don't have a whole lot of hard evidence to back this up, my
> belief is that our worst problems stem not from bad parameter values
> but from wrong models.

I think these are orthogonal issues.

The time spent in real-world operations like random page accesses, sequential
page accesses, cpu operations, index lookups, etc, are all measurable
quantities. They can be directly measured or approximated by looking at the
resulting net times. Measuring these things instead of asking the user to
provide them is just a nicer user experience.

Separately, plugging these values into more and more accurate model will come
up with better estimates for how many of these operations a query will need to
perform.

> Anyway, I see little point in trying to build an automatic parameter
> optimizer until we have cost models whose parameters are more stable
> than the current ones.

Well what people currently do is tweak the physical values until the produce
results for their work load that match reality. It would be neat if postgres
could do this automatically.

Arguably the more accurate the cost model the less of a motivation for
automatic adjustments there is since you could easily plug in accurate values
from the hardware specs. But actually I think it'll always be a nice feature.

--
greg

Re: What about utility to calculate planner cost constants?

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> The time spent in real-world operations like random page accesses, sequential
> page accesses, cpu operations, index lookups, etc, are all measurable
> quantities. They can be directly measured or approximated by looking at the
> resulting net times.

That's the theory, all right, and that's why I've been resistant to
lowering random_page_cost just because "it gives better answers".
To the extent that you believe that is a real physical parameter with
a definable value (which is a bit debatable actually, but nevermind)
it should be possible to measure it by experiment.

The difficulty with the notion of doing that measurement by timing
Postgres operations is that it's a horribly bad experimental setup.
You have no way to isolate the effects of just one variable, or even
a small number of variables, which you really need to do if you want
to estimate with any degree of precision.  What's more, there are plenty
of relevant factors that aren't in the model at all (such as the extent
of other load on the machine), and so the noise in the measurements
will be enormous.

And you can't just dismiss the issue of wrong cost models and say we can
get numbers anyway.  We see examples almost every day on this list where
the planner is so far off about indexscan vs seqscan costs that you'd
have to put random_page_cost far below 1 to make its numbers line up
with reality.  That's not a matter of tuning the parameter, it's
evidence that the cost model is wrong.  If you try to solve for the
"right value" of the parameter by comparing estimated and actual costs,
you'll get garbage, even without any issues about noisy measurements
or numerical instability of your system of equations.

            regards, tom lane

Re: What about utility to calculate planner cost constants?

From
"Dave Held"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Tuesday, March 22, 2005 3:48 PM
> To: Greg Stark
> Cc: Christopher Browne; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] What about utility to calculate planner cost
> constants?
> [...]
> The difficulty with the notion of doing that measurement by timing
> Postgres operations is that it's a horribly bad experimental setup.
> You have no way to isolate the effects of just one variable, or even
> a small number of variables, which you really need to do if you want
> to estimate with any degree of precision.  What's more, there
> are plenty of relevant factors that aren't in the model at all (such
> as the extent of other load on the machine), and so the noise in the
> measurements will be enormous.
>
> And you can't just dismiss the issue of wrong cost models and say we
> can get numbers anyway.  We see examples almost every day on this
> list where the planner is so far off about indexscan vs seqscan costs
> that you'd have to put random_page_cost far below 1 to make its numbers
> line up with reality.  That's not a matter of tuning the parameter,
> it's evidence that the cost model is wrong.  If you try to solve for
> the "right value" of the parameter by comparing estimated and actual
> costs, you'll get garbage, even without any issues about noisy
> measurements or numerical instability of your system of equations.

Then instead of building a fixed cost model, why not evolve an adaptive
model using an ANN or GA?  I can't say that I'm remotely familiar with
how the planner does its business, but perhaps we should throw away all
these tunable cost parameters and let a neural network create them
implicitly, if they really exist in any meaningful form.  I suppose the
inputs to the network would be the available scan types, the actual and
estimated rows, correlations, etc.  The outputs would be query plans, is
that right?  So we pick several representative data points in the query
space and train the network on those, to "bootstrap" it.  With any luck,
the network will generalize the training inputs and do a halfway decent
job on real-world values.  If a user is unhappy with the way the network
is operating, they can switch on a training mode whereby the network
tries some different plans for a given query and uses the execution time
to judge which plans worked the best.  The alternative plans can be
suggested by built-in heuristics or perhaps built randomly.  Of course,
such training would only be practical for smaller data sets, but perhaps
there would be a way to let the network perform a query on a subset of
the data and then extrapolate the behavior of a plan over the full data
set.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

Re: What about utility to calculate planner cost constants?

From
"Dave Held"
Date:
> -----Original Message-----
> From: Dave Held
> Sent: Tuesday, March 22, 2005 4:16 PM
> To: Tom Lane
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] What about utility to calculate planner cost
> constants?
> [...]
> Then instead of building a fixed cost model, why not evolve
> an adaptive model using an ANN or GA?
> [...]

Another advantage to an adaptive planner is that for those who
can afford to duplicate/replicate their hardware/db, they can
perhaps dedicate a backup server to plan optimization where the
db just runs continuously in a learning mode trying out different
plans for a core set of important queries.  Then those network
parameters can get replicated back to the primary server(s),
hopefully improving query planning on the production dbs.  And
perhaps people could make those parameters public, with the hope
that someone with a similar setup could benefit from a pre-
learned network.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

Re: What about utility to calculate planner cost constants?

From
Ron Mayer
Date:
Tom Lane wrote:
>
> And you can't just dismiss the issue of wrong cost models and say we can
> get numbers anyway.

Is there a way to see more details about the cost estimates.
EXPLAIN ANALYZE seems to show the total time and rows; but not
information like how many disk pages were accessed.

I get the feeling that sometimes the number of rows is estimated
very well, but the amount of disk I/O is way off.

Sometimes the number of pages read/written is grossly
overestimated (if tables lave a lot of locally clustered data)
or underestimated if a sort barely exceeds sort_mem.


Perhaps an EXPLAN ANALYZE VERBOSE that would add info like this:

    Index scan ([...]estimated 1000 pages read)  (actual[...] 10 pages read)

would help track those down?


Re: What about utility to calculate planner cost constants?

From
Chris Browne
Date:
dave.held@arrayservicesgrp.com ("Dave Held") writes:
>> -----Original Message-----
>> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>> Sent: Tuesday, March 22, 2005 3:48 PM
>> To: Greg Stark
>> Cc: Christopher Browne; pgsql-performance@postgresql.org
>> Subject: Re: [PERFORM] What about utility to calculate planner cost
>> constants?
>> [...]
>> The difficulty with the notion of doing that measurement by timing
>> Postgres operations is that it's a horribly bad experimental setup.
>> You have no way to isolate the effects of just one variable, or even
>> a small number of variables, which you really need to do if you want
>> to estimate with any degree of precision.  What's more, there
>> are plenty of relevant factors that aren't in the model at all (such
>> as the extent of other load on the machine), and so the noise in the
>> measurements will be enormous.
>>
>> And you can't just dismiss the issue of wrong cost models and say we
>> can get numbers anyway.  We see examples almost every day on this
>> list where the planner is so far off about indexscan vs seqscan costs
>> that you'd have to put random_page_cost far below 1 to make its numbers
>> line up with reality.  That's not a matter of tuning the parameter,
>> it's evidence that the cost model is wrong.  If you try to solve for
>> the "right value" of the parameter by comparing estimated and actual
>> costs, you'll get garbage, even without any issues about noisy
>> measurements or numerical instability of your system of equations.
>
> Then instead of building a fixed cost model, why not evolve an adaptive
> model using an ANN or GA?  I can't say that I'm remotely familiar with
> how the planner does its business, but perhaps we should throw away all
> these tunable cost parameters and let a neural network create them
> implicitly, if they really exist in any meaningful form.  I suppose the
> inputs to the network would be the available scan types, the actual and
> estimated rows, correlations, etc.  The outputs would be query plans, is
> that right?  So we pick several representative data points in the query
> space and train the network on those, to "bootstrap" it.  With any luck,
> the network will generalize the training inputs and do a halfway decent
> job on real-world values.  If a user is unhappy with the way the network
> is operating, they can switch on a training mode whereby the network
> tries some different plans for a given query and uses the execution time
> to judge which plans worked the best.  The alternative plans can be
> suggested by built-in heuristics or perhaps built randomly.  Of course,
> such training would only be practical for smaller data sets, but perhaps
> there would be a way to let the network perform a query on a subset of
> the data and then extrapolate the behavior of a plan over the full data
> set.

This strikes me as an interesting approach for trying to determine the
proper shape of the cost model.  I'd also want to consider simulated
annealing (SA) (e.g. - perhaps Lester Ingber's ASA code...).

We take such a network, perhaps assuming some small degree polynomial
set of parameters, train it based on some reasonably sizable set of
queries, and then see which of those parameters wind up being treated
as strong/not.

That would provide results that would allow improving the model.

I wouldn't assume that an untuned ANN/GA/SA would provide useful results
in general.

It would certainly be pretty cool if we could get this approach into a
production query optimizer; we would hope that this would, in effect,
tune itself, over time.

And I suppose that actually is somewhat plausible...  Each query plan
that comes thru winds up having some "expected cost", and after
executing that plan, we have an "actual cost" which could be used as
feedback to the effect that the estimate was either pretty right or
pretty wrong.

We'd presumably start by taking our "traditional cost estimate" as
being the way to go; when it gets sufficiently clear that the ANN/GA
network is providing a more accurate cost, it would make sense to make
the jump...

What would also be very interesting would be to see the degree to
which analytical results could be taken out of this.  For instance,
some cost factors might turn out to be pretty universally true, and we
might discover that most of the benefit can come from using a pretty
static network of parameters.
--
(format nil "~S@~S" "cbbrowne" "cbbrowne.com")
http://www3.sympatico.ca/cbbrowne/or.html
"The  present  need for  security  products far exceeds  the number of
individuals    capable of    designing  secure  systems. Consequently,
industry  has resorted to  employing folks  and purchasing "solutions"
from vendors that shouldn't be let near a project involving securing a
system."  -- Lucky Green

Re: What about utility to calculate planner cost constants?

From
Kenneth Marshall
Date:
On Tue, Mar 22, 2005 at 08:09:40AM -0500, Christopher Browne wrote:
> Martha Stewart called it a Good Thing when gsstark@mit.edu (Greg Stark) wrote:
> > I don't think it would be very hard at all actually.
> >
> > It's just a linear algebra problem with a bunch of independent
> > variables and a system of equations. Solving for values for all of
> > them is a straightforward problem.
> >
> > Of course in reality these variables aren't actually independent
> > because the costing model isn't perfect. But that wouldn't be a
> > problem, it would just reduce the accuracy of the results.
>
> Are you certain it's a linear system?  I'm not.  If it was a matter of
> minimizing a linear expression subject to some set of linear
> equations, then we could model this as a Linear Program for which
> there are some perfectly good solvers available.  (Few with BSD-style
> licenses, but we could probably get some insight out of running for a
> while with something that's there...)
>
> I think there's good reason to consider it to be distinctly
> NON-linear, which makes it way more challenging to solve the problem.
>
Non-linear optimization works very well in many cases. Issues such
as local minima can be addressed. In a sense, the planner output
can be treated as a blackbox function and the "goodness" of the
solution is how well it approximates the actual query times. In this
case, it will be imperative to constrain some of the values to prevent
"crazy" configurations.

Ken