Re: What about utility to calculate planner cost constants? - Mailing list pgsql-performance

From Chris Browne
Subject Re: What about utility to calculate planner cost constants?
Date
Msg-id 607jjztg1l.fsf@dba2.int.libertyrms.com
Whole thread Raw
In response to Re: What about utility to calculate planner cost constants?  ("Dave Held" <dave.held@arrayservicesgrp.com>)
List pgsql-performance
dave.held@arrayservicesgrp.com ("Dave Held") writes:
>> -----Original Message-----
>> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>> Sent: Tuesday, March 22, 2005 3:48 PM
>> To: Greg Stark
>> Cc: Christopher Browne; pgsql-performance@postgresql.org
>> Subject: Re: [PERFORM] What about utility to calculate planner cost
>> constants?
>> [...]
>> The difficulty with the notion of doing that measurement by timing
>> Postgres operations is that it's a horribly bad experimental setup.
>> You have no way to isolate the effects of just one variable, or even
>> a small number of variables, which you really need to do if you want
>> to estimate with any degree of precision.  What's more, there
>> are plenty of relevant factors that aren't in the model at all (such
>> as the extent of other load on the machine), and so the noise in the
>> measurements will be enormous.
>>
>> And you can't just dismiss the issue of wrong cost models and say we
>> can get numbers anyway.  We see examples almost every day on this
>> list where the planner is so far off about indexscan vs seqscan costs
>> that you'd have to put random_page_cost far below 1 to make its numbers
>> line up with reality.  That's not a matter of tuning the parameter,
>> it's evidence that the cost model is wrong.  If you try to solve for
>> the "right value" of the parameter by comparing estimated and actual
>> costs, you'll get garbage, even without any issues about noisy
>> measurements or numerical instability of your system of equations.
>
> Then instead of building a fixed cost model, why not evolve an adaptive
> model using an ANN or GA?  I can't say that I'm remotely familiar with
> how the planner does its business, but perhaps we should throw away all
> these tunable cost parameters and let a neural network create them
> implicitly, if they really exist in any meaningful form.  I suppose the
> inputs to the network would be the available scan types, the actual and
> estimated rows, correlations, etc.  The outputs would be query plans, is
> that right?  So we pick several representative data points in the query
> space and train the network on those, to "bootstrap" it.  With any luck,
> the network will generalize the training inputs and do a halfway decent
> job on real-world values.  If a user is unhappy with the way the network
> is operating, they can switch on a training mode whereby the network
> tries some different plans for a given query and uses the execution time
> to judge which plans worked the best.  The alternative plans can be
> suggested by built-in heuristics or perhaps built randomly.  Of course,
> such training would only be practical for smaller data sets, but perhaps
> there would be a way to let the network perform a query on a subset of
> the data and then extrapolate the behavior of a plan over the full data
> set.

This strikes me as an interesting approach for trying to determine the
proper shape of the cost model.  I'd also want to consider simulated
annealing (SA) (e.g. - perhaps Lester Ingber's ASA code...).

We take such a network, perhaps assuming some small degree polynomial
set of parameters, train it based on some reasonably sizable set of
queries, and then see which of those parameters wind up being treated
as strong/not.

That would provide results that would allow improving the model.

I wouldn't assume that an untuned ANN/GA/SA would provide useful results
in general.

It would certainly be pretty cool if we could get this approach into a
production query optimizer; we would hope that this would, in effect,
tune itself, over time.

And I suppose that actually is somewhat plausible...  Each query plan
that comes thru winds up having some "expected cost", and after
executing that plan, we have an "actual cost" which could be used as
feedback to the effect that the estimate was either pretty right or
pretty wrong.

We'd presumably start by taking our "traditional cost estimate" as
being the way to go; when it gets sufficiently clear that the ANN/GA
network is providing a more accurate cost, it would make sense to make
the jump...

What would also be very interesting would be to see the degree to
which analytical results could be taken out of this.  For instance,
some cost factors might turn out to be pretty universally true, and we
might discover that most of the benefit can come from using a pretty
static network of parameters.
--
(format nil "~S@~S" "cbbrowne" "cbbrowne.com")
http://www3.sympatico.ca/cbbrowne/or.html
"The  present  need for  security  products far exceeds  the number of
individuals    capable of    designing  secure  systems. Consequently,
industry  has resorted to  employing folks  and purchasing "solutions"
from vendors that shouldn't be let near a project involving securing a
system."  -- Lucky Green

pgsql-performance by date:

Previous
From: Ron Mayer
Date:
Subject: Re: Planner issue
Next
From: Bruce Momjian
Date:
Subject: Re: What needs to be done for real Partitioning?