Re: What about utility to calculate planner cost constants? - Mailing list pgsql-performance
From | Chris Browne |
---|---|
Subject | Re: What about utility to calculate planner cost constants? |
Date | |
Msg-id | 607jjztg1l.fsf@dba2.int.libertyrms.com Whole thread Raw |
In response to | Re: What about utility to calculate planner cost constants? ("Dave Held" <dave.held@arrayservicesgrp.com>) |
List | pgsql-performance |
dave.held@arrayservicesgrp.com ("Dave Held") writes: >> -----Original Message----- >> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >> Sent: Tuesday, March 22, 2005 3:48 PM >> To: Greg Stark >> Cc: Christopher Browne; pgsql-performance@postgresql.org >> Subject: Re: [PERFORM] What about utility to calculate planner cost >> constants? >> [...] >> The difficulty with the notion of doing that measurement by timing >> Postgres operations is that it's a horribly bad experimental setup. >> You have no way to isolate the effects of just one variable, or even >> a small number of variables, which you really need to do if you want >> to estimate with any degree of precision. What's more, there >> are plenty of relevant factors that aren't in the model at all (such >> as the extent of other load on the machine), and so the noise in the >> measurements will be enormous. >> >> And you can't just dismiss the issue of wrong cost models and say we >> can get numbers anyway. We see examples almost every day on this >> list where the planner is so far off about indexscan vs seqscan costs >> that you'd have to put random_page_cost far below 1 to make its numbers >> line up with reality. That's not a matter of tuning the parameter, >> it's evidence that the cost model is wrong. If you try to solve for >> the "right value" of the parameter by comparing estimated and actual >> costs, you'll get garbage, even without any issues about noisy >> measurements or numerical instability of your system of equations. > > Then instead of building a fixed cost model, why not evolve an adaptive > model using an ANN or GA? I can't say that I'm remotely familiar with > how the planner does its business, but perhaps we should throw away all > these tunable cost parameters and let a neural network create them > implicitly, if they really exist in any meaningful form. I suppose the > inputs to the network would be the available scan types, the actual and > estimated rows, correlations, etc. The outputs would be query plans, is > that right? So we pick several representative data points in the query > space and train the network on those, to "bootstrap" it. With any luck, > the network will generalize the training inputs and do a halfway decent > job on real-world values. If a user is unhappy with the way the network > is operating, they can switch on a training mode whereby the network > tries some different plans for a given query and uses the execution time > to judge which plans worked the best. The alternative plans can be > suggested by built-in heuristics or perhaps built randomly. Of course, > such training would only be practical for smaller data sets, but perhaps > there would be a way to let the network perform a query on a subset of > the data and then extrapolate the behavior of a plan over the full data > set. This strikes me as an interesting approach for trying to determine the proper shape of the cost model. I'd also want to consider simulated annealing (SA) (e.g. - perhaps Lester Ingber's ASA code...). We take such a network, perhaps assuming some small degree polynomial set of parameters, train it based on some reasonably sizable set of queries, and then see which of those parameters wind up being treated as strong/not. That would provide results that would allow improving the model. I wouldn't assume that an untuned ANN/GA/SA would provide useful results in general. It would certainly be pretty cool if we could get this approach into a production query optimizer; we would hope that this would, in effect, tune itself, over time. And I suppose that actually is somewhat plausible... Each query plan that comes thru winds up having some "expected cost", and after executing that plan, we have an "actual cost" which could be used as feedback to the effect that the estimate was either pretty right or pretty wrong. We'd presumably start by taking our "traditional cost estimate" as being the way to go; when it gets sufficiently clear that the ANN/GA network is providing a more accurate cost, it would make sense to make the jump... What would also be very interesting would be to see the degree to which analytical results could be taken out of this. For instance, some cost factors might turn out to be pretty universally true, and we might discover that most of the benefit can come from using a pretty static network of parameters. -- (format nil "~S@~S" "cbbrowne" "cbbrowne.com") http://www3.sympatico.ca/cbbrowne/or.html "The present need for security products far exceeds the number of individuals capable of designing secure systems. Consequently, industry has resorted to employing folks and purchasing "solutions" from vendors that shouldn't be let near a project involving securing a system." -- Lucky Green
pgsql-performance by date: