Thread: What about utility to calculate planner cost constants?
I was following the cpu_tuple_cost thread and wondering, if it could be possible to make PQA style utility to calculate configuration-specific values for planner cost constants. It could make use of output of log_(statement|parser|planner|executor)_stats, tough I'm not sure if the output contains anything useful for those purposes. Otherwise it could just collect statements, run EXPLAIN ANALYZE for all of them and then play with planner cost constants to get the estimated values as close as possible to actual values. Something like Goal Seek in Excel, if you pardon my reference to MS :). Somewhat similar project seems to be pgautotune from pgFoundry, but it only considers shared_buffers, sort_mem and vacuum_mem. And it seems to use synthetic data instead of actual database and actual statements from log. And it has been inactive for a while. Tambet
Tambet, > I was following the cpu_tuple_cost thread and wondering, if it could be > possible to make PQA style utility to calculate configuration-specific > values for planner cost constants. It could make use of output of > log_(statement|parser|planner|executor)_stats, tough I'm not sure if the > output contains anything useful for those purposes. Yeah, that's something I need to look at. > Otherwise it could just collect statements, run EXPLAIN ANALYZE for all > of them and then play with planner cost constants to get the estimated > values as close as possible to actual values. Something like Goal Seek > in Excel, if you pardon my reference to MS :). That's not really practical. There are currently 5 major query tuning parameters, not counting the memory adjustments which really can't be left out. You can't realistically test all combinations of 6 variables. > Somewhat similar project seems to be pgautotune from pgFoundry, but it > only considers shared_buffers, sort_mem and vacuum_mem. And it seems to > use synthetic data instead of actual database and actual statements from > log. And it has been inactive for a while. Yeah, pg_autotune is a dead project. Once we got OSDL able to run tests, we came up with some rules-of-thumb which are more useful than autotune's output. More importantly, the approach doesn't scale to the 15-20 GUCs which we'd realistically want to test. -- Josh Berkus Aglio Database Solutions San Francisco
If by not practical you mean, "no one has implemented a multivariable testing approach," I'll agree with you. But multivariable testing is definitely a valid statistical approach to solving just such problems. -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 On Mar 21, 2005, at 11:51 AM, Josh Berkus wrote: > That's not really practical. There are currently 5 major query tuning > parameters, not counting the memory adjustments which really can't be > left > out. You can't realistically test all combinations of 6 variables.
Thomas, > If by not practical you mean, "no one has implemented a multivariable > testing approach," I'll agree with you. But multivariable testing is > definitely a valid statistical approach to solving just such problems. Well, not practical as in: "would take either $10 million in equipment or 10,000 hours or both" --Josh -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > > Otherwise it could just collect statements, run EXPLAIN ANALYZE for all > > of them and then play with planner cost constants to get the estimated > > values as close as possible to actual values. Something like Goal Seek > > in Excel, if you pardon my reference to MS :). > > That's not really practical. There are currently 5 major query tuning > parameters, not counting the memory adjustments which really can't be left > out. You can't realistically test all combinations of 6 variables. I don't think it would be very hard at all actually. It's just a linear algebra problem with a bunch of independent variables and a system of equations. Solving for values for all of them is a straightforward problem. Of course in reality these variables aren't actually independent because the costing model isn't perfect. But that wouldn't be a problem, it would just reduce the accuracy of the results. What's needed is for the explain plan to total up the costing penalties independently. So the result would be something like 1000 * random_page_cost + 101 * sequential_page_cost + 2000 * index_tuple_cost + ... In other words a tuple like <1000,101,2000,...> And explain analyze would produce the above tuple along with the resulting time. Some program would have to gather these values from the log or stats data and gather them up into a large linear system and solve for values that minimize the divergence from the observed times. (costs penalties are currently normalized to sequential_page_cost being 1. That could be maintained, or it could be changed to be normalized to an expected 1ms.) (Also, currently explain analyze has overhead that makes this impractical. Ideally it could subtract out its overhead so the solutions would be accurate enough to be useful) -- greg
On Mon, 21 Mar 2005 14:59:56 -0800, Josh Berkus <josh@agliodbs.com> wrote: > > If by not practical you mean, "no one has implemented a multivariable > > testing approach," I'll agree with you. But multivariable testing is > > definitely a valid statistical approach to solving just such problems. > Well, not practical as in: "would take either $10 million in equipment or > 10,000 hours or both" I think you don't need EXPLAIN ANALYZE each query with different GUCs, you would only need EXPLAIN most of the times (which is way quicker). Once you get 'near' actual values, you would do EXECUTE ANALYZE to verify the variables. Regards, Dawid
Greg Stark wrote: > Josh Berkus <josh@agliodbs.com> writes: >>That's not really practical. There are currently 5 major query tuning >>parameters, not counting the memory adjustments which really can't be left >>out. You can't realistically test all combinations of 6 variables. > > I don't think it would be very hard at all actually. [snip] > What's needed is for the explain plan to total up the costing penalties > independently. So the result would be something like > > 1000 * random_page_cost + 101 * sequential_page_cost + 2000 * index_tuple_cost > + ... > > In other words a tuple like <1000,101,2000,...> > > And explain analyze would produce the above tuple along with the resulting > time. > > Some program would have to gather these values from the log or stats data and > gather them up into a large linear system and solve for values that minimize > the divergence from the observed times. You'd only need to log them if they diverged from expected anyway. That should result in fairly low activity pretty quickly (or we're wasting our time). Should they go to the stats collector rather than logs? > (Also, currently explain analyze has overhead that makes this impractical. > Ideally it could subtract out its overhead so the solutions would be accurate > enough to be useful) Don't we only need the top-level figures though? There's no need to record timings for each stage, just work completed. -- Richard Huxton Archonet Ltd
Martha Stewart called it a Good Thing when gsstark@mit.edu (Greg Stark) wrote: > I don't think it would be very hard at all actually. > > It's just a linear algebra problem with a bunch of independent > variables and a system of equations. Solving for values for all of > them is a straightforward problem. > > Of course in reality these variables aren't actually independent > because the costing model isn't perfect. But that wouldn't be a > problem, it would just reduce the accuracy of the results. Are you certain it's a linear system? I'm not. If it was a matter of minimizing a linear expression subject to some set of linear equations, then we could model this as a Linear Program for which there are some perfectly good solvers available. (Few with BSD-style licenses, but we could probably get some insight out of running for a while with something that's there...) I think there's good reason to consider it to be distinctly NON-linear, which makes it way more challenging to solve the problem. There might well be some results to be gotten out of a linear approximation; the Grand Challenge is to come up with the model in the first place... -- wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','gmail.com'). http://linuxdatabases.info/info/or.html "Tom Christiansen asked me, "Chip, is there anything that you like that isn't big and complicated?" C++, EMACS, Perl, Unix, English-no, I guess not." -- Chip Salzenberg, when commenting on Perl6/C++
On Tue, Mar 22, 2005 at 08:09:40 -0500, Christopher Browne <cbbrowne@acm.org> wrote: > > Are you certain it's a linear system? I'm not. If it was a matter of > minimizing a linear expression subject to some set of linear > equations, then we could model this as a Linear Program for which > there are some perfectly good solvers available. (Few with BSD-style > licenses, but we could probably get some insight out of running for a > while with something that's there...) For less than 100 equations and 100 unknowns, you should be able to use naive solvers. After that you don't get very accurate answers without being more careful. I still have my numerical analysis text books around and can look algorithms up for doing this without too much trouble.
Richard Huxton <dev@archonet.com> writes: > You'd only need to log them if they diverged from expected anyway. That should > result in fairly low activity pretty quickly (or we're wasting our time). > Should they go to the stats collector rather than logs? I think you need to log them all. Otherwise when you go to analyze the numbers and come up with ideal values you're going to be basing your optimization on a skewed subset. I don't know whether the stats collector or the logs is better suited to this. > > (Also, currently explain analyze has overhead that makes this impractical. > > Ideally it could subtract out its overhead so the solutions would be accurate > > enough to be useful) > > Don't we only need the top-level figures though? There's no need to record > timings for each stage, just work completed. I guess you only need top level values. But you also might want some flag if the row counts for any node were far off. In that case perhaps you would want to discard the data point. -- greg
Christopher Browne <cbbrowne@acm.org> writes: > Are you certain it's a linear system? If you just consider the guc parameters that tell postgres how long various real world operations take (all the *_cost parameters) then it's a linear system. It has to be. The resulting time is just a sum of the times for some number of each of these real world operations. If you include parameters like the geqo_* parameters or the hypothetical parameter that controls what selectivity to assume for clauses with unknown selectivity then no, it wouldn't be. But if you assume the estimated row counts are correct and you're just trying to solve for the parameters to come up with the most accurate cost for the current hardware then I think you're golden. > There might well be some results to be gotten out of a linear > approximation; the Grand Challenge is to come up with the model in the > first place... Indeed. The model's not perfect now of course, and it'll never really be perfect since some of the parameters represent operations that aren't always a consistent cost. But you should be able to solve for the values that result in the most accurate totals the most often. There may be some tradeoffs (and therefore new guc variables :) PS It occurs to me that there's no reason to use the unreliable EXPLAIN counts of the costs. You may as well account accurately for them and use the actual values used in performing the query. This means there's no reason to discard inaccurately estimated data points. Moreover, the overhead issue a non-issue. Since you only need the total time, and the total costs. You would have the overhead of performing lots of increments on those costs, but you only have to do two gettimeofdays. Once at the beginning and once at the end. -- greg
Greg Stark wrote: > Richard Huxton <dev@archonet.com> writes: > >>You'd only need to log them if they diverged from expected anyway. That should >>result in fairly low activity pretty quickly (or we're wasting our time). >>Should they go to the stats collector rather than logs? > > I think you need to log them all. Otherwise when you go to analyze the numbers > and come up with ideal values you're going to be basing your optimization on a > skewed subset. I can see your thinking, I must admit I was thinking of a more iterative process: estimate deltas, change config, check, repeat. I'm not convinced there are "ideal" values with a changing workload - for example, random_page_cost will presumably vary depending on how much contention there is for random seeks. Certainly, effective_cache size could vary. > I don't know whether the stats collector or the logs is better suited to this. > >>>(Also, currently explain analyze has overhead that makes this impractical. >>>Ideally it could subtract out its overhead so the solutions would be accurate >>>enough to be useful) >> >>Don't we only need the top-level figures though? There's no need to record >>timings for each stage, just work completed. > > I guess you only need top level values. But you also might want some flag if > the row counts for any node were far off. In that case perhaps you would want > to discard the data point. I think you'd need to adjust work-estimates by actual-rows / estimated-rows. I _was_ trying to think of a clever way of using row mis-estimates to correct statistics automatically. This was triggered by the discussion a few weeks ago about hints to the planner and the recent talk about plan cacheing. Some sort of feedback loop so the planner would know its estimates were off should be a big win from an ease-of-use point of view. Didn't look easy to do though :-( -- Richard Huxton Archonet Ltd
Christopher Browne <cbbrowne@acm.org> writes: > Martha Stewart called it a Good Thing when gsstark@mit.edu (Greg Stark) wrote: >> It's just a linear algebra problem with a bunch of independent >> variables and a system of equations. Solving for values for all of >> them is a straightforward problem. > Are you certain it's a linear system? I'm not. I'm quite certain it isn't a linear system, because the planner's cost models include nonlinear equations. While I don't have a whole lot of hard evidence to back this up, my belief is that our worst problems stem not from bad parameter values but from wrong models. In particular we *know* that the cost model for nestloop-inner-indexscan joins is wrong, because it doesn't account for cacheing effects across repeated scans. There are some other obvious weak spots as well. It could be argued that we ought to allow the system to assume index cacheing even for standalone queries, on the grounds that if you are doing a query often enough to care about it, there was probably a recent use of the same query that pulled in the upper index levels. The current cost models all assume starting from ground zero with empty caches for each query, and that is surely not reflective of many real-world cases. I've looked at fixing this a couple times, but so far my attempts to devise a more believable index access cost estimator have come out with numbers higher than the current estimates ... not the direction we want it to go :-( Anyway, I see little point in trying to build an automatic parameter optimizer until we have cost models whose parameters are more stable than the current ones. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > Christopher Browne <cbbrowne@acm.org> writes: >> Are you certain it's a linear system? > If you just consider the guc parameters that tell postgres how long various > real world operations take (all the *_cost parameters) then it's a linear > system. It has to be. No it doesn't. Think caching effects for instance. We do have cache effects in the cost models, even though they are wrong as per my nearby rant... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Christopher Browne <cbbrowne@acm.org> writes: > > Martha Stewart called it a Good Thing when gsstark@mit.edu (Greg Stark) wrote: > >> It's just a linear algebra problem with a bunch of independent > >> variables and a system of equations. Solving for values for all of > >> them is a straightforward problem. > > > Are you certain it's a linear system? I'm not. > > I'm quite certain it isn't a linear system, because the planner's cost > models include nonlinear equations. The equations will all be linear for the *_cost variables. If they weren't they would be meaningless, the units would be wrong. Things like caching are just going to be the linear factors that determine how many random page costs and sequential page costs to charge the query. > While I don't have a whole lot of hard evidence to back this up, my > belief is that our worst problems stem not from bad parameter values > but from wrong models. I think these are orthogonal issues. The time spent in real-world operations like random page accesses, sequential page accesses, cpu operations, index lookups, etc, are all measurable quantities. They can be directly measured or approximated by looking at the resulting net times. Measuring these things instead of asking the user to provide them is just a nicer user experience. Separately, plugging these values into more and more accurate model will come up with better estimates for how many of these operations a query will need to perform. > Anyway, I see little point in trying to build an automatic parameter > optimizer until we have cost models whose parameters are more stable > than the current ones. Well what people currently do is tweak the physical values until the produce results for their work load that match reality. It would be neat if postgres could do this automatically. Arguably the more accurate the cost model the less of a motivation for automatic adjustments there is since you could easily plug in accurate values from the hardware specs. But actually I think it'll always be a nice feature. -- greg
Greg Stark <gsstark@mit.edu> writes: > The time spent in real-world operations like random page accesses, sequential > page accesses, cpu operations, index lookups, etc, are all measurable > quantities. They can be directly measured or approximated by looking at the > resulting net times. That's the theory, all right, and that's why I've been resistant to lowering random_page_cost just because "it gives better answers". To the extent that you believe that is a real physical parameter with a definable value (which is a bit debatable actually, but nevermind) it should be possible to measure it by experiment. The difficulty with the notion of doing that measurement by timing Postgres operations is that it's a horribly bad experimental setup. You have no way to isolate the effects of just one variable, or even a small number of variables, which you really need to do if you want to estimate with any degree of precision. What's more, there are plenty of relevant factors that aren't in the model at all (such as the extent of other load on the machine), and so the noise in the measurements will be enormous. And you can't just dismiss the issue of wrong cost models and say we can get numbers anyway. We see examples almost every day on this list where the planner is so far off about indexscan vs seqscan costs that you'd have to put random_page_cost far below 1 to make its numbers line up with reality. That's not a matter of tuning the parameter, it's evidence that the cost model is wrong. If you try to solve for the "right value" of the parameter by comparing estimated and actual costs, you'll get garbage, even without any issues about noisy measurements or numerical instability of your system of equations. regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Tuesday, March 22, 2005 3:48 PM > To: Greg Stark > Cc: Christopher Browne; pgsql-performance@postgresql.org > Subject: Re: [PERFORM] What about utility to calculate planner cost > constants? > [...] > The difficulty with the notion of doing that measurement by timing > Postgres operations is that it's a horribly bad experimental setup. > You have no way to isolate the effects of just one variable, or even > a small number of variables, which you really need to do if you want > to estimate with any degree of precision. What's more, there > are plenty of relevant factors that aren't in the model at all (such > as the extent of other load on the machine), and so the noise in the > measurements will be enormous. > > And you can't just dismiss the issue of wrong cost models and say we > can get numbers anyway. We see examples almost every day on this > list where the planner is so far off about indexscan vs seqscan costs > that you'd have to put random_page_cost far below 1 to make its numbers > line up with reality. That's not a matter of tuning the parameter, > it's evidence that the cost model is wrong. If you try to solve for > the "right value" of the parameter by comparing estimated and actual > costs, you'll get garbage, even without any issues about noisy > measurements or numerical instability of your system of equations. Then instead of building a fixed cost model, why not evolve an adaptive model using an ANN or GA? I can't say that I'm remotely familiar with how the planner does its business, but perhaps we should throw away all these tunable cost parameters and let a neural network create them implicitly, if they really exist in any meaningful form. I suppose the inputs to the network would be the available scan types, the actual and estimated rows, correlations, etc. The outputs would be query plans, is that right? So we pick several representative data points in the query space and train the network on those, to "bootstrap" it. With any luck, the network will generalize the training inputs and do a halfway decent job on real-world values. If a user is unhappy with the way the network is operating, they can switch on a training mode whereby the network tries some different plans for a given query and uses the execution time to judge which plans worked the best. The alternative plans can be suggested by built-in heuristics or perhaps built randomly. Of course, such training would only be practical for smaller data sets, but perhaps there would be a way to let the network perform a query on a subset of the data and then extrapolate the behavior of a plan over the full data set. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129
> -----Original Message----- > From: Dave Held > Sent: Tuesday, March 22, 2005 4:16 PM > To: Tom Lane > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] What about utility to calculate planner cost > constants? > [...] > Then instead of building a fixed cost model, why not evolve > an adaptive model using an ANN or GA? > [...] Another advantage to an adaptive planner is that for those who can afford to duplicate/replicate their hardware/db, they can perhaps dedicate a backup server to plan optimization where the db just runs continuously in a learning mode trying out different plans for a core set of important queries. Then those network parameters can get replicated back to the primary server(s), hopefully improving query planning on the production dbs. And perhaps people could make those parameters public, with the hope that someone with a similar setup could benefit from a pre- learned network. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129
Tom Lane wrote: > > And you can't just dismiss the issue of wrong cost models and say we can > get numbers anyway. Is there a way to see more details about the cost estimates. EXPLAIN ANALYZE seems to show the total time and rows; but not information like how many disk pages were accessed. I get the feeling that sometimes the number of rows is estimated very well, but the amount of disk I/O is way off. Sometimes the number of pages read/written is grossly overestimated (if tables lave a lot of locally clustered data) or underestimated if a sort barely exceeds sort_mem. Perhaps an EXPLAN ANALYZE VERBOSE that would add info like this: Index scan ([...]estimated 1000 pages read) (actual[...] 10 pages read) would help track those down?
dave.held@arrayservicesgrp.com ("Dave Held") writes: >> -----Original Message----- >> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >> Sent: Tuesday, March 22, 2005 3:48 PM >> To: Greg Stark >> Cc: Christopher Browne; pgsql-performance@postgresql.org >> Subject: Re: [PERFORM] What about utility to calculate planner cost >> constants? >> [...] >> The difficulty with the notion of doing that measurement by timing >> Postgres operations is that it's a horribly bad experimental setup. >> You have no way to isolate the effects of just one variable, or even >> a small number of variables, which you really need to do if you want >> to estimate with any degree of precision. What's more, there >> are plenty of relevant factors that aren't in the model at all (such >> as the extent of other load on the machine), and so the noise in the >> measurements will be enormous. >> >> And you can't just dismiss the issue of wrong cost models and say we >> can get numbers anyway. We see examples almost every day on this >> list where the planner is so far off about indexscan vs seqscan costs >> that you'd have to put random_page_cost far below 1 to make its numbers >> line up with reality. That's not a matter of tuning the parameter, >> it's evidence that the cost model is wrong. If you try to solve for >> the "right value" of the parameter by comparing estimated and actual >> costs, you'll get garbage, even without any issues about noisy >> measurements or numerical instability of your system of equations. > > Then instead of building a fixed cost model, why not evolve an adaptive > model using an ANN or GA? I can't say that I'm remotely familiar with > how the planner does its business, but perhaps we should throw away all > these tunable cost parameters and let a neural network create them > implicitly, if they really exist in any meaningful form. I suppose the > inputs to the network would be the available scan types, the actual and > estimated rows, correlations, etc. The outputs would be query plans, is > that right? So we pick several representative data points in the query > space and train the network on those, to "bootstrap" it. With any luck, > the network will generalize the training inputs and do a halfway decent > job on real-world values. If a user is unhappy with the way the network > is operating, they can switch on a training mode whereby the network > tries some different plans for a given query and uses the execution time > to judge which plans worked the best. The alternative plans can be > suggested by built-in heuristics or perhaps built randomly. Of course, > such training would only be practical for smaller data sets, but perhaps > there would be a way to let the network perform a query on a subset of > the data and then extrapolate the behavior of a plan over the full data > set. This strikes me as an interesting approach for trying to determine the proper shape of the cost model. I'd also want to consider simulated annealing (SA) (e.g. - perhaps Lester Ingber's ASA code...). We take such a network, perhaps assuming some small degree polynomial set of parameters, train it based on some reasonably sizable set of queries, and then see which of those parameters wind up being treated as strong/not. That would provide results that would allow improving the model. I wouldn't assume that an untuned ANN/GA/SA would provide useful results in general. It would certainly be pretty cool if we could get this approach into a production query optimizer; we would hope that this would, in effect, tune itself, over time. And I suppose that actually is somewhat plausible... Each query plan that comes thru winds up having some "expected cost", and after executing that plan, we have an "actual cost" which could be used as feedback to the effect that the estimate was either pretty right or pretty wrong. We'd presumably start by taking our "traditional cost estimate" as being the way to go; when it gets sufficiently clear that the ANN/GA network is providing a more accurate cost, it would make sense to make the jump... What would also be very interesting would be to see the degree to which analytical results could be taken out of this. For instance, some cost factors might turn out to be pretty universally true, and we might discover that most of the benefit can come from using a pretty static network of parameters. -- (format nil "~S@~S" "cbbrowne" "cbbrowne.com") http://www3.sympatico.ca/cbbrowne/or.html "The present need for security products far exceeds the number of individuals capable of designing secure systems. Consequently, industry has resorted to employing folks and purchasing "solutions" from vendors that shouldn't be let near a project involving securing a system." -- Lucky Green
On Tue, Mar 22, 2005 at 08:09:40AM -0500, Christopher Browne wrote: > Martha Stewart called it a Good Thing when gsstark@mit.edu (Greg Stark) wrote: > > I don't think it would be very hard at all actually. > > > > It's just a linear algebra problem with a bunch of independent > > variables and a system of equations. Solving for values for all of > > them is a straightforward problem. > > > > Of course in reality these variables aren't actually independent > > because the costing model isn't perfect. But that wouldn't be a > > problem, it would just reduce the accuracy of the results. > > Are you certain it's a linear system? I'm not. If it was a matter of > minimizing a linear expression subject to some set of linear > equations, then we could model this as a Linear Program for which > there are some perfectly good solvers available. (Few with BSD-style > licenses, but we could probably get some insight out of running for a > while with something that's there...) > > I think there's good reason to consider it to be distinctly > NON-linear, which makes it way more challenging to solve the problem. > Non-linear optimization works very well in many cases. Issues such as local minima can be addressed. In a sense, the planner output can be treated as a blackbox function and the "goodness" of the solution is how well it approximates the actual query times. In this case, it will be imperative to constrain some of the values to prevent "crazy" configurations. Ken