Thread: Tuning planner cost estimates
I've been doing some work to try and identify the actual costs associated with an index scan with some limited sucess. What's been run so far can be seen at http://stats.distributed.net/~decibel. But there's a couple problems. First, I can't use the box exclusively for this testing, which results in some result inconsistencies. Second, I've been using a dataset that I can't make public, which means no one else can run these tests on different hardware. So what I think would be useful is some way to generate a known dataset, and then be able to run tests against it on different machines. In the case of testing index scans, we need to be able to vary correlation, which so far I've been doing by ordering by different columns. I suspect it will also be important to test with different tuple sizes. There's also the question of whether or not the cache should be flushed for each run or not. Does this sound like a good way to determine actual costs for index scans (and hopefully other access methods in the future)? If so, what would be a good way to implement this? -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Jim, > I've been doing some work to try and identify the actual costs > associated with an index scan with some limited sucess. What's been run > so far can be seen at http://stats.distributed.net/~decibel. But there's > a couple problems. First, I can't use the box exclusively for this > testing, which results in some result inconsistencies. I can get you access to boxes. Chat on IRC? > Second, I've been > using a dataset that I can't make public, which means no one else can > run these tests on different hardware. Then use one of the DBT databases. > In the > case of testing index scans, we need to be able to vary correlation, > which so far I've been doing by ordering by different columns. I suspect > it will also be important to test with different tuple sizes. There's > also the question of whether or not the cache should be flushed for each > run or not. > > Does this sound like a good way to determine actual costs for index > scans (and hopefully other access methods in the future)? If so, what > would be a good way to implement this? Well, the problem is that what we need to index scans is a formula, rather than a graph. The usefulness of benchmarking index scan cost is so that we can test our formula for accuracy and precision. However, such a formula *does* need to take into account concurrent activity, updates, etc ... that is, it needs to approximately estimate the relative cost on a live database, not a test one. This is also going to be a moving target because Tom's in-memory-bitmapping changes relative cost equations. I think a first step would be, in fact, to develop a tool that allows us to put EXPLAIN ANALYZE results in a database table. Without that, there is no possibility of statistical-scale analysis. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > I think a first step would be, in fact, to develop a tool that allows us to > put EXPLAIN ANALYZE results in a database table. Without that, there is no > possibility of statistical-scale analysis. AFAIK you can do that today using, eg, plpgsql: for rec in explain analyze ... loop insert into table values(rec."QUERY PLAN"); end loop; regards, tom lane
Tom, > for rec in explain analyze ... loop > insert into table values(rec."QUERY PLAN"); > end loop; I need to go further than that and parse the results as well. And preserve relationships and nesting levels. Hmmmm ... what's the indenting formula for nesting levels? -- Josh Berkus Aglio Database Solutions San Francisco
On Thu, May 19, 2005 at 09:31:47AM -0700, Josh Berkus wrote: > > In the > > case of testing index scans, we need to be able to vary correlation, > > which so far I've been doing by ordering by different columns. I suspect > > it will also be important to test with different tuple sizes. There's > > also the question of whether or not the cache should be flushed for each > > run or not. > > > > Does this sound like a good way to determine actual costs for index > > scans (and hopefully other access methods in the future)? If so, what > > would be a good way to implement this? > > Well, the problem is that what we need to index scans is a formula, rather > than a graph. The usefulness of benchmarking index scan cost is so that we True, but having a graphical representation of how different input variables (such as correlation) affect runtime is a good way to derive such a formula, or at least point you in the right direction. > can test our formula for accuracy and precision. However, such a formula > *does* need to take into account concurrent activity, updates, etc ... that > is, it needs to approximately estimate the relative cost on a live database, > not a test one. Well, that raises an interesting issue, because AFAIK none of the cost estimate functions currently do that. Heck, AFAIK even the piggyback seqscan code doesn't take other seqscans into account. Another issue is: what state should the buffers/disk cache be in? In the thread that kicked all this off Tom noted that my results were skewed because of caching, so I changed my tests to flush the disk cache as effectively as I could (by running a program that would consume enough available memory to just start the box swapping), but I don't think that's necessarily realistic. Though at least it should preclude the need to run tests multiple times on an otherwise idle box in order to 'pre-seed' the cache (not that that's any more realistic). If you don't use one of these techniques you end up with results that depend on what test was run before the current one... > This is also going to be a moving target because Tom's in-memory-bitmapping > changes relative cost equations. I thought those all had seperate costing functions...? In any case, if we have a cost estimation tool it will make it much easier to derive cost estimation functions. > I think a first step would be, in fact, to develop a tool that allows us to > put EXPLAIN ANALYZE results in a database table. Without that, there is no > possibility of statistical-scale analysis. Rather than trying to parse all possible output, ISTM it would be much better if there was a way to access the info directly. Would it be difficult to have an option that produces output that is a set of different fields? I'm thinking something like: Level (basically how far something's indented) Parent node (what node a child node is feeding) node_id (some kind of identifier for each step) operation (estimate|actual)_(startup|total|rows|width|loops) other (something to hold index condition, filter, etc) But ultimately, I'm not sure if this is really required or not, because I don't see that we need to use explain when running queries. In fact, it's possibly desireable that we don't, because of the overhead it incurs. We would want to log an explain (maybe analyze) just to make sure we knew what the optimizer was doing, but I think we shouldn't need the info to produce cost estimates. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <decibel@decibel.org> writes: > On Thu, May 19, 2005 at 09:31:47AM -0700, Josh Berkus wrote: >> can test our formula for accuracy and precision. However, such a formula >> *does* need to take into account concurrent activity, updates, etc ... that >> is, it needs to approximately estimate the relative cost on a live database, >> not a test one. > Well, that raises an interesting issue, because AFAIK none of the cost > estimate functions currently do that. I'm unconvinced that it'd be a good idea, either. People already complain that the planner's choices change when they ANALYZE; if the current load factor or something like that were to be taken into account then you'd *really* have a problem with irreproducible behavior. It might make sense to have something a bit more static, perhaps a GUC variable that says "plan on the assumption that there's X amount of concurrent activity". I'm not sure what scale to measure X on, nor exactly how this would factor into the estimates anyway --- but at least this approach would maintain reproducibility of behavior. > Another issue is: what state should the buffers/disk cache be in? The current cost models are all based on the assumption that every query starts from ground zero: nothing in cache. Which is pretty bogus in most real-world scenarios. We need to think about ways to tune that assumption, too. Maybe this is actually the same discussion, because certainly one of the main impacts of a concurrent environment is on what you can expect to find in cache. regards, tom lane
Jim, > Well, that raises an interesting issue, because AFAIK none of the cost > estimate functions currently do that. Heck, AFAIK even the piggyback > seqscan code doesn't take other seqscans into account. Sure. But you're striving for greater accuracy, no? Actually, all that's really needed in the way of concurrent activity is a calculated factor that lets us know how likely a particular object is to be cached, either in the fs cache or the pg cache (with different factors for each presumably) based on history. Right now, that's based on estimated_cache_size, which is rather innacurate: a table which is queried once a month has the exact same cost factors as one which is queried every 2.1 seconds. This would mean an extra column in pg_stats I suppose. > But ultimately, I'm not sure if this is really required or not, because > I don't see that we need to use explain when running queries. In fact, > it's possibly desireable that we don't, because of the overhead it > incurs. We would want to log an explain (maybe analyze) just to make > sure we knew what the optimizer was doing, but I think we shouldn't need > the info to produce cost estimates. Well, the problem is that you need to know how much time the index scan took vs. other query steps. I don't see a way to do this other than an anayze. -- __Aglio Database Solutions_______________ Josh Berkus Consultant josh@agliodbs.com www.agliodbs.com Ph: 415-752-2500 Fax: 415-752-2387 2166 Hayes Suite 200 San Francisco, CA
On Fri, May 20, 2005 at 04:47:38PM -0400, Tom Lane wrote: > "Jim C. Nasby" <decibel@decibel.org> writes: > > On Thu, May 19, 2005 at 09:31:47AM -0700, Josh Berkus wrote: > >> can test our formula for accuracy and precision. However, such a formula > >> *does* need to take into account concurrent activity, updates, etc ... that > >> is, it needs to approximately estimate the relative cost on a live database, > >> not a test one. > > > Well, that raises an interesting issue, because AFAIK none of the cost > > estimate functions currently do that. > > I'm unconvinced that it'd be a good idea, either. People already > complain that the planner's choices change when they ANALYZE; if the > current load factor or something like that were to be taken into account > then you'd *really* have a problem with irreproducible behavior. > > It might make sense to have something a bit more static, perhaps a GUC > variable that says "plan on the assumption that there's X amount of > concurrent activity". I'm not sure what scale to measure X on, nor > exactly how this would factor into the estimates anyway --- but at least > this approach would maintain reproducibility of behavior. Or allowing the load of the machine to affect query plans dynamically is something that could be disabled by default, so presumably if you turn it on it means you know what you're doing. Of course this is all academic until we have a means to actually measure how much system load affects the different things we estimate cost for, and I don't see that happening until we have a system for measuring how changing different input variables affects costs. > > Another issue is: what state should the buffers/disk cache be in? > > The current cost models are all based on the assumption that every query > starts from ground zero: nothing in cache. Which is pretty bogus in > most real-world scenarios. We need to think about ways to tune that > assumption, too. Maybe this is actually the same discussion, because > certainly one of the main impacts of a concurrent environment is on what > you can expect to find in cache. Well, load doesn't directly effect cache efficiency; it's really a question of the ratios of how often different things in the database are accessed. If you wanted to get a crude idea of how likely pages from some relation are to be in cache, you could take periodic snapshots of io stats and see what percentage of the IO done in a given time period was on the relation you're interested in as compared to the rest of the database. But I think this is probably still a 2nd order effect. In terms of a testing system, here's what I'm thinking of. For each cost estimate, there will be a number of input variables we want to vary, and then check to see how changes in them effect run time. Using index scan as a simple example, 1st order variables will likely be index and table size (especially in relation to cache size), and correlation. So, we need some kind of a test harness that can vary these variables (prefferably one at a time), and run a test case after each change. It would then need to store the timing info, possibly along with other information (such as explain output). If I'm the one to write this it'll end up in perl, since that's the only language I know that would be able to accomplish this. DBT seems to be a reasonable test database to do this testing with, especially since it would provide a ready means to provide external load. Does this sound like a reasonable approach? Also, how important do people think it is to use explain analyze output instead of just doing SELECT count(*) FROM (query you actually want to test)? (The select count(*) wrapper is just a means to throw away the results since we don't really want to worry about data transfer times, etc). The testing I've done (http://stats.distributed.net/~decibel/base.log) shows explain analyze to be almost 5x slower than select count(*), so it seems a big gain if we can avoid that. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Fri, May 20, 2005 at 03:23:16PM -0700, Josh Berkus wrote: > Jim, > > > Well, that raises an interesting issue, because AFAIK none of the cost > > estimate functions currently do that. Heck, AFAIK even the piggyback > > seqscan code doesn't take other seqscans into account. > > Sure. But you're striving for greater accuracy, no? > > Actually, all that's really needed in the way of concurrent activity is a > calculated factor that lets us know how likely a particular object is to be > cached, either in the fs cache or the pg cache (with different factors for > each presumably) based on history. Right now, that's based on > estimated_cache_size, which is rather innacurate: a table which is queried > once a month has the exact same cost factors as one which is queried every > 2.1 seconds. This would mean an extra column in pg_stats I suppose. True, though that's a somewhat different issue that what the load on the box is (see the reply I just posted). Load on the box (particuarly IO load) will also play a factor for things; for example, it probably means seqscans end up costing a lot more than random IO does, because the disk heads are being sent all over the place anyway. > > But ultimately, I'm not sure if this is really required or not, because > > I don't see that we need to use explain when running queries. In fact, > > it's possibly desireable that we don't, because of the overhead it > > incurs. We would want to log an explain (maybe analyze) just to make > > sure we knew what the optimizer was doing, but I think we shouldn't need > > the info to produce cost estimates. > > Well, the problem is that you need to know how much time the index scan took > vs. other query steps. I don't see a way to do this other than an anayze. True, but that can be done by a seperate seqscan step. I would argue that doing it that way is actually more accurate, because the overhead of explain analyze is huge and tends to swamp other factors out. As I mentioned in my other email, my tests show explain analyze select * from table is 5x slower than select count(*) from table. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <decibel@decibel.org> writes: > Does this sound like a reasonable approach? Also, how important do > people think it is to use explain analyze output instead of just doing > SELECT count(*) FROM (query you actually want to test)? (The select > count(*) wrapper is just a means to throw away the results since we > don't really want to worry about data transfer times, etc). The testing > I've done (http://stats.distributed.net/~decibel/base.log) shows explain > analyze to be almost 5x slower than select count(*), so it seems a big > gain if we can avoid that. I'd go with the select count(*) --- I can't imagine that we will be trying to model the behavior of anything so complex that we really need explain analyze output. (On the other hand, recording explain output is a good idea to make sure you are testing what you think you are testing.) Actually, it might be worth using "select count(null)", which should avoid the calls to int8inc. I think this doesn't matter so much in CVS tip, but certainly in existing releases the palloc overhead involved is noticeable. BTW, 5x is an awful lot; I've not noticed overheads more than about 2x. regards, tom lane
On Fri, 2005-05-20 at 15:23 -0700, Josh Berkus wrote: > > Well, that raises an interesting issue, because AFAIK none of the cost > > estimate functions currently do that. Heck, AFAIK even the piggyback > > seqscan code doesn't take other seqscans into account. > > Sure. But you're striving for greater accuracy, no? > > Actually, all that's really needed in the way of concurrent activity is a > calculated factor that lets us know how likely a particular object is to be > cached, either in the fs cache or the pg cache (with different factors for > each presumably) based on history. Right now, that's based on > estimated_cache_size, which is rather innacurate: a table which is queried > once a month has the exact same cost factors as one which is queried every > 2.1 seconds. This would mean an extra column in pg_stats I suppose. Hmmm...not sure that would be a good thing. effective_cache_size isn't supposed to be set according to how much of a table is in cache when the query starts. The setting is supposed to reflect how much cache is *available* for the current index scan, when performing an index scan on a table that is not in clustered sequence. The more out of sequence the table is, the more memory is required to avoid doing any repeated I/Os during the scan. Of course, if there are many users, the available cache may be much reduced. Best regards, Simon Riggs