Thread: More thoughts about planner's cost estimates
I've been thinking about the planner's costing problems again, particularly in connection with Philippe Lang's complaint here: http://archives.postgresql.org/pgsql-general/2006-05/msg01526.php Investigation showed that the planner is much too optimistic about the usefulness of a multi-index BitmapAnd plan, and since that comparison is just a cost-estimate comparison, it implies we are underestimating the cost of an index scan. A typical example from his results is -> BitmapAnd (cost=12.30..12.30 rows=1 width=0) (actual time=0.306..0.306 rows=0 loops=13628) -> Bitmap IndexScan on lw_id_workflow (cost=0.00..2.02 rows=7 width=0) (actual time=0.009..0.009 rows=7 loops=13628) Index Cond: (lw.id_workflow = "outer".id) -> Bitmap Index Scan on lw_ordre (cost=0.00..10.03 rows=1437 width=0)(actual time=0.293..0.293 rows=1714 loops=13628) Index Cond: (ordre = $4) There are two variables involved here: the cost of touching an index page and the cost of processing an index tuple. Given the two comparable data points above, we can solve for those numbers, and it turns out that the real cost ratio on Philippe's machine is about 50 to 1. Which says that for him, cpu_index_tuple_cost plus cpu_operator_cost should be around 0.02, nearly an order of magnitude higher than their current default values (0.001 and 0.0025 respectively). In general it seems to me that for CPU-bound databases, the default values of the cpu_xxx_cost variables are too low. I am tempted to raise the default value of cpu_index_tuple_cost to 0.005, which would be half of cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with my feel for the relative costs of things. For a CPU-bound database all three would need to go up more than that. But rather than telling people to manipulate all three of these variables individually, I think it might also be a good idea to provide a new GUC variable named something like "cpu_speed_scale" that would just be a multiplier for the other variables. It would default to 1.0 and our standard advice for CPU-bound databases would be "decrease random_page_cost to 1.0 and raise cpu_speed_scale to 10.0 or so". Seems cleaner than telling people to muck with three or so individual values. Another thing that's bothering me is that the index access cost computation (in genericcostestimate) is looking sillier and sillier: /* * Estimate the number of index pages that will be retrieved. * * For all currently-supported index types,the first page of the index is * a metadata page, and we should figure on fetching that plus a pro-rated * fractionof the remaining pages. */ if (index->pages > 1 && index->tuples > 0) { numIndexPages = (numIndexTuples/ index->tuples) * (index->pages - 1); numIndexPages += 1; /* count the metapage too */ numIndexPages = ceil(numIndexPages); } else numIndexPages = 1.0; /* * Compute the index access cost. * * Disk cost: our generic assumption is that the index pages will be read * sequentially, so they have cost 1.0 each, not random_page_cost. */ *indexTotalCost = numIndexPages; There are several things wrong with this, at least in its application to btrees. It's not counting descent of the btree (ie, touches of the root page and intermediate-level pages). On the other hand it's surely overcharging for metapage touches. As of CVS tip we cache the metapage in the relcache, so it's probably fair to disregard that cost altogether. And on the third hand, when we need to retrieve multiple leaf pages it's over-optimistic to assume that those'll be purely sequential fetches. (That would be approximately true in a freshly-built btree, but not in one that's suffered any amount of page splitting or recycling.) I've desisted from touching this mainly because the obvious sanity adjustments, such as adding something for tree descent and charging random_page_cost not 1.0 for leaf page touches, would increase the estimated cost of index usage, which seemed like not the direction we wanted to go in. So I figured that the errors were more or less cancelling each other. But the addition of bitmap index scans is now exposing the weaknesses, so we need to face up to the problem. I am inclined to think that a reasonable model is to continue to estimate the number of index leaf pages touched as above (pro-rating against the total number of index entries), to charge random_page_cost per leaf page touched, and not to count anything for metapage or tree descent. I justify the latter on the grounds that upper tree levels will tend to stay in cache because they're touched so often. random_page_cost per leaf page may be an overestimate, since sometimes logically adjacent leaf pages will be physically adjacent too, but not charging for tree descent should help to cancel that out. With this model, the disk cost to fetch a single index entry will be estimated as random_page_cost (default 4.0) rather than the current fixed 2.0. This shouldn't hurt things too much for simple indexscans --- especially since anyone running with a reduced random_page_cost won't see as much change. And it will increase the costs for bitmap scans that cross many index pages, which is what we need in light of Philippe's numbers. Now we have seen a lot of cases in which indexscans were being drastically overestimated, so increasing the cost estimate even more may seem like a horrid idea. But I believe that most or all of these cases were ones where the same index was being visited many times, and so the real estimation failure is not accounting for cache effects across multiple indexscans. Rather than being afraid to raise the cost estimate for an indexscan in isolation, I think it's time to bite the bullet and do something about that. The big difficulty in modeling cache effects from multiple scans is that we usually don't know how many times the index will be scanned. If we did have that number, I think it'd be reasonable to use the Mackert and Lohman approximation already used in cost_index to estimate the total number of index leaf pages fetched over N scans, and then divide by N to get our per-scan estimated cost. But because the planner builds cost estimates bottom-up, in most scenarios we don't have any idea whether an indexscan plan node will be iterated once or many times in the finished plan. However there is one case where we do have some information: when computing a best_inner_indexscan() for a nestloop join, it'd be reasonable to use the estimated size of the outer relation as the loop count. And this is exactly the case where we are falling down in practice: the complaints we've seen are mostly about overestimating the cost of a nestloop-with-inner-index-scan. So I think handling just this case would be a big step forward, even if it's not a theoretically pure general solution. Thoughts, gripes, better ideas? regards, tom lane
Tom, As you know, this is something I think about a bit too, though not nearly as deeply as you. > In general it seems to me that for CPU-bound databases, the default values > of the cpu_xxx_cost variables are too low. I am tempted to raise the > default value of cpu_index_tuple_cost to 0.005, which would be half of > cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with > my feel for the relative costs of things. For a CPU-bound database all > three would need to go up more than that. But rather than telling people > to manipulate all three of these variables individually, I think it might > also be a good idea to provide a new GUC variable named something like > "cpu_speed_scale" that would just be a multiplier for the other variables. Yes, please! I've done a bit of playing around with the cpu_* variables, and frankly have always manipulated them by the same multiple. Also, my testing has shown that on *large* databases (20x RAM +) with fast processors (Opteron) the cpu_* values are too high. But a lot of that is offsetting estimates which are too high elsewhere ... as is random_page_cost. > I've desisted from touching this mainly because the obvious sanity > adjustments, such as adding something for tree descent and charging > random_page_cost not 1.0 for leaf page touches, would increase the > estimated cost of index usage, which seemed like not the direction we > wanted to go in. So I figured that the errors were more or less > cancelling each other. But the addition of bitmap index scans is now > exposing the weaknesses, so we need to face up to the problem. Yeah. I've refrained from proposing changes because it's a pick-up-sticks. If we start modifying the model, we need to fix *everything*, not just one item. And then educate our users that they need to use the GUC variables in a different way. Here's the issues I see: 1. n-distinct estimation is bad, as previously discussed; 2. our current heuristics sampling methods prevent us from sampling more than 0.5% of any reasonably large table, causing all statistics on those tables to be off for any table with irregular distribution of values; 3. We don't have any method to analyze inter-column correlation within a table; 4. We don't keep statistics on foriegn key correlation; 5. random_page_cost (as previously discussed) is actually a funciton of relatively immutable hardware statistics, and as such should not need to exist as a GUC once the cost model is fixed. 6. We haven't added any way to estimate rows returned from SRFs. > I am inclined to think that a reasonable model is to continue to estimate > the number of index leaf pages touched as above (pro-rating against the > total number of index entries), to charge random_page_cost per leaf page > touched, and not to count anything for metapage or tree descent. I > justify the latter on the grounds that upper tree levels will tend to stay > in cache because they're touched so often. random_page_cost per leaf page > may be an overestimate, since sometimes logically adjacent leaf pages will > be physically adjacent too, but not charging for tree descent should help > to cancel that out. With this model, the disk cost to fetch a single > index entry will be estimated as random_page_cost (default 4.0) rather > than the current fixed 2.0. This shouldn't hurt things too much for > simple indexscans --- especially since anyone running with a reduced > random_page_cost won't see as much change. And it will increase the costs > for bitmap scans that cross many index pages, which is what we need in > light of Philippe's numbers. Per above, I think anything involving random_page_cost is the wrong way to go. RPC is a GUC assigned by the DBA on pure guesswork. We should be decreasing its use or eliminating it entirely, not increasing the amount we use it. For btrees, we should be able to accurately estimate the cost of the index access given: a) the depth of the tree; b) the likelyhood that requested values are adjacent; c) whether the index and tables are already in shared_mem, or else (d) and (e) below: d) the probability that the index pages are cached in memory, which is composed of:(1) the frequency with which that index is accessed, modified by(2) whether the index is a fraction of availableram, or larger than ram e) the probability that the relevant table pages are cached in memory, determined by the same two factors. *None* of this should involve a userset parameter (random_page_cost) since as you pointed out months ago, the ration of seq/seek access has remained relatively constant over the past 6 years of HDD and I/O engineering. Sadly, I don't have the math for all of the above but I'm hoping someone (either here or on ACM) does. Also, this would require us to have different *formulas* and not merely different cost factors for different kinds of indexes. I believe that this is something that Jie is already struggling with. Jie? > The big difficulty in modeling cache effects from multiple scans is that > we usually don't know how many times the index will be scanned. If we did > have that number, I think it'd be reasonable to use the Mackert and Lohman > approximation already used in cost_index to estimate the total number of > index leaf pages fetched over N scans, and then divide by N to get our > per-scan estimated cost. But because the planner builds cost estimates > bottom-up, in most scenarios we don't have any idea whether an indexscan > plan node will be iterated once or many times in the finished plan. I think we can work around this by figuring out whether the index has already been scanned in the plan, something we *can* know. So the first scan will be full cost and remaining scans will be fractional cost. Possibly not as good as Mackert/Lohman, but it allows us reasonably accurate *overall* estimates without needing to move away from bottom-up plan building. > Thoughts, gripes, better ideas? yes. I think if we're going to fix the cost model, we should overhaul it and target 8.3. Realistically, once we start tinkering with the cost model most queries are going to get worse before they get better. --Josh
Josh Berkus <josh@agliodbs.com> writes: > 1. n-distinct estimation is bad, as previously discussed; > > 2. our current heuristics sampling methods prevent us from sampling more than > 0.5% of any reasonably large table, causing all statistics on those tables to > be off for any table with irregular distribution of values; I'm convinced these two are more connected than you believe. For distributions of values in a range using small samples is on solid statistical basis. It's the same reason poll results based on only a few hundred people can accurately predict elections in a country with hundreds of millions of voters. However for things like estimating discrete values there's absolutely no solid foundation for these small samples. Your accuracy is going to be simply proportional to the sample size, meaning you can't get anything trustworthy without sampling large enough portions of the table that a full sequential scan would be just as fast. I might be interested in implementing that algorithm that was posted a while back that involved generating good unbiased samples of discrete values. The algorithm was quite clever and well described and paper claimed impressively good results. However it will only make sense if people are willing to accept that analyze will need a full table scan -- at least for tables where the DBA knows that good n_distinct estimates are necessary. > 3. We don't have any method to analyze inter-column correlation within a table; > > 4. We don't keep statistics on foriegn key correlation; Gosh these would be nice but they sound like hard problems. Has anybody even made any headway in brainstorming how to tackle them? > 5. random_page_cost (as previously discussed) is actually a funciton of > relatively immutable hardware statistics, and as such should not need to exist > as a GUC once the cost model is fixed. I don't think that's true at all. Not all hardware is the same. Certainly the need to twiddle this GUC should be drastically reduced if the cache effects are modelled properly and the only excuses left are legitimate hardware differences. -- greg
Greg, > I'm convinced these two are more connected than you believe. Actually, I think they are inseparable. > I might be interested in implementing that algorithm that was posted a > while back that involved generating good unbiased samples of discrete > values. The algorithm was quite clever and well described and paper > claimed impressively good results. > > However it will only make sense if people are willing to accept that > analyze will need a full table scan -- at least for tables where the DBA > knows that good n_distinct estimates are necessary. What about block-based sampling? Sampling 1 in 20 disk pages, rather than 1 in 20 rows, should require siginificantly less scanning, and yet give us a large enough sample for reasonable accuracy. > > 3. We don't have any method to analyze inter-column correlation within > > a table; > > > > 4. We don't keep statistics on foriegn key correlation; > > Gosh these would be nice but they sound like hard problems. Has anybody > even made any headway in brainstorming how to tackle them? There's no time like the present! Actually, these both seem like fairly straightforward problems storage-wise. The issue is deriving the statistics, for tables with many columns or FKs. > > 5. random_page_cost (as previously discussed) is actually a funciton > > of relatively immutable hardware statistics, and as such should not > > need to exist as a GUC once the cost model is fixed. > > I don't think that's true at all. Not all hardware is the same. > > Certainly the need to twiddle this GUC should be drastically reduced if > the cache effects are modelled properly and the only excuses left are > legitimate hardware differences. OK ... but still, it should become a "little knob" rather than the "big knob" it is currently. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Yeah. I've refrained from proposing changes because it's a > pick-up-sticks. If we start modifying the model, we need to fix > *everything*, not just one item. And then educate our users that they > need to use the GUC variables in a different way. Here's the issues I see: > [ various deficiencies in our statistics gathering ] Those are all valid points, but pretty much orthogonal to what I'm on about today. The problems I'm thinking about are seen even when the planner's rowcount estimates are dead on (as indeed they mostly were in Philippe's example). > 5. random_page_cost (as previously discussed) is actually a funciton of > relatively immutable hardware statistics, and as such should not need to > exist as a GUC once the cost model is fixed. Well, it'll still exist as a GUC for the same reasons the other ones are GUCs. But I think the main reasons people have needed to twiddle it are(1) their database is completely RAM-resident (whereRPC *should* logically be 1.0), or(2) they're trying to compensate for the overestimation of nestloop indexscans. The changes I'm proposing should help with (2). > For btrees, we should be able to accurately estimate the cost of the > index access given: > a) the depth of the tree; Although logically the tree descent *ought* to be a factor, I haven't seen any evidence that we need to take it into account; in fact all the evidence points to the conclusion that we're better off ignoring it. My theory about why is that the upper levels of the tree stay in cache. I have to admit that's just handwaving, but counting additional disk hits to fetch the first index tuple is clearly not the direction the cost model needs to go in. Look again at the examples in Philippe's output: an indexscan fetching 1700 tuples (about 5 leaf pages probably) took 32x longer than one fetching 7 tuples (presumably all on the same leaf page). There's no way that a model that counts an additional couple of page fetches for tree descent is going to arrive at those numbers. And we see this over and over again: incredibly small times to fetch a single index tuple, at least on the average when the index is being hit many times in one query. > c) whether the index and tables are already in shared_mem, or else (d) > and (e) below: > d) the probability that the index pages are cached in memory, which is > composed of: > (1) the frequency with which that index is accessed, modified by > (2) whether the index is a fraction of available ram, or larger than ram > e) the probability that the relevant table pages are cached in memory, > determined by the same two factors. These would all be nice things to know, but I'm afraid it's pie in the sky. We have no reasonable way to get those numbers. (And if we could get them, there would be another set of problems, namely plan instability: the planner's choices would become very difficult to reproduce.) >> The big difficulty in modeling cache effects from multiple scans is that >> we usually don't know how many times the index will be scanned. > I think we can work around this by figuring out whether the index has > already been scanned in the plan, something we *can* know. So the first > scan will be full cost and remaining scans will be fractional cost. Uh, no, because what we're normally thinking about is independent probes into the index with different keys. For a small number of probes you have to figure that those all hit different leaf pages and aren't going to amortize well. As the number of probes goes up, you can start charging fractional costs because it's more likely you're hitting a leaf page you hit recently. The Mackert/Lohman equations do this reasonably well --- it's possible we can find something better, but those seem like a good place to start. The immediate problem is to get an infrastructure in place that gives us some numbers to apply equations to. regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: > > However it will only make sense if people are willing to accept that > > analyze will need a full table scan -- at least for tables where the DBA > > knows that good n_distinct estimates are necessary. > > What about block-based sampling? Sampling 1 in 20 disk pages, rather than > 1 in 20 rows, should require siginificantly less scanning, and yet give us > a large enough sample for reasonable accuracy. a) We already use block based sampling to reduce overhead. If you're talking about using the entire block and not just randomlysampled tuples from within those blocks then your sample will be biased. b) It *still* wouldn't give you reasonable accuracy for n_distinct estimates. Your probability of being accurate for discreteprocesses like n_distinct is going to be more or less proportional to your sample. So sampling 5% of the tablegives you only a 5% of the chance of an accurate prediction. More or less. > Actually, these both seem like fairly straightforward problems > storage-wise. The issue is deriving the statistics, for tables with many > columns or FKs. Well there are problems on many levels: 1) You have n^2 possible two-column combinations. That's a lot of processing and storage. 2) It isn't even clear what data you're exactly looking for. Certainly "correlation" is just shorthand here and isn't whatyou're actually looking for. You need something like a complete histogram for the dependent column for every possiblevalue of the forcing column. That would be an enormous amount of storage. > OK ... but still, it should become a "little knob" rather than the "big > knob" it is currently. Sure, the more accurately you model the cache the less need there would be to adjust random_page_cost. I'm not sure how accurate Postgres can really get unless it switches philosophies towards using O_DIRECT and doing it's own caching and scheduling. But it could certainly be much better than today. Tom's comment makes it look like there's hope for pretty accurate cache modelling for nested loop plans. -- greg
Greg, Tom, > a) We already use block based sampling to reduce overhead. If you're > talking about using the entire block and not just randomly sampled > tuples from within those blocks then your sample will be biased. There are actually some really good equations to work with this, estimating both row correlation and n-distinct based on sampling complete blocks. See prior discussions on this list on N-distinct. > 1) You have n^2 possible two-column combinations. That's a lot of > processing and storage. Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2. > 2) It isn't even clear what data you're exactly looking for. Certainly > "correlation" is just shorthand here and isn't what you're actually > looking for. Actually, I'd think that a correlation number estimate (0 = complete uncorrelated, 1 = completely correlated) would be sufficient to improve row count estimation significantly, without incurring the vast overhead of histogramXhistorgram manipulation. > Those are all valid points, but pretty much orthogonal to what I'm on > about today. The problems I'm thinking about are seen even when the > planner's rowcount estimates are dead on (as indeed they mostly were > in Philippe's example). OK, I was afraid they were interdependant. You would know better than me. > Well, it'll still exist as a GUC for the same reasons the other ones are > GUCs. But I think the main reasons people have needed to twiddle it are > (1) their database is completely RAM-resident (where RPC > *should* logically be 1.0), or > (2) they're trying to compensate for the overestimation of > nestloop indexscans. > The changes I'm proposing should help with (2). Right. What I'm saying is that (1) should be derived from estimated_cache_size and DBSIZE, not by setting an additional GUC. > > (1) the frequency with which that index is accessed, modified by > > (2) whether the index is a fraction of available ram, or larger than > > ram e) the probability that the relevant table pages are cached in > > memory, determined by the same two factors. > > These would all be nice things to know, but I'm afraid it's pie in the > sky. We have no reasonable way to get those numbers. (And if we could > get them, there would be another set of problems, namely plan > instability: the planner's choices would become very difficult to > reproduce.) Hmmm ... I think those numbers are possible to infer, at least on a gross level. Whether the cost of calculating them outweighs the benefit of having them is another question, resolvable only through some experimentation. > a good place to start. The immediate problem is to get an > infrastructure in place that gives us some numbers to apply equations > to. No arguments, of course. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Greg, > > 1) You have n^2 possible two-column combinations. That's a lot of > > processing and storage. > > Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2. Ooops, bad math. Andrew pointed out it's actually n*(n-1)/2, not n!. Also, we could omit columns unlikely to correlate, such as large text columns, bytea and numerics with high precisions. Also, we probably don't need to correlate UNIQUE columns inside ... I think. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote: > > Josh Berkus <josh@agliodbs.com> writes: > > > 1. n-distinct estimation is bad, as previously discussed; > > > > 2. our current heuristics sampling methods prevent us from sampling more than > > 0.5% of any reasonably large table, causing all statistics on those tables to > > be off for any table with irregular distribution of values; > > I'm convinced these two are more connected than you believe. For distributions > of values in a range using small samples is on solid statistical basis. It's > the same reason poll results based on only a few hundred people can accurately > predict elections in a country with hundreds of millions of voters. > > However for things like estimating discrete values there's absolutely no solid > foundation for these small samples. Your accuracy is going to be simply > proportional to the sample size, meaning you can't get anything trustworthy > without sampling large enough portions of the table that a full sequential > scan would be just as fast. > > I might be interested in implementing that algorithm that was posted a while > back that involved generating good unbiased samples of discrete values. The > algorithm was quite clever and well described and paper claimed impressively > good results. > > However it will only make sense if people are willing to accept that analyze > will need a full table scan -- at least for tables where the DBA knows that > good n_distinct estimates are necessary. There *might* be an alternative.... Suppose that the executor could keep track of what values it's seeing from a table as it's actually running a query. These could be used to build statistical information without actually running analyze. Of course, the problem is that you could end up with some seriously skewed statistics, depending on what your query patterns are. But in some ways that's not bad; you're optimizing for the queries you most often run. One possible way to handle this would be to keep track of how many times each value has been seen, as well as when it was last seen, so you have some idea of the quality of the data. Another idea is to somehow blend these stats with the traditional method. > > 3. We don't have any method to analyze inter-column correlation within a table; > > > > 4. We don't keep statistics on foriegn key correlation; > > Gosh these would be nice but they sound like hard problems. Has anybody even > made any headway in brainstorming how to tackle them? A *huge* improvement would be gathering statistics on all multi-column indexes. Some of the stats might not make sense in this context, but others (such as correlation) would. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Thu, Jun 01, 2006 at 03:15:09PM -0400, Tom Lane wrote: > These would all be nice things to know, but I'm afraid it's pie in the > sky. We have no reasonable way to get those numbers. (And if we could > get them, there would be another set of problems, namely plan instability: > the planner's choices would become very difficult to reproduce.) Speaking of plan instability, something that's badly needed is the ability to steer away from query plans that *might* be the most optimal, but also will fail horribly should the cost estimates be wrong. People generally don't care about getting the absolutely most optimal plan; they do care about NOT getting a plan that's horribly bad. One possible way to do this would be to have the estimator calculate a worst-case cost to go along with the best case cost, or maybe even 3 numbers: ideal, what we think will happen, and worst-case. I know that index scans already have that information, at least in the form of cost for a plan with 0 correlation and one with perfect correlation. Does anyone have any experience developing genetic algorithms? I think coming up with cost estimators might be an ideal use for that technology... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > Speaking of plan instability, something that's badly needed is the > ability to steer away from query plans that *might* be the most optimal, > but also will fail horribly should the cost estimates be wrong. You sure that doesn't leave us with the empty set :-( ? Any plan we pick will be horrible under some scenario. I do agree that the current lowest-cost-and-nothing-else criterion can pick some pretty brittle plans, but it's not that easy to see how to improve it. I don't think either "best case" or "worst case" are particularly helpful concepts here. You'd almost have to try to develop an estimated probability distribution, and that's way more info than we have. > People generally don't care about getting the absolutely most optimal > plan; they do care about NOT getting a plan that's horribly bad. If 8.2 runs a query at half the speed of 8.1, people will be unhappy, and they won't be mollified if you tell them that that plan is "better" because it would have degraded less quickly if the planner's estimates were wrong. The complaint that started this thread (Philippe Lang's a couple days ago) was in fact of exactly that nature: his query was running slower than it had been because the planner was picking bitmap scans instead of plain index scans. Well, those bitmap scans would have been a lot more robust in the face of bad rowcount estimates, but the estimates weren't wrong. regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: > Greg, Tom, > > > a) We already use block based sampling to reduce overhead. If you're > > talking about using the entire block and not just randomly sampled > > tuples from within those blocks then your sample will be biased. > > There are actually some really good equations to work with this, estimating > both row correlation and n-distinct based on sampling complete blocks. > See prior discussions on this list on N-distinct. In the prior discussions someone posted the paper with the algorithm I mentioned. That paper mentions that previous work showed poor results at estimating n_distinct even with sample sizes as large as 50% or more. > > 1) You have n^2 possible two-column combinations. That's a lot of > > processing and storage. > > Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2. You have O(n^2) possible *two-column* combinations and O(n!) n-column combinations. I was hoping two-column combinations would be enough information to extrapolate from for larger combinations. > > 2) It isn't even clear what data you're exactly looking for. Certainly > > "correlation" is just shorthand here and isn't what you're actually > > looking for. > > Actually, I'd think that a correlation number estimate (0 = complete > uncorrelated, 1 = completely correlated) would be sufficient to improve > row count estimation significantly, without incurring the vast overhead of > histogramXhistorgram manipulation. No, correlation is actually quite uninteresting here. Consider for example two columns "state" and "area code". The "area code" is pretty much 100% dependent on state, but will show very little correlation. Similarly things like "product name" and "catalog number" or even "author" and "genre" would be problem cases but have little correlation. And you can easily imagine queries that specify restrictive clauses on two such columns for which the existing statistics will overestimate the selectivity because it assumes no interdependency. Hopefully you're right that you don't need complete histograms. Perhaps there's some statistics concept they don't teach in stats 101 that would cover this need. What we're looking for is a function f(a,b,x) that gives the net selectivity given a and b, the selectivity of two clauses based on two columns, and x some simple value that can be easily calculated by looking at the data in advance. -- greg
On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote: > > Josh Berkus <josh@agliodbs.com> writes: > > > Greg, Tom, > > > > > a) We already use block based sampling to reduce overhead. If > > > you're talking about using the entire block and not just > > > randomly sampled tuples from within those blocks then your > > > sample will be biased. > > > > There are actually some really good equations to work with this, > > estimating both row correlation and n-distinct based on sampling > > complete blocks. See prior discussions on this list on > > N-distinct. > > In the prior discussions someone posted the paper with the algorithm > I mentioned. That paper mentions that previous work showed poor > results at estimating n_distinct even with sample sizes as large as > 50% or more. Which paper? People have referenced several different ones. > > > 1) You have n^2 possible two-column combinations. That's a lot > > > of processing and storage. > > > > Yes, that's the hard problem to solve. Actually, btw, it's n!, > > not n^2. > > You have O(n^2) possible *two-column* combinations and O(n!) > n-column combinations. I was hoping two-column combinations would > be enough information to extrapolate from for larger combinations. The math nerd in me says that this is bad math, but it might work "well enough" by ass-u-ming a lack of higher-order correllations. > > > 2) It isn't even clear what data you're exactly looking for. > > > Certainly "correlation" is just shorthand here and isn't what > > > you're actually looking for. > > > > Actually, I'd think that a correlation number estimate (0 = > > complete uncorrelated, 1 = completely correlated) would be > > sufficient to improve row count estimation significantly, without > > incurring the vast overhead of histogramXhistorgram manipulation. > > No, correlation is actually quite uninteresting here. Consider for > example two columns "state" and "area code". The "area code" is > pretty much 100% dependent on state, but will show very little > correlation. There are quantitative tests of independence available. I'm not sure whether any of these have been applied even theoretically in DBMS-land. > Hopefully you're right that you don't need complete histograms. > Perhaps there's some statistics concept they don't teach in stats > 101 that would cover this need. What we're looking for is a > function f(a,b,x) that gives the net selectivity given a and b, the > selectivity of two clauses based on two columns, and x some simple > value that can be easily calculated by looking at the data in > advance. That would be neat :) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
Tom Lane wrote: > Another thing that's bothering me is that the index access cost computation > (in genericcostestimate) is looking sillier and sillier: > > /* > * Estimate the number of index pages that will be retrieved. > * > * For all currently-supported index types, the first page of the index is > * a metadata page, and we should figure on fetching that plus a pro-rated > * fraction of the remaining pages. > */ > if (index->pages > 1 && index->tuples > 0) > { > numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1); > numIndexPages += 1; /* count the metapage too */ > numIndexPages = ceil(numIndexPages); > } > else > numIndexPages = 1.0; > > /* > * Compute the index access cost. > * > * Disk cost: our generic assumption is that the index pages will be read > * sequentially, so they have cost 1.0 each, not random_page_cost. > */ > *indexTotalCost = numIndexPages; > > There are several things wrong with this, at least in its application to > btrees. It's not counting descent of the btree (ie, touches of the root > page and intermediate-level pages). On the other hand it's surely > overcharging for metapage touches. As of CVS tip we cache the metapage in > the relcache, so it's probably fair to disregard that cost altogether. > And on the third hand, when we need to retrieve multiple leaf pages it's > over-optimistic to assume that those'll be purely sequential fetches. > (That would be approximately true in a freshly-built btree, but not in one > that's suffered any amount of page splitting or recycling.) > > > I am inclined to think that a reasonable model is to continue to estimate > the number of index leaf pages touched as above (pro-rating against the > total number of index entries), to charge random_page_cost per leaf page > touched, and not to count anything for metapage or tree descent. I > justify the latter on the grounds that upper tree levels will tend to stay > in cache because they're touched so often. random_page_cost per leaf page > may be an overestimate, since sometimes logically adjacent leaf pages will > be physically adjacent too, but not charging for tree descent should help > to cancel that out. With this model, the disk cost to fetch a single > index entry will be estimated as random_page_cost (default 4.0) rather > than the current fixed 2.0. This shouldn't hurt things too much for > simple indexscans --- especially since anyone running with a reduced > random_page_cost won't see as much change. And it will increase the costs > for bitmap scans that cross many index pages, which is what we need in > light of Philippe's numbers. > This sounds good to me, although the 2.0 -> 4.0 cost jump may cause (more) cases of people seeing their index scans in pre-8.2 versions becoming some other type of access in 8.2. I guess a comment about testing existing applications could be placed in the 8.2 release notes? > Now we have seen a lot of cases in which indexscans were being drastically > overestimated, so increasing the cost estimate even more may seem like a > horrid idea. But I believe that most or all of these cases were ones > where the same index was being visited many times, and so the real > estimation failure is not accounting for cache effects across multiple > indexscans. Rather than being afraid to raise the cost estimate for an > indexscan in isolation, I think it's time to bite the bullet and do > something about that. > > The big difficulty in modeling cache effects from multiple scans is that > we usually don't know how many times the index will be scanned. If we did > have that number, I think it'd be reasonable to use the Mackert and Lohman > approximation already used in cost_index to estimate the total number of > index leaf pages fetched over N scans, and then divide by N to get our > per-scan estimated cost. But because the planner builds cost estimates > bottom-up, in most scenarios we don't have any idea whether an indexscan > plan node will be iterated once or many times in the finished plan. > However there is one case where we do have some information: when > computing a best_inner_indexscan() for a nestloop join, it'd be reasonable > to use the estimated size of the outer relation as the loop count. > And this is exactly the case where we are falling down in practice: > the complaints we've seen are mostly about overestimating the cost of a > nestloop-with-inner-index-scan. So I think handling just this case would > be a big step forward, even if it's not a theoretically pure general > solution. > > If I understand correctly, this is the case were we normally see folks needing add several 'set enable_xxx=off' statements to get the nested loop plan that *actually* works best :-). So, also looks good! Cheers Mark
Mark Kirkwood <markir@paradise.net.nz> writes: > Tom Lane wrote: >> With this model, the disk cost to fetch a single >> index entry will be estimated as random_page_cost (default 4.0) rather >> than the current fixed 2.0. This shouldn't hurt things too much for >> simple indexscans --- especially since anyone running with a reduced >> random_page_cost won't see as much change. And it will increase the costs >> for bitmap scans that cross many index pages, which is what we need in >> light of Philippe's numbers. > This sounds good to me, although the 2.0 -> 4.0 cost jump may cause > (more) cases of people seeing their index scans in pre-8.2 versions > becoming some other type of access in 8.2. I guess a comment about > testing existing applications could be placed in the 8.2 release notes? Yeah, that comes with the territory. One point to note is that with this model, setting random_page_cost below 2.0 will actually make small indexscans look *cheaper* than they do now. So it'll certainly be possible to make the thing jump in that direction if you need to. regards, tom lane
Hi All, Just a small comment from a mortal user. On Thursday 01 June 2006 19:28, Josh Berkus wrote: > 5. random_page_cost (as previously discussed) is actually a funciton of > relatively immutable hardware statistics, and as such should not need to > exist as a GUC once the cost model is fixed. It's correct that the hardware statistics are quite immutable - per tablespace. This suggests to me that the GUC should be default value only and then overridable per tablespace. It's quite normal to have primary (current data) and secondary (historical data) storage for larger databases. This can also fit nicely into the pg_tmp storage if tie support for multiple tmp folders together with tablespaces. > > 6. We haven't added any way to estimate rows returned from SRFs. This would also be very cool - currently the planner can really get annoyed when joining SRF functions with tables. --- Nicolai
Josh, Greg, and Tom, I do not know how sensitive the plans will be to the correlation, but one thought might be to map the histogram X histogram correlation to a square grid of values. Then you can map them to an integer which would give you 8 x 8 with binary values, a 5 x 5 with 4 values per point, or a 4 x 4 with 8 values per point. If close is good enough, that would be a compact way to store some histogram cross correlation information. Ken On Thu, Jun 01, 2006 at 01:50:26PM -0700, Josh Berkus wrote: > Greg, Tom, > > ... > > 2) It isn't even clear what data you're exactly looking for. Certainly > > "correlation" is just shorthand here and isn't what you're actually > > looking for. > > Actually, I'd think that a correlation number estimate (0 = complete > uncorrelated, 1 = completely correlated) would be sufficient to improve > row count estimation significantly, without incurring the vast overhead of > histogramXhistorgram manipulation. > ...
David Fetter <david@fetter.org> writes: > > In the prior discussions someone posted the paper with the algorithm > > I mentioned. That paper mentions that previous work showed poor > > results at estimating n_distinct even with sample sizes as large as > > 50% or more. > > Which paper? People have referenced several different ones. Phillip B Gibbons. Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. In Proceedings of the 27th VLDB Conference, Roma, Italy, 2001. Which says (emphasis in original as italics): The most well-studied approach for distinct-values estimation is to collect a uniform random sample S of the data, storeS in a database, and then use S at query time to provide fast, approximate answers to distinct values queries [22,23, 27, 21, 29, 5, 28, 18, 19, 9, 7]. However, previous work [28, 18, 9, 7] has shown powerful negative results onthe quality of distinct-values estimates based on sampling (or other techniques that examine only part of the inputdata), even for the simple case of counting the number of distinct values in a column. The strongest negative resultis due to Charikar et al. [7], who proved that estimating the number of distinct values in a column to within a smallconstant factor (with probability > 1/2) requires that *nearly* *the* *entire* *data* *set* *be* *sampled*. Moreover,all known sampling-based estimators provide unsatisfactory results on data sets of interest [7], even for this simple case. And later: Using a variety of synthetic and real-world data sets, we show that distinct sampling gives estimates for distinct valuesqueries that are within 0%-10%, whereas previous methods were typically 50%-250% off, across the spectrum of datasets and queries studied. Here "distinct sampling" is the algorithm being proposed which requires looking at every record and keeping a sample *of the distinct values*. The "previous methods" are methods based on sampling the records I haven't read the citation [7] that proves the strong negative result for any estimator of distinct values based on sampling. It's: M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc.19th ACM Symp. on Principles of Database Systems, pages 268?279, May 2000. > > Hopefully you're right that you don't need complete histograms. > > Perhaps there's some statistics concept they don't teach in stats > > 101 that would cover this need. What we're looking for is a > > function f(a,b,x) that gives the net selectivity given a and b, the > > selectivity of two clauses based on two columns, and x some simple > > value that can be easily calculated by looking at the data in > > advance. > > That would be neat :) Doing a bit of basic searching around I think the tool we're looking for here is called a "chi-squared test for independence". I haven't read up on how it works so I'm unclear if i it be calculated using a simple O(n) scan or if it would require some non-linear post-processing after the analyze pass which would be unfortunate. And I haven't found anything that describes how to make use of the resulting number. Does it actually give a formula f(a,b,x) that gives a nice convenient expected selectivity for the clause? Oh, and incidentally, at first glance it seems like calculating it depends on having good distinct value sampling. -- greg
Greg, > Using a variety of synthetic and real-world data sets, we show that > distinct sampling gives estimates for distinct values queries that > are within 0%-10%, whereas previous methods were typically 50%-250% off, > across the spectrum of data sets and queries studied. Aha. It's a question of the level of error permissable. For our estimates, being 100% off is actually OK. That's why I was looking at 5% block sampling; it stays within the range of +/- 50% n-distinct in 95% of cases. > Doing a bit of basic searching around I think the tool we're looking for > here is called a "chi-squared test for independence". Augh. I wrote a program (in Pascal) to do this back in 1988. Now I can't remember the math. For a two-column test it's relatively computation-light, though, as I recall ... but I don't remember standard chi square works with a random sample. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: > > Using a variety of synthetic and real-world data sets, we show that > > distinct sampling gives estimates for distinct values queries that > > are within 0%-10%, whereas previous methods were typically 50%-250% off, > > across the spectrum of data sets and queries studied. > > Aha. It's a question of the level of error permissable. For our > estimates, being 100% off is actually OK. That's why I was looking at 5% > block sampling; it stays within the range of +/- 50% n-distinct in 95% of > cases. Well, let's see. Say for example we're scanning 50,000 records out of 1M records. Using the Theorem from Charikar et al quoted as Theorem 1 in section 2 we get that there is at least a 5% chance of getting a result with a ratio error at least sqrt((1M-50k)/100k ln(20)) or 5.33. So no, even assuming you have a good unbiased sample, a 5% sample is only going to get you to a result within 19%-533% of the real values 95% of the time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On some distributions it may be outside that range even more than 95% of the time. This is entirely unlike the histograms where we have a solid foundation for a positive result. We can guarantee that the result will be outside +/- x% *at most* 5% of the time. For most distributions it'll be outside that range even less. And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from 5% block sampling took just as long as reading all the blocks. Even if we figure out what's causing that (IMHO surprising) result and improve matters I would only expect it to be 3-4x faster than a full scan. http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php -- greg
Greg Stark wrote: >Josh Berkus <josh@agliodbs.com> writes: > > > >>> Using a variety of synthetic and real-world data sets, we show that >>> distinct sampling gives estimates for distinct values queries that >>>are within 0%-10%, whereas previous methods were typically 50%-250% off, >>>across the spectrum of data sets and queries studied. >>> >>> >>Aha. It's a question of the level of error permissable. For our >>estimates, being 100% off is actually OK. That's why I was looking at 5% >>block sampling; it stays within the range of +/- 50% n-distinct in 95% of >>cases. >> >> > >Well, let's see. Say for example we're scanning 50,000 records out of 1M >records. Using the Theorem from Charikar et al quoted as Theorem 1 in section >2 we get that there is at least a 5% chance of getting a result with a ratio >error at least sqrt((1M-50k)/100k ln(20)) or 5.33. > >So no, even assuming you have a good unbiased sample, a 5% sample is only >going to get you to a result within 19%-533% of the real values 95% of the >time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On >some distributions it may be outside that range even more than 95% of the >time. > >This is entirely unlike the histograms where we have a solid foundation for a >positive result. We can guarantee that the result will be outside +/- x% *at >most* 5% of the time. For most distributions it'll be outside that range even >less. > >And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from >5% block sampling took just as long as reading all the blocks. Even if we >figure out what's causing that (IMHO surprising) result and improve matters I >would only expect it to be 3-4x faster than a full scan. > >http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php > > > I'm sorry to interrupt your esoteric (to me) discussion, but I have a very simple question: would you define a "good unbiased sample"? My statistics professor Dan Price (rest his soul) would tell me there are only random samples of some sort, and "other", which are always biased, and never good. Excuse my absolutes, I didn't mean Good or Evil. And how do you calculate a level of error without randomization? And are you talking of type I or type II error? Michael
Greg Stark <gsstark@mit.edu> writes: > And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from > 5% block sampling took just as long as reading all the blocks. Even if we > figure out what's causing that (IMHO surprising) result and improve matters I > would only expect it to be 3-4x faster than a full scan. One way to reduce the I/O pain from extensive sampling would be to turn VACUUM ANALYZE into a genuine combined operation instead of a mere notational shorthand for two separate scans. I'd still be worried about the CPU pain though. ANALYZE can afford to expend a pretty fair number of cycles per sampled tuple, but with a whole-table sample that's going to add up. regards, tom lane
On Fri, Jun 02, 2006 at 01:39:32PM -0700, Michael Dean wrote: > I'm sorry to interrupt your esoteric (to me) discussion, but I have > a very simple question: would you define a "good unbiased sample"? > My statistics professor Dan Price (rest his soul) would tell me > there are only random samples of some sort, and "other", which are > always biased, and never good. What's at issue here is a biased estimator, not a biased sample. If you know an unbiased estimator of multiplicity based on a random sample, that would be a great :) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from > > 5% block sampling took just as long as reading all the blocks. Even if we > > figure out what's causing that (IMHO surprising) result and improve matters I > > would only expect it to be 3-4x faster than a full scan. > > One way to reduce the I/O pain from extensive sampling would be to turn > VACUUM ANALYZE into a genuine combined operation instead of a mere > notational shorthand for two separate scans. Except that we're also looking for every way we can to avoid having vacuum have to touch every page so large tables with narrow hot spots can still afford to be vacuumed regularly during peak hours. But for most users analyze doesn't really have to run as often as vacuum. One sequential scan per night doesn't seem like that big a deal to me. > I'd still be worried about the CPU pain though. ANALYZE can afford to > expend a pretty fair number of cycles per sampled tuple, but with a > whole-table sample that's going to add up. That is a concern. Though the algorithm is pretty efficient, it basically amounts to hashing all the tuples keeping only a limited number of hash buckets and just throwing away the rest of the data. Things could get more complicated if we get to more complex stats than simple n_distinct estimates and performing dependence tests on the data for multi-column statistics. -- greg
Greg, Tom, > But for most users analyze doesn't really have to run as often as > vacuum. One sequential scan per night doesn't seem like that big a deal > to me. Clearly you don't have any 0.5 TB databases. > > I'd still be worried about the CPU pain though. ANALYZE can afford to > > expend a pretty fair number of cycles per sampled tuple, but with a > > whole-table sample that's going to add up. Agreed. Despite conventional wisdom, most PostgreSQL databases ... even those with high level OLTP or very large DW ... are CPU-bound. We really don't want an ANALYZE which is an order-of-magnitude increase in CPU activity. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus wrote: > Greg, Tom, > >> But for most users analyze doesn't really have to run as often as >> vacuum. One sequential scan per night doesn't seem like that big a deal >> to me. > > Clearly you don't have any 0.5 TB databases. Perhaps something like "ANALYZE FULL"? Then only those who need the more precise statistics would pay the cost for a full scan. -- todd
I wrote: > In general it seems to me that for CPU-bound databases, the default values > of the cpu_xxx_cost variables are too low. ... rather than telling people > to manipulate all three of these variables individually, I think it might > also be a good idea to provide a new GUC variable named something like > "cpu_speed_scale" that would just be a multiplier for the other variables. > It would default to 1.0 and our standard advice for CPU-bound databases > would be "decrease random_page_cost to 1.0 and raise cpu_speed_scale to > 10.0 or so". Seems cleaner than telling people to muck with three or so > individual values. Nicolai Petri's comment about per-tablespace access costs caused me to rethink the above proposal. Instead of inventing "cpu_speed_scale", which seems rather baroque after thinking about it more, what I now think we should do is invent a "seq_page_cost" GUC to replace the traditionally hardwired value of 1.0 cost unit per sequential page fetch. Then, if you've got different tablespaces with different disk speeds, you could imagine having per-tablespace values of seq_page_cost and random_page_cost, whereas you probably want the CPU cost numbers to remain the same across all tables. I don't really want to get into inventing per-tablespace settings right now, because the need hasn't been demonstrated; but if we ever do want to do it, this approach will be a whole lot less confusing than something involving a cpu_speed_scale knob. This still leaves you twiddling two knobs (now random_page_cost and seq_page_cost) if you want to set up the planner for an all-in-memory database. So it's not any more complicated for that purpose. One objection to this is that after moving "off the gold standard" of 1.0 = one page fetch, there is no longer any clear meaning to the cost estimate units; you're faced with the fact that they're just an arbitrary scale. I'm not sure that's such a bad thing, though. For instance, some people might want to try to tune their settings so that the estimates are actually comparable to milliseconds of real time. Comments? regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: > Greg, Tom, > > > But for most users analyze doesn't really have to run as often as > > vacuum. One sequential scan per night doesn't seem like that big a deal > > to me. > > Clearly you don't have any 0.5 TB databases. Actually I did not so long ago. Sequential scans in an OLTP query would be a disaster. But a single sequential scan run at a controlled time wouldn't concern me as long as *all* of the following constraints are met: a) You can run them at your leisure at off-peak times when your i/o bandwidth isn't in short supply. b) You don't need the results urgently so you don't care if it takes a while to run. c) You don't need many of them at the same time. Even on your production system surely you occasionally, say, take a full backup or run "select count(*)" or other sanity checks on the data? > > > I'd still be worried about the CPU pain though. ANALYZE can afford to > > > expend a pretty fair number of cycles per sampled tuple, but with a > > > whole-table sample that's going to add up. > > Agreed. Despite conventional wisdom, most PostgreSQL databases ... even > those with high level OLTP or very large DW ... are CPU-bound. We > really don't want an ANALYZE which is an order-of-magnitude increase in > CPU activity. I don't think Tom was actually expressing concern about ANALYZE becoming more expensive, but about tying ANALYZE and VACUUM together and making VACUUM more expensive. VACUUM is something we want to encourage people to think they can run all day long, not just occasionally. -- greg
> One objection to this is that after moving "off the gold standard" of > 1.0 = one page fetch, there is no longer any clear meaning to the > cost estimate units; you're faced with the fact that they're just an > arbitrary scale. I'm not sure that's such a bad thing, though. For > instance, some people might want to try to tune their settings so that > the estimates are actually comparable to milliseconds of real time. Any chance that the correspondence to time could be made a part of the design on purpose and generally advise people to follow that rule? If we could tell people to run *benchmark* and use those numbers directly as a first approximation tuning, it could help quite a bit for people new to PostgreSQL experiencing poor performance. effective_cache_size then becomes essentially the last hand-set variable for medium sized installations. --
Rod Taylor <pg@rbt.ca> writes: >> One objection to this is that after moving "off the gold standard" of >> 1.0 = one page fetch, there is no longer any clear meaning to the >> cost estimate units; you're faced with the fact that they're just an >> arbitrary scale. I'm not sure that's such a bad thing, though. For >> instance, some people might want to try to tune their settings so that >> the estimates are actually comparable to milliseconds of real time. > Any chance that the correspondence to time could be made a part of the > design on purpose and generally advise people to follow that rule? We might eventually get to that point, but I'm hesitant to try to do it immediately. For one thing, I really *don't* want to get bug reports from newbies complaining that the cost estimates are always off by a factor of X. (Not but what we haven't gotten some of those anyway :-() In the short term I see us sticking to the convention that seq_page_cost is 1.0 in a "typical" database, while anyone who's really hot to try to make the other happen is free to experiment. > If we could tell people to run *benchmark* and use those numbers > directly as a first approximation tuning, it could help quite a bit > for people new to PostgreSQL experiencing poor performance. We don't have such a benchmark ... if we did, we could have told people how to use it to set the variables already. I'm very very suspicious of any suggestion that it's easy to derive appropriate numbers for these settings from one magic benchmark. regards, tom lane
On Jun 2, 2006, at 5:24 PM, Todd A. Cook wrote: > Josh Berkus wrote: >> Greg, Tom, >>> But for most users analyze doesn't really have to run as often as >>> vacuum. One sequential scan per night doesn't seem like that big >>> a deal >>> to me. >> Clearly you don't have any 0.5 TB databases. > > Perhaps something like "ANALYZE FULL"? Then only those who need the > more precise statistics would pay the cost for a full scan. Might also be worth adding analyze delay settings, ala vacuum_cost_delay.
Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby: > Might also be worth adding analyze delay settings, ala > vacuum_cost_delay. Actually we should have delay settings for all potential (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD CONSTRAINT, maybe more - so that there would be better chances of running those on busy databases without disastrous effects. Probably we should use the same settings for all these, not invent a new set for each. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Ühel kenal päeval, R, 2006-06-02 kell 16:23, kirjutas Greg Stark: > And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from > 5% block sampling took just as long as reading all the blocks. Even if we > figure out what's causing that (IMHO surprising) result and improve matters I > would only expect it to be 3-4x faster than a full scan. You should not be surprised by this once you visualise what happens at the disk level with all those platters spinning and heads moving :) Disks can read at full rotation speed, so skipping (not reading) some blocks will not make reading the remaining blocks from the same track faster. And if there are more than 20 8k pages per track, you still have a very high probablility you need to read all tracks.. You may be able to move to the next track a little earlier compared to reading all blocks, but then you are likely to miss the block from next track and have to wait a full rotation. You will get some win from skipping pages only once your % falls so low that you can also skip a significant number of tracks. > http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php Your test program could have got a little better results, if you had somehow managed to tell the system all the block numbers to read in one go, not each time the next one after hetting the previous one. In current version it is quite likely that it had to wait several disk rotations for even the sectors from the same track, as for small steps it may have missed the next sector. It does not apply for disks which always read a full track in RAM cache, but even there all tracks are actually read. The fact that 5% was not slower than seqscan seems to indicate that actually all track reads were cached inside the disk or controller. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
pgbench appears to already support arbitrary SQL queries with the -f switch, so why couldn't we just make it a little smarter and have people enable SQL query logging for a day or two, then pass the log off to pgbench: pgbench -f <log file> Seems to me like that wouldn't be too difficult to do, and would give much closer "real-world" results than pgbench's built-in benchmark. On top of that the community could start offering up "template" benchmarks like: "busy website", "data warehouse", "forums", "financial" and distribute them with pgbench: pgbench -f templates/data_warehouse.pgbench pgbench -f templates/forums.pgbench ... From that point a brute force auto-tune utility would be pretty straight forward to write. pgautotune -f templates/data_warehouse.bench,myapp.sqllog Or if one server runs multiple custom apps that you want to tune for: pgautotune -f myapp1.sqllog,myapp2.sqllog,myapp3.sqllog Even if it took 48hrs to run, it would be a good burn-in test for a brand new server. ;) On Fri, 2006-06-02 at 19:38 -0400, Tom Lane wrote: > Rod Taylor <pg@rbt.ca> writes: > >> One objection to this is that after moving "off the gold standard" of > >> 1.0 = one page fetch, there is no longer any clear meaning to the > >> cost estimate units; you're faced with the fact that they're just an > >> arbitrary scale. I'm not sure that's such a bad thing, though. For > >> instance, some people might want to try to tune their settings so that > >> the estimates are actually comparable to milliseconds of real time. > > > Any chance that the correspondence to time could be made a part of the > > design on purpose and generally advise people to follow that rule? > > We might eventually get to that point, but I'm hesitant to try to do it > immediately. For one thing, I really *don't* want to get bug reports > from newbies complaining that the cost estimates are always off by a > factor of X. (Not but what we haven't gotten some of those anyway :-() > In the short term I see us sticking to the convention that seq_page_cost > is 1.0 in a "typical" database, while anyone who's really hot to try to > make the other happen is free to experiment. > > > If we could tell people to run *benchmark* and use those numbers > > directly as a first approximation tuning, it could help quite a bit > > for people new to PostgreSQL experiencing poor performance. > > We don't have such a benchmark ... if we did, we could have told > people how to use it to set the variables already. I'm very very > suspicious of any suggestion that it's easy to derive appropriate > numbers for these settings from one magic benchmark. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Mike Benoit <ipso@snappymail.ca>
Hannu Krosing <hannu@skype.net> writes: > Disks can read at full rotation speed, so skipping (not reading) some > blocks will not make reading the remaining blocks from the same track > faster. And if there are more than 20 8k pages per track, you still have > a very high probablility you need to read all tracks.. Well, if there are exactly 20 you would expect a 50% chance of having to read a track so you would expect double the effective bandwidth. It would have to be substantially bigger to not have any noticeable effect. > You may be able to move to the next track a little earlier compared to > reading all blocks, but then you are likely to miss the block from next > track and have to wait a full rotation. No, I don't think that works out. You always have a chance of missing the block from the next track and having to wait a full rotation and your chance isn't increased or decreased by seeking earlier in the rotation. So you would expect each track to take <20 block reads> less time on average. > Your test program could have got a little better results, if you had > somehow managed to tell the system all the block numbers to read in one > go, not each time the next one after hetting the previous one. I was trying to simulate the kind of read pattern that postgres generates which I believe looks like that. > The fact that 5% was not slower than seqscan seems to indicate that > actually all track reads were cached inside the disk or controller. I dunno, your first explanation seemed pretty convincing and doesn't depend on specific assumptions about the caching. Moreover this doesn't explain why you *do* get a speedup when reading less than 5%. Perhaps what this indicates is that the real meat is in track sampling, not block sampling. -- greg
Greg Stark <gsstark@MIT.EDU> writes: > Perhaps what this indicates is that the real meat is in track sampling, not > block sampling. Fwiw, I've done a little benchmarking and I'm starting to think this isn't a bad idea. I see a dramatic speed improvement for samples of 1-10% as the block size increases. Presumably this is as Hannu said, reducing the number of tracks necessary to cover the sample. I see improvements up to around 256M blocks or so, but my data is pretty questionable since I'm busy watching tv in Mythtv in another window. It's on another drive but it still seems to be making the numbers jump around a bit. I expect there's a trade-off between keeping enough blocks for the sample of blocks to be representative on the one hand and large blocks being much faster to read in on the other. I would suggest something like setting the block size in the block sampling algorithm to something like max(8k,sqrt(table size)). That gives 8k blocks for anything up to 255M but takes better advantage of the speed increase available from sequential i/o for larger tables, from my experiments about a 50% increase in speed. Actually maybe even something even more aggressive would be better, maybe (table size)^.75 So it kicks in sooner than on 256M tables and gets to larger block sizes on reasonable sized tables. Note, this doesn't mean anything like changing page sizes, just selecting more blocks that hopefully lie on the same track when possible. -- greg
Greg Stark <gsstark@MIT.EDU> writes: > I see improvements up to around 256M blocks or so, but my data is pretty > questionable since I'm busy watching tv in Mythtv in another window. It's on > another drive but it still seems to be making the numbers jump around a bit. Oops, I meant blocks of 256k there. Sorry. -- greg
Hannu Krosing <hannu@skype.net> writes: > Ãhel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby: > > > Might also be worth adding analyze delay settings, ala > > vacuum_cost_delay. > > Actually we should have delay settings for all potential > (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD > CONSTRAINT, maybe more - so that there would be better chances of > running those on busy databases without disastrous effects. What about UPDATE and DELETE and for that matter SELECT? -- greg
Greg Stark <gsstark@mit.edu> writes: > Hannu Krosing <hannu@skype.net> writes: >> Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby: >>> Might also be worth adding analyze delay settings, ala >>> vacuum_cost_delay. ANALYZE already respects the vacuum delay settings. >> Actually we should have delay settings for all potential >> (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD >> CONSTRAINT, maybe more - so that there would be better chances of >> running those on busy databases without disastrous effects. > What about UPDATE and DELETE and for that matter SELECT? This seems pretty silly. The point of the delay stuff is to prevent background maintenance operations from eating an unreasonable share of resources compared to foreground queries. I don't see why you'd put delays into queries --- if your machine is loaded, it's loaded. I think the existing features are sufficient in this line and that doing more is just adding complexity for complexity's sake. regards, tom lane
Ühel kenal päeval, P, 2006-06-04 kell 18:09, kirjutas Tom Lane: > Greg Stark <gsstark@mit.edu> writes: > > Hannu Krosing <hannu@skype.net> writes: > >> Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby: > >>> Might also be worth adding analyze delay settings, ala > >>> vacuum_cost_delay. > > ANALYZE already respects the vacuum delay settings. > > >> Actually we should have delay settings for all potential > >> (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD > >> CONSTRAINT, maybe more - so that there would be better chances of > >> running those on busy databases without disastrous effects. > > > What about UPDATE and DELETE and for that matter SELECT? > > This seems pretty silly. The point of the delay stuff is to prevent > background maintenance operations from eating an unreasonable share > of resources compared to foreground queries. I don't see why you'd > put delays into queries --- if your machine is loaded, it's loaded. > > I think the existing features are sufficient in this line and that > doing more is just adding complexity for complexity's sake. Making CREATE INDEX respect delay settings will be reasonable once we get it to run without locking the table. And if non-locking is doable for ADD/ALTER CONSTRAINT, then it makes sense there too. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On Jun 4, 2006, at 5:09 PM, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: >> Hannu Krosing <hannu@skype.net> writes: >>> Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby: >>>> Might also be worth adding analyze delay settings, ala >>>> vacuum_cost_delay. > > ANALYZE already respects the vacuum delay settings. > >>> Actually we should have delay settings for all potential >>> (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD >>> CONSTRAINT, maybe more - so that there would be better chances of >>> running those on busy databases without disastrous effects. > >> What about UPDATE and DELETE and for that matter SELECT? > > This seems pretty silly. The point of the delay stuff is to prevent > background maintenance operations from eating an unreasonable share > of resources compared to foreground queries. I don't see why you'd > put delays into queries --- if your machine is loaded, it's loaded. 'maintenance operations' often also mean running large updates. Being able to run those at a reduced priority would certainly be helpful in many cases. Though, a better way to accomplish this would be to have the OS handle prioritized IO scheduling, but since pretty much none of them seem to do that... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Tom Lane wrote: > > One objection to this is that after moving "off the gold standard" of > 1.0 = one page fetch, there is no longer any clear meaning to the > cost estimate units; you're faced with the fact that they're just an > arbitrary scale. I'm not sure that's such a bad thing, though. It seems to me the appropriate gold standard is Time, in microseconds or milliseconds. The default postgresql.conf can come with a set of hardcoded values that reasonably approximate some real-world system; and if that's documented in the file someone reading it can say "hey, my CPU's about the same but my disk subsystem is much faster,so I know in which direction to change things". And another person may say "ooh, now I know that my 4GHz machines should have about twice the number here as my 2GHz box". For people who *really* care a lot (HW vendors?), they could eventually make measurements on their systems.