Thread: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
"Nathan Boley"
Date:
Currently eqsel assumes that, except for the values stored as mcv's, the number of times that a value appears in a table column is independent of it's value. Unfortunately, this often yields inappropriate selectivity estimates and frequently leads to inappropriate plans. As an example, consider an insurance company that keeps a record of patient heights. Assume there are a 1000000 patient heights in this column, and they are distributed normally with a mean of 1.7526 and a standard deviation of 0.0762. Furthermore, assume that the heights are only measured to the nearest centimeter. Then, we'd expect there to be about 73 distinct heights, with a SD of 1.5. Ignoring the effects of MCV's, the planner expects SELECT height FROM heights WHERE height = 1.75; to yield roughly 13000 results. However, given that we know the underlying distribution, we would expect to see ~52000 results. Similarly, the planner expects to see 13000 results from SELECT 1.75 FROM heights WHERE height = 2.05; While we expect to see 2.7. Obviously this example is not totally convincing: if I were to post this to pg-general looking for advice I'm sure that everyone would tell me to just increase the size of my mcv stats. However, in cases where the number of distinct values is higher, this isn't always feasible. Also, why store a list of 50 values and their frequencies when 10 extra would provide the same plans without bloating pg_statistics? To combat this problem, I have two different proposals. Idea 1: Keep an array of stadistinct that correspond to each bucket size. In the example above, ( again ignoring mcv's ) the quantile data is 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1.38 1.66 1.69 1.71 1.73 1.75 1.77 1.79 1.82 1.85 2.12 with numdistinct values of ( respectively ) 29 2 2 2 2 2 2 3 3 25 For the two above examples, this new approach would yield selectivity estimates of (1000000/10)/2 = 50000 ( vs an actual ED of ~52000 ) and (1000000/10)/25 = 4000 ( vs an actual ED of ~2.7 ) Furthermore, this is done without mcvs. Since mcv's would make the histogram more sensitive to the edges, the estimates with mcv's should be correspondingly better. There are two potential problems that I see with this approach: 1) It assumes the = is equivalent to <= and >= . This is certainly true for real numbers, but is it true for every equality relation that eqsel predicts for? 2) It bloats the stats table. Idea 2: Keep a correlation statistic between ndistinct and bucket size This addresses problem #2. In lieu of keeping an actual list of ndistinct per histogram bucket, we store the linear scaling coefficient between histogram bucket width and ndistinct/(avg ndistinct). To visualize this, it is easiest to consider plotting the bucket width versus ndistinct. The scaling coefficient is the linear line that passes through origin and minimizes the square of the difference between it's estimate for ndistinct and the actual value. When I apply this method to the above data I find a coefficient of 13.63 for an average ndist of 72/10. This provides selectivity estimates, for the above two examples, of (1000000/10)/( 13.63*7.2*(1.77 - 1.75) ) = 50950 ( vs an actual ED of ~52000 ) and (1000000/10)/( 13.63*7.2*(2.12 - 1.85) ) = 3774 ( vs an actual ED of ~2.7 ) Although this yields better results than idea 1 for this particular example, it will be much more sensitive to weird distributions. Obviously there are some special cases to consider: we wouldn't want the stats to be skewed such that they provide really bad plans. However, with some carefully designed caps I believe that we could ensure that the estimates are at least as good as they are now. In fact, I'm not certain that an R^2 penalty is the correct loss function. Ideally, we want to minimize the extra time that the db spends by choosing an incorrect plan. Maybe slight overestimations are better than slight underestimations? Maybe the cost of the occasional (really) bad plan is less than the cost of a bunch of kinda bad plans? Finally, we aren't limited to just one coefficient. We could also store multiple coefficents to improve our estimates, and provide a compromise between ideas 1 and 2. Food for future thought... I addition to the previous benefits, I think that this method has the potential to make the process by which MCV are chosen (or not chosen) smarter. Now the planner chooses a value to be an mcv candidate if it's frequency is greater than 1.25 * the average frequency. Given that this improved selectivity estimate is implemented, maybe a better way would be to include a value as an mcv if it's a) above a certain threshold and b) the histogram selectivity estimator does do a poor job. What are peoples thoughts on idea 1 vs idea 2? Am I missing any relevant details about the planner's operation? Do people think that the improved estimates would be worth the additional overhead? -Nathan
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Tom Lane
Date:
"Nathan Boley" <npboley@gmail.com> writes: > ... There are two potential problems that I see with this approach: > 1) It assumes the = is equivalent to <= and >= . This is certainly > true for real numbers, but is it true for every equality relation that > eqsel predicts for? The cases that compute_scalar_stats is used in have that property, since the < and = operators are taken from the same btree opclass. > Do people think that the improved estimates would be worth the > additional overhead? Your argument seems to consider only columns having a normal distribution. How badly does it fall apart for non-normal distributions? (For instance, Zipfian distributions seem to be pretty common in database work, from what I've seen.) regards, tom lane
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Jeff Davis
Date:
On Sun, 2008-06-08 at 19:03 -0400, Tom Lane wrote: > Your argument seems to consider only columns having a normal > distribution. How badly does it fall apart for non-normal > distributions? (For instance, Zipfian distributions seem to be pretty > common in database work, from what I've seen.) > If using "Idea 1: Keep an array of stadistinct that correspond to each bucket size," I would expect it to still be a better estimate than it is currently, because it's keeping a separate ndistinct for each histogram bucket. Regards, Jeff Davis
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
"Nathan Boley"
Date:
> Your argument seems to consider only columns having a normal > distribution. My example was based upon normally distributed data because people usually know what they are and they are reasonably common. > How badly does it fall apart for non-normal distributions? This should work to the extent that there is a correlation between the number of distinct values in a histogram bin and the 'size' of the bin. I know that this isn't a very good answer, but it's a hard problem in the general case. However, for any non-uniform probability distribution there exists a measure under which there is a non-zero correlation between these. I wrote up a hand-wavie semi-formal argument at the end, but it's not tough to see intuitively. Just think about the graph of the cdf. The histogram boundaries are horizontal lines that are equally spaced from top to bottom. If we rescale the x-axis st there is a fixed distance between each distinct value, and define the distance as ( ndistinct in a given interval)/(total ndistinct) then the only place where this doesn't hold is when the CDF is f(x) = x, aka the dist is uniform. And even then we just get a 0 coefficient, which is exactly what we are always assuming now. Obviously we run into problems when a) we have a poor estimate for ndistinct - but then we have worse problems b) our length measure doesn't correspond well with ndistinct in an interval expanding on b)... your mention of Zipfian distributions is particularly good example of where this could be poor. Right now ( someone correct me if I'm wrong ) words are sorted alphabetically. However, if we wanted this estimator to do a good job, we would sort them by their length or, better yet, frequency in the english language ( which is highly correlated with length ). > (For instance, Zipfian distributions seem to be pretty common in database work, from what I've seen.) This should work well for any power law distribution. If people are curious, I can rerun my previous example using a power law distribution instead of normal. However, the easy way of thinking about all of this is that we're building a linear model between ndistinct and histogram bucket width. It's intuitive to expect there to be a correlation between the two, and so the model should have some predictive power. -Nathan <somewhat formal - probably will be difficult to read without some basic probability theory> To see this, first note that we can alternatively define the uniform distribution on [a,b] as the distribution whose CDF is a straight line that passes through both (a,0) and (b,1) ( just integrate the PDF ). So any non-uniform distribution will have a CDF with slope that is both below and above 1/(b-a) at some set of points, implying the existence of an interval [i1, i2] st ( CDF(i2) - CDF(i1) ) < ( i2 - i1 )/(b-a). Then, under the constraints of the probability measure, there exists a second disjoint interval st ( CDF(i2') - CDF(i1') ) > ( i2' - i1' )/(b-a). In other words, Next, assume that the number of potential distinct values in our interval scales linearly with the length of the interval. Although it seems as if this assumption could be somewhat burdensome, there always exists a measure under which this is true for sets with a defined order relation. ( As remarked earlier by Tom, we are already making this assumption ). To see this, consider defining the length(i1, i2) as ( the number of distinct value in [i1, i2] )/( total num distinct values ), where the number of distinct values is the set of values { v | v >= i1 and v <= i2 }. Next, note that the joint distribution of identically distributed, independent random variables is multinomial with cell probabilities given by the value of the pdf at each distinct point. Next, I'll state without proof that for an IID RV the expected number of distinct values is maximized for a uniform distribution ( this is pretty easy to see: think about the binomial case. Do you want your cell probabilities to be ( 1.0, 0 ) or ( 0.5, 0.5 ) ) Finally, note that the number of expected distinct values decreases faster than linearly in the length of the interval. This is pretty clear when we consider the sparse case. As the number of potential entries ( in this case, the interval length) approaches infinity, the probability of a new entry being distinct approaches 1. This means that, in this limit, every new entry ends up being distinct, aka the number of distinct values scales linearly in the number of new entries. As the interval shrinks, new entries have some probability of being repeats. As the interval shrinks to 0, there is a zero probability of new entries being unique. Since, a) there doesn't exists a linear relationship that contains the two boundary points b) the multinomial distribution of the PDF is continuous c) the relationship is clearly decreasing we can surmise that it is sub-linear. Therefore, we have two intervals that have sub and super linear slopes that cancel one another. However, the change in ndistinct for the super-linear slope scales super-linearly, while the change in ndistinct for the sub-linear scales sub-linearly. So there is a net correlation. </somewhat formal>
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Zeugswetter Andreas OSB sIT
Date:
> Obviously we run into problems when > a) we have a poor estimate for ndistinct - but then we have > worse problems > b) our length measure doesn't correspond well with ndistinct > in an interval One more problem with low ndistinct values is that the condition might very well hit no rows at all. But Idea 1 will largely overestimate the number of hits. e.g. char(2) field has a histogram bin for 'a1' - 'b1' ndistinct is 2 because actual values in the bin are 'a1' and 'a2'. A query for 'a3' now has a bogus estimate of nrowsperbin / 2. I think for low ndistinct values we will want to know the exact value + counts and not a bin. So I think we will want additional stats rows that represent "value 'a1' stats". Andreas
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Gregory Stark
Date:
"Zeugswetter Andreas OSB sIT" <Andreas.Zeugswetter@s-itsolutions.at> writes: > I think for low ndistinct values we will want to know the exact > value + counts and not a bin. So I think we will want additional stats rows > that represent "value 'a1' stats". Isn't that what our most frequent values list does? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Zeugswetter Andreas OSB sIT
Date:
> > I think for low ndistinct values we will want to know the exact > > value + counts and not a bin. So I think we will want > additional stats rows > > that represent "value 'a1' stats". > > Isn't that what our most frequent values list does? Maybe ? Do we have the relevant stats for each ? But the trick is to then exclude those values from the histogram bins. Andreas
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
"Nathan Boley"
Date:
>> > One more problem with low ndistinct values is that the condition might very well >> > hit no rows at all. But Idea 1 will largely overestimate the number of hits. Thats a good point, but I don't see a clear solution. Maybe we could look at past queries and keep track of how often they return empty result sets? It seems that, in some ways, we care about the distribution of the query values in addition to the column values... >> > I think for low ndistinct values we will want to know the exact >> > value + counts and not a bin. So I think we will want additional stats rows >> > that represent "value 'a1' stats". >> >> Isn't that what our most frequent values list does? > > Maybe ? Do we have the relevant stats for each ? > But the trick is to then exclude those values from the histogram bins. Currently, the histogram is only made up of non-mcv values.
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Jeff Davis
Date:
On Tue, 2008-06-10 at 08:51 -0700, Nathan Boley wrote: > >> > One more problem with low ndistinct values is that the condition might very well > >> > hit no rows at all. But Idea 1 will largely overestimate the number of hits. > > Thats a good point, but I don't see a clear solution. Maybe we could I think that MCVs are the solution, right? A low ndistinct means that those values will likely be MCVs. Since you brought up a different process for choosing MCVs, maybe a better name might be "Most Interesting Values". Regards,Jeff Davis
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
"Nathan Boley"
Date:
>> >> > One more problem with low ndistinct values is that the condition might very well >> >> > hit no rows at all. But Idea 1 will largely overestimate the number of hits. >> >> Thats a good point, but I don't see a clear solution. Maybe we could > > I think that MCVs are the solution, right? Only if they cover the entire range of values in the table. > A low ndistinct means that those values will likely be MCVs. Yes, but I don't think thats the point. If we query on values that aren't in the table, the planner will always overestimate the expected number of returned rows because it ( implicitly ) assumes that every query will return at least 1 record.
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Tom Lane
Date:
"Nathan Boley" <npboley@gmail.com> writes: > If we query on values that aren't in the table, the planner will > always overestimate the expected number of returned rows because it ( > implicitly ) assumes that every query will return at least 1 record. That's intentional and should not be changed. I can't see the value of allowing fractional-row estimates anyway. Now if we can *prove* the query matches no rows, that's a whole nother matter, but statistics won't help us with that. regards, tom lane
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
"Nathan Boley"
Date:
>> If we query on values that aren't in the table, the planner will >> always overestimate the expected number of returned rows because it ( >> implicitly ) assumes that every query will return at least 1 record. > > That's intentional and should not be changed. Why? What if ( somehow ) we knew that there was a 90% chance that query would return an empty result set on a big table with 20 non-mcv distinct values. Currently the planner would always choose a seq scan, where an index scan might be better. Better yet, couldn't that be optimized to *if record exists, execute seq scan*. That being said, I think queries are generally searching for values that exist in the table. > I can't see the value of allowing fractional-row estimates anyway. Neither can I, but I could probably think of cases where knowing the SD of the result set could result in better plans. -Nathan
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Tom Lane
Date:
"Nathan Boley" <npboley@gmail.com> writes: >>> If we query on values that aren't in the table, the planner will >>> always overestimate the expected number of returned rows because it ( >>> implicitly ) assumes that every query will return at least 1 record. >> >> That's intentional and should not be changed. > Why? What if ( somehow ) we knew that there was a 90% chance that > query would return an empty result set on a big table with 20 non-mcv > distinct values. Currently the planner would always choose a seq scan, > where an index scan might be better. (1) On what grounds do you assert the above? (2) What makes you think that an estimate of zero rather than one row would change the plan? (In fact, I don't think the plan would change, in this case. The reason for the clamp to 1 row is to avoid foolish results for join situations.) regards, tom lane
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
"Nathan Boley"
Date:
>> Why? What if ( somehow ) we knew that there was a 90% chance that >> query would return an empty result set on a big table with 20 non-mcv >> distinct values. Currently the planner would always choose a seq scan, >> where an index scan might be better. > > (1) On what grounds do you assert the above? For a table with 1000000 non-mcv rows, the planner estimates a result set of cardinality 1000000/20 = 50000, not 1. > (2) What makes you think that an estimate of zero rather than one row > would change the plan? I see where the confusion is coming from. When I said >> What if ( somehow ) we knew that there was a 90% >> chance that query would return an empty result set I meant that the planner doesn't know that information. And how could it? The estimate for ndistinct is an estimate for the number of distinct values in the table, not an estimate for the number of distinct values that will be queried for. My original point was that we sometimes care about the distribution of what's being queried for and not just what's in the table. But this is all silly anyways: if this was really a concern you would write a function if values exist return values else return none > (In fact, I don't think the plan would change, in this case. The reason > for the clamp to 1 row is to avoid foolish results for join situations.) Which makes sense. My point certainly wasn't, in any way, a criticism of clamping selectivity to 1. Cheers, Nathan
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Tom Lane
Date:
"Nathan Boley" <npboley@gmail.com> writes: >> (1) On what grounds do you assert the above? > For a table with 1000000 non-mcv rows, the planner estimates a result > set of cardinality 1000000/20 = 50000, not 1. The real problem in that situation is that you need another twenty slots in the MCV list. The MCV list should *always* exhaust the set of values for which it'd be bad to do an indexscan. Assuming that the threshold for switching to an indexscan is somewhere around selectivity 0.005 (I am not certain offhand, but it's in that general area), this cannot possibly require more than 200 MCV slots, and for most data distributions it'd be a whole lot less. Given such an MCV list, the planner will always make the right choice of whether to do index or seqscan ... as long as it knows the value being searched for, that is. Parameterized plans have a hard time here, but that's not really the fault of the statistics. > I see where the confusion is coming from. When I said > What if ( somehow ) we knew that there was a 90% > chance that query would return an empty result set > I meant that the planner doesn't know that information. And how could it? Hmm. IIRC the estimates are set up on the assumption that you are searching for a value that occurs in the table. I suppose there are applications where that's often false, but as you say, it's hard to know that in advance. regards, tom lane
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > (In fact, I don't think the plan would change, in this case. The reason > for the clamp to 1 row is to avoid foolish results for join situations.) The screw case I've seen is when you have a large partitioned table where constraint_exclusion fails to exclude the irrelevant partitions. You're going to get 0 rows from all but the one partition which contains the 1 row you're looking for. But since each partition is clamped to 1 you end up with an estimate of a few hundred rows coming out of the Append node. The natural way to kill this is to allow fractional rows for these scans. We know they're usually going to be producing 0 so if the estimates produced the right average expected value the sum would add up to 1 and the Append node would get the right value. Alternatively we could make Append more clever about estimating the number of rows it produces. Somehow it could be informed of some holistic view of the quals on its children and how they're inter-dependent. If it's told that only one of its children will produce rows then it can use max() instead of sum() to calculate its rows estimate. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes: > The screw case I've seen is when you have a large partitioned table where > constraint_exclusion fails to exclude the irrelevant partitions. You're going > to get 0 rows from all but the one partition which contains the 1 row you're > looking for. But since each partition is clamped to 1 you end up with an > estimate of a few hundred rows coming out of the Append node. > The natural way to kill this is to allow fractional rows for these scans. No, the right fix is to fix the constraint-exclusion failure. > Alternatively we could make Append more clever about estimating the number of > rows it produces. Somehow it could be informed of some holistic view of the > quals on its children and how they're inter-dependent. If it's told that only > one of its children will produce rows then it can use max() instead of sum() > to calculate its rows estimate. This gets back to the discussions at PGCon about needing to have a more explicit representation of partitioning. Right now, for a many-partition table we spend huge amounts of time deriving the expected behavior from first principles, each time we make a plan. And even then we can only get it right for limited cases (eg, no parameterized queries). If the planner actually understood that a set of tables formed a partitioned set then it'd be a lot easier and faster to get the desired behavior, not only with respect to the rowcount estimates but the plan's structure. regards, tom lane
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> The screw case I've seen is when you have a large partitioned table where >> constraint_exclusion fails to exclude the irrelevant partitions. You're going >> to get 0 rows from all but the one partition which contains the 1 row you're >> looking for. But since each partition is clamped to 1 you end up with an >> estimate of a few hundred rows coming out of the Append node. > >> The natural way to kill this is to allow fractional rows for these scans. > > No, the right fix is to fix the constraint-exclusion failure. There are always going to be cases where that's not possible. It's possible that the second option I described -- teaching Append when to use something other than sum() -- would only work in the cases where constraint exclusion could be fixed though. In which case having fractional row counts might actually be necessary. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes: > It's possible that the second option I described -- teaching Append when to > use something other than sum() -- would only work in the cases where > constraint exclusion could be fixed though. In which case having fractional > row counts might actually be necessary. The above is just armwaving. IMHO, if you don't understand the structure of the table set then you're not going to be able to get the desired behavior via fractional rowcounts either. regards, tom lane
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Andrew Dunstan
Date:
Tom Lane wrote: > This gets back to the discussions at PGCon about needing to have a more > explicit representation of partitioning. Right now, for a > many-partition table we spend huge amounts of time deriving the expected > behavior from first principles, each time we make a plan. And even then > we can only get it right for limited cases (eg, no parameterized > queries). If the planner actually understood that a set of tables > formed a partitioned set then it'd be a lot easier and faster to get the > desired behavior, not only with respect to the rowcount estimates but > the plan's structure. > > > Even if this doesn't solve every problem, it's surely worth doing for those it will solve. cheers andrew
Re: Runtime checking of MCV (Was: ... histogram bucket numdistinct statistics)
From
Csaba Nagy
Date:
On Tue, 2008-06-10 at 19:03 -0400, Tom Lane wrote: > Given such an MCV list, the planner will always make the right choice > of whether to do index or seqscan ... as long as it knows the value > being searched for, that is. Parameterized plans have a hard time here, > but that's not really the fault of the statistics. This is maybe the best example where multiple (sub)plans could be glued together with some kind of plan fork node, so that the actual plan to be executed would be decided based on the parameter values and checking the statistics at runtime instead of plan time for parameterized plans... so the planner creates alternative (sub)plans (e.g. seqscan vs index scan) for the cases where the parameters are MCV or not, and then place them in different branches of a runtime check of the parameter values vs the statistics. Of course the number of branches must be limited, this would be the challenge of such a feature... to cover the parameter space with the minimal number of plan branches so that disastrous plans for special parameter values are avoided. It would also be possible perhaps to gradually grow the alternative counts as a reaction to the actual parameter values used by queries, so that only the parameter space actually in use by queries is covered. In fact I would be interested in experimenting with this. Would it be possible to add new planner behavior as external code ? I would expect not, as the planner is in charge also for the correctness of the results and any external code would put that correctness at risk I guess... in any case, I'll go and check the source. BTW, there was a discussion about global prepared statements/caching of query plans, is there any advance on that ? Thorough planning would make the most sense in that context, possibly by using a special syntax for the application to signal the need for such planning for the most problematic (not necessarily the most used though) queries. Cheers, Csaba.
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> It's possible that the second option I described -- teaching Append when to >> use something other than sum() -- would only work in the cases where >> constraint exclusion could be fixed though. In which case having fractional >> row counts might actually be necessary. > > The above is just armwaving. IMHO, if you don't understand the > structure of the table set then you're not going to be able to get the > desired behavior via fractional rowcounts either. That's only a specific subset of cases. You could just as easily have quals which are only coincidentally related to the partition key or even not related at all, just very selective and produce no records from some partitions. The bottom line is that if you have a large table our statistics do a good job estimating the selectivity of a where clause with the minimum clamped to 1. If you partition it into 100 partitions then the minimum is clamped to 100. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
"Nathan Boley"
Date:
> Assuming that the threshold > for switching to an indexscan is somewhere around selectivity 0.005 > (I am not certain offhand, but it's in that general area), this cannot > possibly require more than 200 MCV slots, and for most data > distributions it'd be a whole lot less. Thats a really good point. > Given such an MCV list, the planner will always make the right choice > of whether to do index or seqscan Given that, wouldn't it be smarter to consider a value as an mcv candidate iff it has a density greater than 0.005, rather than having a count greater than 1.5*average? This would allow people to raise the hard mcv limit without having to worry as much about including worthless mcv values... Cheers, Nathan
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
From
Tom Lane
Date:
"Nathan Boley" <npboley@gmail.com> writes: > Given that, wouldn't it be smarter to consider a value as an mcv > candidate iff it has a density greater than 0.005, rather than having > a count greater than 1.5*average? Yeah, perhaps ... want to experiment with that? Though I'd be a bit worried about how to get the threshold right, seeing that it will depend a whole lot on what the user has selected for random_page_cost and other parameters. regards, tom lane