Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0? - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0? |
Date | |
Msg-id | 20201030234039.5ficlsnzgvm4d5ob@development Whole thread Raw |
In response to | Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0? (Zhenghua Lyu <zlyu@vmware.com>) |
Responses |
Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
|
List | pgsql-hackers |
On Mon, Oct 26, 2020 at 03:01:41PM +0000, Zhenghua Lyu wrote: >Hi, > when group by multi-columns, it will multiply all the distinct values together, and if one column is all null, > it also contributes 200 to the final estimate, and if the product is over the relation size, it will be clamp. > > So the the value of the agg rel size is not correct, and impacts the upper path's cost estimate, and do not > give a good plan. > > I debug some other queries and find this issue, but not sure if this issue is the root cause of my problem, > just open a thread here for discussion. I think we understand what the issue is, in principle - if the column is all-null, the ndistinct estimate 200 is bogus and when multiplied with estimates for other Vars it may lead to over-estimates. That's a valid issue, of course. The question is whether the proposed patch is a good way to handle it. I'm not sure what exactly are Tom's concerns, but I was worried relying on (stanullfrac == 1.0) might result in abrupt changes in estimates when that's a minor difference. For example if column is "almost NULL" we may end up with either 1.0 or (1.0 - epsilon) and the question is what estimates we end up with ... Imagine a column that is 'almost NULL' - it's 99.99% NULL with a couple non-NULL values. When the ANALYZE samples just NULLs, we'll end up with n_distinct = 0.0 stanullfrac = 1.0 and we'll end up estimating either 200 (current estimate) or 1.0 (with this patch). Now, what if stanullfrac is not 1.0 but a little bit less? Say only 1 of the 30k rows is non-NULL? Well, in that case we'll not even get to this condition, because we'll have n_distinct = -3.3318996e-05 stanullfrac = 0.9999667 which means get_variable_numdistinct will return from either if (stadistinct > 0.0) return ... or if (stadistinct < 0.0) return ... and we'll never even get to that new condition. And by definition, the estimate has to be very low, because otherwise we'd need more non-NULL distinct rows in the sample, which makes it less likely to ever see stanullfrac being 1.0. And even if we could get a bigger difference (say, 50 vs. 1.0), but I don't think that's very different from the current situation with 200 as a default. Of course, using 1.0 in these cases may make us more vulnerable to under-estimates for large tables. But for that to happen we must not sample any of the non-NULL values, and if there are many distinct values that's probably even less likely than sampling just one (when we end up with an estimate of 1.0 already). So I'm not sure I understand what would be the risk with this ... Tom, can you elaborate why you dislike the patch? BTW we already have a way to improve the estimate - setting n_distinct for the column to 1.0 using ALTER TABLE should do the trick, I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: