Thread: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
From
Zhenghua Lyu
Date:
Hi hackers,
It seems the function `get_variable_numdistinct` ignore the case when stanullfrac is 1.0:
# create table t(a int, b int);CREATE TABLE# insert into t select i from generate_series(1, 10000)i;INSERT 0 10000gpadmin=# analyze t;ANALYZE# explain analyze select b, count(1) from t group by b;QUERY PLAN-----------------------------------------------------------------------------------------------------------HashAggregate (cost=195.00..197.00 rows=200 width=12) (actual time=5.928..5.930 rows=1 loops=1)Group Key: bBatches: 1 Memory Usage: 40kB-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual time=0.018..1.747 rows=10000 loops=1)Planning Time: 0.237 msExecution Time: 5.983 ms(6 rows)
So it gives the estimate using the default value: 200.
I have added some lines of code to take `stanullfrac ==1.0` into account. With the patch attached, we now get:
# explain analyze select b, count(1) from t group by b;QUERY PLAN-----------------------------------------------------------------------------------------------------------HashAggregate (cost=195.00..195.01 rows=1 width=12) (actual time=6.163..6.164 rows=1 loops=1)Group Key: bBatches: 1 Memory Usage: 24kB-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual time=0.024..1.823 rows=10000 loops=1)Planning Time: 0.535 msExecution Time: 6.344 ms(6 rows)
I am not sure if this change is valuable in practical env, but it should go in the correct direction.
Any comments on this are appreciated.
Attachment
Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
From
Tom Lane
Date:
Zhenghua Lyu <zlyu@vmware.com> writes: > It seems the function `get_variable_numdistinct` ignore the case when stanullfrac is 1.0: I don't like this patch at all. What's the argument for having a special case for this value? When would we ever get exactly 1.0 in practice? regards, tom lane
Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
From
Zhenghua Lyu
Date:
Hi,
when group by multi-columns, it will multiply all the distinct values together, and if one column is all null,
it also contributes 200 to the final estimate, and if the product is over the relation size, it will be clamp.
So the the value of the agg rel size is not correct, and impacts the upper path's cost estimate, and do not
give a good plan.
I debug some other queries and find this issue, but not sure if this issue is the root cause of my problem,
just open a thread here for discussion.
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Monday, October 26, 2020 10:37 PM
To: Zhenghua Lyu <zlyu@vmware.com>
Cc: pgsql-hackers@lists.postgresql.org <pgsql-hackers@lists.postgresql.org>
Subject: Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
Sent: Monday, October 26, 2020 10:37 PM
To: Zhenghua Lyu <zlyu@vmware.com>
Cc: pgsql-hackers@lists.postgresql.org <pgsql-hackers@lists.postgresql.org>
Subject: Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
Zhenghua Lyu <zlyu@vmware.com> writes:
> It seems the function `get_variable_numdistinct` ignore the case when stanullfrac is 1.0:
I don't like this patch at all. What's the argument for having a special
case for this value? When would we ever get exactly 1.0 in practice?
regards, tom lane
> It seems the function `get_variable_numdistinct` ignore the case when stanullfrac is 1.0:
I don't like this patch at all. What's the argument for having a special
case for this value? When would we ever get exactly 1.0 in practice?
regards, tom lane
Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
From
Tomas Vondra
Date:
On Mon, Oct 26, 2020 at 03:01:41PM +0000, Zhenghua Lyu wrote: >Hi, > when group by multi-columns, it will multiply all the distinct values together, and if one column is all null, > it also contributes 200 to the final estimate, and if the product is over the relation size, it will be clamp. > > So the the value of the agg rel size is not correct, and impacts the upper path's cost estimate, and do not > give a good plan. > > I debug some other queries and find this issue, but not sure if this issue is the root cause of my problem, > just open a thread here for discussion. I think we understand what the issue is, in principle - if the column is all-null, the ndistinct estimate 200 is bogus and when multiplied with estimates for other Vars it may lead to over-estimates. That's a valid issue, of course. The question is whether the proposed patch is a good way to handle it. I'm not sure what exactly are Tom's concerns, but I was worried relying on (stanullfrac == 1.0) might result in abrupt changes in estimates when that's a minor difference. For example if column is "almost NULL" we may end up with either 1.0 or (1.0 - epsilon) and the question is what estimates we end up with ... Imagine a column that is 'almost NULL' - it's 99.99% NULL with a couple non-NULL values. When the ANALYZE samples just NULLs, we'll end up with n_distinct = 0.0 stanullfrac = 1.0 and we'll end up estimating either 200 (current estimate) or 1.0 (with this patch). Now, what if stanullfrac is not 1.0 but a little bit less? Say only 1 of the 30k rows is non-NULL? Well, in that case we'll not even get to this condition, because we'll have n_distinct = -3.3318996e-05 stanullfrac = 0.9999667 which means get_variable_numdistinct will return from either if (stadistinct > 0.0) return ... or if (stadistinct < 0.0) return ... and we'll never even get to that new condition. And by definition, the estimate has to be very low, because otherwise we'd need more non-NULL distinct rows in the sample, which makes it less likely to ever see stanullfrac being 1.0. And even if we could get a bigger difference (say, 50 vs. 1.0), but I don't think that's very different from the current situation with 200 as a default. Of course, using 1.0 in these cases may make us more vulnerable to under-estimates for large tables. But for that to happen we must not sample any of the non-NULL values, and if there are many distinct values that's probably even less likely than sampling just one (when we end up with an estimate of 1.0 already). So I'm not sure I understand what would be the risk with this ... Tom, can you elaborate why you dislike the patch? BTW we already have a way to improve the estimate - setting n_distinct for the column to 1.0 using ALTER TABLE should do the trick, I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > So I'm not sure I understand what would be the risk with this ... Tom, > can you elaborate why you dislike the patch? I've got a couple issues with the patch as presented. * As you said, it creates discontinuous behavior for stanullfrac = 1.0 versus stanullfrac = 1.0 - epsilon. That doesn't seem good. * It's not apparent why, if ANALYZE's sample is all nulls, we wouldn't conclude stadistinct = 0 and thus arrive at the desired answer that way. (Since we have a complaint, I'm guessing that ANALYZE might disbelieve its own result and stick in some larger stadistinct. But then maybe that's where to fix this, not here.) * We generally disbelieve edge-case estimates to begin with. The most obvious example is that we don't accept rowcount estimates that are zero. There are also some clamps that disbelieve selectivities approaching 0.0 or 1.0 when estimating from a histogram, and I think we have a couple other similar rules. The reason for this is mainly that taking such estimates at face value creates too much risk of severe relative error due to imprecise or out-of-date statistics. So a special case for stanullfrac = 1.0 seems to go directly against that mindset. I agree that there might be some gold to be mined in this area, as we haven't thought particularly hard about high-stanullfrac situations. One idea is to figure what stanullfrac says about the number of non-null rows, and clamp the get_variable_numdistinct result to be not more than that. But I still would not want to trust an exact zero result. regards, tom lane
Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
From
Tom Lane
Date:
I wrote: > * It's not apparent why, if ANALYZE's sample is all nulls, we wouldn't > conclude stadistinct = 0 and thus arrive at the desired answer that > way. (Since we have a complaint, I'm guessing that ANALYZE might > disbelieve its own result and stick in some larger stadistinct. But > then maybe that's where to fix this, not here.) Oh, on second thought (and with some testing): ANALYZE *does* report stadistinct = 0. The real issue is that get_variable_numdistinct is assuming it can use that value as meaning "stadistinct is unknown". So maybe we should just fix that, probably by adding an explicit bool flag for that condition. BTW ... I've not looked at the callers, but now I'm wondering whether get_variable_numdistinct ought to count NULL as one of the "distinct" values. In applications such as estimating the number of GROUP BY groups, it seems like that would be correct. There might be some callers that don't want it though. regards, tom lane