Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
Date
Msg-id 20201030234039.5ficlsnzgvm4d5ob@development
Whole thread Raw
In response to Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?  (Zhenghua Lyu <zlyu@vmware.com>)
Responses Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
List pgsql-hackers
On Mon, Oct 26, 2020 at 03:01:41PM +0000, Zhenghua Lyu wrote:
>Hi,
>    when group by multi-columns, it will multiply all the distinct values together, and if one column is all null,
>    it also contributes 200 to the final estimate, and if the product is over the relation size, it will be clamp.
>
>    So the the value of the agg rel size is not correct, and impacts the upper path's cost estimate, and do not
>    give a good plan.
>
>    I debug some other queries and find this issue, but not sure if this issue is the root cause of my problem,
>    just open a thread here for discussion.

I think we understand what the issue is, in principle - if the column is
all-null, the ndistinct estimate 200 is bogus and when multiplied with
estimates for other Vars it may lead to over-estimates. That's a valid
issue, of course.

The question is whether the proposed patch is a good way to handle it.

I'm not sure what exactly are Tom's concerns, but I was worried relying
on (stanullfrac == 1.0) might result in abrupt changes in estimates when
that's a minor difference. For example if column is "almost NULL" we may
end up with either 1.0 or (1.0 - epsilon) and the question is what
estimates we end up with ...

Imagine a column that is 'almost NULL' - it's 99.99% NULL with a couple
non-NULL values. When the ANALYZE samples just NULLs, we'll end up with

   n_distinct = 0.0
   stanullfrac = 1.0

and we'll end up estimating either 200 (current estimate) or 1.0 (with
this patch). Now, what if stanullfrac is not 1.0 but a little bit less?
Say only 1 of the 30k rows is non-NULL? Well, in that case we'll not
even get to this condition, because we'll have

   n_distinct = -3.3318996e-05
   stanullfrac = 0.9999667

which means get_variable_numdistinct will return from either

     if (stadistinct > 0.0)
         return ...

or

     if (stadistinct < 0.0)
         return ...

and we'll never even get to that new condition. And by definition, the
estimate has to be very low, because otherwise we'd need more non-NULL
distinct rows in the sample, which makes it less likely to ever see
stanullfrac being 1.0. And even if we could get a bigger difference
(say, 50 vs. 1.0), but I don't think that's very different from the
current situation with 200 as a default.

Of course, using 1.0 in these cases may make us more vulnerable to
under-estimates for large tables. But for that to happen we must not
sample any of the non-NULL values, and if there are many distinct values
that's probably even less likely than sampling just one (when we end up
with an estimate of 1.0 already).

So I'm not sure I understand what would be the risk with this ... Tom,
can you elaborate why you dislike the patch?


BTW we already have a way to improve the estimate - setting n_distinct
for the column to 1.0 using ALTER TABLE should do the trick, I think.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Re: enable_incremental_sort changes query behavior
Next
From: Tomas Vondra
Date:
Subject: Re: A couple questions about ordered-set aggregates