Thread: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?

Hi hackers,
   
   It seems the function `get_variable_numdistinct` ignore the case when stanullfrac is 1.0:

# create table t(a int, b int);
CREATE TABLE
# insert into t select i from generate_series(1, 10000)i;
INSERT 0 10000
gpadmin=# analyze t;
ANALYZE
# explain analyze select b, count(1) from t group by b;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=195.00..197.00 rows=200 width=12) (actual time=5.928..5.930 rows=1 loops=1)
   Group Key: b
   Batches: 1  Memory Usage: 40kB
   ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.018..1.747 rows=10000 loops=1)
 Planning Time: 0.237 ms
 Execution Time: 5.983 ms
(6 rows)

So it gives the estimate using the default value: 200.


I have added some lines of code to take `stanullfrac ==1.0` into account. With the patch attached, we now get:

# explain analyze select b, count(1) from t group by b;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=195.00..195.01 rows=1 width=12) (actual time=6.163..6.164 rows=1 loops=1)
   Group Key: b
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.024..1.823 rows=10000 loops=1)
 Planning Time: 0.535 ms
 Execution Time: 6.344 ms
(6 rows)

I am not sure if this change is valuable in practical env, but it should go in the correct direction.

Any comments on this are appreciated.


Attachment
Zhenghua Lyu <zlyu@vmware.com> writes:
>    It seems the function `get_variable_numdistinct` ignore the case when stanullfrac is 1.0:

I don't like this patch at all.  What's the argument for having a special
case for this value?  When would we ever get exactly 1.0 in practice?

            regards, tom lane



Hi,
    when group by multi-columns, it will multiply all the distinct values together, and if one column is all null,
    it also contributes 200 to the final estimate, and if the product is over the relation size, it will be clamp.

    So the the value of the agg rel size is not correct, and impacts the upper path's cost estimate, and do not
    give a good plan.

    I debug some other queries and find this issue, but not sure if this issue is the root cause of my problem,
    just open a thread here for discussion.

From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Monday, October 26, 2020 10:37 PM
To: Zhenghua Lyu <zlyu@vmware.com>
Cc: pgsql-hackers@lists.postgresql.org <pgsql-hackers@lists.postgresql.org>
Subject: Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
 
Zhenghua Lyu <zlyu@vmware.com> writes:
>    It seems the function `get_variable_numdistinct` ignore the case when stanullfrac is 1.0:

I don't like this patch at all.  What's the argument for having a special
case for this value?  When would we ever get exactly 1.0 in practice?

                        regards, tom lane
On Mon, Oct 26, 2020 at 03:01:41PM +0000, Zhenghua Lyu wrote:
>Hi,
>    when group by multi-columns, it will multiply all the distinct values together, and if one column is all null,
>    it also contributes 200 to the final estimate, and if the product is over the relation size, it will be clamp.
>
>    So the the value of the agg rel size is not correct, and impacts the upper path's cost estimate, and do not
>    give a good plan.
>
>    I debug some other queries and find this issue, but not sure if this issue is the root cause of my problem,
>    just open a thread here for discussion.

I think we understand what the issue is, in principle - if the column is
all-null, the ndistinct estimate 200 is bogus and when multiplied with
estimates for other Vars it may lead to over-estimates. That's a valid
issue, of course.

The question is whether the proposed patch is a good way to handle it.

I'm not sure what exactly are Tom's concerns, but I was worried relying
on (stanullfrac == 1.0) might result in abrupt changes in estimates when
that's a minor difference. For example if column is "almost NULL" we may
end up with either 1.0 or (1.0 - epsilon) and the question is what
estimates we end up with ...

Imagine a column that is 'almost NULL' - it's 99.99% NULL with a couple
non-NULL values. When the ANALYZE samples just NULLs, we'll end up with

   n_distinct = 0.0
   stanullfrac = 1.0

and we'll end up estimating either 200 (current estimate) or 1.0 (with
this patch). Now, what if stanullfrac is not 1.0 but a little bit less?
Say only 1 of the 30k rows is non-NULL? Well, in that case we'll not
even get to this condition, because we'll have

   n_distinct = -3.3318996e-05
   stanullfrac = 0.9999667

which means get_variable_numdistinct will return from either

     if (stadistinct > 0.0)
         return ...

or

     if (stadistinct < 0.0)
         return ...

and we'll never even get to that new condition. And by definition, the
estimate has to be very low, because otherwise we'd need more non-NULL
distinct rows in the sample, which makes it less likely to ever see
stanullfrac being 1.0. And even if we could get a bigger difference
(say, 50 vs. 1.0), but I don't think that's very different from the
current situation with 200 as a default.

Of course, using 1.0 in these cases may make us more vulnerable to
under-estimates for large tables. But for that to happen we must not
sample any of the non-NULL values, and if there are many distinct values
that's probably even less likely than sampling just one (when we end up
with an estimate of 1.0 already).

So I'm not sure I understand what would be the risk with this ... Tom,
can you elaborate why you dislike the patch?


BTW we already have a way to improve the estimate - setting n_distinct
for the column to 1.0 using ALTER TABLE should do the trick, I think.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> So I'm not sure I understand what would be the risk with this ... Tom,
> can you elaborate why you dislike the patch?

I've got a couple issues with the patch as presented.

* As you said, it creates discontinuous behavior for stanullfrac = 1.0
versus stanullfrac = 1.0 - epsilon.  That doesn't seem good.

* It's not apparent why, if ANALYZE's sample is all nulls, we wouldn't
conclude stadistinct = 0 and thus arrive at the desired answer that
way.  (Since we have a complaint, I'm guessing that ANALYZE might
disbelieve its own result and stick in some larger stadistinct.  But
then maybe that's where to fix this, not here.)

* We generally disbelieve edge-case estimates to begin with.  The
most obvious example is that we don't accept rowcount estimates that
are zero.  There are also some clamps that disbelieve selectivities
approaching 0.0 or 1.0 when estimating from a histogram, and I think
we have a couple other similar rules.  The reason for this is mainly
that taking such estimates at face value creates too much risk of
severe relative error due to imprecise or out-of-date statistics.
So a special case for stanullfrac = 1.0 seems to go directly against
that mindset.

I agree that there might be some gold to be mined in this area,
as we haven't thought particularly hard about high-stanullfrac
situations.  One idea is to figure what stanullfrac says about the
number of non-null rows, and clamp the get_variable_numdistinct
result to be not more than that.  But I still would not want to
trust an exact zero result.

            regards, tom lane



I wrote:
> * It's not apparent why, if ANALYZE's sample is all nulls, we wouldn't
> conclude stadistinct = 0 and thus arrive at the desired answer that
> way.  (Since we have a complaint, I'm guessing that ANALYZE might
> disbelieve its own result and stick in some larger stadistinct.  But
> then maybe that's where to fix this, not here.)

Oh, on second thought (and with some testing): ANALYZE *does* report
stadistinct = 0.  The real issue is that get_variable_numdistinct is
assuming it can use that value as meaning "stadistinct is unknown".
So maybe we should just fix that, probably by adding an explicit
bool flag for that condition.

BTW ... I've not looked at the callers, but now I'm wondering whether
get_variable_numdistinct ought to count NULL as one of the "distinct"
values.  In applications such as estimating the number of GROUP BY
groups, it seems like that would be correct.  There might be some
callers that don't want it though.

            regards, tom lane