Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
Date
Msg-id edd7ac83-f675-ae24-8012-04a5f20f743c@enterprisedb.com
Whole thread Raw
In response to Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs  (Alena Rybakina <lena.ribackina@yandex.ru>)
Responses Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
List pgsql-hackers

On 7/6/23 15:51, Alena Rybakina wrote:
> Hi, all!
> 
> On 26.06.2023 12:22, Andrey Lepikhov wrote:
>> On 24/6/2023 17:23, Tomas Vondra wrote:
>>> I really hope what I just wrote makes at least a little bit of sense.
>> Throw in one more example:
>>
>> SELECT i AS id INTO l FROM generate_series(1,100000) i;
>> CREATE TABLE r (id int8, v text);
>> INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
>> ANALYZE l,r;
>> EXPLAIN ANALYZE
>> SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
>>
>> Here you can see the same kind of underestimation:
>> Hash Left Join  (... rows=500 width=14) (... rows=99999 ...)
>>
>> So the eqjoinsel_unmatch_left() function should be modified for the
>> case where nd1<nd2.
>>
>>
>> Unfortunately, this patch could not fix the cardinality calculation in
>> this request, I'll try to look and figure out what is missing here.
> 
> I tried to fix the cardinality score in the query above by changing:
> 
> diff --git a/src/backend/utils/adt/selfuncs.c
> b/src/backend/utils/adt/selfuncs.c
> index 8e18aa1dd2b..40901836146 100644
> --- a/src/backend/utils/adt/selfuncs.c
> +++ b/src/backend/utils/adt/selfuncs.c
> @@ -2604,11 +2604,16 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
>                  * if we're calculating fraction of NULLs or fraction of
> unmatched rows.
>                  */
>                 // unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
> -               if (nd1 > nd2)
> +               if (nd1 != nd2)
>                 {
> -                       selec /= nd1;
> -                       *unmatched_frac = (nd1 - nd2) * 1.0 / nd1;
> +                       selec /= Max(nd1, nd2);
> +                       *unmatched_frac = abs(nd1 - nd2) * 1.0 /
> Max(nd1, nd2);
>                 }
> +               /*if (nd1 > nd2)
> +               {
> +                       selec /= nd1;
> +                       *unmatched_frac = nd1 - nd2 * 1.0 / nd1;
> +               }*/
>                 else
>                 {
>                         selec /= nd2;
> 
> and it worked:
> 
> SELECT i AS id INTO l FROM generate_series(1,100000) i;
> CREATE TABLE r (id int8, v text);
> INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
> ANALYZE l,r;
> EXPLAIN ANALYZE
> SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
> ERROR:  relation "l" already exists
> ERROR:  relation "r" already exists
> INSERT 0 2
> ANALYZE
>                                                   QUERY
> PLAN                                                   
> ---------------------------------------------------------------------------------------------------------------
>  Hash Left Join  (cost=1.09..1944.13 rows=99998 width=14) (actual
> time=0.152..84.184 rows=99999 loops=1)
>    Hash Cond: (l.id = r.id)
>    Filter: (r.v IS NULL)
>    Rows Removed by Filter: 2
>    ->  Seq Scan on l  (cost=0.00..1443.00 rows=100000 width=4) (actual
> time=0.040..27.635 rows=100000 loops=1)
>    ->  Hash  (cost=1.04..1.04 rows=4 width=10) (actual time=0.020..0.022
> rows=4 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 9kB
>          ->  Seq Scan on r  (cost=0.00..1.04 rows=4 width=10) (actual
> time=0.009..0.011 rows=4 loops=1)
>  Planning Time: 0.954 ms
>  Execution Time: 92.309 ms
> (10 rows)
> 
> It looks too simple and I suspect that I might have missed something
> somewhere, but so far I haven't found any examples of queries where it
> doesn't work.
> 
> I didn't see it breaking anything in the examples from my previous
> letter [1].
> 

I think it's correct. Or at least it doesn't break anything my patch
didn't already break. My patch was simply written for one specific
query, so it didn't consider the option that the nd1 and nd2 values
might be in the opposite direction ...

> 1.
> https://www.postgresql.org/message-id/7af1464e-2e24-cfb1-b6d4-1544757f8cfa%40yandex.ru
> 
> 
> Unfortunately, I can't understand your idea from point 4, please explain it?
> 
> The good thing is this helps even for IS NULL checks on non-join-key
> columns (where we don't switch to an antijoin), but there's a couple
> things that I dislike ...
> 
> 1) It's not restricted to outer joins or anything like that (this is
> mostly just my laziness / interest in one particular query, but also
> something the outer-join-aware patch might help with).
> 
> 2) We probably don't want to pass this kind of information through
> sjinfo. That was the simplest thing for an experimental patch, but I
> suspect it's not the only piece of information we may need to pass to
> the lower levels of estimation code.
> 
> 3) I kinda doubt we actually want to move this responsibility (to
> consider fraction of unmatched rows) to the low-level estimation
> routines (e.g. nulltestsel and various others). AFAICS this just
> "introduces NULLs" into the relation, so maybe we could "adjust" the
> attribute statistics (in examine_variable?) by inflating null_frac and
> modifying the other frequencies in MCV/histogram.
> 
> 4) But I'm not sure we actually want to do that in these low-level
> selectivity functions. The outer join essentially produces output with
> two subsets - one with matches on the outer side, one without them. But
> the side without matches has NULLs in all columns. In a way, we know
> exactly how are these columns correlated - if we do the usual estimation
> (even with the null_frac adjusted), we just throw this information away.
> And when there's a lot of rows without a match, that seems bad.
> 

Well, one option would be to modify all selectivity functions to do
something like the patch does for nulltestsel(). That seems a bit
cumbersome because why should those places care about maybe running on
the outer side of a join, or what? For code in extensions this would be
particularly problematic, I think.

So what I was thinking about doing this in a way that'd make this
automatic, without having to modify the selectivity functions.

Option (3) is very simple - examine_variable would simply adjust the
statistics by tweaking the null_frac field, when looking at variables on
the outer side of the join. But it has issues when estimating multiple
conditions.

Imagine t1 has 1M rows, and we want to estimate

  SELECT * FROM t1 LEFT JOIN t2 ON (t1.id = t2.id)
          WHERE ((t2.a=1) AND (t2.b=1))

but only 50% of the t1 rows has a match in t2. Assume each of the t2
conditions matches 100% rows in the table. With the correction, this
means 50% selectivity for each condition. And if we combine them the
usual way, it's 0.5 * 0.5 = 0.25.

But we know all the rows in the "matching" part match the condition, so
the correct selectivity should be 0.5.

In a way, this is just another case of estimation issues due to the
assumption of independence.

But (4) was suggesting we could improve this essentially by treating the
join as two distinct sets of rows

 - the inner join result

 - rows without match on the outer side

For the inner part, we would do estimates as now (using the regular
per-column statistics). If we knew the conditions match 100% rows, we'd
still get 100% when the conditions are combined.

For the second part of the join we know the outer side is just NULLs in
all columns, and that'd make the estimation much simpler for most
clauses. We'd just need to have "fake" statistics with null_frac=1.0 and
that's it.

And then we'd just combine these two selectivities. If we know the inner
side is 50% and all rows match the conditions, and no rows in the other
50% match, the selectivity is 50%.

inner_part * inner_sel + outer_part * outer_sel = 0.5 * 1.0 + 0.0 = 0.5

Now, we still have issues with independence assumption in each of these
parts separately. But that's OK, I think.

I think (4) could be implemented by doing the current estimation for the
 inner part, and by tweaking examine_variable in the "outer" part in a
way similar to (3). Except that it just sets null_frac=1.0 everywhere.



FWIW, I used "AND" in the example for simplicity, but that'd probably be
pushed to the baserel level. There'd need to be OR to keep it at the
join level, but the overall issue is the same, I think.

Also, this entirely ignores extended statistics - I have no idea how we
might tweak those in (3). For (4) we don't need to tweak those at all,
because for inner part we can just apply them as is, and for outer part
it's irrelevant because everything is NULL.


I hope this makes more sense. If not, let me know and I'll try to
explain it better.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Karina Litskevich
Date:
Subject: Re: Avoid unused value (src/fe_utils/print.c)
Next
From: Tom Lane
Date:
Subject: Re: Avoid overflow with simplehash