Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs - Mailing list pgsql-hackers

From Alena Rybakina
Subject Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
Date
Msg-id 148ff8f1-067b-1409-c754-af6117de9b7d@yandex.ru
Whole thread Raw
In response to Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
List pgsql-hackers

Hi, all!

On 26.06.2023 12:22, Andrey Lepikhov wrote:
On 24/6/2023 17:23, Tomas Vondra wrote:
I really hope what I just wrote makes at least a little bit of sense.
Throw in one more example:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;

Here you can see the same kind of underestimation:
Hash Left Join  (... rows=500 width=14) (... rows=99999 ...)

So the eqjoinsel_unmatch_left() function should be modified for the case where nd1<nd2.


Unfortunately, this patch could not fix the cardinality calculation in this request, I'll try to look and figure out what is missing here.

I tried to fix the cardinality score in the query above by changing:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 8e18aa1dd2b..40901836146 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2604,11 +2604,16 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
                 * if we're calculating fraction of NULLs or fraction of unmatched rows.
                 */
                // unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
-               if (nd1 > nd2)
+               if (nd1 != nd2)
                {
-                       selec /= nd1;
-                       *unmatched_frac = (nd1 - nd2) * 1.0 / nd1;
+                       selec /= Max(nd1, nd2);
+                       *unmatched_frac = abs(nd1 - nd2) * 1.0 / Max(nd1, nd2);
                }
+               /*if (nd1 > nd2)
+               {
+                       selec /= nd1;
+                       *unmatched_frac = nd1 - nd2 * 1.0 / nd1;
+               }*/
                else
                {
                        selec /= nd2;

and it worked:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
ERROR:  relation "l" already exists
ERROR:  relation "r" already exists
INSERT 0 2
ANALYZE
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=1.09..1944.13 rows=99998 width=14) (actual time=0.152..84.184 rows=99999 loops=1)
   Hash Cond: (l.id = r.id)
   Filter: (r.v IS NULL)
   Rows Removed by Filter: 2
   ->  Seq Scan on l  (cost=0.00..1443.00 rows=100000 width=4) (actual time=0.040..27.635 rows=100000 loops=1)
   ->  Hash  (cost=1.04..1.04 rows=4 width=10) (actual time=0.020..0.022 rows=4 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on r  (cost=0.00..1.04 rows=4 width=10) (actual time=0.009..0.011 rows=4 loops=1)
 Planning Time: 0.954 ms
 Execution Time: 92.309 ms
(10 rows)

It looks too simple and I suspect that I might have missed something somewhere, but so far I haven't found any examples of queries where it doesn't work.

I didn't see it breaking anything in the examples from my previous letter [1].

1. https://www.postgresql.org/message-id/7af1464e-2e24-cfb1-b6d4-1544757f8cfa%40yandex.ru


Unfortunately, I can't understand your idea from point 4, please explain it?

The good thing is this helps even for IS NULL checks on non-join-key
columns (where we don't switch to an antijoin), but there's a couple
things that I dislike ...

1) It's not restricted to outer joins or anything like that (this is
mostly just my laziness / interest in one particular query, but also
something the outer-join-aware patch might help with).

2) We probably don't want to pass this kind of information through
sjinfo. That was the simplest thing for an experimental patch, but I
suspect it's not the only piece of information we may need to pass to
the lower levels of estimation code.

3) I kinda doubt we actually want to move this responsibility (to
consider fraction of unmatched rows) to the low-level estimation
routines (e.g. nulltestsel and various others). AFAICS this just
"introduces NULLs" into the relation, so maybe we could "adjust" the
attribute statistics (in examine_variable?) by inflating null_frac and
modifying the other frequencies in MCV/histogram.

4) But I'm not sure we actually want to do that in these low-level
selectivity functions. The outer join essentially produces output with
two subsets - one with matches on the outer side, one without them. But
the side without matches has NULLs in all columns. In a way, we know
exactly how are these columns correlated - if we do the usual estimation
(even with the null_frac adjusted), we just throw this information away.
And when there's a lot of rows without a match, that seems bad.

-- 
Regards,
Alena Rybakina
Postgres Professional
Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: pg_basebackup check vs Windows file path limits
Next
From: Tom Lane
Date:
Subject: Re: UUID v7