Home > mailing lists

Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs - Mailing list pgsql-hackers

From	Alena Rybakina
Subject	Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
Date	July 6, 2023 16:51:48
Msg-id	148ff8f1-067b-1409-c754-af6117de9b7d@yandex.ru Whole thread Raw
In response to	Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses	Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
List	pgsql-hackers

Tree view

Hi, all!

On 26.06.2023 12:22, Andrey Lepikhov wrote:

On 24/6/2023 17:23, Tomas Vondra wrote:
I really hope what I just wrote makes at least a little bit of sense.
Throw in one more example:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;

Here you can see the same kind of underestimation:
Hash Left Join (... rows=500 width=14) (... rows=99999 ...)

So the eqjoinsel_unmatch_left() function should be modified for the case where nd1<nd2.

Unfortunately, this patch could not fix the cardinality calculation in this request, I'll try to look and figure out what is missing here.

I tried to fix the cardinality score in the query above by changing:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 8e18aa1dd2b..40901836146 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2604,11 +2604,16 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
                 * if we're calculating fraction of NULLs or fraction of unmatched rows.
                 */
                // unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
-               if (nd1 > nd2)
+               if (nd1 != nd2)
                {
-                       selec /= nd1;
-                       *unmatched_frac = (nd1 - nd2) * 1.0 / nd1;
+                       selec /= Max(nd1, nd2);
+                       *unmatched_frac = abs(nd1 - nd2) * 1.0 / Max(nd1, nd2);
                }
+               /*if (nd1 > nd2)
+               {
+                       selec /= nd1;
+                       *unmatched_frac = nd1 - nd2 * 1.0 / nd1;
+               }*/
                else
                {
                        selec /= nd2;

and it worked:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
ERROR: relation "l" already exists
ERROR: relation "r" already exists
INSERT 0 2
ANALYZE
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=1.09..1944.13 rows=99998 width=14) (actual time=0.152..84.184 rows=99999 loops=1)
   Hash Cond: (l.id = r.id)
   Filter: (r.v IS NULL)
   Rows Removed by Filter: 2
   -> Seq Scan on l (cost=0.00..1443.00 rows=100000 width=4) (actual time=0.040..27.635 rows=100000 loops=1)
   -> Hash (cost=1.04..1.04 rows=4 width=10) (actual time=0.020..0.022 rows=4 loops=1)
         Buckets: 1024 Batches: 1 Memory Usage: 9kB
         -> Seq Scan on r (cost=0.00..1.04 rows=4 width=10) (actual time=0.009..0.011 rows=4 loops=1)
Planning Time: 0.954 ms
Execution Time: 92.309 ms
(10 rows)

It looks too simple and I suspect that I might have missed something somewhere, but so far I haven't found any examples of queries where it doesn't work.

I didn't see it breaking anything in the examples from my previous letter [1].

1. https://www.postgresql.org/message-id/7af1464e-2e24-cfb1-b6d4-1544757f8cfa%40yandex.ru

Unfortunately, I can't understand your idea from point 4, please explain it?

The good thing is this helps even for IS NULL checks on non-join-key
columns (where we don't switch to an antijoin), but there's a couple
things that I dislike ...

1) It's not restricted to outer joins or anything like that (this is
mostly just my laziness / interest in one particular query, but also
something the outer-join-aware patch might help with).

2) We probably don't want to pass this kind of information through
sjinfo. That was the simplest thing for an experimental patch, but I
suspect it's not the only piece of information we may need to pass to
the lower levels of estimation code.

3) I kinda doubt we actually want to move this responsibility (to
consider fraction of unmatched rows) to the low-level estimation
routines (e.g. nulltestsel and various others). AFAICS this just
"introduces NULLs" into the relation, so maybe we could "adjust" the
attribute statistics (in examine_variable?) by inflating null_frac and
modifying the other frequencies in MCV/histogram.

4) But I'm not sure we actually want to do that in these low-level
selectivity functions. The outer join essentially produces output with
two subsets - one with matches on the outer side, one without them. But
the side without matches has NULLs in all columns. In a way, we know
exactly how are these columns correlated - if we do the usual estimation
(even with the null_frac adjusted), we just throw this information away.
And when there's a lot of rows without a match, that seems bad.

-- 
Regards,
Alena Rybakina
Postgres Professional

Attachment

0001-Fixed-the-case-of-calculating-underestimated-cardina.patch

pgsql-hackers by date:

From: Daniel Gustafsson
Date: 06 July 2023, 16:50:57
Subject: Re: pg_basebackup check vs Windows file path limits

From: Tom Lane
Date: 06 July 2023, 17:02:44
Subject: Re: UUID v7

Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs - Mailing list pgsql-hackers

Attachment

Previous

Next