Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs - Mailing list pgsql-hackers
From | Alena Rybakina |
---|---|
Subject | Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs |
Date | |
Msg-id | 148ff8f1-067b-1409-c754-af6117de9b7d@yandex.ru Whole thread Raw |
In response to | Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
|
List | pgsql-hackers |
Hi, all!
On 24/6/2023 17:23, Tomas Vondra wrote:I really hope what I just wrote makes at least a little bit of sense.Throw in one more example:
SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
Here you can see the same kind of underestimation:
Hash Left Join (... rows=500 width=14) (... rows=99999 ...)
So the eqjoinsel_unmatch_left() function should be modified for the case where nd1<nd2.
Unfortunately, this patch could not fix the cardinality calculation in this request, I'll try to look and figure out what is missing here.
I tried to fix the cardinality score in the query above by changing:
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 8e18aa1dd2b..40901836146 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2604,11 +2604,16 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
* if we're calculating fraction of NULLs or fraction of unmatched rows.
*/
// unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
- if (nd1 > nd2)
+ if (nd1 != nd2)
{
- selec /= nd1;
- *unmatched_frac = (nd1 - nd2) * 1.0 / nd1;
+ selec /= Max(nd1, nd2);
+ *unmatched_frac = abs(nd1 - nd2) * 1.0 / Max(nd1, nd2);
}
+ /*if (nd1 > nd2)
+ {
+ selec /= nd1;
+ *unmatched_frac = nd1 - nd2 * 1.0 / nd1;
+ }*/
else
{
selec /= nd2;
and it worked:
SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
ERROR: relation "l" already exists
ERROR: relation "r" already exists
INSERT 0 2
ANALYZE
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=1.09..1944.13 rows=99998 width=14) (actual time=0.152..84.184 rows=99999 loops=1)
Hash Cond: (l.id = r.id)
Filter: (r.v IS NULL)
Rows Removed by Filter: 2
-> Seq Scan on l (cost=0.00..1443.00 rows=100000 width=4) (actual time=0.040..27.635 rows=100000 loops=1)
-> Hash (cost=1.04..1.04 rows=4 width=10) (actual time=0.020..0.022 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on r (cost=0.00..1.04 rows=4 width=10) (actual time=0.009..0.011 rows=4 loops=1)
Planning Time: 0.954 ms
Execution Time: 92.309 ms
(10 rows)
It looks too simple and I suspect that I might have missed something somewhere, but so far I haven't found any examples of queries where it doesn't work.
I didn't see it breaking anything in the examples from my previous letter [1].
1. https://www.postgresql.org/message-id/7af1464e-2e24-cfb1-b6d4-1544757f8cfa%40yandex.ru
Unfortunately, I can't understand your idea from point 4, please explain it?
The good thing is this helps even for IS NULL checks on non-join-key columns (where we don't switch to an antijoin), but there's a couple things that I dislike ... 1) It's not restricted to outer joins or anything like that (this is mostly just my laziness / interest in one particular query, but also something the outer-join-aware patch might help with). 2) We probably don't want to pass this kind of information through sjinfo. That was the simplest thing for an experimental patch, but I suspect it's not the only piece of information we may need to pass to the lower levels of estimation code. 3) I kinda doubt we actually want to move this responsibility (to consider fraction of unmatched rows) to the low-level estimation routines (e.g. nulltestsel and various others). AFAICS this just "introduces NULLs" into the relation, so maybe we could "adjust" the attribute statistics (in examine_variable?) by inflating null_frac and modifying the other frequencies in MCV/histogram. 4) But I'm not sure we actually want to do that in these low-level selectivity functions. The outer join essentially produces output with two subsets - one with matches on the outer side, one without them. But the side without matches has NULLs in all columns. In a way, we know exactly how are these columns correlated - if we do the usual estimation (even with the null_frac adjusted), we just throw this information away. And when there's a lot of rows without a match, that seems bad.
-- Regards, Alena Rybakina Postgres Professional
Attachment
pgsql-hackers by date: