Re: Removing unneeded self joins - Mailing list pgsql-hackers
From | Alena Rybakina |
---|---|
Subject | Re: Removing unneeded self joins |
Date | |
Msg-id | 5309f032-e2fc-4913-a5d6-a2a2b5e22b29@postgrespro.ru Whole thread Raw |
In response to | Re: Removing unneeded self joins (Alexander Korotkov <aekorotkov@gmail.com>) |
List | pgsql-hackers |
I agree with that, and I also believe this case will be quite rare in practice.On Fri, Apr 4, 2025 at 11:35 AM Andrei Lepikhov <lepihov@gmail.com> wrote:On 4/4/25 04:53, Richard Guo wrote:On Fri, Apr 4, 2025 at 1:02 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:I've got an off-list bug report from Alexander Lakhin involving a placeholder variable. Alena and Andrei proposed a fix. It is fairly simple: we just shouldn't remove PHVs during self-join elimination, as they might still be referenced from other parts of a query. The patch is attached. I'm going to fix this if no objections.Hmm, I'm not sure about the fix. It seems to me that it simply prevents removing any PHVs in the self-join removal case. My concern is that this might result in PHVs that could actually be removed not being removed in many cases.Let's play with use cases: If a PHV is needed in the inner or outer only, it means we have a clause in the baserestrictinfo that will be transferred to the keeping relation, and we shouldn't remove the PHV. Another case is when the PHV is needed in a join clause of the self-join. I may imagine such a case: toKeep.x+toRemove.y=PHV This clause will be transformed to "toKeep.x+toKeep.y=PHV", pushed to baserestrictinfo of keeping relation and should be saved. I think it is possible to invent quite a narrow case of clause like the following: PHV_evaluated_at_inner = PHV_evaluated_at_outer It needs to prove reproducibility. But even if it makes sense, it seems to have no danger for further selectivity estimation compared to the source clause and is a too-narrow case, isn't it? In other cases, this PHV is needed something else, and we can't remove it.
I experimented with some examples like this and noticed that it does affect cardinality estimation, though I'm not sure the impact is significant.Should we add more regression tests covering these cases?
I used the tables from the regression tests, so if they’re appropriate for reproducing this case, it should be straightforward to add them.
EXPLAIN (analyze, VERBOSE)
SELECT 1
FROM tbl_phv t1
LEFT JOIN (
SELECT 1 AS extra, x, y FROM tbl_phv tl
) t3
JOIN (
SELECT y, x FROM tbl_phv tr
) t4 ON t4.y = t3.y and (t4.x*0) = (t3.x *0)
ON true
WHERE t3.extra IS NOT NULL
AND t4.x + t3.y= (t1.x * 0);
The query plan with transformation:
------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=3.27..7.16 rows=100 width=4) (actual time=0.149..0.151 rows=0.00 loops=1)
Output: 1
Hash Cond: ((tr.x + tr.y) = (t1.x * 0))
Buffers: shared hit=2
-> Seq Scan on public.tbl_phv tr (cost=0.00..2.26 rows=100 width=8) (actual time=0.027..0.047 rows=100.00 loops=1)
Output: tr.x, tr.y
Filter: ((1 IS NOT NULL) AND ((tr.x * 0) IS NOT NULL))
Rows Removed by Filter: 1
Buffers: shared hit=1
-> Hash (cost=2.01..2.01 rows=101 width=4) (actual time=0.078..0.079 rows=100.00 loops=1)
Output: t1.x
Buckets: 1024 Batches: 1 Memory Usage: 12kB
Buffers: shared hit=1
-> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101 width=4) (actual time=0.005..0.038 rows=101.00 loops=1)
Output: t1.x
Buffers: shared hit=1
Planning Time: 0.607 ms
Execution Time: 0.194 ms
(18 rows)
The query plan without transformation:
EXPLAIN (analyze, VERBOSE)
SELECT 1
FROM tbl_phv t1
LEFT JOIN (
SELECT 1 AS extra, x, y FROM tbl_phv tl
) t3
JOIN (
SELECT y, x FROM tbl_phv tr
) t4 ON t4.y = t3.y
ON true
WHERE t3.extra IS NOT NULL
AND t4.x + t3.y= (t1.x * 0);
The query plan with transformation:
------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=3.26..6.90 rows=100 width=4) (actual time=0.145..0.146 rows=0.00 loops=1)
Output: 1
Hash Cond: ((t1.x * 0) = (tr.x + tr.y))
Buffers: shared hit=2
-> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101 width=4) (actual time=0.030..0.040 rows=101.00 loops=1)
Output: t1.x, t1.y
Buffers: shared hit=1
-> Hash (cost=2.01..2.01 rows=100 width=8) (actual time=0.081..0.081 rows=100.00 loops=1)
Output: tr.y, tr.x
Buckets: 1024 Batches: 1 Memory Usage: 12kB
Buffers: shared hit=1
-> Seq Scan on public.tbl_phv tr (cost=0.00..2.01 rows=100 width=8) (actual time=0.012..0.040 rows=101.00 loops=1)
Output: tr.y, tr.x
Filter: (1 IS NOT NULL)
Buffers: shared hit=1
Planning Time: 0.571 ms
Execution Time: 0.195 ms
(17 rows)
The query plan without transformation:
-- Regards, Alena Rybakina Postgres Professional
pgsql-hackers by date: