Home > mailing lists

Re: Removing unneeded self joins - Mailing list pgsql-hackers

From	Alena Rybakina
Subject	Re: Removing unneeded self joins
Date	April 6 00:01:57
Msg-id	5309f032-e2fc-4913-a5d6-a2a2b5e22b29@postgrespro.ru Whole thread Raw
In response to	Re: Removing unneeded self joins (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: Removing unneeded self joins
List	pgsql-hackers

Tree view

On 05.04.2025 13:09, Alexander Korotkov wrote:

On Fri, Apr 4, 2025 at 11:35 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 4/4/25 04:53, Richard Guo wrote:

On Fri, Apr 4, 2025 at 1:02 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I've got an off-list bug report from Alexander Lakhin involving a
placeholder variable.  Alena and Andrei proposed a fix.  It is fairly
simple: we just shouldn't remove PHVs during self-join elimination, as
they might still be referenced from other parts of a query.  The patch
is attached.  I'm going to fix this if no objections.

Hmm, I'm not sure about the fix.  It seems to me that it simply
prevents removing any PHVs in the self-join removal case.  My concern
is that this might result in PHVs that could actually be removed not
being removed in many cases.

Let's play with use cases:
If a PHV is needed in the inner or outer only, it means we have a clause
in the baserestrictinfo that will be transferred to the keeping
relation, and we shouldn't remove the PHV.
Another case is when the PHV is needed in a join clause of the
self-join. I may imagine such a case:

toKeep.x+toRemove.y=PHV

This clause will be transformed to "toKeep.x+toKeep.y=PHV", pushed to
baserestrictinfo of keeping relation and should be saved.
I think it is possible to invent quite a narrow case of clause like the
following:

PHV_evaluated_at_inner = PHV_evaluated_at_outer

It needs to prove reproducibility. But even if it makes sense, it seems
to have no danger for further selectivity estimation compared to the
source clause and is a too-narrow case, isn't it?
In other cases, this PHV is needed something else, and we can't remove it.

I agree with that, and I also believe this case will be quite rare in practice.

Should we add more regression tests covering these cases?

I experimented with some examples like this and noticed that it does affect cardinality estimation, though I'm not sure the impact is significant.
I used the tables from the regression tests, so if they’re appropriate for reproducing this case, it should be straightforward to add them.

EXPLAIN (analyze, VERBOSE)
SELECT 1
FROM tbl_phv t1
LEFT JOIN (
SELECT 1 AS extra, x, y FROM tbl_phv tl
) t3
JOIN (
SELECT y, x FROM tbl_phv tr
) t4 ON t4.y = t3.y and (t4.x*0) = (t3.x *0)
ON true
WHERE t3.extra IS NOT NULL
AND t4.x + t3.y= (t1.x * 0);

The query plan with transformation:
------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=3.27..7.16 rows=100 width=4) (actual time=0.149..0.151 rows=0.00 loops=1)
   Output: 1
   Hash Cond: ((tr.x + tr.y) = (t1.x * 0))
   Buffers: shared hit=2
   -> Seq Scan on public.tbl_phv tr (cost=0.00..2.26 rows=100 width=8) (actual time=0.027..0.047 rows=100.00 loops=1)
         Output: tr.x, tr.y
         Filter: ((1 IS NOT NULL) AND ((tr.x * 0) IS NOT NULL))
         Rows Removed by Filter: 1
         Buffers: shared hit=1
   -> Hash (cost=2.01..2.01 rows=101 width=4) (actual time=0.078..0.079 rows=100.00 loops=1)
         Output: t1.x
         Buckets: 1024 Batches: 1 Memory Usage: 12kB
         Buffers: shared hit=1
         -> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101 width=4) (actual time=0.005..0.038 rows=101.00 loops=1)
               Output: t1.x
               Buffers: shared hit=1
Planning Time: 0.607 ms
Execution Time: 0.194 ms
(18 rows)

The query plan without transformation:

 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join (cost=6.09..8.74 rows=1 width=4) (actual time=0.248..0.250 rows=0.00 loops=1)
 Output: 1
 Hash Cond: ((t1.x * 0) = (tr.x + tl.y))
 Buffers: shared hit=3
 -> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101 width=4) (actual time=0.011..0.021 rows=101.00 loops=1)
 Output: t1.x, t1.y
 Buffers: shared hit=1
 -> Hash (cost=6.08..6.08 rows=1 width=8) (actual time=0.206..0.207 rows=100.00 loops=1)
 Output: tl.y, tr.x
 Buckets: 1024 Batches: 1 Memory Usage: 12kB
 Buffers: shared hit=2
 -> Hash Join (cost=3.51..6.08 rows=1 width=8) (actual time=0.090..0.160 rows=100.00 loops=1)
 Output: tl.y, tr.x
 Inner Unique: true
 Hash Cond: ((tr.y = tl.y) AND ((tr.x * 0) = (tl.x * 0)))
 Buffers: shared hit=2
 -> Seq Scan on public.tbl_phv tr (cost=0.00..2.01 rows=101 width=8) (actual time=0.005..0.016 rows=101.00 loops=1)
 Output: tr.x, tr.y
 Buffers: shared hit=1
 -> Hash (cost=2.01..2.01 rows=100 width=8) (actual time=0.080..0.080 rows=100.00 loops=1)
 Output: tl.y, tl.x
 Buckets: 1024 Batches: 1 Memory Usage: 12kB
 Buffers: shared hit=1
 -> Seq Scan on public.tbl_phv tl (cost=0.00..2.01 rows=100 width=8) (actual time=0.008..0.035 rows=101.00 loops=1)
 Output: tl.y, tl.x
 Filter: (1 IS NOT NULL)
 Buffers: shared hit=1
 Planning:
 Buffers: shared hit=25
 Planning Time: 0.609 ms
 Execution Time: 0.319 ms
(31 rows)

EXPLAIN (analyze, VERBOSE)
SELECT 1
FROM tbl_phv t1
LEFT JOIN (
SELECT 1 AS extra, x, y FROM tbl_phv tl
) t3
JOIN (
SELECT y, x FROM tbl_phv tr
) t4 ON t4.y = t3.y
ON true
WHERE t3.extra IS NOT NULL
AND t4.x + t3.y= (t1.x * 0);

The query plan with transformation:
------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=3.26..6.90 rows=100 width=4) (actual time=0.145..0.146 rows=0.00 loops=1)
   Output: 1
   Hash Cond: ((t1.x * 0) = (tr.x + tr.y))
   Buffers: shared hit=2
   -> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101 width=4) (actual time=0.030..0.040 rows=101.00 loops=1)
         Output: t1.x, t1.y
         Buffers: shared hit=1
   -> Hash (cost=2.01..2.01 rows=100 width=8) (actual time=0.081..0.081 rows=100.00 loops=1)
         Output: tr.y, tr.x
         Buckets: 1024 Batches: 1 Memory Usage: 12kB
         Buffers: shared hit=1
         -> Seq Scan on public.tbl_phv tr (cost=0.00..2.01 rows=100 width=8) (actual time=0.012..0.040 rows=101.00 loops=1)
               Output: tr.y, tr.x
               Filter: (1 IS NOT NULL)
               Buffers: shared hit=1
Planning Time: 0.571 ms
Execution Time: 0.195 ms
(17 rows)

The query plan without transformation:

 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------
 Hash Join (cost=6.53..9.94 rows=50 width=4) (actual time=0.236..0.238 rows=0.00 loops=1)
 Output: 1
 Hash Cond: ((tr.x + tl.y) = (t1.x * 0))
 Buffers: shared hit=3
 -> Hash Join (cost=3.26..5.55 rows=100 width=8) (actual time=0.075..0.137 rows=101.00 loops=1)
 Output: tl.y, tr.x
 Inner Unique: true
 Hash Cond: (tr.y = tl.y)
 Buffers: shared hit=2
 -> Seq Scan on public.tbl_phv tr (cost=0.00..2.01 rows=101 width=8) (actual time=0.005..0.017 rows=101.00 loops=1)
 Output: tr.x, tr.y
 Buffers: shared hit=1
 -> Hash (cost=2.01..2.01 rows=100 width=4) (actual time=0.064..0.065 rows=101.00 loops=1)
 Output: tl.y
 Buckets: 1024 Batches: 1 Memory Usage: 12kB
 Buffers: shared hit=1
 -> Seq Scan on public.tbl_phv tl (cost=0.00..2.01 rows=100 width=4) (actual time=0.006..0.033 rows=101.00 loops=1)
 Output: tl.y
 Filter: (1 IS NOT NULL)
 Buffers: shared hit=1
 -> Hash (cost=2.01..2.01 rows=101 width=4) (actual time=0.078..0.078 rows=100.00 loops=1)
 Output: t1.x
 Buckets: 1024 Batches: 1 Memory Usage: 12kB
 Buffers: shared hit=1
 -> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101 width=4) (actual time=0.014..0.039 rows=101.00 loops=1)
 Output: t1.x
 Buffers: shared hit=1
 Planning:
 Buffers: shared hit=12
 Planning Time: 0.478 ms
 Execution Time: 0.293 ms
(31 rows)

-- 
Regards,
Alena Rybakina
Postgres Professional

pgsql-hackers by date:

From: Shayon Mukherjee
Date: 06 April, 00:00:59
Subject: Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX

From: Andreas Karlsson
Date: 06 April, 01:14:52
Subject: Re: dblink query interruptibility

Re: Removing unneeded self joins - Mailing list pgsql-hackers

Previous

Next