Home > mailing lists

Re: A new strategy for pull-up correlated ANY_SUBLINK - Mailing list pgsql-hackers

From	Alena Rybakina
Subject	Re: A new strategy for pull-up correlated ANY_SUBLINK
Date	October 12, 2023 00:01:37
Msg-id	b9437269-0480-4016-91bf-61168f72bdd3@yandex.ru Whole thread Raw
In response to	Re: A new strategy for pull-up correlated ANY_SUBLINK (Andy Fan <zhihui.fan1213@gmail.com>)
Responses	Re: A new strategy for pull-up correlated ANY_SUBLINK
List	pgsql-hackers

Tree view

Hi!

I reviewed your patch and it was interesting for me!

Thank you for the explanation. It was really informative for me!

I think we need the restriction and that should be enough for this feature
. Given the query Richard provided before:

explain
select * from tenk1 A where exists
(select 1 from tenk2 B
where A.hundred in (select C.hundred FROM tenk2 C
WHERE c.odd = b.odd));

It first can be converted to the below format without any issue.

SELECT * FROM tenk1 A SEMI JOIN tenk2 B
on A.hundred in (select C.hundred FROM tenk2 C
WHERE c.odd = b.odd);

Then without the restriction, since we only pull the varnos from
sublink->testexpr, then it is {A}, so it convert to

SELECT * FROM
(tenk1 A SEMI JOIN LATERAL (SELECT c.hundred FROM tenk2 C)
ON c.odd = b.odd AND a.hundred = v.hundred)
SEMI JOIN on tenk2 B ON TRUE;

then the above query is NOT A VALID QUERY since:
1. The above query is *not* same as

SELECT * FROM (tenk1 A SEMI JOIN tenk2 B) on true
SEMI JOIN LATERAL (SELECT c.hundred FROM tenk2 C) v
ON v.odd = b.odd;

2. The above query requires b.odd when B is not available. So it is
right that an optimizer can't generate a plan for it. The fix would
be to do the restriction before applying this optimization.

I'm not sure pull-up-subquery can play any role here, IIUC, the bad thing
happens before pull-up-subquery.

I also write & analyze more test and found no issue by me

1. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should not be pull-up to rarg of the left join since A.hundred is not
available.

2. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = a.odd);
==> should not be pull-up to rarg of the left join since A.odd is not
available.

3. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should be pull-up to rarg of left join.

4. SELECT * FROM tenk1 A INNER JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> pull-up as expected.

5. SELECT * FROM tenk1 A RIGHT JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should not be pull-up into larg of left join since b.odd is not
available.

After reviewing, I want to suggest some changes related to the code and tests.

First of all, I think, it would be better to "treat" change to "consider" and rewrite the pull-up check condition in two lines:/*
* If the sub-select refers to any Vars of the parent query, we so let's
* considering it as LATERAL. (Vars of higher levels don't matter here.)
*/

use_lateral = !bms_is_empty(sub_ref_outer_relids) &&
bms_is_subset(sub_ref_outer_relids, available_rels);if (!use_lateral && !bms_is_empty(sub_ref_outer_relids))
return NULL;

Secondly, I noticed another interesting feature in your patch and I think it could be added to the test.

If we get only one row from the aggregated subquery, we can pull-up it in the subquery scan filter.

postgres=# explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);                          QUERY PLAN
--------------------------------------------------------------
Nested Loop Left Join
   -> Seq Scan on tenk1 a
   -> Materialize
         -> Nested Loop
               -> Seq Scan on tenk2 b
               -> Subquery Scan on "ANY_subquery"
                     Filter: (b.hundred = "ANY_subquery".min)
                     -> Aggregate
                           -> Seq Scan on tenk2 c
                                 Filter: (odd = b.odd)
(10 rows)

It was impossible without your patch:postgres=# explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                    QUERY PLAN
---------------------------------------------------
Nested Loop Left Join
   -> Seq Scan on tenk1 a
   -> Materialize
         -> Seq Scan on tenk2 b
               Filter: (SubPlan 1)
               SubPlan 1
                 -> Aggregate
                       -> Seq Scan on tenk2 c
                             Filter: (odd = b.odd)
(9 rows)

And I found an alternative query, when aggregated sublink will pull-up into JoinExpr condition.explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
                         QUERY PLAN
-------------------------------------------------------------
Nested Loop Left Join
   -> Seq Scan on tenk1 a
   -> Materialize
         -> Hash Semi Join
               Hash Cond: (b.hundred = "ANY_subquery".count)
               -> Seq Scan on tenk2 b
               -> Hash
                     -> Subquery Scan on "ANY_subquery"
                           -> HashAggregate
                                 Group Key: c.odd
                                 -> Seq Scan on tenk2 c
(11 rows)

Unfortunately, I found a request when sublink did not pull-up, as in the examples above. I couldn't quite figure out why.create table a (x int, y int, z int, t int);
create table b (x int, t int);
create unique index on a (t, x);
create index on b (t,x);
insert into a select id, id, id, id FROM generate_series(1,100000) As id;
insert into b select id, id FROM generate_series(1,1000) As id;

explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
    left join a
        on b.x=a.x and
           b.t in
            (select max(a0.t)
             from a a0
             where a0.x = b.x and
                   a0.t = b.t);                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
Hash Right Join (actual time=1.150..58.512 rows=1000 loops=1)
   Hash Cond: (a.x = b.x)
   Join Filter: (SubPlan 2)
   Buffers: shared hit=3546
   -> Seq Scan on a (actual time=0.023..15.798 rows=100000 loops=1)
         Buffers: shared hit=541
   -> Hash (actual time=1.038..1.042 rows=1000 loops=1)
         Buckets: 4096 Batches: 1 Memory Usage: 72kB
         Buffers: shared hit=5
         -> Seq Scan on b (actual time=0.047..0.399 rows=1000 loops=1)
               Buffers: shared hit=5
   SubPlan 2
     -> Result (actual time=0.018..0.018 rows=1 loops=1000)
           Buffers: shared hit=3000
           InitPlan 1 (returns $2)
             -> Limit (actual time=0.015..0.016 rows=1 loops=1000)
                   Buffers: shared hit=3000
                   -> Index Only Scan using a_t_x_idx on a a0 (actual time=0.014..0.014 rows=1 loops=1000)
                         Index Cond: ((t IS NOT NULL) AND (t = b.t) AND (x = b.x))
                         Heap Fetches: 1000
                         Buffers: shared hit=3000
Planning Time: 0.630 ms
Execution Time: 58.941 ms
(23 rows)

I thought it would be:

explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
    left join a on
        b.x=a.x and
        b.t =
            (select max(a0.t)
             from a a0
             where a0.x = b.x and
                   a0.t <= b.t);                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Right Join (actual time=1.181..67.927 rows=1000 loops=1)
   Hash Cond: (a.x = b.x)
   Join Filter: (b.t = (SubPlan 2))
   Buffers: shared hit=3546
   -> Seq Scan on a (actual time=0.022..17.109 rows=100000 loops=1)
         Buffers: shared hit=541
   -> Hash (actual time=1.065..1.068 rows=1000 loops=1)
         Buckets: 4096 Batches: 1 Memory Usage: 72kB
         Buffers: shared hit=5
         -> Seq Scan on b (actual time=0.049..0.401 rows=1000 loops=1)
               Buffers: shared hit=5
   SubPlan 2
     -> Result (actual time=0.025..0.025 rows=1 loops=1000)
           Buffers: shared hit=3000
           InitPlan 1 (returns $2)
             -> Limit (actual time=0.024..0.024 rows=1 loops=1000)
                   Buffers: shared hit=3000
                   -> Index Only Scan Backward using a_t_x_idx on a a0 (actual time=0.023..0.023 rows=1 loops=1000)
                         Index Cond: ((t IS NOT NULL) AND (t <= b.t) AND (x = b.x))
                         Heap Fetches: 1000
                         Buffers: shared hit=3000
Planning Time: 0.689 ms
Execution Time: 68.220 ms
(23 rows)

If you noticed, it became possible after replacing the "in" operator with "=".

I took the liberty of adding this to your patch and added myself as reviewer, if you don't mind.

-- 
Regards,
Alena Rybakina

Attachment

pgsql-hackers by date:

From: Thomas Munro
Date: 11 October 2023, 23:47:29
Subject: Re: Lowering the default wal_blocksize to 4K

From: Ajay P S
Date: 12 October 2023, 00:30:33
Subject: Regarding Postgresql Transaction isolation

Re: A new strategy for pull-up correlated ANY_SUBLINK - Mailing list pgsql-hackers

Attachment

Previous

Next