Thread: Re: Exists pull-up application with JoinExpr

Re: Exists pull-up application with JoinExpr

From

Alena Rybakina

Date:

25 December 2024, 14:20:42

On 24.12.2024 13:25, Ranier Vilela wrote:

Hi Alena.
Em ter., 24 de dez. de 2024 às 01:44, Alena Rybakina <a.rybakina@postgrespro.ru> escreveu:
Hi, hackers!
I found one pull-up that works if the inner join condition is written through the where condition,
create temp table ta (id int primary key, val int);
insert into ta values(1,1);
insert into ta values(2,2);
insert into ta values(3,3);

create temp table tb (id int primary key, aval int);
insert into tb values(4,1);
insert into tb values(5,1);
insert into tb values(1,2);

create temp table tc (id int primary key, aid int);
insert into tc values(6,1);
insert into tc values(7,2);
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT *   FROM ta  WHERE EXISTS (SELECT *                  FROM tb, tc                  WHERE ta.id = tb.id);                               QUERY PLAN                                
------------------------------------------------------------------------- Nested Loop Semi Join (actual rows=1 loops=1)   Buffers: local hit=6   ->  Seq Scan on ta (actual rows=3 loops=1)         Buffers: local hit=1   ->  Nested Loop (actual rows=0 loops=3)         Buffers: local hit=5         ->  Index Only Scan using tb_pkey on tb (actual rows=0 loops=3)               Index Cond: (id = ta.id)               Heap Fetches: 1               Buffers: local hit=4         ->  Seq Scan on tc (actual rows=1 loops=1)               Buffers: local hit=1 Planning:   Buffers: shared hit=67 read=12
(14 rows)
but it doesn't work if it is written through the outside condition.
alena@postgres=# EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT *   FROM ta  WHERE EXISTS (SELECT *                  FROM tb JOIN tc                  ON ta.id = tb.id);                      QUERY PLAN                      
------------------------------------------------------ Seq Scan on ta (actual rows=1 loops=1)   Filter: EXISTS(SubPlan 1)   Rows Removed by Filter: 2   Buffers: local hit=5   SubPlan 1     ->  Nested Loop (actual rows=0 loops=3)           Buffers: local hit=4           ->  Seq Scan on tb (actual rows=0 loops=3)                 Filter: (ta.id = id)                 Rows Removed by Filter: 3                 Buffers: local hit=3           ->  Seq Scan on tc (actual rows=1 loops=1)                 Buffers: local hit=1 Planning:   Buffers: shared hit=16 read=9
(15 rows)
I have written a patch to add this functionality and now it gives an query plan: alena@postgres=# EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb JOIN tc ON ta.id = tb.id); QUERY PLAN ------------------------------------------------------------------------- Nested Loop Semi Join (actual rows=1 loops=1) Buffers: local hit=6 -> Seq Scan on ta (actual rows=3 loops=1) Buffers: local hit=1 -> Nested Loop (actual rows=0 loops=3) Buffers: local hit=5 -> Index Only Scan using tb_pkey on tb (actual rows=0 loops=3) Index Cond: (id = ta.id) Heap Fetches: 1 Buffers: local hit=4 -> Seq Scan on tc (actual rows=1 loops=1) Buffers: local hit=1 (12 rows)
tb and tc form a Cartesian product, but in the case of the intersection condition with tuples from the table ta (ta.id = tb.id). So, according to the join condition, tb intersects only with 1, and only it gets into the result, but at the same time they appear twice - this is because of the Cartesian product of tb with tc
How it works:
I rewrote the code a bit so that it considers not only the quals in jointree->quals, but also those in join expression (subselect->jointree->fromlist). If they satisfy the conditions for using pull up, I add them to the list of clauses and form a "Bool" expression from them, joined by an "AND" operation.
I took a look at this patch and I did a little polishing on it.

And I believe that in testing, you need to set it to BUFFERS OFF,
because of the recent change made to ANALYZE.

The tests are failing, like this:
QUERY PLAN
-------------------------------------------------------------------------
Nested Loop Semi Join (actual rows=2 loops=1)
+ Buffers: local hit=7
-> Seq Scan on ta (actual rows=2 loops=1)
+ Buffers: local hit=1
-> Nested Loop (actual rows=1 loops=2)
+ Buffers: local hit=6
-> Index Only Scan using tb_pkey on tb (actual rows=1 loops=2)
Index Cond: (id = ta.id)
Heap Fetches: 2
+ Buffers: local hit=4
-> Seq Scan on tc (actual rows=1 loops=2)
-(7 rows)
+ Buffers: local hit=2
+(12 rows)

Yes, you are right) Thank you for your interest to this thread)

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Exists pull-up application with JoinExpr

From

Ilia Evdokimov

Date:

27 December 2024, 15:53:35

Hi Alena,

Thank you for your work on subqueries with JOIN.

Have you considered the scenario where in subquery includes a qual like 
(tc.aid = 1)? When I tried executing those queries I receive different 
results. In my opinion, to prevent this, we should add filters for such 
quals within the loop 'foreach (lc, all_clauses)'

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta
WHERE EXISTS (SELECT * FROM tb, tc WHERE ta.id = tb.id AND tc.aid = 1);
                               QUERY PLAN
----------------------------------------------------------------------
  Hash Join (actual rows=1 loops=1)
    Hash Cond: (ta.id = tb.id)
    Buffers: local hit=3
    ->  Seq Scan on ta (actual rows=3 loops=1)
          Buffers: local hit=1
    ->  Hash (actual rows=3 loops=1)
          Buckets: 4096  Batches: 1  Memory Usage: 33kB
          Buffers: local hit=2
          ->  HashAggregate (actual rows=3 loops=1)
                Group Key: tb.id
                Batches: 1  Memory Usage: 121kB
                Buffers: local hit=2
                ->  Nested Loop (actual rows=3 loops=1)
                      Buffers: local hit=2
                      ->  Seq Scan on tb (actual rows=3 loops=1)
                            Buffers: local hit=1
                      ->  Materialize (actual rows=1 loops=3)
                            Storage: Memory  Maximum Storage: 17kB
                            Buffers: local hit=1
                            ->  Seq Scan on tc (actual rows=1 loops=1)
                                  Filter: (aid = 1)
                                  Rows Removed by Filter: 1
                                  Buffers: local hit=1
(23 rows)

============================

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb JOIN tc ON ta.id =
tb.id WHERE tc.aid = 1);
                                 QUERY PLAN
---------------------------------------------------------------------------
  Seq Scan on ta (actual rows=1 loops=1)
    Filter: EXISTS(SubPlan 1)
    Rows Removed by Filter: 2
    Buffers: local hit=6
    SubPlan 1
      ->  Nested Loop (actual rows=0 loops=3)
            Buffers: local hit=5
            ->  Index Only Scan using tb_pkey on tb (actual rows=0 loops=3)
                  Index Cond: (id = ta.id)
                  Heap Fetches: 1
                  Buffers: local hit=4
            ->  Seq Scan on tc (actual rows=1 loops=1)
                  Filter: (aid = 1)
                  Buffers: local hit=1
(14 rows)

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Exists pull-up application with JoinExpr

From

Alena Rybakina

Date:

30 December 2024, 11:24:25

Hi! Thank you for your interest to this subject!

On 27.12.2024 15:53, Ilia Evdokimov wrote:

Hi Alena,

Thank you for your work on subqueries with JOIN.

Have you considered the scenario where in subquery includes a qual like (tc.aid = 1)? When I tried executing those queries I receive different results. In my opinion, to prevent this, we should add filters for such quals within the loop 'foreach (lc, all_clauses)'

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta
WHERE EXISTS (SELECT * FROM tb, tc WHERE ta.id = tb.id AND tc.aid = 1);
                              QUERY PLAN
----------------------------------------------------------------------
Hash Join (actual rows=1 loops=1)
   Hash Cond: (ta.id = tb.id)
   Buffers: local hit=3
   -> Seq Scan on ta (actual rows=3 loops=1)
         Buffers: local hit=1
   -> Hash (actual rows=3 loops=1)
         Buckets: 4096 Batches: 1 Memory Usage: 33kB
         Buffers: local hit=2
         -> HashAggregate (actual rows=3 loops=1)
               Group Key: tb.id
               Batches: 1 Memory Usage: 121kB
               Buffers: local hit=2
               -> Nested Loop (actual rows=3 loops=1)
                     Buffers: local hit=2
                     -> Seq Scan on tb (actual rows=3 loops=1)
                           Buffers: local hit=1
                     -> Materialize (actual rows=1 loops=3)
                           Storage: Memory Maximum Storage: 17kB
                           Buffers: local hit=1
                           -> Seq Scan on tc (actual rows=1 loops=1)
                                 Filter: (aid = 1)
                                 Rows Removed by Filter: 1
                                 Buffers: local hit=1
(23 rows)

============================

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb JOIN tc ON ta.id =
tb.id WHERE tc.aid = 1);
                                QUERY PLAN
---------------------------------------------------------------------------
Seq Scan on ta (actual rows=1 loops=1)
   Filter: EXISTS(SubPlan 1)
   Rows Removed by Filter: 2
   Buffers: local hit=6
   SubPlan 1
     -> Nested Loop (actual rows=0 loops=3)
           Buffers: local hit=5
           -> Index Only Scan using tb_pkey on tb (actual rows=0 loops=3)
                 Index Cond: (id = ta.id)
                 Heap Fetches: 1
                 Buffers: local hit=4
           -> Seq Scan on tc (actual rows=1 loops=1)
                 Filter: (aid = 1)
                 Buffers: local hit=1
(14 rows)

You are right, at the moment the code is not processed if there is a constant qual in the subquery (like t1.x1=1 in the example below) and this problem is not only related to the current patch.

For example you can get such a query plan if you complete this request to the master:

create table t (x int);
create table t1 (x1 int);
create table t2 (x2 int); EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT 1   FROM t  WHERE EXISTS (SELECT 1                  FROM t1                  where t1.x1 = 1);                   QUERY PLAN                   
------------------------------------------------ Result (actual rows=0 loops=1)   One-Time Filter: (InitPlan 1).col1   InitPlan 1     ->  Seq Scan on t1 (actual rows=0 loops=1)           Filter: (x1 = 1)   ->  Seq Scan on t (never executed)
(6 rows)

It's all because of the check in this function - this qual has levelsoup = 0, not 1 (see (!contain_vars_of_level(whereClause, 1)), but I already found out that by changing this, the logic of correction there is required a little more complicated. At the moment, I'm working to add this processing to the patch.

Thanks for the case!

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: pull-up subquery if JOIN-ON contains refs to upper-query

From

Alena Rybakina

Date:

11 February, 18:59:55

On 10.02.2025 23:51, Ilia Evdokimov wrote:
>
> On 09.02.2025 18:14, Alena Rybakina wrote:
>> Hi! I found another example where the transformation worked 
>> incorrectly and reconsidered the idea.
>>
>> As for conversion of exists_sublink_to_ANY, we need to get the 
>> flattened implicit-AND list of clauses and pull out the chunks of the 
>> WHERE clause that belong to the parent query,
>> since we are called halfway through the parent's 
>> preprocess_expression() and earlier steps of preprocess_expression() 
>> wouldn't get applied to the pulled-up stuff unless we do them here.
>> We also do some processing for vars depending on which side the var 
>> is on - if it's in a subquery, we only need to lower its level 
>> (varlevel) because subquery will be flatted, while
>> for other vars that belong to the parent query, we need to do 
>> preparation to pull up the sub-select into top range table.
>>
>> For those expressions that we couldn't assign to either list, we 
>> define newWhere and apply both cases.
>>
>
> When I run 'make -C contrib/ check', tests of postgres_fdw extension 
> failed. I might be wrong, but you should be careful with LIMIT.
>
Thank you for the review, I'm working on it.

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: pull-up subquery if JOIN-ON contains refs to upper-query

From

Alena Rybakina

Date:

02 April, 19:39:58

Hi!

My colleague reviewed my patch and gave feedback on how to improve it - 
for some queries with data types that I did not consider, pull-up is not 
applied, although it should. Some of them:

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF, BUFFERS OFF)
SELECT 1
   FROM ta
  WHERE EXISTS (SELECT 1
                  FROM tb
                  JOIN tc
                    ON ta.id = tb.id
                   AND tb.id = ANY('{1}'::int[])
               );

                                 QUERY PLAN
--------------------------------------------------------------------------
  Seq Scan on ta (actual rows=1.00 loops=1)
    Filter: EXISTS(SubPlan 1)
    Rows Removed by Filter: 1
    SubPlan 1
      ->  Nested Loop (actual rows=0.50 loops=2)
            ->  Seq Scan on tb (actual rows=0.50 loops=2)
                  Filter: ((id = ANY ('{1}'::integer[])) AND (ta.id = id))
                  Rows Removed by Filter: 2
            ->  Seq Scan on tc (actual rows=1.00 loops=1)

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF, BUFFERS OFF)
  SELECT 1
    FROM ta
   WHERE EXISTS (SELECT 1
                   FROM tb
                   JOIN tc
                     ON ta.id = tb.id
                    AND tb.is_active
                );
                        QUERY PLAN
---------------------------------------------------------
  Seq Scan on ta (actual rows=2.00 loops=1)
    Filter: EXISTS(SubPlan 1)
    SubPlan 1
      ->  Nested Loop (actual rows=1.00 loops=2)
            ->  Seq Scan on tb (actual rows=1.00 loops=2)
                  Filter: (is_active AND (ta.id = id))
                  Rows Removed by Filter: 0
            ->  Seq Scan on tc (actual rows=1.00 loops=2)

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF, BUFFERS OFF)
  SELECT 1
    FROM ta
   WHERE EXISTS (SELECT 1
                   FROM tb
                   JOIN tc
                     ON ta.id = tb.id
                    AND tb.is_active IS NOT NULL
                );

                                    QUERY PLAN
--------------------------------------------------------------------
  Seq Scan on ta (actual rows=2.00 loops=1)
    Filter: EXISTS(SubPlan 1)
     SubPlan 1
        ->  Nested Loop (actual rows=1.00 loops=2)
              ->  Seq Scan on tb (actual rows=1.00 loops=2)
                    Filter: ((is_active IS NOT NULL) AND (ta.id = id))
                    Rows Removed by Filter: 0
              ->  Seq Scan on tc (actual rows=1.00 loops=2)

UPDATE tb SET is_active = NULL WHERE id = 2;

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF, BUFFERS OFF)
  SELECT 1
    FROM ta
   WHERE EXISTS (SELECT 1
                   FROM tb
                   JOIN tc
                     ON ta.id = tb.id
                    AND tb.is_active IS NULL
                );
                               QUERY PLAN
----------------------------------------------------------------
  Seq Scan on ta (actual rows=1.00 loops=1)
    Filter: EXISTS(SubPlan 1)
    Rows Removed by Filter: 1
    SubPlan 1
      ->  Nested Loop (actual rows=0.50 loops=2)
            ->  Seq Scan on tb (actual rows=0.50 loops=2)
                  Filter: ((is_active IS NULL) AND (ta.id = id))
                  Rows Removed by Filter: 4
            ->  Seq Scan on tc (actual rows=1.00 loops=1)

I see that I need to add a walker that, when traversing the tree, 
determines whether there are conditions under which pull-up is 
impossible - the presence of
volatility of functions and other restrictions, and leave the 
transformation for the var objects that I added before, I described it 
here.

Unfortunately, I need a few days to implement this and need time for a 
review, and I think I will not have time to do this before the code 
freeze, so
I am moving this to the next commitfest and not changing the status 
"awaiting the author".

On 11.02.2025 18:59, Alena Rybakina wrote:
> On 10.02.2025 23:51, Ilia Evdokimov wrote:
>>
>> On 09.02.2025 18:14, Alena Rybakina wrote:
>>> Hi! I found another example where the transformation worked 
>>> incorrectly and reconsidered the idea.
>>>
>>> As for conversion of exists_sublink_to_ANY, we need to get the 
>>> flattened implicit-AND list of clauses and pull out the chunks of 
>>> the WHERE clause that belong to the parent query,
>>> since we are called halfway through the parent's 
>>> preprocess_expression() and earlier steps of preprocess_expression() 
>>> wouldn't get applied to the pulled-up stuff unless we do them here.
>>> We also do some processing for vars depending on which side the var 
>>> is on - if it's in a subquery, we only need to lower its level 
>>> (varlevel) because subquery will be flatted, while
>>> for other vars that belong to the parent query, we need to do 
>>> preparation to pull up the sub-select into top range table.
>>>
>>> For those expressions that we couldn't assign to either list, we 
>>> define newWhere and apply both cases.
>>>
>>
>> When I run 'make -C contrib/ check', tests of postgres_fdw extension 
>> failed. I might be wrong, but you should be careful with LIMIT.
>>
> Thank you for the review, I'm working on it.
>
Sorry for not responding, but I will fix this bug after I update the 
code based on the comments above. Thank you for noticing and writing to 
me, your feedback is very important.

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: pull-up subquery if JOIN-ON contains refs to upper-query

From

Ilia Evdokimov

Date:

03 April, 16:56:52

On 02.04.2025 19:39, Alena Rybakina wrote:
>
> I see that I need to add a walker that, when traversing the tree, 
> determines whether there are conditions under which pull-up is 
> impossible - the presence of
> volatility of functions and other restrictions, and leave the 
> transformation for the var objects that I added before, I described it 
> here.
>

I have some concerns about pulling up every clause from the subquery 
with one column. In particular, not every clause is safe or beneficial 
to pull up: OR-clauses, CASE expressions, nested sublinks could 
significantly change how the planner estimates the number of rows or 
applies filters, especially when they are not true join predicates. 
Pulling them up might lead to worse plans, or even change the semantics 
in subtle ways. I think before applying such transformations, we should 
make sure they are not only safe but actually improve the resulting plan.

Re: pull-up subquery if JOIN-ON contains refs to upper-query

From

Alena Rybakina

Date:

04 June, 13:40:07

Hi, all! I updated the patch and it looks nice. All the problems have been solved.

On 03.04.2025 16:56, Ilia Evdokimov wrote:

On 02.04.2025 19:39, Alena Rybakina wrote:

I see that I need to add a walker that, when traversing the tree, determines whether there are conditions under which pull-up is impossible - the presence of
volatility of functions and other restrictions, and leave the transformation for the var objects that I added before, I described it here.

I have some concerns about pulling up every clause from the subquery with one column. In particular, not every clause is safe or beneficial to pull up: OR-clauses, CASE expressions, nested sublinks could significantly change how the planner estimates the number of rows or applies filters, especially when they are not true join predicates. Pulling them up might lead to worse plans, or even change the semantics in subtle ways. I think before applying such transformations, we should make sure they are not only safe but actually improve the resulting plan.

There may indeed be cases where a query plan without pull-up is worse than with pull-up.

For example, as shown below, with pull-up we don't need to scan two tables and perform a join, since the subquery returns 0 rows (no matching tuples in the inner sequential scan in a parameterized Nested Loop).
However, this cannot be detected at the current planning stage - we simply don't have that information yet.

Do you have any ideas on how to solve this problem? So far, the only approach I see is to try an alternative plan but I'm still learning this.

For example:
create table t(x int);
create table t1(x int);
create table t2(x int);

insert into t2 select id from generate_series(20001,30000) as id;
insert into t1 select id from generate_series(10001,20000) as id;
insert into t select id from generate_series(1,10000) as id;
vacuum analyze;
explain analyze select * from t where exists (select * from t1 join t2 on t.x = t1.x);

with my patch:
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1500540.00..1500822.50 rows=10000 width=4) (actual time=70694.658..70694.662 rows=0.00 loops=1)
   Hash Cond: (t.x = t1.x)
   Buffers: shared hit=135
   -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual time=0.009..1.545 rows=10000.00 loops=1)
         Buffers: shared hit=45
   -> Hash (cost=1500415.00..1500415.00 rows=10000 width=4) (actual time=70690.524..70690.526 rows=10000.00 loops=1)
         Buckets: 16384 Batches: 1 Memory Usage: 480kB
         Buffers: shared hit=90
         -> HashAggregate (cost=1500315.00..1500415.00 rows=10000 width=4) (actual time=70683.143..70686.590 rows=10000.00 loops=1)
               Group Key: t1.x
               Batches: 1 Memory Usage: 793kB
               Buffers: shared hit=90
               -> Nested Loop (cost=0.00..1250315.00 rows=100000000 width=4) (actual time=0.019..25650.447 rows=100000000.00 loops=1)
                     Buffers: shared hit=90
                     -> Seq Scan on t1 (cost=0.00..145.00 rows=10000 width=4) (actual time=0.006..4.931 rows=10000.00 loops=1)
                           Buffers: shared hit=45
                     -> Materialize (cost=0.00..195.00 rows=10000 width=0) (actual time=0.000..0.875 rows=10000.00 loops=10000)
                           Storage: Memory Maximum Storage: 519kB
                           Buffers: shared hit=45
                           -> Seq Scan on t2 (cost=0.00..145.00 rows=10000 width=0) (actual time=0.007..1.246 rows=10000.00 loops=1)
                                 Buffers: shared hit=45
Planning:
   Buffers: shared hit=36 read=3
Planning Time: 0.375 ms
Execution Time: 70695.154 ms

without my patch:

                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..309.30 rows=5738 width=4) (actual time=68268.562..68268.565 rows=0.00 loops=1)
   Filter: EXISTS(SubPlan 1)
   Rows Removed by Filter: 10000
   Buffers: shared hit=900045
   SubPlan 1
     -> Nested Loop (cost=0.00..8524.27 rows=654075 width=0) (actual time=6.823..6.823 rows=0.00 loops=10000)
           Buffers: shared hit=900000
           -> Seq Scan on t2 (cost=0.00..159.75 rows=11475 width=0) (actual time=0.011..1.660 rows=10000.00 loops=10000)
                 Buffers: shared hit=450000
           -> Materialize (cost=0.00..188.72 rows=57 width=0) (actual time=0.000..0.000 rows=0.00 loops=100000000)
                 Storage: Memory Maximum Storage: 17kB
                 Buffers: shared hit=450000
                 -> Seq Scan on t1 (cost=0.00..188.44 rows=57 width=0) (actual time=2.403..2.403 rows=0.00 loops=10000)
                       Filter: (t.x = x)
                       Rows Removed by Filter: 10000
                       Buffers: shared hit=450000
Planning:
   Buffers: shared hit=40 read=16
Planning Time: 0.487 ms
Execution Time: 68268.600 ms

-- 
Regards,
Alena Rybakina
Postgres Professional

Attachment

v4-0001-Teach-the-planner-to-convert-EXISTS-and-NOT-EXISTS-s.patch