Thread: Re: Exists pull-up application with JoinExpr

Re: Exists pull-up application with JoinExpr

From
Alena Rybakina
Date:
On 24.12.2024 13:25, Ranier Vilela wrote:
Hi Alena.

Em ter., 24 de dez. de 2024 às 01:44, Alena Rybakina <a.rybakina@postgrespro.ru> escreveu:

Hi, hackers!

I found one pull-up that works if the inner join condition is written through the where condition,

create temp table ta (id int primary key, val int);
insert into ta values(1,1);
insert into ta values(2,2);
insert into ta values(3,3);

create temp table tb (id int primary key, aval int);
insert into tb values(4,1);
insert into tb values(5,1);
insert into tb values(1,2);

create temp table tc (id int primary key, aid int);
insert into tc values(6,1);
insert into tc values(7,2);
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT *   FROM ta  WHERE EXISTS (SELECT *                  FROM tb, tc                  WHERE ta.id = tb.id);                               QUERY PLAN                                
------------------------------------------------------------------------- Nested Loop Semi Join (actual rows=1 loops=1)   Buffers: local hit=6   ->  Seq Scan on ta (actual rows=3 loops=1)         Buffers: local hit=1   ->  Nested Loop (actual rows=0 loops=3)         Buffers: local hit=5         ->  Index Only Scan using tb_pkey on tb (actual rows=0 loops=3)               Index Cond: (id = ta.id)               Heap Fetches: 1               Buffers: local hit=4         ->  Seq Scan on tc (actual rows=1 loops=1)               Buffers: local hit=1 Planning:   Buffers: shared hit=67 read=12
(14 rows)

but it doesn't work if it is written through the outside condition.

alena@postgres=# EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT *   FROM ta  WHERE EXISTS (SELECT *                  FROM tb JOIN tc                  ON ta.id = tb.id);                      QUERY PLAN                      
------------------------------------------------------ Seq Scan on ta (actual rows=1 loops=1)   Filter: EXISTS(SubPlan 1)   Rows Removed by Filter: 2   Buffers: local hit=5   SubPlan 1     ->  Nested Loop (actual rows=0 loops=3)           Buffers: local hit=4           ->  Seq Scan on tb (actual rows=0 loops=3)                 Filter: (ta.id = id)                 Rows Removed by Filter: 3                 Buffers: local hit=3           ->  Seq Scan on tc (actual rows=1 loops=1)                 Buffers: local hit=1 Planning:   Buffers: shared hit=16 read=9
(15 rows)

I have written a patch to add this functionality and now it gives an query plan: alena@postgres=# EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
 SELECT *
   FROM ta
  WHERE EXISTS (SELECT *
                  FROM tb JOIN tc
                  ON ta.id = tb.id);
                     QUERY PLAN                                
-------------------------------------------------------------------------
 Nested Loop Semi Join (actual rows=1 loops=1)
   Buffers: local hit=6
   ->  Seq Scan on ta (actual rows=3 loops=1)
         Buffers: local hit=1
   ->  Nested Loop (actual rows=0 loops=3)
         Buffers: local hit=5
         ->  Index Only Scan using tb_pkey on tb (actual rows=0 loops=3)
               Index Cond: (id = ta.id)
               Heap Fetches: 1
               Buffers: local hit=4
         ->  Seq Scan on tc (actual rows=1 loops=1)
               Buffers: local hit=1
(12 rows)

tb and tc form a Cartesian product, but in the case of the intersection condition with tuples from the table ta (ta.id = tb.id). So, according to the join condition, tb intersects only with 1, and only it gets into the result, but at the same time they appear twice - this is because of the Cartesian product of tb with tc
How it works:

I rewrote the code a bit so that it considers not only the quals in jointree->quals, but also those in join expression (subselect->jointree->fromlist). If they satisfy the conditions for using pull up, I add them to the list of clauses and form a "Bool" expression from them, joined by an "AND" operation.

I took a look at this patch and I did a little polishing on it.

And I believe that in testing, you need to set it to BUFFERS OFF,
because of the recent change made to ANALYZE.

The tests are failing, like this: 
QUERY PLAN
 -------------------------------------------------------------------------
 Nested Loop Semi Join (actual rows=2 loops=1)
+ Buffers: local hit=7
 -> Seq Scan on ta (actual rows=2 loops=1)
+ Buffers: local hit=1
 -> Nested Loop (actual rows=1 loops=2)
+ Buffers: local hit=6
 -> Index Only Scan using tb_pkey on tb (actual rows=1 loops=2)
 Index Cond: (id = ta.id)
 Heap Fetches: 2
+ Buffers: local hit=4
 -> Seq Scan on tc (actual rows=1 loops=2)
-(7 rows)
+ Buffers: local hit=2
+(12 rows)

Yes, you are right) Thank you for your interest to this thread)
-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Exists pull-up application with JoinExpr

From
Ilia Evdokimov
Date:
Hi Alena,

Thank you for your work on subqueries with JOIN.

Have you considered the scenario where in subquery includes a qual like 
(tc.aid = 1)? When I tried executing those queries I receive different 
results. In my opinion, to prevent this, we should add filters for such 
quals within the loop 'foreach (lc, all_clauses)'

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta
WHERE EXISTS (SELECT * FROM tb, tc WHERE ta.id = tb.id AND tc.aid = 1);
                               QUERY PLAN
----------------------------------------------------------------------
  Hash Join (actual rows=1 loops=1)
    Hash Cond: (ta.id = tb.id)
    Buffers: local hit=3
    ->  Seq Scan on ta (actual rows=3 loops=1)
          Buffers: local hit=1
    ->  Hash (actual rows=3 loops=1)
          Buckets: 4096  Batches: 1  Memory Usage: 33kB
          Buffers: local hit=2
          ->  HashAggregate (actual rows=3 loops=1)
                Group Key: tb.id
                Batches: 1  Memory Usage: 121kB
                Buffers: local hit=2
                ->  Nested Loop (actual rows=3 loops=1)
                      Buffers: local hit=2
                      ->  Seq Scan on tb (actual rows=3 loops=1)
                            Buffers: local hit=1
                      ->  Materialize (actual rows=1 loops=3)
                            Storage: Memory  Maximum Storage: 17kB
                            Buffers: local hit=1
                            ->  Seq Scan on tc (actual rows=1 loops=1)
                                  Filter: (aid = 1)
                                  Rows Removed by Filter: 1
                                  Buffers: local hit=1
(23 rows)

============================

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb JOIN tc ON ta.id =
tb.id WHERE tc.aid = 1);
                                 QUERY PLAN
---------------------------------------------------------------------------
  Seq Scan on ta (actual rows=1 loops=1)
    Filter: EXISTS(SubPlan 1)
    Rows Removed by Filter: 2
    Buffers: local hit=6
    SubPlan 1
      ->  Nested Loop (actual rows=0 loops=3)
            Buffers: local hit=5
            ->  Index Only Scan using tb_pkey on tb (actual rows=0 loops=3)
                  Index Cond: (id = ta.id)
                  Heap Fetches: 1
                  Buffers: local hit=4
            ->  Seq Scan on tc (actual rows=1 loops=1)
                  Filter: (aid = 1)
                  Buffers: local hit=1
(14 rows)

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.




Re: Exists pull-up application with JoinExpr

From
Alena Rybakina
Date:

Hi! Thank you for your interest to this subject!

On 27.12.2024 15:53, Ilia Evdokimov wrote:
Hi Alena,

Thank you for your work on subqueries with JOIN.

Have you considered the scenario where in subquery includes a qual like (tc.aid = 1)? When I tried executing those queries I receive different results. In my opinion, to prevent this, we should add filters for such quals within the loop 'foreach (lc, all_clauses)'

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta
WHERE EXISTS (SELECT * FROM tb, tc WHERE ta.id = tb.id AND tc.aid = 1);
                              QUERY PLAN
----------------------------------------------------------------------
 Hash Join (actual rows=1 loops=1)
   Hash Cond: (ta.id = tb.id)
   Buffers: local hit=3
   ->  Seq Scan on ta (actual rows=3 loops=1)
         Buffers: local hit=1
   ->  Hash (actual rows=3 loops=1)
         Buckets: 4096  Batches: 1  Memory Usage: 33kB
         Buffers: local hit=2
         ->  HashAggregate (actual rows=3 loops=1)
               Group Key: tb.id
               Batches: 1  Memory Usage: 121kB
               Buffers: local hit=2
               ->  Nested Loop (actual rows=3 loops=1)
                     Buffers: local hit=2
                     ->  Seq Scan on tb (actual rows=3 loops=1)
                           Buffers: local hit=1
                     ->  Materialize (actual rows=1 loops=3)
                           Storage: Memory  Maximum Storage: 17kB
                           Buffers: local hit=1
                           ->  Seq Scan on tc (actual rows=1 loops=1)
                                 Filter: (aid = 1)
                                 Rows Removed by Filter: 1
                                 Buffers: local hit=1
(23 rows)

============================

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb JOIN tc ON ta.id =
tb.id WHERE tc.aid = 1);
                                QUERY PLAN
---------------------------------------------------------------------------
 Seq Scan on ta (actual rows=1 loops=1)
   Filter: EXISTS(SubPlan 1)
   Rows Removed by Filter: 2
   Buffers: local hit=6
   SubPlan 1
     ->  Nested Loop (actual rows=0 loops=3)
           Buffers: local hit=5
           ->  Index Only Scan using tb_pkey on tb (actual rows=0 loops=3)
                 Index Cond: (id = ta.id)
                 Heap Fetches: 1
                 Buffers: local hit=4
           ->  Seq Scan on tc (actual rows=1 loops=1)
                 Filter: (aid = 1)
                 Buffers: local hit=1
(14 rows)


You are right, at the moment the code is not processed if there is a constant qual in the subquery (like t1.x1=1 in the example below) and this problem is not only related to the current patch. 

For example you can get such a query plan if you complete this request to the master:

create table t (x int);
create table t1 (x1 int);
create table t2 (x2 int); EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT 1   FROM t  WHERE EXISTS (SELECT 1                  FROM t1                  where t1.x1 = 1);                   QUERY PLAN                   
------------------------------------------------ Result (actual rows=0 loops=1)   One-Time Filter: (InitPlan 1).col1   InitPlan 1     ->  Seq Scan on t1 (actual rows=0 loops=1)           Filter: (x1 = 1)   ->  Seq Scan on t (never executed)
(6 rows)

It's all because of the check in this function - this qual has levelsoup = 0, not 1 (see (!contain_vars_of_level(whereClause, 1)), but I already found out that by changing this, the logic of correction there is required a little more complicated. At the moment, I'm working to add this processing to the patch. 

Thanks for the case!

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: pull-up subquery if JOIN-ON contains refs to upper-query

From
Alena Rybakina
Date:
On 10.02.2025 23:51, Ilia Evdokimov wrote:
>
> On 09.02.2025 18:14, Alena Rybakina wrote:
>> Hi! I found another example where the transformation worked 
>> incorrectly and reconsidered the idea.
>>
>> As for conversion of exists_sublink_to_ANY, we need to get the 
>> flattened implicit-AND list of clauses and pull out the chunks of the 
>> WHERE clause that belong to the parent query,
>> since we are called halfway through the parent's 
>> preprocess_expression() and earlier steps of preprocess_expression() 
>> wouldn't get applied to the pulled-up stuff unless we do them here.
>> We also do some processing for vars depending on which side the var 
>> is on - if it's in a subquery, we only need to lower its level 
>> (varlevel) because subquery will be flatted, while
>> for other vars that belong to the parent query, we need to do 
>> preparation to pull up the sub-select into top range table.
>>
>> For those expressions that we couldn't assign to either list, we 
>> define newWhere and apply both cases.
>>
>
> When I run 'make -C contrib/ check', tests of postgres_fdw extension 
> failed. I might be wrong, but you should be careful with LIMIT.
>
Thank you for the review, I'm working on it.

-- 
Regards,
Alena Rybakina
Postgres Professional