A new strategy for pull-up correlated ANY_SUBLINK - Mailing list pgsql-hackers

From Andy Fan
Subject A new strategy for pull-up correlated ANY_SUBLINK
Date
Msg-id CAKU4AWoZksNZ4VR-fLTdwmiR91WU8qViDBNQKNwY=7iyo+uV0w@mail.gmail.com
Whole thread Raw
Responses Re: A new strategy for pull-up correlated ANY_SUBLINK
Re: A new strategy for pull-up correlated ANY_SUBLINK
List pgsql-hackers
In the past we pull-up the ANY-sublink with 2 steps, the first step is to
pull up the sublink as a subquery, and the next step is to pull up the
subquery if it is allowed.  The benefits of this method are obvious,
pulling up the subquery has more requirements, even if we can just finish
the first step, we still get huge benefits. However the bad stuff happens
if varlevelsup = 1 involves, things fail at step 1.

convert_ANY_sublink_to_join ...

    if (contain_vars_of_level((Node *) subselect, 1))
        return NULL; 

In this patch we distinguish the above case and try to pull-up it within
one step if it is helpful, It looks to me that what we need to do is just
transform it to EXIST-SUBLINK.

The only change is transforming the format of SUBLINK, so outer-join /
pull-up as semi-join is unrelated, so the correctness should not be an
issue.

I can help with the following query very much.

master:
explain (costs off, analyze) select * from tenk1 t1
where hundred in (select hundred from tenk2 t2
                  where t2.odd = t1.odd
                  and even in (select even from tenk1 t3
                               where t3.fivethous = t2.fivethous))
and even > 0;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Seq Scan on tenk1 t1 (actual time=0.023..234.955 rows=10000 loops=1)
   Filter: ((even > 0) AND (SubPlan 2))
   SubPlan 2
     ->  Seq Scan on tenk2 t2 (actual time=0.023..0.023 rows=1 loops=10000)
           Filter: ((odd = t1.odd) AND (SubPlan 1))
           Rows Removed by Filter: 94
           SubPlan 1
             ->  Seq Scan on tenk1 t3 (actual time=0.011..0.011 rows=1 loops=10000)
                   Filter: (fivethous = t2.fivethous)
                   Rows Removed by Filter: 94
 Planning Time: 0.169 ms
 Execution Time: 235.488 ms
(12 rows)

patched:

explain (costs off, analyze) select * from tenk1 t1
where hundred in (select hundred from tenk2 t2
                  where t2.odd = t1.odd
                  and even in (select even from tenk1 t3
                               where t3.fivethous = t2.fivethous))
and even > 0;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Hash Join (actual time=13.102..17.676 rows=10000 loops=1)
   Hash Cond: ((t1.odd = t2.odd) AND (t1.hundred = t2.hundred))
   ->  Seq Scan on tenk1 t1 (actual time=0.014..1.702 rows=10000 loops=1)
         Filter: (even > 0)
   ->  Hash (actual time=13.080..13.082 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         ->  HashAggregate (actual time=13.041..13.060 rows=100 loops=1)
               Group Key: t2.odd, t2.hundred
               Batches: 1  Memory Usage: 73kB
               ->  Hash Join (actual time=8.044..11.296 rows=10000 loops=1)
                     Hash Cond: ((t3.fivethous = t2.fivethous) AND (t3.even = t2.even))
                     ->  HashAggregate (actual time=4.054..4.804 rows=5000 loops=1)
                           Group Key: t3.fivethous, t3.even
                           Batches: 1  Memory Usage: 465kB
                           ->  Seq Scan on tenk1 t3 (actual time=0.002..0.862 rows=10000 loops=1)
                     ->  Hash (actual time=3.962..3.962 rows=10000 loops=1)
                           Buckets: 16384  Batches: 1  Memory Usage: 597kB
                           ->  Seq Scan on tenk2 t2 (actual time=0.004..2.289 rows=10000 loops=1)
 Planning Time: 0.426 ms
 Execution Time: 18.129 ms
(20 rows)

The execution time is 33ms (patched) VS 235ms (master).
The planning time is 0.426ms (patched) VS 0.169ms (master).

I think the extra planning time comes from the search space increasing a
lot and that's where the better plan comes.

I used below queries to measure how much effort we made but got nothing:
run twice in 1 session and just count the second planning time.

explain (costs off, analyze) select * from tenk1 t1
where
(hundred, odd) in (select hundred, odd from tenk2 t2
                  where (even, fivethous) in
                  (select even, fivethous from tenk1 t3));


psql regression -f 1.sql | grep 'Planning Time' | tail -1

master:

Planning Time: 0.430 ms
Planning Time: 0.551 ms
Planning Time: 0.316 ms
Planning Time: 0.342 ms
Planning Time: 0.390 ms

patched:

Planning Time: 0.405 ms
Planning Time: 0.406 ms
Planning Time: 0.433 ms
Planning Time: 0.371 ms
Planning Time: 0.425 ms


I think this can show us the extra planning effort is pretty low. 

This topic has been raised many times, at least at [1] [2]. and even MySQL
can support some simple but common cases. I think we can do something
helpful as well.  Any feedback is welcome.

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Prefetch the next tuple's memory during seqscans
Next
From: Andrey Lepikhov
Date:
Subject: Re: A new strategy for pull-up correlated ANY_SUBLINK