Re: Pull up sublink of type 'NOT NOT (expr)' - Mailing list pgsql-hackers

From Richard Guo
Subject Re: Pull up sublink of type 'NOT NOT (expr)'
Date
Msg-id CAN_9JTyyi0Q=Mis5T2KL0QCSAhBJ0aGqnZm2hYjQ0E25bnTU=Q@mail.gmail.com
Whole thread Raw
In response to Re: Pull up sublink of type 'NOT NOT (expr)'  (Alexey Bashtanov <bashtanov@imap.cc>)
List pgsql-hackers
Hi Alex,

Yes hashed SubPlan preserves order and may be faster than hash join in some
cases. But I don't think that is a reason good enough to prevent the subplan
from being converted to join.

Let's suppose the subplan is uncorrelated, otherwise hashed SubPlan would not
be used. Hashed SubPlan can only be applied when the size of the subquery
result can fit in work_mem. When the data size increases or work_mem is set to
a smaller value that more than one batch is needed, hashed SubPlan will not be
used and the total cost will increase dramatically. Hash join, by contrast,
handles multiple batches better.

Below is an example to show the behavior of hash join and hashed SubPlan for
single batch and multiple batches.

create table a (i int);
create table b (i int);
insert into a select i from generate_series(1,10000)i;
insert into b select i from generate_series(1,10000)i;

For single batch:

Hash Join
gpadmin=# explain analyze select * from a where a.i in (select b.i from b);
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=270.00..552.50 rows=10000 width=4) (actual time=4.332..9.499 rows=10000 loops=1)
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.016..1.328 rows=10000 loops=1)
   ->  Hash  (cost=145.00..145.00 rows=10000 width=4) (actual time=4.292..4.292 rows=10000 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 480kB
         ->  Seq Scan on b  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.013..1.817 rows=10000 loops=1)
 Planning Time: 0.236 ms
 Execution Time: 10.099 ms
(8 rows)

Hashed SubPlan
gpadmin=# explain analyze select * from a where not not a.i in (select b.i from b);
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on a  (cost=170.00..340.00 rows=5000 width=4) (actual time=6.359..11.843 rows=10000 loops=1)
   Filter: (hashed SubPlan 1)
   SubPlan 1
     ->  Seq Scan on b  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.017..1.749 rows=10000 loops=1)
 Planning Time: 0.109 ms
 Execution Time: 12.905 ms
(6 rows)

insert into b select i from generate_series(1,100000)i;
insert into b select i from generate_series(1,100000)i;


For multiple batches:

Hash Join
gpadmin=# explain analyze select * from a where a.i in (select b.i from b);
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=6476.00..7659.50 rows=10000 width=4) (actual time=89.915..121.016 rows=10000 loops=1)
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.020..1.779 rows=10000 loops=1)
   ->  Hash  (cost=3030.00..3030.00 rows=210000 width=4) (actual time=89.666..89.667 rows=210000 loops=1)
         Buckets: 131072  Batches: 4  Memory Usage: 2860kB
         ->  Seq Scan on b  (cost=0.00..3030.00 rows=210000 width=4) (actual time=0.015..37.686 rows=210000 loops=1)
 Planning Time: 0.256 ms
 Execution Time: 121.769 ms
(8 rows)

Plain SubPlan
gpadmin=# explain analyze select * from a where not not a.i in (select b.i from b);
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on a  (cost=0.00..27130170.00 rows=5000 width=4) (actual time=0.030..9994.036 rows=10000 loops=1)
   Filter: (SubPlan 1)
   SubPlan 1
     ->  Materialize  (cost=0.00..4901.00 rows=210000 width=4) (actual time=0.000..0.359 rows=5000 loops=10000)
           ->  Seq Scan on b  (cost=0.00..3030.00 rows=210000 width=4) (actual time=0.010..1.673 rows=10000 loops=1)
 Planning Time: 0.077 ms
 Execution Time: 9995.556 ms
(7 rows)

Thanks
Richard

On Wed, Oct 24, 2018 at 12:39 AM, Alexey Bashtanov <bashtanov@imap.cc> wrote:
Hello Richard,

Currently for quals in the form of "NOT NOT (SubLink)", this SubLink would not
be considered when pulling up sublinks. For instance:

gpadmin=# explain select * from a where NOT NOT (a.i in (select b.i from b));
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on a  (cost=51.50..85.62 rows=1005 width=8)
   Filter: (hashed SubPlan 1)
   SubPlan 1
     ->  Seq Scan on b  (cost=0.00..44.00 rows=3000 width=4)
(4 rows)


Should we give it a chance, like the attached does?

Sometimes hashed subplan is faster than hash join and than all the other options, as it preserves the order.
Using NOT NOT, one can control whether to use it or not:
https://pgblog.bashtanov.com/2017/12/08/double-negative-and-query-performance/ (test case and results in the bottom of the page).

Surely dirty tricks should not be the way to control the planner, but when breaking them we should probably provide a way to achieve the same result,
ideally making the planner choose the best plan without hints.

Best,
  Alex

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Alter index rename concurrently to
Next
From: Shay Rojansky
Date:
Subject: UNLISTEN, DISCARD ALL and readonly standby