Home > mailing lists

Re: Pull up sublink of type 'NOT NOT (expr)' - Mailing list pgsql-hackers

From	Richard Guo
Subject	Re: Pull up sublink of type 'NOT NOT (expr)'
Date	October 25, 2018 07:42:05
Msg-id	CAN_9JTyyi0Q=Mis5T2KL0QCSAhBJ0aGqnZm2hYjQ0E25bnTU=Q@mail.gmail.com Whole thread
In response to	Re: Pull up sublink of type 'NOT NOT (expr)' (Alexey Bashtanov <bashtanov@imap.cc>)
List	pgsql-hackers

Tree view

Hi Alex,

Yes hashed SubPlan preserves order and may be faster than hash join in some

cases. But I don't think that is a reason good enough to prevent the subplan

from being converted to join.

Let's suppose the subplan is uncorrelated, otherwise hashed SubPlan would not

be used. Hashed SubPlan can only be applied when the size of the subquery

result can fit in work_mem. When the data size increases or work_mem is set to

a smaller value that more than one batch is needed, hashed SubPlan will not be

used and the total cost will increase dramatically. Hash join, by contrast,

handles multiple batches better.

Below is an example to show the behavior of hash join and hashed SubPlan for

single batch and multiple batches.

create table a (i int);

create table b (i int);

insert into a select i from generate_series(1,10000)i;

insert into b select i from generate_series(1,10000)i;

For single batch:

Hash Join

gpadmin=# explain analyze select * from a where a.i in (select b.i from b);

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------

Hash Semi Join (cost=270.00..552.50 rows=10000 width=4) (actual time=4.332..9.499 rows=10000 loops=1)

Hash Cond: (a.i = b.i)

-> Seq Scan on a (cost=0.00..145.00 rows=10000 width=4) (actual time=0.016..1.328 rows=10000 loops=1)

-> Hash (cost=145.00..145.00 rows=10000 width=4) (actual time=4.292..4.292 rows=10000 loops=1)

Buckets: 16384 Batches: 1 Memory Usage: 480kB

-> Seq Scan on b (cost=0.00..145.00 rows=10000 width=4) (actual time=0.013..1.817 rows=10000 loops=1)

Planning Time: 0.236 ms

Execution Time: 10.099 ms

(8 rows)

Hashed SubPlan

gpadmin=# explain analyze select * from a where not not a.i in (select b.i from b);

QUERY PLAN

-------------------------------------------------------------------------------------------------------------

Seq Scan on a (cost=170.00..340.00 rows=5000 width=4) (actual time=6.359..11.843 rows=10000 loops=1)

Filter: (hashed SubPlan 1)

SubPlan 1

-> Seq Scan on b (cost=0.00..145.00 rows=10000 width=4) (actual time=0.017..1.749 rows=10000 loops=1)

Planning Time: 0.109 ms

Execution Time: 12.905 ms

(6 rows)

insert into b select i from generate_series(1,100000)i;

insert into b select i from generate_series(1,100000)i;

For multiple batches:

Hash Join

gpadmin=# explain analyze select * from a where a.i in (select b.i from b);

                                                     QUERY PLAN

---------------------------------------------------------------------------------------------------------------------

 Hash Semi Join  (cost=6476.00..7659.50 rows=10000 width=4) (actual time=89.915..121.016 rows=10000 loops=1)

   Hash Cond: (a.i = b.i)

   ->  Seq Scan on a  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.020..1.779 rows=10000 loops=1)

   ->  Hash  (cost=3030.00..3030.00 rows=210000 width=4) (actual time=89.666..89.667 rows=210000 loops=1)

         Buckets: 131072  Batches: 4  Memory Usage: 2860kB

         ->  Seq Scan on b  (cost=0.00..3030.00 rows=210000 width=4) (actual time=0.015..37.686 rows=210000 loops=1)

 Planning Time: 0.256 ms

 Execution Time: 121.769 ms

(8 rows)

Plain SubPlan

gpadmin=# explain analyze select * from a where not not a.i in (select b.i from b);

                                                     QUERY PLAN

---------------------------------------------------------------------------------------------------------------------

 Seq Scan on a  (cost=0.00..27130170.00 rows=5000 width=4) (actual time=0.030..9994.036 rows=10000 loops=1)

   Filter: (SubPlan 1)

   SubPlan 1

     ->  Materialize  (cost=0.00..4901.00 rows=210000 width=4) (actual time=0.000..0.359 rows=5000 loops=10000)

           ->  Seq Scan on b  (cost=0.00..3030.00 rows=210000 width=4) (actual time=0.010..1.673 rows=10000 loops=1)

 Planning Time: 0.077 ms

 Execution Time: 9995.556 ms

(7 rows)

Thanks

Richard

On Wed, Oct 24, 2018 at 12:39 AM, Alexey Bashtanov <bashtanov@imap.cc> wrote:

Hello Richard,

Currently for quals in the form of "NOT NOT (SubLink)", this SubLink would not
be considered when pulling up sublinks. For instance:

gpadmin=# explain select * from a where NOT NOT (a.i in (select b.i from b));
QUERY PLAN
-------------------------------------------------------------
Seq Scan on a (cost=51.50..85.62 rows=1005 width=8)
Filter: (hashed SubPlan 1)
SubPlan 1
-> Seq Scan on b (cost=0.00..44.00 rows=3000 width=4)
(4 rows)

Should we give it a chance, like the attached does?

Sometimes hashed subplan is faster than hash join and than all the other options, as it preserves the order.
Using NOT NOT, one can control whether to use it or not:
https://pgblog.bashtanov.com/2017/12/08/double-negative-and-query-performance/ (test case and results in the bottom of the page).

Surely dirty tricks should not be the way to control the planner, but when breaking them we should probably provide a way to achieve the same result,
ideally making the planner choose the best plan without hints.

Best,
Alex

pgsql-hackers by date:

From: Peter Eisentraut
Date: 25 October 2018, 07:41:13
Subject: Re: Alter index rename concurrently to

From: Shay Rojansky
Date: 25 October 2018, 07:53:14
Subject: UNLISTEN, DISCARD ALL and readonly standby

Re: Pull up sublink of type 'NOT NOT (expr)' - Mailing list pgsql-hackers

Previous

Next