Home > mailing lists

Thread: query plan question, nested loop vs hash join

query plan question, nested loop vs hash join

From

Andrey Lizenko

Date:

07 October 2014, 22:27:31

Hi,

I have similar problem as in

http://www.postgresql.org/message-id/flat/52B311C4.1070108@gmail.com#52B311C4.1070108@gmail.com

server version is 9.3.4

Here is only two quite simple tables:

db_new=# \d activities_example

Table "public.activities_example"

Column | Type | Modifiers

----------------+---------+-----------

id | integer |

order_chain_id | integer |

Indexes:

"activities_example_idx" btree (order_chain_id)

db_new=# \d orders_example

Table "public.orders_example"

Column | Type | Modifiers

--------+---------+-----------

id | integer |

Number of rows as below:

db_new=# select count(*) from activities_example ;

count

---------

3059965

db_new=# select count(*) from orders_example ;

count

-------

19038

db_new=# select count(*) from activities_example where order_chain_id in (select id from orders_example);

count

-------

91426

(1 row)

and I can see that planner uses hashjoin with all enabled options and nested loop with disabled parameter:

db_new=# explain analyze select * from activities_example where order_chain_id in (select id from orders_example);

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

Hash Semi Join (cost=513.36..57547.59 rows=89551 width=8) (actual time=18.340..966.367 rows=91426 loops=1)

Hash Cond: (activities_example.order_chain_id = orders_example.id)

-> Seq Scan on activities_example (cost=0.00..44139.65 rows=3059965 width=8) (actual time=0.018..294.216 rows=3059965 loops=1)

-> Hash (cost=275.38..275.38 rows=19038 width=4) (actual time=5.458..5.458 rows=19038 loops=1)

Buckets: 2048 Batches: 1 Memory Usage: 670kB

-> Seq Scan on orders_example (cost=0.00..275.38 rows=19038 width=4) (actual time=0.015..2.308 rows=19038 loops=1)

Total runtime: 970.234 ms

(7 rows)

db_new=# set enable_hashjoin = off;

SET

db_new=# explain analyze select * from activities_example where order_chain_id in (select id from orders_example);

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop (cost=1629.09..166451.01 rows=89551 width=8) (actual time=16.091..116.476 rows=91426 loops=1)

-> Unique (cost=1628.66..1723.85 rows=19038 width=4) (actual time=15.929..23.156 rows=19038 loops=1)

-> Sort (cost=1628.66..1676.25 rows=19038 width=4) (actual time=15.892..19.884 rows=19038 loops=1)

Sort Key: orders_example.id

Sort Method: external sort Disk: 264kB

-> Seq Scan on orders_example (cost=0.00..275.38 rows=19038 width=4) (actual time=0.015..2.747 rows=19038 loops=1)

-> Index Scan using activities_example_idx on activities_example (cost=0.43..8.60 rows=5 width=8) (actual time=0.002..0.004 rows=5 loops=19038)

Index Cond: (order_chain_id = orders_example.id)

Total runtime: 121.366 ms

(9 rows)

second runtime is much more quicker.

What is the reason of "Seq Scan on activities_example" in the first case?

Is it possible to force optimizer choose the second plan without doing "set enable_hashjoin = off;" ?

Increasing of 'effective_cache_size' leads to similar thing with mergejoin,
other options (work_mem, shared_buffers. etc) do not change anything.

Thanks in advance.

Regards, Andrey Lizenko

Re: query plan question, nested loop vs hash join

From

Marti Raudsepp

Date:

07 October 2014, 23:27:20

On Fri, Oct 3, 2014 at 6:38 PM, Andrey Lizenko <lizenko79@gmail.com> wrote:
> Is it possible to force optimizer choose the second plan without doing  "set
> enable_hashjoin = off;" ?
>
> Increasing  of 'effective_cache_size' leads to similar thing with mergejoin,
> other options (work_mem, shared_buffers. etc) do not change anything.

Have you tried changing random_page_cost?

In small databases where most of the data is cached anyway, lowering
random_page_cost to somewhere between 1 and 2 usually leads to better
planner decisions.

Regards,
Marti

Re: query plan question, nested loop vs hash join

From

Andrey Lizenko

Date:

08 October 2014, 12:49:02

Thanks for your reply, Marti, as I answered to Tom couple of days ago adjusting of 'effective_cache_size' to 80% of RAM and 'random_page_cost' from 2 to 1 helped me.

On 8 October 2014 00:26, Marti Raudsepp <marti@juffo.org> wrote:

On Fri, Oct 3, 2014 at 6:38 PM, Andrey Lizenko <lizenko79@gmail.com> wrote:
> Is it possible to force optimizer choose the second plan without doing "set
> enable_hashjoin = off;" ?
>
> Increasing of 'effective_cache_size' leads to similar thing with mergejoin,
> other options (work_mem, shared_buffers. etc) do not change anything.

Have you tried changing random_page_cost?

In small databases where most of the data is cached anyway, lowering
random_page_cost to somewhere between 1 and 2 usually leads to better
planner decisions.

Regards,
Marti

С уважением, Андрей Лизенко