Thread: can't get rid of unnesesary SORT step in explain plan for hash join

can't get rid of unnesesary SORT step in explain plan for hash join

From
Alexey Nalbat
Date:
Hello.

I need to make some sql-statement to be executed as fast as possible. :) My database consists of:
1) table of categories having 1'000 rows, 2) table of manufacturers having 1'000 rows, 3) table of
resellers having 1'000 rows, 4) table of products having 1'000'000 rows. In the products table there
are aproximately 1'000 products per category; as well as per manufacturer and per reseller. Here
are these tables:

postgres=# create table categories ( c_id int4, c_name varchar(32) );
CREATE
postgres=# create table manufacturers ( m_id int4, m_name varchar(32) );
CREATE
postgres=# create table resellers ( r_id int4, r_name varchar(32) );
CREATE
postgres=# create table products ( p_id int4, c_id int4, m_id int4, r_id int4, p_name varchar(32) );
CREATE

I want to get category identifiers (c_id) for those categories which have products manufactured by
manufacturer with identifier (m_id) equal to 123 and reselled by companies which names (r_name)
start with 'a'. In order to make this query faster I created two indicies:

postgres=# create index products_mcr on products ( m_id, c_id, r_id );
CREATE
postgres=# create index resellers_n on resellers ( r_name );
CREATE

I tried several sql-statements which can be used for my purpose, because they have different
execution plans. They are listed below according with their plans. Each #a query is different from
the corresponding # one in that it has fictive "order by m_id, c_id" clause. This clause does not
change the result of a query, but in some way says the optimiser not to use "Sort" step. It works for
the 2a (SubPlan), 3a (SubPlan) and 4a (Nested Loop) queries, but not for the 1a (Hash Join).
And I want to use this query, because practice results show, that it is the fastest one. At my
computer these queries took: 1a) 1sec, 2a) 16sec, 3a) 3sec, 4a) 3sec.

So, my question is: how can I get rid of this unnesesary "Sort" step in the execution plan for hash join?

Thanks in advance.

P.S.: Excuse me for bad english.

+++
+++

1) select distinct c_id from products, ( select r_id from resellers where r_name like 'a%' ) as temp where m_id = 123
andproducts.r_id = temp.r_id;
 

Unique  (cost=16.83..16.83 rows=1 width=12) ->  Sort  (cost=16.83..16.83 rows=1 width=12)       ->  Hash Join
(cost=8.16..16.82rows=1 width=12)             ->  Index Scan using products_mcr on products  (cost=0.00..8.14 rows=10
width=8)            ->  Hash  (cost=8.14..8.14 rows=10 width=4)                   ->  Index Scan using resellers_n on
resellers (cost=0.00..8.14 rows=10 width=4)
 

1a) select distinct c_id, m_id from products, ( select r_id from resellers where r_name like 'a%' ) as temp where m_id
=123 and products.r_id = temp.r_id order by m_id, c_id;
 

Unique  (cost=16.83..16.83 rows=1 width=12) ->  Sort  (cost=16.83..16.83 rows=1 width=12)       ->  Hash Join
(cost=8.16..16.82rows=1 width=12)             ->  Index Scan using products_mcr on products  (cost=0.00..8.14 rows=10
width=8)            ->  Hash  (cost=8.14..8.14 rows=10 width=4)                   ->  Index Scan using resellers_n on
resellers (cost=0.00..8.14 rows=10 width=4)
 

2) select distinct c_id from products where m_id = 123 and r_id in ( select r_id from resellers where r_name like 'a%'
);

Unique  (cost=89.68..89.71 rows=1 width=4) ->  Sort  (cost=89.68..89.68 rows=10 width=4)       ->  Index Scan using
products_mcron products  (cost=0.00..89.52 rows=10 width=4)             SubPlan               ->  Materialize
(cost=8.14..8.14rows=10 width=4)                     ->  Index Scan using resellers_n on resellers  (cost=0.00..8.14
rows=10width=4)
 

2a) select distinct c_id, m_id from products where m_id = 123 and r_id in ( select r_id from resellers where r_name
like'a%' ) order by m_id, c_id;
 

Unique  (cost=0.00..89.57 rows=1 width=8) ->  Index Scan using products_mcr on products  (cost=0.00..89.52 rows=10
width=8)      SubPlan         ->  Materialize  (cost=8.14..8.14 rows=10 width=4)               ->  Index Scan using
resellers_non resellers  (cost=0.00..8.14 rows=10 width=4)
 

3) select distinct c_id from products where m_id = 123 and exists ( select r_id from resellers where r_name like 'a%'
andr_id = products.r_id );
 

Unique  (cost=89.91..89.93 rows=1 width=4) ->  Sort  (cost=89.91..89.91 rows=10 width=4)       ->  Index Scan using
products_mcron products  (cost=0.00..89.74 rows=10 width=4)             SubPlan               ->  Index Scan using
resellers_non resellers  (cost=0.00..8.16 rows=1 width=4)
 

3a) select distinct c_id, m_id from products where m_id = 123 and exists ( select r_id from resellers where r_name like
'a%'and r_id = products.r_id ) order by m_id, c_id;
 

Unique  (cost=0.00..89.79 rows=1 width=8) ->  Index Scan using products_mcr on products  (cost=0.00..89.74 rows=10
width=8)      SubPlan         ->  Index Scan using resellers_n on resellers  (cost=0.00..8.16 rows=1 width=4)
 

4) set enable_hashjoin = off; set enable_mergejoin = off; select distinct c_id from products, ( select r_id from
resellerswhere r_name like 'a%' ) as temp where m_id = 123 and products.r_id = temp.r_id;
 

Unique  (cost=90.75..90.76 rows=1 width=12) ->  Sort  (cost=90.75..90.75 rows=1 width=12)       ->  Nested Loop
(cost=0.00..90.74rows=1 width=12)             ->  Index Scan using products_mcr on products  (cost=0.00..8.14 rows=10
width=8)            ->  Index Scan using resellers_n on resellers  (cost=0.00..8.14 rows=10 width=4)
 

4a) set enable_hashjoin = off; set enable_mergejoin = off; select distinct c_id, m_id from products, ( select r_id from
resellerswhere r_name like 'a%' ) as temp where m_id = 123 and products.r_id = temp.r_id order by m_id, c_id;
 

Unique  (cost=0.00..90.75 rows=1 width=16) ->  Nested Loop  (cost=0.00..90.74 rows=1 width=16)       ->  Index Scan
usingproducts_mcr on products  (cost=0.00..8.14 rows=10 width=12)       ->  Index Scan using resellers_n on resellers
(cost=0.00..8.14rows=10 width=4)
 

-- 

WBR, Alexey Nalbat


Alexey Nalbat <alexey@price.ru> writes:
> So, my question is: how can I get rid of this unnesesary "Sort" step
> in the execution plan for hash join?

You can't, because it's not unnecessary.  Hash join doesn't promise
to produce its outputs in any particular order.  But the Unique
filter needs to see its inputs in order by the fields being made
unique.
        regards, tom lane


Alexey Nalbat <alexey@price.ru> writes:
> While executing this query postgres at first creates hash on table
> "resellers", then get from index "products_mcr" for rows with
> "m_id=123" already ordered (!!!)  pairs "c_id,r_id", for each that
> pair it checks join condition using hash. If postgers behaves like
> this then the result of this hash join is already sorted, and extra
> "Sort" is not needed.

No, because the hash step might choose to divide its processing into
multiple "batches", thereby destroying the sort order.  The outer path's
ordering is preserved only in relatively small test cases...
        regards, tom lane


Re: can't get rid of unnesesary SORT step in explain plan for hash join

From
Alexey Nalbat
Date:
On Sun, 13 May 2001, Tom Lane wrote:
> Alexey Nalbat <alexey@price.ru> writes:
> > So, my question is: how can I get rid of this unnesesary "Sort" step
> > in the execution plan for hash join?
> 
> You can't, because it's not unnecessary.  Hash join doesn't promise
> to produce its outputs in any particular order.  But the Unique
> filter needs to see its inputs in order by the fields being made
> unique.

Hello.

Tom, thanks for your answer. But I am not agree with it.

Here is the query of my interest:
+ select distinct c_id, m_id from products, ( select r_id from resellers where r_name like 'a%' ) as temp where m_id =
123and products.r_id = temp.r_id order by m_id, c_id;
 
+ 
+ Unique  (cost=16.83..16.83 rows=1 width=12)
+   ->  Sort  (cost=16.83..16.83 rows=1 width=12)
+     ->  Hash Join  (cost=8.16..16.82 rows=1 width=12)
+       ->  Index Scan using products_mcr on products  (cost=0.00..8.14 rows=10 width=8)
+       ->  Hash  (cost=8.14..8.14 rows=10 width=4)
+         ->  Index Scan using resellers_n on resellers  (cost=0.00..8.14 rows=10 width=4)

While executing this query postgres at first creates hash on table "resellers",
then get from index "products_mcr" for rows with "m_id=123" already ordered (!!!)
pairs "c_id,r_id", for each that pair it checks join condition using hash. If postgers
behaves like this then the result of this hash join is already sorted, and extra
"Sort" is not needed.

In fact query without "distinct" and "order by" clauses returns ordered values
of "c_id". I mean this query:
+ select c_id from products, ( select r_id from resellers where r_name like 'a%' ) as temp where m_id = 123 and
products.r_id= temp.r_id" returns ordered values of "c_id;
 
+ 
+ Hash Join  (cost=8.16..16.82 rows=1 width=12)
+   ->  Index Scan using products_mcr on products  (cost=0.00..8.14 rows=10 width=8)
+   ->  Hash  (cost=8.14..8.14 rows=10 width=4)
+     ->  Index Scan using resellers_n on resellers  (cost=0.00..8.14 rows=10 width=4)

And one more argument. We are now transferring our DB from Oracle to Postgres,
and in oracle this query does not have "Sort" in explain_plan, it has only "Unique":
+ select distinct c_id from products where m_id = 123 and r_id in ( select r_id from resellers where r_name like 'a%' )
orderby c_id
 
+ 
+ SORT, UNIQUE
+    HASH JOIN,
+       INDEX, RANGE SCAN PRODUCTS_MCR
+       TABLE ACCESS, FULL RESELLERS

So I suppose, that this "Sort" step can be removed from the execution plan. What
do you think about this?

Thanks a lot.

-- 

WBR, Alexey Nalbat