Thread: Join problem

Join problem

From
Silke Trissl
Date:
Hi,

I came across a very intriguing thing:

I had to join two tables and in both tables I wanted to restrict the
result set by some (text/varchar) attributes.

Here is an example:

Table "item" # 147 000 entries

     Column     |         Type          | Modifiers
---------------+-----------------------+------------
  id            | integer               | not null
  description   | text                  |
  comment       | text                  | not null
  order_id      | integer               |


Table "orders" # 210 000 entries
       Column      |          Type          | Modifiers
-----------------+------------------------+-----------
  order_id        | integer                |
  order_name      | character varying(255) |


The tables have 147 000 and 210 000 entries, respectively.

First I tried the following query, which took ages:

(Query 1)
EXPLAIN ANALYZE
SELECT item.id
FROM item, orders
WHERE orders.order_name ~* 'Smit'
   AND item.description ~* 'CD'
   and orders.order_id = item.order_id;



I found out, that the change of the operator from '~*' to '=' for the
item.description brought a great boost in performance (425 secs to 1
sec!), but not in cost (Query plans at the end).

(Query 2)
  EXPLAIN ANALYZE
SELECT item.id
FROM item, orders
WHERE orders.order_name ~* 'Smit'
   AND item.description = 'CD'
   and orders.order_id = item.order_id;


The main difference was that Query 2 used the Hash join instead of the
Nested Loop, so I disabled the option 'NESTED LOOP' and got for Query 1
a similar time as for Query 2.


Can anyone tell me, why in one case the Hash join and in the other the
much worse Nested Loop is prefered?
And my second question is, is there any possibility to execute the first
query without disabling the Nested Loop first, but get the good
performance of the Hash join?


Many thanks in advance for your help or suggestions

Silke


QUERY PLANS:

#####################################

Query 1:
   QUERY PLAN
------------------------------------------------------------------------------------------------------------------

  Nested Loop  (cost=0.00..28836.75 rows=1 width=4) (actual
time=65350.780..452130.702 rows=6 loops=1)
    Join Filter: ("inner".order_id = "outer".order_id)
    ->  Seq Scan on item  (cost=0.00..28814.24 rows=1 width=8) (actual
time=33.180..1365.190 rows=716 loops=1)
          Filter: (description ~* 'CD'::text)
    ->  Seq Scan on orders  (cost=0.00..22.50 rows=1 width=4) (actual
time=21.644..629.500 rows=18 loops=716)
          Filter: ((order_name)::text ~* 'Smith'::text)
  Total runtime: 452130.782 ms
###########################################################################

Query 2:
   QUERY PLAN
------------------------------------------------------------------------------------------------------------------

   Hash Join  (cost=22.50..28840.44 rows=4 width=4) (actual
time=1187.798..1187.798 rows=0 loops=1)
    Hash Cond: ("outer".order_id = "inner".order_id)
    ->  Seq Scan on item  (cost=0.00..28814.24 rows=733 width=8) (actual
time=542.737..542.737 rows=0 loops=1)
          Filter: (description = 'CD'::text)
    ->  Hash  (cost=22.50..22.50 rows=1 width=4) (actual
time=645.042..645.042 rows=0 loops=1)
          ->  Seq Scan on orders  (cost=0.00..22.50 rows=1 width=4)
(actual time=22.373..644.996 rows=18 loops=1)
                Filter: ((order_name)::text ~* 'Smith'::text)
  Total runtime: 1187.865 ms
############################################################################


Query 1 with  'set enable_nestloop to false'

    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=22.50..28836.75 rows=1 width=4) (actual
time=1068.593..2003.330 rows=6 loops=1)
    Hash Cond: ("outer".item_id = "inner".item_id)
    ->  Seq Scan on item  (cost=0.00..28814.24 rows=1 width=8) (actual
time=33.347..1357.073 rows=716 loops=1)
          Filter: (description ~* 'CD'::text)
    ->  Hash  (cost=22.50..22.50 rows=1 width=4) (actual
time=645.287..645.287 rows=0 loops=1)
          ->  Seq Scan on orders  (cost=0.00..22.50 rows=1 width=4)
(actual time=22.212..645.239 rows=18 loops=1)
                Filter: ((order_name)::text ~* 'CD'::text)
  Total runtime: 2003.409 ms


Re: Join problem

From
Suller András
Date:
these two queries are not equal. Query1 returns 6 rows, query2 returns 0
rows, because '~*' and '=' operators are not same. BTW when you use '=',
it could use index on "item.description".
On query1, "Seq Scan on item" estimates 1 row, on query2 it estimates
733 rows. IMHO that's why query1 uses nested loop, query2 uses hash join.

bye,
Suller Andras

Silke Trissl írta:

> Hi,
>
> Query 1:
>   QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------
>
>  Nested Loop  (cost=0.00..28836.75 rows=1 width=4) (actual
> time=65350.780..452130.702 rows=6 loops=1)
>    Join Filter: ("inner".order_id = "outer".order_id)
>    ->  Seq Scan on item  (cost=0.00..28814.24 rows=1 width=8) (actual
> time=33.180..1365.190 rows=716 loops=1)
>          Filter: (description ~* 'CD'::text)
>    ->  Seq Scan on orders  (cost=0.00..22.50 rows=1 width=4) (actual
> time=21.644..629.500 rows=18 loops=716)
>          Filter: ((order_name)::text ~* 'Smith'::text)
>  Total runtime: 452130.782 ms
> ###########################################################################
>
>
> Query 2:
>   QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------
>
>   Hash Join  (cost=22.50..28840.44 rows=4 width=4) (actual
> time=1187.798..1187.798 rows=0 loops=1)
>    Hash Cond: ("outer".order_id = "inner".order_id)
>    ->  Seq Scan on item  (cost=0.00..28814.24 rows=733 width=8)
> (actual time=542.737..542.737 rows=0 loops=1)
>          Filter: (description = 'CD'::text)
>    ->  Hash  (cost=22.50..22.50 rows=1 width=4) (actual
> time=645.042..645.042 rows=0 loops=1)
>          ->  Seq Scan on orders  (cost=0.00..22.50 rows=1 width=4)
> (actual time=22.373..644.996 rows=18 loops=1)
>                Filter: ((order_name)::text ~* 'Smith'::text)
>  Total runtime: 1187.865 ms
> ############################################################################
>
>
>
> Query 1 with  'set enable_nestloop to false'
>
>    QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------- 
>
>  Hash Join  (cost=22.50..28836.75 rows=1 width=4) (actual
> time=1068.593..2003.330 rows=6 loops=1)
>    Hash Cond: ("outer".item_id = "inner".item_id)
>    ->  Seq Scan on item  (cost=0.00..28814.24 rows=1 width=8) (actual
> time=33.347..1357.073 rows=716 loops=1)
>          Filter: (description ~* 'CD'::text)
>    ->  Hash  (cost=22.50..22.50 rows=1 width=4) (actual
> time=645.287..645.287 rows=0 loops=1)
>          ->  Seq Scan on orders  (cost=0.00..22.50 rows=1 width=4)
> (actual time=22.212..645.239 rows=18 loops=1)
>                Filter: ((order_name)::text ~* 'CD'::text)
>  Total runtime: 2003.409 ms




Re: Join problem

From
Tom Lane
Date:
Silke Trissl <trissl@informatik.hu-berlin.de> writes:
> I found out, that the change of the operator from '~*' to '=' for the
> item.description brought a great boost in performance (425 secs to 1
> sec!), but not in cost (Query plans at the end).

The main problem seems to be bad estimation of the number of rows
extracted from the item table.  Have you ANALYZEd that table lately?

            regards, tom lane