Thread: subquery vs join on 7.4.5

subquery vs join on 7.4.5

From

David Haas

Date:

23 February 2005, 09:34:16

Hi -

This is based on a discussion I was having with neilc on IRC.  He
suggested I post it here.  Sorry for the length - I'm including
everything he requested

I'm comparing the speeds of the following two queries on 7.4.5.  I was
curious why query 1 was faster than query 2:

query 1:
Select layer_number
FROM batch_report_index
WHERE device_id = (SELECT device_id FROM device_index WHERE device_name
='CP8M')
AND technology_id = (SELECT technology_id FROM technology_index WHERE
technology_name = 'G12');

query 2:
Select b.layer_number
FROM batch_report_index b, device_index d, technology_index t
WHERE b.device_id = d.device_id
AND b.technology_id = t.technology_id
AND d.device_name = 'CP8M'
AND t.technology_name = 'G12';


Here were my first runs:
(query 1 explain analyze)
  Seq Scan on batch_report_index  (cost=6.05..12370.66 rows=83 width=4)
(actual time=19.274..1903.110 rows=61416 loops=1)
    Filter: ((device_id = $0) AND (technology_id = $1))
    InitPlan
      ->  Index Scan using device_index_device_name_key on device_index
(cost=0.00..4.88 rows=1 width=4) (actual time=0.310..0.320 rows=1
loops=1)
            Index Cond: (device_name = 'CP8M'::text)
      ->  Seq Scan on technology_index  (cost=0.00..1.18 rows=1 width=4)
(actual time=0.117..0.149 rows=1 loops=1)
            Filter: (technology_name = 'G12'::text)
  Total runtime: 1947.896 ms
(8 rows)


(query 2 explain analyze)
  Hash Join  (cost=6.06..12380.70 rows=46 width=4) (actual
time=35.509..2831.685 rows=61416 loops=1)
    Hash Cond: ("outer".technology_id = "inner".technology_id)
    ->  Hash Join  (cost=4.88..12375.87 rows=638 width=8) (actual
time=34.584..2448.862 rows=61416 loops=1)
          Hash Cond: ("outer".device_id = "inner".device_id)
          ->  Seq Scan on batch_report_index b  (cost=0.00..10182.74
rows=436374 width=12) (actual time=0.100..1373.085 rows=436374 loops=1)
          ->  Hash  (cost=4.88..4.88 rows=1 width=4) (actual
time=0.635..0.635 rows=0 loops=1)
                ->  Index Scan using device_index_device_name_key on
device_index d  (cost=0.00..4.88 rows=1 width=4) (actual
time=0.505..0.520 rows=1 loops=1)
                      Index Cond: (device_name = 'CP8M'::text)
    ->  Hash  (cost=1.18..1.18 rows=1 width=4) (actual time=0.348..0.348
rows=0 loops=1)
          ->  Seq Scan on technology_index t  (cost=0.00..1.18 rows=1
width=4) (actual time=0.198..0.239 rows=1 loops=1)
                Filter: (technology_name = 'G12'::text)
  Total runtime: 2872.252 ms
(12 rows)


On neilc's suggestion, I did a vacuum analyze, then turned off hash
joins.  Here's query 2, no hash joins:

(query 2 explain analyze)
  Nested Loop  (cost=0.00..15651.44 rows=46 width=4) (actual
time=22.079..2741.103 rows=61416 loops=1)
    Join Filter: ("inner".technology_id = "outer".technology_id)
    ->  Seq Scan on technology_index t  (cost=0.00..1.18 rows=1 width=4)
(actual time=0.178..0.218 rows=1 loops=1)
          Filter: (technology_name = 'G12'::text)
    ->  Nested Loop  (cost=0.00..15642.29 rows=638 width=8) (actual
time=21.792..2530.470 rows=61416 loops=1)
          Join Filter: ("inner".device_id = "outer".device_id)
          ->  Index Scan using device_index_device_name_key on
device_index d  (cost=0.00..4.88 rows=1 width=4) (actual
time=0.331..0.346 rows=1 loops=1)
                Index Cond: (device_name = 'CP8M'::text)
          ->  Seq Scan on batch_report_index b  (cost=0.00..10182.74
rows=436374 width=12) (actual time=0.070..1437.938 rows=436374 loops=1)
  Total runtime: 2782.628 ms
(10 rows)

He then suggested I turn hash_joins back on and put an index on the
batch_report_table's device_id.  Here's query 2 again:

(query 2 explain analyze)
Hash Join  (cost=1.18..2389.06 rows=46 width=4) (actual
time=1.562..2473.554 rows=61416 loops=1)
    Hash Cond: ("outer".technology_id = "inner".technology_id)
    ->  Nested Loop  (cost=0.00..2384.24 rows=638 width=8) (actual
time=0.747..2140.160 rows=61416 loops=1)
          ->  Index Scan using device_index_device_name_key on
device_index d  (cost=0.00..4.88 rows=1 width=4) (actual
time=0.423..0.435 rows=1 loops=1)
                Index Cond: (device_name = 'CP8M'::text)
          ->  Index Scan using b_r_device_index on batch_report_index b
(cost=0.00..2365.82 rows=1083 width=12) (actual time=0.288..1868.118
rows=61416 loops=1)
                Index Cond: (b.device_id = "outer".device_id)
    ->  Hash  (cost=1.18..1.18 rows=1 width=4) (actual time=0.359..0.359
rows=0 loops=1)
          ->  Seq Scan on technology_index t  (cost=0.00..1.18 rows=1
width=4) (actual time=0.198..0.237 rows=1 loops=1)
                Filter: (technology_name = 'G12'::text)
  Total runtime: 2515.950 ms
(11 rows)

He then suggested I increase the statistics on batch_report_index & run
the query again.  I "set statistics" for both
the device_id and technology_id column to 900, vacuum analyzed, and
re-ran the query (it's still slower than query
1 run after the same contortions ):

(query 2 explain analyze)
  Hash Join  (cost=1.18..1608.49 rows=46 width=4) (actual
time=1.437..1499.414 rows=61416 loops=1)
    Hash Cond: ("outer".technology_id = "inner".technology_id)
    ->  Nested Loop  (cost=0.00..1603.66 rows=638 width=8) (actual
time=0.613..1185.826 rows=61416 loops=1)
          ->  Index Scan using device_index_device_name_key on
device_index d  (cost=0.00..4.88 rows=1 width=4) (actual
time=0.246..0.259 rows=1 loops=1)
                Index Cond: (device_name = 'CP8M'::text)
          ->  Index Scan using b_r_device_index on batch_report_index b
(cost=0.00..1589.93 rows=708 width=12) (actual time=0.324..928.888
rows=61416 loops=1)
                Index Cond: (b.device_id = "outer".device_id)
    ->  Hash  (cost=1.18..1.18 rows=1 width=4) (actual time=0.358..0.358
rows=0 loops=1)
          ->  Seq Scan on technology_index t  (cost=0.00..1.18 rows=1
width=4) (actual time=0.196..0.238 rows=1 loops=1)
                Filter: (technology_name = 'G12'::text)
  Total runtime: 1538.302 ms

At this point, he said "send it to the -perform mailing list".  So here
I am.

The relevant table schemas as of right now:
reg=# \d batch_report_index
                                        Table "public.batch_report_index"
      Column      |          Type          |
  Modifiers
-----------------+------------------------
+---------------------------------------------------------------------
  batch_report_id | integer                | not null default
nextval('"batch_report__batch_report__seq"'::text)
  lot             | character varying(16)  |
  tool_id         | integer                | not null
  technology_id   | integer                | not null
  device_id       | integer                | not null
  reticle_id      | integer                | not null
  layer_id        | integer                | not null
  layer_number    | integer                | not null
  image_id        | integer                |
  start_date      | date                   |
  start_time      | time without time zone |
  stop_date       | date                   |
  stop_time       | time without time zone |
  in_system       | boolean                | default false
Indexes:
     "batch_report_index_pkey" primary key, btree (batch_report_id)
     "b_r_device_index" btree (device_id)
     "batch_report_stop_date_indexing" btree (stop_date)
     "batch_report_tool_id_index" btree (tool_id)

reg=# \d technology_index
                                  Table "public.technology_index"
      Column      |  Type   |                              Modifiers
-----------------+---------
+---------------------------------------------------------------------
  technology_id   | integer | not null default
nextval('"technology_in_technology_id_seq"'::text)
  technology_name | text    | not null
Indexes:
     "technology_index_pkey" primary key, btree (technology_id)
     "technology_in_technology_na_key" unique, btree (technology_name)

reg=# \d device_index
                               Table "public.device_index"
    Column    |  Type   |                           Modifiers
-------------+---------
+----------------------------------------------------------------
  device_id   | integer | not null default
nextval('"device_index_device_id_seq"'::text)
  device_name | text    | not null
Indexes:
     "device_index_pkey" primary key, btree (device_id)
     "device_index_device_name_key" unique, btree (device_name)

Thanks for the great work y'all do!

Re: subquery vs join on 7.4.5

From

Tom Lane

Date:

23 February 2005, 19:55:38

David Haas <dave@modelpredictivesystems.com> writes:
> I'm comparing the speeds of the following two queries on 7.4.5.  I was
> curious why query 1 was faster than query 2:

> query 1:
> Select layer_number
> FROM batch_report_index
> WHERE device_id = (SELECT device_id FROM device_index WHERE device_name
> ='CP8M')
> AND technology_id = (SELECT technology_id FROM technology_index WHERE
> technology_name = 'G12');

> query 2:
> Select b.layer_number
> FROM batch_report_index b, device_index d, technology_index t
> WHERE b.device_id = d.device_id
> AND b.technology_id = t.technology_id
> AND d.device_name = 'CP8M'
> AND t.technology_name = 'G12';

Why didn't you try a two-column index on batch_report_index(device_id,
technology_id) ?

Whether this would actually be better than a seqscan I'm not sure, given
the large number of matching rows.  But the planner would surely try it
given that it's drastically underestimating that number :-(

            regards, tom lane