join vs. subquery - Mailing list pgsql-performance
From | David Haas |
---|---|
Subject | join vs. subquery |
Date | |
Msg-id | a852bfe3b5d0e6099606d553291b674f@modelpredictivesystems.com Whole thread Raw |
List | pgsql-performance |
Hi - This is based on a discussion I was having with neilc on IRC. He suggested I post it here. Sorry for the length - I'm including everything he requested. I'm comparing the speeds of the following two queries. I was curious why query 1 was faster than query 2: query 1: Select layer_number FROM batch_report_index WHERE device_id = (SELECT device_id FROM device_index WHERE device_name ='CP8M') AND technology_id = (SELECT technology_id FROM technology_index WHERE technology_name = 'G12'); query 2: Select b.layer_number FROM batch_report_index b, device_index d, technology_index t WHERE b.device_id = d.device_id AND b.technology_id = t.technology_id AND d.device_name = 'CP8M' AND t.technology_name = 'G12'; Here were my first runs: (query 1 explain analyze) Seq Scan on batch_report_index (cost=6.05..12370.66 rows=83 width=4) (actual time=19.274..1903.110 rows=61416 loops=1) Filter: ((device_id = $0) AND (technology_id = $1)) InitPlan -> Index Scan using device_index_device_name_key on device_index (cost=0.00..4.88 rows=1 width=4) (actual time=0.310..0.320 rows=1 loops=1) Index Cond: (device_name = 'CP8M'::text) -> Seq Scan on technology_index (cost=0.00..1.18 rows=1 width=4) (actual time=0.117..0.149 rows=1 loops=1) Filter: (technology_name = 'G12'::text) Total runtime: 1947.896 ms (8 rows) (query 2 explain analyze) Hash Join (cost=6.06..12380.70 rows=46 width=4) (actual time=35.509..2831.685 rows=61416 loops=1) Hash Cond: ("outer".technology_id = "inner".technology_id) -> Hash Join (cost=4.88..12375.87 rows=638 width=8) (actual time=34.584..2448.862 rows=61416 loops=1) Hash Cond: ("outer".device_id = "inner".device_id) -> Seq Scan on batch_report_index b (cost=0.00..10182.74 rows=436374 width=12) (actual time=0.100..1373.085 rows=436374 loops=1) -> Hash (cost=4.88..4.88 rows=1 width=4) (actual time=0.635..0.635 rows=0 loops=1) -> Index Scan using device_index_device_name_key on device_index d (cost=0.00..4.88 rows=1 width=4) (actual time=0.505..0.520 rows=1 loops=1) Index Cond: (device_name = 'CP8M'::text) -> Hash (cost=1.18..1.18 rows=1 width=4) (actual time=0.348..0.348 rows=0 loops=1) -> Seq Scan on technology_index t (cost=0.00..1.18 rows=1 width=4) (actual time=0.198..0.239 rows=1 loops=1) Filter: (technology_name = 'G12'::text) Total runtime: 2872.252 ms (12 rows) On neilc's suggestion, I did a vacuum analyze, then turned off hash joins. Here's query 2, no hash joins: (query 2 explain analyze) Nested Loop (cost=0.00..15651.44 rows=46 width=4) (actual time=22.079..2741.103 rows=61416 loops=1) Join Filter: ("inner".technology_id = "outer".technology_id) -> Seq Scan on technology_index t (cost=0.00..1.18 rows=1 width=4) (actual time=0.178..0.218 rows=1 loops=1) Filter: (technology_name = 'G12'::text) -> Nested Loop (cost=0.00..15642.29 rows=638 width=8) (actual time=21.792..2530.470 rows=61416 loops=1) Join Filter: ("inner".device_id = "outer".device_id) -> Index Scan using device_index_device_name_key on device_index d (cost=0.00..4.88 rows=1 width=4) (actual time=0.331..0.346 rows=1 loops=1) Index Cond: (device_name = 'CP8M'::text) -> Seq Scan on batch_report_index b (cost=0.00..10182.74 rows=436374 width=12) (actual time=0.070..1437.938 rows=436374 loops=1) Total runtime: 2782.628 ms (10 rows) He then suggested I turn hash_joins back on and put an index on the batch_report_table's device_id. Here's query 2 again: (query 2 explain analyze) Hash Join (cost=1.18..2389.06 rows=46 width=4) (actual time=1.562..2473.554 rows=61416 loops=1) Hash Cond: ("outer".technology_id = "inner".technology_id) -> Nested Loop (cost=0.00..2384.24 rows=638 width=8) (actual time=0.747..2140.160 rows=61416 loops=1) -> Index Scan using device_index_device_name_key on device_index d (cost=0.00..4.88 rows=1 width=4) (actual time=0.423..0.435 rows=1 loops=1) Index Cond: (device_name = 'CP8M'::text) -> Index Scan using b_r_device_index on batch_report_index b (cost=0.00..2365.82 rows=1083 width=12) (actual time=0.288..1868.118 rows=61416 loops=1) Index Cond: (b.device_id = "outer".device_id) -> Hash (cost=1.18..1.18 rows=1 width=4) (actual time=0.359..0.359 rows=0 loops=1) -> Seq Scan on technology_index t (cost=0.00..1.18 rows=1 width=4) (actual time=0.198..0.237 rows=1 loops=1) Filter: (technology_name = 'G12'::text) Total runtime: 2515.950 ms (11 rows) He then suggested I increase the statistics on batch_report_index & run the query again. I "set statistics" for both the device_id and technology_id column to 900, vacuum analyzed, and re-ran the query (it's still slower than query 1 run after the same contortions ): (query 2 explain analyze) Hash Join (cost=1.18..1608.49 rows=46 width=4) (actual time=1.437..1499.414 rows=61416 loops=1) Hash Cond: ("outer".technology_id = "inner".technology_id) -> Nested Loop (cost=0.00..1603.66 rows=638 width=8) (actual time=0.613..1185.826 rows=61416 loops=1) -> Index Scan using device_index_device_name_key on device_index d (cost=0.00..4.88 rows=1 width=4) (actual time=0.246..0.259 rows=1 loops=1) Index Cond: (device_name = 'CP8M'::text) -> Index Scan using b_r_device_index on batch_report_index b (cost=0.00..1589.93 rows=708 width=12) (actual time=0.324..928.888 rows=61416 loops=1) Index Cond: (b.device_id = "outer".device_id) -> Hash (cost=1.18..1.18 rows=1 width=4) (actual time=0.358..0.358 rows=0 loops=1) -> Seq Scan on technology_index t (cost=0.00..1.18 rows=1 width=4) (actual time=0.196..0.238 rows=1 loops=1) Filter: (technology_name = 'G12'::text) Total runtime: 1538.302 ms At this point, he said "send it to the -perform mailing list". So here I am. The relevant table schemas as of right now: reg=# \d batch_report_index Table "public.batch_report_index" Column | Type | Modifiers -----------------+------------------------ +--------------------------------------------------------------------- batch_report_id | integer | not null default nextval('"batch_report__batch_report__seq"'::text) lot | character varying(16) | tool_id | integer | not null technology_id | integer | not null device_id | integer | not null reticle_id | integer | not null layer_id | integer | not null layer_number | integer | not null image_id | integer | start_date | date | start_time | time without time zone | stop_date | date | stop_time | time without time zone | in_system | boolean | default false Indexes: "batch_report_index_pkey" primary key, btree (batch_report_id) "b_r_device_index" btree (device_id) "batch_report_stop_date_indexing" btree (stop_date) "batch_report_tool_id_index" btree (tool_id) reg=# \d technology_index Table "public.technology_index" Column | Type | Modifiers -----------------+--------- +--------------------------------------------------------------------- technology_id | integer | not null default nextval('"technology_in_technology_id_seq"'::text) technology_name | text | not null Indexes: "technology_index_pkey" primary key, btree (technology_id) "technology_in_technology_na_key" unique, btree (technology_name) reg=# \d device_index Table "public.device_index" Column | Type | Modifiers -------------+--------- +---------------------------------------------------------------- device_id | integer | not null default nextval('"device_index_device_id_seq"'::text) device_name | text | not null Indexes: "device_index_pkey" primary key, btree (device_id) "device_index_device_name_key" unique, btree (device_name) Thanks for the great work y'all do!
pgsql-performance by date: