simple join query runs very slowly - Mailing list pgsql-general
From | rdnews@dahlsys.com (Roger Dahl) |
---|---|
Subject | simple join query runs very slowly |
Date | |
Msg-id | db3559c4.0403251728.1f38a59a@posting.google.com Whole thread Raw |
Responses |
Re: simple join query runs very slowly
Re: simple join query runs very slowly Re: simple join query runs very slowly |
List | pgsql-general |
Hello everyone, I'm having trouble with a slow query and I wonder if anyone can help me speed it up. I have read the PostgreSQL documentation and FAQ but have been unable to find anything to help me out. I have three tables: items, boxes and item_box_maps. The items table holds ids and item names. The boxes table holds ids and box names. The item_box_maps table holds box and item ids that "map" items to boxes. To register a box as containing a specific item, I just add a record in item_box_maps with the appropriate box name id and item name id. I have used this approach on SQL Server many times and have always gotten excellent performance. To find out which boxes contain a specific item, I would write a query like this: select b.name, from boxes b inner join item_box_maps m on m.box_id = b.id inner join items i on m.id = m.item_id where i.name = 'hammer'; I've set up a PostgreSQL database with around 2000 boxes, 2000 items and 80.000 mappings. The query above will typically return 40 records (names of boxes) for a given item name. I'm used to this query running "instantaneously" but on my PostgreSQL server this query takes over one second. I have btree indexes on everything and have vacuum analyze'ed everything. EXPLAIN gives me the following output on the query above: Hash Join (cost=5.00..3608.36 rows=14 width=114) (actual time=572.76..1043.98 rows=7 loops=1) Hash Cond: ("outer".item_id = "inner".id) -> Merge Join (cost=0.00..3371.16 rows=30938 width=110) (actual time=472.93..969.62 rows=27978 loops=1) Merge Cond: ("outer".id = "inner".box_id) -> Index Scan using boxes_pkey on boxes b (cost=0.00..1621.40 rows=2076 width=102) (actual time=2.95..26.0\ 8 rows=2076 loops=1) -> Index Scan using item_box_maps_idx_box_id on item_box_maps m (cost=0.00..1233.31 rows=51384 wi\ dth=8) (actual time=24.17..580.09 rows=78544 loops=1) -> Hash (cost=5.00..5.00 rows=1 width=4) (actual time=0.38..0.38 rows=0 loops=1) -> Index Scan using items_idx_item on items i (cost=0.00..5.00 rows=1 width=4) (actual time=0.35..0.37 rows=1 l\ oops=1) Index Cond: (item = 'hammer'::text) Total runtime: 1044.35 msec I don't have any experience interpreting these, but to my untrained eye, everything looks fine. I then tried rewriting the query to use subqueries: select b.name from boxes b where b.id in ( select box_id from item_box_maps where item_id in ( select id from items where item = 'hammer' ) ); This one runs only slightly faster at around 800ms and gives the following EXPLAIN output: Seq Scan on boxes b (cost=100000000.00..208235168085.56 rows=1038 width=98) (actual time=768.62..812.63 rows=7 loops=1) Filter: (subplan) SubPlan -> Materialize (cost=100257787.86..100257787.86 rows=25692 width=4) (actual time=0.35..0.36 rows=7 loops=2076) -> Seq Scan on item_box_maps (cost=100000000.00..100257787.86 rows=25692 width=4) (actual time=626.02..723\ .87 rows=7 loops=1) Filter: (subplan) SubPlan -> Materialize (cost=5.00..5.00 rows=1 width=4) (actual time=0.00..0.00 rows=1 loops=78544) -> Index Scan using items_idx_item on items (cost=0.00..5.00 rows=1 width=4) (actual time=0.32.\ .0.34 rows=1 loops=1) Index Cond: (item = 'hammer'::text) What seems to be wrong with this one is that it's not using the indexes. If I break up the query and run each part separate and manually insert the results into the next part, I get the following results: select id from items where item = 'hammer'; id ------- 10751 (1 row) select box_id from item_box_maps where item_id in (10751); box_id ------------- 12191 10537 10536 12169 12161 12151 11327 (7 rows) select name from boxes where id in (12191, 10537, 10536, 12169, 12161, 12151, 11327); name ----- box2910 box1223 box15 box433 box1233 box95 box1003 (7 rows) Each of these queries return instantly and the output from EXPLAIN indicates that they use the index. For instance: Index Scan using boxes_pkey, boxes_pkey, boxes_pkey, boxes_pkey, boxes_pkey, boxes_pkey, boxes_p\ key on boxes (cost=0.00..37.69 rows=7 width=98) (actual time=0.10..0.34 rows=7 loops=1) Index Cond: ((id = 12191) OR (id = 10537) OR (id = 10536) OR (id = 12169) OR (id = 12161) OR (id = 12151) OR (id = 11327)) Total runtime: 0.51 msec Some more information: SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE relname LIKE 'items%'; relname | relkind | reltuples | relpages ------------------------+---------+-----------+---------- items_id_seq | S | 1 | 1 items_pkey | i | 2124 | 38 items_type | c | 0 | 0 items | r | 2124 | 81 items_idx_item | i | 2124 | 9 (5 rows) SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE relname LIKE 'boxes%'; relname | relkind | reltuples | relpages --------------------------+---------+-----------+---------- boxes_id_seq | S | 1 | 1 boxes_pkey | i | 2076 | 67 boxes_type | c | 0 | 0 boxes | r | 2076 | 456 boxes_idx_name | i | 2076 | 12 (5 rows) SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE relname LIKE 'item_box_maps%'; relname | relkind | reltuples | relpages ----------------------------------+---------+-----------+---------- item_box_maps_id_seq | S | 1 | 1 item_box_maps_pkey | i | 78544 | 246 item_box_maps | r | 51384 | 435 item_box_maps_idx_item_id | i | 78544 | 232 item_box_maps_idx_box_id | i | 78544 | 248 item_box_maps_idx_item_id_box_id | i | 78544 | 296 (6 rows) SELECT attname, n_distinct, most_common_vals FROM pg_stats WHERE tablename = 'item_box_maps'; attname | n_distinct | most_common_vals ----------+------------+---------------------------------------------------------- id | -1 | item_id | 2271 | {1,771,33,11104,81,677,4915,2,46,11853} box_id | 3448 | {10720,10600,10701,9329,9346,10611,10734,2960,5270,9332} (3 rows) The only thing I notice in here is that the indexes for the mapping table are huge -- over half the size of the table. Is that correct? Other than that, I have run out of ideas. SQL Server would run the query instantaneously and it should be possible to have PostgreSQL run it in much less than a second, but how? Anyone? Roger Dahl
pgsql-general by date: