simple join query runs very slowly - Mailing list pgsql-general

From rdnews@dahlsys.com (Roger Dahl)
Subject simple join query runs very slowly
Date
Msg-id db3559c4.0403251728.1f38a59a@posting.google.com
Whole thread Raw
Responses Re: simple join query runs very slowly
Re: simple join query runs very slowly
Re: simple join query runs very slowly
List pgsql-general
Hello everyone,

I'm having trouble with a slow query and I wonder if anyone can help
me speed it up. I have read the PostgreSQL documentation and FAQ but
have been unable to find anything to help me out.

I have three tables: items, boxes and item_box_maps. The items table
holds ids and item names. The boxes table holds ids and box names. The
item_box_maps table holds box and item ids that "map" items to boxes.

To register a box as containing a specific item, I just add a record
in item_box_maps with the appropriate box name id and item name id. I
have used this approach on SQL Server many times and have always
gotten excellent performance.

To find out which boxes contain a specific item, I would write a query
like this:

select b.name,
from boxes b
inner join item_box_maps m on m.box_id = b.id
inner join items i on m.id = m.item_id
where i.name = 'hammer';

I've set up a PostgreSQL database with around 2000 boxes, 2000 items
and 80.000 mappings. The query above will typically return 40 records
(names of boxes) for a given item name. I'm used to this query running
"instantaneously" but on my PostgreSQL server this query takes over
one second.

I have btree indexes on everything and have vacuum analyze'ed
everything. EXPLAIN gives me the following output on the query above:

Hash Join  (cost=5.00..3608.36 rows=14 width=114) (actual
time=572.76..1043.98 rows=7 loops=1)
   Hash Cond: ("outer".item_id = "inner".id)
   ->  Merge Join  (cost=0.00..3371.16 rows=30938 width=110) (actual
time=472.93..969.62 rows=27978 loops=1)
         Merge Cond: ("outer".id = "inner".box_id)
         ->  Index Scan using boxes_pkey on boxes b
(cost=0.00..1621.40 rows=2076 width=102) (actual time=2.95..26.0\
8 rows=2076 loops=1)
         ->  Index Scan using item_box_maps_idx_box_id on
item_box_maps m  (cost=0.00..1233.31 rows=51384 wi\
dth=8) (actual time=24.17..580.09 rows=78544 loops=1)
   ->  Hash  (cost=5.00..5.00 rows=1 width=4) (actual time=0.38..0.38
rows=0 loops=1)
         ->  Index Scan using items_idx_item on items i
(cost=0.00..5.00 rows=1 width=4) (actual time=0.35..0.37 rows=1 l\
oops=1)
               Index Cond: (item = 'hammer'::text)
 Total runtime: 1044.35 msec

I don't have any experience interpreting these, but to my untrained
eye, everything looks fine.

I then tried rewriting the query to use subqueries:

select b.name
from boxes b
where b.id in (
  select box_id
  from item_box_maps
  where item_id in (
    select id
    from items
    where item = 'hammer'
  )
);

This one runs only slightly faster at around 800ms and gives the
following EXPLAIN output:

Seq Scan on boxes b  (cost=100000000.00..208235168085.56 rows=1038
width=98) (actual time=768.62..812.63 rows=7 loops=1)
   Filter: (subplan)
   SubPlan
     ->  Materialize  (cost=100257787.86..100257787.86 rows=25692
width=4) (actual time=0.35..0.36 rows=7 loops=2076)
           ->  Seq Scan on item_box_maps
(cost=100000000.00..100257787.86 rows=25692 width=4) (actual
time=626.02..723\
.87 rows=7 loops=1)
                 Filter: (subplan)
                 SubPlan
                   ->  Materialize  (cost=5.00..5.00 rows=1 width=4)
(actual time=0.00..0.00 rows=1 loops=78544)
                         ->  Index Scan using items_idx_item on items
(cost=0.00..5.00 rows=1 width=4) (actual time=0.32.\
.0.34 rows=1 loops=1)
                               Index Cond: (item = 'hammer'::text)


What seems to be wrong with this one is that it's not using the
indexes. If I break up the query and run each part separate and
manually insert the results into the next part, I get the following
results:

select id
from items
where item = 'hammer';
id
-------
10751
(1 row)

select box_id
from item_box_maps
where item_id in (10751);

box_id
-------------
12191
10537
10536
12169
12161
12151
11327
(7 rows)

select name
from boxes
where id in (12191, 10537, 10536, 12169, 12161, 12151, 11327);

name
-----
box2910
box1223
box15
box433
box1233
box95
box1003
(7 rows)


Each of these queries return instantly and the output from EXPLAIN
indicates that they use the index. For instance:

Index Scan using boxes_pkey, boxes_pkey, boxes_pkey, boxes_pkey,
boxes_pkey, boxes_pkey, boxes_p\
key on boxes  (cost=0.00..37.69 rows=7 width=98) (actual
time=0.10..0.34 rows=7 loops=1)
   Index Cond: ((id = 12191) OR (id = 10537) OR (id = 10536) OR (id =
12169) OR (id = 12161) OR (id = 12151) OR (id = 11327))
 Total runtime: 0.51 msec

Some more information:

SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE
relname LIKE 'items%';

relname                 | relkind | reltuples | relpages
------------------------+---------+-----------+----------
 items_id_seq           | S       |         1 |        1
 items_pkey             | i       |      2124 |       38
 items_type             | c       |         0 |        0
 items                  | r       |      2124 |       81
 items_idx_item         | i       |      2124 |        9
(5 rows)

SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE
relname LIKE 'boxes%';

relname                   | relkind | reltuples | relpages
--------------------------+---------+-----------+----------
 boxes_id_seq             | S       |         1 |        1
 boxes_pkey               | i       |      2076 |       67
 boxes_type               | c       |         0 |        0
 boxes                    | r       |      2076 |      456
 boxes_idx_name           | i       |      2076 |       12
 (5 rows)

SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE
relname LIKE 'item_box_maps%';

relname                           | relkind | reltuples | relpages
----------------------------------+---------+-----------+----------
 item_box_maps_id_seq             | S       |         1 |        1
 item_box_maps_pkey               | i       |     78544 |      246
 item_box_maps                    | r       |     51384 |      435
 item_box_maps_idx_item_id        | i       |     78544 |      232
 item_box_maps_idx_box_id         | i       |     78544 |      248
 item_box_maps_idx_item_id_box_id | i       |     78544 |      296
(6 rows)

SELECT attname, n_distinct, most_common_vals FROM pg_stats WHERE
tablename = 'item_box_maps';

attname   | n_distinct |                     most_common_vals
----------+------------+----------------------------------------------------------
 id       |         -1 |
 item_id  |       2271 | {1,771,33,11104,81,677,4915,2,46,11853}
 box_id   |       3448 |
{10720,10600,10701,9329,9346,10611,10734,2960,5270,9332}
(3 rows)

The only thing I notice in here is that the indexes for the mapping
table are huge -- over half the size of the table. Is that correct?

Other than that, I have run out of ideas. SQL Server would run the
query instantaneously and it should be possible to have PostgreSQL run
it in much less than a second, but how?

Anyone?

Roger Dahl

pgsql-general by date:

Previous
From: "scott.marlowe"
Date:
Subject: Re: prob with really simple insert
Next
From: "scott.marlowe"
Date:
Subject: Re: Beginner Questions Please: Which To Go With ?