3 tables, slow count(*), order by Seq Scan in Query Plan - Mailing list pgsql-sql
From | |
---|---|
Subject | 3 tables, slow count(*), order by Seq Scan in Query Plan |
Date | |
Msg-id | 20050527035634.49305.qmail@web31110.mail.mud.yahoo.com Whole thread Raw |
Responses |
Re: 3 tables, slow count(*), order by Seq Scan in Query Plan
|
List | pgsql-sql |
Hello, I have 3 tables (2 tables + 1 lookup table that ties them) and running a straight-forward aggregate count(*) query with a couple of joins takes about 10 seconds (and I need it to be sub-second or so). Also, I am wondering if this approach is scalable with my row-counts and my hardware (below). My slow query is this: ---------------------- SELECT keyword.name, count(*) FROM user_data, user_data_keyword, keyword WHERE (user_data.user_id = 1) AND (user_data.id = user_data_keyword.user_data_id) AND (user_data_keyword.keyword_id= keyword.id) GROUP BY keyword.name ORDER BY COUNT(*) DESC LIMIT 10; QUERY PLAN from EXPLAIN ANALYZE: --------------------------------Limit (cost=27820.00..27820.03 rows=10 width=114) (actual time=9971.322..9971.401 rows=10 loops=1) -> Sort (cost=27820.00..27822.50 rows=1000 width=114) (actual time=9971.312..9971.338 rows=10 loops=1) Sort Key: count(*) -> HashAggregate (cost=27767.67..27770.17 rows=1000 width=114) (actual time=9955.457..9963.051 rows=2005 loops=1) -> Hash Join (cost=4459.64..27738.80 rows=5774 width=114) (actual time=1140.776..9919.852 rows=5516 loops=1) Hash Cond: ("outer".keyword_id = "inner".id) -> Hash Join (cost=4437.14..27600.81 rows=5774 width=4) (actual time=21.781..7804.329 rows=5516 loops=1) Hash Cond: ("outer".user_data_id = "inner".id) -> Seq Scan on user_data_keyword (cost=0.00..17332.29 rows=1154729 width=8) (actual time=2.717..3834.186 rows=1154729 loops=1) -> Hash (cost=4433.94..4433.94 rows=1278 width=4) (actual time=18.862..18.862 rows=0 loops=1) -> Index Scan using ix_user_data_user_id_data_id on user_data (cost=0.00..4433.94 rows=1278 width=4) (actual time=0.234..13.454 rows=1149 loops=1) Index Cond: (user_id= 1) -> Hash (cost=20.00..20.00 rows=1000 width=118) (actual time=1118.616..1118.616 rows=0 loops=1) -> Seq Scan on keyword (cost=0.00..20.00 rows=1000 width=118) (actual time=1.140..609.577 rows=105440 loops=1)Total runtime: 9972.704 ms (15 rows) Ouch :) I'm trying to analyze the query plan (I'm not very good at it, obviously), and I see 2 Sequential Scans, one on the _big_ "user_data_keyword" table with about 60% of the total cost, and one of the "keyword". I also see HashAggregate with a high cost and a long actual time. I'm not sure what to do, which indices I need to add, as the "user_data_keyword" and "keyword" tables already have PK-based btree indices: "user_data_keyword" has: ... btree (user_data_id, keyword_id) "keyword" has: ... btree (id) Here are my 3 tables: user_data (about 300K rows currently, will grow to 10M+) --------- id (PK), user_id (FK) ... other columns ... user_data_keyword (lookup table - size always 4 x user_data) ----------------- user_data_id (FK) keyword_id (FK) PK(user_data_id, keyword_id) keyword (size about 10 x smaller than user_data_keyword) ------- id (PK) name VARCHAR(64) NOT NULL UNIQUE add_date TIMEZONE Is there any way of speeding up my query? Also, given the number of rows expected in those tables: user_data: 10M user_data_keyword: 40M keyword: 4M Any ideas how a query like this will scale when i hit those numbers? This particular query speed numbers are from 7.4.6 on a 1.5GHz laptop, but the production server is running PG 8.0.3 on a 3.0GHz P4 with 2 SATA disks in RAID1 config and 1GB RAM. How realistic is it to get sub-second responses on such hardware given the above numbers? Thanks, Otis