Need another way to do this, my sql much to slow... - Mailing list pgsql-sql
From | Jerry Wintrode |
---|---|
Subject | Need another way to do this, my sql much to slow... |
Date | |
Msg-id | 4E676B0AAF74B443A18D7BC7AAB3CFFD1D44D6@s01-exch01.tripos.com Whole thread Raw |
List | pgsql-sql |
I am attempting to write a spam management add-on for my corporate server, I have the database and the apps written but one part is WAY to slow to be usefull. The following view: Column | Type | Modifiers -----------------------+---------+-----------env_sender_num | integer |msg_from_claims_count | bigint | is built on the follow table: Column | Type | Modifiers -----------------+-----------------------------+-----------msg_sender_num | integer |env_sender_num | integer |msg_from_domain | character varying(255) |env_from_domain | charactervarying(255) |recorded_date | timestamp without time zone | Indexes: record_of_claims_env_sender_num_idx btree (env_sender_num), record_of_claims_msg_sender_num_idx btree (msg_sender_num) With the following SQL: SELECT record_of_claims.env_sender_num, count(DISTINCT record_of_claims.msg_sender_num)AS msg_from_claims_count FROM record_of_claims GROUP BY record_of_claims.env_sender_num; A sample dataset follows: msg_sender_num | env_sender_num | msg_from_domain | env_from_domain | recorded_date ----------------+----------------+-----------------------+-------------- ---------+--------------------- 1 | 1 | yahoo.com | yahoo.com | 2003-11-18 13:21:07 2 | 2 | mx128.optinvc13y.com | mx128.optinvc13y.com | 2003-11-18 13:21:16 3 | 3 | yahoo.com | yahoo.com | 2003-11-18 13:21:17 4 | 4 | yahoo.com | yahoo.com | 2003-11-18 13:21:21 5 | 5 | biomarketgroup.com | biomarketgroup.com | 2003-11-18 13:21:24 6 | 6 | sohu.com | sohu.com | 2003-11-18 13:21:28 7 | 7 | lycos.com | lycos.com | 2003-11-18 13:21:38 8 | 8 | mail.expressrx.info | mail.expressrx.info | 2003-11-18 13:21:41 9 | 9 | approveddeals.com | approveddeals.com | 2003-11-18 13:21:41 10 | 10 | conceptholidays.co.uk | conceptholidays.co.uk | 2003-11-18 13:21:48 The msg_sender_num and env_sender_num come from another table of unique names of senders. What I am attempting to do is see how many times msg_sender_num 1 claims to be a different env_sender_num. So I have to find all the entries in msg_sender_num equal to 1 and build a count of the distinct numbers in env_sender_num. This number is then used later to say that if a msg_sender_num claims to be more then 2 env_sender_num's then the sender is a spammer and gets added to a list. Everything is working fine except the SQL above. It takes WAY to long to process on a 500000+ record database. Hell it takes 12 seconds or so on a 50000 record database. I have included the query plan to show that the indexes are being used. Query Plan: Aggregate (cost=0.00..166.16 rows=264 width=8) (actual time=0.98..7768.19 rows=62911 loops=1) -> Group (cost=0.00..159.57 rows=2635 width=8) (actual time=0.56..3179.14 rows=80466 loops=1) -> Index Scan using record_of_claims_env_sender_num_idx on record_of_claims (cost=0.00..152.99 rows=2635 width=8) (actual time=0.55..2240.15 rows=80466 loops=1) Total runtime: 7931.63 msec Is there a better, read "Faster", way to achieve this? Jerry Wintrode Very Newbie Postgres User