Need another way to do this, my sql much to slow... - Mailing list pgsql-sql

From Jerry Wintrode
Subject Need another way to do this, my sql much to slow...
Date
Msg-id 4E676B0AAF74B443A18D7BC7AAB3CFFD1D44D6@s01-exch01.tripos.com
Whole thread Raw
List pgsql-sql
I am attempting to write a spam management add-on for my corporate
server, I have the database and the apps written but one part is WAY to
slow to be usefull.

The following view:
       Column         |  Type   | Modifiers
-----------------------+---------+-----------env_sender_num        | integer |msg_from_claims_count | bigint  |

is built on the follow table:
    Column      |            Type             | Modifiers
-----------------+-----------------------------+-----------msg_sender_num  | integer
|env_sender_num | integer                     |msg_from_domain | character varying(255)      |env_from_domain |
charactervarying(255)      |recorded_date   | timestamp without time zone | 
Indexes: record_of_claims_env_sender_num_idx btree (env_sender_num),        record_of_claims_msg_sender_num_idx btree
(msg_sender_num)

With the following SQL:


SELECT record_of_claims.env_sender_num, count(DISTINCT
record_of_claims.msg_sender_num)AS msg_from_claims_count FROM record_of_claims GROUP BY
record_of_claims.env_sender_num;

A sample dataset follows:

msg_sender_num | env_sender_num |    msg_from_domain    |
env_from_domain    |    recorded_date
----------------+----------------+-----------------------+--------------
---------+---------------------             1 |              1 | yahoo.com             | yahoo.com
| 2003-11-18 13:21:07             2 |              2 | mx128.optinvc13y.com  |
mx128.optinvc13y.com  | 2003-11-18 13:21:16             3 |              3 | yahoo.com             | yahoo.com
| 2003-11-18 13:21:17             4 |              4 | yahoo.com             | yahoo.com
| 2003-11-18 13:21:21             5 |              5 | biomarketgroup.com    |
biomarketgroup.com    | 2003-11-18 13:21:24             6 |              6 | sohu.com              | sohu.com
| 2003-11-18 13:21:28             7 |              7 | lycos.com             | lycos.com
| 2003-11-18 13:21:38             8 |              8 | mail.expressrx.info   |
mail.expressrx.info   | 2003-11-18 13:21:41             9 |              9 | approveddeals.com     |
approveddeals.com     | 2003-11-18 13:21:41            10 |             10 | conceptholidays.co.uk |
conceptholidays.co.uk | 2003-11-18 13:21:48


The msg_sender_num and env_sender_num come from another table of unique
names of senders. What I am attempting to do is see how many times
msg_sender_num 1 claims to be a different env_sender_num. So I have to
find all the entries in msg_sender_num equal to 1 and build a count of
the distinct numbers in env_sender_num. This number is then used later
to say that if a msg_sender_num claims to be more then 2
env_sender_num's then the sender is a spammer and gets added to a list.

Everything is working fine except the SQL above. It takes WAY to long to
process on a 500000+ record database. Hell it takes 12 seconds or so on
a 50000 record database. I have included the query plan to show that the
indexes are being used.

Query Plan:

Aggregate  (cost=0.00..166.16 rows=264 width=8) (actual
time=0.98..7768.19 rows=62911 loops=1)  ->  Group  (cost=0.00..159.57 rows=2635 width=8) (actual
time=0.56..3179.14 rows=80466 loops=1)        ->  Index Scan using record_of_claims_env_sender_num_idx on
record_of_claims  (cost=0.00..152.99 rows=2635 width=8) (actual
time=0.55..2240.15 rows=80466 loops=1)

Total runtime: 7931.63 msec


Is there a better, read "Faster", way to achieve this?


Jerry Wintrode
Very Newbie Postgres User








pgsql-sql by date:

Previous
From: Erik Thiele
Date:
Subject: current_date timezone documentation suggestion
Next
From: Christoph Haller
Date:
Subject: Re: Need another way to do this, my sql much to slow...