Thread: Need another way to do this, my sql much to slow...
I am attempting to write a spam management add-on for my corporate server, I have the database and the apps written but one part is WAY to slow to be usefull. The following view: Column | Type | Modifiers -----------------------+---------+-----------env_sender_num | integer |msg_from_claims_count | bigint | is built on the follow table: Column | Type | Modifiers -----------------+-----------------------------+-----------msg_sender_num | integer |env_sender_num | integer |msg_from_domain | character varying(255) |env_from_domain | charactervarying(255) |recorded_date | timestamp without time zone | Indexes: record_of_claims_env_sender_num_idx btree (env_sender_num), record_of_claims_msg_sender_num_idx btree (msg_sender_num) With the following SQL: SELECT record_of_claims.env_sender_num, count(DISTINCT record_of_claims.msg_sender_num)AS msg_from_claims_count FROM record_of_claims GROUP BY record_of_claims.env_sender_num; A sample dataset follows: msg_sender_num | env_sender_num | msg_from_domain | env_from_domain | recorded_date ----------------+----------------+-----------------------+-------------- ---------+--------------------- 1 | 1 | yahoo.com | yahoo.com | 2003-11-18 13:21:07 2 | 2 | mx128.optinvc13y.com | mx128.optinvc13y.com | 2003-11-18 13:21:16 3 | 3 | yahoo.com | yahoo.com | 2003-11-18 13:21:17 4 | 4 | yahoo.com | yahoo.com | 2003-11-18 13:21:21 5 | 5 | biomarketgroup.com | biomarketgroup.com | 2003-11-18 13:21:24 6 | 6 | sohu.com | sohu.com | 2003-11-18 13:21:28 7 | 7 | lycos.com | lycos.com | 2003-11-18 13:21:38 8 | 8 | mail.expressrx.info | mail.expressrx.info | 2003-11-18 13:21:41 9 | 9 | approveddeals.com | approveddeals.com | 2003-11-18 13:21:41 10 | 10 | conceptholidays.co.uk | conceptholidays.co.uk | 2003-11-18 13:21:48 The msg_sender_num and env_sender_num come from another table of unique names of senders. What I am attempting to do is see how many times msg_sender_num 1 claims to be a different env_sender_num. So I have to find all the entries in msg_sender_num equal to 1 and build a count of the distinct numbers in env_sender_num. This number is then used later to say that if a msg_sender_num claims to be more then 2 env_sender_num's then the sender is a spammer and gets added to a list. Everything is working fine except the SQL above. It takes WAY to long to process on a 500000+ record database. Hell it takes 12 seconds or so on a 50000 record database. I have included the query plan to show that the indexes are being used. Query Plan: Aggregate (cost=0.00..166.16 rows=264 width=8) (actual time=0.98..7768.19 rows=62911 loops=1) -> Group (cost=0.00..159.57 rows=2635 width=8) (actual time=0.56..3179.14 rows=80466 loops=1) -> Index Scan using record_of_claims_env_sender_num_idx on record_of_claims (cost=0.00..152.99 rows=2635 width=8) (actual time=0.55..2240.15 rows=80466 loops=1) Total runtime: 7931.63 msec Is there a better, read "Faster", way to achieve this? Jerry Wintrode Very Newbie Postgres User
> > I am attempting to write a spam management add-on for my corporate > server, I have the database and the apps written but one part is WAY to > slow to be usefull. > > The following view: > > Column | Type | Modifiers > -----------------------+---------+----------- > env_sender_num | integer | > msg_from_claims_count | bigint | > > is built on the follow table: > > Column | Type | Modifiers > -----------------+-----------------------------+----------- > msg_sender_num | integer | > env_sender_num | integer | > msg_from_domain | character varying(255) | > env_from_domain | character varying(255) | > recorded_date | timestamp without time zone | > Indexes: record_of_claims_env_sender_num_idx btree (env_sender_num), > record_of_claims_msg_sender_num_idx btree (msg_sender_num) > > With the following SQL: > > > SELECT record_of_claims.env_sender_num, count(DISTINCT > record_of_claims.msg_sender_num) > AS msg_from_claims_count=20 > FROM record_of_claims=20 > GROUP BY record_of_claims.env_sender_num; > > A sample dataset follows: > [snip] Not sure, if this can speed up things SELECT env_sender_num, COUNT(msg_sender_num) AS msg_from_claims_count FROM ( SELECT DISTINCT ON (msg_sender_num) msg_sender_num,env_sender_num FROM record_of_claims ORDER BY msg_sender_num,env_sender_num DESC ) foo GROUP BY env_sender_num; but possibly it inspires you or someone else for a better one. Regards, Christoph
Let me give you a better dataset: msg_sender_num | env_sender_num ----------------+---------------- 118 | 53003 118 | 51778 118 | 49679 118 | 49457 118 | 37434 118 | 37389 118 | 33644 118 | 33609 118 | 26043 118 | 26004 118 | 24288 118 | 23357 118 | 16246 118 | 16103 118 | 12967 118 | 12140 118 | 4191 118 | 122 118| 860 with the SQL: SELECT record_of_claims.msg_sender_num, count(DISTINCT record_of_claims.env_sender_num) AS env_from_claims_count FROM record_of_claims WHERE (record_of_claims.msg_sender_num =118) GROUP BY record_of_claims.msg_sender_num; You get: msg_sender_num | env_from_claims_count ----------------+----------------------- 118 | 19 Which is correct for the following reason: msg_sender_num | envelope_from ----------------+--------------------------------------------- 118 | ABVQ3QQBAQAFfLcB9QAAAAACAAAAAA@b.tpcper.com 118 | AjEywAQBAQAFgHcB9QAAOw4CAAAAAA@b.tpcper.com 118 | AjEywAQBAQAFJkQB9QAAOw4CAAAAAA@b.tpcper.com 118 | AjEywAQBAQAFKhMB9QAAOw4CAAAAAA@b.tpcper.com 118 | ABVQ3QQBAQAFKhMB9QAAAAACAAAAAA@b.tpcper.com 118 | ABVQ3QQBAQAFKz0B9QAAAAACAAAAAA@b.tpcper.com 118 | AjEywAQBAQAFKz0B9QAAOw4CAAAAAA@b.tpcper.com 118 | ABVQ3QQBAQAFKiMB9QAAAAACAAAAAA@b.tpcper.com 118 | AjEywAQBAQAFKiMB9QAAOw4CAAAAAA@b.tpcper.com 118 | AjEywAQBAQAFKxoB9QAAOw4CAAAAAA@b.tpcper.com 118 | ABVQ3QQBAQAFKxoB9QAAAAACAAAAAA@b.tpcper.com 118 | AjEywAQBAQAFK0QB9QAAOw4CAAAAAA@b.tpcper.com 118 | ABVQ3QQBAQAFK0QB9QAAAAACAAAAAA@b.tpcper.com 118 | ABVQ3QQBAQAFLuEB9QAAAAACAAAAAA@b.tpcper.com 118 | AjEywAQBAQAFLuEB9QAAOw4CAAAAAA@b.tpcper.com 118 | AjEywAQBAQAFf8wB9QAAOw4CAAAAAA@b.tpcper.com 118 | ABVQ3QQBAQAFf8wB9QAAAAACAAAAAA@b.tpcper.com 118 | AjEywAQBAQAFgAoB9QAAOw4CAAAAAA@b.tpcper.com 118 | ABVQ3QQBAQAFgA4B9QAAAAACAAAAAA@b.tpcper.com 19 different envelope from names all claiming to be the same Message from: 118 ("TopOffers TopOffers@Topica.com") All of the above address would be added to blacklist for 120 days. If I say every 15 minutes or so create a new table full of the results of the SQL view it sort of solves my problem. Still takes forever to process but the next view that needs these results can do an index scan on the resulting table and not have to build the list all over again. In this was I can cut my processing time to 6/10th of a second. But I have to create/drop/rename tables on a time interval. Not the best solution. Jerry Wintrode Network Administrator Tripos, Inc.
Ok, I figured that part out by simply changing the way I'm doing to query, and writing a function to handle the reply. But it will introduce another problem. How to I pass special characters, any character, to a function like this: select msg_2_env('"Ann's Free Gifts & Coupons" <server1@mail03a-free-gifts.mx07.com>'); As you can see the message from name is: "Ann's Free Gifts & Coupons" server1@mail03a-free-gifts.mx07.com I need that whole string to match. Including the ",&,@, and yes the single quote in Ann's. Passed as a variable this should not be a problem, I think, but how do I test this on the command line with psql? Oh, here is the simple function in case anyone cares to have it...very simple. Now processing about 100000 records takes 1ms. Down from the 12-15 seconds. WooHoo. Just that other little issue..hehehe. ======================================================================== ==== CREATE FUNCTION msg_2_env (text) RETURNS int4 AS ' DECLARE intext ALIAS FOR $1; result int4; BEGIN result := ( SELECT count(DISTINCT record_of_claims.msg_sender_num) AS mclaim_count FROM record_of_claims WHERE (record_of_claims.env_sender_num = (SELECT env_from_senders.env_sender_num FROM env_from_senders WHERE (env_from_senders.envelope_from = intext::character varying))) GROUP BY record_of_claims.env_sender_num ); RETURN result; END; ' LANGUAGE 'plpgsql'; ======================================================================== ==== Jerry Wintrode Network Administrator Tripos, Inc. ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend
> > Ok, I figured that part out by simply changing the way I'm doing to > query, and writing a function to handle the reply. But it will introduce > another problem. How to I pass special characters, any character, to a > function like this: > > select msg_2_env('"Ann's Free Gifts & Coupons" > <server1@mail03a-free-gifts.mx07.com>'); > > As you can see the message from name is: > "Ann's Free Gifts & Coupons" server1@mail03a-free-gifts.mx07.com > > I need that whole string to match. Including the ",&,@, and yes the > single quote in Ann's. Passed as a variable this should not be a > problem, I think, but how do I test this on the command line with psql? > > Oh, here is the simple function in case anyone cares to have it...very > simple. Now processing about 100000 records takes 1ms. Down from the > 12-15 seconds. WooHoo. Just that other little issue..hehehe. > > CREATE FUNCTION msg_2_env (text) RETURNS int4 AS > ' > DECLARE > intext ALIAS FOR $1; > result int4; > > BEGIN > > result := ( SELECT count(DISTINCT > record_of_claims.msg_sender_num) AS mclaim_count FROM record_of_claims > WHERE (record_of_claims.env_sender_num = (SELECT > env_from_senders.env_sender_num FROM env_from_senders WHERE > (env_from_senders.envelope_from = intext::character varying))) GROUP BY > record_of_claims.env_sender_num ); > > RETURN result; > > END; > ' LANGUAGE 'plpgsql'; > > Jerry Wintrode > Network Administrator > Tripos, Inc. > The only character you have to care about is the single quote. Do: select msg_2_env('"Ann''s Free Gifts & Coupons" <server1@mail03a-free-gifts.mx07.com>'); One more thing: As COUNT returns a bigint my coding would be .. result bigint; .. Regards, Christoph