Performance problems testing with Spamassassin 3.1.0 Bayes module. - Mailing list pgsql-performance
From | Matthew Schumacher |
---|---|
Subject | Performance problems testing with Spamassassin 3.1.0 Bayes module. |
Date | |
Msg-id | 42E80C28.2040608@aptalaska.net Whole thread Raw |
Responses |
Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.
Re: Performance problems testing with Spamassassin 3.1.0 Bayes module. Re: Performance problems testing with Spamassassin 3.1.0 |
List | pgsql-performance |
I'm not sure how much this has been discussed on the list, but wasn't able to find anything relevant in the archives. The new Spamassassin is due out pretty soon. They are currently testing 3.1.0pre4. One of the things I hope to get out of this release is bayes word stats moved to a real RDBMS. They have separated the mysql BayesStore module from the PgSQL one so now postgres can use it's own queries. I loaded all of this stuff up on a test server and am finding that the bayes put performance is really not good enough for any real amount of mail load. The performance problems seems to be when the bayes module is inserting/updating. This is now handled by the token_put procedure. After playing with various indexes and what not I simply am unable to make this procedure perform any better. Perhaps someone on the list can spot the bottleneck and reveal why this procedure isn't performing that well or ways to make it better. I put the rest of the schema up at http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone needs to see it too. CREATE OR REPLACE FUNCTION put_token(integer, bytea, integer, integer, integer) RETURNS bool AS ' DECLARE inuserid ALIAS for $1; intoken ALIAS for $2; inspam_count ALIAS for $3; inham_count ALIAS for $4; inatime ALIAS for $5; got_token record; updated_atime_p bool; BEGIN updated_atime_p := FALSE; SELECT INTO got_token spam_count, ham_count, atime FROM bayes_token WHERE id = inuserid AND token = intoken; IF NOT FOUND THEN -- we do not insert negative counts, just return true IF (inspam_count < 0 OR inham_count < 0) THEN RETURN TRUE; END IF; INSERT INTO bayes_token (id, token, spam_count, ham_count, atime) VALUES (inuserid, intoken, inspam_count, inham_count, inatime); IF NOT FOUND THEN RAISE EXCEPTION ''unable to insert into bayes_token''; return FALSE; END IF; UPDATE bayes_vars SET token_count = token_count + 1 WHERE id = inuserid; IF NOT FOUND THEN RAISE EXCEPTION ''unable to update token_count in bayes_vars''; return FALSE; END IF; UPDATE bayes_vars SET newest_token_age = inatime WHERE id = inuserid AND newest_token_age < inatime; IF NOT FOUND THEN UPDATE bayes_vars SET oldest_token_age = inatime WHERE id = inuserid AND oldest_token_age > inatime; END IF; return TRUE; ELSE IF (inspam_count != 0) THEN -- no need to update atime if it is < the existing value IF (inatime < got_token.atime) THEN UPDATE bayes_token SET spam_count = spam_count + inspam_count WHERE id = inuserid AND token = intoken AND spam_count + inspam_count >= 0; ELSE UPDATE bayes_token SET spam_count = spam_count + inspam_count, atime = inatime WHERE id = inuserid AND token = intoken AND spam_count + inspam_count >= 0; IF FOUND THEN updated_atime_p := TRUE; END IF; END IF; END IF; IF (inham_count != 0) THEN -- no need to update atime is < the existing value or if it was already updated IF inatime < got_token.atime OR updated_atime_p THEN UPDATE bayes_token SET ham_count = ham_count + inham_count WHERE id = inuserid AND token = intoken AND ham_count + inham_count >= 0; ELSE UPDATE bayes_token SET ham_count = ham_count + inham_count, atime = inatime WHERE id = inuserid AND token = intoken AND ham_count + inham_count >= 0; IF FOUND THEN updated_atime_p := TRUE; END IF; END IF; END IF; IF updated_atime_p THEN UPDATE bayes_vars SET oldest_token_age = inatime WHERE id = inuserid AND oldest_token_age > inatime; END IF; return TRUE; END IF; END; ' LANGUAGE 'plpgsql';
pgsql-performance by date: