Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance
From | John A Meinel |
---|---|
Subject | Re: Performance problems testing with Spamassassin 3.1.0 |
Date | |
Msg-id | 42F221BB.5000702@arbash-meinel.com Whole thread Raw |
In response to | Re: Performance problems testing with Spamassassin 3.1.0 (Matthew Schumacher <matt.s@aptalaska.net>) |
Responses |
Re: Performance problems testing with Spamassassin 3.1.0
|
List | pgsql-performance |
Matthew Schumacher wrote: > Okay, > > Here is the status of the SA updates and a question: > > Michael got SA changed to pass an array of tokens to the proc so right > there we gained a ton of performance due to connections and transactions > being grouped into one per email instead of one per token. > > Now I am working on making the proc even faster. Since we have all of > the tokens coming in as an array, it should be possible to get this down > to just a couple of queries. > > I have the proc using IN and NOT IN statements to update everything at > once from a temp table, but it progressively gets slower because the > temp table is growing between vacuums. At this point it's slightly > slower than the old update or else insert on every token. I recommend that you drop and re-create the temp table. There is no reason to have it around, considering you delete and re-add everything. That means you never have to vacuum it, since it always only contains the latest rows. > > What I really want to do is have the token array available as a record > so that I can query against it, but not have it take up the resources of > a real table. If I could copy from an array into a record then I can > even get rid of the loop. Anyone have any thoughts on how to do this? > My one question here, is the inspam_count and inham_count *always* the same for all tokens? I would have thought each token has it's own count. Anyway, there are a few lines I would change: > > CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER, > intokenary BYTEA[], > inspam_count INTEGER, > inham_count INTEGER, > inatime INTEGER) > RETURNS VOID AS ' > DECLARE > _token BYTEA; > BEGIN > -- create the table at the start of the procedure CREATE TEMP TABLE bayes_token_tmp (intoken bytea); -- You might also add primary key if you are going to be adding -- *lots* of entries, but it sounds like you are going to have -- less than 1 page, so it doesn't matter > for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) > LOOP > _token := intokenary[i]; > INSERT INTO bayes_token_tmp VALUES (_token); > END LOOP; > > UPDATE > bayes_token > SET > spam_count = greatest_int(spam_count + inspam_count, 0), > ham_count = greatest_int(ham_count + inham_count , 0), > atime = greatest_int(atime, 1000) > WHERE > id = inuserid > AND -- (token) IN (SELECT intoken FROM bayes_token_tmp); EXISTS (SELECT token FROM bayes_token_tmp WHERE intoken=token LIMIT 1); -- I would also avoid your intoken (NOT) IN (SELECT token FROM -- bayes_token) There are a few possibilities, but to me -- as your bayes_token table becomes big, this will start -- to be the slow point -- Rather than doing 2 NOT IN queries, it *might* be faster to do DELETE FROM bayes_token_tmp WHERE NOT EXISTS (SELECT token FROM bayes_token WHERE token=intoken); > > UPDATE > bayes_vars > SET -- token_count = token_count + (SELECT count(intoken) FROM -- bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)), token_count = token_count + (SELECT count(intoken) FROM bayes_token_tmp) -- You don't need the where NOT IN, since we already removed those rows > newest_token_age = greatest_int(newest_token_age, inatime), > oldest_token_age = least_int(oldest_token_age, inatime) > WHERE > id = inuserid; > > INSERT INTO > bayes_token > SELECT > inuserid, > intoken, > inspam_count, > inham_count, > inatime > FROM > bayes_token_tmp > WHERE > (inspam_count > 0 OR inham_count > 0) -- AND -- (intoken) NOT IN (SELECT token FROM bayes_token); -- You don't need either of those lines, again because we already -- filtered -- delete from bayes_token_tmp; -- And rather than deleting all of the entries just DROP TABLE bayes_token_tmp; > > RETURN; > END; > ' LANGUAGE 'plpgsql'; > > CREATE OR REPLACE FUNCTION greatest_int (integer, integer) > RETURNS INTEGER > IMMUTABLE STRICT > AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;' > LANGUAGE SQL; > > CREATE OR REPLACE FUNCTION least_int (integer, integer) > RETURNS INTEGER > IMMUTABLE STRICT > AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;' > LANGUAGE SQL; > So to clarify, here is my finished function: ------------------------------------ CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER, intokenary BYTEA[], inspam_count INTEGER, inham_count INTEGER, inatime INTEGER) RETURNS VOID AS ' DECLARE _token BYTEA; BEGIN CREATE TEMP TABLE bayes_token_tmp (intoken bytea); for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) LOOP _token := intokenary[i]; INSERT INTO bayes_token_tmp VALUES (_token); END LOOP; UPDATE bayes_token SET spam_count = greatest_int(spam_count + inspam_count, 0), ham_count = greatest_int(ham_count + inham_count , 0), atime = greatest_int(atime, 1000) WHERE id = inuserid AND EXISTS (SELECT token FROM bayes_token_tmp WHERE intoken=token LIMIT 1); DELETE FROM bayes_token_tmp WHERE NOT EXISTS (SELECT token FROM bayes_token WHERE token=intoken); UPDATE bayes_vars SET token_count = token_count + (SELECT count(intoken) FROM bayes_token_tmp), newest_token_age = greatest_int(newest_token_age, inatime), oldest_token_age = least_int(oldest_token_age, inatime) WHERE id = inuserid; INSERT INTO bayes_token SELECT inuserid, intoken, inspam_count, inham_count, inatime FROM bayes_token_tmp WHERE (inspam_count > 0 OR inham_count > 0) DROP TABLE bayes_token_tmp; RETURN; END; ' LANGUAGE 'plpgsql';
Attachment
pgsql-performance by date: