Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance
From | Matthew Schumacher |
---|---|
Subject | Re: Performance problems testing with Spamassassin 3.1.0 |
Date | |
Msg-id | 42F251F0.1070407@aptalaska.net Whole thread Raw |
In response to | Re: Performance problems testing with Spamassassin 3.1.0 (Matthew Schumacher <matt.s@aptalaska.net>) |
Responses |
Re: Performance problems testing with Spamassassin 3.1.0
|
List | pgsql-performance |
Matthew Schumacher wrote: > Tom Lane wrote: > > >>I don't really see why you think that this path is going to lead to >>better performance than where you were before. Manipulation of the >>temp table is never going to be free, and IN (sub-select) is always >>inherently not fast, and NOT IN (sub-select) is always inherently >>awful. Throwing a pile of simple queries at the problem is not >>necessarily the wrong way ... especially when you are doing it in >>plpgsql, because you've already eliminated the overhead of network >>round trips and repeated planning of the queries. >> >> regards, tom lane > > > The reason why I think this may be faster is because I would avoid > running an update on data that needs to be inserted which saves > searching though the table for a matching token. > > Perhaps I should do the insert first, then drop those tokens from the > temp table, then do my updates in a loop. > > I'll have to do some benchmarking... > > schu Tom, I think your right, whenever I do a NOT IN it does a full table scan against bayes_token and since that table is going to get very big doing the simple query in a loop that uses an index seems a bit faster. John, thanks for your help, it was worth a try, but it looks like the looping is just faster. Here is what I have so far in case anyone else has ideas before I abandon it: CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER, intokenary BYTEA[], inspam_count INTEGER, inham_count INTEGER, inatime INTEGER) RETURNS VOID AS ' DECLARE _token BYTEA; BEGIN UPDATE bayes_token SET spam_count = greatest_int(spam_count + inspam_count, 0), ham_count = greatest_int(ham_count + inham_count , 0), atime = greatest_int(atime, inatime) WHERE id = inuserid AND (token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary)); UPDATE bayes_vars SET token_count = token_count + ( SELECT count(bayes_token_tmp) FROM bayes_token_tmp(intokenary) WHERE bayes_token_tmp NOT IN (SELECT token FROM bayes_token)), newest_token_age = greatest_int(newest_token_age, inatime), oldest_token_age = least_int(oldest_token_age, inatime) WHERE id = inuserid; INSERT INTO bayes_token SELECT inuserid, bayes_token_tmp, inspam_count, inham_count, inatime FROM bayes_token_tmp(intokenary) WHERE (inspam_count > 0 OR inham_count > 0) AND (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token); RETURN; END; ' LANGUAGE 'plpgsql'; CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS SETOF bytea AS ' BEGIN for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) LOOP return next intokenary[i]; END LOOP; RETURN; end ' language 'plpgsql'; CREATE OR REPLACE FUNCTION greatest_int (integer, integer) RETURNS INTEGER IMMUTABLE STRICT AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;' LANGUAGE SQL; CREATE OR REPLACE FUNCTION least_int (integer, integer) RETURNS INTEGER IMMUTABLE STRICT AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;' LANGUAGE SQL;
pgsql-performance by date: