Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance

From Matthew Schumacher
Subject Re: Performance problems testing with Spamassassin 3.1.0
Date
Msg-id 42F1CE43.8040801@aptalaska.net
Whole thread Raw
In response to Re: Performance problems testing with Spamassassin 3.1.0  (Michael Parker <parkerm@pobox.com>)
Responses Re: Performance problems testing with Spamassassin 3.1.0
Re: Performance problems testing with Spamassassin 3.1.0
Re: Performance problems testing with Spamassassin 3.1.0
List pgsql-performance
Okay,

Here is the status of the SA updates and a question:

Michael got SA changed to pass an array of tokens to the proc so right
there we gained a ton of performance due to connections and transactions
being grouped into one per email instead of one per token.

Now I am working on making the proc even faster.  Since we have all of
the tokens coming in as an array, it should be possible to get this down
to just a couple of queries.

I have the proc using IN and NOT IN statements to update everything at
once from a temp table, but it progressively gets slower because the
temp table is growing between vacuums.  At this point it's slightly
slower than the old update or else insert on every token.

What I really want to do is have the token array available as a record
so that I can query against it, but not have it take up the resources of
a real table.  If I could copy from an array into a record then I can
even get rid of the loop.  Anyone have any thoughts on how to do this?


CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
                                     intokenary BYTEA[],
                                     inspam_count INTEGER,
                                     inham_count INTEGER,
                                     inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
    _token := intokenary[i];
    INSERT INTO bayes_token_tmp VALUES (_token);
  END LOOP;

  UPDATE
    bayes_token
  SET
    spam_count = greatest_int(spam_count + inspam_count, 0),
    ham_count = greatest_int(ham_count + inham_count , 0),
    atime = greatest_int(atime, 1000)
  WHERE
    id = inuserid
  AND
    (token) IN (SELECT intoken FROM bayes_token_tmp);

  UPDATE
    bayes_vars
  SET
    token_count = token_count + (SELECT count(intoken) FROM
bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)),
    newest_token_age = greatest_int(newest_token_age, inatime),
    oldest_token_age = least_int(oldest_token_age, inatime)
  WHERE
    id = inuserid;

  INSERT INTO
    bayes_token
    SELECT
      inuserid,
      intoken,
      inspam_count,
      inham_count,
      inatime
    FROM
      bayes_token_tmp
    WHERE
      (inspam_count > 0 OR inham_count > 0)
    AND
      (intoken) NOT IN (SELECT token FROM bayes_token);

  delete from bayes_token_tmp;

  RETURN;
END;
' LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
 LANGUAGE SQL;

CREATE OR REPLACE FUNCTION least_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
 LANGUAGE SQL;

pgsql-performance by date:

Previous
From: John A Meinel
Date:
Subject: Re: Is There A Windows Version of Performance Tuning Documents?
Next
From: prasanna s
Date:
Subject: Indexed views.