Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance

From John A Meinel
Subject Re: Performance problems testing with Spamassassin 3.1.0
Date
Msg-id 42F221BB.5000702@arbash-meinel.com
Whole thread Raw
In response to Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher <matt.s@aptalaska.net>)
Responses Re: Performance problems testing with Spamassassin 3.1.0
List pgsql-performance
Matthew Schumacher wrote:
> Okay,
>
> Here is the status of the SA updates and a question:
>
> Michael got SA changed to pass an array of tokens to the proc so right
> there we gained a ton of performance due to connections and transactions
> being grouped into one per email instead of one per token.
>
> Now I am working on making the proc even faster.  Since we have all of
> the tokens coming in as an array, it should be possible to get this down
> to just a couple of queries.
>
> I have the proc using IN and NOT IN statements to update everything at
> once from a temp table, but it progressively gets slower because the
> temp table is growing between vacuums.  At this point it's slightly
> slower than the old update or else insert on every token.

I recommend that you drop and re-create the temp table. There is no
reason to have it around, considering you delete and re-add everything.
That means you never have to vacuum it, since it always only contains
the latest rows.

>
> What I really want to do is have the token array available as a record
> so that I can query against it, but not have it take up the resources of
> a real table.  If I could copy from an array into a record then I can
> even get rid of the loop.  Anyone have any thoughts on how to do this?
>

My one question here, is the inspam_count and inham_count *always* the
same for all tokens? I would have thought each token has it's own count.
Anyway, there are a few lines I would change:

>
> CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
>                                      intokenary BYTEA[],
>                                      inspam_count INTEGER,
>                                      inham_count INTEGER,
>                                      inatime INTEGER)
> RETURNS VOID AS '
> DECLARE
>   _token BYTEA;
> BEGIN
>

    -- create the table at the start of the procedure
    CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
    -- You might also add primary key if you are going to be adding
    -- *lots* of entries, but it sounds like you are going to have
    -- less than 1 page, so it doesn't matter

>   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>   LOOP
>     _token := intokenary[i];
>     INSERT INTO bayes_token_tmp VALUES (_token);
>   END LOOP;
>
>   UPDATE
>     bayes_token
>   SET
>     spam_count = greatest_int(spam_count + inspam_count, 0),
>     ham_count = greatest_int(ham_count + inham_count , 0),
>     atime = greatest_int(atime, 1000)
>   WHERE
>     id = inuserid
>   AND

--    (token) IN (SELECT intoken FROM bayes_token_tmp);
      EXISTS (SELECT token FROM bayes_token_tmp
           WHERE intoken=token LIMIT 1);

-- I would also avoid your intoken (NOT) IN (SELECT token FROM
-- bayes_token) There are a few possibilities, but to me
-- as your bayes_token table becomes big, this will start
-- to be the slow point

-- Rather than doing 2 NOT IN queries, it *might* be faster to do
   DELETE FROM bayes_token_tmp
    WHERE NOT EXISTS (SELECT token FROM bayes_token
               WHERE token=intoken);


>
>   UPDATE
>     bayes_vars
>   SET

--     token_count = token_count + (SELECT count(intoken) FROM
-- bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)),
       token_count = token_count + (SELECT count(intoken)
                      FROM bayes_token_tmp)

-- You don't need the where NOT IN, since we already removed those rows

>     newest_token_age = greatest_int(newest_token_age, inatime),
>     oldest_token_age = least_int(oldest_token_age, inatime)
>   WHERE
>     id = inuserid;
>
>   INSERT INTO
>     bayes_token
>     SELECT
>       inuserid,
>       intoken,
>       inspam_count,
>       inham_count,
>       inatime
>     FROM
>       bayes_token_tmp
>     WHERE
>       (inspam_count > 0 OR inham_count > 0)

--     AND
--       (intoken) NOT IN (SELECT token FROM bayes_token);

-- You don't need either of those lines, again because we already
-- filtered

--   delete from bayes_token_tmp;
--   And rather than deleting all of the entries just
     DROP TABLE bayes_token_tmp;

>
>   RETURN;
> END;
> ' LANGUAGE 'plpgsql';
>
> CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
>  LANGUAGE SQL;
>
> CREATE OR REPLACE FUNCTION least_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
>  LANGUAGE SQL;
>

So to clarify, here is my finished function:
------------------------------------
CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
                                     intokenary BYTEA[],
                                     inspam_count INTEGER,
                                     inham_count INTEGER,
                                     inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
    _token := intokenary[i];
    INSERT INTO bayes_token_tmp VALUES (_token);
  END LOOP;

  UPDATE
    bayes_token
  SET
    spam_count = greatest_int(spam_count + inspam_count, 0),
    ham_count = greatest_int(ham_count + inham_count , 0),
    atime = greatest_int(atime, 1000)
  WHERE
    id = inuserid
  AND
    EXISTS (SELECT token FROM bayes_token_tmp
             WHERE intoken=token LIMIT 1);

   DELETE FROM bayes_token_tmp
    WHERE NOT EXISTS (SELECT token FROM bayes_token
               WHERE token=intoken);

  UPDATE
    bayes_vars
  SET
    token_count = token_count + (SELECT count(intoken)
                   FROM bayes_token_tmp),
    newest_token_age = greatest_int(newest_token_age, inatime),
    oldest_token_age = least_int(oldest_token_age, inatime)
  WHERE
    id = inuserid;

  INSERT INTO
    bayes_token
    SELECT
      inuserid,
      intoken,
      inspam_count,
      inham_count,
      inatime
    FROM
      bayes_token_tmp
    WHERE
      (inspam_count > 0 OR inham_count > 0)

  DROP TABLE bayes_token_tmp;

  RETURN;
END;
' LANGUAGE 'plpgsql';


Attachment

pgsql-performance by date:

Previous
From: Laszlo Hornyak
Date:
Subject: Re: Indexed views.
Next
From: Tom Lane
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0