Re: Performance problems testing with Spamassassin 3.1.0

From: Matthew Schumacher
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Date: ,
Msg-id: 42F251F0.1070407@aptalaska.net
(view: Whole thread, Raw)
In response to: Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher)
Responses: Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel)
List: pgsql-performance

Tree view

Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Matthew Schumacher, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Josh Berkus, )
  Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Josh Berkus, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Dennis Bjorklund, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Josh Berkus, )
     Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Tom Lane, )
  Re: Performance problems testing with Spamassassin  ("Luke Lonergan", )
 Re: Performance problems testing with Spamassassin 3.1.0  (Karim Nassar, )
  Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Gavin Sherry, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
     Re: Performance problems testing with Spamassassin 3.1.0  (Gavin Sherry, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Andrew McMillan, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
     Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
     Re: Performance problems testing with Spamassassin 3.1.0  (Andrew McMillan, )
      Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
      Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
       Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
        Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
          Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Michael Parker, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
           Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
          Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
           Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
            Re: Performance problems testing with Spamassassin 3.1.0  ("Jim C. Nasby", )
             Re: Performance problems testing with Spamassassin 3.1.0  (Michael Parker, )
              Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
               Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
               Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
               Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
                Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                 Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
                Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                 Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                  Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                   Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                    Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
        Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
         Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
           Re: Performance problems testing with Spamassassin 3.1.0  ("Jim C. Nasby", )
            Re: Performance problems testing with Spamassassin 3.1.0  (Andreas Pflug, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  ("Merlin Moncure", )

Matthew Schumacher wrote:
> Tom Lane wrote:
>
>
>>I don't really see why you think that this path is going to lead to
>>better performance than where you were before.  Manipulation of the
>>temp table is never going to be free, and IN (sub-select) is always
>>inherently not fast, and NOT IN (sub-select) is always inherently
>>awful.  Throwing a pile of simple queries at the problem is not
>>necessarily the wrong way ... especially when you are doing it in
>>plpgsql, because you've already eliminated the overhead of network
>>round trips and repeated planning of the queries.
>>
>>            regards, tom lane
>
>
> The reason why I think this may be faster is because I would avoid
> running an update on data that needs to be inserted which saves
> searching though the table for a matching token.
>
> Perhaps I should do the insert first, then drop those tokens from the
> temp table, then do my updates in a loop.
>
> I'll have to do some benchmarking...
>
> schu

Tom, I think your right, whenever I do a NOT IN it does a full table
scan against bayes_token and since that table is going to get very big
doing the simple query in a loop that uses an index seems a bit faster.

John, thanks for your help, it was worth a try, but it looks like the
looping is just faster.

Here is what I have so far in case anyone else has ideas before I
abandon it:

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
                                     intokenary BYTEA[],
                                     inspam_count INTEGER,
                                     inham_count INTEGER,
                                     inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  UPDATE
    bayes_token
  SET
    spam_count = greatest_int(spam_count + inspam_count, 0),
    ham_count = greatest_int(ham_count + inham_count , 0),
    atime = greatest_int(atime, inatime)
  WHERE
    id = inuserid
  AND
    (token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));

  UPDATE
    bayes_vars
  SET
    token_count = token_count + (
      SELECT
        count(bayes_token_tmp)
      FROM
        bayes_token_tmp(intokenary)
      WHERE
        bayes_token_tmp NOT IN (SELECT token FROM bayes_token)),
    newest_token_age = greatest_int(newest_token_age, inatime),
    oldest_token_age = least_int(oldest_token_age, inatime)
  WHERE
    id = inuserid;

  INSERT INTO
    bayes_token
    SELECT
      inuserid,
      bayes_token_tmp,
      inspam_count,
      inham_count,
      inatime
    FROM
      bayes_token_tmp(intokenary)
    WHERE
      (inspam_count > 0 OR inham_count > 0)
    AND
      (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

  RETURN;
END;
' LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS
SETOF bytea AS
'
BEGIN
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
    return next intokenary[i];
  END LOOP;
  RETURN;
end
'
language 'plpgsql';

CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
 LANGUAGE SQL;

CREATE OR REPLACE FUNCTION least_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
 LANGUAGE SQL;


pgsql-performance by date:

From: John A Meinel
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0
From: Dirk Lutzebäck
Date:
Subject: Performance problems on 4-way AMD Opteron 875 (dual core)