Performance problems testing with Spamassassin 3.1.0 Bayes module.

From: Matthew Schumacher
Subject: Performance problems testing with Spamassassin 3.1.0 Bayes module.
Date: ,
Msg-id: 42E80C28.2040608@aptalaska.net
(view: Whole thread, Raw)
Responses: Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Josh Berkus)
Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Tom Lane)
Re: Performance problems testing with Spamassassin 3.1.0  (Karim Nassar)
List: pgsql-performance

Tree view

Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Matthew Schumacher, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Josh Berkus, )
  Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Josh Berkus, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Dennis Bjorklund, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Josh Berkus, )
     Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Tom Lane, )
  Re: Performance problems testing with Spamassassin  ("Luke Lonergan", )
 Re: Performance problems testing with Spamassassin 3.1.0  (Karim Nassar, )
  Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Gavin Sherry, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
     Re: Performance problems testing with Spamassassin 3.1.0  (Gavin Sherry, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Andrew McMillan, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
     Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
     Re: Performance problems testing with Spamassassin 3.1.0  (Andrew McMillan, )
      Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
      Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
       Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
        Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
          Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Michael Parker, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
           Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
          Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
           Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
            Re: Performance problems testing with Spamassassin 3.1.0  ("Jim C. Nasby", )
             Re: Performance problems testing with Spamassassin 3.1.0  (Michael Parker, )
              Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
               Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
               Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
               Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
                Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                 Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
                Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                 Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                  Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                   Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                    Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
        Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
         Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
           Re: Performance problems testing with Spamassassin 3.1.0  ("Jim C. Nasby", )
            Re: Performance problems testing with Spamassassin 3.1.0  (Andreas Pflug, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  ("Merlin Moncure", )

I'm not sure how much this has been discussed on the list, but wasn't
able to find anything relevant in the archives.

The new Spamassassin is due out pretty soon.  They are currently testing
3.1.0pre4.  One of the things I hope to get out of this release is bayes
word stats moved to a real RDBMS.  They have separated the mysql
BayesStore module from the PgSQL one so now postgres can use it's own
queries.

I loaded all of this stuff up on a test server and am finding that the
bayes put performance is really not good enough for any real amount of
mail load.

The performance problems seems to be when the bayes module is
inserting/updating.  This is now handled by the token_put procedure.

After playing with various indexes and what not I simply am unable to
make this procedure perform any better.  Perhaps someone on the list can
spot the bottleneck and reveal why this procedure isn't performing that
well or ways to make it better.

I put the rest of the schema up at
http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
needs to see it too.

CREATE OR REPLACE FUNCTION put_token(integer, bytea, integer, integer,
integer) RETURNS bool AS '
DECLARE
        inuserid ALIAS for $1;
        intoken ALIAS for $2;
        inspam_count ALIAS for $3;
        inham_count ALIAS for $4;
        inatime ALIAS for $5;
        got_token record;
        updated_atime_p bool;
BEGIN
  updated_atime_p := FALSE;
  SELECT INTO got_token spam_count, ham_count, atime
    FROM bayes_token
   WHERE id = inuserid
     AND token = intoken;
   IF NOT FOUND THEN
     -- we do not insert negative counts, just return true
     IF (inspam_count < 0 OR inham_count < 0) THEN
       RETURN TRUE;
     END IF;
     INSERT INTO bayes_token (id, token, spam_count, ham_count, atime)
     VALUES (inuserid, intoken, inspam_count, inham_count, inatime);
     IF NOT FOUND THEN
       RAISE EXCEPTION ''unable to insert into bayes_token'';
       return FALSE;
     END IF;
     UPDATE bayes_vars SET token_count = token_count + 1
      WHERE id = inuserid;
     IF NOT FOUND THEN
       RAISE EXCEPTION ''unable to update token_count in bayes_vars'';
       return FALSE;
     END IF;
     UPDATE bayes_vars SET newest_token_age = inatime
      WHERE id = inuserid AND newest_token_age < inatime;
     IF NOT FOUND THEN
       UPDATE bayes_vars
          SET oldest_token_age = inatime
        WHERE id = inuserid
          AND oldest_token_age > inatime;
     END IF;
     return TRUE;
  ELSE
    IF (inspam_count != 0) THEN
      -- no need to update atime if it is < the existing value
      IF (inatime < got_token.atime) THEN
        UPDATE bayes_token
           SET spam_count = spam_count + inspam_count
         WHERE id = inuserid
           AND token = intoken
           AND spam_count + inspam_count >= 0;
      ELSE
        UPDATE bayes_token
           SET spam_count = spam_count + inspam_count,
               atime = inatime
         WHERE id = inuserid
           AND token = intoken
           AND spam_count + inspam_count >= 0;
        IF FOUND THEN
          updated_atime_p := TRUE;
        END IF;
      END IF;
    END IF;
    IF (inham_count != 0) THEN
      -- no need to update atime is < the existing value or if it was
already updated
      IF inatime < got_token.atime OR updated_atime_p THEN
        UPDATE bayes_token
           SET ham_count = ham_count + inham_count
         WHERE id = inuserid
           AND token = intoken
           AND ham_count + inham_count >= 0;
      ELSE
        UPDATE bayes_token
           SET ham_count = ham_count + inham_count,
               atime = inatime
         WHERE id = inuserid
           AND token = intoken
           AND ham_count + inham_count >= 0;
        IF FOUND THEN
          updated_atime_p := TRUE;
        END IF;
      END IF;
    END IF;
    IF updated_atime_p THEN
      UPDATE bayes_vars
         SET oldest_token_age = inatime
       WHERE id = inuserid
         AND oldest_token_age > inatime;
    END IF;
    return TRUE;
  END IF;
END;
' LANGUAGE 'plpgsql';



pgsql-performance by date:

From: Matthew Schumacher
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0
From: Chris Travers
Date:
Subject: Left joining against two empty tables makes a query SLOW