Re: Performance problems testing with Spamassassin 3.1.0

From: Matthew Schumacher
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Date: ,
Msg-id: 42EBEBE9.4020504@aptalaska.net
(view: Whole thread, Raw)
In response to: Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane)
Responses: Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel)
List: pgsql-performance

Tree view

Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Matthew Schumacher, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Josh Berkus, )
  Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Josh Berkus, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Dennis Bjorklund, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Josh Berkus, )
     Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Tom Lane, )
  Re: Performance problems testing with Spamassassin  ("Luke Lonergan", )
 Re: Performance problems testing with Spamassassin 3.1.0  (Karim Nassar, )
  Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Gavin Sherry, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
     Re: Performance problems testing with Spamassassin 3.1.0  (Gavin Sherry, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Andrew McMillan, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
     Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
     Re: Performance problems testing with Spamassassin 3.1.0  (Andrew McMillan, )
      Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
      Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
       Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
        Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
          Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Michael Parker, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
           Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
          Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
           Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
            Re: Performance problems testing with Spamassassin 3.1.0  ("Jim C. Nasby", )
             Re: Performance problems testing with Spamassassin 3.1.0  (Michael Parker, )
              Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
               Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
               Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
               Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
                Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                 Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
                Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                 Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                  Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                   Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                    Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
        Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
         Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
           Re: Performance problems testing with Spamassassin 3.1.0  ("Jim C. Nasby", )
            Re: Performance problems testing with Spamassassin 3.1.0  (Andreas Pflug, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  ("Merlin Moncure", )

Tom Lane wrote:

> I looked into this a bit.  It seems that the problem when you wrap the
> entire insertion series into one transaction is associated with the fact
> that the test does so many successive updates of the single row in
> bayes_vars.  (VACUUM VERBOSE at the end of the test shows it cleaning up
> 49383 dead versions of the one row.)  This is bad enough when it's in
> separate transactions, but when it's in one transaction, none of those
> dead row versions can be marked "fully dead" yet --- so for every update
> of the row, the unique-key check has to visit every dead version to make
> sure it's dead in the context of the current transaction.  This makes
> the process O(N^2) in the number of updates per transaction.  Which is
> bad enough if you just want to do one transaction per message, but it's
> intolerable if you try to wrap the whole bulk-load scenario into one
> transaction.
>
> I'm not sure that we can do anything to make this a lot smarter, but
> in any case, the real problem is to not do quite so many updates of
> bayes_vars.
>
> How constrained are you as to the format of the SQL generated by
> SpamAssassin?  In particular, could you convert the commands generated
> for a single message into a single statement?  I experimented with
> passing all the tokens for a given message as a single bytea array,
> as in the attached, and got almost a factor of 4 runtime reduction
> on your test case.
>
> BTW, it's possible that this is all just a startup-transient problem:
> once the database has been reasonably well populated, one would expect
> new tokens to be added infrequently, and so the number of updates to
> bayes_vars ought to drop off.
>
>             regards, tom lane
>

The spamassassins bayes code calls the _put_token method in the storage
module a loop.  This means that the storage module isn't called once per
message, but once per token.

I'll look into modifying it to so that the bayes code passes a hash of
tokens to the storage module where they can loop or in the case of the
pgsql module pass an array of tokens to a procedure where we loop and
use temp tables to make this much more efficient.

I don't have much time this weekend to toss at this, but will be looking
at it on Monday.

Thanks,

schu


pgsql-performance by date:

From: Matthew Schumacher
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0
From: John Arbash Meinel
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0