Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance

From Matthew Schumacher
Subject Re: Performance problems testing with Spamassassin 3.1.0
Date
Msg-id 42EBEBE9.4020504@aptalaska.net
Whole thread Raw
In response to Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Performance problems testing with Spamassassin 3.1.0
List pgsql-performance
Tom Lane wrote:

> I looked into this a bit.  It seems that the problem when you wrap the
> entire insertion series into one transaction is associated with the fact
> that the test does so many successive updates of the single row in
> bayes_vars.  (VACUUM VERBOSE at the end of the test shows it cleaning up
> 49383 dead versions of the one row.)  This is bad enough when it's in
> separate transactions, but when it's in one transaction, none of those
> dead row versions can be marked "fully dead" yet --- so for every update
> of the row, the unique-key check has to visit every dead version to make
> sure it's dead in the context of the current transaction.  This makes
> the process O(N^2) in the number of updates per transaction.  Which is
> bad enough if you just want to do one transaction per message, but it's
> intolerable if you try to wrap the whole bulk-load scenario into one
> transaction.
>
> I'm not sure that we can do anything to make this a lot smarter, but
> in any case, the real problem is to not do quite so many updates of
> bayes_vars.
>
> How constrained are you as to the format of the SQL generated by
> SpamAssassin?  In particular, could you convert the commands generated
> for a single message into a single statement?  I experimented with
> passing all the tokens for a given message as a single bytea array,
> as in the attached, and got almost a factor of 4 runtime reduction
> on your test case.
>
> BTW, it's possible that this is all just a startup-transient problem:
> once the database has been reasonably well populated, one would expect
> new tokens to be added infrequently, and so the number of updates to
> bayes_vars ought to drop off.
>
>             regards, tom lane
>

The spamassassins bayes code calls the _put_token method in the storage
module a loop.  This means that the storage module isn't called once per
message, but once per token.

I'll look into modifying it to so that the bayes code passes a hash of
tokens to the storage module where they can loop or in the case of the
pgsql module pass an array of tokens to a procedure where we loop and
use temp tables to make this much more efficient.

I don't have much time this weekend to toss at this, but will be looking
at it on Monday.

Thanks,

schu

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Next
From: John Arbash Meinel
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0