Tom Lane wrote:
> I looked into this a bit. It seems that the problem when you wrap the
> entire insertion series into one transaction is associated with the fact
> that the test does so many successive updates of the single row in
> bayes_vars. (VACUUM VERBOSE at the end of the test shows it cleaning up
> 49383 dead versions of the one row.) This is bad enough when it's in
> separate transactions, but when it's in one transaction, none of those
> dead row versions can be marked "fully dead" yet --- so for every update
> of the row, the unique-key check has to visit every dead version to make
> sure it's dead in the context of the current transaction. This makes
> the process O(N^2) in the number of updates per transaction. Which is
> bad enough if you just want to do one transaction per message, but it's
> intolerable if you try to wrap the whole bulk-load scenario into one
> transaction.
>
> I'm not sure that we can do anything to make this a lot smarter, but
> in any case, the real problem is to not do quite so many updates of
> bayes_vars.
>
> How constrained are you as to the format of the SQL generated by
> SpamAssassin? In particular, could you convert the commands generated
> for a single message into a single statement? I experimented with
> passing all the tokens for a given message as a single bytea array,
> as in the attached, and got almost a factor of 4 runtime reduction
> on your test case.
>
> BTW, it's possible that this is all just a startup-transient problem:
> once the database has been reasonably well populated, one would expect
> new tokens to be added infrequently, and so the number of updates to
> bayes_vars ought to drop off.
>
> regards, tom lane
>
The spamassassins bayes code calls the _put_token method in the storage
module a loop. This means that the storage module isn't called once per
message, but once per token.
I'll look into modifying it to so that the bayes code passes a hash of
tokens to the storage module where they can loop or in the case of the
pgsql module pass an array of tokens to a procedure where we loop and
use temp tables to make this much more efficient.
I don't have much time this weekend to toss at this, but will be looking
at it on Monday.
Thanks,
schu