Re: Performance problems testing with Spamassassin - Mailing list pgsql-performance

From Matthew Schumacher
Subject Re: Performance problems testing with Spamassassin
Date
Msg-id 42EB3E63.6040300@aptalaska.net
Whole thread Raw
In response to Re: Performance problems testing with Spamassassin  (Karim Nassar <karim.nassar@acm.org>)
Responses Re: Performance problems testing with Spamassassin
List pgsql-performance
Karim Nassar wrote:
>
> kan4@slap-happy:~/k-bayesBenchmark$ time ./test.pl
> <-- snip db creation stuff -->
> 17:18:44 -- START
> 17:19:37 -- AFTER TEMP LOAD : loaded 120596 records
> 17:19:46 -- AFTER bayes_token INSERT : inserted 49359 new records into bayes_token
> 17:19:50 -- AFTER bayes_vars UPDATE : updated 1 records
> 17:23:37 -- AFTER bayes_token UPDATE : updated 47537 records
> DONE
>
> real    5m4.551s
> user    0m29.442s
> sys     0m3.925s
>
>
> I am sure someone smarter could optimize further.
>
> Anyone with a super-spifty machine wanna see if there is an improvement
> here?
>

There is a great improvement in loading the data.  While I didn't load
it on my server, my test box shows significant gains.

It seems that the only thing your script does different is separate the
updates from inserts so that an expensive update isn't called when we
want to insert.  The other major difference is the 'IN' and 'MOT IN'
syntax which looks to be much faster than trying everything as an update
before inserting.

While these optimizations seem to make a huge difference in loading the
token data, the real life scenario is a little different.

You see, the database keeps track of the number of times each token was
found in ham or spam, so that when we see a new message we can parse it
into tokens then compare with the database to see how likely the
messages is spam based on the statistics of tokens we have already
learned on.

Since we would want to commit this data after each message, the number
of tokens processed at one time would probably only be a few hundred,
most of which are probably updates after we have trained on a few
thousand emails.

I apologize if my crude benchmark was misleading, it was meant to
simulate the sheer number of inserts/updates the database may go though
in an environment that didn't require people to load spamassassin and
start training on spam.

I'll do some more testing on Monday, perhaps grouping even 200 tokens at
a time using your method will yield significant gains, but probably not
as dramatic as it does using my loading benchmark.

I post more when I have a chance to look at this in more depth.

Thanks,
schu

pgsql-performance by date:

Previous
From: William Yu
Date:
Subject: Re: Performance problems on 4/8way Opteron (dualcore)
Next
From: Karim Nassar
Date:
Subject: Re: Performance problems testing with Spamassassin