Karim Nassar wrote:
>
> kan4@slap-happy:~/k-bayesBenchmark$ time ./test.pl
> <-- snip db creation stuff -->
> 17:18:44 -- START
> 17:19:37 -- AFTER TEMP LOAD : loaded 120596 records
> 17:19:46 -- AFTER bayes_token INSERT : inserted 49359 new records into bayes_token
> 17:19:50 -- AFTER bayes_vars UPDATE : updated 1 records
> 17:23:37 -- AFTER bayes_token UPDATE : updated 47537 records
> DONE
>
> real 5m4.551s
> user 0m29.442s
> sys 0m3.925s
>
>
> I am sure someone smarter could optimize further.
>
> Anyone with a super-spifty machine wanna see if there is an improvement
> here?
>
There is a great improvement in loading the data. While I didn't load
it on my server, my test box shows significant gains.
It seems that the only thing your script does different is separate the
updates from inserts so that an expensive update isn't called when we
want to insert. The other major difference is the 'IN' and 'MOT IN'
syntax which looks to be much faster than trying everything as an update
before inserting.
While these optimizations seem to make a huge difference in loading the
token data, the real life scenario is a little different.
You see, the database keeps track of the number of times each token was
found in ham or spam, so that when we see a new message we can parse it
into tokens then compare with the database to see how likely the
messages is spam based on the statistics of tokens we have already
learned on.
Since we would want to commit this data after each message, the number
of tokens processed at one time would probably only be a few hundred,
most of which are probably updates after we have trained on a few
thousand emails.
I apologize if my crude benchmark was misleading, it was meant to
simulate the sheer number of inserts/updates the database may go though
in an environment that didn't require people to load spamassassin and
start training on spam.
I'll do some more testing on Monday, perhaps grouping even 200 tokens at
a time using your method will yield significant gains, but probably not
as dramatic as it does using my loading benchmark.
I post more when I have a chance to look at this in more depth.
Thanks,
schu