Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance

From John Arbash Meinel
Subject Re: Performance problems testing with Spamassassin 3.1.0
Date
Msg-id 42EC612A.1080908@arbash-meinel.com
Whole thread Raw
In response to Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher <matt.s@aptalaska.net>)
Responses Re: Performance problems testing with Spamassassin 3.1.0
List pgsql-performance
Matthew Schumacher wrote:

>All it's doing is trying the update before the insert to get around the
>problem of not knowing which is needed.  With only 2-3 of the queries
>implemented I'm already back to running about the same speed as the
>original SA proc that is going to ship with SA 3.1.0.
>
>All of the queries are using indexes so at this point I'm pretty
>convinced that the biggest problem is the sheer number of queries
>required to run this proc 200 times for each email (once for each token).
>
>I don't see anything that could be done to make this much faster on the
>postgres end, it's looking like the solution is going to involve cutting
>down the number of queries some how.
>
>One thing that is still very puzzling to me is why this runs so much
>slower when I put the data.sql in a transaction.  Obviously transactions
>are acting different when you call a proc a zillion times vs an insert
>query.
>
>
Well, I played with adding a COMMIT;BEGIN; statement to your exact test
every 1000 lines. And this is what I got:

Unmodified:
real    17m53.587s
user    0m6.204s
sys     0m3.556s

With BEGIN/COMMIT:
real    1m53.466s
user    0m5.203s
sys     0m3.211s

So I see the potential for improvement almost 10 fold by switching to
transactions. I played with the perl script (and re-implemented it in
python), and for the same data as the perl script, using COPY instead of
INSERT INTO means 5s instead of 33s.

I also played around with adding VACUUM ANALYZE every 10 COMMITS, which
brings the speed to:

real    1m41.258s
user    0m5.394s
sys     0m3.212s

And doing VACUUM ANALYZE every 5 COMMITS makes it:
real    1m46.403s
user    0m5.597s
sys     0m3.244s

I'm assuming the slowdown is because of the extra time spent vacuuming.
Overall performance might still be improving, since you wouldn't
actually be inserting all 100k rows at once.


Just to complete the reference, the perl version runs as:
10:44:02 -- START
10:44:35 -- AFTER TEMP LOAD : loaded 120596 records
10:44:39 -- AFTER bayes_token INSERT : inserted 49359 new records into
bayes_token
10:44:41 -- AFTER bayes_vars UPDATE : updated 1 records
10:46:42 -- AFTER bayes_token UPDATE : updated 47537 records
DONE

My python version runs as:
00:22:54 -- START
00:23:00 -- AFTER TEMP LOAD : loaded 120596 records
00:23:03 -- AFTER bayes_token INSERT : inserted 49359 new records into
bayes_token
00:23:06 -- AFTER bayes_vars UPDATE : updated 1 records
00:25:04 -- AFTER bayes_token UPDATE : updated 47537 records
DONE

The python is effectively just a port of the perl code (with many lines
virtually unchanged), and really the only performance difference is that
the initial data load is much faster with a COPY.

This is all run on Ubuntu, with postgres 7.4.7, and a completely
unchanged postgresql.conf. (But the machine is a dual P4 2.4GHz, with
3GB of RAM).

John
=:->

>Anyway, if anyone else has any ideas I'm all ears, but at this point
>it's looking like raw query speed is needed for this app and while I
>don't care for mysql as a database, it does have the speed going for it.
>
>schu
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>
>
>


Attachment

pgsql-performance by date:

Previous
From: Matthew Schumacher
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Next
From: John Arbash Meinel
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0