Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance
From | John Arbash Meinel |
---|---|
Subject | Re: Performance problems testing with Spamassassin 3.1.0 |
Date | |
Msg-id | 42EC612A.1080908@arbash-meinel.com Whole thread Raw |
In response to | Re: Performance problems testing with Spamassassin 3.1.0 (Matthew Schumacher <matt.s@aptalaska.net>) |
Responses |
Re: Performance problems testing with Spamassassin 3.1.0
|
List | pgsql-performance |
Matthew Schumacher wrote: >All it's doing is trying the update before the insert to get around the >problem of not knowing which is needed. With only 2-3 of the queries >implemented I'm already back to running about the same speed as the >original SA proc that is going to ship with SA 3.1.0. > >All of the queries are using indexes so at this point I'm pretty >convinced that the biggest problem is the sheer number of queries >required to run this proc 200 times for each email (once for each token). > >I don't see anything that could be done to make this much faster on the >postgres end, it's looking like the solution is going to involve cutting >down the number of queries some how. > >One thing that is still very puzzling to me is why this runs so much >slower when I put the data.sql in a transaction. Obviously transactions >are acting different when you call a proc a zillion times vs an insert >query. > > Well, I played with adding a COMMIT;BEGIN; statement to your exact test every 1000 lines. And this is what I got: Unmodified: real 17m53.587s user 0m6.204s sys 0m3.556s With BEGIN/COMMIT: real 1m53.466s user 0m5.203s sys 0m3.211s So I see the potential for improvement almost 10 fold by switching to transactions. I played with the perl script (and re-implemented it in python), and for the same data as the perl script, using COPY instead of INSERT INTO means 5s instead of 33s. I also played around with adding VACUUM ANALYZE every 10 COMMITS, which brings the speed to: real 1m41.258s user 0m5.394s sys 0m3.212s And doing VACUUM ANALYZE every 5 COMMITS makes it: real 1m46.403s user 0m5.597s sys 0m3.244s I'm assuming the slowdown is because of the extra time spent vacuuming. Overall performance might still be improving, since you wouldn't actually be inserting all 100k rows at once. Just to complete the reference, the perl version runs as: 10:44:02 -- START 10:44:35 -- AFTER TEMP LOAD : loaded 120596 records 10:44:39 -- AFTER bayes_token INSERT : inserted 49359 new records into bayes_token 10:44:41 -- AFTER bayes_vars UPDATE : updated 1 records 10:46:42 -- AFTER bayes_token UPDATE : updated 47537 records DONE My python version runs as: 00:22:54 -- START 00:23:00 -- AFTER TEMP LOAD : loaded 120596 records 00:23:03 -- AFTER bayes_token INSERT : inserted 49359 new records into bayes_token 00:23:06 -- AFTER bayes_vars UPDATE : updated 1 records 00:25:04 -- AFTER bayes_token UPDATE : updated 47537 records DONE The python is effectively just a port of the perl code (with many lines virtually unchanged), and really the only performance difference is that the initial data load is much faster with a COPY. This is all run on Ubuntu, with postgres 7.4.7, and a completely unchanged postgresql.conf. (But the machine is a dual P4 2.4GHz, with 3GB of RAM). John =:-> >Anyway, if anyone else has any ideas I'm all ears, but at this point >it's looking like raw query speed is needed for this app and while I >don't care for mysql as a database, it does have the speed going for it. > >schu > >---------------------------(end of broadcast)--------------------------- >TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > > >
Attachment
pgsql-performance by date: