Thread: Re: Performance problems testing with Spamassassin

Re: Performance problems testing with Spamassassin

From
"Luke Lonergan"
Date:
work_mem = 131072               # min 64, size in KB
shared_buffers = 16000          # min 16, at least max_connections*2, 8KB each
checkpoint_segments = 128       # in logfile segments, min 1, 16MB each
effective_cache_size = 750000   # typically 8KB each
fsync=false                     # turns forced synchronization on or off

------------------------------------------
On Bizgres (0_7_2) running on a 2GHz Opteron:
------------------------------------------
[llonergan@stinger4 bayesBenchmark]$ ./test.sh

real    0m38.348s
user    0m1.422s
sys     0m1.870s

------------------------------------------
On a 2.4GHz AMD64:
------------------------------------------
[llonergan@kite15 bayesBenchmark]$ ./test.sh

real    0m35.497s
user    0m2.250s
sys     0m0.470s

Now we turn fsync=true:

------------------------------------------
On a 2.4GHz AMD64:
------------------------------------------
[llonergan@kite15 bayesBenchmark]$ ./test.sh

real    2m7.368s
user    0m2.560s
sys     0m0.750s

I guess we see the real culprit here.  Anyone surprised it's the WAL?

- Luke

________________________________

From: pgsql-performance-owner@postgresql.org on behalf of Andrew McMillan
Sent: Thu 7/28/2005 10:50 PM
To: Matthew Schumacher
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0



On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote:
>
> Ok, I finally got some test data together so that others can test
> without installing SA.
>
> The schema and test dataset is over at
> http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz
>
> I have a pretty fast machine with a tuned postgres and it takes it about
> 2 minutes 30 seconds to load the test data.  Since the test data is the
> bayes information on 616 spam messages than comes out to be about 250ms
> per message.  While that is doable, it does add quite a bit of overhead
> to the email system.

On my laptop this takes:

real    1m33.758s
user    0m4.285s
sys     0m1.181s

One interesting effect is the data in bayes_vars has a huge number of
updates and needs vacuum _frequently_.  After the run a vacuum full
compacts it down from 461 pages to 1 page.

Regards,
                                        Andrew.

-------------------------------------------------------------------------
Andrew @ Catalyst .Net .NZ  Ltd,  PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/            PHYS: Level 2, 150-154 Willis St
DDI: +64(4)803-2201      MOB: +64(272)DEBIAN      OFFICE: +64(4)499-2267
                      I don't do it for the money.
                    -- Donald Trump, Art of the Deal

-------------------------------------------------------------------------





Re: Performance problems testing with Spamassassin

From
Alvaro Herrera
Date:
On Fri, Jul 29, 2005 at 03:01:07AM -0400, Luke Lonergan wrote:

> I guess we see the real culprit here.  Anyone surprised it's the WAL?

So what?  Are you planning to suggest people to turn fsync=false?

I just had a person lose 3 days of data on some tables because of that,
even when checkpoints were 5 minutes apart.  With fsync off, there's no
work _at all_ going on, not just the WAL -- heap/index file fsync at
checkpoint is also skipped.  This is no good.

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"In a specialized industrial society, it would be a disaster
to have kids running around loose." (Paul Graham)

Re: Performance problems testing with Spamassassin

From
Tom Lane
Date:
"Luke Lonergan" <LLonergan@greenplum.com> writes:
> I guess we see the real culprit here.  Anyone surprised it's the WAL?

You have not proved that at all.

I haven't had time to look at Matthew's problem, but someone upthread
implied that it was doing a separate transaction for each word.  If so,
collapsing that to something more reasonable (say one xact per message)
would probably help a great deal.

            regards, tom lane

Re: Performance problems testing with Spamassassin

From
Josh Berkus
Date:
Luke,

> work_mem = 131072               # min 64, size in KB

Incidentally, this is much too high for an OLTP application, although I don't
think this would have affected the test.

> shared_buffers = 16000          # min 16, at least max_connections*2, 8KB
> each checkpoint_segments = 128       # in logfile segments, min 1, 16MB
> each effective_cache_size = 750000   # typically 8KB each
> fsync=false                     # turns forced synchronization on or off

Try changing:
wal_buffers = 256

and try Bruce's stop full_page_writes patch.

> I guess we see the real culprit here.  Anyone surprised it's the WAL?

Nope.  On high-end OLTP stuff, it's crucial that the WAL have its own
dedicated disk resource.

Also, running a complex stored procedure for each and every word in each
e-mail is rather deadly ... with the e-mail traffic our server at Globix
receives, for example, that would amount to running it about 1,000 times a
minute.   It would be far better to batch this, somehow, maybe using temp
tables.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Performance problems testing with Spamassassin

From
"Luke Lonergan"
Date:
Alvaro,

On 7/29/05 6:23 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:

> On Fri, Jul 29, 2005 at 03:01:07AM -0400, Luke Lonergan wrote:
>
>> I guess we see the real culprit here.  Anyone surprised it's the WAL?
>
> So what?  Are you planning to suggest people to turn fsync=false?

That's not the conclusion I made, no.  I was pointing out that fsync has a
HUGE impact on his problem, which implies something to do with the I/O sync
operations.  Black box bottleneck hunt approach #12.

> With fsync off, there's no
> work _at all_ going on, not just the WAL -- heap/index file fsync at
> checkpoint is also skipped.  This is no good.

OK - so that's what Tom is pointing out, that fsync impacts more than WAL.

However, finding out that fsync/no fsync makes a 400% difference in speed
for this problem is interesting and relevant, no?

- Luke



Re: Performance problems testing with Spamassassin

From
"Luke Lonergan"
Date:
Tom,

On 7/29/05 7:12 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> "Luke Lonergan" <LLonergan@greenplum.com> writes:
>> I guess we see the real culprit here.  Anyone surprised it's the WAL?
>
> You have not proved that at all.

As Alvaro pointed out, fsync has impact on more than WAL, so good point.
Interesting that fsync has such a huge impact on this situation though.

- Luke



Re: Performance problems testing with Spamassassin

From
Matthew Schumacher
Date:
Ok,

Here is something new, when I take my data.sql file and add a begin and
commit at the top and bottom, the benchmark is a LOT slower?

My understanding is that it should be much faster because fsync isn't
called until the commit instead of on every sql command.

I must be missing something here.

schu

Re: Performance problems testing with Spamassassin

From
Karim Nassar
Date:
On Fri, 2005-07-29 at 09:47 -0700, Josh Berkus wrote:
> Try changing:
> wal_buffers = 256
>
> and try Bruce's stop full_page_writes patch.
>
> > I guess we see the real culprit here.  Anyone surprised it's the WAL?
>
> Nope.  On high-end OLTP stuff, it's crucial that the WAL have its own
> dedicated disk resource.
>
> Also, running a complex stored procedure for each and every word in each
> e-mail is rather deadly ... with the e-mail traffic our server at Globix
> receives, for example, that would amount to running it about 1,000 times a
> minute.

Is this a real-world fix? Seems to me that Spam Assassin runs on a
plethora of mail servers, and optimizing his/her/my/your pg config
doesn't solve the root problem: there are thousands of (seemingly)
high-overhead function calls being executed.


> It would be far better to batch this, somehow, maybe using temp
> tables.

Agreed. On my G4 laptop running the default configured Ubuntu Linux
postgresql 7.4.7 package, it took 43 minutes for Matthew's script to run
(I ran it twice just to be sure). In my spare time over the last day, I
created a brute force perl script that took under 6 minutes. Am I on to
something, or did I just optimize for *my* system?

http://ccl.cens.nau.edu/~kan4/files/k-bayesBenchmark.tar.gz

kan4@slap-happy:~/k-bayesBenchmark$ time ./test.pl
<-- snip db creation stuff -->
17:18:44 -- START
17:19:37 -- AFTER TEMP LOAD : loaded 120596 records
17:19:46 -- AFTER bayes_token INSERT : inserted 49359 new records into bayes_token
17:19:50 -- AFTER bayes_vars UPDATE : updated 1 records
17:23:37 -- AFTER bayes_token UPDATE : updated 47537 records
DONE

real    5m4.551s
user    0m29.442s
sys     0m3.925s


I am sure someone smarter could optimize further.

Anyone with a super-spifty machine wanna see if there is an improvement
here?

--
Karim Nassar <karim.nassar@acm.org>


Re: Performance problems testing with Spamassassin

From
Matthew Schumacher
Date:
Karim Nassar wrote:
>
> kan4@slap-happy:~/k-bayesBenchmark$ time ./test.pl
> <-- snip db creation stuff -->
> 17:18:44 -- START
> 17:19:37 -- AFTER TEMP LOAD : loaded 120596 records
> 17:19:46 -- AFTER bayes_token INSERT : inserted 49359 new records into bayes_token
> 17:19:50 -- AFTER bayes_vars UPDATE : updated 1 records
> 17:23:37 -- AFTER bayes_token UPDATE : updated 47537 records
> DONE
>
> real    5m4.551s
> user    0m29.442s
> sys     0m3.925s
>
>
> I am sure someone smarter could optimize further.
>
> Anyone with a super-spifty machine wanna see if there is an improvement
> here?
>

There is a great improvement in loading the data.  While I didn't load
it on my server, my test box shows significant gains.

It seems that the only thing your script does different is separate the
updates from inserts so that an expensive update isn't called when we
want to insert.  The other major difference is the 'IN' and 'MOT IN'
syntax which looks to be much faster than trying everything as an update
before inserting.

While these optimizations seem to make a huge difference in loading the
token data, the real life scenario is a little different.

You see, the database keeps track of the number of times each token was
found in ham or spam, so that when we see a new message we can parse it
into tokens then compare with the database to see how likely the
messages is spam based on the statistics of tokens we have already
learned on.

Since we would want to commit this data after each message, the number
of tokens processed at one time would probably only be a few hundred,
most of which are probably updates after we have trained on a few
thousand emails.

I apologize if my crude benchmark was misleading, it was meant to
simulate the sheer number of inserts/updates the database may go though
in an environment that didn't require people to load spamassassin and
start training on spam.

I'll do some more testing on Monday, perhaps grouping even 200 tokens at
a time using your method will yield significant gains, but probably not
as dramatic as it does using my loading benchmark.

I post more when I have a chance to look at this in more depth.

Thanks,
schu

Re: Performance problems testing with Spamassassin

From
Karim Nassar
Date:
On Sat, 2005-07-30 at 00:46 -0800, Matthew Schumacher wrote:

> I'll do some more testing on Monday, perhaps grouping even 200 tokens at
> a time using your method will yield significant gains, but probably not
> as dramatic as it does using my loading benchmark.

In that case, some of the clauses could be simplified further since we
know that we are dealing with only one user. I don't know what that will
get us, since postgres is so damn clever.

I suspect that the aggregate functions will be more efficient when you
do this, since the temp table will be much smaller, but I am only
guessing at this point.

If you need to support a massive initial data load, further time savings
are to be had by doing COPY instead of 126,000 inserts.

Please do keep us updated.

Thanking all the gods and/or developers for spamassassin,
--
Karim Nassar <karim.nassar@acm.org>