Thread: Performance problems testing with Spamassassin 3.1.0 Bayes module.

Performance problems testing with Spamassassin 3.1.0 Bayes module.

From

Matthew Schumacher

Date:

27 July 2005, 19:35:39

I'm not sure how much this has been discussed on the list, but wasn't
able to find anything relevant in the archives.

The new Spamassassin is due out pretty soon.  They are currently testing
3.1.0pre4.  One of the things I hope to get out of this release is bayes
word stats moved to a real RDBMS.  They have separated the mysql
BayesStore module from the PgSQL one so now postgres can use it's own
queries.

I loaded all of this stuff up on a test server and am finding that the
bayes put performance is really not good enough for any real amount of
mail load.

The performance problems seems to be when the bayes module is
inserting/updating.  This is now handled by the token_put procedure.

After playing with various indexes and what not I simply am unable to
make this procedure perform any better.  Perhaps someone on the list can
spot the bottleneck and reveal why this procedure isn't performing that
well or ways to make it better.

I put the rest of the schema up at
http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
needs to see it too.

CREATE OR REPLACE FUNCTION put_token(integer, bytea, integer, integer,
integer) RETURNS bool AS '
DECLARE
        inuserid ALIAS for $1;
        intoken ALIAS for $2;
        inspam_count ALIAS for $3;
        inham_count ALIAS for $4;
        inatime ALIAS for $5;
        got_token record;
        updated_atime_p bool;
BEGIN
  updated_atime_p := FALSE;
  SELECT INTO got_token spam_count, ham_count, atime
    FROM bayes_token
   WHERE id = inuserid
     AND token = intoken;
   IF NOT FOUND THEN
     -- we do not insert negative counts, just return true
     IF (inspam_count < 0 OR inham_count < 0) THEN
       RETURN TRUE;
     END IF;
     INSERT INTO bayes_token (id, token, spam_count, ham_count, atime)
     VALUES (inuserid, intoken, inspam_count, inham_count, inatime);
     IF NOT FOUND THEN
       RAISE EXCEPTION ''unable to insert into bayes_token'';
       return FALSE;
     END IF;
     UPDATE bayes_vars SET token_count = token_count + 1
      WHERE id = inuserid;
     IF NOT FOUND THEN
       RAISE EXCEPTION ''unable to update token_count in bayes_vars'';
       return FALSE;
     END IF;
     UPDATE bayes_vars SET newest_token_age = inatime
      WHERE id = inuserid AND newest_token_age < inatime;
     IF NOT FOUND THEN
       UPDATE bayes_vars
          SET oldest_token_age = inatime
        WHERE id = inuserid
          AND oldest_token_age > inatime;
     END IF;
     return TRUE;
  ELSE
    IF (inspam_count != 0) THEN
      -- no need to update atime if it is < the existing value
      IF (inatime < got_token.atime) THEN
        UPDATE bayes_token
           SET spam_count = spam_count + inspam_count
         WHERE id = inuserid
           AND token = intoken
           AND spam_count + inspam_count >= 0;
      ELSE
        UPDATE bayes_token
           SET spam_count = spam_count + inspam_count,
               atime = inatime
         WHERE id = inuserid
           AND token = intoken
           AND spam_count + inspam_count >= 0;
        IF FOUND THEN
          updated_atime_p := TRUE;
        END IF;
      END IF;
    END IF;
    IF (inham_count != 0) THEN
      -- no need to update atime is < the existing value or if it was
already updated
      IF inatime < got_token.atime OR updated_atime_p THEN
        UPDATE bayes_token
           SET ham_count = ham_count + inham_count
         WHERE id = inuserid
           AND token = intoken
           AND ham_count + inham_count >= 0;
      ELSE
        UPDATE bayes_token
           SET ham_count = ham_count + inham_count,
               atime = inatime
         WHERE id = inuserid
           AND token = intoken
           AND ham_count + inham_count >= 0;
        IF FOUND THEN
          updated_atime_p := TRUE;
        END IF;
      END IF;
    END IF;
    IF updated_atime_p THEN
      UPDATE bayes_vars
         SET oldest_token_age = inatime
       WHERE id = inuserid
         AND oldest_token_age > inatime;
    END IF;
    return TRUE;
  END IF;
END;
' LANGUAGE 'plpgsql';

Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.

From

Josh Berkus

Date:

27 July 2005, 21:09:35

Matt,

> After playing with various indexes and what not I simply am unable to
> make this procedure perform any better.  Perhaps someone on the list can
> spot the bottleneck and reveal why this procedure isn't performing that
> well or ways to make it better.

Well, my first thought is that this is a pretty complicated procedure for
something you want to peform well.    Is all this logic really necessary?
How does it get done for MySQL?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

27 July 2005, 22:59:55

Josh Berkus wrote:
> Matt,
>
>
>>After playing with various indexes and what not I simply am unable to
>>make this procedure perform any better.  Perhaps someone on the list can
>>spot the bottleneck and reveal why this procedure isn't performing that
>>well or ways to make it better.
>
>
> Well, my first thought is that this is a pretty complicated procedure for
> something you want to peform well.    Is all this logic really necessary?
> How does it get done for MySQL?
>

I'm not sure if it's all needed, in mysql they have this simple schema:

===============================================
CREATE TABLE bayes_expire (
  id int(11) NOT NULL default '0',
  runtime int(11) NOT NULL default '0',
  KEY bayes_expire_idx1 (id)
) TYPE=MyISAM;

CREATE TABLE bayes_global_vars (
  variable varchar(30) NOT NULL default '',
  value varchar(200) NOT NULL default '',
  PRIMARY KEY  (variable)
) TYPE=MyISAM;

INSERT INTO bayes_global_vars VALUES ('VERSION','3');

CREATE TABLE bayes_seen (
  id int(11) NOT NULL default '0',
  msgid varchar(200) binary NOT NULL default '',
  flag char(1) NOT NULL default '',
  PRIMARY KEY  (id,msgid)
) TYPE=MyISAM;

CREATE TABLE bayes_token (
  id int(11) NOT NULL default '0',
  token char(5) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY  (id, token),
  INDEX bayes_token_idx1 (token),
  INDEX bayes_token_idx2 (id, atime)
) TYPE=MyISAM;

CREATE TABLE bayes_vars (
  id int(11) NOT NULL AUTO_INCREMENT,
  username varchar(200) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  token_count int(11) NOT NULL default '0',
  last_expire int(11) NOT NULL default '0',
  last_atime_delta int(11) NOT NULL default '0',
  last_expire_reduce int(11) NOT NULL default '0',
  oldest_token_age int(11) NOT NULL default '2147483647',
  newest_token_age int(11) NOT NULL default '0',
  PRIMARY KEY  (id),
  UNIQUE bayes_vars_idx1 (username)
) TYPE=MyISAM;
===============================================

Then they do this to insert the token:

INSERT INTO bayes_token (
  id,
  token,
  spam_count,
  ham_count,
  atime
) VALUES (
  ?,
  ?,
  ?,
  ?,
  ?
) ON DUPLICATE KEY
  UPDATE
    spam_count = GREATEST(spam_count + ?, 0),
    ham_count = GREATEST(ham_count + ?, 0),
    atime = GREATEST(atime, ?)

Or update the token:

UPDATE bayes_vars SET
  $token_count_update
  newest_token_age = GREATEST(newest_token_age, ?),
  oldest_token_age = LEAST(oldest_token_age, ?)
  WHERE id = ?


I think the reason why the procedure was written for postgres was
because of the greatest and least statements performing poorly.

Honestly, I'm not real up on writing procs, I was hoping the problem
would be obvious to someone.

schu

Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.

From

Josh Berkus

Date:

27 July 2005, 23:11:56

Matt,

> UPDATE bayes_vars SET
>   $token_count_update
>   newest_token_age = GREATEST(newest_token_age, ?),
>   oldest_token_age = LEAST(oldest_token_age, ?)
>   WHERE id = ?
>
>
> I think the reason why the procedure was written for postgres was
> because of the greatest and least statements performing poorly.

Well, it might be because we don't have a built-in GREATEST or LEAST prior to
8.1.   However, it's pretty darned easy to construct one.

> Honestly, I'm not real up on writing procs, I was hoping the problem
> would be obvious to someone.

Well, there's the general performance tuning stuff of course (postgresql.conf)
which if you've not done any of it will pretty dramatically affect your
througput rates.   And vacuum, analyze, indexes, etc.

You should also look at ways to make the SP simpler.  For example, you have a
cycle that looks like:

SELECT
    IF NOT FOUND
        INSERT
    ELSE
        UPDATE

Which could be made shorter as:

UPDATE
    IF NOT FOUND
        INSERT

... saving you one index scan.

Also, I don't quite follow it, but the procedure seems to be doing at least
two steps that the MySQL version isn't doing at all.  If the PG version is
doing more things, of course it's going to take longer.

Finally, when you have a proc you're happy with, I suggest having an expert
re-write it in C, which should double the procedure performance.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

27 July 2005, 23:17:10

Josh Berkus wrote:
> Matt,
>

> Well, it might be because we don't have a built-in GREATEST or LEAST prior to
> 8.1.   However, it's pretty darned easy to construct one.

I was more talking about min() and max() but yea, I think you knew where
I was going with it...

>
> Well, there's the general performance tuning stuff of course (postgresql.conf)
> which if you've not done any of it will pretty dramatically affect your
> througput rates.   And vacuum, analyze, indexes, etc.

I have gone though all that.

> You should also look at ways to make the SP simpler.  For example, you have a
> cycle that looks like:
>
> SELECT
>     IF NOT FOUND
>         INSERT
>     ELSE
>         UPDATE
>
> Which could be made shorter as:
>
> UPDATE
>     IF NOT FOUND
>         INSERT
>
> ... saving you one index scan.
>
> Also, I don't quite follow it, but the procedure seems to be doing at least
> two steps that the MySQL version isn't doing at all.  If the PG version is
> doing more things, of course it's going to take longer.
>
> Finally, when you have a proc you're happy with, I suggest having an expert
> re-write it in C, which should double the procedure performance.
>

Sounds like I need to completely understand what the proc is doing and
work on a rewrite.  I'll look into writing it in C, I need to do some
reading about how that works and exactly what it buys you.

Thanks for the helpful comments.

schu

Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.

From

Tom Lane

Date:

28 July 2005, 03:19:37

Matthew Schumacher <matt.s@aptalaska.net> writes:
> After playing with various indexes and what not I simply am unable to
> make this procedure perform any better.  Perhaps someone on the list can
> spot the bottleneck and reveal why this procedure isn't performing that
> well or ways to make it better.

There's not anything obviously wrong with that procedure --- all of the
updates are on primary keys, so one would expect reasonably efficient
query plans to get chosen.  Perhaps it'd be worth the trouble to build
the server with profiling enabled and get a gprof trace to see where the
time is going.

            regards, tom lane

Re: Performance problems testing with Spamassassin 3.1.0

From

Karim Nassar

Date:

28 July 2005, 03:29:31

On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote:

> I put the rest of the schema up at
> http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
> needs to see it too.

Do you have sample data too?

--
Karim Nassar
Collaborative Computing Lab of NAU
Office: (928) 523 5868 -=- Mobile: (928) 699 9221
http://ccl.cens.nau.edu/~kan4

Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.

From

"Merlin Moncure"

Date:

28 July 2005, 09:20:41

> I'm not sure how much this has been discussed on the list, but wasn't
> able to find anything relevant in the archives.
>
> The new Spamassassin is due out pretty soon.  They are currently
testing
> 3.1.0pre4.  One of the things I hope to get out of this release is
bayes
> word stats moved to a real RDBMS.  They have separated the mysql
> BayesStore module from the PgSQL one so now postgres can use it's own
> queries.
>
> I loaded all of this stuff up on a test server and am finding that the
> bayes put performance is really not good enough for any real amount of
> mail load.
>
> The performance problems seems to be when the bayes module is
> inserting/updating.  This is now handled by the token_put procedure.

1. you need high performance client side timing (sub 1 millisecond).  on
win32 use QueryPerformanceCounter

2. one by one, convert queries inside your routine into dynamic
versions.  That is, use execute 'query string'

3. Identify the problem.  Something somewhere is not using the index.
Because of the way the planner works you have to do this sometimes.

Merlin

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

28 July 2005, 21:13:33

Karim Nassar wrote:
> On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote:
>
>
>>I put the rest of the schema up at
>>http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
>>needs to see it too.
>
>
> Do you have sample data too?
>

Ok, I finally got some test data together so that others can test
without installing SA.

The schema and test dataset is over at
http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz

I have a pretty fast machine with a tuned postgres and it takes it about
2 minutes 30 seconds to load the test data.  Since the test data is the
bayes information on 616 spam messages than comes out to be about 250ms
per message.  While that is doable, it does add quite a bit of overhead
to the email system.

Perhaps this is as fast as I can expect it to go, if that's the case I
may have to look at mysql, but I really don't want to do that.

I will be working on some other benchmarks, and reading though exactly
how bayes works, but at least there is some data to play with.

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

Gavin Sherry

Date:

29 July 2005, 00:57:44

On Thu, 28 Jul 2005, Matthew Schumacher wrote:

> Karim Nassar wrote:
> > On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote:
> >
> >
> >>I put the rest of the schema up at
> >>http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
> >>needs to see it too.
> >
> >
> > Do you have sample data too?
> >
>
> Ok, I finally got some test data together so that others can test
> without installing SA.
>
> The schema and test dataset is over at
> http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz
>
> I have a pretty fast machine with a tuned postgres and it takes it about
> 2 minutes 30 seconds to load the test data.  Since the test data is the
> bayes information on 616 spam messages than comes out to be about 250ms
> per message.  While that is doable, it does add quite a bit of overhead
> to the email system.

I had a look at your data -- thanks.

I have a question though: put_token() is invoked 120596 times in your
benchmark... for 616 messages. That's nearly 200 queries (not even
counting the 1-8 (??) inside the function itself) per message. Something
doesn't seem right there....

Gavin

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

29 July 2005, 02:10:27

Gavin Sherry wrote:

>
> I had a look at your data -- thanks.
>
> I have a question though: put_token() is invoked 120596 times in your
> benchmark... for 616 messages. That's nearly 200 queries (not even
> counting the 1-8 (??) inside the function itself) per message. Something
> doesn't seem right there....
>
> Gavin

I am pretty sure that's right because it is doing word statistics on
email messages.

I need to spend some time studying the code, I just haven't found time yet.

Would it be safe to say that there isn't any glaring performance
penalties other than the sheer volume of queries?

Thanks,

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

Andrew McMillan

Date:

29 July 2005, 02:50:27

On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote:
>
> Ok, I finally got some test data together so that others can test
> without installing SA.
>
> The schema and test dataset is over at
> http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz
>
> I have a pretty fast machine with a tuned postgres and it takes it about
> 2 minutes 30 seconds to load the test data.  Since the test data is the
> bayes information on 616 spam messages than comes out to be about 250ms
> per message.  While that is doable, it does add quite a bit of overhead
> to the email system.

On my laptop this takes:

real    1m33.758s
user    0m4.285s
sys     0m1.181s

One interesting effect is the data in bayes_vars has a huge number of
updates and needs vacuum _frequently_.  After the run a vacuum full
compacts it down from 461 pages to 1 page.

Regards,
                    Andrew.

-------------------------------------------------------------------------
Andrew @ Catalyst .Net .NZ  Ltd,  PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/            PHYS: Level 2, 150-154 Willis St
DDI: +64(4)803-2201      MOB: +64(272)DEBIAN      OFFICE: +64(4)499-2267
                      I don't do it for the money.
                    -- Donald Trump, Art of the Deal

-------------------------------------------------------------------------

Attachment

signature.asc

Re: Performance problems testing with Spamassassin 3.1.0

From

Gavin Sherry

Date:

29 July 2005, 02:58:43

zOn Thu, 28 Jul 2005, Matthew Schumacher wrote:

> Gavin Sherry wrote:
>
> >
> > I had a look at your data -- thanks.
> >
> > I have a question though: put_token() is invoked 120596 times in your
> > benchmark... for 616 messages. That's nearly 200 queries (not even
> > counting the 1-8 (??) inside the function itself) per message. Something
> > doesn't seem right there....
> >
> > Gavin
>
> I am pretty sure that's right because it is doing word statistics on
> email messages.
>
> I need to spend some time studying the code, I just haven't found time yet.
>
> Would it be safe to say that there isn't any glaring performance
> penalties other than the sheer volume of queries?

Well, everything relating to one message should be issued in a transaction
block. Secondly, the initial select may be unnecessary -- I haven't looked
at the logic that closely.

There is, potentially, some parser overhead. In C, you could get around
this with PQprepare() et al.

It would also be interesting to look at the cost of a C function.

Gavin

Re: Performance problems testing with Spamassassin 3.1.0

From

Dennis Bjorklund

Date:

29 July 2005, 03:49:02

On Wed, 27 Jul 2005, Matthew Schumacher wrote:

> Then they do this to insert the token:
>
> INSERT INTO bayes_token (
>   id,
>   token,
>   spam_count,
>   ham_count,
>   atime
> ) VALUES (
>   ?,
>   ?,
>   ?,
>   ?,
>   ?
> ) ON DUPLICATE KEY
>   UPDATE
>     spam_count = GREATEST(spam_count + ?, 0),
>     ham_count = GREATEST(ham_count + ?, 0),
>     atime = GREATEST(atime, ?)
>
> Or update the token:
>
> UPDATE bayes_vars SET
>   $token_count_update
>   newest_token_age = GREATEST(newest_token_age, ?),
>   oldest_token_age = LEAST(oldest_token_age, ?)
>   WHERE id = ?
>
>
> I think the reason why the procedure was written for postgres was
> because of the greatest and least statements performing poorly.

How can they perform poorly when they are dead simple? Here are 2
functions that work for the above cases of greatest:

CREATE FUNCTION greatest_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
 LANGUAGE SQL;

CREATE FUNCTION least_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
 LANGUAGE SQL;

and these should be inlined by pg and very fast to execute.

I wrote a function that should do what the insert above does. The update
I've not looked at (I don't know what $token_count_update is) but the
update looks simple enough to just implement the same way in pg as in
mysql.

For the insert or replace case you can probably use this function:

CREATE FUNCTION insert_or_update_token (xid INTEGER,
                                        xtoken BYTEA,
                                        xspam_count INTEGER,
                                        xham_count INTEGER,
                                        xatime INTEGER)
RETURNS VOID AS
$$
BEGIN
   LOOP
     UPDATE bayes_token
        SET spam_count = greatest_int (spam_count + xspam_count, 0),
            ham_count  = greatest_int (ham_count + xham_count, 0),
            atime      = greatest_int (atime, xatime)
      WHERE id = xid
        AND token = xtoken;

      IF found THEN
        RETURN;
      END IF;

      BEGIN
        INSERT INTO bayes_token VALUES (xid,
                                        xtoken,
                                        xspam_count,
                                        xham_count,
                                        xatime);
        RETURN;
      EXCEPTION WHEN unique_violation THEN
        -- do nothing
      END;
   END LOOP;
END;
$$
LANGUAGE plpgsql;

It's not really tested so I can't tell if it's faster then what you have.
What it does do is mimic the way you insert values in mysql. It only work
on pg 8.0 and later however since the exception handling was added in 8.0.

--
/Dennis Björklund

Re: Performance problems testing with Spamassassin 3.1.0

From

Josh Berkus

Date:

29 July 2005, 14:35:26

Dennis,

>       EXCEPTION WHEN unique_violation THEN

I seem to remember that catching an exception in a PL/pgSQL procedure was a
large performance cost.   It'd be better to do UPDATE ... IF NOT FOUND.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

29 July 2005, 14:38:34

Andrew McMillan wrote:
> On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote:
>
>>Ok, I finally got some test data together so that others can test
>>without installing SA.
>>
>>The schema and test dataset is over at
>>http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz
>>
>>I have a pretty fast machine with a tuned postgres and it takes it about
>>2 minutes 30 seconds to load the test data.  Since the test data is the
>>bayes information on 616 spam messages than comes out to be about 250ms
>>per message.  While that is doable, it does add quite a bit of overhead
>>to the email system.
>
>
> On my laptop this takes:
>
> real    1m33.758s
> user    0m4.285s
> sys     0m1.181s
>
> One interesting effect is the data in bayes_vars has a huge number of
> updates and needs vacuum _frequently_.  After the run a vacuum full
> compacts it down from 461 pages to 1 page.
>
> Regards,
>                     Andrew.
>

I wonder why your laptop is so much faster.  My 2 min 30 sec test was
done on a dual xeon with a LSI megaraid with 128MB cache and writeback
caching turned on.

Here are my memory settings:

shared_buffers = 16384
work_mem = 32768
maintenance_work_mem = 65536

I tried higher values before I came back to these but it didn't help my
performance any.  I should also mention that this is a production
database server that was servicing other queries when I ran this test.

How often should this table be vacuumed, every 5 minutes?

Also, this test goes a bit faster with sync turned off, if mysql isn't
using sync that would be why it's so much faster.  Anyone know what the
default for mysql is?

Thanks,
schu

Re: Performance problems testing with Spamassassin 3.1.0

From

John Arbash Meinel

Date:

29 July 2005, 15:30:07

Josh Berkus wrote:

>Dennis,
>
>
>
>>      EXCEPTION WHEN unique_violation THEN
>>
>>
>
>I seem to remember that catching an exception in a PL/pgSQL procedure was a
>large performance cost.   It'd be better to do UPDATE ... IF NOT FOUND.
>
>
>
Actually, he was doing an implicit UPDATE IF NOT FOUND in that he was doing:

UPDATE

IF found THEN return;

INSERT
EXCEPT
...

So really, the exception should never be triggered.
John
=:->

Attachment

signature.asc

Re: Performance problems testing with Spamassassin

From

"Luke Lonergan"

Date:

29 July 2005, 16:02:48

Tom,

On 7/27/05 11:19 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Matthew Schumacher <matt.s@aptalaska.net> writes:
>> After playing with various indexes and what not I simply am unable to
>> make this procedure perform any better.  Perhaps someone on the list can
>> spot the bottleneck and reveal why this procedure isn't performing that
>> well or ways to make it better.
>
> There's not anything obviously wrong with that procedure --- all of the
> updates are on primary keys, so one would expect reasonably efficient
> query plans to get chosen.  Perhaps it'd be worth the trouble to build
> the server with profiling enabled and get a gprof trace to see where the
> time is going.

Yes - that would be excellent.  We've used oprofile recently at Mark Wong's
suggestion, which doesn't require rebuilding the source.

- Luke

Re: Performance problems testing with Spamassassin 3.1.0

From

PFC

Date:

29 July 2005, 16:24:48


> Also, this test goes a bit faster with sync turned off, if mysql isn't
> using sync that would be why it's so much faster.  Anyone know what the
> default for mysql is?

    For InnoDB I think it's like Postgres (only slower) ; for MyISAM it's no
fsync, no transactions, no crash tolerance of any kind, and it's not a
default value (in the sense that you could tweak it) it's just the way
it's coded.

Re: Performance problems testing with Spamassassin 3.1.0

From

Andrew McMillan

Date:

29 July 2005, 17:20:12

On Fri, 2005-07-29 at 09:37 -0800, Matthew Schumacher wrote:
> >
> > On my laptop this takes:
> >
> > real    1m33.758s
> > user    0m4.285s
> > sys     0m1.181s
> >
> > One interesting effect is the data in bayes_vars has a huge number of
> > updates and needs vacuum _frequently_.  After the run a vacuum full
> > compacts it down from 461 pages to 1 page.
> >
>
> I wonder why your laptop is so much faster.  My 2 min 30 sec test was
> done on a dual xeon with a LSI megaraid with 128MB cache and writeback
> caching turned on.

I only do development stuff on my laptop, and all of my databases are
reconstructable from copies, etc...  so I turn off fsync in this case.


> How often should this table be vacuumed, every 5 minutes?

I would be tempted to vacuum after each e-mail, in this case.


> Also, this test goes a bit faster with sync turned off, if mysql isn't
> using sync that would be why it's so much faster.  Anyone know what the
> default for mysql is?

It depends on your table type for MySQL.

For the data in question (i.e. bayes scoring) it would seem that not
much would be lost if you did have to restore your data from a day old
backup, so perhaps fsync=false is OK for this particular application.

Regards,
                    Andrew McMillan.

-------------------------------------------------------------------------
Andrew @ Catalyst .Net .NZ  Ltd,  PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/            PHYS: Level 2, 150-154 Willis St
DDI: +64(4)803-2201      MOB: +64(272)DEBIAN      OFFICE: +64(4)499-2267
  What we wish, that we readily believe.
                                     -- Demosthenes
-------------------------------------------------------------------------

Attachment

signature.asc

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

29 July 2005, 17:31:02

Andrew McMillan wrote:
>
> For the data in question (i.e. bayes scoring) it would seem that not
> much would be lost if you did have to restore your data from a day old
> backup, so perhaps fsync=false is OK for this particular application.
>
> Regards,
>                     Andrew McMillan.
>

Restoring from the previous days backup is plenty exceptable in this
application, so is it possible to turn fsync off, but only for this
database, or would I need to start another psql instance?

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

Tom Lane

Date:

29 July 2005, 18:02:15

Andrew McMillan <andrew@catalyst.net.nz> writes:
> On Fri, 2005-07-29 at 09:37 -0800, Matthew Schumacher wrote:
>> How often should this table be vacuumed, every 5 minutes?

> I would be tempted to vacuum after each e-mail, in this case.

Perhaps the bulk of the transient states should be done in a temp table,
and only write into a real table when you're done?  Dropping a temp
table is way quicker than vacuuming it ...

            regards, tom lane

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

29 July 2005, 18:48:13

Ok, here is where I'm at, I reduced the proc down to this:

CREATE FUNCTION update_token (_id INTEGER,
                              _token BYTEA,
                              _spam_count INTEGER,
                              _ham_count INTEGER,
                              _atime INTEGER)
RETURNS VOID AS
$$
BEGIN
  LOOP
    UPDATE bayes_token
      SET spam_count = spam_count + _spam_count,
          ham_count  = ham_count + _ham_count,
          atime      = _atime
      WHERE id = _id
      AND token = _token;

    IF found THEN
      RETURN;
    END IF;

    INSERT INTO bayes_token VALUES (_id, _token, _spam_count,
_ham_count, _atime);
    IF FOUND THEN
      UPDATE bayes_vars SET token_count = token_count + 1 WHERE id = _id;
      IF NOT FOUND THEN
        RAISE EXCEPTION 'unable to update token_count in bayes_vars';
        return FALSE;
      END IF;

       RETURN;
    END IF;

    RETURN;
  END LOOP;
END;
$$
LANGUAGE plpgsql;

All it's doing is trying the update before the insert to get around the
problem of not knowing which is needed.  With only 2-3 of the queries
implemented I'm already back to running about the same speed as the
original SA proc that is going to ship with SA 3.1.0.

All of the queries are using indexes so at this point I'm pretty
convinced that the biggest problem is the sheer number of queries
required to run this proc 200 times for each email (once for each token).

I don't see anything that could be done to make this much faster on the
postgres end, it's looking like the solution is going to involve cutting
down the number of queries some how.

One thing that is still very puzzling to me is why this runs so much
slower when I put the data.sql in a transaction.  Obviously transactions
are acting different when you call a proc a zillion times vs an insert
query.

Anyway, if anyone else has any ideas I'm all ears, but at this point
it's looking like raw query speed is needed for this app and while I
don't care for mysql as a database, it does have the speed going for it.

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

Tom Lane

Date:

30 July 2005, 15:29:17

Matthew Schumacher <matt.s@aptalaska.net> writes:
> One thing that is still very puzzling to me is why this runs so much
> slower when I put the data.sql in a transaction.  Obviously transactions
> are acting different when you call a proc a zillion times vs an insert
> query.

I looked into this a bit.  It seems that the problem when you wrap the
entire insertion series into one transaction is associated with the fact
that the test does so many successive updates of the single row in
bayes_vars.  (VACUUM VERBOSE at the end of the test shows it cleaning up
49383 dead versions of the one row.)  This is bad enough when it's in
separate transactions, but when it's in one transaction, none of those
dead row versions can be marked "fully dead" yet --- so for every update
of the row, the unique-key check has to visit every dead version to make
sure it's dead in the context of the current transaction.  This makes
the process O(N^2) in the number of updates per transaction.  Which is
bad enough if you just want to do one transaction per message, but it's
intolerable if you try to wrap the whole bulk-load scenario into one
transaction.

I'm not sure that we can do anything to make this a lot smarter, but
in any case, the real problem is to not do quite so many updates of
bayes_vars.

How constrained are you as to the format of the SQL generated by
SpamAssassin?  In particular, could you convert the commands generated
for a single message into a single statement?  I experimented with
passing all the tokens for a given message as a single bytea array,
as in the attached, and got almost a factor of 4 runtime reduction
on your test case.

BTW, it's possible that this is all just a startup-transient problem:
once the database has been reasonably well populated, one would expect
new tokens to be added infrequently, and so the number of updates to
bayes_vars ought to drop off.

            regards, tom lane


Revised insertion procedure:


CREATE or replace FUNCTION put_tokens (_id INTEGER,
                              _tokens BYTEA[],
                              _spam_count INTEGER,
                              _ham_count INTEGER,
                              _atime INTEGER)
RETURNS VOID AS
$$
declare _token bytea;
        new_tokens integer := 0;
BEGIN
  for i in array_lower(_tokens,1) .. array_upper(_tokens,1)
  LOOP
    _token := _tokens[i];
    UPDATE bayes_token
      SET spam_count = spam_count + _spam_count,
          ham_count  = ham_count + _ham_count,
          atime      = _atime
      WHERE id = _id
      AND token = _token;

    IF not found THEN
      INSERT INTO bayes_token VALUES (_id, _token, _spam_count,
                                      _ham_count, _atime);
      new_tokens := new_tokens + 1;
    END IF;
  END LOOP;
  if new_tokens > 0 THEN
      UPDATE bayes_vars SET token_count = token_count + new_tokens
      WHERE id = _id;
      IF NOT FOUND THEN
        RAISE EXCEPTION 'unable to update token_count in bayes_vars';
      END IF;
  END IF;
  RETURN;
END;
$$
LANGUAGE plpgsql;


Typical input:


select
put_tokens(1,'{"\\125\\42\\80\\167\\166","\\38\\153\\220\\93\\190","\\68\\7\\112\\52\\224","\\51\\14\\78\\155\\49","\\73\\245\\15\\221\\43","\\96\\179\\108\\197\\121","\\123\\97\\220\\173\\247","\\55\\132\\243\\51\\65","\\238\\36\\129\\75\\181","\\145\\253\\196\\106\\90","\\119\\0\\51\\127\\236","\\229\\35\\181\\222\\3","\\163\\1\\191\\220\\79","\\232\\97\\152\\207\\26","\\111\\146\\81\\182\\250","\\47\\141\\12\\76\\45","\\252\\97\\168\\243\\222","\\24\\157\\202\\45\\24","\\230\\207\\30\\46\\115","\\106\\45\\182\\94\\136","\\45\\66\\245\\41\\103","\\108\\126\\171\\154\\210","\\64\\90\\1\\184\\145","\\242\\78\\150\\104\\213","\\214\\\\134\\7\\179\\150","\\249\\12\\247\\164\\74","\\234\\35\\93\\118\\102","\\5\\152\\152\\219\\188","\\99\\186\\172\\56\\241","\\99\\220\\62\\240\\148","\\106\\12\\199\\33\\177","\\34\\74\\190\\192\\186","\\219\\127\\145\\132\\203","\\240\\113\\128\\160\\46","\\83\\5\\239\\206\\221","\\245\\253\\219\\83\\250","\\1\\53\\126\\56\\129","\\206\\1!

30\\97\\246\\47","\\217\\57\\185\\37\\202","\\235\\10\\74\\224\\150","\\80\\151\\70\\52\\96","\\126\\49\\156\\162\\93","\\243\\120\\218\\226\\49","\\251\\132\\118\\47\\221","\\241\\160\\120\\146\\198","\\183\\32\\161\\223\\178","\\80\\205\\77\\57\\2","\\121\\231\\13\\71\\218","\\71\\143\\184\\88\\185","\\163\\96\\119\\211\\142","\\20\\143\\90\\91\\211","\\179\\228\\212\\15\\22","\\243\\35\\149\\9\\55","\\140\\149\\99\\233\\241","\\164\\246\\101\\147\\107","\\202\\70\\218\\40\\114","\\39\\36\\186\\46\\84","\\58\\116\\44\\237\\2","\\80\\204\\185\\47\\105","\\64\\227\\29\\108\\222","\\173\\115\\56\\91\\52","\\102\\39\\157\\252\\64","\\133\\9\\89\\207\\62","\\27\\2\\230\\227\\201","\\163\\45\\123\\160\\129","\\170\\131\\168\\107\\198","\\236\\253\\0\\43\\228","\\44\\255\\93\\197\\136","\\64\\122\\42\\230\\126","\\207\\222\\104\\27\\239","\\26\\240\\78\\73\\45","\\225\\107\\181\\246\\160","\\231\\72\\243\\36\\159","\\248\\60\\14\\67\\145","\\21\\161\\247\\43\\198","\\81\\243\\19!
 1\\168\\18","\\237\\227\\23\\40\\140","\\60\\90\\96\\168\\201","\\211\

\107\\181\\46\\38","\\178\\129\\212\\16\\254","\\85\\177\\246\\29\\221","\\182\\123\\178\\157\\9","\\154\\159\\180\\116\\89","\\80\\136\\196\\242\\161","\\185\\110\\90\\163\\157","\\163\\191\\229\\13\\42","\\11\\119\\205\\160\\223","\\75\\216\\70\\223\\6","\\130\\48\\154\\145\\51","\\62\\104\\212\\72\\3","\\247\\105\\51\\64\\136","\\17\\96\\45\\40\\77","\\52\\1\\252\\53\\121","\\68\\195\\58\\103\\91","\\135\\131\\100\\4\\0","\\131\\129\\44\\193\\194","\\47\\234\\101\\143\\26","\\206\\\\134\\32\\154\\0","\\17\\41\\177\\34\\178","\\145\\127\\114\\231\\216","\\19\\172\\6\\39\\126","\\237\\233\\121\\43\\119","\\201\\167\\167\\67\\233","\\88\\159\\102\\50\\117","\\100\\133\\107\\190\\133","\\169\\146\\178\\120\\106"}',1,0,1088628232);
select
put_tokens(1,'{"\\196\\75\\30\\153\\73","\\73\\245\\15\\221\\43","\\14\\7\\116\\254\\162","\\244\\161\\139\\59\\16","\\214\\226\\238\\196\\30","\\209\\14\\131\\231\\30","\\41\\\\134\\176\\195\\166","\\70\\206\\48\\38\\33","\\247\\131\\136\\80\\31","\\4\\85\\5\\167\\214","\\246\\106\\225\\106\\242","\\28\\0\\229\\160\\90","\\127\\209\\58\\120\\83","\\12\\52\\52\\147\\95","\\255\\115\\21\\5\\68","\\244\\152\\121\\76\\20","\\19\\128\\183\\248\\181","\\140\\91\\18\\127\\208","\\93\\9\\62\\196\\247","\\248\\200\\31\\207\\108","\\44\\216\\247\\15\\195","\\59\\189\\9\\237\\142","\\1\\14\\10\\221\\68","\\163\\155\\122\\223\\104","\\97\\5\\105\\55\\137","\\184\\211\\162\\23\\247","\\239\\249\\83\\68\\54","\\67\\207\\180\\186\\234","\\99\\78\\237\\211\\180","\\200\\11\\32\\179\\50","\\95\\105\\18\\60\\253","\\207\\102\\227\\94\\84","\\71\\143\\184\\88\\185","\\13\\181\\75\\24\\192","\\188\\241\\141\\99\\242","\\139\\124\\248\\130\\4","\\25\\110\\149\\63\\114","\\21\\162\\199\\1!

29\\199","\\164\\246\\101\\147\\107","\\198\\202\\223\\58\\197","\\181\\10\\41\\25\\130","\\71\\163\\116\\239\\170","\\46\\170\\238\\142\\89","\\176\\120\\106\\103\\228","\\39\\228\\25\\38\\170","\\114\\79\\121\\18\\222","\\178\\105\\98\\61\\39","\\90\\61\\12\\23\\135","\\176\\118\\81\\65\\66","\\55\\104\\57\\198\\150","\\206\\251\\224\\128\\41","\\29\\158\\68\\146\\164","\\248\\60\\14\\67\\145","\\210\\220\\161\\10\\254","\\72\\81\\151\\213\\68","\\25\\236\\210\\197\\128","\\72\\37\\208\\227\\54","\\242\\24\\6\\88\\26","\\128\\197\\20\\5\\211","\\98\\105\\71\\42\\180","\\91\\43\\72\\84\\104","\\205\\254\\174\\65\\141","\\222\\194\\126\\204\\164","\\233\\153\\37\\148\\226","\\32\\195\\22\\153\\87","\\194\\97\\220\\251\\18","\\151\\201\\148\\52\\147","\\205\\55\\0\\226\\58","\\172\\12\\50\\0\\140","\\56\\32\\43\\9\\45","\\18\\174\\50\\162\\126","\\138\\150\\12\\72\\189","\\49\\230\\150\\210\\48","\\2\\140\\64\\104\\32","\\14\\174\\41\\196\\121","\\100\\195\\116\\130\\101","\!
 \222\\45\\94\\39\\64","\\178\\203\\221\\63\\94","\\26\\188\\157\\\\134

\\52","\\119\\0\\51\\127\\236","\\88\\32\\224\\142\\164","\\111\\146\\81\\182\\250","\\12\\177\\151\\83\\13","\\113\\27\\173\\162\\19","\\158\\216\\41\\236\\226","\\16\\88\\\\134\\180\\112","\\43\\32\\16\\77\\238","\\136\\93\\210\\172\\63","\\251\\214\\30\\40\\146","\\156\\27\\198\\60\\170","\\185\\29\\172\\30\\68","\\202\\83\\59\\228\\252","\\219\\127\\145\\132\\203","\\1\\223\\97\\229\\127","\\113\\83\\123\\167\\140","\\99\\1\\116\\56\\165","\\143\\224\\239\\1\\173","\\49\\186\\156\\51\\92","\\246\\224\\70\\245\\137","\\235\\10\\74\\224\\150","\\43\\88\\245\\14\\103","\\88\\128\\232\\142\\254","\\251\\132\\118\\47\\221","\\36\\7\\142\\234\\98","\\130\\126\\199\\170\\126","\\133\\23\\51\\253\\234","\\249\\89\\242\\87\\86","\\102\\243\\47\\193\\211","\\140\\18\\140\\164\\248","\\179\\228\\212\\15\\22","\\168\\155\\243\\169\\191","\\117\\37\\139\\241\\230","\\155\\11\\254\\171\\200","\\196\\159\\253\\223\\15","\\93\\207\\154\\106\\135","\\11\\255\\28\\123\\125","\\239\\9\\226!

\\59\\198","\\191\\204\\230\\61\\39","\\175\\204\\181\\113\\\\134","\\64\\227\\29\\108\\222","\\169\\173\\194\\83\\40","\\212\\93\\170\\169\\12","\\249\\55\\232\\182\\208","\\75\\175\\181\\248\\246","\\108\\95\\114\\215\\138","\\220\\37\\59\\207\\197","\\45\\146\\43\\76\\81","\\166\\231\\20\\9\\189","\\27\\126\\81\\92\\75","\\66\\168\\119\\100\\196","\\229\\9\\196\\165\\250","\\83\\186\\103\\184\\46","\\85\\177\\246\\29\\221","\\140\\159\\53\\211\\157","\\214\\193\\192\\217\\109","\\10\\5\\64\\97\\157","\\92\\137\\120\\70\\55","\\235\\45\\181\\44\\98","\\150\\56\\132\\207\\19","\\67\\95\\161\\39\\122","\\109\\65\\145\\170\\79","\\\\134\\28\\90\\39\\33","\\226\\177\\240\\202\\157","\\1\\57\\50\\6\\240","\\249\\240\\222\\56\\161","\\110\\136\\88\\85\\249","\\82\\27\\239\\51\\211","\\114\\223\\252\\83\\189","\\129\\216\\251\\218\\80","\\247\\36\\101\\90\\229","\\209\\73\\221\\46\\11","\\242\\12\\120\\117\\\\134","\\146\\198\\57\\177\\49","\\212\\57\\9\\240\\216","\\215\\151\\2!
 16\\59\\75","\\47\\132\\161\\165\\54","\\113\\4\\77\\241\\150","\\217\

\184\\149\\53\\124","\\152\\111\\25\\231\\104","\\42\\185\\112\\250\\156","\\39\\131\\14\\140\\189","\\148\\169\\158\\251\\150","\\184\\142\\204\\122\\179","\\19\\189\\181\\105\\116","\\116\\77\\22\\135\\50","\\236\\231\\60\\132\\229","\\200\\63\\76\\232\\9","\\32\\20\\168\\87\\45","\\99\\129\\99\\165\\29","\\2\\208\\66\\228\\105","\\99\\194\\194\\229\\17","\\85\\250\\55\\51\\114","\\200\\165\\249\\77\\72","\\5\\91\\178\\157\\24","\\245\\253\\219\\83\\250","\\166\\103\\181\\196\\34","\\227\\149\\148\\105\\157","\\95\\44\\15\\251\\98","\\183\\32\\161\\223\\178","\\120\\236\\145\\158\\78","\\244\\4\\92\\233\\112","\\189\\231\\124\\92\\19","\\112\\132\\8\\49\\157","\\160\\243\\244\\94\\104","\\150\\176\\139\\251\\157","\\176\\193\\155\\175\\144","\\161\\208\\145\\92\\92","\\77\\122\\94\\69\\182","\\77\\13\\131\\29\\27","\\92\\9\\178\\204\\254","\\177\\4\\154\\211\\63","\\62\\4\\242\\1\\78","\\4\\129\\113\\205\\164","\\168\\95\\68\\89\\38","\\173\\115\\56\\91\\52","\\212\\161\\1!

59\\148\\179","\\133\\9\\89\\207\\62","\\242\\51\\168\\130\\86","\\154\\199\\208\\84\\2","\\160\\215\\250\\104\\22","\\45\\252\\143\\149\\204","\\48\\50\\91\\39\\243","\\94\\168\\48\\202\\122","\\238\\38\\180\\135\\142","\\234\\59\\24\\148\\2","\\237\\227\\23\\40\\140","\\7\\114\\176\\80\\123","\\204\\170\\0\\60\\65","\\217\\202\\249\\158\\182","\\82\\170\\45\\96\\86","\\118\\11\\123\\51\\216","\\192\\130\\153\\88\\59","\\219\\53\\146\\88\\198","\\203\\114\\182\\147\\145","\\158\\140\\239\\104\\247","\\179\\86\\111\\146\\65","\\192\\168\\51\\125\\183","\\8\\251\\77\\143\\231","\\237\\229\\173\\221\\29","\\69\\178\\247\\196\\175","\\114\\33\\237\\189\\119","\\220\\44\\144\\93\\98","\\241\\38\\138\\127\\252","\\66\\218\\237\\199\\157","\\132\\240\\212\\221\\48","\\180\\41\\157\\84\\37","\\203\\180\\58\\113\\136","\\156\\39\\111\\181\\34","\\16\\202\\216\\183\\55","\\154\\51\\122\\201\\45","\\218\\112\\47\\206\\142","\\189\\141\\110\\230\\132","\\80\\167\\61\\103\\247","\\186\!
 \15\\121\\27\\167","\\103\\163\\217\\19\\220","\\173\\116\\86\\7\\249"

,"\\25\\37\\98\\35\\127","\\44\\92\\200\\89\\84","\\171\\129\\106\\249\\38","\\24\\147\\77\\\\134\\62","\\254\\184\\72\\159\\91","\\221\\13\\18\\153\\154","\\109\\232\\79\\169\\176","\\152\\103\\190\\50\\18","\\51\\71\\217\\22\\76","\\105\\109\\7\\77\\198","\\250\\121\\163\\49\\73","\\138\\204\\\\134\\247\\116","\\130\\38\\156\\36\\27","\\20\\83\\86\\113\\124","\\40\\63\\161\\157\\76","\\205\\99\\150\\109\\249","\\111\\174\\57\\169\\238","\\106\\169\\245\\170\\240","\\32\\10\\53\\160\\76","\\226\\0\\58\\9\\22","\\63\\83\\21\\3\\205","\\212\\141\\249\\177\\102","\\197\\226\\42\\202\\130","\\70\\40\\85\\176\\2","\\3\\16\\133\\118\\91","\\232\\48\\176\\209\\77","\\20\\149\\0\\2\\144","\\50\\87\\138\\108\\149","\\13\\78\\64\\211\\245","\\15\\158\\123\\62\\103","\\239\\68\\210\\175\\197","\\247\\216\\7\\211\\5","\\112\\100\\135\\210\\101","\\47\\26\\118\\254\\62","\\123\\7\\143\\206\\114","\\184\\43\\252\\56\\194","\\55\\16\\219\\\\134\\201","\\170\\128\\224\\160\\251","\\180\\10!

8\\182\\255\\118","\\164\\155\\151\\195\\67","\\116\\56\\163\\249\\92","\\250\\207\\75\\244\\104","\\122\\219\\25\\49\\17","\\16\\61\\66\\50\\32","\\15\\223\\166\\\\134\\188","\\16\\221\\48\\159\\124","\\163\\66\\245\\19\\190","\\52\\177\\137\\57\\104","\\137\\158\\143\\12\\73","\\175\\156\\252\\243\\165","\\18\\119\\\\134\\198\\209","\\179\\60\\37\\63\\136","\\68\\117\\75\\163\\27","\\234\\108\\150\\93\\15","\\209\\159\\154\\221\\138","\\70\\215\\50\\36\\255","\\237\\64\\91\\125\\54","\\17\\41\\177\\34\\178","\\19\\241\\29\\12\\34","\\\\134\\151\\112\\6\\214","\\63\\61\\146\\243\\60"}',1,0,1088649636);


For testing purposes, here is the Perl script I used to make the revised
input from the original test script:


#!/usr/bin/perl -w

$lastat = 0;

while (<>) {
    if (m/^select put_token\((\d+),'(.*)',(\d+),(\d+),(\d+)\);$/) {
    $d1 = $1;
    $data = $2;
    $d3 = $3;
    $d4 = $4;
    $d5 = $5;
    $data =~ s/\\/\\\\/g;
    if ($d5 == $lastat) {
        $last2 = $last2 . ',"' . $data . '"';
    } else {
        if ($lastat) {
        print "select put_tokens($last1,'{$last2}',$last3,$last4,$lastat);\n";
        }
        $last1 = $d1;
        $last2 = '"' . $data . '"';
        $last3 = $d3;
        $last4 = $d4;
        $lastat = $d5;
    }
    } else {
    print ;
    }
}

print "select put_tokens($last1,'{$last2}',$last3,$last4,$lastat);\n";


BTW, the data quoting is probably wrong here, but I suppose that can be
fixed easily.

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

30 July 2005, 18:06:57

Tom Lane wrote:

> I looked into this a bit.  It seems that the problem when you wrap the
> entire insertion series into one transaction is associated with the fact
> that the test does so many successive updates of the single row in
> bayes_vars.  (VACUUM VERBOSE at the end of the test shows it cleaning up
> 49383 dead versions of the one row.)  This is bad enough when it's in
> separate transactions, but when it's in one transaction, none of those
> dead row versions can be marked "fully dead" yet --- so for every update
> of the row, the unique-key check has to visit every dead version to make
> sure it's dead in the context of the current transaction.  This makes
> the process O(N^2) in the number of updates per transaction.  Which is
> bad enough if you just want to do one transaction per message, but it's
> intolerable if you try to wrap the whole bulk-load scenario into one
> transaction.
>
> I'm not sure that we can do anything to make this a lot smarter, but
> in any case, the real problem is to not do quite so many updates of
> bayes_vars.
>
> How constrained are you as to the format of the SQL generated by
> SpamAssassin?  In particular, could you convert the commands generated
> for a single message into a single statement?  I experimented with
> passing all the tokens for a given message as a single bytea array,
> as in the attached, and got almost a factor of 4 runtime reduction
> on your test case.
>
> BTW, it's possible that this is all just a startup-transient problem:
> once the database has been reasonably well populated, one would expect
> new tokens to be added infrequently, and so the number of updates to
> bayes_vars ought to drop off.
>
>             regards, tom lane
>

The spamassassins bayes code calls the _put_token method in the storage
module a loop.  This means that the storage module isn't called once per
message, but once per token.

I'll look into modifying it to so that the bayes code passes a hash of
tokens to the storage module where they can loop or in the case of the
pgsql module pass an array of tokens to a procedure where we loop and
use temp tables to make this much more efficient.

I don't have much time this weekend to toss at this, but will be looking
at it on Monday.

Thanks,

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

John Arbash Meinel

Date:

31 July 2005, 02:27:24

Matthew Schumacher wrote:

>All it's doing is trying the update before the insert to get around the
>problem of not knowing which is needed.  With only 2-3 of the queries
>implemented I'm already back to running about the same speed as the
>original SA proc that is going to ship with SA 3.1.0.
>
>All of the queries are using indexes so at this point I'm pretty
>convinced that the biggest problem is the sheer number of queries
>required to run this proc 200 times for each email (once for each token).
>
>I don't see anything that could be done to make this much faster on the
>postgres end, it's looking like the solution is going to involve cutting
>down the number of queries some how.
>
>One thing that is still very puzzling to me is why this runs so much
>slower when I put the data.sql in a transaction.  Obviously transactions
>are acting different when you call a proc a zillion times vs an insert
>query.
>
>
Well, I played with adding a COMMIT;BEGIN; statement to your exact test
every 1000 lines. And this is what I got:

Unmodified:
real    17m53.587s
user    0m6.204s
sys     0m3.556s

With BEGIN/COMMIT:
real    1m53.466s
user    0m5.203s
sys     0m3.211s

So I see the potential for improvement almost 10 fold by switching to
transactions. I played with the perl script (and re-implemented it in
python), and for the same data as the perl script, using COPY instead of
INSERT INTO means 5s instead of 33s.

I also played around with adding VACUUM ANALYZE every 10 COMMITS, which
brings the speed to:

real    1m41.258s
user    0m5.394s
sys     0m3.212s

And doing VACUUM ANALYZE every 5 COMMITS makes it:
real    1m46.403s
user    0m5.597s
sys     0m3.244s

I'm assuming the slowdown is because of the extra time spent vacuuming.
Overall performance might still be improving, since you wouldn't
actually be inserting all 100k rows at once.


Just to complete the reference, the perl version runs as:
10:44:02 -- START
10:44:35 -- AFTER TEMP LOAD : loaded 120596 records
10:44:39 -- AFTER bayes_token INSERT : inserted 49359 new records into
bayes_token
10:44:41 -- AFTER bayes_vars UPDATE : updated 1 records
10:46:42 -- AFTER bayes_token UPDATE : updated 47537 records
DONE

My python version runs as:
00:22:54 -- START
00:23:00 -- AFTER TEMP LOAD : loaded 120596 records
00:23:03 -- AFTER bayes_token INSERT : inserted 49359 new records into
bayes_token
00:23:06 -- AFTER bayes_vars UPDATE : updated 1 records
00:25:04 -- AFTER bayes_token UPDATE : updated 47537 records
DONE

The python is effectively just a port of the perl code (with many lines
virtually unchanged), and really the only performance difference is that
the initial data load is much faster with a COPY.

This is all run on Ubuntu, with postgres 7.4.7, and a completely
unchanged postgresql.conf. (But the machine is a dual P4 2.4GHz, with
3GB of RAM).

John
=:->

>Anyway, if anyone else has any ideas I'm all ears, but at this point
>it's looking like raw query speed is needed for this app and while I
>don't care for mysql as a database, it does have the speed going for it.
>
>schu
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>
>
>

Attachment

signature.asc

Re: Performance problems testing with Spamassassin 3.1.0

From

John Arbash Meinel

Date:

31 July 2005, 02:32:13

Matthew Schumacher wrote:

>Tom Lane wrote:
>
>
>
>>I looked into this a bit.  It seems that the problem when you wrap the
>>entire insertion series into one transaction is associated with the fact
>>that the test does so many successive updates of the single row in
>>bayes_vars.  (VACUUM VERBOSE at the end of the test shows it cleaning up
>>49383 dead versions of the one row.)  This is bad enough when it's in
>>separate transactions, but when it's in one transaction, none of those
>>dead row versions can be marked "fully dead" yet --- so for every update
>>of the row, the unique-key check has to visit every dead version to make
>>sure it's dead in the context of the current transaction.  This makes
>>the process O(N^2) in the number of updates per transaction.  Which is
>>bad enough if you just want to do one transaction per message, but it's
>>intolerable if you try to wrap the whole bulk-load scenario into one
>>transaction.
>>
>>I'm not sure that we can do anything to make this a lot smarter, but
>>in any case, the real problem is to not do quite so many updates of
>>bayes_vars.
>>
>>How constrained are you as to the format of the SQL generated by
>>SpamAssassin?  In particular, could you convert the commands generated
>>for a single message into a single statement?  I experimented with
>>passing all the tokens for a given message as a single bytea array,
>>as in the attached, and got almost a factor of 4 runtime reduction
>>on your test case.
>>
>>BTW, it's possible that this is all just a startup-transient problem:
>>once the database has been reasonably well populated, one would expect
>>new tokens to be added infrequently, and so the number of updates to
>>bayes_vars ought to drop off.
>>
>>            regards, tom lane
>>
>>
>>
>
>The spamassassins bayes code calls the _put_token method in the storage
>module a loop.  This means that the storage module isn't called once per
>message, but once per token.
>
>
Well, putting everything into a transaction per email might make your
pain go away.
If you saw the email I just sent, I modified your data.sql file to add a
"COMMIT;BEGIN" every 1000 selects, and I saw a performance jump from 18
minutes down to less than 2 minutes. Heck, on my machine, the advanced
perl version takes more than 2 minutes to run. It is actually slower
than the data.sql with commit statements.

>I'll look into modifying it to so that the bayes code passes a hash of
>tokens to the storage module where they can loop or in the case of the
>pgsql module pass an array of tokens to a procedure where we loop and
>use temp tables to make this much more efficient.
>
>
Well, you could do that. Or you could just have the bayes code issue
"BEGIN;" when it starts processing an email, and a "COMMIT;" when it
finishes. From my testing, you will see an enormous speed improvement.
(And you might consider including a fairly frequent VACUUM ANALYZE)

>I don't have much time this weekend to toss at this, but will be looking
>at it on Monday.
>
>
Good luck,
John
=:->

>Thanks,
>
>schu
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: don't forget to increase your free space map settings
>
>
>

Attachment

signature.asc

Re: Performance problems testing with Spamassassin 3.1.0

From

John Arbash Meinel

Date:

31 July 2005, 12:52:50

John Arbash Meinel wrote:

>Matthew Schumacher wrote:
>
>
>
>>All it's doing is trying the update before the insert to get around the
>>problem of not knowing which is needed.  With only 2-3 of the queries
>>implemented I'm already back to running about the same speed as the
>>original SA proc that is going to ship with SA 3.1.0.
>>
>>All of the queries are using indexes so at this point I'm pretty
>>convinced that the biggest problem is the sheer number of queries
>>required to run this proc 200 times for each email (once for each token).
>>
>>I don't see anything that could be done to make this much faster on the
>>postgres end, it's looking like the solution is going to involve cutting
>>down the number of queries some how.
>>
>>One thing that is still very puzzling to me is why this runs so much
>>slower when I put the data.sql in a transaction.  Obviously transactions
>>are acting different when you call a proc a zillion times vs an insert
>>query.
>>
>>
>>
>>
>Well, I played with adding a COMMIT;BEGIN; statement to your exact test
>every 1000 lines. And this is what I got:
>
>
Just for reference, I also tested this on my old server, which is a dual
Celeron 450 with 256M ram. FC4 and Postgres 8.0.3
Unmodified:
real    54m15.557s
user    0m24.328s
sys     0m14.200s

With Transactions every 1000 selects, and vacuum every 5000:
real    8m36.528s
user    0m16.585s
sys     0m12.569s

With Transactions every 1000 selects, and vacuum every 10000:
real    7m50.748s
user    0m16.183s
sys     0m12.489s

On this machine vacuum is more expensive, since it doesn't have as much ram.

Anyway, on this machine, I see approx 7x improvement. Which I think is
probably going to satisfy your spamassassin needs.
John
=:->

PS> Looking forward to having a spamassassin that can utilize my
favorite db. Right now, I'm not using a db backend because it wasn't
worth setting up mysql.

>Unmodified:
>real    17m53.587s
>user    0m6.204s
>sys     0m3.556s
>
>With BEGIN/COMMIT:
>real    1m53.466s
>user    0m5.203s
>sys     0m3.211s
>
>So I see the potential for improvement almost 10 fold by switching to
>transactions. I played with the perl script (and re-implemented it in
>python), and for the same data as the perl script, using COPY instead of
>INSERT INTO means 5s instead of 33s.
>
>I also played around with adding VACUUM ANALYZE every 10 COMMITS, which
>brings the speed to:
>
>real    1m41.258s
>user    0m5.394s
>sys     0m3.212s
>
>And doing VACUUM ANALYZE every 5 COMMITS makes it:
>real    1m46.403s
>user    0m5.597s
>sys     0m3.244s
>
>I'm assuming the slowdown is because of the extra time spent vacuuming.
>Overall performance might still be improving, since you wouldn't
>actually be inserting all 100k rows at once.
>
>
...

>This is all run on Ubuntu, with postgres 7.4.7, and a completely
>unchanged postgresql.conf. (But the machine is a dual P4 2.4GHz, with
>3GB of RAM).
>
>John
>=:->
>
>

Attachment

signature.asc

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

31 July 2005, 13:51:21

Ok, here is the current plan.

Change the spamassassin API to pass a hash of tokens into the storage
module, pass the tokens to the proc as an array, start a transaction,
load the tokens into a temp table using copy, select the tokens distinct
into the token table for new tokens, update the token table for known
tokens, then commit.

This solves the following problems:

1.  Each email is a transaction instead of each token.
2.  The update statement is only called when we really need an update
which avoids all of those searches.
3.  The looping work is done inside the proc instead of perl calling a
method a zillion times per email.

I'm not sure how vacuuming will be done yet, if we vacuum once per email
that may be too often, so I may do that every 5 mins in cron.

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

"Jim C. Nasby"

Date:

31 July 2005, 14:10:20

On Sun, Jul 31, 2005 at 08:51:06AM -0800, Matthew Schumacher wrote:
> Ok, here is the current plan.
>
> Change the spamassassin API to pass a hash of tokens into the storage
> module, pass the tokens to the proc as an array, start a transaction,
> load the tokens into a temp table using copy, select the tokens distinct
> into the token table for new tokens, update the token table for known
> tokens, then commit.

You might consider:
UPDATE tokens
    FROM temp_table (this updates existing records)

INSERT INTO tokens
    SELECT ...
    FROM temp_table
    WHERE NOT IN (SELECT ... FROM tokens)

This way you don't do an update to newly inserted tokens, which helps
keep vacuuming needs in check.

> This solves the following problems:
>
> 1.  Each email is a transaction instead of each token.
> 2.  The update statement is only called when we really need an update
> which avoids all of those searches.
> 3.  The looping work is done inside the proc instead of perl calling a
> method a zillion times per email.
>
> I'm not sure how vacuuming will be done yet, if we vacuum once per email
> that may be too often, so I may do that every 5 mins in cron.

I would suggest leaving an option to have SA vacuum every n emails,
since some people may not want to mess with cron, etc. I suspect that
pg_autovacuum would be able to keep up with things pretty well, though.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Performance problems testing with Spamassassin 3.1.0

From

Andreas Pflug

Date:

31 July 2005, 15:19:28

Jim C. Nasby wrote:
> On Sun, Jul 31, 2005 at 08:51:06AM -0800, Matthew Schumacher wrote:
>
>>Ok, here is the current plan.
>>
>>Change the spamassassin API to pass a hash of tokens into the storage
>>module, pass the tokens to the proc as an array, start a transaction,
>>load the tokens into a temp table using copy, select the tokens distinct
>>into the token table for new tokens, update the token table for known
>>tokens, then commit.
>
>
> You might consider:
> UPDATE tokens
>     FROM temp_table (this updates existing records)
>
> INSERT INTO tokens
>     SELECT ...
>     FROM temp_table
>     WHERE NOT IN (SELECT ... FROM tokens)
>
> This way you don't do an update to newly inserted tokens, which helps
> keep vacuuming needs in check.

The subselect might be quite a big set, so avoiding a full table scan
and materialization by

DELETE temp_table
   WHERE key IN (select key FROM tokens JOIN temp_table);
INSERT INTO TOKENS SELECT * FROM temp_table;

or

INSERT INTO TOKENS
SELECT temp_table.* FROM temp_table LEFT JOIN tokens USING (key)
WHERE tokens.key IS NULL

might be an additional win, assuming that only a small fraction of
tokens is inserted and updated.

Regards,
Andreas

Re: Performance problems testing with Spamassassin 3.1.0

From

Michael Parker

Date:

01 August 2005, 01:05:06

Hi All,

As a SpamAssassin developer, who by my own admission has real problem
getting PostgreSQL to work well, I must thank everyone for their
feedback on this issue.  Believe me when I say what is in the tree now
is a far cry from what used to be there, orders of magnitude faster
for sure.  I think there are several good ideas that have come out of
this thread and I've set about attempting to implement them.

Here is a version of the stored procedure, based in large part by the
one written by Tom Lane, that accepts and array of tokens and loops
over them to either update or insert them into the database (I'm not
including the greatest_int/least_int procedures but you've all seen
them before):

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
                                     intokenary BYTEA[],
                                     inspam_count INTEGER,
                                     inham_count INTEGER,
                                     inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
  new_tokens INTEGER := 0;
BEGIN
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
    _token := intokenary[i];
    UPDATE bayes_token
       SET spam_count = greatest_int(spam_count + inspam_count, 0),
           ham_count = greatest_int(ham_count + inham_count, 0),
           atime = greatest_int(atime, inatime)
     WHERE id = inuserid
       AND token = _token;
    IF NOT FOUND THEN
      -- we do not insert negative counts, just return true
      IF NOT (inspam_count < 0 OR inham_count < 0) THEN
        INSERT INTO bayes_token (id, token, spam_count,
                                     ham_count, atime)
        VALUES (inuserid, _token, inspam_count, inham_count, inatime);
        IF FOUND THEN
          new_tokens := new_tokens + 1;
        END IF;
      END IF;
    END IF;
  END LOOP;

  UPDATE bayes_vars
     SET token_count = token_count + new_tokens,
         newest_token_age = greatest_int(newest_token_age, inatime),
         oldest_token_age = least_int(oldest_token_age, inatime)
   WHERE id = inuserid;
  RETURN;
END;
' LANGUAGE 'plpgsql';

This version is about 32x faster than the old version, with the
default fsync value and autovacuum running in the background.

The next hurdle, and I've just posted to the DBD::Pg list, is
escaping/quoting the token strings.  They are true binary strings,
substrings of SHA1 hashes, I don't think the original data set
provided puts them in the right context.  They have proved to be
tricky.  I'm unable to call the stored procedure from perl because I
keep getting a malformed array litteral error.

Here is some example code that shows the issue:
#!/usr/bin/perl -w

# from a new db, do this first
# INSERT INTO bayes_vars VALUES (1,'nobody',0,0,0,0,0,0,2147483647,0);

use strict;
use DBI;
use DBD::Pg qw(:pg_types);
use Digest::SHA1 qw(sha1);

my $dbh = DBI->connect("DBI:Pg:dbname=spamassassin","postgres") || die;

my @dataary;

# Input is just a list of words (ie /usr/share/dict/words) stop after 150
while(<>) {
  chomp;
  push(@dataary, substr(sha1($_), -5));
# to see it work with normal string comment out above and uncomment below
#  push(@dataary, $_);
  last if scalar(@dataary) >= 150;
}

my $datastring = join(",", map { '"' . bytea_esc($_) . '"' }
@dataary);
my $sql = "select put_tokens(1, '{$datastring}', 1, 1, 10000)";
my $sth = $dbh->prepare($sql);
my $rc = $sth->execute();
unless ($rc) {
  print "Error: " . $dbh->errstr() . "\n";
}
$sth->finish();

sub bytea_esc {
  my ($str) = @_;
  my $buf = "";
  foreach my $char (split(//,$str)) {
    if (ord($char) == 0) { $buf .= "\\\\000"; }
    elsif (ord($char) == 39) { $buf .= "\\\\047"; }
    elsif (ord($char) == 92) { $buf .= "\\\\134"; }
    else { $buf .= $char; }
  }
  return $buf;
}

Any ideas? or thoughts on the revised procedure?  I'd greatly
appriciate them.

Sorry for the length, but hopefully it give a good enough example.

Thanks
Michael Parker

Attachment

msg-22763-14349.dat

Re: Performance problems testing with Spamassassin 3.1.0

From

Tom Lane

Date:

01 August 2005, 01:42:39

Michael Parker <parkerm@pobox.com> writes:
> The next hurdle, and I've just posted to the DBD::Pg list, is
> escaping/quoting the token strings.

If you're trying to write a bytea[] literal, I think the most reliable
way to write the individual bytes is
    \\\\nnn
where nnn is *octal*.  The idea here is:
    * string literal parser takes off one level of backslashing,
      leaving \\nnn
    * array input parser takes off another level, leaving \nnn
    * bytea input parser knows about backslashed octal values

Note it has to be 3 octal digits every time, no abbreviations.

            regards, tom lane

Re: Performance problems testing with Spamassassin 3.1.0

From

Tom Lane

Date:

01 August 2005, 01:54:48

Michael Parker <parkerm@pobox.com> writes:
> sub bytea_esc {
>   my ($str) = @_;
>   my $buf = "";
>   foreach my $char (split(//,$str)) {
>     if (ord($char) == 0) { $buf .= "\\\\000"; }
>     elsif (ord($char) == 39) { $buf .= "\\\\047"; }
>     elsif (ord($char) == 92) { $buf .= "\\\\134"; }
>     else { $buf .= $char; }
>   }
>   return $buf;
> }

Oh, I see the problem: you forgot to convert " to a backslash sequence.

It would probably also be wise to convert anything >= 128 to a backslash
sequence, so as to avoid any possible problems with multibyte character
encodings.  You wouldn't see this issue in a SQL_ASCII database, but I
suspect it would rise up to bite you with other encoding settings.

            regards, tom lane

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

01 August 2005, 14:09:23

Tom Lane wrote:
> Michael Parker <parkerm@pobox.com> writes:
>
>>sub bytea_esc {
>>  my ($str) = @_;
>>  my $buf = "";
>>  foreach my $char (split(//,$str)) {
>>    if (ord($char) == 0) { $buf .= "\\\\000"; }
>>    elsif (ord($char) == 39) { $buf .= "\\\\047"; }
>>    elsif (ord($char) == 92) { $buf .= "\\\\134"; }
>>    else { $buf .= $char; }
>>  }
>>  return $buf;
>>}
>
>
> Oh, I see the problem: you forgot to convert " to a backslash sequence.
>
> It would probably also be wise to convert anything >= 128 to a backslash
> sequence, so as to avoid any possible problems with multibyte character
> encodings.  You wouldn't see this issue in a SQL_ASCII database, but I
> suspect it would rise up to bite you with other encoding settings.
>
>             regards, tom lane

Here is some code that applies Toms Suggestions:

38c39,41
<     if (ord($char) == 0) { $buf .= "\\\\000"; }
---
>     if (ord($char) >= 128) { $buf .= "\\\\" . sprintf ("%lo",
ord($char)); }
>     elsif (ord($char) == 0) { $buf .= "\\\\000"; }
>     elsif (ord($char) == 34) { $buf .= "\\\\042"; }

But this begs the question, why not escape everything?

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

01 August 2005, 17:04:29

Tom Lane wrote:
>
> Revised insertion procedure:
>
>
> CREATE or replace FUNCTION put_tokens (_id INTEGER,
>                               _tokens BYTEA[],
>                               _spam_count INTEGER,
>                               _ham_count INTEGER,
>                               _atime INTEGER)
> RETURNS VOID AS
> $$
> declare _token bytea;
>         new_tokens integer := 0;
> BEGIN
>   for i in array_lower(_tokens,1) .. array_upper(_tokens,1)
>   LOOP
>     _token := _tokens[i];
>     UPDATE bayes_token
>       SET spam_count = spam_count + _spam_count,
>           ham_count  = ham_count + _ham_count,
>           atime      = _atime
>       WHERE id = _id
>       AND token = _token;
>
>     IF not found THEN
>       INSERT INTO bayes_token VALUES (_id, _token, _spam_count,
>                                       _ham_count, _atime);
>       new_tokens := new_tokens + 1;
>     END IF;
>   END LOOP;
>   if new_tokens > 0 THEN
>       UPDATE bayes_vars SET token_count = token_count + new_tokens
>       WHERE id = _id;
>       IF NOT FOUND THEN
>         RAISE EXCEPTION 'unable to update token_count in bayes_vars';
>       END IF;
>   END IF;
>   RETURN;
> END;
> $$
> LANGUAGE plpgsql;
>

Tom, thanks for all your help on this, I think we are fairly close to
having this done in a proc.  The biggest problem we are running into now
is that the data gets inserted as an int.  Even though your proc defines
_token as byeta, I get numbers in the table.  For example:

select put_tokens2(1, '{"\\246\\323\\061\\332\\277"}', 1, 1, 10000);

produces this:

 id |      token      | spam_count | ham_count | atime
----+-----------------+------------+-----------+-------
  1 | 246323061332277 |          1 |         1 | 10000

I'm not sure why this is happening, perhaps the problem is obvious to you?

Thanks,
schu

Re: Performance problems testing with Spamassassin 3.1.0

From

PFC

Date:

01 August 2005, 18:18:31


> select put_tokens2(1, '{"\\246\\323\\061\\332\\277"}', 1, 1, 10000);

    Try adding more backslashes until it works (seems that you need \\\\ or
something).
    Don't DBI convert the language types to postgres quoted forms on its own ?

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

01 August 2005, 18:28:35

PFC wrote:
>
>
>> select put_tokens2(1, '{"\\246\\323\\061\\332\\277"}', 1, 1, 10000);
>
>
>     Try adding more backslashes until it works (seems that you need \\\\
> or  something).
>     Don't DBI convert the language types to postgres quoted forms on its
> own ?
>

Your right.... I am finding that the proc is not the problem as I
suspected, it works correctly when I am not calling it from perl,
something isn't escaped correctly....

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

"Jim C. Nasby"

Date:

01 August 2005, 18:50:02

On Mon, Aug 01, 2005 at 01:28:24PM -0800, Matthew Schumacher wrote:
> PFC wrote:
> >
> >
> >> select put_tokens2(1, '{"\\246\\323\\061\\332\\277"}', 1, 1, 10000);
> >
> >
> >     Try adding more backslashes until it works (seems that you need \\\\
> > or  something).
> >     Don't DBI convert the language types to postgres quoted forms on its
> > own ?
> >
>
> Your right.... I am finding that the proc is not the problem as I
> suspected, it works correctly when I am not calling it from perl,
> something isn't escaped correctly....

I'm not sure who's responsible for DBI::Pg (Josh?), but would it make
sense to add better support for bytea to DBI::Pg? ISTM there should be a
better way of doing this than adding gobs of \'s.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Performance problems testing with Spamassassin 3.1.0

From

Michael Parker

Date:

01 August 2005, 19:09:28

Jim C. Nasby wrote:

>I'm not sure who's responsible for DBI::Pg (Josh?), but would it make
>sense to add better support for bytea to DBI::Pg? ISTM there should be a
>better way of doing this than adding gobs of \'s.
>
>
It has support for binding a bytea parameter, but in this case we're
trying to build up an array and pass that into a stored procedure.  The
$dbh->quote() method for DBD::Pg lacks the ability to quote bytea
types.  There is actually a TODO note in the code about adding support
for quoting Pg specific types.  Presumabliy the difficulties we are
having with this would be solved by that, once it has been implemented.
In the meantime, I believe it's just a matter of getting the right
escapes happen so that the procedure is inserting values that we can
later get via a select and using bind_param() with the PG_BYTEA type.

Michael

Attachment

signature.asc

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

04 August 2005, 05:14:17

Okay,

Here is the status of the SA updates and a question:

Michael got SA changed to pass an array of tokens to the proc so right
there we gained a ton of performance due to connections and transactions
being grouped into one per email instead of one per token.

Now I am working on making the proc even faster.  Since we have all of
the tokens coming in as an array, it should be possible to get this down
to just a couple of queries.

I have the proc using IN and NOT IN statements to update everything at
once from a temp table, but it progressively gets slower because the
temp table is growing between vacuums.  At this point it's slightly
slower than the old update or else insert on every token.

What I really want to do is have the token array available as a record
so that I can query against it, but not have it take up the resources of
a real table.  If I could copy from an array into a record then I can
even get rid of the loop.  Anyone have any thoughts on how to do this?


CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
                                     intokenary BYTEA[],
                                     inspam_count INTEGER,
                                     inham_count INTEGER,
                                     inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
    _token := intokenary[i];
    INSERT INTO bayes_token_tmp VALUES (_token);
  END LOOP;

  UPDATE
    bayes_token
  SET
    spam_count = greatest_int(spam_count + inspam_count, 0),
    ham_count = greatest_int(ham_count + inham_count , 0),
    atime = greatest_int(atime, 1000)
  WHERE
    id = inuserid
  AND
    (token) IN (SELECT intoken FROM bayes_token_tmp);

  UPDATE
    bayes_vars
  SET
    token_count = token_count + (SELECT count(intoken) FROM
bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)),
    newest_token_age = greatest_int(newest_token_age, inatime),
    oldest_token_age = least_int(oldest_token_age, inatime)
  WHERE
    id = inuserid;

  INSERT INTO
    bayes_token
    SELECT
      inuserid,
      intoken,
      inspam_count,
      inham_count,
      inatime
    FROM
      bayes_token_tmp
    WHERE
      (inspam_count > 0 OR inham_count > 0)
    AND
      (intoken) NOT IN (SELECT token FROM bayes_token);

  delete from bayes_token_tmp;

  RETURN;
END;
' LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
 LANGUAGE SQL;

CREATE OR REPLACE FUNCTION least_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
 LANGUAGE SQL;

Re: Performance problems testing with Spamassassin 3.1.0

From

PFC

Date:

04 August 2005, 05:45:50

> What I really want to do is have the token array available as a record
> so that I can query against it, but not have it take up the resources of
> a real table.  If I could copy from an array into a record then I can
> even get rid of the loop.  Anyone have any thoughts on how to do this?

    You could make a set-returning-function (about 3 lines) which RETURNs
NEXT every element in the array ; then you can use this SRF just like a
table and SELECT from it, join it with your other tables, etc.

Re: Performance problems testing with Spamassassin 3.1.0

From

John A Meinel

Date:

04 August 2005, 11:10:33

Matthew Schumacher wrote:
> Okay,
>
> Here is the status of the SA updates and a question:
>
> Michael got SA changed to pass an array of tokens to the proc so right
> there we gained a ton of performance due to connections and transactions
> being grouped into one per email instead of one per token.
>
> Now I am working on making the proc even faster.  Since we have all of
> the tokens coming in as an array, it should be possible to get this down
> to just a couple of queries.
>
> I have the proc using IN and NOT IN statements to update everything at
> once from a temp table, but it progressively gets slower because the
> temp table is growing between vacuums.  At this point it's slightly
> slower than the old update or else insert on every token.

I recommend that you drop and re-create the temp table. There is no
reason to have it around, considering you delete and re-add everything.
That means you never have to vacuum it, since it always only contains
the latest rows.

>
> What I really want to do is have the token array available as a record
> so that I can query against it, but not have it take up the resources of
> a real table.  If I could copy from an array into a record then I can
> even get rid of the loop.  Anyone have any thoughts on how to do this?
>

My one question here, is the inspam_count and inham_count *always* the
same for all tokens? I would have thought each token has it's own count.
Anyway, there are a few lines I would change:

>
> CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
>                                      intokenary BYTEA[],
>                                      inspam_count INTEGER,
>                                      inham_count INTEGER,
>                                      inatime INTEGER)
> RETURNS VOID AS '
> DECLARE
>   _token BYTEA;
> BEGIN
>

    -- create the table at the start of the procedure
    CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
    -- You might also add primary key if you are going to be adding
    -- *lots* of entries, but it sounds like you are going to have
    -- less than 1 page, so it doesn't matter

>   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>   LOOP
>     _token := intokenary[i];
>     INSERT INTO bayes_token_tmp VALUES (_token);
>   END LOOP;
>
>   UPDATE
>     bayes_token
>   SET
>     spam_count = greatest_int(spam_count + inspam_count, 0),
>     ham_count = greatest_int(ham_count + inham_count , 0),
>     atime = greatest_int(atime, 1000)
>   WHERE
>     id = inuserid
>   AND

--    (token) IN (SELECT intoken FROM bayes_token_tmp);
      EXISTS (SELECT token FROM bayes_token_tmp
           WHERE intoken=token LIMIT 1);

-- I would also avoid your intoken (NOT) IN (SELECT token FROM
-- bayes_token) There are a few possibilities, but to me
-- as your bayes_token table becomes big, this will start
-- to be the slow point

-- Rather than doing 2 NOT IN queries, it *might* be faster to do
   DELETE FROM bayes_token_tmp
    WHERE NOT EXISTS (SELECT token FROM bayes_token
               WHERE token=intoken);


>
>   UPDATE
>     bayes_vars
>   SET

--     token_count = token_count + (SELECT count(intoken) FROM
-- bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)),
       token_count = token_count + (SELECT count(intoken)
                      FROM bayes_token_tmp)

-- You don't need the where NOT IN, since we already removed those rows

>     newest_token_age = greatest_int(newest_token_age, inatime),
>     oldest_token_age = least_int(oldest_token_age, inatime)
>   WHERE
>     id = inuserid;
>
>   INSERT INTO
>     bayes_token
>     SELECT
>       inuserid,
>       intoken,
>       inspam_count,
>       inham_count,
>       inatime
>     FROM
>       bayes_token_tmp
>     WHERE
>       (inspam_count > 0 OR inham_count > 0)

--     AND
--       (intoken) NOT IN (SELECT token FROM bayes_token);

-- You don't need either of those lines, again because we already
-- filtered

--   delete from bayes_token_tmp;
--   And rather than deleting all of the entries just
     DROP TABLE bayes_token_tmp;

>
>   RETURN;
> END;
> ' LANGUAGE 'plpgsql';
>
> CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
>  LANGUAGE SQL;
>
> CREATE OR REPLACE FUNCTION least_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
>  LANGUAGE SQL;
>

So to clarify, here is my finished function:
------------------------------------
CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
                                     intokenary BYTEA[],
                                     inspam_count INTEGER,
                                     inham_count INTEGER,
                                     inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
    _token := intokenary[i];
    INSERT INTO bayes_token_tmp VALUES (_token);
  END LOOP;

  UPDATE
    bayes_token
  SET
    spam_count = greatest_int(spam_count + inspam_count, 0),
    ham_count = greatest_int(ham_count + inham_count , 0),
    atime = greatest_int(atime, 1000)
  WHERE
    id = inuserid
  AND
    EXISTS (SELECT token FROM bayes_token_tmp
             WHERE intoken=token LIMIT 1);

   DELETE FROM bayes_token_tmp
    WHERE NOT EXISTS (SELECT token FROM bayes_token
               WHERE token=intoken);

  UPDATE
    bayes_vars
  SET
    token_count = token_count + (SELECT count(intoken)
                   FROM bayes_token_tmp),
    newest_token_age = greatest_int(newest_token_age, inatime),
    oldest_token_age = least_int(oldest_token_age, inatime)
  WHERE
    id = inuserid;

  INSERT INTO
    bayes_token
    SELECT
      inuserid,
      intoken,
      inspam_count,
      inham_count,
      inatime
    FROM
      bayes_token_tmp
    WHERE
      (inspam_count > 0 OR inham_count > 0)

  DROP TABLE bayes_token_tmp;

  RETURN;
END;
' LANGUAGE 'plpgsql';

Attachment

signature.asc

Re: Performance problems testing with Spamassassin 3.1.0

From

Tom Lane

Date:

04 August 2005, 11:37:12

Matthew Schumacher <matt.s@aptalaska.net> writes:
>   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>   LOOP
>     _token := intokenary[i];
>     INSERT INTO bayes_token_tmp VALUES (_token);
>   END LOOP;

>   UPDATE
>     bayes_token
>   SET
>     spam_count = greatest_int(spam_count + inspam_count, 0),
>     ham_count = greatest_int(ham_count + inham_count , 0),
>     atime = greatest_int(atime, 1000)
>   WHERE
>     id = inuserid
>   AND
>     (token) IN (SELECT intoken FROM bayes_token_tmp);

I don't really see why you think that this path is going to lead to
better performance than where you were before.  Manipulation of the
temp table is never going to be free, and IN (sub-select) is always
inherently not fast, and NOT IN (sub-select) is always inherently
awful.  Throwing a pile of simple queries at the problem is not
necessarily the wrong way ... especially when you are doing it in
plpgsql, because you've already eliminated the overhead of network
round trips and repeated planning of the queries.

            regards, tom lane

Re: Performance problems testing with Spamassassin 3.1.0

From

John A Meinel

Date:

04 August 2005, 12:08:38

Tom Lane wrote:
> Matthew Schumacher <matt.s@aptalaska.net> writes:
>
>>  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>>  LOOP
>>    _token := intokenary[i];
>>    INSERT INTO bayes_token_tmp VALUES (_token);
>>  END LOOP;
>
>
>>  UPDATE
>>    bayes_token
>>  SET
>>    spam_count = greatest_int(spam_count + inspam_count, 0),
>>    ham_count = greatest_int(ham_count + inham_count , 0),
>>    atime = greatest_int(atime, 1000)
>>  WHERE
>>    id = inuserid
>>  AND
>>    (token) IN (SELECT intoken FROM bayes_token_tmp);
>
>
> I don't really see why you think that this path is going to lead to
> better performance than where you were before.  Manipulation of the
> temp table is never going to be free, and IN (sub-select) is always
> inherently not fast, and NOT IN (sub-select) is always inherently
> awful.  Throwing a pile of simple queries at the problem is not
> necessarily the wrong way ... especially when you are doing it in
> plpgsql, because you've already eliminated the overhead of network
> round trips and repeated planning of the queries.

So for an IN (sub-select), does it actually pull all of the rows from
the other table, or is the planner smart enough to stop once it finds
something?

Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)?

What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y)

I would guess that the EXISTS/NOT EXISTS would be faster, though it
probably would necessitate using a nested loop (at least that seems to
be the way the query is written).

I did some tests on a database with 800k rows, versus a temp table with
2k rows. I did one sequential test (1-2000, with 66 rows missing), and
one sparse test (1-200, 100000-100200, 200000-200200, ... with 658 rows
missing).

If found that NOT IN did indeed have to load the whole table. IN was
smart enough to do a nested loop.
EXISTS and NOT EXISTS did a sequential scan on my temp table, with a
SubPlan filter (which looks a whole lot like a Nested Loop).

What I found was that IN performed about the same as EXISTS (since they
are both effectively doing a nested loop), but that NOT IN took 4,000ms
while NOT EXISTS was the same speed as EXISTS at around 166ms.

Anyway, so it does seem like NOT IN is not a good choice, but IN seems
to be equivalent to EXISTS, and NOT EXISTS is also very fast.

Is this generally true, or did I just get lucky on my data?

John
=:->



>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id IN (SELECT id FROM object_t);
                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 Nested Loop IN Join  (cost=0.00..9851.68 rows=2140 width=4) (actual time=0.085..183.889 rows=1351 loops=1)
   ->  Seq Scan on ids  (cost=0.00..31.40 rows=2140 width=4) (actual time=0.014..24.032 rows=2009 loops=1)
   ->  Index Scan using object_t_pkey on object_t  (cost=0.00..4.58 rows=1 width=4) (actual time=0.071..0.071 rows=1
loops=2009)
         Index Cond: ("outer".id = object_t.id)
 Total runtime: 184.823 ms
(5 rows)

Time: 186.931 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id);
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.086..165.053 rows=1351 loops=1)
   Filter: (subplan)
   SubPlan
     ->  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 width=4) (actual time=0.025..0.025
rows=1loops=2009) 
           Index Cond: (id = $0)
 Total runtime: 165.995 ms
(6 rows)

Time: 167.795 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id NOT IN (SELECT id FROM object_t);
                                                        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=36410.51..36447.26 rows=1070 width=4) (actual time=4168.247..4172.080 rows=658 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
     ->  Seq Scan on object_t  (cost=0.00..34381.81 rows=811481 width=4) (actual time=0.044..2464.296 rows=811481
loops=1)
 Total runtime: 4210.784 ms
(5 rows)

Time: 4212.276 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE NOT EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id);
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.372..164.510 rows=658 loops=1)
   Filter: (NOT (subplan))
   SubPlan
     ->  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 width=4) (actual time=0.064..0.064
rows=1loops=2009) 
           Index Cond: (id = $0)
 Total runtime: 165.016 ms
(6 rows)

Time: 166.786 ms

mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id IN (SELECT id FROM object_t);
                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 Nested Loop IN Join  (cost=0.00..9851.68 rows=2140 width=4) (actual time=0.069..97.567 rows=1934 loops=1)
   ->  Seq Scan on ids  (cost=0.00..31.40 rows=2140 width=4) (actual time=0.012..1.868 rows=2000 loops=1)
   ->  Index Scan using object_t_pkey on object_t  (cost=0.00..4.58 rows=1 width=4) (actual time=0.045..0.045 rows=1
loops=2000)
         Index Cond: ("outer".id = object_t.id)
 Total runtime: 98.236 ms
(5 rows)

Time: 99.921 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id);
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.071..74.158 rows=1934 loops=1)
   Filter: (subplan)
   SubPlan
     ->  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 width=4) (actual time=0.013..0.013
rows=1loops=2000) 
           Index Cond: (id = $0)
 Total runtime: 74.798 ms
(6 rows)

Time: 86.287 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id NOT IN (SELECT id FROM object_t);
                                                        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=36410.51..36447.26 rows=1070 width=4) (actual time=4024.613..4028.774 rows=66 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
     ->  Seq Scan on object_t  (cost=0.00..34381.81 rows=811481 width=4) (actual time=0.040..2503.374 rows=811481
loops=1)
 Total runtime: 4122.327 ms
(5 rows)

Time: 4134.659 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE NOT EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id);
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.220..92.611 rows=66 loops=1)
   Filter: (NOT (subplan))
   SubPlan
     ->  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 width=4) (actual time=0.043..0.043
rows=1loops=2000) 
           Index Cond: (id = $0)
 Total runtime: 92.743 ms
(6 rows)

Time: 94.190 ms
mifar07=#

Attachment

signature.asc

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

04 August 2005, 13:16:00

John A Meinel wrote:
> Matthew Schumacher wrote:
>
 > I recommend that you drop and re-create the temp table. There is no
> reason to have it around, considering you delete and re-add everything.
> That means you never have to vacuum it, since it always only contains
> the latest rows.

Whenever I have a create temp and drop statement I get these errors:

select put_tokens(1, '{"\\\\000"}', 1, 1, 1000);
ERROR:  relation with OID 582248 does not exist
CONTEXT:  SQL statement "INSERT INTO bayes_token_tmp VALUES ( $1 )"
PL/pgSQL function "put_tokens" line 12 at SQL statement


>
>
>
> My one question here, is the inspam_count and inham_count *always* the
> same for all tokens? I would have thought each token has it's own count.
> Anyway, there are a few lines I would change:

No, we get the userid, inspam, inham, and atime, and they are the same
for each token.  If we have a different user we call the proc again.

>     -- create the table at the start of the procedure
>     CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
>     -- You might also add primary key if you are going to be adding
>     -- *lots* of entries, but it sounds like you are going to have
>     -- less than 1 page, so it doesn't matter

This causes errors, see above....

> --    (token) IN (SELECT intoken FROM bayes_token_tmp);
>       EXISTS (SELECT token FROM bayes_token_tmp
>            WHERE intoken=token LIMIT 1);
>
> -- I would also avoid your intoken (NOT) IN (SELECT token FROM
> -- bayes_token) There are a few possibilities, but to me
> -- as your bayes_token table becomes big, this will start
> -- to be the slow point
>
> -- Rather than doing 2 NOT IN queries, it *might* be faster to do
>    DELETE FROM bayes_token_tmp
>     WHERE NOT EXISTS (SELECT token FROM bayes_token
>                WHERE token=intoken);
>
>

I'll look into this.


thanks,

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

Tom Lane

Date:

04 August 2005, 13:16:24

John A Meinel <john@arbash-meinel.com> writes:
> Tom Lane wrote:
>> I don't really see why you think that this path is going to lead to
>> better performance than where you were before.

> So for an IN (sub-select), does it actually pull all of the rows from
> the other table, or is the planner smart enough to stop once it finds
> something?

It stops when it finds something --- but it's still a join operation
in essence.  I don't see that putting the values one by one into a table
and then joining is going to be a win compared to just processing the
values one at a time against the main table.

> Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)?
> What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y)

The EXISTS variants are actually worse, because we've not spent as much
time teaching the planner how to optimize them.  There's effectively
only one decent plan for an EXISTS, which is that the subselect's "x" is
indexed and we do an indexscan probe using the outer "y" for each outer
row.  IN and NOT IN can do that, or several alternative plans that might
be better depending on data statistics.

However, that's cold comfort for Matthew's application -- the only way
he'd get any benefit from all those planner smarts is if he ANALYZEs
the temp table after loading it and then EXECUTEs the main query (so
that it gets re-planned every time).  Plus, at least some of those
alternative plans would require an index on the temp table, which is
unlikely to be worth the cost of setting up.  And finally, this
formulation requires separate IN and NOT IN tests that are necessarily
going to do a lot of redundant work.

There's enough overhead here that I find it highly doubtful that it'll
be a win compared to the original approach of retail queries against the
main table.

            regards, tom lane

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

04 August 2005, 13:32:08

Tom Lane wrote:

> I don't really see why you think that this path is going to lead to
> better performance than where you were before.  Manipulation of the
> temp table is never going to be free, and IN (sub-select) is always
> inherently not fast, and NOT IN (sub-select) is always inherently
> awful.  Throwing a pile of simple queries at the problem is not
> necessarily the wrong way ... especially when you are doing it in
> plpgsql, because you've already eliminated the overhead of network
> round trips and repeated planning of the queries.
>
>             regards, tom lane

The reason why I think this may be faster is because I would avoid
running an update on data that needs to be inserted which saves
searching though the table for a matching token.

Perhaps I should do the insert first, then drop those tokens from the
temp table, then do my updates in a loop.

I'll have to do some benchmarking...

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

04 August 2005, 14:36:01

Matthew Schumacher wrote:
> Tom Lane wrote:
>
>
>>I don't really see why you think that this path is going to lead to
>>better performance than where you were before.  Manipulation of the
>>temp table is never going to be free, and IN (sub-select) is always
>>inherently not fast, and NOT IN (sub-select) is always inherently
>>awful.  Throwing a pile of simple queries at the problem is not
>>necessarily the wrong way ... especially when you are doing it in
>>plpgsql, because you've already eliminated the overhead of network
>>round trips and repeated planning of the queries.
>>
>>            regards, tom lane
>
>
> The reason why I think this may be faster is because I would avoid
> running an update on data that needs to be inserted which saves
> searching though the table for a matching token.
>
> Perhaps I should do the insert first, then drop those tokens from the
> temp table, then do my updates in a loop.
>
> I'll have to do some benchmarking...
>
> schu

Tom, I think your right, whenever I do a NOT IN it does a full table
scan against bayes_token and since that table is going to get very big
doing the simple query in a loop that uses an index seems a bit faster.

John, thanks for your help, it was worth a try, but it looks like the
looping is just faster.

Here is what I have so far in case anyone else has ideas before I
abandon it:

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
                                     intokenary BYTEA[],
                                     inspam_count INTEGER,
                                     inham_count INTEGER,
                                     inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  UPDATE
    bayes_token
  SET
    spam_count = greatest_int(spam_count + inspam_count, 0),
    ham_count = greatest_int(ham_count + inham_count , 0),
    atime = greatest_int(atime, inatime)
  WHERE
    id = inuserid
  AND
    (token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));

  UPDATE
    bayes_vars
  SET
    token_count = token_count + (
      SELECT
        count(bayes_token_tmp)
      FROM
        bayes_token_tmp(intokenary)
      WHERE
        bayes_token_tmp NOT IN (SELECT token FROM bayes_token)),
    newest_token_age = greatest_int(newest_token_age, inatime),
    oldest_token_age = least_int(oldest_token_age, inatime)
  WHERE
    id = inuserid;

  INSERT INTO
    bayes_token
    SELECT
      inuserid,
      bayes_token_tmp,
      inspam_count,
      inham_count,
      inatime
    FROM
      bayes_token_tmp(intokenary)
    WHERE
      (inspam_count > 0 OR inham_count > 0)
    AND
      (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

  RETURN;
END;
' LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS
SETOF bytea AS
'
BEGIN
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
    return next intokenary[i];
  END LOOP;
  RETURN;
end
'
language 'plpgsql';

CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
 LANGUAGE SQL;

CREATE OR REPLACE FUNCTION least_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
 LANGUAGE SQL;

Re: Performance problems testing with Spamassassin 3.1.0

From

John A Meinel

Date:

04 August 2005, 16:36:35

Matthew Schumacher wrote:
> Matthew Schumacher wrote:
>
>>Tom Lane wrote:
>>
>>
>>
>>>I don't really see why you think that this path is going to lead to
>>>better performance than where you were before.  Manipulation of the
>>>temp table is never going to be free, and IN (sub-select) is always
>>>inherently not fast, and NOT IN (sub-select) is always inherently
>>>awful.  Throwing a pile of simple queries at the problem is not
>>>necessarily the wrong way ... especially when you are doing it in
>>>plpgsql, because you've already eliminated the overhead of network
>>>round trips and repeated planning of the queries.
>>>
>>>            regards, tom lane
>>
>>
>>The reason why I think this may be faster is because I would avoid
>>running an update on data that needs to be inserted which saves
>>searching though the table for a matching token.
>>
>>Perhaps I should do the insert first, then drop those tokens from the
>>temp table, then do my updates in a loop.
>>
>>I'll have to do some benchmarking...
>>
>>schu
>
>
> Tom, I think your right, whenever I do a NOT IN it does a full table
> scan against bayes_token and since that table is going to get very big
> doing the simple query in a loop that uses an index seems a bit faster.
>
> John, thanks for your help, it was worth a try, but it looks like the
> looping is just faster.
>
> Here is what I have so far in case anyone else has ideas before I
> abandon it:

Surely this isn't what you have. You have *no* loop here, and you have
stuff like:
  AND
    (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

I'm guessing this isn't your last version of the function.

As far as putting the CREATE TEMP TABLE inside the function, I think the
problem is that the first time it runs, it compiles the function, and
when it gets to the UPDATE/INSERT with the temporary table name, at
compile time it hard-codes that table id.

I tried getting around it by using "EXECUTE" which worked, but it made
the function horribly slow. So I don't recommend it.

Anyway, if you want us to evaluate it, you really need to send us the
real final function.

John
=:->


>
> CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
>                                      intokenary BYTEA[],
>                                      inspam_count INTEGER,
>                                      inham_count INTEGER,
>                                      inatime INTEGER)
> RETURNS VOID AS '
> DECLARE
>   _token BYTEA;
> BEGIN
>
>   UPDATE
>     bayes_token
>   SET
>     spam_count = greatest_int(spam_count + inspam_count, 0),
>     ham_count = greatest_int(ham_count + inham_count , 0),
>     atime = greatest_int(atime, inatime)
>   WHERE
>     id = inuserid
>   AND
>     (token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));
>
>   UPDATE
>     bayes_vars
>   SET
>     token_count = token_count + (
>       SELECT
>         count(bayes_token_tmp)
>       FROM
>         bayes_token_tmp(intokenary)
>       WHERE
>         bayes_token_tmp NOT IN (SELECT token FROM bayes_token)),
>     newest_token_age = greatest_int(newest_token_age, inatime),
>     oldest_token_age = least_int(oldest_token_age, inatime)
>   WHERE
>     id = inuserid;
>
>   INSERT INTO
>     bayes_token
>     SELECT
>       inuserid,
>       bayes_token_tmp,
>       inspam_count,
>       inham_count,
>       inatime
>     FROM
>       bayes_token_tmp(intokenary)
>     WHERE
>       (inspam_count > 0 OR inham_count > 0)
>     AND
>       (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
>
>   RETURN;
> END;
> ' LANGUAGE 'plpgsql';
>
> CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS
> SETOF bytea AS
> '
> BEGIN
>   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>   LOOP
>     return next intokenary[i];
>   END LOOP;
>   RETURN;
> end
> '
> language 'plpgsql';
>
> CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
>  LANGUAGE SQL;
>
> CREATE OR REPLACE FUNCTION least_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
>  LANGUAGE SQL;
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

Attachment

signature.asc

Re: Performance problems testing with Spamassassin 3.1.0

From

Matthew Schumacher

Date:

04 August 2005, 19:38:23

John A Meinel wrote:

> Surely this isn't what you have. You have *no* loop here, and you have
> stuff like:
>   AND
>     (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
>
> I'm guessing this isn't your last version of the function.
>
> As far as putting the CREATE TEMP TABLE inside the function, I think the
> problem is that the first time it runs, it compiles the function, and
> when it gets to the UPDATE/INSERT with the temporary table name, at
> compile time it hard-codes that table id.
>
> I tried getting around it by using "EXECUTE" which worked, but it made
> the function horribly slow. So I don't recommend it.
>
> Anyway, if you want us to evaluate it, you really need to send us the
> real final function.
>
> John
> =:->

It is the final function.  It doesn't need a loop because of the
bayes_token_tmp function I added.  The array is passed to it and it
returns a record set so I can work off of it like it's a table.  So the
function works the same way it before, but instead of using SELECT
intoken from TEMPTABLE, you use SELECT bayes_token_tmp from
bayes_token_tmp(intokenary).

I think this is more efficient than the create table overhead,
especially because the incoming record set won't be to big.

Thanks,

schu

Re: Performance problems testing with Spamassassin 3.1.0

From

John A Meinel

Date:

04 August 2005, 19:43:09

Matthew Schumacher wrote:
> John A Meinel wrote:
>
>
>>Surely this isn't what you have. You have *no* loop here, and you have
>>stuff like:
>>  AND
>>    (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
>>
>>I'm guessing this isn't your last version of the function.
>>
>>As far as putting the CREATE TEMP TABLE inside the function, I think the
>>problem is that the first time it runs, it compiles the function, and
>>when it gets to the UPDATE/INSERT with the temporary table name, at
>>compile time it hard-codes that table id.
>>
>>I tried getting around it by using "EXECUTE" which worked, but it made
>>the function horribly slow. So I don't recommend it.
>>
>>Anyway, if you want us to evaluate it, you really need to send us the
>>real final function.
>>
>>John
>>=:->
>
>
> It is the final function.  It doesn't need a loop because of the
> bayes_token_tmp function I added.  The array is passed to it and it
> returns a record set so I can work off of it like it's a table.  So the
> function works the same way it before, but instead of using SELECT
> intoken from TEMPTABLE, you use SELECT bayes_token_tmp from
> bayes_token_tmp(intokenary).
>
> I think this is more efficient than the create table overhead,
> especially because the incoming record set won't be to big.
>
> Thanks,
>
> schu
>
>

Well, I would at least recommend that you change the "WHERE
bayes_token_tmp NOT IN (SELECT token FROM bayes_token)"
with a
"WHERE NOT EXISTS (SELECT toke FROM bayes_token WHERE
token=bayes_token_tmp)"

You might try experimenting with the differences, but on my system the
NOT IN has to do a full sequential scan on bayes_token and load all
entries into a list, while NOT EXISTS can do effectively a nested loop.

The nested loop requires that there is an index on bayes_token(token),
but I'm pretty sure there is anyway.

Again, in my testing, it was a difference of 4200ms versus 180ms. (800k
rows in my big table, 2k in the temp one)

John
=:->

Attachment

signature.asc