Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance

From John A Meinel
Subject Re: Performance problems testing with Spamassassin 3.1.0
Date
Msg-id 42F26E2C.8010607@arbash-meinel.com
Whole thread Raw
In response to Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher <matt.s@aptalaska.net>)
Responses Re: Performance problems testing with Spamassassin 3.1.0
List pgsql-performance
Matthew Schumacher wrote:
> Matthew Schumacher wrote:
>
>>Tom Lane wrote:
>>
>>
>>
>>>I don't really see why you think that this path is going to lead to
>>>better performance than where you were before.  Manipulation of the
>>>temp table is never going to be free, and IN (sub-select) is always
>>>inherently not fast, and NOT IN (sub-select) is always inherently
>>>awful.  Throwing a pile of simple queries at the problem is not
>>>necessarily the wrong way ... especially when you are doing it in
>>>plpgsql, because you've already eliminated the overhead of network
>>>round trips and repeated planning of the queries.
>>>
>>>            regards, tom lane
>>
>>
>>The reason why I think this may be faster is because I would avoid
>>running an update on data that needs to be inserted which saves
>>searching though the table for a matching token.
>>
>>Perhaps I should do the insert first, then drop those tokens from the
>>temp table, then do my updates in a loop.
>>
>>I'll have to do some benchmarking...
>>
>>schu
>
>
> Tom, I think your right, whenever I do a NOT IN it does a full table
> scan against bayes_token and since that table is going to get very big
> doing the simple query in a loop that uses an index seems a bit faster.
>
> John, thanks for your help, it was worth a try, but it looks like the
> looping is just faster.
>
> Here is what I have so far in case anyone else has ideas before I
> abandon it:

Surely this isn't what you have. You have *no* loop here, and you have
stuff like:
  AND
    (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

I'm guessing this isn't your last version of the function.

As far as putting the CREATE TEMP TABLE inside the function, I think the
problem is that the first time it runs, it compiles the function, and
when it gets to the UPDATE/INSERT with the temporary table name, at
compile time it hard-codes that table id.

I tried getting around it by using "EXECUTE" which worked, but it made
the function horribly slow. So I don't recommend it.

Anyway, if you want us to evaluate it, you really need to send us the
real final function.

John
=:->


>
> CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
>                                      intokenary BYTEA[],
>                                      inspam_count INTEGER,
>                                      inham_count INTEGER,
>                                      inatime INTEGER)
> RETURNS VOID AS '
> DECLARE
>   _token BYTEA;
> BEGIN
>
>   UPDATE
>     bayes_token
>   SET
>     spam_count = greatest_int(spam_count + inspam_count, 0),
>     ham_count = greatest_int(ham_count + inham_count , 0),
>     atime = greatest_int(atime, inatime)
>   WHERE
>     id = inuserid
>   AND
>     (token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));
>
>   UPDATE
>     bayes_vars
>   SET
>     token_count = token_count + (
>       SELECT
>         count(bayes_token_tmp)
>       FROM
>         bayes_token_tmp(intokenary)
>       WHERE
>         bayes_token_tmp NOT IN (SELECT token FROM bayes_token)),
>     newest_token_age = greatest_int(newest_token_age, inatime),
>     oldest_token_age = least_int(oldest_token_age, inatime)
>   WHERE
>     id = inuserid;
>
>   INSERT INTO
>     bayes_token
>     SELECT
>       inuserid,
>       bayes_token_tmp,
>       inspam_count,
>       inham_count,
>       inatime
>     FROM
>       bayes_token_tmp(intokenary)
>     WHERE
>       (inspam_count > 0 OR inham_count > 0)
>     AND
>       (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
>
>   RETURN;
> END;
> ' LANGUAGE 'plpgsql';
>
> CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS
> SETOF bytea AS
> '
> BEGIN
>   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>   LOOP
>     return next intokenary[i];
>   END LOOP;
>   RETURN;
> end
> '
> language 'plpgsql';
>
> CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
>  LANGUAGE SQL;
>
> CREATE OR REPLACE FUNCTION least_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
>  LANGUAGE SQL;
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>


Attachment

pgsql-performance by date:

Previous
From: Matthew Schumacher
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Next
From: Matthew Schumacher
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0