Re: Performance problems testing with Spamassassin 3.1.0

From: Michael Parker
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Date: ,
Msg-id: 20050801040443.GA6527@mail.herk.net
(view: Whole thread, Raw)
In response to: Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane)
Responses: Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane)
Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane)
List: pgsql-performance

Tree view

Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Matthew Schumacher, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Josh Berkus, )
  Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Josh Berkus, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Dennis Bjorklund, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Josh Berkus, )
     Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  (Tom Lane, )
  Re: Performance problems testing with Spamassassin  ("Luke Lonergan", )
 Re: Performance problems testing with Spamassassin 3.1.0  (Karim Nassar, )
  Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Gavin Sherry, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
     Re: Performance problems testing with Spamassassin 3.1.0  (Gavin Sherry, )
   Re: Performance problems testing with Spamassassin 3.1.0  (Andrew McMillan, )
    Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
     Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
     Re: Performance problems testing with Spamassassin 3.1.0  (Andrew McMillan, )
      Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
      Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
       Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
        Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
          Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Michael Parker, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
           Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
         Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
          Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
           Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
            Re: Performance problems testing with Spamassassin 3.1.0  ("Jim C. Nasby", )
             Re: Performance problems testing with Spamassassin 3.1.0  (Michael Parker, )
              Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
               Re: Performance problems testing with Spamassassin 3.1.0  (PFC, )
               Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
               Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
                Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                 Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane, )
                Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                 Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                  Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
                   Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
                    Re: Performance problems testing with Spamassassin 3.1.0  (John A Meinel, )
        Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
         Re: Performance problems testing with Spamassassin 3.1.0  (John Arbash Meinel, )
          Re: Performance problems testing with Spamassassin 3.1.0  (Matthew Schumacher, )
           Re: Performance problems testing with Spamassassin 3.1.0  ("Jim C. Nasby", )
            Re: Performance problems testing with Spamassassin 3.1.0  (Andreas Pflug, )
 Re: Performance problems testing with Spamassassin 3.1.0 Bayes module.  ("Merlin Moncure", )

Hi All,

As a SpamAssassin developer, who by my own admission has real problem
getting PostgreSQL to work well, I must thank everyone for their
feedback on this issue.  Believe me when I say what is in the tree now
is a far cry from what used to be there, orders of magnitude faster
for sure.  I think there are several good ideas that have come out of
this thread and I've set about attempting to implement them.

Here is a version of the stored procedure, based in large part by the
one written by Tom Lane, that accepts and array of tokens and loops
over them to either update or insert them into the database (I'm not
including the greatest_int/least_int procedures but you've all seen
them before):

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
                                     intokenary BYTEA[],
                                     inspam_count INTEGER,
                                     inham_count INTEGER,
                                     inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
  new_tokens INTEGER := 0;
BEGIN
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
    _token := intokenary[i];
    UPDATE bayes_token
       SET spam_count = greatest_int(spam_count + inspam_count, 0),
           ham_count = greatest_int(ham_count + inham_count, 0),
           atime = greatest_int(atime, inatime)
     WHERE id = inuserid
       AND token = _token;
    IF NOT FOUND THEN
      -- we do not insert negative counts, just return true
      IF NOT (inspam_count < 0 OR inham_count < 0) THEN
        INSERT INTO bayes_token (id, token, spam_count,
                                     ham_count, atime)
        VALUES (inuserid, _token, inspam_count, inham_count, inatime);
        IF FOUND THEN
          new_tokens := new_tokens + 1;
        END IF;
      END IF;
    END IF;
  END LOOP;

  UPDATE bayes_vars
     SET token_count = token_count + new_tokens,
         newest_token_age = greatest_int(newest_token_age, inatime),
         oldest_token_age = least_int(oldest_token_age, inatime)
   WHERE id = inuserid;
  RETURN;
END;
' LANGUAGE 'plpgsql';

This version is about 32x faster than the old version, with the
default fsync value and autovacuum running in the background.

The next hurdle, and I've just posted to the DBD::Pg list, is
escaping/quoting the token strings.  They are true binary strings,
substrings of SHA1 hashes, I don't think the original data set
provided puts them in the right context.  They have proved to be
tricky.  I'm unable to call the stored procedure from perl because I
keep getting a malformed array litteral error.

Here is some example code that shows the issue:
#!/usr/bin/perl -w

# from a new db, do this first
# INSERT INTO bayes_vars VALUES (1,'nobody',0,0,0,0,0,0,2147483647,0);

use strict;
use DBI;
use DBD::Pg qw(:pg_types);
use Digest::SHA1 qw(sha1);

my $dbh = DBI->connect("DBI:Pg:dbname=spamassassin","postgres") || die;

my @dataary;

# Input is just a list of words (ie /usr/share/dict/words) stop after 150
while(<>) {
  chomp;
  push(@dataary, substr(sha1($_), -5));
# to see it work with normal string comment out above and uncomment below
#  push(@dataary, $_);
  last if scalar(@dataary) >= 150;
}

my $datastring = join(",", map { '"' . bytea_esc($_) . '"' }
@dataary);
my $sql = "select put_tokens(1, '{$datastring}', 1, 1, 10000)";
my $sth = $dbh->prepare($sql);
my $rc = $sth->execute();
unless ($rc) {
  print "Error: " . $dbh->errstr() . "\n";
}
$sth->finish();

sub bytea_esc {
  my ($str) = @_;
  my $buf = "";
  foreach my $char (split(//,$str)) {
    if (ord($char) == 0) { $buf .= "\\\\000"; }
    elsif (ord($char) == 39) { $buf .= "\\\\047"; }
    elsif (ord($char) == 92) { $buf .= "\\\\134"; }
    else { $buf .= $char; }
  }
  return $buf;
}

Any ideas? or thoughts on the revised procedure?  I'd greatly
appriciate them.

Sorry for the length, but hopefully it give a good enough example.

Thanks
Michael Parker


Attachment

pgsql-performance by date:

From: Tom Lane
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0
From: Tom Lane
Date:
Subject: Re: [PATCHES] COPY FROM performance improvements