Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance

From Michael Parker
Subject Re: Performance problems testing with Spamassassin 3.1.0
Date
Msg-id 20050801040443.GA6527@mail.herk.net
Whole thread Raw
In response to Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Performance problems testing with Spamassassin 3.1.0
Re: Performance problems testing with Spamassassin 3.1.0
List pgsql-performance
Hi All,

As a SpamAssassin developer, who by my own admission has real problem
getting PostgreSQL to work well, I must thank everyone for their
feedback on this issue.  Believe me when I say what is in the tree now
is a far cry from what used to be there, orders of magnitude faster
for sure.  I think there are several good ideas that have come out of
this thread and I've set about attempting to implement them.

Here is a version of the stored procedure, based in large part by the
one written by Tom Lane, that accepts and array of tokens and loops
over them to either update or insert them into the database (I'm not
including the greatest_int/least_int procedures but you've all seen
them before):

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
                                     intokenary BYTEA[],
                                     inspam_count INTEGER,
                                     inham_count INTEGER,
                                     inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
  new_tokens INTEGER := 0;
BEGIN
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
    _token := intokenary[i];
    UPDATE bayes_token
       SET spam_count = greatest_int(spam_count + inspam_count, 0),
           ham_count = greatest_int(ham_count + inham_count, 0),
           atime = greatest_int(atime, inatime)
     WHERE id = inuserid
       AND token = _token;
    IF NOT FOUND THEN
      -- we do not insert negative counts, just return true
      IF NOT (inspam_count < 0 OR inham_count < 0) THEN
        INSERT INTO bayes_token (id, token, spam_count,
                                     ham_count, atime)
        VALUES (inuserid, _token, inspam_count, inham_count, inatime);
        IF FOUND THEN
          new_tokens := new_tokens + 1;
        END IF;
      END IF;
    END IF;
  END LOOP;

  UPDATE bayes_vars
     SET token_count = token_count + new_tokens,
         newest_token_age = greatest_int(newest_token_age, inatime),
         oldest_token_age = least_int(oldest_token_age, inatime)
   WHERE id = inuserid;
  RETURN;
END;
' LANGUAGE 'plpgsql';

This version is about 32x faster than the old version, with the
default fsync value and autovacuum running in the background.

The next hurdle, and I've just posted to the DBD::Pg list, is
escaping/quoting the token strings.  They are true binary strings,
substrings of SHA1 hashes, I don't think the original data set
provided puts them in the right context.  They have proved to be
tricky.  I'm unable to call the stored procedure from perl because I
keep getting a malformed array litteral error.

Here is some example code that shows the issue:
#!/usr/bin/perl -w

# from a new db, do this first
# INSERT INTO bayes_vars VALUES (1,'nobody',0,0,0,0,0,0,2147483647,0);

use strict;
use DBI;
use DBD::Pg qw(:pg_types);
use Digest::SHA1 qw(sha1);

my $dbh = DBI->connect("DBI:Pg:dbname=spamassassin","postgres") || die;

my @dataary;

# Input is just a list of words (ie /usr/share/dict/words) stop after 150
while(<>) {
  chomp;
  push(@dataary, substr(sha1($_), -5));
# to see it work with normal string comment out above and uncomment below
#  push(@dataary, $_);
  last if scalar(@dataary) >= 150;
}

my $datastring = join(",", map { '"' . bytea_esc($_) . '"' }
@dataary);
my $sql = "select put_tokens(1, '{$datastring}', 1, 1, 10000)";
my $sth = $dbh->prepare($sql);
my $rc = $sth->execute();
unless ($rc) {
  print "Error: " . $dbh->errstr() . "\n";
}
$sth->finish();

sub bytea_esc {
  my ($str) = @_;
  my $buf = "";
  foreach my $char (split(//,$str)) {
    if (ord($char) == 0) { $buf .= "\\\\000"; }
    elsif (ord($char) == 39) { $buf .= "\\\\047"; }
    elsif (ord($char) == 92) { $buf .= "\\\\134"; }
    else { $buf .= $char; }
  }
  return $buf;
}

Any ideas? or thoughts on the revised procedure?  I'd greatly
appriciate them.

Sorry for the length, but hopefully it give a good enough example.

Thanks
Michael Parker


Attachment

pgsql-performance by date:

Previous
From: William Yu
Date:
Subject: Re: Performance problems on 4/8way Opteron (dualcore)
Next
From: Tom Lane
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0