Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance
From | Michael Parker |
---|---|
Subject | Re: Performance problems testing with Spamassassin 3.1.0 |
Date | |
Msg-id | 20050801040443.GA6527@mail.herk.net Whole thread Raw |
In response to | Re: Performance problems testing with Spamassassin 3.1.0 (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Performance problems testing with Spamassassin 3.1.0
Re: Performance problems testing with Spamassassin 3.1.0 |
List | pgsql-performance |
Hi All, As a SpamAssassin developer, who by my own admission has real problem getting PostgreSQL to work well, I must thank everyone for their feedback on this issue. Believe me when I say what is in the tree now is a far cry from what used to be there, orders of magnitude faster for sure. I think there are several good ideas that have come out of this thread and I've set about attempting to implement them. Here is a version of the stored procedure, based in large part by the one written by Tom Lane, that accepts and array of tokens and loops over them to either update or insert them into the database (I'm not including the greatest_int/least_int procedures but you've all seen them before): CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER, intokenary BYTEA[], inspam_count INTEGER, inham_count INTEGER, inatime INTEGER) RETURNS VOID AS ' DECLARE _token BYTEA; new_tokens INTEGER := 0; BEGIN for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) LOOP _token := intokenary[i]; UPDATE bayes_token SET spam_count = greatest_int(spam_count + inspam_count, 0), ham_count = greatest_int(ham_count + inham_count, 0), atime = greatest_int(atime, inatime) WHERE id = inuserid AND token = _token; IF NOT FOUND THEN -- we do not insert negative counts, just return true IF NOT (inspam_count < 0 OR inham_count < 0) THEN INSERT INTO bayes_token (id, token, spam_count, ham_count, atime) VALUES (inuserid, _token, inspam_count, inham_count, inatime); IF FOUND THEN new_tokens := new_tokens + 1; END IF; END IF; END IF; END LOOP; UPDATE bayes_vars SET token_count = token_count + new_tokens, newest_token_age = greatest_int(newest_token_age, inatime), oldest_token_age = least_int(oldest_token_age, inatime) WHERE id = inuserid; RETURN; END; ' LANGUAGE 'plpgsql'; This version is about 32x faster than the old version, with the default fsync value and autovacuum running in the background. The next hurdle, and I've just posted to the DBD::Pg list, is escaping/quoting the token strings. They are true binary strings, substrings of SHA1 hashes, I don't think the original data set provided puts them in the right context. They have proved to be tricky. I'm unable to call the stored procedure from perl because I keep getting a malformed array litteral error. Here is some example code that shows the issue: #!/usr/bin/perl -w # from a new db, do this first # INSERT INTO bayes_vars VALUES (1,'nobody',0,0,0,0,0,0,2147483647,0); use strict; use DBI; use DBD::Pg qw(:pg_types); use Digest::SHA1 qw(sha1); my $dbh = DBI->connect("DBI:Pg:dbname=spamassassin","postgres") || die; my @dataary; # Input is just a list of words (ie /usr/share/dict/words) stop after 150 while(<>) { chomp; push(@dataary, substr(sha1($_), -5)); # to see it work with normal string comment out above and uncomment below # push(@dataary, $_); last if scalar(@dataary) >= 150; } my $datastring = join(",", map { '"' . bytea_esc($_) . '"' } @dataary); my $sql = "select put_tokens(1, '{$datastring}', 1, 1, 10000)"; my $sth = $dbh->prepare($sql); my $rc = $sth->execute(); unless ($rc) { print "Error: " . $dbh->errstr() . "\n"; } $sth->finish(); sub bytea_esc { my ($str) = @_; my $buf = ""; foreach my $char (split(//,$str)) { if (ord($char) == 0) { $buf .= "\\\\000"; } elsif (ord($char) == 39) { $buf .= "\\\\047"; } elsif (ord($char) == 92) { $buf .= "\\\\134"; } else { $buf .= $char; } } return $buf; } Any ideas? or thoughts on the revised procedure? I'd greatly appriciate them. Sorry for the length, but hopefully it give a good enough example. Thanks Michael Parker
Attachment
pgsql-performance by date: