Thread: Postgres vs. MySQL
Hi, I have installed the dspam filter (http://www.nuclearelephant.com/projects/dspam) on our mail server (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users with a quite low traffic of 4000 messages/day. So it's a quite common platform/environment, nothing spectacular. First time(s) I tried the Postgres interface that was already installed for other applications. Whenever I begin to train and/or filter messages throug dspam the performance is incredibly bad. First messages are ok but soon the filter time begins to increase to about 30 seconds or more! ...so I looked for some optimization both for the linux kernel and the postgres server. Nothing has work for me. I always have the same behavior. For isolation purposes I started using another server just to hold the dspam database and nothing else. No matter what I do: postgres gets slower and slower with each new message fed or filtered. Several strategies have failed: newest RPMs from postgresql.org, pg_autovacuum, etc. I finally tried the MySQL driver. I have started using this tool right now for dspam, so I am a newcomer in MySQL. The result: after some preparation in configuring some parameters for mysqld (with the "QuickStart" Guide from mysql.com) all works fine! It's incredible! the same servers, the same messages, the same dspam compilation (well each one with the corresponding --with-storage-driver=*sql_drv). Postgres is getting worst than 30s/message and MySQL process the same in less than a second. I can surrender the Postgres server by just corpus-feeding one single short message to each user (it takes hours to process 300 users!). On the other hand, MySQL only takes a few minutes to process the same batch. I do not want to make flame over Postgres (I have always prefered it for its capabilities) but I am absolutely impressed by MySQL (I have seen the light!) Please, could anyone explain me this difference? Is Postgres that bad? Is MySQL that good? Am I the only one to have observed this behavior? TIA. Cheers, ---------------------------------------------------------------- Evilio Jose del Rio Silvan Centre Mediterrani d'Investigacions edelrio@cmima.csic.es Marines i Ambientals "Microsoft sells you Windows, Linux gives you the whole house" - Anonymous
On Wed, Nov 24, 2004 at 02:14:18PM +0100, Evilio del Rio wrote: > It's incredible! the same servers, the same messages, the same dspam > compilation (well each one with the corresponding > --with-storage-driver=*sql_drv). Postgres is getting worst than > 30s/message and MySQL process the same in less than a second. AFAIK dspam is heavily optimized for MySQL and not optimized for PostgreSQL at all; I believe there would be significant performance boosts available by "fixing" dspam. Example queries that are slow, as well as table schemas, would probably help a lot in tracking down the problems. /* Steinar */ -- Homepage: http://www.sesse.net/
Evilio del Rio wrote: > Please, could anyone explain me this difference? > Is Postgres that bad? > Is MySQL that good? > Am I the only one to have observed this behavior? Do you have any record of configuration, system hardware, usage patterns, queries executed? If you can tell us what you tried (and why) then we might be able to help, otherwise there's not much information here. -- Richard Huxton Archonet Ltd
As for performance, lots of others will probably volunteer tips and techniques. In my experience, properly written and tuned applications will show only minor speed differences. I have seen several open-source apps that "support postgres" but are not well tested on it. Query optimization can cause orders of magnitude performance differences. It sounds maybe dspam is in this bucket? > > Please, could anyone explain me this difference? > Is Postgres that bad? > Is MySQL that good? > Am I the only one to have observed this behavior? I made a little chart about these about a year ago: http://www.tikipro.org/wiki/index.php?page=DatabaseComparison If speed is what you need, and data integrity / safety is not, then MySQL may be a good choice. (Aggregate statistics tables and other such calculated denormalizations). IMHO, if all you need is dpsam running *now*, then I'd say MySQL might be good choice. If you ever need to run a DB application where data integrity is mission critical, then postgres is the top of my list. [ \ / [ >X< Christian Fowler | spider AT viovio.com [ / \ http://www.viovio.com | http://www.tikipro.org
Evilio del Rio wrote: >Hi, > >I have installed the dspam filter >(http://www.nuclearelephant.com/projects/dspam) on our mail server >(RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users >with a quite low traffic of 4000 messages/day. So it's a quite common >platform/environment, nothing spectacular. > > > The problem is definately dspam. We have been through their code. The new version is much, much better than the older one but I am sure there is more work to be done. The first version we installed suffered from a well known problem: It would use smallint/bigint but would not cast or quote the where clauses and thus PostgreSQL would never use the indexes. It was also missing several indexes on appropriate columns. We offered some advice and we know that some of it was taken but we don't know which. Sincerely, Joshua D. Drake -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
On Wed, Nov 24, 2004 at 09:57:52AM -0500, Christian Fowler wrote: > As for performance, lots of others will probably volunteer tips and > techniques. In my experience, properly written and tuned applications will > show only minor speed differences. I have seen several open-source apps > that "support postgres" but are not well tested on it. Query optimization > can cause orders of magnitude performance differences. Definitely. My favourite is Request Tracker (we use 2.x, although 3.x is the latest version), which used something like 5-600 queries (all seqscans since the database schema only had an ordinary index on the varchar fields in question, and the queries were automatically searching on LOWER(field) to emulate MySQL's case-insensitivity on varchar fields) for _every_ page shown. Needless to say, the web interface was dog slow -- some index manipulation and a few bugfixes (they had some kind of cache layer which would eliminate 98% of the queries, but for some reason was broken for non-MySQL databases) later, and we were down to 3-4 index scans, a few orders of magnitude faster. :-) /* Steinar */ -- Homepage: http://www.sesse.net/
On Wed, Nov 24, 2004 at 02:14:18PM +0100, Evilio del Rio wrote: > I have installed the dspam filter > (http://www.nuclearelephant.com/projects/dspam) on our mail server > (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users > with a quite low traffic of 4000 messages/day. So it's a quite common > platform/environment, nothing spectacular. We just had a case just like this on #postgresql. The (somewhat surprising) solution was increasing the statistics target on the "token" column to something like 200, which makes the planner choose an index scan instead of a sequential scan. For the people who did not follow the case: The culprit is a query like SELECT * FROM table WHERE token IN ('346369873476346', '4376376034', ...) (token is a numeric(20,0)) With one entry in the IN (), the cost of an index scan was estimated to 4.77; with ten entries, it was about 48, but with 574 entries the estimated cost was 513565 (!!), making the planner prefer an index scan to 574 consecutive index scans. Upping the statistics target made the planner estimate the cost to about ~4000, and thus select the index scan, which was two orders of magnitude faster. BTW, this case was with PostgreSQL 7.4.6, not 7.3 as the poster here is reporting. /* Steinar */ -- Homepage: http://www.sesse.net/
I did some work on RT wrt Postgres for a company and found that their was lots of room for improvement particularly if you are linking requests. The latest RT code hopefully has fixes as a result of this work. Dave Steinar H. Gunderson wrote: >On Wed, Nov 24, 2004 at 09:57:52AM -0500, Christian Fowler wrote: > > >>As for performance, lots of others will probably volunteer tips and >>techniques. In my experience, properly written and tuned applications will >>show only minor speed differences. I have seen several open-source apps >>that "support postgres" but are not well tested on it. Query optimization >>can cause orders of magnitude performance differences. >> >> > >Definitely. My favourite is Request Tracker (we use 2.x, although 3.x is the >latest version), which used something like 5-600 queries (all seqscans since >the database schema only had an ordinary index on the varchar fields in >question, and the queries were automatically searching on LOWER(field) to >emulate MySQL's case-insensitivity on varchar fields) for _every_ page shown. >Needless to say, the web interface was dog slow -- some index manipulation >and a few bugfixes (they had some kind of cache layer which would eliminate >98% of the queries, but for some reason was broken for non-MySQL databases) >later, and we were down to 3-4 index scans, a few orders of magnitude faster. >:-) > >/* Steinar */ > > -- Dave Cramer http://www.postgresintl.com 519 939 0336 ICQ#14675561
On Wed, 2004-11-24 at 14:14 +0100, Evilio del Rio wrote: > Hi, > > I have installed the dspam filter > (http://www.nuclearelephant.com/projects/dspam) on our mail server > (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users > with a quite low traffic of 4000 messages/day. So it's a quite common > platform/environment, nothing spectacular. I am using DSpam with PostgreSQL here. I have a daily job that cleans the DSpam database up, as follows: DELETE FROM dspam_token_data WHERE (innocent_hits*2) + spam_hits < 5 AND CURRENT_DATE - last_hit > 60; DELETE FROM dspam_token_data WHERE innocent_hits = 1 AND CURRENT_DATE - last_hit > 30; DELETE FROM dspam_token_data WHERE CURRENT_DATE - last_hit > 180; DELETE FROM dspam_signature_data WHERE CURRENT_DATE - created_on > 14; VACUUM dspam_token_data; VACUUM dspam_signature_data; I also occasionally do a "VACUUM FULL ANALYZE;" on the database as well. In all honesty though, I think that MySQL is better suited to DSpam than PostgreSQL is. > Please, could anyone explain me this difference? > Is Postgres that bad? > Is MySQL that good? > Am I the only one to have observed this behavior? I believe that what DSpam does that is not well-catered for in the way PostgreSQL operates, is that it does very frequent updates to rows in (eventually) quite large tables. In PostgreSQL the UPDATE will result internally in a new record being written, with the old record being marked as deleted. That old record won't be re-used until after a VACUUM has run, and this means that the on-disk tables will have a lot of dead rows in them quite quickly. The reason that PostgreSQL operates this way, is a direct result of the way transactional support is implemented, and it may well change in a version or two. It's got better over the last few versions, with things like pg_autovacuum, but that approach still doesn't suit some types of database updating. Cheers, Andrew. ------------------------------------------------------------------------- Andrew @ Catalyst .Net .NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St DDI: +64(4)803-2201 MOB: +64(272)DEBIAN OFFICE: +64(4)499-2267 These PRESERVES should be FORCE-FED to PENTAGON OFFICIALS!! -------------------------------------------------------------------------
Attachment
On Fri, 2004-11-26 at 14:37 +1300, Andrew McMillan wrote: > In PostgreSQL the UPDATE will result > internally in a new record being written, with the old record being > marked as deleted. That old record won't be re-used until after a > VACUUM has run, and this means that the on-disk tables will have a lot > of dead rows in them quite quickly. Not necessarily: yes, you need a VACUUM to begin reusing the space consumed by expired tuples, but that does not mean "tables will have a lot of dead rows in them quite quickly". VACUUM does not block concurrent database activity, so you can run it as frequently as you'd like (and as your database workload requires). There is a tradeoff between the space consumed by expired tuple versions and the I/O required to do a VACUUM -- it's up to the PG admin to decide what the right balance for their database is (pg_autovacuum et al. can help make this decision). > The reason that PostgreSQL operates this way, is a direct result of the > way transactional support is implemented, and it may well change in a > version or two. I doubt it. -Neil
On Wed, Nov 24, 2004 at 02:14:18PM +0100, Evilio del Rio wrote: > I have installed the dspam filter > (http://www.nuclearelephant.com/projects/dspam) on our mail server > (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users > with a quite low traffic of 4000 messages/day. So it's a quite common > platform/environment, nothing spectacular. > > First time(s) I tried the Postgres interface that was already installed > for other applications. Whenever I begin to train and/or filter > messages throug dspam the performance is incredibly bad. First messages > are ok but soon the filter time begins to increase to about 30 seconds > or more! > > ...so I looked for some optimization both for the linux kernel and the > postgres server. Nothing has work for me. I always have the same > behavior. For isolation purposes I started using another server just to > hold the dspam database and nothing else. No matter what I do: postgres > gets slower and slower with each new message fed or filtered. I know *somewhere* I recently read something indicating a critical configuration change for DSPAM + Postgres, but don't think I've seen it mentioned on this list. Possibly it is in the UPGRADING instructions for 3.2.1, or in a README file there. At any rate, it mentioned that it was essential to make some change to the table layout used by previous versions of DSPAM, and then Postgres would run many times faster. Unfortunately I no longer have 3.2.1 installed on my system, so I can't tell you if it was in there or somewhere else. -- Clifton -- Clifton Royston -- cliftonr@tikitechnologies.com Tiki Technologies Lead Programmer/Software Architect Did you ever fly a kite in bed? Did you ever walk with ten cats on your head? Did you ever milk this kind of cow? Well we can do it. We know how. If you never did, you should. These things are fun, and fun is good. -- Dr. Seuss
I posted about this a couple days ago on dspam-dev... I am using DSpam with PostgreSQL, and like you discovered the horrible performance. The reason is because the default PostgreSQL query planner settings determine that a sequence scan will be more efficient than an index scan, which is wrong. To correct this behavior, adjust the query planner settings for the appropriate table/column with this command: alter table "dspam_token_data" alter "token" set statistics 200; analyze; Let me know if it help you. It worked wonders for me. -- Casey Allen Shobe cshobe@osss.net On Fri, November 26, 2004 12:35 pm, Clifton Royston said: > On Wed, Nov 24, 2004 at 02:14:18PM +0100, Evilio del Rio wrote: >> I have installed the dspam filter >> (http://www.nuclearelephant.com/projects/dspam) on our mail server >> (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users >> with a quite low traffic of 4000 messages/day. So it's a quite common >> platform/environment, nothing spectacular. >> >> First time(s) I tried the Postgres interface that was already installed >> for other applications. Whenever I begin to train and/or filter >> messages throug dspam the performance is incredibly bad. First messages >> are ok but soon the filter time begins to increase to about 30 seconds >> or more! >> >> ...so I looked for some optimization both for the linux kernel and the >> postgres server. Nothing has work for me. I always have the same >> behavior. For isolation purposes I started using another server just to >> hold the dspam database and nothing else. No matter what I do: postgres >> gets slower and slower with each new message fed or filtered. > > I know *somewhere* I recently read something indicating a critical > configuration change for DSPAM + Postgres, but don't think I've seen it > mentioned on this list. Possibly it is in the UPGRADING instructions > for 3.2.1, or in a README file there. At any rate, it mentioned that > it was essential to make some change to the table layout used by previous > versions of DSPAM, and then Postgres would run many times faster. > > Unfortunately I no longer have 3.2.1 installed on my system, so I can't > tell you if it was in there or somewhere else. > > -- Clifton > > -- > Clifton Royston -- cliftonr@tikitechnologies.com > Tiki Technologies Lead Programmer/Software Architect > Did you ever fly a kite in bed? Did you ever walk with ten cats on your > head? > Did you ever milk this kind of cow? Well we can do it. We know how. > If you never did, you should. These things are fun, and fun is good. > -- Dr. > Seuss >
Casey Allen Shobe wrote the following on 11/27/04 03:11 : >I posted about this a couple days ago on dspam-dev... > >I am using DSpam with PostgreSQL, and like you discovered the horrible >performance. The reason is because the default PostgreSQL query planner >settings determine that a sequence scan will be more efficient than an >index scan, which is wrong. To correct this behavior, adjust the query >planner settings for the appropriate table/column with this command: > >alter table "dspam_token_data" alter "token" set statistics 200; analyze; > >Let me know if it help you. It worked wonders for me. > > > In tum mode, this could help too (I'm currently testing it) : CREATE INDEX id_token_data_sumhits ON dspam_token_data ((spam_hits + innocent_hits)); Indeed each UPDATE on dspam_token_data in TUM is done with : WHERE ... AND spam_hits + innocent_hits < 50
Martha Stewart called it a Good Thing when cshobe@osss.net ("Casey Allen Shobe") wrote: > I posted about this a couple days ago on dspam-dev... > > I am using DSpam with PostgreSQL, and like you discovered the horrible > performance. The reason is because the default PostgreSQL query planner > settings determine that a sequence scan will be more efficient than an > index scan, which is wrong. To correct this behavior, adjust the query > planner settings for the appropriate table/column with this command: > > alter table "dspam_token_data" alter "token" set statistics 200; analyze; > > Let me know if it help you. It worked wonders for me. That makes a great deal of sense; the number of tokens are likely to be rather larger than 10, and are likely to be quite unevenly distributed. That fits with the need you found to collect more statistics on that column. Other cases where it seems plausible that it would be worthwhile to do the same: alter table dspam_signature_data alter signature set statistics 200; alter table dspam_neural_data alter node set statistics 200; alter table dspam_neural_decisions alter signature set statistics 200; Lionel's suggestion of having a functional index on dspam_token_data (innocent_hits + spam_hits) also seems likely to be helpful. Along with that, it might prove necessary to alter stats on dspam_token_data thus: alter table dspam_token_data alter innocent_hits set statistics 200; alter table dspam_token_data alter spam_hits set statistics 200; None of these changes are likely to make things materially worse; if they do help, they'll help rather a lot. -- (format nil "~S@~S" "cbbrowne" "gmail.com") http://www.ntlug.org/~cbbrowne/nonrdbms.html Rules of the Evil Overlord #112. "I will not rely entirely upon "totally reliable" spells that can be neutralized by relatively inconspicuous talismans." <http://www.eviloverlord.com/>
FWIW, those queries won't be able to use an index. A better WHERE clause would be: AND last_hit < CURRENT_DATE - 60 On Fri, Nov 26, 2004 at 02:37:12PM +1300, Andrew McMillan wrote: > On Wed, 2004-11-24 at 14:14 +0100, Evilio del Rio wrote: > > Hi, > > > > I have installed the dspam filter > > (http://www.nuclearelephant.com/projects/dspam) on our mail server > > (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users > > with a quite low traffic of 4000 messages/day. So it's a quite common > > platform/environment, nothing spectacular. > > I am using DSpam with PostgreSQL here. I have a daily job that cleans > the DSpam database up, as follows: > > DELETE FROM dspam_token_data > WHERE (innocent_hits*2) + spam_hits < 5 > AND CURRENT_DATE - last_hit > 60; > > DELETE FROM dspam_token_data > WHERE innocent_hits = 1 > AND CURRENT_DATE - last_hit > 30; > > DELETE FROM dspam_token_data > WHERE CURRENT_DATE - last_hit > 180; > > DELETE FROM dspam_signature_data > WHERE CURRENT_DATE - created_on > 14; > > VACUUM dspam_token_data; > > VACUUM dspam_signature_data; > > > > I also occasionally do a "VACUUM FULL ANALYZE;" on the database as well. > > > In all honesty though, I think that MySQL is better suited to DSpam than > PostgreSQL is. > > > > Please, could anyone explain me this difference? > > Is Postgres that bad? > > Is MySQL that good? > > Am I the only one to have observed this behavior? > > I believe that what DSpam does that is not well-catered for in the way > PostgreSQL operates, is that it does very frequent updates to rows in > (eventually) quite large tables. In PostgreSQL the UPDATE will result > internally in a new record being written, with the old record being > marked as deleted. That old record won't be re-used until after a > VACUUM has run, and this means that the on-disk tables will have a lot > of dead rows in them quite quickly. > > The reason that PostgreSQL operates this way, is a direct result of the > way transactional support is implemented, and it may well change in a > version or two. It's got better over the last few versions, with things > like pg_autovacuum, but that approach still doesn't suit some types of > database updating. > > Cheers, > Andrew. > ------------------------------------------------------------------------- > Andrew @ Catalyst .Net .NZ Ltd, PO Box 11-053, Manners St, Wellington > WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St > DDI: +64(4)803-2201 MOB: +64(272)DEBIAN OFFICE: +64(4)499-2267 > These PRESERVES should be FORCE-FED to PENTAGON OFFICIALS!! > ------------------------------------------------------------------------- > -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"