Thread: Postgres vs. MySQL

Postgres vs. MySQL

From
Evilio del Rio
Date:
Hi,

I have installed the dspam filter
(http://www.nuclearelephant.com/projects/dspam) on our mail server
(RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users
with a quite low traffic of 4000 messages/day. So it's a quite common
platform/environment, nothing spectacular.

First time(s) I tried the Postgres interface that was already installed
for other applications.  Whenever I begin to train and/or filter
messages throug dspam the performance is incredibly bad. First messages
are ok but soon the filter time begins to increase to about 30 seconds
or more!

...so I looked for some optimization both for the linux  kernel and the
postgres server.  Nothing has work for me. I always have the same
behavior. For isolation purposes I started using another server just to
hold the dspam database and nothing else. No matter what I do: postgres
gets slower and slower with each new message fed or filtered.

Several strategies have failed: newest RPMs from postgresql.org,
pg_autovacuum, etc.

I finally tried the MySQL driver.

I have started using this tool right now for dspam, so I am a newcomer
in MySQL.

The result: after some preparation in configuring some parameters for
mysqld (with the "QuickStart" Guide from mysql.com) all works fine!

It's incredible! the same servers, the same messages, the same dspam
compilation (well each one with the corresponding
--with-storage-driver=*sql_drv). Postgres is getting worst than
30s/message and MySQL process the same in less than a second.

I can surrender the Postgres server by just corpus-feeding one single
short message to each user (it takes hours to process 300 users!).

On the other hand, MySQL only takes a few minutes to process the same
batch.

I do not want to make flame over Postgres (I have always prefered it for
its capabilities) but I am absolutely impressed by MySQL (I have seen
the light!)

Please, could anyone explain me this difference?
Is Postgres that bad?
Is MySQL that good?
Am I the only one to have observed this behavior?

TIA.

Cheers,

----------------------------------------------------------------
Evilio Jose del Rio Silvan   Centre Mediterrani d'Investigacions
edelrio@cmima.csic.es                Marines i Ambientals
"Microsoft sells you Windows, Linux gives you the whole house" - Anonymous


Re: Postgres vs. MySQL

From
"Steinar H. Gunderson"
Date:
On Wed, Nov 24, 2004 at 02:14:18PM +0100, Evilio del Rio wrote:
> It's incredible! the same servers, the same messages, the same dspam
> compilation (well each one with the corresponding
> --with-storage-driver=*sql_drv). Postgres is getting worst than
> 30s/message and MySQL process the same in less than a second.

AFAIK dspam is heavily optimized for MySQL and not optimized for PostgreSQL
at all; I believe there would be significant performance boosts available
by "fixing" dspam.

Example queries that are slow, as well as table schemas, would probably help
a lot in tracking down the problems.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Postgres vs. MySQL

From
Richard Huxton
Date:
Evilio del Rio wrote:

> Please, could anyone explain me this difference?
> Is Postgres that bad?
> Is MySQL that good?
> Am I the only one to have observed this behavior?

Do you have any record of configuration, system hardware, usage
patterns, queries executed?

If you can tell us what you tried (and why) then we might be able to
help, otherwise there's not much information here.

--
   Richard Huxton
   Archonet Ltd

Re: Postgres vs. MySQL

From
Christian Fowler
Date:
As for performance, lots of others will probably volunteer tips and
techniques. In my experience, properly written and tuned applications will
show only minor speed differences. I have seen several open-source apps
that "support postgres" but are not well tested on it. Query optimization
can cause orders of magnitude performance differences. It sounds maybe
dspam is in this bucket?


>
> Please, could anyone explain me this difference?
> Is Postgres that bad?
> Is MySQL that good?
> Am I the only one to have observed this behavior?

I made a little chart about these about a year ago:

http://www.tikipro.org/wiki/index.php?page=DatabaseComparison

If speed is what you need, and data integrity / safety is not, then MySQL
may be a good choice. (Aggregate statistics tables and other such
calculated denormalizations).

IMHO, if all you need is dpsam running *now*, then I'd say MySQL might be
good choice. If you ever need to run a DB application where data integrity
is mission critical, then postgres is the top of my list.


[ \ /
[ >X<   Christian Fowler      | spider AT viovio.com
[ / \   http://www.viovio.com | http://www.tikipro.org

Re: Postgres vs. MySQL

From
"Joshua D. Drake"
Date:
Evilio del Rio wrote:

>Hi,
>
>I have installed the dspam filter
>(http://www.nuclearelephant.com/projects/dspam) on our mail server
>(RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users
>with a quite low traffic of 4000 messages/day. So it's a quite common
>platform/environment, nothing spectacular.
>
>
>
The problem is definately dspam. We have been through their code.
The new version is much, much better than the older one but I am sure
there is more work to be done.

The first version we installed suffered from a well known problem:

It would use smallint/bigint but would not cast or quote the
where clauses and thus PostgreSQL would never use the indexes.

It was also missing several indexes on appropriate columns.

We offered some advice and we know that some of it was taken but
we don't know which.

Sincerely,

Joshua D. Drake



--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL


Attachment

Re: Postgres vs. MySQL

From
"Steinar H. Gunderson"
Date:
On Wed, Nov 24, 2004 at 09:57:52AM -0500, Christian Fowler wrote:
> As for performance, lots of others will probably volunteer tips and
> techniques. In my experience, properly written and tuned applications will
> show only minor speed differences. I have seen several open-source apps
> that "support postgres" but are not well tested on it. Query optimization
> can cause orders of magnitude performance differences.

Definitely. My favourite is Request Tracker (we use 2.x, although 3.x is the
latest version), which used something like 5-600 queries (all seqscans since
the database schema only had an ordinary index on the varchar fields in
question, and the queries were automatically searching on LOWER(field) to
emulate MySQL's case-insensitivity on varchar fields) for _every_ page shown.
Needless to say, the web interface was dog slow -- some index manipulation
and a few bugfixes (they had some kind of cache layer which would eliminate
98% of the queries, but for some reason was broken for non-MySQL databases)
later, and we were down to 3-4 index scans, a few orders of magnitude faster.
:-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Postgres vs. MySQL

From
"Steinar H. Gunderson"
Date:
On Wed, Nov 24, 2004 at 02:14:18PM +0100, Evilio del Rio wrote:
> I have installed the dspam filter
> (http://www.nuclearelephant.com/projects/dspam) on our mail server
> (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users
> with a quite low traffic of 4000 messages/day. So it's a quite common
> platform/environment, nothing spectacular.

We just had a case just like this on #postgresql. The (somewhat surprising)
solution was increasing the statistics target on the "token" column to
something like 200, which makes the planner choose an index scan instead of a
sequential scan.

For the people who did not follow the case: The culprit is a query like

  SELECT * FROM table WHERE token IN ('346369873476346', '4376376034', ...)

(token is a numeric(20,0)) With one entry in the IN (), the cost of an index
scan was estimated to 4.77; with ten entries, it was about 48, but with 574
entries the estimated cost was 513565 (!!), making the planner prefer an
index scan to 574 consecutive index scans. Upping the statistics target made
the planner estimate the cost to about ~4000, and thus select the index scan,
which was two orders of magnitude faster.

BTW, this case was with PostgreSQL 7.4.6, not 7.3 as the poster here is
reporting.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Postgres vs. MySQL

From
Dave Cramer
Date:
I did some work on RT wrt Postgres for a company and found that their
was lots of room for improvement
particularly if you are linking requests. The latest RT code hopefully
has fixes as a result of this work.

Dave

Steinar H. Gunderson wrote:

>On Wed, Nov 24, 2004 at 09:57:52AM -0500, Christian Fowler wrote:
>
>
>>As for performance, lots of others will probably volunteer tips and
>>techniques. In my experience, properly written and tuned applications will
>>show only minor speed differences. I have seen several open-source apps
>>that "support postgres" but are not well tested on it. Query optimization
>>can cause orders of magnitude performance differences.
>>
>>
>
>Definitely. My favourite is Request Tracker (we use 2.x, although 3.x is the
>latest version), which used something like 5-600 queries (all seqscans since
>the database schema only had an ordinary index on the varchar fields in
>question, and the queries were automatically searching on LOWER(field) to
>emulate MySQL's case-insensitivity on varchar fields) for _every_ page shown.
>Needless to say, the web interface was dog slow -- some index manipulation
>and a few bugfixes (they had some kind of cache layer which would eliminate
>98% of the queries, but for some reason was broken for non-MySQL databases)
>later, and we were down to 3-4 index scans, a few orders of magnitude faster.
>:-)
>
>/* Steinar */
>
>

--
Dave Cramer
http://www.postgresintl.com
519 939 0336
ICQ#14675561


Re: Postgres vs. DSpam

From
Andrew McMillan
Date:
On Wed, 2004-11-24 at 14:14 +0100, Evilio del Rio wrote:
> Hi,
>
> I have installed the dspam filter
> (http://www.nuclearelephant.com/projects/dspam) on our mail server
> (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users
> with a quite low traffic of 4000 messages/day. So it's a quite common
> platform/environment, nothing spectacular.

I am using DSpam with PostgreSQL here.  I have a daily job that cleans
the DSpam database up, as follows:

DELETE FROM dspam_token_data
  WHERE (innocent_hits*2) + spam_hits < 5
  AND CURRENT_DATE - last_hit > 60;

DELETE FROM dspam_token_data
  WHERE innocent_hits = 1
  AND CURRENT_DATE - last_hit > 30;

DELETE FROM dspam_token_data
  WHERE CURRENT_DATE - last_hit > 180;

DELETE FROM dspam_signature_data
  WHERE CURRENT_DATE - created_on > 14;

VACUUM dspam_token_data;

VACUUM dspam_signature_data;



I also occasionally do a "VACUUM FULL ANALYZE;" on the database as well.


In all honesty though, I think that MySQL is better suited to DSpam than
PostgreSQL is.


> Please, could anyone explain me this difference?
> Is Postgres that bad?
> Is MySQL that good?
> Am I the only one to have observed this behavior?

I believe that what DSpam does that is not well-catered for in the way
PostgreSQL operates, is that it does very frequent updates to rows in
(eventually) quite large tables.  In PostgreSQL the UPDATE will result
internally in a new record being written, with the old record being
marked as deleted.  That old record won't be re-used until after a
VACUUM has run, and this means that the on-disk tables will have a lot
of dead rows in them quite quickly.

The reason that PostgreSQL operates this way, is a direct result of the
way transactional support is implemented, and it may well change in a
version or two.  It's got better over the last few versions, with things
like pg_autovacuum, but that approach still doesn't suit some types of
database updating.

Cheers,
                    Andrew.
-------------------------------------------------------------------------
Andrew @ Catalyst .Net .NZ  Ltd,  PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/            PHYS: Level 2, 150-154 Willis St
DDI: +64(4)803-2201      MOB: +64(272)DEBIAN      OFFICE: +64(4)499-2267
       These PRESERVES should be FORCE-FED to PENTAGON OFFICIALS!!
-------------------------------------------------------------------------


Attachment

Re: Postgres vs. DSpam

From
Neil Conway
Date:
On Fri, 2004-11-26 at 14:37 +1300, Andrew McMillan wrote:
> In PostgreSQL the UPDATE will result
> internally in a new record being written, with the old record being
> marked as deleted.  That old record won't be re-used until after a
> VACUUM has run, and this means that the on-disk tables will have a lot
> of dead rows in them quite quickly.

Not necessarily: yes, you need a VACUUM to begin reusing the space
consumed by expired tuples, but that does not mean "tables will have a
lot of dead rows in them quite quickly". VACUUM does not block
concurrent database activity, so you can run it as frequently as you'd
like (and as your database workload requires). There is a tradeoff
between the space consumed by expired tuple versions and the I/O
required to do a VACUUM -- it's up to the PG admin to decide what the
right balance for their database is (pg_autovacuum et al. can help make
this decision).

> The reason that PostgreSQL operates this way, is a direct result of the
> way transactional support is implemented, and it may well change in a
> version or two.

I doubt it.

-Neil



Re: [dspam-users] Postgres vs. MySQL

From
Clifton Royston
Date:
On Wed, Nov 24, 2004 at 02:14:18PM +0100, Evilio del Rio wrote:
> I have installed the dspam filter
> (http://www.nuclearelephant.com/projects/dspam) on our mail server
> (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users
> with a quite low traffic of 4000 messages/day. So it's a quite common
> platform/environment, nothing spectacular.
>
> First time(s) I tried the Postgres interface that was already installed
> for other applications.  Whenever I begin to train and/or filter
> messages throug dspam the performance is incredibly bad. First messages
> are ok but soon the filter time begins to increase to about 30 seconds
> or more!
>
> ...so I looked for some optimization both for the linux  kernel and the
> postgres server.  Nothing has work for me. I always have the same
> behavior. For isolation purposes I started using another server just to
> hold the dspam database and nothing else. No matter what I do: postgres
> gets slower and slower with each new message fed or filtered.

I know *somewhere* I recently read something indicating a critical
configuration change for DSPAM + Postgres, but don't think I've seen it
mentioned on this list.  Possibly it is in the UPGRADING instructions
for 3.2.1, or in a README file there.  At any rate, it mentioned that
it was essential to make some change to the table layout used by previous
versions of DSPAM, and then Postgres would run many times faster.

Unfortunately I no longer have 3.2.1 installed on my system, so I can't
tell you if it was in there or somewhere else.

  -- Clifton

--
          Clifton Royston  --  cliftonr@tikitechnologies.com
         Tiki Technologies Lead Programmer/Software Architect
Did you ever fly a kite in bed?  Did you ever walk with ten cats on your head?
  Did you ever milk this kind of cow?  Well we can do it.  We know how.
If you never did, you should.  These things are fun, and fun is good.
                                                                 -- Dr. Seuss

Re: [dspam-users] Postgres vs. MySQL

From
"Casey Allen Shobe"
Date:
I posted about this a couple days ago on dspam-dev...

I am using DSpam with PostgreSQL, and like you discovered the horrible
performance.  The reason is because the default PostgreSQL query planner
settings determine that a sequence scan will be more efficient than an
index scan, which is wrong.  To correct this behavior, adjust the query
planner settings for the appropriate table/column with this command:

alter table "dspam_token_data" alter "token" set statistics 200; analyze;

Let me know if it help you.  It worked wonders for me.

--
Casey Allen Shobe
cshobe@osss.net

On Fri, November 26, 2004 12:35 pm, Clifton Royston said:
> On Wed, Nov 24, 2004 at 02:14:18PM +0100, Evilio del Rio wrote:
>> I have installed the dspam filter
>> (http://www.nuclearelephant.com/projects/dspam) on our mail server
>> (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users
>> with a quite low traffic of 4000 messages/day. So it's a quite common
>> platform/environment, nothing spectacular.
>>
>> First time(s) I tried the Postgres interface that was already installed
>> for other applications.  Whenever I begin to train and/or filter
>> messages throug dspam the performance is incredibly bad. First messages
>> are ok but soon the filter time begins to increase to about 30 seconds
>> or more!
>>
>> ...so I looked for some optimization both for the linux  kernel and the
>> postgres server.  Nothing has work for me. I always have the same
>> behavior. For isolation purposes I started using another server just to
>> hold the dspam database and nothing else. No matter what I do: postgres
>> gets slower and slower with each new message fed or filtered.
>
> I know *somewhere* I recently read something indicating a critical
> configuration change for DSPAM + Postgres, but don't think I've seen it
> mentioned on this list.  Possibly it is in the UPGRADING instructions
> for 3.2.1, or in a README file there.  At any rate, it mentioned that
> it was essential to make some change to the table layout used by previous
> versions of DSPAM, and then Postgres would run many times faster.
>
> Unfortunately I no longer have 3.2.1 installed on my system, so I can't
> tell you if it was in there or somewhere else.
>
>   -- Clifton
>
> --
>           Clifton Royston  --  cliftonr@tikitechnologies.com
>          Tiki Technologies Lead Programmer/Software Architect
> Did you ever fly a kite in bed?  Did you ever walk with ten cats on your
> head?
>   Did you ever milk this kind of cow?  Well we can do it.  We know how.
> If you never did, you should.  These things are fun, and fun is good.
>                                                                  -- Dr.
> Seuss
>


Re: [dspam-users] Postgres vs. MySQL

From
Lionel Bouton
Date:
Casey Allen Shobe wrote the following on 11/27/04 03:11 :

>I posted about this a couple days ago on dspam-dev...
>
>I am using DSpam with PostgreSQL, and like you discovered the horrible
>performance.  The reason is because the default PostgreSQL query planner
>settings determine that a sequence scan will be more efficient than an
>index scan, which is wrong.  To correct this behavior, adjust the query
>planner settings for the appropriate table/column with this command:
>
>alter table "dspam_token_data" alter "token" set statistics 200; analyze;
>
>Let me know if it help you.  It worked wonders for me.
>
>
>
In tum mode, this could help too (I'm currently testing it) :
CREATE INDEX id_token_data_sumhits ON dspam_token_data ((spam_hits +
innocent_hits));

Indeed each UPDATE on dspam_token_data in TUM is done with :
WHERE ... AND spam_hits + innocent_hits < 50

Re: [dspam-users] Postgres vs. MySQL

From
Christopher Browne
Date:
Martha Stewart called it a Good Thing when cshobe@osss.net ("Casey Allen Shobe") wrote:
> I posted about this a couple days ago on dspam-dev...
>
> I am using DSpam with PostgreSQL, and like you discovered the horrible
> performance.  The reason is because the default PostgreSQL query planner
> settings determine that a sequence scan will be more efficient than an
> index scan, which is wrong.  To correct this behavior, adjust the query
> planner settings for the appropriate table/column with this command:
>
> alter table "dspam_token_data" alter "token" set statistics 200; analyze;
>
> Let me know if it help you.  It worked wonders for me.

That makes a great deal of sense; the number of tokens are likely to
be rather larger than 10, and are likely to be quite unevenly
distributed.  That fits with the need you found to collect more
statistics on that column.

Other cases where it seems plausible that it would be worthwhile to do
the same:

  alter table dspam_signature_data alter signature set statistics 200;
  alter table dspam_neural_data alter node set statistics 200;
  alter table dspam_neural_decisions alter signature set statistics 200;

Lionel's suggestion of having a functional index on dspam_token_data
(innocent_hits + spam_hits) also seems likely to be helpful.  Along
with that, it might prove necessary to alter stats on dspam_token_data
thus:

  alter table dspam_token_data alter innocent_hits set statistics 200;
  alter table dspam_token_data alter spam_hits set statistics 200;

None of these changes are likely to make things materially worse; if
they do help, they'll help rather a lot.
--
(format nil "~S@~S" "cbbrowne" "gmail.com")
http://www.ntlug.org/~cbbrowne/nonrdbms.html
Rules  of the  Evil  Overlord #112.  "I  will not  rely entirely  upon
"totally  reliable"  spells  that  can be  neutralized  by  relatively
inconspicuous talismans." <http://www.eviloverlord.com/>

Re: Postgres vs. DSpam

From
"Jim C. Nasby"
Date:
FWIW, those queries won't be able to use an index. A better WHERE clause
would be:

AND last_hit < CURRENT_DATE - 60

On Fri, Nov 26, 2004 at 02:37:12PM +1300, Andrew McMillan wrote:
> On Wed, 2004-11-24 at 14:14 +0100, Evilio del Rio wrote:
> > Hi,
> >
> > I have installed the dspam filter
> > (http://www.nuclearelephant.com/projects/dspam) on our mail server
> > (RedHat 7.3 Linux with sendmail 8.13 and procmail). I have ~300 users
> > with a quite low traffic of 4000 messages/day. So it's a quite common
> > platform/environment, nothing spectacular.
>
> I am using DSpam with PostgreSQL here.  I have a daily job that cleans
> the DSpam database up, as follows:
>
> DELETE FROM dspam_token_data
>   WHERE (innocent_hits*2) + spam_hits < 5
>   AND CURRENT_DATE - last_hit > 60;
>
> DELETE FROM dspam_token_data
>   WHERE innocent_hits = 1
>   AND CURRENT_DATE - last_hit > 30;
>
> DELETE FROM dspam_token_data
>   WHERE CURRENT_DATE - last_hit > 180;
>
> DELETE FROM dspam_signature_data
>   WHERE CURRENT_DATE - created_on > 14;
>
> VACUUM dspam_token_data;
>
> VACUUM dspam_signature_data;
>
>
>
> I also occasionally do a "VACUUM FULL ANALYZE;" on the database as well.
>
>
> In all honesty though, I think that MySQL is better suited to DSpam than
> PostgreSQL is.
>
>
> > Please, could anyone explain me this difference?
> > Is Postgres that bad?
> > Is MySQL that good?
> > Am I the only one to have observed this behavior?
>
> I believe that what DSpam does that is not well-catered for in the way
> PostgreSQL operates, is that it does very frequent updates to rows in
> (eventually) quite large tables.  In PostgreSQL the UPDATE will result
> internally in a new record being written, with the old record being
> marked as deleted.  That old record won't be re-used until after a
> VACUUM has run, and this means that the on-disk tables will have a lot
> of dead rows in them quite quickly.
>
> The reason that PostgreSQL operates this way, is a direct result of the
> way transactional support is implemented, and it may well change in a
> version or two.  It's got better over the last few versions, with things
> like pg_autovacuum, but that approach still doesn't suit some types of
> database updating.
>
> Cheers,
>                     Andrew.
> -------------------------------------------------------------------------
> Andrew @ Catalyst .Net .NZ  Ltd,  PO Box 11-053, Manners St,  Wellington
> WEB: http://catalyst.net.nz/            PHYS: Level 2, 150-154 Willis St
> DDI: +64(4)803-2201      MOB: +64(272)DEBIAN      OFFICE: +64(4)499-2267
>        These PRESERVES should be FORCE-FED to PENTAGON OFFICIALS!!
> -------------------------------------------------------------------------
>



--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"