Thread: Comparative performance

Comparative performance

From
Joe
Date:
I'm converting a relatively small database (2 MB) from MySQL to PostgreSQL.  It
is used to generate web pages using PHP.  Although the actual website runs under
Linux, the development is done under XP.  I've completed most of the data
conversion and rewrite of the PHP scripts, so now I'm comparing relative
performance.

It appears that PostgreSQL is two to three times slower than MySQL.  For
example, some pages that have some 30,000 characters (when saved as HTML) take 1
to 1 1/2 seconds with MySQL but 3 to 4 seconds with PostgreSQL.  I had read that
the former was generally faster than the latter, particularly for simple web
applications but I was hoping that Postgres' performance would not be that
noticeably slower.

I'm trying to determine if the difference can be attributed to anything that
I've done or missed.  I've run VACUUM ANALYZE on the two main tables and I'm
looking at the results of EXPLAIN on the query that drives the retrieval of
probably 80% of the data for the pages in question.

Before I post the EXPLAIN and the table schema I'd appreciate confirmation that
this list is the appropriate forum.  I'm a relative newcomer to PostgreSQL (but
not to relational databases), so I'm not sure if this belongs in the novice or
general lists.

Joe


Re: Comparative performance

From
Gavin Sherry
Date:
On Wed, 28 Sep 2005, Joe wrote:

> I'm converting a relatively small database (2 MB) from MySQL to PostgreSQL.  It
> is used to generate web pages using PHP.  Although the actual website runs under
> Linux, the development is done under XP.  I've completed most of the data
> conversion and rewrite of the PHP scripts, so now I'm comparing relative
> performance.
>
> It appears that PostgreSQL is two to three times slower than MySQL.  For
> example, some pages that have some 30,000 characters (when saved as HTML) take 1
> to 1 1/2 seconds with MySQL but 3 to 4 seconds with PostgreSQL.  I had read that
> the former was generally faster than the latter, particularly for simple web
> applications but I was hoping that Postgres' performance would not be that
> noticeably slower.

Are you comparing PostgreSQL on XP to MySQL on XP or PostgreSQL on Linux
to MySQL on Linux? Our performance on XP is not great. Also, which version
of PostgreSQL are you using?

>
> I'm trying to determine if the difference can be attributed to anything that
> I've done or missed.  I've run VACUUM ANALYZE on the two main tables and I'm
> looking at the results of EXPLAIN on the query that drives the retrieval of
> probably 80% of the data for the pages in question.

Good.

>
> Before I post the EXPLAIN and the table schema I'd appreciate confirmation that
> this list is the appropriate forum.  I'm a relative newcomer to PostgreSQL (but
> not to relational databases), so I'm not sure if this belongs in the novice or
> general lists.

You can post the results of EXPLAIN ANALYZE here. Please including schema
definitions and the query string(s) themselves.

Thanks,

Gavin

Re: Comparative performance

From
Dennis Bjorklund
Date:
On Wed, 28 Sep 2005, Joe wrote:

> Before I post the EXPLAIN and the table schema I'd appreciate
> confirmation that this list is the appropriate forum.

It is and and useful things to show are

 * the slow query
 * EXPLAIN ANALYZE of the query
 * the output of \d for each table involved in the query
 * the output of SHOW ALL;
 * The amount of memory the machine have

The settings that are the most important to tune in postgresql.conf for
performance is in my opinion; shared_buffers, effective_cache_size and
(to a lesser extent) work_mem.

--
/Dennis Björklund


Re: Comparative performance

From
"Magnus Hagander"
Date:
> > It appears that PostgreSQL is two to three times slower
> than MySQL.
> > For example, some pages that have some 30,000 characters
> (when saved
> > as HTML) take 1 to 1 1/2 seconds with MySQL but 3 to 4 seconds with
> > PostgreSQL.  I had read that the former was generally
> faster than the
> > latter, particularly for simple web applications but I was
> hoping that
> > Postgres' performance would not be that noticeably slower.
>
> Are you comparing PostgreSQL on XP to MySQL on XP or
> PostgreSQL on Linux to MySQL on Linux? Our performance on XP
> is not great. Also, which version of PostgreSQL are you using?

That actually depends a lot on *how* you use it. I've seen pg-on-windows
deployments that come within a few percent of the linux performance.
I've also seen those that are absolutely horrible compared.

One sure way to kill the performance is to do a lot of small
connections. Using persistent connection is even more important on
Windows than it is on Unix. It could easily explain a difference like
this.

//Magnus

Re: Comparative performance

From
PFC
Date:
> It appears that PostgreSQL is two to three times slower than MySQL.  For
> example, some pages that have some 30,000 characters (when saved as
> HTML) take 1 to 1 1/2 seconds with MySQL but 3 to 4 seconds with
> PostgreSQL.  I had read that the former was generally faster than the
> latter, particularly for simple web applications but I was hoping that
> Postgres' performance would not be that noticeably slower.

    From my experience, the postgres libraries in PHP are a piece of crap,
and add a lot of overhead even from small queries.
    For instance, a simple query like "SELECT * FROM table WHERE
primary_key_id=1234" can take the following time, on my laptop, with data
in the filesystem cache of course :

EXPLAIN ANALYZE    <0.1 ms
python + psycopg 2    0.1 ms (damn fast)
php + mysql        0.3 ms
php + postgres    1-2 ms (damn slow)

    So, if your pages are designed in The PHP Way (ie. a large number of
small queries), I might suggest using a language with a decent postgres
interface (python, among others), or rewriting your bunches of small
queries as Stored Procedures or Joins, which will provide large speedups.
Doing >50 queries on a page is always a bad idea, but it's tolerable in
php-mysql, not in php-postgres.

    If it's only one large query, there is a problem, as postgres is usually
a lot smarter about query optimization.

    If you use the usual mysql techniques (like, storing a page counter in a
row in a table, or storing sessions in a table) beware, these are no-nos
for postgres, these things should NOT be done with a database anyway, try
memcached for instance.

Re: Comparative performance

From
Joe
Date:
Magnus Hagander wrote:
> That actually depends a lot on *how* you use it. I've seen pg-on-windows
> deployments that come within a few percent of the linux performance.
> I've also seen those that are absolutely horrible compared.
>
> One sure way to kill the performance is to do a lot of small
> connections. Using persistent connection is even more important on
> Windows than it is on Unix. It could easily explain a difference like
> this.

I just tried using pg_pconnect() and I didn't notice any significant
improvement.  What bothers me most is that with Postgres I tend to see jerky
behavior on almost every page:  the upper 1/2 or 2/3 of the page is displayed
first and you can see a blank bottom (or you can see a half-filled completion
bar).  With MySQL each page is generally displayed in one swoop.

Joe


Re: Comparative performance

From
"Steinar H. Gunderson"
Date:
On Thu, Sep 29, 2005 at 08:16:11AM -0400, Joe wrote:
> I just tried using pg_pconnect() and I didn't notice any significant
> improvement.

PHP persistent connections are not really persistent -- or so I've been told.

Anyhow, what was discussed here was pg_query, not pg_connect. You really want
to reduce the number of pg_query() calls in any case; you haven't told us how
many there are yet, but it sounds like there are a lot of them.

> What bothers me most is that with Postgres I tend to see jerky behavior on
> almost every page:  the upper 1/2 or 2/3 of the page is displayed first and
> you can see a blank bottom (or you can see a half-filled completion bar).
> With MySQL each page is generally displayed in one swoop.

This might just be your TCP/IP stack finding out that the rest of the page
isn't likely to come anytime soon, and start sending it out... or something
else. I wouldn't put too much weight on it, it's likely to go away as soon as
you fix the rest of the problem.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Comparative performance

From
Gavin Sherry
Date:
On Thu, 29 Sep 2005, Joe wrote:

> Magnus Hagander wrote:
> > That actually depends a lot on *how* you use it. I've seen pg-on-windows
> > deployments that come within a few percent of the linux performance.
> > I've also seen those that are absolutely horrible compared.
> >
> > One sure way to kill the performance is to do a lot of small
> > connections. Using persistent connection is even more important on
> > Windows than it is on Unix. It could easily explain a difference like
> > this.
>
> I just tried using pg_pconnect() and I didn't notice any significant
> improvement.  What bothers me most is that with Postgres I tend to see jerky
> behavior on almost every page:  the upper 1/2 or 2/3 of the page is displayed
> first and you can see a blank bottom (or you can see a half-filled completion
> bar).  With MySQL each page is generally displayed in one swoop.

Please post the table definitions, queries and explain analyze results so
we can tell you why the performance is poor.

Gavin

Re: Comparative performance

From
Joe
Date:
PFC wrote:
>     From my experience, the postgres libraries in PHP are a piece of
> crap,  and add a lot of overhead even from small queries.
>     For instance, a simple query like "SELECT * FROM table WHERE
> primary_key_id=1234" can take the following time, on my laptop, with
> data  in the filesystem cache of course :
>
> EXPLAIN ANALYZE    <0.1 ms
> python + psycopg 2    0.1 ms (damn fast)
> php + mysql        0.3 ms
> php + postgres    1-2 ms (damn slow)

As a Trac user I was considering moving to Python, so it's good to know that,
but the rewrite is a longer term project.

>     So, if your pages are designed in The PHP Way (ie. a large number
> of  small queries), I might suggest using a language with a decent
> postgres  interface (python, among others), or rewriting your bunches of
> small  queries as Stored Procedures or Joins, which will provide large
> speedups.  Doing >50 queries on a page is always a bad idea, but it's
> tolerable in  php-mysql, not in php-postgres.

The pages do use a number of queries to collect all the data for display but
nowhere near 50.  I'd say it's probably less than a dozen.  As an aside, one of
my tasks (before the conversion) was to analyze the queries and see where they
could be tweaked for performance, but with MySQL that was never a top priority.

The schema is fairly simple having two main tables: topic and entry (sort of
like account and transaction in an accounting scenario).  There are two
additional tables that perhaps could be merged into the entry table (and that
would reduce the number of queries) but I do not want to make major changes to
the schema (and the app) for the PostgreSQL conversion.

Joe


Re: Comparative performance

From
Joe
Date:
Gavin Sherry wrote:
> Please post the table definitions, queries and explain analyze results so
> we can tell you why the performance is poor.

I did try to post that last night but apparently my reply didn't make it to the
list.  Here it is again:

Matthew Nuzum wrote:

 > This is the right list. Post detail and I'm sure you'll get some suggestions.


Thanks, Matthew (and Chris and Gavin).

The main table used in the query is defined as follows:

CREATE TABLE entry (
   entry_id serial PRIMARY KEY,
   title VARCHAR(128) NOT NULL,
   subtitle VARCHAR(128),
   subject_type SMALLINT,
   subject_id INTEGER REFERENCES topic,
   actor_type SMALLINT,
   actor_id INTEGER REFERENCES topic,
   actor VARCHAR(64),
   actor_role VARCHAR(64),
   rel_entry_id INTEGER,
   rel_entry VARCHAR(64),
   description VARCHAR(255),
   quote text,
   url VARCHAR(255),
   entry_date CHAR(10),
   created DATE NOT NULL DEFAULT CURRENT_DATE,
   updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP)
WITHOUT OIDS;
CREATE INDEX entry_actor_id ON entry (actor_id);
CREATE INDEX entry_subject_id ON entry (subject_id);

It has 3422 rows at this time.

The query for one of the pages is the following:

SELECT entry_id, subject_type AS type, subject_type, subject_id, actor_type,
actor_id, actor, actor_role, rel_entry_id, rel_entry, title, subtitle,
description, url, quote AS main_quote, NULL AS rel_quote, substring(entry_date
from 8) AS dom, substring(entry_date from 1) AS date_ymd, substring(entry_date
from 1 for 7) AS date_ym, substring(entry_date from 1 for 4) AS date_y, created,
updated FROM entry WHERE subject_id = 1079
UNION SELECT entry_id, actor_type AS type, subject_type, subject_id, actor_type,
actor_id, actor, actor_role, rel_entry_id, rel_entry, title, subtitle,
description, url, quote AS main_quote, NULL AS rel_quote, substring(entry_date
from 8) AS dom, substring(entry_date from 1) AS date_ymd, substring(entry_date
from 1 for 7) AS date_ym, substring(entry_date from 1 for 4) AS date_y, created,
updated FROM entry WHERE actor_id = 1079 ORDER BY type, title, subtitle;

The output of EXPLAIN ANALYZE is:

  Sort  (cost=158.98..159.14 rows=62 width=568) (actual time=16.000..16.000
rows=59 loops=1)
    Sort Key: "type", title, subtitle
    ->  Unique  (cost=153.57..157.14 rows=62 width=568) (actual
time=16.000..16.000 rows=59 loops=1)
          ->  Sort  (cost=153.57..153.73 rows=62 width=568) (actual
time=16.000..16.000 rows=59 loops=1)
                Sort Key: entry_id, "type", subject_type, subject_id,
actor_type, actor_id, actor, actor_role, rel_entry_id, rel_entry, title,
subtitle, description, url, main_quote, rel_quote, dom, date_ymd, date_ym,
date_y, created, updated
                ->  Append  (cost=0.00..151.73 rows=62 width=568) (actual
time=0.000..16.000 rows=59 loops=1)
                      ->  Subquery Scan "*SELECT* 1"  (cost=0.00..17.21 rows=4
width=568) (actual time=0.000..0.000 rows=3 loops=1)
                            ->  Index Scan using entry_subject_id on entry
(cost=0.00..17.17 rows=4 width=568) (actual time=0.000..0.000 rows=3 loops=1)
                                  Index Cond: (subject_id = 1079)
                      ->  Subquery Scan "*SELECT* 2"  (cost=0.00..134.52 rows=58
width=568) (actual time=0.000..16.000 rows=56 loops=1)
                            ->  Seq Scan on entry  (cost=0.00..133.94 rows=58
width=568) (actual time=0.000..16.000 rows=56 loops=1)
                                  Filter: (actor_id = 1079)
  Total runtime: 16.000 ms
(13 rows)

What I don't quite understand is why it's doing a sequential scan on actor_id
instead of using the entry_actor_id index.  Note that actor_id has 928 non-null
values (27%), whereas subject_id has 3089 non-null values (90%).

Note that the entry_date column was originally a MySQL date but it had partial
dates, i.e., some days and months are set to zero.  Eventually I hope to define
a PostgreSQL datatype for it and to simplify the substring retrievals.  However,
I don't think the extra computational time should affect the overall runtime
significantly.

Gavin, I'm using PostgreSQL 8.0.3, Apache 1.3.28, PHP 4.3.4, MySQL 4.0.16 and
I'm comparing both databases on XP (on a Pentium 4, 1.6 GHz, 256 MB RAM).

Thanks for any feedback.

Joe


Re: Comparative performance

From
Andreas Pflug
Date:
Joe wrote:

>
>
> The pages do use a number of queries to collect all the data for display
> but nowhere near 50.  I'd say it's probably less than a dozen.
>
> The schema is fairly simple having two main tables: topic and entry
> (sort of like account and transaction in an accounting scenario).  There
> are two additional tables that perhaps could be merged into the entry
> table

Hm, if you only have 4 tables, why do you need 12 queries?
To reduce queries, join them in the query; no need to merge them
physically. If you have only two main tables, I'd bet you only need 1-2
queries for the whole page.

Regards,
Andreas

Re: Comparative performance

From
PFC
Date:

> I just tried using pg_pconnect() and I didn't notice any significant
> improvement.  What bothers me most is that with Postgres I tend to see
> jerky behavior on almost every page:  the upper 1/2 or 2/3 of the page
> is displayed first and you can see a blank bottom (or you can see a
> half-filled completion bar).  With MySQL each page is generally
> displayed in one swoop.

    Persistent connections are useful when your page is fast and the
connection time is an important part of your page time. It is mandatory if
you want to serve more than 20-50 hits/s without causing unnecessary load
on the database. This is not your case, which is why you don't notice any
improvement...

Re: Comparative performance

From
PFC
Date:
>   Total runtime: 16.000 ms

    Even though this query isn't that optimized, it's still only 16
milliseconds.
    Why does it take this long for PHP to get the results ?

    Can you try pg_query'ing this exact same query, FROM PHP, and timing it
with getmicrotime() ?

    You can even do an EXPLAIN ANALYZE from pg_query and display the results
in your webpage, to check how long the query takes on the server.

    You can also try it on a Linux box.

    This smells like a TCP communication problem.

Re: Comparative performance

From
Joe
Date:
Andreas Pflug wrote:
> Hm, if you only have 4 tables, why do you need 12 queries?
> To reduce queries, join them in the query; no need to merge them
> physically. If you have only two main tables, I'd bet you only need 1-2
> queries for the whole page.

There are more than four tables and the queries are not functionally
overlapping.  As an example, allow me to refer to the page
www.freedomcircle.com/topic.php/Economists.

The top row of navigation buttons (Life, Liberty, etc.) is created from a query
of the 'topic' table.  It could've been hard-coded as a PHP array, but with less
flexibility.  The alphabetical links are from a SELECT DISTINCT substring from
topic.  It could've been generated by a PHP for loop (originally implemented
that way) but again with less flexibility.  The listing of economists is another
SELECT from topic.  The subheadings (Articles, Books) come from a SELECT of an
entry_type table --which currently has 70 rows-- and is read into a PHP array
since we don't know what headings will be used in a given page.  The detail of
the entries comes from that query that I posted earlier, but there are three
additional queries that are used for specialized entry types (relationships
between topics --e.g., Prof. Williams teaches at George Mason, events, and
multi-author or multi-subject articles and books).  And there's yet another
table for the specific book information.  Once the data is retrieved it's sorted
internally with PHP, at the heading level, before display.

Maybe there is some way to merge all the queries (some already fairly complex)
that fetch the data for the entries box but I believe it would be a monstrosity
with over 100 lines of SQL.

Thanks,

Joe


Re: Comparative performance

From
Joe
Date:
PFC wrote:
>     Even though this query isn't that optimized, it's still only 16
> milliseconds.
>     Why does it take this long for PHP to get the results ?
>
>     Can you try pg_query'ing this exact same query, FROM PHP, and timing
> it  with getmicrotime() ?

Thanks, that's what I was looking for.  It's microtime(), BTW.  It'll take me
some time to instrument it, but that way I can pinpoint what is really slow.

>     You can even do an EXPLAIN ANALYZE from pg_query and display the
> results  in your webpage, to check how long the query takes on the server.
>
>     You can also try it on a Linux box.

My current host only supports MySQL.  I contacted hub.org to see if they could
assist in this transition but I haven't heard back.

>     This smells like a TCP communication problem.

I'm puzzled by that remark.  How much does TCP get into the picture in a local
Windows client/server environment?

Joe


Re: Comparative performance

From
"Magnus Hagander"
Date:
> >     This smells like a TCP communication problem.
>
> I'm puzzled by that remark.  How much does TCP get into the
> picture in a local Windows client/server environment?

Windows has no Unix Domain Sockets (no surprise there), so TCP
connections over the loopback interface are used to connect to the
server.

//Magnus

Re: Comparative performance

From
Joe
Date:
PFC wrote:
>     Even though this query isn't that optimized, it's still only 16
> milliseconds.
>     Why does it take this long for PHP to get the results ?
>
>     Can you try pg_query'ing this exact same query, FROM PHP, and timing
> it  with getmicrotime() ?

That query took about 27 msec in actual PHP execution time.  It turns out the
real culprit is the following query, which interestingly enough retrieves zero
rows in the case of the Economists page that I've been using for testing, yet it
uses up about 1370 msec in actual runtime:

SELECT topic_id1, topic_id2, topic_name, categ_id, list_name, t.title, url,
page_type, rel_type, inverse_id, r.description AS rel_descrip, r.created, r.updated
FROM relationship r, topic t, entry_type e
WHERE ((topic_id1 = topic_id AND topic_id2 = 1252) OR (topic_id2 = topic_id and
topic_id1 = 1252)) AND rel_type = type_id AND e.class_id = 2
ORDER BY rel_type, list_name;

The EXPLAIN ANALYZE output, after I ran VACUUM ANALYZE on the three tables, is:

  Sort  (cost=4035.55..4035.56 rows=1 width=131) (actual time=2110.000..2110.000
rows=0 loops=1)
    Sort Key: r.rel_type, t.list_name
    ->  Nested Loop  (cost=36.06..4035.54 rows=1 width=131) (actual
time=2110.000..2110.000 rows=0 loops=1)
          Join Filter: ((("inner".topic_id1 = "outer".topic_id) AND
("inner".topic_id2 = 1252)) OR (("inner".topic_id2 = "outer".topic_id) AND
("inner".topic_id1 = 1252)))
          ->  Seq Scan on topic t  (cost=0.00..38.34 rows=1234 width=90) (actual
time=0.000..15.000 rows=1234 loops=1)
          ->  Materialize  (cost=36.06..37.13 rows=107 width=45) (actual
time=0.000..0.509 rows=466 loops=1234)
                ->  Merge Join  (cost=30.31..35.96 rows=107 width=45) (actual
time=0.000..0.000 rows=466 loops=1)
                      Merge Cond: ("outer".type_id = "inner".rel_type)
                      ->  Index Scan using entry_type_pkey on entry_type e  (cost
=0.00..3.94 rows=16 width=4) (actual time=0.000..0.000 rows=15 loops=1)
                            Filter: (class_id = 2)
                      ->  Sort  (cost=30.31..31.48 rows=466 width=43) (actual
time=0.000..0.000 rows=466 loops=1)
                            Sort Key: r.rel_type
                            ->  Seq Scan on relationship r  (cost=0.00..9.66
rows=466 width=43) (actual time=0.000..0.000 rows=466 loops=1)
  Total runtime: 2110.000 ms
(14 rows)

The tables are as follows:

CREATE TABLE entry_type (
   type_id SMALLINT NOT NULL PRIMARY KEY,
   title VARCHAR(32) NOT NULL,
   rel_title VARCHAR(32),
   class_id SMALLINT NOT NULL DEFAULT 1,
   inverse_id SMALLINT,
   updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP)
WITHOUT OIDS;

CREATE TABLE topic (
   topic_id serial PRIMARY KEY,
   topic_name VARCHAR(48) NOT NULL UNIQUE,
   categ_id SMALLINT NOT NULL,
   parent_entity INTEGER,
   parent_concept INTEGER,
   crossref_id INTEGER,
   list_name VARCHAR(80) NOT NULL,
   title VARCHAR(80),
   description VARCHAR(255),
   url VARCHAR(64),
   page_type SMALLINT NOT NULL,
   dark_ind BOOLEAN NOT NULL DEFAULT FALSE,
   ad_code INTEGER,
   created DATE NOT NULL DEFAULT CURRENT_DATE,
   updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP)
WITHOUT OIDS;

CREATE TABLE relationship (
   topic_id1 INTEGER NOT NULL REFERENCES topic,
   topic_id2 INTEGER NOT NULL REFERENCES topic,
   rel_type INTEGER NOT NULL,
   description VARCHAR(255),
   created DATE NOT NULL DEFAULT CURRENT_DATE,
   updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
   PRIMARY KEY (topic_id1, topic_id2, rel_type))
WITHOUT OIDS;

I'm thinking that perhaps I need to set up another index with topic_id2 first
and topic_id1 second.  In addition, an index on entry_type.class_id may improve
things.  Another possibility would be to rewrite the query as a UNION.

Of course, this doesn't explain how MySQL manages to execute the query in about
9 msec.  The only minor differences in the schema are:  entry_type.title and
rel_title are char(32) in MySQL, entry_type.class_id is a tinyint, and
topic.categ_id, page_type and dark_ind are also tinyints. MySQL also doesn't
have the REFERENCES.

A couple of interesting side notes from my testing.  First is that pg_connect()
took about 39 msec but mysql_connect() took only 4 msec, however, pg_pconnect()
took 0.14 msec while mysql_pconnect() took 0.99 msec (all tests were repeated
five times and the quoted results are averages).  Second, is that PostgreSQL's
performance appears to be much more consistent in certain queries.  For example,
the query that retrieves the list of subtopics (the names and description of
economists), took 17 msec in PG, with a low of 15 (three times) and a high of
21, whereas MySQL took 60 msec on average but had a low of 22 and a high of 102
msec.

Joe


Re: Comparative performance

From
PFC
Date:
It's more understandable if the table names are in front of the column
names :

SELECT relationship.topic_id1, relationship.topic_id2, topic.topic_name,
topic.categ_id, topic.list_name, topic.title,
topic.url, topic.page_type, relationship.rel_type, entry_type.inverse_id,
relationship.description AS rel_descrip,
relationship.created, relationship.updated
 FROM relationship, topic, entry_type
WHERE
    ((relationship.topic_id1 = topic.topic_id AND relationship.topic_id2 =
1252)
    OR (relationship.topic_id2 =  topic.topic_id and relationship.topic_id1 =
1252))
    AND    relationship.rel_type = entry_type.type_id
    AND    entry_type.class_id = 2
    ORDER BY rel_type, list_name;

I see a few problems in your schema.
- topic_id1 and topic_id2 play the same role, there is no constraint to
determine which is which, hence it is possible to define the same relation
twice.
- as you search on two columns with OR, you need UNION to use indexes.
- lack of indexes
- I don't understand why the planner doesn't pick up hash joins...
- if you use a version before 8, type mismatch will prevent use of the
indexes.

I'd suggest rewriting the query like this :
SELECT topic.*, foo.* FROM
topic,
(SELECT topic_id2 as fetch_id, topic_id1, topic_id2, rel_type, description
as rel_descrip, created, updated
    FROM relationship
    WHERE
        rel_type IN (SELECT type_id FROM entry_type WHERE class_id = 2)
    AND    topic_id1 = 1252
UNION
SELECT topic_id1 as fetch_id, topic_id1, topic_id2, rel_type, description
as rel_descrip, created, updated
    FROM relationship
    WHERE
        rel_type IN (SELECT type_id FROM entry_type WHERE class_id = 2)
    AND    topic_id2 = 1252)
AS foo
WHERE topic.topic_id = foo.fetch_id


CREATE INDEX'es ON
entry_type( class_id )

relationship( topic_id1, rel_type, topic_id2 )    which becomes your new
PRIMARY KEY
relationship( topic_id2, rel_type, topic_id1 )

> Of course, this doesn't explain how MySQL manages to execute the query
> in about 9 msec.  The only minor differences in the schema are:
> entry_type.title and rel_title are char(32) in MySQL,
> entry_type.class_id is a tinyint, and topic.categ_id, page_type and
> dark_ind are also tinyints. MySQL also doesn't have the REFERENCES.

Can you post the result from MySQL EXPLAIN ?

You might be interested in the following code. Just replace mysql_ by pg_,
it's quite useful.

$global_queries_log = array();

function _getmicrotime()    {    list($u,$s) = explode(' ',microtime());    return
$u+$s;    }

/*    Formats query, with given arguments, escaping all strings as needed.
    db_quote_query( 'UPDATE junk SET a=%s WHERE b=%s', array( 1,"po'po" ) )
    =>    UPDATE junk SET a='1 WHERE b='po\'po'
*/
function db_quote_query( $sql, $params=false )
{
    // if no params, send query raw
    if( !$params )
        return $sql;

    // quote params
    foreach( $params as $key => $val )
    {
        if( is_array( $val ))
            $val = implode( ',', $val );
        $params[$key] = "'".mysql_real_escape_string( $val )."'";
    }
    return vsprintf( $sql, $params );
}

/*    Formats query, with given arguments, escaping all strings as needed.
    Runs query, logging its execution time.
    Returns the query, or dies with error.
*/
function db_query( $sql, $params=false )
{
    // it's already a query
    if( is_resource( $sql ))
        return $sql;

    $sql = db_quote_query( $sql, $params );

    $t = _getmicrotime();
    $r = mysql_query( $sql );
    if( !$r )
    {
        echo "<div class=bigerror><b>Erreur MySQL
:</b><br>".mysql_error()."<br><br><b>Requ􏻪te</b>
:<br>".$sql."<br><br><b>Traceback </b>:<pre>";
        foreach( debug_backtrace() as $t ) xdump( $t );
        echo "</pre></div>";
        die();
    }
    global $global_queries_log;
    $global_queries_log[] = array( _getmicrotime()-$t, $sql );
    return $r;
}

At the end of your page, display the contents of $global_queries_log.




Re: Comparative performance

From
"Jim C. Nasby"
Date:
On Thu, Sep 29, 2005 at 08:44:16AM -0400, Joe wrote:
> CREATE TABLE entry (
>   entry_id serial PRIMARY KEY,
>   title VARCHAR(128) NOT NULL,
>   subtitle VARCHAR(128),
>   subject_type SMALLINT,
>   subject_id INTEGER REFERENCES topic,
>   actor_type SMALLINT,
>   actor_id INTEGER REFERENCES topic,
>   actor VARCHAR(64),
>   actor_role VARCHAR(64),
>   rel_entry_id INTEGER,
>   rel_entry VARCHAR(64),
>   description VARCHAR(255),
>   quote text,
>   url VARCHAR(255),
>   entry_date CHAR(10),
>   created DATE NOT NULL DEFAULT CURRENT_DATE,
>   updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP)
> WITHOUT OIDS;
> CREATE INDEX entry_actor_id ON entry (actor_id);
> CREATE INDEX entry_subject_id ON entry (subject_id);

A few tips...

Fields in PostgreSQL have alignment requirements, so the smallints
aren't saving you anything right now. If you put both of them together
though, you'll save 4 bytes on most hardware.

You'll also get some minor gains from putting all the variable-length
fields at the end, as well as nullable fields. If you search the
archives for 'field order' you should be able to find some useful info.

Make sure these indexes exist if you'll be updating or inserting into
entry:

CREATE INDEX topic__subject_id ON topic(subject_id);
CREATE INDEX topic__actor_id ON topic(actor_id);

Also, the fact that subject and actor both point to topic along with
subject_type and actor_type make me suspect that your design is
de-normalized. Of course there's no way to know without more info.

FWIW, I usually use timestamptz for both created and updated fields.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Comparative performance

From
"Jim C. Nasby"
Date:
On Thu, Sep 29, 2005 at 04:39:36PM -0400, Joe wrote:
> Andreas Pflug wrote:
> >Hm, if you only have 4 tables, why do you need 12 queries?
> >To reduce queries, join them in the query; no need to merge them
> >physically. If you have only two main tables, I'd bet you only need 1-2
> >queries for the whole page.
>
> There are more than four tables and the queries are not functionally
> overlapping.  As an example, allow me to refer to the page
> www.freedomcircle.com/topic.php/Economists.
>
> The top row of navigation buttons (Life, Liberty, etc.) is created from a
> query of the 'topic' table.  It could've been hard-coded as a PHP array,
> but with less flexibility.  The alphabetical links are from a SELECT
> DISTINCT substring from topic.  It could've been generated by a PHP for
> loop (originally implemented that way) but again with less flexibility.
> The listing of economists is another SELECT from topic.  The subheadings
> (Articles, Books) come from a SELECT of an entry_type table --which
> currently has 70 rows-- and is read into a PHP array since we don't know
> what headings will be used in a given page.  The detail of the entries

I suspect this might be something better done in a join.

> comes from that query that I posted earlier, but there are three additional
> queries that are used for specialized entry types (relationships between
> topics --e.g., Prof. Williams teaches at George Mason, events, and
> multi-author or multi-subject articles and books).  And there's yet another

Likewise...

> table for the specific book information.  Once the data is retrieved it's
> sorted internally with PHP, at the heading level, before display.

It's often better to let the database sort and/or aggregate data.

> Maybe there is some way to merge all the queries (some already fairly
> complex) that fetch the data for the entries box but I believe it would be
> a monstrosity with over 100 lines of SQL.

Also, just because no one else has mentioned it, remember that it's very
easy to get MySQL into a mode where you have no data integrity. If
that's the case it's going to be faster than PostgreSQL (though I'm not
sure how much that affects the performance of SELECTs).
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Comparative performance

From
Joe
Date:
Hi Jim,

Jim C. Nasby wrote:
> Also, just because no one else has mentioned it, remember that it's very
> easy to get MySQL into a mode where you have no data integrity. If
> that's the case it's going to be faster than PostgreSQL (though I'm not
> sure how much that affects the performance of SELECTs).

Yes indeed.  When I added the REFERENCES to the schema and reran the conversion
scripts, aside from having to reorder the table creation and loading (they used
to be in alphabetical order), I also found a few referential integrity errors in
the MySQL data.

Joe


Re: Comparative performance

From
Joe
Date:
PFC wrote:
> - if you use a version before 8, type mismatch will prevent use of the
> indexes.

I'm using 8.0.3, but the type mismatch between relationship.rel_type and
entry_type.type_id was unintended.  The current databases use SMALLINT for both.
  The PostgreSQL schema was derived from an export script stored in Subversion,
apparently before the column datatypes were changed.

> CREATE INDEX'es ON
> entry_type( class_id )
>
> relationship( topic_id1, rel_type, topic_id2 )    which becomes your
> new  PRIMARY KEY
> relationship( topic_id2, rel_type, topic_id1 )

Creating the second relationship index was sufficient to modify the query plan
to cut down runtime to zero:

  Sort  (cost=75.94..75.95 rows=2 width=381) (actual time=0.000..0.000 rows=0
loops=1)
    Sort Key: r.rel_type, t.list_name
    ->  Nested Loop  (cost=16.00..75.93 rows=2 width=381) (actual
time=0.000..0.000 rows=0 loops=1)
          Join Filter: ((("outer".topic_id1 = "inner".topic_id) AND
("outer".topic_id2 = 1252)) OR (("outer".topic_id2 = "inner".topic_id) AND
("outer".topic_id1 = 1252)))
          ->  Nested Loop  (cost=16.00..35.11 rows=1 width=169) (actual
time=0.000..0.000 rows=0 loops=1)
                Join Filter: ("inner".rel_type = "outer".type_id)
                ->  Seq Scan on entry_type e  (cost=0.00..18.75 rows=4 width=4)
(actual time=0.000..0.000 rows=15 loops=1)
                      Filter: (class_id = 2)
                ->  Materialize  (cost=16.00..16.04 rows=4 width=167) (actual
time=0.000..0.000 rows=0 loops=15)
                      ->  Seq Scan on relationship r  (cost=0.00..16.00 rows=4
width=167) (actual time=0.000..0.000 rows=0 loops=1)
                            Filter: ((topic_id2 = 1252) OR (topic_id1 = 1252))
          ->  Seq Scan on topic t  (cost=0.00..30.94 rows=494 width=216) (never
executed)
  Total runtime: 0.000 ms
(13 rows)

The overall execution time for the Economists page for PostgreSQL is within 4%
of the MySQL time, so for the time being I'll leave the query in its current form.

Thanks for your help.

Joe


Re: Comparative performance

From
Joe
Date:
Jim C. Nasby wrote:
> Make sure these indexes exist if you'll be updating or inserting into
> entry:
>
> CREATE INDEX topic__subject_id ON topic(subject_id);
> CREATE INDEX topic__actor_id ON topic(actor_id);

Actually, topic's primary key is topic_id.

> Also, the fact that subject and actor both point to topic along with
> subject_type and actor_type make me suspect that your design is
> de-normalized. Of course there's no way to know without more info.

Yes, the design is denormalized.  The reason is that a book or article is
usually by a single author (an "actor" topic) and it will be listed under one
main topic (a "subject" topic).  There's a topic_entry table where additional
actors and subjects can be added.

It's somewhat ironic because I used to teach and/or preach normalization and the
"goodness" of a 3NF+ design (also about having the database do aggregation and
sorting as you mentioned in your other email).

> FWIW, I usually use timestamptz for both created and updated fields.

IIRC 'created' ended up as a DATE because MySQL 4 has a restriction about a
single TIMESTAMP column per table taking the default value of current_timestamp.

Joe


Re: Comparative performance

From
"Jim C. Nasby"
Date:
On Tue, Oct 04, 2005 at 05:11:19PM -0400, Joe wrote:
> Hi Jim,
>
> Jim C. Nasby wrote:
> >Also, just because no one else has mentioned it, remember that it's very
> >easy to get MySQL into a mode where you have no data integrity. If
> >that's the case it's going to be faster than PostgreSQL (though I'm not
> >sure how much that affects the performance of SELECTs).
>
> Yes indeed.  When I added the REFERENCES to the schema and reran the
> conversion scripts, aside from having to reorder the table creation and
> loading (they used to be in alphabetical order), I also found a few
> referential integrity errors in the MySQL data.

Data integrity != refferential integrity. :) It's very easy to
accidentally get MyISAM tables in MySQL, which means you are nowhere
near ACID which also means you can't get anything close to an apples to
apples comparison to PostgreSQL.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Comparative performance

From
"Dario"
Date:
Postgresql uses MVCC to ensure data integrity. Server must choose the right
version of tuple, according to transaction ID of statement. Even for a
select (ACID features of postgresql, I think C and I apply here), it must
accomplish some extra work.

-----Mensaje original-----
De: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org]En nombre de Joe
Enviado el: martes, 04 de octubre de 2005 18:11
Para: Jim C. Nasby
CC: Andreas Pflug; pgsql-performance@postgresql.org
Asunto: Re: [PERFORM] Comparative performance


Hi Jim,

Jim C. Nasby wrote:
> Also, just because no one else has mentioned it, remember that it's very
> easy to get MySQL into a mode where you have no data integrity. If
> that's the case it's going to be faster than PostgreSQL (though I'm not
> sure how much that affects the performance of SELECTs).

Yes indeed.  When I added the REFERENCES to the schema and reran the
conversion
scripts, aside from having to reorder the table creation and loading (they
used
to be in alphabetical order), I also found a few referential integrity
errors in
the MySQL data.

Joe


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster