Thread: PostgreSQL runs a query much slower than BDE and MySQL

PostgreSQL runs a query much slower than BDE and MySQL

From
"Peter Hardman"
Date:
I'm in the process of migrating a Paradox 7/BDE 5.01 database from single-user
Paradox to a web based interface to either MySQL or PostgreSQL.
The database is a pedigree sheep breed society database recording sheep and
flocks (amongst other things).

My current problem is with one table and an associated query which takes 10
times longer to execute on PostgreSQL than BDE, which in turn takes 10 times
longer than MySQL. The table links sheep to flocks and is created as follows:

CREATE TABLE SHEEP_FLOCK
(
  regn_no varchar(7) NOT NULL,
  flock_no varchar(6) NOT NULL,
  transfer_date date NOT NULL,
  last_changed date NOT NULL,
  CONSTRAINT SHEEP_FLOCK_pkey PRIMARY KEY (regn_no, flock_no,
transfer_date)
)
WITHOUT OIDS;
ALTER TABLE SHEEP_FLOCK OWNER TO postgres;

I then populate the table with

COPY SHEEP_FLOCK
FROM 'e:/ssbg/devt/devt/export_data/sheep_flock.txt'
WITH CSV HEADER

The table then has about 82000 records

The query I run is:

/* Select all sheep who's most recent transfer was into the subject flock */
SELECT DISTINCT f1.regn_no, f1.transfer_date as date_in
FROM SHEEP_FLOCK f1 JOIN
    /* The last transfer date for each sheep */
    (SELECT f.regn_no, MAX(f.transfer_date) as last_xfer_date
    FROM  SHEEP_FLOCK f
    GROUP BY f.regn_no) f2
ON f1.regn_no = f2.regn_no
WHERE f1.flock_no = '1359'
AND f1.transfer_date = f2.last_xfer_date

The sub-select on it's own returns about 32000 rows.

Using identically structured tables and the same primary key, if I run this on
Paradox/BDE it takes about 120ms, on MySQL (5.0.24, local server) about 3ms,
and on PostgresSQL (8.1.3, local server) about 1290ms). All on the same
Windows XP Pro machine with 512MB ram of which nearly half is free.

The query plan shows most of the time is spent sorting the 30000+ rows from the subquery, so I added a further
subquery as follows:

/* Select all sheep who's most recent transfer was into the subject flock */
SELECT DISTINCT f1.regn_no, f1.transfer_date as date_in
FROM SHEEP_FLOCK f1 JOIN
    /* The last transfer date for each sheep */
    (SELECT f.regn_no, MAX(f.transfer_date) as last_xfer_date
    FROM  SHEEP_FLOCK f
    WHERE f.regn_no IN
        /* Limit the rows extracted by the outer sub-query to those relevant to the
subject flock */
    /* This typically reduces the time from 1297ms to 47ms - from 35000 rows
to 127 rows */
    (SELECT s.regn_no FROM SHEEP_FLOCK s where s.flock_no = '1359')
    GROUP BY f.regn_no) f2
ON f1.regn_no = f2.regn_no
WHERE f1.flock_no = '1359'
AND f1.transfer_date = f2.last_xfer_date

then as the comment suggests I get a considerable improvement, but it's still an
order of magnitude slower than MySQL.

Can anyone suggest why PostgreSQL performs the original query so much slower than even BDE?
 --
Peter Hardman
Acre Cottage, Horsebridge
King's Somborne
Stockbridge
SO20 6PT

== Breeder of Shetland Cattle and Shetland Sheep ==


Re: PostgreSQL runs a query much slower than BDE and MySQL

From
Arjen van der Meijden
Date:
On 16-8-2006 18:48, Peter Hardman wrote:
> Using identically structured tables and the same primary key, if I run this on
> Paradox/BDE it takes about 120ms, on MySQL (5.0.24, local server) about 3ms,
> and on PostgresSQL (8.1.3, local server) about 1290ms). All on the same
> Windows XP Pro machine with 512MB ram of which nearly half is free.

Is that with or without query caching? I.e. can you test it with SELECT
SQL_NO_CACHE ... ?
In a read-only environment it will still beat PostgreSQL, but as soon as
you'd get a read-write environment, MySQL's query cache is of less use.
So you should compare both the cached and non-cached version, if applicable.

Besides that, most advices on this list are impossible without the
result of 'explain analyze', so you should probably get that as well.

I'm not sure whether this is the same query, but you might want to try:
SELECT DISTINCT f1.regn_no, f1.transfer_date as date_in
FROM SHEEP_FLOCK f1
WHERE
f1.flock_no = '1359'
AND f1.transfer_date = (SELECT MAX(f.transfer_date) FROM SHEEP_FLOCK f
WHERE regn_no = f1.regn_no)

And you might need an index on (regn_no, transfer_date) and/or one
combined with that flock_no.

Best regards,

Arjen

Re: PostgreSQL runs a query much slower than BDE and MySQL

From
"Rodrigo De León"
Date:
On 8/16/06, Peter Hardman <peter@ssbg.zetnet.co.uk> wrote:
> I'm in the process of migrating a Paradox 7/BDE 5.01 database from single-user
> Paradox to a web based interface to either MySQL or PostgreSQL.
> The database is a pedigree sheep breed society database recording sheep and
> flocks (amongst other things).
>
> My current problem is with one table and an associated query which takes 10
> times longer to execute on PostgreSQL than BDE, which in turn takes 10 times
> longer than MySQL. The table links sheep to flocks and is created as follows:
>
> CREATE TABLE SHEEP_FLOCK
> (
>   regn_no varchar(7) NOT NULL,
>   flock_no varchar(6) NOT NULL,
>   transfer_date date NOT NULL,
>   last_changed date NOT NULL,
>   CONSTRAINT SHEEP_FLOCK_pkey PRIMARY KEY (regn_no, flock_no,
> transfer_date)
> )
> WITHOUT OIDS;
> ALTER TABLE SHEEP_FLOCK OWNER TO postgres;
>
> I then populate the table with
>
> COPY SHEEP_FLOCK
> FROM 'e:/ssbg/devt/devt/export_data/sheep_flock.txt'
> WITH CSV HEADER
>
> The table then has about 82000 records
>
> The query I run is:
>
> /* Select all sheep who's most recent transfer was into the subject flock */
> SELECT DISTINCT f1.regn_no, f1.transfer_date as date_in
> FROM SHEEP_FLOCK f1 JOIN
>     /* The last transfer date for each sheep */
>     (SELECT f.regn_no, MAX(f.transfer_date) as last_xfer_date
>     FROM  SHEEP_FLOCK f
>     GROUP BY f.regn_no) f2
> ON f1.regn_no = f2.regn_no
> WHERE f1.flock_no = '1359'
> AND f1.transfer_date = f2.last_xfer_date
>
> The sub-select on it's own returns about 32000 rows.
>
> Using identically structured tables and the same primary key, if I run this on
> Paradox/BDE it takes about 120ms, on MySQL (5.0.24, local server) about 3ms,
> and on PostgresSQL (8.1.3, local server) about 1290ms). All on the same
> Windows XP Pro machine with 512MB ram of which nearly half is free.
>
> The query plan shows most of the time is spent sorting the 30000+ rows from the subquery, so I added a further
> subquery as follows:
>
> /* Select all sheep who's most recent transfer was into the subject flock */
> SELECT DISTINCT f1.regn_no, f1.transfer_date as date_in
> FROM SHEEP_FLOCK f1 JOIN
>     /* The last transfer date for each sheep */
>     (SELECT f.regn_no, MAX(f.transfer_date) as last_xfer_date
>     FROM  SHEEP_FLOCK f
>     WHERE f.regn_no IN
>         /* Limit the rows extracted by the outer sub-query to those relevant to the
> subject flock */
>         /* This typically reduces the time from 1297ms to 47ms - from 35000 rows
> to 127 rows */
>         (SELECT s.regn_no FROM SHEEP_FLOCK s where s.flock_no = '1359')
>     GROUP BY f.regn_no) f2
> ON f1.regn_no = f2.regn_no
> WHERE f1.flock_no = '1359'
> AND f1.transfer_date = f2.last_xfer_date
>
> then as the comment suggests I get a considerable improvement, but it's still an
> order of magnitude slower than MySQL.
>
> Can anyone suggest why PostgreSQL performs the original query so much slower than even BDE?

ANALYZE?

Regards,

Rodrigo

Re: PostgreSQL runs a query much slower than BDE and MySQL

From
"Peter Hardman"
Date:
On 16 Aug 2006 at 20:02, Arjen van der Meijden wrote:

> On 16-8-2006 18:48, Peter Hardman wrote:
> > Using identically structured tables and the same primary key, if I run this on
> > Paradox/BDE it takes about 120ms, on MySQL (5.0.24, local server) about 3ms,
> > and on PostgresSQL (8.1.3, local server) about 1290ms). All on the same
> > Windows XP Pro machine with 512MB ram of which nearly half is free.
>
> Is that with or without query caching? I.e. can you test it with SELECT
> SQL_NO_CACHE ... ?
> In a read-only environment it will still beat PostgreSQL, but as soon as
> you'd get a read-write environment, MySQL's query cache is of less use.
> So you should compare both the cached and non-cached version, if applicable.
It seems to make no difference - not surprising really as I'm just running the query
from the command line interface.
>
> Besides that, most advices on this list are impossible without the
> result of 'explain analyze', so you should probably get that as well.
Here is the output of EXPLAIN ANALYZE for the slow query:

Unique  (cost=7201.65..8487.81 rows=1 width=13) (actual
time=1649.733..1811.684 rows=32 loops=1)
  ->  Merge Join  (cost=7201.65..8487.80 rows=1 width=13) (actual
time=1649.726..1811.528 rows=32 loops=1)
        Merge Cond: ((("outer".regn_no)::text = "inner"."?column3?") AND
("outer".transfer_date = "inner".last_xfer_date))
        ->  Index Scan using sheep_flock_pkey on sheep_flock f1
(cost=0.00..1033.19 rows=77 width=13) (actual time=15.357..64.237 rows=127
loops=1)
              Index Cond: ((flock_no)::text = '1359'::text)
        ->  Sort  (cost=7201.65..7285.84 rows=33676 width=15) (actual
time=1580.198..1653.502 rows=38277 loops=1)
              Sort Key: (f2.regn_no)::text, f2.last_xfer_date
              ->  Subquery Scan f2  (cost=0.00..4261.67 rows=33676 width=15) (actual
time=0.331..598.246 rows=38815 loops=1)
                    ->  GroupAggregate  (cost=0.00..3924.91 rows=33676 width=13)
(actual time=0.324..473.131 rows=38815 loops=1)
                          ->  Index Scan using sheep_flock_pkey on sheep_flock f
(cost=0.00..3094.95 rows=81802 width=13) (actual time=0.295..232.156
rows=81802 loops=1)
Total runtime: 1812.737 ms


>
> I'm not sure whether this is the same query, but you might want to try:
> SELECT DISTINCT f1.regn_no, f1.transfer_date as date_in
> FROM SHEEP_FLOCK f1
> WHERE
> f1.flock_no = '1359'
> AND f1.transfer_date = (SELECT MAX(f.transfer_date) FROM SHEEP_FLOCK f
> WHERE regn_no = f1.regn_no)
>
That's neat - I didn't know you could make a reference from a subselect to the
outer select. Your query has the same performance as my very complex one on
both MySQL and PostgreSQL. However I'm not entirely sure about the times for
MySQL - every interface gives a different answer so I'll have to try them from a
script so I know whats going on.
Interestingly BDE takes 7 seconds to run your query. Just as well I didn't start
from there...
> And you might need an index on (regn_no, transfer_date) and/or one
> combined with that flock_no.
Explain says it only uses the primary key, so it seems there' no need for a
separate index

Thanks for the help
--
Peter Hardman
Acre Cottage, Horsebridge
King's Somborne
Stockbridge
SO20 6PT

== Breeder of Shetland Cattle and Shetland Sheep ==


Re: PostgreSQL runs a query much slower than BDE and MySQL

From
Tom Lane
Date:
"Peter Hardman" <peter@ssbg.zetnet.co.uk> writes:
> I'm in the process of migrating a Paradox 7/BDE 5.01 database from single-user
> Paradox to a web based interface to either MySQL or PostgreSQL.
> The query I run is:

> /* Select all sheep who's most recent transfer was into the subject flock */
> SELECT DISTINCT f1.regn_no, f1.transfer_date as date_in
> FROM SHEEP_FLOCK f1 JOIN
>     /* The last transfer date for each sheep */
>     (SELECT f.regn_no, MAX(f.transfer_date) as last_xfer_date
>     FROM  SHEEP_FLOCK f
>     GROUP BY f.regn_no) f2
> ON f1.regn_no = f2.regn_no
> WHERE f1.flock_no = '1359'
> AND f1.transfer_date = f2.last_xfer_date

This seems pretty closely related to this recent thread:
http://archives.postgresql.org/pgsql-performance/2006-08/msg00220.php
in which the OP is doing a very similar kind of query in almost exactly
the same way.

I can't help thinking that there's probably a better way to phrase this
type of query in SQL, though it's not jumping out at me what that is.

What I find interesting though is that it sounds like both MSSQL and
Paradox know something we don't about how to optimize it.  PG doesn't
have any idea how to do the above query without forming the full output
of the sub-select, but I suspect that the commercial DBs know a
shortcut; perhaps they are able to automatically derive a restriction
in the subquery similar to what you did by hand.  Does Paradox have
anything comparable to EXPLAIN that would give a hint about the query
plan they are using?

Also, just as in the other thread, I'm thinking that a seqscan+hash
aggregate would be a better idea than this bit:

>                    ->  GroupAggregate  (cost=0.00..3924.91 rows=33676 width=13) (actual time=0.324..473.131
rows=38815loops=1) 
>                          ->  Index Scan using sheep_flock_pkey on sheep_flock f (cost=0.00..3094.95 rows=81802
width=13)(actual time=0.295..232.156) 

Possibly you need to raise work_mem to get it to consider the hash
aggregation method.

BTW, are you *sure* you are testing PG 8.1?  The "Subquery Scan f2" plan
node looks unnecessary to me, and I'd have expected 8.1 to drop it out.
8.0 and before would have left it in the plan though.  This doesn't make
all that much difference performance-wise in itself, but it does make me
wonder what you are testing.

            regards, tom lane

Re: PostgreSQL runs a query much slower than BDE and MySQL

From
"Peter Hardman"
Date:
On 17 Aug 2006 at 10:00, Mario Weilguni wrote:

> not really sure if this is right without any testdata, but isn't that what you
> want?
>
> CREATE index foo on sheep_flock (flock_no);
>
> SELECT DISTINCT on (f1.transfer_date) f1.regn_no, f1.transfer_date as date_in
> FROM SHEEP_FLOCK f1
> WHERE f1.flock_no = '1359'
> order by f1.transfer_date desc;
>
> best regards,
> mario weilguni
>
>
Mario, Thanks for the suggestion, but this query produces the wrong answer - but
then I provided no data, nor properly explained what the data would be.
Each sheep will have multiple records, starting with one for when it's first
registered, then one for each flock it's in (eg sold into) then one for when it dies
and goes to the 'big flock in the sky'.

 So first I need to find the most recent record for each sheep and then select the
sheep who's most recent record matches the flock in question.

Your query finds all the sheep that have been in the flock in question, then selects
the first one from each set of records with the same date. So it collects data on
dead sheep, and only selects one sheep if several were bought or registered on
the same day.

Forgive me for being verbose - I want to make sure I understand it propely myself!

regards,
 --
Peter Hardman
Acre Cottage, Horsebridge
King's Somborne
Stockbridge
SO20 6PT

== Breeder of Shetland Cattle and Shetland Sheep ==


Re: PostgreSQL runs a query much slower than BDE and MySQL

From
"Peter Hardman"
Date:
On 16 Aug 2006 at 18:51, Tom Lane wrote:

> "Peter Hardman" <peter@ssbg.zetnet.co.uk> writes:
> > I'm in the process of migrating a Paradox 7/BDE 5.01 database from single-user
<snip>

Arjen van der Meijden has proposed a very elegant query in another post.

> What I find interesting though is that it sounds like both MSSQL and
> Paradox know something we don't about how to optimize it.  PG doesn't
> have any idea how to do the above query without forming the full output
> of the sub-select, but I suspect that the commercial DBs know a
> shortcut; perhaps they are able to automatically derive a restriction
> in the subquery similar to what you did by hand.  Does Paradox have
> anything comparable to EXPLAIN that would give a hint about the query
> plan they are using?

Sadly, no. In fact the ability to use SQL from Paradox at all is not well known and
not very visible in the the documentation.

I wonder whether Paradox and MySQL are just not doing the sort (this seems to
be what eats up the time), since the output of the subquery is in fact already in the
proper order.

>
> Also, just as in the other thread, I'm thinking that a seqscan+hash
> aggregate would be a better idea than this bit:
>
> >                    ->  GroupAggregate  (cost=0.00..3924.91 rows=33676 width=13) (actual time=0.324..473.131
rows=38815loops=1) 
> >                          ->  Index Scan using sheep_flock_pkey on sheep_flock f (cost=0.00..3094.95 rows=81802
width=13)(actual time=0.295..232.156) 
>
> Possibly you need to raise work_mem to get it to consider the hash
> aggregation method.
>
> BTW, are you *sure* you are testing PG 8.1?  The "Subquery Scan f2" plan
> node looks unnecessary to me, and I'd have expected 8.1 to drop it out.
> 8.0 and before would have left it in the plan though.  This doesn't make
> all that much difference performance-wise in itself, but it does make me
> wonder what you are testing.

Yes, the executables all say version 8.1.3.6044
>
Regards,--
Peter Hardman
Acre Cottage, Horsebridge
King's Somborne
Stockbridge
SO20 6PT

== Breeder of Shetland Cattle and Shetland Sheep ==


Re: PostgreSQL runs a query much slower than BDE and MySQL

From
Markus Schaber
Date:
Hi, Peter,

Peter Hardman wrote:

>> BTW, are you *sure* you are testing PG 8.1?  The "Subquery Scan f2" plan
>> node looks unnecessary to me, and I'd have expected 8.1 to drop it out.
>> 8.0 and before would have left it in the plan though.  This doesn't make
>> all that much difference performance-wise in itself, but it does make me
>> wonder what you are testing.
>
> Yes, the executables all say version 8.1.3.6044

Would you mind to look at the output of "select version();", too?

I ask this because I stumbled over it myself, that I had installed the
correct postgresql and psql versions, but accidentally connected to a
different database installation due to strange environment and script
settings...


Thanks,
Markus


--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: PostgreSQL runs a query much slower than BDE and MySQL

From
"Peter Hardman"
Date:
On 17 Aug 2006 at 12:11, Markus Schaber wrote:

> Hi, Peter,
>
> Peter Hardman wrote:
>
> >> BTW, are you *sure* you are testing PG 8.1?  The "Subquery Scan f2" plan
> >> node looks unnecessary to me, and I'd have expected 8.1 to drop it out.
> >> 8.0 and before would have left it in the plan though.  This doesn't make
> >> all that much difference performance-wise in itself, but it does make me
> >> wonder what you are testing.
> >
> > Yes, the executables all say version 8.1.3.6044
>
> Would you mind to look at the output of "select version();", too?
>
> I ask this because I stumbled over it myself, that I had installed the
> correct postgresql and psql versions, but accidentally connected to a
> different database installation due to strange environment and script
> settings...
select version() returns

PostgreSQL 8.1.3 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2
(mingw-special)

Cheers,--
Peter Hardman
Acre Cottage, Horsebridge
King's Somborne
Stockbridge
SO20 6PT

== Breeder of Shetland Cattle and Shetland Sheep ==


Re: PostgreSQL runs a query much slower than BDE and MySQL

From
Markus Schaber
Date:
Hi, Peter,

Peter Hardman wrote:

> select version() returns
>
> PostgreSQL 8.1.3 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2
> (mingw-special)

That looks correct.

I also presume that your environment is not as fragile wr/t connecting
do wrong databases, compared to debian with their multi-cluster
multi-version script wrapper magic.

Don't mind.
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: PostgreSQL runs a query much slower than BDE and MySQL

From
Tom Lane
Date:
"Peter Hardman" <peter@ssbg.zetnet.co.uk> writes:
> I wonder whether Paradox and MySQL are just not doing the sort (this
> seems to be what eats up the time), since the output of the subquery
> is in fact already in the proper order.

MSSQL (from the other thread).  I feel fairly safe in assuming that
MySQL's query optimizer is not nearly in the league to do this query
effectively.  (I like the theory Arjen mentioned that what you are
measuring there is the effects of their query cache rather than a
smart fundamental implementation.)  I wonder whether MSSQL has an
EXPLAIN equivalent ...

Anywy, your point about the sort being redundant is a good one, and
offhand I'd have expected PG to catch that; I'll have to look into
why it didn't.  But that's not going to explain a 10x speed
difference, because the sort isn't 90% of the runtime.

            regards, tom lane

Re: PostgreSQL runs a query much slower than BDE and

From
Mark Lewis
Date:
MSSQL can give either a graphical query plan or a text-based one similar
to PG.  There's no way that I've found to get the equivalent of an
EXPLAIN ANALYZE, but I'm by no means an MSSQL guru.

To get a neat-looking but not very useful graphical query plan from the
Query Analyzer tool, hit <Ctrl-L>.

To get the text-based one, execute "SET SHOWPLAN_ALL ON" which toggles
diagnostic mode on, and each query that you run will return the explain
plan instead of actually running until you execute "SET SHOWPLAN_ALL
OFF".

-- Mark Lewis

On Thu, 2006-08-17 at 09:11 -0400, Tom Lane wrote:
> "Peter Hardman" <peter@ssbg.zetnet.co.uk> writes:
> > I wonder whether Paradox and MySQL are just not doing the sort (this
> > seems to be what eats up the time), since the output of the subquery
> > is in fact already in the proper order.
>
> MSSQL (from the other thread).  I feel fairly safe in assuming that
> MySQL's query optimizer is not nearly in the league to do this query
> effectively.  (I like the theory Arjen mentioned that what you are
> measuring there is the effects of their query cache rather than a
> smart fundamental implementation.)  I wonder whether MSSQL has an
> EXPLAIN equivalent ...
>
> Anywy, your point about the sort being redundant is a good one, and
> offhand I'd have expected PG to catch that; I'll have to look into
> why it didn't.  But that's not going to explain a 10x speed
> difference, because the sort isn't 90% of the runtime.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

Re: PostgreSQL runs a query much slower than BDE and

From
"Magnus Hagander"
Date:
> MSSQL can give either a graphical query plan or a text-based
> one similar to PG.  There's no way that I've found to get the
> equivalent of an EXPLAIN ANALYZE, but I'm by no means an MSSQL guru.

SET STATISTICS IO ON
SET STATISTICS PROFILE ON
SET STATISTICS TIME ON


//Magnus

Re: PostgreSQL runs a query much slower than BDE and MySQL

From
Tom Lane
Date:
I wrote:
> Anywy, your point about the sort being redundant is a good one, and
> offhand I'd have expected PG to catch that; I'll have to look into
> why it didn't.  But that's not going to explain a 10x speed
> difference, because the sort isn't 90% of the runtime.

I dug into this using some made-up test data, and was able to reproduce
the plan you got after changing the order of the pkey index columns
to (regn_no, transfer_date, flock_no) ... are you sure you quoted that
accurately before?

I found a couple of minor planner problems, which I've repaired in CVS
HEAD.  You might consider using TEXT columns instead of VARCHAR(n),
because the only bug that actually seemed to change the chosen plan
involved the planner getting confused by the difference between
varchar_var and varchar_var::text (which is what gets generated for
sorting purposes because varchar doesn't have a separate sort operator).

There's a more interesting issue, which I'm afraid we do not have time
to fix for PG 8.2.  The crux of the matter is that given

SELECT ...
FROM SHEEP_FLOCK f1 JOIN
    (SELECT f.regn_no, MAX(f.transfer_date) as last_xfer_date
    FROM  SHEEP_FLOCK f
    GROUP BY f.regn_no) f2
ON f1.regn_no = f2.regn_no
AND f1.transfer_date = f2.last_xfer_date

if there is an index on (regn_no, transfer_date) then the planner could
in principle do a double-column merge join between an indexscan on this
index and the output of a GroupAggregate implementation of the subquery.
The GroupAggregate plan would in fact only be sorting on regn_no, so
it's not immediately obvious why this is OK.  The reason is that there
is only one MAX() value for any particular regn_no, and so the sort
condition that the last_xfer_date values be in order for any one value
of regn_no is vacuous.  We could consider the subquery's output to be
sorted by *any* list of the form "regn_no, other-stuff".

The planner's notion of matching pathkey lists to determine sortedness
is not at all capable of dealing with this.  After a little bit of
thought I'm tempted to propose that we add a concept that a particular
pathkey list is "unique", meaning that it is known to include a unique
key for the data.  Such a key would match, for sortedness purposes,
any requested sort ordering consisting of its columns followed by
others.  In the above example, we would know that a GROUP BY implemented
by GroupAggregate yields an output for which the grouping columns
are a unique sort key.

I imagine this concept is already known in the database research
literature; anyone recognize it and know a standard name for it?

What'd be really interesting is to know if MSSQL and Paradox are using
this concept to optimize their plans for Peter's query.  Can someone with
a copy of MSSQL try this test case and see what it reports as the plan?

BTW, I used this to generate some COPY data I could load into Peter's
example table:

perl -e 'for ($s = 1; $s < 32000; $s++) {
$f=int($s/100);
print "$s\t$f\t1\t0\n";
print "$s\t$f\t2\t0\n";
}' >sheep.data

I changed the date columns to integers rather than bother to make up
valid dates.  I think only the regn_no and flock_no statistics matter
to the planner for this particular query --- as you can see, I arranged
for 2 entries per sheep and 100 sheep per flock, which is in the general
ballpark of what Peter mentioned as his stats.

            regards, tom lane

Re: PostgreSQL runs a query much slower than BDE and MySQL

From
Scott Lamb
Date:
I have no idea if there's a standard name or what it may be, but for
what it's worth, this sounds similar to the optimizations I wanted
for a different query:

http://archives.postgresql.org/pgsql-performance/2005-11/msg00037.php

1. Recognize that a term constant across the whole sort is
irrelevant. (In my earlier case, a constant number, but here MAX
(xxx), which seems harder.)
2. Put together two sequences already in the appropriate order,
without resorting. (In my case, a union; here a join.)

though I no longer need them for that problem. I'm quite happy with
the client-side solution we came up with.

--
Scott Lamb <http://www.slamb.org/>



Re: PostgreSQL runs a query much slower than BDE and MySQL

From
Scott Lamb
Date:
On Aug 16, 2006, at 3:51 PM, Tom Lane wrote:
>> /* Select all sheep who's most recent transfer was into the
>> subject flock */
>> SELECT DISTINCT f1.regn_no, f1.transfer_date as date_in
>> FROM SHEEP_FLOCK f1 JOIN
>>     /* The last transfer date for each sheep */
>>     (SELECT f.regn_no, MAX(f.transfer_date) as last_xfer_date
>>     FROM  SHEEP_FLOCK f
>>     GROUP BY f.regn_no) f2
>> ON f1.regn_no = f2.regn_no
>> WHERE f1.flock_no = '1359'
>> AND f1.transfer_date = f2.last_xfer_date
>
> This seems pretty closely related to this recent thread:
> http://archives.postgresql.org/pgsql-performance/2006-08/msg00220.php
> in which the OP is doing a very similar kind of query in almost
> exactly
> the same way.
>
> I can't help thinking that there's probably a better way to phrase
> this
> type of query in SQL, though it's not jumping out at me what that is.

I don't know about better, but I tend to phrase these in a quite
different way that's (hopefully) equivalent:

select    latest.regn_no,
           latest.transfer_date as date_in
from      sheep_flock latest
where     not exists (
           select    'x'
           from      sheep_flock even_later
           where     latest.regn_no = even_later.regn_no
             and     latest.transfer_date < even_later.transfer_date)
   and     latest.flock_no = '1359'

There's no MAX() or DISTINCT here, so maybe this is easier to optimize?

--
Scott Lamb <http://www.slamb.org/>



Re: PostgreSQL runs a query much slower than BDE and MySQL

From
"Peter Hardman"
Date:

On 16 Aug 2006 at 17:48, Peter Hardman wrote:

> I'm in the process of migrating a Paradox 7/BDE 5.01 database from single-user
> Paradox to a web based interface to either MySQL or PostgreSQL.
<snip>

I've uploaded my data to www.shetland-sheep.org.uk/pgdata/sheep-flock.zip

The flock SSBXXX is the 'big flock in the sky' and thus there should never be any
date for a sheep greater than this.

Yes, the primary key is regn_no + flock_no + transfer_date.

Thanks again for all the help and advice.

Regards,--
Peter Hardman
Acre Cottage, Horsebridge
King's Somborne
Stockbridge
SO20 6PT

== Breeder of Shetland Cattle and Shetland Sheep ==


Re: PostgreSQL runs a query much slower than BDE and MySQL

From
"Peter Hardman"
Date:
On 17 Aug 2006 at 14:33, Tom Lane wrote:

> I wrote:
> > Anywy, your point about the sort being redundant is a good one, and
> > offhand I'd have expected PG to catch that; I'll have to look into
> > why it didn't.  But that's not going to explain a 10x speed
> > difference, because the sort isn't 90% of the runtime.
>
> I dug into this using some made-up test data, and was able to reproduce
> the plan you got after changing the order of the pkey index columns
> to (regn_no, transfer_date, flock_no) ... are you sure you quoted that
> accurately before?

Yes. Maybe the data I've uploaded to www.shetland-
sheep.org.uk/pgdata/sheep_flock.zip will help reproduce the plan.

<snip>
> I found a couple of minor planner problems, which I've repaired in CVS
> HEAD.  You might consider using TEXT columns instead of VARCHAR(n),
> because the only bug that actually seemed to change the chosen plan
> involved the planner getting confused by the difference between
> varchar_var and varchar_var::text (which is what gets generated for
> sorting purposes because varchar doesn't have a separate sort operator).

As someone else suggested, these fields ought really to be CHAR no VARCHAR.
I chose VARCHAR because the data mostly is shorter than the maximum lengths
(although probably not enough to matter). I'd not really got into the subtleties of
different behaviour of CHAR and VARCHAR.
>
<snip>

Regards,--
Peter Hardman
Acre Cottage, Horsebridge
King's Somborne
Stockbridge
SO20 6PT

== Breeder of Shetland Cattle and Shetland Sheep ==


Re: PostgreSQL runs a query much slower than BDE and MySQL

From
"Peter Hardman"
Date:
<div align="left"><font face="Arial" size="2"><span style="font-size:10pt">On 17 Aug 2006 at 20:58, Peter Hardman
wrote:</span></font></div><divalign="left"><font face="Arial" size="2"><span style="font-size:10pt"><br
/></span></font></div><divalign="left"><font color="#7f0000" face="Arial" size="2"><span style="font-size:10pt">>
</span></font></div><divalign="left"><font color="#7f0000" face="Arial" size="2"><span style="font-size:10pt">>
</span></font></div><divalign="left"><font color="#7f0000" face="Arial" size="2"><span style="font-size:10pt">> On
16Aug 2006 at 17:48, Peter Hardman wrote:</span></font></div><div align="left"><font color="#7f0000" face="Arial"
size="2"><spanstyle="font-size:10pt">> </span></font></div><div align="left"><font color="#7f0000" face="Arial"
size="2"><spanstyle="font-size:10pt">> > I'm in the process of migrating a Paradox 7/BDE 5.01 database from
single-user</span></font></div><div align="left"><font color="#7f0000" face="Arial" size="2"><span
style="font-size:10pt">>> Paradox to a web based interface to either MySQL or PostgreSQL.</span></font></div><div
align="left"><fontcolor="#7f0000" face="Arial" size="2"><span style="font-size:10pt">> <snip>
</span></font></div><divalign="left"><font color="#7f0000" face="Arial" size="2"><span style="font-size:10pt">>
</span></font></div><divalign="left"><font color="#7f0000" face="Arial" size="2"><span style="font-size:10pt">> I've
uploadedmy data to www.shetland-sheep.org.uk/pgdata/sheep-flock.zip</span></font></div><div align="left"><font
color="#7f0000"face="Arial" size="2"><span style="font-size:10pt"><br /></span></font></div><div align="left"><font
face="Arial"size="2"><span style="font-size:10pt">Sorry - that should be </span></font><font color="#7f0000"
face="Arial"size="2"><span
style="font-size:10pt">www.shetland-sheep.org.uk/pgdata/sheep_flock.zip</span></font></div><divalign="left"><font
color="#7f0000"face="Arial" size="2"><span style="font-size:10pt">> </span></font></div><div align="left"><font
color="#7f0000"face="Arial" size="2"><span style="font-size:10pt">> The flock SSBXXX is the 'big flock in the sky'
andthus there should never be any </span></font></div><div align="left"><font color="#7f0000" face="Arial"
size="2"><spanstyle="font-size:10pt">> date for a sheep greater than this. </span></font></div><div
align="left"><fontcolor="#7f0000" face="Arial" size="2"><span style="font-size:10pt">> </span></font></div><div
align="left"><fontcolor="#7f0000" face="Arial" size="2"><span style="font-size:10pt">> Yes, the primary key is
regn_no+ flock_no + transfer_date.</span></font></div><div align="left"><font color="#7f0000" face="Arial"
size="2"><spanstyle="font-size:10pt">> </span></font></div><div align="left"><font color="#7f0000" face="Arial"
size="2"><spanstyle="font-size:10pt">> Thanks again for all the help and advice.</span></font></div><div
align="left"><fontcolor="#7f0000" face="Arial" size="2"><span style="font-size:10pt">> </span></font></div><div
align="left"><fontcolor="#7f0000" face="Arial" size="2"><span style="font-size:10pt">> Regards,--
</span></font></div><divalign="left"><font color="#7f0000" face="Arial" size="2"><span style="font-size:10pt">>
PeterHardman</span></font></div><div align="left"><font color="#7f0000" face="Arial" size="2"><span
style="font-size:10pt">>Acre Cottage, Horsebridge</span></font></div><div align="left"><font color="#7f0000"
face="Arial"size="2"><span style="font-size:10pt">> King's Somborne</span></font></div><div align="left"><font
color="#7f0000"face="Arial" size="2"><span style="font-size:10pt">> Stockbridge</span></font></div><div
align="left"><fontcolor="#7f0000" face="Arial" size="2"><span style="font-size:10pt">> SO20
6PT</span></font></div><divalign="left"><font color="#7f0000" face="Arial" size="2"><span style="font-size:10pt">>
</span></font></div><divalign="left"><font color="#7f0000" face="Arial" size="2"><span style="font-size:10pt">> ==
Breederof Shetland Cattle and Shetland Sheep ==</span></font></div><div align="left"><font color="#7f0000" face="Arial"
size="2"><spanstyle="font-size:10pt">> </span></font></div><div align="left"><font color="#7f0000" face="Arial"
size="2"><spanstyle="font-size:10pt">> </span></font></div><div align="left"><font color="#7f0000" face="Arial"
size="2"><spanstyle="font-size:10pt">> ---------------------------(end of
broadcast)---------------------------</span></font></div><divalign="left"><font color="#7f0000" face="Arial"
size="2"><spanstyle="font-size:10pt">> TIP 2: Don't 'kill -9' the postmaster</span></font></div><div
align="left"><fontcolor="#7f0000" face="Arial" size="2"><span style="font-size:10pt">> </span></font></div><div
align="left"><fontcolor="#7f0000" face="Arial" size="2"><span style="font-size:10pt"><br /></span></font></div><div
align="left"><fontcolor="#7f0000" face="Arial" size="2"><span style="font-size:10pt">-- </span></font></div><div
align="left"><fontcolor="#7f0000" face="Arial" size="2"><span style="font-size:10pt">Peter
Hardman</span></font></div><divalign="left"><font color="#7f0000" face="Arial" size="2"><span
style="font-size:10pt">AcreCottage, Horsebridge</span></font></div><div align="left"><font color="#7f0000" face="Arial"
size="2"><spanstyle="font-size:10pt">King's Somborne</span></font></div><div align="left"><font color="#7f0000"
face="Arial"size="2"><span style="font-size:10pt">Stockbridge</span></font></div><div align="left"><font
color="#7f0000"face="Arial" size="2"><span style="font-size:10pt">SO20 6PT</span></font></div><div align="left"><font
color="#7f0000"face="Arial" size="2"><span style="font-size:10pt"><br /></span></font></div><div align="left"><font
color="#7f0000"face="Arial" size="2"><span style="font-size:10pt">== Breeder of Shetland Cattle and Shetland Sheep
==</span></font></div><divalign="left"></div> 

Re: PostgreSQL runs a query much slower than BDE and MySQL

From
Scott Lamb
Date:
Peter, I compared these using the data you supplied on my PostgreSQL
8.1.4 system:

On Aug 17, 2006, at 12:09 PM, Scott Lamb wrote:

> On Aug 16, 2006, at 3:51 PM, Tom Lane wrote:
>>> /* Select all sheep who's most recent transfer was into the
>>> subject flock */
>>> SELECT DISTINCT f1.regn_no, f1.transfer_date as date_in
>>> FROM SHEEP_FLOCK f1 JOIN
>>>     /* The last transfer date for each sheep */
>>>     (SELECT f.regn_no, MAX(f.transfer_date) as last_xfer_date
>>>     FROM  SHEEP_FLOCK f
>>>     GROUP BY f.regn_no) f2
>>> ON f1.regn_no = f2.regn_no
>>> WHERE f1.flock_no = '1359'
>>> AND f1.transfer_date = f2.last_xfer_date


         QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------
Unique  (cost=2575.07..2575.08 rows=1 width=36) (actual
time=1083.579..1083.696 rows=32 loops=1)
    ->  Sort  (cost=2575.07..2575.07 rows=1 width=36) (actual
time=1083.576..1083.613 rows=32 loops=1)
          Sort Key: f1.regn_no, f1.transfer_date
          ->  Nested Loop  (cost=1364.00..2575.06 rows=1 width=36)
(actual time=287.895..1083.297 rows=32 loops=1)
                ->  HashAggregate  (cost=1364.00..1366.50 rows=200
width=36) (actual time=262.345..337.940 rows=38815 loops=1)
                      ->  Seq Scan on sheep_flock f
(cost=0.00..1116.00 rows=49600 width=36) (actual time=0.005..119.282
rows=81802 loops=1)
                ->  Index Scan using sheep_flock_pkey on sheep_flock
f1  (cost=0.00..6.02 rows=1 width=36) (actual time=0.016..0.016
rows=0 loops=38815)
                      Index Cond: (((f1.regn_no)::text =
("outer".regn_no)::text) AND ((f1.flock_no)::text = '1359'::text) AND
(f1.transfer_date = "outer"."?column2?"))
Total runtime: 1085.115 ms
(9 rows)

>>
>> This seems pretty closely related to this recent thread:
>> http://archives.postgresql.org/pgsql-performance/2006-08/msg00220.php
>> in which the OP is doing a very similar kind of query in almost
>> exactly
>> the same way.
>>
>> I can't help thinking that there's probably a better way to phrase
>> this
>> type of query in SQL, though it's not jumping out at me what that is.
>
> I don't know about better, but I tend to phrase these in a quite
> different way that's (hopefully) equivalent:
>
> select    latest.regn_no,
>           latest.transfer_date as date_in
> from      sheep_flock latest
> where     not exists (
>           select    'x'
>           from      sheep_flock even_later
>           where     latest.regn_no = even_later.regn_no
>             and     latest.transfer_date < even_later.transfer_date)
>   and     latest.flock_no = '1359'
>
> There's no MAX() or DISTINCT here, so maybe this is easier to
> optimize?

                                                                       Q
UERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------
Bitmap Heap Scan on sheep_flock latest  (cost=764.60..2185.05
rows=124 width=36) (actual time=11.915..13.800 rows=32 loops=1)
    Recheck Cond: ((flock_no)::text = '1359'::text)
    Filter: (NOT (subplan))
    ->  Bitmap Index Scan on sheep_flock_pkey  (cost=0.00..764.60
rows=248 width=0) (actual time=10.950..10.950 rows=127 loops=1)
          Index Cond: ((flock_no)::text = '1359'::text)
    SubPlan
      ->  Index Scan using sheep_flock_pkey on sheep_flock
even_later  (cost=0.00..317.49 rows=83 width=0) (actual
time=0.016..0.016 rows=1 loops=127)
            Index Cond: ((($0)::text = (regn_no)::text) AND ($1 <
transfer_date))
Total runtime: 13.902 ms
(9 rows)

seems to return the same data in two orders of magnitude less time.

--
Scott Lamb <http://www.slamb.org/>



Re: PostgreSQL runs a query much slower than BDE and MySQL

From
Tom Lane
Date:
"Peter Hardman" <peter@ssbg.zetnet.co.uk> writes:
> On 17 Aug 2006 at 14:33, Tom Lane wrote:
>> I found a couple of minor planner problems, which I've repaired in CVS
>> HEAD.  You might consider using TEXT columns instead of VARCHAR(n),

> As someone else suggested, these fields ought really to be CHAR no VARCHAR.

That should be fine too.  VARCHAR is sort of a poor stepchild in
Postgres, because it piggybacks on TEXT's operators --- but CHAR
has different comparison rules, hence its own operators, hence
doesn't trip over that bug.

There's still some things that don't add up to me, like the question
of the pkey column order.  Will look some more.

            regards, tom lane

Re: PostgreSQL runs a query much slower than BDE and

From
Simon Riggs
Date:
On Thu, 2006-08-17 at 14:33 -0400, Tom Lane wrote:

> There's a more interesting issue, which I'm afraid we do not have time
> to fix for PG 8.2.  The crux of the matter is that given
>
> SELECT ...
> FROM SHEEP_FLOCK f1 JOIN
>     (SELECT f.regn_no, MAX(f.transfer_date) as last_xfer_date
>     FROM  SHEEP_FLOCK f
>     GROUP BY f.regn_no) f2
> ON f1.regn_no = f2.regn_no
> AND f1.transfer_date = f2.last_xfer_date
>
> if there is an index on (regn_no, transfer_date) then the planner could
> in principle do a double-column merge join between an indexscan on this
> index and the output of a GroupAggregate implementation of the subquery.
> The GroupAggregate plan would in fact only be sorting on regn_no, so
> it's not immediately obvious why this is OK.  The reason is that there
> is only one MAX() value for any particular regn_no, and so the sort
> condition that the last_xfer_date values be in order for any one value
> of regn_no is vacuous.  We could consider the subquery's output to be
> sorted by *any* list of the form "regn_no, other-stuff".
>
> The planner's notion of matching pathkey lists to determine sortedness
> is not at all capable of dealing with this.  After a little bit of
> thought I'm tempted to propose that we add a concept that a particular
> pathkey list is "unique", meaning that it is known to include a unique
> key for the data.  Such a key would match, for sortedness purposes,
> any requested sort ordering consisting of its columns followed by
> others.  In the above example, we would know that a GROUP BY implemented
> by GroupAggregate yields an output for which the grouping columns
> are a unique sort key.
>
> I imagine this concept is already known in the database research
> literature; anyone recognize it and know a standard name for it?

(catching up on some earlier mails....)

Not seen any particular name for that around. There are quite a few
places in the optimizer, IIRC, that could use the concept of uniqueness
if it existed.

I would note that the above query plan is similar-ish to the one you'd
get if you tried to push down the GROUP BY from the top of a join. So
the uniqueness information sounds like an important precursor to that.

I've just rechecked out the lit I was reading on this earlier this year:

http://portal.acm.org/ft_gateway.cfm?id=233320&type=pdf&coll=&dl=acm&CFID=15151515&CFTOKEN=6184618#search=%22db2%20order%20optimization%20tpc-d%22
"Fundamental Techniques for Order Optimization" Simmen et al

Also, IIRC, there was some work talking about extending the Interesting
Order concept to allow groupings to be noted also.

From our work on sorting earlier, we had it that a Merge Join will
always require a Mark/Restore operation on its sorted inputs. If the
Outer input is unique then a Restore operation will never be required,
so the Mark can be avoided also and thus the materialization of the sort
can also be avoided. So some way of telling the MJ node that the sort
order is also unique would be very useful.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com