Thread: is it possible to make this faster?

is it possible to make this faster?

From
"Merlin Moncure"
Date:
been doing a lot of pgsql/mysql performance testing lately, and there
is one query that mysql does much better than pgsql...and I see it a
lot in normal development:

select a,b,max(c) from t group by a,b;

t has an index on a,b,c.

in my sample case with cardinality of 1000 for a, 2000 for b, and
300000 records in t, pgsql does a seq. scan on dev box in about a
second (returning 2000 records).

recent versions of mysql do much better, returning same set in < 20ms.
mysql explain says it uses an index to optimize the group by somehow.
is there a faster way to write this query?

Merlin

Re: is it possible to make this faster?

From
Bruno Wolff III
Date:
On Thu, May 25, 2006 at 16:07:19 -0400,
  Merlin Moncure <mmoncure@gmail.com> wrote:
> been doing a lot of pgsql/mysql performance testing lately, and there
> is one query that mysql does much better than pgsql...and I see it a
> lot in normal development:
>
> select a,b,max(c) from t group by a,b;
>
> t has an index on a,b,c.
>
> in my sample case with cardinality of 1000 for a, 2000 for b, and
> 300000 records in t, pgsql does a seq. scan on dev box in about a
> second (returning 2000 records).
>
> recent versions of mysql do much better, returning same set in < 20ms.
> mysql explain says it uses an index to optimize the group by somehow.
> is there a faster way to write this query?

SELECT DISTINCT ON (a, b) a, b, c FROM t ORDER BY a DESC, b DESC, c DESC;

Re: is it possible to make this faster?

From
"Merlin Moncure"
Date:
On 5/25/06, Bruno Wolff III <bruno@wolff.to> wrote:
> On Thu, May 25, 2006 at 16:07:19 -0400,
>   Merlin Moncure <mmoncure@gmail.com> wrote:
> > been doing a lot of pgsql/mysql performance testing lately, and there
> > is one query that mysql does much better than pgsql...and I see it a
> > lot in normal development:
> >
> > select a,b,max(c) from t group by a,b;
> >

> SELECT DISTINCT ON (a, b) a, b, c FROM t ORDER BY a DESC, b DESC, c DESC;

that is actually slower than group by in my case...am i missing
something? (both essentially resolved to seq_scan)

merlin

Re: is it possible to make this faster?

From
"Steinar H. Gunderson"
Date:
On Thu, May 25, 2006 at 04:07:19PM -0400, Merlin Moncure wrote:
> been doing a lot of pgsql/mysql performance testing lately, and there
> is one query that mysql does much better than pgsql...and I see it a
> lot in normal development:
>
> select a,b,max(c) from t group by a,b;
>
> t has an index on a,b,c.

The planner _should_ TTBOMK be able to do it by itself in 8.1, but have you
tried something along the following lines?

  select a,b,(select c from t t2 order by c desc where t1.a=t2.a and t1.b=t2.b)
  from t t1 group by a,b;

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: is it possible to make this faster?

From
Alan Hodgson
Date:
On May 25, 2006 01:31 pm, "Merlin Moncure" <mmoncure@gmail.com> wrote:
> > SELECT DISTINCT ON (a, b) a, b, c FROM t ORDER BY a DESC, b DESC, c
> > DESC;
>
> that is actually slower than group by in my case...am i missing
> something? (both essentially resolved to seq_scan)

Try it with an index on a,b,c.

--
Alan

Re: is it possible to make this faster?

From
Bruno Wolff III
Date:
On Thu, May 25, 2006 at 16:31:40 -0400,
  Merlin Moncure <mmoncure@gmail.com> wrote:
> On 5/25/06, Bruno Wolff III <bruno@wolff.to> wrote:
> >On Thu, May 25, 2006 at 16:07:19 -0400,
> >  Merlin Moncure <mmoncure@gmail.com> wrote:
> >> been doing a lot of pgsql/mysql performance testing lately, and there
> >> is one query that mysql does much better than pgsql...and I see it a
> >> lot in normal development:
> >>
> >> select a,b,max(c) from t group by a,b;
> >>
>
> >SELECT DISTINCT ON (a, b) a, b, c FROM t ORDER BY a DESC, b DESC, c DESC;
>
> that is actually slower than group by in my case...am i missing
> something? (both essentially resolved to seq_scan)

If there aren't many c's for each (a,b), then a sort might be the best way to
do this. I don't remember if skip scanning ever got done, but if it did, it
would have been 8.1 or later.

Re: is it possible to make this faster?

From
Tom Lane
Date:
"Merlin Moncure" <mmoncure@gmail.com> writes:
> been doing a lot of pgsql/mysql performance testing lately, and there
> is one query that mysql does much better than pgsql...and I see it a
> lot in normal development:

> select a,b,max(c) from t group by a,b;

> t has an index on a,b,c.

The index won't help, as per this comment from planagg.c:

     * We don't handle GROUP BY, because our current implementations of
     * grouping require looking at all the rows anyway, and so there's not
     * much point in optimizing MIN/MAX.

Given the numbers you mention (300k rows in 2000 groups) I'm not
convinced that an index-based implementation would help much; we'd
still need to fetch at least one record out of every 150, which is
going to cost near as much as seqscanning all of them.

> recent versions of mysql do much better, returning same set in < 20ms.

Well, since they don't do MVCC they can answer this query from the
index without going to the heap at all.  But that still seems remarkably
fast for something that has to grovel through 300k index entries.

            regards, tom lane

Re: is it possible to make this faster?

From
"Merlin Moncure"
Date:
On 5/25/06, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:
> On Thu, May 25, 2006 at 04:07:19PM -0400, Merlin Moncure wrote:
> > been doing a lot of pgsql/mysql performance testing lately, and there
> > is one query that mysql does much better than pgsql...and I see it a
> > lot in normal development:
> >
> > select a,b,max(c) from t group by a,b;

>   select a,b,(select c from t t2 order by c desc where t1.a=t2.a and t1.b=t2.b)
>   from t t1 group by a,b;

this came out to a tie with the group by approach, although it
produced a different (but similar) plan.  we are still orders of
magnitude behind mysql here.

Interestingly, if I extract out the distinct values of a,b to a temp
table and rejoin to t using your approach, I get competitive times
with mysql.  this means the essential problem is:

select a,b from t group by a,b

is slow.  This feels like the same penalty for mvcc we pay with count(*)...hm.

merlin

Re: is it possible to make this faster?

From
"Steinar H. Gunderson"
Date:
On Thu, May 25, 2006 at 04:54:09PM -0400, Merlin Moncure wrote:
>>  select a,b,(select c from t t2 order by c desc where t1.a=t2.a and
>>  t1.b=t2.b)
>>  from t t1 group by a,b;
> this came out to a tie with the group by approach, although it
> produced a different (but similar) plan.  we are still orders of
> magnitude behind mysql here.

Actually, it _should_ produce a syntax error -- it's missing a LIMIT 1 in the
subquery.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: is it possible to make this faster?

From
Tom Lane
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> "Merlin Moncure" <mmoncure@gmail.com> writes:
>> recent versions of mysql do much better, returning same set in < 20ms.

> Well, since they don't do MVCC they can answer this query from the
> index without going to the heap at all.  But that still seems remarkably
> fast for something that has to grovel through 300k index entries.

Are you sure you measured that right?  I tried to duplicate this using
mysql 5.0.21, and I see runtimes of 0.45 sec without an index and
0.15 sec with.  This compares to psql times around 0.175 sec.  Doesn't
look to me like we're hurting all that badly, even without using the
index.

            regards, tom lane

Re: is it possible to make this faster?

From
Scott Marlowe
Date:
On Thu, 2006-05-25 at 15:52, Tom Lane wrote:
> "Merlin Moncure" <mmoncure@gmail.com> writes:
> > been doing a lot of pgsql/mysql performance testing lately, and there
> > is one query that mysql does much better than pgsql...and I see it a
> > lot in normal development:
>
> > select a,b,max(c) from t group by a,b;
>
> > t has an index on a,b,c.
>
> The index won't help, as per this comment from planagg.c:
>
>      * We don't handle GROUP BY, because our current implementations of
>      * grouping require looking at all the rows anyway, and so there's not
>      * much point in optimizing MIN/MAX.
>
> Given the numbers you mention (300k rows in 2000 groups) I'm not
> convinced that an index-based implementation would help much; we'd
> still need to fetch at least one record out of every 150, which is
> going to cost near as much as seqscanning all of them.
>
> > recent versions of mysql do much better, returning same set in < 20ms.
>
> Well, since they don't do MVCC they can answer this query from the
> index without going to the heap at all.  But that still seems remarkably
> fast for something that has to grovel through 300k index entries.

Well, they do, just with innodb tables.

Merlin, have you tried this against innodb tables to see what you get?

Re: is it possible to make this faster?

From
Mark Lewis
Date:
On Thu, 2006-05-25 at 16:52 -0400, Tom Lane wrote:
> "Merlin Moncure" <mmoncure@gmail.com> writes:
> > been doing a lot of pgsql/mysql performance testing lately, and there
> > is one query that mysql does much better than pgsql...and I see it a
> > lot in normal development:
>
> > select a,b,max(c) from t group by a,b;
>
> > t has an index on a,b,c.
>
> The index won't help, as per this comment from planagg.c:
>
>      * We don't handle GROUP BY, because our current implementations of
>      * grouping require looking at all the rows anyway, and so there's not
>      * much point in optimizing MIN/MAX.
>
> Given the numbers you mention (300k rows in 2000 groups) I'm not
> convinced that an index-based implementation would help much; we'd
> still need to fetch at least one record out of every 150, which is
> going to cost near as much as seqscanning all of them.

Well, if the MySQL server has enough RAM that the index is cached (or
index + relevant chunks of data file if using InnoDB?) then that would
explain how MySQL can use an index to get fast results.

-- Mark Lewis

Re: is it possible to make this faster?

From
Jim Nasby
Date:
On May 25, 2006, at 4:11 PM, Tom Lane wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> "Merlin Moncure" <mmoncure@gmail.com> writes:
>>> recent versions of mysql do much better, returning same set in <
>>> 20ms.
>
>> Well, since they don't do MVCC they can answer this query from the
>> index without going to the heap at all.  But that still seems
>> remarkably
>> fast for something that has to grovel through 300k index entries.
>
> Are you sure you measured that right?  I tried to duplicate this using
> mysql 5.0.21, and I see runtimes of 0.45 sec without an index and
> 0.15 sec with.  This compares to psql times around 0.175 sec.  Doesn't
> look to me like we're hurting all that badly, even without using the
> index.

Well, that would depend greatly on how wide the rows were, and I
don't believe the OP ever mentioned that. If he's got a nice, fat
varchar(1024) in that table, then it's not surprising that an index
would help things.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461



Re: is it possible to make this faster?

From
Tom Lane
Date:
Jim Nasby <jnasby@pervasive.com> writes:
> On May 25, 2006, at 4:11 PM, Tom Lane wrote:
>> Are you sure you measured that right?  I tried to duplicate this using
>> mysql 5.0.21, and I see runtimes of 0.45 sec without an index and
>> 0.15 sec with.  This compares to psql times around 0.175 sec.  Doesn't
>> look to me like we're hurting all that badly, even without using the
>> index.

> Well, that would depend greatly on how wide the rows were, and I
> don't believe the OP ever mentioned that. If he's got a nice, fat
> varchar(1024) in that table, then it's not surprising that an index
> would help things.

Wide rows might slow down the psql side of things somewhat (though
probably not as much as you think).  That doesn't account for the
discrepancy in our mysql results though.

For the record, I was testing with a table like
    create table t(a int, b int, c int);
    create index ti on t(a,b,c);

            regards, tom lane

Re: is it possible to make this faster?

From
Jeff -
Date:
Also, are you sure your numbers are not coming out of the mysql query
cache?

That might explain some of it - also with Tom seeing comprable
numbers in his test.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/





Re: is it possible to make this faster?

From
Tom Lane
Date:
Jeff - <threshar@real.jefftrout.com> writes:
> Also, are you sure your numbers are not coming out of the mysql query
> cache?
> That might explain some of it - also with Tom seeing comprable
> numbers in his test.

Indeed, enabling the mysql query cache makes the timings drop to
nil ... as long as I present a query that's strcmp-equal to the
last one (not different in whitespace for instance).

            regards, tom lane

Re: is it possible to make this faster?

From
"Merlin Moncure"
Date:
On 5/25/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> > "Merlin Moncure" <mmoncure@gmail.com> writes:
> >> recent versions of mysql do much better, returning same set in < 20ms.

> Are you sure you measured that right?  I tried to duplicate this using
> mysql 5.0.21, and I see runtimes of 0.45 sec without an index and
> 0.15 sec with.  This compares to psql times around 0.175 sec.  Doesn't
> look to me like we're hurting all that badly, even without using the
> index.

Well, my numbers were approximate, but I tested on a few different
machines.  the times got closer as the cpu speed got faster.  pg
really loves a quick cpu.  on 600 mhz p3 I got 70ms on mysql and
1050ms on pg.  Mysql query cache is always off for my performance
testing.

My a and b columns were ID columns from another table, so I rewrote
the join and now pg is smoking mysql (again).

To quickly answer the other questions:

1. no, not testing innodb
2, rows are narrow

Merlin

Re: is it possible to make this faster?

From
Tom Lane
Date:
"Merlin Moncure" <mmoncure@gmail.com> writes:
> On 5/25/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> "Merlin Moncure" <mmoncure@gmail.com> writes:
>>> recent versions of mysql do much better, returning same set in < 20ms.

>> Are you sure you measured that right?  I tried to duplicate this using
>> mysql 5.0.21, and I see runtimes of 0.45 sec without an index and
>> 0.15 sec with.  This compares to psql times around 0.175 sec.  Doesn't
>> look to me like we're hurting all that badly, even without using the
>> index.

> Well, my numbers were approximate, but I tested on a few different
> machines.  the times got closer as the cpu speed got faster.  pg
> really loves a quick cpu.  on 600 mhz p3 I got 70ms on mysql and
> 1050ms on pg.  Mysql query cache is always off for my performance
> testing.

Well, this bears looking into, because I couldn't get anywhere near 20ms
with mysql.  I was using a dual Xeon 2.8GHz machine which ought to be
quick enough, and the stock Fedora Core 5 RPM of mysql.  (Well, actually
that SRPM built on FC4, because this machine is still on FC4.)  I made a
MyISAM table with three integer columns as mentioned, and filled it with
about 300000 rows with 2000 distinct values of (a,b) and random values
of c.  I checked the timing both in the mysql CLI, and with a trivial
test program that timed mysql_real_query() plus mysql_store_result(),
getting pretty near the same timings each way.

BTW, in pgsql it helps a whole lot to raise work_mem a bit for this
example --- at default work_mem it wants to do sort + group_aggregate,
while with work_mem 2000 or more it'll use a hash_aggregate plan which
is quite a bit faster.

It seems possible that there is some equivalently simple tuning on the
mysql side that you did and I didn't.  This is an utterly stock mysql
install, just "rpm -i" and "service mysqld start".

            regards, tom lane

Re: is it possible to make this faster?

From
"Merlin Moncure"
Date:
On 5/26/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, this bears looking into, because I couldn't get anywhere near 20ms
> with mysql.  I was using a dual Xeon 2.8GHz machine which ought to be

did you have a key on a,b,c? if I include unimportant unkeyed field d
the query time drops from 70ms to ~ 1 second.  mysql planner is
tricky, it's full of special case optimizations...

select count(*) from (select a,b,max(c) group by a,b) q;
blows the high performance case as does putting the query in a view.

mysql> select version();
+-----------+
| version() |
+-----------+
| 5.0.16    |
+-----------+
1 row in set (0.00 sec)

mysql> set global query_cache_size = 0;
Query OK, 0 rows affected (0.00 sec)

mysql> select user_id, acc_id, max(sample_date) from usage_samples group by 1,2
[...]
+---------+--------+------------------+
939 rows in set (0.07 sec)

mysql> select user_id, acc_id, max(sample_date) from usage_samples group by 1,2
[...]
+---------+--------+------------------+--------------+
939 rows in set (1.39 sec)

merlin

Re: is it possible to make this faster?

From
Tom Lane
Date:
"Merlin Moncure" <mmoncure@gmail.com> writes:
> did you have a key on a,b,c?

Yeah, I did
    create index t1i on t1 (a,b,c);
Do I need to use some other syntax to get it to work?

> select count(*) from (select a,b,max(c) group by a,b) q;
> blows the high performance case as does putting the query in a view.

I noticed that too, while trying to suppress the returning of the
results for timing purposes ... still a few bugs in their optimizer
obviously.  (Curiously, EXPLAIN still claims that the index is being
used.)

> mysql> select user_id, acc_id, max(sample_date) from usage_samples group by 1,2
> [...]
> +---------+--------+------------------+
> 939 rows in set (0.07 sec)

> mysql> select user_id, acc_id, max(sample_date) from usage_samples group by 1,2
> [...]
> +---------+--------+------------------+--------------+
> 939 rows in set (1.39 sec)

I don't understand what you did differently in those two cases?
Or was there a DROP INDEX between?

            regards, tom lane

Re: is it possible to make this faster?

From
"Merlin Moncure"
Date:
On 5/26/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Merlin Moncure" <mmoncure@gmail.com> writes:
> > did you have a key on a,b,c?
> Yeah, I did
>         create index t1i on t1 (a,b,c);
> Do I need to use some other syntax to get it to work?

can't thing of anything, I'm running completely stock, did you do a
optimize table foo? is the wind blowing in the right direction?

> > select count(*) from (select a,b,max(c) group by a,b) q;
> > blows the high performance case as does putting the query in a view.

> I noticed that too, while trying to suppress the returning of the
> results for timing purposes ... still a few bugs in their optimizer
> obviously.  (Curiously, EXPLAIN still claims that the index is being
> used.)

well, they do some tricky things pg can't do for architectural reasons
but the special case is obviously hard to get right.  I suppose this
kinda agrues against doing all kinds of acrobatics to optimize mvcc
weak cases like the above and count(*)...better to make heap access as
quick as possible.

> > mysql> select user_id, acc_id, max(sample_date) from usage_samples group by 1,2
> > 939 rows in set (0.07 sec)

> > mysql> select user_id, acc_id, max(sample_date) from usage_samples group by 1,2
> > 939 rows in set (1.39 sec)

oops, pasted the wrong query..case 2 should have been
select user_id, acc_id, max(sample_date), disksize from usage_samples
group by 1,2
illustrating what going to the heap does to the time.

merlin

Re: is it possible to make this faster?

From
Tom Lane
Date:
"Merlin Moncure" <mmoncure@gmail.com> writes:
> can't thing of anything, I'm running completely stock, did you do a
> optimize table foo?

Nope, never heard of that before.  But I did it, and it doesn't seem to
have changed my results at all.

> mysql> select user_id, acc_id, max(sample_date) from usage_samples group by 1,2
> 939 rows in set (0.07 sec)

0.07 seconds is not impossibly out of line with my result of 0.15 sec,
maybe your machine is just 2X faster than mine.  This is a 2.8GHz dual
Xeon EM64T, what are you testing?  You said "less than 20 msec" before,
what was that on?

            regards, tom lane

Re: is it possible to make this faster?

From
"Merlin Moncure"
Date:
On 5/26/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > mysql> select user_id, acc_id, max(sample_date) from usage_samples group by 1,2
> > 939 rows in set (0.07 sec)
>
> 0.07 seconds is not impossibly out of line with my result of 0.15 sec,
> maybe your machine is just 2X faster than mine.  This is a 2.8GHz dual
> Xeon EM64T, what are you testing?  You said "less than 20 msec" before,
> what was that on?

600 mhz p3: 70 ms, 1100 ms slow case
1600 mhz p4: 10-30ms (mysql timer not very precise) 710ms slow case
quad opteron 865: 0 :-)
dual p3 1133 Mhz xeon, mysql 4.0.16: 500 ms

using steinar's 'substitute group by' for pg I get 40ms on the p3 and
low times on all else.  your time of 150 ms is looking like the slow
case on my results.

merlin

Re: is it possible to make this faster?

From
Tom Lane
Date:
"Merlin Moncure" <mmoncure@gmail.com> writes:
> your time of 150 ms is looking like the slow case on my results.

Yeah... so what's wrong with my test?  Anyone else care to duplicate
the test and see what they get?

            regards, tom lane

Re: is it possible to make this faster?

From
Mark Kirkwood
Date:
Tom Lane wrote:
> "Merlin Moncure" <mmoncure@gmail.com> writes:
>> your time of 150 ms is looking like the slow case on my results.
>
> Yeah... so what's wrong with my test?  Anyone else care to duplicate
> the test and see what they get?

Using your test [generating c from int(rand(1000))], I get 230 ms using
5.0.18 on a P3 1000 Mhz (doing optimize table on t made no difference at
all).

Cheers

Mark