Thread: Additional select fields in a GROUP BY

Additional select fields in a GROUP BY

From
Vitaly Belman
Date:
Hello,

Consider the following query:

select t1field1, avg(t2fieild2)
from t1, t2
where t1.field1 = t2.field2
group by t1field1

That works fine. But I'd really like to see more fields of t1 in this
query, however I can't add them into the select because they're not
part of the GROUP BY, thus I have to add them to there too:

select t1field1, t1field2, t1field3, avg(t2fieild2)
from t1, t2
where t1.field1 = t2.field2
group by t1field1, t1field2, t1field3

The problem is that addind them all to GROUP BY causes a performance
loss.. The only solution I found is using a subquery like this:

select * from
t1, (select t1field1, avg(t2fieild2)
from t1, t2
where t1.field1 = t2.field2
group by t1field1) t1inner
where t1.field1 = t1inner.field1

It works just fine.. But I prefer not to use subqueries unless I am
really forced to due to the design of my application.

Another solution I considered is using aggreate function like that:

select t1field1, max(t1field2), max(t1field3), avg(t2fieild2)
from t1, t2
where t1.field1 = t2.field2
group by t1field1

Sadly, this caused the same performance... I wonder though, is it
possible to make an aggregate function like first(), last() in Oracle
(IIRC)? I believe that in such cases MySQL does first() by itself.

Other ideas are welcome too.


Regards,
 Vitaly Belman

 ICQ: 1912453
 AIM: VitalyB1984
 MSN: tmdagent@hotmail.com
 Yahoo!: VitalyBe


Re: Additional select fields in a GROUP BY

From
Tom Lane
Date:
Vitaly Belman <vitalib@012.net.il> writes:
> The problem is that addind them all to GROUP BY causes a performance
> loss.

Really?  I'd think that there'd be no visible loss if the earlier
fields of the GROUP BY are already unique.  The sort comparison
should stop at the first field that determines the sort order.
Can you provide a self-contained example?

            regards, tom lane

Re: Additional select fields in a GROUP BY

From
Bruno Wolff III
Date:
On Sun, Jun 13, 2004 at 06:21:17 +0300,
  Vitaly Belman <vitalib@012.net.il> wrote:
>
> Consider the following query:
>
> select t1field1, avg(t2fieild2)
> from t1, t2
> where t1.field1 = t2.field2
> group by t1field1
>
> That works fine. But I'd really like to see more fields of t1 in this
> query, however I can't add them into the select because they're not
> part of the GROUP BY, thus I have to add them to there too:

If t1.field1 is a candiate key for t1, then the normal thing to do is
to group t2 by t2.field1 (assuming you really meant to join on t2.field1,
not t2.field2) and THEN join to t1. That may even be faster than the way you
are doing things now.

So the query would look like:

SELECT t1.field1, t1.field2, t1.field3, a.t2avg FROM t1,
  (SELECT field1, avg(field2) as t2avg FROM t2 GROUP BY field1) as a
  WHERE t1.field1 = a.field1

Re: Additional select fields in a GROUP BY

From
Vitaly Belman
Date:
Bruno:

It wasn't exactly my case but you did give me an idea by this tip,
changing a perspective did quite good to the timing of this query.

Tom:

Hmm.. I am not sure how I can demonstrate this to you... To see the
time differences you'd need the whole table.. That's quite a lot of
data to be posted on a mailing list, if you wish to test it on your
side, I'll dump this table partly and send them to you somehow.

I do stand by what I said though, here's the real query example:

Original query (execution time, 800ms):

select s.series_id, avg(vote_avg), sum(vote_count)
from v_bookseries s, bv_seriesgenres sg
where s.series_id = sg.series_id and sg.genre_id = 1
group by s.series_id
order by sum(vote_count) desc
limit 10

QUERY PLAN:

Limit  (cost=6523.51..6523.53 rows=10 width=12)
  ->  Sort  (cost=6523.51..6566.27 rows=17104 width=12)
        Sort Key: sum(b.vote_count)
        ->  GroupAggregate  (cost=1368.54..5320.92 rows=17104 width=12)
              ->  Merge Join  (cost=1368.54..4796.91 rows=58466 width=12)
                    Merge Cond: ("outer".series_id = "inner".series_id)
                    ->  Merge Join  (cost=0.00..6676.41 rows=65902 width=16)
                          Merge Cond: ("outer".series_id = "inner".series_id)
                          ->  Index Scan using bv_series_pkey on
bv_series s  (cost=0.00..386.83 rows=17104 width=4)
                          ->  Index Scan using i_books_series_id on
bv_books b  (cost=0.00..14148.38 rows=171918 width=12)
                    ->  Sort  (cost=1368.54..1406.47 rows=15173 width=4)
                          Sort Key: sg.series_id
                          ->  Index Scan using i_seriesgenres_genre_id
on bv_seriesgenres sg  (cost=0.00..314.83 rows=15173 width=4)
                                Index Cond: (genre_id = 1)


Query with added GROUP BY members (execution time, 1400ms):

select s.series_id, s.series_name, s.series_picture, avg(vote_avg),
sum(vote_count)
from v_bookseries s, bv_seriesgenres sg
where s.series_id = sg.series_id and sg.genre_id = 1
group by s.series_id, s.series_name, s.series_picture
order by sum(vote_count) desc
limit 10

QUERY PLAN:

Limit  (cost=12619.76..12619.79 rows=10 width=47)
  ->  Sort  (cost=12619.76..12662.52 rows=17104 width=47)
        Sort Key: sum(b.vote_count)
        ->  GroupAggregate  (cost=10454.67..11417.18 rows=17104 width=47)
              ->  Sort  (cost=10454.67..10600.83 rows=58466 width=47)
                    Sort Key: s.series_id, s.series_name, s.series_picture
                    ->  Merge Join  (cost=1368.54..4796.91 rows=58466 width=47)
                          Merge Cond: ("outer".series_id = "inner".series_id)
                          ->  Merge Join  (cost=0.00..6676.41
rows=65902 width=51)
                                Merge Cond: ("outer".series_id =
"inner".series_id)
                                ->  Index Scan using bv_series_pkey on
bv_series s  (cost=0.00..386.83 rows=17104 width=39)
                                ->  Index Scan using i_books_series_id
on bv_books b  (cost=0.00..14148.38 rows=171918 width=12)
                          ->  Sort  (cost=1368.54..1406.47 rows=15173 width=4)
                                Sort Key: sg.series_id
                                ->  Index Scan using
i_seriesgenres_genre_id on bv_seriesgenres sg  (cost=0.00..314.83
rows=15173 width=4)
                                      Index Cond: (genre_id = 1)

Notice that the GROUP BY items added the following to the plan:

        ->  GroupAggregate  (cost=10454.67..11417.18 rows=17104 width=47)
              ->  Sort  (cost=10454.67..10600.83 rows=58466 width=47)
                    Sort Key: s.series_id, s.series_name, s.series_picture

Which eventually almost doubles the execution time.


On Sun, 13 Jun 2004 08:52:12 -0500, Bruno Wolff III <bruno@wolff.to> wrote:
>
> On Sun, Jun 13, 2004 at 06:21:17 +0300,
>  Vitaly Belman <vitalib@012.net.il> wrote:
> >
> > Consider the following query:
> >
> > select t1field1, avg(t2fieild2)
> > from t1, t2
> > where t1.field1 = t2.field2
> > group by t1field1
> >
> > That works fine. But I'd really like to see more fields of t1 in this
> > query, however I can't add them into the select because they're not
> > part of the GROUP BY, thus I have to add them to there too:
>
> If t1.field1 is a candiate key for t1, then the normal thing to do is
> to group t2 by t2.field1 (assuming you really meant to join on t2.field1,
> not t2.field2) and THEN join to t1. That may even be faster than the way you
> are doing things now.
>
> So the query would look like:
>
> SELECT t1.field1, t1.field2, t1.field3, a.t2avg FROM t1,
>  (SELECT field1, avg(field2) as t2avg FROM t2 GROUP BY field1) as a
>  WHERE t1.field1 = a.field1
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Re: Additional select fields in a GROUP BY

From
Tom Lane
Date:
Vitaly Belman <vitalyb@gmail.com> writes:
> Notice that the GROUP BY items added the following to the plan:

>               ->  Sort  (cost=10454.67..10600.83 rows=58466 width=47)
>                     Sort Key: s.series_id, s.series_name, s.series_picture

Oh, I see: in the first case you need no sort at all because the output
of the indexscan is already known to be sorted by s.series_id.  I was
thinking of a sort with more or fewer sort columns, but that's not the
issue here.

            regards, tom lane