Thread: Why does adding SUM and GROUP BY destroy performance?

Why does adding SUM and GROUP BY destroy performance?

From
David Link
Date:
Hi,

Why does adding SUM and GROUP BY destroy performance?
details follow.
Thanks, David Link

s1.sql:
SELECT
    t.tid, t.title,
    COALESCE(s0c100r100.units, 0) as w0c100r100units,
    (COALESCE(r1c2r100.units, 0) + COALESCE(y0c2r100.units, 0))
        as r0c2r100units
FROM
    title t
    JOIN upc u1 ON t.tid = u1.tid
    LEFT OUTER JOIN sale_200331 s0c100r100 ON u1.upc = s0c100r100.upc
        AND s0c100r100.week = 200331 AND s0c100r100.channel = 100
        AND s0c100r100.region = 100
    LEFT OUTER JOIN rtd r1c2r100 ON u1.upc = r1c2r100.upc
        AND r1c2r100.year = 2002 AND r1c2r100.channel = 2
        AND r1c2r100.region = 100
    LEFT OUTER JOIN ytd_200331 y0c2r100 ON u1.upc = y0c2r100.upc
        AND y0c2r100.week = 200331 AND y0c2r100.channel = 2
        AND y0c2r100.region = 100
    LEFT OUTER JOIN media m ON t.media = m.key
    LEFT OUTER JOIN screen_format sf ON t.screen_format = sf.key
WHERE
    t.distributor != 'CONTROL LABEL'
ORDER BY
    t.title ASC
LIMIT 50
;


s2.sql:
SELECT
    t.tid, t.title,
    SUM(COALESCE(s0c100r100.units, 0)) as w0c100r100units,
    SUM((COALESCE(r1c2r100.units, 0) + COALESCE(y0c2r100.units, 0)))
        as r0c2r100units
FROM
    title t
    JOIN upc u1 ON t.tid = u1.tid
    LEFT OUTER JOIN sale_200331 s0c100r100 ON u1.upc = s0c100r100.upc
        AND s0c100r100.week = 200331 AND s0c100r100.channel = 100
        AND s0c100r100.region = 100
    LEFT OUTER JOIN rtd r1c2r100 ON u1.upc = r1c2r100.upc
        AND r1c2r100.year = 2002 AND r1c2r100.channel = 2
        AND r1c2r100.region = 100
    LEFT OUTER JOIN ytd_200331 y0c2r100 ON u1.upc = y0c2r100.upc
        AND y0c2r100.week = 200331 AND y0c2r100.channel = 2
        AND y0c2r100.region = 100
    LEFT OUTER JOIN media m ON t.media = m.key
    LEFT OUTER JOIN screen_format sf ON t.screen_format = sf.key
WHERE
    t.distributor != 'CONTROL LABEL'
GROUP BY
    t.tid, t.title
ORDER BY
    t.title ASC
LIMIT 50
;


Times:
 s1.sql takes 0m0.124s
 s2.sql takes 1m1.450s

Stats:
 title table: 68,000 rows
 sale_200331 table: 150,000 rows
 ytd_200331 table: 0 rows
 rtd table: 650,000 rows

Indexes are in place.

s1 explain plan:
                           QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..65105.51 rows=50 width=132)
   ->  Nested Loop  (cost=0.00..91726868.54 rows=70445 width=132)
         Join Filter: ("outer".screen_format = "inner"."key")
         ->  Nested Loop  (cost=0.00..91651668.74 rows=70445 width=127)
               Join Filter: ("outer".media = "inner"."key")
               ->  Nested Loop  (cost=0.00..91578053.95 rows=70445
width=122)
                     ->  Nested Loop  (cost=0.00..91236359.89
rows=70445 width=98)
                           ->  Nested Loop  (cost=0.00..90894665.82
rows=70445 width=74)
                                 ->  Nested Loop
(cost=0.00..90539626.76 rows=70445 width=50)
                                       ->  Index Scan using
title_title_ind on title t  (cost=0.00..193051.67 rows=68775 width=38)
                                             Filter: (distributor <>
'CONTROL LABEL'::character varying)
                                       ->  Index Scan using
davids_tid_index on upc u1  (cost=0.00..1309.24 rows=353 width=12)
                                             Index Cond: ("outer".tid =
u1.tid)
                                 ->  Index Scan using
sale_200331_upc_wk_chl_reg_ind on sale_200331 s0c100r100
(cost=0.00..5.02 rows=1 width=24)
                                       Index Cond: (("outer".upc =
s0c100r100.upc) AND (s0c100r100.week = 200331) AND (s0c100r100.channel
= 100) AND (s0c100r100.region = 100))
                           ->  Index Scan using
rtd_upc_year_chl_reg_ind on rtd r1c2r100  (cost=0.00..4.83 rows=1
width=24)
                                 Index Cond: (("outer".upc =
r1c2r100.upc) AND (r1c2r100."year" = 2002) AND (r1c2r100.channel = 2)
AND (r1c2r100.region = 100))
                     ->  Index Scan using ytd_200331_upc_wkchlreg_ind
on ytd_200331 y0c2r100  (cost=0.00..4.83 rows=1 width=24)
                           Index Cond: (("outer".upc = y0c2r100.upc)
AND (y0c2r100.week = 200331) AND (y0c2r100.channel = 2) AND
(y0c2r100.region = 100))
               ->  Seq Scan on media m  (cost=0.00..1.02 rows=2
width=5)
         ->  Seq Scan on screen_format sf  (cost=0.00..1.03 rows=3
width=5)
(21 rows)


s2 explain plan:

                          QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=403996.99..403997.11 rows=50 width=132)
   ->  Sort  (cost=403996.99..404014.60 rows=7044 width=132)
         Sort Key: t.title
         ->  Aggregate  (cost=402393.74..403274.30 rows=7044 width=132)
               ->  Group  (cost=402393.74..402922.08 rows=70445
width=132)
                     ->  Sort  (cost=402393.74..402569.86 rows=70445
width=132)
                           Sort Key: t.tid, t.title
                           ->  Hash Join  (cost=375382.76..392011.46
rows=70445 width=132)
                                 Hash Cond: ("outer".screen_format =
"inner"."key")
                                 ->  Hash Join
(cost=375381.72..390997.78 rows=70445 width=127)
                                       Hash Cond: ("outer".media =
"inner"."key")
                                       ->  Merge Join
(cost=375380.70..390057.49 rows=70445 width=122)
                                             Merge Cond: ("outer".upc =
"inner".upc)
                                             Join Filter:
(("inner".week = 200331) AND ("inner".channel = 2) AND ("inner".region
= 100))
                                             ->  Merge Join
(cost=375380.70..382782.40 rows=70445 width=98)
                                                   Merge Cond:
("outer".upc = "inner".upc)
                                                   Join Filter:
(("inner"."year" = 2002) AND ("inner".channel = 2) AND ("inner".region
= 100))
                                                   ->  Sort
(cost=375310.87..375486.98 rows=70445 width=74)
                                                         Sort Key:
u1.upc
                                                         ->  Nested
Loop  (cost=6348.20..367282.53 rows=70445 width=74)
                                                               ->  Hash
Join  (cost=6348.20..12243.46 rows=70445 width=50)

Hash Cond: ("outer".tid = "inner".tid)
                                                                     ->
 Seq Scan on upc u1  (cost=0.00..2795.28 rows=70628 width=12)
                                                                     ->
 Hash  (cost=4114.93..4114.93 rows=68775 width=38)

   ->  Seq Scan on title t  (cost=0.00..4114.93 rows=68775 width=38)

         Filter: (distributor <> 'CONTROL LABEL'::character varying)
                                                               ->
Index Scan using sale_200331_upc_wk_chl_reg_ind on sale_200331
s0c100r100  (cost=0.00..5.02 rows=1 width=24)

Index Cond: (("outer".upc = s0c100r100.upc) AND (s0c100r100.week =
200331) AND (s0c100r100.channel = 100) AND (s0c100r100.region = 100))
                                                   ->  Sort
(cost=69.83..72.33 rows=1000 width=24)
                                                         Sort Key:
r1c2r100.upc
                                                         ->  Seq Scan
on rtd r1c2r100  (cost=0.00..20.00 rows=1000 width=24)
                                             ->  Index Scan using
ytd_200331_upc_wkchlreg_ind on ytd_200331 y0c2r100  (cost=0.00..52.00
rows=1000 width=24)
                                       ->  Hash  (cost=1.02..1.02
rows=2 width=5)
                                             ->  Seq Scan on media m
(cost=0.00..1.02 rows=2 width=5)
                                 ->  Hash  (cost=1.03..1.03 rows=3
width=5)
                                       ->  Seq Scan on screen_format sf
 (cost=0.00..1.03 rows=3 width=5)
(36 rows)




__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

Re: Why does adding SUM and GROUP BY destroy performance?

From
Ron Johnson
Date:
On Wed, 2003-09-17 at 12:51, David Link wrote:
> Hi,
>
> Why does adding SUM and GROUP BY destroy performance?
> details follow.
> Thanks, David Link

PostgreSQL's generalized UDFs makes optimizing for the standard
aggregates very hard to do.  There should be some improvement in
v7.4, though.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA

"Those who would give up essential Liberty to purchase a little
temporary safety, deserve neither Liberty nor safety." or
something like that
Ben Franklin, maybe


Re: Why does adding SUM and GROUP BY destroy performance?

From
Christopher Browne
Date:
In the last exciting episode, dvlink@yahoo.com (David Link) wrote:
> Why does adding SUM and GROUP BY destroy performance?

When you use SUM (or other aggregates), there are no short cuts to
walking through each and every tuple specified by the WHERE clause.

On some systems there are statistics mechanisms that can short-circuit
that.  On PostgreSQL, the use of MVCC to let new data "almost
magically appear" :-) has the demerit, in the case of aggregates, of
not leaving much opening for short cuts.

There are some cases where you CAN do much better than the aggregates
do.

  SELECT MAX(FIELD) FROM TABLE WHERE A='THIS' and B='THAT';

may be replaced with the likely-to-be-faster:

  select field from table where a = 'THIS' and b='THAT' order by field
    desc limit 1;

MIN() admits a similar rewriting.  If there is an index on FIELD, this
will likely be _way_ faster than using MIN()/MAX().

In a sense, it's not that aggregates "destroy" performance; just that
there are no magical shortcuts to make them incredibly fast.
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','acm.org').
http://www.ntlug.org/~cbbrowne/multiplexor.html
"And  1.1.81 is  officially BugFree(tm),  so  if you  receive any  bug
reports on it, you know they are just evil lies." -- Linus Torvalds

Re: Why does adding SUM and GROUP BY destroy performance?

From
David Link
Date:
Thanks Ron, Thanks Christopher for your excellent feedback.

I guess it's back to the drawing board.  This is a very late hour
business requirement change.  And we need quick real-time results.

Two things are being considered:

   1. loading the non aggregate query entirely into memory (using perl
cgi, currently, but looking at the possiblity of doing this in the
database with either PL/perl or PL/plsql, though I don't know what
would be gained by doing it that way).  And handling the summing and
then the sort ourselves in the program, -- or --

   2. preprocessing a lot of the data at pre-determined times.
Essentially doubling the size of our database.

I'd be open to any other suggestions.

Thanks again. very much.
-David



__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

Re: Why does adding SUM and GROUP BY destroy performance?

From
Ang Chin Han
Date:
Christopher Browne wrote:
> In the last exciting episode, dvlink@yahoo.com (David Link) wrote:
>
>>Why does adding SUM and GROUP BY destroy performance?
>
>
> When you use SUM (or other aggregates), there are no short cuts to
> walking through each and every tuple specified by the WHERE clause.

Er... not in this case, if I read David's email correctly.

His first query is walking through every tuple anyway.
His second query is the one summing them up, AFTER, here's the critical
part, GROUPing them by t.tid.

I suspect 7.4 (now in beta), or rewriting the query for <7.4 would speed
thing up. 7.4's Hash Aggregate would be the winner here.

As for rewriting this, David, try:

SELECT t.tid, t.title,
   (select the stuff you want from lots of tables where something = t.tid)
FROM
   title t;

Doubt it'll be as fast as using 7.4 though.

--
Linux homer 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386
GNU/Linux
   3:00pm  up 267 days,  6:26,  4 users,  load average: 5.44, 5.26, 5.17

Attachment

Re: Why does adding SUM and GROUP BY destroy performance?

From
Tom Lane
Date:
Ang Chin Han <angch@bytecraft.com.my> writes:
> Christopher Browne wrote:
>> When you use SUM (or other aggregates), there are no short cuts to
>> walking through each and every tuple specified by the WHERE clause.

> Er... not in this case, if I read David's email correctly.

> His first query is walking through every tuple anyway.

No, it isn't, because he had a LIMIT.  I think the real point is that
computing the first fifty groups requires sucking in a lot more tuples
than just computing the first fifty rows.

            regards, tom lane

Re: Why does adding SUM and GROUP BY destroy performance?

From
Christopher Browne
Date:
dvlink@yahoo.com (David Link) writes:
> Thanks Ron, Thanks Christopher for your excellent feedback.
>
> I guess it's back to the drawing board.  This is a very late hour
> business requirement change.  And we need quick real-time results.
>
> Two things are being considered:
>
>    1. loading the non aggregate query entirely into memory (using perl
> cgi, currently, but looking at the possiblity of doing this in the
> database with either PL/perl or PL/plsql, though I don't know what
> would be gained by doing it that way).  And handling the summing and
> then the sort ourselves in the program, -- or --

I wouldn't think that this would actually help, either way.

The problem with the aggregates is that it requires that the
individual records all get visited, and the stats added together.

Suppose there are a million records, each 100 bytes in size.

Doing the aggregate on the database means reading through 1M records
and having the DBMS summarize them.  That's pretty expensive.

If you push the work into an in-memory Perl program, that doesn't
help; it means that each time it runs, it has to pull 100,000,000
bytes of data across the network connection, in addition to all the
other work.  That should be SLOWER than using the aggregate.

I think you may be misunderstanding the nature of the problem.  It's
not that aggregates are forcibly "performance destroying;" it's that
they require walking through the data, which happens not to be cheap
even though the result set may be pretty small.

>    2. preprocessing a lot of the data at pre-determined times.
> Essentially doubling the size of our database.
>
> I'd be open to any other suggestions.

Well, if you have a field that allows handling it in a time-based
manner, and the aggregate is static/well-defined, you might create a
table that is used to collect daily summaries.

Every new day, you would do something like:

insert into summary_table (date, gparameter, count)
  (select '2003-09-16'::date, gparameter, count(*)
      from detail_table
      where trans_on between '2003-09-16' and '2003-09-17'
         group by gparameter);

Then set up a view...

create view summary as
  select gparameter, sum(count) from
    ((select * from summary_table)  UNION ALL
     (select gparameter, count(*) from detail_table
      where trans_on > '2003-09-17' group by gparameter));

I am assuming that there is an index on trans_on; you might want to
handle that a bit differently.

I leave, as exercise for the reader, what to do about the '2003-09-17'
in the view.  That probably needs to be more dynamic.  Or perhaps not;
perhaps the process that augments the summary table would drop and
recreate the view.

Another improvement would be for the hourly/daily process to further
collapse summary_table so it only has one entry per 'gparameter'
grouping.
--
"cbbrowne","@","libertyrms.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)