Thread: Why does adding SUM and GROUP BY destroy performance?
Hi, Why does adding SUM and GROUP BY destroy performance? details follow. Thanks, David Link s1.sql: SELECT t.tid, t.title, COALESCE(s0c100r100.units, 0) as w0c100r100units, (COALESCE(r1c2r100.units, 0) + COALESCE(y0c2r100.units, 0)) as r0c2r100units FROM title t JOIN upc u1 ON t.tid = u1.tid LEFT OUTER JOIN sale_200331 s0c100r100 ON u1.upc = s0c100r100.upc AND s0c100r100.week = 200331 AND s0c100r100.channel = 100 AND s0c100r100.region = 100 LEFT OUTER JOIN rtd r1c2r100 ON u1.upc = r1c2r100.upc AND r1c2r100.year = 2002 AND r1c2r100.channel = 2 AND r1c2r100.region = 100 LEFT OUTER JOIN ytd_200331 y0c2r100 ON u1.upc = y0c2r100.upc AND y0c2r100.week = 200331 AND y0c2r100.channel = 2 AND y0c2r100.region = 100 LEFT OUTER JOIN media m ON t.media = m.key LEFT OUTER JOIN screen_format sf ON t.screen_format = sf.key WHERE t.distributor != 'CONTROL LABEL' ORDER BY t.title ASC LIMIT 50 ; s2.sql: SELECT t.tid, t.title, SUM(COALESCE(s0c100r100.units, 0)) as w0c100r100units, SUM((COALESCE(r1c2r100.units, 0) + COALESCE(y0c2r100.units, 0))) as r0c2r100units FROM title t JOIN upc u1 ON t.tid = u1.tid LEFT OUTER JOIN sale_200331 s0c100r100 ON u1.upc = s0c100r100.upc AND s0c100r100.week = 200331 AND s0c100r100.channel = 100 AND s0c100r100.region = 100 LEFT OUTER JOIN rtd r1c2r100 ON u1.upc = r1c2r100.upc AND r1c2r100.year = 2002 AND r1c2r100.channel = 2 AND r1c2r100.region = 100 LEFT OUTER JOIN ytd_200331 y0c2r100 ON u1.upc = y0c2r100.upc AND y0c2r100.week = 200331 AND y0c2r100.channel = 2 AND y0c2r100.region = 100 LEFT OUTER JOIN media m ON t.media = m.key LEFT OUTER JOIN screen_format sf ON t.screen_format = sf.key WHERE t.distributor != 'CONTROL LABEL' GROUP BY t.tid, t.title ORDER BY t.title ASC LIMIT 50 ; Times: s1.sql takes 0m0.124s s2.sql takes 1m1.450s Stats: title table: 68,000 rows sale_200331 table: 150,000 rows ytd_200331 table: 0 rows rtd table: 650,000 rows Indexes are in place. s1 explain plan: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..65105.51 rows=50 width=132) -> Nested Loop (cost=0.00..91726868.54 rows=70445 width=132) Join Filter: ("outer".screen_format = "inner"."key") -> Nested Loop (cost=0.00..91651668.74 rows=70445 width=127) Join Filter: ("outer".media = "inner"."key") -> Nested Loop (cost=0.00..91578053.95 rows=70445 width=122) -> Nested Loop (cost=0.00..91236359.89 rows=70445 width=98) -> Nested Loop (cost=0.00..90894665.82 rows=70445 width=74) -> Nested Loop (cost=0.00..90539626.76 rows=70445 width=50) -> Index Scan using title_title_ind on title t (cost=0.00..193051.67 rows=68775 width=38) Filter: (distributor <> 'CONTROL LABEL'::character varying) -> Index Scan using davids_tid_index on upc u1 (cost=0.00..1309.24 rows=353 width=12) Index Cond: ("outer".tid = u1.tid) -> Index Scan using sale_200331_upc_wk_chl_reg_ind on sale_200331 s0c100r100 (cost=0.00..5.02 rows=1 width=24) Index Cond: (("outer".upc = s0c100r100.upc) AND (s0c100r100.week = 200331) AND (s0c100r100.channel = 100) AND (s0c100r100.region = 100)) -> Index Scan using rtd_upc_year_chl_reg_ind on rtd r1c2r100 (cost=0.00..4.83 rows=1 width=24) Index Cond: (("outer".upc = r1c2r100.upc) AND (r1c2r100."year" = 2002) AND (r1c2r100.channel = 2) AND (r1c2r100.region = 100)) -> Index Scan using ytd_200331_upc_wkchlreg_ind on ytd_200331 y0c2r100 (cost=0.00..4.83 rows=1 width=24) Index Cond: (("outer".upc = y0c2r100.upc) AND (y0c2r100.week = 200331) AND (y0c2r100.channel = 2) AND (y0c2r100.region = 100)) -> Seq Scan on media m (cost=0.00..1.02 rows=2 width=5) -> Seq Scan on screen_format sf (cost=0.00..1.03 rows=3 width=5) (21 rows) s2 explain plan: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=403996.99..403997.11 rows=50 width=132) -> Sort (cost=403996.99..404014.60 rows=7044 width=132) Sort Key: t.title -> Aggregate (cost=402393.74..403274.30 rows=7044 width=132) -> Group (cost=402393.74..402922.08 rows=70445 width=132) -> Sort (cost=402393.74..402569.86 rows=70445 width=132) Sort Key: t.tid, t.title -> Hash Join (cost=375382.76..392011.46 rows=70445 width=132) Hash Cond: ("outer".screen_format = "inner"."key") -> Hash Join (cost=375381.72..390997.78 rows=70445 width=127) Hash Cond: ("outer".media = "inner"."key") -> Merge Join (cost=375380.70..390057.49 rows=70445 width=122) Merge Cond: ("outer".upc = "inner".upc) Join Filter: (("inner".week = 200331) AND ("inner".channel = 2) AND ("inner".region = 100)) -> Merge Join (cost=375380.70..382782.40 rows=70445 width=98) Merge Cond: ("outer".upc = "inner".upc) Join Filter: (("inner"."year" = 2002) AND ("inner".channel = 2) AND ("inner".region = 100)) -> Sort (cost=375310.87..375486.98 rows=70445 width=74) Sort Key: u1.upc -> Nested Loop (cost=6348.20..367282.53 rows=70445 width=74) -> Hash Join (cost=6348.20..12243.46 rows=70445 width=50) Hash Cond: ("outer".tid = "inner".tid) -> Seq Scan on upc u1 (cost=0.00..2795.28 rows=70628 width=12) -> Hash (cost=4114.93..4114.93 rows=68775 width=38) -> Seq Scan on title t (cost=0.00..4114.93 rows=68775 width=38) Filter: (distributor <> 'CONTROL LABEL'::character varying) -> Index Scan using sale_200331_upc_wk_chl_reg_ind on sale_200331 s0c100r100 (cost=0.00..5.02 rows=1 width=24) Index Cond: (("outer".upc = s0c100r100.upc) AND (s0c100r100.week = 200331) AND (s0c100r100.channel = 100) AND (s0c100r100.region = 100)) -> Sort (cost=69.83..72.33 rows=1000 width=24) Sort Key: r1c2r100.upc -> Seq Scan on rtd r1c2r100 (cost=0.00..20.00 rows=1000 width=24) -> Index Scan using ytd_200331_upc_wkchlreg_ind on ytd_200331 y0c2r100 (cost=0.00..52.00 rows=1000 width=24) -> Hash (cost=1.02..1.02 rows=2 width=5) -> Seq Scan on media m (cost=0.00..1.02 rows=2 width=5) -> Hash (cost=1.03..1.03 rows=3 width=5) -> Seq Scan on screen_format sf (cost=0.00..1.03 rows=3 width=5) (36 rows) __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com
On Wed, 2003-09-17 at 12:51, David Link wrote: > Hi, > > Why does adding SUM and GROUP BY destroy performance? > details follow. > Thanks, David Link PostgreSQL's generalized UDFs makes optimizing for the standard aggregates very hard to do. There should be some improvement in v7.4, though. -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA "Those who would give up essential Liberty to purchase a little temporary safety, deserve neither Liberty nor safety." or something like that Ben Franklin, maybe
In the last exciting episode, dvlink@yahoo.com (David Link) wrote: > Why does adding SUM and GROUP BY destroy performance? When you use SUM (or other aggregates), there are no short cuts to walking through each and every tuple specified by the WHERE clause. On some systems there are statistics mechanisms that can short-circuit that. On PostgreSQL, the use of MVCC to let new data "almost magically appear" :-) has the demerit, in the case of aggregates, of not leaving much opening for short cuts. There are some cases where you CAN do much better than the aggregates do. SELECT MAX(FIELD) FROM TABLE WHERE A='THIS' and B='THAT'; may be replaced with the likely-to-be-faster: select field from table where a = 'THIS' and b='THAT' order by field desc limit 1; MIN() admits a similar rewriting. If there is an index on FIELD, this will likely be _way_ faster than using MIN()/MAX(). In a sense, it's not that aggregates "destroy" performance; just that there are no magical shortcuts to make them incredibly fast. -- wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','acm.org'). http://www.ntlug.org/~cbbrowne/multiplexor.html "And 1.1.81 is officially BugFree(tm), so if you receive any bug reports on it, you know they are just evil lies." -- Linus Torvalds
Thanks Ron, Thanks Christopher for your excellent feedback. I guess it's back to the drawing board. This is a very late hour business requirement change. And we need quick real-time results. Two things are being considered: 1. loading the non aggregate query entirely into memory (using perl cgi, currently, but looking at the possiblity of doing this in the database with either PL/perl or PL/plsql, though I don't know what would be gained by doing it that way). And handling the summing and then the sort ourselves in the program, -- or -- 2. preprocessing a lot of the data at pre-determined times. Essentially doubling the size of our database. I'd be open to any other suggestions. Thanks again. very much. -David __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com
Christopher Browne wrote: > In the last exciting episode, dvlink@yahoo.com (David Link) wrote: > >>Why does adding SUM and GROUP BY destroy performance? > > > When you use SUM (or other aggregates), there are no short cuts to > walking through each and every tuple specified by the WHERE clause. Er... not in this case, if I read David's email correctly. His first query is walking through every tuple anyway. His second query is the one summing them up, AFTER, here's the critical part, GROUPing them by t.tid. I suspect 7.4 (now in beta), or rewriting the query for <7.4 would speed thing up. 7.4's Hash Aggregate would be the winner here. As for rewriting this, David, try: SELECT t.tid, t.title, (select the stuff you want from lots of tables where something = t.tid) FROM title t; Doubt it'll be as fast as using 7.4 though. -- Linux homer 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386 GNU/Linux 3:00pm up 267 days, 6:26, 4 users, load average: 5.44, 5.26, 5.17
Attachment
Ang Chin Han <angch@bytecraft.com.my> writes: > Christopher Browne wrote: >> When you use SUM (or other aggregates), there are no short cuts to >> walking through each and every tuple specified by the WHERE clause. > Er... not in this case, if I read David's email correctly. > His first query is walking through every tuple anyway. No, it isn't, because he had a LIMIT. I think the real point is that computing the first fifty groups requires sucking in a lot more tuples than just computing the first fifty rows. regards, tom lane
dvlink@yahoo.com (David Link) writes: > Thanks Ron, Thanks Christopher for your excellent feedback. > > I guess it's back to the drawing board. This is a very late hour > business requirement change. And we need quick real-time results. > > Two things are being considered: > > 1. loading the non aggregate query entirely into memory (using perl > cgi, currently, but looking at the possiblity of doing this in the > database with either PL/perl or PL/plsql, though I don't know what > would be gained by doing it that way). And handling the summing and > then the sort ourselves in the program, -- or -- I wouldn't think that this would actually help, either way. The problem with the aggregates is that it requires that the individual records all get visited, and the stats added together. Suppose there are a million records, each 100 bytes in size. Doing the aggregate on the database means reading through 1M records and having the DBMS summarize them. That's pretty expensive. If you push the work into an in-memory Perl program, that doesn't help; it means that each time it runs, it has to pull 100,000,000 bytes of data across the network connection, in addition to all the other work. That should be SLOWER than using the aggregate. I think you may be misunderstanding the nature of the problem. It's not that aggregates are forcibly "performance destroying;" it's that they require walking through the data, which happens not to be cheap even though the result set may be pretty small. > 2. preprocessing a lot of the data at pre-determined times. > Essentially doubling the size of our database. > > I'd be open to any other suggestions. Well, if you have a field that allows handling it in a time-based manner, and the aggregate is static/well-defined, you might create a table that is used to collect daily summaries. Every new day, you would do something like: insert into summary_table (date, gparameter, count) (select '2003-09-16'::date, gparameter, count(*) from detail_table where trans_on between '2003-09-16' and '2003-09-17' group by gparameter); Then set up a view... create view summary as select gparameter, sum(count) from ((select * from summary_table) UNION ALL (select gparameter, count(*) from detail_table where trans_on > '2003-09-17' group by gparameter)); I am assuming that there is an index on trans_on; you might want to handle that a bit differently. I leave, as exercise for the reader, what to do about the '2003-09-17' in the view. That probably needs to be more dynamic. Or perhaps not; perhaps the process that augments the summary table would drop and recreate the view. Another improvement would be for the hourly/daily process to further collapse summary_table so it only has one entry per 'gparameter' grouping. -- "cbbrowne","@","libertyrms.info" <http://dev6.int.libertyrms.com/> Christopher Browne (416) 646 3304 x124 (land)