Thread: Better way to compute moving averages?
I've been asked compute various moving averages of end of day sales by store. I can do this for all rows with no problem (same query without the WHERE clause towards the end). That query took 10-15 minutes to run over approx 3.4 million rows. I'm sure they will want this information to be added to the daily end of day reports. I can run the query below (excluding the WHERE clause) but it takes almost as long to run one day as it does the entire dataset. It looks like when I do the inner select, it is still running over the entire dataset. I have added a "WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '365 days'" (as below) to the inner query, which allows the query to run between 1-2 minutes. Question 1) This seems to work, but was curious if there is a better way. Question 2) Is there a way to specify a date, instead of using current date and current_timestamp, as a variable and use that in the query? I know I can do that in my Java program using variables, but wasn't sure if there was a way to do this with a function or stored procedure? INSERT INTO historical_data_avg (store_id, date, avg7sales, avg14sales, avg30sales, avg60sales, avg90sales, avg180sales) ( SELECT t1.store_id, t1.eod_ts, t1.avg5sales, t1.avg10sales, t1.avg20sales, t1.avg50sales, t1.avg100sales, t1.avg180sales FROM ( SELECT store_id, eod_ts, avg(eod_sales)OVER (PARTITION BY store_id ORDER BY eod_ts DESC ROWS BETWEEN CURRENT ROW AND 4 FOLLOWING) AS avg5sales, avg(eod_sales) OVER (PARTITION BY store_id ORDER BY eod_ts DESC ROWS BETWEEN CURRENT ROW AND 9 FOLLOWING) AS avg10sales, avg(eod_sales) OVER (PARTITION BY store_id ORDER BY eod_tsDESC ROWS BETWEEN CURRENT ROW AND 19 FOLLOWING) AS avg20sales, avg(eod_sales) OVER (PARTITION BY store_id ORDER BY eod_tsDESC ROWS BETWEEN CURRENT ROW AND 49 FOLLOWING) AS avg50sales, avg(eod_sales) OVER (PARTITION BY store_id ORDER BY eod_tsDESC ROWS BETWEEN CURRENT ROW AND 99 FOLLOWING) AS avg100sales, avg(eod_sales) OVER (PARTITION BY store_id ORDER BY eod_tsDESC ROWS BETWEEN CURRENT ROW AND 179 FOLLOWING) AS avg200sales FROM end_of_day_data WHERE eod_ts > CURRENT_TIMESTAMP -INTERVAL '260 days' GROUP BY store_id, eod_ts, eod_sales ORDER BY ticker_id, eod_ts ) as t1 WHERE t1.eod_ts = current_date );
Jason Aleksi wrote > I've been asked compute various moving averages of end of day sales by > store. I can do this for all rows with no problem (same query without > the WHERE clause towards the end). That query took 10-15 minutes to run > over approx 3.4 million rows. I'm sure they will want this information > to be added to the daily end of day reports. I can run the query below > (excluding the WHERE clause) but it takes almost as long to run one day > as it does the entire dataset. It looks like when I do the inner > select, it is still running over the entire dataset. I have added a > "WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '365 days'" (as below) to > the inner query, which allows the query to run between 1-2 minutes. > > Question 1) This seems to work, but was curious if there is a better way. Possibly though I'm not sure you've given enough info... Regardless, are you positive you need "FOLLOWING" instead of "PRECEDING" frames? > Question 2) Is there a way to specify a date, instead of using current > date and current_timestamp, as a variable and use that in the query? I > know I can do that in my Java program > using variables, but wasn't sure if there was a way to do this with a > function or stored procedure? Have you read about functions in the documentation? CREATE FUNCTION echo_text(input_var text) RETURNS text AS $$ SELECT input_var; $$ LANGUAGE sql STRICT ; > It looks like when I do the inner > select, it is still running over the entire dataset. > I have added a > "WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '365 days'" (as below) to > the inner query, which allows the query to run between 1-2 minutes. > [...] > FROM end_of_day_data > WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '260 days' > GROUP BY store_id, eod_ts, eod_sales > ORDER BY ticker_id, eod_ts > ) as t1 > WHERE t1.eod_ts = current_date > ); You expected differently? GROUP BY (in particular, not only) prevents any kind of push-down of quals. Try something like (not tested or tried - just having an idea this moment): WITH scope_of_source_with_aging AS ( SELECT eod_ts, first(eod_ts) OVER () AS ref_eod, eod_ts - first(eod_ts) OVER (ORDER BY eod_ts) AS eod_age FROM end_of_day_data WHERE eod_ts >= current_timestamp - interval '206 days' ) SELECT ref_eod, avg(CASE WHEN eod_age BETWEEN 0 AND 7 THEN ... ELSE NULL END) AS average_7day [, ...] FROM scope_of_source_with_aging GROUP BY ref_eod ; The first, and most important part, is to limit your source data to only the desired time period. The rest of the above simply tries to limit how much work needs to be done. Since window function, while great, are fairly slow if you can minimize their use - in this case by limiting them to simply calculating and age - it should help performance. The concept above should be applicable to your situation generally but you will need to clarify your needs and then modify the WHERE and window clauses as appropriate. I also don't recall if there is a native "first" function though you can either write one yourself or possibly add an appropriate range clause that will accomplish the same goal. David J. -- View this message in context: http://postgresql.nabble.com/Better-way-to-compute-moving-averages-tp5835291p5835295.html Sent from the PostgreSQL - sql mailing list archive at Nabble.com.
On 1/23/2015 6:50 PM, David G Johnston wrote: > Jason Aleksi wrote >> I've been asked compute various moving averages of end of day sales by >> store. I can do this for all rows with no problem (same query without >> the WHERE clause towards the end). That query took 10-15 minutes to run >> over approx 3.4 million rows. I'm sure they will want this information >> to be added to the daily end of day reports. I can run the query below >> (excluding the WHERE clause) but it takes almost as long to run one day >> as it does the entire dataset. It looks like when I do the inner >> select, it is still running over the entire dataset. I have added a >> "WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '365 days'" (as below) to >> the inner query, which allows the query to run between 1-2 minutes. >> >> Question 1) This seems to work, but was curious if there is a better way. > Possibly though I'm not sure you've given enough info... > > Regardless, are you positive you need "FOLLOWING" instead of "PRECEDING" > frames? > > >> Question 2) Is there a way to specify a date, instead of using current >> date and current_timestamp, as a variable and use that in the query? I >> know I can do that in my Java program >> using variables, but wasn't sure if there was a way to do this with a >> function or stored procedure? > Have you read about functions in the documentation? > > CREATE FUNCTION echo_text(input_var text) > RETURNS text > AS $$ > SELECT input_var; > $$ > LANGUAGE sql > STRICT > ; > > >> It looks like when I do the inner >> select, it is still running over the entire dataset. >> I have added a >> "WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '365 days'" (as below) to >> the inner query, which allows the query to run between 1-2 minutes. >> [...] >> FROM end_of_day_data >> WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '260 days' >> GROUP BY store_id, eod_ts, eod_sales >> ORDER BY ticker_id, eod_ts >> ) as t1 >> WHERE t1.eod_ts = current_date >> ); > You expected differently? GROUP BY (in particular, not only) prevents any > kind of push-down of quals. > > Try something like (not tested or tried - just having an idea this moment): > > WITH scope_of_source_with_aging AS ( > SELECT eod_ts, > first(eod_ts) OVER () AS ref_eod, > eod_ts - first(eod_ts) OVER (ORDER BY eod_ts) AS eod_age > FROM end_of_day_data WHERE eod_ts >= current_timestamp - interval '206 days' > ) > SELECT ref_eod, > avg(CASE WHEN eod_age BETWEEN 0 AND 7 THEN ... ELSE NULL END) AS > average_7day > [, ...] > FROM scope_of_source_with_aging > GROUP BY ref_eod > ; > > The first, and most important part, is to limit your source data to only the > desired time period. The rest of the above simply tries to limit how much > work needs to be done. Since window function, while great, are fairly slow > if you can minimize their use - in this case by limiting them to simply > calculating and age - it should help performance. > > The concept above should be applicable to your situation generally but you > will need to clarify your needs and then modify the WHERE and window clauses > as appropriate. > > I also don't recall if there is a native "first" function though you can > either write one yourself or possibly add an appropriate range clause that > will accomplish the same goal. > > David J. > > > > > > > -- > View this message in context: http://postgresql.nabble.com/Better-way-to-compute-moving-averages-tp5835291p5835295.html > Sent from the PostgreSQL - sql mailing list archive at Nabble.com. > > David J. Thanks for the info. 1) Yes, I pretty sure I need FOLLOWING because I ordered the eod_ts (timestamp) using DESC. 2) The only functions I've created in the past have been primarily relating to queries that don't return any data. Things such as "purge duplicate rows". Nothing that used input and output; only to execute tasks on the backend. Also, I'll look at the WITH method again. I saw that in an earlier search, but didn't fully understand it and looked for another way. In both cases, thanks for the reassurance and pointing me in a direction to explore. Regards, JA