Thread: Best way to aggregate sum for each month

Best way to aggregate sum for each month

From
Andreas Joseph Krogh
Date:
Hi all.
 
I'm using PG-9.4.
 
I'm having a table, activity_log, which holds activities with duration for a given date.
I'm trying to sum the duration for each month in a given year and am currently doing it like this:
 
select
    q.start_date    , sum(log.duration) / (3600 * 1000)::NUMERIC as total_duration
FROM
    (SELECT cast(generate_series('2014-01-01' :: DATE, '2014-12-01' :: DATE, '1 month') AS DATE) as start_date) AS q    LEFT OUTER JOIN activity_log log ON date_trunc('month', log.start_date::timestamp without time zone) = q.start_date
GROUP BY q.start_date
ORDER BY q.start_date
;

I have the current index defined:
 
create index activity_start_month ON activity_log(date_trunc('month', start_date::timestamp without time zone));

 
Here's the explain plan:
 
                                                                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=65.27..91102.17 rows=200 width=12) (actual time=179.983..344.517 rows=12 loops=1)
   Group Key: ((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date)
   ->  Merge Left Join  (cost=65.27..82950.79 rows=1629675 width=12) (actual time=163.894..310.011 rows=143551 loops=1)
         Merge Cond: (((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date) = date_trunc('month'::text, (log.start_date)::timestamp without time zone))
         ->  Sort  (cost=64.84..67.34 rows=1000 width=4) (actual time=0.045..0.052 rows=12 loops=1)
               Sort Key: ((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date)
               Sort Method: quicksort  Memory: 25kB
               ->  Result  (cost=0.00..5.01 rows=1000 width=0) (actual time=0.017..0.031 rows=12 loops=1)
         ->  Materialize  (cost=0.42..51097.29 rows=325935 width=12) (actual time=0.032..227.692 rows=323180 loops=1)
               ->  Index Scan using activity_start_month on activity_log log  (cost=0.42..50282.45 rows=325935 width=12) (actual time=0.029..185.746 rows=323180 loops=1)
 Planning time: 0.201 ms
 Execution time: 344.645 ms
(12 rows)
 
 
Are there any ways to improve this?
 
Thanks.
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment

Re: Best way to aggregate sum for each month

From
Naresh Kumar
Date:
Try this,


;WITH cte as
(
SELECT date_trunc('month', log.start_date::timestamp without time zone) AS log_date, sum(log.duration) / (3600 * 1000)::NUMERIC as total_duration
FROM activity_log
GROUP BY date_trunc('month', log.start_date::timestamp without time zone)
)

SELECT q.start_date, cte.total_duration
FROM
    (SELECT cast(generate_series('2014-01-01' :: DATE, '2014-12-01' :: DATE, '1 month') AS DATE) as start_date) AS q
    LEFT OUTER JOIN cte ON cte.log_date = q.start_date

On Fri, Apr 17, 2015 at 5:10 PM, Andreas Joseph Krogh <andreas@visena.com> wrote:
Hi all.
 
I'm using PG-9.4.
 
I'm having a table, activity_log, which holds activities with duration for a given date.
I'm trying to sum the duration for each month in a given year and am currently doing it like this:
 
select
    q.start_date    , sum(log.duration) / (3600 * 1000)::NUMERIC as total_duration
FROM
    (SELECT cast(generate_series('2014-01-01' :: DATE, '2014-12-01' :: DATE, '1 month') AS DATE) as start_date) AS q    LEFT OUTER JOIN activity_log log ON date_trunc('month', log.start_date::timestamp without time zone) = q.start_date
GROUP BY q.start_date
ORDER BY q.start_date
;

I have the current index defined:
 
create index activity_start_month ON activity_log(date_trunc('month', start_date::timestamp without time zone));

 
Here's the explain plan:
 
                                                                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=65.27..91102.17 rows=200 width=12) (actual time=179.983..344.517 rows=12 loops=1)
   Group Key: ((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date)
   ->  Merge Left Join  (cost=65.27..82950.79 rows=1629675 width=12) (actual time=163.894..310.011 rows=143551 loops=1)
         Merge Cond: (((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date) = date_trunc('month'::text, (log.start_date)::timestamp without time zone))
         ->  Sort  (cost=64.84..67.34 rows=1000 width=4) (actual time=0.045..0.052 rows=12 loops=1)
               Sort Key: ((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date)
               Sort Method: quicksort  Memory: 25kB
               ->  Result  (cost=0.00..5.01 rows=1000 width=0) (actual time=0.017..0.031 rows=12 loops=1)
         ->  Materialize  (cost=0.42..51097.29 rows=325935 width=12) (actual time=0.032..227.692 rows=323180 loops=1)
               ->  Index Scan using activity_start_month on activity_log log  (cost=0.42..50282.45 rows=325935 width=12) (actual time=0.029..185.746 rows=323180 loops=1)
 Planning time: 0.201 ms
 Execution time: 344.645 ms
(12 rows)
 
 
Are there any ways to improve this?
 
Thanks.
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS

Attachment

Re: Best way to aggregate sum for each month

From
Andreas Joseph Krogh
Date:
From what I can see your query performs worse, as CTEs are optimisation-fences. Your query needs to scans thru all activity_log entries, then do Merge Left Join .
Maybe there's no way to optimize the Merge Left Join node which takes ~300ms?
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
 
På lørdag 18. april 2015 kl. 01:29:37, skrev Naresh Kumar <narain337@gmail.com>:
Try this,
 
 
;WITH cte as
(
SELECT date_trunc('month', log.start_date::timestamp without time zone) AS log_date, sum(log.duration) / (3600 * 1000)::NUMERIC as total_duration
FROM activity_log
GROUP BY date_trunc('month', log.start_date::timestamp without time zone)
)
 
SELECT q.start_date, cte.total_duration
FROM
    (SELECT cast(generate_series('2014-01-01' :: DATE, '2014-12-01' :: DATE, '1 month') AS DATE) as start_date) AS q
    LEFT OUTER JOIN cte ON cte.log_date = q.start_date
 
On Fri, Apr 17, 2015 at 5:10 PM, Andreas Joseph Krogh <andreas@visena.com> wrote:
Hi all.
 
I'm using PG-9.4.
 
I'm having a table, activity_log, which holds activities with duration for a given date.
I'm trying to sum the duration for each month in a given year and am currently doing it like this:
 
select
    q.start_date    , sum(log.duration) / (3600 * 1000)::NUMERIC as total_duration
FROM
    (SELECT cast(generate_series('2014-01-01' :: DATE, '2014-12-01' :: DATE, '1 month') AS DATE) as start_date) AS q    LEFT OUTER JOIN activity_log log ON date_trunc('month', log.start_date::timestamp without time zone) = q.start_date
GROUP BY q.start_date
ORDER BY q.start_date
;

I have the current index defined:
 
create index activity_start_month ON activity_log(date_trunc('month', start_date::timestamp without time zone));

 
Here's the explain plan:
 
                                                                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=65.27..91102.17 rows=200 width=12) (actual time=179.983..344.517 rows=12 loops=1)
   Group Key: ((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date)
   ->  Merge Left Join  (cost=65.27..82950.79 rows=1629675 width=12) (actual time=163.894..310.011 rows=143551 loops=1)
         Merge Cond: (((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date) = date_trunc('month'::text, (log.start_date)::timestamp without time zone))
         ->  Sort  (cost=64.84..67.34 rows=1000 width=4) (actual time=0.045..0.052 rows=12 loops=1)
               Sort Key: ((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date)
               Sort Method: quicksort  Memory: 25kB
               ->  Result  (cost=0.00..5.01 rows=1000 width=0) (actual time=0.017..0.031 rows=12 loops=1)
         ->  Materialize  (cost=0.42..51097.29 rows=325935 width=12) (actual time=0.032..227.692 rows=323180 loops=1)
               ->  Index Scan using activity_start_month on activity_log log  (cost=0.42..50282.45 rows=325935 width=12) (actual time=0.029..185.746 rows=323180 loops=1)
 Planning time: 0.201 ms
 Execution time: 344.645 ms
(12 rows)
 
 
Are there any ways to improve this?
 
Thanks.
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
 
 
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Best way to aggregate sum for each month

From
"David G. Johnston"
Date:
On Friday, April 17, 2015, Andreas Joseph Krogh <andreas@visena.com> wrote:
Hi all.
 
I'm using PG-9.4.
 
I'm having a table, activity_log, which holds activities with duration for a given date.
I'm trying to sum the duration for each month in a given year and am currently doing it like this:
 
select
    q.start_date    , sum(log.duration) / (3600 * 1000)::NUMERIC as total_duration
FROM
    (SELECT cast(generate_series('2014-01-01' :: DATE, '2014-12-01' :: DATE, '1 month') AS DATE) as start_date) AS q    LEFT OUTER JOIN activity_log log ON date_trunc('month', log.start_date::timestamp without time zone) = q.start_date
GROUP BY q.start_date
ORDER BY q.start_date
;

I have the current index defined:
 
create index activity_start_month ON activity_log(date_trunc('month', start_date::timestamp without time zone));

 
Here's the explain plan:
 
                                                                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=65.27..91102.17 rows=200 width=12) (actual time=179.983..344.517 rows=12 loops=1)
   Group Key: ((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date)
   ->  Merge Left Join  (cost=65.27..82950.79 rows=1629675 width=12) (actual time=163.894..310.011 rows=143551 loops=1)
         Merge Cond: (((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date) = date_trunc('month'::text, (log.start_date)::timestamp without time zone))
         ->  Sort  (cost=64.84..67.34 rows=1000 width=4) (actual time=0.045..0.052 rows=12 loops=1)
               Sort Key: ((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date)
               Sort Method: quicksort  Memory: 25kB
               ->  Result  (cost=0.00..5.01 rows=1000 width=0) (actual time=0.017..0.031 rows=12 loops=1)
         ->  Materialize  (cost=0.42..51097.29 rows=325935 width=12) (actual time=0.032..227.692 rows=323180 loops=1)
               ->  Index Scan using activity_start_month on activity_log log  (cost=0.42..50282.45 rows=325935 width=12) (actual time=0.029..185.746 rows=323180 loops=1)
 Planning time: 0.201 ms
 Execution time: 344.645 ms
(12 rows)
 
 
Are there any ways to improve this?


I would usually turn activity_log log into (select dt, sum(...) from activity_log group by dt) log

And then left join with coalesce to build the final output.

Work_mem adjustment may help too.

David J.

Re: Best way to aggregate sum for each month

From
Naresh Kumar
Date:
I wrote it in the form of cte for readability purpose. 

If you are aggregating by month date and sure that there will be atleast one record for every month you don't require to generate the series and JOIN. 


On Fri, Apr 17, 2015 at 6:46 PM, Andreas Joseph Krogh <andreas@visena.com> wrote:
From what I can see your query performs worse, as CTEs are optimisation-fences. Your query needs to scans thru all activity_log entries, then do Merge Left Join .
Maybe there's no way to optimize the Merge Left Join node which takes ~300ms?
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
 
 
På lørdag 18. april 2015 kl. 01:29:37, skrev Naresh Kumar <narain337@gmail.com>:
Try this,
 
 
;WITH cte as
(
SELECT date_trunc('month', log.start_date::timestamp without time zone) AS log_date, sum(log.duration) / (3600 * 1000)::NUMERIC as total_duration
FROM activity_log
GROUP BY date_trunc('month', log.start_date::timestamp without time zone)
)
 
SELECT q.start_date, cte.total_duration
FROM
    (SELECT cast(generate_series('2014-01-01' :: DATE, '2014-12-01' :: DATE, '1 month') AS DATE) as start_date) AS q
    LEFT OUTER JOIN cte ON cte.log_date = q.start_date
 
On Fri, Apr 17, 2015 at 5:10 PM, Andreas Joseph Krogh <andreas@visena.com> wrote:
Hi all.
 
I'm using PG-9.4.
 
I'm having a table, activity_log, which holds activities with duration for a given date.
I'm trying to sum the duration for each month in a given year and am currently doing it like this:
 
select
    q.start_date    , sum(log.duration) / (3600 * 1000)::NUMERIC as total_duration
FROM
    (SELECT cast(generate_series('2014-01-01' :: DATE, '2014-12-01' :: DATE, '1 month') AS DATE) as start_date) AS q    LEFT OUTER JOIN activity_log log ON date_trunc('month', log.start_date::timestamp without time zone) = q.start_date
GROUP BY q.start_date
ORDER BY q.start_date
;

I have the current index defined:
 
create index activity_start_month ON activity_log(date_trunc('month', start_date::timestamp without time zone));

 
Here's the explain plan:
 
                                                                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=65.27..91102.17 rows=200 width=12) (actual time=179.983..344.517 rows=12 loops=1)
   Group Key: ((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date)
   ->  Merge Left Join  (cost=65.27..82950.79 rows=1629675 width=12) (actual time=163.894..310.011 rows=143551 loops=1)
         Merge Cond: (((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date) = date_trunc('month'::text, (log.start_date)::timestamp without time zone))
         ->  Sort  (cost=64.84..67.34 rows=1000 width=4) (actual time=0.045..0.052 rows=12 loops=1)
               Sort Key: ((generate_series(('2014-01-01'::date)::timestamp with time zone, ('2014-12-01'::date)::timestamp with time zone, '1 mon'::interval))::date)
               Sort Method: quicksort  Memory: 25kB
               ->  Result  (cost=0.00..5.01 rows=1000 width=0) (actual time=0.017..0.031 rows=12 loops=1)
         ->  Materialize  (cost=0.42..51097.29 rows=325935 width=12) (actual time=0.032..227.692 rows=323180 loops=1)
               ->  Index Scan using activity_start_month on activity_log log  (cost=0.42..50282.45 rows=325935 width=12) (actual time=0.029..185.746 rows=323180 loops=1)
 Planning time: 0.201 ms
 Execution time: 344.645 ms
(12 rows)
 
 
Are there any ways to improve this?
 
Thanks.
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
 
 
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
 

Attachment