Thread: DATE_TRUNC() and GROUP BY?

DATE_TRUNC() and GROUP BY?

From
Dave Johansen
Date:
I just ran into an interesting issue on Postgres 8.4. I have a database with about 3 months of data and when I do following query:
SELECT DATE_TRUNC('day', time) AS time_t, COUNT(*) FROM mytable GROUP BY time_t;

EXPLAIN shows that it's doing a sort and then a GroupAggregate. There will only be ~90 outputs, so is there a way I can hint/force the planner to just do a HashAggregate?

Just to see if it would change the plan, I tried increasing the work_mem up to 1GB and it still did the same plan.

Thanks,
Dave

Re: DATE_TRUNC() and GROUP BY?

From
David Rowley
Date:
On Fri, Dec 20, 2013 at 1:35 PM, Dave Johansen <davejohansen@gmail.com> wrote:
I just ran into an interesting issue on Postgres 8.4. I have a database with about 3 months of data and when I do following query:
SELECT DATE_TRUNC('day', time) AS time_t, COUNT(*) FROM mytable GROUP BY time_t;

EXPLAIN shows that it's doing a sort and then a GroupAggregate. There will only be ~90 outputs, so is there a way I can hint/force the planner to just do a HashAggregate?

Just to see if it would change the plan, I tried increasing the work_mem up to 1GB and it still did the same plan.


PostgreSQL does not really have any stats on the selectivity of date_trunc('day', time) so my guess is that it can only assume that it has the same selectivity as the time column by itself... Which is very untrue in this case.
The group aggregate plan is chosen here as PostgreSQL thinks the the hash table is going to end up pretty big and decides that the group aggregate will be the cheaper option.

I mocked up your data and on 9.4 I can get the hash aggregate plan to run if I set the n_distinct value to 90 then analyze the table again.. Even if you could do this on 8.4 I'd not recommend it as it will probably cause havoc with other plans around the time column. I did also get the hash aggregate plan to run if I created a functional index on date_trunc('day', time) then ran analyze again. I don't have a copy of 8.4 around to see if the planner will make use of the index in the same way.

What would be really nice is if we could create our own statistics on what we want, something like:

CREATE STATISTICS name ON table (date_trunc('day', time));

That way postgres could have a better idea of the selectivity in this situation.

I'd give creating the function index a try, but keep in mind the overhead that it will cause with inserts, updates and deletes.

Regards

David Rowley
 
Thanks,
Dave

Re: DATE_TRUNC() and GROUP BY?

From
Dave Johansen
Date:
On Fri, Dec 20, 2013 at 10:46 PM, David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, Dec 20, 2013 at 1:35 PM, Dave Johansen <davejohansen@gmail.com> wrote:
I just ran into an interesting issue on Postgres 8.4. I have a database with about 3 months of data and when I do following query:
SELECT DATE_TRUNC('day', time) AS time_t, COUNT(*) FROM mytable GROUP BY time_t;

EXPLAIN shows that it's doing a sort and then a GroupAggregate. There will only be ~90 outputs, so is there a way I can hint/force the planner to just do a HashAggregate?

Just to see if it would change the plan, I tried increasing the work_mem up to 1GB and it still did the same plan.


PostgreSQL does not really have any stats on the selectivity of date_trunc('day', time) so my guess is that it can only assume that it has the same selectivity as the time column by itself... Which is very untrue in this case.
The group aggregate plan is chosen here as PostgreSQL thinks the the hash table is going to end up pretty big and decides that the group aggregate will be the cheaper option.

I mocked up your data and on 9.4 I can get the hash aggregate plan to run if I set the n_distinct value to 90 then analyze the table again.. Even if you could do this on 8.4 I'd not recommend it as it will probably cause havoc with other plans around the time column. I did also get the hash aggregate plan to run if I created a functional index on date_trunc('day', time) then ran analyze again. I don't have a copy of 8.4 around to see if the planner will make use of the index in the same way.

What would be really nice is if we could create our own statistics on what we want, something like:

CREATE STATISTICS name ON table (date_trunc('day', time));

That way postgres could have a better idea of the selectivity in this situation.

I'd give creating the function index a try, but keep in mind the overhead that it will cause with inserts, updates and deletes.

Thanks for the advice and I'll give the index a try. Are there any other tricks that I could try? Like maybe a custom aggregate or data type conversion (truncated date in seconds since an epoch) that would make the planner do the right thing? Or will those two ideas just run into the same planner problem?

Thanks again,
Dave

Re: DATE_TRUNC() and GROUP BY?

From
Dave Johansen
Date:
On Thu, Jan 2, 2014 at 11:39 AM, Dave Johansen <davejohansen@gmail.com> wrote:
On Fri, Dec 20, 2013 at 10:46 PM, David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, Dec 20, 2013 at 1:35 PM, Dave Johansen <davejohansen@gmail.com> wrote:
I just ran into an interesting issue on Postgres 8.4. I have a database with about 3 months of data and when I do following query:
SELECT DATE_TRUNC('day', time) AS time_t, COUNT(*) FROM mytable GROUP BY time_t;

EXPLAIN shows that it's doing a sort and then a GroupAggregate. There will only be ~90 outputs, so is there a way I can hint/force the planner to just do a HashAggregate?

Just to see if it would change the plan, I tried increasing the work_mem up to 1GB and it still did the same plan.


PostgreSQL does not really have any stats on the selectivity of date_trunc('day', time) so my guess is that it can only assume that it has the same selectivity as the time column by itself... Which is very untrue in this case.
The group aggregate plan is chosen here as PostgreSQL thinks the the hash table is going to end up pretty big and decides that the group aggregate will be the cheaper option.

I mocked up your data and on 9.4 I can get the hash aggregate plan to run if I set the n_distinct value to 90 then analyze the table again.. Even if you could do this on 8.4 I'd not recommend it as it will probably cause havoc with other plans around the time column. I did also get the hash aggregate plan to run if I created a functional index on date_trunc('day', time) then ran analyze again. I don't have a copy of 8.4 around to see if the planner will make use of the index in the same way.

I just tried this on 8.4 and it won't create the index because DATE_TRUNC() is not immutable. The exact error is:
ERROR:  function in index expression must be marked IMMUTABLE

Any suggestions or other ideas?

Thanks,
Dave

Re: DATE_TRUNC() and GROUP BY?

From
Dave Johansen
Date:
On Thu, Jan 2, 2014 at 12:36 PM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jan 2, 2014 at 11:39 AM, Dave Johansen <davejohansen@gmail.com> wrote:
On Fri, Dec 20, 2013 at 10:46 PM, David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, Dec 20, 2013 at 1:35 PM, Dave Johansen <davejohansen@gmail.com> wrote:
I just ran into an interesting issue on Postgres 8.4. I have a database with about 3 months of data and when I do following query:
SELECT DATE_TRUNC('day', time) AS time_t, COUNT(*) FROM mytable GROUP BY time_t;

EXPLAIN shows that it's doing a sort and then a GroupAggregate. There will only be ~90 outputs, so is there a way I can hint/force the planner to just do a HashAggregate?

Just to see if it would change the plan, I tried increasing the work_mem up to 1GB and it still did the same plan.


PostgreSQL does not really have any stats on the selectivity of date_trunc('day', time) so my guess is that it can only assume that it has the same selectivity as the time column by itself... Which is very untrue in this case.
The group aggregate plan is chosen here as PostgreSQL thinks the the hash table is going to end up pretty big and decides that the group aggregate will be the cheaper option.

I mocked up your data and on 9.4 I can get the hash aggregate plan to run if I set the n_distinct value to 90 then analyze the table again.. Even if you could do this on 8.4 I'd not recommend it as it will probably cause havoc with other plans around the time column. I did also get the hash aggregate plan to run if I created a functional index on date_trunc('day', time) then ran analyze again. I don't have a copy of 8.4 around to see if the planner will make use of the index in the same way.

I just tried this on 8.4 and it won't create the index because DATE_TRUNC() is not immutable. The exact error is:
ERROR:  function in index expression must be marked IMMUTABLE

Any suggestions or other ideas?

I apologize for the multiple emails, but I just looked at the definition of DATE_TRUNC() and for TIMESTAMP WITHOUT TIME ZONE it's IMMUTABLE, so I will look into switching to that and see if using the index speeds up the queries.