Thread: 8.3 planner handling of IS NULL in aggregations

8.3 planner handling of IS NULL in aggregations

From
Sam Mason
Date:
Hi,

I've just noticed that the planner in 8.3.3 doesn't seem to realize the
difference in the result of the following:

  SELECT col, COUNT(*)
  FROM tbl
  GROUP BY col;

and

  SELECT col IS NULL, COUNT(*)
  FROM tbl
  GROUP BY col IS NULL;

For a table with several million distinct values in "col" this
makes quite a difference.  I'd expect to be getting in memory hash
aggregations, but I'm getting a sort step in there instead.  Here's an
example:

  SELECT col1 IS NOT NULL, col2 IS NOT NULL, col3 IS NOT NULL,
    COUNT(*)
  FROM tbl
  GROUP BY 1,2,3
  ORDER BY 1,2,3;

gives the following plan:

  GroupAggregate  (cost=5018623.99..5387423.18 rows=4338999 width=12)
    ->  Sort  (cost=5018623.99..5081536.33 rows=25164936 width=12)
          Sort Key: ((col1 IS NOT NULL)), ((col2 IS NOT NULL)), ((col3 IS NOT NULL))
          ->  Seq Scan on tbl  (cost=0.00..376989.36 rows=25164936 width=12)

I can't see any way for it to produce more than 8 rows of output and so
I'd expect a hash aggregate to be best.  Removing the IS NOT NULLs from
the expression gives basically the same plan and expected number of rows
which then looks reasonable.


  Sam

Re: 8.3 planner handling of IS NULL in aggregations

From
Tom Lane
Date:
Sam Mason <sam@samason.me.uk> writes:
> I've just noticed that the planner in 8.3.3 doesn't seem to realize the
> difference in the result of the following:
>   GROUP BY col;
>   GROUP BY col IS NULL;

Yeah, estimate_num_groups doesn't have any special knowledge about IS
NULL -- it just sees this as "an expression involving col".  The
general assumption about that is that the expression doesn't reduce
the number of groups (think "col + 1" for example).  In general I'd
rather it overestimated the number of groups than underestimated,
so I don't think this heuristic is really wrong.

Putting in a special case for IS NULL seems a bit silly, but maybe
checking for a boolean result type would cover enough real-world
uses to be worth the trouble?  Not sure.

            regards, tom lane

Re: 8.3 planner handling of IS NULL in aggregations

From
Sam Mason
Date:
On Fri, Jul 04, 2008 at 12:37:48PM -0400, Tom Lane wrote:
> Yeah, estimate_num_groups doesn't have any special knowledge about IS
> NULL -- it just sees this as "an expression involving col".  The
> general assumption about that is that the expression doesn't reduce
> the number of groups (think "col + 1" for example).  In general I'd
> rather it overestimated the number of groups than underestimated,
> so I don't think this heuristic is really wrong.

Hum, I thought it tried harder than that---I remembered date_trunc doing
special things, but it doesn't seem to.  Ah well!

> Putting in a special case for IS NULL seems a bit silly, but maybe
> checking for a boolean result type would cover enough real-world
> uses to be worth the trouble?  Not sure.

Sounds sensible.  It would generalize nicely to the new enum types as
well, use the minimum of the expected number of rows or the size of the
result type's domain.


  Sam