Thread: Re: [HACKERS] No: implied sort with group by

Re: [HACKERS] No: implied sort with group by

From
darrenk@insightdist.com (Darren King)
Date:
> postgres=> select b,c,sum(a) from t1 group by b,c;
> b|c|sum
> -+-+---
>  |x|  5
>  |z|  3
>  |x|  0
> (3 rows)
>
> postgres=> select * from t1;
> a|b|c
> -+-+-
> 1| |x
> 2| |x
> 2| |x
> 3| |z
> 0| |x
> (5 rows)
>
> I just inserted a single out-of-order row at the end of the table which, since the
> integer value is zero, should have not affected the result. Sorry I didn't understand
> the nature of the test case.

The order of the implicit sort would be arbitrary, but should first sort on
any fields in a given ORDER BY to help speed things up later in the tree.

What are the effects of sorted or partially sorted input data to the sort code?

The current group/aggregate code seems to just loop over the tuples as they are.

I see two ways to fix the above, one w/minimal code, second w/more work, but
potentially better speed for large queries.

1.  Put a sort node immediately before the group node, taking into account
any user given ordering.  Also make sure the optimizer is aware of this sort
when calculating query costs.

2.  Instead of sorting the tuples before grouping, add a hashing system to
the group node so that the pre-sorting is not necessary.

Hmmm...is this a grouping problem or an aggregate problem?  Or both?  The first
query above should have the data sorted before aggregating, shouldn't it, or I
am still missing a piece of this puzzle?

darrenk

Re: [HACKERS] No: implied sort with group by

From
ocie@paracel.com
Date:
Darren King wrote:
>

[examples deleted]

> I see two ways to fix the above, one w/minimal code, second w/more work, but
> potentially better speed for large queries.
>
> 1.  Put a sort node immediately before the group node, taking into account
> any user given ordering.  Also make sure the optimizer is aware of this sort
> when calculating query costs.
>
> 2.  Instead of sorting the tuples before grouping, add a hashing system to
> the group node so that the pre-sorting is not necessary.
>
> Hmmm...is this a grouping problem or an aggregate problem?  Or both?  The first
> query above should have the data sorted before aggregating, shouldn't it, or I
> am still missing a piece of this puzzle?
>
> darrenk

The hash should work.  If the hash key is built on the group-by items,
then any row with the same entries in these columns will get hashed to
the same result row.  At this point, it should be fairly easy to
perform aggregation (test and substitute for min and max, add for
sum,avg, etc).

Ocie

Re: [HACKERS] No: implied sort with group by

From
"Thomas G. Lockhart"
Date:
> > postgres=> select b,c,sum(a) from t1 group by b,c;
> > b|c|sum
> > -+-+---
> >  |x|  5
> >  |z|  3
> >  |x|  0
> > (3 rows)
> >
> > postgres=> select * from t1;
> > a|b|c
> > -+-+-
> > 1| |x
> > 2| |x
> > 2| |x
> > 3| |z
> > 0| |x
> > (5 rows)
> >
> > I just inserted a single out-of-order row at the end of the table which, since the
> > integer value is zero, should have not affected the result. Sorry I didn't understand
> > the nature of the test case.

> Hmmm...is this a grouping problem or an aggregate problem?  Or both?  The first
> query above should have the data sorted before aggregating, shouldn't it, or I
> am still missing a piece of this puzzle?

fwiw, I see the same incorrect behavior in v6.2.1p5.

                                                - Tom