Thread: Re: [HACKERS] No: implied sort with group by

Re: [HACKERS] No: implied sort with group by

From
darrenk@insightdist.com (Darren King)
Date:
> > > This is what I think is missing or broken right now.
> > >
> > > > > select * from t1;
> > > >          a b  c
> > > >          1    x
> > > >          2    x
> > > >          3    z
> > > >          2    x
> > > >
> > > > 4 row(s) retrieved.
> > > > > select b,c,sum(a) from t1 group by b,c;
> > > > b  c             (sum)
> > > >
> > > >    x                 5
> > > >    z                 3
> > > >> 2 row(s) retrieved.
> >
> > Sorry, I've lost the thread. What is broken? I get this same result, and
> > (assuming that column "b" is full of nulls) I think this the correct result.
>
> At one point, it was thought that NULLs shouldn't be grouped, but I
> backed out the patch.  There is a problem with GROUP BY on large
> datasets, and Vadim knows the cause, and will work on it later.

Different from the grouping by NULLs issue...

The above results are from Sybase.  If these same four rows are inserted into
postgres, the second query will return three rows.  Something like...

b|c|sum(a)
 |x|3
 |z|3
 |x|2

It does this not because of the null values of column b, but because the data is
not sorted before getting to the group by node if the user does not explicitly put
an order by in the query.  IMHO, postgres should put an arbitrary sort node in the
tree so that the data can be properly grouped as the group node iterates over it.

And even if I put an "order by c" clause in there, I still get three rows, they're
just properly sorted. :)

darrenk


Re: [HACKERS] No: implied sort with group by

From
"Thomas G. Lockhart"
Date:
> > > > This is what I think is missing or broken right now.
> > > >
> > > > > > select * from t1;
> > > > >          a b  c
> > > > >          1    x
> > > > >          2    x
> > > > >          3    z
> > > > >          2    x
> > > > >
> > > > > 4 row(s) retrieved.
> > > > > > select b,c,sum(a) from t1 group by b,c;
> > > > > b  c             (sum)
> > > > >
> > > > >    x                 5
> > > > >    z                 3
> > > > >> 2 row(s) retrieved.
> > >
> > > Sorry, I've lost the thread. What is broken? I get this same result, and
> > > (assuming that column "b" is full of nulls) I think this the correct result.
> >
> > At one point, it was thought that NULLs shouldn't be grouped, but I
> > backed out the patch.  There is a problem with GROUP BY on large
> > datasets, and Vadim knows the cause, and will work on it later.
>
> Different from the grouping by NULLs issue...
>
> The above results are from Sybase.  If these same four rows are inserted into
> postgres, the second query will return three rows.  Something like...
>
> b|c|sum(a)
>  |x|3
>  |z|3
>  |x|2
>
> It does this not because of the null values of column b, but because the data is
> not sorted before getting to the group by node if the user does not explicitly put
> an order by in the query.  IMHO, postgres should put an arbitrary sort node in the
> tree so that the data can be properly grouped as the group node iterates over it.
>
> And even if I put an "order by c" clause in there, I still get three rows, they're
> just properly sorted. :)

Not necessarily true; as I said, I get the same result as above (with the 980112
source tree; have things changed since??). Perhaps you are running into the sorting
problem which seemed to be present on larger tables only?

                                                    - Tom

postgres=> select b,c,sum(a) from t1 group by b,c;
b|c|sum
-+-+---
 |x|  5
 |z|  3
(2 rows)

postgres=> select * from t1;
a|b|c
-+-+-
1| |x
2| |x
2| |x
3| |z
(4 rows)