Thread: Sorted group by

Sorted group by

From
Matthew Wakeling
Date:
I'm trying to eke a little bit more performance out of an application, and
I was wondering if there was a better way to do the following:

I am trying to retrieve, for many sets of rows grouped on a couple of
fields, the value of an ungrouped field where the row has the highest
value in another ungrouped field. For instance, I have the following table
setup:

group  | whatever type
value  | whatever type
number | int
Index: group

I then have rows like this:

group     | value         | number
-------------------------------------
Foo       | foo           | 1
Foo       | turnips       | 2
Bar       | albatross     | 3
Bar       | monkey        | 4

I want to receive results like this:

group     | value
-----------------------
Foo       | turnips
Bar       | monkey

Currently, I do this in my application by ordering by the number and only
using the last value. I imagine that this is something that can be done in
the new Postgres 9, with a sorted group by - something like this:

SELECT group, LAST(value, ORDER BY number) FROM table GROUP BY group

Is this something that is already built in, or would I have to write my
own LAST aggregate function?

Matthew

--
 The third years are wandering about all worried at the moment because they
 have to hand in their final projects. Please be sympathetic to them, say
 things like "ha-ha-ha", but in a sympathetic tone of voice
                                        -- Computer Science Lecturer

Re: Sorted group by

From
Thomas Kellerer
Date:
Matthew Wakeling wrote on 10.08.2010 17:40:
> Currently, I do this in my application by ordering by the number and
> only using the last value. I imagine that this is something that can be
> done in the new Postgres 9, with a sorted group by - something like this:
>
> SELECT group, LAST(value, ORDER BY number) FROM table GROUP BY group
>
> Is this something that is already built in, or would I have to write my
> own LAST aggregate function?

No. It's built in (8.4) and it's called Windowing functions:
http://www.postgresql.org/docs/8.4/static/tutorial-window.html
http://www.postgresql.org/docs/8.4/static/functions-window.html

SELECT group, last_value(value) over(ORDER BY number)
FROM table

You don't need the group by then (but you can apply e.g. an ORDER BY GROUP)

Thomas

Re: Sorted group by

From
Matthew Wakeling
Date:
On Tue, 10 Aug 2010, Thomas Kellerer wrote:
> No. It's built in (8.4) and it's called Windowing functions:
> http://www.postgresql.org/docs/8.4/static/tutorial-window.html
> http://www.postgresql.org/docs/8.4/static/functions-window.html
>
> SELECT group, last_value(value) over(ORDER BY number)
> FROM table

I may be mistaken, but as I understand it, a windowing function doesn't
reduce the number of rows in the results?

Matthew

--
 Don't worry about people stealing your ideas. If your ideas are any good,
 you'll have to ram them down people's throats.     -- Howard Aiken

Re: Sorted group by

From
Thom Brown
Date:
On 10 August 2010 17:03, Matthew Wakeling <matthew@flymine.org> wrote:
> On Tue, 10 Aug 2010, Thomas Kellerer wrote:
>>
>> No. It's built in (8.4) and it's called Windowing functions:
>> http://www.postgresql.org/docs/8.4/static/tutorial-window.html
>> http://www.postgresql.org/docs/8.4/static/functions-window.html
>>
>> SELECT group, last_value(value) over(ORDER BY number)
>> FROM table
>
> I may be mistaken, but as I understand it, a windowing function doesn't
> reduce the number of rows in the results?
>

I think you are mistaken.  The last_value function is a window
function aggregate.  Give it a try.

--
Thom Brown
Registered Linux user: #516935

Re: Sorted group by

From
hubert depesz lubaczewski
Date:
On Tue, Aug 10, 2010 at 04:40:16PM +0100, Matthew Wakeling wrote:
>
> I'm trying to eke a little bit more performance out of an
> application, and I was wondering if there was a better way to do the
> following:
>
> I am trying to retrieve, for many sets of rows grouped on a couple
> of fields, the value of an ungrouped field where the row has the
> highest value in another ungrouped field. For instance, I have the
> following table setup:
>
> group  | whatever type
> value  | whatever type
> number | int
> Index: group
>
> I then have rows like this:
>
> group     | value         | number
> -------------------------------------
> Foo       | foo           | 1
> Foo       | turnips       | 2
> Bar       | albatross     | 3
> Bar       | monkey        | 4
>
> I want to receive results like this:
>
> group     | value
> -----------------------
> Foo       | turnips
> Bar       | monkey
>
> Currently, I do this in my application by ordering by the number and
> only using the last value. I imagine that this is something that can
> be done in the new Postgres 9, with a sorted group by - something
> like this:
>
> SELECT group, LAST(value, ORDER BY number) FROM table GROUP BY group
>
> Is this something that is already built in, or would I have to write
> my own LAST aggregate function?

this is trivially done when usign 'distinct on':
select distinct on (group) *
from table
order by group desc, number desc;

depesz

--
Linkedin: http://www.linkedin.com/in/depesz  /  blog: http://www.depesz.com/
jid/gtalk: depesz@depesz.com / aim:depeszhdl / skype:depesz_hdl / gg:6749007

Re: Sorted group by

From
Thomas Kellerer
Date:
Matthew Wakeling wrote on 10.08.2010 18:03:
> On Tue, 10 Aug 2010, Thomas Kellerer wrote:
>> No. It's built in (8.4) and it's called Windowing functions:
>> http://www.postgresql.org/docs/8.4/static/tutorial-window.html
>> http://www.postgresql.org/docs/8.4/static/functions-window.html
>>
>> SELECT group, last_value(value) over(ORDER BY number)
>> FROM table
>
> I may be mistaken, but as I understand it, a windowing function doesn't
> reduce the number of rows in the results?

Yes you are right, a bit too quick on my side ;)

But this might be what you are after then:

select group_, value_
from (
   select group_, value_, number_, row_number() over (partition by group_ order by value_ desc) as row_num
   from numbers
) t
where row_num = 1
order by group_ desc

Re: Sorted group by

From
Thom Brown
Date:
On 10 August 2010 17:06, Thom Brown <thom@linux.com> wrote:
> On 10 August 2010 17:03, Matthew Wakeling <matthew@flymine.org> wrote:
>> On Tue, 10 Aug 2010, Thomas Kellerer wrote:
>>>
>>> No. It's built in (8.4) and it's called Windowing functions:
>>> http://www.postgresql.org/docs/8.4/static/tutorial-window.html
>>> http://www.postgresql.org/docs/8.4/static/functions-window.html
>>>
>>> SELECT group, last_value(value) over(ORDER BY number)
>>> FROM table
>>
>> I may be mistaken, but as I understand it, a windowing function doesn't
>> reduce the number of rows in the results?
>>
>
> I think you are mistaken.  The last_value function is a window
> function aggregate.  Give it a try.
>

D'oh, no, I'm mistaken.  My brain has been malfunctioning today.

--
Thom Brown
Registered Linux user: #516935

Re: Sorted group by

From
"Kevin Grittner"
Date:
Matthew Wakeling <matthew@flymine.org> wrote:

> I'm trying to eke a little bit more performance out of an
> application

In addition to the suggestion from Thomas Kellerer, it would be
interesting to try the following and see how performance compares
using real data.

select group, value from tbl x
  where not exists
        (select * from tbl y
          where y.group = x.group and y.number > x.number);

We have a lot of code using this general technique, and I'm curious
whether there are big gains to be had by moving to the windowing
functions.  (I suspect there are.)

-Kevin

Re: Sorted group by

From
Jonathan Blitz
Date:
Another couple of possible ways:

Select groupfield,value
From tbl x1
Where number = (select max(number) from tbl x2 where x2.groupfield=
x1.groupfield)



Select groupfield,value
From tbl x1
Where (groupfield,number) in (select groupfield,max(number) from tbl group
by groupfield)

Which is quickest?
Probably best to try out and see.

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Kevin Grittner
Sent: Tuesday, August 10, 2010 7:38 PM
To: Matthew Wakeling; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Sorted group by

Matthew Wakeling <matthew@flymine.org> wrote:

> I'm trying to eke a little bit more performance out of an application

In addition to the suggestion from Thomas Kellerer, it would be interesting
to try the following and see how performance compares using real data.

select group, value from tbl x
  where not exists
        (select * from tbl y
          where y.group = x.group and y.number > x.number);

We have a lot of code using this general technique, and I'm curious whether
there are big gains to be had by moving to the windowing functions.  (I
suspect there are.)

-Kevin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3061 - Release Date: 08/09/10
21:35:00


Re: Sorted group by

From
Matthew Wakeling
Date:
Original query:

explain analyse select * from tracker where objectid < 1200000;
                           QUERY PLAN
-----------------------------------------------------------------------
  Index Scan using tracker_objectid on tracker
         (cost=0.00..915152.62 rows=3684504 width=33)
         (actual time=0.061..5402.608 rows=3790872 loops=1)
    Index Cond: (objectid < 1200000)
  Total runtime: 9134.362 ms
(3 rows)

On Tue, 10 Aug 2010, hubert depesz lubaczewski wrote:
> select distinct on (group) *
> from table
> order by group desc, number desc;

This solution is rather obvious, and works on older versions of Postgres.
Thanks. However, the burden of sorting by two columns (actually, in our
application the group is two column, so sorting by three columns instead)
makes this significantly slower than just copying the whole data through
our application (which effectively does a hash aggregation).

explain analyse select distinct on (objectid, fieldname) objectid,
fieldname, sourcename, version from tracker where objectid < 1200000 order
by objectid, fieldname, version desc;

                             QUERY PLAN
--------------------------------------------------------------------------
  Unique  (cost=1330828.11..1357953.05 rows=361666 width=34)
          (actual time=12815.878..22452.737 rows=1782996 loops=1)
    ->  Sort  (cost=1330828.11..1339869.76 rows=3616658 width=34)
              (actual time=12815.873..16608.903 rows=3790872 loops=1)
          Sort Key: objectid, fieldname, version
          Sort Method:  quicksort  Memory: 420980kB
          ->  Index Scan using tracker_objectid on tracker
              (cost=0.00..936861.47 rows=3616658 width=34)
              (actual time=0.061..5441.050 rows=3790872 loops=1)
                Index Cond: (objectid < 1200000)
  Total runtime: 24228.724 ms
(7 rows)

On Tue, 10 Aug 2010, Thomas Kellerer wrote:
> select group_, value_
> from (
>  select group_, value_, number_, row_number() over (partition by group_
> order by value_ desc) as row_num
>  from numbers
> ) t
> where row_num = 1
> order by group_ desc

This looks quite cute, however it is slightly slower than the DISTINCT ON
approach.

explain analyse select objectid, fieldname, sourcename from (select
objectid, fieldname, sourcename, version, row_number() over (partition by
objectid, fieldname order by version desc) as row_num from tracker where
objectid < 1200000) as t where row_num = 1;
                            QUERY PLAN
-------------------------------------------------------------------------
  Subquery Scan t  (cost=1330828.11..1457411.14 rows=18083 width=68)
                   (actual time=12835.553..32220.075 rows=1782996 loops=1)
    Filter: (t.row_num = 1)
    ->  WindowAgg  (cost=1330828.11..1412202.92 rows=3616658 width=34)
                   (actual time=12835.541..26471.802 rows=3790872 loops=1)
          ->  Sort  (cost=1330828.11..1339869.76 rows=3616658 width=34)
                    (actual time=12822.560..16646.112 rows=3790872 loops=1)
                Sort Key: tracker.objectid, tracker.fieldname, tracker.version
                Sort Method:  quicksort  Memory: 420980kB
                ->  Index Scan using tracker_objectid on tracker
                    (cost=0.00..936861.47 rows=3616658 width=34)
                    (actual time=0.067..5433.790 rows=3790872 loops=1)
                      Index Cond: (objectid < 1200000)
  Total runtime: 34002.828 ms
(9 rows)

On Tue, 10 Aug 2010, Kevin Grittner wrote:
> select group, value from tbl x
>  where not exists
>        (select * from tbl y
>          where y.group = x.group and y.number > x.number);

This is a join, which is quite a bit slower:

explain analyse select objectid, fieldname, sourcename from tracker as a
where not exists(select * from tracker as b where a.objectid = b.objectid
and a.fieldname = b.fieldname and a.version < b.version and b.objectid <
1200000) and a.objectid < 1200000;

                                QUERY PLAN
---------------------------------------------------------------------------
  Merge Anti Join  (cost=2981427.73..3042564.32 rows=2411105 width=30)
                   (actual time=24834.372..53939.131 rows=1802376 loops=1)
    Merge Cond: ((a.objectid = b.objectid) AND (a.fieldname = b.fieldname))
    Join Filter: (a.version < b.version)
    ->  Sort  (cost=1490713.86..1499755.51 rows=3616658 width=34)
              (actual time=12122.478..15944.255 rows=3790872 loops=1)
          Sort Key: a.objectid, a.fieldname
          Sort Method:  quicksort  Memory: 420980kB
          ->  Index Scan using tracker_objectid on tracker a
              (cost=0.00..1096747.23 rows=3616658 width=34)
              (actual time=0.070..5403.235 rows=3790872 loops=1)
                Index Cond: (objectid < 1200000)
    ->  Sort  (cost=1490713.86..1499755.51 rows=3616658 width=17)
              (actual time=12710.564..20952.841 rows=8344994 loops=1)
          Sort Key: b.objectid, b.fieldname
          Sort Method:  quicksort  Memory: 336455kB
          ->  Index Scan using tracker_objectid on tracker b
              (cost=0.00..1096747.23 rows=3616658 width=17)
              (actual time=0.084..5781.844 rows=3790872 loops=1)
                Index Cond: (objectid < 1200000)
  Total runtime: 55756.482 ms
(14 rows)

On Tue, 10 Aug 2010, Jonathan Blitz wrote:
> Select groupfield,value
> From tbl x1
> Where number = (select max(number) from tbl x2 where x2.groupfield=
> x1.groupfield)

This one effectively forces a nested loop join:

explain analyse select objectid, fieldname, sourcename from tracker as a
where version = (select max(version) from tracker as b where a.objectid =
b.objectid and a.fieldname = b.fieldname) and a.objectid < 1200000;

                                QUERY PLAN
-------------------------------------------------------------------------
  Index Scan using tracker_objectid on tracker a
        (cost=0.00..58443381.75 rows=18083 width=34)
        (actual time=6.482..59803.225 rows=1802376 loops=1)
    Index Cond: (objectid < 1200000)
    Filter: (version = (SubPlan 2))
    SubPlan 2
      ->  Result  (cost=15.89..15.90 rows=1 width=0)
                  (actual time=0.011..0.012 rows=1 loops=3790872)
            InitPlan 1 (returns $2)
              ->  Limit  (cost=0.00..15.89 rows=1 width=4)
                         (actual time=0.007..0.008 rows=1 loops=3790872)
                    ->  Index Scan Backward using tracker_all on tracker b
                         (cost=0.00..31.78 rows=2 width=4)
                         (actual time=0.005..0.005 rows=1 loops=3790872)
                          Index Cond: (($0 = objectid) AND ($1 = fieldname))
                          Filter: (version IS NOT NULL)
  Total runtime: 61649.116 ms
(11 rows)

On Tue, 10 Aug 2010, Jonathan Blitz wrote:
> Select groupfield,value
> From tbl x1
> Where (groupfield,number) in (select groupfield,max(number) from tbl group
> by groupfield)

This is another join.

explain analyse select objectid, fieldname, sourcename from tracker where
(objectid, fieldname, version) in (select objectid, fieldname,
max(version) from tracker group by objectid, fieldname);

I terminated this query after about an hour. Here is the EXPLAIN:

                               QUERY PLAN
------------------------------------------------------------------------
  Hash Join  (cost=55310973.80..72060974.32 rows=55323 width=30)
    Hash Cond: ((public.tracker.objectid = public.tracker.objectid)
            AND (public.tracker.fieldname = public.tracker.fieldname)
            AND (public.tracker.version = (max(public.tracker.version))))
    ->  Seq Scan on tracker  (cost=0.00..5310600.80 rows=293972480 width=34)
    ->  Hash  (cost=54566855.96..54566855.96 rows=29397248 width=40)
          ->  GroupAggregate
              (cost=50965693.08..54272883.48 rows=29397248 width=17)
                ->  Sort
                    (cost=50965693.08..51700624.28 rows=293972480 width=17)
                      Sort Key: public.tracker.objectid, public.tracker.fieldname
                      ->  Seq Scan on tracker
                          (cost=0.00..5310600.80 rows=293972480 width=17)
(8 rows)

Matthew

--
 I quite understand I'm doing algebra on the blackboard and the usual response
 is to throw objects...  If you're going to freak out... wait until party time
 and invite me along                     -- Computer Science Lecturer

Re: Sorted group by

From
Alvaro Herrera
Date:
Excerpts from Matthew Wakeling's message of mar ago 10 11:40:16 -0400 2010:

> I am trying to retrieve, for many sets of rows grouped on a couple of
> fields, the value of an ungrouped field where the row has the highest
> value in another ungrouped field.

I think this does what you want (schema is from the tenk1 table in the
regression database):

select string4 as group,
   (array_agg(stringu1 order by unique1 desc))[1] as value
from tenk1
group by 1 ;

Please let me know how it performs with your data.  The plan is rather simple:

regression=# explain analyze select string4 as group, (array_agg(stringu1 order by unique1 desc))[1] as value from
tenk1group by 1 ; 
                                                          QUERY PLAN
       

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 GroupAggregate  (cost=0.00..1685.16 rows=4 width=132) (actual time=22.825..88.922 rows=4 loops=1)
   ->  Index Scan using ts4 on tenk1  (cost=0.00..1635.11 rows=10000 width=132) (actual time=0.135..33.188 rows=10000
loops=1)
 Total runtime: 89.348 ms
(3 filas)


--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support