Re: Sorted group by - Mailing list pgsql-performance

From Matthew Wakeling
Subject Re: Sorted group by
Date
Msg-id alpine.DEB.2.00.1008111148300.2654@aragorn.flymine.org
Whole thread Raw
In response to Re: Sorted group by  (Jonathan Blitz <jblitz@013.net>)
List pgsql-performance
Original query:

explain analyse select * from tracker where objectid < 1200000;
                           QUERY PLAN
-----------------------------------------------------------------------
  Index Scan using tracker_objectid on tracker
         (cost=0.00..915152.62 rows=3684504 width=33)
         (actual time=0.061..5402.608 rows=3790872 loops=1)
    Index Cond: (objectid < 1200000)
  Total runtime: 9134.362 ms
(3 rows)

On Tue, 10 Aug 2010, hubert depesz lubaczewski wrote:
> select distinct on (group) *
> from table
> order by group desc, number desc;

This solution is rather obvious, and works on older versions of Postgres.
Thanks. However, the burden of sorting by two columns (actually, in our
application the group is two column, so sorting by three columns instead)
makes this significantly slower than just copying the whole data through
our application (which effectively does a hash aggregation).

explain analyse select distinct on (objectid, fieldname) objectid,
fieldname, sourcename, version from tracker where objectid < 1200000 order
by objectid, fieldname, version desc;

                             QUERY PLAN
--------------------------------------------------------------------------
  Unique  (cost=1330828.11..1357953.05 rows=361666 width=34)
          (actual time=12815.878..22452.737 rows=1782996 loops=1)
    ->  Sort  (cost=1330828.11..1339869.76 rows=3616658 width=34)
              (actual time=12815.873..16608.903 rows=3790872 loops=1)
          Sort Key: objectid, fieldname, version
          Sort Method:  quicksort  Memory: 420980kB
          ->  Index Scan using tracker_objectid on tracker
              (cost=0.00..936861.47 rows=3616658 width=34)
              (actual time=0.061..5441.050 rows=3790872 loops=1)
                Index Cond: (objectid < 1200000)
  Total runtime: 24228.724 ms
(7 rows)

On Tue, 10 Aug 2010, Thomas Kellerer wrote:
> select group_, value_
> from (
>  select group_, value_, number_, row_number() over (partition by group_
> order by value_ desc) as row_num
>  from numbers
> ) t
> where row_num = 1
> order by group_ desc

This looks quite cute, however it is slightly slower than the DISTINCT ON
approach.

explain analyse select objectid, fieldname, sourcename from (select
objectid, fieldname, sourcename, version, row_number() over (partition by
objectid, fieldname order by version desc) as row_num from tracker where
objectid < 1200000) as t where row_num = 1;
                            QUERY PLAN
-------------------------------------------------------------------------
  Subquery Scan t  (cost=1330828.11..1457411.14 rows=18083 width=68)
                   (actual time=12835.553..32220.075 rows=1782996 loops=1)
    Filter: (t.row_num = 1)
    ->  WindowAgg  (cost=1330828.11..1412202.92 rows=3616658 width=34)
                   (actual time=12835.541..26471.802 rows=3790872 loops=1)
          ->  Sort  (cost=1330828.11..1339869.76 rows=3616658 width=34)
                    (actual time=12822.560..16646.112 rows=3790872 loops=1)
                Sort Key: tracker.objectid, tracker.fieldname, tracker.version
                Sort Method:  quicksort  Memory: 420980kB
                ->  Index Scan using tracker_objectid on tracker
                    (cost=0.00..936861.47 rows=3616658 width=34)
                    (actual time=0.067..5433.790 rows=3790872 loops=1)
                      Index Cond: (objectid < 1200000)
  Total runtime: 34002.828 ms
(9 rows)

On Tue, 10 Aug 2010, Kevin Grittner wrote:
> select group, value from tbl x
>  where not exists
>        (select * from tbl y
>          where y.group = x.group and y.number > x.number);

This is a join, which is quite a bit slower:

explain analyse select objectid, fieldname, sourcename from tracker as a
where not exists(select * from tracker as b where a.objectid = b.objectid
and a.fieldname = b.fieldname and a.version < b.version and b.objectid <
1200000) and a.objectid < 1200000;

                                QUERY PLAN
---------------------------------------------------------------------------
  Merge Anti Join  (cost=2981427.73..3042564.32 rows=2411105 width=30)
                   (actual time=24834.372..53939.131 rows=1802376 loops=1)
    Merge Cond: ((a.objectid = b.objectid) AND (a.fieldname = b.fieldname))
    Join Filter: (a.version < b.version)
    ->  Sort  (cost=1490713.86..1499755.51 rows=3616658 width=34)
              (actual time=12122.478..15944.255 rows=3790872 loops=1)
          Sort Key: a.objectid, a.fieldname
          Sort Method:  quicksort  Memory: 420980kB
          ->  Index Scan using tracker_objectid on tracker a
              (cost=0.00..1096747.23 rows=3616658 width=34)
              (actual time=0.070..5403.235 rows=3790872 loops=1)
                Index Cond: (objectid < 1200000)
    ->  Sort  (cost=1490713.86..1499755.51 rows=3616658 width=17)
              (actual time=12710.564..20952.841 rows=8344994 loops=1)
          Sort Key: b.objectid, b.fieldname
          Sort Method:  quicksort  Memory: 336455kB
          ->  Index Scan using tracker_objectid on tracker b
              (cost=0.00..1096747.23 rows=3616658 width=17)
              (actual time=0.084..5781.844 rows=3790872 loops=1)
                Index Cond: (objectid < 1200000)
  Total runtime: 55756.482 ms
(14 rows)

On Tue, 10 Aug 2010, Jonathan Blitz wrote:
> Select groupfield,value
> From tbl x1
> Where number = (select max(number) from tbl x2 where x2.groupfield=
> x1.groupfield)

This one effectively forces a nested loop join:

explain analyse select objectid, fieldname, sourcename from tracker as a
where version = (select max(version) from tracker as b where a.objectid =
b.objectid and a.fieldname = b.fieldname) and a.objectid < 1200000;

                                QUERY PLAN
-------------------------------------------------------------------------
  Index Scan using tracker_objectid on tracker a
        (cost=0.00..58443381.75 rows=18083 width=34)
        (actual time=6.482..59803.225 rows=1802376 loops=1)
    Index Cond: (objectid < 1200000)
    Filter: (version = (SubPlan 2))
    SubPlan 2
      ->  Result  (cost=15.89..15.90 rows=1 width=0)
                  (actual time=0.011..0.012 rows=1 loops=3790872)
            InitPlan 1 (returns $2)
              ->  Limit  (cost=0.00..15.89 rows=1 width=4)
                         (actual time=0.007..0.008 rows=1 loops=3790872)
                    ->  Index Scan Backward using tracker_all on tracker b
                         (cost=0.00..31.78 rows=2 width=4)
                         (actual time=0.005..0.005 rows=1 loops=3790872)
                          Index Cond: (($0 = objectid) AND ($1 = fieldname))
                          Filter: (version IS NOT NULL)
  Total runtime: 61649.116 ms
(11 rows)

On Tue, 10 Aug 2010, Jonathan Blitz wrote:
> Select groupfield,value
> From tbl x1
> Where (groupfield,number) in (select groupfield,max(number) from tbl group
> by groupfield)

This is another join.

explain analyse select objectid, fieldname, sourcename from tracker where
(objectid, fieldname, version) in (select objectid, fieldname,
max(version) from tracker group by objectid, fieldname);

I terminated this query after about an hour. Here is the EXPLAIN:

                               QUERY PLAN
------------------------------------------------------------------------
  Hash Join  (cost=55310973.80..72060974.32 rows=55323 width=30)
    Hash Cond: ((public.tracker.objectid = public.tracker.objectid)
            AND (public.tracker.fieldname = public.tracker.fieldname)
            AND (public.tracker.version = (max(public.tracker.version))))
    ->  Seq Scan on tracker  (cost=0.00..5310600.80 rows=293972480 width=34)
    ->  Hash  (cost=54566855.96..54566855.96 rows=29397248 width=40)
          ->  GroupAggregate
              (cost=50965693.08..54272883.48 rows=29397248 width=17)
                ->  Sort
                    (cost=50965693.08..51700624.28 rows=293972480 width=17)
                      Sort Key: public.tracker.objectid, public.tracker.fieldname
                      ->  Seq Scan on tracker
                          (cost=0.00..5310600.80 rows=293972480 width=17)
(8 rows)

Matthew

--
 I quite understand I'm doing algebra on the blackboard and the usual response
 is to throw objects...  If you're going to freak out... wait until party time
 and invite me along                     -- Computer Science Lecturer

pgsql-performance by date:

Previous
From: Scott Marlowe
Date:
Subject: Re: Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
Next
From: Joseph Conway
Date:
Subject: 32 vs 64 bit build on Solaris Sparc