Re: Sorted group by - Mailing list pgsql-performance
From | Matthew Wakeling |
---|---|
Subject | Re: Sorted group by |
Date | |
Msg-id | alpine.DEB.2.00.1008111148300.2654@aragorn.flymine.org Whole thread Raw |
In response to | Re: Sorted group by (Jonathan Blitz <jblitz@013.net>) |
List | pgsql-performance |
Original query: explain analyse select * from tracker where objectid < 1200000; QUERY PLAN ----------------------------------------------------------------------- Index Scan using tracker_objectid on tracker (cost=0.00..915152.62 rows=3684504 width=33) (actual time=0.061..5402.608 rows=3790872 loops=1) Index Cond: (objectid < 1200000) Total runtime: 9134.362 ms (3 rows) On Tue, 10 Aug 2010, hubert depesz lubaczewski wrote: > select distinct on (group) * > from table > order by group desc, number desc; This solution is rather obvious, and works on older versions of Postgres. Thanks. However, the burden of sorting by two columns (actually, in our application the group is two column, so sorting by three columns instead) makes this significantly slower than just copying the whole data through our application (which effectively does a hash aggregation). explain analyse select distinct on (objectid, fieldname) objectid, fieldname, sourcename, version from tracker where objectid < 1200000 order by objectid, fieldname, version desc; QUERY PLAN -------------------------------------------------------------------------- Unique (cost=1330828.11..1357953.05 rows=361666 width=34) (actual time=12815.878..22452.737 rows=1782996 loops=1) -> Sort (cost=1330828.11..1339869.76 rows=3616658 width=34) (actual time=12815.873..16608.903 rows=3790872 loops=1) Sort Key: objectid, fieldname, version Sort Method: quicksort Memory: 420980kB -> Index Scan using tracker_objectid on tracker (cost=0.00..936861.47 rows=3616658 width=34) (actual time=0.061..5441.050 rows=3790872 loops=1) Index Cond: (objectid < 1200000) Total runtime: 24228.724 ms (7 rows) On Tue, 10 Aug 2010, Thomas Kellerer wrote: > select group_, value_ > from ( > select group_, value_, number_, row_number() over (partition by group_ > order by value_ desc) as row_num > from numbers > ) t > where row_num = 1 > order by group_ desc This looks quite cute, however it is slightly slower than the DISTINCT ON approach. explain analyse select objectid, fieldname, sourcename from (select objectid, fieldname, sourcename, version, row_number() over (partition by objectid, fieldname order by version desc) as row_num from tracker where objectid < 1200000) as t where row_num = 1; QUERY PLAN ------------------------------------------------------------------------- Subquery Scan t (cost=1330828.11..1457411.14 rows=18083 width=68) (actual time=12835.553..32220.075 rows=1782996 loops=1) Filter: (t.row_num = 1) -> WindowAgg (cost=1330828.11..1412202.92 rows=3616658 width=34) (actual time=12835.541..26471.802 rows=3790872 loops=1) -> Sort (cost=1330828.11..1339869.76 rows=3616658 width=34) (actual time=12822.560..16646.112 rows=3790872 loops=1) Sort Key: tracker.objectid, tracker.fieldname, tracker.version Sort Method: quicksort Memory: 420980kB -> Index Scan using tracker_objectid on tracker (cost=0.00..936861.47 rows=3616658 width=34) (actual time=0.067..5433.790 rows=3790872 loops=1) Index Cond: (objectid < 1200000) Total runtime: 34002.828 ms (9 rows) On Tue, 10 Aug 2010, Kevin Grittner wrote: > select group, value from tbl x > where not exists > (select * from tbl y > where y.group = x.group and y.number > x.number); This is a join, which is quite a bit slower: explain analyse select objectid, fieldname, sourcename from tracker as a where not exists(select * from tracker as b where a.objectid = b.objectid and a.fieldname = b.fieldname and a.version < b.version and b.objectid < 1200000) and a.objectid < 1200000; QUERY PLAN --------------------------------------------------------------------------- Merge Anti Join (cost=2981427.73..3042564.32 rows=2411105 width=30) (actual time=24834.372..53939.131 rows=1802376 loops=1) Merge Cond: ((a.objectid = b.objectid) AND (a.fieldname = b.fieldname)) Join Filter: (a.version < b.version) -> Sort (cost=1490713.86..1499755.51 rows=3616658 width=34) (actual time=12122.478..15944.255 rows=3790872 loops=1) Sort Key: a.objectid, a.fieldname Sort Method: quicksort Memory: 420980kB -> Index Scan using tracker_objectid on tracker a (cost=0.00..1096747.23 rows=3616658 width=34) (actual time=0.070..5403.235 rows=3790872 loops=1) Index Cond: (objectid < 1200000) -> Sort (cost=1490713.86..1499755.51 rows=3616658 width=17) (actual time=12710.564..20952.841 rows=8344994 loops=1) Sort Key: b.objectid, b.fieldname Sort Method: quicksort Memory: 336455kB -> Index Scan using tracker_objectid on tracker b (cost=0.00..1096747.23 rows=3616658 width=17) (actual time=0.084..5781.844 rows=3790872 loops=1) Index Cond: (objectid < 1200000) Total runtime: 55756.482 ms (14 rows) On Tue, 10 Aug 2010, Jonathan Blitz wrote: > Select groupfield,value > From tbl x1 > Where number = (select max(number) from tbl x2 where x2.groupfield= > x1.groupfield) This one effectively forces a nested loop join: explain analyse select objectid, fieldname, sourcename from tracker as a where version = (select max(version) from tracker as b where a.objectid = b.objectid and a.fieldname = b.fieldname) and a.objectid < 1200000; QUERY PLAN ------------------------------------------------------------------------- Index Scan using tracker_objectid on tracker a (cost=0.00..58443381.75 rows=18083 width=34) (actual time=6.482..59803.225 rows=1802376 loops=1) Index Cond: (objectid < 1200000) Filter: (version = (SubPlan 2)) SubPlan 2 -> Result (cost=15.89..15.90 rows=1 width=0) (actual time=0.011..0.012 rows=1 loops=3790872) InitPlan 1 (returns $2) -> Limit (cost=0.00..15.89 rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=3790872) -> Index Scan Backward using tracker_all on tracker b (cost=0.00..31.78 rows=2 width=4) (actual time=0.005..0.005 rows=1 loops=3790872) Index Cond: (($0 = objectid) AND ($1 = fieldname)) Filter: (version IS NOT NULL) Total runtime: 61649.116 ms (11 rows) On Tue, 10 Aug 2010, Jonathan Blitz wrote: > Select groupfield,value > From tbl x1 > Where (groupfield,number) in (select groupfield,max(number) from tbl group > by groupfield) This is another join. explain analyse select objectid, fieldname, sourcename from tracker where (objectid, fieldname, version) in (select objectid, fieldname, max(version) from tracker group by objectid, fieldname); I terminated this query after about an hour. Here is the EXPLAIN: QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=55310973.80..72060974.32 rows=55323 width=30) Hash Cond: ((public.tracker.objectid = public.tracker.objectid) AND (public.tracker.fieldname = public.tracker.fieldname) AND (public.tracker.version = (max(public.tracker.version)))) -> Seq Scan on tracker (cost=0.00..5310600.80 rows=293972480 width=34) -> Hash (cost=54566855.96..54566855.96 rows=29397248 width=40) -> GroupAggregate (cost=50965693.08..54272883.48 rows=29397248 width=17) -> Sort (cost=50965693.08..51700624.28 rows=293972480 width=17) Sort Key: public.tracker.objectid, public.tracker.fieldname -> Seq Scan on tracker (cost=0.00..5310600.80 rows=293972480 width=17) (8 rows) Matthew -- I quite understand I'm doing algebra on the blackboard and the usual response is to throw objects... If you're going to freak out... wait until party time and invite me along -- Computer Science Lecturer
pgsql-performance by date: