Re: Extracting superlatives - SQL design philosophy - Mailing list pgsql-performance
From | Richard Huxton |
---|---|
Subject | Re: Extracting superlatives - SQL design philosophy |
Date | |
Msg-id | 4B86307D.3000901@archonet.com Whole thread Raw |
In response to | Re: Extracting superlatives - SQL design philosophy (Dave Crooke <dcrooke@gmail.com>) |
List | pgsql-performance |
On 24/02/10 23:37, Dave Crooke wrote: > 1. The city temps table is a toy example, not meant to be realistic :-) You knew that and I guessed it, but it's worth stating these things for people who read the archives a year from now. > 2. Yes, my (Java) algorithm is deterministic ... it will return > exactly one row per city, and that will be the row (or strictly, *a* > row) containing the highest temp. Temp value ties will break in favour > of earlier rows in Guinness Book of Records tradition :-) It's > equivalent to a HashAggregate implementation. But not when you add in other columns (which is what you're trying to do). > The following two query plans (from my real schema) illustrate the > itch I am trying to scratch .... I want the functionality of the 2nd > one, but with the execution plan structure of the first: > > # explain analyse select a, max(b) from perf_raw_2010_02_23 group by a; > QUERY > PLAN > -------------------------------------------------------------------------------------------------------------------------------------- > HashAggregate (cost=117953.09..117961.07 rows=639 width=8) (actual > time=10861.845..10863.008 rows=1023 loops=1) > -> Seq Scan on perf_raw_2010_02_23 (cost=0.00..91572.39 > rows=5276139 width=8) (actual time=0.038..4459.222 rows=5276139 > loops=1) > Total runtime: 10863.856 ms > (3 rows) > > Time: 10864.817 ms > # explain analyse select distinct on (a) * from perf_raw_2010_02_23 > order by a, b desc ; One big bit of the cost difference is going to be the ordering you need to get a repeatable result. > QUERY > PLAN > --------------------------------------------------------------------------------------------------------------------------------------------- > Unique (cost=1059395.04..1085775.73 rows=639 width=28) (actual > time=46011.204..58428.210 rows=1023 loops=1) > -> Sort (cost=1059395.04..1072585.39 rows=5276139 width=28) > (actual time=46011.200..53561.112 rows=5276139 loops=1) > Sort Key: a, b > Sort Method: external merge Disk: 247584kB > -- actually OS RAM buffers Even if the sort never actually reaches a physical disk, you should still see an increase by increasing sort_mem for the duration of the one query. It's not going to be the magnitude you want, but probably worth doing. > -> Seq Scan on perf_raw_2010_02_23 (cost=0.00..91572.39 > rows=5276139 width=28) (actual time=0.047..6491.036 rows=5276139 > loops=1) > Total runtime: 58516.185 ms > (6 rows) > > Time: 58517.233 ms > > The only difference between these two is that the second query returns > the whole row. The *ratio* in cost between these two plans increases > in proportion to log(n) of the table size ... at 5.5m rows its > livable, at 500m it's probably not :-! If performance on this query is vital to you, and the table doesn't change after initial population (which I'm guessing is true) then try an index on (a asc, b desc) and CLUSTER that index. Depending on the ratio of distinct a:b values that could be what you're after. -- Richard Huxton Archonet Ltd
pgsql-performance by date: