Thread: Extracting superlatives - SQL design philosophy
This is a generic SQL issue and not PG specific, but I'd like to get an opinion from this list. Consider the following data: # \d bar Table "public.bar" Column | Type | Modifiers --------+-----------------------------+----------- city | character varying(255) | temp | integer | date | timestamp without time zone | # select * from bar order by city, date; city | temp | date -----------+------+--------------------- Austin | 75 | 2010-02-21 15:00:00 Austin | 35 | 2010-02-23 15:00:00 Edinburgh | 42 | 2010-02-23 15:00:00 New York | 56 | 2010-02-23 15:00:00 New York | 78 | 2010-06-23 15:00:00 (5 rows) If you want the highest recorded temperature for a city, that's easy to do, since the selection criteria works on the same column that we are extracing: # select city, max(temp) from bar group by city order by 1; city | max -----------+----- Austin | 75 Edinburgh | 42 New York | 78 (3 rows) However there is (AFAIK) no simple way in plain SQL to write a query that performs such an aggregation where the aggregation criteria is on one column and you want to return another, e.g. adding the the *date of* that highest temperature to the output above, or doing a query to get the most recent temperature reading for each city. What I'd like to do is something like the below (and I'm inventing mock syntax here, the following is not valid SQL): -- Ugly implicit syntax but no worse than an Oracle outer join ;-) select city, temp, date from bar where date=max(date) group by city, temp order by city; or perhaps -- More explicit select aggregate_using(max(date), city, temp, date) from bar group by city, temp order by city; Both of the above, if they existed, would be a single data access followed by and sort-merge. The only way I know how to do it involves doing two accesses to the data, e.g. # select city, temp, date from bar a where date=(select max(b.date) from bar b where a.city=b.city) order by 1; city | temp | date -----------+------+--------------------- Austin | 35 | 2010-02-23 15:00:00 Edinburgh | 42 | 2010-02-23 15:00:00 New York | 78 | 2010-06-23 15:00:00 (3 rows) # explain select * from bar a where date=(select max(b.date) from bar b where a.city=b.city) order by 1; QUERY PLAN -------------------------------------------------------------------------- Sort (cost=1658.86..1658.87 rows=1 width=528) Sort Key: a.city -> Seq Scan on bar a (cost=0.00..1658.85 rows=1 width=528) Filter: (date = (subplan)) SubPlan -> Aggregate (cost=11.76..11.77 rows=1 width=8) -> Seq Scan on bar b (cost=0.00..11.75 rows=1 width=8) -- would be an index lookup in a real scenario Filter: (($0)::text = (city)::text) (8 rows)
This is a generic SQL issue and not PG specific, but I'd like to get
an opinion from this list.
Consider the following data:
# \d bar
Table "public.bar"
Column | Type | Modifiers
--------+-----------------------------+-----------
city | character varying(255) |
temp | integer |
date | timestamp without time zone |
# select * from bar order by city, date;
city | temp | date
-----------+------+---------------------
Austin | 75 | 2010-02-21 15:00:00
Austin | 35 | 2010-02-23 15:00:00
Edinburgh | 42 | 2010-02-23 15:00:00
New York | 56 | 2010-02-23 15:00:00
New York | 78 | 2010-06-23 15:00:00
(5 rows)
If you want the highest recorded temperature for a city, that's easy
to do, since the selection criteria works on the same column that we
are extracing:
# select city, max(temp) from bar group by city order by 1;
city | max
-----------+-----
Austin | 75
Edinburgh | 42
New York | 78
(3 rows)
However there is (AFAIK) no simple way in plain SQL to write a query
that performs such an aggregation where the aggregation criteria is on
one column and you want to return another, e.g. adding the the *date
of* that highest temperature to the output above, or doing a query to
get the most recent temperature reading for each city.
What I'd like to do is something like the below (and I'm inventing
mock syntax here, the following is not valid SQL):
-- Ugly implicit syntax but no worse than an Oracle outer join ;-)
select city, temp, date from bar where date=max(date) group by city,
temp order by city;
or perhaps
-- More explicit
select aggregate_using(max(date), city, temp, date) from bar group by
city, temp order by city;
Both of the above, if they existed, would be a single data access
followed by and sort-merge.
The only way I know how to do it involves doing two accesses to the data, e.g.
# select city, temp, date from bar a where date=(select max(b.date)
from bar b where a.city=b.city) order by 1;
city | temp | date
-----------+------+---------------------
Austin | 35 | 2010-02-23 15:00:00
Edinburgh | 42 | 2010-02-23 15:00:00
New York | 78 | 2010-06-23 15:00:00
(3 rows)
# explain select * from bar a where date=(select max(b.date) from bar
b where a.city=b.city) order by 1;
QUERY PLAN
--------------------------------------------------------------------------
Sort (cost=1658.86..1658.87 rows=1 width=528)
Sort Key: a.city
-> Seq Scan on bar a (cost=0.00..1658.85 rows=1 width=528)
Filter: (date = (subplan))
SubPlan
-> Aggregate (cost=11.76..11.77 rows=1 width=8)
-> Seq Scan on bar b (cost=0.00..11.75 rows=1
width=8) -- would be an index lookup in a real scenario
Filter: (($0)::text = (city)::text)
(8 rows)
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
This looks to be a perfect use for SELECT DISTINCT ON: SELECT DISTINCT ON (city) * FROM bar ORDER BY city, temp desc Or am I misunderstanding the issue? Garrett Murphy -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Dave Crooke Sent: Wednesday, February 24, 2010 2:31 PM To: pgsql-performance Subject: [PERFORM] Extracting superlatives - SQL design philosophy This is a generic SQL issue and not PG specific, but I'd like to get an opinion from this list. Consider the following data: # \d bar Table "public.bar" Column | Type | Modifiers --------+-----------------------------+----------- city | character varying(255) | temp | integer | date | timestamp without time zone | # select * from bar order by city, date; city | temp | date -----------+------+--------------------- Austin | 75 | 2010-02-21 15:00:00 Austin | 35 | 2010-02-23 15:00:00 Edinburgh | 42 | 2010-02-23 15:00:00 New York | 56 | 2010-02-23 15:00:00 New York | 78 | 2010-06-23 15:00:00 (5 rows) If you want the highest recorded temperature for a city, that's easy to do, since the selection criteria works on the same column that we are extracing: # select city, max(temp) from bar group by city order by 1; city | max -----------+----- Austin | 75 Edinburgh | 42 New York | 78 (3 rows) However there is (AFAIK) no simple way in plain SQL to write a query that performs such an aggregation where the aggregation criteria is on one column and you want to return another, e.g. adding the the *date of* that highest temperature to the output above, or doing a query to get the most recent temperature reading for each city. What I'd like to do is something like the below (and I'm inventing mock syntax here, the following is not valid SQL): -- Ugly implicit syntax but no worse than an Oracle outer join ;-) select city, temp, date from bar where date=max(date) group by city, temp order by city; or perhaps -- More explicit select aggregate_using(max(date), city, temp, date) from bar group by city, temp order by city; Both of the above, if they existed, would be a single data access followed by and sort-merge. The only way I know how to do it involves doing two accesses to the data, e.g. # select city, temp, date from bar a where date=(select max(b.date) from bar b where a.city=b.city) order by 1; city | temp | date -----------+------+--------------------- Austin | 35 | 2010-02-23 15:00:00 Edinburgh | 42 | 2010-02-23 15:00:00 New York | 78 | 2010-06-23 15:00:00 (3 rows) # explain select * from bar a where date=(select max(b.date) from bar b where a.city=b.city) order by 1; QUERY PLAN -------------------------------------------------------------------------- Sort (cost=1658.86..1658.87 rows=1 width=528) Sort Key: a.city -> Seq Scan on bar a (cost=0.00..1658.85 rows=1 width=528) Filter: (date = (subplan)) SubPlan -> Aggregate (cost=11.76..11.77 rows=1 width=8) -> Seq Scan on bar b (cost=0.00..11.75 rows=1 width=8) -- would be an index lookup in a real scenario Filter: (($0)::text = (city)::text) (8 rows) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
You could do: select B.City, MaxCityTemp.Temp, min(B.Date) as FirstMaxDate from bar b INNER JOIN (select city,max(temp) as Temp from Bar group by City) as MaxCityTemp ON B.City=MaxCityTemp.City Group by B.City, MaxCityTemp.Temp George Sexton MH Software, Inc. http://www.mhsoftware.com/ Voice: 303 438 9585 > -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of Dave Crooke > Sent: Wednesday, February 24, 2010 2:31 PM > To: pgsql-performance > Subject: [PERFORM] Extracting superlatives - SQL design philosophy > > This is a generic SQL issue and not PG specific, but I'd like to get > an opinion from this list. > > Consider the following data: > > # \d bar > Table "public.bar" > Column | Type | Modifiers > --------+-----------------------------+----------- > city | character varying(255) | > temp | integer | > date | timestamp without time zone | > > # select * from bar order by city, date; > city | temp | date > -----------+------+--------------------- > Austin | 75 | 2010-02-21 15:00:00 > Austin | 35 | 2010-02-23 15:00:00 > Edinburgh | 42 | 2010-02-23 15:00:00 > New York | 56 | 2010-02-23 15:00:00 > New York | 78 | 2010-06-23 15:00:00 > (5 rows) > > If you want the highest recorded temperature for a city, that's easy > to do, since the selection criteria works on the same column that we > are extracing: > > # select city, max(temp) from bar group by city order by 1; > city | max > -----------+----- > Austin | 75 > Edinburgh | 42 > New York | 78 > (3 rows) > > > However there is (AFAIK) no simple way in plain SQL to write a query > that performs such an aggregation where the aggregation criteria is on > one column and you want to return another, e.g. adding the the *date > of* that highest temperature to the output above, or doing a query to > get the most recent temperature reading for each city. > > What I'd like to do is something like the below (and I'm inventing > mock syntax here, the following is not valid SQL): > > -- Ugly implicit syntax but no worse than an Oracle outer join ;-) > select city, temp, date from bar where date=max(date) group by city, > temp order by city; > > or perhaps > > -- More explicit > select aggregate_using(max(date), city, temp, date) from bar group by > city, temp order by city; > > Both of the above, if they existed, would be a single data access > followed by and sort-merge. > > The only way I know how to do it involves doing two accesses to the > data, e.g. > > # select city, temp, date from bar a where date=(select max(b.date) > from bar b where a.city=b.city) order by 1; > city | temp | date > -----------+------+--------------------- > Austin | 35 | 2010-02-23 15:00:00 > Edinburgh | 42 | 2010-02-23 15:00:00 > New York | 78 | 2010-06-23 15:00:00 > (3 rows) > > > # explain select * from bar a where date=(select max(b.date) from bar > b where a.city=b.city) order by 1; > QUERY PLAN > ----------------------------------------------------------------------- > --- > Sort (cost=1658.86..1658.87 rows=1 width=528) > Sort Key: a.city > -> Seq Scan on bar a (cost=0.00..1658.85 rows=1 width=528) > Filter: (date = (subplan)) > SubPlan > -> Aggregate (cost=11.76..11.77 rows=1 width=8) > -> Seq Scan on bar b (cost=0.00..11.75 rows=1 > width=8) -- would be an index lookup in a real scenario > Filter: (($0)::text = (city)::text) > (8 rows) > > -- > Sent via pgsql-performance mailing list (pgsql- > performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
I missed something: select B.City, MaxCityTemp.Temp, min(B.Date) as FirstMaxDate from bar b INNER JOIN (select city,max(temp) as Temp from Bar group by City) as MaxCityTemp ON B.City=MaxCityTemp.City AND B.Temp=MaxCityTemp.Temp Group by B.City, MaxCityTemp.Temp George Sexton MH Software, Inc. http://www.mhsoftware.com/ Voice: 303 438 9585 > -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of George Sexton > Sent: Wednesday, February 24, 2010 2:58 PM > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Extracting superlatives - SQL design philosophy > > You could do: > > select > B.City, > MaxCityTemp.Temp, > min(B.Date) as FirstMaxDate > from bar b > INNER JOIN (select city,max(temp) as Temp from Bar group by City) > as > MaxCityTemp > ON B.City=MaxCityTemp.City > Group by > B.City, > MaxCityTemp.Temp > > George Sexton > MH Software, Inc. > http://www.mhsoftware.com/ > Voice: 303 438 9585 > > > > -----Original Message----- > > From: pgsql-performance-owner@postgresql.org [mailto:pgsql- > performance- > > owner@postgresql.org] On Behalf Of Dave Crooke > > Sent: Wednesday, February 24, 2010 2:31 PM > > To: pgsql-performance > > Subject: [PERFORM] Extracting superlatives - SQL design philosophy > > > > This is a generic SQL issue and not PG specific, but I'd like to get > > an opinion from this list. > > > > Consider the following data: > > > > # \d bar > > Table "public.bar" > > Column | Type | Modifiers > > --------+-----------------------------+----------- > > city | character varying(255) | > > temp | integer | > > date | timestamp without time zone | > > > > # select * from bar order by city, date; > > city | temp | date > > -----------+------+--------------------- > > Austin | 75 | 2010-02-21 15:00:00 > > Austin | 35 | 2010-02-23 15:00:00 > > Edinburgh | 42 | 2010-02-23 15:00:00 > > New York | 56 | 2010-02-23 15:00:00 > > New York | 78 | 2010-06-23 15:00:00 > > (5 rows) > > > > If you want the highest recorded temperature for a city, that's easy > > to do, since the selection criteria works on the same column that we > > are extracing: > > > > # select city, max(temp) from bar group by city order by 1; > > city | max > > -----------+----- > > Austin | 75 > > Edinburgh | 42 > > New York | 78 > > (3 rows) > > > > > > However there is (AFAIK) no simple way in plain SQL to write a query > > that performs such an aggregation where the aggregation criteria is > on > > one column and you want to return another, e.g. adding the the *date > > of* that highest temperature to the output above, or doing a query to > > get the most recent temperature reading for each city. > > > > What I'd like to do is something like the below (and I'm inventing > > mock syntax here, the following is not valid SQL): > > > > -- Ugly implicit syntax but no worse than an Oracle outer join ;-) > > select city, temp, date from bar where date=max(date) group by city, > > temp order by city; > > > > or perhaps > > > > -- More explicit > > select aggregate_using(max(date), city, temp, date) from bar group by > > city, temp order by city; > > > > Both of the above, if they existed, would be a single data access > > followed by and sort-merge. > > > > The only way I know how to do it involves doing two accesses to the > > data, e.g. > > > > # select city, temp, date from bar a where date=(select max(b.date) > > from bar b where a.city=b.city) order by 1; > > city | temp | date > > -----------+------+--------------------- > > Austin | 35 | 2010-02-23 15:00:00 > > Edinburgh | 42 | 2010-02-23 15:00:00 > > New York | 78 | 2010-06-23 15:00:00 > > (3 rows) > > > > > > # explain select * from bar a where date=(select max(b.date) from bar > > b where a.city=b.city) order by 1; > > QUERY PLAN > > --------------------------------------------------------------------- > -- > > --- > > Sort (cost=1658.86..1658.87 rows=1 width=528) > > Sort Key: a.city > > -> Seq Scan on bar a (cost=0.00..1658.85 rows=1 width=528) > > Filter: (date = (subplan)) > > SubPlan > > -> Aggregate (cost=11.76..11.77 rows=1 width=8) > > -> Seq Scan on bar b (cost=0.00..11.75 rows=1 > > width=8) -- would be an index lookup in a real scenario > > Filter: (($0)::text = (city)::text) > > (8 rows) > > > > -- > > Sent via pgsql-performance mailing list (pgsql- > > performance@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-performance > > > > -- > Sent via pgsql-performance mailing list (pgsql- > performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Garrett's is the best answer from the list .... the only fly in the ointment here is that it performs a full sort of the records, which isn't strictly necessary to the required output. This is functionally equivalent to what I came up with for a MODE (most common value) aggregation, but the syntax is a bit neater. Craig / Geroge - there are lots of ways to do this with a subquery or join back against the bar table. My goal was actually to avoid this dual lookup and join, not for this contrived example but for my real-world case where the "bar" table is not actually a table but is a record set generated on the fly with a non-trivial subquery which I don't want to repeat. Mose - I think I get what you're aiming at here, but your query as stated returns a syntax error. What I'd like to have is a way in SQL to get an execution plan which matches Java algorithm below, which just does one "table scan" down the data and aggregates in place, i.e. it's O(n) with row lookups: HashMap<String,Row> winners = new HashMap<String,Row>(); for (Row r : rows) { Row oldrow = winners.get(r.city); if (oldrow == null || r.temp > oldrow.temp) winnders.put(r.city, r); }; for (String city : winners.keySet()) System.out.println(winners.get(city)); I'd imagine it would be possible to have a query planner optimization that would convert Garrett's DISTINCT ON syntax to do what I was trying to, by realizing that DISTINCT ON X ... ORDER BY Y DESC is going to return the the one row for each X which has the highest value of Y, and so use a MAX-structured accumulation instead of a sort. Cheers Dave On Wed, Feb 24, 2010 at 3:43 PM, Garrett Murphy <gmurphy@lawlogix.com> wrote: > This looks to be a perfect use for SELECT DISTINCT ON: > > SELECT DISTINCT ON (city) > * FROM bar > ORDER BY city, temp desc > > Or am I misunderstanding the issue? > > Garrett Murphy > > -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Dave Crooke > Sent: Wednesday, February 24, 2010 2:31 PM > To: pgsql-performance > Subject: [PERFORM] Extracting superlatives - SQL design philosophy > > This is a generic SQL issue and not PG specific, but I'd like to get > an opinion from this list. > > Consider the following data: > > # \d bar > Table "public.bar" > Column | Type | Modifiers > --------+-----------------------------+----------- > city | character varying(255) | > temp | integer | > date | timestamp without time zone | > > # select * from bar order by city, date; > city | temp | date > -----------+------+--------------------- > Austin | 75 | 2010-02-21 15:00:00 > Austin | 35 | 2010-02-23 15:00:00 > Edinburgh | 42 | 2010-02-23 15:00:00 > New York | 56 | 2010-02-23 15:00:00 > New York | 78 | 2010-06-23 15:00:00 > (5 rows) > > If you want the highest recorded temperature for a city, that's easy > to do, since the selection criteria works on the same column that we > are extracing: > > # select city, max(temp) from bar group by city order by 1; > city | max > -----------+----- > Austin | 75 > Edinburgh | 42 > New York | 78 > (3 rows) > > > However there is (AFAIK) no simple way in plain SQL to write a query > that performs such an aggregation where the aggregation criteria is on > one column and you want to return another, e.g. adding the the *date > of* that highest temperature to the output above, or doing a query to > get the most recent temperature reading for each city. > > What I'd like to do is something like the below (and I'm inventing > mock syntax here, the following is not valid SQL): > > -- Ugly implicit syntax but no worse than an Oracle outer join ;-) > select city, temp, date from bar where date=max(date) group by city, > temp order by city; > > or perhaps > > -- More explicit > select aggregate_using(max(date), city, temp, date) from bar group by > city, temp order by city; > > Both of the above, if they existed, would be a single data access > followed by and sort-merge. > > The only way I know how to do it involves doing two accesses to the data, e.g. > > # select city, temp, date from bar a where date=(select max(b.date) > from bar b where a.city=b.city) order by 1; > city | temp | date > -----------+------+--------------------- > Austin | 35 | 2010-02-23 15:00:00 > Edinburgh | 42 | 2010-02-23 15:00:00 > New York | 78 | 2010-06-23 15:00:00 > (3 rows) > > > # explain select * from bar a where date=(select max(b.date) from bar > b where a.city=b.city) order by 1; > QUERY PLAN > -------------------------------------------------------------------------- > Sort (cost=1658.86..1658.87 rows=1 width=528) > Sort Key: a.city > -> Seq Scan on bar a (cost=0.00..1658.85 rows=1 width=528) > Filter: (date = (subplan)) > SubPlan > -> Aggregate (cost=11.76..11.77 rows=1 width=8) > -> Seq Scan on bar b (cost=0.00..11.75 rows=1 > width=8) -- would be an index lookup in a real scenario > Filter: (($0)::text = (city)::text) > (8 rows) > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Dave Crooke wrote: > This is a generic SQL issue and not PG specific, but I'd like to get > an opinion from this list. > > Consider the following data: > > # \d bar > Table "public.bar" > Column | Type | Modifiers > --------+-----------------------------+----------- > city | character varying(255) | > temp | integer | > date | timestamp without time zone | > > # select * from bar order by city, date; > city | temp | date > -----------+------+--------------------- > Austin | 75 | 2010-02-21 15:00:00 > Austin | 35 | 2010-02-23 15:00:00 > Edinburgh | 42 | 2010-02-23 15:00:00 > New York | 56 | 2010-02-23 15:00:00 > New York | 78 | 2010-06-23 15:00:00 > (5 rows) > > If you want the highest recorded temperature for a city, that's easy > to do, since the selection criteria works on the same column that we > are extracing: > > # select city, max(temp) from bar group by city order by 1; > city | max > -----------+----- > Austin | 75 > Edinburgh | 42 > New York | 78 > (3 rows) > > > However there is (AFAIK) no simple way in plain SQL to write a query > that performs such an aggregation where the aggregation criteria is on > one column and you want to return another, e.g. adding the the *date > of* that highest temperature to the output above, or doing a query to > get the most recent temperature reading for each city. If you add a unique-id column to your table that's filled in from a sequence, it becomes easy: select city, temp, date from bar where id in (select id from bar where ... whatever you like ...); Craig
On 24/02/10 22:47, Dave Crooke wrote: > I'd imagine it would be possible to have a query planner optimization > that would convert Garrett's DISTINCT ON syntax to do what I was > trying to, by realizing that DISTINCT ON X ... ORDER BY Y DESC is > going to return the the one row for each X which has the highest value > of Y, and so use a MAX-structured accumulation instead of a sort. Why is there only one row? For city temperatures, that seems unlikely. In the event of more than one row does your algorithm give repeatable results? -- Richard Huxton Archonet Ltd
1. The city temps table is a toy example, not meant to be realistic :-) 2. Yes, my (Java) algorithm is deterministic ... it will return exactly one row per city, and that will be the row (or strictly, *a* row) containing the highest temp. Temp value ties will break in favour of earlier rows in Guinness Book of Records tradition :-) It's equivalent to a HashAggregate implementation. The following two query plans (from my real schema) illustrate the itch I am trying to scratch .... I want the functionality of the 2nd one, but with the execution plan structure of the first: # explain analyse select a, max(b) from perf_raw_2010_02_23 group by a; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=117953.09..117961.07 rows=639 width=8) (actual time=10861.845..10863.008 rows=1023 loops=1) -> Seq Scan on perf_raw_2010_02_23 (cost=0.00..91572.39 rows=5276139 width=8) (actual time=0.038..4459.222 rows=5276139 loops=1) Total runtime: 10863.856 ms (3 rows) Time: 10864.817 ms # explain analyse select distinct on (a) * from perf_raw_2010_02_23 order by a, b desc ; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=1059395.04..1085775.73 rows=639 width=28) (actual time=46011.204..58428.210 rows=1023 loops=1) -> Sort (cost=1059395.04..1072585.39 rows=5276139 width=28) (actual time=46011.200..53561.112 rows=5276139 loops=1) Sort Key: a, b Sort Method: external merge Disk: 247584kB -- actually OS RAM buffers -> Seq Scan on perf_raw_2010_02_23 (cost=0.00..91572.39 rows=5276139 width=28) (actual time=0.047..6491.036 rows=5276139 loops=1) Total runtime: 58516.185 ms (6 rows) Time: 58517.233 ms The only difference between these two is that the second query returns the whole row. The *ratio* in cost between these two plans increases in proportion to log(n) of the table size ... at 5.5m rows its livable, at 500m it's probably not :-! Cheers Dave On Wed, Feb 24, 2010 at 5:12 PM, Richard Huxton <dev@archonet.com> wrote: > On 24/02/10 22:47, Dave Crooke wrote: >> >> I'd imagine it would be possible to have a query planner optimization >> that would convert Garrett's DISTINCT ON syntax to do what I was >> trying to, by realizing that DISTINCT ON X ... ORDER BY Y DESC is >> going to return the the one row for each X which has the highest value >> of Y, and so use a MAX-structured accumulation instead of a sort. > > Why is there only one row? For city temperatures, that seems unlikely. > > In the event of more than one row does your algorithm give repeatable > results? > > -- > Richard Huxton > Archonet Ltd >
On 24/02/10 23:37, Dave Crooke wrote: > 1. The city temps table is a toy example, not meant to be realistic :-) You knew that and I guessed it, but it's worth stating these things for people who read the archives a year from now. > 2. Yes, my (Java) algorithm is deterministic ... it will return > exactly one row per city, and that will be the row (or strictly, *a* > row) containing the highest temp. Temp value ties will break in favour > of earlier rows in Guinness Book of Records tradition :-) It's > equivalent to a HashAggregate implementation. But not when you add in other columns (which is what you're trying to do). > The following two query plans (from my real schema) illustrate the > itch I am trying to scratch .... I want the functionality of the 2nd > one, but with the execution plan structure of the first: > > # explain analyse select a, max(b) from perf_raw_2010_02_23 group by a; > QUERY > PLAN > -------------------------------------------------------------------------------------------------------------------------------------- > HashAggregate (cost=117953.09..117961.07 rows=639 width=8) (actual > time=10861.845..10863.008 rows=1023 loops=1) > -> Seq Scan on perf_raw_2010_02_23 (cost=0.00..91572.39 > rows=5276139 width=8) (actual time=0.038..4459.222 rows=5276139 > loops=1) > Total runtime: 10863.856 ms > (3 rows) > > Time: 10864.817 ms > # explain analyse select distinct on (a) * from perf_raw_2010_02_23 > order by a, b desc ; One big bit of the cost difference is going to be the ordering you need to get a repeatable result. > QUERY > PLAN > --------------------------------------------------------------------------------------------------------------------------------------------- > Unique (cost=1059395.04..1085775.73 rows=639 width=28) (actual > time=46011.204..58428.210 rows=1023 loops=1) > -> Sort (cost=1059395.04..1072585.39 rows=5276139 width=28) > (actual time=46011.200..53561.112 rows=5276139 loops=1) > Sort Key: a, b > Sort Method: external merge Disk: 247584kB > -- actually OS RAM buffers Even if the sort never actually reaches a physical disk, you should still see an increase by increasing sort_mem for the duration of the one query. It's not going to be the magnitude you want, but probably worth doing. > -> Seq Scan on perf_raw_2010_02_23 (cost=0.00..91572.39 > rows=5276139 width=28) (actual time=0.047..6491.036 rows=5276139 > loops=1) > Total runtime: 58516.185 ms > (6 rows) > > Time: 58517.233 ms > > The only difference between these two is that the second query returns > the whole row. The *ratio* in cost between these two plans increases > in proportion to log(n) of the table size ... at 5.5m rows its > livable, at 500m it's probably not :-! If performance on this query is vital to you, and the table doesn't change after initial population (which I'm guessing is true) then try an index on (a asc, b desc) and CLUSTER that index. Depending on the ratio of distinct a:b values that could be what you're after. -- Richard Huxton Archonet Ltd
> -- More explicit > select aggregate_using(max(date), city, temp, date) from bar group by > city, temp order by city; select city, max(ROW(temp, date)) from bar group by city; Does not work (alas) for lack of a default comparison for record type. Another solution, which works wonders if you've got the list of cities in a separate table, and an index on (city, temp) is this : SELECT c.city, (SELECT ROW( t.date, t.temp ) FROM cities_temp t WHERE t.city=c.city ORDER BY temp DESC LIMIT 1) FROM cities; This will do a nested loop index scan and it is the fastest way, except if you have very few rows per city. The syntax is ugly and you have to extract the stuff from the ROW() afterwards, though. Unfortunately, this does not work : SELECT c.city, (SELECT t.date, t.temp FROM cities_temp t WHERE t.city=c.city ORDER BY temp DESC LIMIT 1) AS m FROM cities; because the subselect isn't allowed to return more than 1 column. Note that you can also get the usually annoying top-N by category to use the index by doing something like : SELECT c.city, (SELECT array_agg(date) FROM (SELECT t.date FROM cities_temp t WHERE t.city=c.city ORDER BY temp DESC LIMIT 5)) AS m FROM cities; The results aren't in a very usable form either, but : CREATE INDEX ti ON annonces( type_id, price ) WHERE price IS NOT NULL; EXPLAIN ANALYZE SELECT t.id, (SELECT ROW(a.id, a.price, a.date_annonce) FROM annonces a WHERE a.type_id = t.id AND price IS NOT NULL ORDER BY price DESC LIMIT 1) FROM types_bien t; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on types_bien t (cost=0.00..196.09 rows=57 width=4) (actual time=0.025..0.511 rows=57 loops=1) SubPlan 1 -> Limit (cost=0.00..3.41 rows=1 width=16) (actual time=0.008..0.008 rows=1 loops=57) -> Index Scan Backward using ti on annonces a (cost=0.00..8845.65 rows=2592 width=16) (actual time=0.007..0.007 rows=1 loops=57) Index Cond: (type_id = $0) Total runtime: 0.551 ms explain analyze select distinct type_id, first_value(price) over w as max_price from annonces where price is not null window w as (partition by type_id order by price desc); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ HashAggregate (cost=30515.41..30626.87 rows=11146 width=10) (actual time=320.927..320.971 rows=46 loops=1) -> WindowAgg (cost=27729.14..29958.16 rows=111451 width=10) (actual time=195.289..282.150 rows=111289 loops=1) -> Sort (cost=27729.14..28007.76 rows=111451 width=10) (actual time=195.278..210.762 rows=111289 loops=1) Sort Key: type_id, price Sort Method: quicksort Memory: 8289kB -> Seq Scan on annonces (cost=0.00..18386.17 rows=111451 width=10) (actual time=0.009..72.589 rows=111289 loops=1) Filter: (price IS NOT NULL) Total runtime: 322.382 ms Here using the index is 600x faster... worth a bit of ugly SQL, you decide. By disabling seq_scan and bitmapscan, you can corecr this plan : EXPLAIN ANALYZE SELECT DISTINCT ON (type_id) type_id, date_annonce, price FROM annonces WHERE price IS NOT NULL ORDER BY type_id, price LIMIT 40; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=0.00..78757.61 rows=33 width=14) (actual time=0.021..145.509 rows=40 loops=1) -> Unique (cost=0.00..78757.61 rows=33 width=14) (actual time=0.021..145.498 rows=40 loops=1) -> Index Scan using ti on annonces (cost=0.00..78478.99 rows=111451 width=14) (actual time=0.018..132.671 rows=110796 loops=1) Total runtime: 145.549 ms This plan would be very bad (unless the whole table is in RAM) because I guess the index scan isn't aware of the DISTINCT ON, so it scans all rows in the index and in the table.
Hi all,
Just for your information, and this is not related to PG directly:
Teradata provides a “qualify” syntax which works as a filtering condition on a windowed function result. This is the only DB allowing this direct filtering on windowed functions, from what I know.
So, as an example, the query you ask for becomes very easy on this database:
select
city, temp, date
from bar
qualify row_number() over (partition by city order by temp desc)=1
This is very practical indeed (you can mix it also with classical where/having/group by syntaxes).
On postgres, you may get the same result using an inner query (sorry, I can’t test it for now) such as:
select
city, temp, date
from
(select city, temp, date, row_number() over (partition by city order by temp desc) as nr
from bar ) a1
where nr=1
Julien Theulier
De : pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] De la part de Mose
Envoyé : mercredi 24 février 2010 22:50
À : Dave Crooke
Cc : pgsql-performance
Objet : Re: [PERFORM] Extracting superlatives - SQL design philosophy
Can you try using window functions?
Something like this:
select distinct
city,
first_value(temp) over w as max_temp,
first_value(date) over w as max_temp_date
from
cities
window w as (partition by city order by temp desc)
- Mose
On Wed, Feb 24, 2010 at 1:31 PM, Dave Crooke <dcrooke@gmail.com> wrote:
This is a generic SQL issue and not PG specific, but I'd like to get
an opinion from this list.
Consider the following data:
# \d bar
Table "public.bar"
Column | Type | Modifiers
--------+-----------------------------+-----------
city | character varying(255) |
temp | integer |
date | timestamp without time zone |
# select * from bar order by city, date;
city | temp | date
-----------+------+---------------------
Austin | 75 | 2010-02-21 15:00:00
Austin | 35 | 2010-02-23 15:00:00
Edinburgh | 42 | 2010-02-23 15:00:00
New York | 56 | 2010-02-23 15:00:00
New York | 78 | 2010-06-23 15:00:00
(5 rows)
If you want the highest recorded temperature for a city, that's easy
to do, since the selection criteria works on the same column that we
are extracing:
# select city, max(temp) from bar group by city order by 1;
city | max
-----------+-----
Austin | 75
Edinburgh | 42
New York | 78
(3 rows)
However there is (AFAIK) no simple way in plain SQL to write a query
that performs such an aggregation where the aggregation criteria is on
one column and you want to return another, e.g. adding the the *date
of* that highest temperature to the output above, or doing a query to
get the most recent temperature reading for each city.
What I'd like to do is something like the below (and I'm inventing
mock syntax here, the following is not valid SQL):
-- Ugly implicit syntax but no worse than an Oracle outer join ;-)
select city, temp, date from bar where date=max(date) group by city,
temp order by city;
or perhaps
-- More explicit
select aggregate_using(max(date), city, temp, date) from bar group by
city, temp order by city;
Both of the above, if they existed, would be a single data access
followed by and sort-merge.
The only way I know how to do it involves doing two accesses to the data, e.g.
# select city, temp, date from bar a where date=(select max(b.date)
from bar b where a.city=b.city) order by 1;
city | temp | date
-----------+------+---------------------
Austin | 35 | 2010-02-23 15:00:00
Edinburgh | 42 | 2010-02-23 15:00:00
New York | 78 | 2010-06-23 15:00:00
(3 rows)
# explain select * from bar a where date=(select max(b.date) from bar
b where a.city=b.city) order by 1;
QUERY PLAN
--------------------------------------------------------------------------
Sort (cost=1658.86..1658.87 rows=1 width=528)
Sort Key: a.city
-> Seq Scan on bar a (cost=0.00..1658.85 rows=1 width=528)
Filter: (date = (subplan))
SubPlan
-> Aggregate (cost=11.76..11.77 rows=1 width=8)
-> Seq Scan on bar b (cost=0.00..11.75 rows=1
width=8) -- would be an index lookup in a real scenario
Filter: (($0)::text = (city)::text)
(8 rows)
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
"Julien Theulier" <julien@squidsolutions.com> writes: > Teradata provides a �qualify� syntax which works as a filtering condition on > a windowed function result. This is the only DB allowing this direct > filtering on windowed functions, from what I know. Seems like you could easily translate that into SQL-standard syntax by adding a level of sub-select: select ... from (select *, window_function wf from ...) ss where wf=1; regards, tom lane
On Wed, Feb 24, 2010 at 4:31 PM, Dave Crooke <dcrooke@gmail.com> wrote: > This is a generic SQL issue and not PG specific, but I'd like to get > an opinion from this list. > > Consider the following data: > > # \d bar > Table "public.bar" > Column | Type | Modifiers > --------+-----------------------------+----------- > city | character varying(255) | > temp | integer | > date | timestamp without time zone | > > # select * from bar order by city, date; > city | temp | date > -----------+------+--------------------- > Austin | 75 | 2010-02-21 15:00:00 > Austin | 35 | 2010-02-23 15:00:00 > Edinburgh | 42 | 2010-02-23 15:00:00 > New York | 56 | 2010-02-23 15:00:00 > New York | 78 | 2010-06-23 15:00:00 > (5 rows) > > If you want the highest recorded temperature for a city, that's easy > to do, since the selection criteria works on the same column that we > are extracing: > > # select city, max(temp) from bar group by city order by 1; > city | max > -----------+----- > Austin | 75 > Edinburgh | 42 > New York | 78 > (3 rows) > > > However there is (AFAIK) no simple way in plain SQL to write a query > that performs such an aggregation where the aggregation criteria is on > one column and you want to return another, e.g. adding the the *date > of* that highest temperature to the output above, or doing a query to > get the most recent temperature reading for each city. > > What I'd like to do is something like the below (and I'm inventing > mock syntax here, the following is not valid SQL): > > -- Ugly implicit syntax but no worse than an Oracle outer join ;-) > select city, temp, date from bar where date=max(date) group by city, > temp order by city; > > or perhaps > > -- More explicit > select aggregate_using(max(date), city, temp, date) from bar group by > city, temp order by city; > > Both of the above, if they existed, would be a single data access > followed by and sort-merge. > > The only way I know how to do it involves doing two accesses to the data, e.g. > > # select city, temp, date from bar a where date=(select max(b.date) > from bar b where a.city=b.city) order by 1; > city | temp | date > -----------+------+--------------------- > Austin | 35 | 2010-02-23 15:00:00 > Edinburgh | 42 | 2010-02-23 15:00:00 > New York | 78 | 2010-06-23 15:00:00 > (3 rows) > > > # explain select * from bar a where date=(select max(b.date) from bar > b where a.city=b.city) order by 1; > QUERY PLAN > -------------------------------------------------------------------------- > Sort (cost=1658.86..1658.87 rows=1 width=528) > Sort Key: a.city > -> Seq Scan on bar a (cost=0.00..1658.85 rows=1 width=528) > Filter: (date = (subplan)) > SubPlan > -> Aggregate (cost=11.76..11.77 rows=1 width=8) > -> Seq Scan on bar b (cost=0.00..11.75 rows=1 > width=8) -- would be an index lookup in a real scenario > Filter: (($0)::text = (city)::text) Another cool way to do this is via a custom aggregate: create function maxdata(data, data) returns data as $$ select case when ($1).date > ($2).date then $1 else $2 end; $$ language sql; create aggregate maxdata(data) ( sfunc=maxdata, stype=data ); select (d).* from ( select maxdata(data) as d from data group by city ); It does it in a single pass. Where this approach can pay dividends is when you have a very complicated 'max'-ing criteria to justify the verbosity of creating the aggregate. If you are not doing the whole table, the self join is often faster. I'm surprised custom aggregates aren't used more...they seem very clean and neat to me. merlin
What I ended up doing for my app is just going with straight SQL that generates the "key" tuples with a SELECT DISTINCT, and then has a dependent subquery that does a very small index scan to pull the data for each row (I care somewhat about portability). In order to make this perform, I created a second index on the raw data table that has the columns tupled in the order I need for this rollup, which allows PG to do a fairly efficient index range scan.
I had been trying to avoid using the disk space to carry this 2nd index, since it is only needed for the bulk rollup, and I then reliased I only have to keep it on the current day's partition, and I can drop it once that partition's data has been aggregated (the insert overhead of the index isn't as much of a concern).
Alternatively, I could have lived without the index by sharding the raw data right down to the rollup intervals, which would mean that rollups are effective as a full table scan anyway, but I didn't want to do that as it would make real-time data extration queries slower if they had to go across 10-20 tables.
Thanks everyone for the insights
Cheers
Dave
On Wed, Feb 24, 2010 at 4:31 PM, Dave Crooke <dcrooke@gmail.com> wrote:Another cool way to do this is via a custom aggregate:> This is a generic SQL issue and not PG specific, but I'd like to get
> an opinion from this list.
>
> Consider the following data:
>
> # \d bar
> Table "public.bar"
> Column | Type | Modifiers
> --------+-----------------------------+-----------
> city | character varying(255) |
> temp | integer |
> date | timestamp without time zone |
>
> # select * from bar order by city, date;
> city | temp | date
> -----------+------+---------------------
> Austin | 75 | 2010-02-21 15:00:00
> Austin | 35 | 2010-02-23 15:00:00
> Edinburgh | 42 | 2010-02-23 15:00:00
> New York | 56 | 2010-02-23 15:00:00
> New York | 78 | 2010-06-23 15:00:00
> (5 rows)
>
> If you want the highest recorded temperature for a city, that's easy
> to do, since the selection criteria works on the same column that we
> are extracing:
>
> # select city, max(temp) from bar group by city order by 1;
> city | max
> -----------+-----
> Austin | 75
> Edinburgh | 42
> New York | 78
> (3 rows)
>
>
> However there is (AFAIK) no simple way in plain SQL to write a query
> that performs such an aggregation where the aggregation criteria is on
> one column and you want to return another, e.g. adding the the *date
> of* that highest temperature to the output above, or doing a query to
> get the most recent temperature reading for each city.
>
> What I'd like to do is something like the below (and I'm inventing
> mock syntax here, the following is not valid SQL):
>
> -- Ugly implicit syntax but no worse than an Oracle outer join ;-)
> select city, temp, date from bar where date=max(date) group by city,
> temp order by city;
>
> or perhaps
>
> -- More explicit
> select aggregate_using(max(date), city, temp, date) from bar group by
> city, temp order by city;
>
> Both of the above, if they existed, would be a single data access
> followed by and sort-merge.
>
> The only way I know how to do it involves doing two accesses to the data, e.g.
>
> # select city, temp, date from bar a where date=(select max(b.date)
> from bar b where a.city=b.city) order by 1;
> city | temp | date
> -----------+------+---------------------
> Austin | 35 | 2010-02-23 15:00:00
> Edinburgh | 42 | 2010-02-23 15:00:00
> New York | 78 | 2010-06-23 15:00:00
> (3 rows)
>
>
> # explain select * from bar a where date=(select max(b.date) from bar
> b where a.city=b.city) order by 1;
> QUERY PLAN
> --------------------------------------------------------------------------
> Sort (cost=1658.86..1658.87 rows=1 width=528)
> Sort Key: a.city
> -> Seq Scan on bar a (cost=0.00..1658.85 rows=1 width=528)
> Filter: (date = (subplan))
> SubPlan
> -> Aggregate (cost=11.76..11.77 rows=1 width=8)
> -> Seq Scan on bar b (cost=0.00..11.75 rows=1
> width=8) -- would be an index lookup in a real scenario
> Filter: (($0)::text = (city)::text)
create function maxdata(data, data) returns data as
$$
select case when ($1).date > ($2).date then $1 else $2 end;
$$ language sql;
create aggregate maxdata(data)
(
sfunc=maxdata,
stype=data
);
select (d).* from
(
select maxdata(data) as d from data group by city
);
It does it in a single pass. Where this approach can pay dividends is
when you have a very complicated 'max'-ing criteria to justify the
verbosity of creating the aggregate. If you are not doing the whole
table, the self join is often faster. I'm surprised custom aggregates
aren't used more...they seem very clean and neat to me.
merlin