Re: Extracting superlatives - SQL design philosophy - Mailing list pgsql-performance

From Dave Crooke
Subject Re: Extracting superlatives - SQL design philosophy
Date
Msg-id ca24673e1003090813tc0126e8v5be0866c67827199@mail.gmail.com
Whole thread Raw
In response to Re: Extracting superlatives - SQL design philosophy  (Merlin Moncure <mmoncure@gmail.com>)
List pgsql-performance
Cool trick .... I didn't realise you could do this at the SQL level without a custom max() written in C.


What I ended up doing for my app is just going with straight SQL that generates the "key" tuples with a SELECT DISTINCT, and then has a dependent subquery that does a very small index scan to pull the data for each row (I care somewhat about portability). In order to make this perform, I created a second index on the raw data table that has the columns tupled in the order I need for this rollup, which allows PG to do a fairly efficient index range scan.

I had been trying to avoid using the disk space to carry this 2nd index, since it is only needed for the bulk rollup, and I then reliased I only have to keep it on the current day's partition, and I can drop it once that partition's data has been aggregated (the insert overhead of the index isn't as much of a concern).

Alternatively, I could have lived without the index by sharding the raw data right down to the rollup intervals, which would mean that rollups are effective as a full table scan anyway, but I didn't want to do that as it would make real-time data extration queries slower if they had to go across 10-20 tables.


Thanks everyone for the insights

Cheers
Dave

On Tue, Mar 9, 2010 at 6:46 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Feb 24, 2010 at 4:31 PM, Dave Crooke <dcrooke@gmail.com> wrote:
> This is a generic SQL issue and not PG specific, but I'd like to get
> an opinion from this list.
>
> Consider the following data:
>
> # \d bar
>                 Table "public.bar"
>  Column |            Type             | Modifiers
> --------+-----------------------------+-----------
>  city   | character varying(255)      |
>  temp   | integer                     |
>  date   | timestamp without time zone |
>
> # select * from bar order by city, date;
>   city    | temp |        date
> -----------+------+---------------------
>  Austin    |   75 | 2010-02-21 15:00:00
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   56 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (5 rows)
>
> If you want the highest recorded temperature for a city, that's easy
> to do, since the selection criteria works on the same column that we
> are extracing:
>
> # select city, max(temp) from bar group by city order by 1;
>   city    | max
> -----------+-----
>  Austin    |  75
>  Edinburgh |  42
>  New York  |  78
> (3 rows)
>
>
> However there is (AFAIK) no simple way in plain SQL to write a query
> that performs such an aggregation where the aggregation criteria is on
> one column and you want to return another, e.g. adding the the *date
> of* that highest temperature to the output above, or doing a query to
> get the most recent temperature reading for each city.
>
> What I'd like to do is something like the below (and I'm inventing
> mock syntax here, the following is not valid SQL):
>
> -- Ugly implicit syntax but no worse than an Oracle outer join ;-)
> select city, temp, date from bar where date=max(date) group by city,
> temp order by city;
>
> or perhaps
>
> -- More explicit
> select aggregate_using(max(date), city, temp, date) from bar group by
> city, temp order by city;
>
> Both of the above, if they existed, would be a single data access
> followed by and sort-merge.
>
> The only way I know how to do it involves doing two accesses to the data, e.g.
>
> # select city, temp, date from bar a where date=(select max(b.date)
> from bar b where a.city=b.city) order by 1;
>   city    | temp |        date
> -----------+------+---------------------
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (3 rows)
>
>
> # explain select * from bar a where date=(select max(b.date) from bar
> b where a.city=b.city) order by 1;
>                                QUERY PLAN
> --------------------------------------------------------------------------
>  Sort  (cost=1658.86..1658.87 rows=1 width=528)
>   Sort Key: a.city
>   ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
>         Filter: (date = (subplan))
>         SubPlan
>           ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
>                 ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
> width=8)     -- would be an index lookup in a real scenario
>                       Filter: (($0)::text = (city)::text)

Another cool way to do this is via a custom aggregate:
create function maxdata(data, data) returns data as
$$
 select case when ($1).date > ($2).date then $1 else $2 end;
$$ language sql;

create aggregate maxdata(data)
(
  sfunc=maxdata,
  stype=data
);

select (d).* from
(
 select maxdata(data) as d from data group by city
);

It does it in a single pass.  Where this approach can pay dividends is
when you have a very complicated 'max'-ing criteria to justify the
verbosity of creating the aggregate.  If you are not doing the whole
table, the self join is often faster.  I'm surprised custom aggregates
aren't used more...they seem very clean and neat to me.

merlin

pgsql-performance by date:

Previous
From: Michael Stone
Date:
Subject: Re: 10K vs 15k rpm for analytics
Next
From: John KEA
Date:
Subject: Deleting Large Objects