Re: Extracting superlatives - SQL design philosophy

From: Pierre C
Subject: Re: Extracting superlatives - SQL design philosophy
Date: ,
Msg-id: op.u8octdbxeorkce@localhost
(view: Whole thread, Raw)
In response to: Extracting superlatives - SQL design philosophy  (Dave Crooke)
List: pgsql-performance

Tree view

Extracting superlatives - SQL design philosophy  (Dave Crooke, )
 Re: Extracting superlatives - SQL design philosophy  (Mose, )
  Re: Extracting superlatives - SQL design philosophy  ("Julien Theulier", )
   Re: Extracting superlatives - SQL design philosophy  (Tom Lane, )
 Re: Extracting superlatives - SQL design philosophy  ("Garrett Murphy", )
  Re: Extracting superlatives - SQL design philosophy  (Dave Crooke, )
   Re: Extracting superlatives - SQL design philosophy  (Richard Huxton, )
    Re: Extracting superlatives - SQL design philosophy  (Dave Crooke, )
     Re: Extracting superlatives - SQL design philosophy  (Richard Huxton, )
 Re: Extracting superlatives - SQL design philosophy  ("George Sexton", )
  Re: Extracting superlatives - SQL design philosophy  ("George Sexton", )
 Re: Extracting superlatives - SQL design philosophy  (Craig James, )
 Re: Extracting superlatives - SQL design philosophy  ("Pierre C", )
 Re: Extracting superlatives - SQL design philosophy  (Merlin Moncure, )
  Re: Extracting superlatives - SQL design philosophy  (Dave Crooke, )

> -- More explicit
> select aggregate_using(max(date), city, temp, date) from bar group by
> city, temp order by city;

select city, max(ROW(temp, date)) from bar group by city;

Does not work (alas) for lack of a default comparison for record type.

Another solution, which works wonders if you've got the list of cities in
a separate table, and an index on (city, temp) is this :

SELECT c.city, (SELECT ROW( t.date, t.temp ) FROM cities_temp t WHERE
t.city=c.city ORDER BY temp DESC LIMIT 1) FROM cities;

This will do a nested loop index scan and it is the fastest way, except if
you have very few rows per city.
The syntax is ugly and you have to extract the stuff from the ROW()
afterwards, though.

Unfortunately, this does not work :

SELECT c.city, (SELECT t.date, t.temp FROM cities_temp t WHERE
t.city=c.city ORDER BY temp DESC LIMIT 1) AS m FROM cities;

because the subselect isn't allowed to return more than 1 column.

Note that you can also get the usually annoying top-N by category to use
the index by doing something like :

SELECT c.city, (SELECT array_agg(date) FROM (SELECT t.date FROM
cities_temp t WHERE t.city=c.city ORDER BY temp DESC LIMIT 5)) AS m FROM
cities;

The results aren't in a very usable form either, but :

CREATE INDEX ti ON annonces( type_id, price ) WHERE price IS NOT NULL;

EXPLAIN ANALYZE SELECT
t.id, (SELECT ROW(a.id, a.price, a.date_annonce)
    FROM annonces a
    WHERE a.type_id = t.id AND price IS NOT NULL
    ORDER BY price DESC LIMIT 1)
 FROM types_bien t;
                                                                   QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
  Seq Scan on types_bien t  (cost=0.00..196.09 rows=57 width=4) (actual
time=0.025..0.511 rows=57 loops=1)
    SubPlan 1
      ->  Limit  (cost=0.00..3.41 rows=1 width=16) (actual
time=0.008..0.008 rows=1 loops=57)
            ->  Index Scan Backward using ti on annonces a
(cost=0.00..8845.65 rows=2592 width=16) (actual time=0.007..0.007 rows=1
loops=57)
                  Index Cond: (type_id = $0)
  Total runtime: 0.551 ms

explain analyze
select distinct type_id, first_value(price) over w as max_price
 from annonces where price is not null
window w as (partition by type_id order by price desc);
                                                              QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=30515.41..30626.87 rows=11146 width=10) (actual
time=320.927..320.971 rows=46 loops=1)
    ->  WindowAgg  (cost=27729.14..29958.16 rows=111451 width=10) (actual
time=195.289..282.150 rows=111289 loops=1)
          ->  Sort  (cost=27729.14..28007.76 rows=111451 width=10) (actual
time=195.278..210.762 rows=111289 loops=1)
                Sort Key: type_id, price
                Sort Method:  quicksort  Memory: 8289kB
                ->  Seq Scan on annonces  (cost=0.00..18386.17 rows=111451
width=10) (actual time=0.009..72.589 rows=111289 loops=1)
                      Filter: (price IS NOT NULL)
  Total runtime: 322.382 ms

Here using the index is 600x faster... worth a bit of ugly SQL, you decide.

By disabling seq_scan and bitmapscan, you can corecr this plan :

EXPLAIN ANALYZE SELECT DISTINCT ON (type_id) type_id, date_annonce, price
 FROM annonces WHERE price IS NOT NULL ORDER BY type_id, price LIMIT 40;
                                                                 QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..78757.61 rows=33 width=14) (actual time=0.021..145.509
rows=40 loops=1)
    ->  Unique  (cost=0.00..78757.61 rows=33 width=14) (actual
time=0.021..145.498 rows=40 loops=1)
          ->  Index Scan using ti on annonces  (cost=0.00..78478.99
rows=111451 width=14) (actual time=0.018..132.671 rows=110796 loops=1)
  Total runtime: 145.549 ms

This plan would be very bad (unless the whole table is in RAM) because I
guess the index scan isn't aware of the DISTINCT ON, so it scans all rows
in the index and in the table.









pgsql-performance by date:

From: "Julien Theulier"
Date:
Subject: Re: Extracting superlatives - SQL design philosophy
From: Tom Lane
Date:
Subject: Re: Extracting superlatives - SQL design philosophy