Sort and Limit - really nasty query and feature of the day - Mailing list pgsql-general
From | Listmail |
---|---|
Subject | Sort and Limit - really nasty query and feature of the day |
Date | |
Msg-id | op.tqb0s3sbzcizji@apollo13 Whole thread Raw |
In response to | Re: Storing blobs in PG DB ("Merlin Moncure" <mmoncure@gmail.com>) |
List | pgsql-general |
Today I rewrote a particularly nasty query involving a UNION ALL between an active table and a huge archive table, some left joins, order by and limit, and it went from 5 minutes to under one second ; however one query became 4 with some glue in between. EXPLAIN SELECT * FROM ( SELECT 0 AS archived, id, price, surface, coords, detect_time, type_id, vente, zipcode, city_id, description FROM annonces UNION ALL SELECT 1 AS archived, a.id, price, surface, coords, detect_time, type_id, vente, zipcode, city_id, description FROM archive_data a LEFT JOIN archive_ext d ON a.id=d.id ) AS foo WHERE detect_time >= '2006-10-30 16:17:45.064793' AND type_id IN (1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6) AND vente AND (zipcode IN (69001,69002,69003,69004,69005,69006,69007,69008,69009) OR city_id IN (27595) OR coords && '(45.74101689082,4.8371263505564),(45.75898310918,4.8628736494436)'::BOX) AND surface IS NOT NULL AND price IS NOT NULL ORDER BY price/surface LIMIT 100; Here is the messy explain : Limit (cost=333560.35..333560.60 rows=100 width=103) -> Sort (cost=333560.35..333656.88 rows=38610 width=103) Sort Key: (foo.price / (foo.surface)::double precision) -> Result (cost=133.21..328438.41 rows=38610 width=103) -> Append (cost=133.21..328245.36 rows=38610 width=103) -> Bitmap Heap Scan on annonces (cost=133.21..7520.56 rows=1426 width=190) Recheck Cond: ((vente AND (zipcode = ANY ('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[]))) OR (vente AND (city_id = 27595)) OR (coords && '(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box)) Filter: ((detect_time >= '2006-10-30 16:17:45.064793'::timestamp without time zone) AND (type_id = ANY ('{1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6}'::integer[])) AND vente AND (surface IS NOT NULL) AND (price IS NOT NULL)) -> BitmapOr (cost=133.21..133.21 rows=4294 width=0) -> Bitmap Index Scan on annonces_zip (cost=0.00..55.91 rows=1761 width=0) Index Cond: ((vente = true) AND (zipcode = ANY ('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[]))) -> Bitmap Index Scan on annonces_city (cost=0.00..42.85 rows=1859 width=0) Index Cond: ((vente = true) AND (city_id = 27595)) -> Bitmap Index Scan on annonces_coords (cost=0.00..33.37 rows=675 width=0) Index Cond: (coords && '(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box) -> Merge Right Join (cost=59679.03..320338.70 rows=37184 width=182) Merge Cond: (d.id = a.id) -> Index Scan using archive_ext_pkey on archive_ext d (cost=0.00..252661.12 rows=2976314 width=119) -> Sort (cost=59679.03..59771.99 rows=37184 width=67) Sort Key: a.id -> Bitmap Heap Scan on archive_data a (cost=3951.02..56856.32 rows=37184 width=67) Recheck Cond: ((vente AND (zipcode = ANY ('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[]))) OR (vente AND (city_id = 27595)) OR (coords && '(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box)) Filter: ((detect_time >= '2006-10-30 16:17:45.064793'::timestamp without time zone) AND (type_id = ANY ('{1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6}'::integer[])) AND vente AND (surface IS NOT NULL) AND (price IS NOT NULL)) -> BitmapOr (cost=3951.02..3951.02 rows=171699 width=0) -> Bitmap Index Scan on archive_data_zip (cost=0.00..1692.62 rows=80610 width=0) Index Cond: ((vente = true) AND (zipcode = ANY ('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[]))) -> Bitmap Index Scan on archive_data_city (cost=0.00..1695.31 rows=80683 width=0) Index Cond: ((vente = true) AND (city_id = 27595)) -> Bitmap Index Scan on archive_data_coords (cost=0.00..535.20 rows=10406 width=0) Index Cond: (coords && '(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box) I didn't redo the explain analyze, it takes too long ; however the stats and count estimates are pretty good, and it takes a good 5 minutes. However, the interesting parts of the query are very fast. Let's disassemble it : EXPLAIN ANALYZE SELECT 0 AS archived, id, price, surface, coords, detect_time, type_id, vente, zipcode, city_id, description FROM annonces WHERE detect_time >= '2006-10-30 16:17:45.064793' AND type_id IN (1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6) AND vente AND (zipcode IN (69001,69002,69003,69004,69005,69006,69007,69008,69009) OR city_id IN (27595) OR coords && '(45.74101689082,4.8371263505564),(45.75898310918,4.8628736494436)'::BOX) AND surface IS NOT NULL AND price IS NOT NULL ORDER BY price/surface LIMIT 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=7602.40..7602.65 rows=100 width=190) (actual time=19.102..19.163 rows=100 loops=1) -> Sort (cost=7602.40..7605.96 rows=1426 width=190) (actual time=19.100..19.123 rows=100 loops=1) Sort Key: (price / (surface)::double precision) -> Bitmap Heap Scan on annonces (cost=133.21..7527.69 rows=1426 width=190) (actual time=4.255..16.725 rows=910 loops=1) Recheck Cond: ((vente AND (zipcode = ANY ('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[]))) OR (vente AND (city_id = 27595)) OR (coords && '(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box)) Filter: ((detect_time >= '2006-10-30 16:17:45.064793'::timestamp without time zone) AND (type_id = ANY ('{1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6}'::integer[])) AND vente AND (surface IS NOT NULL) AND (price IS NOT NULL)) -> BitmapOr (cost=133.21..133.21 rows=4294 width=0) (actual time=2.662..2.662 rows=0 loops=1) -> Bitmap Index Scan on annonces_zip (cost=0.00..55.91 rows=1761 width=0) (actual time=0.518..0.518 rows=1367 loops=1) Index Cond: ((vente = true) AND (zipcode = ANY ('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[]))) -> Bitmap Index Scan on annonces_city (cost=0.00..42.85 rows=1859 width=0) (actual time=0.364..0.364 rows=1316 loops=1) Index Cond: ((vente = true) AND (city_id = 27595)) -> Bitmap Index Scan on annonces_coords (cost=0.00..33.37 rows=675 width=0) (actual time=1.776..1.776 rows=2449 loops=1) Index Cond: (coords && '(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box) Total runtime: 19.327 ms Bitmap saves the day. Now for the other part : EXPLAIN ANALYZE SELECT 1 AS archived, id, price, surface, coords, detect_time, type_id, vente, zipcode, city_id FROM archive_data WHERE detect_time >= '2006-10-30 16:17:45.064793' AND type_id IN (1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6) AND vente AND (zipcode IN (69001,69002,69003,69004,69005,69006,69007,69008,69009) OR city_id IN (27595) OR coords && '(45.74101689082,4.8371263505564),(45.75898310918,4.8628736494436)'::BOX) AND surface IS NOT NULL AND price IS NOT NULL ORDER BY price/surface LIMIT 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=59864.95..59865.20 rows=100 width=67) (actual time=490.718..490.793 rows=100 loops=1) -> Sort (cost=59864.95..59957.91 rows=37184 width=67) (actual time=490.716..490.765 rows=100 loops=1) Sort Key: (price / (surface)::double precision) -> Bitmap Heap Scan on archive_data (cost=3951.02..57042.24 rows=37184 width=67) (actual time=223.720..344.485 rows=27761 loops=1) Recheck Cond: ((vente AND (zipcode = ANY ('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[]))) OR (vente AND (city_id = 27595)) OR (coords && '(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box)) Filter: ((detect_time >= '2006-10-30 16:17:45.064793'::timestamp without time zone) AND (type_id = ANY ('{1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6}'::integer[])) AND vente AND (surface IS NOT NULL) AND (price IS NOT NULL)) -> BitmapOr (cost=3951.02..3951.02 rows=171699 width=0) (actual time=144.175..144.175 rows=0 loops=1) -> Bitmap Index Scan on archive_data_zip (cost=0.00..1692.62 rows=80610 width=0) (actual time=38.715..38.715 rows=86909 loops=1) Index Cond: ((vente = true) AND (zipcode = ANY ('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[]))) -> Bitmap Index Scan on archive_data_city (cost=0.00..1695.31 rows=80683 width=0) (actual time=26.576..26.576 rows=85868 loops=1) Index Cond: ((vente = true) AND (city_id = 27595)) -> Bitmap Index Scan on archive_data_coords (cost=0.00..535.20 rows=10406 width=0) (actual time=78.880..78.880 rows=117333 loops=1) Index Cond: (coords && '(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box) Total runtime: 492.530 ms So, taken individually, postgres is exceedingly good on the "hard" parts of this query (ie. finding what I want). Problem is that, on such a query, it would really pay to : - Move the ORDER BY and LIMIT inside the UNION ALL Postgres already moves the WHERE conditions. Obviously, if both sides of the UNION have the same ORDER BY and LIMIT, moving them inside would work well. This only works when LIMIT is present, of course. - Continue moving the ORDER BY and LIMIT inside the LEFT JOIN so it can be performed before the MERGE JOIN This allow merging 100 rows instead of 27761, which could even be done with some other join type like Nested Loop. This also only works with LIMIT. - and re-sort the final query result since it's an UNION. I don't think it would change the result since it is a left join, all the rows on the left part are in the end result anyway. Is it planned to add this to postgres sometimes ? In the end I used the two fast SELECTs to get the ids of the items to display and had PHP shove them back into the original view which has more stuff going on inside, I was too lazy to redo it and my search engine. The end result is about 300 times faster...
pgsql-general by date: