Thread: Select distinct sorting all rows 8.0rc1
System is 4-way Opteron 844, 16 GB memory, SCSI. This is a trial run for converting an Oracle DB, so system is not optimized. I have shared_buffers = 50000 work_mem = 65536 A table has about 65 million rows (data collection system.) It has a primary key, no other indexes, no OIDs. The primary key contains 5 columns, of which service_id is the first (i.e., higher order.) I've run a simple ANALYZE on this table. Trying to find the unique service_id values, I did the following. If I'm reading this right, the sequential scan is passing all 65 million retrieved values onto the sort, which understandably takes a long time. This query won't be run very often, if ever, in production, so I don't want to add additional indexes to support it. Our known queries run very fast, about 1.2 seconds the first time through, and 20 msecs on repeats. Can I run ANALYZE in a different way so that queries like this can be completed in a shorter amount of time? I'm trying to anticipate ad-hoc queries the user community might come up with. Since there are such a small number or result rows compared to the number of total rows, perhaps using a hash table to record unique values would avoid passing all 65 million rows to the sort. Thanks for all suggestions. estat=> explain analyze select distinct(service_id) from five_min_stats_200408; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ---- Unique (cost=13578354.70..13894902.76 rows=726 width=12) (actual time=1227906.271..1282110.055 rows=879 loops=1) -> Sort (cost=13578354.70..13736628.73 rows=63309612 width=12) (actual time=1227906.266..1255961.318 rows=63359396 loops=1) Sort Key: service_id -> Seq Scan on five_min_stats_200408 (cost=0.00..1668170.12 rows=63309612 width=12) (actual time=0.061..80398.222 rows=63359396 loops=1) Total runtime: 1284212.556 ms (5 rows) Time: 1284213.359 ms -- Guy Rouillier
The planner is smarter with GROUP BY than with DISTINCT, so you can rewrite your query as the following, whihc will probaly use a HashAggregate, and be a lot faster : SELECT service_id FROM five_min_stats_200408 GROUP BY service_id; This won't avoid the Seq Scan however. If you know you have few different service_ids, you can code a plpgsql function which does this : SELECT INTO value service_id FROM five_min_stats_200408 ORDER BY service_id ASC LIMIT 1; WHILE FOUND RETURN NEXT value SELECT INTO value service_id FROM five_min_stats_200408 WHERE service_is>value ORDER BY service_id ASC LIMIT 1; END WHILE Basically it skips from one value to the next using your index, and returns them as they come. You'll get one indexed SELECT by distinct value. If you have, say 100 distinct values in 1M rows it'll be many orders of magniude faster. > estat=> explain analyze select distinct(service_id) from > five_min_stats_200408; > QUERY PLAN > ------------------------------------------------------------------------ > ------------------------------------------------------------------------ > ---- > Unique (cost=13578354.70..13894902.76 rows=726 width=12) (actual > time=1227906.271..1282110.055 rows=879 loops=1) > -> Sort (cost=13578354.70..13736628.73 rows=63309612 width=12) > (actual time=1227906.266..1255961.318 rows=63359396 loops=1) > Sort Key: service_id > -> Seq Scan on five_min_stats_200408 (cost=0.00..1668170.12 > rows=63309612 width=12) (actual time=0.061..80398.222 rows=63359396 > loops=1) > Total runtime: 1284212.556 ms > (5 rows) > > Time: 1284213.359 ms >
Pierre-Frédéric Caillaud wrote: > The planner is smarter with GROUP BY than with DISTINCT, so you can > rewrite your query as the following, whihc will probaly use a > HashAggregate, and be a lot faster : > > SELECT service_id FROM five_min_stats_200408 GROUP BY service_id; Pierre-Frédéric, thanks! Vast improvement - this is definitely acceptable. It did indeed use a seq scan followed by a hashaggregate. Total runtime: 76295.775 ms >> estat=> explain analyze select distinct(service_id) from >> five_min_stats_200408; QUERY PLAN >> ------------------------------------------------------------------------ >> ------------------------------------------------------------------------ >> >> Time: 1284213.359 ms -- Guy Rouillier