Thread: Select distinct sorting all rows 8.0rc1

Select distinct sorting all rows 8.0rc1

From
"Guy Rouillier"
Date:
System is 4-way Opteron 844, 16 GB memory, SCSI.  This is a trial run
for converting an Oracle DB, so system is not optimized.  I have

shared_buffers = 50000
work_mem = 65536

A table has about 65 million rows (data collection system.)  It has a
primary key, no other indexes, no OIDs.  The primary key contains 5
columns, of which service_id is the first (i.e., higher order.)  I've
run a simple ANALYZE on this table.  Trying to find the unique
service_id values, I did the following.  If I'm reading this right, the
sequential scan is passing all 65 million retrieved values onto the
sort, which understandably takes a long time.

This query won't be run very often, if ever, in production, so I don't
want to add additional indexes to support it.  Our known queries run
very fast, about 1.2 seconds the first time through, and 20 msecs on
repeats.  Can I run ANALYZE in a different way so that queries like this
can be completed in a shorter amount of time?  I'm trying to anticipate
ad-hoc queries the user community might come up with.

Since there are such a small number or result rows compared to the
number of total rows, perhaps using a hash table to record unique values
would avoid passing all 65 million rows to the sort.  Thanks for all
suggestions.

estat=> explain analyze select distinct(service_id) from
five_min_stats_200408;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
----
 Unique  (cost=13578354.70..13894902.76 rows=726 width=12) (actual
time=1227906.271..1282110.055 rows=879 loops=1)
   ->  Sort  (cost=13578354.70..13736628.73 rows=63309612 width=12)
(actual time=1227906.266..1255961.318 rows=63359396 loops=1)
         Sort Key: service_id
         ->  Seq Scan on five_min_stats_200408  (cost=0.00..1668170.12
rows=63309612 width=12) (actual time=0.061..80398.222 rows=63359396
loops=1)
 Total runtime: 1284212.556 ms
(5 rows)

Time: 1284213.359 ms

--
Guy Rouillier


Re: Select distinct sorting all rows 8.0rc1

From
Pierre-Frédéric Caillaud
Date:
    The planner is smarter with GROUP BY than with DISTINCT, so you can
rewrite your query as the following, whihc will probaly use a
HashAggregate, and be a lot faster :

    SELECT service_id FROM five_min_stats_200408 GROUP BY service_id;

    This won't avoid the Seq Scan however.

    If you know you have few different service_ids, you can code a plpgsql
function which does this :

SELECT INTO value service_id FROM five_min_stats_200408 ORDER BY
service_id ASC LIMIT 1;
WHILE FOUND
    RETURN NEXT value
    SELECT INTO value service_id FROM five_min_stats_200408 WHERE
service_is>value ORDER BY service_id ASC LIMIT 1;
END WHILE

    Basically it skips from one value to the next using your index, and
returns them as they come. You'll get one indexed SELECT by distinct
value. If you have, say 100 distinct values in 1M rows it'll be many
orders of magniude faster.



> estat=> explain analyze select distinct(service_id) from
> five_min_stats_200408;
> QUERY PLAN
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> ----
>  Unique  (cost=13578354.70..13894902.76 rows=726 width=12) (actual
> time=1227906.271..1282110.055 rows=879 loops=1)
>    ->  Sort  (cost=13578354.70..13736628.73 rows=63309612 width=12)
> (actual time=1227906.266..1255961.318 rows=63359396 loops=1)
>          Sort Key: service_id
>          ->  Seq Scan on five_min_stats_200408  (cost=0.00..1668170.12
> rows=63309612 width=12) (actual time=0.061..80398.222 rows=63359396
> loops=1)
>  Total runtime: 1284212.556 ms
> (5 rows)
>
> Time: 1284213.359 ms
>



Re: Select distinct sorting all rows 8.0rc1

From
"Guy Rouillier"
Date:
Pierre-Frédéric Caillaud wrote:
>     The planner is smarter with GROUP BY than with DISTINCT, so you can
> rewrite your query as the following, whihc will probaly use a
> HashAggregate, and be a lot faster :
>
>     SELECT service_id FROM five_min_stats_200408 GROUP BY service_id;

Pierre-Frédéric, thanks!  Vast improvement - this is definitely acceptable.  It did indeed use a seq scan followed by a
hashaggregate. 

 Total runtime: 76295.775 ms


>> estat=> explain analyze select distinct(service_id) from
>> five_min_stats_200408; QUERY PLAN
>> ------------------------------------------------------------------------
>> ------------------------------------------------------------------------
>>
>> Time: 1284213.359 ms



--
Guy Rouillier