Select distinct sorting all rows 8.0rc1 - Mailing list pgsql-general

From Guy Rouillier
Subject Select distinct sorting all rows 8.0rc1
Date
Msg-id CC1CF380F4D70844B01D45982E671B2348E46A@mtxexch01.add0.masergy.com
Whole thread Raw
Responses Re: Select distinct sorting all rows 8.0rc1  (Pierre-Frédéric Caillaud<lists@boutiquenumerique.com>)
List pgsql-general
System is 4-way Opteron 844, 16 GB memory, SCSI.  This is a trial run
for converting an Oracle DB, so system is not optimized.  I have

shared_buffers = 50000
work_mem = 65536

A table has about 65 million rows (data collection system.)  It has a
primary key, no other indexes, no OIDs.  The primary key contains 5
columns, of which service_id is the first (i.e., higher order.)  I've
run a simple ANALYZE on this table.  Trying to find the unique
service_id values, I did the following.  If I'm reading this right, the
sequential scan is passing all 65 million retrieved values onto the
sort, which understandably takes a long time.

This query won't be run very often, if ever, in production, so I don't
want to add additional indexes to support it.  Our known queries run
very fast, about 1.2 seconds the first time through, and 20 msecs on
repeats.  Can I run ANALYZE in a different way so that queries like this
can be completed in a shorter amount of time?  I'm trying to anticipate
ad-hoc queries the user community might come up with.

Since there are such a small number or result rows compared to the
number of total rows, perhaps using a hash table to record unique values
would avoid passing all 65 million rows to the sort.  Thanks for all
suggestions.

estat=> explain analyze select distinct(service_id) from
five_min_stats_200408;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
----
 Unique  (cost=13578354.70..13894902.76 rows=726 width=12) (actual
time=1227906.271..1282110.055 rows=879 loops=1)
   ->  Sort  (cost=13578354.70..13736628.73 rows=63309612 width=12)
(actual time=1227906.266..1255961.318 rows=63359396 loops=1)
         Sort Key: service_id
         ->  Seq Scan on five_min_stats_200408  (cost=0.00..1668170.12
rows=63309612 width=12) (actual time=0.061..80398.222 rows=63359396
loops=1)
 Total runtime: 1284212.556 ms
(5 rows)

Time: 1284213.359 ms

--
Guy Rouillier


pgsql-general by date:

Previous
From: Stephan Szabo
Date:
Subject: Re: Trigger problem
Next
From: Mike Cox
Date:
Subject: Preview of Fourth PostgreSQL RFD.