Re: select distinct and index usage - Mailing list pgsql-general

From David Wilson
Subject Re: select distinct and index usage
Date
Msg-id e7f9235d0804071754q52ccbe07h1c18df336c933564@mail.gmail.com
Whole thread Raw
In response to Re: select distinct and index usage  (Gregory Stark <stark@enterprisedb.com>)
Responses Re: select distinct and index usage  (Alvaro Herrera <alvherre@commandprompt.com>)
Re: select distinct and index usage  ("Stephen Denne" <Stephen.Denne@datamail.co.nz>)
List pgsql-general
On Mon, Apr 7, 2008 at 7:57 PM, Gregory Stark <stark@enterprisedb.com> wrote:
>
>  You could try changing it to the equivalent GROUP BY query. The planner,
>  unfortunately, doesn't know they're equivalent and has two separate sets of
>  plans available. In this case where there are only 4,000 distinct values out
>  of 75M original records you might find a HashAggregate plan, which the planner
>  doesn't know can be used for DISTINCT, best. You might have to raise work_mem
>  before the planner feels a hash will fit.
>
>  --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>   Ask me about EnterpriseDB's RemoteDBA services!
>

Progress!

explain analyze select datestamp from vals group by datestamp;
                                                             QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1719740.40..1719783.03 rows=4263 width=4)
(actual time=120192.018..120193.930 rows=4252 loops=1)
   ->  Seq Scan on vals  (cost=0.00..1531261.72 rows=75391472 width=4)
(actual time=17.441..66807.429 rows=75391476 loops=1)
 Total runtime: 120195.144 ms


Compared with:

explain analyze select distinct datestamp from vals;
                                                              QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=15003047.47..15380004.83 rows=4263 width=4) (actual
time=649599.159..721671.595 rows=4252 loops=1)
  ->  Sort  (cost=15003047.47..15191526.15 rows=75391472 width=4)
(actual time=649599.157..694392.602 rows=75391476 loops=1)
        Sort Key: datestamp
        Sort Method:  external merge  Disk: 1178592kB
        ->  Seq Scan on vals  (cost=0.00..1531261.72 rows=75391472
width=4) (actual time=9.104..93130.468 rows=75391476 loops=1)
 Total runtime: 722379.434 ms

Still doing the sequential scan on the table, but at least it's
avoiding the expensive disk merge sort. It still seems as if I ought
to be able to coax it into using an index for this type of query,
though- especially since it's using one on the other table. Is there
perhaps some way to reformulate the index in such a way as to make it
more useful to the planner?

--
- David T. Wilson
Princeton Satellite Systems
david.t.wilson@gmail.com

pgsql-general by date:

Previous
From: Mike Ginsburg
Date:
Subject: Most Occurring Value
Next
From: Alvaro Herrera
Date:
Subject: Re: select distinct and index usage