Re: select distinct and index usage - Mailing list pgsql-general

From David Wilson
Subject Re: select distinct and index usage
Date
Msg-id e7f9235d0804071127n5cd08096qec8653c905550f1e@mail.gmail.com
Whole thread Raw
In response to Re: select distinct and index usage  (Alban Hertroys <dalroi@solfertje.student.utwente.nl>)
Responses Re: select distinct and index usage  (Gregory Stark <stark@enterprisedb.com>)
Re: select distinct and index usage  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On Mon, Apr 7, 2008 at 1:42 PM, Alban Hertroys
<dalroi@solfertje.student.utwente.nl> wrote:
>
>  Have you tried this query with enable_seqscan=off? If my guess is right
> (and the planners, in that case) it'd be even slower.

set enable_seqscan=off;
explain select distinct datestamp from vals;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Unique  (cost=115003047.47..115380004.83 rows=4263 width=4)
   ->  Sort  (cost=115003047.47..115191526.15 rows=75391472 width=4)
         Sort Key: datestamp
         ->  Seq Scan on vals  (cost=100000000.00..101531261.72
rows=75391472 width=4)

It appears to be doing a sequential scan regardless of the set, as if
it doesn't believe it can use the index for some reason
>
>  Something that might help you, but I'm not sure whether it might hurt the
> performance of other queries, is to cluster that table on val_datestamp_idx.
> That way the records are already (mostly) sorted on disk in the order of the
> datestamps, which seems to be the brunt of above query plan.

That's a good thought. I'll give that a try this evening when the DB
has some downtime and see what happens.

>
>  There seems to be quite a bit of overlap in your index definitions. From my
> experience this can confuse the planner.
>
>  I suggest you combine them, but not knowing your data... Maybe rewriting
> your UNIQUE constraint to (val_dur, datestamp, eid, sid) would be enough to
> replace all those other indexes.
>  If not, it's probably better to have one index per column, so that the
> planner is free to combine them as it sees fit. That'd result in a bitmap
> index scan, btw.

I can take a look at the other indices again, but those are all in
place for specific other queries that generally involve some set of
a=1, b=2, c=3, datestamp>5 type of where-clause and were created
specifically in response to sequential scans showing up in other
queries (and had the proper effect of fixing them!)
>
>  I'm not a postgresql tuning expert (I know my way around though), other
> people can explain you way better than I can. Bruce Momjian for example:
> http://www.linuxjournal.com/article/4791

I'll take a look at that, thanks.

>  That calculation doesn't look familiar to me, I'm more used to:
>   select pg_size_pretty(pg_relation_size('...'));
>
>  You can put the name of any relation in there, be it tables, indexes, etc.
>
>  11GB is pretty large for an index on an integer column, especially with
> only 75M rows: that's 146 bytes/row in your index. Maybe your index got
> bloated somehow? I think it should be about a tenth of that.

pg_total_relation_size('..') gives the number of bytes for the table +
all associated indices; pg_relation_size('..') gives for just the
table. The difference between the two should be total bytes take up by
the 5 total indices (11 total index cols), giving a
back-of-the-envelope estimation of 1gb for the size of the datestamp
index. I am fairly certain that I didn't give pg 1gb to fit the index
in memory, so I'll try upping its total available memory tonight and
see if that doesn't improve things.

I appreciate the responses so far! I'm used to several minutes for
some of the complex queries on this DB, but 12.5 minutes for a select
distinct just seems wrong. :)

--
- David T. Wilson
Princeton Satellite Systems
david.t.wilson@gmail.com

pgsql-general by date:

Previous
From: Terry Lee Tucker
Date:
Subject: Subtracting Two Intervals
Next
From: Bruce Momjian
Date:
Subject: Re: Subtracting Two Intervals