Thread: Why isn't an index being used when selecting a distinct value?
Version: Postgres 8.1.4
Platform: RHEL
Given this scenario with the indexes in place, when I ask for the distinct field1_id values, why does the optimizer choose a sequential scan instead of just reading from the kda_log_fid_cre_20080123_idx index? The time it takes to perform the sequential scan against 20+ million records is way too slow.
CREATE TABLE kda_log_20080213 (
"field1" character varying(255),
field character varying(100),
value bigint,
period integer DEFAULT 60,
created timestamp with time zone DEFAULT ('now'::text)::timestamp(6) with time zone,
customer_id integer,
field1_id integer
);
CREATE INDEX kda_log_cid_cre_fld_20080213_idx ON kda_log_20080213 USING btree (customer_id, created, "field1");
CREATE INDEX kda_log_fid_cre_20080213_idx ON kda_log_20080213 USING btree (field1_id, created);
keaton=# explain select distinct field1_id into temp kda_temp from kda_log_20080213;
QUERY PLAN
----------------------------------------------------------------------------------------------
Unique (cost=5759201.93..5927827.87 rows=8545 width=4)
-> Sort (cost=5759201.93..5843514.90 rows=33725188 width=4)
Sort Key: field1_id
-> Seq Scan on kda_log_20080213 (cost=0.00..748067.88 rows=33725188 width=4)
(4 rows)
Thanks,
Keaton
------ End of Forwarded Message
Platform: RHEL
Given this scenario with the indexes in place, when I ask for the distinct field1_id values, why does the optimizer choose a sequential scan instead of just reading from the kda_log_fid_cre_20080123_idx index? The time it takes to perform the sequential scan against 20+ million records is way too slow.
CREATE TABLE kda_log_20080213 (
"field1" character varying(255),
field character varying(100),
value bigint,
period integer DEFAULT 60,
created timestamp with time zone DEFAULT ('now'::text)::timestamp(6) with time zone,
customer_id integer,
field1_id integer
);
CREATE INDEX kda_log_cid_cre_fld_20080213_idx ON kda_log_20080213 USING btree (customer_id, created, "field1");
CREATE INDEX kda_log_fid_cre_20080213_idx ON kda_log_20080213 USING btree (field1_id, created);
keaton=# explain select distinct field1_id into temp kda_temp from kda_log_20080213;
QUERY PLAN
----------------------------------------------------------------------------------------------
Unique (cost=5759201.93..5927827.87 rows=8545 width=4)
-> Sort (cost=5759201.93..5843514.90 rows=33725188 width=4)
Sort Key: field1_id
-> Seq Scan on kda_log_20080213 (cost=0.00..748067.88 rows=33725188 width=4)
(4 rows)
Thanks,
Keaton
------ End of Forwarded Message
"Keaton Adams" <kadams@mxlogic.com> writes: > Version: Postgres 8.1.4 > Platform: RHEL > > Given this scenario with the indexes in place, when I ask for the distinct > field1_id values, why does the optimizer choose a sequential scan instead of > just reading from the kda_log_fid_cre_20080123_idx index? The time it takes > to perform the sequential scan against 20+ million records is way too slow. Try (temporarily) doing: SET enable_seqscan = off; > keaton=# explain select distinct field1_id into temp kda_temp from > kda_log_20080213; If the database is right that will be even slower. Using a full index scan requires a lot of random access seeks, generally the larger the table the *more* likely a sequential scan and sort is a better approach than using an index. If it's wrong and it's faster then you have to consider whether it's only faster because you've read the table into cache already. Will it be in cache in production? If so then you migth try raising effective_cache_size or lowering random_page_cost. Another thing to try is using GROUP BY instead of DISTINCT. This is one case where the postgres optimizer doesn't handle the two equivalent cases in exactly the same way and there are some plans available in one method that aren't in the other. That's only likely to help if you have relative few values of field1_id but it's worth trying. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
The GROUP BY was the fastest method. Thanks for the suggestions, Keaton On 2/15/08 3:12 PM, "Gregory Stark" <stark@enterprisedb.com> wrote: > "Keaton Adams" <kadams@mxlogic.com> writes: > >> Version: Postgres 8.1.4 >> Platform: RHEL >> >> Given this scenario with the indexes in place, when I ask for the distinct >> field1_id values, why does the optimizer choose a sequential scan instead of >> just reading from the kda_log_fid_cre_20080123_idx index? The time it takes >> to perform the sequential scan against 20+ million records is way too slow. > > Try (temporarily) doing: > > SET enable_seqscan = off; > >> keaton=# explain select distinct field1_id into temp kda_temp from >> kda_log_20080213; > > If the database is right that will be even slower. Using a full index scan > requires a lot of random access seeks, generally the larger the table the > *more* likely a sequential scan and sort is a better approach than using an > index. > > If it's wrong and it's faster then you have to consider whether it's only > faster because you've read the table into cache already. Will it be in cache > in production? If so then you migth try raising effective_cache_size or > lowering random_page_cost. > > Another thing to try is using GROUP BY instead of DISTINCT. This is one case > where the postgres optimizer doesn't handle the two equivalent cases in > exactly the same way and there are some plans available in one method that > aren't in the other. That's only likely to help if you have relative few > values of field1_id but it's worth trying.