Thread: Seq Scans when index expected to be used
pgSql 7.4.0 Hi, Am trying to find duplicate values in a large table (about 80M records). Somehow, none of the two (2) queries (see below) is using the index "I_bca" that, I believe, should've been used. Any ideas? Thanks ------------------------------------------------------------------ CREATE TABLE te.test ( id te.didlong NOT NULL, a te.dtimestamp, b te.didint NOT NULL, c te.didint NOT NULL, d te.dstring, ) WITHOUT OIDS; -- attempt to create AK_bca failed due to duplicate values -- alter table te.test -- add constraint AK_bca unique (b, c, a); create index I_bca on te.test ( b, c, a ); alter table te.test add constraint PK_id primary key (id); analyze te.test; ------------------------------------------------------------------ -- first attempt to find duplicate values explain select b, c, a from test group by b, c, a having count(*) > 1 QUERY PLAN "GroupAggregate (cost=19644987.88..21026410.30 rows=78938424 width=16)" " Filter: (count(*) > 1)" " -> Sort (cost=19644987.88..19842333.94 rows=78938424 width=16)" " Sort Key: b, c, a" " -> Seq Scan on test (cost=0.00..1589706.24 rows=78938424 width=16)" ------------------------------------------------------------------ -- second attempt to find duplicate values explain select DV1.b, DV1.c, DV1.a from test DV1, test DV2 where DV1.b = DV2.b and DV1.c = DV2.c and DV1.a = DV2.a and DV1.id <> DV2.id QUERY PLAN "Merge Join (cost=42373495.75..45309925.87 rows=95424260 width=16)" " Merge Cond: (("outer"."?column5?" = "inner"."?column5?") AND ("outer"."?column6?" = "inner"."?column6?") AND ("outer"."?column7?" = "inner"."?column7?"))" " Join Filter: (("outer".id)::bigint <> ("inner".id)::bigint)" " -> Sort (cost=21186747.88..21384093.94 rows=78938424 width=24)" " Sort Key: (dv1.a)::timestamp without time zone, (dv1.c)::integer, (dv1.b)::integer" " -> Seq Scan on test dv1 (cost=0.00..1589706.24 rows=78938424 width=24)" " -> Sort (cost=21186747.88..21384093.94 rows=78938424 width=24)" " Sort Key: (dv2.a)::timestamp without time zone, (dv2.c)::integer, (dv2.b)::integer" " -> Seq Scan on test dv2 (cost=0.00..1589706.24 rows=78938424 width=24)" __________________________________ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/
On Sat, 29 Nov 2003 08:49:24 -0800 (PST), ow wrote > explain select b, c, a > from test > group by b, c, a > having count(*) > 1 I'm not sure about 7.4 aggregate improvements, but <=7.3 didn't work good with aggregates at all. Maybe it's not directly an answer to your question, but try theses queries: select t1.b,t1.c,t1.a from test t1 join test t2 using (b,c,a) where t2.id<>t1.id group by t1.b,t1.c,t1.a or select a,b,c from test t1 where exists (select * from test t2 where t2.a=t1.a and t2.b=t1.b and t2.c=t1.c and t1.id<>t2.id) group by a,b,c Regards, Tomasz Myrta
ow <oneway_111@yahoo.com> writes: > Am trying to find duplicate values in a large table (about 80M records). > Somehow, none of the two (2) queries (see below) is using the index "I_bca" > that, I believe, should've been used. Those plans are perfectly reasonable (particularly the GroupAggregate one). regards, tom lane
jasiek wrote: > On Sat, 29 Nov 2003 08:49:24 -0800 (PST), ow wrote > >>explain select b, c, a >>from test >>group by b, c, a >>having count(*) > 1 Why would you expect this to use an index scan when it needs to read the entire table? If you read the whole table (or even a significant fraction of it), a seq scan is faster. Joe
On Sat, 29 Nov 2003, ow wrote: > Am trying to find duplicate values in a large table (about 80M records). > Somehow, none of the two (2) queries (see below) is using the index "I_bca" > that, I believe, should've been used. As a side note, to help determine if it should be used, you probably want explain analyze not just explain. In general the index is only going to help you if the sort is very expensive, not for the initial scan (where the seq scan is generally faster).
--- Joe Conway <mail@joeconway.com> wrote: > > > >>explain select b, c, a > >>from test > >>group by b, c, a > >>having count(*) > 1 > > Why would you expect this to use an index scan when it needs to read the > entire table? If you read the whole table (or even a significant > fraction of it), a seq scan is faster. My impression was that the index "I_bca" covers the query, hence there should not be a need to go to the table itself. Why would it? P.S. explain analyze vs explain. Normally, would've used "explain analyze" but in this case it's taking way too long so I used "explain". Thanks __________________________________ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/
ow wrote: > My impression was that the index "I_bca" covers the query, hence there should > not be a need to go to the table itself. Why would it? Postgres always has to go to the table. The ability to read data directly from indexes (ala other RDBMSs) has been discussed, but not implemented. IIRC it's a hard problem due to the way Postgres does MVCC. Check the archives. > explain analyze vs explain. Normally, would've used "explain analyze" but in > this case it's taking way too long so I used "explain". I can understand that, but most people will ask for explain analyze anyway ;-) Joe