Thread: Seq Scans when index expected to be used

Seq Scans when index expected to be used

From
ow
Date:
pgSql 7.4.0

Hi,

Am trying to find duplicate values in a large table (about 80M records).
Somehow, none of the two (2) queries (see below) is using the index "I_bca"
that, I believe, should've been used.

Any ideas? Thanks

------------------------------------------------------------------

CREATE TABLE te.test
( id te.didlong NOT NULL, a te.dtimestamp, b te.didint NOT NULL, c te.didint NOT NULL, d te.dstring,
) WITHOUT OIDS;

-- attempt to create AK_bca failed due to duplicate values
-- alter table te.test
--   add constraint AK_bca unique (b, c, a);

create index I_bca on te.test (   b, c, a
);

alter table te.test  add constraint PK_id primary key (id);

analyze te.test;

------------------------------------------------------------------
--  first attempt to find duplicate values
explain select b, c, a
from test
group by b, c, a
having count(*) > 1

QUERY PLAN
"GroupAggregate  (cost=19644987.88..21026410.30 rows=78938424 width=16)"
"  Filter: (count(*) > 1)"
"  ->  Sort  (cost=19644987.88..19842333.94 rows=78938424 width=16)"
"        Sort Key: b, c, a"
"        ->  Seq Scan on test  (cost=0.00..1589706.24 rows=78938424 width=16)"


------------------------------------------------------------------
--  second attempt to find duplicate values
explain select DV1.b, DV1.c, DV1.a
from test DV1, test DV2
where DV1.b = DV2.b and DV1.c = DV2.c and DV1.a = DV2.a and DV1.id <> DV2.id

QUERY PLAN
"Merge Join  (cost=42373495.75..45309925.87 rows=95424260 width=16)"
"  Merge Cond: (("outer"."?column5?" = "inner"."?column5?") AND
("outer"."?column6?" = "inner"."?column6?") AND ("outer"."?column7?" =
"inner"."?column7?"))"
"  Join Filter: (("outer".id)::bigint <> ("inner".id)::bigint)"
"  ->  Sort  (cost=21186747.88..21384093.94 rows=78938424 width=24)"
"        Sort Key: (dv1.a)::timestamp without time zone, (dv1.c)::integer,
(dv1.b)::integer"
"        ->  Seq Scan on test dv1  (cost=0.00..1589706.24 rows=78938424
width=24)"
"  ->  Sort  (cost=21186747.88..21384093.94 rows=78938424 width=24)"
"        Sort Key: (dv2.a)::timestamp without time zone, (dv2.c)::integer,
(dv2.b)::integer"
"        ->  Seq Scan on test dv2  (cost=0.00..1589706.24 rows=78938424
width=24)"






__________________________________
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/


Re: Seq Scans when index expected to be used

From
"jasiek"
Date:
On Sat, 29 Nov 2003 08:49:24 -0800 (PST), ow wrote
> explain select b, c, a
> from test
> group by b, c, a
> having count(*) > 1

I'm not sure about 7.4 aggregate improvements, but <=7.3 didn't work good 
with aggregates at all. Maybe it's not directly an answer to your question, 
but try theses queries:

select t1.b,t1.c,t1.a from test t1 join test t2 using (b,c,a)
where t2.id<>t1.id
group by t1.b,t1.c,t1.a

or

select a,b,c from test t1
where exists (select * from test t2 where t2.a=t1.a and t2.b=t1.b and 
t2.c=t1.c and t1.id<>t2.id)
group by a,b,c

Regards,
Tomasz Myrta


Re: Seq Scans when index expected to be used

From
Tom Lane
Date:
ow <oneway_111@yahoo.com> writes:
> Am trying to find duplicate values in a large table (about 80M records).
> Somehow, none of the two (2) queries (see below) is using the index "I_bca"
> that, I believe, should've been used.

Those plans are perfectly reasonable (particularly the GroupAggregate one).
        regards, tom lane


Re: Seq Scans when index expected to be used

From
Joe Conway
Date:
jasiek wrote:
> On Sat, 29 Nov 2003 08:49:24 -0800 (PST), ow wrote
> 
>>explain select b, c, a
>>from test
>>group by b, c, a
>>having count(*) > 1

Why would you expect this to use an index scan when it needs to read the 
entire table? If you read the whole table (or even a significant 
fraction of it), a seq scan is faster.

Joe





Re: Seq Scans when index expected to be used

From
Stephan Szabo
Date:
On Sat, 29 Nov 2003, ow wrote:

> Am trying to find duplicate values in a large table (about 80M records).
> Somehow, none of the two (2) queries (see below) is using the index "I_bca"
> that, I believe, should've been used.

As a side note, to help determine if it should be used, you probably want
explain analyze not just explain. In general the index is only going to
help you if the sort is very expensive, not for the initial scan (where
the seq scan is generally faster).


Re: Seq Scans when index expected to be used

From
ow
Date:
--- Joe Conway <mail@joeconway.com> wrote:
> > 
> >>explain select b, c, a
> >>from test
> >>group by b, c, a
> >>having count(*) > 1
> 
> Why would you expect this to use an index scan when it needs to read the 
> entire table? If you read the whole table (or even a significant 
> fraction of it), a seq scan is faster.

My impression was that the index "I_bca" covers the query, hence there should
not be a need to go to the table itself. Why would it?

P.S.
explain analyze vs explain. Normally, would've used "explain analyze" but in
this case it's taking way too long so I used "explain".

Thanks





__________________________________
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/


Re: Seq Scans when index expected to be used

From
Joe Conway
Date:
ow wrote:
> My impression was that the index "I_bca" covers the query, hence there should
> not be a need to go to the table itself. Why would it?

Postgres always has to go to the table. The ability to read data 
directly from indexes (ala other RDBMSs) has been discussed, but not 
implemented. IIRC it's a hard problem due to the way Postgres does MVCC. 
Check the archives.

> explain analyze vs explain. Normally, would've used "explain analyze" but in
> this case it's taking way too long so I used "explain".

I can understand that, but most people will ask for explain analyze 
anyway ;-)

Joe