Thread: Simple query not using index: why?

Simple query not using index: why?

From
William Garrison
Date:
I am looking for records with duplicate keys, so I am running this query:

SELECT
    fileid, COUNT(*)
FROM
    file
GROUP BY
    fileid
HAVING
    COUNT(*)>1

The table has an index on fileid (non-unique index) so I am surprised
that postgres is doing a table scan.  This database is >15GB, and there
are a number of fairly large string columns in the table.  I am very
surprised that scanning the index is not faster than scanning the
table.  Any thoughts on that?  Is scanning the table faster than
scanning the index?  Is there a reason that it needs anything other than
the index?

Re: Simple query not using index: why?

From
aklaver@comcast.net (Adrian Klaver)
Date:
-------------- Original message ----------------------
From: William Garrison <postgres@mobydisk.com>
> I am looking for records with duplicate keys, so I am running this query:
>
> SELECT
>     fileid, COUNT(*)
> FROM
>     file
> GROUP BY
>     fileid
> HAVING
>     COUNT(*)>1
>
> The table has an index on fileid (non-unique index) so I am surprised
> that postgres is doing a table scan.  This database is >15GB, and there
> are a number of fairly large string columns in the table.  I am very
> surprised that scanning the index is not faster than scanning the
> table.  Any thoughts on that?  Is scanning the table faster than
> scanning the index?  Is there a reason that it needs anything other than
> the index?
>

I may be missing something, but it would have to scan the entire table to get all the occurrences of each fileid in
orderto do the count(*). 



--
Adrian Klaver
aklaver@comcast.net


Re: Simple query not using index: why?

From
William Garrison
Date:
Can't it just scan the index to get that?  I assumed the index had links to every fileid in the table.  In my over-simplified imagination, the table looks like this:

ctid|fileid|column|column|column|column
ctid|fileid|column|column|column|column
ctid|fileid|column|column|column|column
ctid|fileid|column|column|column|column
etc.

While the index looks like
fileid|ctid
fileid|ctid
fileid|ctid
fileid|ctid
...

So I expected scanning the index was faster, and still had everything it needed to do the count.  Or perhaps it was because I said COUNT(*) so it needs to look at the other columns in the table?  I really just wanted the number of "hits" not the number of records with distinct values or anything like that.  My understanding was that COUNT(*) did that, and didn't really look at the columns themselves.


Adrian Klaver wrote:
 -------------- Original message ----------------------
From: William Garrison <postgres@mobydisk.com> 
I am looking for records with duplicate keys, so I am running this query:

SELECT   fileid, COUNT(*)
FROM   file
GROUP BY   fileid
HAVING   COUNT(*)>1

The table has an index on fileid (non-unique index) so I am surprised 
that postgres is doing a table scan.  This database is >15GB, and there 
are a number of fairly large string columns in the table.  I am very 
surprised that scanning the index is not faster than scanning the 
table.  Any thoughts on that?  Is scanning the table faster than 
scanning the index?  Is there a reason that it needs anything other than 
the index?
   
I may be missing something, but it would have to scan the entire table to get all the occurrences of each fileid in order to do the count(*).



--
Adrian Klaver
aklaver@comcast.net

 

Re: Simple query not using index: why?

From
Joshua Drake
Date:
On Wed, 03 Sep 2008 15:55:17 -0400
William Garrison <postgres@mobydisk.com> wrote:

> So I expected scanning the index was faster, and still had everything
> it needed to do the count.  Or perhaps it was because I said COUNT(*)
> so it needs to look at the other columns in the table?  I really just
> wanted the number of "hits" not the number of records with distinct
> values or anything like that.  My understanding was that COUNT(*) did
> that, and didn't really look at the columns themselves.

We do not have visibility information in the index, so we have to scan
the pages to see what tuples are live or dead (and thus count them).

Sincerely,

Joshua D. Drake


--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate