Thread: index scan on =, but not < ?

index scan on =, but not < ?

From
"Rick Schumeyer"
Date:

I have two index questions.  The first is about an issue that has been recently discussed,

and I just wanted to be sure of my understanding.  Functions like count(), max(), etc. will

use sequential scans instead of index scans because the index doesn’t know which rows

are actually visible…is this correct?

 

Second:

 

I created an index in a table with over 10 million rows.

The index is on field x, which is a double.

 

The following command, as I expected, results in an index scan:

 

=# explain select * from data where x = 0;

                               QUERY PLAN

-------------------------------------------------------------------------

 Index Scan using data_x_ix on data  (cost=0.00..78.25 rows=19 width=34)

   Index Cond: (x = 0::double precision)

(2 rows)

 

 

But this command, in which the only difference if > instead of =, is a sequential scan.

 

=# explain select * from data where x > 0;

                            QUERY PLAN

------------------------------------------------------------------

 Seq Scan on data  (cost=0.00..1722605.20 rows=62350411 width=34)

   Filter: (x > 0::double precision)

(2 rows)

 

Why is this?

(This is with pg 8.0.1 on a PC running FC3 with 1GB ram…if it matters)

Re: index scan on =, but not < ?

From
Thomas F.O'Connell
Date:
Your hypothesis about index usage of count() and max() is correct.

As for why you see index usage in your first example query and not your
second: compare the number of rows in question. An index is extremely
useful if 19 rows will be returned. But when 62350411 rows will be
returned, you're talking about a substantial fraction of the table. A
sequential scan will probably correctly be judged to be faster by the
planner.

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Mar 8, 2005, at 12:35 PM, Rick Schumeyer wrote:

> I have two index questions.  The first is about an issue that has been
> recently discussed,
> and I just wanted to be sure of my understanding.  Functions like
> count(), max(), etc. will
> use sequential scans instead of index scans because the index doesn’t
> know which rows
> are actually visible…is this correct?
>
>  
>
> Second:
>
>  
>
> I created an index in a table with over 10 million rows.
> The index is on field x, which is a double.
>
> The following command, as I expected, results in an index scan:
>
> =# explain select * from data where x = 0;
>
>                                QUERY PLAN
>
> -----------------------------------------------------------------------
> --
>  Index Scan using data_x_ix on data  (cost=0.00..78.25 rows=19
> width=34)
>    Index Cond: (x = 0::double precision)
> (2 rows)
>  
>
> But this command, in which the only difference if > instead of =, is a
> sequential scan.
>
>
> =# explain select * from data where x > 0;
>
>                             QUERY PLAN
>
> ------------------------------------------------------------------
>
>  Seq Scan on data  (cost=0.00..1722605.20 rows=62350411 width=34)
>    Filter: (x > 0::double precision)
> (2 rows)
>
> Why is this?
>
> (This is with pg 8.0.1 on a PC running FC3 with 1GB ram…if it matters)


Re: index scan on =, but not < ?

From
John Arbash Meinel
Date:
Rick Schumeyer wrote:

> I have two index questions. The first is about an issue that has been
> recently discussed,
>
> and I just wanted to be sure of my understanding. Functions like
> count(), max(), etc. will
>
> use sequential scans instead of index scans because the index doesn’t
> know which rows
>
> are actually visible…is this correct?
>
Actually, index scans are chosen whenever the cost is expected to be
cheaper than a sequential scan. This is generally about < 10% of the
total number of rows.

> Second:
>
> I created an index in a table with over 10 million rows.
>
> The index is on field x, which is a double.
>
> The following command, as I expected, results in an index scan:
>
> =# explain select * from data where x = 0;
>
> QUERY PLAN
>
> -------------------------------------------------------------------------
>
> Index Scan using data_x_ix on data (cost=0.00..78.25 rows=19 width=34)
>
> Index Cond: (x = 0::double precision)
>
> (2 rows)
>
Since you have 10m rows, when it expects to get only 19 rows, it is much
faster to use an index.

> But this command, in which the only difference if > instead of =, is a
> sequential scan.
>
> =# explain select * from data where x > 0;
>
> QUERY PLAN
>
> ------------------------------------------------------------------
>
> Seq Scan on data (cost=0.00..1722605.20 rows=62350411 width=34)
>
> Filter: (x > 0::double precision)
>
> (2 rows)
>
Here, pg expects to find 62M rows (you must have significantly more than
10M rows). In this case a sequential scan is much faster than an indexed
one, so that's what pg does.

> Why is this?
>
> (This is with pg 8.0.1 on a PC running FC3 with 1GB ram…if it matters)
>
If you think there is truly a performance problem, try attaching the
results of "explain analyze" in which we might be able to tell you that
your statistics inaccurate (run vacuum analyze if you haven't).

John
=:->


Attachment

Re: index scan on =, but not < ?

From
"Rick Schumeyer"
Date:
That makes a lot of sense.  Sure enough, if I change the query from
WHERE x > 0  (which return a lot of rows) to
WHERE x > 0 AND x < 1
I now get an index scan.

> As for why you see index usage in your first example query and not your
> second: compare the number of rows in question. An index is extremely
> useful if 19 rows will be returned. But when 62350411 rows will be
> returned, you're talking about a substantial fraction of the table. A
> sequential scan will probably correctly be judged to be faster by the
> planner.
>


Re: index scan on =, but not < ?

From
Dennis Bjorklund
Date:
On Tue, 8 Mar 2005, Rick Schumeyer wrote:

> =# explain select * from data where x = 0;
> -------------------------------------------------------------------------
>  Index Scan using data_x_ix on data  (cost=0.00..78.25 rows=19 width=34)
>    Index Cond: (x = 0::double precision)
>
> But this command, in which the only difference if > instead of =, is a
> sequential scan.
>
> =# explain select * from data where x > 0;
> ------------------------------------------------------------------
>  Seq Scan on data  (cost=0.00..1722605.20 rows=62350411 width=34)
>    Filter: (x > 0::double precision)
>
> Why is this?

That is because it's faster to execute the x>0 query with a seq. scan then
a index scan. Postgresql is doing the right thing here.

Pg estimates that the first query will return 19 rows and that the second
query will return 62350411 rows. To return 62350411 rows it's faster to
just scan the table and not use the index.

--
/Dennis Björklund


Re: index scan on =, but not < ?

From
Bruno Wolff III
Date:
On Tue, Mar 08, 2005 at 13:35:53 -0500,
  Rick Schumeyer <rschumeyer@ieee.org> wrote:
> I have two index questions.  The first is about an issue that has been
> recently discussed,
>
> and I just wanted to be sure of my understanding.  Functions like count(),
> max(), etc. will
>
> use sequential scans instead of index scans because the index doesn't know
> which rows
>
> are actually visible.is this correct?

Not exactly. If the number of rows to be examined is on the order of 5%
of the table, an index scan will probably be slower than a sequential
scan. The visibility issue makes index scans slower in the case that
the only columns of interest are in the index.
Another issue is that max could in theory use an index, but there isn't
a mechanism for Postgres to know how to do this in general for aggregates
where it is possible. There have been discussions in the past about
how this could be done, but no one has done it yet.

Re: index scan on =, but not < ?

From
"Jim C. Nasby"
Date:
On Tue, Mar 08, 2005 at 10:38:21PM -0600, Bruno Wolff III wrote:
> Not exactly. If the number of rows to be examined is on the order of 5%
> of the table, an index scan will probably be slower than a sequential
> scan. The visibility issue makes index scans slower in the case that

Shouldn't that be 50%?
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: index scan on =, but not < ?

From
Bruno Wolff III
Date:
On Tue, Mar 08, 2005 at 22:55:19 -0600,
  "Jim C. Nasby" <decibel@decibel.org> wrote:
> On Tue, Mar 08, 2005 at 10:38:21PM -0600, Bruno Wolff III wrote:
> > Not exactly. If the number of rows to be examined is on the order of 5%
> > of the table, an index scan will probably be slower than a sequential
> > scan. The visibility issue makes index scans slower in the case that
>
> Shouldn't that be 50%?

No. When you are doing an index scan of a significant part of the table,
you will fetch some heap pages more than once. You will also be fetching
blocks out of order, so you will lose out on read ahead optimization
by the OS. This assumes that you don't get a lot of cache hits on the
help pages. If a significant portion of the table is cached, then the
trade off point will be at a higher percentage of the table.

Re: index scan on =, but not < ?

From
"Jim C. Nasby"
Date:
On Tue, Mar 08, 2005 at 11:20:20PM -0600, Bruno Wolff III wrote:
> On Tue, Mar 08, 2005 at 22:55:19 -0600,
>   "Jim C. Nasby" <decibel@decibel.org> wrote:
> > On Tue, Mar 08, 2005 at 10:38:21PM -0600, Bruno Wolff III wrote:
> > > Not exactly. If the number of rows to be examined is on the order of 5%
> > > of the table, an index scan will probably be slower than a sequential
> > > scan. The visibility issue makes index scans slower in the case that
> >
> > Shouldn't that be 50%?
>
> No. When you are doing an index scan of a significant part of the table,
> you will fetch some heap pages more than once. You will also be fetching
> blocks out of order, so you will lose out on read ahead optimization
> by the OS. This assumes that you don't get a lot of cache hits on the
> help pages. If a significant portion of the table is cached, then the
> trade off point will be at a higher percentage of the table.

Ahh, I was thinking of a high correlation factor on the index. I still
question 5% though... that seems awefully low.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: index scan on =, but not < ?

From
David Brown
Date:
Assuming your system isn't starved for memory, shouldn't repeated page
fetches be hitting the cache?

I've also wondered about the conventional wisdom that read ahead doesn't
help random reads. I may well be missing something, but *if* the OS has
enough memory to cache most of the table, surely read ahead will still
work to your advantage?

Bruno Wolff III wrote:

>No. When you are doing an index scan of a significant part of the table,
>you will fetch some heap pages more than once. You will also be fetching
>blocks out of order, so you will lose out on read ahead optimization
>by the OS. This assumes that you don't get a lot of cache hits on the
>help pages. If a significant portion of the table is cached, then the
>trade off point will be at a higher percentage of the table.
>
>

Re: index scan on =, but not < ?

From
David Brown
Date:
Jim C. Nasby wrote:

>Ahh, I was thinking of a high correlation factor on the index. I still
>question 5% though... that seems awefully low.
>
>
Not really. It all depends on how many records you're packing into each
page. 1% may well be the threshold for small records.

Tom mentioned this in the last couple of months. He was citing a uniform
distribution as an example and I thought that sounded a little
pessimistic, but when I did the (possibly faulty) math with a random
distribution, I discovered he wasn't far off.

It's not this simple, but if you can fit 50 randomly organized records
into each page and you want to retrieve 2% of the rows, it's likely
you'll have to fetch every page - believe it or not.

What concerns me is that this all depends on the correlation factor, and
I suspect that the planner is not giving enough weight to this.
Actually, I'm wondering if it's even looking at the statistic, but I
haven't created a test to check. It might explain quite a few complaints
about the planner not utilizing indexes.

Re: index scan on =, but not < ?

From
Manfred Koizar
Date:
On Thu, 10 Mar 2005 10:24:46 +1000, David Brown <time@bigpond.net.au>
wrote:
>What concerns me is that this all depends on the correlation factor, and
>I suspect that the planner is not giving enough weight to this.

The planner does the right thing for correlations very close to 1 (and
-1) and for correlations near zero.  For correlations somewhere between
0 and 1 the cost is estimated by interpolation, but it tends too much
towards the bad end, IMHO.

Servus
 Manfred