Thread: index scan on =, but not < ?
I have two index questions. The first is about an issue that has been recently discussed,
and I just wanted to be sure of my understanding. Functions like count(), max(), etc. will
use sequential scans instead of index scans because the index doesn’t know which rows
are actually visible…is this correct?
Second:
I created an index in a table with over 10 million rows.
The index is on field x, which is a double.
The following command, as I expected, results in an index scan:
=# explain select * from data where x = 0;
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using data_x_ix on data (cost=0.00..78.25 rows=19 width=34)
Index Cond: (x = 0::double precision)
(2 rows)
But this command, in which the only difference if > instead of =, is a sequential scan.
=# explain select * from data where x > 0;
QUERY PLAN
------------------------------------------------------------------
Seq Scan on data (cost=0.00..1722605.20 rows=62350411 width=34)
Filter: (x > 0::double precision)
(2 rows)
Why is this?
(This is with pg 8.0.1 on a PC running FC3 with 1GB ram…if it matters)
Your hypothesis about index usage of count() and max() is correct. As for why you see index usage in your first example query and not your second: compare the number of rows in question. An index is extremely useful if 19 rows will be returned. But when 62350411 rows will be returned, you're talking about a substantial fraction of the table. A sequential scan will probably correctly be judged to be faster by the planner. -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 On Mar 8, 2005, at 12:35 PM, Rick Schumeyer wrote: > I have two index questions. The first is about an issue that has been > recently discussed, > and I just wanted to be sure of my understanding. Functions like > count(), max(), etc. will > use sequential scans instead of index scans because the index doesn’t > know which rows > are actually visible…is this correct? > > > > Second: > > > > I created an index in a table with over 10 million rows. > The index is on field x, which is a double. > > The following command, as I expected, results in an index scan: > > =# explain select * from data where x = 0; > > QUERY PLAN > > ----------------------------------------------------------------------- > -- > Index Scan using data_x_ix on data (cost=0.00..78.25 rows=19 > width=34) > Index Cond: (x = 0::double precision) > (2 rows) > > > But this command, in which the only difference if > instead of =, is a > sequential scan. > > > =# explain select * from data where x > 0; > > QUERY PLAN > > ------------------------------------------------------------------ > > Seq Scan on data (cost=0.00..1722605.20 rows=62350411 width=34) > Filter: (x > 0::double precision) > (2 rows) > > Why is this? > > (This is with pg 8.0.1 on a PC running FC3 with 1GB ram…if it matters)
Rick Schumeyer wrote: > I have two index questions. The first is about an issue that has been > recently discussed, > > and I just wanted to be sure of my understanding. Functions like > count(), max(), etc. will > > use sequential scans instead of index scans because the index doesn’t > know which rows > > are actually visible…is this correct? > Actually, index scans are chosen whenever the cost is expected to be cheaper than a sequential scan. This is generally about < 10% of the total number of rows. > Second: > > I created an index in a table with over 10 million rows. > > The index is on field x, which is a double. > > The following command, as I expected, results in an index scan: > > =# explain select * from data where x = 0; > > QUERY PLAN > > ------------------------------------------------------------------------- > > Index Scan using data_x_ix on data (cost=0.00..78.25 rows=19 width=34) > > Index Cond: (x = 0::double precision) > > (2 rows) > Since you have 10m rows, when it expects to get only 19 rows, it is much faster to use an index. > But this command, in which the only difference if > instead of =, is a > sequential scan. > > =# explain select * from data where x > 0; > > QUERY PLAN > > ------------------------------------------------------------------ > > Seq Scan on data (cost=0.00..1722605.20 rows=62350411 width=34) > > Filter: (x > 0::double precision) > > (2 rows) > Here, pg expects to find 62M rows (you must have significantly more than 10M rows). In this case a sequential scan is much faster than an indexed one, so that's what pg does. > Why is this? > > (This is with pg 8.0.1 on a PC running FC3 with 1GB ram…if it matters) > If you think there is truly a performance problem, try attaching the results of "explain analyze" in which we might be able to tell you that your statistics inaccurate (run vacuum analyze if you haven't). John =:->
Attachment
That makes a lot of sense. Sure enough, if I change the query from WHERE x > 0 (which return a lot of rows) to WHERE x > 0 AND x < 1 I now get an index scan. > As for why you see index usage in your first example query and not your > second: compare the number of rows in question. An index is extremely > useful if 19 rows will be returned. But when 62350411 rows will be > returned, you're talking about a substantial fraction of the table. A > sequential scan will probably correctly be judged to be faster by the > planner. >
On Tue, 8 Mar 2005, Rick Schumeyer wrote: > =# explain select * from data where x = 0; > ------------------------------------------------------------------------- > Index Scan using data_x_ix on data (cost=0.00..78.25 rows=19 width=34) > Index Cond: (x = 0::double precision) > > But this command, in which the only difference if > instead of =, is a > sequential scan. > > =# explain select * from data where x > 0; > ------------------------------------------------------------------ > Seq Scan on data (cost=0.00..1722605.20 rows=62350411 width=34) > Filter: (x > 0::double precision) > > Why is this? That is because it's faster to execute the x>0 query with a seq. scan then a index scan. Postgresql is doing the right thing here. Pg estimates that the first query will return 19 rows and that the second query will return 62350411 rows. To return 62350411 rows it's faster to just scan the table and not use the index. -- /Dennis Björklund
On Tue, Mar 08, 2005 at 13:35:53 -0500, Rick Schumeyer <rschumeyer@ieee.org> wrote: > I have two index questions. The first is about an issue that has been > recently discussed, > > and I just wanted to be sure of my understanding. Functions like count(), > max(), etc. will > > use sequential scans instead of index scans because the index doesn't know > which rows > > are actually visible.is this correct? Not exactly. If the number of rows to be examined is on the order of 5% of the table, an index scan will probably be slower than a sequential scan. The visibility issue makes index scans slower in the case that the only columns of interest are in the index. Another issue is that max could in theory use an index, but there isn't a mechanism for Postgres to know how to do this in general for aggregates where it is possible. There have been discussions in the past about how this could be done, but no one has done it yet.
On Tue, Mar 08, 2005 at 10:38:21PM -0600, Bruno Wolff III wrote: > Not exactly. If the number of rows to be examined is on the order of 5% > of the table, an index scan will probably be slower than a sequential > scan. The visibility issue makes index scans slower in the case that Shouldn't that be 50%? -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Tue, Mar 08, 2005 at 22:55:19 -0600, "Jim C. Nasby" <decibel@decibel.org> wrote: > On Tue, Mar 08, 2005 at 10:38:21PM -0600, Bruno Wolff III wrote: > > Not exactly. If the number of rows to be examined is on the order of 5% > > of the table, an index scan will probably be slower than a sequential > > scan. The visibility issue makes index scans slower in the case that > > Shouldn't that be 50%? No. When you are doing an index scan of a significant part of the table, you will fetch some heap pages more than once. You will also be fetching blocks out of order, so you will lose out on read ahead optimization by the OS. This assumes that you don't get a lot of cache hits on the help pages. If a significant portion of the table is cached, then the trade off point will be at a higher percentage of the table.
On Tue, Mar 08, 2005 at 11:20:20PM -0600, Bruno Wolff III wrote: > On Tue, Mar 08, 2005 at 22:55:19 -0600, > "Jim C. Nasby" <decibel@decibel.org> wrote: > > On Tue, Mar 08, 2005 at 10:38:21PM -0600, Bruno Wolff III wrote: > > > Not exactly. If the number of rows to be examined is on the order of 5% > > > of the table, an index scan will probably be slower than a sequential > > > scan. The visibility issue makes index scans slower in the case that > > > > Shouldn't that be 50%? > > No. When you are doing an index scan of a significant part of the table, > you will fetch some heap pages more than once. You will also be fetching > blocks out of order, so you will lose out on read ahead optimization > by the OS. This assumes that you don't get a lot of cache hits on the > help pages. If a significant portion of the table is cached, then the > trade off point will be at a higher percentage of the table. Ahh, I was thinking of a high correlation factor on the index. I still question 5% though... that seems awefully low. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Assuming your system isn't starved for memory, shouldn't repeated page fetches be hitting the cache? I've also wondered about the conventional wisdom that read ahead doesn't help random reads. I may well be missing something, but *if* the OS has enough memory to cache most of the table, surely read ahead will still work to your advantage? Bruno Wolff III wrote: >No. When you are doing an index scan of a significant part of the table, >you will fetch some heap pages more than once. You will also be fetching >blocks out of order, so you will lose out on read ahead optimization >by the OS. This assumes that you don't get a lot of cache hits on the >help pages. If a significant portion of the table is cached, then the >trade off point will be at a higher percentage of the table. > >
Jim C. Nasby wrote: >Ahh, I was thinking of a high correlation factor on the index. I still >question 5% though... that seems awefully low. > > Not really. It all depends on how many records you're packing into each page. 1% may well be the threshold for small records. Tom mentioned this in the last couple of months. He was citing a uniform distribution as an example and I thought that sounded a little pessimistic, but when I did the (possibly faulty) math with a random distribution, I discovered he wasn't far off. It's not this simple, but if you can fit 50 randomly organized records into each page and you want to retrieve 2% of the rows, it's likely you'll have to fetch every page - believe it or not. What concerns me is that this all depends on the correlation factor, and I suspect that the planner is not giving enough weight to this. Actually, I'm wondering if it's even looking at the statistic, but I haven't created a test to check. It might explain quite a few complaints about the planner not utilizing indexes.
On Thu, 10 Mar 2005 10:24:46 +1000, David Brown <time@bigpond.net.au> wrote: >What concerns me is that this all depends on the correlation factor, and >I suspect that the planner is not giving enough weight to this. The planner does the right thing for correlations very close to 1 (and -1) and for correlations near zero. For correlations somewhere between 0 and 1 the cost is estimated by interpolation, but it tends too much towards the bad end, IMHO. Servus Manfred