Thread: Ok, why isn't it using *this* index?
I have a table with columns 'country' and 'state'. I put indexes on both of them. I've done the "vacuum analyze" as per the faq. But when I ask it to explain, it says it will use the index on 'state' if I do a select * from waypoint where state = 'ON'; but it won't use the index on 'country' if I do a select * from waypoint where country = 'CANADA'; Some other interesting things are that it uses the index on state even if I say "where state in ('ON','QC','BC','AB')", and it uses the index on state but not the one on country if I combine "where state = 'ON' and country = 'CANADA'". Here's what it says: waypoint=> explain select * from waypoint where state = 'ON'; NOTICE: QUERY PLAN: Index Scan using waypoint_state on waypoint (cost=7.17 rows=84 width=130) EXPLAIN waypoint=> explain select * from waypoint where country = 'CANADA'; NOTICE: QUERY PLAN: Seq Scan on waypoint (cost=455.13 rows=6813 width=130) EXPLAIN Also, can anybody explain why the "rows=" doesn't correspond to anything logical? For instance, in the first one it says "rows=84" even though there are 107 matching records, and 71 different states. -- Paul Tomblin <ptomblin@xcski.com>, not speaking for anybody Diplomacy is the ability to let someone else have your way.
Paul Tomblin wrote: > I have a table with columns 'country' and 'state'. I put indexes on both > of them. I've done the "vacuum analyze" as per the faq. But when I ask > it to explain, it says it will use the index on 'state' if I do a > select * from waypoint where state = 'ON'; > but it won't use the index on 'country' if I do a > select * from waypoint where country = 'CANADA'; > > Some other interesting things are that it uses the index on state even if > I say "where state in ('ON','QC','BC','AB')", and it uses the index on > state but not the one on country if I combine "where state = 'ON' and > country = 'CANADA'". Maybe it's just my Oracle side, but doesn't country has a too low selectivity? -- _ / \ Leandro Guimarães Faria Corsetti Dutra +55 (11) 3040 8913 \ / Amdocs at Bell Canada +1 (514) 786 87 47 X Support Center, São Paulo, Brazil mailto:adbaamd@bell.ca / \ http://terravista.pt./Enseada/1989/ mailto:leandrod@amdocs.com
I would guess that the optimizer is choosing a sequential scan when the country is CANADA because the number of rows fetched as a percentage of total rows would warrant it. For example, country = 'CANADA' might be true for %30 of the total rows whereas STATE = 'ON' might only represent 2%, and thus the index scan. The EXPLAIN shows estimates and an excellent explanation of them can be found here: http://www.postgresql.org/users-lounge/docs/7.0/user/c4884.htm#AEN 4889 Hope that helps, Mike Mascari mascarm@mascari.com -----Original Message----- From: Paul Tomblin [SMTP:ptomblin@xcski.com] Sent: Sunday, April 01, 2001 3:39 PM To: pgsql-general@postgresql.org Subject: [GENERAL] Ok, why isn't it using *this* index? I have a table with columns 'country' and 'state'. I put indexes on both of them. I've done the "vacuum analyze" as per the faq. But when I ask it to explain, it says it will use the index on 'state' if I do a select * from waypoint where state = 'ON'; but it won't use the index on 'country' if I do a select * from waypoint where country = 'CANADA'; Some other interesting things are that it uses the index on state even if I say "where state in ('ON','QC','BC','AB')", and it uses the index on state but not the one on country if I combine "where state = 'ON' and country = 'CANADA'". Here's what it says: waypoint=> explain select * from waypoint where state = 'ON'; NOTICE: QUERY PLAN: Index Scan using waypoint_state on waypoint (cost=7.17 rows=84 width=130) EXPLAIN waypoint=> explain select * from waypoint where country = 'CANADA'; NOTICE: QUERY PLAN: Seq Scan on waypoint (cost=455.13 rows=6813 width=130) EXPLAIN Also, can anybody explain why the "rows=" doesn't correspond to anything logical? For instance, in the first one it says "rows=84" even though there are 107 matching records, and 71 different states. -- Paul Tomblin <ptomblin@xcski.com>, not speaking for anybody Diplomacy is the ability to let someone else have your way.
Paul Tomblin <ptomblin@xcski.com> writes: > waypoint=> explain select * from waypoint where country = 'CANADA'; > NOTICE: QUERY PLAN: > Seq Scan on waypoint (cost=455.13 rows=6813 width=130) I take it the majority of rows have country = 'CANADA'? How many rows in the table all together, anyway? Presumably you're seeing the results of an estimate that this WHERE clause is too unselective for an index scan to be profitable. But I can't tell if the rows estimate is any good or not. A rule of thumb is that an indexscan will only be used if the index clauses select no more than a few percent of the rows in the table. Otherwise the additional I/O to scan the index and to read the table in nonsequential fashion costs more than a sequential scan does. > Also, can anybody explain why the "rows=" doesn't correspond to anything > logical? For instance, in the first one it says "rows=84" even though > there are 107 matching records, and 71 different states. ROTFL ... given the thinness of the statistics used to make the estimate, I'd call rows=84 practically dead on, if the true value is 107. The system is doing real good here. You can read in the archives about lots of cases where the estimate is off by a factor of ten or worse, leading to bad plan choices. This estimate is plenty close enough to arrive at a reasonable plan. regards, tom lane
Quoting ADBAAMD (adba.amdocs@bell.ca): > Paul Tomblin wrote: > > I have a table with columns 'country' and 'state'. I put indexes on both > > of them. I've done the "vacuum analyze" as per the faq. But when I ask > > it to explain, it says it will use the index on 'state' if I do a > > select * from waypoint where state = 'ON'; > > but it won't use the index on 'country' if I do a > > select * from waypoint where country = 'CANADA'; > > Maybe it's just my Oracle side, but doesn't country has a too low > selectivity? If I try explain select * from waypoint where country = 'BELIZE'; a query that will only select one record out of the 8300-odd, it still doesn't use the index. Seq Scan on waypoint (cost=455.13 rows=6813 width=130) -- Paul Tomblin <ptomblin@xcski.com>, not speaking for anybody Mommy, what does "Formatting Drive C:" mean?
Paul Tomblin wrote: > Quoting ADBAAMD (adba.amdocs@bell.ca): > >> Paul Tomblin wrote: >> >>> I have a table with columns 'country' and 'state'. I put indexes on both >>> of them. I've done the "vacuum analyze" as per the faq. But when I ask >>> it to explain, it says it will use the index on 'state' if I do a >>> select * from waypoint where state = 'ON'; >>> but it won't use the index on 'country' if I do a >>> select * from waypoint where country = 'CANADA'; >> >> Maybe it's just my Oracle side, but doesn't country has a too low >> selectivity? > > > If I try > explain select * from waypoint where country = 'BELIZE'; > a query that will only select one record out of the 8300-odd, it still > doesn't use the index. > Seq Scan on waypoint (cost=455.13 rows=6813 width=130) Selectivity isn't about specific values, but about averages. If the planner would know statistics about each and every indexed value on the database, it would take a lot of effort to ANALYZE indexed data, the memory and disk consumption by statistics would be high, and CPU usage by the planner would go gaga. So it analyzes just averages. It doesn't matter that BELIZE has a high selectivity, but that country has a low one. -- _ / \ Leandro Guimarães Faria Corsetti Dutra +55 (11) 3040 8913 \ / Amdocs at Bell Canada +1 (514) 786 87 47 X Support Center, São Paulo, Brazil mailto:adbaamd@bell.ca / \ http://terravista.pt./Enseada/1989/ mailto:leandrod@amdocs.com
Quoting ADBAAMD (adba.amdocs@bell.ca): > Paul Tomblin wrote: > > If I try > > explain select * from waypoint where country = 'BELIZE'; > > a query that will only select one record out of the 8300-odd, it still > > doesn't use the index. > > Seq Scan on waypoint (cost=455.13 rows=6813 width=130) > > Selectivity isn't about specific values, but about averages. > > If the planner would know statistics about each and every indexed value > on the database, it would take a lot of effort to ANALYZE indexed data, > the memory and disk consumption by statistics would be high, and CPU > usage by the planner would go gaga. So it analyzes just averages. > > It doesn't matter that BELIZE has a high selectivity, but that country > has a low one. Ok, so if I understand you correctly, the fact that about 90% of the records have country='USA' and about 9% of the records have country='CANADA' means that it's never going to use the index because it on average, a query is going to be for USA, and a sequential scan is going to be better. I think I understand now. If this is correct, then doesn't it make sense just to drop that index? At least until I get a lot more data from other countries? waypoint=> select count(*), country from waypoint group by country; count|country -----+-------------------- 2|ANTIGUA AND BARBUDA 15|BAHAMAS 1|BARBADOS 1|BELIZE 741|CANADA 1|CAYMAN ISLANDS 5|COLOMBIA 2|COSTA RICA 23|CUBA 1|DOMINICA 3|DOMINICAN REPUBLIC 1|ECUADOR 3|FED STS MICRONESIA 4|FRENCH WEST INDIES 1|GRENADA 1|GUYANA 2|HAITI 2|HONDURAS 4|JAMAICA 2|MARSHALL ISLANDS 31|MEXICO 3|NETHERLANDS ANTILLES 2|NICARAGUA 1|PALAU 8|PANAMA 2|TRINIDAD AND TOBAGO 2|TRUST TERRITORIES 2|TURKS AND CAICOS ISL 7436|USA 5|VENEZUELA (30 rows) -- Paul Tomblin <ptomblin@xcski.com>, not speaking for anybody Every program has two purposes -- one for which it was written and another for which it wasn't.
Paul Tomblin wrote: > Quoting ADBAAMD (adba.amdocs@bell.ca): > > Ok, so if I understand you correctly, the fact that about 90% of the > records have country='USA' and about 9% of the records have > country='CANADA' means that it's never going to use the index because it > on average, a query is going to be for USA, and a sequential scan is going > to be better. > > I think I understand now. If this is correct, then doesn't it make sense > just to drop that index? At least until I get a lot more data from other > countries? Probably you are right. You could also try other index access methods besides the standard b-tree. I don't know about the situation in pgsql, but in Oracle we have even an index access method create specifically to serve low selectivity data: bitmaps. While I could find a succint description of access methods at http://www.postgresql.org/users-lounge/docs/7.0/postgres/indices.html, I don't know how each of them work exactly, nor if any would be any good for low selectivity situations. Is there any docs on that? -- _ / \ Leandro Guimarães Faria Corsetti Dutra +55 (11) 3040 8913 \ / Amdocs at Bell Canada +1 (514) 786 87 47 X Support Center, São Paulo, Brazil mailto:adbaamd@bell.ca / \ http://terravista.pt./Enseada/1989/ mailto:leandrod@amdocs.com
Quoting ADBAAMD (adba.amdocs@bell.ca): > Paul Tomblin wrote: > > I think I understand now. If this is correct, then doesn't it make sense > > just to drop that index? At least until I get a lot more data from other > > countries? > > Probably you are right. Ok, thanks for all the help everybody. -- Paul Tomblin <ptomblin@xcski.com>, not speaking for anybody Don't use a big word where a diminutive one will suffice.
On Sun, Apr 01, 2001 at 05:11:46PM -0400, Paul Tomblin wrote: > Ok, so if I understand you correctly, the fact that about 90% of the > records have country='USA' and about 9% of the records have > country='CANADA' means that it's never going to use the index because it > on average, a query is going to be for USA, and a sequential scan is going > to be better. > > I think I understand now. If this is correct, then doesn't it make sense > just to drop that index? At least until I get a lot more data from other > countries? huh, this seems to be ineteresting to know, just because this "feature" may seriously affect effectivity. by the way, have you tried to disable sequential scan? that may force pgsql to use an index in any case, AFAIK. -- Denis A. Doroshenko [GPRS/IN/WAP, VAS group engineer] .-. _|_ | [Omnitel Ltd., T.Sevcenkos st. 25, Vilnius, Lithuania] | | _ _ _ .| _ | [Phone: +370 9863486 E-mail: d.doroshenko@omnitel.net] |_|| | || |||(/_|_
Denis A. Doroshenko wrote: > On Sun, Apr 01, 2001 at 05:11:46PM -0400, Paul Tomblin wrote: > >> Ok, so if I understand you correctly, the fact that about 90% of the >> records have country='USA' and about 9% of the records have >> country='CANADA' means that it's never going to use the index because it >> on average, a query is going to be for USA, and a sequential scan is going >> to be better. >> >> I think I understand now. If this is correct, then doesn't it make sense >> just to drop that index? At least until I get a lot more data from other >> countries? > > > huh, this seems to be ineteresting to know, just because this > "feature" may seriously affect effectivity. by the way, have you tried > to disable sequential scan? that may force pgsql to use an index in > any case, AFAIK. It is not always a good idea to force the use of the index. Overriding the planner should only be done after tests on realistic volumes of data and with a thorough understanding about what's going on. In fact it is wise to work only with the indices you will use frequently and tables that have some significant queries as compared to updates, because updates get *slower* with indices! -- _ / \ Leandro Guimarães Faria Corsetti Dutra +55 (11) 3040 8913 \ / Amdocs at Bell Canada +1 (514) 786 87 47 X Support Center, São Paulo, Brazil mailto:adbaamd@bell.ca / \ http://terravista.pt./Enseada/1989/ mailto:leandrod@amdocs.com
Paul Tomblin <ptomblin@xcski.com> writes: > Ok, so if I understand you correctly, the fact that about 90% of the > records have country='USA' and about 9% of the records have > country='CANADA' means that it's never going to use the index because it > on average, a query is going to be for USA, and a sequential scan is going > to be better. Actually, 7.0 and later (which you are not using, I gather from your EXPLAIN display) do know the difference between the most common value in the column and the rest of 'em. I think that 7.0 would choose an indexscan in the case where it can see that you are not looking for 'USA'. Which would be the right choice for 'BELIZE', but probably not the right choice for 'CANADA'. For 7.2 I am hoping to extend the stored stats to know about the top three or so common values, not just one, so that we can deal more effectively with data distributions like this one. But in any case, most of the respondents in this thread have been assuming that you were running a reasonably current Postgres. Try upgrading ... regards, tom lane
Quoting Tom Lane (tgl@sss.pgh.pa.us): > Paul Tomblin <ptomblin@xcski.com> writes: > > Ok, so if I understand you correctly, the fact that about 90% of the > > records have country='USA' and about 9% of the records have > > country='CANADA' means that it's never going to use the index because it > > on average, a query is going to be for USA, and a sequential scan is going > > to be better. > > effectively with data distributions like this one. But in any case, > most of the respondents in this thread have been assuming that you > were running a reasonably current Postgres. Try upgrading ... Does anybody know if the RPM that comes with RedHat 7.0 will work with RedHat 6.2? I've just upgraded my workstation machine to RedHat 7.0, and I want to wait a bit before upgrading my server as well. (If only because lpr suddenly stopped working with the last "update" from RedHat.) -- Paul Tomblin <ptomblin@xcski.com>, not speaking for anybody "American 999, Heathrow Approach. Descend and maintain 4,000 feet, QNH 1011." "Uhhh Heathrow Approach, could you give us that in inches?" "American 999, descend and maintain 48,000 inches, QNH 1011."
Paul Tomblin <ptomblin@xcski.com> writes: > Does anybody know if the RPM that comes with RedHat 7.0 will work with > RedHat 6.2? I've just upgraded my workstation machine to RedHat 7.0, and > I want to wait a bit before upgrading my server as well. (If only because > lpr suddenly stopped working with the last "update" from RedHat.) It's unlikely that the binary RPM will work, but if you get the source RPM it *should* build on 6.2, though you might have to upgrade RPM on your 6.2 machine (if you haven't already). I'm running 7.1beta (compiled from source tarball) on multiple 6.2 machines. It's not too painful to manage because everything goes into /usr/local/pgsql by default--it doesn't scatter files all over the machine. -Doug