Thread: Poor index choice -- multiple indexes of the same columns
Postgresql 8.0.3 Hi, I have a query select 1 from census where date < '1975-9-21' and sname = 'RAD' and status != 'A' limit 1; Explain analyze says it always uses the index made by: CREATE INDEX census_date_sname ON census (date, sname); this is even after I made the index: CREATE INDEX census_sname_date ON census (sname, date); I made census_sname_date because it ran too slow. By deleting census_date_sname (temporarly, because my apps don't like this) I can force the use of census_sname_date and the query runs fine. Seems to me that when there's a constant value in the query and an = comparision it will always be faster to use the (b-tree) index that's ordered first by the constant value, as then all further blocks are guarenteed to have a higher relevant information density. At least when compared with another index that has the same columns in it. As you might imagine there are relatively few sname values and relatively many date values in my data. I use a query like the above in a trigger to enforce bounds limitations. I don't expect (want) to find any rows returned. I've figured out how to avoid executing this code very often, so this is not presently a serious problem for me. Karl <kop@meme.com> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Karl, > Seems to me that when there's a constant value in the query > and an = comparision it will always be faster to use the (b-tree) > index that's ordered first by the constant value, as then all further > blocks are guarenteed to have a higher relevant information > density. At least when compared with another index that has the > same columns in it. That really depends on the stats. Such a choice would *not* be appropriate if the < comparison was expected to return 1- rows while the = condition applied to 15% of the table. What is your STATISTICS_TARGET for the relevant columns set to? When's the last time you ran analyze? If this is all updated, you want to post the pg_stats rows for the relevant columns? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On 06/27/2005 05:37:41 PM, Josh Berkus wrote: > Karl, > > > Seems to me that when there's a constant value in the query > > and an = comparision it will always be faster to use the (b-tree) > > index that's ordered first by the constant value, as then all > further > > blocks are guarenteed to have a higher relevant information > > density. At least when compared with another index that has the > > same columns in it. > > That really depends on the stats. Such a choice would *not* be > appropriate if the < comparison was expected to return 1- rows while > the = > condition applied to 15% of the table. We're talking internals here so I don't know what I'm talking about, but, when the = comparison returns 15% of the table you can find your way straight to the 1- (sic) relevent rows because that 15% is further sorted by the second column of the index. So that's one disk read and after that when you scan the rest of the blocks every datum is relevant/returned. So your scan will pass through fewer disk blocks. The only case that would make sense to consider using the other index is if the planner knew it could get the answer in 1 disk read, in which case it should be able to get the answer out of either index in one disk read as both indexes are on the same columns. > What is your STATISTICS_TARGET for the relevant columns set to? STATISTICS_TARGET is the default, which I read as 10 the docs. > When's > the last time you ran analyze? I'm doing this in a torture test script, loading data. Every fibnocci number of rows * 100 I VACCUM ANALYZE. So, 100, 200, 300, 500, 800, etc. Just for grins I've created the index I'd like it to use and run VACUUM ANALYZE and shown the EXPLAIN ANALYZE below. > If this is all updated, you want to > post > the pg_stats rows for the relevant columns? Pg_stats rows below. (I've tried to wrap the lines short so as not to mess up anybody's mailer.) # create index census_sname_date on census (sname, date); CREATE INDEX # vacuum analyze census; VACUUM # explain analyze select 1 from census where date < '1975-9-21' and sname = 'RAD' and status != 'A' ; QUERY PLAN --------------------------------------------------------------- --------------------------------------------------------------- ---- Index Scan using census_date_sname on census (cost=0.00..2169.51 rows=1437 width=0) (actual time=40.610..40.610 rows=0 loops=1) Index Cond: ((date < '1975-09-21'::date) AND (sname = 'RAD'::bpchar)) Filter: (status <> 'A'::bpchar) Total runtime: 40.652 ms (4 rows) Compare with: # drop index census_date_sname; DROP INDEX # explain analyze select date from census where sname = 'RAD' and date < '1975-9-21' and status != 'A' limit 1; QUERY PLAN ------------------------------------------------------------------- ------------------------------------------------------------------- Limit (cost=0.00..3.37 rows=1 width=4) (actual time=0.097..0.097 rows=0 loops=1) -> Index Scan using census_sname_date on census (cost=0.00..5203.95 rows=1544 width=4) (actual time=0.094..0.094 rows=0 loops=1) Index Cond: ((sname = 'RAD'::bpchar) AND (date < '1975-09-21'::date)) Filter: (status <> 'A'::bpchar) Total runtime: 0.133 ms (5 rows) # select * from pg_stats where tablename = 'census' and (attname = 'sname' or attname = 'date'); schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+----------- -+-------------------------------------------------------------------- ---------------------------------------------+------------------------ ---------------------------------------------------------------------- --------------+------------------------------------------------------- ---------------------------------------------------------------------+ ------------- babase | census | date | 0 | 4 | 4687 | {1979-02-01,1976-06-16,1977-03-23,1978-08-25,1979-09-20,1971-06-28 ,1972-04-28,1972-08-27,1974-04-06,1975-03-19} | {0.002,0.00166667,0.00166667,0.00166667,0.00166667,0.00133333 ,0.00133333,0.00133333,0.00133333,0.00133333} | {1959-07-15,1966-02-18,1969-02-22,1971-01-10,1972-07-26,1974-02-09 ,1975-05-27,1976-07-28,1977-08-19,1978-08-07,1979-10-02} | 1 babase | census | sname | 0 | 7 | 177 | {MAX,ALT,PRE,COW,EST,JAN,RIN,ZUM,DUT,LUL} | {0.0166667,0.015,0.015,0.0146667 ,0.0143333,0.014,0.0136667,0.0136667,0.0133333,0.0133333} | {ALI,BUN,FAN,IBI,LER,NDO,PET,RUS,SLM,TOT,XEN} | 0.0446897 (2 rows) Thanks. Karl <kop@meme.com> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
On 06/27/2005 09:36:51 PM, Karl O. Pinc wrote: > I'm doing this in a torture test script, loading data. > Every fibnocci number of rows * 100 I VACCUM ANALYZE. > So, 100, 200, 300, 500, 800, etc. (And of course disconnect my client and re-connect so as to use the new statistics. sure would be nice if I didn't have to do this.) Karl <kop@meme.com> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
"Karl O. Pinc" <kop@meme.com> writes: > I have a query > select 1 > from census > where date < '1975-9-21' and sname = 'RAD' and status != 'A' > limit 1; > Explain analyze says it always uses the index made by: > CREATE INDEX census_date_sname ON census (date, sname); > this is even after I made the index: > CREATE INDEX census_sname_date ON census (sname, date); I don't believe that any existing release can tell the difference between these two indexes as far as costs go. I just recently added some code to btcostestimate that would cause it to prefer the index on (sname, date) but of course that's not released yet. However: isn't the above query pretty seriously underspecified? With a LIMIT and no ORDER BY, you are asking for a random one of the rows matching the condition. I realize that with "select 1" you may not care much, but adding a suitable ORDER BY would help push the planner towards using the right index. In this case "ORDER BY sname DESC, date DESC" would probably do the trick. regards, tom lane
On 06/28/2005 01:40:56 AM, Tom Lane wrote: > "Karl O. Pinc" <kop@meme.com> writes: > > I have a query > > > select 1 > > from census > > where date < '1975-9-21' and sname = 'RAD' and status != 'A' > > limit 1; > > > Explain analyze says it always uses the index made by: > > > CREATE INDEX census_date_sname ON census (date, sname); > > > this is even after I made the index: > > > CREATE INDEX census_sname_date ON census (sname, date); > > I don't believe that any existing release can tell the difference > between these two indexes as far as costs go. I just recently > added some code to btcostestimate that would cause it to prefer > the index on (sname, date) but of course that's not released yet. > > However: isn't the above query pretty seriously underspecified? > With a LIMIT and no ORDER BY, you are asking for a random one > of the rows matching the condition. I realize that with > "select 1" you may not care much, but adding a suitable ORDER BY > would help push the planner towards using the right index. In > this case "ORDER BY sname DESC, date DESC" would probably do the > trick. Yes, that works. I'd already tried "ORDER BY date DESC", before I first wrote, and that did not work. (I started with no LIMIT either, and tried adding specifications until I gave up. It's very good that the new planner will figure out things by itself.) "ORDER BY sname DESC" works as well. This is a bit odd, as with the constant in the = comparison "ORDER BY date DESC" is the same as "ORDER BY sname DESC, date DESC". I guess that's why I gave up on my attempts to get the planner to use the (sname, date) index before I got to your solution. Thanks everybody for the help. Karl <kop@meme.com> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein