Thread: why does this use the wrong index?
I have two identical queries except for the date range. In the first case, with the wider date range, the correct (I believe) index is used. In the second case where the date range is smaller a different index is used and a less efficient plan is chosen. In the second query the problem seems to be CPU resoures; while it is running 1 core of the CPU is 100% busy. Note, if I drop the ad_log_date index then this query is always fast, but some other queries I do require that index. So, What can I do to encourage Postgres to use the first index even when the date range is smaller. # explain analyze SELECT name FROM players AS foo WHERE EXISTS (SELECT 1 FROM ad_log WHERE player = foo.id AND date(start_time) BETWEEN E'2008-09-14' AND E'2008-09-18' LIMIT 1) ORDER BY name; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- - Sort (cost=1573.74..1574.31 rows=230 width=13) (actual time=28.421..28.505 rows=306 loops=1) Sort Key: foo.name Sort Method: quicksort Memory: 28kB -> Seq Scan on players foo (cost=0.00..1564.72 rows=230 width=13) (actual time=0.104..27.876 rows=306 loops=1) Filter: (subplan) SubPlan -> Limit (cost=0.01..3.39 rows=1 width=0) (actual time=0.058..0.058 rows=1 loops=460) -> Index Scan using ad_log_player_date on ad_log (cost=0.01..34571.03 rows=10228 width=0) (actual time=0.056..0.056 rows=1 loops=460) Index Cond: ((player = $0) AND (date(start_time) >= '2008-09-14'::date) AND (date(start_time) <= '2008-09-18'::date)) Total runtime: 28.623 ms (10 rows) # explain analyze SELECT name FROM players AS foo WHERE EXISTS (SELECT 1 FROM ad_log WHERE player = foo.id AND date(start_time) BETWEEN E'2008-09-18' AND E'2008-09-18' LIMIT 1) ORDER BY name; QUERY PLAN ---------------------------------------------------------------------------- ----------------------------------------------------------------- Index Scan using players_name_key on players foo (cost=0.00..8376.84 rows=230 width=13) (actual time=813.695..143452.810 rows=301 loops=1) Filter: (subplan) SubPlan -> Limit (cost=0.01..18.14 rows=1 width=0) (actual time=311.846..311.846 rows=1 loops=460) -> Index Scan using ad_log_date on ad_log (cost=0.01..18.14 rows=1 width=0) (actual time=311.844..311.844 rows=1 loops=460) Index Cond: ((date(start_time) >= '2008-09-18'::date) AND (date(start_time) <= '2008-09-18'::date)) Filter: (player = $0) Total runtime: 143453.100 ms (8 rows) Thanks, --Rainer
> So, What can I do to encourage Postgres to use the first index even when the > date range is smaller. > It looks like PostgreSQL is estimating the selectivity of your date ranges poorly. In the second (bad) plan it estimates that the index scan with the filter will return 1 row (and that's probably because it estimates that the date range you specify will match only one row). This leads PostgreSQL to choose the narrower index because, if the index scan is only going to return one row anyway, it might as well scan the smaller index. What's the n_distinct for start_time? => select n_distinct from pg_stats where tablename='ad_log' and attname='start_time'; If n_distinct is near -1, that would explain why it thinks that it will only get one result. Based on the difference between the good index scan (takes 0.056ms per loop) and the bad index scan with the filter (311ms per loop), the "player" condition must be very selective, but PostgreSQL doesn't care because it already thinks that the date range is selective. Regards, Jeff Davis
On Fri, 2008-09-19 at 11:25 -0700, Jeff Davis wrote: > What's the n_distinct for start_time? Actually, I take that back. Apparently, PostgreSQL can't change "x BETWEEN y AND y" into "x=y", so PostgreSQL can't use n_distinct at all. That's your problem. If it's one day only, change it to equality and it should be fine. Regards, Jeff Davis