Re: inconsistent/weird index usage - Mailing list pgsql-performance
From | John Meinel |
---|---|
Subject | Re: inconsistent/weird index usage |
Date | |
Msg-id | 415D614D.7010001@johnmeinel.com Whole thread Raw |
In response to | inconsistent/weird index usage (Dustin Sallings <dustin@spy.net>) |
List | pgsql-performance |
Dustin Sallings wrote: > [...] > OK, now that that's over with, I have this one particular query that > I attempt to run for a report from my phone that no longer works because > it tries to do a table scan on *some* of the tables. Why it chooses > this table scan, I can't imagine. The query is as follows: > > select > s.serial as serial_num, > s.name as name, > date(ts) as day, > min(sample) as min_temp, > avg(sample) as avg_temp, > stddev(sample) as stddev_temp, > max(sample) as max_temp > from > samples inner join sensors s using (sensor_id) > where > ts > current_date - 7 > group by > serial_num, name, day > order by > serial_num, day desc > > [ next section heavily clipped for clarity ] -> Seq Scan on samples_1999 (cost rows=103677) (actual rows=0 loops=1) -> Index Scan using samples_2000_bytime on samples_2000 (cost rows=714082 (actual rows=0 loops=1) -> Seq Scan on samples_2001 (cost rows=902191) (actual rows=0 loops=1) -> Seq Scan on samples_2002 (cost rows=1037201) (actual rows=0 loops=1) -> Index Scan using samples_2003_bytime on samples_2003 (cost rows=1049772) (actual rows=0 loops=1) -> Index Scan using samples_2004_bytime on samples_2004 (cost rows=791991) (actual rows=66389 loops=1) [...] > > > Essentially, what you can see here is that it's doing an index scan > on samples_2000, samples_2003, and samples_2004, but a sequential scan > on samples_1999, samples_2001, and samples_2002. It's very strange to > me that it would make these choices. If I disable sequential scans > altogether for this session, the query runs in under 4 seconds. > > This is a very cool solution for long-term storage, and isn't > terribly hard to manage. I actually have other report queries that seem > to be making pretty good index selection currently...but I always want > more! :) Does anyone have any suggestions as to how to get this to do > what I want? > > Of course, ideally, it would ignore five of the tables altogether. :) > > -- > SPY My girlfriend asked me which one I like better. > pub 1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net> > | Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98 D8 BE > L_______________________ I hope the answer won't upset her. ____________ > > Just as a heads up. You have run vacuum analyze before running this query, correct? Because you'll notice that the query planner is thinking that it will have 103677 rows from 1999, 700,000 rows from 2000, 900,000 rows from 2001, etc, etc. Obviously the query planner is not planning well considering it there are only 60,000 rows from 2004, and no rows from anything else. It just seems like it hasn't updated it's statistics to be aware of when the time is on most of the tables. (By the way, an indexed scan returning 0 entries is *really* fast, so I wouldn't worry about ignoring the extra tables. :) I suppose the other question is whether this is a prepared or stored query. Because sometimes the query planner cannot do enough optimization in a stored query. (I ran into this problem where I had 1 column with 500,000+ entries referencing 1 number. If I ran manually, the time was much better because I wasn't using *that* number. With a stored query, it had to take into account that I *might* use that number, and didn't want to do 500,000+ indexed lookups) The only other thing I can think of is that there might be some collision between datetime and date. Like it is thinking it is looking at the time of day when it plans the queries (hence why so many rows), but really it is looking at the date. Perhaps a cast is in order to make it work right. I don't really know. Interesting problem, though. John =:->
Attachment
pgsql-performance by date: