Thread: Use my (date) index, darn it!
I have a very large table (about a million rows) which I most frequently want to select a subset of rows from base on a date field. That date field is indexed, and when Postgres uses that index, queries are fast. But sometimes it decides not to use the index, resorting to a sequential scan instead. This is really, really slow. To try to convince it to use my date index, I turned off the sequential scan strategy in the planner. That worked on one copy of the db, but not on another, where it decided to use an index from another column entirely, which didn't help performance. I dropped the other index, leaving only the date index, and performance was good again. Obviously the planner is making some bad choices here. I know that it is trying to avoid random seeks or other scary things implied by a "correlation" statistic that is not close to 1 or -1, but it is massively overestimating the hit caused by those seeks and seemingly not taking into account the size of the table! This is Postgres 7.4 on Linux and Mac OS X, BTW. Anyway, to "fix" the situation, I clustered the table on the date column. But I fear that the data will slowly "drift" back to a state where the planner decides again that a sequential scan is a good idea. Blah. So, my question: how can I prevent this? Ideally, the planner should be smarter. Failing that, I'd like to be able to force it to use the index that I know will result in the fastest queries (3 seconds vs. 30 seconds in many cases). Suggestions? -John
John Siracusa <siracusa@mindspring.com> writes: > Obviously the planner is making some bad choices here. A fair conclusion ... > I know that it is trying to avoid random seeks or other scary things > implied by a "correlation" statistic that is not close to 1 or -1, but > it is massively overestimating the hit caused by those seeks and > seemingly not taking into account the size of the table! You haven't given any evidence to support these conclusions, though. Could we see some table schemas, EXPLAIN ANALYZE output, and relevant pg_stats entries for the various cases? regards, tom lane
On 1/5/04 1:55 AM, Tom Lane wrote: > John Siracusa <siracusa@mindspring.com> writes: >> Obviously the planner is making some bad choices here. > > A fair conclusion ... > >> I know that it is trying to avoid random seeks or other scary things >> implied by a "correlation" statistic that is not close to 1 or -1, but >> it is massively overestimating the hit caused by those seeks and >> seemingly not taking into account the size of the table! > > You haven't given any evidence to support these conclusions, though. Well here's what I was basing that theory on: before clustering, the correlation for the date column was around 0.3. After clustering, it was 1, and the index was always used. Does clustering change any other statistics other that correlation? I ran analyze immediately before and after the cluster operation. > Could we see some table schemas, EXPLAIN ANALYZE output, and relevant > pg_stats entries for the various cases? Well, the table is clustered now, so I can't reproduce the situation. Is there any way to "uncluster" a table? Should I just cluster it on a different column? -John
John Siracusa <siracusa@mindspring.com> writes: > Is there any way to "uncluster" a table? Should I just cluster it on a > different column? That should work, if you choose one that's uncorrelated with the previous clustering attribute. regards, tom lane
After a long battle with technology, siracusa@mindspring.com (John Siracusa), an earthling, wrote: > On 1/5/04 1:55 AM, Tom Lane wrote: >> John Siracusa <siracusa@mindspring.com> writes: >>> Obviously the planner is making some bad choices here. >> >> A fair conclusion ... >> >>> I know that it is trying to avoid random seeks or other scary things >>> implied by a "correlation" statistic that is not close to 1 or -1, but >>> it is massively overestimating the hit caused by those seeks and >>> seemingly not taking into account the size of the table! >> >> You haven't given any evidence to support these conclusions, though. > > Well here's what I was basing that theory on: before clustering, the > correlation for the date column was around 0.3. After clustering, it was 1, > and the index was always used. Does clustering change any other statistics > other that correlation? I ran analyze immediately before and after the > cluster operation. > >> Could we see some table schemas, EXPLAIN ANALYZE output, and relevant >> pg_stats entries for the various cases? > > Well, the table is clustered now, so I can't reproduce the situation. Is > there any way to "uncluster" a table? Should I just cluster it on a > different column? That would presumably work... It sounds to me as though the statistics that are being collected aren't "good enough." That tends to be a sign that the quantity of statistics (e.g. - bins in the histogram) are insufficient. This would be resolved by changing the number of bins (default of 10) via "ALTER TABLE FOO ALTER COLUMN BAR SET STATISTICS 100" (or some other value higher than 10). Clustering would rearrange the contents of the table, and perhaps make the histogram 'more representative.' Increasing the "SET STATISTICS" value will quite likely be even more helpful, and is a lot less expensive than clustering the table... -- If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me http://www.ntlug.org/~cbbrowne/nonrdbms.html Rules of the Evil Overlord #158. "I will exchange the labels on my folder of top-secret plans and my folder of family recipes. Imagine the hero's surprise when he decodes the stolen plans and finds instructions for Grandma's Potato Salad." <http://www.eviloverlord.com/>
On 1/5/04 11:45 AM, Christopher Browne wrote: > It sounds to me as though the statistics that are being collected > aren't "good enough." That tends to be a sign that the quantity of > statistics (e.g. - bins in the histogram) are insufficient. > > This would be resolved by changing the number of bins (default of 10) > via "ALTER TABLE FOO ALTER COLUMN BAR SET STATISTICS 100" (or some > other value higher than 10). I did that, but I wasn't sure what value to use and what column to increase. I believe I increased the date column itself to 50 or something, but then I wasn't sure what to do next. I re-analyzed the table with the date column set to 50 but it didn't seem to help, so I resorted to clustering. > Clustering would rearrange the contents of the table, and perhaps make > the histogram 'more representative.' Increasing the "SET STATISTICS" > value will quite likely be even more helpful, and is a lot less > expensive than clustering the table... What column(s) should I increase? Do I have to do anything after increasing the statistics, or do I just wait for the stats collector to do its thing? -John
In the last exciting episode, siracusa@mindspring.com (John Siracusa) wrote: > What column(s) should I increase? Do I have to do anything after increasing > the statistics, or do I just wait for the stats collector to do its thing? You have to ANALYZE the table again, to force in new statistics. And if the index in question is on _just_ the date column, then it is probably only that date column where the "SET STATISTICS" needs to be increased. -- let name="cbbrowne" and tld="cbbrowne.com" in name ^ "@" ^ tld;; http://www.ntlug.org/~cbbrowne/sap.html Faith is the quality that enables you to eat blackberry jam on a picnic without looking to see whether the seeds move. -- DeMara Cabrera