Thread: Column correlation drifts, index ignored again
This is a follow-up to an old thread of mine, but I can't find it now so I'll just re-summarize. I have a ~1 million row table that I mostly want to query by date range. The rows are pretty uniformly spread over a 3 year date range. I have an index on the date column, but it wasn't always used in the past. I disabled the seqscan plan before running my query as a first fix, but it bothered me that I had to do that. Next, thanks to my earlier thread, I clustered the table on the date column and then "SET STATISTICS" on the date column to be 100. That did the trick, and I stopped explicitly disabling seqscan. Today, I noticed that Postgres (still 7.4) stopped using the date index again. I checked the correlation for the date column and it was down to 0.4. So I guess that stat does drift away from 1.0 after clustering. That's a bummer, because clustering locks up the table while it works, which I can't really afford to do often. Even at a correlation of 0.4 on the date column, using the date index was still much faster than the seqscan plan that Postgres was choosing. Anyway, it's reclustering now. A common query looks like this: SELECT SUM(amount), SUM(quantity), date_trunc('day', date) AS date FROM mytable WHERE col1 IS NOT NULL AND col2 = 'foo' AND col3 = 'bar' AND date BETWEEN '2004-02-01 00:00:00' AND '2004-02-28 23:59:59' GROUP BY date_trunc('day', date) ORDER BY date; The EXPLAIN ANALYZE output should look like this: Sort (cost=4781.75..4824.15 rows=16963 width=23) (actual time=2243.595..2243.619 rows=21 loops=1) Sort Key: date_trunc('day'::text, date) -> HashAggregate (cost=3462.87..3590.09 rows=16963 width=23) (actual time=2241.773..2243.454 rows=21 loops=1) -> Index Scan using mytable_date_idx on mytable (cost=0.00..3071.70 rows=52155 width=23) (actual time=2.610..1688.111 rows=49679 loops=1) Index Cond: ((date >= '2004-02-01 00:00:00'::timestamp without time zone) AND (date <= '2004-02-28 23:59:59'::timestamp without time zone)) Filter: ((col1 IS NOT NULL) AND ((col2)::text = 'foo'::text) AND ((col3)::text = 'bar'::text)) Total runtime: 2244.391 ms Unfortunately, since I just re-clustered, I can't get the old EXPLAIN output, but just imagine "Seq Scan" in place of "Index Scan using mytable_date_idx" to get the idea. My question is: what other options do I have? Should I "SET STATISTICS" on the date column to 200? 500? The maximum value of 1000? I want to do something that will convince Postgres that using the date index is, by far, the best plan when running my queries, even when the date column correlation stat drops well below 1.0. -John
On Saturday 21 February 2004 16:18, John Siracusa wrote: John, > Next, thanks to my earlier thread, I clustered the table on the date > column and then "SET STATISTICS" on the date column to be 100. That > did the trick, and I stopped explicitly disabling seqscan. 100? Are you sure you don't mean some other number? 100 is not very high for problem analyze issues. You might try 500. Generally when I have a problem query I raise stats to something like 1000 and drop it down until the problem behaviour starts re-appearing. > date_trunc('day', date) AS date Have you tried putting an index on date_trunc('day', date) and querying on that instead of using this: > date BETWEEN '2004-02-01 00:00:00' AND '2004-02-28 23:59:59' -- Josh Berkus Aglio Database Solutions San Francisco
On 2/22/04 2:05 PM, Josh Berkus wrote: > On Saturday 21 February 2004 16:18, John Siracusa wrote: >> Next, thanks to my earlier thread, I clustered the table on the date >> column and then "SET STATISTICS" on the date column to be 100. That >> did the trick, and I stopped explicitly disabling seqscan. > > 100? Are you sure you don't mean some other number? 100 is not very high > for problem analyze issues. You might try 500. IIRC, 100 was the number suggested in the earlier thread. I did set it to 500 yesterday, I believe. We'll see how that goes. > Generally when I have a problem query I raise stats to something like 1000 and > drop it down until the problem behaviour starts re-appearing. Since this problem takes a long time to appear (months), that cycle could take a long time... :) >> date_trunc('day', date) AS date > > Have you tried putting an index on date_trunc('day', date) and querying on > that instead of using this: > >> date BETWEEN '2004-02-01 00:00:00' AND '2004-02-28 23:59:59' No, but then I'd just have a different index to persuade the planner to use :) Not every query does date_trunc() stuff, but they all do date ranges, often at a granularity of seconds. -John
John Siracusa <siracusa@mindspring.com> writes: > I want to do something that will convince Postgres that using the date > index is, by far, the best plan when running my queries, even when the > date column correlation stat drops well below 1.0. Have you tried experimenting with random_page_cost? Seems like your results suggest that you need to lower it. regards, tom lane
On 2/22/04 5:06 PM, Tom Lane wrote: > John Siracusa <siracusa@mindspring.com> writes: >> I want to do something that will convince Postgres that using the date >> index is, by far, the best plan when running my queries, even when the >> date column correlation stat drops well below 1.0. > > Have you tried experimenting with random_page_cost? Seems like your > results suggest that you need to lower it. I don't want to do anything that "universal" if I can help it, because I don't want to adversely affect any other queries that the planner currently aces. I'm guessing that the reason using the date index is always so much faster is that doing so only reads the rows in the date range (say, 1,000 of them) instead of reading every single row in the table (1,000,000) as in a seqscan plan. I think the key is to get the planner to correctly ballpark the number of rows in the date range. If it does, I can't imagine it ever deciding to read 1,000,000 rows instead of 1,000 with any sane "cost" setting. I'm assuming the defaults are sane :) -John
John Siracusa <siracusa@mindspring.com> writes: > I think the key is to get the planner to correctly ballpark the number of > rows in the date range. I thought it was. What you showed was -> Index Scan using mytable_date_idx on mytable (cost=0.00..3071.70 rows=52155 width=23) (actual time=2.610..1688.111 rows=49679loops=1) which seemed plenty close enough to me. regards, tom lane
On 2/22/04 6:40 PM, Tom Lane wrote: > John Siracusa <siracusa@mindspring.com> writes: >> I think the key is to get the planner to correctly ballpark the number of >> rows in the date range. > > I thought it was. What you showed was > > -> Index Scan using mytable_date_idx on mytable (cost=0.00..3071.70 rows=52155 > width=23) (actual time=2.610..1688.111 rows=49679 loops=1) > > which seemed plenty close enough to me. That's after the planner correctly chooses the date index. Unfortunately, I forgot to save the EXPLAIN output from when it was choosing seqscan instead. Does the planner get estimates from both plans before deciding whether or not to use the one that references the date index? -John
John, > I think the key is to get the planner to correctly ballpark the number of > rows in the date range. If it does, I can't imagine it ever deciding to > read 1,000,000 rows instead of 1,000 with any sane "cost" setting. I'm > assuming the defaults are sane :) The default for random_page_cost is sane, but very conservative; it's pretty much assuming tables that are bigger than RAM and a single IDE disk. If your setup is better than that, you can lower it. For example, in the ideal case (database fits in RAM, fast RAM, CPU, and random seek on the disk), you can lower it to 1.5. For less ideal situations, 1.8 to 2.5 is reasonable on high-end hardware. -- -Josh Berkus Aglio Database Solutions San Francisco
John Siracusa <siracusa@mindspring.com> writes: > Does the planner get estimates from both plans before deciding whether or > not to use the one that references the date index? The rowcount estimate is made prior to the plan cost estimate, much less the plan selection. So you'd see the same number either way. regards, tom lane
Josh Berkus wrote: > John, > > > I think the key is to get the planner to correctly ballpark the number of > > rows in the date range. If it does, I can't imagine it ever deciding to > > read 1,000,000 rows instead of 1,000 with any sane "cost" setting. I'm > > assuming the defaults are sane :) > > The default for random_page_cost is sane, but very conservative; it's pretty > much assuming tables that are bigger than RAM and a single IDE disk. If > your setup is better than that, you can lower it. > > For example, in the ideal case (database fits in RAM, fast RAM, CPU, and > random seek on the disk), you can lower it to 1.5. For less ideal > situations, 1.8 to 2.5 is reasonable on high-end hardware. I suspect this ultimately depends on the types of queries you do, the size of the tables involved, disk cache, etc. For instance, if you don't have sort_mem set high enough, then things like large hash joins will spill to disk and almost certainly cause a lot of contention (random access patterns) even if a sequential scan is being used to read the table data. The fix there is, of course, to increase sort_mem if possible (as long as you don't cause paging during the operation, which will also slow things down), but you might not really have that option -- in which case you might see some improvement by tweaking random_page_cost. On a system where the PG data is stored on a disk that does other things, you'll actually want random_page_cost to be *closer* to 1 rather than further away. The reason is that the average access time of a sequential page in that case is much closer to that of a random page than it would be if the disk in question were dedicated to PG duty. This also goes for large RAID setups where multiple types of data (e.g., home directories, log files, etc.) are stored along with the PG data -- such disk setups will have more random activity happening on the disk while PG activity is happening, thus making the PG sequential access pattern appear more like random access. The best way I can think of to tune random_page_cost is to do EXPLAIN ANALYZE on the queries you want to optimize the most under the circumstances the queries are most likely to be run, then do the same with enable_seqscan off. Then look at the ratio of predicted and actual times for the scans themselves. Once you've done that, you can tweak random_page_cost up or down and do further EXPLAINs (with enable_seqscan off and without ANALYZE) until the ratio of the estimated index scan time to the actual index scan time of the same query (gotten previously via EXPLAIN ANALYZE) is the same as the ratio of the estimated sequential scan time (which won't change based on random_page_cost) to the actual sequential scan time. So: 1. set enable_seqscan = on 2. set random_page_cost = <some really high value to force seqscans> 3. EXPLAIN ANALYZE query 4. record the ratio of estimated to actual scan times. 5. set enable_seqscan = off 6. set random_page_cost = <rough estimate of what it should be> 7. EXPLAIN ANALYZE query 8. record the actual index scan time(s) 9. tweak random_page_cost 10. EXPLAIN query 11. If ratio of estimate to actual (recorded in step 8) is much different than that recorded in step 4, then go back to step 9. Reduce random_page_cost if the random ratio is larger than the sequential ratio, increase if it's smaller. As a result, I ended up setting my random_page_cost to 1.5 on my system. I suspect that the amount of pain you'll suffer when the planner incorrectly chooses a sequential scan is much greater on average than the amount of pain if it incorrectly chooses an index scan, so I'd tend to favor erring on the low side for random_page_cost. I'll know tomorrow whether or not my tweaking worked properly, as I have a job that kicks off every night that scans the entire filesystem and stores all the inode information about every file in a newly-created table, then "merges" it into the existing file information table. Each table is about 2.5 million rows... -- Kevin Brown kevin@sysexperts.com
Kevin, > 1. set enable_seqscan = on > 2. set random_page_cost = <some really high value to force seqscans> > 3. EXPLAIN ANALYZE query > 4. record the ratio of estimated to actual scan times. > 5. set enable_seqscan = off > 6. set random_page_cost = <rough estimate of what it should be> > 7. EXPLAIN ANALYZE query > 8. record the actual index scan time(s) > 9. tweak random_page_cost > 10. EXPLAIN query > 11. If ratio of estimate to actual (recorded in step 8) is much > different than that recorded in step 4, then go back to step 9. > Reduce random_page_cost if the random ratio is larger than the > sequential ratio, increase if it's smaller. Nice, we ought to post that somewhere people can find it in the future. I'm also glad that your new job allows you to continue doing PostgreSQL stuff. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Kevin, >> 1. set enable_seqscan = on >> 2. set random_page_cost = <some really high value to force seqscans> >> 3. EXPLAIN ANALYZE query >> 4. record the ratio of estimated to actual scan times. >> 5. set enable_seqscan = off >> 6. set random_page_cost = <rough estimate of what it should be> >> 7. EXPLAIN ANALYZE query >> 8. record the actual index scan time(s) >> 9. tweak random_page_cost >> 10. EXPLAIN query >> 11. If ratio of estimate to actual (recorded in step 8) is much >> different than that recorded in step 4, then go back to step 9. >> Reduce random_page_cost if the random ratio is larger than the >> sequential ratio, increase if it's smaller. > Nice, we ought to post that somewhere people can find it in the future. If we post it as recommended procedure we had better put big caveat notices on it. The pitfalls with doing this are: 1. If you repeat the sequence exactly as given, you will be homing in on a RANDOM_PAGE_COST that describes your system's behavior with a fully cached query. It is to be expected that you will end up with 1.0 or something very close to it. The only way to avoid that is to use a query that is large enough to blow out your kernel's RAM cache; which of course will take long enough that iterating step 10 will be no fun, and people will be mighty tempted to take shortcuts. 2. Of course, you are computing a RANDOM_PAGE_COST that is relevant to just this single query. Prudence would suggest repeating the process with several different queries and taking some sort of average. When I did the experiments that led up to choosing 4.0 as the default, some years ago, it took several days of thrashing the disks on a couple of different machines before I had numbers that I didn't think were mostly noise :-(. I am *real* suspicious of any replacement numbers that have been derived in just a few minutes. regards, tom lane
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > Kevin, > >> 1. set enable_seqscan = on > >> 2. set random_page_cost = <some really high value to force seqscans> > >> 3. EXPLAIN ANALYZE query > >> 4. record the ratio of estimated to actual scan times. > >> 5. set enable_seqscan = off > >> 6. set random_page_cost = <rough estimate of what it should be> > >> 7. EXPLAIN ANALYZE query > >> 8. record the actual index scan time(s) > >> 9. tweak random_page_cost > >> 10. EXPLAIN query > >> 11. If ratio of estimate to actual (recorded in step 8) is much > >> different than that recorded in step 4, then go back to step 9. > >> Reduce random_page_cost if the random ratio is larger than the > >> sequential ratio, increase if it's smaller. > > > Nice, we ought to post that somewhere people can find it in the future. > > If we post it as recommended procedure we had better put big caveat > notices on it. The pitfalls with doing this are: > > 1. If you repeat the sequence exactly as given, you will be homing in on > a RANDOM_PAGE_COST that describes your system's behavior with a fully > cached query. It is to be expected that you will end up with 1.0 or > something very close to it. The only way to avoid that is to use a > query that is large enough to blow out your kernel's RAM cache; which of > course will take long enough that iterating step 10 will be no fun, > and people will be mighty tempted to take shortcuts. Oops. You're right. I did this on my system, but forgot to put it in the list of things to do: 0. Fill the page cache with something other than PG data, e.g. by repeatedly catting several large files and redirecting the output to /dev/null. The sum total size of the files should exceed the amount of memory on the system. The reason you might not have to do this between EXPLAIN ANALYZE queries is that the first query will scan the table itself while the second one will scan the index. But that was probably more specific to the query I was doing. If the one you're doing is complex enough the system may have to read data pages from the table itself after fetching the index page, in which case you'll want to fill the page cache between the queries. > 2. Of course, you are computing a RANDOM_PAGE_COST that is relevant to > just this single query. Prudence would suggest repeating the process > with several different queries and taking some sort of average. Right. And the average should probably be weighted based on the relative frequency that the query in question will be executed. In my case, the query I was experimenting with was by far the biggest query that occurs on my system (though it turns out that there are others in that same process that I should look at as well). > When I did the experiments that led up to choosing 4.0 as the default, > some years ago, it took several days of thrashing the disks on a couple > of different machines before I had numbers that I didn't think were > mostly noise :-(. I am *real* suspicious of any replacement numbers > that have been derived in just a few minutes. One problem I've been running into is the merge join spilling to disk because sort_mem isn't big enough. The problem isn't that this is happening, it's that I think the planner is underestimating the impact that doing this will have on the time the merge join takes. Does the planner even account for the possibility that a sort or join will spill to disk? Spilling to disk like that will suddenly cause sequential reads to perform much more like random reads, unless the sequential scans are performed in their entirety between sorts/merges. In any case, one thing that none of this really accounts for is that it's better to set random_page_cost too low than too high. The reason is that index scans are more selective than sequential scans: a sequential scan will read the entire table every time, whereas an index scan will read only the index pages (and their parents) that match the query. My experience is that when the planner improperly computes the selectivity of the query (e.g., by not having good enough or sufficiently up to date statistics), it generally computes a lower selectivity than the query actually represents, and thus selects a sequential scan when an index scan would be more efficient. The auto vacuum daemon helps in this regard, by keeping the statistics more up-to-date. Certainly you shouldn't go overboard by setting random_page_cost too low "just in case", but it does mean that if you go through the process of running tests to determine the proper value for random_page_cost, you should probably select a random_page_cost that's in the lower part of the range of values you got. -- Kevin Brown kevin@sysexperts.com
On Tuesday February 24 2004 1:14, Kevin Brown wrote: > > One problem I've been running into is the merge join spilling to disk > because sort_mem isn't big enough. The problem isn't that this is > happening, it's that I think the planner is underestimating the impact > that doing this will have on the time the merge join takes. Does the > planner even account for the possibility that a sort or join will spill > to disk? Spilling to disk like that will suddenly cause sequential > reads to perform much more like random reads, unless the sequential > scans are performed in their entirety between sorts/merges. How do you know the merge join is spilling to disk? How are you identifying that? Just assuming from vmstat? iostat?
Ed L. wrote: > How do you know the merge join is spilling to disk? How are you identifying > that? Just assuming from vmstat? iostat? The existence of files in $PG_DATA/base/<db-oid>/pgsql_tmp while the query is running, combined with the EXPLAIN output (which shows what sorts and joins are being performed). -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > One problem I've been running into is the merge join spilling to disk > because sort_mem isn't big enough. The problem isn't that this is > happening, it's that I think the planner is underestimating the impact > that doing this will have on the time the merge join takes. Does the > planner even account for the possibility that a sort or join will spill > to disk? Yes it does. I thought it was making a pretty good estimate, actually. The only obvious hole in the assumptions is * The disk traffic is assumed to be half sequential and half random * accesses (XXX can't we refine that guess?) Because of the way that tuplesort.c works, the first merge pass should be pretty well sequential, but I think the passes after that might be mostly random from the kernel's viewpoint :-(. Possibly the I/O cost should be adjusted depending on how many merge passes we expect. > In any case, one thing that none of this really accounts for is that > it's better to set random_page_cost too low than too high. That depends on what you are doing, although I will concede that a lot of people are doing things where indexscans should be favored. regards, tom lane