Re: Re. [HACKERS] Some notes on optimizer cost estimates - Mailing list pgsql-hackers
From | Don Baccus |
---|---|
Subject | Re: Re. [HACKERS] Some notes on optimizer cost estimates |
Date | |
Msg-id | 3.0.1.32.20000121081044.01036290@mail.pacifier.com Whole thread Raw |
In response to | Re. [HACKERS] Some notes on optimizer cost estimates (Xun Cheng <xun@cs.ucsb.edu>) |
List | pgsql-hackers |
At 06:19 PM 1/20/00 -0800, Xun Cheng wrote: >I'm very glad you bring up this cost estimate issue. >Recent work in database research have argued a more >detailed disk access cost model should be used for >large queries especially joins. >Traditional cost estimate only considers the number of >disk pages accessed. However a more detailed model >would consider three parameters: avg. seek, avg. latency >and avg. page transfer. For old disk, typical values are >SEEK=9.5 milliseconds, LATENCY=8.3 ms, TRANSFER=2.6ms. >A sequential continuous reading of a table (assuming >1000 continuous pages) would cost >(SEEK+LATENCY+1000*TRANFER=2617.8ms); while quasi-randomly >reading 200 times with 2 continuous pages/time would >cost (SEEK+200*LATENCY+400*TRANSFER=2700ms). >Someone from IBM lab re-studied the traditional >ad hoc join algorithms (nested, sort-merge, hash) using the detailed cost model >and found some interesting results. One complication when doing an index scan is that you are accessing two separate files (table and index), which can frequently be expected to cause an considerable increase in average seek time. Oracle and other commercial databases recommend spreading indices and tables over several spindles if at all possible in order to minimize this effect. I suspect it also helps their optimizer make decisions that are more consistently good for customers with the largest and most complex databases and queries, by making cost estimates more predictably reasonable. Still...this doesn't help with the question about the effect of the filesystem system cache. I wandered around the web for a little bit last night, and found one summary of a paper by Osterhout on the effect of the Solaris cache on a fileserver serving diskless workstations. There was reference to the hierarchy involved (i.e. the local workstation cache is faster than the fileserver's cache which has to be read via the network which in turn is faster than reading from the fileserver's disk). It appears the rule-of-thumb for the cache-hit ratio on reads, presumably based on measuring some internal Sun systems, used in their calculations was 80%. Just a datapoint to think about. There's also considerable operating system theory on paging systems that might be useful for thinking about trying to estimate the Postgres cache/hit ratio. Then again, maybe Postgres could just keep count of how many pages of a given table are in the cache at any given time? Or simply keep track of the current ratio of hits and misses? >>I have been spending some time measuring actual runtimes for various >>sequential-scan and index-scan query plans, and have learned that the >>current Postgres optimizer's cost estimation equations are not very >>close to reality at all. >One interesting question I'd like to ask is if this non-closeness >really affects the optimal choice of postgresql's query optimizer. >And to what degree the effects might be? My point is that >if the optimizer estimated the cost for sequential-scan is 10 and >the cost for index-scan is 20 while the actual costs are 10 vs. 40, >it should be ok because the optimizer would still choose sequential-scan >as it should. This is crucial, of course - if there are only two types of scans available, what ever heuristic is used only has to be accurate enough to pick the right one. Once the choice is made, it doesn't really matter (from the optimizer's POV) just how long it will actually take, the time will be spent and presumably it will be shorter than the alternative. How frequently will the optimizer choose wrongly if: 1. All of the tables and indices were in PG buffer cache or filesystem cache? (i.e. fixed access times for both types ofscans) or 2. The table's so big that only a small fraction can reside in RAM during the scan and join, which means that the non-sequential disk access pattern of the indexed scan is much more expensive. Also, if you pick sequential scans more frequently based on a presumption that index scans are expensive due to increased average seek time, how often will this penalize the heavy-duty user that invests in extra drives and lots of RAM? ... >>The current cost estimation >>method essentially assumes that the buffer cache plus OS disk cache will >>be 100% efficient --- we will never have to read the same page of the >>main table twice in a scan, due to having discarded it between >>references. This of course is unreasonably optimistic. Worst case >>is that we'd fetch a main-table page for each selected tuple, but in >>most cases that'd be unreasonably pessimistic. > >This is actually the motivation that I asked before if postgresql >has a raw disk facility. That way we have much control on this cache >issue. Of course only if we can provide some algo. better than OS >cache algo. (depending on the context, like large joins), a raw disk >facility will be worthwhile (besides the recoverability). Postgres does have control over its buffer cache. The one thing that raw disk I/O would give you is control over where blocks are placed, meaning you could more accurately model the cost of retrieving them. So presumably the cache could be tuned to the allocation algorithm used to place various structures on the disk. I still wonder just how much gain you get by this approach. Compared, to, say simply spending $2,000 on a gigabyte of RAM. Heck, PCs even support a couple gigs of RAM now. >Actually I have another question for you guys which is somehow related >to this cost estimation issue. You know the difference between OLTP >and OLAP. My question is how you target postgresql on both kinds >of applications or just OLTP. From what I know OLTP and OLAP would >have a big difference in query characteristics and thus >optimization difference. If postgresql is only targeted on >OLTP, the above cost estimation issue might not be that >important. However for OLAP, large tables and large queries are >common and optimization would be difficult. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
pgsql-hackers by date: