Some notes on optimizer cost estimates - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Some notes on optimizer cost estimates |
Date | |
Msg-id | 25387.948414692@sss.pgh.pa.us Whole thread Raw |
Responses |
RE: [HACKERS] Some notes on optimizer cost estimates
Re: [HACKERS] Some notes on optimizer cost estimates Re: [HACKERS] Some notes on optimizer cost estimates |
List | pgsql-hackers |
I have been spending some time measuring actual runtimes for various sequential-scan and index-scan query plans, and have learned that the current Postgres optimizer's cost estimation equations are not very close to reality at all. Presently we estimate the cost of a sequential scan as Nblocks + CPU_PAGE_WEIGHT * Ntuples --- that is, the unit of cost is the time to read one disk page, and we have a "fudge factor" that relates CPU time per tuple to disk time per page. (The default CPU_PAGE_WEIGHT is 0.033, which is probably too high for modern hardware --- 0.01 seems like it might be a better default, at least for simple queries.) OK, it's a simplistic model, but not too unreasonable so far. The cost of an index scan is measured in these same terms as Nblocks + CPU_PAGE_WEIGHT * Ntuples + CPU_INDEX_PAGE_WEIGHT * Nindextuples Here Ntuples is the number of tuples selected by the index qual condition (typically, it's less than the total table size used in sequential-scan estimation). CPU_INDEX_PAGE_WEIGHT essentially estimates the cost of scanning an index tuple; by default it's 0.017 or half CPU_PAGE_WEIGHT. Nblocks is estimated as the index size plus an appropriate fraction of the main table size. There are two big problems with this: 1. Since main-table tuples are visited in index order, we'll be hopping around from page to page in the table. The current cost estimation method essentially assumes that the buffer cache plus OS disk cache will be 100% efficient --- we will never have to read the same page of the main table twice in a scan, due to having discarded it between references. This of course is unreasonably optimistic. Worst case is that we'd fetch a main-table page for each selected tuple, but in most cases that'd be unreasonably pessimistic. 2. The cost of a disk page fetch is estimated at 1.0 unit for both sequential and index scans. In reality, sequential access is *much* cheaper than the quasi-random accesses performed by an index scan. This is partly a matter of physical disk seeks, and partly a matter of benefitting (or not) from any read-ahead logic the OS may employ. As best I can measure on my hardware, the cost of a nonsequential disk read should be estimated at 4 to 5 times the cost of a sequential one --- I'm getting numbers like 2.2 msec per disk page for sequential scans, and as much as 11 msec per page for index scans. I don't know, however, if this ratio is similar enough on other platforms to be useful for cost estimating. We could make it a parameter like we do for CPU_PAGE_WEIGHT ... but you know and I know that no one ever bothers to adjust those numbers in the field ... The other effect that needs to be modeled, and currently is not, is the "hit rate" of buffer cache. Presumably, this is 100% for tables smaller than the cache and drops off as the table size increases --- but I have no particular thoughts on the form of the dependency. Does anyone have ideas here? The problem is complicated by the fact that we don't really know how big the cache is; we know the number of buffers Postgres has, but we have no idea how big a disk cache the kernel is keeping. As near as I can tell, finding a hit in the kernel disk cache is not a lot more expensive than having the page sitting in Postgres' own buffers --- certainly it's much much cheaper than a disk read. BTW, if you want to do some measurements of your own, try turning on PGOPTIONS="-d 2 -te". This will dump a lot of interesting numbers into the postmaster log, if your platform supports getrusage(). regards, tom lane
pgsql-hackers by date: