Thread: question about index cost estimates
i know tom lane is probably sick of me, but i'm trying to figure out how the cost estimators work for an index scan. i'm logging a lot of the calculations that go into the cost estimate for some sample queries to see what factors are most important in a cost estimate. it seems to me that in the queries that i'm running, a vast majority of the cost comes from reading the tuples from the relation. i think the cost of nonsequential access seems reasonable (it's about 2), but the calculation of pages fetched from the relation seems high. i'm not sure what the actual should be, but some queries have it at 2-4x the number of pages in the relation, which seemed high, so i started looking at that. i don't understand why the function that is used would be a good model of what is actually happening. here's the comment & the function: * Estimate number of main-table tuples and pages fetched.** If the number of tuples is much smaller than the number of pagesin* the relation, each tuple will cost a separate nonsequential fetch.* If it is comparable or larger, then probablywe will be able to* avoid some fetches. We use a growth rate of log(#tuples/#pages +* 1) --- probably totally bogus,but intuitively it gives the right* shape of curve at least. pages_fetched = ceil(baserel->pages * log(tuples_fetched / baserel->pages + 1.0)); i'm at a total loss to explain how this works. for all i know, it's correct and it is that costly, i don't know. it just seems to me that there should be some notion of tuple size figured in to know how many tuples fit in a page. can somebody explain it to me? thanks, jeff
Jeff Hoffmann <jeff@propertykey.com> writes: > calculation of pages fetched from the relation seems high. i'm not sure > what the actual should be, but some queries have it at 2-4x the number > of pages in the relation, which seemed high, so i started looking at > that. If the table is bigger than the disk cache, and we are hitting pages more than once, then we are likely to have to fetch some pages more than once. The model is assuming that the tuples we want are scattered more or less randomly over the whole table. If #tuples to be fetched approaches or exceeds #pages in table, then clearly we are going to be touching some pages more than once because we want more than one tuple from them. The $64 question is did the page fall out of buffer cache between references? Worst case (if your table vastly exceeds your available buffer space), practically every tuple fetch will require a separate physical read, and then the number of page fetches is essentially the number of tuples returned --- which of course can be a lot more than the number of pages in the table. Right now these considerations are divided between cost_index() and cost_nonsequential_access() in a way that might well be wrong. I've been intending to dig into the literature and try to find a better cost model but haven't gotten to it yet. > it just seems to me that there should be some notion of tuple size > figured in to know how many tuples fit in a page. It seemed to me that the critical ratios are #tuples fetched vs #pages in table and table size vs. cache size. I could be wrong though... regards, tom lane
Tom Lane wrote: > If the table is bigger than the disk cache, and we are hitting pages > more than once, then we are likely to have to fetch some pages more than > once. The model is assuming that the tuples we want are scattered more > or less randomly over the whole table. If #tuples to be fetched > approaches or exceeds #pages in table, then clearly we are going to be > touching some pages more than once because we want more than one tuple > from them. i understand that, but still, if you have a tuple thats, say, 144 bytes, you're going to have a higher chance of revisiting the same page than if the tuple is 4k. it may be something that works itself out of the equation somewhere. > The $64 question is did the page fall out of buffer cache > between references? Worst case (if your table vastly exceeds your > available buffer space), practically every tuple fetch will require a > separate physical read, and then the number of page fetches is > essentially the number of tuples returned --- which of course can be > a lot more than the number of pages in the table. i understand that, too. i understand the need for a curve like that, maybe my question should be the slope of the curve. it turns out that playing around with my queries, it seems like when i was selecting about 10% of the records, the page fetch estimate is pretty accurate, although when the selectivity falls to significantly below that, the page fetch estimate stays quite a bit higher than actual. part of what i'm doing is looking at the executor stats for each query. can you explain what the shared blocks, local blocks, and direct blocks mean? i'm assuming that the shared blocks refers to the reads to the database files & indexes. what do the other mean? are the shared blocks shared amongst all of the backends or does each have its own pool? i'm getting a 90%+ buffer hit rate on index scans, but i'm assuming that's because i'm the only one doing anything on this machine right now and that would go down with more processes. > > Right now these considerations are divided between cost_index() and > cost_nonsequential_access() in a way that might well be wrong. > I've been intending to dig into the literature and try to find a better > cost model but haven't gotten to it yet. well, there have been a lot of things that i've looked at and thought "that doesn't sound right", like for example the random page access cost, which defaults to 4.0. i thought that was too high, but i ran bonnie on my hard drive and it looks like that actual value is around 3.9. so now i'm a believer. somebody must have put some thought into somewhere. i'm still taking everything with a grain of salt until i can explain it, though. jeff
Jeff Hoffmann <jeff@propertykey.com> writes: > i understand that, but still, if you have a tuple thats, say, 144 bytes, > you're going to have a higher chance of revisiting the same page than if > the tuple is 4k. Are you? If you are pulling ten tuples from a thousand-page relation, seems to me the odds of revisiting any one page are somewhere around 0.01 (too lazy to do the exact combinatorics right now). Doesn't really matter what the average tuple size is --- or if you prefer, we already accounted for that because we are looking at target tuples divided by total pages rather than target tuples divided by total tuples. > maybe my question should be the slope of the curve. it turns out that > playing around with my queries, it seems like when i was selecting about > 10% of the records, the page fetch estimate is pretty accurate, although > when the selectivity falls to significantly below that, the page fetch > estimate stays quite a bit higher than actual. Hmm. I would think that the behavior for very low selectivity ought to be pretty close to the model: it's hard to see how it can be anything but one page fetch per selected tuple, unless you are seeing strong clustering effects. I do *not* claim that the slope of the curve is necessarily right for higher selectivities though... that needs to be looked at. > part of what i'm doing is looking at the executor stats for each query. > can you explain what the shared blocks, local blocks, and direct blocks > mean? The filesystem blocks in/out are from the kernel getrusage() call. On my box, at least, these seem to mean physical I/O operations initiated on behalf of the process --- AFAICT, touching a page that is already in kernel disk buffers does not increment the filesystem blocks count. The other numbers are from PrintBufferUsage() in bufmgr.c. It looks like the shared block counts are the number of block read and write requests made to the kernel by bufmgr.c, and the hit rate indicates the fraction of buffer fetch requests made to bufmgr.c that were satisfied in Postgres' own buffer area (ie, without a kernel request). The local block counts are the same numbers for non-shared relations, which basically means tables created in the current transaction. They'll probably be zero in most cases of practical interest. The "direct" counts seem to be broken at the moment --- I can't find any code that increments them. It looks like the intent was to count block I/O operations on temporary files (sorttemp and suchlike). This is not of interest for pure indexscans... > are the shared blocks shared amongst all of the backends or does each > have its own pool? Shared among all --- that's what the shared memory block is (mostly) for... > i'm getting a 90%+ buffer hit rate on index scans, > but i'm assuming that's because i'm the only one doing anything on > this machine right now and that would go down with more processes. It also suggests that your test table isn't much bigger than the buffer cache ;-) --- or at least the part of it that you're touching isn't. > i'm still taking everything with a grain of salt until i can > explain it, though. Good man. I don't think anyone but me has looked at this stuff in a long while, and it could use review. regards, tom lane
> -----Original Message----- > From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On > Behalf Of Jeff Hoffmann > [snip] > > pages_fetched = ceil(baserel->pages * log(tuples_fetched / > baserel->pages + 1.0)); > Unfortunately I didn't understand this well either. pages_fetched seems to be able to be greater than baserel->pages. But if there's sufficiently large buffer space pages_fetched would be <= baserel->pages. Are there any assupmtions about buffer space ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
Tom Lane wrote: > > Jeff Hoffmann <jeff@propertykey.com> writes: > > i understand that, but still, if you have a tuple thats, say, 144 bytes, > > you're going to have a higher chance of revisiting the same page than if > > the tuple is 4k. > > Are you? If you are pulling ten tuples from a thousand-page relation, > seems to me the odds of revisiting any one page are somewhere around > 0.01 (too lazy to do the exact combinatorics right now). well, for something like 10 tuples out of say 50,000 - 100,000 records, that probably wouldn't come into play and you'd almost definitely talking 1 page access per tuple. i was thinking it would come into play with higher selectivity. actually i don't know if i'm thinking through the probabilities and formulas require as much as just brainstorming what things _might_ be coming into play. > Doesn't really > matter what the average tuple size is --- or if you prefer, we already > accounted for that because we are looking at target tuples divided by > total pages rather than target tuples divided by total tuples. i thought about that after i sent my message. i find that a lot of times, i'll answer my own question roughly 15 seconds after sending the message. i'm still trying to figure out how that happens. > > i'm still taking everything with a grain of salt until i can > > explain it, though. > > Good man. I don't think anyone but me has looked at this stuff in a > long while, and it could use review. i can tell you that the more i look at things and start understanding the logic, the more it seems that the numbers are fairly reasonable if you don't or can't know a lot of the factors like caching or clustering. unfortunately those factors can make a lot of difference in actual results. i was looking at the cost estimate of an rtree index scan, but it turns out that when you add in the page accesses for the actual tuples, it's rare to see an index scan cost that matters to the total cost for reasonably large tables. because of that, i'm not sure what the value of fixing the rtree index scan cost estimator, other than making it semantically correct. it seems in the big picture, we're subject to the whims of the disk cache more than anything. i think i'm starting to see what you said about fixing selectivity being more important, although i'm still not convinced that making the selectivity more accurate is going to come close to solving the problem. the selectivity problem is better defined, though, and therefore easier to fix than the other problems in cost estimation. i'm assuming you're considering adding some sort of histogram to the stats that vacuum collects. is that something that you're seriously considering or do you have another plan to introduce a useful concept of the distribution of an attribute? the concept of making a histogram on a b-tree is pretty simple, but if you're going to do it, you should probably be taking into account r-trees, where you'd need a 2-d histogram making the job just a bit tougher. jeff
Hiroshi Inoue wrote: > > > -----Original Message----- > > From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On > > Behalf Of Jeff Hoffmann > > > > [snip] > > > > > pages_fetched = ceil(baserel->pages * log(tuples_fetched / > > baserel->pages + 1.0)); > > > > Unfortunately I didn't understand this well either. > > pages_fetched seems to be able to be greater than > baserel->pages. not only does it seem that way, you can expect it to happen fairly frequently, even if you're pulling only 1-2% of the records with a query. if you don't believe it, check the actual performance of a few queries. > But if there's sufficiently large buffer > space pages_fetched would be <= baserel->pages. > Are there any assupmtions about buffer space ? > the # of pages fetched would be the same, it'd just be cheaper to pull them from the buffer instead of from disk. that's what isn't being taken into consideration properly in the estimate. the real question is what assumptions can you make about buffer space? you don't know how many concurrent accesses there are (all sharing buffer space). i also don't think you can count on knowing the size of the buffer space. therefore, the buffer space is set to some constant intermediate value & it is taken account of, at least in the cost_nonsequential_tuple. the question is this: shouldn't you be able to make an educated guess at this by dividing the total buffer space allocated by the backend by the number of postmaster processes running at the time? or don't you know those things? jeff
Jeff Hoffmann <jeff@propertykey.com> writes: > it seems in the big picture, we're subject to the whims of the disk > cache more than anything. Ain't that the truth. Maybe we ought to think about whether there's any way to find out how big the kernel's disk cache is. Even if it only worked on some platforms, we'd be no worse off on the other ones... > I'm assuming you're considering adding some sort of histogram to the > stats that vacuum collects. is that something that you're seriously > considering or do you have another plan to introduce a useful concept > of the distribution of an attribute? the concept of making a > histogram on a b-tree is pretty simple, but if you're going to do it, > you should probably be taking into account r-trees, where you'd need a > 2-d histogram making the job just a bit tougher. I was considering a histogram for b-trees, but I have to admit I hadn't thought about r-trees. Seems like a 2-D histogram would be too bulky to be feasible. Could we get any mileage out of two 1-D histograms (ie, examine the two coordinates independently)? Or is that too simplistic to be worth bothering with? regards, tom lane
Jeff Hoffmann <jeff@propertykey.com> writes: > the question is this: shouldn't you be able to make an educated guess at > this by dividing the total buffer space allocated by the backend by the > number of postmaster processes running at the time? or don't you know > those things? Two things here: One, we could easily find out the number of active backends, and we certainly know the number of shared disk buffers. BUT: it'd be a debugging nightmare if the planner's choices depended on the number of other backends that were running at the instant of planning. Even though that'd theoretically be the right thing to do, I don't think we want to go there. (If you want an argument less dependent on mere implementation convenience, consider that in many scenarios the N backends will be accessing more or less the same set of tables. So the assumption that each backend only gets the use of 1/N of the shared buffer space is too pessimistic anyway.) Two, the Postgres shared buffer cache is only the first-line cache. We also have the Unix kernel's buffer cache underneath us, though we must share it with whatever else is going on on the machine. As far as I've been able to measure there is relatively little cost difference between finding a page in the Postgres cache and finding it in the kernel cache --- certainly a kernel call is still much cheaper than an actual disk access. So the most relevant number seems to be the fraction of the kernel's buffer cache that's effectively available to Postgres. Right now we have no way at all to measure that number, so we punt and treat it as a user- settable parameter (I think I made the default setting 10Mb or so). It'd be worthwhile looking into whether we can do better than guessing about the kernel cache size. regards, tom lane
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > pages_fetched seems to be able to be greater than > baserel->pages. But if there's sufficiently large buffer > space pages_fetched would be <= baserel->pages. > Are there any assupmtions about buffer space ? Right now cost_index doesn't try to account for that, because it doesn't have any way of knowing the relevant buffer-space parameter. (As I said to Jeff, we have to consider kernel buffer space not just the number of Postgres shared buffers.) cost_nonsequential_access does have a dependence on (a totally bogus estimate of) effective cache size, but it's a considerably weaker dependence than you suggest above. If we had a reliable estimate of cache size I'd be inclined to restructure this code quite a bit... regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > pages_fetched seems to be able to be greater than > > baserel->pages. But if there's sufficiently large buffer > > space pages_fetched would be <= baserel->pages. > > Are there any assupmtions about buffer space ? > > Right now cost_index doesn't try to account for that, because > it doesn't have any way of knowing the relevant buffer-space > parameter. (As I said to Jeff, we have to consider kernel > buffer space not just the number of Postgres shared buffers.) > > cost_nonsequential_access does have a dependence on (a totally > bogus estimate of) effective cache size, but it's a considerably > weaker dependence than you suggest above. Thanks. I just confirmed my question because I didn't understand whether effecive cache size is irrelevant to the calculation or not. > If we had a reliable > estimate of cache size I'd be inclined to restructure this code > quite a bit... > Yes,I know that reliable estimate is very significant but I have no idea unfortunately. Regards. Hiroshi Inoue Inoue@tpf.co.jp
Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes: > On the postgres side, what if you actually made the cache _smaller_, > only caching important stuff like system tables or indexes. You yourself > said it doesn't matter if you get it from the cache or the kernel, so > why not let the kernel do it and prevent double buffering? In fact I don't think it's productive to use an extremely large -B setting. This is something that easily could be settled by experiment --- do people actually see any significant improvement in performance from increasing -B above, say, a few hundred? regards, tom lane
Tom Lane wrote: > I was considering a histogram for b-trees, but I have to admit I hadn't > thought about r-trees. Seems like a 2-D histogram would be too bulky > to be feasible. Could we get any mileage out of two 1-D histograms > (ie, examine the two coordinates independently)? Or is that too > simplistic to be worth bothering with? > i don't think it would do any good to look at them separately. you might as well just assume a uniform distribution if you're going to do that. does anybody on the list know anything about fractals & wavelets? i know _nothing_ about it, but i know that you can use wavelets to compress photos (i.e., a 2-d data source) and my understanding is that you're essentially converting the image into a mathematical function. it's also my understanding that wavelets work well on multiple scales, i.e., you can zoom in on a picture to get more detail or zoom out and get less detail. my thought is if you've got a histogram, would something like this be useful? like i said, i have no idea how it works or if it is of any interest or if it's even practical, just throwing something out there. jeff
Jeff Hoffmann <jeff@propertykey.com> writes: > does anybody on the list know anything about fractals & wavelets? Now there's an interesting idea: regard the stats as a lossy compression of the probability density of the original dataset. Hmm ... this doesn't do anything for the problem of computing the pdf cheaply to begin with, but it might help with storing it compactly in pg_statistic. regards, tom lane
Tom Lane wrote: > > Jeff Hoffmann <jeff@propertykey.com> writes: > > does anybody on the list know anything about fractals & wavelets? > > Now there's an interesting idea: regard the stats as a lossy compression > of the probability density of the original dataset. Hmm ... this > doesn't do anything for the problem of computing the pdf cheaply to > begin with, but it might help with storing it compactly in pg_statistic. > > regards, tom lane yeah, that's exactly what i meant. you can probably tell math & statistics aren't my strongest points. i'm trying to learn a little about fractals because about all i knew before today is those pretty little pictures. just doing a quick search for their use with databases, though, i found at least one paper on selectivity estimates & fractals which i'm going to read. just for your reference, the paper is from VLDB and it called "Estimating the Selectivity of Spatial Queries Using the 'Correlation' fractal Dimension". here's the link: http://www.vldb.org/conf/1995/P299.PDF. i've only read the abstract, but it sounds pretty promising. i'm starting to wish that i knew what was going on -- this is getting to be interesting to me. i've been using postgresql for years, but up until the last few days, i was trying to treat it as a black box as much as possible -- i probably should have tried getting involved a lot earlier... jeff
Jeff Hoffmann writes: > just for your reference, the paper is from VLDB and it called > "Estimating the Selectivity of Spatial Queries Using the 'Correlation' > fractal Dimension". here's the link: > http://www.vldb.org/conf/1995/P299.PDF. i've only read the abstract, > but it sounds pretty promising. That paper was very interesting. I think it might in fact contain the answer for writing smart selectivity estimators for geometrical data, now someone only needs to be brave enough to try to code it. Btw., it also contains a number interesting references regarding R-trees and selectivity estimation in general. Whodda thunk there's genuine information to be found on the Web ... :) -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden