Thread: Re: order by and index path
Andread wrote: > > (who really selects nearly all rows from a 5M row > > table?). > > Data Warehouse apps > > > This will hurt if someone really selects most of the rows and the index > > scan jumps over the disc. > > I think this is a non issue, since if a qual is not restrictive enough, > the optimizer should choose a seq scan anyway. Doesn' t it do this already ? > [...] Right and wrong. Right - it is the optimizers job to decide if an index should be used or not. And the decision has to be made based on the cost. Wrong - PostgreSQL's query optimizer doesn't do it already. It assumes that a qualification is always restrictive enough and chooses an index scan any time if the qualification can be thrown into the indexqual. In the following I only discuss the situation where qualifications can be used in the indexqual. Calculating the cost of a query is easy. Have N tuples in P data-pages where the given qualification will match M. Assuming that the tuples are not in ascending order in the data pages, the cost fetching one tuple by its TID raises with P (more seeking necessary). Now you can calculate the cost of an index scan by C=M/N*P*F where F is some constant factor to make C comparable to a current seqscan cost value (I know, it must be smarter, but for this description a simple calculation is enough). The only problem is that the optimizer has absolutely no chance to estimate M (the mystic value as Bruce called it). In a given qualification WHERE key > 0 AND key <= 100 it cannot know if this would result in 0 or 100% of the rows. To estimate that, it needs statistical information about the key ranges that are present. Assume there would be 11 keys remembered by the last vacuum run, that break up the whole present key range of 10000 tuples into 10 chunks and they read 5 40 70 90 500 600 1000 1100 1400 1500 2000 where 5 is the lowest key at all, 40 is the key of tuple 1000 (in key order), 70 is the key of tuple 2000 and so on. Now looking at the qualification and this key range information would tell, that the absolute limit of rows returned by an index scan would be 3999 (which still could have a key value of 100). But the qualification WHERE key >= 50 could return at max 8999 tuples and WHERE key > 50 AND key < 70 has a maximum of 998 result tuples. This would be the information required to make the right decision for the case where all rows selected are wanted. We do not have this statistical information. So the whole thing is at this time academic. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
> We do not have this statistical information. So the whole > thing is at this time academic. I recall that Commercial Ingres made the assumption that one row (or 1% of rows? My memory of Ingres is fading :) would be returned from a qualified query if no statistics were available to suggest otherwise. It did collect statistics on data distribution to try to help make those optimizer choices. It may be reasonable to assume that if there is an index, then using it with any qualified query would be a win. Since the alternative is to decide to _not_ use an index, a decision for which we have no support with existing statistics. - Tom
Statistics on key distribution (was: Re: order by and index path)
From
jwieck@debis.com (Jan Wieck)
Date:
> > > We do not have this statistical information. So the whole > > thing is at this time academic. > > I recall that Commercial Ingres made the assumption that one row (or 1% > of rows? My memory of Ingres is fading :) would be returned from a > qualified query if no statistics were available to suggest otherwise. > > It did collect statistics on data distribution to try to help make those > optimizer choices. > > It may be reasonable to assume that if there is an index, then using it > with any qualified query would be a win. Since the alternative is to > decide to _not_ use an index, a decision for which we have no support > with existing statistics. It may be also reasonable to collect statistic information and use that to quantify the cost of an index scan. The vacuum cleaner scans all indices on a relation vacuum'd completely. And at that time it already knows the number of pages and tuples in the heap relation (has that in the vcrelstats). Based on this it could decide to take every n'th index tuple while scanning and drop them somewhere where other backends can find them. This would be the statistical information needed by the optimizer to estimate the real cost of an index scan. It is only of interest for big tables, where hopping from block to block will make an index scan a looser against a seqscan in a many row matching scan. So it's up to the optimizer do decide based on the # of pages if statistical information is really required for cost calculation. Having the final indexqual along with the statistical information it will be a little tricky to figure out how many rows it might return, but not impossible. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
> could return at max 8999 tuples and > > WHERE key > 50 AND key < 70 > > has a maximum of 998 result tuples. This would be the > information required to make the right decision for the case > where all rows selected are wanted. > > We do not have this statistical information. So the whole > thing is at this time academic. But we do have statistical information in pg_statistic if you run vacuum analyze. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> > We do not have this statistical information. So the whole > > thing is at this time academic. > > I recall that Commercial Ingres made the assumption that one row (or 1% > of rows? My memory of Ingres is fading :) would be returned from a > qualified query if no statistics were available to suggest otherwise. > > It did collect statistics on data distribution to try to help make those > optimizer choices. > > It may be reasonable to assume that if there is an index, then using it > with any qualified query would be a win. Since the alternative is to > decide to _not_ use an index, a decision for which we have no support > with existing statistics. For =, the assumion is 1 row, for > the assumption is 1/3 of the table. With pg_statistic, it uses that. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> > > could return at max 8999 tuples and > > > > WHERE key > 50 AND key < 70 > > > > has a maximum of 998 result tuples. This would be the > > information required to make the right decision for the case > > where all rows selected are wanted. > > > > We do not have this statistical information. So the whole > > thing is at this time academic. > > But we do have statistical information in pg_statistic if you run vacuum > analyze. Nice (forgot that - pardon), anyway only having lowest and highest key values isn't enough to make a useful estimation about how many rows an indexqual will return. If we change pg_statistic in a way that more keys can get stored per relation/attribute, then the optimizer would have a real chance on it. I have starelid staattnum staitupno staop stakey in mind, where staitupno tells the position of the key in a complete index scan. Then it becomes the place to fill in the key range information as described in my posting. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #