Thread: Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)
Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)
From
Zeugswetter Andreas IZ5
Date:
> So, the selectivity that a search for the most common value would > have is a reasonable estimate for the selectivity of a search for any > value. That's a bogus assumption in this case --- but it's hard to > justify making any other assumption in general. > Other db's usually use the value count(*) / nunique for the light weight statistics. This makes the assumptoin that the distinct index values are evenly distributed. That is on average a correct assumption, whereas our assumption on average overestimates the number of rows returned. I am not sure we have a nunique info though. Andreas
Zeugswetter Andreas IZ5 <Andreas.Zeugswetter@telecom.at> writes: > Other db's usually use the value count(*) / nunique for the light > weight statistics. This makes the assumptoin that the distinct index > values are evenly distributed. That is on average a correct > assumption, whereas our assumption on average overestimates the number > of rows returned. I am not sure we have a nunique info though. We don't, and AFAICS it would be an expensive statistic to compute. I have thought about this a little more overnight, and I have come up with what I think is a better idea. Suppose that VACUUM ANALYZE stores in pg_statistic not only the disbursion, but also the most frequently occurring value of each column. It already computes (or I should say estimates) the most frequently occurring value (MFOV) in order to arrive at the disbursion, so storing the value costs nothing except a little more space in pg_statistic. Now, the logic that eqsel() should use is if constant-being-compared-against == MFOV then return disbursion;else return MIN(disbursion, 1.0 - disbursion); which works like this: if we are indeed looking for the MFOV then the selectivity is just the disbursion, no question. If we are looking for a value *other* than the MFOV, then the selectivity must be less than the disbursion, since surely this value occurs less often than the MFOV. But the total fraction of non-MFOV values in the table is 1.0-disbursion, so the fraction that are the specific value we want can't exceed that either. The MIN() above is therefore a hard upper bound for the selectivity of the non-MFOV case. In practice we might want to multiply the MIN by a fudge-factor somewhat less than one, to arrive at what we hope is a reasonable estimate rather than a worst-case estimate. BTW, this argument proves rigorously that the selectivity of a search for any value other than the MFOV is not more than 0.5, so there is some basis for my intuition that eqsel should not return a value above 0.5. So, in the cases where eqsel does not know the exact value being searched for, I'd still be inclined to cap its result at 0.5. If we use this logic, the stat we really want is exactly the frequency of the MFOV, not the disbursion which is just closely related to it. I have not looked at the other uses of disbursion, but if they all can work like this we might want to forget the statistical niceties and just store the frequency of the MFOV. A final comment is that NULL would be treated just like any regular value in determining what is the MFOV... regards, tom lane
> > > So, the selectivity that a search for the most common value would > > have is a reasonable estimate for the selectivity of a search for any > > value. That's a bogus assumption in this case --- but it's hard to > > justify making any other assumption in general. > > > Other db's usually use the value count(*) / nunique for the light weight > statistics. > This makes the assumptoin that the distinct index values are evenly > distributed. > That is on average a correct assumption, whereas our assumption on average > overestimates the number of rows returned. > I am not sure we have a nunique info though. > Yes, that's the problem. Figuring out the number of uniques is hard, expecially with no index. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Zeugswetter Andreas IZ5 <Andreas.Zeugswetter@telecom.at> writes: > > Other db's usually use the value count(*) / nunique for the light > > weight statistics. This makes the assumptoin that the distinct index > > values are evenly distributed. That is on average a correct > > assumption, whereas our assumption on average overestimates the number > > of rows returned. I am not sure we have a nunique info though. > > We don't, and AFAICS it would be an expensive statistic to compute. > > I have thought about this a little more overnight, and I have come up > with what I think is a better idea. Suppose that VACUUM ANALYZE stores > in pg_statistic not only the disbursion, but also the most frequently > occurring value of each column. It already computes (or I should say > estimates) the most frequently occurring value (MFOV) in order to arrive > at the disbursion, so storing the value costs nothing except a little > more space in pg_statistic. Now, the logic that eqsel() should use is > > if constant-being-compared-against == MFOV then > return disbursion; > else > return MIN(disbursion, 1.0 - disbursion); Yes, I like this. > BTW, this argument proves rigorously that the selectivity of a search > for any value other than the MFOV is not more than 0.5, so there is some > basis for my intuition that eqsel should not return a value above 0.5. > So, in the cases where eqsel does not know the exact value being > searched for, I'd still be inclined to cap its result at 0.5. I don't follow this. If the most frequent value occurs 95% of the time, wouldn't the selectivity be 0.95? And why use 1-disbursion. You may find the existing code does a better job than the MIN() computation. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: >> BTW, this argument proves rigorously that the selectivity of a search >> for any value other than the MFOV is not more than 0.5, so there is some >> basis for my intuition that eqsel should not return a value above 0.5. >> So, in the cases where eqsel does not know the exact value being >> searched for, I'd still be inclined to cap its result at 0.5. > I don't follow this. If the most frequent value occurs 95% of the time, > wouldn't the selectivity be 0.95? If you are searching for the most frequent value, then the selectivity estimate should indeed be 0.95. If you are searching for anything else, the selectivity estimate ought to be 0.05 or less. If you don't know what value you will be searching for, which number should you use? The unsupported assumption here is that if the table contains 95% occurrence of a particular value, then the odds are also 95% (or at least high) that that's the value you are searching for in any given query that has an "= something" WHERE qual. That assumption is pretty reasonable in some cases (such as your example earlier of "WHERE state = 'PA'" in a Pennsylvania-local database), but it falls down badly in others, such as where the most common value is NULL or an empty string or some other indication that there's no useful data. In that sort of situation it's actually pretty unlikely that the user will be searching for field = most-common-value ... but the system probably has no way to know that. I wonder whether it would help to add even more data to pg_statistic. For example, suppose we store the fraction of the columns that are NULL, plus the most frequently occurring *non null* value, plus the fraction of the columns that are that value. This would allow us to be very smart about columns in which "no data" is represented by NULL (as a good DB designer would do): selectivity of "IS NULL": NULLfraction selectivity of "IS NOT NULL": 1 - NULLfraction selectivity of "= X" for a known non-null constant X:if X == MFOV: MFOVfractionelse: MIN(MFOVfraction, 1-MFOVfraction-NULLfraction) selectivity of "= X" when X is not known a priori, but presumably is not null:MIN(MFOVfraction, 1-NULLfraction) Both of the MIN()s are upper bounds, so multiplying them by a fudge-factor < 1 would be reasonable. These rules would guarantee small selectivity values when either MFOVfraction or 1-NULLfraction is small. It still wouldn't cost much, since I believe VACUUM ANALYZE is counting nulls already... regards, tom lane
> Bruce Momjian <maillist@candle.pha.pa.us> writes: > >> BTW, this argument proves rigorously that the selectivity of a search > >> for any value other than the MFOV is not more than 0.5, so there is some > >> basis for my intuition that eqsel should not return a value above 0.5. > >> So, in the cases where eqsel does not know the exact value being > >> searched for, I'd still be inclined to cap its result at 0.5. > > > I don't follow this. If the most frequent value occurs 95% of the time, > > wouldn't the selectivity be 0.95? > > If you are searching for the most frequent value, then the selectivity > estimate should indeed be 0.95. If you are searching for anything else, > the selectivity estimate ought to be 0.05 or less. If you don't know > what value you will be searching for, which number should you use? You are going to love this: 0.95 * 0.95 + 0.05 * 0.05 This is because with a 95% of one value, you would think the ask for that value 95% of the time, and another value 5% of the time. The last 0.05 is not really accurate. It assumes there are only two unique values in the table, which may be wrong, but it is close enough. > > The unsupported assumption here is that if the table contains 95% > occurrence of a particular value, then the odds are also 95% (or at > least high) that that's the value you are searching for in any given > query that has an "= something" WHERE qual. Yes. > That assumption is pretty reasonable in some cases (such as your > example earlier of "WHERE state = 'PA'" in a Pennsylvania-local > database), but it falls down badly in others, such as where the > most common value is NULL or an empty string or some other indication > that there's no useful data. In that sort of situation it's actually > pretty unlikely that the user will be searching for field = > most-common-value ... but the system probably has no way to know that. Well, if null is most common, it is very probable they would be looking for col IS NULL. > I wonder whether it would help to add even more data to pg_statistic. > For example, suppose we store the fraction of the columns that are NULL, > plus the most frequently occurring *non null* value, plus the fraction > of the columns that are that value. This would allow us to be very > smart about columns in which "no data" is represented by NULL (as a good > DB designer would do): That would be nice. > > selectivity of "IS NULL": NULLfraction > > selectivity of "IS NOT NULL": 1 - NULLfraction > > selectivity of "= X" for a known non-null constant X: > if X == MFOV: MFOVfraction > else: MIN(MFOVfraction, 1-MFOVfraction-NULLfraction) > > selectivity of "= X" when X is not known a priori, but presumably is not > null: > MIN(MFOVfraction, 1-NULLfraction) > > Both of the MIN()s are upper bounds, so multiplying them by a > fudge-factor < 1 would be reasonable. Yes, I am with you here. > These rules would guarantee small selectivity values when either > MFOVfraction or 1-NULLfraction is small. It still wouldn't cost > much, since I believe VACUUM ANALYZE is counting nulls already... Yes, it is. Sounds nice. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
At 20:21 28/07/99 -0400, you wrote: > >> I wonder whether it would help to add even more data to pg_statistic. >> For example, suppose we store the fraction of the columns that are NULL, >> plus the most frequently occurring *non null* value, plus the fraction >> of the columns that are that value. This would allow us to be very >> smart about columns in which "no data" is represented by NULL (as a good >> DB designer would do): > >That would be nice. > I know I've mentioned this before, but can't the designer of the query be given some influence over optimizer index choices? We can circle around the problem of understanding the demographics of a table, but without row-by-row analysis, you'll *never* get the complete and accurate view that is needed to cater for all cases. OTOH, a query designer often knows that a particular query will only be run to find 'exceptions' (ie. non-nulls when 95% are nulls), or to find 'small' ranges. IMO, when a DBA is in a position to help the optimizer, they *should* allowed to. PG *already* has something like this in the form of partial indexes: you can view the query that is associated with the index as a 'hint' as to when that index should be used. All I'm asking is for queries, not indexes, to specify when an index is used. This will not in any way replace the optimizer, but it will give users the ability deal with pathological cases. In terms of the statistics collected, it *may* also be worth doing some rudimentary analysis on the data to see it is conforms to any common distribution (or sum of distributions), and if it does, save that information. eg. the optimizer will do pretty well if it *knows* the data is in a normal distribution, with a mean of 972 and a stdev of 70! Of course, you must be sure that it *is* a normal distribution to start with. FWIW, statisticians often seem worried about three values: the mean, median and mode. I don't really know which is which, but they are: o The average of all values o The average of the min and max value o The most common value. Someone who knows a lot more about this stuff than me can probably tell us how these values will affect the trust we place in the index statistics. Someone on this list must be able to give us some insight??? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
> FWIW, statisticians often seem worried about three values: the mean, median > and mode. I don't really know which is which, but they are: > > o The average of all values > o The average of the min and max value We have this value. Good for > and < comparisons. > o The most common value. We know the number of times the most common value occurs, but not the actual value. > > Someone who knows a lot more about this stuff than me can probably tell us > how these values will affect the trust we place in the index statistics. > Someone on this list must be able to give us some insight??? > > > ---------------------------------------------------------------- > Philip Warner | __---_____ > Albatross Consulting Pty. Ltd. |----/ - \ > (A.C.N. 008 659 498) | /(@) ______---_ > Tel: +61-03-5367 7422 | _________ \ > Fax: +61-03-5367 7430 | ___________ | > Http://www.rhyme.com.au | / \| > | --________-- > PGP key available upon request, | / > and from pgp5.ai.mit.edu:11371 |/ > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > > Bruce Momjian <maillist@candle.pha.pa.us> writes: > >> BTW, this argument proves rigorously that the selectivity of a search > >> for any value other than the MFOV is not more than 0.5, so there is some > >> basis for my intuition that eqsel should not return a value above 0.5. > >> So, in the cases where eqsel does not know the exact value being > >> searched for, I'd still be inclined to cap its result at 0.5. > > > I don't follow this. If the most frequent value occurs 95% of the time, > > wouldn't the selectivity be 0.95? > > If you are searching for the most frequent value, then the selectivity > estimate should indeed be 0.95. If you are searching for anything else, > the selectivity estimate ought to be 0.05 or less. If you don't know > what value you will be searching for, which number should you use? > > The unsupported assumption here is that if the table contains 95% > occurrence of a particular value, then the odds are also 95% (or at > least high) that that's the value you are searching for in any given > query that has an "= something" WHERE qual. > > That assumption is pretty reasonable in some cases (such as your > example earlier of "WHERE state = 'PA'" in a Pennsylvania-local > database), but it falls down badly in others, such as where the > most common value is NULL or an empty string or some other indication > that there's no useful data. In that sort of situation it's actually > pretty unlikely that the user will be searching for field = > most-common-value ... but the system probably has no way to know that. This is exactly what a partial index is supposed to do. And then the system knows it... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > Tom Lane wrote: >> ... it falls down badly in others, such as where the >> most common value is NULL or an empty string or some other indication >> that there's no useful data. In that sort of situation it's actually >> pretty unlikely that the user will be searching for field = >> most-common-value ... but the system probably has no way to know that. > This is exactly what a partial index is supposed to do. And then the > system knows it... I've heard a couple of people assert in this thread that partial indexes are the answer, but I don't believe it. Two reasons: (1) The system won't use a partial index *at all* unless it can prove that the index's predicate (condition for including tuples) is implied by the query's WHERE condition. So the predicate doesn't add a thing to the system's knowledge about the query. (2) The statistics that we have available are stats about a column. Not stats about a column given the predicate of some index. So there's no gain in our statistical knowledge either. Partial indexes might be a component of a solution, but they are very far from being a solution all by themselves. regards, tom lane PS: a quick glance at gram.y shows that we don't actually accept partial-index predicates in CREATE INDEX, so Andreas was right that the feature got ripped out at some point. I have no idea how much work might be required to re-enable it... but I'll bet it's not trivial.
Tom Lane wrote: > > (2) The statistics that we have available are stats about a column. > Not stats about a column given the predicate of some index. So there's > no gain in our statistical knowledge either. If we added just count of NULLs we would cover for the NOT NULL case > Partial indexes might be a component of a solution, but they are > very far from being a solution all by themselves. > > regards, tom lane > > PS: a quick glance at gram.y shows that we don't actually accept > partial-index predicates in CREATE INDEX, so Andreas was right that > the feature got ripped out at some point. I have no idea how much > work might be required to re-enable it... but I'll bet it's not > trivial. That's why I suggested getting just the simplest case (NOT NULL) working first. The more general approach would of course be to gather stats by index. -------------- Hannu
> Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > > Tom Lane wrote: > >> ... it falls down badly in others, such as where the > >> most common value is NULL or an empty string or some other indication > >> that there's no useful data. In that sort of situation it's actually > >> pretty unlikely that the user will be searching for field = > >> most-common-value ... but the system probably has no way to know that. > > > This is exactly what a partial index is supposed to do. And then the > > system knows it... > > I've heard a couple of people assert in this thread that partial indexes > are the answer, but I don't believe it. Two reasons: > > (1) The system won't use a partial index *at all* unless it can prove > that the index's predicate (condition for including tuples) is implied > by the query's WHERE condition. So the predicate doesn't add a thing > to the system's knowledge about the query. > > (2) The statistics that we have available are stats about a column. > Not stats about a column given the predicate of some index. So there's > no gain in our statistical knowledge either. > > Partial indexes might be a component of a solution, but they are > very far from being a solution all by themselves. Agreed. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
OK, I haven't heard any objections to my last proposal for improving selectivity estimates for "=", so I'm going to bull ahead and implement it. This will require adding columns to a system table, which I've never done before. There are some attribute statistics in pg_attribute and some in pg_statistic, but it looks like changing pg_attribute is a pretty dangerous business, so I'm inclined to leave pg_attribute alone and just add columns to pg_statistic. Do I need to do anything beyond making the obvious additions to catalog/pg_statistic.h, rebuild, and initdb? I see that pg_attribute.h doesn't contain any handmade entries for pg_statistic, so that at least is no problem... regards, tom lane
> OK, I haven't heard any objections to my last proposal for improving > selectivity estimates for "=", so I'm going to bull ahead and implement it. > > This will require adding columns to a system table, which I've never > done before. There are some attribute statistics in pg_attribute and > some in pg_statistic, but it looks like changing pg_attribute is a > pretty dangerous business, so I'm inclined to leave pg_attribute alone > and just add columns to pg_statistic. > > Do I need to do anything beyond making the obvious additions to > catalog/pg_statistic.h, rebuild, and initdb? I see that pg_attribute.h > doesn't contain any handmade entries for pg_statistic, so that at least > is no problem... No, that's pretty much it. You have to make sure you get it all correct, or it will not work, but you know that. The only other thing is that you have to make sure that every insert into pg_statistic in the backend knows about the new fields. I do that by looking for all references to pg_statistic in the backend, and making sure I have the new fields covered. If you add a varlena field, and there is an index on the table, you may have to add something to the cache. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026