Thread: Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From
Zeugswetter Andreas IZ5
Date:
> So, the selectivity that a search for the most common value would
> have is a reasonable estimate for the selectivity of a search for any
> value.  That's a bogus assumption in this case --- but it's hard to
> justify making any other assumption in general.
> 
Other db's usually use the value count(*) / nunique for the light weight
statistics.
This makes the assumptoin that the distinct index values are evenly
distributed.
That is on average a correct assumption, whereas our assumption on average
overestimates the number of rows returned.
I am not sure we have a nunique info though.

Andreas



Zeugswetter Andreas IZ5 <Andreas.Zeugswetter@telecom.at> writes:
> Other db's usually use the value count(*) / nunique for the light
> weight statistics.  This makes the assumptoin that the distinct index
> values are evenly distributed.  That is on average a correct
> assumption, whereas our assumption on average overestimates the number
> of rows returned.  I am not sure we have a nunique info though.

We don't, and AFAICS it would be an expensive statistic to compute.

I have thought about this a little more overnight, and I have come up
with what I think is a better idea.  Suppose that VACUUM ANALYZE stores
in pg_statistic not only the disbursion, but also the most frequently
occurring value of each column.  It already computes (or I should say
estimates) the most frequently occurring value (MFOV) in order to arrive
at the disbursion, so storing the value costs nothing except a little
more space in pg_statistic.  Now, the logic that eqsel() should use is
if constant-being-compared-against == MFOV then    return disbursion;else    return MIN(disbursion, 1.0 - disbursion);

which works like this: if we are indeed looking for the MFOV then the
selectivity is just the disbursion, no question.  If we are looking for
a value *other* than the MFOV, then the selectivity must be less than
the disbursion, since surely this value occurs less often than the MFOV.
But the total fraction of non-MFOV values in the table is
1.0-disbursion, so the fraction that are the specific value we want
can't exceed that either.

The MIN() above is therefore a hard upper bound for the selectivity
of the non-MFOV case.  In practice we might want to multiply the MIN
by a fudge-factor somewhat less than one, to arrive at what we hope
is a reasonable estimate rather than a worst-case estimate.

BTW, this argument proves rigorously that the selectivity of a search
for any value other than the MFOV is not more than 0.5, so there is some
basis for my intuition that eqsel should not return a value above 0.5.
So, in the cases where eqsel does not know the exact value being
searched for, I'd still be inclined to cap its result at 0.5.

If we use this logic, the stat we really want is exactly the frequency
of the MFOV, not the disbursion which is just closely related to it.
I have not looked at the other uses of disbursion, but if they all can
work like this we might want to forget the statistical niceties and just
store the frequency of the MFOV.

A final comment is that NULL would be treated just like any regular
value in determining what is the MFOV...
        regards, tom lane


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From
Bruce Momjian
Date:
> 
> > So, the selectivity that a search for the most common value would
> > have is a reasonable estimate for the selectivity of a search for any
> > value.  That's a bogus assumption in this case --- but it's hard to
> > justify making any other assumption in general.
> > 
> Other db's usually use the value count(*) / nunique for the light weight
> statistics.
> This makes the assumptoin that the distinct index values are evenly
> distributed.
> That is on average a correct assumption, whereas our assumption on average
> overestimates the number of rows returned.
> I am not sure we have a nunique info though.
> 

Yes, that's the problem.  Figuring out the number of uniques is hard,
expecially with no index.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From
Bruce Momjian
Date:
> Zeugswetter Andreas IZ5 <Andreas.Zeugswetter@telecom.at> writes:
> > Other db's usually use the value count(*) / nunique for the light
> > weight statistics.  This makes the assumptoin that the distinct index
> > values are evenly distributed.  That is on average a correct
> > assumption, whereas our assumption on average overestimates the number
> > of rows returned.  I am not sure we have a nunique info though.
> 
> We don't, and AFAICS it would be an expensive statistic to compute.
> 
> I have thought about this a little more overnight, and I have come up
> with what I think is a better idea.  Suppose that VACUUM ANALYZE stores
> in pg_statistic not only the disbursion, but also the most frequently
> occurring value of each column.  It already computes (or I should say
> estimates) the most frequently occurring value (MFOV) in order to arrive
> at the disbursion, so storing the value costs nothing except a little
> more space in pg_statistic.  Now, the logic that eqsel() should use is
> 
>     if constant-being-compared-against == MFOV then
>         return disbursion;
>     else
>         return MIN(disbursion, 1.0 - disbursion);

Yes, I like this.

> BTW, this argument proves rigorously that the selectivity of a search
> for any value other than the MFOV is not more than 0.5, so there is some
> basis for my intuition that eqsel should not return a value above 0.5.
> So, in the cases where eqsel does not know the exact value being
> searched for, I'd still be inclined to cap its result at 0.5.

I don't follow this. If the most frequent value occurs 95% of the time,
wouldn't the selectivity be 0.95?  And why use 1-disbursion.  You may
find the existing code does  a better job than the MIN() computation.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Bruce Momjian <maillist@candle.pha.pa.us> writes:
>> BTW, this argument proves rigorously that the selectivity of a search
>> for any value other than the MFOV is not more than 0.5, so there is some
>> basis for my intuition that eqsel should not return a value above 0.5.
>> So, in the cases where eqsel does not know the exact value being
>> searched for, I'd still be inclined to cap its result at 0.5.

> I don't follow this. If the most frequent value occurs 95% of the time,
> wouldn't the selectivity be 0.95?

If you are searching for the most frequent value, then the selectivity
estimate should indeed be 0.95.  If you are searching for anything else,
the selectivity estimate ought to be 0.05 or less.  If you don't know
what value you will be searching for, which number should you use?

The unsupported assumption here is that if the table contains 95%
occurrence of a particular value, then the odds are also 95% (or at
least high) that that's the value you are searching for in any given
query that has an "= something" WHERE qual.

That assumption is pretty reasonable in some cases (such as your
example earlier of "WHERE state = 'PA'" in a Pennsylvania-local
database), but it falls down badly in others, such as where the
most common value is NULL or an empty string or some other indication
that there's no useful data.  In that sort of situation it's actually
pretty unlikely that the user will be searching for field =
most-common-value ... but the system probably has no way to know that.

I wonder whether it would help to add even more data to pg_statistic.
For example, suppose we store the fraction of the columns that are NULL,
plus the most frequently occurring *non null* value, plus the fraction
of the columns that are that value.  This would allow us to be very
smart about columns in which "no data" is represented by NULL (as a good
DB designer would do):

selectivity of "IS NULL": NULLfraction

selectivity of "IS NOT NULL": 1 - NULLfraction

selectivity of "= X" for a known non-null constant X:if X == MFOV: MFOVfractionelse: MIN(MFOVfraction,
1-MFOVfraction-NULLfraction)

selectivity of "= X" when X is not known a priori, but presumably is not
null:MIN(MFOVfraction, 1-NULLfraction)

Both of the MIN()s are upper bounds, so multiplying them by a
fudge-factor < 1 would be reasonable.

These rules would guarantee small selectivity values when either
MFOVfraction or 1-NULLfraction is small.  It still wouldn't cost
much, since I believe VACUUM ANALYZE is counting nulls already...
        regards, tom lane


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From
Bruce Momjian
Date:
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> >> BTW, this argument proves rigorously that the selectivity of a search
> >> for any value other than the MFOV is not more than 0.5, so there is some
> >> basis for my intuition that eqsel should not return a value above 0.5.
> >> So, in the cases where eqsel does not know the exact value being
> >> searched for, I'd still be inclined to cap its result at 0.5.
> 
> > I don't follow this. If the most frequent value occurs 95% of the time,
> > wouldn't the selectivity be 0.95?
> 
> If you are searching for the most frequent value, then the selectivity
> estimate should indeed be 0.95.  If you are searching for anything else,
> the selectivity estimate ought to be 0.05 or less.  If you don't know
> what value you will be searching for, which number should you use?

You are going to love this:
0.95 * 0.95 + 0.05 * 0.05

This is because with a 95% of one value, you would think the ask for
that value 95% of the time, and another value 5% of the time.  The last
0.05 is not really accurate.  It assumes there are only two unique
values in the table, which may be wrong, but it is close enough.

> 
> The unsupported assumption here is that if the table contains 95%
> occurrence of a particular value, then the odds are also 95% (or at
> least high) that that's the value you are searching for in any given
> query that has an "= something" WHERE qual.

Yes.

> That assumption is pretty reasonable in some cases (such as your
> example earlier of "WHERE state = 'PA'" in a Pennsylvania-local
> database), but it falls down badly in others, such as where the
> most common value is NULL or an empty string or some other indication
> that there's no useful data.  In that sort of situation it's actually
> pretty unlikely that the user will be searching for field =
> most-common-value ... but the system probably has no way to know that.

Well, if null is most common, it is very probable they would be looking
for col IS NULL.

> I wonder whether it would help to add even more data to pg_statistic.
> For example, suppose we store the fraction of the columns that are NULL,
> plus the most frequently occurring *non null* value, plus the fraction
> of the columns that are that value.  This would allow us to be very
> smart about columns in which "no data" is represented by NULL (as a good
> DB designer would do):

That would be nice.

> 
> selectivity of "IS NULL": NULLfraction
> 
> selectivity of "IS NOT NULL": 1 - NULLfraction
> 
> selectivity of "= X" for a known non-null constant X:
>     if X == MFOV: MFOVfraction
>     else: MIN(MFOVfraction, 1-MFOVfraction-NULLfraction)
> 
> selectivity of "= X" when X is not known a priori, but presumably is not
> null:
>     MIN(MFOVfraction, 1-NULLfraction)
> 
> Both of the MIN()s are upper bounds, so multiplying them by a
> fudge-factor < 1 would be reasonable.

Yes, I am with you here.

> These rules would guarantee small selectivity values when either
> MFOVfraction or 1-NULLfraction is small.  It still wouldn't cost
> much, since I believe VACUUM ANALYZE is counting nulls already...

Yes, it is.  Sounds nice.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From
Philip Warner
Date:
At 20:21 28/07/99 -0400, you wrote:
>
>> I wonder whether it would help to add even more data to pg_statistic.
>> For example, suppose we store the fraction of the columns that are NULL,
>> plus the most frequently occurring *non null* value, plus the fraction
>> of the columns that are that value.  This would allow us to be very
>> smart about columns in which "no data" is represented by NULL (as a good
>> DB designer would do):
>
>That would be nice.
>

I know I've mentioned this before, but can't the designer of the query be
given some influence over optimizer index choices? We can circle around the
problem of understanding the demographics of a table, but without
row-by-row analysis, you'll *never* get the complete and accurate view that
is needed to cater for all cases.

OTOH, a query designer often knows that a particular query will only be run
to find 'exceptions' (ie. non-nulls when 95% are nulls), or to find 'small'
ranges. IMO, when a DBA is in a position to help the optimizer, they
*should* allowed to. PG *already* has something like this in the form of
partial indexes: you can view the query that is associated with the index
as a 'hint' as to when that index should be used. All I'm asking is for
queries, not indexes, to specify when an index is used.

This will not in any way replace the optimizer, but it will give users the
ability deal with pathological cases.

In terms of the statistics collected, it *may* also be worth doing some
rudimentary analysis on the data to see it is conforms to any common
distribution (or sum of distributions), and if it does, save that
information. eg. the optimizer will do pretty well if it *knows* the data
is in a normal distribution, with a mean of 972 and a stdev of 70! Of
course, you must be sure that it *is* a normal distribution to start with.

FWIW, statisticians often seem worried about three values: the mean, median
and mode. I don't really know which is which, but they are:

o The average of all values
o The average of the min and max value
o The most common value.

Someone who knows a lot more about this stuff than me can probably tell us
how these values will affect the trust we place in the index statistics.
Someone on this list must be able to give us some insight???


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From
Bruce Momjian
Date:
> FWIW, statisticians often seem worried about three values: the mean, median
> and mode. I don't really know which is which, but they are:
> 
> o The average of all values
> o The average of the min and max value

We have this value.  Good for > and < comparisons.

> o The most common value.

We know the number of times the most common value occurs, but not the
actual value.

> 
> Someone who knows a lot more about this stuff than me can probably tell us
> how these values will affect the trust we place in the index statistics.
> Someone on this list must be able to give us some insight???
> 
> 
> ----------------------------------------------------------------
> Philip Warner                    |     __---_____
> Albatross Consulting Pty. Ltd.   |----/       -  \
> (A.C.N. 008 659 498)             |          /(@)   ______---_
> Tel: +61-03-5367 7422            |                 _________  \
> Fax: +61-03-5367 7430            |                 ___________ |
> Http://www.rhyme.com.au          |                /           \|
>                                  |    --________--
> PGP key available upon request,  |  /
> and from pgp5.ai.mit.edu:11371   |/
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From
Thomas Lockhart
Date:
Tom Lane wrote:
> 
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> >> BTW, this argument proves rigorously that the selectivity of a search
> >> for any value other than the MFOV is not more than 0.5, so there is some
> >> basis for my intuition that eqsel should not return a value above 0.5.
> >> So, in the cases where eqsel does not know the exact value being
> >> searched for, I'd still be inclined to cap its result at 0.5.
> 
> > I don't follow this. If the most frequent value occurs 95% of the time,
> > wouldn't the selectivity be 0.95?
> 
> If you are searching for the most frequent value, then the selectivity
> estimate should indeed be 0.95.  If you are searching for anything else,
> the selectivity estimate ought to be 0.05 or less.  If you don't know
> what value you will be searching for, which number should you use?
> 
> The unsupported assumption here is that if the table contains 95%
> occurrence of a particular value, then the odds are also 95% (or at
> least high) that that's the value you are searching for in any given
> query that has an "= something" WHERE qual.
> 
> That assumption is pretty reasonable in some cases (such as your
> example earlier of "WHERE state = 'PA'" in a Pennsylvania-local
> database), but it falls down badly in others, such as where the
> most common value is NULL or an empty string or some other indication
> that there's no useful data.  In that sort of situation it's actually
> pretty unlikely that the user will be searching for field =
> most-common-value ... but the system probably has no way to know that.

This is exactly what a partial index is supposed to do. And then the
system knows it...
                      - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> Tom Lane wrote:
>> ... it falls down badly in others, such as where the
>> most common value is NULL or an empty string or some other indication
>> that there's no useful data.  In that sort of situation it's actually
>> pretty unlikely that the user will be searching for field =
>> most-common-value ... but the system probably has no way to know that.

> This is exactly what a partial index is supposed to do. And then the
> system knows it...

I've heard a couple of people assert in this thread that partial indexes
are the answer, but I don't believe it.  Two reasons:

(1) The system won't use a partial index *at all* unless it can prove
that the index's predicate (condition for including tuples) is implied
by the query's WHERE condition.  So the predicate doesn't add a thing
to the system's knowledge about the query.

(2) The statistics that we have available are stats about a column.
Not stats about a column given the predicate of some index.  So there's
no gain in our statistical knowledge either.

Partial indexes might be a component of a solution, but they are
very far from being a solution all by themselves.
        regards, tom lane

PS: a quick glance at gram.y shows that we don't actually accept
partial-index predicates in CREATE INDEX, so Andreas was right that
the feature got ripped out at some point.  I have no idea how much
work might be required to re-enable it... but I'll bet it's not
trivial.


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
> (2) The statistics that we have available are stats about a column.
> Not stats about a column given the predicate of some index.  So there's
> no gain in our statistical knowledge either.

If we added just count of NULLs we would cover for the NOT NULL case

> Partial indexes might be a component of a solution, but they are
> very far from being a solution all by themselves.
> 
>                         regards, tom lane
> 
> PS: a quick glance at gram.y shows that we don't actually accept
> partial-index predicates in CREATE INDEX, so Andreas was right that
> the feature got ripped out at some point.  I have no idea how much
> work might be required to re-enable it... but I'll bet it's not
> trivial.

That's why I suggested getting just the simplest case (NOT NULL) 
working first.

The more general approach would of course be to gather stats by index.

--------------
Hannu


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From
Bruce Momjian
Date:
> Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> > Tom Lane wrote:
> >> ... it falls down badly in others, such as where the
> >> most common value is NULL or an empty string or some other indication
> >> that there's no useful data.  In that sort of situation it's actually
> >> pretty unlikely that the user will be searching for field =
> >> most-common-value ... but the system probably has no way to know that.
> 
> > This is exactly what a partial index is supposed to do. And then the
> > system knows it...
> 
> I've heard a couple of people assert in this thread that partial indexes
> are the answer, but I don't believe it.  Two reasons:
> 
> (1) The system won't use a partial index *at all* unless it can prove
> that the index's predicate (condition for including tuples) is implied
> by the query's WHERE condition.  So the predicate doesn't add a thing
> to the system's knowledge about the query.
> 
> (2) The statistics that we have available are stats about a column.
> Not stats about a column given the predicate of some index.  So there's
> no gain in our statistical knowledge either.
> 
> Partial indexes might be a component of a solution, but they are
> very far from being a solution all by themselves.

Agreed.
--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


OK, I haven't heard any objections to my last proposal for improving
selectivity estimates for "=", so I'm going to bull ahead and implement it.

This will require adding columns to a system table, which I've never
done before.  There are some attribute statistics in pg_attribute and
some in pg_statistic, but it looks like changing pg_attribute is a
pretty dangerous business, so I'm inclined to leave pg_attribute alone
and just add columns to pg_statistic.

Do I need to do anything beyond making the obvious additions to
catalog/pg_statistic.h, rebuild, and initdb?  I see that pg_attribute.h
doesn't contain any handmade entries for pg_statistic, so that at least
is no problem...
        regards, tom lane


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From
Bruce Momjian
Date:
> OK, I haven't heard any objections to my last proposal for improving
> selectivity estimates for "=", so I'm going to bull ahead and implement it.
> 
> This will require adding columns to a system table, which I've never
> done before.  There are some attribute statistics in pg_attribute and
> some in pg_statistic, but it looks like changing pg_attribute is a
> pretty dangerous business, so I'm inclined to leave pg_attribute alone
> and just add columns to pg_statistic.
> 
> Do I need to do anything beyond making the obvious additions to
> catalog/pg_statistic.h, rebuild, and initdb?  I see that pg_attribute.h
> doesn't contain any handmade entries for pg_statistic, so that at least
> is no problem...

No, that's pretty much it.  You have to make sure you get it all
correct, or it will not work, but you know that.  The only other thing
is that you have to make sure that every insert into pg_statistic in the
backend knows about the new fields.  I do that by looking for all
references to pg_statistic in the backend, and making sure I have the
new fields covered.  If you add a varlena field, and there is an index
on the table, you may have to add something to the cache.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026