Thread: Re: [HACKERS] optimizer pruning problem

Re: [HACKERS] optimizer pruning problem

From
Roberto Cornacchia
Date:
Thanks a lot for your explanations.
We have started working with the snapshot and have found a problem with
the selectivity estimation of join clauses, probably you just know of
this.
When the join is between attnos < 0 (such as oids), the selectivity is
estimated as 0.5 (leading to very bad size estimates), since this code
in function compute_clause_selec (clausesel.c):
 if (relid1 > 0 && relid2 > 0 && attno1 > 0 && attno2 > 0)   ... else   s1 = (Cost) (0.5);

So what is the aim of the last two and conditions?

Best regards

Roberto Cornacchia
Andrea Ghidini


Re: [HACKERS] optimizer pruning problem

From
Tom Lane
Date:
> When the join is between attnos < 0 (such as oids), the selectivity is
> estimated as 0.5 (leading to very bad size estimates), since this code
> in function compute_clause_selec (clausesel.c):

>   if (relid1 > 0 && relid2 > 0 && attno1 > 0 && attno2 > 0)
>     ...
>   else
>     s1 = (Cost) (0.5);

> So what is the aim of the last two and conditions?

That's a bug, I guess.  -1 is used to signal "couldn't find the
attribute", but there's no real need to check *both* relid and attno
to determine that.  It should consider positive relid and negative
attno to be valid.

Since vacuum doesn't record statistics for the system attributes,
there probably also needs to be a hack in the code that looks in
pg_statistic so that it will produce reasonable estimates.   We
should assume that OID has perfect disbursion, for sure.  I don't
know if we can assume anything much about the other sys attributes...
        regards, tom lane


RE: [HACKERS] optimizer pruning problem

From
"Hiroshi Inoue"
Date:
> 
> > When the join is between attnos < 0 (such as oids), the selectivity is
> > estimated as 0.5 (leading to very bad size estimates), since this code
> > in function compute_clause_selec (clausesel.c):
> 
> >   if (relid1 > 0 && relid2 > 0 && attno1 > 0 && attno2 > 0)
> >     ...
> >   else
> >     s1 = (Cost) (0.5);
> 
> > So what is the aim of the last two and conditions?
> 
> That's a bug, I guess.  -1 is used to signal "couldn't find the
> attribute", but there's no real need to check *both* relid and attno
> to determine that.  It should consider positive relid and negative
> attno to be valid.
>
> Since vacuum doesn't record statistics for the system attributes,
> there probably also needs to be a hack in the code that looks in
> pg_statistic so that it will produce reasonable estimates.   We
> should assume that OID has perfect disbursion, for sure.  I don't
> know if we can assume anything much about the other sys attributes...
>

CTID has perfect disbursion too. The selectivity is necessary
in order to compute rows of scan using TIDs,though we couldn't
use WHERE restriction on ctid now.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Top N queries and disbursion

From
Roberto Cornacchia
Date:
We have finished our work on optimization of Top N queries (with clauses
STOP AFTER ... FOR EACH ... RANK BY ...) now it's time of validating it
with performance tests: since we are working on snapshots of the 6.6
release (now we are using snapshot dated 9/13/99) we are afraid of
instability problems to affect the results. Could you give us any
suggestion about this? We are quite close to the degree day, so we have
to optimize time usage... 
BTW, first results seem to be interesting.

We would ask you for a last thing.
We need to estimate the number of distinct values of an attribute. We
thought 1/disbursion was the right solution, but the results were quite
wrong:
with 100 distinct values of an attribute uniformly distribuited in a
relation of 10000 tuples, disbursion was estimated as 0.002275, giving
us 440 distinct values. We have seen this disbursion estimates
result also in bad join selectivities estimate.
Could this be due to a bad disbursion estimate or is our solution
completely wrong?
Thanks a lot

Roberto Cornacchia
Andrea Ghidini


Re: Top N queries and disbursion

From
Tom Lane
Date:
Roberto Cornacchia <rcorna@tin.it> writes:
> ... since we are working on snapshots of the 6.6
> release (now we are using snapshot dated 9/13/99) we are afraid of
> instability problems to affect the results. Could you give us any
> suggestion about this? We are quite close to the degree day, so we have
> to optimize time usage... 

If you don't want to spend time tracking development changes then you
probably ought to stick with the snapshot you have.  I don't see any
reason that you should try to track changes right now...


> We need to estimate the number of distinct values of an attribute. We
> thought 1/disbursion was the right solution, but the results were quite
> wrong:

No, it's certainly not the right thing.  To my understanding, disbursion
is a measure of the frequency of the most common value of an attribute;
but that tells you very little about how many other values there are.
1/disbursion is a lower bound on the number of values, but it wouldn't
be a good estimate unless you had reason to think that the values were
pretty evenly distributed.  There could be a *lot* of very-infrequent
values.

> with 100 distinct values of an attribute uniformly distribuited in a
> relation of 10000 tuples, disbursion was estimated as 0.002275, giving
> us 440 distinct values.

This is an illustration of the fact that Postgres' disbursion-estimator
is pretty bad :-(.  It usually underestimates the frequency of the most
common value, unless the most common value is really frequent
(probability > 0.2 or so).  I've been trying to think of a more accurate
way of figuring the statistic that wouldn't be unreasonably slow.
Or, perhaps, we should forget all about disbursion and adopt some other
statistic(s).
        regards, tom lane


Re: [HACKERS] Re: Top N queries and disbursion

From
Bruce Momjian
Date:
> No, it's certainly not the right thing.  To my understanding, disbursion
> is a measure of the frequency of the most common value of an attribute;
> but that tells you very little about how many other values there are.
> 1/disbursion is a lower bound on the number of values, but it wouldn't
> be a good estimate unless you had reason to think that the values were
> pretty evenly distributed.  There could be a *lot* of very-infrequent
> values.
> 
> > with 100 distinct values of an attribute uniformly distribuited in a
> > relation of 10000 tuples, disbursion was estimated as 0.002275, giving
> > us 440 distinct values.
> 
> This is an illustration of the fact that Postgres' disbursion-estimator
> is pretty bad :-(.  It usually underestimates the frequency of the most
> common value, unless the most common value is really frequent
> (probability > 0.2 or so).  I've been trying to think of a more accurate
> way of figuring the statistic that wouldn't be unreasonably slow.
> Or, perhaps, we should forget all about disbursion and adopt some other
> statistic(s).

Yes, you have the crux of the issue.  I wrote it because it was the best
thing I could think of, but it is non-optimimal.  Because all the
optimal solutions seemed too slow to me, I couldn't think of a better
one.

Here is my narrative on it from vacuum.c:

---------------------------------------------------------------------------
*  We compute the column min, max, null and non-null counts.*  Plus we attempt to find the count of the value that
occursmost*  frequently in each column*  These figures are used to compute the selectivity of the column**  We use a
three-buckedcache to get the most frequent item*  The 'guess' buckets count hits.  A cache miss causes guess1*  to get
themost hit 'guess' item in the most recent cycle, and*  the new item goes into guess2.  Whenever the total count of
hits* of a 'guess' entry is larger than 'best', 'guess' becomes 'best'.**  This method works perfectly for columns with
uniquevalues, and columns*  with only two unique values, plus nulls.**  It becomes less perfect as the number of unique
valuesincreases and*  their distribution in the table becomes more random.
 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Re: Top N queries and disbursion

From
Roberto Cornacchia
Date:
Bruce Momjian wrote:
> 
> > No, it's certainly not the right thing.  To my understanding, disbursion
> > is a measure of the frequency of the most common value of an attribute;
> > but that tells you very little about how many other values there are.
> > 1/disbursion is a lower bound on the number of values, but it wouldn't
> > be a good estimate unless you had reason to think that the values were
> > pretty evenly distributed.  There could be a *lot* of very-infrequent
> > values.
> >
> > > with 100 distinct values of an attribute uniformly distribuited in a
> > > relation of 10000 tuples, disbursion was estimated as 0.002275, giving
> > > us 440 distinct values.
> >
> > This is an illustration of the fact that Postgres' disbursion-estimator
> > is pretty bad :-(.  It usually underestimates the frequency of the most
> > common value, unless the most common value is really frequent
> > (probability > 0.2 or so).  I've been trying to think of a more accurate
> > way of figuring the statistic that wouldn't be unreasonably slow.
> > Or, perhaps, we should forget all about disbursion and adopt some other
> > statistic(s).
> 
> Yes, you have the crux of the issue.  I wrote it because it was the best
> thing I could think of, but it is non-optimimal.  Because all the
> optimal solutions seemed too slow to me, I couldn't think of a better
> one.

Thank you, Tom and Bruce.
This is not a good news for us :-(. In any case, is 1/disbursion the
best estimate we can have by now, even if not optimal?

Roberto Cornacchia
Andrea Ghidini



Re: [HACKERS] Re: Top N queries and disbursion

From
Tom Lane
Date:
Roberto Cornacchia <rcorna@tin.it> writes:
>>>> 1/disbursion is a lower bound on the number of values, but it wouldn't
>>>> be a good estimate unless you had reason to think that the values were
>>>> pretty evenly distributed.

> Thank you, Tom and Bruce.
> This is not a good news for us :-(. In any case, is 1/disbursion the
> best estimate we can have by now, even if not optimal?

I don't have a better idea right at the moment.  I'm open to the idea
that VACUUM should compute more or different statistics, though ---
as long as it doesn't slow things down too much.  (How much is too much
would probably depend on how much win the new stats would provide for
normal query-planning.  For example, I'd resist making two passes over
the table during VACUUM ANALYZE, but I wouldn't rule it out completely;
you could sell me on it if the advantages were great enough.)

Hey, you guys are the researchers ... give us a better approach to
keeping table statistics ;-)
        regards, tom lane


Re: [HACKERS] Re: Top N queries and disbursion

From
Bruce Momjian
Date:
> > Yes, you have the crux of the issue.  I wrote it because it was the best
> > thing I could think of, but it is non-optimimal.  Because all the
> > optimal solutions seemed too slow to me, I couldn't think of a better
> > one.
> 
> Thank you, Tom and Bruce.
> This is not a good news for us :-(. In any case, is 1/disbursion the
> best estimate we can have by now, even if not optimal?

That is the best one maintained by the database.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Re: Top N queries and disbursion

From
Bruce Momjian
Date:
> I don't have a better idea right at the moment.  I'm open to the idea
> that VACUUM should compute more or different statistics, though ---
> as long as it doesn't slow things down too much.  (How much is too much
> would probably depend on how much win the new stats would provide for
> normal query-planning.  For example, I'd resist making two passes over
> the table during VACUUM ANALYZE, but I wouldn't rule it out completely;
> you could sell me on it if the advantages were great enough.)
> 
> Hey, you guys are the researchers ... give us a better approach to
> keeping table statistics ;-)

Yes, I am open to better ideas.  The current code does 2-value columns,
and unique columns perfectly.  The other distributions it gets only
approximate answers, but for a one-pass system, that's the best I could
do.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026