Thread: RE: [HACKERS] Re: Top N queries and disbursion

RE: [HACKERS] Re: Top N queries and disbursion

From

"Ansley, Michael"

Date:

08 October 1999, 09:43:59

Surely we could store more information (using vacuum) about each table, to
be able to produce good stats relatively quickly?  This would mean that the
estimates would be dependent on vacuum, but there are worse options.  Also,
can't we do something similar to what Oracle does, where you can define your
optimisation to be rule-based, or stats-based.  If it's rule based, the
optimizer looks only at the schema to decide how to optimize.  If
stats-based, then it has a huge amount of information at its disposal to
determine how to optimise.  However, those stats are compiled by something
like vacuum.

MikeA



>> -----Original Message-----
>> From: Roberto Cornacchia [mailto:rcorna@tin.it]
>> Sent: Friday, October 08, 1999 3:19 PM
>> To: Bruce Momjian
>> Cc: Tom Lane; pgsql-hackers@postgreSQL.org
>> Subject: Re: [HACKERS] Re: Top N queries and disbursion
>> 
>> 
>> Bruce Momjian wrote:
>> > 
>> > > No, it's certainly not the right thing.  To my 
>> understanding, disbursion
>> > > is a measure of the frequency of the most common value 
>> of an attribute;
>> > > but that tells you very little about how many other 
>> values there are.
>> > > 1/disbursion is a lower bound on the number of values, 
>> but it wouldn't
>> > > be a good estimate unless you had reason to think that 
>> the values were
>> > > pretty evenly distributed.  There could be a *lot* of 
>> very-infrequent
>> > > values.
>> > >
>> > > > with 100 distinct values of an attribute uniformly 
>> distribuited in a
>> > > > relation of 10000 tuples, disbursion was estimated as 
>> 0.002275, giving
>> > > > us 440 distinct values.
>> > >
>> > > This is an illustration of the fact that Postgres' 
>> disbursion-estimator
>> > > is pretty bad :-(.  It usually underestimates the 
>> frequency of the most
>> > > common value, unless the most common value is really frequent
>> > > (probability > 0.2 or so).  I've been trying to think of 
>> a more accurate
>> > > way of figuring the statistic that wouldn't be unreasonably slow.
>> > > Or, perhaps, we should forget all about disbursion and 
>> adopt some other
>> > > statistic(s).
>> > 
>> > Yes, you have the crux of the issue.  I wrote it because 
>> it was the best
>> > thing I could think of, but it is non-optimimal.  Because all the
>> > optimal solutions seemed too slow to me, I couldn't think 
>> of a better
>> > one.
>> 
>> Thank you, Tom and Bruce.
>> This is not a good news for us :-(. In any case, is 1/disbursion the
>> best estimate we can have by now, even if not optimal?
>> 
>> Roberto Cornacchia
>> Andrea Ghidini
>> 
>> 
>> ************
>>

Re: [HACKERS] Re: Top N queries and disbursion

From

Tom Lane

Date:

08 October 1999, 10:30:58

"Ansley, Michael" <Michael.Ansley@intec.co.za> writes:
> can't we do something similar to what Oracle does, where you can define your
> optimisation to be rule-based, or stats-based.  If it's rule based, the
> optimizer looks only at the schema to decide how to optimize.  If
> stats-based, then it has a huge amount of information at its disposal to
> determine how to optimise.  However, those stats are compiled by something
> like vacuum.

We pretty much do that already; the "rules" are embodied in the default
cost estimates that get used if there's no statistical data from VACUUM.
        regards, tom lane

Re: [HACKERS] Re: Top N queries and disbursion

From

Bruce Momjian

Date:

08 October 1999, 12:22:01

[Charset iso-8859-1 unsupported, filtering to ASCII...]
> Surely we could store more information (using vacuum) about each table, to
> be able to produce good stats relatively quickly?  This would mean that the
> estimates would be dependent on vacuum, but there are worse options.  Also,
> can't we do something similar to what Oracle does, where you can define your
> optimisation to be rule-based, or stats-based.  If it's rule based, the
> optimizer looks only at the schema to decide how to optimize.  If
> stats-based, then it has a huge amount of information at its disposal to
> determine how to optimise.  However, those stats are compiled by something
> like vacuum.

Stats are compiled by vacuum analyze, and every column is analyzed the
same way.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026