RE: [HACKERS] Re: Top N queries and disbursion - Mailing list pgsql-hackers

From Ansley, Michael
Subject RE: [HACKERS] Re: Top N queries and disbursion
Date
Msg-id 1BF7C7482189D211B03F00805F8527F748C131@S-NATH-EXCH2
Whole thread Raw
Responses Re: [HACKERS] Re: Top N queries and disbursion
Re: [HACKERS] Re: Top N queries and disbursion
List pgsql-hackers
Surely we could store more information (using vacuum) about each table, to
be able to produce good stats relatively quickly?  This would mean that the
estimates would be dependent on vacuum, but there are worse options.  Also,
can't we do something similar to what Oracle does, where you can define your
optimisation to be rule-based, or stats-based.  If it's rule based, the
optimizer looks only at the schema to decide how to optimize.  If
stats-based, then it has a huge amount of information at its disposal to
determine how to optimise.  However, those stats are compiled by something
like vacuum.

MikeA



>> -----Original Message-----
>> From: Roberto Cornacchia [mailto:rcorna@tin.it]
>> Sent: Friday, October 08, 1999 3:19 PM
>> To: Bruce Momjian
>> Cc: Tom Lane; pgsql-hackers@postgreSQL.org
>> Subject: Re: [HACKERS] Re: Top N queries and disbursion
>> 
>> 
>> Bruce Momjian wrote:
>> > 
>> > > No, it's certainly not the right thing.  To my 
>> understanding, disbursion
>> > > is a measure of the frequency of the most common value 
>> of an attribute;
>> > > but that tells you very little about how many other 
>> values there are.
>> > > 1/disbursion is a lower bound on the number of values, 
>> but it wouldn't
>> > > be a good estimate unless you had reason to think that 
>> the values were
>> > > pretty evenly distributed.  There could be a *lot* of 
>> very-infrequent
>> > > values.
>> > >
>> > > > with 100 distinct values of an attribute uniformly 
>> distribuited in a
>> > > > relation of 10000 tuples, disbursion was estimated as 
>> 0.002275, giving
>> > > > us 440 distinct values.
>> > >
>> > > This is an illustration of the fact that Postgres' 
>> disbursion-estimator
>> > > is pretty bad :-(.  It usually underestimates the 
>> frequency of the most
>> > > common value, unless the most common value is really frequent
>> > > (probability > 0.2 or so).  I've been trying to think of 
>> a more accurate
>> > > way of figuring the statistic that wouldn't be unreasonably slow.
>> > > Or, perhaps, we should forget all about disbursion and 
>> adopt some other
>> > > statistic(s).
>> > 
>> > Yes, you have the crux of the issue.  I wrote it because 
>> it was the best
>> > thing I could think of, but it is non-optimimal.  Because all the
>> > optimal solutions seemed too slow to me, I couldn't think 
>> of a better
>> > one.
>> 
>> Thank you, Tom and Bruce.
>> This is not a good news for us :-(. In any case, is 1/disbursion the
>> best estimate we can have by now, even if not optimal?
>> 
>> Roberto Cornacchia
>> Andrea Ghidini
>> 
>> 
>> ************
>> 


pgsql-hackers by date:

Previous
From: Roberto Cornacchia
Date:
Subject: Re: [HACKERS] Re: Top N queries and disbursion
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] Re: [GENERAL] Re: [PHP3] Re: PostgreSQL vs Mysql comparison