Re: [HACKERS] Re: Top N queries and disbursion - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [HACKERS] Re: Top N queries and disbursion
Date
Msg-id 199910072353.TAA09358@candle.pha.pa.us
Whole thread Raw
In response to Re: Top N queries and disbursion  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
> No, it's certainly not the right thing.  To my understanding, disbursion
> is a measure of the frequency of the most common value of an attribute;
> but that tells you very little about how many other values there are.
> 1/disbursion is a lower bound on the number of values, but it wouldn't
> be a good estimate unless you had reason to think that the values were
> pretty evenly distributed.  There could be a *lot* of very-infrequent
> values.
> 
> > with 100 distinct values of an attribute uniformly distribuited in a
> > relation of 10000 tuples, disbursion was estimated as 0.002275, giving
> > us 440 distinct values.
> 
> This is an illustration of the fact that Postgres' disbursion-estimator
> is pretty bad :-(.  It usually underestimates the frequency of the most
> common value, unless the most common value is really frequent
> (probability > 0.2 or so).  I've been trying to think of a more accurate
> way of figuring the statistic that wouldn't be unreasonably slow.
> Or, perhaps, we should forget all about disbursion and adopt some other
> statistic(s).

Yes, you have the crux of the issue.  I wrote it because it was the best
thing I could think of, but it is non-optimimal.  Because all the
optimal solutions seemed too slow to me, I couldn't think of a better
one.

Here is my narrative on it from vacuum.c:

---------------------------------------------------------------------------
*  We compute the column min, max, null and non-null counts.*  Plus we attempt to find the count of the value that
occursmost*  frequently in each column*  These figures are used to compute the selectivity of the column**  We use a
three-buckedcache to get the most frequent item*  The 'guess' buckets count hits.  A cache miss causes guess1*  to get
themost hit 'guess' item in the most recent cycle, and*  the new item goes into guess2.  Whenever the total count of
hits* of a 'guess' entry is larger than 'best', 'guess' becomes 'best'.**  This method works perfectly for columns with
uniquevalues, and columns*  with only two unique values, plus nulls.**  It becomes less perfect as the number of unique
valuesincreases and*  their distribution in the table becomes more random.
 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Re: psql and comments
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Re: psql and comments