Re: Understanding histograms - Mailing list pgsql-performance

From Tom Lane
Subject Re: Understanding histograms
Date
Msg-id 18628.1209566591@sss.pgh.pa.us
Whole thread Raw
In response to Re: Understanding histograms  ("Len Shapiro" <lenshap@gmail.com>)
Responses Re: Understanding histograms  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-performance
"Len Shapiro" <lenshap@gmail.com> writes:
> I asked about n_distinct, whose documentation reads in part "The
> negated form is used when ANALYZE believes that the number of distinct
> values is likely to increase as the table grows".  and I asked about
> why ANALYZE believes that the number of distinct values is likely to
> increase.  I'm unclear why you quoted to me the documentation on
> stadistinct.

n_distinct is just a view of stadistinct.  I assumed you'd poked around
in the code enough to know that ...

>>> The "rows=2" estimate makes sense when const = 1 or 5, but it makes no
>>> sense to me for other values of const not in the MVC list.
>>
>> I'm not sure what estimate you'd expect instead?

> Instead I would expect an estimate of "rows=0" for values of const
> that are not in the MCV list and not in the histogram.

Surely that's not very sane?  The MCV list plus histogram generally
don't include every value in the table.  IIRC the estimate for values
not present in the MCV list is (1 - sum(MCV frequencies)) divided by
(n_distinct - number of MCV entries), which amounts to assuming that
all values not present in the MCV list occur equally often.  The weak
spot of course is that the n_distinct estimate may be pretty inaccurate.

> Where in the source is the code that manipulates the histogram?

commands/analyze.c builds it, and most of the estimation with it
happens in utils/adt/selfuncs.c.

            regards, tom lane

pgsql-performance by date:

Previous
From: "Pavan Deolasee"
Date:
Subject: Re: Replication Syatem
Next
From: Tom Lane
Date:
Subject: Re: Replication Syatem