On Mon, 2006-09-18 at 17:46 +0200, Matteo Beccati wrote:
> Tom Lane ha scritto:
> > Matteo Beccati <php@beccati.com> writes:
> >> I cannot see anything bad by using something like that:
> >> if (histogram is large/representative enough)
> >
> > Well, the question is exactly what is "large enough"? I feel a bit
> > uncomfortable about applying the idea to a histogram with only 10
> > entries (especially if we ignore two of 'em). With 100 or more,
> > it sounds all right. What's the breakpoint?
>
> Yes, I think 100-200 could be a good breakpoint. I don't actually know
> what is the current usage of SET STATISTICS, I usually set it to 1000
> for columns which need more precise selectivity.
>
> The breakpoint could be set even higher (500?) so there is space to
> increase statistics without enabling the histogram check, but I don't
> feel very comfortable though suggesting this kind of possibly
> undocumented side effect...
Hi everyone,
You may be interested to have a look at the statistics collector for the
geometry type within PostGIS. In order to prevent very large or very
small geometries from ruining the statistics histogram and generating
incorrect query plans, we make the assumption that the column
distribution is likely to be close to normal, and then remove any
ANALYZE-collected geometries from the set that lie outside +/- 3.25
standard deviations from the mean before creating the final histogram
(removes just under 1% of the data from each end of an assumed normal
distribution). This works well and AFAIK we've only ever had one
reported case of an incorrect query plan being generated using this
method.
Kind regards,
Mark.