Tom Lane writes:
> 2. The M boundary values of an equi-depth histogram, ie, the values that
> divide the data distribution into equal-population sets. For example, if
> M is 3 this would consist of the min, median, and max values.
Why not model that data in a normal distribution (or some other such
model)? This should give you fairly good estimates for <, > and between
type queries. That way the user wouldn't have to tweak M. (I couldn't
even imagine a way to explain to the user how to pick good M's.)
> Issue: for datatypes that have no '<' operator, we of course cannot
> compute a histogram --- but if there is an '=' then the most-common-value
> and number-of-distinct-values stats still make sense.
I think the statistics calculation should be data type (or operator, or
opclass) specific, like the selectivity estimation functions. For
geometric data types you need completely different kinds of statistics,
and this would allow users to plug in different methods. The
pg_statistics table could just have an array of floats where each data
type can store statistics as it chooses and the selectivity estimation
routines can interpret the values in a different way per data type. That
way you can also make some common sense optimization without hacks, e.g.,
for boolean columns you possibly only need to calculate 1 value (number of
trues).
Of course, how to calculate up to N different sets of statistics for N
different columns that require up to N different numbers of passes over
the table is left as a challenge. ;-)
This brings up a question I have: Are statistics calculated for every
column? Should they be? Is there a possibility to speed up ANALYZE by
controlling this?
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter