Re: RFC: planner statistics in 7.2 - Mailing list pgsql-hackers

From Tom Lane
Subject Re: RFC: planner statistics in 7.2
Date
Msg-id 26502.987781583@sss.pgh.pa.us
Whole thread Raw
In response to Re: RFC: planner statistics in 7.2  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-hackers
Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> 2. The M boundary values of an equi-depth histogram, ie, the values that
>> divide the data distribution into equal-population sets.  For example, if
>> M is 3 this would consist of the min, median, and max values.

> Why not model that data in a normal distribution (or some other such
> model)?

If we knew an appropriate distribution model, we could do that.  There
is no distribution model that is appropriate for everything... certainly
the normal distribution is not.  A compressed histogram is more flexible
than any other simple model I know of.

> (I couldn't even imagine a way to explain to the user how to pick good
> M's.)

I was just planning to say "if you get bad estimates on a column with a
highly irregular distribution, try increasing the default M".
Eventually we might have enough experience with it to offer more
detailed advice.

>> Issue: for datatypes that have no '<' operator, we of course cannot
>> compute a histogram --- but if there is an '=' then the most-common-value
>> and number-of-distinct-values stats still make sense.

> I think the statistics calculation should be data type (or operator, or
> opclass) specific, like the selectivity estimation functions.

Yeah, that's probably what we should do, but so far no one has even
suggested what stats we might want for geometric datatypes.  Tell you
what: let's throw in a field that indicates "kind of statistics
gathered", and have the meaning of the array fields depend on that.
The proposal I gave describes just the stats to gather for scalar types.

> pg_statistics table could just have an array of floats where each data
> type can store statistics as it chooses and the selectivity estimation
> routines can interpret the values in a different way per data type.

Man does not live by floats alone --- in particular we need to be able
to store specific values of the target datatype.  Probably what we want
is three or four instances of the pattern

stats_kind    int,        -- identifies type of info
stats_numbers    float[],    -- numerical info such as occurrence fraction
stats_values    text[],        -- array of values of column datatype            -- (in external representation)

Depending on the value of stats_kind, either stats_numbers or
stats_values might be unused, in which case it could be set to NULL so
it doesn't waste space.  For scalar datatypes, an instance of this
pattern could hold the most common values and their frequencies, and
another one could hold the histogram boundary points (with stats_numbers
unused).

> Of course, how to calculate up to N different sets of statistics for N
> different columns that require up to N different numbers of passes over
> the table is left as a challenge.  ;-)

As long as it's OK to gather the stats from a random sample of the
table, I think the structure I proposed would work fine.  Remember the
second part of ANALYZE is looping over each column separately anyway ---
so it could easily apply a datatype-dependent algorithm for computing
statistics.

> This brings up a question I have:  Are statistics calculated for every
> column?  Should they be?  Is there a possibility to speed up ANALYZE by
> controlling this?

Presently, frequency stats will be calculated for every column that has
an '=' operator, and min/max stats will be calculated if there's a '<'
operator for the type.  I was planning to retain that convention.
However if we are going to allow adjustable M, we could also add the
stipulation that the DBA could set M = 0 for columns that he doesn't
want stats gathered for.  (This would make sense for a column that is
never used in a WHERE clause.)
        regards, tom lane


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: System catalog representation of access privileges
Next
From: Jan Wieck
Date:
Subject: Re: Re: [BUG?] tgconstrrelid doesn't survive a dump/restore