Re: RFC: planner statistics in 7.2 - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: RFC: planner statistics in 7.2 |
Date | |
Msg-id | 26502.987781583@sss.pgh.pa.us Whole thread Raw |
In response to | Re: RFC: planner statistics in 7.2 (Peter Eisentraut <peter_e@gmx.net>) |
List | pgsql-hackers |
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane writes: >> 2. The M boundary values of an equi-depth histogram, ie, the values that >> divide the data distribution into equal-population sets. For example, if >> M is 3 this would consist of the min, median, and max values. > Why not model that data in a normal distribution (or some other such > model)? If we knew an appropriate distribution model, we could do that. There is no distribution model that is appropriate for everything... certainly the normal distribution is not. A compressed histogram is more flexible than any other simple model I know of. > (I couldn't even imagine a way to explain to the user how to pick good > M's.) I was just planning to say "if you get bad estimates on a column with a highly irregular distribution, try increasing the default M". Eventually we might have enough experience with it to offer more detailed advice. >> Issue: for datatypes that have no '<' operator, we of course cannot >> compute a histogram --- but if there is an '=' then the most-common-value >> and number-of-distinct-values stats still make sense. > I think the statistics calculation should be data type (or operator, or > opclass) specific, like the selectivity estimation functions. Yeah, that's probably what we should do, but so far no one has even suggested what stats we might want for geometric datatypes. Tell you what: let's throw in a field that indicates "kind of statistics gathered", and have the meaning of the array fields depend on that. The proposal I gave describes just the stats to gather for scalar types. > pg_statistics table could just have an array of floats where each data > type can store statistics as it chooses and the selectivity estimation > routines can interpret the values in a different way per data type. Man does not live by floats alone --- in particular we need to be able to store specific values of the target datatype. Probably what we want is three or four instances of the pattern stats_kind int, -- identifies type of info stats_numbers float[], -- numerical info such as occurrence fraction stats_values text[], -- array of values of column datatype -- (in external representation) Depending on the value of stats_kind, either stats_numbers or stats_values might be unused, in which case it could be set to NULL so it doesn't waste space. For scalar datatypes, an instance of this pattern could hold the most common values and their frequencies, and another one could hold the histogram boundary points (with stats_numbers unused). > Of course, how to calculate up to N different sets of statistics for N > different columns that require up to N different numbers of passes over > the table is left as a challenge. ;-) As long as it's OK to gather the stats from a random sample of the table, I think the structure I proposed would work fine. Remember the second part of ANALYZE is looping over each column separately anyway --- so it could easily apply a datatype-dependent algorithm for computing statistics. > This brings up a question I have: Are statistics calculated for every > column? Should they be? Is there a possibility to speed up ANALYZE by > controlling this? Presently, frequency stats will be calculated for every column that has an '=' operator, and min/max stats will be calculated if there's a '<' operator for the type. I was planning to retain that convention. However if we are going to allow adjustable M, we could also add the stipulation that the DBA could set M = 0 for columns that he doesn't want stats gathered for. (This would make sense for a column that is never used in a WHERE clause.) regards, tom lane
pgsql-hackers by date: