Here is my opinion.
I hope this helps.
Maybe there is no one good formula:
On boolean type, there are at most 3 distinct values.
There is an upper bound for fornames in one country.
There is an upper bound for last names in one country.
There is a fixed number of states and postal codes in one country.
On the other hand, with timestamp, every value could be distinct.
A primary key with only one column has only distinct values.
If the integer column refers with a foreign key into another table's
only primary key, we could take advantage of that knolege.
A column with a unique index has only distinct values.
First ones are for classifying and the second ones measure continuous
or discrete time or something like the time.
The upper bound for classifying might be 3 (boolean), or it might be
one million. The properties of the distribution might be hard to guess.
Here is one way:
1. Find out the number of distinct values for 500 rows.
2. Try to guess, how many distinct values are for 1000 rows.
Find out the real number of distinct values for 1000 rows.
3. If the guess and the reality are 50% wrong, do the iteration for
2x1000 rows.
Iterate using a power of two to increase the samples, until you trust the
estimate enough.
So, in the phase two, you could try to guess with two distinct formulas:
One for the classifying target (boolean columns hit there).
Another one for the timestamp and numerical values.
If there are one million classifications on one column, how you
can find it out, by other means than checking at least two million
rows?
This means, that the user should have a possibility to tell the lower
bound for the number of rows for sampling.
Regards,
Marko Ristola
Tom Lane wrote:
>Josh Berkus <josh@agliodbs.com> writes:
>
>
>>Overall, our formula is inherently conservative of n_distinct. That is, I
>>believe that it is actually computing the *smallest* number of distinct
>>values which would reasonably produce the given sample, rather than the
>>*median* one. This is contrary to the notes in analyze.c, which seem to
>>think that we're *overestimating* n_distinct.
>>
>>
>
>Well, the notes are there because the early tests I ran on that formula
>did show it overestimating n_distinct more often than not. Greg is
>correct that this is inherently a hard problem :-(
>
>I have nothing against adopting a different formula, if you can find
>something with a comparable amount of math behind it ... but I fear
>it'd only shift the failure cases around.
>
> regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match
>
>