Tom,
I'm interested in the problem of cross-column statistics from a
theoretical perspective. It would be interesting to sit down and try to
reason out a useful solution, or at very least to understand the problem
better so I can anticipate when it might come and eat me.
From my understanding, the main problem is that if PG knows the
selectivity of n conditions C1,C2,...,Cn then it doesn't know whether
the combined selectivity will be C1*C2*...*Cn (conditions are
independent) or max(C1,C2,...,Cn) (conditions are strictly dependent),
or somewhere in the middle. Therefore, row estimates could be orders of
magnitude off.
I suppose a common example would be a table with a serial primary key
column and a timestamp value which is always inserted as
CURRENT_TIMESTAMP, so the two columns are strongly correlated. If the
planner guesses that 1% of the rows of the table will match pk>1000000,
and 1% of the rows of the table will match timestamp > X, then it would
be nice for it to know that if you specify both "pk>1000000 AND
timestamp>X" that the combined selectivity is still only 1% and not 1% *
1% = 0.01%.
As long as I'm sitting down and reasoning about the problem anyway, are
there any other types of cases you're aware of where some form of cross-
column statistics would be useful? In the unlikely event that I
actually come up with a brilliant and simple solution, I'd at least like
to make sure that I'm solving the right problem :)
Thanks,
Mark Lewis
On Tue, 2006-10-10 at 22:38 -0400, Tom Lane wrote:
> Brian Herlihy <btherl@yahoo.com.au> writes:
> > What would it take for hints to be added to postgres?
>
> A *whole lot* more thought and effort than has been expended on the
> subject to date.
>
> Personally I have no use for the idea of "force the planner to do
> exactly X given a query of exactly Y". You don't have exactly Y
> today, tomorrow, and the day after (if you do, you don't need a
> hint mechanism at all, you need a mysql-style query cache).
> IMHO most of the planner mistakes we see that could be fixed via
> hinting are really statistical estimation errors, and so the right
> level to be fixing them at is hints about how to estimate the number
> of rows produced for given conditions. Mind you that's still a plenty
> hard problem, but you could at least hope that a hint of that form
> would be useful for more than one query.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match