On Aug 12, 2008, at 4:59 PM, Chris Kratz wrote:
> Ran into a re-occuring performance problem with some report queries
> again today. In a nutshell, we have filters on either multiple
> joined tables, or multiple columns on a single table that are
> highly correlated. So, the estimates come out grossly incorrect
> (the planner has no way to know they are correlated). 2000:1 for
> one I'm looking at right now. Generally this doesn't matter,
> except in complex reporting queries like these when this is the
> first join of 40 other joins. Because the estimate is wrong at the
> lowest level, it snowballs up through the rest of the joins causing
> the query to run very, very slowly. In many of these cases,
> forcing nested loops off for the duration of the query fixes the
> problem. But I have a couple that still are painfully slow and
> shouldn't be.
>
> I've been reading through the archives with others having similar
> problems (including myself a year ago). Am I right in assuming
> that at this point there is still little we can do in postgres to
> speed up this kind of query? Right now the planner has no way to
> know the correlation between different columns in the same table,
> let alone columns in different tables. So, it just assumes no
> correlation and returns incorrectly low estimates in cases like these.
>
> The only solution I've come up with so far is to materialize
> portions of the larger query into subqueries with these correlated
> filters which are indexed and analyzed before joining into the
> larger query. This would keep the incorrect estimates from
> snowballing up through the chain of joins.
>
> Are there any other solutions to this problem?
Well... you could try and convince certain members of the community
that we actually do need some kind of a query hint mechanism... ;)
I did make a suggestion a few months ago that involved sorting a
table on different columns and recording the correlation of other
columns. The scheme isn't perfect, but it would help detect cases
like a field populated by a sequence and another field that's insert
timestamp; those two fields would correlate highly, and you should
even be able to correlate the two histograms; that would allow you to
infer that most of the insert times for _id's between 100 and 200
will be between 2008-01-01 00:10 and 2008-01-01 00:20, for example.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828