Re: Incorrect estimates on correlated filters - Mailing list pgsql-performance

From Decibel!
Subject Re: Incorrect estimates on correlated filters
Date
Msg-id 12707C41-48C3-4072-A152-46FAB1FEE1E2@decibel.org
Whole thread Raw
In response to Incorrect estimates on correlated filters  ("Chris Kratz" <chris.kratz@vistashare.com>)
Responses Re: Incorrect estimates on correlated filters
Re: Incorrect estimates on correlated filters
List pgsql-performance
On Aug 12, 2008, at 4:59 PM, Chris Kratz wrote:
> Ran into a re-occuring performance problem with some report queries
> again today.  In a nutshell, we have filters on either multiple
> joined tables, or multiple columns on a single table that are
> highly correlated.  So, the estimates come out grossly incorrect
> (the planner has no way to know they are correlated).  2000:1 for
> one I'm looking at right now.  Generally this doesn't matter,
> except in complex reporting queries like these when this is the
> first join of 40 other joins.  Because the estimate is wrong at the
> lowest level, it snowballs up through the rest of the joins causing
> the query to run very, very slowly.   In many of these cases,
> forcing nested loops off for the duration of the query fixes the
> problem.  But I have a couple that still are painfully slow and
> shouldn't be.
>
> I've been reading through the archives with others having similar
> problems (including myself a year ago).  Am I right in assuming
> that at this point there is still little we can do in postgres to
> speed up this kind of query?  Right now the planner has no way to
> know the correlation between different columns in the same table,
> let alone columns in different tables.  So, it just assumes no
> correlation and returns incorrectly low estimates in cases like these.
>
> The only solution I've come up with so far is to materialize
> portions of the larger query into subqueries with these correlated
> filters which are indexed and analyzed before joining into the
> larger query.  This would keep the incorrect estimates from
> snowballing up through the chain of joins.
>
> Are there any other solutions to this problem?


Well... you could try and convince certain members of the community
that we actually do need some kind of a query hint mechanism... ;)

I did make a suggestion a few months ago that involved sorting a
table on different columns and recording the correlation of other
columns. The scheme isn't perfect, but it would help detect cases
like a field populated by a sequence and another field that's insert
timestamp; those two fields would correlate highly, and you should
even be able to correlate the two histograms; that would allow you to
infer that most of the insert times for _id's between 100 and 200
will be between 2008-01-01 00:10 and 2008-01-01 00:20, for example.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Attachment

pgsql-performance by date:

Previous
From: Decibel!
Date:
Subject: Re: Filesystem benchmarking for pg 8.3.3 server
Next
From: Greg Smith
Date:
Subject: Re: Filesystem benchmarking for pg 8.3.3 server