Re: limit clause breaks query planner? - Mailing list pgsql-performance

From Scott Carey
Subject Re: limit clause breaks query planner?
Date
Msg-id a1ec7d000809041104j554bc9d9i641ce1a594108a42@mail.gmail.com
Whole thread Raw
In response to Re: limit clause breaks query planner?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance


On Thu, Sep 4, 2008 at 10:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Actually, an even easier hack (which would have the nice property of not
needing to know the exact value being searched for), would simply use
the existing cost estimates if the WHERE variables have low correlation
(meaning the random-locations assumption is probably good), but apply
some sort of penalty factor if the correlation is high.  This would
amount to assuming that the universe is perverse and high correlation
will always mean that the rows we want are at the wrong end of the table
not the right one.  But any DBA will tell you that the universe is
indeed perverse ;-)

As a user, I prefer this solution.  For one, statistics get out of date.  A few updates/ inserts (maybe 3% of the table) can greatly
affect an individual value in the histogram, breaking the location assumptions and create a worst case result.
But when 3% of rows change, the correlation cannot change as drastically as the histogram in the worst case.  When deciding how to go about using the statistics for a query plan, its best to assume that they are not perfect, and that there has been some change since they were last gathered.  This is but one thing that makes the universe perverse :)
The safe assumption is that if the distribution is not random, the expected average number of rows to scan goes up -- this is statistically correct.  With perfect correlation and unknown locations the average expected value number of scanned rows would be ~ half the table.  Equal likelihood of the first 15 rows as the last 15.  Thus, for perfect correlation the average penalty would be half the table scanned, and worst case would be the whole table.  In this case, if the statistics are out of date somewhat, the cost estimate is not likely to be more than a factor of 2 off.  If one were to use the histogram, the cost estimate could be many orders of magnitude off. 
I'm fairly sure the penalty function to use for correlation values can be derived statistically -- or at least approximated well enough for a look up table.
If the histogram is used, the odds of it being wrong or out of date have to be taken into account since the penalty for being incorrect is potentially very large -- its not a gradual increase in cost for a small error, it is a big and uncertain increase.  I see the query planner's main goal is to avoid the worst outcomes more than finding the absolute best one.  Better to produce 90% optimized queries 100% of the time than make 100%  perfect plans  90% of the time and then 10% of the time produce very bad plans.  Your suggestion above would do the best job avoiding bad plans but could miss squeezing out that last few % in rare cases that probably don't matter that much.



OTOH, since indexscans get a cost estimate reduction in the presence of
high correlation, we're already biasing the choice in the direction of
indexscans for high correlation.  We may not need to do it twice.

Because the full scan is compared against more than just index scans (potentially), it makes sense to adjust each accordingly and independently.  Additionally, the index scan cost reduction function and the full table scan cost increase function as correlation changes may have very different, nonlinear 'shapes' between 0 and 1.
 

I don't recall whether the OP ever showed us his statistics for the
table in question --- did it even appear to have high correlation?

                       regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: limit clause breaks query planner?
Next
From: Kiran Mukhyala
Date:
Subject: inaccurate stats on large tables